Skip to main content

Evaluation Configuration

Define evaluation criteria, success thresholds, and reporting in configs/evaluation.yaml.

Overview

The evaluation configuration controls how experiment results are scored and reported:

  • Evaluation Goal: High-level objective description
  • Evaluators: Rule-based checks and LLM judges that score each trace
  • Aggregation: How individual scores combine
  • Limits: Cost controls for LLM evaluations
  • Success Criteria (Phase 2): Performance, quality, and functionality thresholds
  • Additional Analysis (Phase 2): Persona breakdowns, outlier detection, trends
  • Report Configuration (Phase 2): Customizable Markdown and HTML reports
  • Advanced Settings (Phase 2): Statistical tests, alerts

File Location

fluxloop/my-agent/
└── configs/
└── evaluation.yaml # ← This file

Quick Start Example

# Minimal evaluation config
evaluators:
- name: basic_checks
type: rule_based
enabled: true
weight: 1.0
rules:
- check: output_not_empty
- check: success

aggregate:
method: weighted_sum
threshold: 0.7

Evaluation Goal

Type: object
Phase: Phase 2

High-level description of evaluation objectives.

evaluation_goal:
text: |
Verify that the agent provides clear, persona-aware responses
while meeting latency and accuracy targets.

Purpose:

  • Appears in reports and dashboards
  • Guides evaluator selection
  • Communicates intent to stakeholders

Evaluators

Evaluators score each trace based on specific criteria.

Evaluator Types

TypeDescriptionUse Case
rule_basedPredefined checks (latency, keywords, etc.)Objective, deterministic criteria
llm_judgeLLM scores subjective qualityRelevance, coherence, helpfulness

Rule-Based Evaluators

Structure

evaluators:
- name: string # Unique evaluator identifier
type: rule_based
enabled: boolean # Enable/disable
weight: number # Importance weight (0.0-1.0)
rules: # List of checks
- check: string
...parameters

Example

- name: completeness
type: rule_based
enabled: true
weight: 0.3
rules:
- check: output_not_empty
- check: success
- check: latency_under
budget_ms: 1200

Available Rule Checks

output_not_empty

Verify output field is not empty.

- check: output_not_empty

Passes if:

  • output field exists and is non-empty string/object
success

Verify trace completed successfully.

- check: success

Passes if:

  • Trace status is SUCCESS
  • No errors occurred
latency_under

Check execution time under budget.

- check: latency_under
budget_ms: 1000

Parameters:

  • budget_ms: Maximum duration in milliseconds

Passes if:

  • duration_ms < budget_ms
token_usage_under

Check token consumption.

- check: token_usage_under
max_total_tokens: 4000
max_prompt_tokens: 2000
max_completion_tokens: 2000

Parameters:

  • max_total_tokens: Total token budget
  • max_prompt_tokens: Input token budget (optional)
  • max_completion_tokens: Output token budget (optional)
contains

Check if target field contains keywords.

- check: contains
target: output
keywords: ["help", "assist", "support"]

Parameters:

  • target: Field to check (output, input, metadata.X)
  • keywords: List of keywords (case-insensitive)

Passes if:

  • Target contains ANY keyword
not_contains

Check if target field does NOT contain keywords.

- check: not_contains
target: output
keywords: ["error", "sorry", "unavailable"]

Parameters:

  • target: Field to check
  • keywords: List of forbidden keywords

Passes if:

  • Target contains NONE of the keywords
similarity

Check similarity between target and expected value.

- check: similarity
target: output
expected_path: metadata.expected
method: difflib
threshold: 0.7

Parameters:

  • target: Field to compare
  • expected_path: Path to expected value
  • method: difflib | levenshtein | embedding
  • threshold: Minimum similarity (0.0-1.0)

LLM Judge Evaluators

Structure

evaluators:
- name: string
type: llm_judge
enabled: boolean
weight: number
model: string
model_parameters: object # Optional
prompt_template: string
max_score: number
parser: string
metadata: object # Optional

Example

- name: response_quality
type: llm_judge
enabled: true
weight: 0.5
model: gpt-4o-mini
model_parameters:
reasoning:
effort: medium
text:
verbosity: medium
prompt_template: |
Score the assistant's response from 1-10.

User Input: {input}
Assistant Output: {output}

Consider:
- Relevance to the question
- Completeness of answer
- Clarity and conciseness

Provide only a number from 1-10.
max_score: 10
parser: first_number_1_10
metadata:
sample_response: "Score: 8"

Fields

FieldTypeRequiredDescription
modelstringYesLLM model name (e.g., gpt-4o-mini)
model_parametersobjectNoModel-specific parameters
prompt_templatestringYesScoring prompt with placeholders
max_scorenumberYesMaximum possible score
parserstringYesHow to extract score from LLM response
metadataobjectNoAdditional metadata (sample responses, etc.)

Model Parameters

For models supporting structured outputs (e.g., OpenAI):

model_parameters:
reasoning:
effort: medium # low | medium | high
text:
verbosity: medium # low | medium | high

Prompt Templates

Use placeholders for trace data:

PlaceholderDescription
{input}User input text
{output}Agent output text
{persona}Persona name
{duration_ms}Execution time
{metadata.X}Custom metadata field
{observations}List of observations

Example:

prompt_template: |
You are evaluating a customer support agent.

Persona: {persona}
User Question: {input}
Agent Response: {output}
Response Time: {duration_ms}ms

Rate the response quality from 1-10 considering:
1. Accuracy and correctness
2. Helpfulness and completeness
3. Appropriate tone for persona
4. Response time efficiency

Output only a number 1-10.

Parsers

ParserDescriptionExample Output
first_number_1_10Extract first number 1-10"The score is 8" → 8
json_scoreParse JSON {"score": X}{"score": 7}7
first_floatExtract first floating point"Score: 8.5/10" → 8.5

Pre-Built Evaluators

FluxLoop provides pre-configured LLM judge evaluators:

intent_recognition

Scores how well the agent recognized user intent.

- name: intent_recognition
type: llm_judge
enabled: true
weight: 0.25
model: gpt-4o-mini
prompt_template: |
# Automatically populated from built-in template
max_score: 10
parser: first_number_1_10

response_consistency

Checks if the response is logically consistent.

- name: response_consistency
type: llm_judge
enabled: true
weight: 0.25
model: gpt-4o-mini
max_score: 10
parser: first_number_1_10

response_clarity

Evaluates clarity and conciseness.

- name: response_clarity
type: llm_judge
enabled: true
weight: 0.2
model: gpt-4o-mini
max_score: 10
parser: first_number_1_10

information_completeness

Checks if all necessary information is provided.

- name: information_completeness
type: llm_judge
enabled: false # Disabled by default
weight: 0.1
model: gpt-4o-mini
max_score: 10
parser: first_number_1_10

Aggregation

aggregate.method

Type: string
Options: weighted_sum | average
Default: weighted_sum

How to combine evaluator scores.

aggregate:
method: weighted_sum

weighted_sum:

final_score = Σ(evaluator_score × evaluator_weight) / Σ(evaluator_weight)

average:

final_score = Σ(evaluator_score) / count(evaluators)

aggregate.threshold

Type: number
Range: 0.0-1.0
Default: 0.7

Pass/fail threshold.

aggregate:
threshold: 0.75 # 75% to pass

Behavior:

  • final_score >= threshold → PASS
  • final_score < threshold → FAIL

aggregate.by_persona

Type: boolean
Default: true

Group statistics by persona.

aggregate:
by_persona: true

Enables:

  • Per-persona scores in reports
  • Persona-specific analysis
  • Identifying persona-specific issues

Limits (Cost Controls)

limits.sample_rate

Type: number
Range: 0.0-1.0
Default: 1.0

Fraction of traces to evaluate with LLM judges.

limits:
sample_rate: 0.3 # Evaluate 30% of traces

Use Cases:

RateUse Case
1.0Full evaluation, < 100 traces
0.5Moderate sampling, 100-500 traces
0.1-0.3Large experiments, 1000+ traces

limits.max_llm_calls

Type: integer
Default: 50

Maximum LLM API calls.

limits:
max_llm_calls: 100

Stops evaluation after reaching limit to control costs.

limits.timeout_seconds

Type: integer
Default: 60

Timeout per LLM evaluation call.

limits:
timeout_seconds: 30

limits.cache

Type: string | null
Default: evaluation_cache.jsonl

Cache file for LLM responses.

limits:
cache: evaluation_cache.jsonl

Benefits:

  • Avoid re-evaluating same traces
  • Reduce API costs
  • Speed up re-runs

Success Criteria (Phase 2)

Define pass/fail thresholds across multiple dimensions.

Performance Criteria

success_criteria:
performance:
all_traces_successful: true
avg_response_time:
enabled: true
threshold_ms: 2000
max_response_time:
enabled: true
threshold_ms: 5000
error_rate:
enabled: true
threshold_percent: 5
CriterionDescription
all_traces_successfulAll traces must succeed (no errors)
avg_response_timeAverage latency under threshold
max_response_timeMaximum latency under threshold
error_rateError percentage under threshold

Quality Criteria

success_criteria:
quality:
intent_recognition: true
response_consistency: true
response_clarity: true
information_completeness: false

Evaluator names that must pass their thresholds.

Functionality Criteria

success_criteria:
functionality:
tool_calling:
enabled: true
all_calls_successful: true
appropriate_selection: true
correct_parameters: true
proper_timing: true
handles_failures: true

Tool calling requirements:

CheckDescription
all_calls_successfulAll tool calls succeed
appropriate_selectionCorrect tools chosen
correct_parametersValid parameters passed
proper_timingTools called in logical order
handles_failuresGracefully handles tool errors

Additional Analysis (Phase 2)

Persona Analysis

additional_analysis:
persona:
enabled: true
focus_personas: ["expert_user", "novice_user"]

Generates:

  • Per-persona performance breakdown
  • Persona-specific failure patterns
  • Comparative persona analysis

Performance Analysis

additional_analysis:
performance:
detect_outliers: true
trend_analysis: true

Features:

  • Outlier detection (statistical anomalies)
  • Trend analysis over time
  • Performance distribution charts

Failure Analysis

additional_analysis:
failures:
enabled: true
categorize_causes: true

Provides:

  • Failure categorization
  • Root cause analysis
  • Common failure patterns

Baseline Comparison

additional_analysis:
comparison:
enabled: true
baseline_path: "experiments/baseline_v1/evaluation/summary.json"

Compares:

  • Score changes vs baseline
  • Performance regression detection
  • Improvement/degradation analysis

Report Configuration (Phase 2)

report.style

Type: string
Options: quick | standard | detailed
Default: standard

Report detail level.

report:
style: standard
StyleSectionsUse Case
quickExecutive summary, key metricsQuick checks, CI/CD
standard+ Detailed results, recommendationsRegular testing
detailed+ Statistical analysis, examplesDeep analysis

report.sections

report:
sections:
executive_summary: true
key_metrics: true
detailed_results: true
statistical_analysis: false
failure_cases: true
success_examples: false
recommendations: true
action_items: true

Available Sections:

SectionDescription
executive_summaryHigh-level overview
key_metricsCore statistics
detailed_resultsPer-evaluator breakdown
statistical_analysisAdvanced stats
failure_casesFailed trace examples
success_examplesSuccessful trace examples
recommendationsImprovement suggestions
action_itemsSpecific next steps

report.visualizations

report:
visualizations:
charts_and_graphs: true
tables: true
interactive: true

Options:

  • charts_and_graphs: Include charts in HTML report
  • tables: Data tables
  • interactive: Interactive HTML elements

report.tone

Type: string
Options: technical | executive | balanced
Default: balanced

Report writing style.

report:
tone: executive
ToneAudienceStyle
technicalEngineersDetailed, metrics-focused
executiveLeadershipHigh-level, business impact
balancedMixedBoth technical and business

report.output

Type: string
Options: md | html | both
Default: both

Report format(s) to generate.

report:
output: both

Generates:

  • md: report.md (Markdown)
  • html: report.html (Interactive HTML)
  • both: Both formats

report.template_path

Type: string | null
Default: null

Custom HTML template path.

report:
template_path: "templates/custom_report.html"

Use for:

  • Company branding
  • Custom visualizations
  • Specific formatting requirements

Advanced Settings (Phase 2)

Statistical Tests

advanced:
statistical_tests:
enabled: true
significance_level: 0.05
confidence_interval: 0.95

Enables:

  • Hypothesis testing
  • Confidence intervals
  • Statistical significance checks

Outlier Handling

advanced:
outliers:
detection: true
handling: analyze_separately # remove | analyze_separately | include
HandlingDescription
removeExclude outliers from aggregate stats
analyze_separatelyInclude but call out in reports
includeTreat as normal data points

Alerts

advanced:
alerts:
enabled: true
conditions:
- metric: "error_rate"
threshold: 10
operator: ">"
- metric: "avg_score"
threshold: 0.7
operator: "<"

Alert Conditions:

  • metric: Metric to monitor
  • threshold: Threshold value
  • operator: >, <, >=, <=, ==

Complete Examples

Minimal (Quick Start)

evaluators:
- name: basic
type: rule_based
enabled: true
weight: 1.0
rules:
- check: output_not_empty
- check: success

aggregate:
method: weighted_sum
threshold: 0.7

Standard Testing

evaluation_goal:
text: "Validate agent provides accurate, helpful responses"

evaluators:
- name: completeness
type: rule_based
enabled: true
weight: 0.3
rules:
- check: output_not_empty
- check: success
- check: latency_under
budget_ms: 2000

- name: intent_recognition
type: llm_judge
enabled: true
weight: 0.35
model: gpt-4o-mini
max_score: 10
parser: first_number_1_10

- name: response_clarity
type: llm_judge
enabled: true
weight: 0.35
model: gpt-4o-mini
max_score: 10
parser: first_number_1_10

aggregate:
method: weighted_sum
threshold: 0.75
by_persona: true

limits:
sample_rate: 0.5
max_llm_calls: 100
cache: evaluation_cache.jsonl

Production Validation (Phase 2)

evaluation_goal:
text: |
Comprehensive validation before production deployment.
Ensure quality, performance, and reliability targets.

evaluators:
- name: not_empty
type: rule_based
enabled: true
weight: 0.2
rules:
- check: output_not_empty

- name: token_budget
type: rule_based
enabled: true
weight: 0.2
rules:
- check: token_usage_under
max_total_tokens: 4000

- name: intent_recognition
type: llm_judge
enabled: true
weight: 0.3
model: gpt-4o
max_score: 10
parser: first_number_1_10

- name: response_quality
type: llm_judge
enabled: true
weight: 0.3
model: gpt-4o
max_score: 10
parser: first_number_1_10

aggregate:
method: weighted_sum
threshold: 0.8
by_persona: true

limits:
sample_rate: 1.0
max_llm_calls: 500
cache: prod_eval_cache.jsonl

success_criteria:
performance:
all_traces_successful: true
avg_response_time:
enabled: true
threshold_ms: 1500
error_rate:
enabled: true
threshold_percent: 2

quality:
intent_recognition: true
response_quality: true

additional_analysis:
persona:
enabled: true
performance:
detect_outliers: true
trend_analysis: true
failures:
enabled: true
categorize_causes: true
comparison:
enabled: true
baseline_path: "experiments/baseline/evaluation/summary.json"

report:
style: detailed
sections:
executive_summary: true
key_metrics: true
detailed_results: true
statistical_analysis: true
failure_cases: true
recommendations: true
action_items: true
visualizations:
charts_and_graphs: true
tables: true
interactive: true
tone: balanced
output: both

advanced:
statistical_tests:
enabled: true
significance_level: 0.05
outliers:
detection: true
handling: analyze_separately
alerts:
enabled: true
conditions:
- metric: "error_rate"
threshold: 5
operator: ">"

See Also