Evaluation Configuration

Define evaluation criteria, success thresholds, and reporting in configs/evaluation.yaml.

Overview

The evaluation configuration controls how experiment results are scored and reported:

Evaluation Goal: High-level objective description
Evaluators: Rule-based checks and LLM judges that score each trace
Aggregation: How individual scores combine
Limits: Cost controls for LLM evaluations
Success Criteria (Phase 2): Performance, quality, and functionality thresholds
Additional Analysis (Phase 2): Persona breakdowns, outlier detection, trends
Report Configuration (Phase 2): Customizable Markdown and HTML reports
Advanced Settings (Phase 2): Statistical tests, alerts

File Location

fluxloop/my-agent/
└── configs/
    └── evaluation.yaml    # ← This file

Quick Start Example

# Minimal evaluation config
evaluators:
  - name: basic_checks
    type: rule_based
    enabled: true
    weight: 1.0
    rules:
      - check: output_not_empty
      - check: success

aggregate:
  method: weighted_sum
  threshold: 0.7

Evaluation Goal

Type: object
Phase: Phase 2

High-level description of evaluation objectives.

evaluation_goal:
  text: |
    Verify that the agent provides clear, persona-aware responses
    while meeting latency and accuracy targets.

Purpose:

Appears in reports and dashboards
Guides evaluator selection
Communicates intent to stakeholders

Evaluators

Evaluators score each trace based on specific criteria.

Evaluator Types

Type	Description	Use Case
`rule_based`	Predefined checks (latency, keywords, etc.)	Objective, deterministic criteria
`llm_judge`	LLM scores subjective quality	Relevance, coherence, helpfulness

Rule-Based Evaluators

Structure

evaluators:
  - name: string           # Unique evaluator identifier
    type: rule_based
    enabled: boolean       # Enable/disable
    weight: number         # Importance weight (0.0-1.0)
    rules:                 # List of checks
      - check: string
        ...parameters

Example

- name: completeness
  type: rule_based
  enabled: true
  weight: 0.3
  rules:
    - check: output_not_empty
    - check: success
    - check: latency_under
      budget_ms: 1200

Available Rule Checks

output_not_empty

Verify output field is not empty.

- check: output_not_empty

Passes if:

output field exists and is non-empty string/object

success

Verify trace completed successfully.

- check: success

Passes if:

Trace status is SUCCESS
No errors occurred

latency_under

Check execution time under budget.

- check: latency_under
  budget_ms: 1000

Parameters:

budget_ms: Maximum duration in milliseconds

Passes if:

duration_ms < budget_ms

token_usage_under

Check token consumption.

- check: token_usage_under
  max_total_tokens: 4000
  max_prompt_tokens: 2000
  max_completion_tokens: 2000

Parameters:

max_total_tokens: Total token budget
max_prompt_tokens: Input token budget (optional)
max_completion_tokens: Output token budget (optional)

contains

Check if target field contains keywords.

- check: contains
  target: output
  keywords: ["help", "assist", "support"]

Parameters:

target: Field to check (output, input, metadata.X)
keywords: List of keywords (case-insensitive)

Passes if:

Target contains ANY keyword

not_contains

Check if target field does NOT contain keywords.

- check: not_contains
  target: output
  keywords: ["error", "sorry", "unavailable"]

Parameters:

target: Field to check
keywords: List of forbidden keywords

Passes if:

Target contains NONE of the keywords

similarity

Check similarity between target and expected value.

- check: similarity
  target: output
  expected_path: metadata.expected
  method: difflib
  threshold: 0.7

Parameters:

target: Field to compare
expected_path: Path to expected value
method: difflib | levenshtein | embedding
threshold: Minimum similarity (0.0-1.0)

LLM Judge Evaluators

Structure

evaluators:
  - name: string
    type: llm_judge
    enabled: boolean
    weight: number
    model: string
    model_parameters: object  # Optional
    prompt_template: string
    max_score: number
    parser: string
    metadata: object  # Optional

Example

- name: response_quality
  type: llm_judge
  enabled: true
  weight: 0.5
  model: gpt-4o-mini
  model_parameters:
    reasoning:
      effort: medium
    text:
      verbosity: medium
  prompt_template: |
    Score the assistant's response from 1-10.
    
    User Input: {input}
    Assistant Output: {output}
    
    Consider:
    - Relevance to the question
    - Completeness of answer
    - Clarity and conciseness
    
    Provide only a number from 1-10.
  max_score: 10
  parser: first_number_1_10
  metadata:
    sample_response: "Score: 8"

Fields

Field	Type	Required	Description
`model`	`string`	Yes	LLM model name (e.g., `gpt-4o-mini`)
`model_parameters`	`object`	No	Model-specific parameters
`prompt_template`	`string`	Yes	Scoring prompt with placeholders
`max_score`	`number`	Yes	Maximum possible score
`parser`	`string`	Yes	How to extract score from LLM response
`metadata`	`object`	No	Additional metadata (sample responses, etc.)

Model Parameters

For models supporting structured outputs (e.g., OpenAI):

model_parameters:
  reasoning:
    effort: medium  # low | medium | high
  text:
    verbosity: medium  # low | medium | high

Prompt Templates

Use placeholders for trace data:

Placeholder	Description
`{input}`	User input text
`{output}`	Agent output text
`{persona}`	Persona name
`{duration_ms}`	Execution time
`{metadata.X}`	Custom metadata field
`{observations}`	List of observations

Example:

prompt_template: |
  You are evaluating a customer support agent.
  
  Persona: {persona}
  User Question: {input}
  Agent Response: {output}
  Response Time: {duration_ms}ms
  
  Rate the response quality from 1-10 considering:
  1. Accuracy and correctness
  2. Helpfulness and completeness
  3. Appropriate tone for persona
  4. Response time efficiency
  
  Output only a number 1-10.

Parsers

Parser	Description	Example Output
`first_number_1_10`	Extract first number 1-10	"The score is 8" → `8`
`json_score`	Parse JSON `{"score": X}`	`{"score": 7}` → `7`
`first_float`	Extract first floating point	"Score: 8.5/10" → `8.5`

Pre-Built Evaluators

FluxLoop provides pre-configured LLM judge evaluators:

intent_recognition

Scores how well the agent recognized user intent.

- name: intent_recognition
  type: llm_judge
  enabled: true
  weight: 0.25
  model: gpt-4o-mini
  prompt_template: |
    # Automatically populated from built-in template
  max_score: 10
  parser: first_number_1_10

response_consistency

Checks if the response is logically consistent.

- name: response_consistency
  type: llm_judge
  enabled: true
  weight: 0.25
  model: gpt-4o-mini
  max_score: 10
  parser: first_number_1_10

response_clarity

Evaluates clarity and conciseness.

- name: response_clarity
  type: llm_judge
  enabled: true
  weight: 0.2
  model: gpt-4o-mini
  max_score: 10
  parser: first_number_1_10

information_completeness

Checks if all necessary information is provided.

- name: information_completeness
  type: llm_judge
  enabled: false  # Disabled by default
  weight: 0.1
  model: gpt-4o-mini
  max_score: 10
  parser: first_number_1_10

Aggregation

aggregate.method

Type: string
Options: weighted_sum | average
Default: weighted_sum

How to combine evaluator scores.

aggregate:
  method: weighted_sum

weighted_sum:

final_score = Σ(evaluator_score × evaluator_weight) / Σ(evaluator_weight)

average:

final_score = Σ(evaluator_score) / count(evaluators)

aggregate.threshold

Type: number
Range: 0.0-1.0
Default: 0.7

Pass/fail threshold.

aggregate:
  threshold: 0.75  # 75% to pass

Behavior:

final_score >= threshold → PASS
final_score < threshold → FAIL

aggregate.by_persona

Type: boolean
Default: true

Group statistics by persona.

aggregate:
  by_persona: true

Enables:

Per-persona scores in reports
Persona-specific analysis
Identifying persona-specific issues

Limits (Cost Controls)

limits.sample_rate

Type: number
Range: 0.0-1.0
Default: 1.0

Fraction of traces to evaluate with LLM judges.

limits:
  sample_rate: 0.3  # Evaluate 30% of traces

Use Cases:

Rate	Use Case
`1.0`	Full evaluation, < 100 traces
`0.5`	Moderate sampling, 100-500 traces
`0.1-0.3`	Large experiments, 1000+ traces

limits.max_llm_calls

Type: integer
Default: 50

Maximum LLM API calls.

limits:
  max_llm_calls: 100

Stops evaluation after reaching limit to control costs.

limits.timeout_seconds

Type: integer
Default: 60

Timeout per LLM evaluation call.

limits:
  timeout_seconds: 30

limits.cache

Type: string | null
Default: evaluation_cache.jsonl

Cache file for LLM responses.

limits:
  cache: evaluation_cache.jsonl

Benefits:

Avoid re-evaluating same traces
Reduce API costs
Speed up re-runs

Success Criteria (Phase 2)

Define pass/fail thresholds across multiple dimensions.

Performance Criteria

success_criteria:
  performance:
    all_traces_successful: true
    avg_response_time:
      enabled: true
      threshold_ms: 2000
    max_response_time:
      enabled: true
      threshold_ms: 5000
    error_rate:
      enabled: true
      threshold_percent: 5

Criterion	Description
`all_traces_successful`	All traces must succeed (no errors)
`avg_response_time`	Average latency under threshold
`max_response_time`	Maximum latency under threshold
`error_rate`	Error percentage under threshold

Quality Criteria

success_criteria:
  quality:
    intent_recognition: true
    response_consistency: true
    response_clarity: true
    information_completeness: false

Evaluator names that must pass their thresholds.

Functionality Criteria

success_criteria:
  functionality:
    tool_calling:
      enabled: true
      all_calls_successful: true
      appropriate_selection: true
      correct_parameters: true
      proper_timing: true
      handles_failures: true

Tool calling requirements:

Check	Description
`all_calls_successful`	All tool calls succeed
`appropriate_selection`	Correct tools chosen
`correct_parameters`	Valid parameters passed
`proper_timing`	Tools called in logical order
`handles_failures`	Gracefully handles tool errors

Additional Analysis (Phase 2)

Persona Analysis

additional_analysis:
  persona:
    enabled: true
    focus_personas: ["expert_user", "novice_user"]

Generates:

Per-persona performance breakdown
Persona-specific failure patterns
Comparative persona analysis

Performance Analysis

additional_analysis:
  performance:
    detect_outliers: true
    trend_analysis: true

Features:

Outlier detection (statistical anomalies)
Trend analysis over time
Performance distribution charts

Failure Analysis

additional_analysis:
  failures:
    enabled: true
    categorize_causes: true

Provides:

Failure categorization
Root cause analysis
Common failure patterns

Baseline Comparison

additional_analysis:
  comparison:
    enabled: true
    baseline_path: "experiments/baseline_v1/evaluation/summary.json"

Compares:

Score changes vs baseline
Performance regression detection
Improvement/degradation analysis

Report Configuration (Phase 2)

report.style

Type: string
Options: quick | standard | detailed
Default: standard

Report detail level.

report:
  style: standard

Style	Sections	Use Case
`quick`	Executive summary, key metrics	Quick checks, CI/CD
`standard`	+ Detailed results, recommendations	Regular testing
`detailed`	+ Statistical analysis, examples	Deep analysis

report.sections

report:
  sections:
    executive_summary: true
    key_metrics: true
    detailed_results: true
    statistical_analysis: false
    failure_cases: true
    success_examples: false
    recommendations: true
    action_items: true

Available Sections:

Section	Description
`executive_summary`	High-level overview
`key_metrics`	Core statistics
`detailed_results`	Per-evaluator breakdown
`statistical_analysis`	Advanced stats
`failure_cases`	Failed trace examples
`success_examples`	Successful trace examples
`recommendations`	Improvement suggestions
`action_items`	Specific next steps

report.visualizations

report:
  visualizations:
    charts_and_graphs: true
    tables: true
    interactive: true

Options:

charts_and_graphs: Include charts in HTML report
tables: Data tables
interactive: Interactive HTML elements

report.tone

Type: string
Options: technical | executive | balanced
Default: balanced

Report writing style.

report:
  tone: executive

Tone	Audience	Style
`technical`	Engineers	Detailed, metrics-focused
`executive`	Leadership	High-level, business impact
`balanced`	Mixed	Both technical and business

report.output

Type: string
Options: md | html | both
Default: both

Report format(s) to generate.

report:
  output: both

Generates:

md: report.md (Markdown)
html: report.html (Interactive HTML)
both: Both formats

report.template_path

Type: string | null
Default: null

Custom HTML template path.

report:
  template_path: "templates/custom_report.html"

Use for:

Company branding
Custom visualizations
Specific formatting requirements

Advanced Settings (Phase 2)

Statistical Tests

advanced:
  statistical_tests:
    enabled: true
    significance_level: 0.05
    confidence_interval: 0.95

Enables:

Hypothesis testing
Confidence intervals
Statistical significance checks

Outlier Handling

advanced:
  outliers:
    detection: true
    handling: analyze_separately  # remove | analyze_separately | include

Handling	Description
`remove`	Exclude outliers from aggregate stats
`analyze_separately`	Include but call out in reports
`include`	Treat as normal data points

Alerts

advanced:
  alerts:
    enabled: true
    conditions:
      - metric: "error_rate"
        threshold: 10
        operator: ">"
      - metric: "avg_score"
        threshold: 0.7
        operator: "<"

Alert Conditions:

metric: Metric to monitor
threshold: Threshold value
operator: >, <, >=, <=, ==

Complete Examples

Minimal (Quick Start)

evaluators:
  - name: basic
    type: rule_based
    enabled: true
    weight: 1.0
    rules:
      - check: output_not_empty
      - check: success

aggregate:
  method: weighted_sum
  threshold: 0.7

Standard Testing

evaluation_goal:
  text: "Validate agent provides accurate, helpful responses"

evaluators:
  - name: completeness
    type: rule_based
    enabled: true
    weight: 0.3
    rules:
      - check: output_not_empty
      - check: success
      - check: latency_under
        budget_ms: 2000

  - name: intent_recognition
    type: llm_judge
    enabled: true
    weight: 0.35
    model: gpt-4o-mini
    max_score: 10
    parser: first_number_1_10

  - name: response_clarity
    type: llm_judge
    enabled: true
    weight: 0.35
    model: gpt-4o-mini
    max_score: 10
    parser: first_number_1_10

aggregate:
  method: weighted_sum
  threshold: 0.75
  by_persona: true

limits:
  sample_rate: 0.5
  max_llm_calls: 100
  cache: evaluation_cache.jsonl

Production Validation (Phase 2)

evaluation_goal:
  text: |
    Comprehensive validation before production deployment.
    Ensure quality, performance, and reliability targets.

evaluators:
  - name: not_empty
    type: rule_based
    enabled: true
    weight: 0.2
    rules:
      - check: output_not_empty

  - name: token_budget
    type: rule_based
    enabled: true
    weight: 0.2
    rules:
      - check: token_usage_under
        max_total_tokens: 4000

  - name: intent_recognition
    type: llm_judge
    enabled: true
    weight: 0.3
    model: gpt-4o
    max_score: 10
    parser: first_number_1_10

  - name: response_quality
    type: llm_judge
    enabled: true
    weight: 0.3
    model: gpt-4o
    max_score: 10
    parser: first_number_1_10

aggregate:
  method: weighted_sum
  threshold: 0.8
  by_persona: true

limits:
  sample_rate: 1.0
  max_llm_calls: 500
  cache: prod_eval_cache.jsonl

success_criteria:
  performance:
    all_traces_successful: true
    avg_response_time:
      enabled: true
      threshold_ms: 1500
    error_rate:
      enabled: true
      threshold_percent: 2

  quality:
    intent_recognition: true
    response_quality: true

additional_analysis:
  persona:
    enabled: true
  performance:
    detect_outliers: true
    trend_analysis: true
  failures:
    enabled: true
    categorize_causes: true
  comparison:
    enabled: true
    baseline_path: "experiments/baseline/evaluation/summary.json"

report:
  style: detailed
  sections:
    executive_summary: true
    key_metrics: true
    detailed_results: true
    statistical_analysis: true
    failure_cases: true
    recommendations: true
    action_items: true
  visualizations:
    charts_and_graphs: true
    tables: true
    interactive: true
  tone: balanced
  output: both

advanced:
  statistical_tests:
    enabled: true
    significance_level: 0.05
  outliers:
    detection: true
    handling: analyze_separately
  alerts:
    enabled: true
    conditions:
      - metric: "error_rate"
        threshold: 5
        operator: ">"

Overview​

File Location​

Quick Start Example​

Evaluation Goal​

Evaluators​

Evaluator Types​

Rule-Based Evaluators​

Structure​

Example​

Available Rule Checks​

output_not_empty​

success​

latency_under​

token_usage_under​

contains​

not_contains​

similarity​

LLM Judge Evaluators​

Structure​

Example​

Fields​

Model Parameters​

Prompt Templates​

Parsers​

Pre-Built Evaluators​

intent_recognition​

response_consistency​

response_clarity​

information_completeness​

Aggregation​

aggregate.method​

aggregate.threshold​

aggregate.by_persona​

Limits (Cost Controls)​

limits.sample_rate​

limits.max_llm_calls​

limits.timeout_seconds​

limits.cache​

Success Criteria (Phase 2)​

Performance Criteria​

Quality Criteria​

Functionality Criteria​

Additional Analysis (Phase 2)​

Persona Analysis​

Performance Analysis​

Failure Analysis​

Baseline Comparison​

Report Configuration (Phase 2)​

report.style​

report.sections​

report.visualizations​

report.tone​

report.output​

report.template_path​

Advanced Settings (Phase 2)​

Statistical Tests​

Outlier Handling​

Alerts​

Complete Examples​

Minimal (Quick Start)​

Standard Testing​

Production Validation (Phase 2)​

See Also​

Overview

File Location

Quick Start Example

Evaluation Goal

Evaluators

Evaluator Types

Rule-Based Evaluators

Structure

Example

Available Rule Checks

output_not_empty

success

latency_under

token_usage_under

contains

not_contains

similarity

LLM Judge Evaluators

Structure

Example

Fields

Model Parameters

Prompt Templates

Parsers

Pre-Built Evaluators

intent_recognition

response_consistency

response_clarity

information_completeness

Aggregation

aggregate.method

aggregate.threshold

aggregate.by_persona

Limits (Cost Controls)

limits.sample_rate

limits.max_llm_calls

limits.timeout_seconds

limits.cache

Success Criteria (Phase 2)

Performance Criteria

Quality Criteria

Functionality Criteria

Additional Analysis (Phase 2)

Persona Analysis

Performance Analysis

Failure Analysis

Baseline Comparison

Report Configuration (Phase 2)

report.style

report.sections

report.visualizations

report.tone

report.output

report.template_path

Advanced Settings (Phase 2)

Statistical Tests

Outlier Handling

Alerts

Complete Examples

Minimal (Quick Start)

Standard Testing

Production Validation (Phase 2)

See Also