본문으로 건너뛰기

CI/CD Integration

Integrate FluxLoop into CI/CD pipelines for automated agent testing and regression detection.

Overview

FluxLoop can be integrated into CI/CD pipelines to:

  • Automate regression testing after code changes
  • Validate agent quality before deployment
  • Track performance trends over time
  • Enforce quality gates based on evaluation scores
  • Generate test reports for stakeholders

This guide covers GitHub Actions, GitLab CI, and generic CI/CD setup.


Quick Setup

Prerequisites

  1. FluxLoop project with configuration in configs/
  2. Agent code instrumented with FluxLoop SDK
  3. Base inputs defined in configs/input.yaml
  4. Evaluators configured in configs/evaluation.yaml
  5. API keys (OpenAI, Anthropic, etc.) stored as secrets

Key Principles

For CI/CD environments:

  1. Use deterministic mode for input generation (or commit pre-generated inputs)
  2. Set fixed seed in configs/simulation.yaml for reproducibility
  3. Cache dependencies (Python packages, MCP index)
  4. Store API keys as secrets, not in code
  5. Generate artifacts (reports, traces) for review
  6. Fail pipeline if evaluation scores below threshold

GitHub Actions

Basic Workflow

Create .github/workflows/fluxloop-test.yml:

name: FluxLoop Agent Tests

on:
push:
branches: [main, develop]
pull_request:
branches: [main]

jobs:
test-agent:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'

- name: Install dependencies
run: |
pip install fluxloop-cli fluxloop
pip install -r requirements.txt

- name: Verify setup
run: fluxloop doctor

- name: Generate inputs (deterministic)
run: |
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic

- name: Run experiment
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
FLUXLOOP_ENABLED: 'true'
run: |
cd fluxloop/my-agent
fluxloop run experiment --iterations 1

- name: Parse results
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST_EXP"

- name: Evaluate results
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop evaluate experiment "$LATEST_EXP"

- name: Check evaluation threshold
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
python3 << 'EOF'
import json
import sys

with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
summary = json.load(f)

score = summary.get("overall_score", 0)
threshold = 0.7

print(f"Score: {score:.2f}, Threshold: {threshold}")

if score < threshold:
print(f"FAIL: Score {score:.2f} below threshold {threshold}")
sys.exit(1)
else:
print(f"PASS: Score {score:.2f} meets threshold {threshold}")
EOF

- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: fluxloop-results
path: |
fluxloop/my-agent/experiments/
fluxloop/my-agent/inputs/
retention-days: 30

- name: Comment on PR (if PR)
if: github.event_name == 'pull_request' && always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const expDir = 'fluxloop/my-agent/experiments/';
const latest = fs.readdirSync(expDir)
.filter(f => fs.statSync(expDir + f).isDirectory())
.sort()
.reverse()[0];

const summary = JSON.parse(
fs.readFileSync(`${expDir}${latest}/evaluation/summary.json`)
);

const body = `
## FluxLoop Evaluation Results

**Overall Score:** ${summary.overall_score.toFixed(2)} / 1.00
**Status:** ${summary.pass_fail_status}
**Traces:** ${summary.total_traces}

### Evaluator Scores
${Object.entries(summary.by_evaluator || {}).map(([name, data]) =>
`- **${name}**: ${data.score.toFixed(2)}`
).join('\n')}

[View Full Report](../actions/runs/${context.runId})
`;

github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});

Advanced: Comparison with Baseline

Compare results against a baseline:

      - name: Download baseline results
uses: dawidd6/action-download-artifact@v3
with:
workflow: fluxloop-test.yml
branch: main
name: baseline-summary
path: baseline/
continue-on-error: true

- name: Compare with baseline
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)

if [ -f "baseline/summary.json" ]; then
fluxloop evaluate experiment "$LATEST_EXP" \
--baseline baseline/summary.json
fi

- name: Save new baseline (on main)
if: github.ref == 'refs/heads/main'
uses: actions/upload-artifact@v4
with:
name: baseline-summary
path: fluxloop/my-agent/experiments/*/evaluation/summary.json

Scheduled Regression Testing

Run tests on a schedule:

on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTC
workflow_dispatch: # Manual trigger

GitLab CI

Basic Pipeline

Create .gitlab-ci.yml:

image: python:3.11

variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
paths:
- .cache/pip
- venv/

stages:
- setup
- test
- evaluate
- report

before_script:
- python -m venv venv
- source venv/bin/activate
- pip install fluxloop-cli fluxloop
- pip install -r requirements.txt

setup:
stage: setup
script:
- cd fluxloop/my-agent
- fluxloop doctor
- fluxloop config validate
artifacts:
reports:
dotenv: build.env

generate_inputs:
stage: setup
script:
- cd fluxloop/my-agent
- fluxloop generate inputs --limit 20 --mode deterministic
artifacts:
paths:
- fluxloop/my-agent/inputs/
expire_in: 1 week

run_experiment:
stage: test
dependencies:
- generate_inputs
script:
- cd fluxloop/my-agent
- fluxloop run experiment --iterations 1
artifacts:
paths:
- fluxloop/my-agent/experiments/
expire_in: 1 month

parse_results:
stage: evaluate
dependencies:
- run_experiment
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- fluxloop parse experiment "$LATEST_EXP"
artifacts:
paths:
- fluxloop/my-agent/experiments/*/per_trace_analysis/
expire_in: 1 month

evaluate_results:
stage: evaluate
dependencies:
- parse_results
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- fluxloop evaluate experiment "$LATEST_EXP"
- |
python3 << 'EOF'
import json, sys
with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
summary = json.load(f)
score = summary.get("overall_score", 0)
if score < 0.7:
print(f"FAIL: Score {score:.2f} below 0.7")
sys.exit(1)
EOF
artifacts:
paths:
- fluxloop/my-agent/experiments/*/evaluation/
expire_in: 1 month
reports:
junit: fluxloop/my-agent/experiments/*/evaluation/junit.xml

generate_report:
stage: report
dependencies:
- evaluate_results
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- cp "$LATEST_EXP/evaluation/report.html" public/index.html
artifacts:
paths:
- public
only:
- main

Jenkins

Jenkinsfile

pipeline {
agent any

environment {
OPENAI_API_KEY = credentials('openai-api-key')
FLUXLOOP_ENABLED = 'true'
}

stages {
stage('Setup') {
steps {
sh '''
python3 -m venv venv
. venv/bin/activate
pip install fluxloop-cli fluxloop
pip install -r requirements.txt
'''
}
}

stage('Verify') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop doctor
fluxloop config validate
'''
}
}

stage('Generate Inputs') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic
'''
}
}

stage('Run Experiment') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop run experiment --iterations 1
'''
}
}

stage('Evaluate') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST_EXP"
fluxloop evaluate experiment "$LATEST_EXP"
'''
}
}

stage('Quality Gate') {
steps {
script {
def summary = readJSON file: "fluxloop/my-agent/experiments/*/evaluation/summary.json"
def score = summary.overall_score

if (score < 0.7) {
error("Quality gate failed: score ${score} below 0.7")
}
}
}
}
}

post {
always {
archiveArtifacts artifacts: 'fluxloop/my-agent/experiments/**/*',
allowEmptyArchive: true

publishHTML([
reportDir: 'fluxloop/my-agent/experiments/*/evaluation/',
reportFiles: 'report.html',
reportName: 'FluxLoop Evaluation Report'
])
}
}
}

Docker Integration

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir \
fluxloop-cli \
fluxloop \
-r requirements.txt

# Copy project
COPY . .

# Set up FluxLoop
RUN cd fluxloop/my-agent && \
fluxloop doctor

# Default command
CMD ["bash", "-c", "cd fluxloop/my-agent && fluxloop run experiment"]

Docker Compose for Testing

version: '3.8'

services:
fluxloop-test:
build: .
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- FLUXLOOP_ENABLED=true
volumes:
- ./fluxloop:/app/fluxloop
- test-results:/app/fluxloop/my-agent/experiments
command: |
bash -c "
cd fluxloop/my-agent &&
fluxloop generate inputs --limit 20 --mode deterministic &&
fluxloop run experiment --iterations 1 &&
fluxloop parse experiment experiments/*/ &&
fluxloop evaluate experiment experiments/*/
"

volumes:
test-results:

Run tests:

docker-compose run fluxloop-test

Best Practices

1. Use Deterministic Mode in CI

# configs/input.yaml (for CI)
input_generation:
mode: deterministic # or commit pre-generated inputs

# configs/simulation.yaml
seed: 42 # Fixed seed for reproducibility

2. Separate Test Configs from Production

fluxloop/my-agent/
├── configs/ # Production configs
│ ├── project.yaml
│ ├── input.yaml
│ ├── simulation.yaml
│ └── evaluation.yaml
└── configs-ci/ # CI-specific configs
├── input.yaml # Deterministic, fewer inputs
├── simulation.yaml # Lower iterations, fixed seed
└── evaluation.yaml # Stricter thresholds

Run with CI configs:

# Override config directory
cp -r configs-ci/* configs/
fluxloop run experiment

3. Cache Wisely

Cache these:

  • Python packages (pip cache)
  • Generated inputs (if deterministic)
  • MCP index (~/.fluxloop/mcp/index/)

Don't cache:

  • Experiment outputs
  • Traces
  • Evaluation results

4. Store Secrets Securely

GitHub Actions:

env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

GitLab CI:

variables:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Jenkins:

environment {
OPENAI_API_KEY = credentials('openai-api-key')
}

5. Set Quality Gates

# check_quality.py
import json
import sys

with open("experiments/latest/evaluation/summary.json") as f:
summary = json.load(f)

score = summary["overall_score"]
threshold = 0.7

# Can also check individual evaluators
intent_score = summary["by_evaluator"]["intent_recognition"]["score"]
latency_score = summary["by_evaluator"]["token_budget"]["score"]

if score < threshold:
print(f"❌ FAIL: Overall score {score:.2f} < {threshold}")
sys.exit(1)

if intent_score < 0.8:
print(f"❌ FAIL: Intent recognition {intent_score:.2f} < 0.8")
sys.exit(1)

print(f"✅ PASS: All quality gates passed")

Monitoring and Reporting

Track Metrics Over Time

Store evaluation results in a time-series database:

# upload_metrics.py
import json
import requests
from datetime import datetime

with open("experiments/latest/evaluation/summary.json") as f:
summary = json.load(f)

# Send to monitoring system
requests.post("https://metrics.example.com/fluxloop", json={
"timestamp": datetime.now().isoformat(),
"project": "my-agent",
"branch": os.getenv("CI_COMMIT_BRANCH"),
"overall_score": summary["overall_score"],
"by_evaluator": summary["by_evaluator"],
"total_traces": summary["total_traces"],
})

Generate Trend Reports

# Compare last N runs
python3 << 'EOF'
import json
import glob

experiments = sorted(glob.glob("experiments/*/evaluation/summary.json"))[-10:]

for exp in experiments:
with open(exp) as f:
summary = json.load(f)
print(f"{exp}: {summary['overall_score']:.2f}")
EOF

Troubleshooting CI/CD

Tests Pass Locally but Fail in CI

Common causes:

  1. Missing environment variables

    # Check what's set
    fluxloop config env
  2. Different Python version

    # Pin Python version
    python-version: '3.11'
  3. Non-deterministic inputs

    # Use fixed seed
    seed: 42
    mode: deterministic

Slow CI Runs

Optimizations:

  1. Reduce test scope

    # Fewer inputs
    fluxloop generate inputs --limit 10

    # Single iteration
    fluxloop run experiment --iterations 1
  2. Sample LLM evaluations

    # configs/evaluation.yaml
    limits:
    sample_rate: 0.2 # Only 20% of traces
    max_llm_calls: 10
  3. Cache dependencies

    - uses: actions/cache@v4
    with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

Example: Full GitHub Actions Workflow

Comprehensive example with all best practices:

name: FluxLoop CI

on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.11', '3.12']

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Cache MCP index
uses: actions/cache@v4
with:
path: ~/.fluxloop/mcp/index
key: mcp-index-v1

- name: Install dependencies
run: |
pip install fluxloop-cli fluxloop fluxloop-mcp
pip install -r requirements.txt

- name: Verify installation
run: fluxloop doctor

- name: Prepare test environment
run: |
cd fluxloop/my-agent
cp -r configs-ci/* configs/
fluxloop config validate

- name: Run tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic
fluxloop run experiment --iterations 1

LATEST=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST"
fluxloop evaluate experiment "$LATEST"

- name: Quality gate
run: |
cd fluxloop/my-agent
python3 scripts/check_quality.py

- name: Upload artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: results-py${{ matrix.python-version }}
path: fluxloop/my-agent/experiments/

- name: Report to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
// ... (PR comment script from earlier)

See Also