CI/CD Integration
Integrate FluxLoop into CI/CD pipelines for automated agent testing and regression detection.
Overview
FluxLoop can be integrated into CI/CD pipelines to:
- Automate regression testing after code changes
- Validate agent quality before deployment
- Track performance trends over time
- Enforce quality gates based on evaluation scores
- Generate test reports for stakeholders
This guide covers GitHub Actions, GitLab CI, and generic CI/CD setup.
Quick Setup
Prerequisites
- FluxLoop project with configuration in
configs/ - Agent code instrumented with FluxLoop SDK
- Base inputs defined in
configs/input.yaml - Evaluators configured in
configs/evaluation.yaml - API keys (OpenAI, Anthropic, etc.) stored as secrets
Key Principles
For CI/CD environments:
- Use deterministic mode for input generation (or commit pre-generated inputs)
- Set fixed seed in
configs/simulation.yamlfor reproducibility - Cache dependencies (Python packages, MCP index)
- Store API keys as secrets, not in code
- Generate artifacts (reports, traces) for review
- Fail pipeline if evaluation scores below threshold
GitHub Actions
Basic Workflow
Create .github/workflows/fluxloop-test.yml:
name: FluxLoop Agent Tests
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test-agent:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install fluxloop-cli fluxloop
pip install -r requirements.txt
- name: Verify setup
run: fluxloop doctor
- name: Generate inputs (deterministic)
run: |
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic
- name: Run experiment
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
FLUXLOOP_ENABLED: 'true'
run: |
cd fluxloop/my-agent
fluxloop run experiment --iterations 1
- name: Parse results
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST_EXP"
- name: Evaluate results
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop evaluate experiment "$LATEST_EXP"
- name: Check evaluation threshold
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
python3 << 'EOF'
import json
import sys
with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
summary = json.load(f)
score = summary.get("overall_score", 0)
threshold = 0.7
print(f"Score: {score:.2f}, Threshold: {threshold}")
if score < threshold:
print(f"FAIL: Score {score:.2f} below threshold {threshold}")
sys.exit(1)
else:
print(f"PASS: Score {score:.2f} meets threshold {threshold}")
EOF
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: fluxloop-results
path: |
fluxloop/my-agent/experiments/
fluxloop/my-agent/inputs/
retention-days: 30
- name: Comment on PR (if PR)
if: github.event_name == 'pull_request' && always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const expDir = 'fluxloop/my-agent/experiments/';
const latest = fs.readdirSync(expDir)
.filter(f => fs.statSync(expDir + f).isDirectory())
.sort()
.reverse()[0];
const summary = JSON.parse(
fs.readFileSync(`${expDir}${latest}/evaluation/summary.json`)
);
const body = `
## FluxLoop Evaluation Results
**Overall Score:** ${summary.overall_score.toFixed(2)} / 1.00
**Status:** ${summary.pass_fail_status}
**Traces:** ${summary.total_traces}
### Evaluator Scores
${Object.entries(summary.by_evaluator || {}).map(([name, data]) =>
`- **${name}**: ${data.score.toFixed(2)}`
).join('\n')}
[View Full Report](../actions/runs/${context.runId})
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
Advanced: Comparison with Baseline
Compare results against a baseline:
- name: Download baseline results
uses: dawidd6/action-download-artifact@v3
with:
workflow: fluxloop-test.yml
branch: main
name: baseline-summary
path: baseline/
continue-on-error: true
- name: Compare with baseline
run: |
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
if [ -f "baseline/summary.json" ]; then
fluxloop evaluate experiment "$LATEST_EXP" \
--baseline baseline/summary.json
fi
- name: Save new baseline (on main)
if: github.ref == 'refs/heads/main'
uses: actions/upload-artifact@v4
with:
name: baseline-summary
path: fluxloop/my-agent/experiments/*/evaluation/summary.json
Scheduled Regression Testing
Run tests on a schedule:
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTC
workflow_dispatch: # Manual trigger
GitLab CI
Basic Pipeline
Create .gitlab-ci.yml:
image: python:3.11
variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
cache:
paths:
- .cache/pip
- venv/
stages:
- setup
- test
- evaluate
- report
before_script:
- python -m venv venv
- source venv/bin/activate
- pip install fluxloop-cli fluxloop
- pip install -r requirements.txt
setup:
stage: setup
script:
- cd fluxloop/my-agent
- fluxloop doctor
- fluxloop config validate
artifacts:
reports:
dotenv: build.env
generate_inputs:
stage: setup
script:
- cd fluxloop/my-agent
- fluxloop generate inputs --limit 20 --mode deterministic
artifacts:
paths:
- fluxloop/my-agent/inputs/
expire_in: 1 week
run_experiment:
stage: test
dependencies:
- generate_inputs
script:
- cd fluxloop/my-agent
- fluxloop run experiment --iterations 1
artifacts:
paths:
- fluxloop/my-agent/experiments/
expire_in: 1 month
parse_results:
stage: evaluate
dependencies:
- run_experiment
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- fluxloop parse experiment "$LATEST_EXP"
artifacts:
paths:
- fluxloop/my-agent/experiments/*/per_trace_analysis/
expire_in: 1 month
evaluate_results:
stage: evaluate
dependencies:
- parse_results
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- fluxloop evaluate experiment "$LATEST_EXP"
- |
python3 << 'EOF'
import json, sys
with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
summary = json.load(f)
score = summary.get("overall_score", 0)
if score < 0.7:
print(f"FAIL: Score {score:.2f} below 0.7")
sys.exit(1)
EOF
artifacts:
paths:
- fluxloop/my-agent/experiments/*/evaluation/
expire_in: 1 month
reports:
junit: fluxloop/my-agent/experiments/*/evaluation/junit.xml
generate_report:
stage: report
dependencies:
- evaluate_results
script:
- cd fluxloop/my-agent
- LATEST_EXP=$(ls -td experiments/*/ | head -1)
- cp "$LATEST_EXP/evaluation/report.html" public/index.html
artifacts:
paths:
- public
only:
- main
Jenkins
Jenkinsfile
pipeline {
agent any
environment {
OPENAI_API_KEY = credentials('openai-api-key')
FLUXLOOP_ENABLED = 'true'
}
stages {
stage('Setup') {
steps {
sh '''
python3 -m venv venv
. venv/bin/activate
pip install fluxloop-cli fluxloop
pip install -r requirements.txt
'''
}
}
stage('Verify') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop doctor
fluxloop config validate
'''
}
}
stage('Generate Inputs') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic
'''
}
}
stage('Run Experiment') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
fluxloop run experiment --iterations 1
'''
}
}
stage('Evaluate') {
steps {
sh '''
. venv/bin/activate
cd fluxloop/my-agent
LATEST_EXP=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST_EXP"
fluxloop evaluate experiment "$LATEST_EXP"
'''
}
}
stage('Quality Gate') {
steps {
script {
def summary = readJSON file: "fluxloop/my-agent/experiments/*/evaluation/summary.json"
def score = summary.overall_score
if (score < 0.7) {
error("Quality gate failed: score ${score} below 0.7")
}
}
}
}
}
post {
always {
archiveArtifacts artifacts: 'fluxloop/my-agent/experiments/**/*',
allowEmptyArchive: true
publishHTML([
reportDir: 'fluxloop/my-agent/experiments/*/evaluation/',
reportFiles: 'report.html',
reportName: 'FluxLoop Evaluation Report'
])
}
}
}
Docker Integration
Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir \
fluxloop-cli \
fluxloop \
-r requirements.txt
# Copy project
COPY . .
# Set up FluxLoop
RUN cd fluxloop/my-agent && \
fluxloop doctor
# Default command
CMD ["bash", "-c", "cd fluxloop/my-agent && fluxloop run experiment"]
Docker Compose for Testing
version: '3.8'
services:
fluxloop-test:
build: .
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- FLUXLOOP_ENABLED=true
volumes:
- ./fluxloop:/app/fluxloop
- test-results:/app/fluxloop/my-agent/experiments
command: |
bash -c "
cd fluxloop/my-agent &&
fluxloop generate inputs --limit 20 --mode deterministic &&
fluxloop run experiment --iterations 1 &&
fluxloop parse experiment experiments/*/ &&
fluxloop evaluate experiment experiments/*/
"
volumes:
test-results:
Run tests:
docker-compose run fluxloop-test
Best Practices
1. Use Deterministic Mode in CI
# configs/input.yaml (for CI)
input_generation:
mode: deterministic # or commit pre-generated inputs
# configs/simulation.yaml
seed: 42 # Fixed seed for reproducibility
2. Separate Test Configs from Production
fluxloop/my-agent/
├── configs/ # Production configs
│ ├── project.yaml
│ ├── input.yaml
│ ├── simulation.yaml
│ └── evaluation.yaml
└── configs-ci/ # CI-specific configs
├── input.yaml # Deterministic, fewer inputs
├── simulation.yaml # Lower iterations, fixed seed
└── evaluation.yaml # Stricter thresholds
Run with CI configs:
# Override config directory
cp -r configs-ci/* configs/
fluxloop run experiment
3. Cache Wisely
Cache these:
- Python packages (
pip cache) - Generated inputs (if deterministic)
- MCP index (
~/.fluxloop/mcp/index/)
Don't cache:
- Experiment outputs
- Traces
- Evaluation results
4. Store Secrets Securely
GitHub Actions:
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GitLab CI:
variables:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Jenkins:
environment {
OPENAI_API_KEY = credentials('openai-api-key')
}
5. Set Quality Gates
# check_quality.py
import json
import sys
with open("experiments/latest/evaluation/summary.json") as f:
summary = json.load(f)
score = summary["overall_score"]
threshold = 0.7
# Can also check individual evaluators
intent_score = summary["by_evaluator"]["intent_recognition"]["score"]
latency_score = summary["by_evaluator"]["token_budget"]["score"]
if score < threshold:
print(f"❌ FAIL: Overall score {score:.2f} < {threshold}")
sys.exit(1)
if intent_score < 0.8:
print(f"❌ FAIL: Intent recognition {intent_score:.2f} < 0.8")
sys.exit(1)
print(f"✅ PASS: All quality gates passed")
Monitoring and Reporting
Track Metrics Over Time
Store evaluation results in a time-series database:
# upload_metrics.py
import json
import requests
from datetime import datetime
with open("experiments/latest/evaluation/summary.json") as f:
summary = json.load(f)
# Send to monitoring system
requests.post("https://metrics.example.com/fluxloop", json={
"timestamp": datetime.now().isoformat(),
"project": "my-agent",
"branch": os.getenv("CI_COMMIT_BRANCH"),
"overall_score": summary["overall_score"],
"by_evaluator": summary["by_evaluator"],
"total_traces": summary["total_traces"],
})
Generate Trend Reports
# Compare last N runs
python3 << 'EOF'
import json
import glob
experiments = sorted(glob.glob("experiments/*/evaluation/summary.json"))[-10:]
for exp in experiments:
with open(exp) as f:
summary = json.load(f)
print(f"{exp}: {summary['overall_score']:.2f}")
EOF
Troubleshooting CI/CD
Tests Pass Locally but Fail in CI
Common causes:
-
Missing environment variables
# Check what's set
fluxloop config env -
Different Python version
# Pin Python version
python-version: '3.11' -
Non-deterministic inputs
# Use fixed seed
seed: 42
mode: deterministic
Slow CI Runs
Optimizations:
-
Reduce test scope
# Fewer inputs
fluxloop generate inputs --limit 10
# Single iteration
fluxloop run experiment --iterations 1 -
Sample LLM evaluations
# configs/evaluation.yaml
limits:
sample_rate: 0.2 # Only 20% of traces
max_llm_calls: 10 -
Cache dependencies
- uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
Example: Full GitHub Actions Workflow
Comprehensive example with all best practices:
name: FluxLoop CI
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.11', '3.12']
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Cache MCP index
uses: actions/cache@v4
with:
path: ~/.fluxloop/mcp/index
key: mcp-index-v1
- name: Install dependencies
run: |
pip install fluxloop-cli fluxloop fluxloop-mcp
pip install -r requirements.txt
- name: Verify installation
run: fluxloop doctor
- name: Prepare test environment
run: |
cd fluxloop/my-agent
cp -r configs-ci/* configs/
fluxloop config validate
- name: Run tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
cd fluxloop/my-agent
fluxloop generate inputs --limit 20 --mode deterministic
fluxloop run experiment --iterations 1
LATEST=$(ls -td experiments/*/ | head -1)
fluxloop parse experiment "$LATEST"
fluxloop evaluate experiment "$LATEST"
- name: Quality gate
run: |
cd fluxloop/my-agent
python3 scripts/check_quality.py
- name: Upload artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: results-py${{ matrix.python-version }}
path: fluxloop/my-agent/experiments/
- name: Report to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
// ... (PR comment script from earlier)
See Also
- Basic Workflow - Standard FluxLoop workflow
- Multi-Turn Workflow - Dynamic conversations
- Commands Reference - CLI command documentation
- Evaluation Configuration - Success criteria and thresholds