CI/CD Integration

Integrate FluxLoop into CI/CD pipelines for automated agent testing and regression detection.

Overview

FluxLoop can be integrated into CI/CD pipelines to:

Automate regression testing after code changes
Validate agent quality before deployment
Track performance trends over time
Enforce quality gates based on evaluation scores
Generate test reports for stakeholders

This guide covers GitHub Actions, GitLab CI, and generic CI/CD setup.

Quick Setup

Prerequisites

FluxLoop project with configuration in configs/
Agent code instrumented with FluxLoop SDK
Base inputs defined in configs/input.yaml
Evaluators configured in configs/evaluation.yaml
API keys (OpenAI, Anthropic, etc.) stored as secrets

Key Principles

For CI/CD environments:

Use deterministic mode for input generation (or commit pre-generated inputs)
Set fixed seed in configs/simulation.yaml for reproducibility
Cache dependencies (Python packages, MCP index)
Store API keys as secrets, not in code
Generate artifacts (reports, traces) for review
Fail pipeline if evaluation scores below threshold

GitHub Actions

Basic Workflow

Create .github/workflows/fluxloop-test.yml:

name: FluxLoop Agent Tests

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test-agent:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          pip install fluxloop-cli fluxloop
          pip install -r requirements.txt
      
      - name: Verify setup
        run: fluxloop doctor
      
      - name: Generate inputs (deterministic)
        run: |
          cd fluxloop/my-agent
          fluxloop generate inputs --limit 20 --mode deterministic
      
      - name: Run experiment
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          FLUXLOOP_ENABLED: 'true'
        run: |
          cd fluxloop/my-agent
          fluxloop run experiment --iterations 1
      
      - name: Parse results
        run: |
          cd fluxloop/my-agent
          LATEST_EXP=$(ls -td experiments/*/ | head -1)
          fluxloop parse experiment "$LATEST_EXP"
      
      - name: Evaluate results
        run: |
          cd fluxloop/my-agent
          LATEST_EXP=$(ls -td experiments/*/ | head -1)
          fluxloop evaluate experiment "$LATEST_EXP"
      
      - name: Check evaluation threshold
        run: |
          cd fluxloop/my-agent
          LATEST_EXP=$(ls -td experiments/*/ | head -1)
          python3 << 'EOF'
          import json
          import sys
          
          with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
              summary = json.load(f)
          
          score = summary.get("overall_score", 0)
          threshold = 0.7
          
          print(f"Score: {score:.2f}, Threshold: {threshold}")
          
          if score < threshold:
              print(f"FAIL: Score {score:.2f} below threshold {threshold}")
              sys.exit(1)
          else:
              print(f"PASS: Score {score:.2f} meets threshold {threshold}")
          EOF
      
      - name: Upload test artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: fluxloop-results
          path: |
            fluxloop/my-agent/experiments/
            fluxloop/my-agent/inputs/
          retention-days: 30
      
      - name: Comment on PR (if PR)
        if: github.event_name == 'pull_request' && always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const expDir = 'fluxloop/my-agent/experiments/';
            const latest = fs.readdirSync(expDir)
              .filter(f => fs.statSync(expDir + f).isDirectory())
              .sort()
              .reverse()[0];
            
            const summary = JSON.parse(
              fs.readFileSync(`${expDir}${latest}/evaluation/summary.json`)
            );
            
            const body = `
            ## FluxLoop Evaluation Results
            
            **Overall Score:** ${summary.overall_score.toFixed(2)} / 1.00
            **Status:** ${summary.pass_fail_status}
            **Traces:** ${summary.total_traces}
            
            ### Evaluator Scores
            ${Object.entries(summary.by_evaluator || {}).map(([name, data]) => 
              `- **${name}**: ${data.score.toFixed(2)}`
            ).join('\n')}
            
            [View Full Report](../actions/runs/${context.runId})
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

Advanced: Comparison with Baseline

Compare results against a baseline:

      - name: Download baseline results
        uses: dawidd6/action-download-artifact@v3
        with:
          workflow: fluxloop-test.yml
          branch: main
          name: baseline-summary
          path: baseline/
        continue-on-error: true
      
      - name: Compare with baseline
        run: |
          cd fluxloop/my-agent
          LATEST_EXP=$(ls -td experiments/*/ | head -1)
          
          if [ -f "baseline/summary.json" ]; then
            fluxloop evaluate experiment "$LATEST_EXP" \
              --baseline baseline/summary.json
          fi
      
      - name: Save new baseline (on main)
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: baseline-summary
          path: fluxloop/my-agent/experiments/*/evaluation/summary.json

Scheduled Regression Testing

Run tests on a schedule:

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM UTC
  workflow_dispatch:      # Manual trigger

GitLab CI

Basic Pipeline

Create .gitlab-ci.yml:

image: python:3.11

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
  paths:
    - .cache/pip
    - venv/

stages:
  - setup
  - test
  - evaluate
  - report

before_script:
  - python -m venv venv
  - source venv/bin/activate
  - pip install fluxloop-cli fluxloop
  - pip install -r requirements.txt

setup:
  stage: setup
  script:
    - cd fluxloop/my-agent
    - fluxloop doctor
    - fluxloop config validate
  artifacts:
    reports:
      dotenv: build.env

generate_inputs:
  stage: setup
  script:
    - cd fluxloop/my-agent
    - fluxloop generate inputs --limit 20 --mode deterministic
  artifacts:
    paths:
      - fluxloop/my-agent/inputs/
    expire_in: 1 week

run_experiment:
  stage: test
  dependencies:
    - generate_inputs
  script:
    - cd fluxloop/my-agent
    - fluxloop run experiment --iterations 1
  artifacts:
    paths:
      - fluxloop/my-agent/experiments/
    expire_in: 1 month

parse_results:
  stage: evaluate
  dependencies:
    - run_experiment
  script:
    - cd fluxloop/my-agent
    - LATEST_EXP=$(ls -td experiments/*/ | head -1)
    - fluxloop parse experiment "$LATEST_EXP"
  artifacts:
    paths:
      - fluxloop/my-agent/experiments/*/per_trace_analysis/
    expire_in: 1 month

evaluate_results:
  stage: evaluate
  dependencies:
    - parse_results
  script:
    - cd fluxloop/my-agent
    - LATEST_EXP=$(ls -td experiments/*/ | head -1)
    - fluxloop evaluate experiment "$LATEST_EXP"
    - |
      python3 << 'EOF'
      import json, sys
      with open(f"${LATEST_EXP}/evaluation/summary.json") as f:
          summary = json.load(f)
      score = summary.get("overall_score", 0)
      if score < 0.7:
          print(f"FAIL: Score {score:.2f} below 0.7")
          sys.exit(1)
      EOF
  artifacts:
    paths:
      - fluxloop/my-agent/experiments/*/evaluation/
    expire_in: 1 month
    reports:
      junit: fluxloop/my-agent/experiments/*/evaluation/junit.xml

generate_report:
  stage: report
  dependencies:
    - evaluate_results
  script:
    - cd fluxloop/my-agent
    - LATEST_EXP=$(ls -td experiments/*/ | head -1)
    - cp "$LATEST_EXP/evaluation/report.html" public/index.html
  artifacts:
    paths:
      - public
  only:
    - main

Jenkins

Jenkinsfile

pipeline {
    agent any
    
    environment {
        OPENAI_API_KEY = credentials('openai-api-key')
        FLUXLOOP_ENABLED = 'true'
    }
    
    stages {
        stage('Setup') {
            steps {
                sh '''
                    python3 -m venv venv
                    . venv/bin/activate
                    pip install fluxloop-cli fluxloop
                    pip install -r requirements.txt
                '''
            }
        }
        
        stage('Verify') {
            steps {
                sh '''
                    . venv/bin/activate
                    cd fluxloop/my-agent
                    fluxloop doctor
                    fluxloop config validate
                '''
            }
        }
        
        stage('Generate Inputs') {
            steps {
                sh '''
                    . venv/bin/activate
                    cd fluxloop/my-agent
                    fluxloop generate inputs --limit 20 --mode deterministic
                '''
            }
        }
        
        stage('Run Experiment') {
            steps {
                sh '''
                    . venv/bin/activate
                    cd fluxloop/my-agent
                    fluxloop run experiment --iterations 1
                '''
            }
        }
        
        stage('Evaluate') {
            steps {
                sh '''
                    . venv/bin/activate
                    cd fluxloop/my-agent
                    LATEST_EXP=$(ls -td experiments/*/ | head -1)
                    fluxloop parse experiment "$LATEST_EXP"
                    fluxloop evaluate experiment "$LATEST_EXP"
                '''
            }
        }
        
        stage('Quality Gate') {
            steps {
                script {
                    def summary = readJSON file: "fluxloop/my-agent/experiments/*/evaluation/summary.json"
                    def score = summary.overall_score
                    
                    if (score < 0.7) {
                        error("Quality gate failed: score ${score} below 0.7")
                    }
                }
            }
        }
    }
    
    post {
        always {
            archiveArtifacts artifacts: 'fluxloop/my-agent/experiments/**/*', 
                             allowEmptyArchive: true
            
            publishHTML([
                reportDir: 'fluxloop/my-agent/experiments/*/evaluation/',
                reportFiles: 'report.html',
                reportName: 'FluxLoop Evaluation Report'
            ])
        }
    }
}

Docker Integration

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir \
    fluxloop-cli \
    fluxloop \
    -r requirements.txt

# Copy project
COPY . .

# Set up FluxLoop
RUN cd fluxloop/my-agent && \
    fluxloop doctor

# Default command
CMD ["bash", "-c", "cd fluxloop/my-agent && fluxloop run experiment"]

Docker Compose for Testing

version: '3.8'

services:
  fluxloop-test:
    build: .
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - FLUXLOOP_ENABLED=true
    volumes:
      - ./fluxloop:/app/fluxloop
      - test-results:/app/fluxloop/my-agent/experiments
    command: |
      bash -c "
        cd fluxloop/my-agent &&
        fluxloop generate inputs --limit 20 --mode deterministic &&
        fluxloop run experiment --iterations 1 &&
        fluxloop parse experiment experiments/*/ &&
        fluxloop evaluate experiment experiments/*/
      "

volumes:
  test-results:

Run tests:

docker-compose run fluxloop-test

Best Practices

1. Use Deterministic Mode in CI

# configs/input.yaml (for CI)
input_generation:
  mode: deterministic  # or commit pre-generated inputs

# configs/simulation.yaml
seed: 42  # Fixed seed for reproducibility

2. Separate Test Configs from Production

fluxloop/my-agent/
├── configs/              # Production configs
│   ├── project.yaml
│   ├── input.yaml
│   ├── simulation.yaml
│   └── evaluation.yaml
└── configs-ci/           # CI-specific configs
    ├── input.yaml        # Deterministic, fewer inputs
    ├── simulation.yaml   # Lower iterations, fixed seed
    └── evaluation.yaml   # Stricter thresholds

Run with CI configs:

# Override config directory
cp -r configs-ci/* configs/
fluxloop run experiment

3. Cache Wisely

Cache these:

Python packages (pip cache)
Generated inputs (if deterministic)
MCP index (~/.fluxloop/mcp/index/)

Don't cache:

Experiment outputs
Traces
Evaluation results

4. Store Secrets Securely

GitHub Actions:

env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

GitLab CI:

variables:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Jenkins:

environment {
    OPENAI_API_KEY = credentials('openai-api-key')
}

5. Set Quality Gates

# check_quality.py
import json
import sys

with open("experiments/latest/evaluation/summary.json") as f:
    summary = json.load(f)

score = summary["overall_score"]
threshold = 0.7

# Can also check individual evaluators
intent_score = summary["by_evaluator"]["intent_recognition"]["score"]
latency_score = summary["by_evaluator"]["token_budget"]["score"]

if score < threshold:
    print(f"❌ FAIL: Overall score {score:.2f} < {threshold}")
    sys.exit(1)

if intent_score < 0.8:
    print(f"❌ FAIL: Intent recognition {intent_score:.2f} < 0.8")
    sys.exit(1)

print(f"✅ PASS: All quality gates passed")

Monitoring and Reporting

Track Metrics Over Time

Store evaluation results in a time-series database:

# upload_metrics.py
import json
import requests
from datetime import datetime

with open("experiments/latest/evaluation/summary.json") as f:
    summary = json.load(f)

# Send to monitoring system
requests.post("https://metrics.example.com/fluxloop", json={
    "timestamp": datetime.now().isoformat(),
    "project": "my-agent",
    "branch": os.getenv("CI_COMMIT_BRANCH"),
    "overall_score": summary["overall_score"],
    "by_evaluator": summary["by_evaluator"],
    "total_traces": summary["total_traces"],
})

Generate Trend Reports

# Compare last N runs
python3 << 'EOF'
import json
import glob

experiments = sorted(glob.glob("experiments/*/evaluation/summary.json"))[-10:]

for exp in experiments:
    with open(exp) as f:
        summary = json.load(f)
    print(f"{exp}: {summary['overall_score']:.2f}")
EOF

Troubleshooting CI/CD

Tests Pass Locally but Fail in CI

Common causes:

Missing environment variables
```
# Check what's set
fluxloop config env
```

Different Python version

# Pin Python version
python-version: '3.11'

Non-deterministic inputs

# Use fixed seed
seed: 42
mode: deterministic

Slow CI Runs

Optimizations:

Reduce test scope

# Fewer inputs
fluxloop generate inputs --limit 10

# Single iteration
fluxloop run experiment --iterations 1

Sample LLM evaluations

# configs/evaluation.yaml
limits:
  sample_rate: 0.2  # Only 20% of traces
  max_llm_calls: 10

Cache dependencies

- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

Example: Full GitHub Actions Workflow

Comprehensive example with all best practices:

name: FluxLoop CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.11', '3.12']
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      
      - name: Cache MCP index
        uses: actions/cache@v4
        with:
          path: ~/.fluxloop/mcp/index
          key: mcp-index-v1
      
      - name: Install dependencies
        run: |
          pip install fluxloop-cli fluxloop fluxloop-mcp
          pip install -r requirements.txt
      
      - name: Verify installation
        run: fluxloop doctor
      
      - name: Prepare test environment
        run: |
          cd fluxloop/my-agent
          cp -r configs-ci/* configs/
          fluxloop config validate
      
      - name: Run tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          cd fluxloop/my-agent
          fluxloop generate inputs --limit 20 --mode deterministic
          fluxloop run experiment --iterations 1
          
          LATEST=$(ls -td experiments/*/ | head -1)
          fluxloop parse experiment "$LATEST"
          fluxloop evaluate experiment "$LATEST"
      
      - name: Quality gate
        run: |
          cd fluxloop/my-agent
          python3 scripts/check_quality.py
      
      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: results-py${{ matrix.python-version }}
          path: fluxloop/my-agent/experiments/
      
      - name: Report to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            // ... (PR comment script from earlier)

Overview​

Quick Setup​

Prerequisites​

Key Principles​

GitHub Actions​

Basic Workflow​

Advanced: Comparison with Baseline​

Scheduled Regression Testing​

GitLab CI​

Basic Pipeline​

Jenkins​

Jenkinsfile​

Docker Integration​

Dockerfile​

Docker Compose for Testing​

Best Practices​

1. Use Deterministic Mode in CI​

2. Separate Test Configs from Production​

3. Cache Wisely​

4. Store Secrets Securely​

5. Set Quality Gates​

Monitoring and Reporting​

Track Metrics Over Time​

Generate Trend Reports​

Troubleshooting CI/CD​

Tests Pass Locally but Fail in CI​

Slow CI Runs​

Example: Full GitHub Actions Workflow​

See Also​