Guides

CI/CD LLM Eval with GitHub Actions: The 2026 Workflow

Q: Why do most CI LLM eval gates fail in production?

Because they nail one corner of the triangle and skip the other two. The triangle is cheap, fast, statistically significant. A 30-example dataset that runs in 40 seconds for 12 cents fails the third corner: the variance band on the per-rubric mean is wider than the regressions you're trying to catch, so the gate fires false alarms half the time, engineers learn to ignore it, and the green check stops meaning anything. The other failure mode is a 2,000-example LLM-judge sweep on every push: statistically clean, but $9 per PR and 22 minutes to verdict, which gets quietly disabled by quarter end. The working gate runs cheap deterministic checks plus a classifier-cascade on every PR, reserves the full LLM-judge sweep for nightly main against a versioned dataset, and gates promotion to canary on a Welch's t-test against the trailing 7-day baseline.

Q: What's the right exit-code partition for the fi CLI in GitHub Actions?

Four codes earn their keep. 0 = success, 2 = assertion failed (real regression, hard-fail the PR check), 3 = assertion warning under --strict (slack-notify, don't block), 6 = API error (retry with backoff, then fail loudly), 7 = timeout (raise the timeout-minutes ceiling or shard the suite). The reason the partition matters: `grep` heuristics on stdout break the moment the SDK reformats a log line. Exit codes are the stable contract. CI policies wire cleanly on top: hard-fail on 2, retry on 6, alarm on 3, escalate on 7. The fi CLI ships these codes natively at `fi/cli/assertions/exit_codes.py`, so the gate plugs into GitHub Actions, GitLab CI, Buildkite, Jenkins, or CircleCI without translation.

Q: How big should the golden set be for a PR gate?

100 to 200 cases per route, sampled from production traces. Below 100 the variance on per-rubric means is wide enough that a 2-point drop is indistinguishable from judge noise; the gate raises false alarms and gets disabled. Above 500 the per-PR LLM-judge bill grows faster than detection sharpens, and the PR-feedback budget burns past the 5-minute ceiling that keeps engineers in flow. Composition over count: roughly 60 percent happy-path queries, 20 percent edge cases (ambiguous, multi-hop, time-sensitive), 10 percent refusal cases, 10 percent the hardest historical failures from incident reports. Grow the set weekly by promoting failing production traces through Error Feed (HDBSCAN clustering plus a Sonnet 4.5 Judge agent that writes the `immediate_fix` and a 4-dim score). Nightly corpora go to 500-2,000 because cost amortises over one run.

Q: How does a classifier cascade actually save money?

A DeBERTa-class NLI classifier triages every case in under 50 ms on CPU. On a 200-example suite, 70-85 percent of cases land cleanly on one side or the other (clear pass, clear fail, no disagreement between classifier and golden label). Only the low-confidence or disagreement subset escalates to the frontier judge. The frontier judge then runs against 30-60 examples instead of 200, which is the difference between $3 per PR and 20 cents per PR. The cascade trap: don't cascade on subjective axes the classifier can't score (helpfulness, tone, brand voice). Cascade where the classifier has a clean target (NLI faithfulness, claim_support, citation validity); reserve the frontier judge for rubrics that need semantic scoring. Future AGI's `ai-evaluation` SDK ships the NLI-backed local equivalents (`faithfulness`, `claim_support`, `rag_faithfulness`, `factual_consistency`) as the cascade's classifier tier; the Platform's classifier-backed evals run at lower per-eval cost than Galileo Luna-2.

Q: What's the right way to gate on statistical significance without flaky CI?

Three rules. First, pin the judge model and temperature, version them in the rubric file, and bump the version explicitly when you change them; that turns judge drift into a diff instead of a mystery. Second, gate on a delta against the trailing 7-day rolling baseline using Welch's t-test (continuous rubrics) or a two-proportion z-test (binary rubrics like citation validity), not against a single prior PR. Third, fire the gate only when p is under 0.05 AND the effect size exceeds the rubric's noise floor (around 0.03 on a 0-1 scale; tune per rubric). For long-tail failures that hide in averages, gate on p95_score instead of avg_score. The fi CLI exposes `pass_rate`, `avg_score`, `p50_score`, `p90_score`, and `p95_score` as native assertion metrics, so the gate compares percentiles natively without `grep`-on-stdout heuristics.

Q: Should the eval run on every commit or only on PRs?

Only on PRs, only when the PR touches paths that affect agent behavior. GitHub Actions `paths` filter on `prompts/**`, `evals/**`, `src/agent/**`, `src/tools/**`, plus the workflow file itself, scopes the trigger so a CSS change doesn't burn judge tokens. Add `concurrency.cancel-in-progress: true` keyed on the head ref so three rapid pushes don't fan out into three concurrent eval runs. Running on every commit floods the dataset with churn, trains engineers to ignore eval status, and inflates the judge bill without sharpening the signal. The PR gate is the merge decision; the nightly batch is the drift detector; the canary is the live-traffic guardrail. Three triggers, three jobs.

Q: What does per-route scoping mean and why does it matter?

A PR touching `agents/legal/` reruns only the legal-RAG suite, not the support-bot suite or the sales-agent suite. The pattern: tag each example in the golden set with its route, expose a `--routes` flag to pytest (or `pytest -m route_legal` markers), and have a small `affected_routes.py` script read the PR diff and emit the affected route list as a GitHub Actions step output. The PR gate then runs only the matching shards. This is what keeps a multi-route monorepo's PR-feedback budget bounded. Without it, a 200-example suite multiplied across 8 routes is a 1,600-example sweep on every PR, which prices the gate out by month two. The fi CLI honors path scoping natively; the SDK's four distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where a single route's nightly corpus crosses 2,000 examples.

Q: How does the canary auto-rollback fit with PR-gate eval?

The PR gate proves the change doesn't regress on the curated golden set. The canary proves the change doesn't regress on the live traffic distribution, which is the only distribution that pays the bill. After merge, the Agent Command Center gateway routes 1-5 percent of production traffic to the new prompt while the rest stays on the incumbent. The same rubric library scores both populations; a rolling-mean drop beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes) trips the auto-rollback. The gateway emits per-request response headers (cost, latency, model used, fallback used, guardrail triggered) so the canary script has cost and routing telemetry next to the eval scores. If the canary holds for 24 hours, ramp to 25 percent, then 100 percent. Same rubric library scores all three populations (PR set, canary sample, production), so the numbers are comparable end to end.

Cheap, fast, statistically significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, auto-rollback.

March 31, 2026

Updated May 20, 2026

14 min read

ci-cd github-actions llm-evaluation eval-gates regression-testing statistical-gating ai-gateway 2026

Table of Contents

A PR lands at 4:47 pm. The eval workflow runs 247 LLM-as-judge calls across nine rubrics on 200 examples. Four minutes, $3.20, green check. Twelve hours later, the support agent starts refusing valid questions on a class of queries the dataset never covered. Suite Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.68 for those twelve hours. The gate fired green because it didn’t ask the right question.

Most CI LLM eval gates are smoke tests in a trench coat. Tiny dataset, mean against a frozen floor, pass on anything short of catastrophe. The dataset isn’t representative, the floor isn’t calibrated against baseline variance, the threshold doesn’t separate signal from judge noise. The green check is theater.

The opinion this post earns: CI eval on every PR has to be cheap, fast, and statistically significant. Pick any two and the gate is theater. A 30-example dataset that runs in 40 seconds for 12 cents fails the third corner. A 2,000-example LLM-judge sweep at $9 per PR fails the first two. The working pattern runs cheap deterministic checks plus a classifier-cascade on every PR, the full LLM-judge sweep nightly against a versioned dataset, and paired-comparison statistical gating with auto-rollback on the canary. The math (not vibes) decides which regressions block.

This guide is the working playbook. The pytest fixture pattern, the fi CLI exit-code partition, the full GitHub Actions YAML with path-scoped triggers and concurrency cancellation, per-route scoping, Welch’s t-test gating, and the FAGI vendor surface that ships the stack. Code shaped against the ai-evaluation SDK and the fi CLI.

TL;DR: the gate that doesn’t lie

Step	Decision	Rule
1. Triage	Classifier → judge → human	Cheap NLI rubrics first. Frontier judge only on disagreement. Human review on flagged clusters.
2. Triggers	PR-blocking vs nightly vs canary	Cheap rubrics + deterministic floors on PR. Full LLM-judge sweep nightly. Same rubrics on canary.
3. Path scope	Per-route triggers	`paths:` filter on the workflow. `affected_routes.py` script emits route list. `concurrency.cancel-in-progress: true`.
4. Pytest	Parameterized + ai-evaluation SDK	`@pytest.mark.parametrize` per route. `Evaluator(max_workers=16)`. Distributed runners past single-host.
5. Exit codes	0/2/3/6/7 partition	0 success, 2 assertion fail, 3 warning, 6 API error, 7 timeout. Hard contract for CI policy.
6. Statistical gate	Floor + delta + percentile	Welch’s t-test on continuous rubrics, z-test on binary, p95 on tail. p < 0.05 AND effect floor.
7. Canary auto-rollback	Gateway-level	1-5% live traffic. Rolling-mean drift > threshold trips rollback. Gateway response headers feed cost.

The hard ones are 5 and 6. The rest is plumbing.

The cheap-fast-significant triangle

Three constraints fight for the gate’s budget. Cheap, fast, statistically significant. The instinct is to optimize for one corner and hope the others land. They don’t. They actively trade against each other, and the gate that misses any one is theater.

Cheap. Cents per PR, not dollars. A frontier judge at $0.015 per call against 200 examples and nine rubrics is $27 per PR. Multiply by 12 PRs per day across the team and the eval bill outpaces the model API spend by month three. Somebody quietly disables the gate.

Fast. Under five minutes push to verdict. Above ten, engineers merge and apologize. CI is a queue, and the queue measures the slowest thing on the critical path.

Statistically significant. The gate fails when the regression is real and passes when the variance is judge noise. A 30-example dataset gives a 95 percent confidence interval roughly +/- 0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise band; the gate firing on it raises false alarms half the time. Teams fire-drilled twice learn to ignore the third.

Classifier cascade fixes cheap. Path scoping plus concurrency cancellation fixes fast. Sample sizing plus delta gating with Welch’s t-test fixes significant. Each lever exists; the design call is wiring all three at once.

The triage: classifier, judge, human

Three layers, in order of cost and order of decision.

Cheap deterministic + classifier (runs on every PR). JSON schema validation, regex on tool-call structure, exact-match on golden examples, citation existence, and the 8 sub-10ms Scanner classes (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). On top of that, the NLI-backed classifier rubrics (faithfulness, claim_support, rag_faithfulness, factual_consistency) score the semantic axes that pattern-matching can’t see. Microseconds to tens of milliseconds per call, no API spend.

Frontier judge (runs nightly on the full corpus, runs on PR only for cases the classifier can’t decide). This is where Groundedness, ContextAdherence, Completeness, ChunkAttribution, AnswerRefusal, TaskCompletion, EvaluateFunctionCalling earn their bill. The judge runs against the 30-60 examples the classifier flagged as low-confidence, not the 200-example whole. That’s the cascade’s payoff: the cost drops by roughly an order of magnitude without losing signal.

Human review (runs on flagged clusters, not on every example). Error Feed (inside the FAGI eval stack) uses HDBSCAN soft-clustering over ClickHouse-stored span embeddings to group failures into named issues. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an immediate_fix, and a 4-dim score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). The human sees the cluster, not the raw stream. Five minutes a day of triage instead of five hours.

The cascade trap: don’t escalate to a frontier judge on rubrics where the classifier can’t have a target. Helpfulness, tone, brand voice need a judge from the start; faithfulness, citation validity, refusal calibration have a clean target the classifier can hit. Cascade where the classifier earns its keep.

The pytest pattern: parameterized + ai-evaluation SDK

The pytest fixture pattern that keeps the gate honest. Path-scoped, parameterized per route, calls Evaluator(max_workers=16) for in-process parallelism, delegates to the SDK’s distributed runners past the single-host limit.

# tests/test_eval_gate.py
import os
import statistics
import json
from pathlib import Path

import pytest
from scipy import stats
from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextRelevance, Completeness,
    ChunkAttribution, AnswerRefusal, FactualAccuracy,
)
from fi.testcases import TestCase

BASELINE_DIR = Path("evals/baselines")          # rolling 7-day, JSON per route
GOLDEN_DIR   = Path("evals/golden")              # JSONL per route
ROUTES       = ["support-rag", "legal-rag", "sales-agent"]

# Absolute floors (catch catastrophic regressions)
FLOORS = {
    "Groundedness":     0.85,
    "ContextRelevance": 0.80,
    "Completeness":     0.75,
    "ChunkAttribution": 0.80,
    "AnswerRefusal":    0.90,
    "FactualAccuracy":  0.80,
}

# Per-rubric noise floor (effect size below this is treated as variance)
NOISE_FLOOR = 0.03

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
    max_workers=16,
)


def regression_gate(current, baseline, alpha=0.05, min_effect=NOISE_FLOOR):
    """Fail only if mean dropped AND change is significant AND effect exceeds floor."""
    delta = statistics.mean(current) - statistics.mean(baseline)
    if delta >= 0:
        return True, f"no regression (delta=+{delta:.3f})"
    _, p = stats.ttest_ind(current, baseline, equal_var=False)
    if p >= alpha:
        return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
    if abs(delta) < min_effect:
        return True, f"delta={delta:.3f} below effect floor {min_effect}"
    return False, f"regression: delta={delta:.3f}, p={p:.3f}"


def load_cases(route):
    with (GOLDEN_DIR / f"{route}.jsonl").open() as fh:
        for line in fh:
            row = json.loads(line)
            yield TestCase(
                input=row["question"],
                output=row["candidate_response"],
                context="\n\n".join(row["chunks"]),
                expected_response=row.get("expected"),
            )


@pytest.mark.parametrize("route", ROUTES)
def test_pr_gate(route, request):
    if route not in request.config.getoption("--routes").split(","):
        pytest.skip(f"{route} not affected by this PR")

    cases = list(load_cases(route))
    templates = [Groundedness(), ContextRelevance(), Completeness(),
                 ChunkAttribution(), AnswerRefusal(), FactualAccuracy()]

    result = evaluator.evaluate(eval_templates=templates, inputs=cases)
    scores_by_rubric = result.aggregate_by_template()
    baselines = json.loads((BASELINE_DIR / f"{route}.json").read_text())

    failures = []
    for name, current in scores_by_rubric.items():
        # Floor check (catches catastrophic drops)
        if statistics.mean(current) < FLOORS[name]:
            failures.append(f"{route}.{name}: mean below floor {FLOORS[name]}")
        # Delta check (catches slow drift)
        passed, reason = regression_gate(current, baselines[name])
        if not passed:
            failures.append(f"{route}.{name}: {reason}")

    assert not failures, "\n".join(failures)

Four design calls earn their keep. Versioned baseline per route (rolling 7-day, JSON in git so the diff is reviewable). Per-route scope in the parametrize plus a --routes CLI flag (skipped routes don’t burn judge tokens). Both a floor check (catches the catastrophic drop) and a delta check with Welch’s t-test (catches slow drift the floor misses). And max_workers=16 so the in-process parallelism saturates the judge provider’s rate limit before the distributed runner has to take over.

The aggregate_by_template() call returns per-example score arrays per rubric, which is what feeds the statistical test. If you only have the means, you can’t run the t-test; the gate then collapses to “did the average drop”, which is the noise-bait pattern the post is arguing against.

The fi CLI exit codes

The fi CLI is the layer that makes the gate wire cleanly into any CI runner without grep heuristics on stdout. Exit codes are the stable contract; log lines reformat between SDK versions.

# evals/fi-evaluation.yaml: native CI assertions
data:
  - "evals/golden/support-rag.jsonl"
  - "evals/golden/legal-rag.jsonl"

evaluations:
  - template: "groundedness"
    config:
      mode: "cascade"        # NLI classifier first, frontier judge on disagreement

  - template: "answer_refusal"

  - template: "citation_validity"

assertions:
  - template: "groundedness"
    condition: "p95_score >= 0.78"    # gate on the tail, not just the mean
    on_fail: "error"
  - template: "answer_refusal"
    condition: "pass_rate >= 0.95"
    on_fail: "error"
  - template: "citation_validity"
    condition: "pass_rate >= 0.99"
    on_fail: "error"

Run it with fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml. The exit-code partition (verified at fi/cli/assertions/exit_codes.py):

Code	Meaning	CI policy
0	All assertions passed	Merge proceeds
1	Evaluation execution error	Investigate; do not retry blindly
2	One or more assertions failed	Hard-fail the PR check
3	`--strict` warning (passed but flagged)	Slack-notify; do not block
4	Configuration file error	Fix the YAML; not a model issue
5	Test data file error	Fix the dataset path or format
6	API connection / auth error	Retry with backoff; fail loudly if persists
7	Evaluation timeout	Raise `timeout-minutes` or shard the suite

The reason the partition matters: CI policies wire on top of stable exit codes. Hard-fail on 2, retry on 6, alarm on 3, escalate on 7. The moment the gate logic is “grep the stdout for the word ERROR”, a log-line refactor breaks every workflow. Treat the exit code as the contract.

Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score, and runtime percentiles. The percentile assertions are the ones long-tail failures need: a regression that pushes p95_score below a tail floor while leaving the mean intact fails on the percentile and tells the on-call where to look.

The full GitHub Actions workflow

The drop-in YAML. Path-scoped triggers, concurrency.cancel-in-progress, sharding via matrix, the fi CLI cheap-rubric gate, the pytest statistical delta gate.

# .github/workflows/eval-pr-gate.yml
name: eval-pr-gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "src/agent/**"
      - "src/tools/**"
      - ".github/workflows/eval-pr-gate.yml"

concurrency:
  group: eval-pr-gate-${{ github.head_ref }}
  cancel-in-progress: true

permissions:
  contents: read
  pull-requests: write

jobs:
  detect-routes:
    runs-on: ubuntu-latest
    outputs:
      routes: ${{ steps.affected.outputs.routes }}
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 2 }
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - id: affected
        run: |
          ROUTES=$(python evals/affected_routes.py)
          echo "routes=$ROUTES" >> "$GITHUB_OUTPUT"

  pr-gate:
    needs: detect-routes
    if: needs.detect-routes.outputs.routes != '[]'
    runs-on: ubuntu-latest
    timeout-minutes: 8
    strategy:
      fail-fast: false
      matrix:
        route: ${{ fromJson(needs.detect-routes.outputs.routes) }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - uses: actions/cache@v4
        with:
          path: .eval_cache
          key: evals-${{ hashFiles('evals/rubrics/**', 'evals/golden/**') }}

      - run: pip install -r requirements.txt

      # Layer 1: cheap-rubric gate via fi CLI (classifier cascade + assertions)
      - name: Cheap-rubric gate
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: |
          fi run --check --strict --parallel 16 \
            -c evals/fi-evaluation.yaml \
            --filter route=${{ matrix.route }}

      # Layer 2: statistical delta gate (Welch's t-test vs 7-day baseline)
      - name: Statistical delta gate
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: |
          pytest tests/test_eval_gate.py -n auto \
            --routes=${{ matrix.route }} \
            --json-report --json-report-file=eval-report-${{ matrix.route }}.json

      - if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-report-${{ matrix.route }}-${{ github.sha }}
          path: eval-report-${{ matrix.route }}.json

      - if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const r = JSON.parse(fs.readFileSync(
              'eval-report-${{ matrix.route }}.json', 'utf8'));
            const body = `### Eval gate: \`${{ matrix.route }}\` failed\n\n` +
                         '```json\n' + JSON.stringify(r.tests, null, 2) + '\n```';
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner, repo: context.repo.repo, body,
            });

Five habits pay back the first week. Path-scoped triggers so a CSS-only PR doesn’t burn judge tokens. concurrency.cancel-in-progress: true keyed on github.head_ref so three rapid pushes don’t fan out into three concurrent suites. Matrix sharding by route so a multi-route monorepo’s PRs scale linearly with the change surface, not with the route count. actions/cache@v4 keyed on rubric and dataset SHAs so reruns on unchanged code hit cache. And an eval report artifact per route on every run; a borderline regression turns into a 3-minute conversation instead of a 30-minute argument.

A separate schedule: cron: "17 7 * * *" workflow runs the full nightly sweep across every route on the larger corpus (500-2,000 examples) and posts the daily baseline back into Observe. That keeps the rolling 7-day window fresh.

Statistical gating: Welch’s t-test and bootstrap CI

This is the section most CI eval guides skip. It’s the one separating a working gate from theater.

A green check should mean “this PR did not introduce a statistically significant regression,” not “mean Groundedness on 30 examples sat above 0.85.” The two statements look identical and aren’t.

Two thresholds, both calibrated, neither absolute alone:

Absolute floor per rubric. Catches the catastrophic regression. Starting floors: Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, ChunkAttribution >= 0.80, citation validity >= 0.99 (compliance work). Tune per workload.
Delta gate against the trailing 7-day rolling baseline. Welch’s t-test on per-example score arrays (continuous rubrics). Two-proportion z-test on pass rates (binary rubrics). Fail when p < 0.05 AND the effect size exceeds the rubric’s noise floor (~0.03 on a 0-1 scale).

Two extensions earn their keep on long-tail products. First, gate on p95_score instead of avg_score when the failure mode is tail-shaped (a regression that pushes the worst 5 percent below a tail floor while leaving the mean intact is the most common slow-burn). Second, use a paired-comparison bootstrap CI when the same eval examples run through the candidate and the baseline; the paired test has tighter variance than the independent two-sample test, so the gate detects smaller effect sizes at the same sample budget.

def paired_bootstrap_ci(candidate, baseline, n_boot=10_000, alpha=0.05):
    """Bootstrap CI on the paired delta. Same examples through both pipelines."""
    import numpy as np
    paired = np.array(candidate) - np.array(baseline)
    boot = [np.random.choice(paired, len(paired), replace=True).mean()
            for _ in range(n_boot)]
    lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return paired.mean(), (lo, hi)

If the 95 percent CI for the paired delta sits entirely below zero, the regression is real. If it straddles zero, the change is variance. The bootstrap doesn’t assume a parametric distribution, which matters because rubric scores are often bimodal (cluster at 1 and near the floor, sparse in the middle).

The gate lives by one rule: the baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, datasets drift. The gate drifts with them or it starts catching ordinary movement instead of regressions.

Per-route scoping: only eval what the PR touched

A 200-example suite multiplied across 8 routes is a 1,600-example sweep on every PR. The gate prices itself out by month two unless every PR runs only the suites it actually affects.

The pattern: tag each golden-set example with its route, expose a --routes CLI flag to pytest, and run a small affected_routes.py script that reads the PR diff and emits the matching route list as a GitHub Actions step output. The matrix shard above consumes that list.

# evals/affected_routes.py
import json
import subprocess

ROUTE_PATHS = {
    "support-rag":  ["src/agent/support/", "prompts/support_agent",
                     "evals/golden/support-rag.jsonl"],
    "legal-rag":    ["src/agent/legal/", "prompts/legal_agent",
                     "evals/golden/legal-rag.jsonl"],
    "sales-agent":  ["src/agent/sales/", "prompts/sales_agent",
                     "evals/golden/sales-agent.jsonl"],
}

def affected():
    diff = subprocess.check_output(
        ["git", "diff", "--name-only", "HEAD~1", "HEAD"], text=True).splitlines()
    hits = set()
    for route, paths in ROUTE_PATHS.items():
        if any(d.startswith(p) for d in diff for p in paths):
            hits.add(route)
    # Shared paths: rerun every route
    shared = ["src/shared/", "evals/rubrics/", ".github/workflows/eval-pr-gate.yml"]
    if any(d.startswith(p) for d in diff for p in shared):
        hits = set(ROUTE_PATHS.keys())
    return sorted(hits)

if __name__ == "__main__":
    print(json.dumps(affected()))

The shared-paths bucket is the trap most teams hit. A change to evals/rubrics/ or to a shared library affects every route, and a per-route filter would miss it. The script falls back to “rerun every route” when shared paths land in the diff. That’s the right default; the cost of a full sweep on a rubric change is the price of keeping the gate honest.

For datasets above a few thousand examples per route, the SDK ships four distributed runners (Celery, Ray, Temporal, Kubernetes) that parallelize across a worker pool past the single-host limit. The cheap path is Evaluator(max_workers=16) saturating the judge provider’s rate limit; the distributed runners take over when one CI runner stops keeping up.

Canary auto-rollback and the gateway

The PR gate proves the change doesn’t regress on the curated golden set. The canary proves the change doesn’t regress on the live traffic distribution, which is the only distribution that pays the bill.

The pattern uses the Agent Command Center gateway to route 1-5 percent of production traffic to the new prompt while the rest stays on the incumbent. The same rubric library scores both populations via span-attached scoring (EvalTag plus traceAI). A rolling-mean drop beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes) trips the gateway’s auto-rollback hook.

# .github/workflows/eval-canary.yml
name: eval-canary
on:
  workflow_dispatch:
    inputs:
      candidate_prompt: { required: true }
      ramp_percent:     { default: "5" }
jobs:
  canary:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - run: pip install -r requirements.txt
      - name: Register canary route
        env:
          AGENTCC_API_KEY: ${{ secrets.AGENTCC_API_KEY }}
        run: python scripts/register_canary.py \
              --prompt ${{ inputs.candidate_prompt }} \
              --percent ${{ inputs.ramp_percent }}
      - name: Sample canary traces and score
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: python scripts/score_canary.py --window 30m
      - name: Promote or roll back
        run: python scripts/canary_decision.py

The decision script reads the same rubric scores produced by the PR gate, runs the paired-comparison bootstrap CI on the canary versus the incumbent rolling mean, and either ramps to the next stage (5 → 25 → 100 percent) or hits the gateway’s rollback endpoint. The gateway emits per-request response headers (cost, latency, model used, fallback used, routing strategy, guardrail triggered) on every call, so the canary script has cost and routing telemetry next to the eval scores. Same rubric in CI and in production means the numbers are comparable end to end; engineers stop arguing which number is “real” and start fixing the bug.

Anti-patterns the team usually hits

Running on every commit instead of every PR. Pulls the team into ignoring red builds, blows the judge token budget, and floods the dataset with churn. PRs only, with a paths: filter and concurrency.cancel-in-progress.
No baseline. Treating each run as standalone means you can’t tell a noise-floor flake from a real regression. The rolling 7-day JSON committed to git is the cheapest fix.
No per-rubric scoring. Aggregating to a single “LLM quality 0.84” number hides which rubric regressed. The per-rubric vector with per-rubric floors is the working pattern.
Floor without delta gate. Slow regressions slip under the floor for months. The delta gate with Welch’s t-test catches the drift the floor misses.
30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow to 100-200 per route, or gate on p95_score.
LLM-as-judge at the base. The most expensive layer running on every PR on every example. Engineers disable the gate by month two. Classifier cascade first; the frontier judge only on disagreement.
No notification on nightly drift. The cron job runs, the artifact uploads, nobody looks. A Slack webhook on failure is the bare minimum.
No rollback path on canary. Without the gateway-level promote-or-rollback hook, the canary becomes a slow-motion deploy that no one knows how to abort.
Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Error Feed promotes failing production traces weekly.

How Future AGI ships the CI gate

Future AGI ships the eval stack as a package, designed for the cheap-fast-significant triangle. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when the cascade, the closed loop, and the gateway start mattering.

ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes including the seven RAG metrics (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) plus AnswerRefusal, TaskCompletion, EvaluateFunctionCalling, PromptInjection, DataPrivacyCompliance. 8 sub-10ms Scanner classes for the deterministic base (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). NLI-backed local rubrics (faithfulness, claim_support, rag_faithfulness, factual_consistency) for the classifier tier. Four distributed runners (Celery, Ray, Temporal, Kubernetes) for batch regression. RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED for ensemble rubrics.
fi CLI. fi init --template rag|basic|safety|agent scaffolds fi-evaluation.yaml. fi run --check --strict --parallel 16 evaluates assertions on pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/1/2/3/4/5/6/7) as a hard contract. --mode hybrid routes between local classifier rubrics and cloud judges; --offline runs classifier-only when the network is down.
Future AGI Platform. Self-improving evaluators tuned by thumbs feedback. An in-product authoring agent writes RAG-specific rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes weekly full-suite reruns the default rather than the exception.
traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) at register() time. 14 span kinds including a first-class RETRIEVER. 62 built-in evals wire server-side via EvalTag with zero inline latency.
Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an immediate_fix, and a 4-dim score. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
Agent Command Center. OpenAI-compatible gateway in a single Go binary (Apache 2.0). 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, MCP and A2A protocol support, OTel/Prometheus observability. Cloud-hosted at gateway.futureagi.com or self-hostable. SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
agent-opt (Apache 2.0). Six optimizers (ProTeGi, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) for prompt regression triage on green PRs.

The pieces are independent. Drop ai-evaluation plus the fi CLI into your CI this afternoon. Bring traceAI, Error Feed, and the Platform online as the program matures.

Ready to wire a CI gate that doesn’t lie? Run pip install ai-evaluation, then fi init --template rag, point the data paths at your real eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, and add fi run --check --strict to your pull-request workflow. Your next regression has somewhere to land.

Frequently asked questions

Why do most CI LLM eval gates fail in production?

Because they nail one corner of the triangle and skip the other two. The triangle is cheap, fast, statistically significant. A 30-example dataset that runs in 40 seconds for 12 cents fails the third corner: the variance band on the per-rubric mean is wider than the regressions you're trying to catch, so the gate fires false alarms half the time, engineers learn to ignore it, and the green check stops meaning anything. The other failure mode is a 2,000-example LLM-judge sweep on every push: statistically clean, but $9 per PR and 22 minutes to verdict, which gets quietly disabled by quarter end. The working gate runs cheap deterministic checks plus a classifier-cascade on every PR, reserves the full LLM-judge sweep for nightly main against a versioned dataset, and gates promotion to canary on a Welch's t-test against the trailing 7-day baseline.

What's the right exit-code partition for the fi CLI in GitHub Actions?

Four codes earn their keep. 0 = success, 2 = assertion failed (real regression, hard-fail the PR check), 3 = assertion warning under --strict (slack-notify, don't block), 6 = API error (retry with backoff, then fail loudly), 7 = timeout (raise the timeout-minutes ceiling or shard the suite). The reason the partition matters: `grep` heuristics on stdout break the moment the SDK reformats a log line. Exit codes are the stable contract. CI policies wire cleanly on top: hard-fail on 2, retry on 6, alarm on 3, escalate on 7. The fi CLI ships these codes natively at `fi/cli/assertions/exit_codes.py`, so the gate plugs into GitHub Actions, GitLab CI, Buildkite, Jenkins, or CircleCI without translation.

How big should the golden set be for a PR gate?

100 to 200 cases per route, sampled from production traces. Below 100 the variance on per-rubric means is wide enough that a 2-point drop is indistinguishable from judge noise; the gate raises false alarms and gets disabled. Above 500 the per-PR LLM-judge bill grows faster than detection sharpens, and the PR-feedback budget burns past the 5-minute ceiling that keeps engineers in flow. Composition over count: roughly 60 percent happy-path queries, 20 percent edge cases (ambiguous, multi-hop, time-sensitive), 10 percent refusal cases, 10 percent the hardest historical failures from incident reports. Grow the set weekly by promoting failing production traces through Error Feed (HDBSCAN clustering plus a Sonnet 4.5 Judge agent that writes the `immediate_fix` and a 4-dim score). Nightly corpora go to 500-2,000 because cost amortises over one run.

How does a classifier cascade actually save money?

A DeBERTa-class NLI classifier triages every case in under 50 ms on CPU. On a 200-example suite, 70-85 percent of cases land cleanly on one side or the other (clear pass, clear fail, no disagreement between classifier and golden label). Only the low-confidence or disagreement subset escalates to the frontier judge. The frontier judge then runs against 30-60 examples instead of 200, which is the difference between $3 per PR and 20 cents per PR. The cascade trap: don't cascade on subjective axes the classifier can't score (helpfulness, tone, brand voice). Cascade where the classifier has a clean target (NLI faithfulness, claim_support, citation validity); reserve the frontier judge for rubrics that need semantic scoring. Future AGI's `ai-evaluation` SDK ships the NLI-backed local equivalents (`faithfulness`, `claim_support`, `rag_faithfulness`, `factual_consistency`) as the cascade's classifier tier; the Platform's classifier-backed evals run at lower per-eval cost than Galileo Luna-2.

What's the right way to gate on statistical significance without flaky CI?

Three rules. First, pin the judge model and temperature, version them in the rubric file, and bump the version explicitly when you change them; that turns judge drift into a diff instead of a mystery. Second, gate on a delta against the trailing 7-day rolling baseline using Welch's t-test (continuous rubrics) or a two-proportion z-test (binary rubrics like citation validity), not against a single prior PR. Third, fire the gate only when p is under 0.05 AND the effect size exceeds the rubric's noise floor (around 0.03 on a 0-1 scale; tune per rubric). For long-tail failures that hide in averages, gate on p95_score instead of avg_score. The fi CLI exposes `pass_rate`, `avg_score`, `p50_score`, `p90_score`, and `p95_score` as native assertion metrics, so the gate compares percentiles natively without `grep`-on-stdout heuristics.

Should the eval run on every commit or only on PRs?

Only on PRs, only when the PR touches paths that affect agent behavior. GitHub Actions `paths` filter on `prompts/**`, `evals/**`, `src/agent/**`, `src/tools/**`, plus the workflow file itself, scopes the trigger so a CSS change doesn't burn judge tokens. Add `concurrency.cancel-in-progress: true` keyed on the head ref so three rapid pushes don't fan out into three concurrent eval runs. Running on every commit floods the dataset with churn, trains engineers to ignore eval status, and inflates the judge bill without sharpening the signal. The PR gate is the merge decision; the nightly batch is the drift detector; the canary is the live-traffic guardrail. Three triggers, three jobs.

What does per-route scoping mean and why does it matter?

A PR touching `agents/legal/` reruns only the legal-RAG suite, not the support-bot suite or the sales-agent suite. The pattern: tag each example in the golden set with its route, expose a `--routes` flag to pytest (or `pytest -m route_legal` markers), and have a small `affected_routes.py` script read the PR diff and emit the affected route list as a GitHub Actions step output. The PR gate then runs only the matching shards. This is what keeps a multi-route monorepo's PR-feedback budget bounded. Without it, a 200-example suite multiplied across 8 routes is a 1,600-example sweep on every PR, which prices the gate out by month two. The fi CLI honors path scoping natively; the SDK's four distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where a single route's nightly corpus crosses 2,000 examples.

How does the canary auto-rollback fit with PR-gate eval?

The PR gate proves the change doesn't regress on the curated golden set. The canary proves the change doesn't regress on the live traffic distribution, which is the only distribution that pays the bill. After merge, the Agent Command Center gateway routes 1-5 percent of production traffic to the new prompt while the rest stays on the incumbent. The same rubric library scores both populations; a rolling-mean drop beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes) trips the auto-rollback. The gateway emits per-request response headers (cost, latency, model used, fallback used, guardrail triggered) so the canary script has cost and routing telemetry next to the eval scores. If the canary holds for 24 hours, ramp to 25 percent, then 100 percent. Same rubric library scores all three populations (PR set, canary sample, production), so the numbers are comparable end to end.

View all

Guides

Prompt Regression Testing: A Practical 2026 Guide

Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, paired comparison vs prior version with CI delta.

NVJK Kartik · Mar 14, 2026

11 min

Guides

15 Common LLM Evaluation Mistakes Teams Make in 2026

The 15 LLM evaluation mistakes the Future AGI team sees in customer engagements, each with a vignette and the concrete primitive that prevents it.

NVJK Kartik · May 17, 2026

17 min

Guides

LLM Eval for Startups in 2026: A Lean Quality Discipline

How an 8-engineer startup ships production LLM eval without a dedicated team: seven principles, five-engineer rollout, the FAGI primitives that scale.

Nikhil Pareek · May 10, 2026

15 min

TL;DR: the gate that doesn’t lie

The cheap-fast-significant triangle

The triage: classifier, judge, human

The pytest pattern: parameterized + ai-evaluation SDK

The fi CLI exit codes

The full GitHub Actions workflow

Statistical gating: Welch’s t-test and bootstrap CI

Per-route scoping: only eval what the PR touched

Canary auto-rollback and the gateway

Anti-patterns the team usually hits

How Future AGI ships the CI gate

Related reading

Frequently asked questions