CI/CD LLM Eval with GitHub Actions: The 2026 Workflow
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Table of Contents
A PR lands at 4:47 pm. The eval workflow runs 247 LLM-as-judge calls across nine rubrics on 200 examples. Four minutes, $3.20, green check. Twelve hours later, the support agent starts refusing valid questions on a class of queries the dataset never covered. Suite Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.68 for those twelve hours. The gate fired green because it didn’t ask the right question.
Most CI LLM eval gates are smoke tests in a trench coat. Tiny dataset, mean against a frozen floor, pass on anything short of catastrophe. The dataset isn’t representative, the floor isn’t calibrated against baseline variance, the threshold doesn’t separate signal from judge noise. The green check is theater.
The opinion this post earns: CI eval on every PR has to be cheap, fast, and statistically significant. Pick any two and the gate is theater. A 30-example dataset that runs in 40 seconds for 12 cents fails the third corner. A 2,000-example LLM-judge sweep at $9 per PR fails the first two. The working pattern runs cheap deterministic checks plus a classifier-cascade on every PR, the full LLM-judge sweep nightly against a versioned dataset, and paired-comparison statistical gating with auto-rollback on the canary. The math (not vibes) decides which regressions block.
This guide is the working playbook. The pytest fixture pattern, the fi CLI exit-code partition, the full GitHub Actions YAML with path-scoped triggers and concurrency cancellation, per-route scoping, Welch’s t-test gating, and the FAGI vendor surface that ships the stack. Code shaped against the ai-evaluation SDK and the fi CLI.
TL;DR: the gate that doesn’t lie
| Step | Decision | Rule |
|---|---|---|
| 1. Triage | Classifier → judge → human | Cheap NLI rubrics first. Frontier judge only on disagreement. Human review on flagged clusters. |
| 2. Triggers | PR-blocking vs nightly vs canary | Cheap rubrics + deterministic floors on PR. Full LLM-judge sweep nightly. Same rubrics on canary. |
| 3. Path scope | Per-route triggers | paths: filter on the workflow. affected_routes.py script emits route list. concurrency.cancel-in-progress: true. |
| 4. Pytest | Parameterized + ai-evaluation SDK | @pytest.mark.parametrize per route. Evaluator(max_workers=16). Distributed runners past single-host. |
| 5. Exit codes | 0/2/3/6/7 partition | 0 success, 2 assertion fail, 3 warning, 6 API error, 7 timeout. Hard contract for CI policy. |
| 6. Statistical gate | Floor + delta + percentile | Welch’s t-test on continuous rubrics, z-test on binary, p95 on tail. p < 0.05 AND effect floor. |
| 7. Canary auto-rollback | Gateway-level | 1-5% live traffic. Rolling-mean drift > threshold trips rollback. Gateway response headers feed cost. |
The hard ones are 5 and 6. The rest is plumbing.
The cheap-fast-significant triangle
Three constraints fight for the gate’s budget. Cheap, fast, statistically significant. The instinct is to optimize for one corner and hope the others land. They don’t. They actively trade against each other, and the gate that misses any one is theater.
Cheap. Cents per PR, not dollars. A frontier judge at $0.015 per call against 200 examples and nine rubrics is $27 per PR. Multiply by 12 PRs per day across the team and the eval bill outpaces the model API spend by month three. Somebody quietly disables the gate.
Fast. Under five minutes push to verdict. Above ten, engineers merge and apologize. CI is a queue, and the queue measures the slowest thing on the critical path.
Statistically significant. The gate fails when the regression is real and passes when the variance is judge noise. A 30-example dataset gives a 95 percent confidence interval roughly +/- 0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise band; the gate firing on it raises false alarms half the time. Teams fire-drilled twice learn to ignore the third.
Classifier cascade fixes cheap. Path scoping plus concurrency cancellation fixes fast. Sample sizing plus delta gating with Welch’s t-test fixes significant. Each lever exists; the design call is wiring all three at once.
The triage: classifier, judge, human
Three layers, in order of cost and order of decision.
Cheap deterministic + classifier (runs on every PR). JSON schema validation, regex on tool-call structure, exact-match on golden examples, citation existence, and the 8 sub-10ms Scanner classes (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). On top of that, the NLI-backed classifier rubrics (faithfulness, claim_support, rag_faithfulness, factual_consistency) score the semantic axes that pattern-matching can’t see. Microseconds to tens of milliseconds per call, no API spend.
Frontier judge (runs nightly on the full corpus, runs on PR only for cases the classifier can’t decide). This is where Groundedness, ContextAdherence, Completeness, ChunkAttribution, AnswerRefusal, TaskCompletion, EvaluateFunctionCalling earn their bill. The judge runs against the 30-60 examples the classifier flagged as low-confidence, not the 200-example whole. That’s the cascade’s payoff: the cost drops by roughly an order of magnitude without losing signal.
Human review (runs on flagged clusters, not on every example). Error Feed (inside the FAGI eval stack) uses HDBSCAN soft-clustering over ClickHouse-stored span embeddings to group failures into named issues. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an immediate_fix, and a 4-dim score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). The human sees the cluster, not the raw stream. Five minutes a day of triage instead of five hours.
The cascade trap: don’t escalate to a frontier judge on rubrics where the classifier can’t have a target. Helpfulness, tone, brand voice need a judge from the start; faithfulness, citation validity, refusal calibration have a clean target the classifier can hit. Cascade where the classifier earns its keep.
The pytest pattern: parameterized + ai-evaluation SDK
The pytest fixture pattern that keeps the gate honest. Path-scoped, parameterized per route, calls Evaluator(max_workers=16) for in-process parallelism, delegates to the SDK’s distributed runners past the single-host limit.
# tests/test_eval_gate.py
import os
import statistics
import json
from pathlib import Path
import pytest
from scipy import stats
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextRelevance, Completeness,
ChunkAttribution, AnswerRefusal, FactualAccuracy,
)
from fi.testcases import TestCase
BASELINE_DIR = Path("evals/baselines") # rolling 7-day, JSON per route
GOLDEN_DIR = Path("evals/golden") # JSONL per route
ROUTES = ["support-rag", "legal-rag", "sales-agent"]
# Absolute floors (catch catastrophic regressions)
FLOORS = {
"Groundedness": 0.85,
"ContextRelevance": 0.80,
"Completeness": 0.75,
"ChunkAttribution": 0.80,
"AnswerRefusal": 0.90,
"FactualAccuracy": 0.80,
}
# Per-rubric noise floor (effect size below this is treated as variance)
NOISE_FLOOR = 0.03
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
max_workers=16,
)
def regression_gate(current, baseline, alpha=0.05, min_effect=NOISE_FLOOR):
"""Fail only if mean dropped AND change is significant AND effect exceeds floor."""
delta = statistics.mean(current) - statistics.mean(baseline)
if delta >= 0:
return True, f"no regression (delta=+{delta:.3f})"
_, p = stats.ttest_ind(current, baseline, equal_var=False)
if p >= alpha:
return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
if abs(delta) < min_effect:
return True, f"delta={delta:.3f} below effect floor {min_effect}"
return False, f"regression: delta={delta:.3f}, p={p:.3f}"
def load_cases(route):
with (GOLDEN_DIR / f"{route}.jsonl").open() as fh:
for line in fh:
row = json.loads(line)
yield TestCase(
input=row["question"],
output=row["candidate_response"],
context="\n\n".join(row["chunks"]),
expected_response=row.get("expected"),
)
@pytest.mark.parametrize("route", ROUTES)
def test_pr_gate(route, request):
if route not in request.config.getoption("--routes").split(","):
pytest.skip(f"{route} not affected by this PR")
cases = list(load_cases(route))
templates = [Groundedness(), ContextRelevance(), Completeness(),
ChunkAttribution(), AnswerRefusal(), FactualAccuracy()]
result = evaluator.evaluate(eval_templates=templates, inputs=cases)
scores_by_rubric = result.aggregate_by_template()
baselines = json.loads((BASELINE_DIR / f"{route}.json").read_text())
failures = []
for name, current in scores_by_rubric.items():
# Floor check (catches catastrophic drops)
if statistics.mean(current) < FLOORS[name]:
failures.append(f"{route}.{name}: mean below floor {FLOORS[name]}")
# Delta check (catches slow drift)
passed, reason = regression_gate(current, baselines[name])
if not passed:
failures.append(f"{route}.{name}: {reason}")
assert not failures, "\n".join(failures)
Four design calls earn their keep. Versioned baseline per route (rolling 7-day, JSON in git so the diff is reviewable). Per-route scope in the parametrize plus a --routes CLI flag (skipped routes don’t burn judge tokens). Both a floor check (catches the catastrophic drop) and a delta check with Welch’s t-test (catches slow drift the floor misses). And max_workers=16 so the in-process parallelism saturates the judge provider’s rate limit before the distributed runner has to take over.
The aggregate_by_template() call returns per-example score arrays per rubric, which is what feeds the statistical test. If you only have the means, you can’t run the t-test; the gate then collapses to “did the average drop”, which is the noise-bait pattern the post is arguing against.
The fi CLI exit codes
The fi CLI is the layer that makes the gate wire cleanly into any CI runner without grep heuristics on stdout. Exit codes are the stable contract; log lines reformat between SDK versions.
# evals/fi-evaluation.yaml — native CI assertions
data:
- "evals/golden/support-rag.jsonl"
- "evals/golden/legal-rag.jsonl"
evaluations:
- template: "groundedness"
config:
mode: "cascade" # NLI classifier first, frontier judge on disagreement
- template: "answer_refusal"
- template: "citation_validity"
assertions:
- template: "groundedness"
condition: "p95_score >= 0.78" # gate on the tail, not just the mean
on_fail: "error"
- template: "answer_refusal"
condition: "pass_rate >= 0.95"
on_fail: "error"
- template: "citation_validity"
condition: "pass_rate >= 0.99"
on_fail: "error"
Run it with fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml. The exit-code partition (verified at fi/cli/assertions/exit_codes.py):
| Code | Meaning | CI policy |
|---|---|---|
| 0 | All assertions passed | Merge proceeds |
| 1 | Evaluation execution error | Investigate; do not retry blindly |
| 2 | One or more assertions failed | Hard-fail the PR check |
| 3 | --strict warning (passed but flagged) | Slack-notify; do not block |
| 4 | Configuration file error | Fix the YAML; not a model issue |
| 5 | Test data file error | Fix the dataset path or format |
| 6 | API connection / auth error | Retry with backoff; fail loudly if persists |
| 7 | Evaluation timeout | Raise timeout-minutes or shard the suite |
The reason the partition matters: CI policies wire on top of stable exit codes. Hard-fail on 2, retry on 6, alarm on 3, escalate on 7. The moment the gate logic is “grep the stdout for the word ERROR”, a log-line refactor breaks every workflow. Treat the exit code as the contract.
Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score, and runtime percentiles. The percentile assertions are the ones long-tail failures need: a regression that pushes p95_score below a tail floor while leaving the mean intact fails on the percentile and tells the on-call where to look.
The full GitHub Actions workflow
The drop-in YAML. Path-scoped triggers, concurrency.cancel-in-progress, sharding via matrix, the fi CLI cheap-rubric gate, the pytest statistical delta gate.
# .github/workflows/eval-pr-gate.yml
name: eval-pr-gate
on:
pull_request:
paths:
- "prompts/**"
- "evals/**"
- "src/agent/**"
- "src/tools/**"
- ".github/workflows/eval-pr-gate.yml"
concurrency:
group: eval-pr-gate-${{ github.head_ref }}
cancel-in-progress: true
permissions:
contents: read
pull-requests: write
jobs:
detect-routes:
runs-on: ubuntu-latest
outputs:
routes: ${{ steps.affected.outputs.routes }}
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 2 }
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- id: affected
run: |
ROUTES=$(python evals/affected_routes.py)
echo "routes=$ROUTES" >> "$GITHUB_OUTPUT"
pr-gate:
needs: detect-routes
if: needs.detect-routes.outputs.routes != '[]'
runs-on: ubuntu-latest
timeout-minutes: 8
strategy:
fail-fast: false
matrix:
route: ${{ fromJson(needs.detect-routes.outputs.routes) }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- uses: actions/cache@v4
with:
path: .eval_cache
key: evals-${{ hashFiles('evals/rubrics/**', 'evals/golden/**') }}
- run: pip install -r requirements.txt
# Layer 1: cheap-rubric gate via fi CLI (classifier cascade + assertions)
- name: Cheap-rubric gate
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: |
fi run --check --strict --parallel 16 \
-c evals/fi-evaluation.yaml \
--filter route=${{ matrix.route }}
# Layer 2: statistical delta gate (Welch's t-test vs 7-day baseline)
- name: Statistical delta gate
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: |
pytest tests/test_eval_gate.py -n auto \
--routes=${{ matrix.route }} \
--json-report --json-report-file=eval-report-${{ matrix.route }}.json
- if: always()
uses: actions/upload-artifact@v4
with:
name: eval-report-${{ matrix.route }}-${{ github.sha }}
path: eval-report-${{ matrix.route }}.json
- if: failure()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const r = JSON.parse(fs.readFileSync(
'eval-report-${{ matrix.route }}.json', 'utf8'));
const body = `### Eval gate: \`${{ matrix.route }}\` failed\n\n` +
'```json\n' + JSON.stringify(r.tests, null, 2) + '\n```';
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner, repo: context.repo.repo, body,
});
Five habits pay back the first week. Path-scoped triggers so a CSS-only PR doesn’t burn judge tokens. concurrency.cancel-in-progress: true keyed on github.head_ref so three rapid pushes don’t fan out into three concurrent suites. Matrix sharding by route so a multi-route monorepo’s PRs scale linearly with the change surface, not with the route count. actions/cache@v4 keyed on rubric and dataset SHAs so reruns on unchanged code hit cache. And an eval report artifact per route on every run; a borderline regression turns into a 3-minute conversation instead of a 30-minute argument.
A separate schedule: cron: "17 7 * * *" workflow runs the full nightly sweep across every route on the larger corpus (500-2,000 examples) and posts the daily baseline back into Observe. That keeps the rolling 7-day window fresh.
Statistical gating: Welch’s t-test and bootstrap CI
This is the section most CI eval guides skip. It’s the one separating a working gate from theater.
A green check should mean “this PR did not introduce a statistically significant regression,” not “mean Groundedness on 30 examples sat above 0.85.” The two statements look identical and aren’t.
Two thresholds, both calibrated, neither absolute alone:
- Absolute floor per rubric. Catches the catastrophic regression. Starting floors: Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, ChunkAttribution >= 0.80, citation validity >= 0.99 (compliance work). Tune per workload.
- Delta gate against the trailing 7-day rolling baseline. Welch’s t-test on per-example score arrays (continuous rubrics). Two-proportion z-test on pass rates (binary rubrics). Fail when p < 0.05 AND the effect size exceeds the rubric’s noise floor (~0.03 on a 0-1 scale).
Two extensions earn their keep on long-tail products. First, gate on p95_score instead of avg_score when the failure mode is tail-shaped (a regression that pushes the worst 5 percent below a tail floor while leaving the mean intact is the most common slow-burn). Second, use a paired-comparison bootstrap CI when the same eval examples run through the candidate and the baseline; the paired test has tighter variance than the independent two-sample test, so the gate detects smaller effect sizes at the same sample budget.
def paired_bootstrap_ci(candidate, baseline, n_boot=10_000, alpha=0.05):
"""Bootstrap CI on the paired delta. Same examples through both pipelines."""
import numpy as np
paired = np.array(candidate) - np.array(baseline)
boot = [np.random.choice(paired, len(paired), replace=True).mean()
for _ in range(n_boot)]
lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return paired.mean(), (lo, hi)
If the 95 percent CI for the paired delta sits entirely below zero, the regression is real. If it straddles zero, the change is variance. The bootstrap doesn’t assume a parametric distribution, which matters because rubric scores are often bimodal (cluster at 1 and near the floor, sparse in the middle).
The gate lives by one rule: the baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, datasets drift. The gate drifts with them or it starts catching ordinary movement instead of regressions.
Per-route scoping: only eval what the PR touched
A 200-example suite multiplied across 8 routes is a 1,600-example sweep on every PR. The gate prices itself out by month two unless every PR runs only the suites it actually affects.
The pattern: tag each golden-set example with its route, expose a --routes CLI flag to pytest, and run a small affected_routes.py script that reads the PR diff and emits the matching route list as a GitHub Actions step output. The matrix shard above consumes that list.
# evals/affected_routes.py
import json
import subprocess
ROUTE_PATHS = {
"support-rag": ["src/agent/support/", "prompts/support_agent",
"evals/golden/support-rag.jsonl"],
"legal-rag": ["src/agent/legal/", "prompts/legal_agent",
"evals/golden/legal-rag.jsonl"],
"sales-agent": ["src/agent/sales/", "prompts/sales_agent",
"evals/golden/sales-agent.jsonl"],
}
def affected():
diff = subprocess.check_output(
["git", "diff", "--name-only", "HEAD~1", "HEAD"], text=True).splitlines()
hits = set()
for route, paths in ROUTE_PATHS.items():
if any(d.startswith(p) for d in diff for p in paths):
hits.add(route)
# Shared paths: rerun every route
shared = ["src/shared/", "evals/rubrics/", ".github/workflows/eval-pr-gate.yml"]
if any(d.startswith(p) for d in diff for p in shared):
hits = set(ROUTE_PATHS.keys())
return sorted(hits)
if __name__ == "__main__":
print(json.dumps(affected()))
The shared-paths bucket is the trap most teams hit. A change to evals/rubrics/ or to a shared library affects every route, and a per-route filter would miss it. The script falls back to “rerun every route” when shared paths land in the diff. That’s the right default; the cost of a full sweep on a rubric change is the price of keeping the gate honest.
For datasets above a few thousand examples per route, the SDK ships four distributed runners (Celery, Ray, Temporal, Kubernetes) that parallelize across a worker pool past the single-host limit. The cheap path is Evaluator(max_workers=16) saturating the judge provider’s rate limit; the distributed runners take over when one CI runner stops keeping up.
Canary auto-rollback and the gateway
The PR gate proves the change doesn’t regress on the curated golden set. The canary proves the change doesn’t regress on the live traffic distribution, which is the only distribution that pays the bill.
The pattern uses the Agent Command Center gateway to route 1-5 percent of production traffic to the new prompt while the rest stays on the incumbent. The same rubric library scores both populations via span-attached scoring (EvalTag plus traceAI). A rolling-mean drop beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes) trips the gateway’s auto-rollback hook.
# .github/workflows/eval-canary.yml
name: eval-canary
on:
workflow_dispatch:
inputs:
candidate_prompt: { required: true }
ramp_percent: { default: "5" }
jobs:
canary:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- run: pip install -r requirements.txt
- name: Register canary route
env:
AGENTCC_API_KEY: ${{ secrets.AGENTCC_API_KEY }}
run: python scripts/register_canary.py \
--prompt ${{ inputs.candidate_prompt }} \
--percent ${{ inputs.ramp_percent }}
- name: Sample canary traces and score
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: python scripts/score_canary.py --window 30m
- name: Promote or roll back
run: python scripts/canary_decision.py
The decision script reads the same rubric scores produced by the PR gate, runs the paired-comparison bootstrap CI on the canary versus the incumbent rolling mean, and either ramps to the next stage (5 → 25 → 100 percent) or hits the gateway’s rollback endpoint. The gateway emits per-request response headers (cost, latency, model used, fallback used, routing strategy, guardrail triggered) on every call, so the canary script has cost and routing telemetry next to the eval scores. Same rubric in CI and in production means the numbers are comparable end to end; engineers stop arguing which number is “real” and start fixing the bug.
Anti-patterns the team usually hits
- Running on every commit instead of every PR. Pulls the team into ignoring red builds, blows the judge token budget, and floods the dataset with churn. PRs only, with a
paths:filter andconcurrency.cancel-in-progress. - No baseline. Treating each run as standalone means you can’t tell a noise-floor flake from a real regression. The rolling 7-day JSON committed to git is the cheapest fix.
- No per-rubric scoring. Aggregating to a single “LLM quality 0.84” number hides which rubric regressed. The per-rubric vector with per-rubric floors is the working pattern.
- Floor without delta gate. Slow regressions slip under the floor for months. The delta gate with Welch’s t-test catches the drift the floor misses.
- 30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow to 100-200 per route, or gate on
p95_score. - LLM-as-judge at the base. The most expensive layer running on every PR on every example. Engineers disable the gate by month two. Classifier cascade first; the frontier judge only on disagreement.
- No notification on nightly drift. The cron job runs, the artifact uploads, nobody looks. A Slack webhook on failure is the bare minimum.
- No rollback path on canary. Without the gateway-level promote-or-rollback hook, the canary becomes a slow-motion deploy that no one knows how to abort.
- Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Error Feed promotes failing production traces weekly.
How Future AGI ships the CI gate
Future AGI ships the eval stack as a package, designed for the cheap-fast-significant triangle. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when the cascade, the closed loop, and the gateway start mattering.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses including the seven RAG metrics (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy) plusAnswerRefusal,TaskCompletion,EvaluateFunctionCalling,PromptInjection,DataPrivacyCompliance. 8 sub-10msScannerclasses for the deterministic base (JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner). NLI-backed local rubrics (faithfulness,claim_support,rag_faithfulness,factual_consistency) for the classifier tier. Four distributed runners (Celery, Ray, Temporal, Kubernetes) for batch regression.RailType.INPUT/OUTPUT/RETRIEVALplusAggregationStrategy.ANY/ALL/MAJORITY/WEIGHTEDfor ensemble rubrics. fiCLI.fi init --template rag|basic|safety|agentscaffoldsfi-evaluation.yaml.fi run --check --strict --parallel 16evaluates assertions onpass_rate,avg_score,p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/1/2/3/4/5/6/7) as a hard contract.--mode hybridroutes between local classifier rubrics and cloud judges;--offlineruns classifier-only when the network is down.- Future AGI Platform. Self-improving evaluators tuned by thumbs feedback. An in-product authoring agent writes RAG-specific rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes weekly full-suite reruns the default rather than the exception.
- traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) at
register()time. 14 span kinds including a first-classRETRIEVER. 62 built-in evals wire server-side viaEvalTagwith zero inline latency. - Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an
immediate_fix, and a 4-dim score. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap. - Agent Command Center. OpenAI-compatible gateway in a single Go binary (Apache 2.0). 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, MCP and A2A protocol support, OTel/Prometheus observability. Cloud-hosted at
gateway.futureagi.comor self-hostable. SOC 2 Type II, HIPAA, GDPR, and CCPA certified. - agent-opt (Apache 2.0). Six optimizers (
ProTeGi,GEPA,MetaPrompt,BayesianSearch,RandomSearch,PromptWizard) for prompt regression triage on green PRs.
The pieces are independent. Drop ai-evaluation plus the fi CLI into your CI this afternoon. Bring traceAI, Error Feed, and the Platform online as the program matures.
Ready to wire a CI gate that doesn’t lie? Run pip install ai-evaluation, then fi init --template rag, point the data paths at your real eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, and add fi run --check --strict to your pull-request workflow. Your next regression has somewhere to land.
Related reading
- The 2026 LLM Evaluation Playbook
- Evaluate RAG Applications in CI/CD (2026)
- LLM Testing in 2026: Methods and Strategies
- CI/CD for AI Agents: Best Practices (2026)
- A/B Testing LLM Prompts: Best Practices (2026)
- Deterministic LLM Evaluation Metrics (2026)
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- Why Agents Pass Evals and Fail in Production (2026)
Frequently asked questions
Why do most CI LLM eval gates fail in production?
What's the right exit-code partition for the fi CLI in GitHub Actions?
How big should the golden set be for a PR gate?
How does a classifier cascade actually save money?
What's the right way to gate on statistical significance without flaky CI?
Should the eval run on every commit or only on PRs?
What does per-route scoping mean and why does it matter?
How does the canary auto-rollback fit with PR-gate eval?
Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, and paired comparison vs prior version with CI on the delta.
The 15 LLM evaluation mistakes the Future AGI team sees in customer engagements, each with a vignette and the concrete primitive that prevents it.
Celery, Ray, Temporal, and Kubernetes optimise for different things. Pick by your bottleneck, not by what's fashionable. The 2026 engineering decision guide.