How to Evaluate RAG Applications in CI/CD Pipelines (2026)
RAG eval in CI/CD without the theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.
Table of Contents
A PR lands at 4:47 pm. The RAG eval runs in 38 seconds. Green check. Twelve hours later, support tickets trickle in. The trace tree shows the retriever quietly switched its top-1 chunk on a class of queries the 30-example dataset never covered. Suite Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.62 for those twelve hours. The gate fired green because it didn’t ask the right question.
Most CI eval gates for RAG are smoke tests in a trench coat. Tiny dataset, mean against a frozen floor, pass on anything short of catastrophe. The dataset isn’t representative, the floor isn’t calibrated to baseline variance, the threshold doesn’t separate signal from judge noise. The green check is theatre.
The opinion this post earns: the CI eval triangle is cheap, fast, and statistically significant. Pick any two and the gate is theatre. A 30-example dataset for 12 cents per PR fails the third corner. A 2,000-example sweep for 9 dollars per PR fails the first two. Design a gate that holds all three. The math (not vibes) decides which regressions block.
This guide is the working playbook: trigger tiers, dataset sizing, statistical gating with deltas not means, shard-parallel pytest, classifier cascade for cost, the GitHub Actions wiring, and the production bridge that keeps CI and prod scoring the same way. Code shaped against the ai-evaluation SDK and the fi CLI.
TL;DR: the gate that doesn’t lie
| Step | Decision | Rule |
|---|---|---|
| 1. Triggers | PR-blocking vs nightly vs canary | Cheap rubrics on PR. LLM-judge sweep nightly. Same rubrics on canary. |
| 2. Dataset | 100-200 cases per route | Sampled from production. Versioned. Grown weekly through Error Feed. |
| 3. Metric stack | Five core rubrics minimum | Groundedness, ContextRelevance, Completeness, ChunkAttribution, citation validity. |
| 4. Statistical gate | Floor + delta | Absolute floor catches breaks. Welch’s t-test on delta vs 7-day baseline catches drift. |
| 5. Parallelism | Shard by route | Path-scoped triggers. Classifier cascade. Cached pair verdicts. |
| 6. Production bridge | Same rubric, two contexts | EvalTag attaches the CI rubric as a span-attached scorer. |
| 7. Closed loop | Failing traces ratchet the dataset | Error Feed clusters; engineer promotes; gate gets stronger. |
The hard ones are 4 and 5. The rest is plumbing.
The CI eval triangle
Three constraints fight for the gate’s budget.
Cheap. Cents per PR, not dollars. If every push lights up a frontier judge bill, someone quietly disables the gate by quarter end. A gate that costs more than the feature is a gate that gets bypassed.
Fast. Under five minutes push to verdict. Above ten, engineers merge and apologise. CI is a queue, and the queue measures the slowest thing on the critical path.
Statistically significant. The gate fails when the regression is real and passes when the variance is judge noise. A 30-example dataset gives a 95 percent confidence interval roughly +/- 0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise band; the gate firing on it raises false alarms half the time. Teams fire-drilled twice learn to ignore the third.
Sharding and classifier cascade fix cheap. Path scoping and cache fix fast. Sample sizing and delta gating fix significant.
Step 1: trigger tiers, not one gate to rule them all
The first design mistake is running the full LLM-judge sweep on every push. The full sweep is the right gate for nightly main, not for an in-progress feature branch.
Three tiers, three cadences:
- PR-blocking, every push. Cheap classifier rubrics (NLI faithfulness, claim_support, factual_consistency) plus deterministic floors (citation validity, schema compliance, latency budget). Under three minutes against 100-200 examples. Blocks merge.
- Nightly main, full sweep. The LLM-judge stack (Groundedness, ContextRelevance, ContextAdherence, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) against the versioned dataset. 15-30 minutes. Blocks the next promotion to canary.
- Canary observation, sampled live. Same rubrics attached as span-attached scorers via
EvalTagagainst 5-10 percent of live traffic. Alarms on rolling-mean drift.
PR-blocking keeps engineers honest on the change in front of them. Nightly catches the regression a cheap rubric missed. Canary catches failure modes the dataset doesn’t know about. Conflating the three produces the “200 LLM-judge calls on every commit” pattern that prices the gate out of existence. Non-negotiable across all three: rubric definitions live in the same repo as application code, pinned alongside the prompt and the chunker.
Step 2: dataset composition is the lever, not size
A 2,000-example dataset built from the test author’s imagination loses to a 200-example dataset sampled from production every time. The dataset is the gate’s worldview; if the worldview misses the failure modes that show up at 2 am, the gate misses them too.
A RAG eval case carries four fields the gate actually reads:
from dataclasses import dataclass, field
@dataclass
class RAGExample:
id: str
route: str # "legal-rag", "support-rag", ...
question: str
expected_answer: str
expected_chunks: list[str] # ground-truth doc IDs or text spans
metadata: dict = field(default_factory=dict)
Composition over count. A 200-example set that earns its keep covers happy-path queries (top 20 percent of intents), edge cases (ambiguous, multi-hop, time-sensitive), refusal cases, negative cases, and the hardest 10 percent of historical failures from incident reports. expected_chunks is the field most teams skip and regret: without it you can score generation but not retrieval, and the bisect costs a day instead of an hour.
Below 100 examples per route, variance on per-rubric means sinks signal-to-noise under one. Above 500 the per-PR LLM-judge bill grows faster than detection sharpens. PR-blocking sweet spot: 100-200 per route. Nightly can go wider because cost amortises over one run.
Step 3: the metric stack, five rubrics, no more
The five rubrics that gate most RAG regressions in CI:
| Metric | What it scores | What it catches | Layer |
|---|---|---|---|
| Groundedness | Are answer claims supported by retrieval context? | Hallucinated facts | Generation |
| ContextRelevance | Are retrieved chunks relevant to the query? | Noisy retrieval | Retrieval |
| Completeness | Did the response cover what the question asked? | Partial answers | Generation |
| ChunkAttribution | Which retrieved chunks the answer used | Generator ignoring good chunks | Both |
| Citation validity | Do cited spans exist verbatim in retrieval context? | Fabricated citations | Deterministic |
Split the suite by layer because the bisect collapses. A drop in ContextRelevance with stable Groundedness means the retriever regressed. A drop in Groundedness with stable ContextRelevance means the generator regressed. A combined “end-to-end correctness” rubric hides which layer moved.
Citation validity is the cheapest rubric in the stack. String match plus Levenshtein tolerance; running it on 100 percent of responses keeps the LLM-judge bill bounded to rubrics that need semantic scoring. Add domain rubrics as the pipeline matures: regulatory accuracy for legal/medical, retrieval token cost, latency budget. The five above are the floor.
Step 4: statistical gating, not “did the mean drop”
This is the section most CI eval guides skip. It’s the one separating a working gate from theatre.
A green check should mean “this PR did not introduce a statistically significant regression,” not “mean Groundedness on 30 examples sat above 0.85.” The two statements look identical and are not.
Two thresholds, both calibrated:
- Absolute floor per rubric. Catches catastrophic regressions. Starting floors: Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, ChunkAttribution >= 0.80, citation validity >= 0.99. Tune per workload.
- Delta gate against the trailing 7-day rolling baseline. Welch’s t-test on per-example scores. Fail when p < 0.05 and the effect size exceeds the rubric’s noise floor.
import statistics
from scipy import stats
def regression_gate(current, baseline, alpha=0.05, min_effect=0.03):
"""Fail only if the mean dropped, the change is significant, and the effect is big enough."""
delta = statistics.mean(current) - statistics.mean(baseline)
if delta >= 0:
return True, f"no regression (delta=+{delta:.3f})"
_, p = stats.ttest_ind(current, baseline, equal_var=False)
if p >= alpha:
return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
if abs(delta) < min_effect:
return True, f"delta={delta:.3f} below effect floor {min_effect}"
return False, f"regression: delta={delta:.3f}, p={p:.3f}"
Swap the t-test for a two-proportion z-test on pass-rate rubrics like citation validity. For long-tail failures that hide in averages, gate on percentiles. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles as native assertion metrics; a regression pushing p95_score below a tail floor while leaving the mean intact fails on the percentile and tells the on-call where to look.
# fi-evaluation.yaml — native CI assertions
assertions:
- template: "groundedness"
condition: "p95_score >= 0.78" # gate on the tail, not the mean
on_fail: "error"
- template: "context_relevance"
condition: "pass_rate >= 0.90"
on_fail: "error"
- template: "citation_validity"
condition: "pass_rate >= 0.99"
on_fail: "error"
fi run --check exits with code 2 on assertion failure (distinct from code 1 for evaluation errors), 3 when --strict assertions warn, 6 on API failure. That exit-code partition is what makes CI policies (retry on 6, hard-fail on 2, slack-notify on 3) wire cleanly into GitHub Actions or Buildkite without grep heuristics on stdout.
The gate lives by one rule: the baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, datasets drift; the gate drifts with them or it starts catching ordinary movement instead of regressions.
Step 5: shard the eval set, cascade the judge
A 200-example suite against an LLM judge at 1.2 seconds per call takes four minutes serially. Sharded across 8 workers it takes 30 seconds. The numbers stop being theoretical the moment merge queue depth crosses three.
- Path-scoped triggers. A PR touching
rag/legal/reruns only the legal-RAG evals. GitHub Actionson.pull_request.pathsis the cheap version; route-tagged datasets pluspytest -m route_legalis sharper. - Shard by route, parallelise within shard.
Evaluator(max_workers=16)saturates the rate limit on most judge providers; the SDK’s four distributed runners (Celery, Ray, Temporal, Kubernetes) take over past the single-host limit. - Classifier cascade. A DeBERTa-class classifier triages every example; only low-confidence or disagreement cases escalate to the frontier judge. On a 200-example suite the cascade pushes 70-85 percent through the classifier at fractional cost.
import pytest
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextRelevance, Completeness, ChunkAttribution,
)
from fi.testcases import TestCase
evaluator = Evaluator(max_workers=16) # FI_API_KEY / FI_SECRET_KEY from env
ROUTES = ["legal-rag", "support-rag", "internal-search"]
RUBRICS = {
"groundedness": (Groundedness(), 0.85),
"context_relevance": (ContextRelevance(), 0.80),
"completeness": (Completeness(), 0.75),
"chunk_attribution": (ChunkAttribution(), 0.80),
}
@pytest.mark.parametrize("route", ROUTES)
def test_rag_route(route, dataset_for_route, baseline_loader, request):
if route not in changed_routes(request):
pytest.skip(f"{route} not affected by this PR")
test_cases = []
for ex in dataset_for_route(route):
answer, chunks = run_pipeline(ex)
if citation_invalid(answer, chunks):
pytest.fail(f"{ex.id}: fabricated citation (deterministic floor)")
test_cases.append(TestCase(input=ex.question, output=answer,
context="\n\n".join(chunks), id=ex.id))
baselines = baseline_loader(route)
for name, (template, floor) in RUBRICS.items():
result = evaluator.evaluate(eval_templates=[template], inputs=test_cases)
scores = [m.value for r in result.eval_results for m in r.metrics]
passed, reason = regression_gate(scores, baselines[name])
assert passed, f"{route}.{name}: {reason}"
assert statistics.mean(scores) >= floor, \
f"{route}.{name} below absolute floor {floor}"
Cache verdicts keyed on (rubric_version, judge_model, input_hash, output_hash); reruns on unchanged code hit cache and finish in under a minute. Invalidate on rubric or judge version bump, not on every PR.
The cascade trap: don’t cascade on subjective axes a classifier can’t score. A 2B-parameter classifier won’t call helpfulness or tone better than a frontier judge. Cascade where the classifier has a clean target (NLI faithfulness, claim_support, citation validity); reserve the frontier judge for rubrics where it earns its bill.
Step 6: the GitHub Actions workflow
Drop-in. Path-scoped triggers, cache, parallel pytest, fi CLI assertions:
name: RAG Evals
on:
pull_request:
paths: ["rag/**", "evals/**", "fi-evaluation.yaml"]
concurrency:
group: rag-evals-${{ github.head_ref }}
cancel-in-progress: true
jobs:
pr-gate:
runs-on: ubuntu-latest
timeout-minutes: 8
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 2 }
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: "pip" }
- uses: actions/cache@v4
with:
path: .eval_cache
key: evals-${{ hashFiles('evals/rubrics/**', 'evals/datasets/**') }}
- run: pip install -r requirements.txt
- id: routes
run: echo "routes=$(python evals/affected_routes.py)" >> "$GITHUB_OUTPUT"
- name: Cheap-rubric gate (fi CLI assertions)
if: steps.routes.outputs.routes != '[]'
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml
- name: Statistical delta gate (pytest)
if: steps.routes.outputs.routes != '[]'
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
BASELINE_WINDOW_DAYS: "7"
run: pytest evals/test_rag.py -n auto --routes='${{ steps.routes.outputs.routes }}'
- if: always()
uses: actions/upload-artifact@v4
with: { name: eval-report-${{ github.sha }}, path: "eval-report.json" }
A separate schedule: cron: "0 8 * * *" workflow runs the full nightly sweep across all routes and posts the daily baseline back into Observe. Same shape on GitLab CI, Buildkite, Jenkins, or CircleCI: pytest, the fi CLI, and a cache action.
Three habits pay back the first week. Path-scoped triggers so “every push reruns every eval” doesn’t price the gate out of existence. concurrency cancel-in-progress so three pushes don’t fan out into three concurrent suites. An eval report artifact with per-rubric scores, dataset diffs, and failing examples; that’s what turns a borderline regression into a 3-minute conversation instead of a 30-minute argument.
Step 7: bridge to production with the same rubric
Offline CI catches regressions you can think of. Production catches the rest. Same EvalTemplate definition, two contexts. The moment CI and prod scoring disagree, you’re shipping against a worldview no longer correlated with what users see.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType,
)
eval_tags = [
EvalTag(eval_name=EvalName.GROUNDEDNESS,
type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.LLM, config={},
mapping={"context": "retrieval.documents", "output": "output.value"}),
EvalTag(eval_name=EvalName.CONTEXT_RELEVANCE,
type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.RETRIEVER, config={},
mapping={"input": "input.value", "context": "retrieval.documents"}),
]
register(project_type=ProjectType.OBSERVE, project_name="legal-rag",
eval_tags=eval_tags)
The score writes back as a span attribute on the OTel trace. A failing trace surfaces with its rubric score next to latency and chunk IDs; the on-call engineer reads the regression and the failing trace in the same place. traceAI (Apache 2.0) ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting. 14 span kinds include a first-class RETRIEVER; 62 built-in evals wire server-side via EvalTag.
Sample 5-10 percent of production traffic for LLM-judge rubrics; cheap rubrics run on 100 percent. Alarm on a 2-5 point sustained drop in per-rubric rolling mean over 15-60 minutes per route. The delta between the offline CI baseline and the production rolling mean is itself a quality signal; the gap tells you how representative the dataset still is.
Step 8: close the loop, ratchet the dataset
The dataset stops being a regression suite the moment production drifts past it. The loop keeps the gate honest.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) reads the failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). Those fixes feed the Platform’s self-improving evaluators so the rubric ages with the product instead of decaying.
Self-improving evaluators retune from thumbs feedback; the next PR’s gate runs against a sharper definition. The on-call engineer (or a scheduled job) lifts representative traces from each named issue into the eval set with rubric labels. The next PR touching that path clears the new entries or fails. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
Common pitfalls
- Scoring only Groundedness. Catches hallucinations; misses retrieval regressions. Run all five core rubrics or the gate misses half the failure surface.
- No
expected_chunksin the dataset. Can’t score retrieval recall without ground truth. Label work pays back the first time retrieval regresses. - Floating judge model. Score drift across runs of “the same eval.” Pin and version the judge alongside the rubric.
- No delta gate, only floors. Slow regressions slip under the floor for months. The delta gate catches the drift.
- 30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
- Full LLM-judge sweep on every PR. Prices the gate out of existence. PR-blocking gets cheap rubrics; the LLM-judge sweep runs nightly.
- Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite.
- No cache layer. CI cost explodes; flaky network errors block PRs. Cache verdicts; invalidate on rubric or judge version change.
- Production observer with a different rubric than CI. Engineers argue which number is “real” instead of fixing the bug. Pin one definition; run it both places.
Three deliberate tradeoffs
- Statistical gating slows the first PR. Building the baseline window takes a week of nightly runs. The payoff is engineers learn that a red PR means a real regression. Without that trust the gate gets bypassed.
- Sharded parallelism costs orchestration. Path-scoped triggers, route-aware test selection, and a distributed runner add wiring. The lift is the difference between a 4-minute and a 30-second PR-blocking gate, which is what keeps the gate enabled. The fi CLI’s
--checkplus the SDK’s distributed runners pre-build most of it. - Classifier cascade adds rubric-design discipline. Cascades work on rubrics with a clean target; subjective rubrics still need a frontier judge. Classifying which rubrics cascade is a one-time design call. The Platform’s classifier-backed evals run below Galileo Luna-2 per-call cost on the ones that do, which is what makes weekly full-suite reruns the default.
How Future AGI ships RAG eval for CI/CD
Future AGI ships the eval stack as a package. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when you want self-improving rubrics and a dashboard for production observation.
- ai-evaluation SDK (Apache 2.0) — 60+
EvalTemplateclasses including the seven RAG metrics (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy). Real API:from fi.evals import Evaluator;Evaluator(...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]). 13 guardrail backends (9 open-weight), 8 sub-10ms Scanners, four distributed runners. NLI-backed local rubrics (faithfulness,claim_support,rag_faithfulness,factual_consistency) provide the cascade’s classifier tier at a fraction of frontier-judge cost. fiCLI —fi init --template ragscaffoldsfi-evaluation.yaml.fi run --check --strict --parallel 16evaluates assertions onpass_rate,avg_score,p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/2/3/6).--mode hybridroutes between local classifier rubrics and cloud judges;--offlineruns classifier-only when the network is down.- Future AGI Platform — self-improving evaluators tuned by thumbs feedback; an in-product authoring agent writes RAG-specific rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0) — 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) at
register()time. 14 span kinds with a first-classRETRIEVER; 62 built-in evals viaEvalTagwith zero inline latency. - Error Feed (inside the eval stack) — HDBSCAN clustering plus Sonnet 4.5 Judge writes the
immediate_fix; fixes feed the Platform’s self-improving evaluators. - agent-opt (Apache 2.0) — six optimizers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer). - Agent Command Center — 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace.
The pieces are independent: drop ai-evaluation plus the fi CLI into your CI this afternoon; bring traceAI, Error Feed, and the Platform online as the program matures.
Ready to wire a RAG CI gate that doesn’t lie? Run pip install ai-evaluation, then fi init --template rag, point the data paths at your real eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, and add fi run --check --strict to your pull-request workflow. Your next regression has somewhere to land.
Related reading
- The 2026 LLM Evaluation Playbook
- Best RAG Evaluation Tools (2026)
- CI/CD for AI Agents Best Practices (2026)
- How to Build (and Evaluate) a PDF QA Chatbot in 2026
- How to Build (and Evaluate) a Contract Review RAG Agent in 2026
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- Deterministic LLM Evaluation Metrics (2026)
- Synthetic Test Data for LLM Evaluation (2026)
Frequently asked questions
What does a RAG CI gate actually need to do?
How big should the RAG eval dataset be for CI?
How do I keep CI eval cost bounded?
What threshold should a RAG CI gate fail on?
When should the gate block the PR versus fire a warning?
How is statistical significance computed for a per-rubric delta?
Can the same rubric run in CI and in production?
How does Future AGI handle RAG eval in CI?
Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, the honest cost map, and what to build vs buy.
Eval dataset drift is the silent killer. A 2026 methodology for catching input, prompt-template, and retrieval-corpus drift before your CI gate tests yesterday's traffic.
A PDF QA chatbot is a retrieval problem, not a generation one. Parse, chunk, hybrid retrieve, cite, evaluate retrieval before generation, bridge to live OTel spans.