Engineering

How to Evaluate RAG Applications in CI/CD Pipelines (2026)

Q: What does a RAG CI gate actually need to do?

Three things, and not just the first. Score the right metrics on a representative dataset. Decide whether a regression is statistically real or single-example noise. Run cheap and fast enough that engineers don't disable it on a deadline. Most CI eval setups nail the first and skip the other two, which is why their green checks mean nothing. The CI eval triangle is cheap, fast, statistically significant. A gate that misses any of the three is theatre.

Q: How big should the RAG eval dataset be for CI?

Start at 100 to 200 cases per route, sampled from real user questions where possible. Below 100 the variance on per-rubric means is wide enough that a 2-point drop is indistinguishable from judge noise. Above 500 the per-PR judge bill grows faster than the regression signal sharpens. The dataset earns its keep through composition, not size: representative happy-path queries, edge cases, refusal scenarios, multi-hop reasoning, and the hardest 10 percent of historical failures. Grow weekly by promoting failing production traces with Error Feed.

Q: How do I keep CI eval cost bounded?

Four levers compound. Cache judge calls keyed on (rubric version, input, output, model); reruns hit cache. Shard the dataset by route so a PR touching the legal RAG retriever doesn't rerun the support-bot evals. Cascade classifier-backed rubrics in front of frontier judges; the classifier triages every case and only disagreements or low-confidence calls escalate. Run the full frontier-judge sweep on nightly, not on every PR. With caching plus cascade, a 200-example suite runs for cents per PR on the Future AGI Platform, which prices classifier-backed evals below Galileo Luna-2.

Q: What threshold should a RAG CI gate fail on?

Two thresholds, both calibrated, neither absolute alone. An absolute floor per rubric catches catastrophic regressions (Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, citation validity >= 0.99 for compliance work). A delta gate fails the PR when the mean drops by more than the statistical noise floor relative to the trailing 7-day rolling baseline. The delta gate is the one most teams skip and it's the one that matters; the floor catches the obvious break, the delta catches the slow regression.

RAG eval in CI/CD without theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.

May 20, 2026

13 min read

rag rag-evaluation ci-cd llm-evaluation ai-gateway github-actions 2026

Table of Contents

A PR lands at 4:47 pm. The RAG eval runs in 38 seconds. Green check. Twelve hours later, support tickets trickle in. The trace tree shows the retriever quietly switched its top-1 chunk on a class of queries the 30-example dataset never covered. Suite Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.62 for those twelve hours. The gate fired green because it didn’t ask the right question.

Most CI eval gates for RAG are smoke tests in a trench coat. Tiny dataset, mean against a frozen floor, pass on anything short of catastrophe. The dataset isn’t representative, the floor isn’t calibrated to baseline variance, the threshold doesn’t separate signal from judge noise. The green check is theatre.

The opinion this post earns: the CI eval triangle is cheap, fast, and statistically significant. Pick any two and the gate is theatre. A 30-example dataset for 12 cents per PR fails the third corner. A 2,000-example sweep for 9 dollars per PR fails the first two. Design a gate that holds all three. The math (not vibes) decides which regressions block.

This guide is the working playbook: trigger tiers, dataset sizing, statistical gating with deltas not means, shard-parallel pytest, classifier cascade for cost, the GitHub Actions wiring, and the production bridge that keeps CI and prod scoring the same way. Code shaped against the ai-evaluation SDK and the fi CLI.

TL;DR: the gate that doesn’t lie

Step	Decision	Rule
1. Triggers	PR-blocking vs nightly vs canary	Cheap rubrics on PR. LLM-judge sweep nightly. Same rubrics on canary.
2. Dataset	100-200 cases per route	Sampled from production. Versioned. Grown weekly through Error Feed.
3. Metric stack	Five core rubrics minimum	Groundedness, ContextRelevance, Completeness, ChunkAttribution, citation validity.
4. Statistical gate	Floor + delta	Absolute floor catches breaks. Welch’s t-test on delta vs 7-day baseline catches drift.
5. Parallelism	Shard by route	Path-scoped triggers. Classifier cascade. Cached pair verdicts.
6. Production bridge	Same rubric, two contexts	EvalTag attaches the CI rubric as a span-attached scorer.
7. Closed loop	Failing traces ratchet the dataset	Error Feed clusters; engineer promotes; gate gets stronger.

The hard ones are 4 and 5. The rest is plumbing.

The CI eval triangle

Three constraints fight for the gate’s budget.

Cheap. Cents per PR, not dollars. If every push lights up a frontier judge bill, someone quietly disables the gate by quarter end. A gate that costs more than the feature is a gate that gets bypassed.

Fast. Under five minutes push to verdict. Above ten, engineers merge and apologise. CI is a queue, and the queue measures the slowest thing on the critical path.

Statistically significant. The gate fails when the regression is real and passes when the variance is judge noise. A 30-example dataset gives a 95 percent confidence interval roughly +/- 0.07 on a 1-5 rubric mean. A 2-point drop sits inside the noise band; the gate firing on it raises false alarms half the time. Teams fire-drilled twice learn to ignore the third.

Sharding and classifier cascade fix cheap. Path scoping and cache fix fast. Sample sizing and delta gating fix significant.

Step 1: trigger tiers, not one gate to rule them all

The first design mistake is running the full LLM-judge sweep on every push. The full sweep is the right gate for nightly main, not for an in-progress feature branch.

Three tiers, three cadences:

PR-blocking, every push. Cheap classifier rubrics (NLI faithfulness, claim_support, factual_consistency) plus deterministic floors (citation validity, schema compliance, latency budget). Under three minutes against 100-200 examples. Blocks merge.
Nightly main, full sweep. The LLM-judge stack (Groundedness, ContextRelevance, ContextAdherence, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) against the versioned dataset. 15-30 minutes. Blocks the next promotion to canary.
Canary observation, sampled live. Same rubrics attached as span-attached scorers via EvalTag against 5-10 percent of live traffic. Alarms on rolling-mean drift.

PR-blocking keeps engineers honest on the change in front of them. Nightly catches the regression a cheap rubric missed. Canary catches failure modes the dataset doesn’t know about. Conflating the three produces the “200 LLM-judge calls on every commit” pattern that prices the gate out of existence. Non-negotiable across all three: rubric definitions live in the same repo as application code, pinned alongside the prompt and the chunker.

Step 2: dataset composition is the lever, not size

A 2,000-example dataset built from the test author’s imagination loses to a 200-example dataset sampled from production every time. The dataset is the gate’s worldview; if the worldview misses the failure modes that show up at 2 am, the gate misses them too.

A RAG eval case carries four fields the gate actually reads:

from dataclasses import dataclass, field

@dataclass
class RAGExample:
    id: str
    route: str                              # "legal-rag", "support-rag", ...
    question: str
    expected_answer: str
    expected_chunks: list[str]              # ground-truth doc IDs or text spans
    metadata: dict = field(default_factory=dict)

Composition over count. A 200-example set that earns its keep covers happy-path queries (top 20 percent of intents), edge cases (ambiguous, multi-hop, time-sensitive), refusal cases, negative cases, and the hardest 10 percent of historical failures from incident reports. expected_chunks is the field most teams skip and regret: without it you can score generation but not retrieval, and the bisect costs a day instead of an hour.

Below 100 examples per route, variance on per-rubric means sinks signal-to-noise under one. Above 500 the per-PR LLM-judge bill grows faster than detection sharpens. PR-blocking sweet spot: 100-200 per route. Nightly can go wider because cost amortises over one run.

Step 3: the metric stack, five rubrics, no more

The five rubrics that gate most RAG regressions in CI:

Metric	What it scores	What it catches	Layer
Groundedness	Are answer claims supported by retrieval context?	Hallucinated facts	Generation
ContextRelevance	Are retrieved chunks relevant to the query?	Noisy retrieval	Retrieval
Completeness	Did the response cover what the question asked?	Partial answers	Generation
ChunkAttribution	Which retrieved chunks the answer used	Generator ignoring good chunks	Both
Citation validity	Do cited spans exist verbatim in retrieval context?	Fabricated citations	Deterministic

Split the suite by layer because the bisect collapses. A drop in ContextRelevance with stable Groundedness means the retriever regressed. A drop in Groundedness with stable ContextRelevance means the generator regressed. A combined “end-to-end correctness” rubric hides which layer moved.

Citation validity is the cheapest rubric in the stack. String match plus Levenshtein tolerance; running it on 100 percent of responses keeps the LLM-judge bill bounded to rubrics that need semantic scoring. Add domain rubrics as the pipeline matures: regulatory accuracy for legal/medical, retrieval token cost, latency budget. The five above are the floor.

Step 4: statistical gating, not “did the mean drop”

This is the section most CI eval guides skip. It’s the one separating a working gate from theatre.

A green check should mean “this PR did not introduce a statistically significant regression,” not “mean Groundedness on 30 examples sat above 0.85.” The two statements look identical and are not.

Two thresholds, both calibrated:

Absolute floor per rubric. Catches catastrophic regressions. Starting floors: Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, ChunkAttribution >= 0.80, citation validity >= 0.99. Tune per workload.
Delta gate against the trailing 7-day rolling baseline. Welch’s t-test on per-example scores. Fail when p < 0.05 and the effect size exceeds the rubric’s noise floor.

import statistics
from scipy import stats

def regression_gate(current, baseline, alpha=0.05, min_effect=0.03):
    """Fail only if the mean dropped, the change is significant, and the effect is big enough."""
    delta = statistics.mean(current) - statistics.mean(baseline)
    if delta >= 0:
        return True, f"no regression (delta=+{delta:.3f})"
    _, p = stats.ttest_ind(current, baseline, equal_var=False)
    if p >= alpha:
        return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
    if abs(delta) < min_effect:
        return True, f"delta={delta:.3f} below effect floor {min_effect}"
    return False, f"regression: delta={delta:.3f}, p={p:.3f}"

Swap the t-test for a two-proportion z-test on pass-rate rubrics like citation validity. For long-tail failures that hide in averages, gate on percentiles. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles as native assertion metrics; a regression pushing p95_score below a tail floor while leaving the mean intact fails on the percentile and tells the on-call where to look.

# fi-evaluation.yaml: native CI assertions
assertions:
  - template: "groundedness"
    condition: "p95_score >= 0.78"   # gate on the tail, not the mean
    on_fail: "error"
  - template: "context_relevance"
    condition: "pass_rate >= 0.90"
    on_fail: "error"
  - template: "citation_validity"
    condition: "pass_rate >= 0.99"
    on_fail: "error"

fi run --check exits with code 2 on assertion failure (distinct from code 1 for evaluation errors), 3 when --strict assertions warn, 6 on API failure. That exit-code partition is what makes CI policies (retry on 6, hard-fail on 2, slack-notify on 3) wire cleanly into GitHub Actions or Buildkite without grep heuristics on stdout.

The gate lives by one rule: the baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, datasets drift; the gate drifts with them or it starts catching ordinary movement instead of regressions.

Step 5: shard the eval set, cascade the judge

A 200-example suite against an LLM judge at 1.2 seconds per call takes four minutes serially. Sharded across 8 workers it takes 30 seconds. The numbers stop being theoretical the moment merge queue depth crosses three.

Path-scoped triggers. A PR touching rag/legal/ reruns only the legal-RAG evals. GitHub Actions on.pull_request.paths is the cheap version; route-tagged datasets plus pytest -m route_legal is sharper.
Shard by route, parallelise within shard. Evaluator(max_workers=16) saturates the rate limit on most judge providers; the SDK’s four distributed runners (Celery, Ray, Temporal, Kubernetes) take over past the single-host limit.
Classifier cascade. A DeBERTa-class classifier triages every example; only low-confidence or disagreement cases escalate to the frontier judge. On a 200-example suite the cascade pushes 70-85 percent through the classifier at fractional cost.

import pytest
from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextRelevance, Completeness, ChunkAttribution,
)
from fi.testcases import TestCase

evaluator = Evaluator(max_workers=16)  # FI_API_KEY / FI_SECRET_KEY from env

ROUTES = ["legal-rag", "support-rag", "internal-search"]
RUBRICS = {
    "groundedness":      (Groundedness(),     0.85),
    "context_relevance": (ContextRelevance(), 0.80),
    "completeness":      (Completeness(),     0.75),
    "chunk_attribution": (ChunkAttribution(), 0.80),
}

@pytest.mark.parametrize("route", ROUTES)
def test_rag_route(route, dataset_for_route, baseline_loader, request):
    if route not in changed_routes(request):
        pytest.skip(f"{route} not affected by this PR")

    test_cases = []
    for ex in dataset_for_route(route):
        answer, chunks = run_pipeline(ex)
        if citation_invalid(answer, chunks):
            pytest.fail(f"{ex.id}: fabricated citation (deterministic floor)")
        test_cases.append(TestCase(input=ex.question, output=answer,
                                   context="\n\n".join(chunks), id=ex.id))

    baselines = baseline_loader(route)
    for name, (template, floor) in RUBRICS.items():
        result = evaluator.evaluate(eval_templates=[template], inputs=test_cases)
        scores = [m.value for r in result.eval_results for m in r.metrics]
        passed, reason = regression_gate(scores, baselines[name])
        assert passed, f"{route}.{name}: {reason}"
        assert statistics.mean(scores) >= floor, \
            f"{route}.{name} below absolute floor {floor}"

Cache verdicts keyed on (rubric_version, judge_model, input_hash, output_hash); reruns on unchanged code hit cache and finish in under a minute. Invalidate on rubric or judge version bump, not on every PR.

The cascade trap: don’t cascade on subjective axes a classifier can’t score. A 2B-parameter classifier won’t call helpfulness or tone better than a frontier judge. Cascade where the classifier has a clean target (NLI faithfulness, claim_support, citation validity); reserve the frontier judge for rubrics where it earns its bill.

Step 6: the GitHub Actions workflow

Drop-in. Path-scoped triggers, cache, parallel pytest, fi CLI assertions:

name: RAG Evals
on:
  pull_request:
    paths: ["rag/**", "evals/**", "fi-evaluation.yaml"]
concurrency:
  group: rag-evals-${{ github.head_ref }}
  cancel-in-progress: true
jobs:
  pr-gate:
    runs-on: ubuntu-latest
    timeout-minutes: 8
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 2 }
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: "pip" }
      - uses: actions/cache@v4
        with:
          path: .eval_cache
          key: evals-${{ hashFiles('evals/rubrics/**', 'evals/datasets/**') }}
      - run: pip install -r requirements.txt
      - id: routes
        run: echo "routes=$(python evals/affected_routes.py)" >> "$GITHUB_OUTPUT"
      - name: Cheap-rubric gate (fi CLI assertions)
        if: steps.routes.outputs.routes != '[]'
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml
      - name: Statistical delta gate (pytest)
        if: steps.routes.outputs.routes != '[]'
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
          BASELINE_WINDOW_DAYS: "7"
        run: pytest evals/test_rag.py -n auto --routes='${{ steps.routes.outputs.routes }}'
      - if: always()
        uses: actions/upload-artifact@v4
        with: { name: eval-report-${{ github.sha }}, path: "eval-report.json" }

A separate schedule: cron: "0 8 * * *" workflow runs the full nightly sweep across all routes and posts the daily baseline back into Observe. Same shape on GitLab CI, Buildkite, Jenkins, or CircleCI: pytest, the fi CLI, and a cache action. The CI/CD LLM eval with GitHub Actions guide covers the workflow patterns in more depth.

Three habits pay back the first week. Path-scoped triggers so “every push reruns every eval” doesn’t price the gate out of existence. concurrency cancel-in-progress so three pushes don’t fan out into three concurrent suites. An eval report artifact with per-rubric scores, dataset diffs, and failing examples; that’s what turns a borderline regression into a 3-minute conversation instead of a 30-minute argument.

Step 7: bridge to production with the same rubric

Offline CI catches regressions you can think of. Production catches the rest. Same EvalTemplate definition, two contexts. The moment CI and prod scoring disagree, you’re shipping against a worldview no longer correlated with what users see.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType,
)

eval_tags = [
    EvalTag(eval_name=EvalName.GROUNDEDNESS,
            type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.LLM, config={},
            mapping={"context": "retrieval.documents", "output": "output.value"}),
    EvalTag(eval_name=EvalName.CONTEXT_RELEVANCE,
            type=EvalTagType.OBSERVATION_SPAN, value=EvalSpanKind.RETRIEVER, config={},
            mapping={"input": "input.value", "context": "retrieval.documents"}),
]

register(project_type=ProjectType.OBSERVE, project_name="legal-rag",
         eval_tags=eval_tags)

The score writes back as a span attribute on the OTel trace. A failing trace surfaces with its rubric score next to latency and chunk IDs; the on-call engineer reads the regression and the failing trace in the same place. traceAI (Apache 2.0) ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting. 14 span kinds include a first-class RETRIEVER; 62 built-in evals wire server-side via EvalTag.

Sample 5-10 percent of production traffic for LLM-judge rubrics; cheap rubrics run on 100 percent. Alarm on a 2-5 point sustained drop in per-rubric rolling mean over 15-60 minutes per route. The delta between the offline CI baseline and the production rolling mean is itself a quality signal; the gap tells you how representative the dataset still is.

Step 8: close the loop, ratchet the dataset

The dataset stops being a regression suite the moment production drifts past it. The loop keeps the gate honest.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) reads the failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). Those fixes feed the Platform’s self-improving evaluators so the rubric ages with the product instead of decaying.

Self-improving evaluators retune from thumbs feedback; the next PR’s gate runs against a sharper definition. The on-call engineer (or a scheduled job) lifts representative traces from each named issue into the eval set with rubric labels. The next PR touching that path clears the new entries or fails. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.

Common pitfalls

Scoring only Groundedness. Catches hallucinations; misses retrieval regressions. Run all five core rubrics or the gate misses half the failure surface.
No expected_chunks in the dataset. Can’t score retrieval recall without ground truth. Label work pays back the first time retrieval regresses.
Floating judge model. Score drift across runs of “the same eval.” Pin and version the judge alongside the rubric.
No delta gate, only floors. Slow regressions slip under the floor for months. The delta gate catches the drift.
30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
Full LLM-judge sweep on every PR. Prices the gate out of existence. PR-blocking gets cheap rubrics; the LLM-judge sweep runs nightly.
Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite.
No cache layer. CI cost explodes; flaky network errors block PRs. Cache verdicts; invalidate on rubric or judge version change.
Production observer with a different rubric than CI. Engineers argue which number is “real” instead of fixing the bug. Pin one definition; run it both places.

Three deliberate tradeoffs

Statistical gating slows the first PR. Building the baseline window takes a week of nightly runs. The payoff is engineers learn that a red PR means a real regression. Without that trust the gate gets bypassed.
Sharded parallelism costs orchestration. Path-scoped triggers, route-aware test selection, and a distributed runner add wiring. The lift is the difference between a 4-minute and a 30-second PR-blocking gate, which is what keeps the gate enabled. The fi CLI’s --check plus the SDK’s distributed runners pre-build most of it.
Classifier cascade adds rubric-design discipline. Cascades work on rubrics with a clean target; subjective rubrics still need a frontier judge. Classifying which rubrics cascade is a one-time design call. The Platform’s classifier-backed evals run below Galileo Luna-2 per-call cost on the ones that do, which is what makes weekly full-suite reruns the default.

How Future AGI ships RAG eval for CI/CD

Future AGI ships the eval stack as a package. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when you want self-improving rubrics and a dashboard for production observation.

ai-evaluation SDK (Apache 2.0) — 60+ EvalTemplate classes including the seven RAG metrics (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy). Real API: from fi.evals import Evaluator; Evaluator(...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]). 13 guardrail backends (9 open-weight), 8 sub-10ms Scanners, four distributed runners. NLI-backed local rubrics (faithfulness, claim_support, rag_faithfulness, factual_consistency) provide the cascade’s classifier tier at a fraction of frontier-judge cost.
fi CLI — fi init --template rag scaffolds fi-evaluation.yaml. fi run --check --strict --parallel 16 evaluates assertions on pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/2/3/6). --mode hybrid routes between local classifier rubrics and cloud judges; --offline runs classifier-only when the network is down.
Future AGI Platform — self-improving evaluators tuned by thumbs feedback; an in-product authoring agent writes RAG-specific rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
traceAI (Apache 2.0) — 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) at register() time. 14 span kinds with a first-class RETRIEVER; 62 built-in evals via EvalTag with zero inline latency.
Error Feed (inside the eval stack) — HDBSCAN clustering plus Sonnet 4.5 Judge writes the immediate_fix; fixes feed the Platform’s self-improving evaluators.
agent-opt (Apache 2.0) — six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer).
Agent Command Center — 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace.

The pieces are independent: drop ai-evaluation plus the fi CLI into your CI this afternoon; bring traceAI, Error Feed, and the Platform online as the program matures.

Ready to wire a RAG CI gate that doesn’t lie? Run pip install ai-evaluation, then fi init --template rag, point the data paths at your real eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, and add fi run --check --strict to your pull-request workflow. Your next regression has somewhere to land.

Frequently asked questions

What does a RAG CI gate actually need to do?

Three things, and not just the first. Score the right metrics on a representative dataset. Decide whether a regression is statistically real or single-example noise. Run cheap and fast enough that engineers don't disable it on a deadline. Most CI eval setups nail the first and skip the other two, which is why their green checks mean nothing. The CI eval triangle is cheap, fast, statistically significant. A gate that misses any of the three is theatre.

How big should the RAG eval dataset be for CI?

Start at 100 to 200 cases per route, sampled from real user questions where possible. Below 100 the variance on per-rubric means is wide enough that a 2-point drop is indistinguishable from judge noise. Above 500 the per-PR judge bill grows faster than the regression signal sharpens. The dataset earns its keep through composition, not size: representative happy-path queries, edge cases, refusal scenarios, multi-hop reasoning, and the hardest 10 percent of historical failures. Grow weekly by promoting failing production traces with Error Feed.

How do I keep CI eval cost bounded?

Four levers compound. Cache judge calls keyed on (rubric version, input, output, model); reruns hit cache. Shard the dataset by route so a PR touching the legal RAG retriever doesn't rerun the support-bot evals. Cascade classifier-backed rubrics in front of frontier judges; the classifier triages every case and only disagreements or low-confidence calls escalate. Run the full frontier-judge sweep on nightly, not on every PR. With caching plus cascade, a 200-example suite runs for cents per PR on the Future AGI Platform, which prices classifier-backed evals below Galileo Luna-2.

What threshold should a RAG CI gate fail on?

Two thresholds, both calibrated, neither absolute alone. An absolute floor per rubric catches catastrophic regressions (Groundedness >= 0.85, ContextRelevance >= 0.80, Completeness >= 0.75, citation validity >= 0.99 for compliance work). A delta gate fails the PR when the mean drops by more than the statistical noise floor relative to the trailing 7-day rolling baseline. The delta gate is the one most teams skip and it's the one that matters; the floor catches the obvious break, the delta catches the slow regression.

When should the gate block the PR versus fire a warning?

Three tiers, by trigger. On every PR: cheap classifier-backed rubrics plus deterministic floors (citation validity, schema compliance, latency budget). Block on failure. On nightly main: the full LLM-as-judge sweep across all routes against a versioned dataset. Block the next promotion to canary on regression. On canary traffic: production observation with the same rubrics scoring sampled live spans. Alarm on rolling-mean drift; fire a rollback signal when the drift sustains beyond the inter-rater baseline. PR-blocking the full LLM-judge sweep on every push is how teams disable the gate by 6 pm.

How is statistical significance computed for a per-rubric delta?

Welch's t-test on the per-example score arrays against the baseline distribution. The gate fires when the p-value is below an agreed threshold (0.05 is the standard starting point) and the effect size is large enough to matter (a 0.5-point shift on a 1-5 rubric is not a regression). For binary rubrics like citation validity, a two-proportion z-test on pass rates. The fi CLI ships pass_rate, avg_score, p50_score, p90_score, and p95_score as native assertion metrics so the gate compares percentiles, not just means, which is the right move when long-tail failures matter more than averages.

Can the same rubric run in CI and in production?

It has to, or the CI gate stops correlating with what users see. The CI definition lives in code; the production observer runs the identical definition against live OTel spans as a span-attached score via traceAI plus EvalTag. A regression in CI that doesn't show in production means the dataset stopped being representative. A drift in production with green CI means the dataset is missing a failure mode the gate now needs to learn. Both signals are content.

How does Future AGI handle RAG eval in CI?

Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) exposes seven RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) plus NLI-backed local equivalents (faithfulness, claim_support, rag_faithfulness) that run a DeBERTa classifier instead of a frontier judge at a fraction of the cost. The fi CLI ships a native CI gate (fi run --check) with assertion conditions on pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles, plus CI-distinct exit codes (0 success, 2 assertion failed, 3 assertion warning, 6 API error). The Future AGI Platform layers self-improving evaluators tuned by thumbs feedback and classifier-backed pairwise scoring at lower per-eval cost than Galileo Luna-2. The same EvalTemplate runs in pytest, in fi run, in GitHub Actions, and as a span-attached scorer on live traces via traceAI (50+ AI surfaces across Python, TypeScript, Java, C#).

View all

Engineering

Your LLM Eval Failed. Which Input Broke It? Field-Level Eval Attribution in 2026

A pass/fail eval score says something broke, not what. Field-level eval attribution pins the failure to the exact input: context, question, or output.

NVJK Kartik · May 29, 2026

6 min

Engineering

How to Build an LLM Evaluation Framework From Scratch (2026)

Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, honest cost map, build vs buy guidance.

Vrinda Damani · May 5, 2026

14 min

Engineering

How to Build (and Evaluate) a PDF QA Chatbot in 2026

A PDF QA chatbot is a retrieval problem, not generation. Parse, chunk, hybrid retrieve, cite, evaluate retrieval, bridge to OTel spans.

Vrinda Damani · Apr 8, 2026

11 min

TL;DR: the gate that doesn’t lie

The CI eval triangle

Step 1: trigger tiers, not one gate to rule them all

Step 2: dataset composition is the lever, not size

Step 3: the metric stack, five rubrics, no more

Step 4: statistical gating, not “did the mean drop”

Step 5: shard the eval set, cascade the judge

Step 6: the GitHub Actions workflow

Step 7: bridge to production with the same rubric

Step 8: close the loop, ratchet the dataset

Common pitfalls

Three deliberate tradeoffs

How Future AGI ships RAG eval for CI/CD

Related reading

Frequently asked questions