LLM Testing in 2026: Methods and Strategies
The 2026 LLM testing pyramid. Deterministic unit checks at the base, regression eval in the middle, red-team and canary at the top, wired into pytest and CI.
Table of Contents
A PR lands. The LLM test suite runs 247 LLM-as-judge calls on 200 examples across nine rubrics. Four minutes, $3.20, green check. Twelve hours later a customer reports the bot returning malformed JSON the application can’t parse. The judge scored 0.91 for groundedness; nobody asked whether the JSON parsed. A json.loads on every output, costing literally nothing, would have caught it on the first PR. The expensive thing ran. The cheap thing didn’t.
That’s the pyramid inverted. The opinion this post earns: LLM testing is a pyramid, like everything else, and putting LLM-as-judge at the base is the anti-pattern that prices the gate out of existence. Deterministic checks belong at the base (schema, format, scanners, citation validity). Rubric-based eval belongs in the middle, gated on statistical significance. Red-team and canary belong at the top, surgical and expensive. The shape isn’t new. The cost curve is, and it’s flipped from classical software testing, which is what makes the shape matter more than ever.
This guide walks the five-layer pyramid: what each layer scores, what it catches, what it costs, and how it wires into pytest plus the fi CLI. Code shaped against the ai-evaluation SDK.
TL;DR: the pyramid, top to bottom
| Layer | What it scores | Cost per run | Cadence |
|---|---|---|---|
| 5. Canary | Real-traffic regressions on 1-5% live | Free (eval cost) | Every release |
| 4. Red-team | Adversarial prompts, jailbreak surface | Dollars | Pre-canary, nightly |
| 3. Integration | Tool calls, retrieval, multi-turn | Cents-dollars | Nightly |
| 2. Regression eval | Rubric-based scoring on golden set | Cents | Every PR, sharded |
| 1. Deterministic | Schema, regex, scanners, exact-match | Microseconds | Every commit |
Bottom-heavy by volume, top-heavy by cost. The opposite shape is the anti-pattern.
The cost curve flip
Classical pyramids worked because the cheap checks were numerous and the expensive ones were rare. A unit test cost nothing, a UI test cost a CI minute, an end-to-end test cost five. The math front-loaded the cheap signal naturally.
LLM testing flipped the cost curve. The cheapest checks (schema, regex, string match) still cost essentially nothing. The layer above (rubric-based eval) costs real money. A frontier judge at $0.015 per call against 200 examples is $3 per PR, before parallel rubrics multiply it. Red-team suites run dollars. Canary is free in eval cost but expensive in rollback risk if the policy is wrong.
The flip is what makes the pyramid shape matter more, not less. A team that puts the most expensive layer at the base, a full LLM-judge sweep on every commit, burns the eval budget on signal that doesn’t compound. The cheap deterministic checks they skipped catch the JSON parse failure, the missing citation, the prompt-injection leak. Spend the cheap budget freely. Spend the judge budget where it earns its keep.
Layer 1: deterministic unit checks (the base)
The base of the pyramid is everything that runs in microseconds and never lies. JSON schema validation, regex on tool-call structure, exact-match on golden examples, citation existence (the cited span exists verbatim in retrieval context), format checks, and the sub-10ms scanners that catch categorical failures.
from pydantic import BaseModel
from fi.evals.guardrails.scanners import JailbreakScanner, SecretsScanner
class SupportResponse(BaseModel):
intent: str
answer: str
citations: list[str]
def test_response_schema_parses():
raw = run_support_bot("How do I reset my password?")
SupportResponse.model_validate_json(raw) # raises on bad schema
def test_no_prompt_injection_in_output():
out = run_agent("Summarise this document: ...")
assert not JailbreakScanner().scan(out).flagged
assert not SecretsScanner().scan(out).flagged
def test_citation_spans_exist(answer, retrieval_context):
for span in extract_citations(answer):
assert span in retrieval_context, f"fabricated citation: {span!r}"
Three rules earn their keep. (1) Run on every commit. No baseline, no sampling, no threshold tuning. The check passes or it doesn’t. (2) Fail fast. A bad schema means the contract is broken; stop the build, don’t score it. (3) Treat scanners as deterministic. The 8 Scanner classes live here, not the rubric layer. They return a boolean and cost nothing.
Cents per million, microseconds per call. Run them everywhere.
Layer 2: regression eval (the middle)
The middle of the pyramid is rubric-based scoring on a versioned golden dataset, run on every PR, gated on statistical significance. This is where LLM-as-judge earns its bill, and where most teams misallocate it.
The rubric set scores semantic correctness the deterministic layer can’t see: groundedness, refusal calibration, completeness, task completion, and the four-to-seven domain rubrics specific to your product. The full list runs nightly. The PR-blocking gate runs a cheap subset.
import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, AnswerRefusal, Completeness
from fi.testcases import TestCase
evaluator = Evaluator(max_workers=16) # FI_API_KEY / FI_SECRET_KEY in env
RUBRICS = {
"groundedness": (Groundedness(), 0.85),
"completeness": (Completeness(), 0.75),
"refusal": (AnswerRefusal(), 0.90),
}
@pytest.mark.parametrize("route", ["support", "legal", "sales"])
def test_regression(route, dataset, baseline_loader, request):
if route not in changed_routes(request):
pytest.skip(f"{route} not affected by this PR")
cases = [TestCase(input=ex.q, output=run_agent(ex), context=ex.ctx, id=ex.id)
for ex in dataset(route)]
baselines = baseline_loader(route)
for name, (template, floor) in RUBRICS.items():
result = evaluator.evaluate(eval_templates=[template], inputs=cases)
scores = [m.value for r in result.eval_results for m in r.metrics]
passed, reason = regression_gate(scores, baselines[name]) # Welch's t-test
assert passed, f"{route}.{name}: {reason}"
Four rules separate a working regression layer from theatre. (1) Versioned dataset, in git. Diff invisibility is how drift compounds. (2) Per-route scope. A PR touching the support prompt doesn’t rerun the legal suite. (3) Trailing 7-day baseline. Compare against the rolling production observation, not a frozen number. (4) Delta gate, not just floor. Welch’s t-test on per-example deltas catches slow regressions a floor misses; the floor catches catastrophic drops the delta gate writes off as variance. Both, calibrated.
For datasets above a few thousand examples, the SDK ships four distributed runners (Celery, Ray, Temporal, Kubernetes). The classifier cascade (NLI rubrics triage, frontier judge adjudicates the long tail) keeps the bill bounded. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes a 2,000-example nightly suite financially viable.
Layer 3: integration tests
Integration tests exercise the full stack: tool calls executed end-to-end, retrieval recall against ground-truth chunks, multi-turn flows, RAG pipeline composition. They cost more than rubric eval, but they aren’t adversarial. They check that the pieces fit.
The signal they catch that rubric eval misses: tool-call argument shape (right intent, wrong parameter name), retrieval ground-truth recall (the answer is grounded in retrieved chunks, but the right chunk wasn’t retrieved), session-state continuity (turn three references turn one correctly). EvaluateFunctionCalling, RagasContextRecall, ChunkAttribution, and the conversation-level templates live here. Run nightly, not on every PR, unless the suite fits inside the PR-blocking budget.
Layer 4: red-team tests
Adversarial signal sits above regression. Red-team scores how the system behaves under intentionally hostile input: jailbreak attempts, prompt injection, the OWASP LLM Top 10 surface, data-extraction probes, system-prompt leakage, refusal bypasses, multi-turn manipulation.
Three injection patterns earn their keep:
- Prompt injection. Static suites of 500-2,000 known patterns (DAN variants, role-play bypasses, encoded instructions, tool-misuse exploits). Pre-canary on every release.
- Generative red-team. A red-team agent generates adversarial inputs against your latest prompt; success rate is the metric. Weekly. Promote successful attacks into the static suite.
- Domain adversarial. Patterns specific to your surface: regulatory probes for legal, dosage-bait for healthcare, social-engineering for support.
The 8 scanners from Layer 1 also belong on every input and output at runtime, not just in red-team CI. Inline Protect guardrails (or GuardrailProtectWrapper for streaming) sit on the runtime path. Red-team CI is what tells you whether the guardrails actually hold.
Layer 5: canary deployment (the top)
The top is real traffic at low volume. Route 1-5 percent of live requests to the new build. Score it with the same rubrics CI uses. Auto-rollback on per-rubric regression beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes).
Two sub-patterns:
- Canary. New version serves; old version is the fallback. Rollback policy lives in the gateway.
- Shadow. Old version serves users; new version runs on the same input and is scored offline. Catches regressions without exposing them.
Future AGI’s Agent Command Center gateway supports shadow, mirror, and race modes as routing strategies, so the canary policy lives in gateway config rather than application code. Per-virtual-key budgets cap canary cost exposure.
Two rules: pre-register the rollback threshold before traffic touches the canary; alarm on a sustained delta, not a single spike, so a brittle judge call doesn’t roll back a working build.
Wiring the pyramid into CI
Three tiers, three cadences:
# .github/workflows/llm-tests.yml
name: LLM Tests
on:
pull_request:
paths: ["agents/**", "evals/**", "fi-evaluation.yaml"]
concurrency:
group: llm-tests-${{ github.head_ref }}
cancel-in-progress: true
jobs:
pr-gate:
runs-on: ubuntu-latest
timeout-minutes: 6
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: "pip" }
- run: pip install -r requirements.txt
# Layer 1: deterministic (microseconds, every commit)
- run: pytest tests/deterministic/ -n auto
# Layer 2: regression eval, cheap subset (cents, every PR)
- env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml
# Layer 2 delta gate (Welch's t-test vs 7-day baseline)
- env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: pytest evals/test_regression.py -n auto
The fi CLI is the layer that makes the gate wire cleanly into any CI runner. Exit code 0 = success, 2 = assertion failed, 3 = --strict warning, 6 = API error. CI policies (retry on 6, hard-fail on 2, slack-notify on 3) wire without grep heuristics on stdout. Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score, and runtime percentiles, so the gate can fire on the tail when long-tail failures hide in averages.
# fi-evaluation.yaml: the cheap PR-blocking subset
assertions:
- template: "groundedness"
condition: "p95_score >= 0.78"
on_fail: "error"
- template: "answer_refusal"
condition: "pass_rate >= 0.95"
on_fail: "error"
- template: "citation_validity"
condition: "pass_rate >= 0.99"
on_fail: "error"
A separate nightly workflow runs Layers 3 and 4 against the full dataset and posts the daily baseline back into Observe. Canary lives in the gateway config, not in CI.
The two anti-patterns
- Inverted pyramid. LLM-as-judge at the base, deterministic checks tacked on as an afterthought. Expensive, slow, noisy on every PR. Engineers disable the gate by month two. Common in teams that adopted a hosted judge surface and never built the deterministic layer.
- Flat stack. No layers. Every check is a
pytestassertion that happens to call an LLM. No statistical gating, no shape, no cadence. Common in teams that wrote the framework themselves over a weekend and never reached month three.
Braintrust and DeepEval’s testing primitives sit roughly at Layer 2; the deterministic layer they leave to you; integration and red-team they treat as adjacent products. The pyramid is a shape, not a vendor. Buy the runner, build the rubric, build the shape on top of whatever runner you bought.
Eval-in-prod and the closed loop
Offline tests gate regressions before deploy. Production catches what offline doesn’t. Three patterns: sample by signal (oversample failure candidates: low rubric score, thumbs-down, regeneration, refusal), classifier-backed continuous scoring at lower per-eval cost than Galileo Luna-2, and span-attached scores via traceAI so the eval, the trace, and the failure live in the same tree.
Error Feed closes the loop. HDBSCAN soft-clustering groups every trace failure into a named issue; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5). Promote representative traces from each named issue back into the Layer 2 dataset; the next PR’s gate runs against a sharper definition.
Common pitfalls
- LLM-as-judge at the base. The anti-pattern this post argues against. Judges in the middle, deterministic at the base.
- No deterministic layer at all. Every check is a rubric. The JSON parse failure ships because nobody ran
json.loads. - One rubric. “LLM quality 0.84” is not actionable. Per-rubric scoring is what surfaces specific failure modes.
- Floor without delta gate. Slow regressions slip under the floor for months until the cliff.
- 30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
- Canary without auto-rollback. Manual rollback turns canary into a slow incident.
- CI and production rubrics diverge. Engineers argue which number is real instead of fixing the bug. One definition, two contexts.
- Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Error Feed promotes weekly.
Three deliberate tradeoffs
- The deterministic base feels boring. Schema validation and regex don’t make conference talks. They catch the failures that wake the on-call at 3 am. Boring is the point.
- Statistical gating delays the first useful PR. A 7-day rolling baseline takes a week of nightly runs to populate. The payoff is engineers learn a red PR means a real regression.
- Red-team is expensive and worth it. Generative red-team runs dollars per pass; static suites cost cents. Both, weekly. The day you skip is the day someone publishes a jailbreak that works on you.
How Future AGI ships the pyramid
Future AGI ships the eval stack as a package, designed for the pyramid shape. Start with the SDK and the fi CLI for Layers 1 and 2. Graduate to the Platform when the cascade, the closed loop, and the gateway start mattering.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses for Layer 2 (Groundedness,AnswerRefusal,Completeness,TaskCompletion,EvaluateFunctionCalling). 8 sub-10msScannerclasses for Layer 1 (JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner). 4 distributed runners (Celery, Ray, Temporal, Kubernetes).RailType.INPUT/OUTPUT/RETRIEVALplusAggregationStrategy.ANY/ALL/MAJORITY/WEIGHTEDfor ensemble rubrics. fiCLI.fi init --template basic|rag|safety|agentscaffoldsfi-evaluation.yaml.fi run --check --strict --parallel 16evaluates assertions onpass_rate,avg_score,p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/2/3/6).--mode hybridroutes between local classifier rubrics and cloud judges.- Future AGI Platform. Self-improving evaluators tuned by thumbs feedback; an in-product agent writes unlimited custom evaluators from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Agent Command Center. 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure). Shadow, mirror, race modes for Layer 5. 5-level hierarchical budgets cap canary cost. SOC 2 Type II, HIPAA, GDPR, CCPA certified.
- Error Feed (inside the eval stack). HDBSCAN clustering plus Sonnet 4.5 Judge writes the
immediate_fix; fixes feed self-improving evaluators. - traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds; 62 built-in evals via
EvalTagserver-side at zero inline latency.
Ready to wire the pyramid? pip install ai-evaluation, fi init --template basic, point the YAML at your eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, add the deterministic pytest suite at Layer 1 and fi run --check --strict at Layer 2. The shape works the same on every product you ship.
Related reading
Frequently asked questions
What is the LLM testing pyramid and why does it matter?
How is LLM testing different from regular software testing?
What is the cost curve flip in LLM testing?
What are the layers of the LLM test pyramid?
Where does LLM-as-judge belong in the pyramid?
How do I wire the LLM test pyramid into CI?
What does Future AGI ship for LLM testing?
How software engineers should map the test pyramid (unit, integration, e2e) onto LLM evaluation in 2026: the seven gaps, the analogy, and a five-step transition.
The 7-item LLM evaluation best practices checklist that actually ships: dataset, judge calibration, deterministic floor, CI gate, statistical math, production observability, closed loop. Paste it into a sprint.
Testing LLMs at scale is three primitives: distributed runners, classifier cascade, per-route sampling. Six evaluators ranked by what survives the burst at millions of evals per day.