Guides

LLM Testing in 2026: Methods and Strategies

The 2026 LLM testing pyramid. Deterministic unit checks at the base, regression eval in the middle, red-team and canary at the top, wired into pytest and CI.

·
Updated
·
10 min read
llm-testing ci-cd test-pyramid red-team ai-evaluation 2026
Editorial cover image for LLM Testing in 2026 Methods and Strategies
Table of Contents

A PR lands. The LLM test suite runs 247 LLM-as-judge calls on 200 examples across nine rubrics. Four minutes, $3.20, green check. Twelve hours later a customer reports the bot returning malformed JSON the application can’t parse. The judge scored 0.91 for groundedness; nobody asked whether the JSON parsed. A json.loads on every output, costing literally nothing, would have caught it on the first PR. The expensive thing ran. The cheap thing didn’t.

That’s the pyramid inverted. The opinion this post earns: LLM testing is a pyramid, like everything else, and putting LLM-as-judge at the base is the anti-pattern that prices the gate out of existence. Deterministic checks belong at the base (schema, format, scanners, citation validity). Rubric-based eval belongs in the middle, gated on statistical significance. Red-team and canary belong at the top, surgical and expensive. The shape isn’t new. The cost curve is, and it’s flipped from classical software testing, which is what makes the shape matter more than ever.

This guide walks the five-layer pyramid: what each layer scores, what it catches, what it costs, and how it wires into pytest plus the fi CLI. Code shaped against the ai-evaluation SDK.

TL;DR: the pyramid, top to bottom

LayerWhat it scoresCost per runCadence
5. CanaryReal-traffic regressions on 1-5% liveFree (eval cost)Every release
4. Red-teamAdversarial prompts, jailbreak surfaceDollarsPre-canary, nightly
3. IntegrationTool calls, retrieval, multi-turnCents-dollarsNightly
2. Regression evalRubric-based scoring on golden setCentsEvery PR, sharded
1. DeterministicSchema, regex, scanners, exact-matchMicrosecondsEvery commit

Bottom-heavy by volume, top-heavy by cost. The opposite shape is the anti-pattern.

The cost curve flip

Classical pyramids worked because the cheap checks were numerous and the expensive ones were rare. A unit test cost nothing, a UI test cost a CI minute, an end-to-end test cost five. The math front-loaded the cheap signal naturally.

LLM testing flipped the cost curve. The cheapest checks (schema, regex, string match) still cost essentially nothing. The layer above (rubric-based eval) costs real money. A frontier judge at $0.015 per call against 200 examples is $3 per PR, before parallel rubrics multiply it. Red-team suites run dollars. Canary is free in eval cost but expensive in rollback risk if the policy is wrong.

The flip is what makes the pyramid shape matter more, not less. A team that puts the most expensive layer at the base, a full LLM-judge sweep on every commit, burns the eval budget on signal that doesn’t compound. The cheap deterministic checks they skipped catch the JSON parse failure, the missing citation, the prompt-injection leak. Spend the cheap budget freely. Spend the judge budget where it earns its keep.

Layer 1: deterministic unit checks (the base)

The base of the pyramid is everything that runs in microseconds and never lies. JSON schema validation, regex on tool-call structure, exact-match on golden examples, citation existence (the cited span exists verbatim in retrieval context), format checks, and the sub-10ms scanners that catch categorical failures.

from pydantic import BaseModel
from fi.evals.guardrails.scanners import JailbreakScanner, SecretsScanner

class SupportResponse(BaseModel):
    intent: str
    answer: str
    citations: list[str]

def test_response_schema_parses():
    raw = run_support_bot("How do I reset my password?")
    SupportResponse.model_validate_json(raw)   # raises on bad schema

def test_no_prompt_injection_in_output():
    out = run_agent("Summarise this document: ...")
    assert not JailbreakScanner().scan(out).flagged
    assert not SecretsScanner().scan(out).flagged

def test_citation_spans_exist(answer, retrieval_context):
    for span in extract_citations(answer):
        assert span in retrieval_context, f"fabricated citation: {span!r}"

Three rules earn their keep. (1) Run on every commit. No baseline, no sampling, no threshold tuning. The check passes or it doesn’t. (2) Fail fast. A bad schema means the contract is broken; stop the build, don’t score it. (3) Treat scanners as deterministic. The 8 Scanner classes live here, not the rubric layer. They return a boolean and cost nothing.

Cents per million, microseconds per call. Run them everywhere.

Layer 2: regression eval (the middle)

The middle of the pyramid is rubric-based scoring on a versioned golden dataset, run on every PR, gated on statistical significance. This is where LLM-as-judge earns its bill, and where most teams misallocate it.

The rubric set scores semantic correctness the deterministic layer can’t see: groundedness, refusal calibration, completeness, task completion, and the four-to-seven domain rubrics specific to your product. The full list runs nightly. The PR-blocking gate runs a cheap subset.

import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, AnswerRefusal, Completeness
from fi.testcases import TestCase

evaluator = Evaluator(max_workers=16)  # FI_API_KEY / FI_SECRET_KEY in env

RUBRICS = {
    "groundedness": (Groundedness(), 0.85),
    "completeness": (Completeness(), 0.75),
    "refusal":      (AnswerRefusal(), 0.90),
}

@pytest.mark.parametrize("route", ["support", "legal", "sales"])
def test_regression(route, dataset, baseline_loader, request):
    if route not in changed_routes(request):
        pytest.skip(f"{route} not affected by this PR")

    cases = [TestCase(input=ex.q, output=run_agent(ex), context=ex.ctx, id=ex.id)
             for ex in dataset(route)]
    baselines = baseline_loader(route)
    for name, (template, floor) in RUBRICS.items():
        result = evaluator.evaluate(eval_templates=[template], inputs=cases)
        scores = [m.value for r in result.eval_results for m in r.metrics]
        passed, reason = regression_gate(scores, baselines[name])  # Welch's t-test
        assert passed, f"{route}.{name}: {reason}"

Four rules separate a working regression layer from theatre. (1) Versioned dataset, in git. Diff invisibility is how drift compounds. (2) Per-route scope. A PR touching the support prompt doesn’t rerun the legal suite. (3) Trailing 7-day baseline. Compare against the rolling production observation, not a frozen number. (4) Delta gate, not just floor. Welch’s t-test on per-example deltas catches slow regressions a floor misses; the floor catches catastrophic drops the delta gate writes off as variance. Both, calibrated.

For datasets above a few thousand examples, the SDK ships four distributed runners (Celery, Ray, Temporal, Kubernetes). The classifier cascade (NLI rubrics triage, frontier judge adjudicates the long tail) keeps the bill bounded. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which makes a 2,000-example nightly suite financially viable.

Layer 3: integration tests

Integration tests exercise the full stack: tool calls executed end-to-end, retrieval recall against ground-truth chunks, multi-turn flows, RAG pipeline composition. They cost more than rubric eval, but they aren’t adversarial. They check that the pieces fit.

The signal they catch that rubric eval misses: tool-call argument shape (right intent, wrong parameter name), retrieval ground-truth recall (the answer is grounded in retrieved chunks, but the right chunk wasn’t retrieved), session-state continuity (turn three references turn one correctly). EvaluateFunctionCalling, RagasContextRecall, ChunkAttribution, and the conversation-level templates live here. Run nightly, not on every PR, unless the suite fits inside the PR-blocking budget.

Layer 4: red-team tests

Adversarial signal sits above regression. Red-team scores how the system behaves under intentionally hostile input: jailbreak attempts, prompt injection, the OWASP LLM Top 10 surface, data-extraction probes, system-prompt leakage, refusal bypasses, multi-turn manipulation.

Three injection patterns earn their keep:

  • Prompt injection. Static suites of 500-2,000 known patterns (DAN variants, role-play bypasses, encoded instructions, tool-misuse exploits). Pre-canary on every release.
  • Generative red-team. A red-team agent generates adversarial inputs against your latest prompt; success rate is the metric. Weekly. Promote successful attacks into the static suite.
  • Domain adversarial. Patterns specific to your surface: regulatory probes for legal, dosage-bait for healthcare, social-engineering for support.

The 8 scanners from Layer 1 also belong on every input and output at runtime, not just in red-team CI. Inline Protect guardrails (or GuardrailProtectWrapper for streaming) sit on the runtime path. Red-team CI is what tells you whether the guardrails actually hold.

Layer 5: canary deployment (the top)

The top is real traffic at low volume. Route 1-5 percent of live requests to the new build. Score it with the same rubrics CI uses. Auto-rollback on per-rubric regression beyond the calibrated threshold (typically 2-3 percentage points sustained over 15-60 minutes).

Two sub-patterns:

  • Canary. New version serves; old version is the fallback. Rollback policy lives in the gateway.
  • Shadow. Old version serves users; new version runs on the same input and is scored offline. Catches regressions without exposing them.

Future AGI’s Agent Command Center gateway supports shadow, mirror, and race modes as routing strategies, so the canary policy lives in gateway config rather than application code. Per-virtual-key budgets cap canary cost exposure.

Two rules: pre-register the rollback threshold before traffic touches the canary; alarm on a sustained delta, not a single spike, so a brittle judge call doesn’t roll back a working build.

Wiring the pyramid into CI

Three tiers, three cadences:

# .github/workflows/llm-tests.yml
name: LLM Tests
on:
  pull_request:
    paths: ["agents/**", "evals/**", "fi-evaluation.yaml"]
concurrency:
  group: llm-tests-${{ github.head_ref }}
  cancel-in-progress: true
jobs:
  pr-gate:
    runs-on: ubuntu-latest
    timeout-minutes: 6
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: "pip" }
      - run: pip install -r requirements.txt
      # Layer 1: deterministic (microseconds, every commit)
      - run: pytest tests/deterministic/ -n auto
      # Layer 2: regression eval, cheap subset (cents, every PR)
      - env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml
      # Layer 2 delta gate (Welch's t-test vs 7-day baseline)
      - env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: pytest evals/test_regression.py -n auto

The fi CLI is the layer that makes the gate wire cleanly into any CI runner. Exit code 0 = success, 2 = assertion failed, 3 = --strict warning, 6 = API error. CI policies (retry on 6, hard-fail on 2, slack-notify on 3) wire without grep heuristics on stdout. Native assertion conditions cover pass_rate, avg_score, p50_score, p90_score, p95_score, and runtime percentiles, so the gate can fire on the tail when long-tail failures hide in averages.

# fi-evaluation.yaml: the cheap PR-blocking subset
assertions:
  - template: "groundedness"
    condition: "p95_score >= 0.78"
    on_fail: "error"
  - template: "answer_refusal"
    condition: "pass_rate >= 0.95"
    on_fail: "error"
  - template: "citation_validity"
    condition: "pass_rate >= 0.99"
    on_fail: "error"

A separate nightly workflow runs Layers 3 and 4 against the full dataset and posts the daily baseline back into Observe. Canary lives in the gateway config, not in CI.

The two anti-patterns

  • Inverted pyramid. LLM-as-judge at the base, deterministic checks tacked on as an afterthought. Expensive, slow, noisy on every PR. Engineers disable the gate by month two. Common in teams that adopted a hosted judge surface and never built the deterministic layer.
  • Flat stack. No layers. Every check is a pytest assertion that happens to call an LLM. No statistical gating, no shape, no cadence. Common in teams that wrote the framework themselves over a weekend and never reached month three.

Braintrust and DeepEval’s testing primitives sit roughly at Layer 2; the deterministic layer they leave to you; integration and red-team they treat as adjacent products. The pyramid is a shape, not a vendor. Buy the runner, build the rubric, build the shape on top of whatever runner you bought.

Eval-in-prod and the closed loop

Offline tests gate regressions before deploy. Production catches what offline doesn’t. Three patterns: sample by signal (oversample failure candidates: low rubric score, thumbs-down, regeneration, refusal), classifier-backed continuous scoring at lower per-eval cost than Galileo Luna-2, and span-attached scores via traceAI so the eval, the trace, and the failure live in the same tree.

Error Feed closes the loop. HDBSCAN soft-clustering groups every trace failure into a named issue; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, 90 percent prompt-cache hit ratio) writes the RCA, evidence, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5). Promote representative traces from each named issue back into the Layer 2 dataset; the next PR’s gate runs against a sharper definition.

Common pitfalls

  • LLM-as-judge at the base. The anti-pattern this post argues against. Judges in the middle, deterministic at the base.
  • No deterministic layer at all. Every check is a rubric. The JSON parse failure ships because nobody ran json.loads.
  • One rubric. “LLM quality 0.84” is not actionable. Per-rubric scoring is what surfaces specific failure modes.
  • Floor without delta gate. Slow regressions slip under the floor for months until the cliff.
  • 30-example dataset, mean-based gate. Variance wider than the regressions you’re catching. Grow the dataset or gate on percentiles.
  • Canary without auto-rollback. Manual rollback turns canary into a slow incident.
  • CI and production rubrics diverge. Engineers argue which number is real instead of fixing the bug. One definition, two contexts.
  • Static dataset frozen at launch. A 2024 set evaluating a 2026 product is a benchmark, not a regression suite. Error Feed promotes weekly.

Three deliberate tradeoffs

  • The deterministic base feels boring. Schema validation and regex don’t make conference talks. They catch the failures that wake the on-call at 3 am. Boring is the point.
  • Statistical gating delays the first useful PR. A 7-day rolling baseline takes a week of nightly runs to populate. The payoff is engineers learn a red PR means a real regression.
  • Red-team is expensive and worth it. Generative red-team runs dollars per pass; static suites cost cents. Both, weekly. The day you skip is the day someone publishes a jailbreak that works on you.

How Future AGI ships the pyramid

Future AGI ships the eval stack as a package, designed for the pyramid shape. Start with the SDK and the fi CLI for Layers 1 and 2. Graduate to the Platform when the cascade, the closed loop, and the gateway start mattering.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes for Layer 2 (Groundedness, AnswerRefusal, Completeness, TaskCompletion, EvaluateFunctionCalling). 8 sub-10ms Scanner classes for Layer 1 (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). 4 distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED for ensemble rubrics.
  • fi CLI. fi init --template basic|rag|safety|agent scaffolds fi-evaluation.yaml. fi run --check --strict --parallel 16 evaluates assertions on pass_rate, avg_score, p50/p90/p95_score, and runtime percentiles, with CI-distinct exit codes (0/2/3/6). --mode hybrid routes between local classifier rubrics and cloud judges.
  • Future AGI Platform. Self-improving evaluators tuned by thumbs feedback; an in-product agent writes unlimited custom evaluators from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Agent Command Center. 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure). Shadow, mirror, race modes for Layer 5. 5-level hierarchical budgets cap canary cost. SOC 2 Type II, HIPAA, GDPR, CCPA certified.
  • Error Feed (inside the eval stack). HDBSCAN clustering plus Sonnet 4.5 Judge writes the immediate_fix; fixes feed self-improving evaluators.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds; 62 built-in evals via EvalTag server-side at zero inline latency.

Ready to wire the pyramid? pip install ai-evaluation, fi init --template basic, point the YAML at your eval set, set FI_API_KEY and FI_SECRET_KEY in CI secrets, add the deterministic pytest suite at Layer 1 and fi run --check --strict at Layer 2. The shape works the same on every product you ship.

Frequently asked questions

What is the LLM testing pyramid and why does it matter?
The same three-tier shape classical test pyramids use, applied to LLM features. Deterministic unit tests at the base (schema, format, exact-match, regex, scanners, citation validity), regression eval in the middle (rubric-based scoring on a frozen golden set), red-team and canary at the top (adversarial and live-traffic). The shape matters because it front-loads cheap fast signals. The mistake teams make is putting LLM-as-judge at the base of their stack, which is the classic anti-pattern: a $0.04 judge call against every commit prices the gate out of existence inside a quarter, the judge noise drowns out small regressions, and engineers stop trusting the green check. Deterministic checks belong at the base, judges in the middle, red-team and canary at the top.
How is LLM testing different from regular software testing?
Regular tests assert exact outputs against a contract. LLM tests have to handle the same correct answer phrased twenty different ways, which means scoring against rubrics instead of asserting. The shape of the pyramid is identical to classical software testing, but the cost curve has flipped. In classical testing the cheapest checks are the most numerous and the expensive ones are rare. In LLM testing the temptation is to invert that, running expensive LLM-judge calls on every commit because they feel more thorough. They're not more thorough, they're noisier and slower. The pyramid restored: deterministic checks at the base, rubric-based eval in the middle, red-team and canary at the top, with cost compounding upward.
What is the cost curve flip in LLM testing?
Classical software tests cost nothing. A schema check or string match runs in microseconds, free, deterministic. LLM-as-judge calls cost real money: a frontier judge at 1.2 seconds and a few cents per call against a 200-example regression suite is dollars per PR, before you even reach red-team. The flip is what makes the pyramid shape matter more than ever. Putting the most expensive thing at the base, on every commit, on every example, is the anti-pattern that disables the gate. The right shape is to spend the cheap deterministic budget freely at the base, the rubric budget carefully in the middle, and the red-team plus canary budget surgically at the top.
What are the layers of the LLM test pyramid?
Five layers, from base to top. (1) Deterministic unit checks: JSON schema validation, regex, exact-match on golden examples, citation existence, Future AGI's 8 sub-10ms Scanners (jailbreak, code injection, secrets, malicious URLs, invisible chars, language, topic, regex). Cents per million, runs on every commit. (2) Regression eval: rubric-based scoring on a frozen golden set, Welch's t-test on per-example deltas against a 7-day rolling baseline. Cents per PR. (3) Integration tests: tool calls, retrieval, chains. (4) Red-team: adversarial prompts, jailbreak attempts, OWASP LLM Top 10 surface. Dollars per release. (5) Canary: live traffic at 1-5 percent with auto-rollback. Free in eval cost; expensive if the rollback policy is broken.
Where does LLM-as-judge belong in the pyramid?
In the middle, not the base. The middle is where rubrics earn their keep: groundedness, refusal calibration, completeness, task completion, the metrics that need semantic scoring instead of pattern matching. The base is where you put the checks that are deterministic, fast, and cheap enough to run on every commit without flinching. The top is where adversarial signal lives. Teams that put LLM-as-judge at the base end up with a 4-minute PR-blocking gate that costs dollars per push, fires false alarms on judge noise, and gets disabled by engineering leadership inside two quarters. The right place for a judge is the regression layer, gated on statistical significance, scoped per-route, calibrated against a 7-day baseline.
How do I wire the LLM test pyramid into CI?
Three tiers, three cadences. PR-blocking (every push): deterministic scanners plus cheap classifier rubrics. Under three minutes. Cents per run. Nightly main: full rubric sweep on the versioned dataset with delta gating via Welch's t-test against the 7-day baseline. 15-30 minutes. Pre-canary: red-team suite on the merged build. Block promotion to canary on regression. Then canary at 1-5 percent live traffic with auto-rollback on per-rubric regression. The fi CLI ships native CI assertions (pass_rate, avg_score, p50/p90/p95_score) with distinct exit codes (0/2/3/6) so the tiers wire cleanly into GitHub Actions, GitLab CI, or Buildkite without grep heuristics on stdout.
What does Future AGI ship for LLM testing?
The eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes for rubric eval, 8 sub-10ms Scanners as the deterministic base layer, four distributed runners (Celery, Ray, Temporal, Kubernetes) for batch regression, and classifier-backed evals at lower per-eval cost than Galileo Luna-2 for the cascade. The fi CLI ships native CI assertions on pass_rate and percentiles with CI-distinct exit codes. The Agent Command Center gateway ships shadow, mirror, and race modes for canary, with five-level hierarchical budgets that cap canary cost. Error Feed sits inside the eval stack: HDBSCAN clustering plus a Sonnet 4.5 Judge agent writes the immediate fix on every clustered failure. traceAI joins the test signal to the trace tree across 50+ AI surfaces.
Related Articles
View all
LLM Evaluation Best Practices Checklist for 2026
Guides

The 7-item LLM evaluation best practices checklist that actually ships: dataset, judge calibration, deterministic floor, CI gate, statistical math, production observability, closed loop. Paste it into a sprint.

NVJK Kartik
NVJK Kartik ·
13 min