LLM Eval vs Traditional QA: A 2026 Bridge for QA Teams
Traditional QA asserts pass/fail. LLM eval grades against a rubric. The pyramid, the golden set, the CI gate carry. The assertion library is what you replace.
Table of Contents
Your QA team has spent a decade making software reliable. Pytest fixtures, regression suites, red-team builds, CI gates that block the merge button on red. That discipline is exactly what an LLM feature needs, and most LLM teams in 2026 are short on it. The problem isn’t that QA stops applying. The problem is that one primitive, the boolean assertion, has to re-shape for an output space that doesn’t have a single right answer.
This post is the bridge. The thesis is one sentence: traditional QA is pass/fail against an expected output; LLM eval is a graded score against a rubric. Bring the discipline, replace the assertion library. The pyramid carries. The golden set carries. The CI gate carries. The pytest harness barely changes. The body of the assertion is the part you rewrite.
TL;DR: the bridge in one table
| Traditional QA | LLM eval (2026) | Carries? |
|---|---|---|
assertEqual(out, expected) | Score on a 0-to-1 rubric, gate on threshold | No (re-shape) |
| Deterministic by default | Non-deterministic by default | No (re-shape) |
| Tests rot when API changes | Tests rot when input distribution changes | No (re-shape) |
| Single pass/fail threshold | Per-rubric, per-route threshold | No (re-shape) |
| Flakiness = retry and investigate | Variance is baseline; measure it | No (re-shape) |
| Coverage = code paths | Coverage = input distribution | No (re-shape) |
| Pytest harness + fixtures + CI | Pytest harness + fixtures + CI | Yes, unchanged |
| Golden set / regression suite | Golden set / regression suite | Yes, refresh cadence changes |
| Test pyramid (unit / integration / e2e) | Test pyramid (deterministic / classifier / judge) | Yes, different verdict source |
| Bug-bash mentality | Red-team prompts and edge cases | Yes, unchanged |
| Feature team owns the tests | Feature team owns the rubric | Yes, with SME on the rubric |
Six primitives re-shape. Five carry. The two columns sit on the same harness.
What carries: the QA discipline that LLM eval inherits
The thing pre-2024 QA teams already do well is the part of LLM eval that matters most.
The pytest harness. The fixture model, parametrize markers, parallel workers, JUnit XML output, the GitHub Actions integration: all of it carries verbatim. The eval becomes a new kind of test inside the suite you already run. There’s no second tool to adopt and no second CI runner to maintain. Pytest doesn’t care whether the assertion is assertEqual or a 0-to-1 score against a threshold.
The test pyramid. Software QA has the classic pyramid: many cheap unit tests at the base, fewer integration tests in the middle, a small number of expensive end-to-end tests at the top. LLM eval has an identical pyramid with different verdict sources. Deterministic Scanners (JailbreakScanner, SecretsScanner, RegexScanner, and five others) live at the base: sub-10ms, free, behave exactly like traditional assertions. Classifier-backed rubrics (13 backends behind one Guardrails class) live in the middle: sub-second, fractional-cent. LLM-as-judge sits at the top for semantic rubrics like Groundedness and TaskCompletion: slower, dollar-figure-per-thousand, reserved for what the cheaper tiers can’t answer. Same pyramid, different judges.
The golden set. A curated, versioned, stratified set of test cases that the suite runs on every PR. QA teams already have this; it’s called the regression suite. The discipline transfers verbatim. What changes is the rot mechanism: traditional regression suites rot when the API surface drifts; LLM regression suites rot when the input distribution drifts (new user intents, jargon shifts, model upgrades). The golden-set design guide covers the refresh cadence: weekly promotion of failing production traces, deprecation of stale examples, and PR-reviewed additions.
The CI gate. Block the merge button on red. Same mechanic, rubric-shaped thresholds. The CI/CD for LLM eval guide has working GitHub Actions YAML. One detail to catch early: put the threshold table in the CI config, not in test bodies. Regulatory teams can then review and approve threshold changes via PR without rewriting Python.
The bug-bash mindset. QA’s instinct to break things is one of the most valuable skills on a 2026 LLM team. Adversarial prompts, edge cases, persona stress tests, and OWASP LLM Top 10 categories all slot into the golden set as test cases. Run a “red-team day” the way you’d run a bug bash. The prompts that produce broken behavior become permanent regression tests. The eight built-in Scanners cover most of the OWASP-style adversarial surface as deterministic checks; the residual semantic adversarial cases go to PromptInjection and Toxicity templates at the classifier or judge tier.
Test ownership. Software QA’s best practice has always been that the team owning the feature owns the tests. LLM eval inherits this with one tweak: the rubric is the test, and the subject-matter expert owns the rubric. A legal assistant should have rubrics owned by the legal team. The QA engineering function owns the harness, the gating discipline, and the variance management; the SME owns the meaning of “good.”
What changes: the assertion goes from boolean to graded
In traditional QA, an assertion is a boolean. Two strings either match or they don’t, a regex either matches or it doesn’t. assertEqual, assertContains, assertMatches. Green checkmark or red X.
LLM eval flips this. The assertion is a score on a 0-to-1 rubric, often across multiple dimensions in the same test case. Faithfulness might score 0.78, completeness 0.92, format compliance 0.95. The test passes when every rubric clears its threshold. That’s three booleans hidden behind three real-valued scores, with three thresholds tuned per route per use case.
The good news for QA engineers is that pytest absorbs this almost transparently. The fixture returns a list of scores, the test asserts each score against a threshold, the CI reporting renders the same. Here’s the canonical pattern:
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, TaskCompletion
import os
EVALUATOR = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
def test_support_agent_grounded():
result = EVALUATOR.evaluate(
eval_templates=[Groundedness(), TaskCompletion()],
inputs=[{
"input": "What's the cancellation window?",
"output": agent_response,
"context": retrieved_chunks,
}],
)
scores = result.eval_results[0].metrics
assert scores["groundedness"] >= 0.80, f"groundedness {scores['groundedness']}"
assert scores["task_completion"] >= 0.85, f"task_completion {scores['task_completion']}"
The fixture is pytest-shape. The assertion is pytest-shape. The failure message is pytest-shape. Only the body of the assertion is new. The pytest harness for LLM evaluation walks the full GitHub Actions wiring with parallel workers and JUnit reports.
Three other primitives re-shape alongside the assertion. They’re each a small re-think.
Determinism. Run the same prompt twice, you get two different responses. Pin temperature to zero and a seed where the provider supports it, then sample N times per test case (5 to 10 is the common range) and aggregate. The Guardrails class in the ai-evaluation SDK ships AggregationStrategy.MAJORITY for exactly this. Three out of five samples fail, it’s a real failure. One out of five, it’s variance noise. The deterministic vs LLM-judge eval guide goes deeper.
Per-route thresholds. A Groundedness rubric scoring 0.82 is a pass for a casual chatbot and a fail for a regulated insurance assistant. The threshold lives per rubric per route, calibrated against a production baseline. Set the floor too low and broken behavior ships; set it too high and every PR fails for noise. The 2026 pattern is a threshold table keyed on route_name:
ROUTE_THRESHOLDS = {
"casual_chat": {"groundedness": 0.70, "task_completion": 0.75},
"support_agent": {"groundedness": 0.80, "task_completion": 0.85},
"regulated_advice": {"groundedness": 0.92, "task_completion": 0.90},
}
Same suite, same rubrics, different thresholds per route. The prompt regression testing guide covers the wiring across multiple routes without forking the suite.
Coverage. Code coverage asks “did the tests touch this line.” LLM eval coverage asks “does the golden set cover the intent classes, personas, locales, and adversarial prompts production actually sees.” A regression suite of 200 happy-path examples covers zero percent of the failure modes that ship bugs. The 2026 coverage report is stratified by intent and persona, plus a freshness number (median age of an example) and a hard-failure rate (what fraction of examples currently fail any rubric). All trend together over time. Same discipline as software coverage, different denominator.
Mapping pytest patterns to LLM eval
Most pytest patterns map directly. Here are the four QA teams reach for most.
Parametrize a test case across rubrics. In pytest you’d write @pytest.mark.parametrize over inputs. The same decorator carries; the rubrics live in the fixture:
import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, TaskCompletion, ContextAdherence
@pytest.fixture(scope="session")
def evaluator():
return Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
@pytest.mark.parametrize("case", GOLDEN_SET)
def test_support_route(evaluator, case):
result = evaluator.evaluate(
eval_templates=[Groundedness(), TaskCompletion(), ContextAdherence()],
inputs=[case.as_input()],
)
metrics = result.eval_results[0].metrics
for rubric, floor in ROUTE_THRESHOLDS[case.route].items():
assert metrics[rubric] >= floor, f"{rubric}={metrics[rubric]:.3f} < {floor}"
Smoke tier as deterministic scanners. A QA team’s “smoke” tier maps to the deterministic scanner pyramid base. Run the eight built-in Scanners on every PR for every test case: sub-10ms, no API cost, no variance. They behave exactly like the assertion library QA teams already use:
from fi.evals.guardrails.scanners import (
JailbreakScanner, CodeInjectionScanner, SecretsScanner,
MaliciousURLScanner, InvisibleCharScanner, LanguageScanner,
TopicRestrictionScanner, RegexScanner,
)
SMOKE_SCANNERS = [
JailbreakScanner(), CodeInjectionScanner(), SecretsScanner(),
MaliciousURLScanner(), InvisibleCharScanner(),
LanguageScanner(allowed=["en"]),
TopicRestrictionScanner(restricted=["medical_advice"]),
RegexScanner(patterns=[r"\bSSN:\s*\d{3}-\d{2}-\d{4}\b"]),
]
def test_smoke(model_output):
for scanner in SMOKE_SCANNERS:
result = scanner.scan(model_output)
assert not result.triggered, f"{scanner.__class__.__name__}: {result.reason}"
Markers for tiered runs. QA teams already use @pytest.mark.smoke, @pytest.mark.regression, @pytest.mark.e2e. The same markers carry to a three-tier LLM eval pyramid: @pytest.mark.scanner on every PR, @pytest.mark.classifier on nightly regression, @pytest.mark.judge on the high-stakes subset. CI runs each tier on its own schedule, same as the existing convention.
Conftest for shared fixtures. Stand up the Evaluator, the golden set loader, and the threshold table once in conftest.py. Every test file imports the fixture by name. No new tooling, no new CLI. The eval is one more kind of pytest test in a suite you already maintain.
The QA-to-LLM-eval transition mistakes we see most
Five mistakes show up in eval reviews when a QA team onboards its first LLM eval workstream. Each one is small to fix if you catch it before the suite calcifies around it.
Asserting boolean equality on an LLM output. assert response == expected is the QA team’s first instinct and the first thing to throw out. It fails or passes for the wrong reasons. The fix is the rubric-plus-threshold pattern above. Same fixture, different assertion body. The agent-passes-evals-fails-production post covers the failure modes downstream when this gets shipped to production.
Treating variance as flakiness. A test scores 0.85 on one run and 0.82 on the next, and the team retries until it passes. That’s how regression suites quietly turn off. Replace “retry on red” with “sample N times, aggregate.” AggregationStrategy.MAJORITY is the right default for most routes; AggregationStrategy.ALL for high-stakes ones where any single-sample failure should block. Variance is signal, not noise.
One threshold across all routes. A single score >= 0.80 floor is too loose for the regulated route and too tight for the casual one. The threshold belongs in a table keyed by route, calibrated against current production baseline minus a small margin (typically two percentage points). One floor across the suite is the LLM-eval equivalent of running every test in the same environment with the same flags.
Pinning the golden set at launch and never refreshing. The set tests 2023’s traffic forever. The score stays high. Production fails on intents the set never saw. The fix is a refresh pipeline: weekly promotion of failing production traces, deprecation of stale examples, PR review on additions. The LLM data drift detection guide covers the operational pattern.
One judge, one rubric. A team picks a single LLM-judge model and treats it as the eval. The score moves around without a story because one rubric never captures the full quality surface. Compose three or more rubrics via Guardrails, gate on each, and you get diagnostic separation: faithfulness, completeness, format, or refusal calibration each become their own pass/fail signal. Single-rubric eval is the LLM-judge bias mitigation failure mode in miniature.
If your team only fixes three, fix the boolean equality, the variance retry, and the single threshold. The other two are easier to retrofit later.
How Future AGI lowers the friction
The Future AGI eval stack is shaped like pytest because the QA team’s existing harness is the cheapest place to put the new assertion.
Pytest-shape API. Evaluator(...).evaluate(eval_templates=[...], inputs=[...]) returns a list of score objects. You assert scores against thresholds inside pytest fixtures. Same harness, same reporting, same parallel workers. 60+ pre-built EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, AnswerRefusal, TaskCompletion, LLMFunctionCalling, and more) replace the per-team custom-judge work most teams hand-roll.
Compose multiple rubrics like multiple assertions. Guardrails(rail_type=RailType.OUTPUT, aggregation=AggregationStrategy.MAJORITY) composes any combination of templates and decides pass/fail via aggregation strategy. ANY fails on any rubric. ALL requires every rubric to pass. MAJORITY is the right default for variance suppression at N=5.
Eight Scanners as deterministic assertions. JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner. Sub-10ms, no API cost, behave exactly like the assertion library QA teams already use. The right tool for the OWASP-style adversarial surface and for closed-form contracts (schema match, refusal regex, language allow-list).
Thirteen classifier backends as cheap-tier judges. Nine open-weight (Llama Guard 3 8B/1B, Qwen3-Guard 8B/4B/0.6B, Granite Guardian 8B/5B, WildGuard 7B, ShieldGemma 2B) for air-gapped self-hosting, four API (OpenAI Moderation, Azure Content Safety, Turing Flash, Turing Safety) for hosted convenience. All behind one Guardrails class. Sub-second latency, fractional-cent cost: the right tier for high-volume toxicity, PII, language detection where the question is categorical.
Four distributed runners. Celery, Ray, Temporal, Kubernetes. Same shape as parallel pytest workers; same scaling discipline a QA team already runs.
The Future AGI Platform adds two surfaces QA teams ask about repeatedly.
Self-improving evaluators. The Platform auto-retunes thresholds and rubric prompts from production thumbs up / thumbs down feedback. The self-improving AI agent pipeline covers the loop. This is the “self-healing test” QA leads have asked about for a decade, applied to rubric tuning rather than test code.
Error Feed inside the eval stack. HDBSCAN soft-clustering pulls failing production traces, groups them into named clusters, and runs a Sonnet 4.5 Judge that writes an immediate_fix per cluster. The cluster IDs become a new triage dimension. Honest framing: Linear is the only ticket-creating integration in production today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The clustering signal ships now and feeds into a QA team’s existing triage flow.
A five-step transition for a QA team
If you’re a QA lead handed an LLM workstream on Monday, here’s the order to run.
- Pick the rubric mapped to your most bug-prone behavior. For a RAG-style assistant,
Groundednessusually catches the biggest class of bugs. For a tool-using agent,TaskCompletionorLLMFunctionCalling. Wire one rubric into one pytest test on one production-pulled example. Get one green light. - Build the golden set from your bug-triage backlog. Pull the last six months of production incidents, support escalations, and known bad behaviors. Each becomes a test case. The backlog is the most under-used eval resource in any QA team’s archive; real incidents are higher-signal than any invented example.
- Gate every PR with the eval suite via GitHub Actions. Threshold floor = current production baseline minus two percentage points. Block merge below that floor until either the threshold is lowered with sign-off or the regression is fixed.
- Add the cheap fast tier. Wire the eight Scanners on every test case. Add one classifier backend for toxicity / safety. These run sub-second and cost fractions of a cent, so they go on every case without budget concern. Reserve LLM-judge calls for the high-stakes subset.
- Close the loop. Turn on Error Feed clustering and self-improving evaluators. Failing production traces cluster into named issues; the clusters promote into the golden set; the rubric prompt auto-retunes against the new examples. The regression suite refreshes itself instead of waiting for a quarterly review.
Two sprints, a working LLM eval practice on the pytest harness the team already runs. Discipline transfers, the assertion library re-shapes, the rest stays put.
Where this fits in the bridge series
This post is the QA-team companion to two adjacent bridges. The classical ML eval bridge covers the transition for data scientists coming from sklearn and MLflow. The product analytics bridge covers it for product teams coming from Mixpanel, Amplitude, and PostHog. Three different starting disciplines, one shared eval stack.
The closing line is the same in all three: bring the discipline that made your last system reliable. Replace only the primitives that actually break in an unbounded output space. For a QA team, the primitive that breaks is the boolean assertion. Everything else carries.
Ready to wire your first rubric into your existing pytest suite? Install ai-evaluation, drop a Groundedness rubric against your last fifty production traces this afternoon, and gate the next PR on the resulting score. Same harness, same CI, new assertion.
Frequently asked questions
What's the one-sentence difference between traditional QA and LLM eval?
Can a QA team use pytest for LLM evaluation?
What carries from QA to LLM eval, and what gets replaced?
How does a QA team handle non-determinism in LLM tests?
What's the biggest mistake QA teams make in their first LLM eval workstream?
How does Future AGI shorten the transition for QA teams?
Do red-team and bug-bash practices transfer from QA to LLM eval?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
How software engineers should map the test pyramid (unit, integration, e2e) onto LLM evaluation in 2026: the seven gaps, the analogy, and a five-step transition.
Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, and paired comparison vs prior version with CI on the delta.