Guides

LLM Eval vs Traditional QA: A 2026 Bridge for QA Teams

Traditional QA asserts pass/fail. LLM eval grades against a rubric. The pyramid, the golden set, the CI gate carry, the assertion library you replace.

April 2, 2026

Updated May 20, 2026

12 min read

llm-evaluation qa-engineering regression-testing pytest ci-cd golden-set 2026

Table of Contents

Your QA team has spent a decade making software reliable. Pytest fixtures, regression suites, red-team builds, CI gates that block the merge button on red. That discipline is exactly what an LLM feature needs, and most LLM teams in 2026 are short on it. The problem isn’t that QA stops applying. The problem is that one primitive, the boolean assertion, has to re-shape for an output space that doesn’t have a single right answer.

This post is the bridge. The thesis is one sentence: traditional QA is pass/fail against an expected output; LLM eval is a graded score against a rubric. Bring the discipline, replace the assertion library. The pyramid carries. The golden set carries. The CI gate carries. The pytest harness barely changes. The body of the assertion is the part you rewrite.

TL;DR: the bridge in one table

Traditional QA	LLM eval (2026)	Carries?
`assertEqual(out, expected)`	Score on a 0-to-1 rubric, gate on threshold	No (re-shape)
Deterministic by default	Non-deterministic by default	No (re-shape)
Tests rot when API changes	Tests rot when input distribution changes	No (re-shape)
Single pass/fail threshold	Per-rubric, per-route threshold	No (re-shape)
Flakiness = retry and investigate	Variance is baseline; measure it	No (re-shape)
Coverage = code paths	Coverage = input distribution	No (re-shape)
Pytest harness + fixtures + CI	Pytest harness + fixtures + CI	Yes, unchanged
Golden set / regression suite	Golden set / regression suite	Yes, refresh cadence changes
Test pyramid (unit / integration / e2e)	Test pyramid (deterministic / classifier / judge)	Yes, different verdict source
Bug-bash mentality	Red-team prompts and edge cases	Yes, unchanged
Feature team owns the tests	Feature team owns the rubric	Yes, with SME on the rubric

Six primitives re-shape. Five carry. The two columns sit on the same harness.

What carries: the QA discipline that LLM eval inherits

The thing pre-2024 QA teams already do well is the part of LLM eval that matters most.

The pytest harness. The fixture model, parametrize markers, parallel workers, JUnit XML output, the GitHub Actions integration: all of it carries verbatim. The eval becomes a new kind of test inside the suite you already run. There’s no second tool to adopt and no second CI runner to maintain. Pytest doesn’t care whether the assertion is assertEqual or a 0-to-1 score against a threshold.

The test pyramid. Software QA has the classic pyramid: many cheap unit tests at the base, fewer integration tests in the middle, a small number of expensive end-to-end tests at the top. LLM eval has an identical pyramid with different verdict sources. Deterministic Scanners (JailbreakScanner, SecretsScanner, RegexScanner, and five others) live at the base: sub-10ms, free, behave exactly like traditional assertions. Classifier-backed rubrics (13 backends behind one Guardrails class) live in the middle: sub-second, fractional-cent. LLM-as-judge sits at the top for semantic rubrics like Groundedness and TaskCompletion: slower, dollar-figure-per-thousand, reserved for what the cheaper tiers can’t answer. Same pyramid, different judges.

The golden set. A curated, versioned, stratified set of test cases that the suite runs on every PR. QA teams already have this; it’s called the regression suite. The discipline transfers verbatim. What changes is the rot mechanism: traditional regression suites rot when the API surface drifts; LLM regression suites rot when the input distribution drifts (new user intents, jargon shifts, model upgrades). The golden-set design guide covers the refresh cadence: weekly promotion of failing production traces, deprecation of stale examples, and PR-reviewed additions.

The CI gate. Block the merge button on red. Same mechanic, rubric-shaped thresholds. The CI/CD for LLM eval guide has working GitHub Actions YAML. One detail to catch early: put the threshold table in the CI config, not in test bodies. Regulatory teams can then review and approve threshold changes via PR without rewriting Python.

The bug-bash mindset. QA’s instinct to break things is one of the most valuable skills on a 2026 LLM team. Adversarial prompts, edge cases, persona stress tests, and OWASP LLM Top 10 categories all slot into the golden set as test cases. Run a “red-team day” the way you’d run a bug bash. The prompts that produce broken behavior become permanent regression tests. The eight built-in Scanners cover most of the OWASP-style adversarial surface as deterministic checks; the residual semantic adversarial cases go to PromptInjection and Toxicity templates at the classifier or judge tier.

Test ownership. Software QA’s best practice has always been that the team owning the feature owns the tests. LLM eval inherits this with one tweak: the rubric is the test, and the subject-matter expert owns the rubric. A legal assistant should have rubrics owned by the legal team. The QA engineering function owns the harness, the gating discipline, and the variance management; the SME owns the meaning of “good.”

What changes: the assertion goes from boolean to graded

In traditional QA, an assertion is a boolean. Two strings either match or they don’t, a regex either matches or it doesn’t. assertEqual, assertContains, assertMatches. Green checkmark or red X.

LLM eval flips this. The assertion is a score on a 0-to-1 rubric, often across multiple dimensions in the same test case. Faithfulness might score 0.78, completeness 0.92, format compliance 0.95. The test passes when every rubric clears its threshold. That’s three booleans hidden behind three real-valued scores, with three thresholds tuned per route per use case.

The good news for QA engineers is that pytest absorbs this almost transparently. The fixture returns a list of scores, the test asserts each score against a threshold, the CI reporting renders the same. Here’s the canonical pattern:

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, TaskCompletion
import os

EVALUATOR = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)


def test_support_agent_grounded():
    result = EVALUATOR.evaluate(
        eval_templates=[Groundedness(), TaskCompletion()],
        inputs=[{
            "input": "What's the cancellation window?",
            "output": agent_response,
            "context": retrieved_chunks,
        }],
    )
    scores = result.eval_results[0].metrics
    assert scores["groundedness"] >= 0.80, f"groundedness {scores['groundedness']}"
    assert scores["task_completion"] >= 0.85, f"task_completion {scores['task_completion']}"

The fixture is pytest-shape. The assertion is pytest-shape. The failure message is pytest-shape. Only the body of the assertion is new. The pytest harness for LLM evaluation walks the full GitHub Actions wiring with parallel workers and JUnit reports.

Three other primitives re-shape alongside the assertion. They’re each a small re-think.

Determinism. Run the same prompt twice, you get two different responses. Pin temperature to zero and a seed where the provider supports it, then sample N times per test case (5 to 10 is the common range) and aggregate. The Guardrails class in the ai-evaluation SDK ships AggregationStrategy.MAJORITY for exactly this. Three out of five samples fail, it’s a real failure. One out of five, it’s variance noise. The deterministic vs LLM-judge eval guide goes deeper.

Per-route thresholds. A Groundedness rubric scoring 0.82 is a pass for a casual chatbot and a fail for a regulated insurance assistant. The threshold lives per rubric per route, calibrated against a production baseline. Set the floor too low and broken behavior ships; set it too high and every PR fails for noise. The 2026 pattern is a threshold table keyed on route_name:

ROUTE_THRESHOLDS = {
    "casual_chat": {"groundedness": 0.70, "task_completion": 0.75},
    "support_agent": {"groundedness": 0.80, "task_completion": 0.85},
    "regulated_advice": {"groundedness": 0.92, "task_completion": 0.90},
}

Same suite, same rubrics, different thresholds per route. The prompt regression testing guide covers the wiring across multiple routes without forking the suite.

Coverage. Code coverage asks “did the tests touch this line.” LLM eval coverage asks “does the golden set cover the intent classes, personas, locales, and adversarial prompts production actually sees.” A regression suite of 200 happy-path examples covers zero percent of the failure modes that ship bugs. The 2026 coverage report is stratified by intent and persona, plus a freshness number (median age of an example) and a hard-failure rate (what fraction of examples currently fail any rubric). All trend together over time. Same discipline as software coverage, different denominator.

Mapping pytest patterns to LLM eval

Most pytest patterns map directly. Here are the four QA teams reach for most.

Parametrize a test case across rubrics. In pytest you’d write @pytest.mark.parametrize over inputs. The same decorator carries; the rubrics live in the fixture:

import pytest
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, TaskCompletion, ContextAdherence

@pytest.fixture(scope="session")
def evaluator():
    return Evaluator(
        fi_api_key=os.environ["FI_API_KEY"],
        fi_secret_key=os.environ["FI_SECRET_KEY"],
    )

@pytest.mark.parametrize("case", GOLDEN_SET)
def test_support_route(evaluator, case):
    result = evaluator.evaluate(
        eval_templates=[Groundedness(), TaskCompletion(), ContextAdherence()],
        inputs=[case.as_input()],
    )
    metrics = result.eval_results[0].metrics
    for rubric, floor in ROUTE_THRESHOLDS[case.route].items():
        assert metrics[rubric] >= floor, f"{rubric}={metrics[rubric]:.3f} < {floor}"

Smoke tier as deterministic scanners. A QA team’s “smoke” tier maps to the deterministic scanner pyramid base. Run the eight built-in Scanners on every PR for every test case: sub-10ms, no API cost, no variance. They behave exactly like the assertion library QA teams already use:

from fi.evals.guardrails.scanners import (
    JailbreakScanner, CodeInjectionScanner, SecretsScanner,
    MaliciousURLScanner, InvisibleCharScanner, LanguageScanner,
    TopicRestrictionScanner, RegexScanner,
)

SMOKE_SCANNERS = [
    JailbreakScanner(), CodeInjectionScanner(), SecretsScanner(),
    MaliciousURLScanner(), InvisibleCharScanner(),
    LanguageScanner(allowed=["en"]),
    TopicRestrictionScanner(restricted=["medical_advice"]),
    RegexScanner(patterns=[r"\bSSN:\s*\d{3}-\d{2}-\d{4}\b"]),
]

def test_smoke(model_output):
    for scanner in SMOKE_SCANNERS:
        result = scanner.scan(model_output)
        assert not result.triggered, f"{scanner.__class__.__name__}: {result.reason}"

Markers for tiered runs. QA teams already use @pytest.mark.smoke, @pytest.mark.regression, @pytest.mark.e2e. The same markers carry to a three-tier LLM eval pyramid: @pytest.mark.scanner on every PR, @pytest.mark.classifier on nightly regression, @pytest.mark.judge on the high-stakes subset. CI runs each tier on its own schedule, same as the existing convention.

Conftest for shared fixtures. Stand up the Evaluator, the golden set loader, and the threshold table once in conftest.py. Every test file imports the fixture by name. No new tooling, no new CLI. The eval is one more kind of pytest test in a suite you already maintain.

The QA-to-LLM-eval transition mistakes we see most

Five mistakes show up in eval reviews when a QA team onboards its first LLM eval workstream. Each one is small to fix if you catch it before the suite calcifies around it.

Asserting boolean equality on an LLM output. assert response == expected is the QA team’s first instinct and the first thing to throw out. It fails or passes for the wrong reasons. The fix is the rubric-plus-threshold pattern above. Same fixture, different assertion body. The agent-passes-evals-fails-production post covers the failure modes downstream when this gets shipped to production.

Treating variance as flakiness. A test scores 0.85 on one run and 0.82 on the next, and the team retries until it passes. That’s how regression suites quietly turn off. Replace “retry on red” with “sample N times, aggregate.” AggregationStrategy.MAJORITY is the right default for most routes; AggregationStrategy.ALL for high-stakes ones where any single-sample failure should block. Variance is signal, not noise.

One threshold across all routes. A single score >= 0.80 floor is too loose for the regulated route and too tight for the casual one. The threshold belongs in a table keyed by route, calibrated against current production baseline minus a small margin (typically two percentage points). One floor across the suite is the LLM-eval equivalent of running every test in the same environment with the same flags.

Pinning the golden set at launch and never refreshing. The set tests 2023’s traffic forever. The score stays high. Production fails on intents the set never saw. The fix is a refresh pipeline: weekly promotion of failing production traces, deprecation of stale examples, PR review on additions. The LLM data drift detection guide covers the operational pattern.

One judge, one rubric. A team picks a single LLM-judge model and treats it as the eval. The score moves around without a story because one rubric never captures the full quality surface. Compose three or more rubrics via Guardrails, gate on each, and you get diagnostic separation: faithfulness, completeness, format, or refusal calibration each become their own pass/fail signal. Single-rubric eval is the LLM-judge bias mitigation failure mode in miniature.

If your team only fixes three, fix the boolean equality, the variance retry, and the single threshold. The other two are easier to retrofit later.

How Future AGI lowers the friction

The Future AGI eval stack is shaped like pytest because the QA team’s existing harness is the cheapest place to put the new assertion.

Pytest-shape API. Evaluator(...).evaluate(eval_templates=[...], inputs=[...]) returns a list of score objects. You assert scores against thresholds inside pytest fixtures. Same harness, same reporting, same parallel workers. 60+ pre-built EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, AnswerRefusal, TaskCompletion, LLMFunctionCalling, and more) replace the per-team custom-judge work most teams hand-roll.

Compose multiple rubrics like multiple assertions. Guardrails(rail_type=RailType.OUTPUT, aggregation=AggregationStrategy.MAJORITY) composes any combination of templates and decides pass/fail via aggregation strategy. ANY fails on any rubric. ALL requires every rubric to pass. MAJORITY is the right default for variance suppression at N=5.

Eight Scanners as deterministic assertions. JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner. Sub-10ms, no API cost, behave exactly like the assertion library QA teams already use. The right tool for the OWASP-style adversarial surface and for closed-form contracts (schema match, refusal regex, language allow-list).

Thirteen classifier backends as cheap-tier judges. Nine open-weight (Llama Guard 3 8B/1B, Qwen3-Guard 8B/4B/0.6B, Granite Guardian 8B/5B, WildGuard 7B, ShieldGemma 2B) for air-gapped self-hosting, four API (OpenAI Moderation, Azure Content Safety, Turing Flash, Turing Safety) for hosted convenience. All behind one Guardrails class. Sub-second latency, fractional-cent cost: the right tier for high-volume toxicity, PII, language detection where the question is categorical.

Four distributed runners. Celery, Ray, Temporal, Kubernetes. Same shape as parallel pytest workers; same scaling discipline a QA team already runs.

The Future AGI Platform adds two surfaces QA teams ask about repeatedly.

Self-improving evaluators. The Platform auto-retunes thresholds and rubric prompts from production thumbs up / thumbs down feedback. The self-improving AI agent pipeline covers the loop. This is the “self-healing test” QA leads have asked about for a decade, applied to rubric tuning rather than test code.

Error Feed inside the eval stack. HDBSCAN soft-clustering pulls failing production traces, groups them into named clusters, and runs a Sonnet 4.5 Judge that writes an immediate_fix per cluster. The cluster IDs become a new triage dimension. Honest framing: Linear is the only ticket-creating integration in production today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The clustering signal ships now and feeds into a QA team’s existing triage flow.

A five-step transition for a QA team

If you’re a QA lead handed an LLM workstream on Monday, here’s the order to run.

Pick the rubric mapped to your most bug-prone behavior. For a RAG-style assistant, Groundedness usually catches the biggest class of bugs. For a tool-using agent, TaskCompletion or LLMFunctionCalling. Wire one rubric into one pytest test on one production-pulled example. Get one green light.
Build the golden set from your bug-triage backlog. Pull the last six months of production incidents, support escalations, and known bad behaviors. Each becomes a test case. The backlog is the most under-used eval resource in any QA team’s archive; real incidents are higher-signal than any invented example.
Gate every PR with the eval suite via GitHub Actions. Threshold floor = current production baseline minus two percentage points. Block merge below that floor until either the threshold is lowered with sign-off or the regression is fixed.
Add the cheap fast tier. Wire the eight Scanners on every test case. Add one classifier backend for toxicity / safety. These run sub-second and cost fractions of a cent, so they go on every case without budget concern. Reserve LLM-judge calls for the high-stakes subset.
Close the loop. Turn on Error Feed clustering and self-improving evaluators. Failing production traces cluster into named issues; the clusters promote into the golden set; the rubric prompt auto-retunes against the new examples. The regression suite refreshes itself instead of waiting for a quarterly review.

Two sprints, a working LLM eval practice on the pytest harness the team already runs. Discipline transfers, the assertion library re-shapes, the rest stays put.

Where this fits in the bridge series

This post is the QA-team companion to two adjacent bridges. The classical ML eval bridge covers the transition for data scientists coming from sklearn and MLflow. The product analytics bridge covers it for product teams coming from Mixpanel, Amplitude, and PostHog. Three different starting disciplines, one shared eval stack.

The closing line is the same in all three: bring the discipline that made your last system reliable. Replace only the primitives that actually break in an unbounded output space. For a QA team, the primitive that breaks is the boolean assertion. Everything else carries.

Ready to wire your first rubric into your existing pytest suite? Install ai-evaluation, drop a Groundedness rubric against your last fifty production traces this afternoon, and gate the next PR on the resulting score. Same harness, same CI, new assertion.

Frequently asked questions

What's the one-sentence difference between traditional QA and LLM eval?

Traditional QA is pass/fail against an expected output. LLM eval is a graded score against a rubric, with a per-route threshold gating the merge. The test pyramid still applies. The golden set still applies. The CI gate still applies. What changes is the assertion shape: instead of asserting equality, you assert that a 0-to-1 rubric score clears a calibrated floor. Most QA teams replace the assertion library and keep everything else.

Can a QA team use pytest for LLM evaluation?

Yes, and most should. The pytest harness keeps running the suite, owning fixtures, parametrizing test cases, and reporting to CI. The body of the assertion changes. Instead of assertEqual, you call Evaluator from the ai-evaluation SDK on each test case, read back a 0-to-1 score per rubric, and assert that the score clears a per-rubric threshold. Fixtures, markers, parallel workers, JUnit XML output, GitHub Actions integration all behave the same way. The transition is usually a one-day rewire for a team that already has a pytest regression suite.

What carries from QA to LLM eval, and what gets replaced?

Carries: the test pyramid (smoke / regression / e2e becomes deterministic / classifier / judge), the golden set discipline, regression gating on every PR, the CI status check that blocks merge, the bug-bash mindset that breaks things on purpose, and the principle that the feature team owns the tests. Replaces: the assertion library (boolean equality becomes graded rubric), the determinism assumption (run N samples and aggregate), and the coverage definition (code paths become input distribution). The discipline transfers. The primitives re-shape.

How does a QA team handle non-determinism in LLM tests?

Stop treating variance as a bug. Pin temperature to zero and a seed where the provider supports it, then run N samples per test case (5 to 10 is the common range) and aggregate. The Guardrails class in the ai-evaluation SDK ships AggregationStrategy.MAJORITY for exactly this case. If three out of five samples fail, it's a real failure. If one out of five fails, it's variance noise. A test that scores 0.78 plus or minus 0.04 is doing what a stochastic system should. A test that scores 0.78 plus or minus 0.25 is telling you the system itself is unstable, which is a bug worth fixing upstream.

What's the biggest mistake QA teams make in their first LLM eval workstream?

Reaching for boolean equality. A QA team's instinct is to assert response equals expected, and that assertion fails or passes for the wrong reasons against any LLM output. The fix is to swap the assertion body, not throw out the harness. The second-biggest mistake is treating variance as flakiness and retrying tests until they pass, which silently turns the regression suite off. The third is a single-threshold gate across all routes, which is too loose for high-stakes paths and too tight for casual ones. Each of these is a small re-think, not a discipline change.

How does Future AGI shorten the transition for QA teams?

By treating LLM eval as a pytest-shape API. Evaluator(...).evaluate(eval_templates=[...], inputs=[...]) returns a list of scores. The assertion is a one-line threshold check inside a pytest fixture. 60+ EvalTemplate classes (Groundedness, ContextAdherence, TaskCompletion, LLMFunctionCalling, and more) replace the per-team custom judge a QA team would otherwise have to build. Eight Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) run sub-10ms deterministic checks that behave exactly like the old assertion library. Four distributed runners (Celery, Ray, Temporal, Kubernetes) parallelize the suite. The Platform layer adds self-improving evaluators that auto-retune thresholds from production feedback and an Error Feed that clusters production failures via HDBSCAN and feeds them back into the golden set.

Do red-team and bug-bash practices transfer from QA to LLM eval?

Directly. The bug-bash mindset is one of the most valuable skills on a 2026 LLM team. Adversarial prompts, edge-case inputs, persona stress tests, and OWASP LLM Top 10 categories (prompt injection, sensitive information disclosure, insecure output handling, model denial of service) all slot into the golden set as test cases. The bug bash becomes a red-team day. The prompts that produce broken behavior become permanent regression tests. QA's instinct to break things is the closest thing to a free pass on the LLM side of the bridge.

View all

Guides

CI/CD LLM Eval with GitHub Actions: 2026 Workflow

Cheap, fast, statistically significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, auto-rollback.

Nikhil Pareek · Mar 31, 2026

14 min

Guides

LLM Eval vs Software Testing: The 2026 Bridge for Dev Teams

How engineers should map the test pyramid (unit, integration, e2e) onto LLM eval in 2026: the seven gaps, the analogy, a five-step transition.

NVJK Kartik · Mar 15, 2026

16 min

Guides

Prompt Regression Testing: A Practical 2026 Guide

Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, paired comparison vs prior version with CI delta.

NVJK Kartik · Mar 14, 2026

11 min