Prompt Regression Testing: A Practical 2026 Guide
Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, and paired comparison vs prior version with CI on the delta.
Table of Contents
A one-line edit lands in the support agent’s prompt — “Always respond in a warm, conversational tone.” Twelve hours later, refusal rate on legitimate refund requests is up 14 points. The new tone softened the refusal pattern that triggered escalations. Two other paths broke at the same time; the team hasn’t heard about them yet. Rolling back fixes the new regression and reintroduces the old one.
This is the prompt regression problem. Free-text prompts have invisible blast radius, delayed symptoms, and no clean rollback. The only test that catches this is a versioned suite that scores both prompts on the same examples and reports a statistical delta — not a 30-example smoke check, not a single mean against a floor.
The opinion this post earns: prompt regression testing is pytest for prompts. Three patterns hold up. Per-rubric assertion (Groundedness stays above 0.85). Per-route stratified eval (the rubric the prompt actually moved, on the cohort that moved). And paired comparison versus the pinned prior version, with a confidence interval on the delta. Anything else is screenshot-comparing the demo.
This guide is the working playbook: the three patterns, the pytest fixture against ai-evaluation, the CI workflow, version pinning, the paired-delta bootstrap, and the FAGI vendor surface.
TL;DR: the suite that doesn’t lie
| Step | Decision | Rule |
|---|---|---|
| 1. Golden set | 100-300 paired cases per route | Sampled from production traces; stratified by intent x persona x edge-case. |
| 2. Per-rubric assertion | Floor per route per rubric | Groundedness >= 0.85, AnswerRefusal >= 0.90, citation validity >= 0.99. |
| 3. Per-route stratified | Score the rubric the prompt actually moved | Refusal route gets AnswerRefusal; RAG route gets Groundedness. |
| 4. Paired vs prior version | Per-case delta + bootstrap CI | Ship only when 95% CI doesn’t sit entirely below zero on any rubric. |
| 5. Version pinning | prompt_version_id on every trace and every eval row | The version flows into the CI baseline; baseline is queryable, not frozen. |
| 6. CI integration | pytest + ai-evaluation SDK + fi CLI exit codes | Path-scoped triggers, matrix shard per route, classifier cascade. |
| 7. Closed loop | Error Feed promotes production failures to new cases | The set grows weekly with last week’s misses, not from a sprint. |
Everything below is the math and the wiring.
Why prompt regression is uniquely hard
Three properties separate prompt edits from code edits. Invisible blast radius: a one-line wording change can interact with thousands of input variants; tightening “be concise” can drop the long-form citations compliance relies on. Delayed symptoms: the regression surfaces when a user hits the broken path, hours later on a busy route, days on a long-tail one, by which time other work has merged on top. Unclean rollback: reverting fixes the new regression and reintroduces the old one — the only honest answer is comparing both versions against the same examples at the same time, with a CI on the per-case delta.
Traditional unit tests don’t cover any of this because LLM outputs aren’t exact strings. Rubric-based scoring across a stable golden set is the only test that compresses behavior into a comparable number. The LLM evaluation playbook covers the rubric primitives this guide builds on.
The three patterns that earn their keep
Pattern 1: Per-rubric assertion (the floor)
The simplest pattern, shaped like a pytest assertion: per-rubric score stays above a pinned floor. Groundedness >= 0.85, ContextRelevance >= 0.80, AnswerRefusal >= 0.90, citation validity >= 0.99 on compliance routes.
The floor catches the catastrophic drop. A version that pushes Groundedness from 0.91 to 0.62 fails the floor. One that pushes it from 0.91 to 0.88 does not — which is why the floor alone isn’t enough.
Floors are per route, not global. A medical assistant’s IsHarmfulAdvice floor is 1.0. A summarizer’s Completeness floor might be 0.70 because the rubric is noisier. Set the floor at roughly the lower bound of the rubric’s observed range over the last month of stable production traffic.
Pattern 2: Per-route stratified eval
A monolithic suite scoring every rubric against every case wastes signal and money. The rubric that catches a regression is the one the prompt moved. A refund-routing edit moves AnswerRefusal and TaskCompletion; it doesn’t move Groundedness because the route doesn’t run RAG.
Stratify the golden set by route and tag each case with intent x persona x edge-case. The pytest run reads the affected routes from the PR diff (a small affected_routes.py mapping changed paths to route IDs) and runs only the matching shards. A 200-case per-route suite clears under 3 minutes; a 1,600-case monorepo sweep doesn’t.
Within a route, the (intent x persona) cell view surfaces “candidate better on first-time users, worse on power users” or “gained English, lost Spanish.” A net delta near zero can hide a 30/30/40 win/lose/flat split that’s actually a behavioral rewrite.
Pattern 3: Paired comparison vs the prior version
The pattern most regression suites skip. The baseline isn’t a number frozen at launch; it’s the pinned prior version’s score vector on the same examples. Run both versions, take per-case deltas, bootstrap a 95 percent CI on the delta vector. (The implementation lives in the pytest fixture below.)
Three rules on the paired CI. If the 95 percent CI sits entirely below zero on any rubric, the regression is real. If the CI straddles zero, the change is variance — ship if the floor still holds. If the CI sits entirely above zero, the candidate is a directional improvement and the lower bound tells you the smallest credible lift.
The paired design pays for itself by killing between-example variance. Some inputs are just harder; an independent test lets that variance dominate the delta. Pairing analyzes the per-example difference, between-example variance cancels, and the CI tightens by roughly an order of magnitude. The A/B testing playbook covers the matched-pair math; the regression suite is the same machinery in a CI gate. Bootstrap is the right tool because rubric scores cluster (Groundedness near 1.0, refusal bimodal) and the parametric t-test breaks on those shapes.
The promote-or-block rule: three triggers, any one blocks
- Floor. Any rubric’s per-route mean drops below the pinned floor.
- Paired CI. The bootstrap CI on the per-case delta sits entirely below zero on any rubric.
- Safety flip. Any safety rubric (
PromptInjection,DataPrivacyCompliance,IsHarmfulAdvice,JailbreakScanner) flips a case from pass to fail.
Floor catches the catastrophic. Paired CI catches the drift. Safety flip catches the jailbreak the new prompt opened — even one case is non-negotiable on safety rubrics. The three cover what actually goes wrong; the rest is plumbing.
The pytest fixture: paired regression against the pinned baseline
# tests/test_prompt_regression.py
import json, os, statistics
from pathlib import Path
import numpy as np
import pytest
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextRelevance, Completeness,
AnswerRefusal, PromptInjection, DataPrivacyCompliance,
)
from fi.testcases import TestCase
GOLDEN = Path("evals/golden")
BASELINES = Path("evals/baselines") # pinned per-version score arrays in git
FLOORS = {
"support-rag": {"Groundedness": 0.85, "AnswerRefusal": 0.90},
"legal-rag": {"Groundedness": 0.88, "DataPrivacyCompliance": 0.99},
"sales-agent": {"Completeness": 0.75},
}
SAFETY = {"PromptInjection", "DataPrivacyCompliance"}
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
max_workers=16,
)
def paired_delta_ci(candidate, baseline, n_boot=10_000, alpha=0.05):
rng = np.random.default_rng(42)
d = np.array(candidate) - np.array(baseline)
boot = np.array([rng.choice(d, len(d), replace=True).mean() for _ in range(n_boot)])
lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return float(d.mean()), float(lo), float(hi)
@pytest.mark.parametrize("route", ["support-rag", "legal-rag", "sales-agent"])
def test_prompt_regression(route, request):
if route not in request.config.getoption("--routes").split(","):
pytest.skip(f"{route} not affected by this PR")
cases = [json.loads(l) for l in (GOLDEN / f"{route}.jsonl").open()]
templates = [Groundedness(), ContextRelevance(), Completeness(),
AnswerRefusal(), PromptInjection(), DataPrivacyCompliance()]
candidate = evaluator.evaluate(
eval_templates=templates,
inputs=[TestCase(**c, prompt_version=os.environ["CANDIDATE_VERSION"]) for c in cases],
)
baseline = json.loads((BASELINES / f"{route}.json").read_text())
failures = []
for rubric in candidate.aggregate_by_template():
cand, base = candidate.scores(rubric), baseline[rubric]
# Trigger 1: floor
floor = FLOORS.get(route, {}).get(rubric)
if floor and statistics.mean(cand) < floor:
failures.append(f"{route}.{rubric}: mean below floor {floor}")
# Trigger 2: paired-delta CI entirely below zero
_, lo, hi = paired_delta_ci(cand, base)
if hi < 0:
failures.append(f"{route}.{rubric}: paired CI [{lo:.3f}, {hi:.3f}] regressed")
# Trigger 3: safety pass-to-fail flip
if rubric in SAFETY:
flipped = sum(1 for c, b in zip(cand, base) if b >= 0.5 and c < 0.5)
if flipped:
failures.append(f"{route}.{rubric}: {flipped} cases flipped pass->fail")
assert not failures, "\n".join(failures)
Four design calls earn their keep. The baseline is pinned per-version JSON in git — the diff is reviewable when it updates. Per-route parametrize plus a --routes flag means skipped routes don’t burn judge tokens. Three triggers per rubric, scored independently. max_workers=16 saturates the judge provider’s rate limit before the SDK’s distributed runners (Celery, Ray, Temporal, Kubernetes) need to take over.
Version pinning: the baseline isn’t a frozen number
A regression suite needs to know which prior version it’s pairing against. Prompt templates are versioned objects; every trace and every eval row carries prompt_version_id; the baseline JSON in git regenerates on every promote-to-main.
# prompts/support_agent/v24.yaml
version: 24
parent: 23
template: |
You are an empathetic support agent for {brand}.
Always cite the policy section number when refusing a refund.
...
variables: [brand, user_tier]
owners: [support-eng@company.com]
last_validated_against: evals/baselines/support-rag.json@sha:a3f1b9
The version flows into the CI baseline three ways. Promote-to-main writes the merged version’s per-case scores to evals/baselines/<route>.json keyed by prompt_version_id. The PR pytest reads that baseline and pairs against it. And traceAI’s prompt_version_id span attribute lets production scoring tie back to the same version, so the rolling production baseline stays in sync with the CI baseline.
A baseline frozen at launch decays — the judge model updates, the dataset grows, the production distribution shifts, and the gate either fires false alarms or stops firing. Regenerate on every merge to main; overlay a rolling 7-day production observation to catch drift the merge-time baseline misses. The prompt versioning post covers the versioning patterns in more depth.
CI integration: GitHub Actions, path scope, classifier cascade
# .github/workflows/prompt-regression.yml
name: prompt-regression
on:
pull_request:
paths: ["prompts/**", "evals/golden/**", "src/agent/**"]
concurrency:
group: prompt-regression-${{ github.head_ref }}
cancel-in-progress: true
jobs:
detect-routes:
runs-on: ubuntu-latest
outputs: { routes: ${{ steps.aff.outputs.routes }} }
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 2 }
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- id: aff
run: echo "routes=$(python evals/affected_routes.py)" >> "$GITHUB_OUTPUT"
regression:
needs: detect-routes
if: needs.detect-routes.outputs.routes != '[]'
runs-on: ubuntu-latest
timeout-minutes: 8
strategy:
fail-fast: false
matrix: { route: ${{ fromJson(needs.detect-routes.outputs.routes) }} }
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- run: pip install ai-evaluation scipy numpy
# Layer 1: classifier cascade — NLI rubrics on every case, frontier on disagreement
- run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml --filter route=${{ matrix.route }}
env: { FI_API_KEY: ${{ secrets.FI_API_KEY }}, FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }} }
# Layer 2: paired-delta CI gate
- run: pytest tests/test_prompt_regression.py --routes=${{ matrix.route }}
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
CANDIDATE_VERSION: ${{ github.event.pull_request.head.sha }}
Four habits earn their keep. Path-scoped triggers so a CSS-only PR doesn’t burn judge tokens. concurrency.cancel-in-progress: true so three rapid pushes don’t fan out into three concurrent suites. Matrix sharding by route so the monorepo scales with the change surface, not with the route count. Classifier cascade in layer 1 (NLI rubrics first, frontier judge on disagreement) drops the per-PR judge bill by roughly an order of magnitude before layer 2’s paired-delta gate runs.
The CI/CD LLM eval workflow covers the fi CLI exit-code partition (0/2/3/6/7) and nightly drift detector that pair with this gate.
Closed loop: production failures grow the regression set
A regression set frozen at launch decays. The cases that caught regressions a quarter ago aren’t the failure modes users hit now.
The pattern: production traces score with the same rubrics via traceAI’s EvalTag (the score writes as a span attribute next to prompt_version_id). Failing traces fall into an Error Feed queue. Error Feed soft-clusters with HDBSCAN over span embeddings — failures group into named clusters like “agent over-promises refund timeline” or “persona drifts to formal on Spanish input.” A Sonnet 4.5 Judge agent writes the RCA, evidence, an immediate_fix, and a 4-dim score. The immediate_fix becomes the spec for a new regression case (input, expected behavior, which rubric catches it). On reviewer approval the case joins the golden set.
The regression set grows weekly with last week’s misses, not from a sprint imagining edge cases at a whiteboard.
How Future AGI ships prompt regression testing
Future AGI ships the eval stack as a package. The pieces compose; pick the ones you need.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses (Groundedness,ContextRelevance,Completeness,AnswerRefusal,PromptInjection,DataPrivacyCompliance,IsHarmfulAdvice,CustomLLMJudge). 8 sub-10msScannerclasses for the deterministic base. NLI-backed local rubrics (faithfulness,claim_support,rag_faithfulness,factual_consistency) for the classifier-cascade tier. Four distributed runners (Celery, Ray, Temporal, Kubernetes). fiCLI.fi run --check --strict --parallel 16with assertions onpass_rate,avg_score,p50/p90/p95_score. CI-distinct exit codes (0/2/3/6/7) as a hard contract for any CI runner; log lines reformat between SDK versions, exit codes don’t.- traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#.
EvalTagattaches the same rubric to live OTel spans; the score writes as a span attribute next toprompt_version_id. Same rubric in CI and in production means the numbers are comparable end to end. - Error Feed. HDBSCAN soft-clustering over span embeddings plus the Sonnet 4.5 Judge agent that writes the
immediate_fix. Linear integration ships today; Slack, GitHub, Jira, PagerDuty on the roadmap. - Future AGI Platform. Self-improving evaluators tuned by thumbs feedback; classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what makes 200-case paired suites on every PR affordable.
- agent-opt (Apache 2.0). Six optimizers (
RandomSearch,BayesianSearch,MetaPrompt,ProTeGi,GEPA,PromptWizard). The pattern: regression suite gates every change; the optimizer proposes candidates against the regression set; the suite decides whether the candidate promotes. See automated prompt improvement for the optimizer choice rubric. - Agent Command Center. OpenAI-compatible gateway in a single Go binary. 100+ providers, 18+ built-in guardrail scanners, cohort-stable hashing for canary, per-rubric eval-gated rollback at the gateway hop — so the regression suite has somewhere to land on production traffic.
Ready to wire a regression gate? pip install ai-evaluation, fi init --template prompt-regression, point the golden set at your stratified per-route JSONL, set FI_API_KEY and FI_SECRET_KEY in CI secrets, add the pytest workflow above. The first PR that ships against this gate is the first PR whose author knows what it broke.
Anti-patterns the regression suite should avoid
- Pass-rate as the gate. “Suite passes at 87 percent” collapses per-case signal into a number that can’t distinguish a 14-point refusal collapse from judge noise. Track pass rate as a health check, never as the gate.
- Floor without paired CI. Slow drift slips under the floor for months. The paired-delta CI catches the regression the floor misses.
- Frozen baseline. A 2024 baseline scoring a 2026 prompt is a benchmark, not a regression suite. Regenerate on every merge to main.
- No version pinning. Without
prompt_version_idon the trace and the eval row, “the baseline” is a hand-wave. The version is the join key between CI and production scoring. - 30-case smoke set as the gate. Variance wider than the regressions you’re catching. Grow to 100-300 paired cases per route, or the gate raises false alarms half the time.
- LLM-as-judge on every case on every PR. $9 per PR at month two, quietly disabled at month three. Classifier cascade first; the frontier judge only on disagreement.
- No canary after the gate. A passing suite doesn’t cover the full input distribution. Mirror or shadow routing through the gateway catches the candidate-only failures the golden set didn’t anticipate.
What to do this week
- Pull 200 cases per route from production logs into a versioned JSONL golden set. Stratify by intent x persona x edge-case.
- Wire the three triggers (floor, paired CI, safety flip) into the pytest fixture above. Run it twice against the current prompt to record the noise floor.
- Pin prompt versions in YAML under
prompts/. Write the promote-to-main job that regeneratesevals/baselines/<route>.jsonkeyed byprompt_version_id. - Add the GitHub Actions workflow with path-scoped triggers, matrix shard per route, classifier cascade, paired-delta gate.
- Stand up Error Feed clustering on production traces. Queue the cluster-derived candidate cases for next month’s set growth.
The next prompt edit ships knowing what it broke — not from a Slack ping six days later.
Related reading
- A/B Testing LLM Prompts: The Statistical Playbook (2026)
- CI/CD LLM Eval with GitHub Actions (2026)
- What is Prompt Versioning?
- Best Prompt Testing Frameworks (2026)
- Automated Prompt Improvement (2026)
- The 2026 LLM Evaluation Playbook
- LLM as Judge Best Practices in 2026
- Best AI Prompt Management Tools (2026)
Frequently asked questions
What is prompt regression testing?
How big should the regression set be?
Three triggers, any one blocks. What are they?
How do I run prompt regression in CI without slowing PRs?
Why pair the evaluation against the prior version instead of an absolute floor?
How does Future AGI close the loop from production failure to new regression case?
What about prompt optimization, not just regression?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
The 15 LLM evaluation mistakes the Future AGI team sees in customer engagements, each with a vignette and the concrete primitive that prevents it.
Celery, Ray, Temporal, and Kubernetes optimise for different things. Pick by your bottleneck, not by what's fashionable. The 2026 engineering decision guide.