Guides

Prompt Regression Testing: A Practical 2026 Guide

Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, and paired comparison vs prior version with CI on the delta.

·
Updated
·
11 min read
prompt-engineering llm-evaluation regression-testing ci-cd ai-gateway 2026
Editorial cover image for Prompt Regression Testing: A Practical 2026 Guide
Table of Contents

A one-line edit lands in the support agent’s prompt — “Always respond in a warm, conversational tone.” Twelve hours later, refusal rate on legitimate refund requests is up 14 points. The new tone softened the refusal pattern that triggered escalations. Two other paths broke at the same time; the team hasn’t heard about them yet. Rolling back fixes the new regression and reintroduces the old one.

This is the prompt regression problem. Free-text prompts have invisible blast radius, delayed symptoms, and no clean rollback. The only test that catches this is a versioned suite that scores both prompts on the same examples and reports a statistical delta — not a 30-example smoke check, not a single mean against a floor.

The opinion this post earns: prompt regression testing is pytest for prompts. Three patterns hold up. Per-rubric assertion (Groundedness stays above 0.85). Per-route stratified eval (the rubric the prompt actually moved, on the cohort that moved). And paired comparison versus the pinned prior version, with a confidence interval on the delta. Anything else is screenshot-comparing the demo.

This guide is the working playbook: the three patterns, the pytest fixture against ai-evaluation, the CI workflow, version pinning, the paired-delta bootstrap, and the FAGI vendor surface.

TL;DR: the suite that doesn’t lie

StepDecisionRule
1. Golden set100-300 paired cases per routeSampled from production traces; stratified by intent x persona x edge-case.
2. Per-rubric assertionFloor per route per rubricGroundedness >= 0.85, AnswerRefusal >= 0.90, citation validity >= 0.99.
3. Per-route stratifiedScore the rubric the prompt actually movedRefusal route gets AnswerRefusal; RAG route gets Groundedness.
4. Paired vs prior versionPer-case delta + bootstrap CIShip only when 95% CI doesn’t sit entirely below zero on any rubric.
5. Version pinningprompt_version_id on every trace and every eval rowThe version flows into the CI baseline; baseline is queryable, not frozen.
6. CI integrationpytest + ai-evaluation SDK + fi CLI exit codesPath-scoped triggers, matrix shard per route, classifier cascade.
7. Closed loopError Feed promotes production failures to new casesThe set grows weekly with last week’s misses, not from a sprint.

Everything below is the math and the wiring.

Why prompt regression is uniquely hard

Three properties separate prompt edits from code edits. Invisible blast radius: a one-line wording change can interact with thousands of input variants; tightening “be concise” can drop the long-form citations compliance relies on. Delayed symptoms: the regression surfaces when a user hits the broken path, hours later on a busy route, days on a long-tail one, by which time other work has merged on top. Unclean rollback: reverting fixes the new regression and reintroduces the old one — the only honest answer is comparing both versions against the same examples at the same time, with a CI on the per-case delta.

Traditional unit tests don’t cover any of this because LLM outputs aren’t exact strings. Rubric-based scoring across a stable golden set is the only test that compresses behavior into a comparable number. The LLM evaluation playbook covers the rubric primitives this guide builds on.

The three patterns that earn their keep

Pattern 1: Per-rubric assertion (the floor)

The simplest pattern, shaped like a pytest assertion: per-rubric score stays above a pinned floor. Groundedness >= 0.85, ContextRelevance >= 0.80, AnswerRefusal >= 0.90, citation validity >= 0.99 on compliance routes.

The floor catches the catastrophic drop. A version that pushes Groundedness from 0.91 to 0.62 fails the floor. One that pushes it from 0.91 to 0.88 does not — which is why the floor alone isn’t enough.

Floors are per route, not global. A medical assistant’s IsHarmfulAdvice floor is 1.0. A summarizer’s Completeness floor might be 0.70 because the rubric is noisier. Set the floor at roughly the lower bound of the rubric’s observed range over the last month of stable production traffic.

Pattern 2: Per-route stratified eval

A monolithic suite scoring every rubric against every case wastes signal and money. The rubric that catches a regression is the one the prompt moved. A refund-routing edit moves AnswerRefusal and TaskCompletion; it doesn’t move Groundedness because the route doesn’t run RAG.

Stratify the golden set by route and tag each case with intent x persona x edge-case. The pytest run reads the affected routes from the PR diff (a small affected_routes.py mapping changed paths to route IDs) and runs only the matching shards. A 200-case per-route suite clears under 3 minutes; a 1,600-case monorepo sweep doesn’t.

Within a route, the (intent x persona) cell view surfaces “candidate better on first-time users, worse on power users” or “gained English, lost Spanish.” A net delta near zero can hide a 30/30/40 win/lose/flat split that’s actually a behavioral rewrite.

Pattern 3: Paired comparison vs the prior version

The pattern most regression suites skip. The baseline isn’t a number frozen at launch; it’s the pinned prior version’s score vector on the same examples. Run both versions, take per-case deltas, bootstrap a 95 percent CI on the delta vector. (The implementation lives in the pytest fixture below.)

Three rules on the paired CI. If the 95 percent CI sits entirely below zero on any rubric, the regression is real. If the CI straddles zero, the change is variance — ship if the floor still holds. If the CI sits entirely above zero, the candidate is a directional improvement and the lower bound tells you the smallest credible lift.

The paired design pays for itself by killing between-example variance. Some inputs are just harder; an independent test lets that variance dominate the delta. Pairing analyzes the per-example difference, between-example variance cancels, and the CI tightens by roughly an order of magnitude. The A/B testing playbook covers the matched-pair math; the regression suite is the same machinery in a CI gate. Bootstrap is the right tool because rubric scores cluster (Groundedness near 1.0, refusal bimodal) and the parametric t-test breaks on those shapes.

The promote-or-block rule: three triggers, any one blocks

  1. Floor. Any rubric’s per-route mean drops below the pinned floor.
  2. Paired CI. The bootstrap CI on the per-case delta sits entirely below zero on any rubric.
  3. Safety flip. Any safety rubric (PromptInjection, DataPrivacyCompliance, IsHarmfulAdvice, JailbreakScanner) flips a case from pass to fail.

Floor catches the catastrophic. Paired CI catches the drift. Safety flip catches the jailbreak the new prompt opened — even one case is non-negotiable on safety rubrics. The three cover what actually goes wrong; the rest is plumbing.

The pytest fixture: paired regression against the pinned baseline

# tests/test_prompt_regression.py
import json, os, statistics
from pathlib import Path

import numpy as np
import pytest
from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextRelevance, Completeness,
    AnswerRefusal, PromptInjection, DataPrivacyCompliance,
)
from fi.testcases import TestCase

GOLDEN    = Path("evals/golden")
BASELINES = Path("evals/baselines")    # pinned per-version score arrays in git

FLOORS = {
    "support-rag": {"Groundedness": 0.85, "AnswerRefusal": 0.90},
    "legal-rag":   {"Groundedness": 0.88, "DataPrivacyCompliance": 0.99},
    "sales-agent": {"Completeness": 0.75},
}
SAFETY = {"PromptInjection", "DataPrivacyCompliance"}

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
    max_workers=16,
)


def paired_delta_ci(candidate, baseline, n_boot=10_000, alpha=0.05):
    rng = np.random.default_rng(42)
    d = np.array(candidate) - np.array(baseline)
    boot = np.array([rng.choice(d, len(d), replace=True).mean() for _ in range(n_boot)])
    lo, hi = np.percentile(boot, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return float(d.mean()), float(lo), float(hi)


@pytest.mark.parametrize("route", ["support-rag", "legal-rag", "sales-agent"])
def test_prompt_regression(route, request):
    if route not in request.config.getoption("--routes").split(","):
        pytest.skip(f"{route} not affected by this PR")

    cases = [json.loads(l) for l in (GOLDEN / f"{route}.jsonl").open()]
    templates = [Groundedness(), ContextRelevance(), Completeness(),
                 AnswerRefusal(), PromptInjection(), DataPrivacyCompliance()]

    candidate = evaluator.evaluate(
        eval_templates=templates,
        inputs=[TestCase(**c, prompt_version=os.environ["CANDIDATE_VERSION"]) for c in cases],
    )
    baseline = json.loads((BASELINES / f"{route}.json").read_text())

    failures = []
    for rubric in candidate.aggregate_by_template():
        cand, base = candidate.scores(rubric), baseline[rubric]
        # Trigger 1: floor
        floor = FLOORS.get(route, {}).get(rubric)
        if floor and statistics.mean(cand) < floor:
            failures.append(f"{route}.{rubric}: mean below floor {floor}")
        # Trigger 2: paired-delta CI entirely below zero
        _, lo, hi = paired_delta_ci(cand, base)
        if hi < 0:
            failures.append(f"{route}.{rubric}: paired CI [{lo:.3f}, {hi:.3f}] regressed")
        # Trigger 3: safety pass-to-fail flip
        if rubric in SAFETY:
            flipped = sum(1 for c, b in zip(cand, base) if b >= 0.5 and c < 0.5)
            if flipped:
                failures.append(f"{route}.{rubric}: {flipped} cases flipped pass->fail")

    assert not failures, "\n".join(failures)

Four design calls earn their keep. The baseline is pinned per-version JSON in git — the diff is reviewable when it updates. Per-route parametrize plus a --routes flag means skipped routes don’t burn judge tokens. Three triggers per rubric, scored independently. max_workers=16 saturates the judge provider’s rate limit before the SDK’s distributed runners (Celery, Ray, Temporal, Kubernetes) need to take over.

Version pinning: the baseline isn’t a frozen number

A regression suite needs to know which prior version it’s pairing against. Prompt templates are versioned objects; every trace and every eval row carries prompt_version_id; the baseline JSON in git regenerates on every promote-to-main.

# prompts/support_agent/v24.yaml
version: 24
parent: 23
template: |
  You are an empathetic support agent for {brand}.
  Always cite the policy section number when refusing a refund.
  ...
variables: [brand, user_tier]
owners: [support-eng@company.com]
last_validated_against: evals/baselines/support-rag.json@sha:a3f1b9

The version flows into the CI baseline three ways. Promote-to-main writes the merged version’s per-case scores to evals/baselines/<route>.json keyed by prompt_version_id. The PR pytest reads that baseline and pairs against it. And traceAI’s prompt_version_id span attribute lets production scoring tie back to the same version, so the rolling production baseline stays in sync with the CI baseline.

A baseline frozen at launch decays — the judge model updates, the dataset grows, the production distribution shifts, and the gate either fires false alarms or stops firing. Regenerate on every merge to main; overlay a rolling 7-day production observation to catch drift the merge-time baseline misses. The prompt versioning post covers the versioning patterns in more depth.

CI integration: GitHub Actions, path scope, classifier cascade

# .github/workflows/prompt-regression.yml
name: prompt-regression
on:
  pull_request:
    paths: ["prompts/**", "evals/golden/**", "src/agent/**"]

concurrency:
  group: prompt-regression-${{ github.head_ref }}
  cancel-in-progress: true

jobs:
  detect-routes:
    runs-on: ubuntu-latest
    outputs: { routes: ${{ steps.aff.outputs.routes }} }
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 2 }
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - id: aff
        run: echo "routes=$(python evals/affected_routes.py)" >> "$GITHUB_OUTPUT"

  regression:
    needs: detect-routes
    if: needs.detect-routes.outputs.routes != '[]'
    runs-on: ubuntu-latest
    timeout-minutes: 8
    strategy:
      fail-fast: false
      matrix: { route: ${{ fromJson(needs.detect-routes.outputs.routes) }} }
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - run: pip install ai-evaluation scipy numpy
      # Layer 1: classifier cascade — NLI rubrics on every case, frontier on disagreement
      - run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml --filter route=${{ matrix.route }}
        env: { FI_API_KEY: ${{ secrets.FI_API_KEY }}, FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }} }
      # Layer 2: paired-delta CI gate
      - run: pytest tests/test_prompt_regression.py --routes=${{ matrix.route }}
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
          CANDIDATE_VERSION: ${{ github.event.pull_request.head.sha }}

Four habits earn their keep. Path-scoped triggers so a CSS-only PR doesn’t burn judge tokens. concurrency.cancel-in-progress: true so three rapid pushes don’t fan out into three concurrent suites. Matrix sharding by route so the monorepo scales with the change surface, not with the route count. Classifier cascade in layer 1 (NLI rubrics first, frontier judge on disagreement) drops the per-PR judge bill by roughly an order of magnitude before layer 2’s paired-delta gate runs.

The CI/CD LLM eval workflow covers the fi CLI exit-code partition (0/2/3/6/7) and nightly drift detector that pair with this gate.

Closed loop: production failures grow the regression set

A regression set frozen at launch decays. The cases that caught regressions a quarter ago aren’t the failure modes users hit now.

The pattern: production traces score with the same rubrics via traceAI’s EvalTag (the score writes as a span attribute next to prompt_version_id). Failing traces fall into an Error Feed queue. Error Feed soft-clusters with HDBSCAN over span embeddings — failures group into named clusters like “agent over-promises refund timeline” or “persona drifts to formal on Spanish input.” A Sonnet 4.5 Judge agent writes the RCA, evidence, an immediate_fix, and a 4-dim score. The immediate_fix becomes the spec for a new regression case (input, expected behavior, which rubric catches it). On reviewer approval the case joins the golden set.

The regression set grows weekly with last week’s misses, not from a sprint imagining edge cases at a whiteboard.

How Future AGI ships prompt regression testing

Future AGI ships the eval stack as a package. The pieces compose; pick the ones you need.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes (Groundedness, ContextRelevance, Completeness, AnswerRefusal, PromptInjection, DataPrivacyCompliance, IsHarmfulAdvice, CustomLLMJudge). 8 sub-10ms Scanner classes for the deterministic base. NLI-backed local rubrics (faithfulness, claim_support, rag_faithfulness, factual_consistency) for the classifier-cascade tier. Four distributed runners (Celery, Ray, Temporal, Kubernetes).
  • fi CLI. fi run --check --strict --parallel 16 with assertions on pass_rate, avg_score, p50/p90/p95_score. CI-distinct exit codes (0/2/3/6/7) as a hard contract for any CI runner; log lines reformat between SDK versions, exit codes don’t.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. EvalTag attaches the same rubric to live OTel spans; the score writes as a span attribute next to prompt_version_id. Same rubric in CI and in production means the numbers are comparable end to end.
  • Error Feed. HDBSCAN soft-clustering over span embeddings plus the Sonnet 4.5 Judge agent that writes the immediate_fix. Linear integration ships today; Slack, GitHub, Jira, PagerDuty on the roadmap.
  • Future AGI Platform. Self-improving evaluators tuned by thumbs feedback; classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what makes 200-case paired suites on every PR affordable.
  • agent-opt (Apache 2.0). Six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard). The pattern: regression suite gates every change; the optimizer proposes candidates against the regression set; the suite decides whether the candidate promotes. See automated prompt improvement for the optimizer choice rubric.
  • Agent Command Center. OpenAI-compatible gateway in a single Go binary. 100+ providers, 18+ built-in guardrail scanners, cohort-stable hashing for canary, per-rubric eval-gated rollback at the gateway hop — so the regression suite has somewhere to land on production traffic.

Ready to wire a regression gate? pip install ai-evaluation, fi init --template prompt-regression, point the golden set at your stratified per-route JSONL, set FI_API_KEY and FI_SECRET_KEY in CI secrets, add the pytest workflow above. The first PR that ships against this gate is the first PR whose author knows what it broke.

Anti-patterns the regression suite should avoid

  • Pass-rate as the gate. “Suite passes at 87 percent” collapses per-case signal into a number that can’t distinguish a 14-point refusal collapse from judge noise. Track pass rate as a health check, never as the gate.
  • Floor without paired CI. Slow drift slips under the floor for months. The paired-delta CI catches the regression the floor misses.
  • Frozen baseline. A 2024 baseline scoring a 2026 prompt is a benchmark, not a regression suite. Regenerate on every merge to main.
  • No version pinning. Without prompt_version_id on the trace and the eval row, “the baseline” is a hand-wave. The version is the join key between CI and production scoring.
  • 30-case smoke set as the gate. Variance wider than the regressions you’re catching. Grow to 100-300 paired cases per route, or the gate raises false alarms half the time.
  • LLM-as-judge on every case on every PR. $9 per PR at month two, quietly disabled at month three. Classifier cascade first; the frontier judge only on disagreement.
  • No canary after the gate. A passing suite doesn’t cover the full input distribution. Mirror or shadow routing through the gateway catches the candidate-only failures the golden set didn’t anticipate.

What to do this week

  1. Pull 200 cases per route from production logs into a versioned JSONL golden set. Stratify by intent x persona x edge-case.
  2. Wire the three triggers (floor, paired CI, safety flip) into the pytest fixture above. Run it twice against the current prompt to record the noise floor.
  3. Pin prompt versions in YAML under prompts/. Write the promote-to-main job that regenerates evals/baselines/<route>.json keyed by prompt_version_id.
  4. Add the GitHub Actions workflow with path-scoped triggers, matrix shard per route, classifier cascade, paired-delta gate.
  5. Stand up Error Feed clustering on production traces. Queue the cluster-derived candidate cases for next month’s set growth.

The next prompt edit ships knowing what it broke — not from a Slack ping six days later.

Frequently asked questions

What is prompt regression testing?
Running a fixed evaluation suite against a prompt version and comparing per-rubric, per-route scores against the prior pinned version. It's pytest for prompts: assertion-based, version-pinned, run in CI on every PR. The difference from unit testing is the assertion target — instead of equality on a string, the gate asserts a rubric floor (Groundedness >= 0.85) and a paired delta against the prior version's score vector. The output isn't pass/fail per case; it's a confidence interval on the per-case delta. If the CI sits entirely below zero on any rubric, the new prompt regressed.
How big should the regression set be?
100 to 300 paired examples per route, sampled from production traces. Below 100 the paired bootstrap CI is wider than the regressions you're trying to catch and the gate fires false alarms. Above 500 the judge bill grows faster than detection sharpens. The composition that holds up across routes: ~60 percent happy-path queries from real traffic, ~20 percent edge cases (long input, multilingual, ambiguous), ~10 percent refusal cases, ~10 percent the hardest historical failures from the incident log. Stratify by intent and persona so the per-route view in the diff has at least 3 cases per cell.
Three triggers, any one blocks. What are they?
(1) Any rubric's per-route mean drops below the pinned floor (Groundedness < 0.85, AnswerRefusal < 0.90, citation validity < 0.99). (2) The paired-delta 95 percent bootstrap CI on any rubric sits entirely below zero — a statistically significant regression, not a noise blip. (3) Any safety rubric (PromptInjection, DataPrivacyCompliance, IsHarmfulAdvice) flips a case from pass to fail. Floor catches catastrophic drops, paired CI catches slow drift, safety flip catches the jailbreak the new prompt opened. Tune the per-rubric floors and the noise-floor threshold per route once the suite has been running a few weeks.
How do I run prompt regression in CI without slowing PRs?
Path-scoped triggers on prompts/** and src/agent/**, route-scoped sharding via a matrix, classifier cascade first (NLI rubrics on every case, frontier judge only on the disagreement subset), and concurrency.cancel-in-progress on the head ref so rapid pushes don't fan out. A 200-case per-route suite running with Evaluator(max_workers=16) clears in 2-4 minutes for a single route, and the matrix shard means a multi-route monorepo doesn't multiply by route count. The fi CLI exit codes (0/2/3/6/7) wire cleanly into the workflow without grep-on-stdout heuristics.
Why pair the evaluation against the prior version instead of an absolute floor?
Absolute floors catch the catastrophic drop and miss the slow drift. A prompt that moves Groundedness from 0.91 to 0.88 still clears a 0.85 floor — the floor says ship — but the per-case paired delta is significantly negative on the cases that actually moved, which is the regression. Pair the candidate against the pinned prior version on the same examples, take per-case deltas, bootstrap a 95 percent CI on the delta vector. The CI is the gate, not the mean. The floor stays for the catastrophic case; the paired delta is what catches the drift the floor misses.
How does Future AGI close the loop from production failure to new regression case?
Error Feed (inside the eval stack) soft-clusters failing production traces via HDBSCAN over span embeddings. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars) writes the RCA, evidence, an immediate_fix description, and a 4-dim score. The immediate_fix becomes the spec for a new regression case (input, expected behavior, which rubric catches it). The case enters a review queue; on approval it joins the golden set. The regression set grows weekly with the failures it missed last week, instead of staying frozen at launch.
What about prompt optimization, not just regression?
agent-opt ships six optimizers (RandomSearch, BayesianSearch with Optuna and teacher-inferred few-shot, MetaPrompt, ProTeGi, GEPA, PromptWizard) that run eval-driven optimization. The pattern for 2026: regression suite gates every change. The optimizer proposes candidates against the regression set; the suite decides whether the candidate promotes. EarlyStoppingConfig cuts the search budget once improvements plateau. Direct trace-stream-to-optimizer ingestion is on the roadmap; the eval-driven loop ships now.
Related Articles
View all