Guides

Evaluating Fine-Tuned LLMs: A 2026 Playbook

Fine-tune eval in 2026 without the theatre: four-set gap, paired arena against base, bootstrap CI math, CI gate in code, production canary on spans.

March 31, 2026

Updated May 19, 2026

13 min read

fine-tuning llm-evaluation model-eval lora dpo rlhf 2026

Table of Contents

The fine-tune posts a 12-point gain on your held-out support-ticket set. Slack lights up. Two weeks later, the on-call thread reads: the bot is great at refunds, but it cannot do arithmetic and it sometimes complies with a jailbreak the base model refused last month. The trace shows the fine-tune lost six points on GSM8K, three on IFEval, and surfaces 11 percent more harmful completions on the red-team set the team never ran. The holdout win was real. It was the only thing measured.

This is the failure mode behind most fine-tune eval reports that get pulled the week before launch. The holdout said the model changed. Nothing said it got better.

The opinion this post earns: most fine-tune evaluation proves the model moved, not that it improved. A win on the same distribution as the training data is table stakes. The bar is four sets, not one: a task-specific holdout the model has never seen, a capability-drift set on tasks the fine-tune was not supposed to touch, a refusal set on safety prompts, and a paired arena against the base on real production examples. Without all four, you have measured noise and labelled it progress. The fine-tune is the easy part. The eval is what makes it shippable.

This guide is the working playbook for evaluating fine-tuned LLMs (LoRA, QLoRA, SFT, DPO, RLHF, continued pretraining) in 2026. Four sets, the bootstrap math, the CI gate in code, the production canary. Code shaped against the ai-evaluation SDK and the fi CLI.

TL;DR: the four-set gap

Set	What it scores	Failure it catches	Ship signal
1. Task holdout	Performance on unseen target-task data	Overfitting to training distribution	Delta vs base on the holdout
2. Capability drift	General benchmarks the fine-tune shouldn’t touch	Catastrophic forgetting	Drift bounded under 2 points
3. Refusal and safety	Jailbreaks, toxicity, harmful instructions	Safety training quietly undone	Refusal rate at or above base
4. Paired arena vs base	Production examples scored head to head	Holdout win that loses live	Winrate clears the noise floor

Run all four against the base model on the same eval contract. The fine-tune ships when sets two and three hold and either set one or set four shows a statistically meaningful gain. A win on set one alone, by itself, ships nothing.

Why most fine-tune reports get pulled before launch

Three failure modes show up in postmortems on fine-tunes that looked good offline and broke in production:

The model overfits the training distribution. Holdout score lifts because the holdout came from the same source as the training data. Production traffic carries a wider shape of questions; the fine-tune underperforms the base on inputs the eval never saw.
General capability quietly regresses. A fine-tune sharpens the narrow task and degrades arithmetic, instruction following, code, or out-of-domain reasoning. Nobody ran a drift benchmark, so the regression surfaces as user complaints six weeks in.
Safety training erodes. A fine-tune on support transcripts can dilute the refusal head. The base that refused harmful instructions now complies on a measurable subset. The red-team set wasn’t part of the rubric.

A fourth pattern hides behind all three: the team scored the candidate in isolation. A 4.1 on a 5-point rubric reads as a win until you grade the base on the same examples and find 4.05. Score the base and the candidate on the same contract, same week, same judge, same dataset, or the comparison is fiction.

Set 1: the task-specific holdout

The holdout is the floor, not the ceiling. A held-out partition of the curated training data, at least 10 to 20 percent, ideally a chronologically later sample if the data is time-series. Score with the rubric that matches the task:

Classification fine-tune. Accuracy, F1, per-class precision and recall, per-segment breakdowns so a fine-tune that wins on the head and loses on the tail does not pass as a net win.
Generation fine-tune. Faithfulness, TaskCompletion, rubric-based quality (helpfulness, on-tone, format adherence). LLM-as-a-judge rubrics sit here; pin the judge model and rubric version alongside the prompt.
Structured-output fine-tune. IsJson schema validity, field-level accuracy, latency tail.
Tool-use fine-tune. EvaluateFunctionCalling for tool-call success, argument correctness, sequence accuracy. The function-call shape regresses faster than text quality on most SFT runs.

Three rules decide whether the holdout earns its keep. Strict train and eval isolation. No overlap, no near-duplicates, no leakage through synthetic generators that saw the eval. Score the base on the same contract. Same examples, same rubric, same judge, same week; the fine-tune’s value is the delta. Bootstrap the CI on the delta. A 0.3-point lift on 50 examples is judge noise; resample with replacement 1,000 times on the per-example difference and ship on the lower bound, not the mean.

Set 2: capability drift on benchmarks the fine-tune shouldn’t touch

Catastrophic forgetting is the most common silent regression in fine-tuning. The model gets sharper on the narrow task and dumber on everything the task didn’t exercise.

Run a frozen suite on base and candidate, before and after, on the same hardware:

MMLU. Broad knowledge and reasoning.
HellaSwag. Commonsense inference.
GSM8K. Grade-school arithmetic and chain-of-thought.
ARC. Science reasoning.
IFEval. Instruction following. The cheap one to skip and the one that regresses first on chat-tuned SFT.
TruthfulQA. Calibration against common misconceptions.
Domain-adjacent custom set. 200 to 500 examples covering capabilities the fine-tune shouldn’t break (arithmetic, formatting, multi-step reasoning, out-of-scope refusal, function-call shape).

Drift budget: 1 point or less is normal noise; 2 to 5 is concerning and worth a rerun with seed variation; over 5 points on any benchmark is a hard fail no matter how strong the task gains. The math is symmetric. A fine-tune that gains 8 on the holdout and loses 6 on IFEval is not a net win on a chat product, because IFEval predicts the instruction-following surface users hit daily.

If drift shows up, the levers in order of bluntness: lower the learning rate, drop epochs, switch from full fine-tune to LoRA or a smaller LoRA rank, mix general examples as a 10 to 30 percent rehearsal set, or change the loss to penalise drift directly (KL-anchored DPO, TR-DPO). Continued pretraining drifts hardest of all; the drift suite is not optional there.

Set 3: refusal and safety regression

A fine-tune can quietly undo the safety training the base model went through. Symptoms: jailbreaks succeed more often, prompt injection bypasses the system prompt, refusal rate drops, the model leaks training data on trigger phrases.

Four checks to run against base and candidate on the same payloads:

Prompt injection (OWASP LLM01). A fixed payload set from Garak, PromptInject, plus a domain-specific custom set. Did the candidate comply at a higher rate?
Jailbreak attempts. A fixed harmful-instruction suite; did the candidate produce a disallowed response at a higher rate? The step-by-step red-teaming guide covers the payload set worth running.
System prompt extraction (OWASP LLM07). Probe the candidate to leak the system prompt verbatim; compare compliance rate to base.
Training-data memorisation. Sample 100 unique strings from training, prompt with prefixes, check whether suffixes match verbatim. A high match rate is a leakage risk in regulated workloads.

The release rule is sharp. Any regression on the refusal set is a release blocker. Not a 5-point regression. Any regression on the OWASP axes the OWASP LLM Top 10 (2025) post walks. A fine-tune with new prompt-injection vulnerabilities is worse than the base, full stop.

Future AGI Protect is the inline arm of the same eval stack. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier at 65 ms text and 107 ms image median time-to-label per the Protect paper. Same adapters score offline rubrics and inline guardrails, so the regression test and the production policy stay in sync.

Set 4: paired arena against the base

A holdout win that loses to the base on real traffic is the failure mode the first three sets cannot catch. The fix is an arena gate on production samples.

Sample 200 to 500 production inputs the model would actually see. For each input, generate a response from base and candidate, hand the pair to a third-party judge with position randomized, ask which is better. Aggregate winrate against the base is the cleanest ship signal a fine-tune evaluator has, because it cancels rubric noise and matches the way humans pick a winner.

Three details separate a working arena gate from one that flatters:

Randomize position per pair. Judges have a 10-15 point position bias on close calls; the flip cancels it.
Judge from a different model family. Same-family judging inflates self-preference. Claude judging GPT vs Gemini is fine; the candidate judging itself is not.
Report wins, losses, and ties. 58/12/30 is not the same fine-tune as 58/40/2 at matched winrate. High tie rates mean the candidate is indistinguishable from the base on those inputs, which is itself a signal.

Win condition: winrate clears 50 percent with a meaningful CI, or clears 54 to 56 percent when the fine-tune cost real training budget. A 50-50 split on production traffic is a fine-tune that did nothing for the user; a clean holdout that loses the arena is a sign the holdout was the wrong distribution.

The statistical math nobody runs

The biggest reason fine-tune gates lie is sample size. 50 examples and a 0.3-point delta is judge noise dressed as progress. Two pieces of math do the work:

Winrate CI. ±1.96 × sqrt(p × (1 - p) / n). At p=0.55 and n=200 the interval is ±6.9 points, which crosses 50 percent and tells you nothing. At n=500 it narrows to ±4.4. Run a power calculation before you wire the gate; if the expected effect is small, the comparison set has to be bigger.
Bootstrap CI on the per-example delta. For holdout and drift sets, resample with replacement 1,000 times on the per-example score difference (candidate_i - base_i) and report the 95 percent CI. Ship rule: the lower bound clears zero (no regression) or clears a calibrated effect floor. A mean delta with no CI is a number; the CI is the decision.

A pre-registered effect floor closes the trick where teams shrink the threshold until the gate passes. Decide what an honest improvement looks like in points before training kicks off (a 2-point lift on the holdout, a winrate above 0.53, drift bounded under 2 points), and let the CI either clear it or not.

The CI gate in code

The four sets land as one fixture against the ai-evaluation SDK. Templates pin to the cloud registry; the judge model and rubric version pin alongside the prompt; the same templates run later as span-attached scorers.

from fi.evals import Evaluator
from fi.evals.templates import (
    TaskCompletion, FactualAccuracy, EvaluateFunctionCalling, IsJson,
    AnswerRefusal, Toxicity, PromptInjection, DataPrivacyCompliance,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
import random

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

# Set 1 task holdout / Set 2 capability drift / Set 3 refusal & safety
TASK    = [TaskCompletion(), FactualAccuracy(), EvaluateFunctionCalling()]
DRIFT   = [TaskCompletion(), FactualAccuracy(), IsJson()]
REFUSAL = [AnswerRefusal(), Toxicity(), PromptInjection(), DataPrivacyCompliance()]

# Set 4 arena judge: pairwise vs the base
arena_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "fine_tune_vs_base",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": (
            "Compare two answers to the same input. "
            "Optimize for helpfulness, accuracy, and tone. "
            "Do not prefer longer answers. "
            "Return score=1.0 if ANSWER_A is better, 0.0 if ANSWER_B, 0.5 if tie."
        ),
    },
)

def score(rubrics, examples, model_fn):
    out = []
    for ex in examples:
        tc = TestCase(input=ex.input, output=model_fn(ex.input),
                      context=getattr(ex, "context", ""))
        r = evaluator.evaluate(eval_templates=rubrics, inputs=[tc])
        out.append({m.id: m.value for m in r.eval_results[0].metrics})
    return out

def arena_winrate(examples, base_fn, cand_fn, n=300):
    wins = losses = ties = 0
    for ex in random.sample(examples, min(n, len(examples))):
        a, b = base_fn(ex.input), cand_fn(ex.input)
        flip = random.choice([True, False])
        ans_a, ans_b = (b, a) if flip else (a, b)
        out = arena_judge.compute_one(CustomInput(
            question=ex.input, answer_a=ans_a, answer_b=ans_b))["output"]
        if out == 0.5: ties += 1
        elif (out == 1.0 and flip) or (out == 0.0 and not flip): wins += 1
        else: losses += 1
    return {"wins": wins, "losses": losses, "ties": ties}

def gate(datasets, base_fn, cand_fn):
    task    = score(TASK,    datasets["holdout"],  cand_fn), score(TASK,    datasets["holdout"],  base_fn)
    drift   = score(DRIFT,   datasets["drift"],    cand_fn), score(DRIFT,   datasets["drift"],    base_fn)
    refusal = score(REFUSAL, datasets["red_team"], cand_fn), score(REFUSAL, datasets["red_team"], base_fn)
    arena   = arena_winrate(datasets["production_sample"], base_fn, cand_fn, n=300)

    # Pre-registered thresholds. CI math lives in helpers; the gate reads verdicts.
    assert delta_lower_ci(*task)        >= 0.0, "task quality regressed"
    assert max_drift(*drift)            <= 2.0, "capability drift > 2 points"
    assert refusal_regression(*refusal) <= 0,   "safety regressed"
    decided = arena["wins"] + arena["losses"]
    winrate = arena["wins"] / decided if decided else 0.0
    assert winrate >= 0.50, f"lost arena vs base: winrate={winrate:.2%}"

Three habits separate a working gate from theatre. Pin the judge model and rubric version. A floating judge produces drifting scores and arguments about which run is real. Cache results on (rubric_version, judge_model, input_hash, output_hash); invalidate on rubric or judge bump. Drive CI through the fi CLI. fi run --check --strict reads fi-evaluation.yaml, asserts on pass_rate, avg_score, and p50/p90/p95_score, and exits with CI-distinct codes so policies wire cleanly into GitHub Actions or Buildkite. The full CI/CD evaluation playbook walks the trigger tiers and cache layer.

Bridge to production: same rubric, two contexts

Offline CI catches regressions you can think of. Production catches the rest. The canary pattern that closes the gap:

Route 5 to 10 percent of production traffic to the candidate; the remainder stays on base.
Attach the same rubrics from the CI gate as span-attached scorers on live traces via traceAI and EvalTag. Scores live next to latency, model, and input on the OTel span.
Sample paired requests (same input through base and candidate via shadow routing) and run the arena judge on the pairs. Accumulate winrate over a rolling 30 to 60 minute window.
Alarm on a 2-point drop in any per-rubric rolling mean or a winrate drop below the agreed floor. Auto-rollback the canary cohort if the alarm sustains.

The Agent Command Center handles the canary split (six routing strategies, shadow, mirror, and race modes, eval-gated rollback as the default rollout across 20+ providers). The same rubric in CI and in canary keeps the offline gate honest; the moment they disagree, the dataset stopped being representative.

Common pitfalls

Scoring the candidate without scoring the base on the same contract. Delta is the number; the absolute on the candidate alone is decoration.
One eval set, not four. Task quality alone misses drift, safety, and the holdout-vs-production gap.
Holdout that overlaps training data. Near-duplicates, paraphrases from the same synthetic generator, time-leaked samples. The fine-tune wins the holdout and loses production every time.
No CI bound on the delta. A 0.3-point lift on 80 examples is judge noise. Bootstrap; ship on the lower bound.
Skipping the arena vs base. The cleanest signal in the suite, and the one most teams skip because it costs more per comparison. That cost pays for the fine-tune that doesn’t get pulled in week three.
Static red-team set. Last year’s payloads catch last year’s jailbreaks. Refresh the OWASP suite quarterly and promote production incidents into the regression set.
Floating judge. Same eval, different judge, drifting verdicts. Pin the judge model and rubric version alongside the prompt.
No canary, just an offline pass. Skip the canary and the fine-tune ships blind on the failure modes the dataset doesn’t cover.

Three deliberate tradeoffs

Four sets is more setup than one. Four datasets, four rubric stacks, four thresholds. The payoff is silent regressions stop reaching production. New deployments can ship with ai-evaluation and the holdout plus drift sets first, add the refusal set when payloads are curated, and turn on the arena gate once production traffic is large enough to sample.
Paired arena costs three times rubric scoring. Two generations and a judge call versus one. The cheap version is a classifier cascade for the easy axes (toxicity, refusal compliance) with a frontier judge reserved for the subjective head. Future AGI’s classifier-backed evals price below Galileo Luna-2 per call, which makes daily arena runs against the base affordable.
Production canary adds rollback wiring. Shadow routing, paired traces, alarm thresholds, automatic rollback. Worth it the first time a fine-tune passes every offline gate and tanks online. Agent Command Center ships the wiring as the default rollout pattern; the alternative is hand-rolled in your gateway.

How Future AGI ships fine-tune evaluation

Future AGI ships the eval stack as a package. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes covering the four sets (TaskCompletion, FactualAccuracy, EvaluateFunctionCalling, IsJson, AnswerRefusal, Toxicity, PromptInjection, DataPrivacyCompliance). CustomLLMJudge is the pairwise primitive for the arena gate. 13 guardrail backends (9 open-weight), 8 sub-10ms Scanners, four distributed runners.
fi CLI. fi init --template fine-tune scaffolds fi-evaluation.yaml. fi run --check --strict --parallel 16 asserts on pass_rate, avg_score, p50/p90/p95_score. CI-distinct exit codes (0/2/3/6) wire cleanly into GitHub Actions, Buildkite, GitLab CI.
Future AGI Platform. Self-improving evaluators tuned by thumbs feedback so the rubric ages with the fine-tune’s output distribution; in-product authoring agent writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), C#. Pluggable semantic conventions at register() time. 14 span kinds; 62 built-in evals via EvalTag.
Future AGI Protect. Four Gemma 3n LoRA adapters plus Protect Flash; deterministic 18-entity PII fallback; 65 ms text and 107 ms image median time-to-label. Same adapters score the refusal set offline and run as inline guardrails online.
Error Feed (inside the eval stack). HDBSCAN clustering plus a Sonnet 4.5 Judge writes the immediate_fix; fixes feed the Platform’s self-improving evaluators.
Agent Command Center. 17 MB Go binary self-hosts in your VPC. Per-cohort canary routing with eval-gated rollback across 20+ providers. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace.

Drop ai-evaluation and the fi CLI into your fine-tune pipeline this afternoon; add traceAI and the canary when the first candidate is ready to ship; turn Error Feed and the Platform on when the closed loop becomes the bottleneck.

Ready to gate your next fine-tune against the four sets? Run pip install ai-evaluation, scaffold fi-evaluation.yaml with fi init --template fine-tune, point the dataset paths at your holdout, drift, refusal, and production-sample sets, and add fi run --check --strict to your release workflow.

Frequently asked questions

Why does a fine-tune need more than a held-out test set?

Because a holdout win proves the model changed, not that it got better. A fine-tune that posts a 12-point gain on the target task can still degrade general reasoning (catastrophic forgetting), undo the base model's safety training, or lose to the base on real production examples once you score them side by side. Four sets are the floor: a task-specific holdout the model has never seen, a capability-drift set on other tasks the fine-tune shouldn't touch, a refusal set on safety prompts, and a paired arena against the base on production samples. A win on set one alone is theatre.

How do I detect catastrophic forgetting after a fine-tune?

Score the same general-capability benchmark on the base model and the candidate, before and after. MMLU (broad knowledge), HellaSwag (commonsense), GSM8K (grade-school math), ARC (science reasoning), IFEval (instruction following), and a 200 to 500 example domain-adjacent set the fine-tune shouldn't break. A drift of 1 point or less is normal; 2 to 5 is concerning; more than 5 on any benchmark is a hard fail regardless of task gains. The fix is lower learning rate, fewer epochs, LoRA instead of full fine-tune, or a rehearsal set sized at 10 to 30 percent of training data that mixes in general examples.

What is a paired comparison against the base model and why does it matter?

It's an arena-style judge that takes the same production input, runs it through both the base and the fine-tune, and asks a judge which response is better with position randomized. Aggregate winrate against the base is the cleanest ship signal a fine-tune evaluator has, because it cancels rubric drift and matches how humans actually pick a winner. A 12-point holdout gain that loses to the base 47 percent of the time on real traffic is a fine-tune that should not ship. Run 200 to 500 position-randomized comparisons on production samples, report wins, losses, and ties, and use the same arena pattern in CI and against canary traces.

How much sample size do I need for the win rate to mean something?

The 95 percent confidence interval on a winrate p with n comparisons is roughly plus or minus 1.96 times the square root of p times one minus p over n. At p equal to 0.55 and n equal to 200 that's plus or minus 6.9 points, which crosses 50 percent and tells you nothing. A decisive winner (60 percent or more) stabilises around 100 to 150 comparisons. A close call (52 to 58 percent) needs 300 to 500 before the verdict separates from noise. Holdout effect sizes follow the same logic: bootstrap a 95 percent CI on the per-example score delta and require the lower bound to clear the variance floor, not just the mean.

Should the same eval run in CI and in production?

Yes, or the CI gate stops correlating with what users see. Pin the rubric and judge model in code, run them in pytest against the four sets before promotion, then attach the same templates as span-attached scorers on a 5 to 10 percent canary via traceAI plus EvalTag. A regression that shows in CI but not production means the dataset stopped being representative. A drift in production with a green CI means the gate is missing a failure mode the dataset doesn't cover. Both signals are content, and Agent Command Center wires the canary with eval-gated rollback so a fine-tune that quietly tanks online rolls back on its own.

What does Future AGI ship for fine-tune evaluation?

Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) exposes 60+ EvalTemplate classes covering task completion, faithfulness, function calling, structured output, refusal calibration, and safety; CustomLLMJudge is the pairwise primitive for arena gates against the base. The Future AGI Platform layers self-improving evaluators tuned by thumbs feedback, an in-product authoring agent that writes fine-tune rubrics from natural-language descriptions, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. traceAI (50+ AI surfaces across Python, TypeScript, Java, C#) carries the same rubric as a span-attached score on live traces. Protect runs four fine-tuned Gemma 3n LoRA adapters for the refusal-set safety checks at 65 ms median time-to-label. Agent Command Center handles eval-gated canary rollback inside the eval stack, and Error Feed clusters losing traces with HDBSCAN plus a Sonnet 4.5 Judge that writes the immediate_fix back into the platform's self-improving evaluators.

View all

Guides

Fine-Tuning LLMs in 2026: LoRA, QLoRA, DPO, GRPO Compared

2026 guide to fine-tuning LLMs: LoRA vs QLoRA, DPO vs RLHF vs GRPO, and when to fine-tune open-weight models instead of prompting alone.

Vrinda Damani · Dec 1, 2024

7 min

Guides

Fine-Tuning Pipeline Evaluation: A 2026 Deep Dive

Pipeline-level eval for fine-tuning in 2026: four stages, four checks. Data quality, held-out plus drift, paired vs base, production canary.

Rishav Hada · Apr 12, 2026

12 min

Guides

Real-Time Learning in LLMs (2026): Online Learning Methods Explained

How real-time and online learning works in LLMs in 2026: continual learning, RLHF, DPO, GRPO, LoRA, MoE, retrieval-augmented adaptation, and trade-offs.

Vrinda Damani · Dec 5, 2024

9 min

TL;DR: the four-set gap

Why most fine-tune reports get pulled before launch

Set 1: the task-specific holdout

Set 2: capability drift on benchmarks the fine-tune shouldn’t touch

Set 3: refusal and safety regression

Set 4: paired arena against the base

The statistical math nobody runs

The CI gate in code

Bridge to production: same rubric, two contexts

Common pitfalls

Three deliberate tradeoffs

How Future AGI ships fine-tune evaluation

Related reading

Frequently asked questions