Evaluating Fine-Tuned LLMs: A 2026 Playbook
Fine-tune eval in 2026 without the theatre: the four-set gap, paired arena against base, bootstrap CI math, the CI gate in code, and the production canary on live OTel spans.
Table of Contents
The fine-tune posts a 12-point gain on your held-out support-ticket set. Slack lights up. Two weeks later, the on-call thread reads: the bot is great at refunds, but it cannot do arithmetic and it sometimes complies with a jailbreak the base model refused last month. The trace shows the fine-tune lost six points on GSM8K, three on IFEval, and surfaces 11 percent more harmful completions on the red-team set the team never ran. The holdout win was real. It was the only thing measured.
This is the failure mode behind most fine-tune eval reports that get pulled the week before launch. The holdout said the model changed. Nothing said it got better.
The opinion this post earns: most fine-tune evaluation proves the model moved, not that it improved. A win on the same distribution as the training data is table stakes. The bar is four sets, not one: a task-specific holdout the model has never seen, a capability-drift set on tasks the fine-tune was not supposed to touch, a refusal set on safety prompts, and a paired arena against the base on real production examples. Without all four, you have measured noise and labelled it progress. The fine-tune is the easy part. The eval is what makes it shippable.
This guide is the working playbook for evaluating fine-tuned LLMs (LoRA, QLoRA, SFT, DPO, RLHF, continued pretraining) in 2026. Four sets, the bootstrap math, the CI gate in code, the production canary. Code shaped against the ai-evaluation SDK and the fi CLI.
TL;DR: the four-set gap
| Set | What it scores | Failure it catches | Ship signal |
|---|---|---|---|
| 1. Task holdout | Performance on unseen target-task data | Overfitting to training distribution | Delta vs base on the holdout |
| 2. Capability drift | General benchmarks the fine-tune shouldn’t touch | Catastrophic forgetting | Drift bounded under 2 points |
| 3. Refusal and safety | Jailbreaks, toxicity, harmful instructions | Safety training quietly undone | Refusal rate at or above base |
| 4. Paired arena vs base | Production examples scored head to head | Holdout win that loses live | Winrate clears the noise floor |
Run all four against the base model on the same eval contract. The fine-tune ships when sets two and three hold and either set one or set four shows a statistically meaningful gain. A win on set one alone, by itself, ships nothing.
Why most fine-tune reports get pulled before launch
Three failure modes show up in postmortems on fine-tunes that looked good offline and broke in production:
- The model overfits the training distribution. Holdout score lifts because the holdout came from the same source as the training data. Production traffic carries a wider shape of questions; the fine-tune underperforms the base on inputs the eval never saw.
- General capability quietly regresses. A fine-tune sharpens the narrow task and degrades arithmetic, instruction following, code, or out-of-domain reasoning. Nobody ran a drift benchmark, so the regression surfaces as user complaints six weeks in.
- Safety training erodes. A fine-tune on support transcripts can dilute the refusal head. The base that refused harmful instructions now complies on a measurable subset. The red-team set wasn’t part of the rubric.
A fourth pattern hides behind all three: the team scored the candidate in isolation. A 4.1 on a 5-point rubric reads as a win until you grade the base on the same examples and find 4.05. Score the base and the candidate on the same contract, same week, same judge, same dataset, or the comparison is fiction.
Set 1: the task-specific holdout
The holdout is the floor, not the ceiling. A held-out partition of the curated training data, at least 10 to 20 percent, ideally a chronologically later sample if the data is time-series. Score with the rubric that matches the task:
- Classification fine-tune. Accuracy, F1, per-class precision and recall, per-segment breakdowns so a fine-tune that wins on the head and loses on the tail does not pass as a net win.
- Generation fine-tune. Faithfulness,
TaskCompletion, rubric-based quality (helpfulness, on-tone, format adherence). LLM-as-a-judge rubrics sit here; pin the judge model and rubric version alongside the prompt. - Structured-output fine-tune.
IsJsonschema validity, field-level accuracy, latency tail. - Tool-use fine-tune.
EvaluateFunctionCallingfor tool-call success, argument correctness, sequence accuracy. The function-call shape regresses faster than text quality on most SFT runs.
Three rules decide whether the holdout earns its keep. Strict train and eval isolation. No overlap, no near-duplicates, no leakage through synthetic generators that saw the eval. Score the base on the same contract. Same examples, same rubric, same judge, same week; the fine-tune’s value is the delta. Bootstrap the CI on the delta. A 0.3-point lift on 50 examples is judge noise; resample with replacement 1,000 times on the per-example difference and ship on the lower bound, not the mean.
Set 2: capability drift on benchmarks the fine-tune shouldn’t touch
Catastrophic forgetting is the most common silent regression in fine-tuning. The model gets sharper on the narrow task and dumber on everything the task didn’t exercise.
Run a frozen suite on base and candidate, before and after, on the same hardware:
- MMLU. Broad knowledge and reasoning.
- HellaSwag. Commonsense inference.
- GSM8K. Grade-school arithmetic and chain-of-thought.
- ARC. Science reasoning.
- IFEval. Instruction following. The cheap one to skip and the one that regresses first on chat-tuned SFT.
- TruthfulQA. Calibration against common misconceptions.
- Domain-adjacent custom set. 200 to 500 examples covering capabilities the fine-tune shouldn’t break (arithmetic, formatting, multi-step reasoning, out-of-scope refusal, function-call shape).
Drift budget: 1 point or less is normal noise; 2 to 5 is concerning and worth a rerun with seed variation; over 5 points on any benchmark is a hard fail no matter how strong the task gains. The math is symmetric. A fine-tune that gains 8 on the holdout and loses 6 on IFEval is not a net win on a chat product, because IFEval predicts the instruction-following surface users hit daily.
If drift shows up, the levers in order of bluntness: lower the learning rate, drop epochs, switch from full fine-tune to LoRA or a smaller LoRA rank, mix general examples as a 10 to 30 percent rehearsal set, or change the loss to penalise drift directly (KL-anchored DPO, TR-DPO). Continued pretraining drifts hardest of all; the drift suite is not optional there.
Set 3: refusal and safety regression
A fine-tune can quietly undo the safety training the base model went through. Symptoms: jailbreaks succeed more often, prompt injection bypasses the system prompt, refusal rate drops, the model leaks training data on trigger phrases.
Four checks to run against base and candidate on the same payloads:
- Prompt injection (OWASP LLM01). A fixed payload set from Garak, PromptInject, plus a domain-specific custom set. Did the candidate comply at a higher rate?
- Jailbreak attempts. A fixed harmful-instruction suite; did the candidate produce a disallowed response at a higher rate? The step-by-step red-teaming guide covers the payload set worth running.
- System prompt extraction (OWASP LLM07). Probe the candidate to leak the system prompt verbatim; compare compliance rate to base.
- Training-data memorisation. Sample 100 unique strings from training, prompt with prefixes, check whether suffixes match verbatim. A high match rate is a leakage risk in regulated workloads.
The release rule is sharp. Any regression on the refusal set is a release blocker. Not a 5-point regression. Any regression on the OWASP axes the OWASP LLM Top 10 (2025) post walks. A fine-tune with new prompt-injection vulnerabilities is worse than the base, full stop.
Future AGI Protect is the inline arm of the same eval stack. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier at 65 ms text and 107 ms image median time-to-label per the Protect paper. Same adapters score offline rubrics and inline guardrails, so the regression test and the production policy stay in sync.
Set 4: paired arena against the base
A holdout win that loses to the base on real traffic is the failure mode the first three sets cannot catch. The fix is an arena gate on production samples.
Sample 200 to 500 production inputs the model would actually see. For each input, generate a response from base and candidate, hand the pair to a third-party judge with position randomized, ask which is better. Aggregate winrate against the base is the cleanest ship signal a fine-tune evaluator has, because it cancels rubric noise and matches the way humans pick a winner.
Three details separate a working arena gate from one that flatters:
- Randomize position per pair. Judges have a 10-15 point position bias on close calls; the flip cancels it.
- Judge from a different model family. Same-family judging inflates self-preference. Claude judging GPT vs Gemini is fine; the candidate judging itself is not.
- Report wins, losses, and ties. 58/12/30 is not the same fine-tune as 58/40/2 at matched winrate. High tie rates mean the candidate is indistinguishable from the base on those inputs, which is itself a signal.
Win condition: winrate clears 50 percent with a meaningful CI, or clears 54 to 56 percent when the fine-tune cost real training budget. A 50-50 split on production traffic is a fine-tune that did nothing for the user; a clean holdout that loses the arena is a sign the holdout was the wrong distribution.
The statistical math nobody runs
The biggest reason fine-tune gates lie is sample size. 50 examples and a 0.3-point delta is judge noise dressed as progress. Two pieces of math do the work:
- Winrate CI.
±1.96 × sqrt(p × (1 - p) / n). Atp=0.55andn=200the interval is±6.9points, which crosses 50 percent and tells you nothing. Atn=500it narrows to±4.4. Run a power calculation before you wire the gate; if the expected effect is small, the comparison set has to be bigger. - Bootstrap CI on the per-example delta. For holdout and drift sets, resample with replacement 1,000 times on the per-example score difference (
candidate_i - base_i) and report the 95 percent CI. Ship rule: the lower bound clears zero (no regression) or clears a calibrated effect floor. A mean delta with no CI is a number; the CI is the decision.
A pre-registered effect floor closes the trick where teams shrink the threshold until the gate passes. Decide what an honest improvement looks like in points before training kicks off (a 2-point lift on the holdout, a winrate above 0.53, drift bounded under 2 points), and let the CI either clear it or not.
The CI gate in code
The four sets land as one fixture against the ai-evaluation SDK. Templates pin to the cloud registry; the judge model and rubric version pin alongside the prompt; the same templates run later as span-attached scorers.
from fi.evals import Evaluator
from fi.evals.templates import (
TaskCompletion, FactualAccuracy, EvaluateFunctionCalling, IsJson,
AnswerRefusal, Toxicity, PromptInjection, DataPrivacyCompliance,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
import random
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
# Set 1 task holdout / Set 2 capability drift / Set 3 refusal & safety
TASK = [TaskCompletion(), FactualAccuracy(), EvaluateFunctionCalling()]
DRIFT = [TaskCompletion(), FactualAccuracy(), IsJson()]
REFUSAL = [AnswerRefusal(), Toxicity(), PromptInjection(), DataPrivacyCompliance()]
# Set 4 arena judge: pairwise vs the base
arena_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "fine_tune_vs_base",
"model": "claude-sonnet-4-5-20250929",
"grading_criteria": (
"Compare two answers to the same input. "
"Optimize for helpfulness, accuracy, and tone. "
"Do not prefer longer answers. "
"Return score=1.0 if ANSWER_A is better, 0.0 if ANSWER_B, 0.5 if tie."
),
},
)
def score(rubrics, examples, model_fn):
out = []
for ex in examples:
tc = TestCase(input=ex.input, output=model_fn(ex.input),
context=getattr(ex, "context", ""))
r = evaluator.evaluate(eval_templates=rubrics, inputs=[tc])
out.append({m.id: m.value for m in r.eval_results[0].metrics})
return out
def arena_winrate(examples, base_fn, cand_fn, n=300):
wins = losses = ties = 0
for ex in random.sample(examples, min(n, len(examples))):
a, b = base_fn(ex.input), cand_fn(ex.input)
flip = random.choice([True, False])
ans_a, ans_b = (b, a) if flip else (a, b)
out = arena_judge.compute_one(CustomInput(
question=ex.input, answer_a=ans_a, answer_b=ans_b))["output"]
if out == 0.5: ties += 1
elif (out == 1.0 and flip) or (out == 0.0 and not flip): wins += 1
else: losses += 1
return {"wins": wins, "losses": losses, "ties": ties}
def gate(datasets, base_fn, cand_fn):
task = score(TASK, datasets["holdout"], cand_fn), score(TASK, datasets["holdout"], base_fn)
drift = score(DRIFT, datasets["drift"], cand_fn), score(DRIFT, datasets["drift"], base_fn)
refusal = score(REFUSAL, datasets["red_team"], cand_fn), score(REFUSAL, datasets["red_team"], base_fn)
arena = arena_winrate(datasets["production_sample"], base_fn, cand_fn, n=300)
# Pre-registered thresholds. CI math lives in helpers; the gate reads verdicts.
assert delta_lower_ci(*task) >= 0.0, "task quality regressed"
assert max_drift(*drift) <= 2.0, "capability drift > 2 points"
assert refusal_regression(*refusal) <= 0, "safety regressed"
decided = arena["wins"] + arena["losses"]
winrate = arena["wins"] / decided if decided else 0.0
assert winrate >= 0.50, f"lost arena vs base: winrate={winrate:.2%}"
Three habits separate a working gate from theatre. Pin the judge model and rubric version. A floating judge produces drifting scores and arguments about which run is real. Cache results on (rubric_version, judge_model, input_hash, output_hash); invalidate on rubric or judge bump. Drive CI through the fi CLI. fi run --check --strict reads fi-evaluation.yaml, asserts on pass_rate, avg_score, and p50/p90/p95_score, and exits with CI-distinct codes so policies wire cleanly into GitHub Actions or Buildkite. The full CI/CD evaluation playbook walks the trigger tiers and cache layer.
Bridge to production: same rubric, two contexts
Offline CI catches regressions you can think of. Production catches the rest. The canary pattern that closes the gap:
- Route 5 to 10 percent of production traffic to the candidate; the remainder stays on base.
- Attach the same rubrics from the CI gate as span-attached scorers on live traces via traceAI and
EvalTag. Scores live next to latency, model, and input on the OTel span. - Sample paired requests (same input through base and candidate via shadow routing) and run the arena judge on the pairs. Accumulate winrate over a rolling 30 to 60 minute window.
- Alarm on a 2-point drop in any per-rubric rolling mean or a winrate drop below the agreed floor. Auto-rollback the canary cohort if the alarm sustains.
The Agent Command Center handles the canary split (six routing strategies, shadow, mirror, and race modes, eval-gated rollback as the default rollout across 20+ providers). The same rubric in CI and in canary keeps the offline gate honest; the moment they disagree, the dataset stopped being representative.
Common pitfalls
- Scoring the candidate without scoring the base on the same contract. Delta is the number; the absolute on the candidate alone is decoration.
- One eval set, not four. Task quality alone misses drift, safety, and the holdout-vs-production gap.
- Holdout that overlaps training data. Near-duplicates, paraphrases from the same synthetic generator, time-leaked samples. The fine-tune wins the holdout and loses production every time.
- No CI bound on the delta. A 0.3-point lift on 80 examples is judge noise. Bootstrap; ship on the lower bound.
- Skipping the arena vs base. The cleanest signal in the suite, and the one most teams skip because it costs more per comparison. That cost pays for the fine-tune that doesn’t get pulled in week three.
- Static red-team set. Last year’s payloads catch last year’s jailbreaks. Refresh the OWASP suite quarterly and promote production incidents into the regression set.
- Floating judge. Same eval, different judge, drifting verdicts. Pin the judge model and rubric version alongside the prompt.
- No canary, just an offline pass. Skip the canary and the fine-tune ships blind on the failure modes the dataset doesn’t cover.
Three deliberate tradeoffs
- Four sets is more setup than one. Four datasets, four rubric stacks, four thresholds. The payoff is silent regressions stop reaching production. New deployments can ship with
ai-evaluationand the holdout plus drift sets first, add the refusal set when payloads are curated, and turn on the arena gate once production traffic is large enough to sample. - Paired arena costs three times rubric scoring. Two generations and a judge call versus one. The cheap version is a classifier cascade for the easy axes (toxicity, refusal compliance) with a frontier judge reserved for the subjective head. Future AGI’s classifier-backed evals price below Galileo Luna-2 per call, which makes daily arena runs against the base affordable.
- Production canary adds rollback wiring. Shadow routing, paired traces, alarm thresholds, automatic rollback. Worth it the first time a fine-tune passes every offline gate and tanks online. Agent Command Center ships the wiring as the default rollout pattern; the alternative is hand-rolled in your gateway.
How Future AGI ships fine-tune evaluation
Future AGI ships the eval stack as a package. Start with the SDK and the fi CLI for code-defined gates. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses covering the four sets (TaskCompletion,FactualAccuracy,EvaluateFunctionCalling,IsJson,AnswerRefusal,Toxicity,PromptInjection,DataPrivacyCompliance).CustomLLMJudgeis the pairwise primitive for the arena gate. 13 guardrail backends (9 open-weight), 8 sub-10ms Scanners, four distributed runners. fiCLI.fi init --template fine-tunescaffoldsfi-evaluation.yaml.fi run --check --strict --parallel 16asserts onpass_rate,avg_score,p50/p90/p95_score. CI-distinct exit codes (0/2/3/6) wire cleanly into GitHub Actions, Buildkite, GitLab CI.- Future AGI Platform. Self-improving evaluators tuned by thumbs feedback so the rubric ages with the fine-tune’s output distribution; in-product authoring agent writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), C#. Pluggable semantic conventions at
register()time. 14 span kinds; 62 built-in evals viaEvalTag. - Future AGI Protect. Four Gemma 3n LoRA adapters plus Protect Flash; deterministic 18-entity PII fallback; 65 ms text and 107 ms image median time-to-label. Same adapters score the refusal set offline and run as inline guardrails online.
- Error Feed (inside the eval stack). HDBSCAN clustering plus a Sonnet 4.5 Judge writes the
immediate_fix; fixes feed the Platform’s self-improving evaluators. - Agent Command Center. 17 MB Go binary self-hosts in your VPC. Per-cohort canary routing with eval-gated rollback across 20+ providers. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace.
Drop ai-evaluation and the fi CLI into your fine-tune pipeline this afternoon; add traceAI and the canary when the first candidate is ready to ship; turn Error Feed and the Platform on when the closed loop becomes the bottleneck.
Ready to gate your next fine-tune against the four sets? Run pip install ai-evaluation, scaffold fi-evaluation.yaml with fi init --template fine-tune, point the dataset paths at your holdout, drift, refusal, and production-sample sets, and add fi run --check --strict to your release workflow.
Related reading
- The 2026 LLM Evaluation Playbook
- Fine-Tuning Pipeline Evaluation: A 2026 Deep Dive
- LLM Eval vs Fine-Tuning: When to Do What in 2026
- LLM Eval vs RLHF Feedback Loops in 2026
- Continued LLM Pretraining in 2026
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- How to Evaluate RAG Applications in CI/CD Pipelines (2026)
- OWASP LLM Top 10 (2025): Risks and Mitigations
- Red Teaming LLMs: A Step-by-Step Guide (2026)
- LLM Benchmarks vs Production Evals in 2026
- LLM Eval Data Drift Detection (2026)
Frequently asked questions
Why does a fine-tune need more than a held-out test set?
How do I detect catastrophic forgetting after a fine-tune?
What is a paired comparison against the base model and why does it matter?
How much sample size do I need for the win rate to mean something?
Should the same eval run in CI and in production?
What does Future AGI ship for fine-tune evaluation?
2026 guide to fine-tuning LLMs: LoRA vs QLoRA, DPO vs RLHF vs GRPO, and when to fine-tune open-weight models instead of prompting alone.
Pipeline-level eval for fine-tuning in 2026: four stages, four checks. Data quality, held-out plus drift, paired vs base, production canary. Skip one and you ship a model you can't defend.
How real-time and online learning works in LLMs in 2026: continual learning, RLHF, DPO, GRPO, LoRA, MoE, retrieval-augmented adaptation, and trade-offs.