Fine-Tuning Pipeline Evaluation: A 2026 Deep Dive
Pipeline-level eval for fine-tuning in 2026: four stages, four checks. Data quality, held-out plus drift, paired vs base, production canary. Skip one and you ship a model you can't defend.
Table of Contents
You wrap a Llama 3 fine-tune for a medical summarization workflow. The training loss curve looks textbook: smooth decay, no spikes, low final value. You ship. Two weeks in, a customer reports the model refuses to summarize anything with the word “patient” in it. Your data team digs the corpus and finds 12,000 rows of safety-training boilerplate accidentally mixed in from an upstream HuggingFace dataset. The loss curve never flagged it. Your held-out test set never flagged it. The artifact, scored against the rubric you built three weeks ago, looked fine.
The artifact was downstream of a broken pipeline, and only the pipeline could have told you. Fine-tune pipeline evaluation scores every stage that produces the model. The final checkpoint is the last gate, never the only one. The companion post on evaluating fine-tuned LLMs covers the four-set rubric you run on the artifact. This one goes upstream.
The opinion this post earns: a fine-tune pipeline is four stages and four checks. Data quality before training. Held-out plus capability drift after training. Paired comparison against the base on production samples. Production canary against the current model on live traffic. Skip a stage and you ship a model you can’t defend. Each stage catches a class of failures the others miss, and each costs less than the recovery for the failure it prevents.
The four-stage gate
Each stage has one job, one place in the pipeline where it ships first, and one failure mode if you skip it.
| Stage | When it runs | What it scores | Failure if skipped |
|---|---|---|---|
| 1. Data quality | Before tokenization | Deduplication, leakage, label noise, injection | Poisoned rows reach the weights; full re-run |
| 2. Held-out + drift | After each checkpoint | Task quality on unseen data + capability drift on frozen suite | Catastrophic forgetting ships silently |
| 3. Paired vs base | Before promotion | Winrate against the base model on production samples | Held-out win that loses live |
| 4. Production canary | After promotion | Live traffic against the current model | Drift in the real distribution; rollback panic |
Run the four in order. Stage 1 stops training before it starts. Stage 2 stops the run before promotion. Stage 3 holds promotion. Stage 4 triggers auto-rollback through the routing layer. Each gate is a few extra eval calls; skipping any one is the pattern that fills postmortems.
Stage 1: data quality before training
Your fine-tuning corpus determines what the model learns. A clean corpus produces a model that generalizes. A dirty corpus produces a model that memorizes label noise, leaks credentials, and inherits whatever toxicity the upstream filter missed. Audit before tokenization, not after weights are baked.
Three checks earn their keep.
Deduplication, including the leakage path. Exact-duplicate removal is table stakes. The trickier surface is near-duplicates and synthetic-generator paraphrases that leak into your held-out partition. If the same record appears in training and eval under different surface forms, your fine-tune wins the holdout because it memorized, not because it learned. MinHash-LSH at Jaccard 0.85 catches the obvious cases; an embedding-cosine pass at 0.95 catches the rephrased ones. Run the dedup pass before you split, not after.
Label noise. Sample 200 to 500 rows. Re-label them by hand or with a separate judge. If inter-rater agreement with the original label is below 90 percent, the corpus is noisier than the model can absorb cleanly. Either tighten the annotation rubric and re-label, or drop the noisy subset. A fine-tune trained on noisy labels learns the noise as signal and drifts on inputs the noise didn’t cover.
Leakage, PII, and injection payloads. Any corpus drawn from logs, support tickets, or scraped web text carries API keys, customer emails, jailbreak attempts, prompt-injection strings, and zero-width Unicode that hides payloads inside otherwise-clean strings. The ai-evaluation SDK ships Scanners that run sub-10 ms per row, suitable for streaming a million-row corpus:
from fi.evals.guardrails.scanners import (
JailbreakScanner, SecretsScanner, MaliciousURLScanner,
RegexScanner, InvisibleCharScanner, CodeInjectionScanner,
TopicRestrictionScanner,
)
scanners = [
JailbreakScanner(),
SecretsScanner(),
MaliciousURLScanner(),
RegexScanner(patterns=["pii_email", "pii_phone"]),
InvisibleCharScanner(),
CodeInjectionScanner(),
]
def audit_row(row):
flagged = []
text = row["prompt"] + "\n" + row["completion"]
for s in scanners:
if not s.scan(text).passed:
flagged.append(s.__class__.__name__)
return flagged
clean_corpus = [r for r in raw_corpus if not audit_row(r)]
The cost of dropping flagged rows is days of upstream work. The cost of training on them is a fine-tune that has to be re-run from scratch. Log distributional skew alongside the Scanner pass: per-class counts, per-source counts, length distribution. Imbalance above 10 to 1 between classes biases the fine-tune toward the majority. Ship a one-page corpus report with the run config so whoever inherits the model in six months can see what went in.
Stage 2: held-out plus capability drift
The held-out set is the floor, not the ceiling. A held-out partition of the curated training data, at least 10 to 20 percent, ideally a chronologically later sample if the data is time-series, scored against the rubric that matches the task. The mistake teams make is treating it as the whole story. A fine-tune that wins the held-out and loses base capabilities is a regression no matter how green the task-specific dashboard looks.
Two scores run on every checkpoint, side by side.
Held-out task quality. Run the same template suite you plan to ship with at every N steps (5,000 is a reasonable default). The first epoch where loss keeps falling and held-out scores plateau is your real early-stop, not the loss minimum.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextAdherence, TaskCompletion,
LLMFunctionCalling, AnswerRefusal,
)
from fi.testcases import TestCase
evaluator = Evaluator()
def per_checkpoint_score(model, holdout):
inputs = [
TestCase(
input=row["prompt"],
output=model.generate(row["prompt"]),
context=row.get("context"),
expected_output=row["reference"],
)
for row in holdout
]
return evaluator.evaluate(
eval_templates=[
Groundedness(),
ContextAdherence(),
TaskCompletion(),
LLMFunctionCalling(),
AnswerRefusal(),
],
inputs=inputs,
)
The four distributed runners (Celery, Ray, Temporal, Kubernetes) let you co-locate the eval pass with the training job. Ray is the natural pairing with Anyscale-style clusters; Temporal fits teams already using it for orchestration; Celery is the lowest-friction drop-in.
Capability drift on a frozen memory suite. Catastrophic forgetting is the most common silent regression in fine-tuning. The model gets sharper on the narrow task and dumber on everything the task didn’t exercise. Keep a frozen memory test set, separate from both training and the task-specific held-out, that covers tasks the base model handles well: arithmetic and chain-of-thought (GSM8K-style), format compliance (JSON, Markdown, tables), multi-turn instruction following, refusal of out-of-scope requests, and a handful of domain-adjacent tasks. Score the candidate at every checkpoint and plot memory-set scores alongside task-specific scores.
A drift of 1 point or less is normal noise. 2 to 5 is concerning and worth a rerun with seed variation. More than 5 points on any axis is a hard fail no matter how strong the task gains. The levers, in order of bluntness: lower learning rate, fewer epochs, LoRA instead of full fine-tune, smaller rank, or a rehearsal mix at 10 to 30 percent of training data drawn from a general corpus. Continued pretraining drifts hardest of all; the drift suite is not optional there.
Hyperparameter sweeps belong inside this stage. The agent-opt library exposes BayesianSearchOptimizer (Optuna-backed, resumable across runs) with an EarlyStoppingConfig that cuts trials whose held-out scores track below the current best by a configurable margin. Trial scores come from the same template suite you use for per-checkpoint scoring, so the sweep optimizes the metric you actually ship on, not a proxy. Detail in automated optimization for agents.
Stage 3: paired comparison against the base
A held-out win that loses to the base on real traffic is the failure mode the first two stages cannot catch. Held-out scores measure the fine-tune against itself. Paired comparison measures it against the alternative you would actually ship if the fine-tune failed.
The pattern is an arena gate on production samples. Sample 200 to 500 production inputs the model would actually see. For each input, generate a response from both base and candidate, hand the pair to a third-party judge with position randomized, ask which is better. Aggregate winrate against the base, reported with wins, losses, and ties separately, is the cleanest single number a fine-tune evaluator has. It cancels rubric noise between runs and matches how humans pick a winner.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
import random
arena_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "fine_tune_vs_base",
"model": "claude-sonnet-4-5-20250929",
"grading_criteria": (
"Compare two answers to the same input. "
"Optimize for helpfulness, accuracy, tone. "
"Do not prefer longer answers. "
"Return 1.0 if ANSWER_A is better, 0.0 if ANSWER_B, 0.5 if tie."
),
},
)
def arena_winrate(examples, base_fn, cand_fn, n=300):
wins = losses = ties = 0
for ex in random.sample(examples, min(n, len(examples))):
a, b = base_fn(ex.input), cand_fn(ex.input)
flip = random.choice([True, False])
ans_a, ans_b = (b, a) if flip else (a, b)
out = arena_judge.compute_one(CustomInput(
question=ex.input, answer_a=ans_a, answer_b=ans_b))["output"]
if out == 0.5: ties += 1
elif (out == 1.0 and flip) or (out == 0.0 and not flip): wins += 1
else: losses += 1
return {"wins": wins, "losses": losses, "ties": ties}
Three details separate a working arena gate from one that flatters. Randomize position per pair. Judges have a 10-15 point position bias on close calls; the flip cancels it. Judge from a different model family. Same-family judging inflates self-preference. Claude judging GPT vs Gemini is fine; the candidate judging itself is not. Report wins, losses, and ties. 58/12/30 is not the same fine-tune as 58/40/2 at matched winrate. High tie rates mean the candidate is indistinguishable from the base on those inputs, which is itself a signal.
Sample size determines whether the verdict separates from noise. The 95 percent confidence interval on a winrate p with n comparisons is roughly plus or minus 1.96 times the square root of p times one minus p over n. At p equal to 0.55 and n equal to 200 that’s plus or minus 6.9 points, which crosses 50 percent. A decisive winner (60 percent or more) stabilises around 100 to 150 comparisons; a close call (52 to 58 percent) needs 300 to 500. Decide the effect floor before training kicks off; shrinking the threshold after the fact is the most common way teams lie to themselves about fine-tune wins.
Win condition: winrate clears 50 percent with a meaningful CI, or 54 to 56 percent when the fine-tune cost real training budget. A 50-50 split is a fine-tune that did nothing for the user. A clean held-out that loses the arena is a sign the held-out was the wrong distribution.
Stage 4: the production canary
Stages 1 through 3 catch failure modes you can think of and curate examples for. Stage 4 catches the rest: a user phrasing the request differently, a tool-call schema the fine-tune trained on a stale version of, a quantization tier that costs eight points on long-context groundedness in production but looked fine on the 4K dev set, a safety regression that only fires on payloads not in the OWASP suite.
The canary pattern:
- Route 5 to 10 percent of production traffic to the candidate; the remainder stays on the current model (base or previous fine-tune).
- Attach the same rubrics from stage 2 and stage 3 as span-attached scorers on live traces via traceAI and
EvalTag. Scores live next to latency, model, and input on the OTel span. - Sample paired requests through shadow routing so the same input lands on base and candidate. Run the arena judge on the pairs. Accumulate winrate over a rolling 30 to 60 minute window.
- Alarm on a 2-point drop in any per-rubric rolling mean or a winrate drop below the agreed floor. Auto-rollback the canary cohort if the alarm sustains.
The Agent Command Center handles the canary split: six routing strategies, shadow, mirror, and race modes, eval-gated rollback as the default rollout pattern across 100+ providers. Per-call x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-fallback-used headers put the fine-tune’s cost and latency next to the base on the same trace.
Production failures from the canary feed Error Feed, inside the eval stack. HDBSCAN soft-clusters embeddings of failed traces stored in ClickHouse; a Sonnet 4.5 Judge agent (30-turn budget, eight span-tools) reads each cluster and writes an immediate_fix. Common clusters on fine-tuned deployments:
- “Fine-tune over-refuses on medical-adjacent queries the base handles.” Cause: safety boilerplate mixed into the corpus. Fix: re-run the Scanner suite with a stricter
TopicRestrictionScanner, drop offending rows, re-train from the last clean checkpoint. - “Fine-tune loses tool-call schema adherence by epoch 5.” Cause: over-training on free-form completions diluting structured output. Fix: drop to 3 epochs, raise LoRA rank, keep the function-call template in the per-checkpoint rubric.
- “INT4 quantization loses 8 points on strict JSON-mode outputs.” Cause: quantization tier too aggressive for the rubric. Fix: deploy INT8 on the structured-output path, keep INT4 elsewhere.
The fixes feed back into the Platform’s self-improving evaluators, which retune per-evaluator thresholds from the next batch of feedback.
Anti-patterns
- Skipping the data-quality stage. Poisoned rows, leaked credentials, and injection payloads end up in the weights. Recovery is a full re-run, not a patch.
- Treating held-out as the whole story. Catastrophic forgetting and safety regression ship silently. The first signal is a customer complaint about a task no one thought to put in the test set.
- Skipping the paired comparison vs base. The cleanest signal in the suite, and the one most teams skip because it costs more per comparison. That cost pays for the fine-tune that doesn’t get pulled in week three.
- No production canary, just an offline pass. Skip the canary and the fine-tune ships blind on the failure modes the dataset doesn’t cover.
- Floating the judge model and rubric version. Same eval, different judge, drifting verdicts. Pin both alongside the prompt and cache results on
(rubric_version, judge_model, input_hash, output_hash). - One eval pass per run. A fine-tune is a high-dimensional move. Per-checkpoint, per-quantization-tier, and per-canary-window passes are the difference between knowing what happened and guessing.
Three honest caveats
- The
traceAI → datasetconnector is on the roadmap. Today, optimizer datasets come from offline files or platform exports. Eval-driven hyperparameter optimization ships today via the six optimizers and the resumable Optuna study. - Error Feed integrates with Linear today. Slack, GitHub, Jira, and PagerDuty are on the roadmap; other tools pull clustered failures through the API in the meantime.
- Protect ML weights stay closed. The gateway self-hosts in your VPC with a deterministic fallback layer; the ML hop calls
api.futureagi.comor a private vLLM deployment under enterprise license. The four Gemma 3n LoRA adapters are not redistributed.
How Future AGI ships the pipeline
The pieces are independent. Drop the Scanner suite into the pre-train audit today; wire the template suite into per-checkpoint scoring next; turn on the canary when the first candidate is ready to ship.
- ai-evaluation SDK (Apache 2.0): 60+
EvalTemplateclasses; eight Scanners for the pre-train audit; 13 guardrail backends (9 open-weight, 4 API); four distributed runners (Celery, Ray, Temporal, Kubernetes);CustomLLMJudgeas the pairwise primitive for stage 3. - agent-opt (Apache 2.0): six optimizers including
BayesianSearchOptimizer(Optuna-backed, resumable studies) with sharedEarlyStoppingConfig; unifiedEvaluatorover heuristics, LLM-judge, and the FAGI rubrics. - traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions at
register()time. 14 span kinds; 62 built-in evals viaEvalTag. Wrap the training loop withfi.span.kind=CHAINso per-checkpoint scores become span attributes. - Future AGI Platform: self-improving evaluators tuned by thumbs feedback; in-product authoring agent writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Agent Command Center: 17 MB Go binary self-hosts in your VPC. Per-cohort canary routing with eval-gated rollback across 100+ providers. SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace.
- Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse embeddings plus a Sonnet 4.5 Judge agent that writes the
immediate_fixper cluster. Linear OAuth wired today.
Ready to gate your next fine-tune pipeline? Drop pip install ai-evaluation into the pre-train audit, wire per-checkpoint scoring with the template suite, run the paired arena before promotion, and route the canary through Agent Command Center for the closed loop.
Related reading
- Evaluating Fine-Tuned LLMs: A 2026 Playbook
- The 2026 LLM Evaluation Playbook
- Automated Optimization for Agents (2026)
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- How to Build an LLM Evaluation Framework From Scratch (2026)
- OWASP LLM Top 10 (2025): Risks and Mitigations
- LLM Benchmarks vs. Production Evals (2026)
- Agent Observability vs Evaluation vs Benchmarking (2026)
- Continued LLM Pretraining in 2026
Frequently asked questions
What is fine-tuning pipeline evaluation, and how does it differ from evaluating the final model?
How do I audit training data before fine-tuning?
Why is paired comparison against the base model the cleanest ship signal?
What does the production canary stage actually catch?
How does Future AGI fit into a fine-tune pipeline?
What are the most damaging anti-patterns in fine-tune pipeline eval?
Fine-tune eval in 2026 without the theatre: the four-set gap, paired arena against base, bootstrap CI math, the CI gate in code, and the production canary on live OTel spans.
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.
When eval-driven prompt optimization is enough, when fine-tuning earns its weight change, the seven axes that decide, and the five-step path that ships most teams without retrain.