LLM Eval vs Fine-Tuning: When to Do What in 2026
When eval-driven prompt optimization is enough, when fine-tuning earns its weight change, the seven axes that decide, and the five-step path that ships most teams without retrain.
Table of Contents
A team has spent three weeks on a customer-facing agent. The baseline scores 71 on TaskCompletion against the golden set. The product target is 85. The first instinct in the room is to schedule a fine-tune. The second instinct, from the engineer who has shipped this before, is to ask which failures actually need a weight change. Twenty minutes later, an Error Feed cluster review shows two-thirds of the misses are tone and format issues a prompt edit can fix. The fine-tune conversation moves to next quarter, and the team ships in nine days.
That moment, where the choice between “eval harder” and “fine-tune the model” gets made, decides whether the next release ships in a week or a quarter. Most teams get it wrong in both directions: too eager to fine-tune when the failures are prompt-shaped, or too averse to fine-tune when prompt optimization has visibly plateaued. This guide is the decision framework we apply across the deployments we watch ship in 2026: seven axes, a five-step path, the hybrid case when fine-tune is the right call, and the eval stack that proves which lever moved the metric.
This is the companion to the RAG vs fine-tuning decision framework, which covers knowledge access. The question here is different: when is the cheaper eval-driven path enough, and when does the gap justify the expense of changing weights.
TL;DR: most quality lift is prompt-shaped, not weight-shaped
| Lever | Cost | Time | Typical lift | When to pick |
|---|---|---|---|---|
| Eval-driven prompt optimization | Optimizer-run only | Hours to days | 2 to 6 points | Gap under 5 points; phrasing or layout failures |
| Classifier routing | Cheap classifier serving | Days | 1 to 3 points on cost; quality steady | High-volume mixed-difficulty traffic |
| RAG | Index + retrieval serving | 1 to 2 weeks | 3 to 8 points | Knowledge-shaped failures |
| Cascade with augment | Marginal | Days | Cost down 30 to 60 percent | Cost is the bottleneck, not quality |
| Fine-tuning | 10k to 200k+ plus ongoing GPU | Weeks | 3 to 10 points | Steps 1 to 4 plateaued; volume justifies it |
If you only remember three things: the cheapest lever moves the most metric on most teams, the order of operations matters more than the choice between any two levers, and the fine-tune is a real option but it is the last option, not the first.
Why this question matters more in 2026
Two shifts collided this year. The first is that prompt optimization stopped being a vibes exercise. Six optimizers, held-out scoring, early-stopping, and teacher-inferred few-shot mean you can run a real sweep against a real eval template and get a defensible artifact in a few hours. The second is that cheap-tier fine-tuning is now days rather than weeks. LoRA on a distilled base costs less and takes less calendar time than it used to, which means the “we will never fine-tune” stance has weakened too.
The result is that the choice is genuinely live for more teams more often. The “we will just fine-tune” instinct overlooks how much of the gap closes with better evals and better prompts. The “we will never fine-tune” instinct underinvests when prompt engineering has visibly hit a wall. The discipline in 2026 is to run the order of operations honestly, with an eval template that does not move, and to let the data tell you when the gap is prompt-shaped versus weight-shaped.
The cost asymmetry is real. A prompt optimization run against a 300-row golden set with BayesianSearchOptimizer and a teacher inference model costs tens to low hundreds of dollars and produces a new prompt artifact, no new model to host. A small fine-tune on a cheap-tier base costs five figures and adds a serving line item that compounds with every request forever. The math only flips when the gap that prompt optimization left on the table is large enough to justify that compounding cost. That is a real condition, not a hypothetical, but it is also less common than the average team’s first instinct suggests.
The seven axes that decide
1. Quality gap from baseline
Score the current production prompt against the eval template that defines your release criterion. If the gap from baseline to target is under five points, prompt optimization will almost always close it. The agent-opt sweep against Groundedness or TaskCompletion with EarlyStoppingConfig typically lifts two to six points without touching weights. If the gap is over ten points, no single lever is enough; you will need the full five-step path, and fine-tuning is in scope for the residual after steps one through four. The five-to-ten point band is the honest middle: try routing and RAG before committing to a retrain.
2. Cost of error
Per-call stakes change the bar. A consumer recommendation system can afford a wider quality band because each call’s downside is small. A medical pre-screen, a finance copilot, or a contract-redlining agent cannot. High-stakes per-call surfaces are where fine-tuning earns its expense, because tighter calibration of refusal behavior and output structure is what fine-tuning does best. Low-stakes surfaces almost never justify the operational tax.
3. Volume
Volume sets the amortization math. A fine-tune that lifts quality five points but adds two cents per call is a great deal at a billion calls a month and a disaster at a hundred thousand. Eval-driven prompt optimization has the opposite shape: a fixed run cost, no per-call premium, and a prompt artifact that you can ship across every traffic class. Low-volume agents almost always lose money on a fine-tune even when the quality math looks good in isolation.
4. Latency budget
If the strict service-level target is under 200 ms, the prompt bloat that comes with prompt optimization (longer system prompts, more few-shot examples, more instruction scaffolding) becomes the latency bottleneck. A fine-tuned small model with a shorter prompt can hit the target where a prompt-optimized larger model cannot. If the budget is over 500 ms, prompt optimization is fine; there is room for the extra tokens. The 200-to-500 ms band is where cascade and classifier routing earn their keep.
5. Style requirements
Tone, persona, refusal calibration, and structured-output discipline are where prompts are weakest and fine-tunes are strongest. A model can be told to “sound like a senior compliance officer” but it will revert under pressure. A fine-tune trained on a few hundred curated exchanges in that voice will hold the voice under adversarial prompts the eval set never saw. If the failure cluster is “the model is off-brand,” weights are the right lever. If it is “the model gets the facts wrong,” weights are the wrong lever.
6. Data sensitivity
If your data is per-tenant or under contractual isolation, do not fine-tune one model per tenant. The operational surface (versioning, eviction, retraining cadence, cost attribution) compounds with every new customer, and the unit economics are bleak by the tenth tenant. Eval-driven prompt optimization paired with retrieval over isolated indices is the right pattern. Reserve fine-tuning for the shared base case where one model serves many tenants.
7. Iteration speed needed
If the release is this week, prompt optimization is the only option that fits. If the release is this quarter, fine-tuning is on the table. The honest mental model is that fine-tuning is a heavy artifact with a real review cycle (training run, held-out scoring, canary, promotion), and that cycle does not collapse under deadline pressure. Plan for it as a quarter-scale move, not a week-scale one.
The five-step decision path
The order of operations matters more than any single choice. Run the cheap levers first, score after each one against the same eval template, and only escalate when the data shows the previous step plateaued. Skipping a step almost always costs more than running it in order would have.
Step 1: prompt-optimize with agent-opt
Start by running the six agent-opt optimizers against the failing template. RandomSearchOptimizer is the fast baseline. BayesianSearchOptimizer uses Optuna with a teacher inference model to infer few-shot examples that pull the metric. MetaPromptOptimizer, ProTeGi, GEPAOptimizer, and PromptWizardOptimizer cover the search landscape from different angles. EarlyStoppingConfig cuts the run when the held-out score plateaus, so the cost stays bounded.
The artifact is a new prompt scored against the same template you started with. If the lift is real and replicates on the held-out slice, you are done. If the lift is tiny or fragile, move to step two. The trace-stream-to-agent-opt connector that pulls live production traces into the optimizer is on the roadmap; today the path is eval-driven with golden sets and held-out scoring. That is enough to clear the prompt-shaped failures, which is most of them.
Step 2: classifier-route the easy traffic
Most production traffic is bimodal: a high-volume easy slice and a small hard slice. A small classifier (one of the nine open-weight backends the FAGI gateway supports) can route the easy slice to a cheap-tier model and reserve flagship for the hard slice. The cost falls thirty to sixty percent and the quality typically holds, because the cheap-tier model was always good enough for the easy traffic; the team was overpaying because it was easier to send everything to flagship.
Classifier routing is the cost economics alternative to fine-tuning. Where a fine-tune amortizes by making one cheap model handle everything, classifier routing amortizes by sending each request to the model that can handle it for the lowest cost. The two strategies can compose, but routing comes first because it does not commit you to a training artifact.
Step 3: add RAG for knowledge-shaped failures
If the Error Feed cluster review shows the model failing because it does not know something, that is a knowledge gap, not a reasoning gap. RAG is the right lever. Add a retrieval pipeline, score with ContextAdherence and Groundedness, and check whether the failures cluster moves. If it does, you closed the gap with index work rather than weight work. The RAG vs fine-tuning decision framework covers the architecture and eval choices here in detail.
Step 4: cascade with augment
augment=True on the gateway runs a cheap-tier first, scores the response in flight, and escalates to flagship only when the cheap-tier output is uncertain. The pattern cuts cost without cutting quality on the high-volume path. It composes with classifier routing (route first, then cascade within the route) and it composes with RAG (retrieve first, then cascade across model tiers). The cost curve typically falls another fifteen to thirty percent on top of routing alone.
Step 5: fine-tune only if the gap is still open
If steps one through four left a measurable gap, and the gap is shaped like tone, format, calibration, or volume-amortization economics, fine-tuning is the right next move. Curate the training set against the same eval templates you use in production. Hold out a slice. Score every checkpoint. Promote only through canary with the eval stack scoring both the fine-tuned model and the baseline against the same golden set. The fine-tuning pipeline evaluation guide covers the training-loop eval; evaluating fine-tuned LLMs covers the post-deployment loop.
The discipline is to arrive at step five with evidence, not instinct. The Error Feed cluster review should show a cluster that prompt optimization did not move, that routing did not move, that RAG did not move, and whose immediate_fix in the Sonnet 4.5 Judge output reads like a weight-level change rather than a prompt edit. That is the signal that fine-tune is earning its place.
The FAGI grounding: what ships today
The eval-driven path is not a thought experiment. The pieces are shipping today across the Future AGI stack and they compose end to end.
The ai-evaluation SDK ships sixty-plus EvalTemplate classes that drive the eval surface. The ones that matter most for the eval-vs-fine-tune decision: Groundedness for grounded-output failures, ContextAdherence for retrieval-shaped failures, TaskCompletion for end-to-end agent failures, LLMFunctionCalling for tool-use failures, AnswerRefusal for refusal-calibration failures, plus Completeness and FactualAccuracy as the cross-cutting baselines.
agent-opt ships the six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer with Optuna and teacher-inferred few-shot, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) plus EarlyStoppingConfig. The optimizers are eval-driven against any of the sixty-plus templates today. The trace-stream connector that pulls production traces into the optimizer is on the active roadmap, which is the honest framing.
The FAGI gateway runs the routing and cascade layers. Nine open-weight classifier backends are the cost-economics alternative to fine-tune-or-flagship; augment=True runs the cascade. Five self-hosted backends mean a fine-tuned model can sit behind the gateway with the same OpenAI-compatible API as the cheap and flagship tiers. Headers like x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-fallback-used emit canonical telemetry so the cost-per-call and tail-latency comparison between the eval-driven path and a fine-tuned path is real production data, not a synthetic benchmark.
The Error Feed clusters production failures via HDBSCAN soft-clustering, then a Sonnet 4.5 Judge writes an immediate_fix per cluster. Most immediate_fix outputs are prompt edits, not weight retraining. That is the empirical case for running the order of operations honestly: when you look at the clusters, most of them point at prompt-shaped fixes. The Platform’s self-improving evaluators close the loop on prompt optimization without manual sweeps, which is the per-eval cost advantage over running a Galileo Luna-2 style critic on every request. Linear is the only Error Feed integration today; the rest of the integration surface is on the roadmap.
FAGI Protect, the gateway runtime guard rail, ships with closed ML weights for the policy models; the gateway itself self-hosts. The cost economics of running the routing and cascade layers in your own VPC are part of why classifier routing wins step two so cleanly.
The hybrid pattern: when fine-tune is the right call, eval still composes
Even when fine-tune wins the decision, eval-driven work does not go away. It composes at three distinct points along the fine-tune pipeline, and skipping any of them is the most common reason fine-tunes ship and immediately regress.
Pre-fine-tune, curate the training set against the same eval templates you score on in production. A fine-tune trained on a curated set that does not score against Groundedness and TaskCompletion is a fine-tune that will look good on training loss and miss the production metric. The discipline is to make the training distribution look like the eval distribution.
During fine-tune, hold out a slice and score every checkpoint. The optimizer-style early-stopping pattern carries over: stop training when held-out metrics plateau, not when training loss does. Capability regressions are common in checkpoints that look fine on a single template, so score against the full template suite, not just the one the fine-tune was supposed to lift.
Post-fine-tune, canary-route through the gateway, score the fine-tuned model and the baseline against the same templates on real production traffic, and only promote if the lift survives the shape of real load. The classifier routing infrastructure from step two is what makes this canary cheap to run; you do not need a separate A/B framework, you just route a small slice of traffic to the fine-tuned model and let the eval templates score both.
This is the case where eval and fine-tune are not opposing strategies. The fine-tune is just another route that the eval stack scores. The decision framework gets you to step five with evidence; the eval stack then makes sure step five was the right call.
The cost-benefit math, honestly
Prompt optimization via agent-opt: optimizer-run cost (a few hundred dollars at most for a typical 300-row golden set sweep across the six optimizers) plus zero serving overhead because the artifact is a prompt, not a model. Typical lift: two to six points on the primary template. The cost is fixed, the lift is real, the artifact ships in days.
Fine-tuning on a cheap-tier base: ten to fifty thousand dollars for the training run on a small open-weight base, plus the ongoing GPU serving cost (cents per call that compound forever). Frontier-base fine-tuning runs two hundred thousand and up. Typical lift: three to ten points if the gap was actually weight-shaped, less if it was not. The cost is variable, the lift can be real or illusory, and the artifact ships in weeks.
The eval gate that decides between them is mechanical, not philosophical. Run steps one through four. Score against the same templates each time. If the residual gap is over the threshold the product team set and the failure cluster reads as weight-shaped (tone, format, calibration, volume amortization), step five is on the table. If the residual gap is under threshold or the failure cluster reads as prompt-shaped, step five is not on the table yet.
The mistake teams make is to skip the eval gate and run the math from instinct. Instinct says “we should fine-tune to be serious.” The data, when teams actually run the order of operations, says most of the residual is still prompt-shaped after the first sweep and most teams ship without ever reaching step five.
Anti-patterns to avoid
Four show up across teams that get this wrong, each with a recognizable failure shape.
Reaching for fine-tune as the first lever. The pattern looks like a kickoff meeting where the engineering lead opens with “we will fine-tune the base model on our data.” Three weeks later the fine-tune ships with a five-point lift that prompt optimization would have delivered in two days, and the team has a serving line item that compounds forever for no marginal benefit. The fix is to make step one (run agent-opt against the failing template) a kickoff-meeting requirement, not a stretch goal.
Ignoring prompt optimization entirely. The pattern looks like a team that writes the production prompt by hand, never sweeps it against an eval template, and assumes any gap from baseline to target needs a weight change. Eighty percent of the lift is in prompt-shaped territory the team is not exploring. The fix is the same as the first: agent-opt before any fine-tune conversation.
Fine-tuning per tenant. The pattern looks like a team that builds a great per-tenant fine-tune for the first customer, then discovers by tenant five that the operational surface (versioning, eviction, retraining cadence, cost attribution) does not scale. The fix is to fine-tune the shared base case once and use retrieval and prompt optimization for tenant-specific shape.
Deciding without an Error Feed cluster review. The pattern looks like a team that fine-tunes against a guess at the failure mode rather than a clustered view of the actual failure shape. Half the fine-tunes that ship in this pattern fail to move the production metric because they were trained on a hypothesis the data does not support. The fix is to run the cluster review (HDBSCAN soft-cluster the recent failures, read the Sonnet 4.5 Judge immediate_fix outputs, confirm the failure shape matches the lever you are about to pull) before committing to any step-five expense.
The common thread across all four is that the cost of the wrong choice is paid in weeks, not days, and the cost of running the order of operations honestly is one optimizer sweep and one cluster review. The asymmetry favors the cheap discipline every time.
The honest framing on what ships today
A few callouts to keep the picture clean. The six agent-opt optimizers ship today and run eval-driven against any of the sixty-plus EvalTemplate classes. The trace-stream-to-agent-opt connector that turns live production traces into optimizer input is on the active roadmap. Eval-driven prompt optimization with golden sets and held-out scoring is the surface that ships now.
The FAGI gateway runs nine open-weight classifier backends, the cascade with augment=True, and five self-hosted backends that can host a fine-tuned model behind the same OpenAI-compatible API as the cheap and flagship tiers. The Error Feed integration ships today against Linear; broader coverage is on the roadmap. FAGI Protect ships with closed ML weights; the gateway itself self-hosts. The Platform’s self-improving evaluators close the loop on prompt optimization without manual sweeps at a lower per-eval cost than running a heavyweight critic on every request.
The decision framework holds either way. The order of operations does not depend on the roadmap; it depends on the cost asymmetry between prompt-shaped fixes and weight-shaped fixes. Run the cheap levers first, score after each one, and only pay the expensive lever when the data says you have to.
Where to go next
The RAG vs fine-tuning decision framework covers the knowledge-access version of this question. The fine-tuning pipeline evaluation guide covers the eval loop you need during a fine-tune run, and evaluating fine-tuned LLMs covers the post-deployment loop. For the prompt optimization landscape, see prompt optimization at scale and top prompt optimization tools. The agent observability vs evaluation vs benchmarking guide sets the broader vocabulary, and the AI evaluation open-source library walks the sixty-plus EvalTemplate classes.
Frequently asked questions
When is eval-driven prompt optimization enough and when do I need fine-tuning?
Why do most teams reach for fine-tuning too early?
How does agent-opt fit into the decide-before-you-fine-tune workflow?
What is the five-step decision path before fine-tuning?
How do I know prompt optimization has plateaued?
When fine-tuning is the right call, how does eval-driven work compose with it?
What are the anti-patterns in this decision?
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.
Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.
Fine-tune eval in 2026 without the theatre: the four-set gap, paired arena against base, bootstrap CI math, the CI gate in code, and the production canary on live OTel spans.