Your Agent Passes Evals and Fails in Production. Here's Why. (2026)
Your eval set is a snapshot. Production is a river. The six drift modes that age every eval set the day it ships, and the trace-as-eval-surface loop that closes the gap.
Table of Contents
3:14 am. The customer-support agent that shipped Wednesday at 0.91 average on a CI eval suite (47 scenarios, four rubrics, three sprints green) is now quoting refund amounts off by an order of magnitude, contradicting itself across turns, and on one trace handing a user another customer’s order. You pull the failing conversations. Every single one passes the per-turn faithfulness rubric. Every single one failed the user. The eval says ship. Production says you already broke trust.
This is the trace-eval gap, and it’s the most reliable on-call pattern of 2026. The instinct is to blame the rubric. The right read is that the rubric was fine in March and ageing in April; production has been moving the entire time. Your eval set is a snapshot. Production is a river. Evals don’t ship. Traces do. Every release ages your eval set the day it lands; without a pipeline that promotes new failure modes back into the offline set, the offline-pass-prod-fail gap is mathematical, not accidental.
The opinion this post earns: the right unit of evaluation is the production trace, not the curated test case. Static offline evals are a necessary regression gate, never a sufficient ship gate, because they measure against a world that stopped existing the day they froze. The architecture that closes the gap runs the same rubric in both places, scores the live span, clusters the failures, ratchets the offline set off what production already broke. This guide walks the six drift modes, the 4-D trace score, the Error Feed loop, and the honest map of what ships today versus what’s roadmap.
TL;DR: six drift modes age every eval set
| Drift mode | What the offline eval sees | What production actually does |
|---|---|---|
| Dataset drift | All curated cases pass | New user intents the set never had |
| Tool-API drift | Mocked tool returns the same shape | Vendor changed schema, error codes, or rate limits |
| Prompt drift | Rubric written for v3, frozen in git | Prompt is on v17, rubric still grades v3 |
| Retrieval-corpus drift | Index frozen at eval-build time | Index doubled, chunker bumped, same query, new chunks |
| User-distribution drift | Hand-authored test inputs | Real traffic looks nothing like the test set |
| Agent-step compounding | Per-step success 95 percent | Eight steps multiply to 66 percent end-to-end |
Each is a different timescale. Dataset and user-distribution drift creep in weeks. Tool-API and prompt drift land overnight. Retrieval-corpus drift is silent until a re-index. Agent-step compounding is structural and was never going to be caught by single-turn rubrics. None of them are a “more evals” problem. They are an architecture problem.
The eval set ages the day it ships
When you froze the eval set, you froze a hypothesis about what users would do, which tools would behave how, which prompts would still be in production, and which chunks the retriever would surface. Six months later, every one of those hypotheses moved a different distance. The rubric scoring 0.91 in CI is grading a world the agent does not live in anymore.
The rubric is not wrong. The rubric is stale, the same way a unit test is stale when the function it covers got refactored two months ago. Both pass. Neither protects you. The fix in software engineering is to write the test for the current function. The fix in agent eval is to grade the current trace.
What the strongest teams ship in 2026: the same rubric definition runs in pytest against a versioned dataset for the CI gate and against live OpenTelemetry spans as a span-attached score on production traffic. The CI gate is the floor. The span-attached score is the river. The dataset grows weekly from the spans the rubric just flagged. The eval surface is no longer a snapshot. It is the trace stream, sampled and scored continuously.
Drift 1: dataset drift
The eval set was written at launch. Users found intents the test authors never anticipated. The eval still passes because the dataset never moved.
Tell. Offline scores flat for months, production complaints diversify, the team cannot reproduce most reported failures on the test set.
Fix. Sample failing traces weekly. Bucket by user segment, intent, and judge score. Promote the hardest 5 to 10 percent into the eval set with version tags. Every promoted trace is a regression future PRs cannot break.
Drift 2: tool-API drift
You mocked the tool call in CI. The real endpoint changed schema, error shape, or rate-limit headers on a Tuesday. The agent retries, the retry loop times out, the agent fabricates a reasonable-sounding answer. CI is green because the mock still returns the old payload.
Tell. Tool-call latency climbs, retries climb, per-response rubric still passes, cost-per-success creeps wrong.
Fix. Score tool-call success as its own rubric on live spans. EvaluateFunctionCalling grades argument shape and call sequence. A failing tool call shows up in the trace tree right next to the failing response, scored. The mocked CI test catches your regression; the span-attached score catches the vendor’s.
Drift 3: prompt drift
You shipped v17 of the prompt on Friday. The rubric was written for v3 in February. The rubric still grades the criteria v3 cared about. The agent is being evaluated for the wrong thing.
Tell. A senior engineer reads ten traces and disagrees with the judge on six but cannot articulate why. The judge is grading by the old contract.
Fix. Version the rubric in the same PR as the prompt it scores. Treat the rubric like a contract test: when the prompt’s intent moves, the rubric moves with it, and the next CI run regrades the dataset under the new contract. Track judge-versus-human agreement on a small calibration set; when it drops, the rubric is overdue.
Drift 4: retrieval-corpus drift
The retriever you evaluated in March indexed 12,000 documents at chunk size 800. By May the index has 38,000 documents, the chunker reranked on a re-embed, and the same query lands on different top-k chunks. The generator dutifully grounds in whatever it was handed. Groundedness still scores 0.94. The answer is grounded in the wrong material.
Tell. Generation rubrics hold. Users say the bot is “less helpful than last quarter.” Trace inspection shows the top-1 chunk shifted on a class of queries.
Fix. Split the eval suite by layer. Retrieval rubrics (ContextRelevance, ChunkAttribution, ChunkUtilization) catch the index drift before generation rubrics absorb it. A drop in context relevance with stable groundedness means the retriever moved. A drop in groundedness with stable context relevance means the generator did. One bisect instead of three days. Covered in Evaluate RAG in CI/CD (2026).
Drift 5: user-distribution drift
Your eval set was hand-authored or sampled from launch-month traffic. Six months later, real users hit the agent with slang, multi-language code-switching, longer prompts, screenshots, and chains of follow-ups the dataset never had. The judge calibrated against the curated set reads 15 points lower on live traffic.
Tell. A spot-check of production traces, scored by hand, disagrees with the judge by 15+ points. Engineers stop trusting the rubric and start reading traces directly.
Fix. Calibrate the judge against production samples, not the dataset. Each rubric ships with a small human-labelled calibration set drawn from production. Track judge-versus-human drift as its own metric. The Future AGI Platform retunes evaluators end-to-end from thumbs up/down and relabels in the in-product UI, which makes weekly recalibration a default instead of a cost decision.
Drift 6: agent-step compounding
Every per-step rubric scores 95 percent. The agent makes eight tool calls per session. 0.95 to the eighth is 0.66. Two thirds of sessions end up structurally wrong even when every individual step looks right. The rubric never multiplied.
Tell. Per-turn metrics high. Conversation Completeness, outcome rate, CSAT low. Tickets read “the bot kept asking me the same question” or “the bot said yes then said no.”
Fix. Score the trace as a unit. Add Conversation Completeness, Role Adherence, Knowledge Retention, Turn Relevancy on the conversation. Add Optimal Plan Execution on the span tree. Multi-turn metrics are noisier per dollar than per-turn, and correlate with user experience an order of magnitude better. The Multi-Turn LLM Evaluation playbook walks the metric stack.
Why static offline evals cannot catch any of these
The shared property of all six drift modes: they happen after the eval set was frozen. A static dataset cannot encode a hypothesis it does not yet have. The CI gate is a regression test on a hypothesis you wrote in the past. The drift is a hypothesis production has not surfaced cleanly enough to label.
This is not a “your dataset is too small” problem. A 10,000-example offline set from March still does not contain May’s tool-schema change, June’s prompt revision, July’s index re-embed, or August’s users phrasing their questions a new way. Scale does not fix the snapshot. Only sampling production does, and sampling production means the eval surface lives where the agent lives.
The reframe: the trace is the eval case. The curated dataset is the regression seed; live spans are the working set. Failures cluster, the rubric scores them as they happen, the named clusters become the next batch of dataset entries, and the loop closes. Offline pass is necessary. Trace-attached pass is sufficient.
The four-dimensional trace score
Per-turn faithfulness on the final response is not enough granularity to diagnose a drifting agent. The trace score Future AGI’s Error Feed Judge writes back on every failing trace is four-dimensional, scored 1 to 5 each:
- Factual grounding. Did the agent stay anchored in the retrieved or supplied context, or did it confabulate? Catches retrieval-corpus drift and dataset drift at the response level.
- Privacy and safety. Did the agent leak PII, cross a tenant boundary, or comply with a jailbreak it should have refused? Catches tool-API drift on permissions and prompt drift on the refusal head.
- Instruction adherence. Did the agent follow the system prompt and refuse what should have been refused? Catches prompt drift directly: when v17 says one thing and the agent does another, this is the axis that drops.
- Optimal plan execution. Did the agent pick the right tool, in the right order, without redundant calls, retries, or unreachable branches? Catches agent-step compounding and tool-API drift on the call graph.
Four axes, four kinds of regression, one composite. When the composite drops on a trace, the axes tell you which drift mode just bit you. The same axes run in CI on the offline set and on live spans, so the diagnostic vocabulary is identical in both places.
Error Feed: the loop closer
A trace score is a metric. A loop closer is a system. Error Feed is the part of the eval stack that turns the metric stream into a working diagnostic loop.
Mechanics. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues at prob >= 0.4, so noise points stay recoverable. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 (Bedrock) for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary), with a Claude Haiku “Chauffeur” summarising spans over 3000 characters. Prompt cache hit ratio sits around 90 percent, which keeps the bill survivable.
Per cluster, the Judge emits three things engineers actually read: a 5-category, 30-subtype taxonomy classification, the 4-D trace score above, and an immediate_fix string naming the change to ship today (rubric edit, prompt patch, tool-call guard, retrieval-filter tweak).
The fix feeds the Platform’s self-improving evaluators so the rubric ages with the product. The cluster becomes a candidate dataset entry; the on-call engineer promotes representative traces into the offline set. The next PR touching that path has to clear them. Linear ships today (OAuth-wired one-click); Slack, GitHub, Jira, and PagerDuty are on the roadmap. Every incident becomes a regression test the team never has to write again.
The promote-back pattern
Closing the loop is a workflow, not a feature. Five steps:
- Cluster. HDBSCAN groups failing traces into named issues. No engineer triages a flat list of 800 failures.
- Score. The Judge writes the 4-D score, the taxonomy, and the
immediate_fix. - Promote. The on-call engineer accepts the cluster, selects 3 to 10 representative traces, commits them into the offline eval set with route tags and rubric labels.
- Re-gate. The next CI run grades the new entries with the same rubric the production scorer used. The next PR touching that path cannot regress them.
- Optimize. agent-opt searches the prompt space on the expanded set; the fix has to clear the rubric in CI before it ships.
Cadence: weekly on active products, faster on volatile launches. Static sets older than a quarter rarely match production. Drift is visible within two to three weeks on fast-moving agents. Sample failing traces, low-judge-score examples, and a stratified slice across segments. Annotate, version, commit.
What good looks like (and the honest roadmap line)
Six things teams that close the gap ship. Most ship two or three.
- Same rubric in CI and production. Code-defined, versioned dataset on PRs, live spans on canary, same judge and prompt in both places.
- Span-attached scores. 4-D scores as OTel span attributes; trace tree and score live together.
- Multi-turn and outcome metrics. Conversation Completeness, Role Adherence, plus domain outcomes (resolved, filed, booked).
- Tool-call and retrieval scoring as their own layers. Graded independently of the final response.
- Auto-clustering with auto-RCA. HDBSCAN clusters, Judge writes
immediate_fix, each cluster is a candidate dataset entry. - Closed loop into the offline set. New clusters become regression tests in the next CI run.
Eval-driven optimization ships today via agent-opt: six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), uniform EarlyStoppingConfig, unified Evaluator over heuristics, LLM-judge, and 70+ rubrics. Error Feed surfaces clusters, the engineer promotes traces, agent-opt searches against the same rubric, the winning candidate ships.
The honest constraint: a direct trace-stream-to-agent-opt connector (continuous optimization on live spans without the dataset round-trip) is on the active roadmap, not shipped. Teams that want continuous optimization run the loop weekly through the promote step today. Pretending the direct connector ships when it does not is the kind of vendor claim that costs trust the first time an engineer reads the code.
Three deliberate tradeoffs
- Closing the loop costs operational surface. Span-attached scores plus auto-clustering plus a promote workflow is more parts than
pytest evals/. Payoff: a regression suite that ratchets stronger over time. New deployments can ship with traceAI plus ai-evaluation alone and turn the loop on later. - Self-improving evaluators need their own monitoring. A rubric that calibrates against live traces can drift in directions you did not intend. Pin a small human-labelled hold-out; alarm when the judge disagrees with it by more than the inter-rater baseline.
- Trace-attached eval is noisier per dollar than offline scoring. A 4-D rubric on a 30-second trace costs more than per-turn scoring on a 200-token case, and variance is higher. Sample by failure signal, not uniformly. The classifier cascade in front of the frontier judge keeps unit economics survivable; the Platform prices classifier-backed scoring below Galileo Luna-2.
How Future AGI ships the trace-eval bridge
Future AGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined evals. Graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.
ai-evaluation (Apache 2.0) is the code-first surface. 70+ EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, PromptInjection, DataPrivacyCompliance, AnswerRefusal, TaskCompletion, EvaluateFunctionCalling, and the rest). Real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(...) API. 13 guardrail backends, 9 open-weight (LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge.
traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including TOOL, RETRIEVER, AGENT, A2A_CLIENT, A2A_SERVER, EVALUATOR, GUARDRAIL, VECTOR_DB (Phoenix ships 8, Langfuse 5). Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). Server-side EvalTag wires rubric to span at zero added inference latency.
The Future AGI Platform is the operational layer: self-improving evaluators retune from thumbs feedback, an in-product authoring agent writes custom rubrics from natural-language descriptions, classifier-backed evals run at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer (mechanics above).
Ready to close the gap on your own agent? Start with the ai-evaluation SDK quickstart, wire one EvalTemplate against your current dataset in pytest, then attach the same template as an EvalTag on live traces via traceAI. The same rubric running in both places is the diff that closes the trace-eval gap.
Related reading
Frequently asked questions
Why does my agent pass the eval suite and still fail in production?
What replaces a static eval set?
What does a 4-dimensional trace score actually measure?
How does Error Feed close the loop?
How often should the offline set refresh from production?
Is offline eval still useful when traces are the source of truth?
Where does agent-opt fit in the loop?
How does Future AGI ship the trace-eval bridge?
Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, the honest cost map, and what to build vs buy.
Eval dataset drift is the silent killer. A 2026 methodology for catching input, prompt-template, and retrieval-corpus drift before your CI gate tests yesterday's traffic.
RAG eval in CI/CD without the theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.