Engineering

Your Agent Passes Evals and Fails in Production. Here's Why. (2026)

Your eval set is a snapshot, production is a river. Six drift modes that age eval sets, and the trace-as-eval loop that closes the gap.

April 21, 2026

Updated May 19, 2026

14 min read

llm-evaluation agent-evaluation production-llm trace-eval llm-observability 2026

Table of Contents

3:14 am. The customer-support agent that shipped Wednesday at 0.91 average on a CI eval suite (47 scenarios, four rubrics, three sprints green) is now quoting refund amounts off by an order of magnitude, contradicting itself across turns, and on one trace handing a user another customer’s order. You pull the failing conversations. Every single one passes the per-turn faithfulness rubric. Every single one failed the user. The eval says ship. Production says you already broke trust.

This is the trace-eval gap, and it’s the most reliable on-call pattern of 2026. The instinct is to blame the rubric. The right read is that the rubric was fine in March and ageing in April; production has been moving the entire time. Your eval set is a snapshot. Production is a river. Evals don’t ship. Traces do. Every release ages your eval set the day it lands; without a pipeline that promotes new failure modes back into the offline set, the offline-pass-prod-fail gap is mathematical, not accidental.

The opinion this post earns: the right unit of evaluation is the production trace, not the curated test case. Static offline evals are a necessary regression gate, never a sufficient ship gate, because they measure against a world that stopped existing the day they froze. The architecture that closes the gap runs the same rubric in both places, scores the live span, clusters the failures, ratchets the offline set off what production already broke. This guide walks the six drift modes, the 4-D trace score, the Error Feed loop, and the honest map of what ships today versus what’s roadmap.

TL;DR: six drift modes age every eval set

Drift mode	What the offline eval sees	What production actually does
Dataset drift	All curated cases pass	New user intents the set never had
Tool-API drift	Mocked tool returns the same shape	Vendor changed schema, error codes, or rate limits
Prompt drift	Rubric written for v3, frozen in git	Prompt is on v17, rubric still grades v3
Retrieval-corpus drift	Index frozen at eval-build time	Index doubled, chunker bumped, same query, new chunks
User-distribution drift	Hand-authored test inputs	Real traffic looks nothing like the test set
Agent-step compounding	Per-step success 95 percent	Eight steps multiply to 66 percent end-to-end

Each is a different timescale. Dataset and user-distribution drift creep in weeks. Tool-API and prompt drift land overnight. Retrieval-corpus drift is silent until a re-index. Agent-step compounding is structural and was never going to be caught by single-turn rubrics. None of them are a “more evals” problem. They are an architecture problem.

The eval set ages the day it ships

When you froze the eval set, you froze a hypothesis about what users would do, which tools would behave how, which prompts would still be in production, and which chunks the retriever would surface. Six months later, every one of those hypotheses moved a different distance. The rubric scoring 0.91 in CI is grading a world the agent does not live in anymore.

The rubric is not wrong. The rubric is stale, the same way a unit test is stale when the function it covers got refactored two months ago. Both pass. Neither protects you. The fix in software engineering is to write the test for the current function. The fix in agent eval is to grade the current trace.

What the strongest teams ship in 2026: the same rubric definition runs in pytest against a versioned dataset for the CI gate and against live OpenTelemetry spans as a span-attached score on production traffic. The CI gate is the floor. The span-attached score is the river. The dataset grows weekly from the spans the rubric just flagged. The eval surface is no longer a snapshot. It is the trace stream, sampled and scored continuously.

Drift 1: dataset drift

The eval set was written at launch. Users found intents the test authors never anticipated. The eval still passes because the dataset never moved.

Tell. Offline scores flat for months, production complaints diversify, the team cannot reproduce most reported failures on the test set.

Fix. Sample failing traces weekly. Bucket by user segment, intent, and judge score. Promote the hardest 5 to 10 percent into the eval set with version tags. Every promoted trace is a regression future PRs cannot break.

Drift 2: tool-API drift

You mocked the tool call in CI. The real endpoint changed schema, error shape, or rate-limit headers on a Tuesday. The agent retries, the retry loop times out, the agent fabricates a reasonable-sounding answer. CI is green because the mock still returns the old payload.

Tell. Tool-call latency climbs, retries climb, per-response rubric still passes, cost-per-success creeps wrong.

Fix. Score tool-call success as its own rubric on live spans. EvaluateFunctionCalling grades argument shape and call sequence. A failing tool call shows up in the trace tree right next to the failing response, scored. The mocked CI test catches your regression; the span-attached score catches the vendor’s.

Drift 3: prompt drift

You shipped v17 of the prompt on Friday. The rubric was written for v3 in February. The rubric still grades the criteria v3 cared about. The agent is being evaluated for the wrong thing.

Tell. A senior engineer reads ten traces and disagrees with the judge on six but cannot articulate why. The judge is grading by the old contract.

Fix. Version the rubric in the same PR as the prompt it scores. Treat the rubric like a contract test: when the prompt’s intent moves, the rubric moves with it, and the next CI run regrades the dataset under the new contract. Track judge-versus-human agreement on a small calibration set; when it drops, the rubric is overdue.

Drift 4: retrieval-corpus drift

The retriever you evaluated in March indexed 12,000 documents at chunk size 800. By May the index has 38,000 documents, the chunker reranked on a re-embed, and the same query lands on different top-k chunks. The generator dutifully grounds in whatever it was handed. Groundedness still scores 0.94. The answer is grounded in the wrong material.

Tell. Generation rubrics hold. Users say the bot is “less helpful than last quarter.” Trace inspection shows the top-1 chunk shifted on a class of queries.

Fix. Split the eval suite by layer. Retrieval rubrics (ContextRelevance, ChunkAttribution, ChunkUtilization) catch the index drift before generation rubrics absorb it. A drop in context relevance with stable groundedness means the retriever moved. A drop in groundedness with stable context relevance means the generator did. One bisect instead of three days. Covered in Evaluate RAG in CI/CD (2026).

Drift 5: user-distribution drift

Your eval set was hand-authored or sampled from launch-month traffic. Six months later, real users hit the agent with slang, multi-language code-switching, longer prompts, screenshots, and chains of follow-ups the dataset never had. The judge calibrated against the curated set reads 15 points lower on live traffic.

Tell. A spot-check of production traces, scored by hand, disagrees with the judge by 15+ points. Engineers stop trusting the rubric and start reading traces directly.

Fix. Calibrate the judge against production samples, not the dataset. Each rubric ships with a small human-labelled calibration set drawn from production. Track judge-versus-human drift as its own metric. The Future AGI Platform retunes evaluators end-to-end from thumbs up/down and relabels in the in-product UI, which makes weekly recalibration a default instead of a cost decision.

Drift 6: agent-step compounding

Every per-step rubric scores 95 percent. The agent makes eight tool calls per session. 0.95 to the eighth is 0.66. Two thirds of sessions end up structurally wrong even when every individual step looks right. The rubric never multiplied.

Tell. Per-turn metrics high. Conversation Completeness, outcome rate, CSAT low. Tickets read “the bot kept asking me the same question” or “the bot said yes then said no.”

Fix. Score the trace as a unit. Add Conversation Completeness, Role Adherence, Knowledge Retention, Turn Relevancy on the conversation. Add Optimal Plan Execution on the span tree. Multi-turn metrics are noisier per dollar than per-turn, and correlate with user experience an order of magnitude better. The Multi-Turn LLM Evaluation playbook walks the metric stack.

Why static offline evals cannot catch any of these

The shared property of all six drift modes: they happen after the eval set was frozen. A static dataset cannot encode a hypothesis it does not yet have. The CI gate is a regression test on a hypothesis you wrote in the past. The drift is a hypothesis production has not surfaced cleanly enough to label.

This is not a “your dataset is too small” problem. A 10,000-example offline set from March still does not contain May’s tool-schema change, June’s prompt revision, July’s index re-embed, or August’s users phrasing their questions a new way. Scale does not fix the snapshot. Only sampling production does, and sampling production means the eval surface lives where the agent lives. The mechanics behind each of these shifts are unpacked in what LLM drift actually is.

The reframe: the trace is the eval case. The curated dataset is the regression seed; live spans are the working set. Failures cluster, the rubric scores them as they happen, the named clusters become the next batch of dataset entries, and the loop closes. Offline pass is necessary. Trace-attached pass is sufficient.

The four-dimensional trace score

Per-turn faithfulness on the final response is not enough granularity to diagnose a drifting agent. The trace score Future AGI’s Error Feed Judge writes back on every failing trace is four-dimensional, scored 1 to 5 each:

Factual grounding. Did the agent stay anchored in the retrieved or supplied context, or did it confabulate? Catches retrieval-corpus drift and dataset drift at the response level.
Privacy and safety. Did the agent leak PII, cross a tenant boundary, or comply with a jailbreak it should have refused? Catches tool-API drift on permissions and prompt drift on the refusal head.
Instruction adherence. Did the agent follow the system prompt and refuse what should have been refused? Catches prompt drift directly: when v17 says one thing and the agent does another, this is the axis that drops.
Optimal plan execution. Did the agent pick the right tool, in the right order, without redundant calls, retries, or unreachable branches? Catches agent-step compounding and tool-API drift on the call graph.

Four axes, four kinds of regression, one composite. When the composite drops on a trace, the axes tell you which drift mode just bit you. The same axes run in CI on the offline set and on live spans, so the diagnostic vocabulary is identical in both places.

Error Feed: the loop closer

A trace score is a metric. A loop closer is a system. Error Feed is the part of the eval stack that turns the metric stream into a working diagnostic loop.

Mechanics. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues at prob >= 0.4, so noise points stay recoverable. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 (Bedrock) for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary), with a Claude Haiku “Chauffeur” summarising spans over 3000 characters. Prompt cache hit ratio sits around 90 percent, which keeps the bill survivable.

Per cluster, the Judge emits three things engineers actually read: a 5-category, 30-subtype taxonomy classification, the 4-D trace score above, and an immediate_fix string naming the change to ship today (rubric edit, prompt patch, tool-call guard, retrieval-filter tweak).

The fix feeds the Platform’s self-improving evaluators so the rubric ages with the product. The cluster becomes a candidate dataset entry; the on-call engineer promotes representative traces into the offline set. The next PR touching that path has to clear them. Linear ships today (OAuth-wired one-click); Slack, GitHub, Jira, and PagerDuty are on the roadmap. Every incident becomes a regression test the team never has to write again.

The promote-back pattern

Closing the loop is a workflow, not a feature. Five steps:

Cluster. HDBSCAN groups failing traces into named issues. No engineer triages a flat list of 800 failures.
Score. The Judge writes the 4-D score, the taxonomy, and the immediate_fix.
Promote. The on-call engineer accepts the cluster, selects 3 to 10 representative traces, commits them into the offline eval set with route tags and rubric labels.
Re-gate. The next CI run grades the new entries with the same rubric the production scorer used. The next PR touching that path cannot regress them.
Optimize. agent-opt searches the prompt space on the expanded set; the fix has to clear the rubric in CI before it ships.

Cadence: weekly on active products, faster on volatile launches. Static sets older than a quarter rarely match production. Drift is visible within two to three weeks on fast-moving agents. Sample failing traces, low-judge-score examples, and a stratified slice across segments. Annotate, version, commit.

What good looks like (and the honest roadmap line)

Six things teams that close the gap ship. Most ship two or three.

Same rubric in CI and production. Code-defined, versioned dataset on PRs, live spans on canary, same judge and prompt in both places.
Span-attached scores. 4-D scores as OTel span attributes; trace tree and score live together.
Multi-turn and outcome metrics. Conversation Completeness, Role Adherence, plus domain outcomes (resolved, filed, booked).
Tool-call and retrieval scoring as their own layers. Graded independently of the final response.
Auto-clustering with auto-RCA. HDBSCAN clusters, Judge writes immediate_fix, each cluster is a candidate dataset entry.
Closed loop into the offline set. New clusters become regression tests in the next CI run.

Eval-driven optimization ships today via agent-opt: six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), uniform EarlyStoppingConfig, unified Evaluator over heuristics, LLM-judge, and 70+ rubrics. Error Feed surfaces clusters, the engineer promotes traces, agent-opt searches against the same rubric, the winning candidate ships.

The honest constraint: a direct trace-stream-to-agent-opt connector (continuous optimization on live spans without the dataset round-trip) is on the active roadmap, not shipped. Teams that want continuous optimization run the loop weekly through the promote step today. Pretending the direct connector ships when it does not is the kind of vendor claim that costs trust the first time an engineer reads the code.

Three deliberate tradeoffs

Closing the loop costs operational surface. Span-attached scores plus auto-clustering plus a promote workflow is more parts than pytest evals/. Payoff: a regression suite that ratchets stronger over time. New deployments can ship with traceAI plus ai-evaluation alone and turn the loop on later.
Self-improving evaluators need their own monitoring. A rubric that calibrates against live traces can drift in directions you did not intend. Pin a small human-labelled hold-out; alarm when the judge disagrees with it by more than the inter-rater baseline.
Trace-attached eval is noisier per dollar than offline scoring. A 4-D rubric on a 30-second trace costs more than per-turn scoring on a 200-token case, and variance is higher. Sample by failure signal, not uniformly. The classifier cascade in front of the frontier judge keeps unit economics survivable; the Platform prices classifier-backed scoring below Galileo Luna-2.

How Future AGI ships the trace-eval bridge

Future AGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined evals. Graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.

ai-evaluation (Apache 2.0) is the code-first surface. 70+ EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, PromptInjection, DataPrivacyCompliance, AnswerRefusal, TaskCompletion, EvaluateFunctionCalling, and the rest). Real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(...) API. 13 guardrail backends, 9 open-weight (LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge.

traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including TOOL, RETRIEVER, AGENT, A2A_CLIENT, A2A_SERVER, EVALUATOR, GUARDRAIL, VECTOR_DB (Phoenix ships 8, Langfuse 5). Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). Server-side EvalTag wires rubric to span at zero added inference latency.

The Future AGI Platform is the operational layer: self-improving evaluators retune from thumbs feedback, an in-product authoring agent writes custom rubrics from natural-language descriptions, classifier-backed evals run at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer (mechanics above).

Ready to close the gap on your own agent? Start with the ai-evaluation SDK quickstart, wire one EvalTemplate against your current dataset in pytest, then attach the same template as an EvalTag on live traces via traceAI. The same rubric running in both places is the diff that closes the trace-eval gap.

Frequently asked questions

Why does my agent pass the eval suite and still fail in production?

Because the eval set is a snapshot and production is a river. Six drift modes age the snapshot the day it ships. Dataset drift (new user intents the set never had). Tool-API drift (the third-party endpoint changed its schema, error shape, or rate limit). Prompt drift (you edited the prompt, the rubric did not move with it). Retrieval-corpus drift (the index grew, the chunker reranked, the same query surfaces different chunks). User-distribution drift (real traffic looks nothing like the curated set). Agent-step compounding (eight tool calls at 95 percent each multiply to 66 percent end-to-end). The gap is mathematical, not accidental.

What replaces a static eval set?

Trace-as-eval-surface. The same rubric definition runs in pytest against a versioned dataset and against live OpenTelemetry spans in production, with the score attached to the span. Production failures cluster into named issues, the on-call engineer promotes representative traces into the offline set, and the next PR has to clear them. Evals don't ship. Traces do. The offline set ratchets stronger because production is the dataset the offline set is always catching up to.

What does a 4-dimensional trace score actually measure?

Four axes, scored 1 to 5 by the same judge on every trace. Factual grounding (did the agent stay anchored in the retrieved or supplied context, or did it confabulate). Privacy and safety (did the agent leak PII, exfiltrate a tenant boundary, or follow a jailbreak). Instruction adherence (did the agent obey the system prompt and refuse what it should have refused). Optimal plan execution (did the agent pick the right tool, in the right order, without redundant calls or loops). The composite is the trace's score; the four axes are the diagnosis when it drops.

How does Error Feed close the loop?

Error Feed clusters every failing trace into a named issue via HDBSCAN soft-clustering over span embeddings stored in ClickHouse. A Claude Sonnet 4.5 Judge agent on Bedrock runs a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, submit_finding, submit_scores, submit_summary, plus a Haiku Chauffeur for spans over 3000 characters). The Judge emits a 5-category 30-subtype taxonomy classification, the 4-D trace score, and an immediate_fix string. Each fix feeds the Platform's self-improving evaluators; each cluster becomes a candidate dataset entry the engineer promotes into the offline set.

How often should the offline set refresh from production?

Weekly is the floor on active products. Sample failing traces (low 4-D scores), hardened edge cases, and a stratified slice across user segments. Annotate, version, commit. Static eval sets older than a quarter rarely match production patterns; we have seen the gap open within two to three weeks on fast-moving agents. The promote-to-dataset step is the loop. Without it, the offline set ages on autopilot.

Is offline eval still useful when traces are the source of truth?

Yes, and the two have different jobs. Offline is the regression gate that fires on a PR. Production is the drift signal that fires on a deploy. Drop neither. The bug is treating an offline pass as a sufficient condition for shipping when it's only a necessary one. The same rubric runs in both places; the gap between the offline mean and the production mean is its own first-class quality metric.

Where does agent-opt fit in the loop?

Eval-driven optimization ships today. agent-opt exposes six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) with a uniform EarlyStoppingConfig and an Evaluator over heuristics, LLM-judge, and 70+ Future AGI rubrics. You point an optimizer at the offline set Error Feed just expanded, and it searches the prompt space against the same rubric the CI gate uses. The direct trace-stream-to-agent-opt connector is on the active roadmap; today the loop runs through the offline dataset by design.

How does Future AGI ship the trace-eval bridge?

Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) is the code-first surface: 70+ EvalTemplate classes, real Evaluator API, 13 guardrail backends, four distributed runners (Celery, Ray, Temporal, Kubernetes), a fi CLI with native CI assertions. traceAI (50+ AI surfaces across Python, TypeScript, Java, C#) carries the same rubric as a span-attached score on live traces with 14 span kinds and pluggable semantic conventions at register() time. The Future AGI Platform layers self-improving evaluators tuned by thumbs feedback and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack: HDBSCAN clustering plus a Sonnet 4.5 Judge writes the immediate_fix, fixes feed self-improving evaluators, traces promote into the dataset, agent-opt searches the prompt space against the same rubric.

View all

Engineering

How to Build an LLM Evaluation Framework From Scratch (2026)

Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, honest cost map, build vs buy guidance.

Vrinda Damani · May 5, 2026

14 min

Engineering

LLM Eval Data Drift Detection: Three Drifts That Age Your Golden Set

Eval dataset drift is the silent killer. A 2026 method for catching input, prompt-template, and retrieval-corpus drift before CI is wrong.

NVJK Kartik · Mar 3, 2026

12 min

Engineering

Automatic Prompt Optimization in 2026: How Textual Gradients, Genetic Search, and Meta-Prompts Actually Work

Automatic prompt optimization explained: textual gradients (ProTeGi), score trajectories (OPRO), genetic evolution (GEPA), meta-prompting, and how to pick one.

Rishav Hada · May 29, 2026

10 min

TL;DR: six drift modes age every eval set

The eval set ages the day it ships

Drift 1: dataset drift

Drift 2: tool-API drift

Drift 3: prompt drift

Drift 4: retrieval-corpus drift

Drift 5: user-distribution drift

Drift 6: agent-step compounding

Why static offline evals cannot catch any of these

The four-dimensional trace score

Error Feed: the loop closer

The promote-back pattern

What good looks like (and the honest roadmap line)

Three deliberate tradeoffs

How Future AGI ships the trace-eval bridge

Related reading

Frequently asked questions