What Is Eval Drift?
Eval drift is when an evaluation suite stops matching the real production quality it is supposed to measure.
What Is Eval Drift?
Eval drift is when an LLM evaluation pipeline stops representing real production quality because the dataset, evaluator, threshold, or traffic distribution has changed. It is an evaluation reliability issue: a release can pass offline evals while users see more hallucinations, stale context, or bad tool choices. In FutureAGI, eval drift shows up across fi.datasets.Dataset runs, eval:* evaluator scores, and production traces; teams detect it by comparing cohorts, score deltas, and threshold breaches over time.
Why Eval Drift Matters in Production LLM and Agent Systems
Eval drift creates the worst kind of quality failure: the dashboard says green while the product is worse. A support assistant can keep a 0.90 average on a stale golden dataset while new refund-policy questions trigger unsupported answers. A coding agent can keep passing task-completion evals after a tool-schema update because the eval never exercises the new branch. The symptom is not one bad trace; it is disagreement between offline scores and production outcomes.
Developers feel it as false confidence. SREs see incident tickets or p99 latency spikes with no quality context. Product teams see thumbs-down rate rise after a model swap, yet the release gate still passed. Compliance teams lose audit confidence because the test evidence no longer matches live behavior.
Agentic systems make the problem sharper. Multi-step pipelines mix retrieval, planning, tool calls, model fallback, and memory. A small data shift in the first step can change every downstream evaluator. Logs often show cohort-specific drops: eval-fail-rate-by-cohort rises for enterprise users, hallucination tickets cluster around recently changed documents, or tool retries jump only on one route. Eval drift is the signal that your measurement system, not just your model, needs maintenance.
How FutureAGI Handles Eval Drift
FutureAGI’s approach is to treat eval drift as a dataset-version and evaluator-cohort problem, not as a single score. A team pins a fi.datasets.Dataset version for its golden set, attaches Groundedness, ContextRelevance, and ToolSelectionAccuracy through Dataset.add_evaluation(), and stores the resulting score distribution as the baseline. The same eval:* suite then runs on release candidates and sampled production traces from traceAI-langchain.
A concrete workflow: a RAG support agent passes the golden dataset at 94% but production traces show a growing failure rate for queries tagged billing_policy_v3. The engineer slices by dataset cohort and evaluator class. ContextRelevance remains stable, but Groundedness drops eight points on the new policy cohort, while ToolSelectionAccuracy drops only for refund-tool traces. That split says the retriever is finding context, but the answer step is no longer using it consistently.
FutureAGI then turns the diagnosis into action. The engineer promotes failing traces into a new Dataset version, reruns the baseline, and tightens the metric threshold for the affected evaluator before the next release gate. Unlike a single Ragas faithfulness check, this treats drift as a time series across datasets, evaluators, and production cohorts. If the regression is route-specific, the team can hold the model upgrade behind an Agent Command Center model fallback until the updated eval suite passes.
How to Measure or Detect Eval Drift
Measure eval drift by comparing the same business task across stable baselines, fresh production samples, and explicit evaluation windows.
- Evaluator score delta: compare
Groundedness,ContextRelevance, andToolSelectionAccuracyagainst the previous accepted Dataset run. Alert when any evaluator drops more than its threshold. - Eval-fail-rate-by-cohort: chart the percentage of failed rows by user segment, route, prompt version, retriever index, and model variant.
- Dataset freshness: track the share of current production failure modes represented in the golden dataset. A stale dataset hides new errors.
- Trace disagreement: compare live trace outcomes with offline eval results; rising thumbs-down or escalation rate with green evals is a drift warning.
- Threshold churn: frequent threshold edits without new annotations usually means the measurement target is moving.
from fi.evals import Groundedness, ContextRelevance
evaluators = [Groundedness(), ContextRelevance()]
baseline = {"Groundedness": 0.91, "ContextRelevance": 0.88}
current = {"Groundedness": 0.81, "ContextRelevance": 0.87}
drops = {name: baseline[name] - current[name] for name in baseline}
assert max(drops.values()) < 0.05, drops
Common Mistakes
- Comparing different datasets. If rows changed, the score delta may be dataset drift, not quality drift.
- Treating evaluator upgrades as product regressions. Version evaluators separately so a judge-prompt change does not look like a model failure.
- Averaging away cohorts. A global pass can hide a 20-point drop for one customer segment or tool route.
- Changing thresholds after seeing results. Thresholds should encode risk tolerance before the run, not after a failed release gate.
- Ignoring production traces. Offline-only evals miss new intents, new documents, and tool paths that never existed in the golden set.
Frequently Asked Questions
What is eval drift?
Eval drift is when an evaluation suite no longer reflects real production quality because the dataset, evaluator, threshold, or traffic distribution changed. It makes offline scores disagree with what users see.
How is eval drift different from model drift?
Model drift is a change in model behavior. Eval drift is a change in whether your measurement system still captures production risk, even when the model is unchanged.
How do you measure eval drift?
Use FutureAGI Dataset runs with evaluators such as Groundedness, ContextRelevance, and ToolSelectionAccuracy. Compare score distributions by cohort, Dataset version, and threshold over time.