Models

What Is Drift (ML / LLM)?

Any time-based change in inputs, outputs, models, or evaluators that degrades a deployed system's quality if left unmonitored.

What Is Drift?

Drift is any time-based change that degrades a deployed model’s production behavior — inputs moving away from training distributions, label or task semantics shifting, model weights or routes changing, evaluators losing calibration. The umbrella term covers several specific subtypes: data drift, concept drift, model drift, prediction drift, and eval drift, each with its own diagnostic signal. FutureAGI handles drift as an observability problem: production traces, evaluation Dataset runs, and gateway routes are versioned and compared so the type of drift can be diagnosed, not just flagged.

Why Drift Matters in Production LLM and Agent Systems

Drift is the failure mode that does not show up in CI. The model passed every offline eval, the deploy was uneventful, the dashboard was green for the first month — and three months in, the success rate has fallen 8% and no one can tell when. By the time it is visible to product, you have lost the diagnostic window where the change was localized.

Different roles see different symptoms. ML engineers see eval-fail-rate-by-cohort climbing for a specific user segment that did not exist at training time. SREs see token-cost-per-trace creeping up because the model now retries more, asks more clarifications, or hits a new fallback path. Product managers see an NPS dip with no obvious cause. Compliance teams see audit findings about answers in regulated contexts that the model would have refused six months earlier.

In 2026 multi-agent stacks, drift compounds across roles. A drifting embedding model changes retrieval geometry, which changes RAG context, which changes LLM outputs, which changes downstream tool calls. A trajectory-level evaluator catches the symptom in the agent output; only span-level versioning across embeddings, retrievers, planners, and tools localizes the source.

How FutureAGI Handles Drift

FutureAGI’s approach is to make every layer versionable and comparable so drift becomes a diff, not a vibe. The model route, the prompt template, the embedding model, the dataset, the evaluator — each carries an explicit version inside the trace. When a drift alert fires, you can subtract one version from another and see what changed.

Concretely: a production RAG team runs through traceAI-llamaindex with a 5% trace sample piped into fi.datasets.Dataset as a rolling cohort. Every week, Dataset.add_evaluation runs Groundedness, ContextRelevance, and AnswerRelevancy on the new cohort. The platform compares the score distributions to last week’s cohort and to a frozen baseline, breaking the comparison out by user segment, route, and prompt-template version. When the team sees Groundedness drop on the “billing” cohort but not the “support” cohort, they know it is not a model regression — it is a content drift in the billing knowledge base.

On the gateway side, Agent Command Center supports traffic-mirroring for route-level comparisons and model fallback keyed on a quality threshold. If the live route’s drift exceeds the threshold, traffic falls back to the previous safe route while the team investigates. We’ve found that the diagnostic delta between trace-level monitoring and dataset-level evaluation is what separates “we know something drifted” from “we know which layer drifted, when, and for which cohort.”

How to Measure or Detect Drift

Detect drift through a layered set of signals tied to versioned artifacts:

  • fi.evals.Groundedness, ContextRelevance, AnswerRelevancy — run weekly against a sampled production cohort; track the score distribution, not just the mean.
  • Population stability index on input features — flags data-drift against a reference distribution.
  • Prediction-distribution comparison — side-by-side histogram of model outputs in week-N vs. week-N-1 surfaces prediction-drift.
  • Eval-fail-rate-by-cohort — split by user segment, locale, and route; cohort drift is invisible in the global mean.
  • Trace-level versioning — model route, prompt-template version, embedding-model version pinned per span.
  • User-feedback proxy — thumbs-down or task-abandonment rate; lags eval drift but is a useful confirmation signal.
from fi.evals import Groundedness

groundedness = Groundedness()
result = groundedness.evaluate(
    input="What is the refund window for digital products?",
    output="14 days from purchase, per the latest policy.",
    context="Refund policy: 14 days, digital orders.",
)
print(result.score)

Common Mistakes

  • Watching only the global mean. Cohort-level drift hides inside an unchanged global average.
  • Treating drift as one phenomenon. Data, concept, model, prediction, and eval drift have different fixes — diagnose, then act.
  • Skipping evaluator versioning. A new judge prompt changes scores even when the model did not; pin and version the evaluator like the model.
  • Static golden datasets. A fixed test set goes stale; refresh from production samples on a known cadence.
  • No baseline pinned at deploy. Without a frozen reference, you cannot quantify how far you have drifted.

Frequently Asked Questions

What is drift in machine learning?

Drift is any time-based change in inputs, model behavior, label semantics, or evaluator calibration that degrades a deployed system's quality. It includes data drift, concept drift, model drift, prediction drift, and eval drift.

How is drift different from a regression?

A regression is a sudden, often release-driven drop in quality on a fixed test set. Drift is a gradual change in the production environment that erodes quality even when the model and code are unchanged.

How do you detect drift?

Compare input distributions, evaluator scores, and outcome metrics across rolling windows of production traces. FutureAGI pins datasets and evaluator versions so that any score change isolates the drift source.