Failure Modes

What Is Concept Drift?

Concept drift is when the relationship between production inputs and correct outputs changes after deployment.

What Is Concept Drift?

Concept drift is an AI reliability failure mode where the mapping between inputs and correct outputs changes after deployment. In LLM and agent systems, it shows up in production datasets, traces, and evaluation pipelines when the same kind of request now requires a different answer because policies, user intents, tools, or labels changed. FutureAGI teams track it by comparing sdk:Dataset cohorts, evaluator scores, ground-truth labels, and user outcome signals over time.

Why It Matters in Production LLM/Agent Systems

Concept drift turns yesterday’s correct behavior into tomorrow’s wrong behavior without a broken API, timeout, or obvious exception. A benefits assistant may answer “contractors are not eligible” because that was true during evaluation, then fail after the company changes eligibility rules. A credit-support agent may keep classifying “payment pause” requests as hardship cases after the product team introduces a new loan-deferral policy. The input text looks familiar; the correct label changed.

The pain lands across the production chain. Developers see release gates pass while fresh tickets fail. SREs see stable latency and provider error rates, yet escalation rate rises. Product teams see thumbs-down clusters around new policies or new customer segments. Compliance teams lose confidence because old evidence no longer proves current behavior.

The symptoms are cohort-shaped. Look for rising eval-fail-rate-by-cohort, annotation disagreement on recent traffic, support escalations after policy changes, increasing fallback-response rate, or a gap between stable input embeddings and falling task success. In agentic systems, the drift can compound: a changed policy label affects retrieval, then the planner chooses the wrong tool, then the final answer looks grounded to obsolete context. 2026-era multi-step pipelines need drift checks at dataset, trace, and outcome layers because a single “model quality” score will hide where the relationship moved.

How FutureAGI Handles Concept Drift

FutureAGI’s approach is to anchor concept drift in sdk:Dataset, implemented as fi.datasets.Dataset, rather than treating it as a vague line on a dashboard. A team starts with a reference Dataset built from accepted production rows: input, expected_response, policy_version, tenant, route, label, and timestamp. Each release candidate and daily traffic sample becomes another Dataset version, so the question is concrete: did this cohort’s correct answer move?

The team attaches Groundedness, ContextRelevance, and HallucinationScore through Dataset.add_evaluation(), then compares those scores with human labels or user-outcome labels. Example: a RAG support agent still retrieves relevant policy chunks, so ContextRelevance stays flat. But after a pricing-policy update, Groundedness drops for enterprise tenants because the answer is grounded in old context, and refund escalations rise. That split says the concept changed around policy meaning, not that the retriever stopped working.

Unlike Ragas faithfulness, which mainly checks whether an answer is supported by supplied context, this workflow compares evaluator scores against changing business truth. The engineer’s next action is operational: promote failed production traces into the Dataset, add a new policy-version cohort, adjust the metric threshold, and route high-risk traffic through an Agent Command Center model fallback or post-guardrail until the regression eval passes.

How to Measure or Detect It

Detect concept drift by comparing both outcomes and explanations across stable baselines and fresh production samples.

  • Ground-truth label movement: compare expected labels by policy_version, tenant, product, geography, or route. Concept drift needs an outcome change, not only an input shift.
  • Evaluator deltas: Groundedness evaluates whether the response is supported by context; ContextRelevance tracks whether retrieved context still matches the request; HallucinationScore trends unsupported output risk.
  • Dashboard signals: alert on eval-fail-rate-by-cohort, label-disagreement-rate, fallback-response-rate, and human-escalation-rate after a known business or policy change.
  • Trace and dataset fields: keep model, prompt_version, retriever_index, route, policy_version, and Dataset version available for slicing.
  • User-feedback proxies: rising thumbs-down rate or support reopen rate with stable latency often points to meaning drift rather than infrastructure failure.
from fi.evals import Groundedness

evaluator = Groundedness()
result = evaluator.evaluate(
    input=row["question"],
    context=row["retrieved_context"],
    output=row["answer"],
)

Treat a distance test as a hypothesis. Concept drift is confirmed when the changed input-output relationship reduces task quality or changes the accepted label.

Common Mistakes

  • Calling every distribution change concept drift. If inputs changed but the correct answer did not, you are looking at data drift.
  • Reusing a stale golden dataset. A passing score on old labels proves only that the system still solves old cases.
  • Averaging across tenants. One enterprise policy cohort can fail while the global score looks flat.
  • Confusing prompt drift with concept drift. A prompt rewrite changes system behavior; concept drift changes the target relationship.
  • Changing policy labels without versioning. Without policy_version, engineers cannot tell whether the model regressed or the business rule moved.

Frequently Asked Questions

What is concept drift?

Concept drift is when the relationship between inputs and correct outputs changes after deployment. The system may see similar-looking requests, but the right answer, label, or action has moved.

How is concept drift different from data drift?

Data drift means the input distribution changed. Concept drift means the input-to-output relationship changed, so the same kind of input may now require a different answer or label.

How do you measure concept drift?

In FutureAGI, compare `fi.datasets.Dataset` cohorts over time, then attach evaluators such as Groundedness, ContextRelevance, and HallucinationScore. Track eval-fail-rate-by-cohort against ground-truth labels and user outcomes.