How is out-of-distribution data different from data drift?

OOD describes a sample or cohort outside the expected distribution. Data drift is the production movement over time that creates more OOD traffic relative to the baseline.

How do you measure OOD data with FutureAGI?

Use FutureAGI `sdk:Dataset` to compare baseline and live cohorts, then track `EmbeddingSimilarity`, `ContextRelevance`, `Groundedness`, and eval-fail-rate-by-cohort. Trace fields such as `llm.token_count.prompt` help explain the failure path.

What Is Out-of-Distribution (OOD)? FutureAGI Guide (2026)

Q: What is out-of-distribution data?

Out-of-distribution data is input, context, labels, or tool output outside the baseline distribution used to test an AI system. It matters because offline eval scores may stop predicting live behavior.

What Is Out-of-Distribution (OOD)?

Out-of-distribution (OOD) data is input, retrieved context, labels, or tool output that falls outside the baseline distribution used to test an AI system. It is a data-family reliability concept that appears in training, eval datasets, production traces, RAG retrieval, and agent workflows. In FutureAGI, teams track OOD risk through sdk:Dataset, evaluator results, and trace-linked cohorts so new intents, languages, schemas, or policies do not silently invalidate release decisions.

Why Out-of-Distribution Data Matters in Production LLM and Agent Systems

OOD data breaks the link between offline eval scores and live reliability. A support agent can pass every billing row in a reviewed dataset, then fail when a new invoice format, regional tax rule, or account tier appears in production. A RAG system can retrieve plausible but irrelevant policy text for an unseen question, leading to a grounded-looking hallucination downstream. A tool-calling agent can receive a payload shape that was absent from tests and choose a fallback path that no one scored.

The pain is spread across roles. Developers chase prompt changes when the real issue is missing coverage. SREs see spikes in retries, longer llm.token_count.prompt buckets, or rising p99 latency as agents call extra tools. Product teams see thumbs-down clusters in a new cohort while the global eval pass rate stays flat. Compliance owners lose confidence that the approved dataset covers live policy language.

OOD risk is especially sharp in 2026-era multi-step pipelines. One unfamiliar input can change the plan, retrieve new documents, call a tool with novel arguments, and produce a final answer that looks fluent enough to bypass casual review. Symptoms include low nearest-neighbor similarity to baseline rows, high “unknown intent” labels, evaluator failures concentrated in recent traces, and sudden score gaps between baseline and live cohorts.

How FutureAGI Handles Out-of-Distribution Data

FutureAGI’s approach is to make OOD a dataset-and-trace workflow instead of a vague anomaly label. The anchor is sdk:Dataset, exposed as fi.datasets.Dataset. A team keeps a baseline Dataset with columns such as input, expected_response, reference_context, cohort, intent, locale, schema_version, source_trace_id, and dataset_version. Live production samples from traceAI-langchain are imported as a comparison cohort with the same columns.

The first metric is coverage. Engineers compare each live input against baseline rows using EmbeddingSimilarity, then split the result by intent, locale, and schema_version. If the “refund after partial shipment” cohort has low nearest-neighbor similarity and a higher eval-fail-rate-by-cohort, the row is not treated as a model bug yet. They run ContextRelevance on retrieved chunks and Groundedness on the final answer to see whether the failure starts in retrieval or generation.

What happens next is operational. The engineer promotes representative OOD rows into a reviewed dataset version, adds expected answers or accepted tool paths, reruns the regression eval, and sets an alert when live rows fall below the similarity threshold for two deploy windows. Unlike a Great Expectations-style schema check, this catches semantic shifts where the JSON is valid but the user intent, document policy, or tool argument is new.

How to Measure or Detect Out-of-Distribution Data

Measure OOD data by comparing a stable baseline cohort with sampled live traffic, then separating semantic distance from ordinary quality failure:

Baseline distance: EmbeddingSimilarity returns semantic similarity between texts; use it to compare a live input with its nearest reviewed baseline row.
Cohort movement: Population Stability Index, Jensen-Shannon divergence, or Wasserstein distance can show structured metadata movement across locale, route, plan, or schema.
RAG quality drop: falling ContextRelevance and Groundedness scores mean OOD queries may be pulling the wrong context or producing unsupported answers.
Trace signals: watch eval-fail-rate-by-cohort, retrieval zero-result rate, tool retry rate, llm.token_count.prompt, p99 latency, and escalation rate.
Reviewer signal: rising “unknown intent” labels or reviewer disagreement often appears before dashboard averages move.

from fi.evals import EmbeddingSimilarity

scorer = EmbeddingSimilarity()
result = scorer.evaluate(
    response=live_row["input"],
    expected_response=baseline_row["input"],
)
print(result.score, result.reason)

Common Mistakes

The expensive mistakes are usually measurement mistakes, not terminology mistakes:

Treating every OOD case as bad traffic. Some OOD rows are emerging product demand and should become regression coverage.
Using one global anomaly threshold. Locale, tenant tier, tool path, and document family often need separate baseline distributions.
Checking schema validity only. Valid JSON can still contain a new intent, stale policy reference, or unsupported tool argument.
Ignoring near-miss rows. Low similarity plus a passing answer may indicate brittle behavior before users feel it.
Promoting OOD rows without review. Unverified live failures can introduce wrong expected answers and poison the next eval dataset.