Failure Modes

What Is the Hallucination Index?

A production metric that tracks unsupported, fabricated, or unverifiable AI claims across prompts, models, traces, or releases.

What Is the Hallucination Index?

The hallucination index is a failure-mode metric that summarizes the rate and severity of unsupported claims produced by an AI system. Instead of asking whether one answer hallucinated, it asks how a prompt, model, retriever, agent route, or release behaves across many outputs. It appears in eval pipelines and production traces where HallucinationScore can be tracked by cohort. FutureAGI teams use it to compare versions, set risk thresholds, and catch regressions before fabricated facts reach users.

Why It Matters in Production LLM and Agent Systems

Unsupported claims rarely crash a service. They ship as polished answers, cited summaries, tool plans, or extracted fields that look plausible until a user proves them wrong. A low-latency chatbot can still be unsafe if its hallucination index climbs after a prompt edit, retriever change, or model fallback. The obvious failure is a fabricated statement: a benefits bot invents a 60-day appeal window, or a coding assistant names an API parameter that does not exist. The quieter failure is drift: the same workflow answers correctly on the golden dataset, then starts inventing dates for a new customer segment.

The pain lands across the production team. Developers see evaluation regressions after releases. SREs see normal uptime but rising user corrections, replay failures, or eval-fail-rate-by-cohort. Compliance teams see unsupported claims in regulated messages. Product teams see confused users, escalations, and loss of trust.

Agentic systems raise the cost because one invented claim can become state. A planner hallucinates a tool capability, a retriever accepts the wrong premise, and later steps cite that fabricated premise as if it were evidence. By the end of a multi-step trace, the model is not just wrong; it has built a consistent story around the wrong fact. The hallucination index gives teams a production-level signal instead of isolated anecdotes.

How FutureAGI Handles the Hallucination Index

FutureAGI’s approach is to treat the hallucination index as an eval-backed release gate and a trace-backed monitoring signal. The specific FutureAGI anchor is eval:HallucinationScore, implemented by the HallucinationScore local metric in the evaluation stack. Teams use it alongside DetectHallucination, which flags hallucinated or unsupported claims, and Groundedness, which evaluates whether a response is grounded in the provided context.

A typical workflow starts with a dataset of prompts, retrieved context, and expected evidence. The team runs HallucinationScore before a model or prompt rollout and records the index by prompt version. After release, the same scoring is attached to production traces from a traceAI-langchain or traceAI-openai integration. Answer spans carry the model output, retrieval spans carry the supporting documents, and the dashboard groups the hallucination index by route, model, prompt version, and customer cohort.

When the index crosses a threshold, the engineer has concrete actions: block the release, open a regression eval, tune retrieval, or add an Agent Command Center post-guardrail that routes high-risk answers to fallback or review. Unlike Ragas faithfulness used alone, which usually gives a RAG-specific view, a production hallucination index should also cover free-form generation, agent reasoning, and structured extraction. That broader view is what prevents a clean RAG score from hiding a failing agent trajectory.

How to Measure or Detect It

Use multiple signals; the index should not depend on one judge call.

  • fi.evals.HallucinationScore - returns a numeric hallucination score for response review and cohort trending.
  • fi.evals.DetectHallucination - flags hallucinated or unsupported claims in the output.
  • fi.evals.Groundedness - checks whether the response is grounded in provided context.
  • Trace fields - score the answer span against retrieved documents, tool outputs, or other evidence attached to the trace.
  • Dashboard signals - hallucination-index-by-release, eval-fail-rate-by-cohort, high-risk-answer-rate, and escalation-rate after answer.
  • User-feedback proxy - thumbs-down events within one minute of an answer, especially when users mark a factual correction.
from fi.evals import HallucinationScore

evaluator = HallucinationScore()
result = evaluator.evaluate(
    output="The warranty lasts five years.",
    context="The warranty lasts one year."
)
print(result.score)

Common Mistakes

  • Treating the index as a model leaderboard. The same model can score differently by prompt, retriever, route, domain, and customer cohort.
  • Averaging away severe failures. One fabricated legal citation matters more than ten harmless wording mismatches; track severity bands.
  • Measuring only offline datasets. Production data drift changes the index after release; sample live traces too.
  • Confusing retrieval failure with hallucination. Poor context can cause unsupported answers; pair HallucinationScore with ContextRelevance and Groundedness.
  • Using one global threshold. A support bot, medical assistant, and marketing writer need different gates and escalation policies.

Frequently Asked Questions

What is the hallucination index?

The hallucination index is a production reliability score for unsupported, fabricated, or unverifiable claims across a cohort of AI outputs. It turns individual hallucination failures into a trendable release and monitoring signal.

How is the hallucination index different from hallucination detection?

Hallucination detection labels or scores one output. The hallucination index aggregates those signals across prompts, models, routes, user cohorts, or releases so teams can compare system behavior over time.

How do you measure the hallucination index?

FutureAGI measures it with eval:HallucinationScore on datasets and production traces, often paired with DetectHallucination and Groundedness. Teams threshold the score by route, prompt version, or release.