How is a hallucination metric different from groundedness?

Groundedness is a strict pass/fail gate against retrieved context. A hallucination metric is a continuous 0-1 score that combines support, contradiction, and risk pre-screening — useful for trending and severity classification.

How do you measure a hallucination metric?

FutureAGI's fi.evals.HallucinationScore composes a sentinel risk pass, NLI claim entailment, and contradiction detection into a 0-1 score. DetectHallucination is the simpler Pass/Fail variant for binary gates.

What Is a Hallucination Metric? Definition & FutureAGI Guide (2026)

Q: What is a hallucination metric?

A hallucination metric is a 0-1 score for how much of an LLM response is fabricated. Modern implementations combine fast pre-screening, NLI claim verification, and contradiction detection into a single composite signal.

What Is a Hallucination Metric?

A hallucination metric is a quantitative score for how much of an LLM’s response is fabricated — that is, not supported by the input, the retrieved context, or known-correct reference text. Modern hallucination metrics layer three checks: a fast pre-screening pass over the response, a Natural Language Inference (NLI) classification of each claim as supported, contradicted, or neutral, and a contradiction-weighted aggregation. The result is a 0-1 score that drops sharply on contradictions, partially on unsupported claims, and stays high when the response is fully grounded. It runs on production traces and on offline regression datasets.

Why It Matters in Production LLM and Agent Systems

Hallucinations are the single most-cited reason production LLM rollouts get rolled back. The problem is not detecting cartoonish fabrications — those are easy to spot. The problem is the long tail: a confidently wrong date, a slightly-off product specification, a citation to a paper that does not exist, a regulation reference with the right language but the wrong section number. Without a quantitative hallucination metric, every individual case looks like an isolated bug, and the team treats them as a stream of fires rather than a measurable rate.

The pain is felt across the entire org. ML engineers need a number to track per release; “we hallucinate less” is not a deployable claim. Trust-and-safety teams need severity classification — a contradicted claim is worse than an unsupported one. Compliance leads need an auditable signal that they can attach to incident reports. Customer-facing teams need an early warning before users start filing tickets, and a hallucination rate moves before a satisfaction score does.

In 2026 agent stacks the failure compounds. A planner that hallucinates a non-existent tool invokes a downstream tool-call error chain. An LLM-as-judge that hallucinates an evaluation rubric corrupts every eval downstream. Step-level hallucination scoring on every span — not just final responses — is what catches these before they cascade.

How FutureAGI Handles Hallucination Metrics

FutureAGI’s approach is to ship a composite fi.evals.HallucinationScore paired with the simpler binary DetectHallucination so teams can use both for trending and gating. HallucinationScore runs the HallucinationSentinel for fast pre-screening, then the NLI layer for per-claim entailment and contradiction detection, then weights the components (support 0.6, contradiction 0.4 by default) into a 0-1 score. The output exposes counts of supported, unsupported, contradicted, and neutral claims so teams can debug which type of hallucination dominates. DetectHallucination is the cloud-template Pass/Fail gate intended for blocking releases.

Concretely: a knowledge-bot team running on traceAI-llamaindex instruments their RAG chain. They configure HallucinationScore to score every answer span and write the result back as a span event. The Agent Command Center dashboard plots p10 hallucination score (the percentile most sensitive to regressions) and contradiction counts per day. When a model-fallback to claude-3-5-haiku increases contradictions from 4/day to 21/day, the team adds a post-guardrail running DetectHallucination that intercepts low-scoring responses and returns a fallback before delivery to the user. The same composite metric then gates merges in CI: a build fails if mean HallucinationScore drops more than 3 points from baseline.

Unlike Galileo’s hallucination index, which is a single end-to-end score, FutureAGI exposes the decomposition (sentinel, support, contradiction) so engineers can debug which sub-signal is degrading.

How to Measure or Detect It

Hallucination metrics are directly measurable. Wire up:

fi.evals.HallucinationScore — composite 0-1 score with sentinel + NLI + contradiction decomposition.
fi.evals.DetectHallucination — Pass/Fail cloud-template gate with reason.
fi.evals.ContradictionDetection — narrower 1.0/0.0 signal that flags contradicted claims specifically.
OTel attributes llm.output and retrieval.documents — the inputs every hallucination evaluator needs.
p10 hallucination score and contradiction count (dashboard) — the two signals that move first under a regression.

Minimal Python:

from fi.evals import HallucinationScore

evaluator = HallucinationScore()

result = evaluator.evaluate([{
    "response": "The Eiffel Tower is in Paris and was completed in 1889. It is 1500m tall.",
    "context": "The Eiffel Tower, located in Paris, was completed in 1889 and stands 330 metres tall."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

Reporting one hallucination score and treating it as a single number. The decomposition matters — a 0.7 driven by neutral claims is fine; a 0.7 driven by contradictions is a release blocker.
Running hallucination metrics without a context or reference. Hallucination is defined relative to inputs. Score against context for RAG and reference for benchmarks; never just the response alone.
Using a single threshold across model families. Stronger models hallucinate differently than smaller ones — calibrate thresholds per model variant or per route.
Treating low hallucination as proof of correctness. A response that quotes the context verbatim has perfect hallucination scores and may still be irrelevant — pair with AnswerRelevancy.
Letting the same model that generated the response also score it. Self-evaluation collapses contradictions into “supported”; pin the judge model to a different family or use NLI-based metrics.