Failure Modes

What Are LLM Hallucinations?

Confident fluent outputs from a large language model that are factually wrong, unsupported by retrieved context, or fabricated entirely.

What Are LLM Hallucinations?

LLM hallucinations are confident, fluent outputs from a large language model that are factually wrong, unsupported by retrieved context, or fabricated entirely. They appear as invented citations, wrong dates, plausible-but-invalid code, and biographical details that never happened. Hallucinations are the single most common LLM failure mode in production AI systems. They show up across eval pipelines and live traces and are detected with grounding evaluators, NLI-based factual-consistency checks, and judge-model rubrics — then mitigated with retrieval, citation requirements, decoding tweaks, and post-guardrail review.

Why It Matters in Production LLM and Agent Systems

A hallucination that lands in a regulated workflow can be more expensive than a system outage. A medical chatbot that invents a drug interaction; a legal-research tool that cites a non-existent case; a procurement agent that fabricates a vendor SKU — each failure is a confident, fluent statement that downstream humans trust because the rest of the response sounds correct. Studies in 2025–2026 consistently show base hallucination rates between 3% and 27% on factual tasks even with current frontier models.

The pain spans every role. Engineers chase user reports with no clear repro. Product teams cannot ship answer features because hallucination rates fail risk review. Compliance teams cannot quantify residual hallucination for an audit. SREs see traces where the same prompt returns two contradictory answers run-over-run.

In 2026 agent stacks, hallucinations cascade. A planner LLM that hallucinates a tool name causes a tool-timeout downstream. A retriever that hallucinates a chunk citation produces a confident answer the auditor cannot verify. A multi-agent system in which one agent hallucinates a fact propagates that fact through every subsequent step, and the final summary often launders the fabrication into something that looks audited. Every span that produces text needs hallucination scoring, not just the final output. Long-context windows make this worse: a 200K-token plan that fabricates one parameter on step 12 still completes, still ships, and is impossible to find post-hoc without per-span eval coverage.

How FutureAGI Handles LLM Hallucinations

FutureAGI’s approach combines detection with prevention. fi.evals.HallucinationScore returns a comprehensive score against a (response, context, reference) triple and is the canonical detector. Groundedness and Faithfulness cover the RAG-specific case — every claim must be entailed by retrieved context. FactualConsistency runs NLI between response and reference to catch contradictions. On the prevention side, Agent Command Center configures pre-guardrail retrieval enforcement and post-guardrail HallucinationScore checks; outputs scoring above threshold are blocked or routed to a fallback-response.

Concretely: a knowledge-base RAG product traced via traceAI-llamaindex runs Groundedness and HallucinationScore on a 10% sampled cohort and writes scores as span_events. The team’s dashboard tracks hallucination-rate by retriever cohort, and a regression-eval workflow re-runs the suite against every chunking-strategy change. For high-stakes outputs, every response must produce citations; CitationPresence blocks responses without source pointers via post-guardrail. Unlike Ragas faithfulness which only scores the final RAG output, FutureAGI’s per-span scoring localises the failure to a specific retriever or LLM call.

How to Measure or Detect It

  • fi.evals.HallucinationScore: comprehensive 0–1 score; the canonical hallucination metric.
  • fi.evals.Groundedness: per-claim source attribution score; primary RAG signal.
  • fi.evals.FactualConsistency: NLI-based contradiction detector against a reference.
  • fi.evals.CitationPresence: returns boolean for whether the response cites at least one source.
  • Hallucination-rate-by-cohort: dashboard metric that surfaces which retriever, model, or prompt template produces the most fabrications.
  • User-feedback proxy: thumbs-down on factual answers correlates with hallucination but lags by hours.
from fi.evals import HallucinationScore, Groundedness

halluc = HallucinationScore()
ground = Groundedness()

h = halluc.evaluate(input=query, output=response, context=retrieved)
g = ground.evaluate(output=response, context=retrieved)
print(h.score, g.score)

Common Mistakes

  • Relying on temperature=0 as a fix. Lower temperature reduces variance but does not remove hallucination — the prior is still wrong.
  • Treating an LLM-as-a-judge as ground truth. A judge that shares biases with the generator misses fabrications they both find plausible.
  • Scoring only the final answer. Multi-hop pipelines hallucinate at intermediate steps; score every span.
  • Ignoring retrieval quality. Most “model hallucination” issues are actually retrieval failures — bad chunks force the model to invent.
  • Conflating refusal with hallucination. A model that says “I don’t know” is correct; treat refusal separately with AnswerRefusal scoring.

Frequently Asked Questions

What are LLM hallucinations?

LLM hallucinations are confident, fluent outputs from a language model that are factually wrong, unsupported by source context, or fabricated. Examples include invented citations, made-up biographical facts, and plausible but invalid code.

How are LLM hallucinations different from factual errors?

A factual error is any wrong claim. A hallucination is specifically a fabricated claim presented with high confidence and surface plausibility — usually emerging from the model's distribution priors rather than a corrupted input.

How do you measure LLM hallucinations?

FutureAGI runs HallucinationScore for general-purpose detection, Groundedness for RAG-specific source attribution, and FactualConsistency for NLI-based contradiction checks against reference evidence.