Guides

LLM Hallucination: A 2026 Architectural Deep Dive

Hallucination is four distinct failure modes — factual, grounding, citation, and reasoning. Each needs a different detector and a different fix. The methodology, with code.

·
Updated
·
12 min read
llm-evaluation hallucination groundedness faithfulness rag reasoning 2026
Editorial cover image for LLM Hallucination: A 2026 Architectural Deep Dive
Table of Contents

A senior engineer pings the channel: Groundedness is 0.92 on prod, why is legal flagging hallucinations in three out of every ten audited responses — two flag fake case cites, the third had the right citation but the reasoning didn’t follow from it. Three different bugs. One number on the dashboard. The team had been shipping a single rubric against what the audit found was four distinct failure modes hiding behind the average.

This post is about the four shapes hallucination actually takes and why each needs its own detector. The opinion this guide earns: hallucination is four distinct failure modes — factual, grounding, citation, and reasoning — and a single rubric averages all four into one number that tells the team nothing about which bug to fix. The mitigation for each is different. Treating all four as “hallucination” is treating four different bugs with the same patch.

Audience: ML and applied engineers shipping LLM systems where hallucination matters, plus senior researchers who want the definitive reference on the failure-mode taxonomy and the eval stack that catches each one. Code shaped against the ai-evaluation SDK.

TL;DR: four failure modes, four detectors

Failure modeWhat it isThe detectorThe fix
FactualClaim contradicts world factAtomic decomposition + external fact-checkRetrieval or tool-call upstream
GroundingClaim contradicts supplied contextClaim-level entailment against retrieval setStricter prompt + claim-level Groundedness
CitationSource doesn’t exist, or doesn’t support the claimStructural + resolvability + semantic rubricSchema-enforced citations + registry check
ReasoningRight-looking answer, broken inference chainStep-by-step trace scoringRefusal on low-confidence chains + chain rubric

The teams shipping reliable agents in 2026 don’t pick one. They wire all four detectors into the offline eval, the runtime guardrails, and the post-hoc clustering loop. The single-rubric pattern is the most common hallucination-eval mistake.

Why one metric is wrong

Hallucination is the failure users hate most and product owners cannot trade off. Latency, cost, format drift, tool-call errors are negotiable. A confidently wrong claim wrapped in a clean answer is what breaks trust permanently. It’s also nearly impossible to eliminate at the model layer alone — parametric memory is a lossy compression of training data, and the loss is what produces invention.

The mistake most teams make next is bolting one “is the answer correct” judge prompt onto the regression set and calling that hallucination eval. That single rubric averages four distinct failure modes into one score that tells you something is wrong but not which thing. A team shipping against the single number iterates the prompt to lift the score by two points and ships a release that fixes one failure mode while regressing another. Dashboard green; audit red.

The four failure modes are real, distinguishable, and ship together in production traces. The eval stack that catches them is four detectors in parallel, each scoring the trace against the question it is qualified to answer.

The four failure modes

The literature converged on a coarse factual-vs-faithfulness split in the early years (Huang et al., Min et al., FActScore). Production practice in 2026 refined that into four operational categories. The split matters because the four don’t share a fix.

Factual hallucination

A claim that contradicts a verifiable world fact. The model asserts that the Eiffel Tower is in Berlin, generates a plausible date for an event that happened in a different year, or invents an award the person never received. Factual hallucination is independent of the prompt; it’s a knowledge problem.

Detector. Atomic-claim decomposition plus external fact-check. The judge enumerates the claims, then for each either queries a knowledge base or routes verification through a tool-call. FactualAccuracy (eval_id 66) handles atomic decomposition; CustomLLMJudge wires the external fact-check into the grading step.

Fix. Move factual reliability upstream. A retrieval or tool-call architecture that pulls verified information into context before generation eliminates most factual hallucination by construction. The model generates from supplied facts instead of parametric memory.

Grounding hallucination

A claim that contradicts the supplied context. The classic example: a RAG system returning a confident answer about a policy clause when the retrieved chunks said something different, or a summarizer adding a detail not in the source. Grounding hallucination is the one that matters most in enterprise — in regulated domains the source of record is the supplied context, not the world. A grounded answer can still be factually wrong if the context was; a factually correct answer is still unfaithful if it bypassed the context.

Detector. Claim-level entailment against the retrieved set, not answer-level. Decompose the response into atomic claims, score each against the context, report the per-claim rate. Groundedness (47) with augment=True runs an NLI classifier first and falls back to LLM judge on borderline scores. ContextAdherence (5) checks topical boundary; ChunkAttribution (11) and ChunkUtilization (12) audit which chunks contributed and which were ignored. The RAG faithfulness deep dive covers the cherry-pick and sycophancy edge cases pure Groundedness misses.

Fix. Grounding-discipline prompts plus citation enforcement at generation plus per-claim entailment scoring at output. Reporting the per-claim rate instead of the answer-level mean is what matches what audit finds.

Citation hallucination

A claim attached to a source that doesn’t exist, or to a real source that doesn’t contain the claim. The famous legal-AI failure where the bot returns Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019) — a case that does not exist — lives here. Standard Groundedness scores the answer against the retrieval set as a whole and never checks whether the cited authority actually contains the claim attached to it. Most teams ship a structural rubric (does the citation parse) and stop there; the failure modes that hurt in regulated production — fake-looking citations that resolve, real citations attached to the wrong claim — only surface when you run all three layers.

Detector. Three rubrics, not one. Structural: did the model emit a citation in the right schema (Pydantic, function-call validator, regex). Resolvability: does the citation point at a fetchable real source (URL returns 200, DOI resolves, chunk_id exists in the retrieval log). Semantic: does the cited passage actually contain the claim attached to it (judge scores entailment per claim). EvaluateFunctionCalling handles structural deterministically; resolvability is a deterministic registry check; semantic is a CustomLLMJudge sampled at production traffic.

Fix. Force the output schema to include chunk_id and quoted span per claim, verify each citation against the retrieval log at runtime, run the semantic check as a span-attached scorer on a sample. The LLM citation attribution methodology covers per-domain calibration for legal, medical, and journalism.

Reasoning hallucination

A correct-looking final answer produced by a broken chain of inference. The model gets the arithmetic right but the premise wrong; the multi-hop conclusion follows from steps that contradict each other; the chain-of-thought says the policy excludes X and the final answer says X is covered. Reasoning hallucination is the failure mode the rise of reasoning models (o3, GPT-5 with reasoning effort, Claude 4.7 extended thinking, DeepSeek R1) makes more visible, not less — longer traces mean more places the chain can break.

Detector. Step-by-step trace scoring. The judge enumerates the reasoning steps, scores each against the supplied context or external fact-base, and checks step-to-step logical consistency. CustomLLMJudge with grading_criteria for per-step grading (correctness, support, consistency-with-prior-steps) handles this. Final-answer accuracy alone misses the class of failure where the model arrives at the right answer for the wrong reason — the class that breaks when the inputs shift. The LLM reasoning guide covers the model-side landscape.

Fix. Score the trace, not just the answer. Add a refusal threshold on low-confidence chains: when step-level support drops below a calibrated cutoff, AnswerRefusal triggers and the system asks a clarifying question or routes to a human. One extra judge call per response; payoff is catching right-answer-wrong-chain at the long tail.

The eval stack

The production stack runs all four detectors against a versioned regression set in CI and against live traffic via span-attached scorers. The recurring call shape composes the templates per detector:

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, FactualAccuracy,
    ChunkAttribution, ChunkUtilization, Completeness,
    EvaluateFunctionCalling, AnswerRefusal, CustomLLMJudge,
)
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="<your-key>", fi_secret_key="<your-secret>")

reasoning_judge = CustomLLMJudge(
    name="reasoning_step_check",
    grading_criteria=(
        "Enumerate the reasoning steps. For each, score 1 if supported by "
        "context or external fact, 0.5 if partially supported, 0 if "
        "contradicted. Check step-to-step consistency. Penalize "
        "right-answer-wrong-chain. Return per-step scores and the aggregate."
    ),
    model="gpt-5",
)

citation_judge = CustomLLMJudge(
    name="citation_semantic_check",
    grading_criteria=(
        "For each citation, score 1 if the cited passage supports the claim, "
        "0 if it doesn't, and flag if source_id doesn't resolve in the "
        "retrieval log. Return per-citation scores and the aggregate."
    ),
    model="gpt-5",
)

test = TestCase(
    input="What's the penalty for late filing under section 234A?",
    output="Section 234A imposes 1% per month on unpaid tax from the due date.",
    context=[
        "Section 234A: simple interest at 1% per month or part thereof on unpaid tax.",
        "Computed from the day after the due date to the actual filing date.",
    ],
)

result = evaluator.evaluate(
    eval_templates=[
        Groundedness(augment=True),     # grounding: claim-level entailment
        ContextAdherence(),             # grounding: topical boundary
        FactualAccuracy(),              # factual: atomic-claim verification
        ChunkAttribution(),             # grounding: chunks that contributed
        ChunkUtilization(),             # grounding: chunks that were ignored
        Completeness(),                 # grounding: missing-info detection
        AnswerRefusal(),                # reasoning + factual: abstention
        reasoning_judge,                # reasoning: per-step grading
        citation_judge,                 # citation: semantic check
    ],
    inputs=[test],
)
for metric in result[0].metrics:
    print(metric.name, metric.value, metric.reason)

The aggregate dashboard reports four numbers, not one — the per-failure-mode hallucination rate. The release-level metric is the rolling delta of each rate per route, not the absolute Groundedness mean. A two-to-five-point sustained rise on any of the four over 30-90 minutes is the prompt that a new model version shipped, the chunker change landed, or the retrieval index rotated.

The augment cascade

Running an LLM judge on every span across four detectors gets expensive fast. The production pattern is a cascade: cheap heuristic first, classifier-backed check next, LLM-judge fallback on borderline only. The augment=True flag wires the cascade into Groundedness, ContextAdherence, and FactualAccuracy — DeBERTa-NLI for entailment, LLAMAGUARD_3_8B or QWEN3GUARD_8B for content-quality checks. Per-evaluation cost stays below Galileo Luna-2 because most claims clear the classifier and never reach the judge.

Citation and reasoning detectors don’t have a classifier shortcut — the rubrics are structural and step-by-step, both of which need the judge. Production samples these at 5-20% of traffic in low-stakes deployments and 100% in compliance-sensitive ones. The LLM evaluation playbook covers the cascade economics.

Span-attached scoring and Error Feed

The same eval call runs in CI on a versioned regression set and in production as span-attached scores via traceAI. Per-claim and per-step scores attach to RETRIEVER, LLM, and AGENT spans so the retrieved chunks, the cited authorities, and the verdict live in the same trace tree (14 span kinds, 50+ AI surfaces).

Error Feed sits inside the eval stack. HDBSCAN over ClickHouse-stored span embeddings groups failing traces into named issues: “fake case cite in coverage Q&A,” “reasoning chain contradicts retrieved exclusion clause,” “factual fabrication in niche-historical-figure summary.” A Claude Sonnet 4.5 Judge agent on Bedrock reads each failing trace, writes the RCA, evidence quotes, and an immediate_fix. Representative traces promote into the eval set under engineer sign-off. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap. The Platform adds a feedback loop: production thumbs up/down retune the rubric grading prompt and few-shot examples so the rubric calibrates against the team’s actual definition of acceptable.

Defense in depth: per-failure-mode mitigation

The teams shipping reliable agents in 2026 don’t run one mitigation across all four failure modes. They stack the mitigation that fits each:

  • Factual. Retrieval and tool-call upstream of generation. When the model can look up the fact, factual hallucination drops sharply. The RAG evaluation tools guide covers the retrieval-side patterns.
  • Grounding. Grounding-discipline prompts plus claim-level Groundedness with augment=True plus citation enforcement at the output schema. The per-claim rate, not the answer-level mean, is the binding number.
  • Citation. Structured citations in the output schema, runtime registry check (URL resolve, DOI lookup, chunk_id join), sampled semantic-entailment check on the (claim, cited_passage) pair.
  • Reasoning. Per-step trace scoring plus a refusal threshold on low-confidence chains. The cost is one judge call per trace; the payoff is catching right-answer-wrong-chain at the long tail.
  • Eval-gated CI before promotion. Run the four detectors against a versioned regression set on every PR. Fail if any rate drops more than the calibrated delta from baseline. The CI/CD eval guide covers the gate shape.
  • Error Feed clustering in production. HDBSCAN over failure embeddings, Sonnet 4.5 Judge writes the fix, self-improving evaluators retune the thresholds. Catches the drift the regression set didn’t anticipate.

Skipping the mitigation for any one failure mode leaves a measurable gap. Running all four gets the system to a place where hallucination is a metric you watch per type, not a fire you fight in aggregate.

Anti-patterns to avoid

The mistakes that show up in nearly every team’s first hallucination-eval attempt:

  • Single hallucination rubric. One “is the answer correct” judge prompt averaged across every trace. Hides which of the four modes is breaking. Use four detectors.
  • Answer-level Groundedness only. A 0.92 mean groundedness hides a 20% per-claim hallucination rate. Decompose to claims.
  • Structural citation check only. Misses the fake-case-that-parses class. Add resolvability and semantic.
  • Final-answer accuracy for reasoning models. Right answer, wrong chain ships when only the answer is graded. Score the trace.
  • No external fact-check for factual. Internal-context-only judges cannot catch factual hallucination by construction. Wire a knowledge base or tool-call into the rubric.
  • Frozen rubric. The team’s definition of acceptable drifts; the rubric needs to drift with it via self-improving evaluators or quarterly review.
  • No span-attached scoring. Without trace-attached scores, live failures never make it back into the dataset.
  • Treating hallucination as a fundamental limit. Excuse not to ship the four-detector stack. It’s a measurable, addressable engineering problem.

How Future AGI ships the hallucination stack

Future AGI ships the four-detector stack as a package. Start with the SDK for code-defined evals across the four failure modes. Graduate to the Platform when you want self-improving rubrics and an in-product authoring agent.

  • ai-evaluation SDK (Apache 2.0): templates for all four modes. Factual: FactualAccuracy (66) plus CustomLLMJudge with a fact-check tool. Grounding: Groundedness (47), ContextAdherence (5), ChunkAttribution (11), ChunkUtilization (12), Completeness (10). Citation: EvaluateFunctionCalling plus CustomLLMJudge for resolvability and semantic. Reasoning: CustomLLMJudge with per-step grading_criteria plus AnswerRefusal. Local NLI primitives in python/fi/evals/metrics/rag/utils/ for the cheap entailment path.
  • Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent writes reasoning and citation rubrics from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces into named issues per failure mode; Sonnet 4.5 Judge writes the immediate_fix; engineer-reviewed promotions feed the dataset.
  • traceAI (Apache 2.0): per-claim and per-step scores attach as span attributes on RETRIEVER, LLM, and AGENT spans so the eval result lives in the trace tree.

The honest tradeoff: running all four detectors costs roughly 2-3x a single-rubric eval because of the per-step reasoning judge and the semantic citation check. For regulated workloads where audit failure rate is the binding constraint, it’s the right tradeoff. For consumer Q&A, reasoning and citation can stay sampled at 5-20% while grounding and factual run at 100%.

Closing: four bugs, four patches

The two failure patterns we see most often are opposite mistakes. The first is teams treating hallucination as one problem, shipping one rubric, iterating against an aggregate that hides which mode is regressing. The second is teams treating hallucination as inherent, shipping without measurement, hoping the model improves next quarter. Both miss the four-mode structure and the four-detector answer.

The teams shipping reliable agents in 2026 treat hallucination the same way they treat any other reliability problem: measured per failure mode, defended in depth, iterated weekly from production feedback, closed-loop through clustering. The model layer is one piece of the stack. The four detectors are where most of the reliability lives.

Ready to wire the four-detector stack into a CI gate? Bind Groundedness(augment=True), FactualAccuracy, ChunkAttribution, ChunkUtilization, EvaluateFunctionCalling, plus the reasoning and citation CustomLLMJudge rubrics into a pytest fixture against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

Why four types of hallucination instead of one?
Because the fix is different for each. Factual hallucination contradicts world fact and is solved with retrieval or tool-call. Grounding hallucination contradicts the supplied context and is solved with claim-level entailment scoring against the retrieval set. Citation hallucination cites a source that does not exist or does not support the claim attached to it, and is solved with a structural-plus-resolvability-plus-semantic citation rubric. Reasoning hallucination produces a correct-sounding answer through a broken chain of inference, and is solved by scoring the reasoning trace step-by-step. A single hallucination metric averages all four into one number and tells the team nothing about which bug to fix.
What is the difference between factual and grounding hallucination?
Factual hallucination is independent of the prompt. The model asserts that the Eiffel Tower is in Berlin; the assertion is wrong regardless of what context the user supplied. Grounding hallucination is dependent on the prompt. The model gives a confident answer about a policy clause when the retrieved chunks said something different; the answer might even be correct in the world, but it is unfaithful to the supplied source of record. Factual needs external fact-check (tool-call, knowledge base, web search). Grounding needs entailment scoring against the retrieval set. In regulated domains, the source of record is the supplied context, so grounding is the binding eval — a factually correct answer that bypassed the policy document is still wrong.
What is citation hallucination and why is it its own category?
Citation hallucination is the model inventing a source, or attaching a real source to a claim it does not support. The famous legal-AI failure mode where the bot drops in Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019) — a case that does not exist — lives in this category. Standard Groundedness scores the answer against the retrieval set as a whole and never checks whether the cited authority actually contains the claim it is attached to. Citation needs its own three-rubric stack: structural (did the model emit a citation in the schema), resolvability (does the citation point at a fetchable real source), and semantic (does the source actually contain the claim). The legal, medical, and journalism deployments that ship without this rubric ship without an audit trail.
What is reasoning hallucination and how do you score it?
Reasoning hallucination is a correct-looking final answer produced by a broken inference chain. The model gets the arithmetic right but the premise wrong; the multi-hop conclusion follows from steps that contradict each other; the chain-of-thought says the policy excludes X but the answer says X is covered. Scoring requires the reasoning trace, not just the final answer. The judge enumerates the steps, scores each one against the supplied context or external fact-base, and checks step-to-step logical consistency. Future AGI's CustomLLMJudge with grading_criteria handles step-by-step scoring; final-answer accuracy alone misses the class of failure where the model arrives at the right answer for the wrong reason — which is the class that breaks at the long tail.
Why does atomic-claim decomposition matter for hallucination eval?
A single hallucination score per answer hides resolution. A response with five claims where one is fabricated still scores around 0.8 because the judge averages support across the response. Decomposing the response into atomic, standalone, declarative claims and scoring each one independently surfaces the per-claim hallucination rate, which is the number that matches what audit finds. Future AGI's FactualAccuracy template and the extract_atomic_claims primitive in the ai-evaluation SDK run claim extraction by default and expose the per-claim list in the eval result. The dashboard moves from 'answer pass 92%' to 'claim pass 78%' — the 78 is the production number.
What does Future AGI ship for hallucination evaluation?
The ai-evaluation SDK (Apache 2.0) ships templates that map to all four failure modes. Factual: FactualAccuracy (eval_id 66) for atomic-claim verification, CustomLLMJudge wired to an external fact-check tool for world-fact verification. Grounding: Groundedness (47), ContextAdherence (5), ChunkAttribution (11), ChunkUtilization (12), Completeness (10). Citation: EvaluateFunctionCalling for the structural schema check, plus CustomLLMJudge with per-domain resolvability and semantic rubrics. Reasoning: CustomLLMJudge with step-by-step grading_criteria, plus AnswerRefusal for the abstention path on broken chains. The augment=True flag cascades the cheap classifier first, judge only on borderline. traceAI attaches the per-claim scores to RETRIEVER and LLM spans so the eval result lives in the trace tree.
Can hallucination be eliminated at the model layer?
No. Parametric memory is lossy compression of training data, and the loss is what produces invention. Even retrieval-augmented systems hallucinate because the model can still pull from parametric memory when the supplied context is incomplete or ambiguous. The production answer is layered defense per failure mode: retrieval and tool-call upstream for factual, citation enforcement at generation for citation, claim-level entailment scoring for grounding, step-by-step trace scoring for reasoning, eval-gated CI before promotion, and Error Feed clustering with self-improving evaluators in production. Each layer drops the residual rate for its failure mode; combined they get most enterprise systems below the human-baseline error rate for the task.
Related Articles
View all
Evaluating RAG Faithfulness: A 2026 Deep Dive
Guides

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.

NVJK Kartik
NVJK Kartik ·
11 min