LLM Hallucination: A 2026 Architectural Deep Dive
Hallucination is four distinct failure modes — factual, grounding, citation, and reasoning. Each needs a different detector and a different fix. The methodology, with code.
Table of Contents
A senior engineer pings the channel: Groundedness is 0.92 on prod, why is legal flagging hallucinations in three out of every ten audited responses — two flag fake case cites, the third had the right citation but the reasoning didn’t follow from it. Three different bugs. One number on the dashboard. The team had been shipping a single rubric against what the audit found was four distinct failure modes hiding behind the average.
This post is about the four shapes hallucination actually takes and why each needs its own detector. The opinion this guide earns: hallucination is four distinct failure modes — factual, grounding, citation, and reasoning — and a single rubric averages all four into one number that tells the team nothing about which bug to fix. The mitigation for each is different. Treating all four as “hallucination” is treating four different bugs with the same patch.
Audience: ML and applied engineers shipping LLM systems where hallucination matters, plus senior researchers who want the definitive reference on the failure-mode taxonomy and the eval stack that catches each one. Code shaped against the ai-evaluation SDK.
TL;DR: four failure modes, four detectors
| Failure mode | What it is | The detector | The fix |
|---|---|---|---|
| Factual | Claim contradicts world fact | Atomic decomposition + external fact-check | Retrieval or tool-call upstream |
| Grounding | Claim contradicts supplied context | Claim-level entailment against retrieval set | Stricter prompt + claim-level Groundedness |
| Citation | Source doesn’t exist, or doesn’t support the claim | Structural + resolvability + semantic rubric | Schema-enforced citations + registry check |
| Reasoning | Right-looking answer, broken inference chain | Step-by-step trace scoring | Refusal on low-confidence chains + chain rubric |
The teams shipping reliable agents in 2026 don’t pick one. They wire all four detectors into the offline eval, the runtime guardrails, and the post-hoc clustering loop. The single-rubric pattern is the most common hallucination-eval mistake.
Why one metric is wrong
Hallucination is the failure users hate most and product owners cannot trade off. Latency, cost, format drift, tool-call errors are negotiable. A confidently wrong claim wrapped in a clean answer is what breaks trust permanently. It’s also nearly impossible to eliminate at the model layer alone — parametric memory is a lossy compression of training data, and the loss is what produces invention.
The mistake most teams make next is bolting one “is the answer correct” judge prompt onto the regression set and calling that hallucination eval. That single rubric averages four distinct failure modes into one score that tells you something is wrong but not which thing. A team shipping against the single number iterates the prompt to lift the score by two points and ships a release that fixes one failure mode while regressing another. Dashboard green; audit red.
The four failure modes are real, distinguishable, and ship together in production traces. The eval stack that catches them is four detectors in parallel, each scoring the trace against the question it is qualified to answer.
The four failure modes
The literature converged on a coarse factual-vs-faithfulness split in the early years (Huang et al., Min et al., FActScore). Production practice in 2026 refined that into four operational categories. The split matters because the four don’t share a fix.
Factual hallucination
A claim that contradicts a verifiable world fact. The model asserts that the Eiffel Tower is in Berlin, generates a plausible date for an event that happened in a different year, or invents an award the person never received. Factual hallucination is independent of the prompt; it’s a knowledge problem.
Detector. Atomic-claim decomposition plus external fact-check. The judge enumerates the claims, then for each either queries a knowledge base or routes verification through a tool-call. FactualAccuracy (eval_id 66) handles atomic decomposition; CustomLLMJudge wires the external fact-check into the grading step.
Fix. Move factual reliability upstream. A retrieval or tool-call architecture that pulls verified information into context before generation eliminates most factual hallucination by construction. The model generates from supplied facts instead of parametric memory.
Grounding hallucination
A claim that contradicts the supplied context. The classic example: a RAG system returning a confident answer about a policy clause when the retrieved chunks said something different, or a summarizer adding a detail not in the source. Grounding hallucination is the one that matters most in enterprise — in regulated domains the source of record is the supplied context, not the world. A grounded answer can still be factually wrong if the context was; a factually correct answer is still unfaithful if it bypassed the context.
Detector. Claim-level entailment against the retrieved set, not answer-level. Decompose the response into atomic claims, score each against the context, report the per-claim rate. Groundedness (47) with augment=True runs an NLI classifier first and falls back to LLM judge on borderline scores. ContextAdherence (5) checks topical boundary; ChunkAttribution (11) and ChunkUtilization (12) audit which chunks contributed and which were ignored. The RAG faithfulness deep dive covers the cherry-pick and sycophancy edge cases pure Groundedness misses.
Fix. Grounding-discipline prompts plus citation enforcement at generation plus per-claim entailment scoring at output. Reporting the per-claim rate instead of the answer-level mean is what matches what audit finds.
Citation hallucination
A claim attached to a source that doesn’t exist, or to a real source that doesn’t contain the claim. The famous legal-AI failure where the bot returns Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019) — a case that does not exist — lives here. Standard Groundedness scores the answer against the retrieval set as a whole and never checks whether the cited authority actually contains the claim attached to it. Most teams ship a structural rubric (does the citation parse) and stop there; the failure modes that hurt in regulated production — fake-looking citations that resolve, real citations attached to the wrong claim — only surface when you run all three layers.
Detector. Three rubrics, not one. Structural: did the model emit a citation in the right schema (Pydantic, function-call validator, regex). Resolvability: does the citation point at a fetchable real source (URL returns 200, DOI resolves, chunk_id exists in the retrieval log). Semantic: does the cited passage actually contain the claim attached to it (judge scores entailment per claim). EvaluateFunctionCalling handles structural deterministically; resolvability is a deterministic registry check; semantic is a CustomLLMJudge sampled at production traffic.
Fix. Force the output schema to include chunk_id and quoted span per claim, verify each citation against the retrieval log at runtime, run the semantic check as a span-attached scorer on a sample. The LLM citation attribution methodology covers per-domain calibration for legal, medical, and journalism.
Reasoning hallucination
A correct-looking final answer produced by a broken chain of inference. The model gets the arithmetic right but the premise wrong; the multi-hop conclusion follows from steps that contradict each other; the chain-of-thought says the policy excludes X and the final answer says X is covered. Reasoning hallucination is the failure mode the rise of reasoning models (o3, GPT-5 with reasoning effort, Claude 4.7 extended thinking, DeepSeek R1) makes more visible, not less — longer traces mean more places the chain can break.
Detector. Step-by-step trace scoring. The judge enumerates the reasoning steps, scores each against the supplied context or external fact-base, and checks step-to-step logical consistency. CustomLLMJudge with grading_criteria for per-step grading (correctness, support, consistency-with-prior-steps) handles this. Final-answer accuracy alone misses the class of failure where the model arrives at the right answer for the wrong reason — the class that breaks when the inputs shift. The LLM reasoning guide covers the model-side landscape.
Fix. Score the trace, not just the answer. Add a refusal threshold on low-confidence chains: when step-level support drops below a calibrated cutoff, AnswerRefusal triggers and the system asks a clarifying question or routes to a human. One extra judge call per response; payoff is catching right-answer-wrong-chain at the long tail.
The eval stack
The production stack runs all four detectors against a versioned regression set in CI and against live traffic via span-attached scorers. The recurring call shape composes the templates per detector:
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextAdherence, FactualAccuracy,
ChunkAttribution, ChunkUtilization, Completeness,
EvaluateFunctionCalling, AnswerRefusal, CustomLLMJudge,
)
from fi.testcases import TestCase
evaluator = Evaluator(fi_api_key="<your-key>", fi_secret_key="<your-secret>")
reasoning_judge = CustomLLMJudge(
name="reasoning_step_check",
grading_criteria=(
"Enumerate the reasoning steps. For each, score 1 if supported by "
"context or external fact, 0.5 if partially supported, 0 if "
"contradicted. Check step-to-step consistency. Penalize "
"right-answer-wrong-chain. Return per-step scores and the aggregate."
),
model="gpt-5",
)
citation_judge = CustomLLMJudge(
name="citation_semantic_check",
grading_criteria=(
"For each citation, score 1 if the cited passage supports the claim, "
"0 if it doesn't, and flag if source_id doesn't resolve in the "
"retrieval log. Return per-citation scores and the aggregate."
),
model="gpt-5",
)
test = TestCase(
input="What's the penalty for late filing under section 234A?",
output="Section 234A imposes 1% per month on unpaid tax from the due date.",
context=[
"Section 234A: simple interest at 1% per month or part thereof on unpaid tax.",
"Computed from the day after the due date to the actual filing date.",
],
)
result = evaluator.evaluate(
eval_templates=[
Groundedness(augment=True), # grounding: claim-level entailment
ContextAdherence(), # grounding: topical boundary
FactualAccuracy(), # factual: atomic-claim verification
ChunkAttribution(), # grounding: chunks that contributed
ChunkUtilization(), # grounding: chunks that were ignored
Completeness(), # grounding: missing-info detection
AnswerRefusal(), # reasoning + factual: abstention
reasoning_judge, # reasoning: per-step grading
citation_judge, # citation: semantic check
],
inputs=[test],
)
for metric in result[0].metrics:
print(metric.name, metric.value, metric.reason)
The aggregate dashboard reports four numbers, not one — the per-failure-mode hallucination rate. The release-level metric is the rolling delta of each rate per route, not the absolute Groundedness mean. A two-to-five-point sustained rise on any of the four over 30-90 minutes is the prompt that a new model version shipped, the chunker change landed, or the retrieval index rotated.
The augment cascade
Running an LLM judge on every span across four detectors gets expensive fast. The production pattern is a cascade: cheap heuristic first, classifier-backed check next, LLM-judge fallback on borderline only. The augment=True flag wires the cascade into Groundedness, ContextAdherence, and FactualAccuracy — DeBERTa-NLI for entailment, LLAMAGUARD_3_8B or QWEN3GUARD_8B for content-quality checks. Per-evaluation cost stays below Galileo Luna-2 because most claims clear the classifier and never reach the judge.
Citation and reasoning detectors don’t have a classifier shortcut — the rubrics are structural and step-by-step, both of which need the judge. Production samples these at 5-20% of traffic in low-stakes deployments and 100% in compliance-sensitive ones. The LLM evaluation playbook covers the cascade economics.
Span-attached scoring and Error Feed
The same eval call runs in CI on a versioned regression set and in production as span-attached scores via traceAI. Per-claim and per-step scores attach to RETRIEVER, LLM, and AGENT spans so the retrieved chunks, the cited authorities, and the verdict live in the same trace tree (14 span kinds, 50+ AI surfaces).
Error Feed sits inside the eval stack. HDBSCAN over ClickHouse-stored span embeddings groups failing traces into named issues: “fake case cite in coverage Q&A,” “reasoning chain contradicts retrieved exclusion clause,” “factual fabrication in niche-historical-figure summary.” A Claude Sonnet 4.5 Judge agent on Bedrock reads each failing trace, writes the RCA, evidence quotes, and an immediate_fix. Representative traces promote into the eval set under engineer sign-off. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap. The Platform adds a feedback loop: production thumbs up/down retune the rubric grading prompt and few-shot examples so the rubric calibrates against the team’s actual definition of acceptable.
Defense in depth: per-failure-mode mitigation
The teams shipping reliable agents in 2026 don’t run one mitigation across all four failure modes. They stack the mitigation that fits each:
- Factual. Retrieval and tool-call upstream of generation. When the model can look up the fact, factual hallucination drops sharply. The RAG evaluation tools guide covers the retrieval-side patterns.
- Grounding. Grounding-discipline prompts plus claim-level Groundedness with
augment=Trueplus citation enforcement at the output schema. The per-claim rate, not the answer-level mean, is the binding number. - Citation. Structured citations in the output schema, runtime registry check (URL resolve, DOI lookup,
chunk_idjoin), sampled semantic-entailment check on the (claim, cited_passage) pair. - Reasoning. Per-step trace scoring plus a refusal threshold on low-confidence chains. The cost is one judge call per trace; the payoff is catching right-answer-wrong-chain at the long tail.
- Eval-gated CI before promotion. Run the four detectors against a versioned regression set on every PR. Fail if any rate drops more than the calibrated delta from baseline. The CI/CD eval guide covers the gate shape.
- Error Feed clustering in production. HDBSCAN over failure embeddings, Sonnet 4.5 Judge writes the fix, self-improving evaluators retune the thresholds. Catches the drift the regression set didn’t anticipate.
Skipping the mitigation for any one failure mode leaves a measurable gap. Running all four gets the system to a place where hallucination is a metric you watch per type, not a fire you fight in aggregate.
Anti-patterns to avoid
The mistakes that show up in nearly every team’s first hallucination-eval attempt:
- Single hallucination rubric. One “is the answer correct” judge prompt averaged across every trace. Hides which of the four modes is breaking. Use four detectors.
- Answer-level Groundedness only. A 0.92 mean groundedness hides a 20% per-claim hallucination rate. Decompose to claims.
- Structural citation check only. Misses the fake-case-that-parses class. Add resolvability and semantic.
- Final-answer accuracy for reasoning models. Right answer, wrong chain ships when only the answer is graded. Score the trace.
- No external fact-check for factual. Internal-context-only judges cannot catch factual hallucination by construction. Wire a knowledge base or tool-call into the rubric.
- Frozen rubric. The team’s definition of acceptable drifts; the rubric needs to drift with it via self-improving evaluators or quarterly review.
- No span-attached scoring. Without trace-attached scores, live failures never make it back into the dataset.
- Treating hallucination as a fundamental limit. Excuse not to ship the four-detector stack. It’s a measurable, addressable engineering problem.
How Future AGI ships the hallucination stack
Future AGI ships the four-detector stack as a package. Start with the SDK for code-defined evals across the four failure modes. Graduate to the Platform when you want self-improving rubrics and an in-product authoring agent.
- ai-evaluation SDK (Apache 2.0): templates for all four modes. Factual:
FactualAccuracy(66) plusCustomLLMJudgewith a fact-check tool. Grounding:Groundedness(47),ContextAdherence(5),ChunkAttribution(11),ChunkUtilization(12),Completeness(10). Citation:EvaluateFunctionCallingplusCustomLLMJudgefor resolvability and semantic. Reasoning:CustomLLMJudgewith per-stepgrading_criteriaplusAnswerRefusal. Local NLI primitives inpython/fi/evals/metrics/rag/utils/for the cheap entailment path. - Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent writes reasoning and citation rubrics from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces into named issues per failure mode; Sonnet 4.5 Judge writes the
immediate_fix; engineer-reviewed promotions feed the dataset. - traceAI (Apache 2.0): per-claim and per-step scores attach as span attributes on
RETRIEVER,LLM, andAGENTspans so the eval result lives in the trace tree.
The honest tradeoff: running all four detectors costs roughly 2-3x a single-rubric eval because of the per-step reasoning judge and the semantic citation check. For regulated workloads where audit failure rate is the binding constraint, it’s the right tradeoff. For consumer Q&A, reasoning and citation can stay sampled at 5-20% while grounding and factual run at 100%.
Closing: four bugs, four patches
The two failure patterns we see most often are opposite mistakes. The first is teams treating hallucination as one problem, shipping one rubric, iterating against an aggregate that hides which mode is regressing. The second is teams treating hallucination as inherent, shipping without measurement, hoping the model improves next quarter. Both miss the four-mode structure and the four-detector answer.
The teams shipping reliable agents in 2026 treat hallucination the same way they treat any other reliability problem: measured per failure mode, defended in depth, iterated weekly from production feedback, closed-loop through clustering. The model layer is one piece of the stack. The four detectors are where most of the reliability lives.
Ready to wire the four-detector stack into a CI gate? Bind Groundedness(augment=True), FactualAccuracy, ChunkAttribution, ChunkUtilization, EvaluateFunctionCalling, plus the reasoning and citation CustomLLMJudge rubrics into a pytest fixture against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why four types of hallucination instead of one?
What is the difference between factual and grounding hallucination?
What is citation hallucination and why is it its own category?
What is reasoning hallucination and how do you score it?
Why does atomic-claim decomposition matter for hallucination eval?
What does Future AGI ship for hallucination evaluation?
Can hallucination be eliminated at the model layer?
Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.
Citation eval is three rubrics, not one: did the model emit a citation, does it resolve, and does the source actually contain the claim. The methodology, with code.
Summarization eval is four judge prompts, not four concepts. Groundedness, completeness, factuality, conciseness — each as a hardened prompt with a calibration set. The 2026 deep dive.