Guides

Detect Hallucinations in Generative AI: 6 Methods That Actually Catch Them in Production (2026 Guide)

Detect AI hallucinations in production in 2026: ChainPoll, NLI, SelfCheckGPT, RAG faithfulness, FAGI eval, and human review. Code, latency, and trade-offs.

·
Updated
·
7 min read
evaluations hallucination llms
Detect Hallucinations in Generative AI six methods
Table of Contents

TL;DR How to Detect Hallucinations in Generative AI

DecisionRecommendation
Best end-to-end stackFuture AGI fi.evals hallucination plus faithfulness via turing_flash or turing_large
Best for RAGFaithfulness plus chunk attribution and context precision
Best when no sourceSelfCheckGPT style multi-sample consistency or ChainPoll consensus
Cheapest deterministicNLI entailment (DeBERTa-v3-MNLI or similar) on response vs source
Highest precisionChainPoll style multi-sample LLM judge with turing_large
Real-time patternNLI inline plus async LLM-judge attached to span via traceAI
Required guardrail layerAgent Command Center BYOK gateway at /platform/monitor/command-center

Why Hallucinations Are Still The #1 Trust Problem in 2026

Even after a year of newer frontier model launches, hallucinations remain the single biggest reliability problem for production LLM apps. The Stanford HAI 2025 AI Index documents continued hallucination failures across the strongest frontier models, particularly on long-tail factual queries, and notes that retrieval augmentation reduces but does not eliminate the failure mode.

The cost is real. In 2023 a New York federal judge sanctioned a law firm for filing a brief with fabricated case citations generated by ChatGPT, and similar incidents have recurred across jurisdictions in the years since. Multiple medical-coding pilots have stalled in production because clinicians cannot afford a single confidently wrong code in a patient chart. Customer support agents now ship with output-level redaction and fallback flows specifically to contain hallucination blast radius.

Peer-reviewed studies on GPT-4 in clinical settings consistently find non-trivial error rates on complex diagnostic prompts (see for example Nori et al., 2023 on the limits of medical-question accuracy). Newer frontier models cut error rates but never to zero. Detection in production is not optional.

Intrinsic vs Extrinsic Hallucinations: Why You Need Both

In 2026 the field has converged on splitting the failure mode in two:

  • Intrinsic hallucination. The model contradicts the input, the retrieved context, or the system prompt. Example: a RAG bot summarizes a clause the source document explicitly negates.
  • Extrinsic hallucination. The model invents claims that cannot be verified against the source at all. Example: a summarization model adds a number that does not appear anywhere in the input.

The mitigations differ:

TypeBest detectionBest mitigation
IntrinsicFaithfulness, NLI entailment vs sourceTighter retrieval, refuse-if-not-grounded
ExtrinsicMulti-sample consistency, external fact lookupRetrieval augmentation, citations

Both should be measured in parallel. A model that scores high on faithfulness but high on extrinsic hallucination is still unsafe.

Method 1: FAGI fi.evals Hallucination and Faithfulness

The fastest production-ready path. Future AGI’s ai-evaluation SDK ships a named-template catalog of hallucination evaluators that wrap consensus LLM-judge logic plus deterministic checks.

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "your_fi_api_key"
os.environ["FI_SECRET_KEY"] = "your_fi_secret_key"

# Free-form hallucination scoring
result = evaluate(
    eval_templates="hallucination",
    inputs={
        "input": "When was the moon landing?",
        "output": "Apollo 11 landed on the moon in July 1969.",
    },
    model_name="turing_flash",
)
print(result.eval_results[0].metrics[0].value)  # 0 to 1

For RAG, switch to faithfulness and pass the retrieved context:

result = evaluate(
    eval_templates="faithfulness",
    inputs={
        "input": user_query,
        "output": model_response,
        "context": retrieved_chunks,
    },
    model_name="turing_flash",
)

For chunk-level diagnosis, run context_precision and chunk_attribution:

attribution = evaluate(
    eval_templates="chunk_attribution",
    inputs={
        "input": user_query,
        "output": model_response,
        "context": retrieved_chunks,
    },
    model_name="turing_flash",
)

Latency profile on Future AGI cloud:

JudgeLatency
turing_flashabout 1 to 2 seconds
turing_smallabout 2 to 3 seconds
turing_largeabout 3 to 5 seconds

Pick turing_flash for online scoring and turing_large for nightly regression suites and high-stakes flows. The SDK is Apache 2.0 (verified at github.com/future-agi/ai-evaluation).

Method 2: ChainPoll Style Multi-Sample LLM Judge

ChainPoll, introduced in the 2023 Galileo paper, asks an LLM judge to evaluate the same claim multiple times under varied conditions (different chain-of-thought rationales, different orderings) and aggregates the votes. The intuition: a hallucinated claim splits the judge’s votes; a correct claim does not.

You can roll your own via CustomLLMJudge. The pattern: define a judge, call evaluate multiple times, extract the numeric score from each result, then average. Replace the model placeholder with your provider’s current model identifier.

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = CustomLLMJudge(
    name="factuality_chainpoll",
    grading_criteria=(
        "Score 0 to 1. Is this claim supported by widely accepted facts? "
        "Return 0 if any part is fabricated, 1 if fully accurate."
    ),
    model=LiteLLMProvider(model="gpt-5-2025-08-07"),
)

question = "What year did Apollo 11 land on the moon?"
answer = "Apollo 11 landed on the moon in 1969."

raw_scores = []
for _ in range(5):
    result = judge.evaluate(input=question, output=answer)
    # The evaluator returns a structured result; extract the numeric metric value.
    raw_scores.append(result.eval_results[0].metrics[0].value)

final = sum(raw_scores) / len(raw_scores)

Strengths: high precision on factuality. Weaknesses: 5x judge cost, 5x latency unless you parallelize. Reserve for high-stakes flows or nightly batch.

Method 3: NLI-Based Contradiction Detection

Cheap, deterministic, and well-suited for inline gating. Use a fine-tuned NLI model like MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli and score the entailment between the response and the source.

from transformers import pipeline

nli = pipeline(
    "text-classification",
    model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
)
result = nli(f"{source_text} </s></s> {response}")
# label in {"entailment", "neutral", "contradiction"}

If contradiction probability exceeds a threshold (often 0.5), flag for review or trigger a heavier LLM-judge check. NLI runs in tens of milliseconds on a single GPU and a few hundred milliseconds on CPU. The trade-off: NLI struggles with long-form summarization where claims are decomposed across sentences. Pair NLI with sentence-level decomposition for better recall.

Method 4: SelfCheckGPT Style Consistency Probing

When you have no external source to compare against, sample the model multiple times and measure inter-sample agreement. The 2023 SelfCheckGPT paper showed that inconsistency across stochastic samples correlates strongly with hallucination.

from openai import OpenAI

client = OpenAI()

def selfcheck(question, n=5):
    samples = []
    for _ in range(n):
        resp = client.chat.completions.create(
            model="gpt-5-2025-08-07",
            messages=[{"role": "user", "content": question}],
            temperature=1.0,
        )
        samples.append(resp.choices[0].message.content)
    return samples

Score agreement with an NLI model or with an LLM-judge consensus prompt. The 2024 follow-up SelfCheck-NLI improved precision by combining the two approaches.

Use case: free-form generation without retrieval. Limitations: n times inference cost, struggles when the model is consistently wrong.

Method 5: RAG-Specific Faithfulness, Chunk Attribution, Context Precision

If your system is RAG, these three metrics are the workhorses. Every claim in the output should map to a retrieved chunk. Every retrieved chunk should be relevant. Every relevant chunk should actually be used.

MetricWhat it answers
FaithfulnessDoes the output stay grounded in the retrieved context?
Chunk attributionWhich chunks contributed to the output?
Context precisionOf the retrieved chunks, how many were actually relevant?
Context recallDid retrieval surface all the chunks needed to answer?

Faithfulness, chunk attribution, and context precision ship as named templates in fi.evals; for context recall you can use your retrieval logs or check the current Future AGI docs for the latest evaluator catalog. The flow:

from fi.evals import evaluate

rag_inputs = {
    "input": question,
    "output": answer,
    "context": chunks,
}

faithfulness = evaluate(eval_templates="faithfulness", inputs=rag_inputs, model_name="turing_flash")
attribution = evaluate(eval_templates="chunk_attribution", inputs=rag_inputs, model_name="turing_flash")
precision = evaluate(eval_templates="context_precision", inputs=rag_inputs, model_name="turing_flash")

See RAG evaluation metrics for the full RAG eval playbook.

Method 6: Human-in-the-Loop With Active Learning

The highest-precision method, the lowest-scale. Sample low-confidence outputs (NLI contradiction, low judge score, high refusal rate) into a review queue. Subject-matter experts label them. Labels feed back into:

  1. Prompt updates. Common failure patterns become explicit instructions.
  2. Judge calibration. Disagreement between human and judge surfaces judge bias.
  3. Eval datasets. Labeled failures become regression cases for fi.simulate.
from fi.simulate import TestRunner, AgentInput

runner = TestRunner(
    name="human_labeled_hallucinations",
    inputs=[AgentInput(messages=[{"role": "user", "content": q}]) for q in failures],
)
runner.run(agent=my_agent_callable)

Pair human review with active learning sampling: prioritize outputs the judge is least sure about, not random samples. This maximizes information per labeled example.

Putting It Together: A 2026 Production Architecture

User request
  -> Gateway (Agent Command Center at /platform/monitor/command-center)
       - PII guard, prompt-injection guard, rate limit
  -> LLM call (traceAI instrumented span)
  -> Inline NLI check on response vs retrieved context
       - if contradiction > 0.5: refuse or fallback
  -> Return response to user
  -> Async fi.evals hallucination + faithfulness on the span
       - low scores feed alerting + dataset
  -> Sample low-score spans into human review queue
  -> Labels feed prompt + judge + regression suite

Instrumentation is one block:

import os
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from openinference.instrumentation.openai import OpenAIInstrumentor

os.environ["FI_API_KEY"] = "your_fi_api_key"
os.environ["FI_SECRET_KEY"] = "your_fi_secret_key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="prod-hallucination-detection",
)
tracer = FITracer(trace_provider.get_tracer(__name__))
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

Every span is now evaluable. For deeper real-time eval patterns see Real-time LLM evaluation setup.

When To Pick Which Method

Use casePrimaryBackup
RAG chatbotFaithfulness plus chunk attributionNLI inline
Free-form summarizationChainPoll style judgeSelfCheckGPT
Long-tail Q and AHallucination eval (turing_large)Human review
High-stakes medical or legalChainPoll plus human reviewNLI inline gate
Real-time chatbotNLI inline plus async fi.evalsturing_flash judge
Multimodal outputMultimodal fi.evalsVision grounding model

Common Pitfalls

Treating one metric as ground truth. No single metric catches both intrinsic and extrinsic hallucinations. Use at least two.

Ignoring judge bias. Different judge models disagree systematically. Always pin a judge model version per evaluator and re-validate after every judge upgrade.

Skipping calibration. A judge score of 0.7 means nothing until you have calibrated it against human labels in your domain. Run 100 human labels per evaluator at deployment.

Single-shot prompts. ChainPoll style consensus consistently beats single-shot. If precision matters, vote.

Forgetting retrieval upstream. If your retriever drops the relevant chunk, no detector downstream can save you. Measure context recall first.

No regression suite. Real-time eval finds today’s problems. fi.simulate regression suites prevent tomorrow’s. Run both.

Where Hallucination Detection Goes Next

  • Calibrated abstention. Models that emit confidence and refuse when low. Active research areas like selective prediction and self-evaluation heads are landing in production stacks.
  • Tool-grounded detection. Models that call a retrieval tool to fact-check their own draft before returning it. Already shipping in agent frameworks via planner-verifier patterns.
  • Multimodal detection at scale. Vision and audio hallucination detection is an active area of growth across eval platforms. Check the current Future AGI docs for available multimodal evaluator templates before designing around specific capabilities.
  • Regulator-ready logs. EU AI Act obligations and similar frameworks push toward documented risk-management evidence for high-risk systems. Hallucination metric histories from fi.evals are practical compliance artifacts to retain.

Get Started in 15 Minutes

pip install ai-evaluation traceai-openai
export FI_API_KEY=...
export FI_SECRET_KEY=...
from fi.evals import evaluate

result = evaluate(
    eval_templates="hallucination",
    inputs={
        "input": "What was the boiling point of water at sea level?",
        "output": "Water boils at 100 degrees Celsius at sea level.",
    },
    model_name="turing_flash",
)
print(result.eval_results[0].metrics[0].value)

For the dashboard view, log in at app.futureagi.com. For routing, gateway, guardrails, and BYOK key management visit the Agent Command Center at /platform/monitor/command-center. Docs live at docs.futureagi.com.

Related reading:

Book a 30-minute call to wire hallucination detection into your stack in a sandbox.

Frequently asked questions

What is an AI hallucination in 2026?
A hallucination is any model output that is plausible-sounding but unsupported by the source material, the retrieved context, or verifiable real-world facts. In 2026 the working definition splits into intrinsic hallucinations (the output contradicts the prompt or context) and extrinsic hallucinations (the output adds claims that cannot be verified at all). Both should be measured separately because the mitigations differ.
How does FAGI's hallucination eval actually work?
Future AGI ships hallucination, faithfulness, and chunk attribution evaluators in the fi.evals catalog. You call evaluate with eval_templates equal to hallucination, pass the input and output (and context for RAG), and pick a judge model like turing_flash for sub-2-second scoring or turing_large for higher precision. The SDK is Apache 2.0 and the catalog is documented at docs.futureagi.com.
Which method has the lowest false-positive rate?
For RAG, faithfulness combined with chunk attribution gives the best precision because every claim is forced to map to a source chunk. For free-form generation, ChainPoll style multi-sample LLM judges with turing_large currently lead, at the cost of higher latency and judge cost. NLI is cheapest but trades precision for speed.
Can I run hallucination detection in real time without slowing user responses?
Yes. Run cheap heuristic checks like NLI inline and run LLM-judge evals asynchronously on the resulting trace span. With Future AGI traceAI the eval attaches to the span out of band and the user response is never blocked. For high-stakes flows route requests through the Agent Command Center gateway at /platform/monitor/command-center, gate with lightweight inline checks, and attach turing_flash or larger judges asynchronously or where 1 to 5 seconds of latency is acceptable.
How accurate is SelfCheckGPT in 2026?
SelfCheckGPT and its successors work well when no external source is available, but they cost n times the inference budget because you sample multiple completions. Recent variants like SelfCheck-NLI improve precision by combining stochastic sampling with NLI entailment. Use it as a fallback when ground-truth context is missing, not as a primary detector for RAG.
How do I detect hallucinations in multimodal output?
Cross-modal hallucinations (a model describing an object that is not in the image, or transcribing audio incorrectly) require multimodal judges. For vision specifically, pair caption faithfulness with object-grounding checks against a detection model. For coverage in fi.evals, check the current docs at docs.futureagi.com for the up-to-date evaluator catalog before relying on any specific evaluator template.
What is ChainPoll and when should I use it?
ChainPoll asks an LLM judge to evaluate the same claim multiple times under different conditions then aggregates the votes. It typically outperforms single-shot LLM judges on factuality. Use it when precision matters more than cost or latency. You can implement a ChainPoll-style pattern with the Future AGI CustomLLMJudge plus repeated calls, and fi.evals also exposes a hallucination evaluator template you can call directly.
How do I get started with FAGI's hallucination detection?
Install ai-evaluation, set FI_API_KEY and FI_SECRET_KEY, then call evaluate with eval_templates equal to hallucination on every relevant span. For RAG add context to the inputs dict and switch to faithfulness. For continuous production scoring instrument with traceAI from fi_instrumentation so every span is automatically evaluable in the Future AGI dashboard.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.