Evaluating RAG Faithfulness: A 2026 Deep Dive
Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.
Table of Contents
A senior engineer pings the channel: Groundedness is 0.92 on prod, why is legal flagging hallucinations in three out of every ten audited responses. Both numbers are right. The eval averaged support across each response on a holistic 0-to-1 rubric. The audit decomposed each response into claims and asked whether each one was supported. One in three answers had an unsupported claim; a quarter of those answers cherry-picked one chunk while ignoring the chunk that contradicted them. The eval said the response read as grounded. The audit asked whether each claim was.
This post is about the three failure modes answer-level Groundedness hides: cherry-picking from context, sycophantic restatement of the user’s premise, and the gap between claim-level grounded and sentence-level grounded. The opinion this guide earns: faithfulness scored at the answer level is a vibe check, not an eval. The eval that catches what matters is claim-level decomposition plus per-claim grounded scoring plus a contradiction scan over the chunks the model didn’t use.
Audience: senior ML engineers shipping RAG in production who already run Groundedness and want the next layer down. Code shaped against the ai-evaluation SDK, with primitives from python/fi/evals/metrics/rag/utils/.
TL;DR: the failure modes Groundedness misses
| Failure mode | What standard Groundedness sees | What it actually is |
|---|---|---|
| Multi-claim hallucination | 0.85 mean score, looks healthy | One in five claims unsupported per response |
| Cherry-picked context | Pass, every cited chunk supports its claim | Model ignored chunk 4 that contradicted the answer |
| Sycophantic restatement | Pass, answer paraphrases the question | Answer is the user’s premise, not the retrieval |
| Sentence vs claim mismatch | Per-sentence pass | Multi-claim sentences score 1.0 with one unsupported claim inside |
| Omitted qualifier | Pass, claim entails | ”Covers MRI” passes against “covers MRI with prior auth” |
If you only build one extra layer: claim-level decomposition with per-claim grounded scoring. If you build two: add a contradiction scan over the chunks the model didn’t use. The sycophancy and qualifier-omission scorers earn their keep on the long tail.
Why answer-level Groundedness is a vibe check
Groundedness scored at the answer level returns one float per response. The judge reads the question, the answer, the context, and assigns a number for how grounded the whole thing feels. Two failure modes are baked into the metric by construction.
The first is averaging. A response with five claims where one is fabricated still scores around 0.8 — the judge smooths the one bad claim into the four good ones. Average that across 50,000 traces in a release and the dashboard reads 0.92. The per-claim hallucination rate underneath is 18 percent. Legal audits the per-claim rate; the dashboard shows the average.
The second is selection. The judge scores what’s in the answer against what’s in the context. It does not score what was left out. If the retrieval set has five chunks and the model used two, the judge checks the two it used. The three it skipped are invisible. When one of them contains the exclusion clause that contradicts the answer, the response is technically grounded and substantively wrong.
Both failures come from the same root: the answer is a single unit of evaluation, and the unit hides resolution. The fix is structural. Decompose the answer into claims. Score each claim. Scan the unused chunks for contradictions. Report the per-claim hallucination rate, not the mean response score.
Claim-level decomposition: the methodology
Claim-level scoring has three steps and one knob.
Step 1: extract claims. Split the response into atomic, standalone, declarative propositions. Atomic means one fact per claim (“the plan covers MRI” and “MRI requires prior auth” are two claims). Standalone means verifiable without the rest of the answer. Declarative means a factual proposition, not a question or meta-comment.
A naive sentence split misses two patterns. Multi-claim sentences pack two or three propositions into one (“The policy covers outpatient MRI, requires prior auth, and excludes cosmetic procedures” is three claims). Spread claims run one proposition across two sentences. The extractor has to handle both. The SDK exposes this in python/fi/evals/metrics/rag/utils/claims.py:
from fi.evals.metrics.rag.utils import extract_atomic_claims, check_claim_supported
response = (
"The policy covers outpatient MRI with prior authorization. "
"Cosmetic procedures are excluded. Coverage applies to members under 18."
)
atomic = extract_atomic_claims(response)
# ["The policy covers outpatient MRI.",
# "Outpatient MRI requires prior authorization.",
# "Cosmetic procedures are excluded from coverage.",
# "Coverage applies to members under 18."]
context = [
"Outpatient MRI is covered when prior authorization has been obtained.",
"Cosmetic procedures are excluded from coverage.",
"Eligibility extends to members aged 16 and above.",
]
for claim in atomic:
supported, score, src = check_claim_supported(claim, context, threshold=0.5)
print(f"{claim} -> supported={supported} score={score:.2f}")
Step 2: score each claim against the context. check_claim_supported returns (supported, score, best_supporting_context) per claim. NLI for the cheap path; LLM judge for the borderline. The output is a list, one row per claim, not one float per response.
Step 3: report per-claim, not per-response. The release-level metric is the per-claim hallucination rate — fraction of claims with no supporting chunk — not the mean response score. That’s the number that matches what audit finds.
The knob: claim granularity. Coarser claims pass more easily; finer claims catch more drift at the cost of false positives on common-knowledge fragments. Regulated workloads run extract_atomic_claims; consumer Q&A stays at the standard split.
Sentence-level grounded vs claim-level grounded
The cheapest version of “claim-level” eval is sentence-level: split on sentence boundaries, score each sentence. It’s better than answer-level. It’s not the same as claim-level.
The multi-claim sentence. “The policy covers outpatient MRI, requires prior auth, and excludes cosmetic procedures” is one sentence and three claims. Scored as one sentence, the judge returns a single float — typically high, because two of three claims are supported. The unsupported portion smooths into the sentence score. Scored as three claims, the unsupported one drops to zero independently. Dashboard moves from “sentence pass 95%” to “claim pass 78%”. The 78 is the audit number.
The spread claim. “The plan covers MRI. Prior authorization is required” reads as two sentences but one composite claim with a qualifier. Sentence-level scoring marks both as related-but-not-supported when the context says “covers MRI without restriction.” Claim-level scoring extracts “the plan covers MRI with prior authorization” as one composite and runs it against the contradicting chunk. The fail is sharper and pins to the right material.
Sentence-level is a syntactic cheat for claim-level. It works when claims and sentences line up. It misses when they don’t. The cost of doing it right is one extraction call; the payoff is a hallucination rate that matches what your audit team finds.
Cherry-picking detection: the chunks the model didn’t use
The harder failure mode is the one Groundedness never sees: the model picked one chunk that confirmed the user’s framing and ignored the chunk in the same retrieval set that contradicted it. This is the lawsuit failure in legal RAG, the harm failure in medical RAG, the regulatory failure in finance RAG.
The detection has two parts.
Identify the unused chunks. ChunkUtilization flags low context use overall; for the per-chunk view, attribute each claim to its supporting chunk via check_attribution, then take the complement of attributed chunks.
Scan the unused chunks for contradictions or qualifiers the answer omits. For each unused chunk, run a contradiction check against the answer’s claims. The check_contradiction primitive in python/fi/evals/metrics/rag/utils/nli.py returns the entailment direction; flagged contradictions become the cherry-pick alarm.
from fi.evals.metrics.rag.utils import (
extract_atomic_claims, check_claim_supported, check_contradiction,
)
retrieved_chunks = [{"id": "c1", "text": "..."}, {"id": "c2", "text": "..."}]
claims = extract_atomic_claims(answer)
# Mark every chunk that supports at least one claim
attributed_ids = set()
for claim in claims:
for chunk in retrieved_chunks:
supported, score, _ = check_claim_supported(claim, [chunk["text"]])
if supported:
attributed_ids.add(chunk["id"])
unused = [c for c in retrieved_chunks if c["id"] not in attributed_ids]
# Scan unused chunks for contradictions against the answer's claims
cherry_pick_flags = []
for claim in claims:
for chunk in unused:
is_contradiction, score = check_contradiction(claim, chunk["text"])
if is_contradiction and score > 0.6:
cherry_pick_flags.append({"claim": claim, "ignored_chunk": chunk["id"]})
The signature: Groundedness passes, ChunkAttribution passes, ChunkUtilization drops, the contradiction scan over the unused set fires. That combination is cherry-picking. Reporting it as a separate rubric is what moves the audit failure rate down.
For high-stakes domains, layer a domain-specific CustomLLMJudge that names the omission pattern explicitly:
from fi.evals.templates import CustomLLMJudge
cherry_pick_judge = CustomLLMJudge(
name="cherry_pick_omission_check",
grading_criteria=(
"Score 0 if the answer makes a claim about coverage, eligibility, or "
"exclusion that is contradicted or qualified by any chunk in the "
"retrieval set the answer does not cite. Score 1 only if the answer "
"incorporates every relevant qualifier or explicitly notes their "
"absence. Pay attention to exclusion clauses, eligibility cutoffs, "
"temporal qualifiers, prior-auth requirements, and quantifier scopes "
"(all, some, only, except, unless)."
),
model="gpt-5",
)
Sycophantic restatement: when the answer is the user’s premise
The other failure mode standard Groundedness misses is sycophantic restatement. The user asks “doesn’t the policy cover outpatient MRI?” and the model returns “yes, the policy covers outpatient MRI” without the retrieval actually supporting it — or with the retrieval saying the opposite. The answer feels grounded because it parses cleanly and the judge finds tangential context that mentions MRI.
The detection is two checks in series.
Premise similarity. Score semantic similarity between the user’s question and the model’s answer. High similarity with short, syntax-tracking answers is the sycophancy signature. “Yes, X covers Y” against “doesn’t X cover Y” scores near 1.0. A grounded answer carrying the qualifier — “X covers Y when prior auth is obtained” — scores lower.
Independent support. When premise similarity is high, re-run the answer’s claims against the context without the question in the entailment prompt. If per-claim support drops sharply, the model was grounding in the question, not the chunks. A delta above 0.3 is the threshold.
sycophancy_judge = CustomLLMJudge(
name="sycophantic_restatement_check",
grading_criteria=(
"Score 0 if the answer is a near-verbatim restatement of the user's "
"premise without qualifiers from the retrieval context. Score 0 if "
"the answer's primary claims have no independent support in the "
"chunks but high syntactic overlap with the question. Score 1 if the "
"answer introduces facts, qualifiers, or distinctions from the "
"retrieved context that the question did not contain."
),
model="gpt-5",
)
Claim-level scoring catches sycophancy because the extracted claims don’t have independent support, even when the answer parses as related to the question. Combined with the premise-similarity check, the rubric flags the class of failure where the model is agreeing with the user instead of reading the documents.
The full faithfulness stack
The production pattern is a five-layer cascade. Each layer adds resolution and cost; pick the cut against domain risk.
- Deterministic citation verifier. Structured citations (chunk_id, span) verified against the retrieval set. Catches the cheapest class of hallucination — model citing a chunk that wasn’t retrieved, or a span that doesn’t exist. Runs in milliseconds. 100% of production.
- NLI-backed Groundedness on every claim. Claim extraction plus per-claim entailment via DeBERTa-NLI. Cheap, deterministic, scales. Default first pass on every span.
- LLM-judge fallback on borderline claims.
Groundedness(augment=True)flips this on. The classifier handles the bulk; the judge handles paraphrase, multi-hop, and qualifier-heavy claims. Per-evaluation cost stays below Galileo Luna-2 because most claims clear the classifier and never reach the judge. - Cherry-pick contradiction scan. ChunkUtilization plus the contradiction check over unused chunks. Sampled at 5-20% of traffic in low-stakes, 100% in compliance-sensitive.
- Sycophantic-restatement check. Premise similarity plus independent-support delta. Sampled where the question shape suggests confirmation seeking.
The recurring call composes the templates that map to each layer:
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextAdherence, FactualAccuracy,
ChunkAttribution, ChunkUtilization,
)
from fi.testcases import TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
test_case = TestCase(
input="Does the policy cover outpatient MRI?",
output="Yes, the policy covers outpatient MRI with prior authorization.",
context=[
"Outpatient MRI is covered when prior authorization has been obtained.",
"Cosmetic procedures are excluded from coverage.",
],
)
result = evaluator.evaluate(
eval_templates=[
Groundedness(augment=True), # eval_id 47, cascade on
ContextAdherence(), # eval_id 5
FactualAccuracy(), # eval_id 66, atomic-claim path
ChunkAttribution(), # eval_id 11
ChunkUtilization(), # eval_id 12
],
inputs=[test_case],
)
for metric in result[0].metrics:
print(metric.name, metric.value, metric.reason)
Same call shape runs in CI on a versioned dataset and in production as span-attached scores via traceAI. The cherry-pick and sycophancy rubrics layer on as CustomLLMJudge templates.
Anti-patterns to avoid
The mistakes that show up in faithfulness eval reviews:
- Reporting one answer-level Groundedness number. Dashboard reads 0.92, audit finds 18% unsupported claims. Decompose.
- Splitting on sentences and calling it claim-level. Multi-claim sentences and spread claims diverge from sentence boundaries. Use
extract_atomic_claims. - No scan over unused chunks. ChunkUtilization without a contradiction check on the unused set misses cherry-picking.
- No sycophancy check on confirmation-seeking questions. “Doesn’t X” and “isn’t Y” prime the model to restate. Sample these routes.
- LLM judge on every claim. Cost spirals; drift breaks reproducibility. Classifier first via
augment=True, judge on borderline only. - Frozen rubric. New failures land weekly. Promote failing production traces into the eval set or the rubric stops catching the bugs your users file.
- No span-attached scores. Without trace integration, the eval lives on a dashboard nobody opens. Attach to
RETRIEVERspans.
Production observability and Error Feed clustering
The same rubric set runs as span-attached scorers against live traces via traceAI (Apache 2.0; 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING). Sample 5-10% of traffic for the LLM-judge layer; run the classifier path and citation verifier on 100% because they’re cheap.
The signal that matters is the rolling delta of per-claim hallucination rate per route, not the absolute Groundedness mean. A 2-5 point sustained rise over 30-90 minutes is the prompt that a new model version shipped, the chunker change landed, or the retrieval index rotated.
Error Feed sits inside the eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing traces into named issues: “cherry-picked exclusion clause in coverage Q&A,” “sycophantic restatement of refund-policy question,” “qualifier stripped from temporal coverage clause.” A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools including read_span, get_children, submit_finding) reads each failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dim score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each).
Named issues are the unit of triage; immediate_fix is the unit of work. Representative traces from each cluster promote into the eval set under engineer or domain-expert sign-off. The dataset gets sharper every week; the per-claim hallucination rate drops because the rubric now catches what it missed last quarter. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap.
How Future AGI ships the faithfulness stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator. Canonical RAG templates as typed classes —Groundedness(47),ContextAdherence(5),Completeness(10),ChunkAttribution(11),ChunkUtilization(12),FactualAccuracy(66). Local NLI primitives inpython/fi/evals/metrics/rag/utils/—extract_claims,extract_atomic_claims,check_claim_supported,check_attribution,check_contradiction— that run on a DeBERTa model with no API call.CustomLLMJudgewithgrading_criteriafor cherry-pick and sycophancy rubrics. - Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent generates custom cherry-pick and sycophancy rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2 for high-volume RAG scoring.
- Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces into named issues; Sonnet 4.5 Judge writes the
immediate_fix; engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators. - traceAI (Apache 2.0): per-claim scores attach as span attributes on
RETRIEVERspans so the unused chunks live next to the verdict.
The honest tradeoff: cherry-pick and sycophancy detection cost roughly 1.5-2x standard Groundedness because of the contradiction scan and the second judge. For regulated workloads where audit failure rate is the binding constraint, it’s the right tradeoff. For consumer Q&A, the cherry-pick rubric can stay sampled.
Ready to wire the claim-level rubric into a CI gate? Bind Groundedness(augment=True), FactualAccuracy, ChunkAttribution, ChunkUtilization, plus the cherry-pick CustomLLMJudge into a pytest fixture against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why is answer-level Groundedness a misleading number?
What is cherry-pick detection and why does it matter for RAG?
What is sycophantic restatement in RAG and how is it scored?
Why is claim-level grounded scoring different from sentence-level?
How do Groundedness, ChunkAttribution, and ChunkUtilization compose to catch cherry-picking?
What does Future AGI ship for claim-level RAG faithfulness?
How does the closed loop turn faithfulness failures into a stronger eval set?
Hallucination is four distinct failure modes — factual, grounding, citation, and reasoning. Each needs a different detector and a different fix. The methodology, with code.
Citation eval is three rubrics, not one: did the model emit a citation, does it resolve, and does the source actually contain the claim. The methodology, with code.
Summarization eval is four judge prompts, not four concepts. Groundedness, completeness, factuality, conciseness — each as a hardened prompt with a calibration set. The 2026 deep dive.