Guides

Evaluating RAG Faithfulness: A 2026 Deep Dive

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophancy scoring fix it.

February 22, 2026

Updated May 20, 2026

11 min read

rag llm-evaluation faithfulness groundedness hallucination 2026

Table of Contents

A senior engineer pings the channel: Groundedness is 0.92 on prod, why is legal flagging hallucinations in three out of every ten audited responses. Both numbers are right. The eval averaged support across each response on a holistic 0-to-1 rubric. The audit decomposed each response into claims and asked whether each one was supported. One in three answers had an unsupported claim; a quarter of those answers cherry-picked one chunk while ignoring the chunk that contradicted them. The eval said the response read as grounded. The audit asked whether each claim was.

This post is about the three failure modes answer-level Groundedness hides: cherry-picking from context, sycophantic restatement of the user’s premise, and the gap between claim-level grounded and sentence-level grounded. The opinion this guide earns: faithfulness scored at the answer level is a vibe check, not an eval. The eval that catches what matters is claim-level decomposition plus per-claim grounded scoring plus a contradiction scan over the chunks the model didn’t use.

Audience: senior ML engineers shipping RAG in production who already run Groundedness and want the next layer down. Code shaped against the ai-evaluation SDK, with primitives from python/fi/evals/metrics/rag/utils/.

TL;DR: the failure modes Groundedness misses

Failure mode	What standard Groundedness sees	What it actually is
Multi-claim hallucination	0.85 mean score, looks healthy	One in five claims unsupported per response
Cherry-picked context	Pass, every cited chunk supports its claim	Model ignored chunk 4 that contradicted the answer
Sycophantic restatement	Pass, answer paraphrases the question	Answer is the user’s premise, not the retrieval
Sentence vs claim mismatch	Per-sentence pass	Multi-claim sentences score 1.0 with one unsupported claim inside
Omitted qualifier	Pass, claim entails	”Covers MRI” passes against “covers MRI with prior auth”

If you only build one extra layer: claim-level decomposition with per-claim grounded scoring. If you build two: add a contradiction scan over the chunks the model didn’t use. The sycophancy and qualifier-omission scorers earn their keep on the long tail.

Why answer-level Groundedness is a vibe check

Groundedness scored at the answer level returns one float per response. The judge reads the question, the answer, the context, and assigns a number for how grounded the whole thing feels. Two failure modes are baked into the metric by construction.

The first is averaging. A response with five claims where one is fabricated still scores around 0.8 — the judge smooths the one bad claim into the four good ones. Average that across 50,000 traces in a release and the dashboard reads 0.92. The per-claim hallucination rate underneath is 18 percent. Legal audits the per-claim rate; the dashboard shows the average.

The second is selection. The judge scores what’s in the answer against what’s in the context. It does not score what was left out. If the retrieval set has five chunks and the model used two, the judge checks the two it used. The three it skipped are invisible. When one of them contains the exclusion clause that contradicts the answer, the response is technically grounded and substantively wrong.

Both failures come from the same root: the answer is a single unit of evaluation, and the unit hides resolution. The fix is structural. Decompose the answer into claims. Score each claim. Scan the unused chunks for contradictions. Report the per-claim hallucination rate, not the mean response score.

Claim-level decomposition: the methodology

Claim-level scoring has three steps and one knob.

Step 1: extract claims. Split the response into atomic, standalone, declarative propositions. Atomic means one fact per claim (“the plan covers MRI” and “MRI requires prior auth” are two claims). Standalone means verifiable without the rest of the answer. Declarative means a factual proposition, not a question or meta-comment.

A naive sentence split misses two patterns. Multi-claim sentences pack two or three propositions into one (“The policy covers outpatient MRI, requires prior auth, and excludes cosmetic procedures” is three claims). Spread claims run one proposition across two sentences. The extractor has to handle both. The SDK exposes this in python/fi/evals/metrics/rag/utils/claims.py:

from fi.evals.metrics.rag.utils import extract_atomic_claims, check_claim_supported

response = (
    "The policy covers outpatient MRI with prior authorization. "
    "Cosmetic procedures are excluded. Coverage applies to members under 18."
)
atomic = extract_atomic_claims(response)
# ["The policy covers outpatient MRI.",
#  "Outpatient MRI requires prior authorization.",
#  "Cosmetic procedures are excluded from coverage.",
#  "Coverage applies to members under 18."]

context = [
    "Outpatient MRI is covered when prior authorization has been obtained.",
    "Cosmetic procedures are excluded from coverage.",
    "Eligibility extends to members aged 16 and above.",
]
for claim in atomic:
    supported, score, src = check_claim_supported(claim, context, threshold=0.5)
    print(f"{claim} -> supported={supported} score={score:.2f}")

Step 2: score each claim against the context. check_claim_supported returns (supported, score, best_supporting_context) per claim. NLI for the cheap path; LLM judge for the borderline. The output is a list, one row per claim, not one float per response.

Step 3: report per-claim, not per-response. The release-level metric is the per-claim hallucination rate — fraction of claims with no supporting chunk — not the mean response score. That’s the number that matches what audit finds.

The knob: claim granularity. Coarser claims pass more easily; finer claims catch more drift at the cost of false positives on common-knowledge fragments. Regulated workloads run extract_atomic_claims; consumer Q&A stays at the standard split.

Sentence-level grounded vs claim-level grounded

The cheapest version of “claim-level” eval is sentence-level: split on sentence boundaries, score each sentence. It’s better than answer-level. It’s not the same as claim-level.

The multi-claim sentence. “The policy covers outpatient MRI, requires prior auth, and excludes cosmetic procedures” is one sentence and three claims. Scored as one sentence, the judge returns a single float — typically high, because two of three claims are supported. The unsupported portion smooths into the sentence score. Scored as three claims, the unsupported one drops to zero independently. Dashboard moves from “sentence pass 95%” to “claim pass 78%”. The 78 is the audit number.

The spread claim. “The plan covers MRI. Prior authorization is required” reads as two sentences but one composite claim with a qualifier. Sentence-level scoring marks both as related-but-not-supported when the context says “covers MRI without restriction.” Claim-level scoring extracts “the plan covers MRI with prior authorization” as one composite and runs it against the contradicting chunk. The fail is sharper and pins to the right material.

Sentence-level is a syntactic cheat for claim-level. It works when claims and sentences line up. It misses when they don’t. The cost of doing it right is one extraction call; the payoff is a hallucination rate that matches what your audit team finds.

Cherry-picking detection: the chunks the model didn’t use

The harder failure mode is the one Groundedness never sees: the model picked one chunk that confirmed the user’s framing and ignored the chunk in the same retrieval set that contradicted it. This is the lawsuit failure in legal RAG, the harm failure in medical RAG, the regulatory failure in finance RAG.

The detection has two parts.

Identify the unused chunks. ChunkUtilization flags low context use overall; for the per-chunk view, attribute each claim to its supporting chunk via check_attribution, then take the complement of attributed chunks.

Scan the unused chunks for contradictions or qualifiers the answer omits. For each unused chunk, run a contradiction check against the answer’s claims. The check_contradiction primitive in python/fi/evals/metrics/rag/utils/nli.py returns the entailment direction; flagged contradictions become the cherry-pick alarm.

from fi.evals.metrics.rag.utils import (
    extract_atomic_claims, check_claim_supported, check_contradiction,
)

retrieved_chunks = [{"id": "c1", "text": "..."}, {"id": "c2", "text": "..."}]
claims = extract_atomic_claims(answer)

# Mark every chunk that supports at least one claim
attributed_ids = set()
for claim in claims:
    for chunk in retrieved_chunks:
        supported, score, _ = check_claim_supported(claim, [chunk["text"]])
        if supported:
            attributed_ids.add(chunk["id"])

unused = [c for c in retrieved_chunks if c["id"] not in attributed_ids]

# Scan unused chunks for contradictions against the answer's claims
cherry_pick_flags = []
for claim in claims:
    for chunk in unused:
        is_contradiction, score = check_contradiction(claim, chunk["text"])
        if is_contradiction and score > 0.6:
            cherry_pick_flags.append({"claim": claim, "ignored_chunk": chunk["id"]})

The signature: Groundedness passes, ChunkAttribution passes, ChunkUtilization drops, the contradiction scan over the unused set fires. That combination is cherry-picking. Reporting it as a separate rubric is what moves the audit failure rate down.

For high-stakes domains, layer a domain-specific CustomLLMJudge that names the omission pattern explicitly:

from fi.evals.templates import CustomLLMJudge

cherry_pick_judge = CustomLLMJudge(
    name="cherry_pick_omission_check",
    grading_criteria=(
        "Score 0 if the answer makes a claim about coverage, eligibility, or "
        "exclusion that is contradicted or qualified by any chunk in the "
        "retrieval set the answer does not cite. Score 1 only if the answer "
        "incorporates every relevant qualifier or explicitly notes their "
        "absence. Pay attention to exclusion clauses, eligibility cutoffs, "
        "temporal qualifiers, prior-auth requirements, and quantifier scopes "
        "(all, some, only, except, unless)."
    ),
    model="gpt-5",
)

Sycophantic restatement: when the answer is the user’s premise

The other failure mode standard Groundedness misses is sycophantic restatement. The user asks “doesn’t the policy cover outpatient MRI?” and the model returns “yes, the policy covers outpatient MRI” without the retrieval actually supporting it — or with the retrieval saying the opposite. The answer feels grounded because it parses cleanly and the judge finds tangential context that mentions MRI.

The detection is two checks in series.

Premise similarity. Score semantic similarity between the user’s question and the model’s answer. High similarity with short, syntax-tracking answers is the sycophancy signature. “Yes, X covers Y” against “doesn’t X cover Y” scores near 1.0. A grounded answer carrying the qualifier — “X covers Y when prior auth is obtained” — scores lower.

Independent support. When premise similarity is high, re-run the answer’s claims against the context without the question in the entailment prompt. If per-claim support drops sharply, the model was grounding in the question, not the chunks. A delta above 0.3 is the threshold.

sycophancy_judge = CustomLLMJudge(
    name="sycophantic_restatement_check",
    grading_criteria=(
        "Score 0 if the answer is a near-verbatim restatement of the user's "
        "premise without qualifiers from the retrieval context. Score 0 if "
        "the answer's primary claims have no independent support in the "
        "chunks but high syntactic overlap with the question. Score 1 if the "
        "answer introduces facts, qualifiers, or distinctions from the "
        "retrieved context that the question did not contain."
    ),
    model="gpt-5",
)

Claim-level scoring catches sycophancy because the extracted claims don’t have independent support, even when the answer parses as related to the question. Combined with the premise-similarity check, the rubric flags the class of failure where the model is agreeing with the user instead of reading the documents.

The full faithfulness stack

The production pattern is a five-layer cascade. Each layer adds resolution and cost; pick the cut against domain risk.

Deterministic citation verifier. Structured citations (chunk_id, span) verified against the retrieval set. Catches the cheapest class of hallucination — model citing a chunk that wasn’t retrieved, or a span that doesn’t exist. Runs in milliseconds. 100% of production.
NLI-backed Groundedness on every claim. Claim extraction plus per-claim entailment via DeBERTa-NLI. Cheap, deterministic, scales. Default first pass on every span.
LLM-judge fallback on borderline claims. Groundedness(augment=True) flips this on. The classifier handles the bulk; the judge handles paraphrase, multi-hop, and qualifier-heavy claims. Per-evaluation cost stays below Galileo Luna-2 because most claims clear the classifier and never reach the judge.
Cherry-pick contradiction scan. ChunkUtilization plus the contradiction check over unused chunks. Sampled at 5-20% of traffic in low-stakes, 100% in compliance-sensitive.
Sycophantic-restatement check. Premise similarity plus independent-support delta. Sampled where the question shape suggests confirmation seeking.

The recurring call composes the templates that map to each layer:

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, FactualAccuracy,
    ChunkAttribution, ChunkUtilization,
)
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

test_case = TestCase(
    input="Does the policy cover outpatient MRI?",
    output="Yes, the policy covers outpatient MRI with prior authorization.",
    context=[
        "Outpatient MRI is covered when prior authorization has been obtained.",
        "Cosmetic procedures are excluded from coverage.",
    ],
)

result = evaluator.evaluate(
    eval_templates=[
        Groundedness(augment=True),     # eval_id 47, cascade on
        ContextAdherence(),             # eval_id 5
        FactualAccuracy(),              # eval_id 66, atomic-claim path
        ChunkAttribution(),             # eval_id 11
        ChunkUtilization(),             # eval_id 12
    ],
    inputs=[test_case],
)
for metric in result[0].metrics:
    print(metric.name, metric.value, metric.reason)

Same call shape runs in CI on a versioned dataset and in production as span-attached scores via traceAI. The cherry-pick and sycophancy rubrics layer on as CustomLLMJudge templates.

Anti-patterns to avoid

The mistakes that show up in faithfulness eval reviews:

Reporting one answer-level Groundedness number. Dashboard reads 0.92, audit finds 18% unsupported claims. Decompose.
Splitting on sentences and calling it claim-level. Multi-claim sentences and spread claims diverge from sentence boundaries. Use extract_atomic_claims.
No scan over unused chunks. ChunkUtilization without a contradiction check on the unused set misses cherry-picking.
No sycophancy check on confirmation-seeking questions. “Doesn’t X” and “isn’t Y” prime the model to restate. Sample these routes.
LLM judge on every claim. Cost spirals; drift breaks reproducibility. Classifier first via augment=True, judge on borderline only.
Frozen rubric. New failures land weekly. Promote failing production traces into the eval set or the rubric stops catching the bugs your users file.
No span-attached scores. Without trace integration, the eval lives on a dashboard nobody opens. Attach to RETRIEVER spans.

Production observability and Error Feed clustering

The same rubric set runs as span-attached scorers against live traces via traceAI (Apache 2.0; 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING). Sample 5-10% of traffic for the LLM-judge layer; run the classifier path and citation verifier on 100% because they’re cheap.

The signal that matters is the rolling delta of per-claim hallucination rate per route, not the absolute Groundedness mean. A 2-5 point sustained rise over 30-90 minutes is the prompt that a new model version shipped, the chunker change landed, or the retrieval index rotated.

Error Feed sits inside the eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing traces into named issues: “cherry-picked exclusion clause in coverage Q&A,” “sycophantic restatement of refund-policy question,” “qualifier stripped from temporal coverage clause.” A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools including read_span, get_children, submit_finding) reads each failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dim score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each).

Named issues are the unit of triage; immediate_fix is the unit of work. Representative traces from each cluster promote into the eval set under engineer or domain-expert sign-off. The dataset gets sharper every week; the per-claim hallucination rate drops because the rubric now catches what it missed last quarter. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap.

How Future AGI ships the faithfulness stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator. Canonical RAG templates as typed classes — Groundedness (47), ContextAdherence (5), Completeness (10), ChunkAttribution (11), ChunkUtilization (12), FactualAccuracy (66). Local NLI primitives in python/fi/evals/metrics/rag/utils/ — extract_claims, extract_atomic_claims, check_claim_supported, check_attribution, check_contradiction — that run on a DeBERTa model with no API call. CustomLLMJudge with grading_criteria for cherry-pick and sycophancy rubrics.
Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent generates custom cherry-pick and sycophancy rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2 for high-volume RAG scoring.
Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces into named issues; Sonnet 4.5 Judge writes the immediate_fix; engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators.
traceAI (Apache 2.0): per-claim scores attach as span attributes on RETRIEVER spans so the unused chunks live next to the verdict.

The honest tradeoff: cherry-pick and sycophancy detection cost roughly 1.5-2x standard Groundedness because of the contradiction scan and the second judge. For regulated workloads where audit failure rate is the binding constraint, it’s the right tradeoff. For consumer Q&A, the cherry-pick rubric can stay sampled.

Ready to wire the claim-level rubric into a CI gate? Bind Groundedness(augment=True), FactualAccuracy, ChunkAttribution, ChunkUtilization, plus the cherry-pick CustomLLMJudge into a pytest fixture against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is answer-level Groundedness a misleading number?

Groundedness scored at the answer level returns one float for the whole response. An answer with five claims where one is hallucinated still scores around 0.8, because the judge averages support across the response or grades the response as a single unit. A 0.92 mean Groundedness across a release looks healthy and hides a 20 to 30 percent per-claim hallucination rate. The metric reads as a quality score; what it actually measures is the fraction of the response that looks supported on a holistic read. Claim-level decomposition is the only way to recover the per-claim hallucination rate, and the per-claim rate is the one that ships to users.

What is cherry-pick detection and why does it matter for RAG?

Cherry-pick detection is the eval that asks whether the model selectively quoted one passage that confirmed the user's framing while ignoring other retrieved passages that contradicted or qualified it. Standard Groundedness passes when each cited span exists in the context. It says nothing about the spans the model chose not to use. In legal, medical, and policy RAG this is the failure mode that lands the lawsuit: every claim is technically grounded, but the answer omits the exclusion clause that lives in chunk 4 of the retrieval set. The eval requires running the retrieved set through a contradiction-and-qualifier scan and flagging answers that ignore material counter-evidence.

What is sycophantic restatement in RAG and how is it scored?

Sycophantic restatement is the failure mode where the model rephrases the user's question or premise as the answer, dressed up in retrieval-flavored language. The classic shape is a user asking 'doesn't the policy cover X?' and the model returning 'yes, the policy covers X' even when the retrieved context says nothing of the sort, or actually says the opposite. Groundedness can pass because the answer is short and the judge finds tangential support in the context. The detection rubric scores semantic similarity between the answer and the user's premise, then cross-checks the answer against the retrieved chunks that contradict the premise. High premise-similarity with weak independent support is the signature.

Why is claim-level grounded scoring different from sentence-level?

Sentence-level scoring splits on punctuation and checks each sentence against the context. Claim-level scoring extracts atomic, standalone, declarative propositions from the response and checks each one. A single sentence can carry two or three claims (one supported, two not); a single claim can span two sentences. Sentence-level scoring misses the multi-claim sentence and double-counts the spread claim. Claim-level scoring catches both. The cost is one extra LLM call to extract claims, and the resolution is the per-claim hallucination rate that survives audit. Future AGI's RAG faithfulness path runs claim extraction by default and exposes the per-claim list in the eval result.

How do Groundedness, ChunkAttribution, and ChunkUtilization compose to catch cherry-picking?

Groundedness scores whether claims are supported. ChunkAttribution scores whether the cited chunk supports the cited claim. ChunkUtilization scores how many retrieved chunks the generator actually used. Cherry-picking shows up as the pattern: Groundedness high, ChunkAttribution high on the chunks the model used, ChunkUtilization low because the model ignored half the retrieval set. The remaining unused chunks often contain the contradicting evidence. Adding a contradiction scan across the unused chunks is the cherry-pick detection layer; the three core metrics give you the signal that something was skipped, and the contradiction scan tells you what it was.

What does Future AGI ship for claim-level RAG faithfulness?

The ai-evaluation SDK (Apache 2.0) ships Groundedness (eval_id 47) backed by a classifier-with-judge cascade, FactualAccuracy (66) for atomic-claim verification, ContextAdherence (5) for boundary checking, ChunkAttribution (11), ChunkUtilization (12), and Completeness (10). The RAG utilities module exposes extract_claims, extract_atomic_claims, check_claim_supported, and check_attribution as primitives. CustomLLMJudge with grading_criteria covers cherry-pick and sycophantic-restatement rubrics. traceAI attaches the per-claim scores to RETRIEVER spans so the retrieved chunks, the unused chunks, and the per-claim verdict live in the same trace tree.

How does the closed loop turn faithfulness failures into a stronger eval set?

Error Feed sits inside the eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing traces into named issues like 'cherry-picked exclusion clause' or 'sycophantic restatement of user premise.' A Sonnet 4.5 Judge agent reads each cluster, writes the RCA, the evidence quotes, and an immediate_fix. Engineer-reviewed promotions move representative failing traces into the eval set under a CI gate. The dataset gets sharper every week; the per-claim hallucination rate drops because the rubric now catches the failure mode it missed last quarter.

View all

Guides

LLM Hallucination: A 2026 Architectural Deep Dive

Hallucination is four distinct failure modes: factual, grounding, citation, reasoning. Each needs a different detector and a different fix, with code.

Nikhil Pareek · Mar 23, 2026

12 min

Guides

LLM Summarization Evaluation: A 2026 Architectural Deep Dive

Summarization eval is four judge prompts: groundedness, completeness, factuality, conciseness. Each a hardened prompt with a calibration set. 2026 guide.

Nikhil Pareek · Apr 27, 2026

12 min

Guides

Evaluating LLM Citation & Attribution (2026)

Citation eval is three rubrics: did the model emit a citation, does it resolve, does the source actually contain the claim. 2026 methodology with code.

Nikhil Pareek · Apr 21, 2026

12 min

TL;DR: the failure modes Groundedness misses

Why answer-level Groundedness is a vibe check

Claim-level decomposition: the methodology

Sentence-level grounded vs claim-level grounded

Cherry-picking detection: the chunks the model didn’t use

Sycophantic restatement: when the answer is the user’s premise

The full faithfulness stack

Anti-patterns to avoid

Production observability and Error Feed clustering

How Future AGI ships the faithfulness stack

Related reading

Frequently asked questions