Guides

RAG Evaluation Metrics: A Deep Dive (2026)

RAG eval is a bisection problem. Each metric pins the failure to retrieval, generation, or the cross-cut. A metric-by-metric methodology guide for senior ML engineers.

·
Updated
·
12 min read
rag evaluation groundedness ai-evaluation 2026
Editorial cover image for RAG Evaluation Metrics A Deep Dive
Table of Contents

A senior engineer pings the team channel: the RAG quality score dropped from 0.84 to 0.79 on yesterday’s release, can someone bisect. Three days later the bisect is still open because the quality score is a weighted average of seven rubrics and the team cannot tell whether retrieval regressed, generation regressed, or both moved at once. The PR that caused it was a one-line change to the chunker; ContextRelevance would have flagged it in five minutes. Nobody looked at ContextRelevance because the aggregate said 0.79.

This is the RAG eval mistake that compounds across releases. RAG has two failure surfaces — the retriever and the generator — and an aggregate quality score throws away the diagnostic that tells you which one moved. The opinion this post earns: RAG evaluation is a bisection problem, and each metric exists to pin the failure to a specific layer. Report rubrics per layer or you cannot debug regressions; report them as a single number and you have a dashboard, not an eval.

This guide is a methodology deep-dive on the canonical RAG metrics: what each one measures, where each one breaks, when to reach for which. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships. Audience: senior ML engineers shipping RAG in production who already know what vector_store.search() returns and want the rubric set that survives quarter four.

TL;DR: the canonical RAG rubric set by layer

LayerMetricWhat it measuresFailure it catches
RetrievalContextRelevanceRetrieved chunks relevant to the queryRetriever surfaces the wrong chunks
RetrievalContextRecallRetrieved set covers the answer-bearing chunksRetriever misses the right chunk
RetrievalChunkAttributionEach cited chunk supports the cited claimCitation is fabricated or misaligned
RetrievalChunkUtilizationGenerator used what the retriever surfacedRetriever over-fetches; pay for unused tokens
GenerationGroundednessEvery claim supported by retrieval contextModel fabricates from parametric memory
GenerationContextAdherenceResponse stays inside the contextModel wanders into outside knowledge
GenerationAnswerRelevanceResponse addresses the question askedModel writes around the question
GenerationCompletenessResponse covers the required factsModel stops at the first fact
Cross-cutFactualAccuracyClaims true against external truthSource itself is wrong
Cross-cutCitation validityCited span exists verbatim in chunkHallucinated quote with real chunk ID

The bisection rule: retrieval rubrics fall first when the chunker, embedder, or reranker regressed; generation rubrics fall first when the prompt, model, or decoding regressed; cross-cut rubrics fall when either layer is broken or the source corpus drifted. Keep them tagged by layer in your eval store and the regression bisect collapses to a five-minute job.

Why the aggregate score hides the bug

The aggregate is seductive because it fits on a slide. It is also lossy in a specific way: it mixes orthogonal signals. ContextRelevance is a retrieval-side rubric; Groundedness is a generation-side rubric; FactualAccuracy is a cross-cutting rubric that fires when either layer or the source data is wrong. Averaging them produces a number with no derivative. A five-point drop could be retrieval, generation, the source, or three things at once moving in opposite directions.

The original RAG paper was clear about this structure: the retriever and the generator are separate components with separate failure modes. The community drifted to single-number scoring because it ships on a dashboard. The cost is opacity. A retrieval regression and a prompt regression look identical to a quality score, and the team’s response (re-run sweeps, swap models, escalate to the foundation-model vendor) is uncorrelated with the actual fix.

The fix is structural, not statistical. Tag every rubric by layer in the eval store. Compute layer-level aggregates if you must roll up: retrieval_health = mean(ContextRelevance, ChunkAttribution, ChunkUtilization) and generation_health = mean(Groundedness, AnswerRelevance, Completeness) are defensible. A single global RAG score is not.

The retrieval-level metrics

Retrieval-level rubrics score chunks against the query, independent of what the generator wrote. They are the upstream signal: every generation rubric depends on retrieval doing its job, which is why the bisect starts here.

ContextRelevance. For each retrieved chunk, is the chunk relevant to the user’s query. Computed by an LLM judge or a fine-tuned classifier scoring each chunk on a 0-to-1 scale, then averaging (or taking the top-k mean). Fails when vector similarity surfaces a chunk that lexically looks close but answers a different question — the classic “asked about Section 12, got Section 9” failure. Template: ContextRelevance (cloud eval_id=9).

ContextRecall. Did the retriever return the chunks that contain the answer. This is the recall side of retrieval and the rubric that catches missing chunks rather than wrong chunks. Requires an answer key with the expected source chunk IDs; cheap to compute once the dataset is labeled. Local templates in python/fi/evals/metrics/rag/retrieval/: context_recall, context_precision, context_entity_recall. Pair with the IR-style recall_at_k, precision_at_k, mrr, ndcg when you want the rank-aware version.

ChunkAttribution. When the response cites chunk X for a claim, does chunk X actually support that claim. Two failure modes: the chunk is cited but does not contain the quoted span (a citation hallucination), and the chunk is cited but supports a different claim (a citation misalignment). Matters most where citation correctness is a product requirement: legal, medical, financial, compliance. Template: ChunkAttribution (cloud eval_id=11). Local equivalent: source_attribution in python/fi/evals/metrics/rag/advanced/.

ChunkUtilization. How many of the retrieved chunks actually contributed to the response. Top-10 retrieval that the generator uses 2 of is paying token cost on 8 unused chunks plus the latency of generating context for them. The rubric earns its keep in cost optimisation: utilisation below 30 percent is a signal to retrieve fewer chunks, add a reranker before generation, or both. Template: ChunkUtilization (cloud eval_id=12). Local equivalent: context_utilization in python/fi/evals/metrics/rag/generation/.

from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, ChunkAttribution, ChunkUtilization,
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
retrieval_health = ev.evaluate(
    eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
    inputs=[{"input": question, "output": answer, "context": retrieved_chunks}],
)

The generation-level metrics

Generation-level rubrics score the response against the retrieved context, taking retrieval as given. A regression here with retrieval rubrics steady means the bug is the generator: a prompt change, a model swap, a decoding parameter, or the model itself drifting.

Groundedness (a.k.a. Faithfulness). For every claim in the response, is the claim supported by at least one span of the retrieval context. The canonical RAG hallucination rubric. Extract claims (a sentence-splitter or a structured-output schema), score each claim’s entailment against the context using either an NLI model (cheap, deterministic, fast) or an LLM-as-judge (more expensive, handles nuance, slower), and return the fraction of supported claims. Ragas calls it faithfulness; the Ragas paper lays out the claim-extraction methodology that most implementations now follow. Template: Groundedness (cloud eval_id=47). Local equivalents: groundedness, faithfulness, claim_support (DeBERTa-NLI, no API call).

ContextAdherence. Does the response stick to the retrieval context or introduce information from outside. A response can be grounded (every claim is supported) and still fail adherence (the model added context-irrelevant elaboration the user did not ask for). Matters most in regulated workloads where off-policy answers are a compliance issue. Template: ContextAdherence (cloud eval_id=5).

AnswerRelevance. Does the response actually answer the question. The rubric that catches the model writing eloquently around the question without answering it, common in long-context RAG where the generator pads with context summary instead of extracting the asked-for fact. Implemented as a judge scoring (question, answer) directly or via the Ragas reverse-question trick: generate plausible questions from the answer and measure their similarity to the original. Local template: answer_relevancy in python/fi/evals/metrics/rag/generation/. The SDK exposes the related AnswerRefusal (cloud eval_id=88) for the inverse: did the model refuse when it should have answered.

Completeness. Does the response cover the important facts the question demanded. Fails when the model picks one fact and stops, the multi-part question pattern where the user asks three things and gets one. Cheapest to score with a pre-defined expected-fact list per question (much cheaper to annotate than a full reference answer); LLM-judge mode works without the explicit list but with more variance. Template: Completeness (cloud eval_id=10).

The cross-cutting metrics

Cross-cutting rubrics fail when either layer is broken or when the source data itself is wrong. They are the rubrics that surface bugs the layer-specific rubrics cannot see.

FactualAccuracy. Are the claims in the response true, regardless of the retrieval context. The rubric that fires when Groundedness passes but the answer is still wrong — usually because the retrieved source was wrong (stale documentation, an incorrect knowledge-base entry, a fabricated training example) and the model faithfully grounded the response in the wrong source. The fix is upstream of the model: data quality, source freshness, or removing the bad chunk from the corpus. Template: FactualAccuracy (cloud eval_id=66). Local equivalents: factual_consistency, claim_support, contradiction_detection in python/fi/evals/metrics/hallucination/.

Citation validity. Every cited span exists verbatim (or within a small fuzzy tolerance) in the retrieval context. Deterministic, no LLM judge required, runs in milliseconds. The cheapest rubric in the stack and the one that catches the most embarrassing class of failure: a fabricated quote pinned to a real chunk ID, where structured output passed and schema validation passed and the user reads a sentence the document does not contain. Run on every response in production; cost is negligible.

def citation_validity(answer, chunks: list[dict]) -> float:
    chunk_map = {c["id"]: c["text"] for c in chunks}
    if not answer.citations:
        return 1.0
    valid = sum(
        1 for c in answer.citations
        if c.chunk_id in chunk_map and c.quoted_span in chunk_map[c.chunk_id]
    )
    return valid / len(answer.citations)

Metric-pair tradeoffs

Three pairs of rubrics pull in opposite directions, and the tradeoff is what your eval set is designed to expose.

ContextRecall vs ContextPrecision. Increase top-k retrieval and recall goes up; precision goes down. Aggressive recall surfaces the answer-bearing chunk reliably and pays in noise that the generator must filter. Aggressive precision narrows the retrieval surface and risks missing the right chunk on hard queries. Tune the operating point against ChunkUtilization in production: utilisation below 25 percent says you are over-fetching; recall below 90 percent on your gold set says you are under-fetching.

Groundedness vs Completeness. A model rewarded only on groundedness learns to answer narrowly and refuse on partial coverage; it sticks to what the context strictly supports, which scores high on groundedness and low on completeness. A model rewarded only on completeness learns to elaborate beyond the context, which scores high on completeness and low on groundedness. The two have to be reported together or the prompt-tuning loop optimises one at the expense of the other.

ContextAdherence vs AnswerRelevance. Strict adherence forbids outside knowledge; some questions cannot be answered well without it (a definition the corpus does not contain, a follow-on the user expects from a competent assistant). The right move in compliance-sensitive domains is high adherence and accepting the relevance cost; the right move in casual Q&A is lower adherence and higher relevance. The product owner picks the operating point; the eval set encodes the choice.

The diagnostic decision tree

When a regression lands, walk the tree:

  1. ContextRelevance dropped? Retrieval regressed. Bisect the chunker, embedder, vector store config, or reranker change. Generation rubrics will follow within a release or two.
  2. ContextRelevance steady, Groundedness dropped? Generator regressed. Bisect the prompt, model version, temperature, or decoding params.
  3. Groundedness steady, FactualAccuracy dropped? Source data regressed. Audit the corpus: stale chunks, incorrect entries, freshly indexed bad data.
  4. Groundedness steady, Completeness dropped? Model is truncating or refusing more often. Check max tokens, refusal calibration, prompt edits that added “if unsure, say so” language.
  5. Groundedness steady, ChunkAttribution dropped? Citations are fabricating. The generator passed the schema validator with quotes that do not exist in the cited chunks. Tighten the structured-output schema and run citation validity deterministically.
  6. ChunkUtilization dropped? Retriever over-fetched (often after a top-k increase). Reduce top-k, add a reranker, or both.
  7. AnswerRelevance dropped, everything else steady? Prompt or model is wandering. Sample failing traces; the pattern is usually verbose pre-amble.

The tree is the whole point of layer-tagged rubrics. Without it you have a number that moved; with it you have a layer to fix.

Production observability and Error Feed clustering

Offline eval catches the regressions you can think of; production catches the rest. The same rubric set runs as span-attached scorers against live traces via traceAI (Apache 2.0; 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING). Sample 5 to 10 percent of traffic for LLM-judge rubrics; run citation validity and ChunkUtilization on 100 percent because they are cheap.

The signal that matters is rolling-mean delta per rubric per route, not the absolute score. A 2 to 5 point sustained drop in ContextRelevance over 30 to 90 minutes is the prompt that the chunker change shipped. A drift between offline CI pass and online rolling mean is a representativeness signal on the eval set itself; close the gap by promoting failing production traces into the dataset.

Error Feed sits inside the eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools including read_span, get_children, submit_finding) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. The named issues are the unit of triage; the immediate_fix is the unit of work; representative traces from each cluster promote into the eval set under engineer or domain-expert sign-off. The dataset ratchets stronger every week; the CI gate catches more regressions every quarter.

Common RAG eval mistakes

The mistakes that show up in eval reviews almost every time:

  • Reporting one aggregate “RAG quality” score. Hides which layer moved. Tag rubrics by layer and report them separately.
  • Only running Groundedness. Misses retrieval bugs entirely. ContextRelevance is the first rubric, not the last.
  • No expected-fact list for Completeness. A reference answer is one human’s opinion; an expected-fact list is deterministic and cheaper to maintain.
  • No deterministic citation validity. LLM-judge attribution is the slow path; verbatim string match plus fuzzy tolerance catches fabricated quotes on every response for nearly free.
  • Frozen eval set. RAG eval data goes stale fast as source corpora and user-question distributions shift. Refresh weekly from production failures or the gate stops catching the bugs your users actually file.
  • Eval on a separate dashboard. Without span-attached scores, no one looks at the eval and the regression ships to production before anyone notices.

How Future AGI ships the RAG eval stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator, Protect. Every canonical RAG template as a ready-to-use class: Groundedness (eval_id 47), ContextAdherence (5), ContextRelevance (9), Completeness (10), ChunkAttribution (11), ChunkUtilization (12), FactualAccuracy (66), AnswerRefusal (88). Local NLI-based equivalents (groundedness, faithfulness, claim_support, factual_consistency, contradiction_detection, context_recall, context_precision, context_entity_recall, noise_sensitivity, ndcg, mrr, precision_at_k, recall_at_k, context_utilization, answer_relevancy, multi_hop_reasoning, source_attribution) that run on a DeBERTa model with no API call. 13 guardrail backends (9 open-weight, 4 API). 8 sub-10ms Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED.
  • Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent generates custom RAG rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2 for high-volume RAG-trace scoring.
  • Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse embeddings groups failing RAG traces into named issues; Sonnet 4.5 Judge agent writes the immediate_fix; lawyer- or engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators.
  • traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including RETRIEVER, RERANKER, EMBEDDING; pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY); 62 built-in evals via EvalTag for span-attached scoring on live traces.

Ready to wire the rubric set into a CI gate? Bind ContextRelevance, Groundedness, Completeness, ChunkAttribution, and citation validity into a pytest fixture this afternoon against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is reporting a single aggregate RAG quality score the wrong thing to do?
Because the aggregate hides which layer broke. A RAG pipeline has two failure surfaces: retrieval (wrong chunks surfaced) and generation (right chunks, wrong answer). The bug in production is almost always one or the other, not both. An aggregate that averages groundedness with context relevance flattens a retrieval regression and a generation regression into the same number, so the bisect that should take five minutes takes three days. Report each rubric separately and tag it by layer; the diagnostic is the metric, not the score. Aggregates are fine for executive dashboards once the layer-level numbers exist underneath, but never as the primary signal an engineer pages on.
What is the canonical split between retrieval-level and generation-level metrics?
Retrieval-level metrics score the chunks the retriever surfaced against the question and the corpus, independent of what the generator did with them: ContextRelevance, ContextRecall, ChunkAttribution, ChunkUtilization, and the IR-style Recall@k, Precision@k, MRR, nDCG. Generation-level metrics score the response against the retrieved context, independent of whether retrieval was good: Groundedness/Faithfulness, ContextAdherence, AnswerRelevance, Completeness. Cross-cutting metrics like FactualAccuracy and citation validity score the response against external truth or against the chunks themselves and fail when either layer is wrong. A regression that hits a retrieval rubric while generation rubrics hold steady tells you the retriever moved; the reverse tells you the generator moved.
What is the difference between Groundedness, ContextAdherence, and FactualAccuracy?
Groundedness asks whether every claim in the response is supported by some span of the retrieval context. ContextAdherence asks whether the response stays inside the context or introduces information from outside. FactualAccuracy asks whether the claims are true in the world, independent of the context. The classic failure pattern: a response can be perfectly grounded (every claim has a supporting span) while being factually inaccurate (because the retrieved source was wrong). A response can be grounded and factually correct but fail adherence (the model elaborated with outside knowledge). Run Groundedness when you trust your sources, FactualAccuracy when you do not, and ContextAdherence when off-policy content is a regulatory liability.
When do I need ChunkAttribution and ChunkUtilization specifically?
ChunkAttribution scores whether each cited chunk actually supports the claim it is cited for; it is the rubric that catches plausible-looking fabricated citations and matters most in legal, medical, financial, and compliance-sensitive RAG. ChunkUtilization scores how many of the retrieved chunks the generator actually used; it is the rubric that catches retrieval over-fetch and pays for itself the moment your context window costs start dominating per-query latency. The two often run together because they answer adjacent questions: did the citations land on the right chunks, and is the retriever pulling chunks the generator ignores. Treat them as a pair when citation correctness is a product requirement and as separate signals otherwise.
How do AnswerRelevance and Completeness differ, and why are they not the same rubric?
AnswerRelevance asks whether the response addresses the question that was asked; it fails when the model returns related-but-off-target content (the user asked about Section 12, the model wrote about Section 9). Completeness asks whether the response covers the important facts the question demanded; it fails when the model gets one fact right and stops. A response can be perfectly relevant but incomplete (right topic, missing facts), and a response can be complete but not relevant (every required fact is mentioned, but mostly buried in unrelated commentary). Both are generation-side, both require an expected-fact list or a judge with a clear rubric, and both move independently in production.
Where does the diagnostic decision tree start when a RAG regression lands in production?
Start with ContextRelevance. If it dropped, retrieval regressed (a chunker change, an embedding rotation, a reranker swap) and downstream rubrics will follow. If ContextRelevance held but Groundedness dropped, the generator regressed (a model swap, a prompt edit, a temperature change) and the bug is in generation. If Groundedness held but FactualAccuracy dropped, the source data is wrong and retrieval is grounding the response in a stale or incorrect chunk. If Completeness dropped while Groundedness held, the model is truncating or refusing more often. If ChunkAttribution dropped while Groundedness held, citations are fabricating. Each rubric pins the regression to a specific layer; together they let you bisect in under five minutes.
What does Future AGI ship for RAG eval, and where does it sit against alternatives?
The ai-evaluation SDK (Apache 2.0) ships every canonical RAG template as a typed EvalTemplate class — Groundedness (eval_id 47), ContextAdherence (5), ContextRelevance (9), Completeness (10), ChunkAttribution (11), ChunkUtilization (12), FactualAccuracy (66), AnswerRefusal (88) — alongside local NLI-based equivalents (faithfulness, claim_support, factual_consistency, context_recall, context_precision, context_entity_recall, noise_sensitivity, source_attribution, multi_hop_reasoning) that run on a DeBERTa model with no API call. Thirteen guardrail backends, eight sub-10ms Scanners, four distributed runners. The Future AGI Platform layers on self-improving evaluators tuned by user feedback and classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack and groups failing traces into named issues via HDBSCAN clustering plus a Sonnet 4.5 Judge that writes the immediate fix.
Related Articles
View all
Evaluating RAG Faithfulness: A 2026 Deep Dive
Guides

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.

NVJK Kartik
NVJK Kartik ·
11 min