RAG Evaluation Metrics: A Deep Dive (2026)
RAG eval is a bisection problem. Each metric pins the failure to retrieval, generation, or the cross-cut. A metric-by-metric methodology guide for senior ML engineers.
Table of Contents
A senior engineer pings the team channel: the RAG quality score dropped from 0.84 to 0.79 on yesterday’s release, can someone bisect. Three days later the bisect is still open because the quality score is a weighted average of seven rubrics and the team cannot tell whether retrieval regressed, generation regressed, or both moved at once. The PR that caused it was a one-line change to the chunker; ContextRelevance would have flagged it in five minutes. Nobody looked at ContextRelevance because the aggregate said 0.79.
This is the RAG eval mistake that compounds across releases. RAG has two failure surfaces — the retriever and the generator — and an aggregate quality score throws away the diagnostic that tells you which one moved. The opinion this post earns: RAG evaluation is a bisection problem, and each metric exists to pin the failure to a specific layer. Report rubrics per layer or you cannot debug regressions; report them as a single number and you have a dashboard, not an eval.
This guide is a methodology deep-dive on the canonical RAG metrics: what each one measures, where each one breaks, when to reach for which. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships. Audience: senior ML engineers shipping RAG in production who already know what vector_store.search() returns and want the rubric set that survives quarter four.
TL;DR: the canonical RAG rubric set by layer
| Layer | Metric | What it measures | Failure it catches |
|---|---|---|---|
| Retrieval | ContextRelevance | Retrieved chunks relevant to the query | Retriever surfaces the wrong chunks |
| Retrieval | ContextRecall | Retrieved set covers the answer-bearing chunks | Retriever misses the right chunk |
| Retrieval | ChunkAttribution | Each cited chunk supports the cited claim | Citation is fabricated or misaligned |
| Retrieval | ChunkUtilization | Generator used what the retriever surfaced | Retriever over-fetches; pay for unused tokens |
| Generation | Groundedness | Every claim supported by retrieval context | Model fabricates from parametric memory |
| Generation | ContextAdherence | Response stays inside the context | Model wanders into outside knowledge |
| Generation | AnswerRelevance | Response addresses the question asked | Model writes around the question |
| Generation | Completeness | Response covers the required facts | Model stops at the first fact |
| Cross-cut | FactualAccuracy | Claims true against external truth | Source itself is wrong |
| Cross-cut | Citation validity | Cited span exists verbatim in chunk | Hallucinated quote with real chunk ID |
The bisection rule: retrieval rubrics fall first when the chunker, embedder, or reranker regressed; generation rubrics fall first when the prompt, model, or decoding regressed; cross-cut rubrics fall when either layer is broken or the source corpus drifted. Keep them tagged by layer in your eval store and the regression bisect collapses to a five-minute job.
Why the aggregate score hides the bug
The aggregate is seductive because it fits on a slide. It is also lossy in a specific way: it mixes orthogonal signals. ContextRelevance is a retrieval-side rubric; Groundedness is a generation-side rubric; FactualAccuracy is a cross-cutting rubric that fires when either layer or the source data is wrong. Averaging them produces a number with no derivative. A five-point drop could be retrieval, generation, the source, or three things at once moving in opposite directions.
The original RAG paper was clear about this structure: the retriever and the generator are separate components with separate failure modes. The community drifted to single-number scoring because it ships on a dashboard. The cost is opacity. A retrieval regression and a prompt regression look identical to a quality score, and the team’s response (re-run sweeps, swap models, escalate to the foundation-model vendor) is uncorrelated with the actual fix.
The fix is structural, not statistical. Tag every rubric by layer in the eval store. Compute layer-level aggregates if you must roll up: retrieval_health = mean(ContextRelevance, ChunkAttribution, ChunkUtilization) and generation_health = mean(Groundedness, AnswerRelevance, Completeness) are defensible. A single global RAG score is not.
The retrieval-level metrics
Retrieval-level rubrics score chunks against the query, independent of what the generator wrote. They are the upstream signal: every generation rubric depends on retrieval doing its job, which is why the bisect starts here.
ContextRelevance. For each retrieved chunk, is the chunk relevant to the user’s query. Computed by an LLM judge or a fine-tuned classifier scoring each chunk on a 0-to-1 scale, then averaging (or taking the top-k mean). Fails when vector similarity surfaces a chunk that lexically looks close but answers a different question — the classic “asked about Section 12, got Section 9” failure. Template: ContextRelevance (cloud eval_id=9).
ContextRecall. Did the retriever return the chunks that contain the answer. This is the recall side of retrieval and the rubric that catches missing chunks rather than wrong chunks. Requires an answer key with the expected source chunk IDs; cheap to compute once the dataset is labeled. Local templates in python/fi/evals/metrics/rag/retrieval/: context_recall, context_precision, context_entity_recall. Pair with the IR-style recall_at_k, precision_at_k, mrr, ndcg when you want the rank-aware version.
ChunkAttribution. When the response cites chunk X for a claim, does chunk X actually support that claim. Two failure modes: the chunk is cited but does not contain the quoted span (a citation hallucination), and the chunk is cited but supports a different claim (a citation misalignment). Matters most where citation correctness is a product requirement: legal, medical, financial, compliance. Template: ChunkAttribution (cloud eval_id=11). Local equivalent: source_attribution in python/fi/evals/metrics/rag/advanced/.
ChunkUtilization. How many of the retrieved chunks actually contributed to the response. Top-10 retrieval that the generator uses 2 of is paying token cost on 8 unused chunks plus the latency of generating context for them. The rubric earns its keep in cost optimisation: utilisation below 30 percent is a signal to retrieve fewer chunks, add a reranker before generation, or both. Template: ChunkUtilization (cloud eval_id=12). Local equivalent: context_utilization in python/fi/evals/metrics/rag/generation/.
from fi.evals import Evaluator
from fi.evals.templates import (
ContextRelevance, ChunkAttribution, ChunkUtilization,
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
retrieval_health = ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=[{"input": question, "output": answer, "context": retrieved_chunks}],
)
The generation-level metrics
Generation-level rubrics score the response against the retrieved context, taking retrieval as given. A regression here with retrieval rubrics steady means the bug is the generator: a prompt change, a model swap, a decoding parameter, or the model itself drifting.
Groundedness (a.k.a. Faithfulness). For every claim in the response, is the claim supported by at least one span of the retrieval context. The canonical RAG hallucination rubric. Extract claims (a sentence-splitter or a structured-output schema), score each claim’s entailment against the context using either an NLI model (cheap, deterministic, fast) or an LLM-as-judge (more expensive, handles nuance, slower), and return the fraction of supported claims. Ragas calls it faithfulness; the Ragas paper lays out the claim-extraction methodology that most implementations now follow. Template: Groundedness (cloud eval_id=47). Local equivalents: groundedness, faithfulness, claim_support (DeBERTa-NLI, no API call).
ContextAdherence. Does the response stick to the retrieval context or introduce information from outside. A response can be grounded (every claim is supported) and still fail adherence (the model added context-irrelevant elaboration the user did not ask for). Matters most in regulated workloads where off-policy answers are a compliance issue. Template: ContextAdherence (cloud eval_id=5).
AnswerRelevance. Does the response actually answer the question. The rubric that catches the model writing eloquently around the question without answering it, common in long-context RAG where the generator pads with context summary instead of extracting the asked-for fact. Implemented as a judge scoring (question, answer) directly or via the Ragas reverse-question trick: generate plausible questions from the answer and measure their similarity to the original. Local template: answer_relevancy in python/fi/evals/metrics/rag/generation/. The SDK exposes the related AnswerRefusal (cloud eval_id=88) for the inverse: did the model refuse when it should have answered.
Completeness. Does the response cover the important facts the question demanded. Fails when the model picks one fact and stops, the multi-part question pattern where the user asks three things and gets one. Cheapest to score with a pre-defined expected-fact list per question (much cheaper to annotate than a full reference answer); LLM-judge mode works without the explicit list but with more variance. Template: Completeness (cloud eval_id=10).
The cross-cutting metrics
Cross-cutting rubrics fail when either layer is broken or when the source data itself is wrong. They are the rubrics that surface bugs the layer-specific rubrics cannot see.
FactualAccuracy. Are the claims in the response true, regardless of the retrieval context. The rubric that fires when Groundedness passes but the answer is still wrong — usually because the retrieved source was wrong (stale documentation, an incorrect knowledge-base entry, a fabricated training example) and the model faithfully grounded the response in the wrong source. The fix is upstream of the model: data quality, source freshness, or removing the bad chunk from the corpus. Template: FactualAccuracy (cloud eval_id=66). Local equivalents: factual_consistency, claim_support, contradiction_detection in python/fi/evals/metrics/hallucination/.
Citation validity. Every cited span exists verbatim (or within a small fuzzy tolerance) in the retrieval context. Deterministic, no LLM judge required, runs in milliseconds. The cheapest rubric in the stack and the one that catches the most embarrassing class of failure: a fabricated quote pinned to a real chunk ID, where structured output passed and schema validation passed and the user reads a sentence the document does not contain. Run on every response in production; cost is negligible.
def citation_validity(answer, chunks: list[dict]) -> float:
chunk_map = {c["id"]: c["text"] for c in chunks}
if not answer.citations:
return 1.0
valid = sum(
1 for c in answer.citations
if c.chunk_id in chunk_map and c.quoted_span in chunk_map[c.chunk_id]
)
return valid / len(answer.citations)
Metric-pair tradeoffs
Three pairs of rubrics pull in opposite directions, and the tradeoff is what your eval set is designed to expose.
ContextRecall vs ContextPrecision. Increase top-k retrieval and recall goes up; precision goes down. Aggressive recall surfaces the answer-bearing chunk reliably and pays in noise that the generator must filter. Aggressive precision narrows the retrieval surface and risks missing the right chunk on hard queries. Tune the operating point against ChunkUtilization in production: utilisation below 25 percent says you are over-fetching; recall below 90 percent on your gold set says you are under-fetching.
Groundedness vs Completeness. A model rewarded only on groundedness learns to answer narrowly and refuse on partial coverage; it sticks to what the context strictly supports, which scores high on groundedness and low on completeness. A model rewarded only on completeness learns to elaborate beyond the context, which scores high on completeness and low on groundedness. The two have to be reported together or the prompt-tuning loop optimises one at the expense of the other.
ContextAdherence vs AnswerRelevance. Strict adherence forbids outside knowledge; some questions cannot be answered well without it (a definition the corpus does not contain, a follow-on the user expects from a competent assistant). The right move in compliance-sensitive domains is high adherence and accepting the relevance cost; the right move in casual Q&A is lower adherence and higher relevance. The product owner picks the operating point; the eval set encodes the choice.
The diagnostic decision tree
When a regression lands, walk the tree:
- ContextRelevance dropped? Retrieval regressed. Bisect the chunker, embedder, vector store config, or reranker change. Generation rubrics will follow within a release or two.
- ContextRelevance steady, Groundedness dropped? Generator regressed. Bisect the prompt, model version, temperature, or decoding params.
- Groundedness steady, FactualAccuracy dropped? Source data regressed. Audit the corpus: stale chunks, incorrect entries, freshly indexed bad data.
- Groundedness steady, Completeness dropped? Model is truncating or refusing more often. Check max tokens, refusal calibration, prompt edits that added “if unsure, say so” language.
- Groundedness steady, ChunkAttribution dropped? Citations are fabricating. The generator passed the schema validator with quotes that do not exist in the cited chunks. Tighten the structured-output schema and run citation validity deterministically.
- ChunkUtilization dropped? Retriever over-fetched (often after a top-k increase). Reduce top-k, add a reranker, or both.
- AnswerRelevance dropped, everything else steady? Prompt or model is wandering. Sample failing traces; the pattern is usually verbose pre-amble.
The tree is the whole point of layer-tagged rubrics. Without it you have a number that moved; with it you have a layer to fix.
Production observability and Error Feed clustering
Offline eval catches the regressions you can think of; production catches the rest. The same rubric set runs as span-attached scorers against live traces via traceAI (Apache 2.0; 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING). Sample 5 to 10 percent of traffic for LLM-judge rubrics; run citation validity and ChunkUtilization on 100 percent because they are cheap.
The signal that matters is rolling-mean delta per rubric per route, not the absolute score. A 2 to 5 point sustained drop in ContextRelevance over 30 to 90 minutes is the prompt that the chunker change shipped. A drift between offline CI pass and online rolling mean is a representativeness signal on the eval set itself; close the gap by promoting failing production traces into the dataset.
Error Feed sits inside the eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools including read_span, get_children, submit_finding) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. The named issues are the unit of triage; the immediate_fix is the unit of work; representative traces from each cluster promote into the eval set under engineer or domain-expert sign-off. The dataset ratchets stronger every week; the CI gate catches more regressions every quarter.
Common RAG eval mistakes
The mistakes that show up in eval reviews almost every time:
- Reporting one aggregate “RAG quality” score. Hides which layer moved. Tag rubrics by layer and report them separately.
- Only running Groundedness. Misses retrieval bugs entirely. ContextRelevance is the first rubric, not the last.
- No expected-fact list for Completeness. A reference answer is one human’s opinion; an expected-fact list is deterministic and cheaper to maintain.
- No deterministic citation validity. LLM-judge attribution is the slow path; verbatim string match plus fuzzy tolerance catches fabricated quotes on every response for nearly free.
- Frozen eval set. RAG eval data goes stale fast as source corpora and user-question distributions shift. Refresh weekly from production failures or the gate stops catching the bugs your users actually file.
- Eval on a separate dashboard. Without span-attached scores, no one looks at the eval and the regression ships to production before anyone notices.
How Future AGI ships the RAG eval stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator, Protect. Every canonical RAG template as a ready-to-use class:Groundedness(eval_id 47),ContextAdherence(5),ContextRelevance(9),Completeness(10),ChunkAttribution(11),ChunkUtilization(12),FactualAccuracy(66),AnswerRefusal(88). Local NLI-based equivalents (groundedness,faithfulness,claim_support,factual_consistency,contradiction_detection,context_recall,context_precision,context_entity_recall,noise_sensitivity,ndcg,mrr,precision_at_k,recall_at_k,context_utilization,answer_relevancy,multi_hop_reasoning,source_attribution) that run on a DeBERTa model with no API call. 13 guardrail backends (9 open-weight, 4 API). 8 sub-10ms Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes).RailType.INPUT/OUTPUT/RETRIEVALplusAggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. - Future AGI Platform: self-improving evaluators tuned by user feedback; in-product authoring agent generates custom RAG rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2 for high-volume RAG-trace scoring.
- Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse embeddings groups failing RAG traces into named issues; Sonnet 4.5 Judge agent writes the
immediate_fix; lawyer- or engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators. - traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds including
RETRIEVER,RERANKER,EMBEDDING; pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY); 62 built-in evals viaEvalTagfor span-attached scoring on live traces.
Ready to wire the rubric set into a CI gate? Bind ContextRelevance, Groundedness, Completeness, ChunkAttribution, and citation validity into a pytest fixture this afternoon against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why is reporting a single aggregate RAG quality score the wrong thing to do?
What is the canonical split between retrieval-level and generation-level metrics?
What is the difference between Groundedness, ContextAdherence, and FactualAccuracy?
When do I need ChunkAttribution and ChunkUtilization specifically?
How do AnswerRelevance and Completeness differ, and why are they not the same rubric?
Where does the diagnostic decision tree start when a RAG regression lands in production?
What does Future AGI ship for RAG eval, and where does it sit against alternatives?
Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness. Scored independently, calibrated against humans, run in CI. The 2026 guide.
Citation eval is three rubrics, not one: did the model emit a citation, does it resolve, and does the source actually contain the claim. The methodology, with code.
Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.