What Is RAG Evaluation?
The component-level measurement of retrieval-augmented generation quality across retrieval, generation, and final answer using metrics like groundedness and context relevance.
What Is RAG Evaluation?
RAG evaluation is the discipline of measuring quality across a retrieval-augmented generation pipeline by scoring its components independently rather than judging only the final answer. It typically runs three families of evaluators: retrieval-quality (ContextRelevance, ChunkAttribution, ContextPrecision, ContextRecall), generation-quality (Groundedness, Faithfulness), and answer-quality (AnswerRelevancy, end-to-end TaskCompletion). The output is a per-component score per request, dashboarded and thresholded so engineers can pinpoint whether the retriever, the chunker, or the generator regressed. Without it, a fluent but wrong answer looks identical to a correct one. the canonical 2026 RAG hallucination failure mode.
Why RAG evaluation matters in production LLM and agent systems
Most RAG failures are not generation failures. they are silent retrieval failures. The LLM produces confident prose regardless of whether the retrieved chunks were relevant, so a broken retriever shows up as “the answers are fine, but customers say they’re wrong.” Without component-level evaluation, the team blames the model, swaps prompts for two weeks, and ships nothing because the prompt was never the problem. In our 2026 evals across 40+ enterprise RAG deployments, retrieval was the root cause of ~58% of quality regressions; generation accounted for ~24%, and the remainder split between prompt engineering regressions and model upgrades.
The pain shows up across roles. ML engineers see vague “quality is down” tickets without the signals to diagnose. Retrieval engineers cannot prove their new embedding model is better without a relevance metric. Compliance leads cannot answer “how do you know the model isn’t fabricating policy quotes?” because they only have a thumbs-down rate. Product managers ship and rollback the same PR three times because the eval signal lags user complaints by days.
In 2026 agentic RAG stacks, the failure mode compounds. An agent that retrieves at step one, decides at step two, and acts at step three carries every retrieval error forward as a wrong tool call, a wrong refund, a wrong invoice. Trajectory-level evals plus per-step RAG evals are the only way to localise the breakage; otherwise you are debugging a four-step trace by reading text. Corrective-RAG and self-RAG patterns assume an evaluator is in the loop. without one, the “corrective” step has nothing to correct against.
The compliance dimension matters too. In regulated domains (legal, medical, financial), an ungrounded answer is not a quality bug. it is a control failure. The EU AI Act treats unsupported claims by a high-risk system as documented non-compliance under Art. 15 robustness. That moves Groundedness and Faithfulness from “nice to have” into release-blocking signals. RAG evaluation is the mechanism that produces the audit evidence those regulators ask for.
Long-context models (Claude Opus 4.7 at 1M tokens, Gemini 3.x at 2M, GPT-5.x at 1M as of May 2026) did not eliminate the need for RAG evaluation. they made NoiseSensitivity and ContextUtilization the critical signals. When you can stuff 800K tokens into context, the question shifts from “did the retriever find the doc?” to “did the model find the relevant 200 tokens inside the doc?” That is now a measurable failure mode, not a guess.
Why the leaderboards moved past Ragas in 2026
Ragas, RAGElo, and BEIR-style retrieval benchmarks dominated the 2023–2024 conversation. By 2026 the operational gap has become clear: those tools score offline against fixed datasets, but production RAG fails in ways fixed datasets cannot capture. Three failure classes show up in our 2026 incident reviews almost weekly:
- Index drift: an embedding-model upgrade or a chunking change shifts the retrieval distribution. Offline scores on the old golden dataset stay flat; production traces tell a different story.
- Query-distribution drift: users start asking new kinds of questions (a launch, a regulation change, a market event). The golden dataset has no coverage.
ContextRelevancedrops on the new cohort but the aggregate looks fine. - Adversarial retrieval: a prompt injection lands in the index via an untriaged document. Retrieval surfaces it, generation follows it, the user gets compromised. Offline eval has no visibility.
These all require online, trace-anchored evaluation. That is the gap FutureAGI fills relative to a notebook-style Ragas pipeline.
How FutureAGI handles RAG evaluation
FutureAGI’s approach is to ship RAG-specific evaluators as first-class fi.evals classes and wire them to traces collected via traceAI integrations. The headline metric is RAGScoreDetailed, which returns context relevance, groundedness, and answer relevancy in a single call, plus an aggregated RAGScore. Specialised evaluators. ChunkAttribution (did the answer reference any retrieved chunk?), ChunkUtilization (how much of the chunk did it use?), NoiseSensitivity (does irrelevant context degrade the answer?), ContextEntityRecall (entity-level retrieval completeness), MultiHopReasoning (chains across chunks). give surgical diagnostics when the headline drops.
Concretely: a team running a Haystack pipeline instruments with traceAI-haystack, captures retriever and generator spans, and samples production traces into a Dataset. They attach RAGScoreDetailed, ChunkAttribution, and NoiseSensitivity via Dataset.add_evaluation(). The dashboard shows three lines: when ContextRelevance drops, retrieval is the suspect. chunk size, chunking strategy, embedding model, or top-k. When Groundedness drops with ContextRelevance flat, the generator is hallucinating despite good context. usually a model swap or prompt template regression. When NoiseSensitivity rises, the retriever is bringing back distractors that are degrading reasoning. Each signal points at a different fix.
The same evaluators run online: fi.evals.Groundedness configured as a real-time eval fires on every span where retrieval.documents is present, writes its score back as a span event, and triggers an alert if the rolling fail-rate crosses threshold. That is RAG evaluation as production infrastructure, not a notebook artifact.
Unlike Ragas, which is offline-first and ties to a fixed dataset, or DeepEval’s RAGEval, which scores end-to-end without persisting per-component scores back to a span, FutureAGI ties every evaluator call back to a trace span via OTel attributes. so a Groundedness regression is debuggable down to the prompt, retrieved chunks, model version, embedding version, and reranker config that produced it.
Building the golden dataset for RAG evaluation
A RAG golden dataset is more than question/answer pairs. Each row needs five fields to support component-level evaluation:
- Question. the user input.
- Expected answer. for
AnswerRelevancyand reference-based metrics. - Expected retrieved chunks. labeled relevant documents for
ContextRecallandContextPrecision. Without this, retrieval-only evaluation is impossible. - Acceptable hallucination tolerance. a per-row flag for whether the model can extrapolate or must stay strictly grounded.
- Cohort tag. domain, intent, complexity (single-hop vs multi-hop), refusal expected, etc. Required for slice-level analysis.
We recommend 400–1500 rows for an enterprise RAG eval, with 5–10% refreshed monthly from production. Dual-annotator labeling with arbitration is the bar; single-annotator labels carry 10-15% disagreement on free-text RAG answers, which floors achievable evaluator precision.
What to score at each RAG layer (May 2026)
| Layer | Question being answered | Primary evaluator(s) | Secondary evaluators | Alert threshold |
|---|---|---|---|---|
| Query rewrite | Is the rewritten query better than the original? | AnswerRelevancy on rewrite | . | < 0.7 |
| Retrieval | Did we fetch the right chunks? | ContextRelevance, ContextRecall | ContextPrecision, MRR, NDCG | < 0.6 |
| Chunk-level | Was each chunk useful? | ChunkAttribution, ChunkUtilization | . | Attribution = Failed |
| Rerank | Is top-K ordering correct? | ContextPrecision | NDCG | < 0.7 |
| Noise robustness | Does irrelevant context hurt? | NoiseSensitivity | . | > 0.3 (lower is better) |
| Generation grounding | Does the answer stay in-context? | Groundedness, Faithfulness | RAGFaithfulness | < 0.85 |
| Answer quality | Does the answer address the question? | AnswerRelevancy | IsHelpful, Completeness | < 0.8 |
| Citation | Are sources attributed? | SourceAttribution, CitationPresence | . | Failed |
| Multi-hop | Does reasoning chain across chunks? | MultiHopReasoning | . | < 0.7 |
| End-to-end | Aggregate health | RAGScore, RAGScoreDetailed | TaskCompletion | < 0.75 |
Wiring online RAG evaluation into the trace pipeline
The 2026 best practice is online RAG evaluation: every production trace that contains a retrieval.documents span gets scored, with results written back as a span event. The setup we recommend:
- Sample rate: 100% for low-volume regulated routes; 5–20% for high-volume general routes.
- Evaluator stack:
RAGScoreDetailedfor headline,ChunkAttributionandNoiseSensitivityfor component diagnostics. - Alerting: trip a page when
Groundednessfail-rate over a rolling 15-minute window exceeds 2x baseline. - Sampling-into-eval: failed traces are auto-promoted into a candidate dataset, triaged by LLM-as-a-judge, and human-reviewed weekly.
- Cost control: cache evaluator results per
(prompt_version, retriever_version, embedding_version, question_hash)tuple to avoid re-scoring identical traces.
This is the difference between a RAG application that learns from production and one that ships and prays.
Component-specific failure signatures and their fixes
The component breakdown lets you read a failure signature and know exactly where to look. The common signatures we see in 2026:
ContextRelevancedrops, others flat → embedding model, chunking strategy, or reranker config changed. Roll back the index pipeline and add a regression gate.Groundednessdrops,ContextRelevanceflat → generator prompt regressed, or the model upgrade is less faithful to context. Bisect on prompt versions and model.NoiseSensitivityrises,ContextRelevanceflat → top-k is too high or reranker dropped its discriminative power. Re-run reranker config sweep.ChunkAttributionfails consistently → the model is generating from its weights, not the retrieved chunks. Tighten the prompt to require citation, or swap to a more grounded model.ContextRecalldrops,ContextPrecisionflat → corpus has new documents the retriever cannot find. Re-embed or update keyword indexing.MultiHopReasoningdrops → multi-step queries are arriving but the retriever returns single-hop chunks. Consider agentic RAG with iterative retrieval.- All components fine,
RAGScoredrops → answer quality is degrading despite retrieval and grounding. Check the answer-style judge: it may be reading verbosity or formatting changes as quality drops.
That kind of localised debugging is the whole point of component-level RAG evaluation.
The evaluator-judge model question
A 2026 RAG evaluator is almost always a judge model. Groundedness, Faithfulness, AnswerRelevancy, and the rest are LLM-as-a-judge implementations. The choice of judge model materially affects the eval. Our 2026 defaults:
- Use a judge from a different family than the generator. If you generate with Claude Sonnet 4.6, judge with GPT-5.x or Gemini 3.x.
- Pin the judge model version explicitly. A judge upgrade is an eval-system change, not a transparent improvement.
- Calibrate the judge against a human-labeled subset quarterly. If judge-vs-human agreement drops below 0.85, re-prompt or swap the judge.
- For high-stakes regulated routes, ensemble two judges and require agreement. Single-judge eval has 5–12% variance; two-judge ensemble cuts it roughly in half.
These rules apply to every cloud-template fi.evals evaluator. The same evaluator class can return different scores on the same input depending on judge-model and prompt version. track both.
Standard public anchors to bake into the regression set: RAGTruth (Salesforce, 18K labeled chunks, 4 task types. the cleanest signal for Groundedness, where the median frontier model fails on 5-8% of answers), RAGBench (12 RAG tasks across 6 domains, 100K+ examples with TRACe-style component labels), CRAG (Meta, 4400 stratified Q with adversarial noise injection. frontier RAG pipelines score 30-45% accuracy before any custom tuning), and MultiHop-RAG (2556 multi-hop Q over news, naive RAG ~35%, graph-augmented or sub-question-decomposed pipelines 65-75%). For long-context RAG specifically, RULER (NVIDIA, 4K-128K) and BABILong are the right stressors. frontier models drop 15-30 points on multi-hop variable tracking as context grows, the cleanest evidence that NoiseSensitivity belongs in every release gate.
How to measure RAG quality end-to-end
A complete RAG eval stack scores all three layers:
- Retrieval quality:
fi.evals.ContextRelevancereturns 0–1 on whether the retrieved passage answers the input; pair withChunkAttribution(pass/fail) andChunkUtilization(0–1) for chunk-level diagnosis. - Generation grounding:
fi.evals.Groundednessreturns pass/fail on whether the answer is supported by the context;Faithfulnessreturns 0–1 across multiple claims;RAGFaithfulnessis the RAG-specific multi-claim variant. - Answer quality:
fi.evals.AnswerRelevancyandRAGScorefor end-to-end signal;Completenessfor whether the answer covers the question fully. - Robustness:
fi.evals.NoiseSensitivitymeasures degradation when irrelevant context is added. critical for long-context RAG. - Retrieval ranking:
MRR,NDCG,PrecisionAtK,RecallAtKfor retrieval engineering teams. - Entity coverage:
ContextEntityRecallwhen answers depend on named entities (legal, biomedical, financial). - Reasoning chains:
MultiHopReasoningfor queries that span 2+ retrieved chunks. - Dashboard signals:
RAGScoremean per cohort,Groundednessfail-rate,ContextRelevancep10. alert on any of the three crossing threshold. - User proxies: thumbs-down rate on sourced answers, citation click-through, escalation rate.
from fi.evals import RAGScoreDetailed, ChunkAttribution, NoiseSensitivity
scorer = RAGScoreDetailed()
attribution = ChunkAttribution()
noise = NoiseSensitivity()
result = scorer.evaluate(
input="What's our SLA on P1 incidents?",
output="P1 incidents are responded to within 1 hour.",
context=["...P1 SLA: 1-hour response, 4-hour resolution..."],
)
attribution_result = attribution.evaluate(
output=result.output,
context=result.context,
)
noise_result = noise.evaluate(
input=question,
output=result.output,
context=mixed_relevant_and_irrelevant_chunks,
)
print(result.score, attribution_result.score, noise_result.score)
Common mistakes (May 2026 edition)
- Scoring only the final answer. A single end-to-end score hides whether retrieval or generation broke. Run
RAGScoreDetailedor component evaluators side by side from day one. - Using BLEU or ROUGE for RAG. Reference-overlap metrics fail on open-ended answers and reward verbatim copying. use
Groundedness,AnswerRelevancy, andFaithfulnessinstead. - Evaluating only the golden dataset. Static eval sets miss real query distribution. Sample production traces continuously into the eval cohort. Pin a “live cohort” alongside the golden dataset.
- Letting the judge model match the generator. Self-evaluation inflates scores; pin the judge to a different model family. unlike Ragas, which often defaults to the same model used in the chain. See LLM-as-a-judge.
- No retrieval-only eval. Teams skip
ContextRelevancebecause retrieval “feels right”. then spend weeks tuning prompts when the bug is in the embedding model. - Ignoring
NoiseSensitivityin long-context setups. A 1M-token context window does not mean the model uses 1M tokens well. Score noise robustness explicitly. - No regression eval after re-indexing. Embedding upgrades, chunking changes, and reranker swaps shift retrieval distribution. Pin a regression eval and re-run on every index rebuild.
- Treating citations as proof of grounding. Models hallucinate citations too. Pair
CitationPresencewithGroundedness. a citation that doesn’t support the claim is worse than no citation. - One global threshold across cohorts. Legal RAG and shopping RAG have different acceptable
Groundednessthresholds. Slice by domain and tune per cohort. - Skipping evaluation in agentic-RAG step zero. When an agent decides whether to retrieve, score the retrieval decision with
ToolSelectionAccuracy. not just the retrieved chunks. - Confusing
FaithfulnesswithGroundedness.Groundednessis a coarse pass/fail on whether the answer stays inside the context.Faithfulnessdecomposes the answer into claims and scores each. For multi-claim answers, run both. - Ignoring chunk overlap in evaluation. Two chunks that overlap can both score “relevant” while only the boundary text is actually useful.
ChunkUtilizationdistinguishes them. - Not scoring agentic RAG trajectories at the step level. When an agent retrieves three times before answering, score retrieval on each call. a single end-to-end
RAGScorehides the middle-step failures. - Treating evaluator versions as immutable. Cloud-template evaluators are themselves models; they get upgraded. Pin the evaluator version in your eval runs and rerun the regression set whenever you upgrade.
Frequently Asked Questions
What is RAG evaluation?
RAG evaluation is the structured measurement of a retrieval-augmented generation pipeline across retrieval, generation, and answer layers. using metrics like context relevance, groundedness, and answer relevancy to localise where quality breaks.
How is RAG evaluation different from generic LLM evaluation?
Generic LLM evaluation scores the final response. RAG evaluation also scores the retrieval step, because most RAG failures originate in retrieval rather than the model. You need component-level signals or you cannot fix the right thing.
How do you measure RAG quality?
FutureAGI's fi.evals package ships RAGScore (single weighted score) and RAGScoreDetailed (per-component breakdown), plus standalone Groundedness, ContextRelevance, ChunkAttribution, and ChunkUtilization evaluators.