RAG Evaluation Metrics in 2026: Faithfulness, Context Precision, Context Recall, Groundedness, Answer Relevance, with FAGI fi.evals
RAG eval metrics in 2026: faithfulness, context precision, recall, groundedness, answer relevance, hallucination. With FAGI fi.evals templates.
Table of Contents
A legal research RAG ships and scores 0.91 faithfulness on the offline eval set. Three weeks into production, customers complain that 1 in 6 responses misses a key statute. The team checks faithfulness in the dashboard: still 0.91. They check context recall: 0.62. The retriever was missing the second statute on multi-hop questions; the generator was answering coherently from the partial context, so faithfulness stayed high. No retrieval-stage metric on the dashboard surfaced the regression. This is what RAG evaluation looks like when you measure only the generation. This post is the 2026 picture: the six metrics that anchor a complete RAG evaluation program, what each catches, what each misses, and how FutureAGI fi.evals (Apache 2.0) templates run them against your traces.
TL;DR: The six metrics that matter in 2026
| Metric | Layer | What it catches | fi.evals template |
|---|---|---|---|
| Faithfulness | Generation | Whole-answer fabrication vs context | faithfulness |
| Context precision | Retrieval | Irrelevant chunks in the retrieved set | context_relevance / context_precision |
| Context recall | Retrieval | Missing chunks the answer needed | context_recall |
| Groundedness | Generation | Per-sentence unsupported claims | groundedness |
| Answer relevance | Generation | Off-topic answers | answer_relevancy |
| Hallucination | Final answer | Claims no chunk supports (safety net) | hallucination |
If you only read one row: a 2026 RAG eval program tracks at least one retrieval-stage metric and at least one generation-stage metric. Tracking only generation hides retrieval regressions; tracking only retrieval misses fabrication.
The two-stage evaluation problem
A RAG system has two stages: retrieve and generate. Each can fail independently. A complete evaluation program needs at least one metric per stage.
Stage 1: Retrieval. Did the retriever surface the right chunks for the query?
- Two failure modes: missed relevant chunks (low recall), surfaced irrelevant chunks (low precision).
- Metrics: context precision, context recall, Precision@k, Recall@k, MRR, NDCG.
Stage 2: Generation. Did the generator use the retrieved chunks to produce a correct, supported, on-topic answer?
- Three failure modes: fabricated claims (low faithfulness), per-sentence ungrounded claims (low groundedness), off-topic answers (low answer relevance).
- Metrics: faithfulness, groundedness, answer relevance, hallucination.
A pipeline can fail at stage 1 (the retriever missed the chunk) and still produce a “faithful” answer (the model was honest about the partial context, but the user got an incomplete answer). A pipeline can succeed at stage 1 (the retriever returned the right chunk) and fail at stage 2 (the model fabricated anyway).
Both failure modes are common in production. Both need their own metric.
Metric 1: Faithfulness
What it asks: is the generated answer supported by the retrieved context, with no fabricated claims?
How to score it: an LLM judge sees the answer and the retrieved chunks. It scores 0 to 1 (or labels each claim as supported, partially supported, or unsupported) based on whether the answer makes claims the context does not back. The composite is the faithfulness score.
When it fires: the model fabricated a fact. The context did not say “the contract expires January 1”; the model said it did.
When it does not fire: the model is partially correct because the retrieval was partial. The model answered the question coherently from incomplete context; the answer matches what was retrieved, so faithfulness is high; but the user is missing information. This is a retrieval-stage problem, not a faithfulness problem.
fi.evals template: faithfulness (Apache 2.0, runs as Python or cloud).
# Real FAGI API
from fi.evals import evaluate
result = evaluate(
"faithfulness",
output=generated_answer,
context=retrieved_chunks_concatenated,
)
print(f"Faithfulness: {result.score:.3f}")
The fast cloud judge (turing_flash, ~1-2s) is the default for online sampling. The deeper judges (turing_small ~2-3s, turing_large ~3-5s) run as offline batch or as second-stage evaluators when latency is not constrained.
Metric 2: Context precision
What it asks: of the chunks the retriever returned, what fraction are relevant to the query?
How to score it: an LLM judge labels each retrieved chunk as relevant or not, given the query. Context precision is relevant chunks divided by total chunks. Often computed as a weighted version where higher-ranked chunks count more.
When it fires: the retriever returned 10 chunks, 3 are relevant, 7 are noise. The generator is now drowning in 7 irrelevant chunks; context windows are wasted; the model may misweight the noise.
Why it matters: low precision tanks generation quality even when recall is fine. Re-rankers (Cohere Rerank, BGE Reranker, Jina Reranker) live to fix low context precision.
fi.evals template: context_relevance or context_precision.
Metric 3: Context recall
What it asks: of the chunks needed to fully answer the query, what fraction did the retriever return?
How to score it: requires a ground-truth set (the chunks a human labeled as necessary for the answer). Context recall is necessary chunks retrieved divided by total necessary chunks.
When it fires: the answer needed 3 chunks; the retriever returned 2; the generator answers from the 2 it has and the user gets an incomplete answer.
Why it matters: low recall is the silent regression. The model can still produce a fluent, faithful-looking answer from incomplete context; faithfulness stays high; users get partial information. The only way to surface low recall is to measure it explicitly.
fi.evals template: context_recall (requires ground-truth chunk labels).
For depth on retrieval-quality monitoring, see Best Retrieval Quality Monitoring Tools 2026.
Metric 4: Groundedness
What it asks: for each sentence in the answer, is there a retrieved chunk that supports it?
How to score it: an LLM judge labels each sentence as grounded (a backing chunk exists) or ungrounded. Groundedness is the fraction of sentences grounded.
When it fires: 4 of 5 sentences in an answer are supported by retrieved chunks; the 5th is a fabrication. Faithfulness as a single 0-1 score might average to 0.8 and look fine; groundedness flags 1 of 5 sentences ungrounded and gives you the exact sentence to fix.
Why it matters: granularity. Faithfulness for the dashboard, groundedness for the diagnostic. Production teams track both.
fi.evals template: groundedness.
Metric 5: Answer relevance
What it asks: does the answer actually address the user’s question, or does it produce on-topic but unrelated text?
How to score it: an LLM judge scores how well the answer addresses the specific query, 0 to 1. Penalizes verbose hedging, off-topic preambles, and answers that talk around the question.
When it fires: the user asks “what’s our refund policy”; the answer is a faithful summary of the company’s general support policy that does not mention refunds.
Why it matters: a faithful, grounded answer to the wrong question is still useless. Answer relevance is the closing-the-loop metric.
fi.evals template: answer_relevancy (the FAGI/RAGAS canonical name).
Metric 6: Hallucination
What it asks: does the answer make claims that no retrieved chunk supports? Often run reference-free (no labeled ground truth needed).
How to score it: a hallucination detector flags claims that are factually questionable or unsupported. Often complementary to faithfulness; runs as a final safety net before the response ships.
When it fires: the model invented something the context did not say. Faithfulness should also catch this; hallucination is the second line of defense.
fi.evals template: hallucination.
For the broader hallucination story, see Understanding LLM Hallucination 2026 and Detect Hallucination Generative AI 2025.
Classic IR metrics still earn their place
When you have labeled gold documents per query, the discrete metrics remain useful:
- Precision@k: fraction of top-k retrieved chunks that are relevant. Target 0.7+ for narrow domains, 0.5+ for broad.
- Recall@k: fraction of relevant chunks that appear in top-k. Target 0.8+ at k=20 for broad datasets.
- MRR (Mean Reciprocal Rank): 1/rank of the first relevant result, averaged across queries. Higher means the right chunk lands at the top.
- NDCG@k: relevance-weighted ranking score. Target 0.8+ at k=10.
These are the IR-flavored versions of context precision and context recall. Use them when you have labels; use the LLM-judge versions when you do not. Many production stacks compute both and reconcile.
How to run RAG evaluation: the three-layer pattern
Layer 1: Offline gold-set evaluation
Build a labeled set: 100 to 500 query-context-answer tuples. Each query has the gold chunks the retriever should return and the gold answer the generator should produce.
Run the eval suite on every prompt change, model swap, retriever change, or chunking-strategy change. The composite metric (faithfulness, context recall, answer relevance, optionally cost-per-correct-answer) gates the merge.
# Real FAGI API
from fi.evals import evaluate
results = []
for query, context, answer, gold_chunks in test_set:
faith = evaluate("faithfulness", output=answer, context=context)
relev = evaluate("answer_relevancy", input=query, output=answer)
results.append({
"faithfulness": faith.score,
"answer_relevance": relev.score,
})
mean_faith = sum(r["faithfulness"] for r in results) / len(results)
mean_relev = sum(r["answer_relevance"] for r in results) / len(results)
print(f"Faithfulness: {mean_faith:.3f}, Relevance: {mean_relev:.3f}")
Layer 2: Online sample evaluation
Score a sample (5 to 20 percent) of production traffic with fi.evals templates running as cloud evaluators tied to OTel spans. The faster cloud judge (turing_flash, ~1-2s) covers the bulk of online sampling; the deeper judges (turing_small, turing_large) run async and attach scores to the trace.
This catches drift that the gold set will not: real users, real queries, real distribution shifts.
Layer 3: Weekly human calibration
Sample 50 to 100 production traces. Have a domain reviewer label each on the same rubric as the LLM judge. Compute Cohen’s kappa between the human and the judge.
Target kappa is 0.6 or higher. Below 0.6, the judge is too noisy to trust; re-tune the judge prompt or swap judge models.
For depth on judge calibration, see Best LLM as Judge Platforms 2026.
The full picture: tracing every span
A 2026 RAG eval program does not just run metrics; it attaches them to spans.
# Pseudocode showing trace + eval attached at the span level
from fi_instrumentation import register
from traceai_llamaindex import LlamaIndexInstrumentor
from fi.evals import evaluate
register(project_name="rag-eval-demo")
LlamaIndexInstrumentor().instrument()
# When the agent runs, every retrieve, generate, and judge call emits an OTel span.
# fi.evals attaches scores to the spans:
# - retrieve span gets context_relevance and context_recall
# - generate span gets faithfulness and groundedness
# - the trace gets answer_relevance and the composite cost-per-correct-answer
The dashboard rolls per-span into per-trace, per-trace into per-day, per-day into the eval scorecard. When a customer flags a bad answer, the trace shows which retrieve span had low recall and which generate span had low groundedness; the fix is targeted, not a re-tune of the whole pipeline.

Figure 1: A 2026 RAG evaluation stack attaches per-stage metrics to per-stage spans.
How to choose metrics for your use case
Three questions:
- Do you have labeled gold chunks per query? If yes, classic IR metrics (Precision@k, Recall@k, MRR, NDCG) plus context_recall earn their place. If no, the LLM-judge versions (context_precision, context_relevance) are the path.
- Is the corpus broad (Q&A across many topics) or narrow (one domain)? Broad: prioritize recall (you have many ways to miss). Narrow: prioritize precision (noise is the bigger risk).
- What is the cost of a wrong answer? High-stakes (legal, medical, compliance): track all six metrics, gate ships on groundedness and faithfulness, run a guardrail at runtime. Low-stakes (FAQ, internal search): faithfulness and answer relevance are enough; skip the heavier metrics for speed.
Common failure modes
Measuring only generation
The most common mistake. Faithfulness and answer relevance look healthy; context recall is silently regressing. The fix is to score context_precision and context_recall on the retrieve span explicitly.
LLM-judge bias
The judge has its own biases: prefers longer answers, prefers verbose hedging, prefers the model family it shares with the system under test. The fix is to calibrate against humans (kappa 0.6 floor) and to use a different model for the judge than for the system.
Overfitting to the gold set
The gold set has 200 queries; the pipeline scores 0.92 on faithfulness on the gold set; in production, faithfulness on real traffic is 0.78. The fix is to keep refreshing the gold set with sampled production queries and to track online-eval scores alongside offline.
Composite metric drift
Optimize one metric and another regresses. Improve faithfulness by retrieving more chunks; context precision drops because more chunks means more noise. The fix is a composite (faithfulness times context_precision, or cost-per-correct-answer) that ties the metrics into one number.
Reference-set staleness
The gold answers were labeled 6 months ago; the world changed; the gold answers are now wrong. The fix is a quarterly refresh of the gold set and a versioned eval run so historical scores are comparable.
For depth on RAG hallucinations specifically, see RAG Prompting to Reduce Hallucination and RAG Hallucinations: FutureAGI.
Where this is going in 2027
Three trends.
First, multi-hop and reasoning RAG evaluation matures. Today most metrics score a single-pass retrieve-generate cycle. The 2027 pattern is metrics that score the agent’s full trajectory across multiple retrieves, including the question of whether the agent retrieved redundantly or efficiently.
Second, ground-truth-free evaluation gets stronger. Reference-free hallucination detection improves to the point that a 100-query gold set is no longer the bottleneck for getting reliable production signals.
Third, the eval back-end becomes the dashboard. The trace + eval + metric layer (FutureAGI’s stack, the Phoenix + OpenInference stack, Langfuse) is where teams will spend most of their RAG operating time. Investment in this layer pays compounding returns.
How to start
- Build a 100 to 200 query labeled set. Each query has the gold chunks and the gold answer.
- Pick at least one retrieval metric and at least one generation metric. The minimum: context_recall plus faithfulness. The recommended: context_relevance, context_recall, faithfulness, groundedness, answer_relevance.
- Run the metrics offline on every prompt or model change. Block ship on a composite threshold.
- Wire trace + eval (traceAI + fi.evals, both Apache 2.0) for online sample evaluation on 5 to 20 percent of production traffic.
- Schedule weekly human review of 50 traces to calibrate the judge.
- Add a guardrail (FAGI Protect) at runtime for the worst failures: hallucinated content, PII leaks, prompt injection in retrieved chunks.
The full path lives in one stack with FutureAGI: traceAI for spans (Apache 2.0), fi.evals templates for the six core metrics (Apache 2.0), the dashboard for roll-ups, Protect for runtime. Self-host the Apache 2.0 cores or use the managed platform.
Sources
- RAGAS paper: https://arxiv.org/abs/2309.15217
- ARES paper: https://arxiv.org/abs/2311.09476
- FutureAGI fi.evals (Apache 2.0): https://github.com/future-agi/ai-evaluation
- FutureAGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI
- FutureAGI RAG eval docs: https://docs.futureagi.com/docs/sdk/evals/metrics/rag/
- TREC IR evaluation primer: https://trec.nist.gov/pubs/trec16/appendices/measures.pdf
- HuggingFace evaluate library: https://github.com/huggingface/evaluate
Frequently asked questions
What are the core RAG evaluation metrics in 2026?
What is the difference between faithfulness and groundedness in RAG?
How is context precision different from context recall?
Which evaluation metric is most predictive of user satisfaction in RAG?
How do I run RAG evaluation in production at scale?
What about Precision@k, Recall@k, MRR, NDCG? Are they still useful?
How does FutureAGI fi.evals help with RAG evaluation?
What is the most common RAG evaluation mistake in production?
Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.
LangChain callbacks in 2026: every lifecycle event, sync vs async handlers, runnable config patterns, and how to wire callbacks into OpenTelemetry traces.
LLM-as-a-judge in 2026: G-Eval, pairwise, rubric, Cohen's kappa calibration, bias controls, plus tools (FutureAGI, DeepEval, Ragas, Phoenix) compared.