Guides

RAG Evaluation Metrics in 2026: Faithfulness, Context Precision, Context Recall, Groundedness, Answer Relevance, with FAGI fi.evals

RAG eval metrics in 2026: faithfulness, context precision, recall, groundedness, answer relevance, hallucination. With FAGI fi.evals templates.

September 12, 2025

Updated May 14, 2026

11 min read

evaluations rag rag-evaluation metrics 2026

A legal research RAG ships and scores 0.91 faithfulness on the offline eval set. Three weeks into production, customers complain that 1 in 6 responses misses a key statute. The team checks faithfulness in the dashboard: still 0.91. They check context recall: 0.62. The retriever was missing the second statute on multi-hop questions; the generator was answering coherently from the partial context, so faithfulness stayed high. No retrieval-stage metric on the dashboard surfaced the regression. This is what RAG evaluation looks like when you measure only the generation. This post is the 2026 picture: the six metrics that anchor a complete RAG evaluation program, what each catches, what each misses, and how FutureAGI fi.evals (Apache 2.0) templates run them against your traces.

TL;DR: The six metrics that matter in 2026

Metric	Layer	What it catches	fi.evals template
Faithfulness	Generation	Whole-answer fabrication vs context	`faithfulness`
Context precision	Retrieval	Irrelevant chunks in the retrieved set	`context_relevance` / `context_precision`
Context recall	Retrieval	Missing chunks the answer needed	`context_recall`
Groundedness	Generation	Per-sentence unsupported claims	`groundedness`
Answer relevance	Generation	Off-topic answers	`answer_relevancy`
Hallucination	Final answer	Claims no chunk supports (safety net)	`hallucination`

If you only read one row: a 2026 RAG eval program tracks at least one retrieval-stage metric and at least one generation-stage metric. Tracking only generation hides retrieval regressions; tracking only retrieval misses fabrication.

The two-stage evaluation problem

A RAG system has two stages: retrieve and generate. Each can fail independently. A complete evaluation program needs at least one metric per stage.

Stage 1: Retrieval. Did the retriever surface the right chunks for the query?

Two failure modes: missed relevant chunks (low recall), surfaced irrelevant chunks (low precision).
Metrics: context precision, context recall, Precision@k, Recall@k, MRR, NDCG.

Stage 2: Generation. Did the generator use the retrieved chunks to produce a correct, supported, on-topic answer?

Three failure modes: fabricated claims (low faithfulness), per-sentence ungrounded claims (low groundedness), off-topic answers (low answer relevance).
Metrics: faithfulness, groundedness, answer relevance, hallucination.

A pipeline can fail at stage 1 (the retriever missed the chunk) and still produce a “faithful” answer (the model was honest about the partial context, but the user got an incomplete answer). A pipeline can succeed at stage 1 (the retriever returned the right chunk) and fail at stage 2 (the model fabricated anyway).

Both failure modes are common in production. Both need their own metric.

Metric 1: Faithfulness

What it asks: is the generated answer supported by the retrieved context, with no fabricated claims?

How to score it: an LLM judge sees the answer and the retrieved chunks. It scores 0 to 1 (or labels each claim as supported, partially supported, or unsupported) based on whether the answer makes claims the context does not back. The composite is the faithfulness score.

When it fires: the model fabricated a fact. The context did not say “the contract expires January 1”; the model said it did.

When it does not fire: the model is partially correct because the retrieval was partial. The model answered the question coherently from incomplete context; the answer matches what was retrieved, so faithfulness is high; but the user is missing information. This is a retrieval-stage problem, not a faithfulness problem.

fi.evals template: faithfulness (Apache 2.0, runs as Python or cloud).

# Real FAGI API
from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output=generated_answer,
    context=retrieved_chunks_concatenated,
)
print(f"Faithfulness: {result.score:.3f}")

The fast cloud judge (turing_flash, ~1-2s) is the default for online sampling. The deeper judges (turing_small ~2-3s, turing_large ~3-5s) run as offline batch or as second-stage evaluators when latency is not constrained.

Metric 2: Context precision

What it asks: of the chunks the retriever returned, what fraction are relevant to the query?

How to score it: an LLM judge labels each retrieved chunk as relevant or not, given the query. Context precision is relevant chunks divided by total chunks. Often computed as a weighted version where higher-ranked chunks count more.

When it fires: the retriever returned 10 chunks, 3 are relevant, 7 are noise. The generator is now drowning in 7 irrelevant chunks; context windows are wasted; the model may misweight the noise.

Why it matters: low precision tanks generation quality even when recall is fine. Re-rankers (Cohere Rerank, BGE Reranker, Jina Reranker) live to fix low context precision.

fi.evals template: context_relevance or context_precision.

Metric 3: Context recall

What it asks: of the chunks needed to fully answer the query, what fraction did the retriever return?

How to score it: requires a ground-truth set (the chunks a human labeled as necessary for the answer). Context recall is necessary chunks retrieved divided by total necessary chunks.

When it fires: the answer needed 3 chunks; the retriever returned 2; the generator answers from the 2 it has and the user gets an incomplete answer.

Why it matters: low recall is the silent regression. The model can still produce a fluent, faithful-looking answer from incomplete context; faithfulness stays high; users get partial information. The only way to surface low recall is to measure it explicitly.

fi.evals template: context_recall (requires ground-truth chunk labels).

For depth on retrieval-quality monitoring, see Best Retrieval Quality Monitoring Tools 2026.

Metric 4: Groundedness

What it asks: for each sentence in the answer, is there a retrieved chunk that supports it?

How to score it: an LLM judge labels each sentence as grounded (a backing chunk exists) or ungrounded. Groundedness is the fraction of sentences grounded.

When it fires: 4 of 5 sentences in an answer are supported by retrieved chunks; the 5th is a fabrication. Faithfulness as a single 0-1 score might average to 0.8 and look fine; groundedness flags 1 of 5 sentences ungrounded and gives you the exact sentence to fix.

Why it matters: granularity. Faithfulness for the dashboard, groundedness for the diagnostic. Production teams track both.

fi.evals template: groundedness.

Metric 5: Answer relevance

What it asks: does the answer actually address the user’s question, or does it produce on-topic but unrelated text?

How to score it: an LLM judge scores how well the answer addresses the specific query, 0 to 1. Penalizes verbose hedging, off-topic preambles, and answers that talk around the question.

When it fires: the user asks “what’s our refund policy”; the answer is a faithful summary of the company’s general support policy that does not mention refunds.

Why it matters: a faithful, grounded answer to the wrong question is still useless. Answer relevance is the closing-the-loop metric.

fi.evals template: answer_relevancy (the FAGI/RAGAS canonical name).

Metric 6: Hallucination

What it asks: does the answer make claims that no retrieved chunk supports? Often run reference-free (no labeled ground truth needed).

How to score it: a hallucination detector flags claims that are factually questionable or unsupported. Often complementary to faithfulness; runs as a final safety net before the response ships.

When it fires: the model invented something the context did not say. Faithfulness should also catch this; hallucination is the second line of defense.

fi.evals template: hallucination.

For the broader hallucination story, see Understanding LLM Hallucination 2026 and Detect Hallucination Generative AI 2025.

Classic IR metrics still earn their place

When you have labeled gold documents per query, the discrete metrics remain useful:

Precision@k: fraction of top-k retrieved chunks that are relevant. Target 0.7+ for narrow domains, 0.5+ for broad.
Recall@k: fraction of relevant chunks that appear in top-k. Target 0.8+ at k=20 for broad datasets.
MRR (Mean Reciprocal Rank): 1/rank of the first relevant result, averaged across queries. Higher means the right chunk lands at the top.
NDCG@k: relevance-weighted ranking score. Target 0.8+ at k=10.

These are the IR-flavored versions of context precision and context recall. Use them when you have labels; use the LLM-judge versions when you do not. Many production stacks compute both and reconcile.

How to run RAG evaluation: the three-layer pattern

Layer 1: Offline gold-set evaluation

Build a labeled set: 100 to 500 query-context-answer tuples. Each query has the gold chunks the retriever should return and the gold answer the generator should produce.

Run the eval suite on every prompt change, model swap, retriever change, or chunking-strategy change. The composite metric (faithfulness, context recall, answer relevance, optionally cost-per-correct-answer) gates the merge.

# Real FAGI API
from fi.evals import evaluate

results = []
for query, context, answer, gold_chunks in test_set:
    faith = evaluate("faithfulness", output=answer, context=context)
    relev = evaluate("answer_relevancy", input=query, output=answer)
    results.append({
        "faithfulness": faith.score,
        "answer_relevance": relev.score,
    })

mean_faith = sum(r["faithfulness"] for r in results) / len(results)
mean_relev = sum(r["answer_relevance"] for r in results) / len(results)
print(f"Faithfulness: {mean_faith:.3f}, Relevance: {mean_relev:.3f}")

Layer 2: Online sample evaluation

Score a sample (5 to 20 percent) of production traffic with fi.evals templates running as cloud evaluators tied to OTel spans. The faster cloud judge (turing_flash, ~1-2s) covers the bulk of online sampling; the deeper judges (turing_small, turing_large) run async and attach scores to the trace.

This catches drift that the gold set will not: real users, real queries, real distribution shifts.

Layer 3: Weekly human calibration

Sample 50 to 100 production traces. Have a domain reviewer label each on the same rubric as the LLM judge. Compute Cohen’s kappa between the human and the judge.

Target kappa is 0.6 or higher. Below 0.6, the judge is too noisy to trust; re-tune the judge prompt or swap judge models.

For depth on judge calibration, see Best LLM as Judge Platforms 2026.

The full picture: tracing every span

A 2026 RAG eval program does not just run metrics; it attaches them to spans.

# Pseudocode showing trace + eval attached at the span level
from fi_instrumentation import register
from traceai_llamaindex import LlamaIndexInstrumentor
from fi.evals import evaluate

register(project_name="rag-eval-demo")
LlamaIndexInstrumentor().instrument()

# When the agent runs, every retrieve, generate, and judge call emits an OTel span.
# fi.evals attaches scores to the spans:
# - retrieve span gets context_relevance and context_recall
# - generate span gets faithfulness and groundedness
# - the trace gets answer_relevance and the composite cost-per-correct-answer

The dashboard rolls per-span into per-trace, per-trace into per-day, per-day into the eval scorecard. When a customer flags a bad answer, the trace shows which retrieve span had low recall and which generate span had low groundedness; the fix is targeted, not a re-tune of the whole pipeline.

RAG evaluation: retrieval-stage metrics on the retrieve span, generation-stage metrics on the generate span, composite at the trace level

Figure 1: A 2026 RAG evaluation stack attaches per-stage metrics to per-stage spans.

How to choose metrics for your use case

Three questions:

Do you have labeled gold chunks per query? If yes, classic IR metrics (Precision@k, Recall@k, MRR, NDCG) plus context_recall earn their place. If no, the LLM-judge versions (context_precision, context_relevance) are the path.
Is the corpus broad (Q&A across many topics) or narrow (one domain)? Broad: prioritize recall (you have many ways to miss). Narrow: prioritize precision (noise is the bigger risk).
What is the cost of a wrong answer? High-stakes (legal, medical, compliance): track all six metrics, gate ships on groundedness and faithfulness, run a guardrail at runtime. Low-stakes (FAQ, internal search): faithfulness and answer relevance are enough; skip the heavier metrics for speed.

Common failure modes

Measuring only generation

The most common mistake. Faithfulness and answer relevance look healthy; context recall is silently regressing. The fix is to score context_precision and context_recall on the retrieve span explicitly.

LLM-judge bias

The judge has its own biases: prefers longer answers, prefers verbose hedging, prefers the model family it shares with the system under test. The fix is to calibrate against humans (kappa 0.6 floor) and to use a different model for the judge than for the system.

Overfitting to the gold set

The gold set has 200 queries; the pipeline scores 0.92 on faithfulness on the gold set; in production, faithfulness on real traffic is 0.78. The fix is to keep refreshing the gold set with sampled production queries and to track online-eval scores alongside offline.

Composite metric drift

Optimize one metric and another regresses. Improve faithfulness by retrieving more chunks; context precision drops because more chunks means more noise. The fix is a composite (faithfulness times context_precision, or cost-per-correct-answer) that ties the metrics into one number.

Reference-set staleness

The gold answers were labeled 6 months ago; the world changed; the gold answers are now wrong. The fix is a quarterly refresh of the gold set and a versioned eval run so historical scores are comparable.

For depth on RAG hallucinations specifically, see RAG Prompting to Reduce Hallucination and RAG Hallucinations: FutureAGI.

Where this is going in 2027

Three trends.

First, multi-hop and reasoning RAG evaluation matures. Today most metrics score a single-pass retrieve-generate cycle. The 2027 pattern is metrics that score the agent’s full trajectory across multiple retrieves, including the question of whether the agent retrieved redundantly or efficiently.

Second, ground-truth-free evaluation gets stronger. Reference-free hallucination detection improves to the point that a 100-query gold set is no longer the bottleneck for getting reliable production signals.

Third, the eval back-end becomes the dashboard. The trace + eval + metric layer (FutureAGI’s stack, the Phoenix + OpenInference stack, Langfuse) is where teams will spend most of their RAG operating time. Investment in this layer pays compounding returns.

How to start

Build a 100 to 200 query labeled set. Each query has the gold chunks and the gold answer.
Pick at least one retrieval metric and at least one generation metric. The minimum: context_recall plus faithfulness. The recommended: context_relevance, context_recall, faithfulness, groundedness, answer_relevance.
Run the metrics offline on every prompt or model change. Block ship on a composite threshold.
Wire trace + eval (traceAI + fi.evals, both Apache 2.0) for online sample evaluation on 5 to 20 percent of production traffic.
Schedule weekly human review of 50 traces to calibrate the judge.
Add a guardrail (FAGI Protect) at runtime for the worst failures: hallucinated content, PII leaks, prompt injection in retrieved chunks.

The full path lives in one stack with FutureAGI: traceAI for spans (Apache 2.0), fi.evals templates for the six core metrics (Apache 2.0), the dashboard for roll-ups, Protect for runtime. Self-host the Apache 2.0 cores or use the managed platform.

Sources

RAGAS paper: https://arxiv.org/abs/2309.15217
ARES paper: https://arxiv.org/abs/2311.09476
FutureAGI fi.evals (Apache 2.0): https://github.com/future-agi/ai-evaluation
FutureAGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI
FutureAGI RAG eval docs: https://docs.futureagi.com/docs/sdk/evals/metrics/rag/
TREC IR evaluation primer: https://trec.nist.gov/pubs/trec16/appendices/measures.pdf
HuggingFace evaluate library: https://github.com/huggingface/evaluate

Frequently asked questions

What are the core RAG evaluation metrics in 2026?

Six metrics anchor a 2026 RAG evaluation program. Faithfulness scores whether the generated answer is supported by the retrieved context (no fabricated claims). Context precision scores whether the retrieved chunks are relevant to the query. Context recall scores whether all the chunks needed for the answer were retrieved. Groundedness is the per-sentence version of faithfulness, scoring each answer sentence against the retrieved evidence. Answer relevance scores whether the answer addresses the user query. Hallucination is the reference-free safety net on the final answer. FutureAGI's fi.evals library ships templates for all six.

What is the difference between faithfulness and groundedness in RAG?

Both check whether the answer is supported by the retrieved context, but at different granularity. Faithfulness scores the whole answer as one unit: is the answer overall supported by the context? Groundedness scores each sentence: does this specific claim have a backing chunk? Groundedness catches the case where 4 out of 5 sentences are grounded and the 5th is fabricated, which faithfulness as a single score may average out. Production RAG systems track both; faithfulness for the dashboard, groundedness for per-sentence diagnosis.

How is context precision different from context recall?

Context precision asks: of the chunks the retriever returned, what fraction are relevant to the query? It measures retrieval cleanliness. Context recall asks: of all the chunks needed to fully answer the query, what fraction did the retriever return? It measures retrieval completeness. A retriever can score 0.9 precision (every chunk it returned is relevant) and 0.5 recall (it missed half the needed chunks). The two metrics fail in different ways and need different fixes: low precision usually means re-ranker tuning; low recall usually means top-k or chunking changes.

Which evaluation metric is most predictive of user satisfaction in RAG?

Answer relevance plus groundedness, jointly. A relevant answer that is grounded in evidence builds trust; a relevant answer that hallucinates breaks trust on the first error; an unrelated answer fails immediately. Faithfulness as a single score correlates with satisfaction, but groundedness at the sentence level is what users actually feel. For practical dashboards: track answer relevance and groundedness, alert on either dropping, and roll up into a composite cost-per-correct-answer for the ship gate.

How do I run RAG evaluation in production at scale?

Three-layer pattern. First, offline eval on a labeled golden set (100 to 500 query-answer-context tuples) gates every prompt or model change. Second, online eval on a sample of production traffic scores faithfulness, groundedness, and answer relevance per request with FutureAGI fi.evals templates running as cloud evaluators (turing_flash 1-2s, turing_small 2-3s, turing_large 3-5s) tied to the OTel span. Third, weekly human review on 50 to 100 randomly sampled production traces to calibrate the LLM-as-judge. The composite signal feeds your alerting and your roadmap.

What about Precision@k, Recall@k, MRR, NDCG? Are they still useful?

Yes, as the retrieval-stage view. Precision@k and Recall@k are the discrete versions of context precision and context recall at a specific k. MRR (Mean Reciprocal Rank) and NDCG are useful when ranking quality matters more than presence: if the right chunk lands at position 8 of 10, the model often misses it. Use Precision@5, Recall@10, MRR, and NDCG@10 when you have labeled gold documents per query. Use the LLM-judge versions (context_precision and answer_relevancy in the RAGAS-style framing) when you do not; context_recall still needs a reference answer or gold chunks to compute reliably.

How does FutureAGI fi.evals help with RAG evaluation?

fi.evals (Apache 2.0) ships ready-made templates for the core RAG metrics: faithfulness, context_precision, context_recall, groundedness, answer_relevancy (the FAGI/RAGAS canonical name), and hallucination as a reference-free safety net. Each template runs as a Python evaluator (for offline tests) or as a cloud evaluator (turing_flash ~1-2s, turing_small ~2-3s, turing_large ~3-5s for deeper analysis). Spans emitted by traceAI carry the scores; the dashboard rolls per-trace into composite metrics. Custom LLM judges (CustomLLMJudge) cover domain-specific rubrics.

What is the most common RAG evaluation mistake in production?

Evaluating only the generation, not the retrieval. A pipeline that scores faithfulness and answer relevance can look healthy on the dashboard while context recall silently drops by 30 percent, because the model is good at sounding grounded even on incomplete context. The fix is to score context precision and context recall on the retrieve span (independent of the generation), so a retrieval regression surfaces immediately. Pair retrieval-stage metrics (precision, recall, MRR, NDCG) with generation-stage metrics (faithfulness, groundedness, answer relevance) on the same trace.

View all

Guides

Build a Generative AI Chatbot in 2026: Step-by-Step Guide

Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.

Rishav Hada · Jul 24, 2025

8 min

Guides

LangChain Callbacks 2026: Handlers, Events, Tracing Guide

LangChain callbacks in 2026: every lifecycle event, sync vs async handlers, runnable config patterns, and how to wire callbacks into OpenTelemetry traces.

Vrinda Damani · Mar 7, 2025

7 min

Guides

LLM-as-a-Judge in 2026: How It Works, When It Fails

LLM-as-a-judge in 2026: G-Eval, pairwise, rubric, Cohen's kappa calibration, bias controls, plus tools (FutureAGI, DeepEval, Ragas, Phoenix) compared.

Vrinda Damani · Jan 29, 2025

9 min