What is context recall in RAG?

Context recall is a 0-1 score for retrieval completeness — what share of the reference answer's information was actually present in the retrieved context. Lower recall means the retriever missed required chunks.

How is context recall different from context precision?

Recall measures completeness — did you retrieve everything you needed? Precision measures ranking — were the relevant chunks ranked above the irrelevant ones? Recall fixes coverage; precision fixes order.

How do you measure context recall?

FutureAGI's fi.evals.ContextRecall splits the reference answer into sentences and runs NLI attribution against the retrieved contexts, returning the fraction of sentences inferable from context.

What Is Context Recall? Definition & FutureAGI Guide (2026)

What Is Context Recall?

Context recall is a RAG retrieval metric that quantifies how much of the information needed to answer the question was actually retrieved. The evaluator takes a reference (ground-truth) answer, splits it into sentences, and checks each one through Natural Language Inference attribution against the retrieved contexts. The score is the fraction of reference sentences that can be inferred from at least one chunk — 1.0 means the retriever got everything required, 0.0 means it got none of it. It is the metric to watch when answers are confidently wrong because a fact was never retrieved in the first place.

Why It Matters in Production LLM and Agent Systems

Wrong-by-omission failures are the hardest RAG bugs to catch. The model produces a clean, confident answer that misses a critical clause — a returns-window exception, a deductible carve-out, a regional pricing rule — because the chunk that contained that clause never made it into the top-k. Faithfulness scores the answer high (it stayed inside the retrieved context). Groundedness passes (no unsupported claims). Only context recall surfaces the actual problem: the retriever returned context that supported a partial answer.

The pain is acute for retrieval engineers and product owners. An ML engineer ships a new chunking strategy that improves coherence but cuts the average chunk in half — recall drops because critical facts now span chunk boundaries. A product manager fields complaints about “the bot doesn’t know about the EU pricing exception” and discovers that the EU paragraph sits at rank 14, below the top-k cutoff. A compliance team is asked whether a medical bot has access to all relevant guidance and has no metric to prove it does.

In 2026 agentic-RAG, low recall is what should trigger a query rewrite or a multi-hop retrieval. Self-RAG and corrective-RAG patterns rely on a recall-style signal to know when to retrieve again rather than answer from incomplete context.

How FutureAGI Handles Context Recall

FutureAGI’s approach is to ship fi.evals.ContextRecall as a local NLI-driven metric that runs identically in offline evaluation and live tracing surfaces. The evaluator splits the reference answer into sentences, filters very short or non-verbal ones, and runs each through entailment-style attribution against the retrieved contexts — the same NLI layer that powers Faithfulness and RAGFaithfulness. It returns a 0-1 score with per-sentence attribution detail.

Concretely: a knowledge-base team running on traceAI-llamaindex builds a regression Dataset of 500 question-answer pairs from real production tickets. They attach ContextRecall and ContextPrecision and run nightly against new index builds. When a chunking-config change pushes mean recall from 0.88 to 0.71, the dataset surfaces the failing rows — they cluster around long-form policy paragraphs being split at sentence boundaries. The team rolls back the chunking change, then runs GEPA to evolve a chunking strategy that preserves the failing patterns, all anchored to the recall regression as the optimisation signal.

Unlike Ragas context-recall, which is also NLI-attribution-based, FutureAGI’s evaluator returns per-sentence attribution detail in the response — not just the aggregate score — so the failing chunk gap is debuggable inline with the trace.

How to Measure or Detect It

Context recall is directly measurable when a reference answer exists. Wire up:

fi.evals.ContextRecall — NLI-based attribution from reference sentences to retrieved context.
fi.evals.ContextEntityRecall — entity-level companion when you need to verify named-entity coverage specifically.
fi.evals.RecallAtK — simpler counterpart for graded retrieval evaluation against labelled relevance.
OTel attributes retrieval.documents and an offline reference field — the inputs every recall evaluator needs.
Recall-vs-Precision joint plot (dashboard) — the only retriever evaluation visualisation that does not lie.

Minimal Python:

from fi.evals import ContextRecall

recall = ContextRecall()

result = recall.evaluate([{
    "query": "What is the capital of France?",
    "contexts": ["Paris is the capital of France."],
    "reference": "The capital of France is Paris."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

Computing recall on production traces without a reference answer. Recall requires ground truth; without it you are measuring something else. Run recall on labelled regression datasets, not raw production data.
Reporting recall without precision. A recall-1.0 retriever that returns 200 chunks per query has terrible precision; the trade-off only shows in the joint plot.
Using exact-match attribution instead of NLI. Word-overlap attribution misses paraphrased support — ContextRecall uses NLI on purpose.
Letting reference answers grow stale. A reference written in 2024 against a 2024 corpus will drag recall scores down once the corpus updates. Version your reference answers with the corpus.
Treating low recall as a chunking problem when it’s a retrieval problem. Confirm the missing fact is in the index at all before re-chunking — sometimes the chunk exists but ranks below top-k, which is a precision/reranker fix, not a chunking one.