Evaluation

What Is the Contextual Recall Metric?

A RAG metric that scores whether the retriever returned all the chunks needed to support the reference answer, by decomposing the answer into atomic claims and checking each against retrieval.

What Is the Contextual Recall Metric?

The contextual recall metric scores whether the retriever returned all the chunks needed to support the reference answer. It asks “of every relevant fact in the ground truth, how many appear in the retrieved context?” — typically by decomposing the reference answer into atomic claims and checking each claim against the retrieved chunks. It is the canonical signal for diagnosing under-retrieval. FutureAGI exposes it as fi.evals.ContextRecall, a local metric that runs in notebooks, datasets, and live traces with the same formula across surfaces.

Why the Contextual Recall Metric Matters in Production LLM and Agent Systems

Hallucinations born from missing context are the hardest to debug. The user asks for a multi-fact answer; the retriever returns five chunks, four relevant and one missing the key fact; the model fills the gap with a confident invention. From the answer alone, the failure looks like a model-quality problem. From the retrieval alone, four-of-five looks healthy. Only contextual recall surfaces what was missing.

The pain hits retrieval engineers, RAG product owners, and answer-quality reviewers. Retrieval engineers see top-K accuracy that looks fine on aggregate but tail recall that quietly degrades. RAG product owners see hallucination rates that are unattributable across retrieval, reranking, and generation. Answer-quality reviewers cannot distinguish “the model lied” from “the model couldn’t see the right context.”

In 2026 agentic-RAG and self-RAG patterns, low contextual recall is the signal that should trigger a query rewrite or a retrieval expansion — not a re-rank of the same insufficient pool. Multi-hop agents need recall as a step-level signal so the agent itself can decide to retrieve again. Unlike Ragas which uses a similar decomposition approach, FutureAGI’s ContextRecall runs against a unified evaluator surface alongside ContextPrecision and Faithfulness, so the three retrieval signals plot on one dashboard.

How FutureAGI Handles the Contextual Recall Metric

FutureAGI’s approach is to ship fi.evals.ContextRecall as a local metric usable across surfaces: notebook for ad-hoc analysis, Dataset.add_evaluation for offline regression, and trace-evaluator for live monitoring. Inputs are the user query, the retrieved contexts, and a reference answer. The evaluator decomposes the reference into atomic claims via an LLM judge, checks each claim against the retrieved chunks, and returns a 0–1 recall score plus the per-claim breakdown — so you see which claims were missed, not just the aggregate.

A concrete example: a customer-support RAG team is investigating a hallucination spike. They run ContextRecall and ContextPrecision against the last 1,000 traces. Precision is healthy at 0.87 median; recall has dropped from 0.79 to 0.61 over the past two weeks. The per-claim breakdown shows the missed claims cluster around product-spec details that recently shipped to the knowledge base under a new schema. The retriever was tuned on the old schema. The fix is a ChunkAttribution audit, a re-index, and a regression eval gate on a frozen Dataset of historical multi-fact questions.

We have found that contextual recall is the most reliable leading indicator of hallucinations from under-retrieval — measure it before you blame the model.

How to Measure or Detect It

Wire up the contextual recall metric:

  • fi.evals.ContextRecall — atomic-claim recall score against the retrieved contexts.
  • fi.evals.ContextEntityRecall — entity-level retrieval completeness, useful for KG-backed RAG.
  • fi.evals.RecallAtK — fraction of relevant items appearing in top K.
  • fi.evals.ChunkAttribution — which retrieved chunks the model actually used.
  • Per-claim recall breakdown — surface which claims the retrieval missed.
from fi.evals import ContextRecall

result = ContextRecall().evaluate([{
    "query": "What's covered by the warranty?",
    "contexts": [
        "Warranty covers parts and labor for 12 months.",
        "Shipping is free over $50.",
    ],
    "reference": "Parts and labor are covered for 12 months; battery for 6 months."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

  • Measuring recall as a single number. The per-claim breakdown is what tells you which retrieval gap to fix.
  • Tuning recall in isolation. A 1.0-recall retriever that returns 50 chunks costs more, fits less in context, and ranks worse — read recall and precision together.
  • Skipping recall when answer accuracy looks fine. Tail-quality regressions hide inside healthy averages.
  • Using a static eval set as the world. Knowledge bases drift; refresh the recall benchmark on the same cadence as the index.
  • Confusing context recall with context utilization. Recall asks “did you retrieve it?” Utilization asks “did the model use it?”

Frequently Asked Questions

What is the contextual recall metric?

It is a RAG metric that scores retrieval completeness — whether the retriever returned all the chunks needed to support the reference answer. It typically decomposes the answer into atomic claims and checks each against the retrieved chunks.

How is contextual recall different from contextual precision?

Recall asks 'of all the relevant chunks that exist, how many did you return?' Precision asks 'of the chunks you returned, how well are they ranked?' Recall tunes the retriever; precision tunes the reranker.

How do you measure the contextual recall metric?

Run `fi.evals.ContextRecall` against a query, the retrieved contexts, and a reference answer; the evaluator decomposes the reference into atomic claims and checks each against the contexts. Pair with `ContextPrecision` for a complete retrieval view.