Evaluation

What Is Recall@K?

A ranking metric measuring the fraction of all relevant items retrieved within the top K results.

What Is Recall@K?

Recall@K is an LLM-evaluation metric for ranked retrieval that measures what fraction of all known relevant items appear within the top K results. In a RAG eval pipeline, production search trace, or agent candidate list, it tells you whether enough correct evidence reached the model before the cutoff. FutureAGI maps the anchor eval:RecallAtK to fi.evals.RecallAtK, so teams can catch missing coverage even when the top result looks plausible.

Why Recall@K Matters in Production LLM and Agent Systems

Recall@K matters because a system can own the right evidence and still fail when that evidence lands below the cutoff. A retriever may return five chunks to the prompt, while the decisive refund-policy exception sits at rank 7. The model then produces a clean answer from partial context. Nothing looks broken at the final generation layer; the failure started when the ranked list dropped a required item.

The pain shows up differently by team. Retrieval engineers see index rebuilds that keep latency stable but lose long-tail documents. SREs see higher retry rates when agents reformulate queries after weak first retrieval. Product teams see users complain that “the bot should know this,” even though the corpus contains the answer. Compliance teams cannot prove that regulated guidance was made available to the model for the cases that needed it.

Logs usually show low retrieved-relevant-count, repeated query rewrites, sparse citation clicks, or answer-quality regressions isolated to specific query cohorts. In 2026 multi-step agent pipelines, Recall@K is also a control signal for whether the agent should retrieve again, call a search tool, or route to a fallback knowledge source. If the first retrieval step misses coverage, every downstream reasoning, tool choice, and groundedness check inherits that ceiling.

How FutureAGI Handles Recall@K

FutureAGI’s approach is to treat Recall@K as a retrieval coverage gate, not a generic ranking score. The specific surface is eval:RecallAtK, implemented as fi.evals.RecallAtK in the FutureAGI evaluator inventory. The evaluator compares an ordered result list against a labelled set of relevant IDs and reports the fraction recovered inside the configured K. Teams usually pair it with PrecisionAtK, MRR, and NDCG so coverage, noise, first-hit rank, and graded ranking quality stay separate.

A real workflow: a support RAG team instruments a LangChain retriever with traceAI-langchain. Each retrieval span stores ordered candidate IDs in retrieval.documents, plus tags for retriever.version, tenant, and query intent. A FutureAGI Dataset contains the golden relevant_ids for support questions. Before a new embedding model ships, the release eval runs RecallAtK(config={"k": 5}) and blocks if p25 Recall@5 falls below 0.80 for billing, cancellation, or security-policy questions.

When the score drops, the engineer does not rewrite the prompt first. They open the failing examples, compare ranked_ids to relevant_ids, and check whether the missing IDs were absent from the index or merely ranked after K. If they were absent, the fix is ingestion or chunking. If they were late, the fix is reranking or hybrid-search weighting.

Unlike Ragas context recall, which checks whether reference-answer statements are inferable from retrieved text, Recall@K checks labelled item coverage directly. That makes it sharp for search, RAG, memory lookup, and agent-candidate retrieval where the relevant item IDs are known.

How to Measure or Detect Recall@K

Measure Recall@K only when you have a ranked candidate list and a trusted set of relevant items. The formula is relevant_items_in_top_k / total_relevant_items. Track:

  • fi.evals.RecallAtK — returns the share of relevant IDs recovered inside the top K ranked results.
  • retrieval.documents — ordered trace data from the retrieval span; score the exact list the model receives.
  • Recall@K by cohort — dashboard p25, median, and fail rate by retriever version, tenant, language, and query intent.
  • Joint recall/precision viewRecallAtK catches missed coverage; PrecisionAtK catches irrelevant filler inside K.
  • User proxy — thumbs-down rate, citation-click corrections, and escalation-rate for queries with low retrieval coverage.

Minimal Python:

from fi.evals import RecallAtK

metric = RecallAtK(config={"k": 5})
result = metric.evaluate([{
    "ranked_ids": ["doc-7", "doc-2", "doc-9", "doc-4", "doc-1"],
    "relevant_ids": ["doc-2", "doc-4", "doc-8"],
}])
print(result.eval_results[0].output)

Set K to the number of retrieved items actually sent into the prompt or agent step, not the number returned by an upstream search API.

Common Mistakes

Most Recall@K bugs come from evaluating a cleaner list than the production system uses.

  • Confusing Recall@K with Precision@K. Recall@K rewards finding all relevant items; Precision@K rewards keeping irrelevant items out of the top K.
  • Choosing K after looking at the results. K should match the prompt budget, reranker output, or agent candidate limit in production.
  • Counting duplicate chunks as separate hits. Deduplicate by canonical document or passage ID before scoring coverage.
  • Reporting only the mean. Tail cohorts can lose required evidence while average Recall@K stays healthy.
  • Using incomplete labels. If the gold set omits relevant items, a better retriever can look worse than the baseline.

Frequently Asked Questions

What is Recall@K?

Recall@K is the fraction of all known relevant items that appear within the top K ranked results. FutureAGI uses it to find retrieval coverage gaps before answer quality drops.

How is Recall@K different from Precision@K?

Recall@K asks how many relevant items were recovered inside K, while Precision@K asks how many of the K returned items were relevant. Recall@K penalizes missed evidence; Precision@K penalizes noisy evidence.

How do you measure Recall@K?

Use FutureAGI's `fi.evals.RecallAtK` on ranked result IDs and known relevant IDs. Track it by retriever version, query cohort, and the K value your model actually consumes.