Evaluation

What Is MRR (Mean Reciprocal Rank) Metric?

A retrieval metric that averages the reciprocal of the rank of the first relevant result across all queries.

What Is MRR (Mean Reciprocal Rank)?

Mean Reciprocal Rank (MRR) is a ranking-quality metric that measures how high the first relevant result appears in a ranked list. For each query, take the reciprocal of the rank of the first correct hit — 1/1 at position one, 1/2 at position two, 1/3 at position three — then average across queries. The score lives in [0, 1], higher is better, and rewards putting the right answer at the top. In a FutureAGI RAG-evaluation workflow, MRR runs over retriever and reranker outputs alongside ContextRelevance and Recall@K.

Why It Matters in Production LLM and Agent Systems

For RAG and search systems, what users see is the top result. If the right document is buried at rank 7, the model never sees it either — most retrievers feed only the top-K (typically 3–5) chunks into the prompt. A retriever with high recall but bad ranking is functionally useless: the right context is in the corpus, the system retrieved it, but it lost the prioritisation step. MRR is the single metric that captures whether your retriever is putting the right thing first.

The pain is concrete. A RAG team sees their faithfulness scores climb whenever they tune the LLM, but plateaus they cannot break — because the LLM is downstream of a retriever that ranks the right chunk at position 4, while only top-3 chunks are passed in. A search team ships a new embedding model; recall@10 improves but MRR drops, meaning users get fewer good answers in the wrong order. An agent team sees tool-selection accuracy drop after a reranker change; the right tool is in the candidate list but ranked third, and the planner picks the top one.

In 2026-era agentic RAG, ranking matters even more. Multi-vector retrieval, hybrid search, and context-window-pressure constraints mean only the top 1–3 chunks survive into the LLM prompt. MRR, plus its companion nDCG, becomes the metric you optimise reranker training against — not because users care about ranks 4–10, but because the model literally cannot see them.

How FutureAGI Handles MRR

FutureAGI does not maintain a MeanReciprocalRank evaluator class — MRR is a derived metric you compute on top of relevance judgments. The way FutureAGI surfaces it: relevance signals come from fi.evals.ContextRelevance and fi.evals.ChunkAttribution, which score each retrieved chunk’s relevance to the query. Rank computation is a one-line aggregation across the scored chunks per query — find the first chunk above your relevance threshold, take 1/rank, average. Dataset workflow registers the queries, retrieved chunks, and ground-truth labels in a Dataset, runs the evaluator with Dataset.add_evaluation(), and computes MRR alongside Recall@K and Faithfulness. Production tracking wires traceAI-langchain or traceAI-llamaindex spans to capture the retrieved chunks per query, then runs nightly MRR on a sampled cohort.

Concretely: a legal-document RAG team is comparing two rerankers — a cross-encoder and a Cohere reranker. They run 800 golden queries through both, score chunk relevance with ContextRelevance, and compute MRR. The cross-encoder scores 0.81 MRR; Cohere scores 0.74 — but Cohere’s Recall@10 is higher. The team picks the cross-encoder for top-3 retrieval and uses Cohere as a fallback, gating the choice through FutureAGI’s regression-eval workflow. Without MRR, the choice would have hinged on recall alone and quietly degraded faithfulness downstream.

How to Measure or Detect It

MRR is a simple aggregate but the signals it depends on come from your eval stack:

  • ContextRelevance: fi.evals.ContextRelevance returns a 0–1 score per chunk; threshold defines what counts as “relevant” for the rank computation.
  • ChunkAttribution: fi.evals.ChunkAttribution flags whether each chunk actually contributed to the final answer — pairs with MRR for end-to-end retrieval quality.
  • Per-query reciprocal-rank (dashboard): the distribution of 1/rank across queries; outlier queries with rank=0 surface the worst retrieval failures.
  • MRR-by-cohort: split queries by language, query length, or topic; surfaces retrieval brittleness in slices.
  • MRR vs. Recall@K: track both — MRR climbing with Recall flat means ranking improved; Recall climbing with MRR flat means coverage improved.

Minimal Python:

from fi.evals import ContextRelevance

cr = ContextRelevance()
mrr_total = 0
for query, chunks in dataset:
    scores = [cr.evaluate(input=query, context=c).score for c in chunks]
    rank = next((i + 1 for i, s in enumerate(scores) if s > 0.7), None)
    mrr_total += 1 / rank if rank else 0
print("MRR:", mrr_total / len(dataset))

Common Mistakes

  • Computing MRR with no relevance threshold. “First non-zero score” is not the same as “first relevant chunk” — pick a threshold (typically 0.6–0.8) and document it.
  • Reporting only mean MRR. A mean of 0.6 hides whether half the queries scored 1.0 and half scored 0.2 — show distribution, not just average.
  • Using MRR when many partial matches matter. For exploratory or multi-aspect queries, nDCG@K is the right metric; MRR over-weights position 1.
  • Ignoring out-of-corpus queries. Queries with no relevant document at all collapse MRR — track them as a separate “no-hit-rate” instead of letting them drag the mean.
  • Optimising MRR without checking faithfulness downstream. A retriever can rank a noisy chunk first and still score high MRR; pair it with Faithfulness or Groundedness.

Frequently Asked Questions

What is MRR (Mean Reciprocal Rank)?

MRR averages the reciprocal of the rank of the first relevant document across queries. It scores 1.0 when the right result is always at position 1, and lower as the right result drifts down the list.

How is MRR different from nDCG?

MRR cares only about the first relevant result's position. nDCG considers the relevance grade of every result up to a cutoff K. Use MRR for navigational and lookup queries; nDCG when many partial matches matter.

How do you compute MRR for a RAG retriever?

For each query, find the rank of the first relevant chunk, take the reciprocal, and average. FutureAGI's ContextRelevance evaluator surfaces the relevance signal you compute MRR on top of.