How is MRR different from NDCG?

MRR only cares about the first relevant result, while NDCG can reward multiple relevant results with graded relevance. Use MRR when first-hit speed is the main failure mode.

How do you measure MRR?

Use FutureAGI's `fi.evals.MRR` evaluator on ranked candidates and known relevant IDs. Track the score by dataset slice, retriever version, and agent step.

What Is MRR? Mean Reciprocal Rank FutureAGI Guide (2026)

Q: What is mean reciprocal rank?

Mean reciprocal rank is a ranking metric that scores how early the first relevant result appears. FutureAGI uses MRR for retrieval, search, and agent candidate-ranking evals.

What Is Mean Reciprocal Rank (MRR)?

Mean reciprocal rank (MRR) is an LLM-evaluation metric for ranked retrieval, search, and agent tool-candidate lists that measures how early the first relevant result appears. For each query, it takes the reciprocal of the rank of the first correct item, then averages those values across queries. In a production eval pipeline or trace, MRR tells you whether the system finds a usable document, tool, answer, or route near the top before the model spends context, latency, or user trust on worse options.

Why Mean Reciprocal Rank Matters in Production LLM and Agent Systems

Bad ranking fails quietly. A RAG system can retrieve the right policy paragraph at rank 8, but if the context window only admits the top 5 chunks, the model answers from weaker evidence. A tool-using agent can include the right API in a candidate list, but choose a plausible wrong tool because it was ranked first. MRR exposes this failure earlier than final-answer accuracy because it focuses on the first usable hit.

FutureAGI teams usually see the pain in three places: search logs with repeated query rewrites, traces where the model ignores late evidence, and dashboards where answer quality drops while raw recall stays flat. Developers debug reranker weights. SREs see latency rise as agents retry retrieval. Product teams see users click the second or third citation, then lose trust in the answer.

MRR matters even more in 2026 multi-step pipelines because each step creates another ranked decision: documents, tools, memory entries, routes, or follow-up actions. If the first relevant option is consistently buried, an agent burns tokens and tool calls before it can recover. The symptom is not always a hard failure; it is often a slow trace with a correct item visible too late to help during live traffic.

How FutureAGI Handles Mean Reciprocal Rank

FutureAGI’s approach is to treat MRR as a first-hit reliability signal, not a generic quality score. The specific surface for eval:MRR is fi.evals.MRR, listed in the FutureAGI evaluator catalog as the local metric for “how quickly the first relevant result appears.” Engineers run it on ranked candidate lists from retrieval evals, search evals, agent-memory lookups, or tool-selection prefilters.

Example: a support-search agent built with traceAI-langchain retrieves 10 help-center chunks before the model writes an answer. Each retrieval span records the ordered candidate IDs, and the golden dataset stores the known relevant document IDs. A nightly regression eval runs MRR by tenant, language, and query type. If MRR drops from 0.74 to 0.51 after an index rebuild while RecallAtK stays flat, the retriever still finds the right document but ranks it too low. The engineer checks the reranker version, embedding model, recency boost, and hybrid-search merge rule instead of rewriting the prompt.

Unlike NDCG, MRR ignores every relevant item after the first one. That makes it sharp for “can the system get to a usable answer fast?” but incomplete for graded relevance. In FutureAGI, teams usually pair MRR with ContextPrecision, RecallAtK, and a downstream Groundedness or Faithfulness eval. The release gate can block when first-hit ranking regresses, while traces still show whether the generated answer used the retrieved evidence correctly.

How to Measure or Detect Mean Reciprocal Rank

MRR is directly measurable when you have a ranked list and at least one known relevant item for each query. Watch these signals together:

fi.evals.MRR - returns the mean of 1 / rank for the first relevant candidate across queries.
fi.evals.PrecisionAtK - shows how dense the top k list is with relevant items.
fi.evals.NDCG - handles graded relevance when “excellent” and “acceptable” results should score differently.
Trace fields - store ordered candidates on retrieval spans, then join them to golden labels during eval.
Dashboard slices - plot MRR p25, top-1 miss rate, and eval-fail-rate-by-cohort by retriever version, query intent, tenant, and agent step.

Minimal Python:

from fi.evals import MRR

mrr = MRR()
score = mrr.evaluate([
    {"ranked_ids": ["doc-9", "doc-3", "doc-1"], "relevant_ids": ["doc-3"]},
    {"ranked_ids": ["doc-4", "doc-8"], "relevant_ids": ["doc-4"]},
])
print(score)

User-feedback proxies also help: citation click-through, thumbs-down rate after search-heavy answers, and escalation-rate for queries where the relevant document appears below rank 3.

Track both mean and p25 because a healthy mean can hide long-tail queries where relevant candidates start at rank 6 or worse.

Common Mistakes

The mistakes are usually measurement design errors, not formula errors:

Treating MRR as a full ranking-quality score. It only checks the first relevant item; pair it with NDCG when graded relevance matters.
Mixing query types in one average. Navigational queries and exploratory questions have different acceptable first-hit ranks; segment thresholds by intent and workflow.
Scoring a different list than the model sees. Evaluate the exact ranked list sent into the prompt, reranker output, or agent step.
Rewarding retrieval without checking answer faithfulness. MRR can improve while the generator still cites the wrong claim or ignores the top hit.
Comparing scores across changing gold sets. Adding easier labels can raise MRR without any retrieval improvement; version the dataset and report cohort diffs.