How is dense passage retrieval different from BM25?

BM25 ranks passages by lexical term matching and term frequency. Dense passage retrieval ranks by semantic similarity, so it can find paraphrases, but it can miss exact identifiers unless paired with hybrid search.

How do you measure dense passage retrieval?

FutureAGI measures dense passage retrieval with fi.evals ContextRelevance, RecallAtK, MRR, and NDCG on labelled queries and sampled production traces. These signals show whether the right evidence appeared and where it ranked.

What Is Dense Passage Retrieval? FutureAGI Guide (2026)

Q: What is dense passage retrieval?

Dense passage retrieval embeds a user query and candidate passages into the same vector space, then retrieves passages with the closest embeddings. It is commonly used as the retrieval stage in RAG systems.

What Is Dense Passage Retrieval?

Dense passage retrieval is a RAG retrieval method that embeds a user query and candidate passages into the same vector space, then returns the passages with the nearest embeddings. It is the dense retriever stage inside a RAG pipeline or production trace, upstream of reranking and generation. FutureAGI evaluates dense passage retrieval with ContextRelevance, RecallAtK, MRR, and NDCG so teams can catch retrieval misses before the model answers from weak or missing evidence.

Why Dense Passage Retrieval Matters in Production LLM and Agent Systems

Retrieval errors are quiet. If dense passage retrieval returns the wrong top-k passages, the generator still writes a fluent answer, but it is grounded in the wrong evidence or no evidence at all. The visible failures are silent hallucinations downstream of a faulty retriever, missing source citations, and context starvation where the model refuses because the needed passage never arrived.

The pain spreads across the whole team. Retrieval engineers see RecallAtK flatten after an embedding-model change. SREs see p99 retrieval latency rise because the index is tuned for recall without a latency budget. Compliance teams cannot prove that policy answers came from approved documents. Product teams see thumbs-down feedback on answers that look plausible in logs. End users see confident responses that quote an adjacent policy instead of the one they asked about.

Dense passage retrieval is especially important in 2026 agentic RAG. A support agent may retrieve policy, choose a refund tool, then draft a customer message. If the first retrieval step misses the EU refund policy and returns a generic billing page, every later action inherits the error. Unlike BM25, dense retrieval handles paraphrase well, but it is weaker on account IDs, SKUs, and exact legal clauses unless hybrid search or reranking is added.

How FutureAGI Handles Dense Passage Retrieval

FutureAGI’s approach is to evaluate dense passage retrieval before the generation step can hide the failure. In a typical workflow, a team instruments a LangChain or LlamaIndex RAG app with traceAI-langchain or traceAI-llamaindex. The retrieval span records the user query, retrieval.documents, retrieval.score, index version, embedding model, and top-k. Those fields become the inputs for retrieval evaluators rather than loose debug text in logs.

For the eval surface, FutureAGI uses fi.evals.ContextRelevance to score whether retrieved passages can answer the query, RecallAtK to check whether labelled gold passages appear in the returned set, MRR to measure how early the first relevant passage appears, and NDCG to evaluate graded ranking quality. ChunkAttribution can then verify whether the final answer cited or used any retrieved passage, which separates retrieval failure from generation neglect.

A real example: an enterprise search agent answers, “Can I export EU invoices for 2026 audits?” The dense retriever returns invoice API docs at ranks 1-3, while the EU retention policy sits at rank 11. FutureAGI shows low RecallAtK@5, low p25 ContextRelevance, and healthy generation groundedness. The engineer raises the alert threshold for ContextRelevance by corpus segment, adds hybrid search for jurisdiction terms, reruns the regression eval against the golden dataset, and only ships when MRR improves without increasing p99 retrieval latency.

How to Measure or Detect Dense Passage Retrieval Quality

Use component-level retrieval metrics before answer-level metrics:

fi.evals.ContextRelevance: returns a relevance score and reason for query-versus-passages quality.
RecallAtK: returns the fraction of labelled queries where the gold passage appears in the top-k set.
MRR: measures how soon the first relevant passage appears; rank 1 matters more than rank 8.
NDCG: scores ranking quality when relevance has grades, not just relevant or irrelevant labels.
Trace fields: retrieval.documents, retrieval.score, top-k, index version, and embedding model.
Dashboard signals: p10 ContextRelevance, RecallAtK@5, retrieval p99 latency, token-cost-per-trace, and thumbs-down rate by query cohort.

Measure offline on labelled query-passage pairs and online on sampled traces. Compare index versions by cohort rather than averaging all queries together; product-code lookups, policy questions, and exploratory questions fail differently. Treat thresholds as release gates: block a retriever change if RecallAtK@5 drops, or if p99 latency rises while NDCG is flat.

from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="Can I export EU invoices for audit?",
    context="EU invoices can be exported from Billing > Exports for seven years."
)
print(result.score, result.reason)

Common Mistakes

Treating dense passage retrieval as a better BM25. Dense search finds paraphrases, but BM25 is often stronger for IDs, exact titles, legal clauses, and policy names.
Scoring only the final answer. Groundedness can stay high when the model faithfully uses irrelevant context. Measure retrieval before generation and store the retrieved passages.
Increasing top-k as the default fix. Larger top-k raises token cost and distractors. Compare RecallAtK, NDCG, and p99 latency together before changing it.
Mixing embeddings after model migration. Query and passage vectors must come from compatible embedding models. Re-embed passages, version the index, and compare cohorts.
Skipping the reranker for policy or support search. Dense retrieval is good at recall; a reranker often decides whether rank 1 is usable enough for generation.