RAG

What Is Average Precision?

A ranking-quality metric that averages precision at every position where a relevant item appears in a single query's ranked result list.

What Is Average Precision?

Average precision (AP) is a ranking-quality metric that summarizes how well a retrieval system places relevant items at the top of a ranked list. It is the area under the precision-recall curve for a single query. equivalently, the average of precision values measured at every rank position where a relevant item appears. AP is the per-query primitive behind mean average precision (MAP), the canonical aggregate for retrieval benchmarks. In RAG and rag-family stacks, AP scores the retriever and reranker stages. FutureAGI exposes the closest in-stack metrics through PrecisionAtK, RecallAtK, NDCG, and MRR evaluators on a versioned Dataset.

In 2026, retrieval is rarely a single component. A typical production RAG path is query rewrite → hybrid retriever (dense + BM25) → cross-encoder reranker → context assembly, often plus a graph-RAG hop. Each stage has its own AP, and a single retrieval regression can hide inside an aggregate.

Why Average Precision Matters in Production LLM and Agent Systems

Retrieval quality determines RAG quality. If the right chunk is not in the top-k handed to the LLM, even a perfect generator cannot produce a grounded answer. AP measures exactly this: how aggressively does the system rank relevant items above irrelevant ones? Reporting only top-1 accuracy or recall@10 hides the difference between “the right answer is at position 1” and “the right answer is at position 9 buried under noise”. both can satisfy recall@10, but the LLM only meaningfully attends to the first few chunks.

The pain feels different by role:

  • RAG engineers see hallucination spike when a new embedding model drops AP on a long-tail topic cohort.
  • SREs see latency creep when a reranker is added to compensate for poor retriever AP.
  • Product managers see thumbs-down rates climb on questions that look easy in QA.
  • Compliance leads see citation gaps when the LLM cites a chunk that wasn’t actually relevant.

In 2026 RAG stacks, AP analysis is critical because the retrieval surface fragmented. A single query passes through a vector retriever (Qdrant, Pinecone, Weaviate), a BM25 retriever, a hybrid fuser (Reciprocal Rank Fusion is the default), and a cross-encoder reranker (Cohere Rerank 3, Voyage rerank-2, or open-weight bge-reranker-v3). each stage has its own AP, and a regression in any one of them can corrupt the LLM. The reliability question is not “what’s our overall AP” but “where in the retrieval chain did AP drop, and on which cohort.”

How FutureAGI Handles Average Precision and Ranking Quality

FutureAGI’s approach is to score ranking quality at every stage of the retrieval chain on a versioned dataset. There is no managed AveragePrecision evaluator in fi.evals. the standard ML metric is best computed per-query and aggregated by the platform layer. but the related rank-aware evaluators are part of the local-metric stack and run identically.

The retrieval-stage measurement matrix:

StagePrimary metricSecondary signal
Query rewriteAnswerRelevancy on rewritten queryLength delta
Dense retrieverRecallAtK (k=20-50)ContextEntityRecall
BM25 retrieverRecallAtKLength-normalized score
Hybrid fusionPrecisionAtK (k=10)Per-source contribution
RerankerNDCG (k=10), MRRReranker latency p99
LLM contextContextRelevance (top-K)Groundedness on answer

A typical workflow: the team builds an evaluation Dataset of (query, relevance-labelled document set) pairs. They call Dataset.add_evaluation and attach PrecisionAtK, RecallAtK, NDCG, and MRR. Every retriever and reranker change is gated against this dataset. For online observability, retrieval spans are captured through traceAI-llamaindex, traceAI-langchain, or vector-DB-specific integrations like traceAI-pinecone and traceAI-weaviate. ContextRelevance runs on live traces to flag when retrieved context is irrelevant, and ContextEntityRecall verifies entity-level coverage.

When a candidate embedding model is being evaluated, the workflow uses Agent Command Center traffic-mirroring to send a percentage of live queries to the candidate while the production model still serves users. The mirrored cohort is scored with NDCG and PrecisionAtK against ground truth from the human-annotation queue. Promotion is gated on the regression eval result. When Groundedness drops in production but the LLM didn’t change, the FutureAGI dashboard’s eval-fail-rate-by-cohort on retrieval evaluators almost always identifies the upstream regression. Compared with Ragas, which reports retrieval metrics on static datasets, the FutureAGI surface keeps the same metric attached to live spans.

In our 2026 evals, the strongest predictor of RAG hallucination is reranker NDCG@5, not retriever recall@50. Recall buys you the right document somewhere in the haystack; NDCG buys you that document in front of the LLM’s attention window. Public anchors: on RAGTruth (18K labeled response chunks) and RAGBench (~100K examples across five industry domains), groundedness regressions correlate roughly 2x more strongly with NDCG@5 than with recall@50, and CRAG (Meta/KDD Cup 2024, ~4.4K questions across five domains) shows the same gap when long-tail entity queries are mixed in.

How to Measure or Detect Ranking Quality

Use rank-aware evaluators rather than raw recall numbers:

  • PrecisionAtK. fraction of top-K results that are relevant. Use K matching what the LLM actually consumes.
  • RecallAtK. fraction of all relevant results that appear in top K. pair with precision.
  • NDCG. rank-position-weighted relevance, the strongest single ranking metric.
  • MRR. rank of the first relevant result; useful when one canonical answer dominates.
  • ContextRelevance and ContextRecall. live evaluators for chunks an LLM actually sees.
  • Per-cohort reporting. split metrics by topic, language, query type; aggregates lie about long-tail performance.

A minimal precision/recall check on retrieval output:

from fi.evals import PrecisionAtK, NDCG, MRR

precision = PrecisionAtK(k=5)
ndcg = NDCG(k=10)
mrr = MRR()

retrieved = ["doc-7", "doc-2", "doc-9", "doc-3", "doc-1"]
relevant = ["doc-2", "doc-1", "doc-4"]

print(precision.evaluate(retrieved=retrieved, relevant=relevant).score)
print(ndcg.evaluate(retrieved=retrieved, relevant=relevant).score)
print(mrr.evaluate(retrieved=retrieved, relevant=relevant).score)

Common Mistakes

  • Reporting only recall@k. Recall ignores rank order; AP and NDCG capture how aggressively relevant items are pushed to the top.
  • Using one global K. The right K matches what the LLM consumes. usually 3-10. not what the vector DB returns.
  • Ignoring cohort breakdowns. Long-tail queries can collapse silently while the average AP looks healthy.
  • Treating reranker AP gains as additive to retriever AP. Cross-encoder rerankers can only re-rank what the retriever returned; bad retriever recall caps AP.
  • No regression eval before swapping embedding models. Embedding swaps shift the entire vector geometry; AP regressions are easy to miss without a versioned dataset.
  • Static gold labels. Production traffic drifts; the labelled set needs monthly refresh.
  • Mixing graded and binary relevance. AP assumes binary labels; for graded relevance, NDCG is the right metric.

Frequently Asked Questions

What is average precision?

Average precision (AP) is the area under the precision-recall curve for a single query. the average of precision values at every rank position where a relevant document appears in the result list.

How is average precision different from mean average precision?

Average precision is per query; mean average precision is the mean of AP across all queries in a set. AP is the primitive; MAP is the aggregate that's reported on a benchmark.

How do you measure average precision in a RAG pipeline?

Score retrieval results with `PrecisionAtK`, `RecallAtK`, and `NDCG` against a labelled relevance set; FutureAGI runs these on a versioned `Dataset` so changes to chunking, embedding, or rerankers are caught before deploy.