What Is Precision@K?
An evaluation metric that measures the fraction of top-K ranked retrieval results judged relevant to the query.
What Is Precision@K?
Precision@K is an LLM-evaluation metric for ranked retrieval that measures the fraction of the first K returned items that are relevant. In a RAG eval pipeline, production trace, or agent tool-result list, it answers a narrow question: did the system put useful evidence near the top? FutureAGI measures it with the PrecisionAtK evaluator so engineers can catch noisy retrieval before generation turns irrelevant chunks into unsupported answers.
Why Precision@K Matters in Production LLM and Agent Systems
Precision@K fails loudly in user experience but quietly in infrastructure. A retriever may return twenty documents, and the answer model may see only the first five because of context-window limits, latency budgets, or prompt templates. If those first five are mostly irrelevant, the system can hallucinate from stale policy pages, cite the wrong source, or trigger a tool call with weak evidence.
Developers feel this as confusing RAG failures: the right document exists in the index, but it appears below the cutoff. SREs see answer-quality alerts without a matching outage. Product teams see thumbs-down comments like “it ignored the document I uploaded.” Compliance reviewers see citation failures when a generated answer references a non-authoritative chunk.
The metric is more important in 2026-era agentic systems because retrieval is no longer one step before one answer. Agents retrieve memory, search documents, call tools, rerank sources, and hand evidence between subagents. A low Precision@K at any step can poison the next action. Common trace symptoms include high ContextRelevance pass rates for one chunk but low answer faithfulness overall, relevant chunks just below rank K, high token use with low context use, and eval failures concentrated on long-tail queries. Precision@K narrows the investigation to ranked evidence quality instead of blaming the generator first.
How FutureAGI Handles Precision@K
FutureAGI’s approach is to treat Precision@K as a deterministic ranked-retrieval metric attached to datasets and traces at /platform/evaluate, not as a vague “retrieval quality” label. The glossary anchor is eval:PrecisionAtK; the concrete surface is the PrecisionAtK local metric in fi.evals. It computes the fraction of top-K relevance_scores that meet the configured relevance_threshold, returning an output score and a reason such as how many of the top K items were relevant.
Real workflow: a support RAG agent on Claude Opus 4.7 answers refund-policy questions. The team captures ranked contexts from a traceAI-langchain trace, scores each retrieved chunk against the query, and runs PrecisionAtK with k=5. If the score drops from 0.80 to 0.40 after a chunking change, the engineer opens the failed rows, compares the top five contexts, and checks whether the retriever or reranker introduced irrelevant pricing pages above refund policy text. The next action might be a regression eval on refund intents, a reranker threshold change, or a rollback of the chunking job.
ContextPrecision is the adjacent FutureAGI metric when the question is ranking order across all retrieved chunks. RecallAtK is the companion when the question is missing evidence. Unlike Ragas context precision as a standalone offline report, FutureAGI keeps the score near the trace, dataset row, model version, and release cohort, so the failure points to the retrieval stage that changed.
How to Measure or Detect Precision@K
Measure Precision@K after defining what “relevant” means for the task. Relevance can come from human labels, synthetic gold labels, rubric scoring, or another evaluator, but the ranked list must preserve retrieval order.
PrecisionAtKevaluator: returns the fraction of top-K relevance scores that meet the configured threshold.ContextPrecisioncomparison: helps when relevant chunks exist but are ranked below weaker chunks.- Dashboard signal: track eval-fail-rate-by-cohort for retrieval rows, especially after index, embedding, reranker, or chunking changes.
- Trace signal: compare top-K contexts in
traceAI-langchaintraces with downstream answer failures and user thumbs-down rate. - Counter-metric: pair with
RecallAtK; perfect Precision@K can still hide a retriever that returns too little evidence.
Minimal Python:
from fi.evals import PrecisionAtK
metric = PrecisionAtK(config={"k": 3, "relevance_threshold": 0.7})
result = metric.evaluate([{
"query": "How do refunds work?",
"contexts": ["refund policy", "pricing page", "return window"],
"relevance_scores": [0.95, 0.2, 0.8],
}])
print(result.eval_results[0].output) # 0.6667
Precision@K vs. other ranking metrics in 2026
Precision@K is one of four ranking metrics that matter for production RAG. The choice between them depends on what failure mode you care about:
| metric | best for | failure mode it catches |
|---|---|---|
PrecisionAtK | dense top-K windows | irrelevant chunks in the consumed top-K |
RecallAtK | coverage on rare evidence | required evidence missing from top-K |
NDCG | ordered top-K with grades | useful evidence buried at rank 6-8 |
MRR | single-answer lookup | first relevant result too low |
For most 2026 production stacks running Claude Opus 4.7 or GPT-5.x as the generator, the right pair is PrecisionAtK (catches noise) and RecallAtK (catches coverage gaps). Add NDCG when graded relevance labels are available and the reranker can choose between “useful” and “exactly correct” evidence. Skip MRR unless the task genuinely has one correct document. most enterprise RAG does not.
For external retrieval calibration, the BEIR benchmark (18 heterogeneous IR tasks across QA, fact-checking, citation prediction) and MTEB (Massive Text Embedding Benchmark, 56 datasets, 8 task types) report Precision@K alongside NDCG@10. frontier cross-encoder rerankers (cohere-rerank-v3, voyage-rerank-2) typically lift Precision@5 by 4-9 points over dense-only retrieval on narrow-domain corpora. MultiHop-RAG (2,556 multi-hop questions) and RAGBench (100K examples across five domains) anchor the RAG-specific side; Precision@5 on multi-hop questions sits 8-15 points below single-hop for the same retriever.
The 2026-specific shift: long-context models change the optimal K. When Gemini 3 Pro can accept 1M tokens, the team is tempted to set K=50 and let the model sort it out. We’ve measured this: dense top-K windows above 15 reduce Groundedness on policy QA, even when the model technically has the budget. The retriever still needs to filter; long context is a safety net, not a replacement for ranked retrieval quality. The pattern that works: K=5-8 for the prompt, plus a wider K=30 sliver pulled into a reranker span that re-evaluates before generation.
Common Mistakes
Precision@K is easy to report and easy to misread because K, relevance labels, and retrieval order all shape the result.
- Choosing K from a benchmark, not the prompt. If the model sees five chunks, Precision@10 overstates production evidence quality.
- Ignoring relevance threshold drift. A threshold of 0.5 and 0.8 can tell different stories for the same ranked list.
- Treating Precision@K as answer quality. It measures retrieved evidence density, not whether the final response used that evidence correctly.
- Optimizing only Precision@K. A retriever can return one perfect chunk and still miss other required evidence; check
RecallAtK. - Averaging across unrelated routes. Product docs, legal policies, and user-uploaded files have different relevance distributions and failure costs.
- Reporting Precision@K without the corresponding
RecallAtK. A precise top-K can still miss the only required evidence; the pair is what tells the full story.
Frequently Asked Questions
What is Precision@K?
Precision@K is a ranked-retrieval evaluation metric that measures the share of the first K returned results that are relevant. In LLM and agent eval pipelines, it tells whether the context window is being filled with useful evidence.
How is Precision@K different from Recall@K?
Precision@K asks how much of the top-K list is relevant; Recall@K asks how much of all relevant evidence was retrieved within K. Precision@K punishes noisy top results, while Recall@K punishes missing evidence.
How do you measure Precision@K?
Use the FutureAGI PrecisionAtK evaluator with ranked relevance scores and a configured K. It returns the fraction of top-K items whose relevance score meets the threshold.