Precision@K is a ranked-retrieval evaluation metric that measures the share of the first K returned results that are relevant. In LLM and agent eval pipelines, it tells whether the context window is being filled with useful evidence.

How is Precision@K different from Recall@K?

Precision@K asks how much of the top-K list is relevant; Recall@K asks how much of all relevant evidence was retrieved within K. Precision@K punishes noisy top results, while Recall@K punishes missing evidence.

How do you measure Precision@K?

Use the FutureAGI PrecisionAtK evaluator with ranked relevance scores and a configured K. It returns the fraction of top-K items whose relevance score meets the threshold.

What Is Precision@K? Definition & FutureAGI Guide (2026)

What Is Precision@K?

Precision@K is an LLM-evaluation metric for ranked retrieval that measures the fraction of the first K returned items that are relevant. In a RAG eval pipeline, production trace, or agent tool-result list, it answers a narrow question: did the system put useful evidence near the top? FutureAGI measures it with the PrecisionAtK evaluator so engineers can catch noisy retrieval before generation turns irrelevant chunks into unsupported answers.

Why Precision@K Matters in Production LLM and Agent Systems

Precision@K fails loudly in user experience but quietly in infrastructure. A retriever may return twenty documents, and the answer model may see only the first five because of context-window limits, latency budgets, or prompt templates. If those first five are mostly irrelevant, the system can hallucinate from stale policy pages, cite the wrong source, or trigger a tool call with weak evidence.

Developers feel this as confusing RAG failures: the right document exists in the index, but it appears below the cutoff. SREs see answer-quality alerts without a matching outage. Product teams see thumbs-down comments like “it ignored the document I uploaded.” Compliance reviewers see citation failures when a generated answer references a non-authoritative chunk.

The metric is more important in 2026-era agentic systems because retrieval is no longer one step before one answer. Agents retrieve memory, search documents, call tools, rerank sources, and hand evidence between subagents. A low Precision@K at any step can poison the next action. Common trace symptoms include high ContextRelevance pass rates for one chunk but low answer faithfulness overall, relevant chunks just below rank K, high token use with low context use, and eval failures concentrated on long-tail queries. Precision@K narrows the investigation to ranked evidence quality instead of blaming the generator first.

How FutureAGI Handles Precision@K

FutureAGI’s approach is to treat Precision@K as a deterministic ranked-retrieval metric attached to datasets and traces, not as a vague “retrieval quality” label. The glossary anchor is eval:PrecisionAtK; the concrete surface is the PrecisionAtK local metric in fi.evals. It computes the fraction of top-K relevance_scores that meet the configured relevance_threshold, returning an output score and a reason such as how many of the top K items were relevant.

Real workflow: a support RAG agent answers refund-policy questions. The team captures ranked contexts from a traceAI-langchain trace, scores each retrieved chunk against the query, and runs PrecisionAtK with k=5. If the score drops from 0.80 to 0.40 after a chunking change, the engineer opens the failed rows, compares the top five contexts, and checks whether the retriever or reranker introduced irrelevant pricing pages above refund policy text. The next action might be a regression eval on refund intents, a reranker threshold change, or a rollback of the chunking job.

ContextPrecision is the adjacent FutureAGI metric when the question is ranking order across all retrieved chunks. RecallAtK is the companion when the question is missing evidence. Unlike Ragas context precision as a standalone offline report, FutureAGI keeps the score near the trace, dataset row, model version, and release cohort, so the failure points to the retrieval stage that changed.

How to Measure or Detect Precision@K

Measure Precision@K after defining what “relevant” means for the task. Relevance can come from human labels, synthetic gold labels, rubric scoring, or another evaluator, but the ranked list must preserve retrieval order.

PrecisionAtK evaluator: returns the fraction of top-K relevance scores that meet the configured threshold.
ContextPrecision comparison: helps when relevant chunks exist but are ranked below weaker chunks.
Dashboard signal: track eval-fail-rate-by-cohort for retrieval rows, especially after index, embedding, reranker, or chunking changes.
Trace signal: compare top-K contexts in traceAI-langchain traces with downstream answer failures and user thumbs-down rate.
Counter-metric: pair with RecallAtK; perfect Precision@K can still hide a retriever that returns too little evidence.

Minimal Python:

from fi.evals import PrecisionAtK

metric = PrecisionAtK(config={"k": 3, "relevance_threshold": 0.7})
result = metric.evaluate([{
    "query": "How do refunds work?",
    "contexts": ["refund policy", "pricing page", "return window"],
    "relevance_scores": [0.95, 0.2, 0.8],
}])
print(result.eval_results[0].output)  # 0.6667

Common Mistakes

Precision@K is easy to report and easy to misread because K, relevance labels, and retrieval order all shape the result.

Choosing K from a benchmark, not the prompt. If the model sees five chunks, Precision@10 overstates production evidence quality.
Ignoring relevance threshold drift. A threshold of 0.5 and 0.8 can tell different stories for the same ranked list.
Treating Precision@K as answer quality. It measures retrieved evidence density, not whether the final response used that evidence correctly.
Optimizing only Precision@K. A retriever can return one perfect chunk and still miss other required evidence; check RecallAtK.
Averaging across unrelated routes. Product docs, legal policies, and user-uploaded files have different relevance distributions and failure costs.