Evaluation

What Is the Contextual Precision Metric?

An Average Precision-style RAG metric that scores whether relevant retrieved chunks rank above irrelevant ones in the retrieved list.

What Is the Contextual Precision Metric?

The contextual precision metric scores whether the chunks your retriever returned are ranked with relevant ones above irrelevant ones. It walks the retrieved list top-down and computes an Average Precision-style score — a relevant chunk at rank 1 contributes near full weight, the same chunk at rank 5 contributes much less, and an irrelevant chunk near the top crashes the score sharply. It is the canonical metric for tuning rerankers, hybrid-search merge logic, and any retrieval setup where ranking quality drives downstream answer quality. FutureAGI exposes it as fi.evals.ContextPrecision across notebook, dataset, and live-trace evaluation.

Why the Contextual Precision Metric Matters in Production LLM and Agent Systems

LLMs are not order-agnostic. They attend disproportionately to chunks at the start and end of their context window — the well-documented “lost in the middle” effect — which means a retriever that returns the right chunk at rank 7 of 10 may as well not have returned it at all. A high-recall, low-precision retriever produces this exact failure pattern: the right answer is technically in the context, but the model never sees it.

The pain hits retrieval engineers, ML platform owners, and answer-quality reviewers. An ML engineer adds a reranker and answer accuracy improves. A colleague swaps in a cheaper reranker the next sprint, mean accuracy holds, latency drops, and the change ships. A month later contextual precision shows the cheaper reranker placed irrelevant chunks at rank 1 on 14% of long-tail queries — invisible because mean accuracy was diluted across the head.

In 2026 agentic-RAG and corrective-RAG patterns, low contextual precision is what should trigger a re-rank or query rewrite, not a retry of the same retrieval call. Multi-hop agents need ranking precision as a step-level signal, not a batch-only metric. Unlike Ragas faithfulness, which evaluates the answer, contextual precision evaluates the retrieval — letting teams attribute regressions to the right component.

How FutureAGI Handles the Contextual Precision Metric

FutureAGI’s approach is to ship fi.evals.ContextPrecision as a local metric you can run anywhere — notebook, dataset evaluation, live-trace evaluator, or scheduled regression — with the same Average Precision formula. Inputs are the user query, the ordered list of retrieved contexts, and a reference answer (or human relevance labels if available). The evaluator computes Precision@k weighted by relevance at each rank, then averages across the list.

A concrete example: a search team running on traceAI-pinecone ships a new reranker. They instrument retrieval spans with the ranked chunks and run ContextPrecision on every span. The Agent Command Center dashboard plots p25, median, and p75 precision by reranker version. When the cheap-reranker change drops p25 from 0.65 to 0.41, an alert fires before mean accuracy moves. The same evaluator runs offline on a golden ranking Dataset, so the regression is reproducible without waiting for production data.

We have found that pairing ContextPrecision with ContextRecall is the only honest way to evaluate a retriever — precision alone rewards small surgical retrieval, recall alone rewards dragnet retrieval. The trade-off shows up in the joint plot.

How to Measure or Detect It

Wire up the contextual precision metric:

  • fi.evals.ContextPrecision — Average Precision-style score across the ranked list.
  • fi.evals.PrecisionAtK — simpler companion that returns the fraction of top-K results that are relevant.
  • fi.evals.NDCG — Normalised Discounted Cumulative Gain for graded relevance labels.
  • OTel attribute retrieval.documents — the ordered list the evaluator scores.
  • p25 precision (dashboard) — the percentile that exposes a regressed reranker first.
from fi.evals import ContextPrecision

result = ContextPrecision().evaluate([{
    "query": "What does the model do under load?",
    "contexts": [
        "Under load the system uses semantic cache and weighted routing.",
        "Pinecone supports hybrid search.",
    ],
    "reference": "It uses semantic cache and weighted routing under load."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

  • Reporting Precision@k and calling it contextual precision. Precision@k ignores ordering within top k.
  • Measuring on raw vector-search output instead of the reranked list. If the model only sees reranked chunks, score the reranked list.
  • Tuning precision without watching recall. A precision-1.0 retriever that returns one chunk has zero recall on multi-fact questions.
  • Using contextual precision when graded relevance labels are available. NDCG handles “very relevant” vs “somewhat relevant” cleanly.
  • Setting a single global threshold across query types. Lookups need higher precision than exploratory queries.

Frequently Asked Questions

What is the contextual precision metric?

It is an Average Precision-style RAG metric that scores whether the retrieved chunks are ranked with relevant ones above irrelevant ones. The same chunk earns more credit at rank 1 than at rank 5.

How is contextual precision different from precision-at-K?

Precision-at-K only counts the fraction of the top K that is relevant. Contextual precision uses the order within the top results — irrelevant chunks at high ranks penalise the score harder than the same chunks at low ranks.

How do you measure the contextual precision metric?

Run `fi.evals.ContextPrecision` against a query, the retrieved list, and a reference answer. Pair with `ContextRecall` for retrieval health and `NDCG` when you have graded relevance labels.