Evaluation

What Is Context Precision?

A RAG retrieval metric that scores whether relevant chunks are ranked higher than irrelevant ones in the retrieved list, using an Average Precision-style formula.

What Is Context Precision?

Context precision is a RAG retrieval metric that scores how well your retrieved chunks are ordered — specifically, whether the chunks relevant to the user’s query are ranked higher than the irrelevant ones. The evaluator walks the retrieved list and computes an Average Precision-style score: a relevant chunk at rank 1 contributes nearly its full weight, the same chunk at rank 5 contributes much less, and any irrelevant chunk near the top suppresses the score sharply. It is the metric you watch when tuning rerankers, hybrid-search merge logic, or any system where ranking quality drives downstream answer quality.

Why It Matters in Production LLM and Agent Systems

LLMs are not order-agnostic. Most production models attend disproportionately to the first and last items in a context window — the well-documented “lost in the middle” effect — which means a retrieval setup that returns the right chunk at rank 7 of 10 may as well not have returned it at all. A high-recall, low-precision retriever produces this exact failure: the right answer is in the context, but the model ignores it.

The pain hits retrieval engineers and infra owners hardest. An ML engineer adds a reranker and answer accuracy goes up — but a colleague pushes a cheaper reranker the next sprint, mean answer accuracy stays flat, and tail latency drops, so the change ships. A month later context precision shows the cheaper reranker is putting irrelevant chunks at rank 1 on 14% of long-tail queries; nobody noticed because mean accuracy was diluted across the head distribution. Without context precision, ranking regressions are invisible until a customer files a ticket.

In 2026 agentic-RAG and corrective-RAG patterns, low context precision is what should trigger a re-rank or a query rewrite — not a retry of the same retrieval call. Multi-hop agents that iterate on retrieval need precision to be a first-class step-level signal.

How FutureAGI Handles Context Precision

FutureAGI’s approach is to ship fi.evals.ContextPrecision as a local metric you can run anywhere — notebook, dataset evaluation, or live trace evaluator — with the same Average Precision-style formula. Inputs are the user query, the ordered list of retrieved contexts, and a reference answer (or human relevance labels if you have them). The evaluator computes Precision@k weighted by relevance at each rank, then averages — that is the standard definition, implemented identically across local metric and cloud surfaces so notebook scores match production scores.

Concretely: a search team running on traceAI-pinecone ships a new reranker, instruments retrieval spans with retrieval.documents carrying the ranked chunks, and runs ContextPrecision on every span. The Agent Command Center dashboard plots precision distribution by reranker version. The team sets an alert at “p25 below 0.65” — and when the cheap-reranker change crashes the p25 to 0.41, the alert fires before mean accuracy moves. The same evaluator then runs offline against a golden ranking dataset, so the regression is reproducible without waiting for production data.

We have found that pairing ContextPrecision with ContextRecall is the only honest way to evaluate a retriever — precision alone rewards small, surgical retrieval; recall alone rewards dragnet retrieval; the trade-off shows up in their joint plot.

How to Measure or Detect It

Context precision is directly measurable. Wire up:

  • fi.evals.ContextPrecision — Average Precision-style score across the ranked retrieval list.
  • fi.evals.PrecisionAtK — simpler companion that returns the fraction of top-K results that are relevant.
  • fi.evals.NDCG — Normalised Discounted Cumulative Gain for graded relevance labels.
  • OTel attribute retrieval.documents — the ordered list your evaluator scores.
  • p25 precision (dashboard) — the percentile that exposes a regressed reranker first.

Minimal Python:

from fi.evals import ContextPrecision

precision = ContextPrecision()

result = precision.evaluate([{
    "query": "What is machine learning?",
    "contexts": [
        "Machine learning is a branch of AI.",
        "The weather is nice today.",
    ],
    "reference": "Machine learning is an AI technique."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

  • Reporting Precision@k and calling it context precision. Precision@k ignores the order within the top k; context precision penalises irrelevant chunks at high ranks specifically.
  • Computing precision on the raw vector-search output instead of the reranked list. If your model only sees reranked chunks, that is the list you must score.
  • Tuning for precision without watching recall. A precision-1.0 retriever that returns one chunk has zero recall on multi-fact questions; the scores must be read together.
  • Using context precision when graded relevance labels are available. Reach for NDCG instead — it handles “very relevant” vs “somewhat relevant” cleanly.
  • Setting a single global threshold across query types. Lookups need higher precision than exploratory queries — split alerts by query intent.

Frequently Asked Questions

What is context precision in RAG?

Context precision is a 0-1 score for retrieval ranking quality. It rewards setups where relevant chunks appear before irrelevant ones and penalises irrelevant chunks at top ranks.

How is context precision different from context recall?

Precision asks 'of the chunks you returned, how well-ranked are the relevant ones?' Recall asks 'of all the relevant chunks that exist, how many did you return?' Precision tunes the reranker; recall tunes the retriever.

How do you measure context precision?

FutureAGI's fi.evals.ContextPrecision takes the query, the ranked list of retrieved contexts, and a reference answer, then computes an Average Precision-style score across the ranking.