Evaluation

What Is Context Relevance?

A RAG evaluation metric that scores whether retrieved context is sufficient and relevant to answer the user's query, independent of the generated response.

What Is Context Relevance?

Context relevance is a RAG evaluation metric that scores whether the chunks your retriever returned are actually relevant and sufficient to answer the user’s query — judged independently of what the model later writes. The evaluator takes the input query and the retrieved context and returns a continuous score where higher means the context could plausibly support a correct answer. It runs at the retrieval span in production traces and on offline RAG evaluation datasets. Context relevance is the upstream signal: when it is low, every downstream metric is downstream of a broken retriever.

Why It Matters in Production LLM and Agent Systems

The most common RAG failure is not a model hallucination — it is the retriever returning chunks that do not answer the question. The model then either (a) admits it cannot answer, which looks like a refusal regression, or (b) fills in from parametric memory, which looks like a hallucination regression. Both root causes are the same: bad retrieval. Without context relevance, you cannot tell them apart.

The pain falls on retrieval and platform engineers. An ML engineer ships a new embedding model and groundedness stays at 0.92, but answer correctness drops 11 points — because context relevance silently fell off a cliff and the model is now grounding its answers in chunks unrelated to the query. A search platform owner gets a ticket that “the bot doesn’t know about Q3 numbers” and has to reproduce the retrieval to find that the right chunk was at rank 12, below the top-k cutoff. A product manager sees user satisfaction sliding without a corresponding eval-fail spike, because the model is politely refusing instead of hallucinating.

In 2026 agentic-RAG, corrective-RAG, and self-RAG patterns, context relevance is what triggers a retrieval retry. The agent loop measures relevance after the first retrieval; if it is below threshold, the agent rewrites the query and retrieves again instead of generating from weak context.

How FutureAGI Handles Context Relevance

FutureAGI’s approach is to expose two complementary evaluators because the question “is this context relevant?” can be asked from two angles. fi.evals.ContextRelevance takes the user query and retrieved context as inputs and returns a relevance score against the question — that is the retrieval-quality signal. fi.evals.ContextRelevanceToResponse is the local-metric companion that measures relevance from the response’s perspective, useful for distinguishing “retrieval was poor” from “retrieval was good but the model ignored it” — the canonical context-neglect ambiguity.

Concretely: a RAG team running on traceAI-langchain instruments their chain so the retrieval span carries retrieval.documents and the input span carries the user query. They configure ContextRelevance to score every retrieval, and the dashboard plots relevance distribution by index version, embedding model, and reranker config. When a new reranker drops the p25 relevance score from 0.74 to 0.52, the team knows the regression is upstream — they roll back the reranker rather than chasing the symptoms downstream in groundedness or answer accuracy.

Unlike Ragas context-relevancy, which only checks query-vs-context, FutureAGI’s pair lets you triangulate retrieval quality and model utilisation in the same trace.

How to Measure or Detect It

Context relevance is directly measurable. The signals to wire:

  • fi.evals.ContextRelevance — query-vs-context score for retrieval quality.
  • fi.evals.ContextRelevanceToResponse — response-vs-context relevance for context-neglect detection.
  • fi.evals.NoiseSensitivity — adjacent metric for how robust the system is to irrelevant context.
  • OTel attributes retrieval.documents and input.value — the inputs every relevance evaluator depends on.
  • p25 relevance score (dashboard) — the percentile that exposes a failing reranker before the mean does.

Minimal Python:

from fi.evals import ContextRelevance

evaluator = ContextRelevance()

result = evaluator.evaluate(
    input="Why doesn't honey go bad?",
    context="Honey never spoils because it has low moisture content and high acidity."
)
print(result.score, result.reason)

Common Mistakes

  • Conflating context relevance with groundedness. Relevance asks if the context could answer the question. Groundedness asks if the response stayed inside the context. Both can fail independently.
  • Reading only the mean. Mean context relevance hides a failing reranker; the p25 and p10 expose it. Alert on percentiles, not averages.
  • Scoring relevance without the reranker output. If you measure relevance on raw vector-search hits, you are not measuring what the model actually saw — score the reranked top-k.
  • Treating “relevant” as a fixed threshold across query types. A factual lookup needs higher relevance than an exploratory chat — split scores by query intent before alerting.
  • Ignoring ContextRelevanceToResponse when groundedness is high. A high groundedness score on irrelevant context means the model is parroting the chunk rather than answering — that is the context-neglect inverse.

Frequently Asked Questions

What is context relevance?

Context relevance scores whether retrieved chunks are relevant and sufficient to answer the user's query. It evaluates the retriever, not the generator, so a low score points to retrieval failure rather than hallucination.

How is context relevance different from context precision?

Context relevance asks whether the retrieved context can answer the query at all. Context precision asks whether relevant chunks were ranked higher than irrelevant ones. Relevance is per-context; precision is per-ranking.

How do you measure context relevance?

FutureAGI's fi.evals.ContextRelevance takes the input query and retrieved context and returns a relevance score with a reason. ContextRelevanceToResponse is a complementary check for relevance vs the generated response.