What Is the Contextual Relevancy Metric?
A RAG metric that scores how relevant each retrieved chunk is to the user query, independent of whether the answer is grounded in them.
What Is the Contextual Relevancy Metric?
The contextual relevancy metric scores how relevant the retrieved chunks are to the user query — independent of whether the answer is grounded in them. It is a per-context judgement that surfaces retriever noise before downstream metrics like faithfulness or answer quality are computed. FutureAGI exposes it through fi.evals.ContextRelevance for retrieved-chunk relevance and fi.evals.ContextRelevanceToResponse for response-driven relevance, both runnable across notebook, dataset, and live trace surfaces with the same evaluator contract.
Why the Contextual Relevancy Metric Matters in Production LLM and Agent Systems
A model fed irrelevant context behaves badly in three ways: it confabulates, it cites unrelated chunks, or it produces a confidently wrong answer. The first failure goes to the engineering team as “hallucination.” The second goes to the trust team as “false citation.” The third goes to the customer as a poor experience. All three trace back to one upstream problem — irrelevant context.
The pain hits retrieval engineers, RAG owners, and answer-quality reviewers. Retrieval engineers see vector-search recall numbers that look healthy on aggregate but include junk chunks at the top of long-tail queries. RAG owners see faithfulness scores drop without an obvious cause. Answer-quality reviewers cannot tell if the model misbehaved or the retrieval delivered a poisoned context.
In 2026 agentic-RAG patterns, contextual relevancy is the trigger for query-rewriting and re-retrieval steps. A self-RAG agent that detects low relevancy can rewrite the query, expand keywords, or call a different retriever — but only if relevancy is measured per chunk per query, not as an aggregate. Unlike Ragas’s similar metric, FutureAGI’s evaluators run on the same fi.evals contract used for retrieval, generation, and safety — so RAG dashboards do not stitch metrics across libraries.
How FutureAGI Handles the Contextual Relevancy Metric
FutureAGI’s approach is to provide two evaluator variants for two distinct questions. fi.evals.ContextRelevance answers “is this retrieved chunk relevant to the query?” — independent of any response. fi.evals.ContextRelevanceToResponse answers “is this retrieved chunk relevant given the response that was actually generated?” — which is sharper for diagnosing whether the model used the right slice of the context. Both run against the same query / contexts / response payload shape and return per-context plus aggregate scores.
A concrete example: a legal-research RAG team sees faithfulness drop from 0.86 to 0.74 in a week. They run ContextRelevance on the same trace window and find median context relevancy is unchanged but p10 dropped from 0.62 to 0.31 — a fat tail of irrelevant chunks rising. They cross-reference with traceAI-pinecone retrieval spans and identify a re-index that changed embedding model versions silently. The fix is to pin the embedding model in the retrieval span and add a regression eval that runs ContextRelevance against a frozen Dataset of 500 legal queries; the eval ships with the next reranker promotion and prevents a repeat.
We have found that contextual relevancy is the upstream metric that explains otherwise mysterious drops in faithfulness or answer relevance — measure it before assuming a model bug, especially when retrieval pipelines change embeddings, chunkers, or rerankers in the same release window.
How to Measure or Detect It
Wire up contextual relevancy:
fi.evals.ContextRelevance— per-chunk and aggregate relevance to the query.fi.evals.ContextRelevanceToResponse— relevance given the actual response, sharper for downstream attribution.fi.evals.AnswerRelevancy— pair with relevancy upstream to localize the failure.- OTel attribute
retrieval.documents— the chunk list the evaluator scores. - p10 relevancy (dashboard) — the percentile that exposes a regressed retriever first.
from fi.evals import ContextRelevance
result = ContextRelevance().evaluate([{
"query": "How long is the parts warranty?",
"contexts": [
"Parts and labor warranty: 12 months from delivery.",
"Free shipping over $50.",
],
}])
print(result.eval_results[0].output, result.eval_results[0].reason)
Common Mistakes
- Reporting only mean relevancy. Median and p10 are where regressions hide.
- Confusing relevancy with grounding. A chunk can be relevant to the query but the model still ungrounded; faithfulness lives downstream.
- Skipping relevancy and going straight to faithfulness. You will misattribute retrieval failures as model failures.
- Running relevancy on the raw vector-search output, not the reranked list. If the model sees the reranked list, evaluate that list.
- One global threshold across query types. Lookup queries demand higher relevancy than exploratory queries.
Frequently Asked Questions
What is the contextual relevancy metric?
It is a RAG metric that scores how relevant each retrieved chunk is to the user query, independent of whether the answer is grounded in those chunks. It surfaces retriever noise before downstream evaluation.
How is contextual relevancy different from contextual precision?
Relevancy scores per-chunk relevance to the query. Precision scores whether the relevant chunks are ranked above the irrelevant ones. You measure relevancy first, then precision uses that signal across the ranked list.
How do you measure the contextual relevancy metric?
Run `fi.evals.ContextRelevance` on the query plus retrieved contexts to get a per-chunk and aggregate relevance score. Pair with `ContextPrecision`, `ContextRecall`, and `Faithfulness` for a full RAG evaluation.