How does it fit into a full RAG evaluation?

It is the upstream signal in a RAG evaluation. Pair it with `ContextPrecision` for ranking, `ContextRecall` for completeness, and `Faithfulness` for grounding to localize where a regression is occurring.

How do you measure the contextual relevancy RAG metric?

Run `fi.evals.ContextRelevance` against a query and the retrieved contexts; the evaluator returns per-chunk and aggregate relevance scores. For response-aware variants, use `ContextRelevanceToResponse`.

What Is the Contextual Relevancy RAG Metric? FutureAGI Guide (2026)

Q: What is the contextual relevancy RAG metric?

It is the per-chunk relevance score inside a RAG evaluation suite that asks whether each retrieved chunk is relevant to the user query, independent of whether the answer is grounded in the chunks.

What Is the Contextual Relevancy RAG Metric?

The contextual relevancy RAG metric is the per-chunk relevance score in a RAG evaluation suite — it asks whether each retrieved chunk is actually relevant to the user query, independent of whether the answer is grounded in those chunks. Inside a RAG evaluation pipeline it is the upstream metric that explains downstream drops in faithfulness, answer relevance, or hallucination scores. FutureAGI exposes it through fi.evals.ContextRelevance and fi.evals.ContextRelevanceToResponse, with consistent semantics across notebook, dataset, and live trace surfaces.

Why the Contextual Relevancy RAG Metric Matters in Production LLM and Agent Systems

A RAG system fails in two layers, and contextual relevancy separates them. The retrieval layer can return junk; the generation layer can ignore good context or invent unsupported claims. If you only measure the answer, the two failure modes look identical — both produce wrong answers. The fix is different though: improve embeddings or rerankers for retrieval failures, improve prompts or model choice for generation failures.

The pain hits RAG product owners, retrieval engineers, and quality reviewers. RAG owners see hallucination spikes that are unattributable. Retrieval engineers tune embeddings without knowing if the issue is upstream or downstream. Quality reviewers report “the answer was wrong” without enough signal to direct the fix.

In 2026 agentic-RAG, corrective-RAG, and self-RAG patterns, contextual relevancy is the trigger for an additional retrieval step. The agent decides “this context is irrelevant, rewrite the query and try again” — but only when relevancy is measured per chunk per query, not as a batch-level aggregate. Compliance-sensitive RAG routes treat low relevancy as a refusal signal so the agent does not paper over a missing document with general knowledge. Unlike Ragas, which exposes a comparable metric in a separate library, FutureAGI’s fi.evals runs all RAG metrics under one contract so RAG dashboards do not stitch across tools.

How FutureAGI Handles the Contextual Relevancy RAG Metric

FutureAGI’s approach is to ship contextual relevancy as one signal inside a coherent RAG evaluation suite. The relevant evaluators: fi.evals.ContextRelevance for query-only relevance, fi.evals.ContextRelevanceToResponse for response-aware relevance, fi.evals.ContextPrecision for ranking quality, fi.evals.ContextRecall for completeness, fi.evals.Faithfulness and fi.evals.RAGFaithfulness for grounding, and fi.evals.RAGScore as a composite. All run on the same query / contexts / response payload and emit comparable 0–1 scores.

A concrete example: a customer-support RAG team is investigating a faithfulness regression on a knowledge-base release. They run RAGScoreDetailed on the affected window. The breakdown shows context relevance dropped 0.18, context recall held flat, faithfulness dropped 0.12. The localization is clear — retrieval started returning irrelevant chunks because the new embedding model interacts badly with the new chunking strategy. The fix is to align embedding-and-chunking versions in the retrieval pipeline, with regression eval against a frozen Dataset of 500 representative tickets, gated in CI so the embedding-chunker mismatch cannot recur.

We have found that the contextual relevancy RAG metric is the single most useful localization signal in a multi-component RAG failure — it cleanly separates “your retriever failed” from “your model failed,” which directs the engineering team to the right component on day one of an incident.

How to Measure or Detect It

Wire up the contextual relevancy RAG metric inside a full evaluation suite:

fi.evals.ContextRelevance — per-chunk and aggregate relevance to the query, returns a 0–1 score and per-chunk reasons.
fi.evals.ContextRelevanceToResponse — response-aware relevance, sharper for downstream attribution when the answer is already produced.
fi.evals.RAGScore / fi.evals.RAGScoreDetailed — composite RAG score for an end-to-end view across retrieval and generation.
fi.evals.Faithfulness — pair to localize retrieval vs. generation failure on the same trace.
OTel attribute retrieval.documents — the chunk list the evaluator scores; pair with retrieval.query and llm.output on the RAG span.
Dashboard signals — relevance-fail-rate by retriever version, by embedding model, by chunking strategy, and by query cohort.

from fi.evals import ContextRelevance, RAGScoreDetailed

rel = ContextRelevance().evaluate([{
    "query": "What does the parts warranty cover?",
    "contexts": [
        "Parts and labor warranty: 12 months from delivery.",
        "Free shipping over $50.",
    ],
}])

full = RAGScoreDetailed().evaluate([{
    "query": "...",
    "contexts": [...],
    "response": "...",
}])

Common Mistakes

Reading relevancy without grounding. A relevant context with low faithfulness still produces hallucinations; read both.
Single-number reporting. Per-chunk breakdown shows where the retriever poisoned the context.
Skipping the response-aware variant. ContextRelevanceToResponse is sharper when localizing a specific failed answer.
Treating low relevancy as a model bug. It almost always points upstream to embeddings, chunking, or reranking.
No regression cohort. Without a frozen dataset of representative queries, version-to-version comparisons are noisy.