How is answer relevancy different from groundedness?

Answer relevancy asks 'does the response address the question?' Groundedness asks 'is the response supported by the retrieved context?' A response can be fully grounded yet answer the wrong question, and vice versa — they are orthogonal.

What Is Answer Relevancy? Definition & FutureAGI Guide (2026)

Q: What is answer relevancy?

Answer relevancy is a 0-1 metric for how directly a response addresses the user's query. It penalises off-topic, evasive, or overly general answers — independent of whether the answer is factually correct.

Q: How do you measure answer relevancy?

FutureAGI's fi.evals.AnswerRelevancy combines keyword coverage between query and response, semantic similarity, and direct-answer indicators into a 0-1 score with a written reason.

What Is Answer Relevancy?

Answer relevancy is an LLM evaluation metric that scores how directly a model’s response addresses the user’s query. The evaluator combines three signals — keyword coverage between query and response, semantic similarity, and direct-answer indicators — into a 0-1 score and penalises off-topic, evasive, or over-general responses. It is independent of whether the answer is correct or grounded; a response can be factually right and still score low on relevancy if it answered an adjacent question. Engineers run answer relevancy on offline eval datasets and on production traces to catch query-misinterpretation regressions.

Why It Matters in Production LLM and Agent Systems

Off-topic or evasive answers are the silent killer of LLM products. The model is confident, the retrieved context is relevant, the claims are faithful — but the user asked about Q3 revenue and the model produced a paragraph about Q3 strategy. Without answer relevancy, this regression hides behind every other “passing” metric. It is also the failure mode users complain about most often, because a wrong-question answer feels more frustrating than a wrong-fact answer — they at least learn nothing they did not already know.

The pain hits product, support, and ML teams. A product manager sees CSAT slip without any eval-fail signal because the model is producing technically-correct-but-irrelevant answers. A support team gets escalations on questions like “what’s the deductible?” being answered with policy-level overviews. An ML engineer rolls back a “helpful” prompt change that pushed the model toward expansive answers — relevancy fell, even though faithfulness rose.

In 2026 conversational and agentic stacks, multi-turn relevancy compounds. Each turn that drifts off-topic pulls the trajectory further from the user’s actual goal, and a single low-relevancy turn at step three can derail an eight-step agent. Step-level answer relevancy on every agent response catches this where end-to-end relevancy never can.

How FutureAGI Handles Answer Relevancy

FutureAGI’s approach is to ship fi.evals.AnswerRelevancy as a multi-signal local metric that runs in offline datasets and on production traces with the same code. The evaluator takes a query and a response and returns a 0-1 score blended from keyword coverage (default weight 0.3), semantic similarity over sentence embeddings (0.5), and structural direct-answer indicators (0.2). The weights are configurable per use case — keyword-weighted for short factual lookups, semantic-weighted for paraphrase-heavy chat — and the per-signal breakdown is returned alongside the aggregate.

Concretely: a customer-support team running on traceAI-openai-agents instruments their assistant. They sample 5% of production conversations into an evaluation cohort and run AnswerRelevancy on every assistant turn. The Agent Command Center dashboard plots relevancy distribution by intent class. When a prompt change pushes mean relevancy from 0.84 to 0.71 specifically on billing queries, the team filters traces below 0.6, exports them as a Dataset, and runs PromptWizard to evolve a prompt that recovers the specificity. The optimisation loop is anchored to relevancy as the fitness function — the change ships only when relevancy clears the prior baseline.

Unlike Ragas answer-relevancy, which generates synthetic questions from the response and measures how well they match the original query, FutureAGI’s evaluator combines direct keyword and semantic signals — faster on production traffic and easier to debug per-signal.

How to Measure or Detect It

Answer relevancy is directly measurable. Wire up:

fi.evals.AnswerRelevancy — multi-signal 0-1 score with per-signal breakdown (keyword, semantic, structure).
fi.evals.Completeness — companion metric for whether the response fully answers the query rather than just touching on it.
fi.evals.IsHelpful — judge-style helpfulness score, complementary signal for chat.
OTel attributes input.value and llm.output — the inputs every relevancy evaluator depends on.
Mean and p25 relevancy by intent class (dashboard) — splitting by intent exposes regressions that a global mean hides.

Minimal Python:

from fi.evals import AnswerRelevancy

evaluator = AnswerRelevancy()

result = evaluator.evaluate([{
    "query": "What is the capital of France?",
    "response": "The capital of France is Paris."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

Reporting answer relevancy as a quality metric. Relevancy is a necessary condition, not sufficient — pair with Faithfulness for correctness and Groundedness for support.
Using BLEU or ROUGE in place of answer relevancy. Those compare to a reference answer; relevancy is reference-free and works on open-ended chat where there is no canonical gold answer.
Setting a single global threshold across query types. Lookups need higher relevancy than exploratory questions; split alerts by query intent or your alerts will be noisy.
Letting relevancy drop on legitimate refusals. A safety-driven refusal of an unsafe prompt is correct but scores low on direct-answer indicators — gate by AnswerRefusal before alerting on relevancy.
Ignoring the per-signal breakdown. A low relevancy with high keyword coverage but low semantic similarity is a different regression than the inverse — use the breakdown to debug.