What Is Intent Coverage?
The fraction of user intents that a RAG system, search index, or chatbot can answer correctly out of the full universe of intents users actually express.
What Is Intent Coverage?
Intent coverage is the fraction of user intents a RAG system, search index, or chatbot can correctly answer out of the full universe of intents users actually express. It is the breadth-side metric: how many of the things users ask can the system handle, even before we ask how well it handles them. Depth metrics like AnswerRelevancy, Faithfulness, and ContextRelevance measure quality per intent; intent coverage measures whether the intent is even in scope. FutureAGI computes it by labeling production queries with intent classes via IntentClassification and running evals per-cohort.
Why It Matters in Production LLM and Agent Systems
A RAG system’s eval scores depend on which questions you tested, and most teams test the questions they remember to ask. The questions users actually ask are messier, narrower, and often outside the index. Without intent coverage, you can ship a “95% faithfulness” system that fails 40% of real users — because 40% of intents weren’t represented in the eval set, weren’t indexed, or weren’t supported by the prompt.
The pain hits across roles. Product owners see retention drop on a slice of users without an obvious metric to point to. Customer support absorbs the friction as escalations and one-star reviews. Engineering ships an upgrade that improves average score and breaks a long-tail intent. Compliance leads cannot answer “does the system answer all intents users ask, or refuse appropriately when it can’t?” without an intent-coverage frame.
In 2026-era stacks the surface widens. Agents handle multi-intent conversations where a single user turn carries two or three intents — “refund my October order and update my address.” A coverage gap on one intent means the agent fails the whole turn even if the other intent is handled. Coverage reporting must operate at the intent level inside a turn, not just the turn level.
How FutureAGI Handles Intent Coverage
FutureAGI’s approach is to make intent a first-class axis on every evaluation. At the dataset level, every row in a Dataset is enriched with an intent label (either hand-labeled or assigned by IntentClassification). Dataset.add_evaluation runs Faithfulness, ContextRelevance, and AnswerRelevancy per row, and the dashboard surfaces eval-fail-rate-by-intent. At the trace level, traceAI integrations classify each production query into an intent and store it on the span. At the gap-detection level, intents that appear in production traces but not in the golden dataset are flagged as coverage holes — ones that need either index expansion, prompt update, or explicit refusal handling.
Concretely: a fintech RAG team running on traceAI-langchain ships a knowledge base of 800 articles. They sample 5,000 production traces, classify them with IntentClassification, and find 47 intents. The eval dataset only covered 28 of those 47 — 19 intents had zero coverage. Of the 28 covered, three had pass rates under 50%. The team prioritizes index expansion for the 19 missing intents, prompt fixes for the three weak ones, and writes a regression eval that gates future deploys on intent-level pass rate. Coverage stops being a guess and becomes a measurable, regressionable production property.
How to Measure or Detect It
Coverage measurement is two operations: enumerate intents and evaluate per-intent. The signals:
IntentClassificationevaluator: classifies every query into an intent class for downstream slicing.AnswerRelevancyper-intent: measures whether the answer addresses the user’s intent.ContextRelevanceper-intent: surfaces retrieval gaps for intents the index doesn’t support.ContextRecallper-intent: measures whether the index returned the right chunk for each intent.Faithfulnessper-intent: catches hallucinated answers on out-of-coverage intents.- Coverage gap report (dashboard signal): the set of intents seen in production traces but missing or under-represented in the eval dataset.
Minimal Python:
from fi.evals import AnswerRelevancy, ContextRelevance
relevancy = AnswerRelevancy()
context = ContextRelevance()
result = relevancy.evaluate(input=user_query, output=model_response)
print(result.score, result.reason)
# slice the per-row results by intent label to compute coverage gaps
Common Mistakes
- Reporting only the global pass rate. A 95% global score on a 60% coverage set means 40% of users are silently unsupported.
- Building the eval set from team imagination. Real intents are messier than what engineers think to test; sample from production traces.
- Treating refusal as failure on out-of-scope intents. A clean refusal on an out-of-scope intent is correct behavior; the failure is silent hallucination.
- Mixing intent labels with cohort labels. Intent is “what the user wants”; cohort is “who the user is” — track both, separately.
- Skipping multi-intent decomposition. When a turn carries multiple intents, score per-intent, not just per-turn — coverage is intent-level by definition.
Frequently Asked Questions
What is intent coverage?
Intent coverage is the percentage of user intents that a RAG system or chatbot can correctly answer, out of all intents users actually express. It is the breadth metric that pairs with depth metrics like answer relevancy.
How is intent coverage different from accuracy?
Accuracy measures correctness on the intents you tested. Intent coverage measures whether you tested — and can answer — all the intents users have. A system can score 95% accuracy on 60% coverage and still fail most users.
How do you measure intent coverage?
Build an intent-labeled dataset from production traces, run `Faithfulness`, `ContextRelevance`, and `AnswerRelevancy` per-intent in FutureAGI, and report pass rate sliced by intent cohort — the gaps are your coverage holes.