What Is Groundedness?
A RAG evaluation metric that measures whether every claim in an LLM response is supported by the retrieved context provided to the model.
What Is Groundedness?
Groundedness is a RAG evaluation metric that checks whether an LLM response is firmly supported by the retrieved context the model received. An evaluator breaks the output into claims and verifies each one against the context. if any claim adds information the context does not contain, the response is ungrounded. It runs in two surfaces: as part of an offline regression-eval suite over a golden dataset, and as a live evaluator on production traces inside platforms like FutureAGI. Groundedness is the primary signal for catching silent retrieval failures, and in 2026 agentic-RAG stacks it runs at every retrieval-augmented step, not just at the final answer.
Why groundedness matters in production LLM and agent systems
A RAG system that retrieves perfect context can still produce wrong answers. The model adds a date that was never in the chunk. It generalizes a single quote into a policy. It mentions a competitor that was not in the document. It hedges with “research suggests…” when the context contained no such research. None of these reads as a hallucination if you only spot-check. the answer sounds reasonable, the citation looks valid, the user nods. Without a groundedness check, this regression compounds across every release.
The pain is concentrated on engineering and trust-and-safety teams. An ML engineer ships a new chunking strategy and groundedness drops 14 points overnight. the chunks are now too small to entail full claims. A support team gets escalations from a customer who was told their refund window was 60 days when the policy actually said 30. A compliance lead is asked, mid-audit, “how do you know your medical-info bot only repeats what is in your knowledge base?” and the only honest answer is that nobody is checking. A legal-tech team discovers their contract assistant invented a clause number that didn’t exist in any of the retrieved documents.
In 2026 agentic-RAG stacks the failure mode multiplies. The retriever runs once per planning step, every step’s output feeds the next step’s retrieval, and a single ungrounded intermediate claim gets cited as fact two hops later. The dominant 2026 failure pattern isn’t “the model hallucinated”; it’s “the model hallucinated at step 2, the agent retrieved more context based on that hallucination at step 4, and step 6 produced a fluent, well-cited answer built on a fictional premise.” Step-level groundedness scoring on each retrieval-augmented span. paired with ChunkAttribution and ContextRelevance. is how you stop that chain.
Groundedness vs faithfulness vs context relevance. the 2026 distinctions
These three metrics are constantly conflated in vendor docs, and the conflation costs teams real production debugging time.
| Metric | Question answered | Output | Where to use |
|---|---|---|---|
| Groundedness | Does every claim trace to the retrieved context? | Pass / Fail with reason | Release gate, post-guardrail on RAG routes |
| Faithfulness | What proportion of claims are supported? | 0-1 continuous | Trending dashboards, model A/B |
| Context Relevance | Could the retrieved context answer the question at all? | 0-1 continuous | Retriever tuning, chunk size regression |
| Chunk Attribution | Did the model actually use the retrieved context? | Per-chunk score | Diagnosing “retrieval was good but model ignored it” |
| Answer Relevancy | Does the answer address the question? | 0-1 continuous | Catching grounded-but-evasive answers |
| Hallucination Score | Composite. input, context, and output | 0-1 composite | Overall reliability index across RAG and non-RAG |
Groundedness is the gate. Faithfulness is the gauge. Context relevance is the retriever diagnostic. Chunk attribution is the “did the model bother to read what we gave it?” check. Run all five and you can distinguish “retriever failure” from “model failure” from “prompt told the model to add general knowledge” in a single dashboard row.
How FutureAGI handles groundedness
FutureAGI’s approach is to expose groundedness as a first-class evaluator at three points in the lifecycle. The fi.evals.Groundedness class takes output and context (with optional input) and returns Pass or Fail with a written reason. the same evaluator whether you call it from a notebook, attach it to a Dataset for offline regression, or wire it onto a production trace via traceAI. The judge model is configurable and runs out-of-family from your production model by default, which removes the self-evaluation bias that inflates scores on competitor stacks.
Concretely: a RAG team using traceAI-langchain instruments their LangChain chain so every retrieval span carries retrieval.documents and the answer span carries llm.output. They configure Groundedness to score every answer span where retrieval ran, and the result is written back as a span event with eval.groundedness.score and eval.groundedness.reason. The dashboard then plots groundedness pass-rate by route, by gen_ai.request.model, and by chunking strategy. When a deploy drops the rate from 94% to 81%, the team diffs failing reasons against passing ones in the FutureAGI evaluation explorer. the failing reasons cluster around a new prompt that told the model to “supplement with general knowledge,” which is exactly the regression that broke groundedness.
The same Groundedness class can run as a post-guardrail inside Agent Command Center. On a healthcare or financial-services route, configure post-guardrail: [Groundedness, PII, ContentSafety]; ungrounded responses are blocked, redacted, or escalated rather than reaching the user. Unlike Ragas faithfulness, which scores claim-by-claim on a 0-1 scale but does not ship as a runtime block, FutureAGI’s Groundedness is a hard pass/fail gate intended to be paired with the continuous Faithfulness evaluator. one for blocking, one for trending. Unlike DeepEval’s faithfulness metric which leaves judge-model pinning to the user, FutureAGI defaults to a pinned out-of-family judge.
In our 2026 evals across customer healthcare, fintech, and legal-tech RAG stacks, we’ve found that groundedness pass-rate is the single best leading indicator of escalation rate two weeks later. A 5-point drop in groundedness on Monday reliably predicts a measurable rise in human-handoff rate by the following Monday. well before user complaints make it into a Jira backlog.
Groundedness in agentic-RAG: the multi-hop trap
A single-turn RAG question hits one retrieval, one generation, one groundedness check. A 2026 agentic-RAG flow hits between three and twenty. The failure pattern that costs teams the most: the agent retrieves cleanly at step 1, hallucinates a confident bridging claim at step 2, uses that claim as the query at step 3, retrieves new (now-irrelevant) context based on it, and produces a final answer at step 6 that is fluent, well-cited from step-5 documents, and entirely built on the step-2 fiction.
The instrumentation pattern that catches this:
- Score
Groundednesson every retrieval-augmented step, not just the final answer. - Pair with
ChunkAttributionper step. when the model “uses” retrieved content but the content was retrieved based on a hallucinated query, bothChunkAttributionandGroundednesspass at that step, but the previous step’sGroundednessfails. - Plot step-level groundedness as a trajectory heatmap. A single red cell in the middle of an otherwise-green trajectory is the leading indicator of a downstream wrong answer.
- Wire a post-guardrail on every intermediate retrieval-augmented step in safety-critical routes. If
Groundednessfails mid-trajectory, abort the agent loop rather than letting the chain continue.
This is the part where FutureAGI’s traceAI instrumentation matters more than the evaluator itself. The evaluator is widely available; the per-step trajectory view that makes the multi-hop failure visible is not.
Groundedness for non-RAG outputs
A common 2026 ask: “we don’t do RAG, but we want a groundedness number.” Three patterns work:
- Summarization. the source document is the “context”; the summary is the “output”. Use
Faithfulness(the continuous metric), notGroundednessper se. - Tool-using agents. the tool outputs become the context; the model’s final answer must be supported by what tools returned. Score
Groundednessagainst the concatenated tool-response payload. - Multi-turn chat. earlier turns from the user become the context; later assistant claims must be supported. Useful for catching the “model invented a user preference” failure.
Judge-model choice for groundedness
The judge model behind a groundedness evaluator is the single largest source of score variance across vendors. In our 2026 evals comparing judge configurations, three patterns are robust:
- Out-of-family pinning. If your production model is GPT-5.x, judge with Claude Opus 4.7 or Gemini 3 Pro. never another GPT. Same-family judging inflates pass rates by 3-8 points on subjective rubrics.
- Reasoning-mode judges. For groundedness specifically, judges in extended-thinking mode (Claude’s extended thinking, GPT-5’s deep reasoning) catch implicit unsupported claims that fast judges miss. The latency tax is worth it for offline runs.
- Two-judge consensus for release gates. On gate runs, require agreement between two out-of-family judges. Disagreements queue to human review through
fi.queues.AnnotationQueue. This is the configuration we recommend for high-stakes routes. - Cheap judges for monitoring. Inline production groundedness checks need cheap judges (Haiku 4.5, Gemini 3 Flash) because they run on every request. Reserve heavy judges for offline gate runs.
Groundedness over long context
2026 frontier models advertise 1M+ context windows, and teams sometimes assume that solves grounding. It does not. NVIDIA’s RULER benchmark (4K-128K context, 13 task categories) and LongBench v2 (~3K long-context examples, 8 task types) show retrieval quality and groundedness both degrade after roughly 32K-128K tokens depending on model, and on RAGTruth’s 18K labeled chunks the median frontier model still fails grounding on 5-8% of answers. The “lost in the middle” pattern is alive in 2026. claims grounded against context at the start and end of a long window are well-supported, claims grounded against the middle are systematically weaker.
The instrumentation pattern: record context length on every trace (gen_ai.usage.input_tokens), then plot groundedness pass-rate vs context length. The break point is where you should re-rank, chunk, or summarize before the final generation step.
How to measure or detect groundedness
Groundedness is measurable directly. The signals to wire up:
fi.evals.Groundedness. Pass/Fail per response with a reason string explaining which claim is unsupported.fi.evals.RAGFaithfulness. companion 0-1 score for the proportion of claims supported by context, used for trending.fi.evals.ChunkAttribution. whether the model actually used the retrieved chunks; distinguishes “ignored retrieval” from “misused retrieval.”fi.evals.ContextRelevance. whether retrieved context could have answered the query in the first place.fi.evals.AnswerRelevancy. guards against grounded-but-evasive responses that quote context but never answer the user.fi.evals.HallucinationScore. composite signal that combines context, input, and output for an overall hallucination index.- OTel attribute
retrieval.documents. the context payload your evaluator scores against; without it, groundedness cannot be computed. - OTel attribute
gen_ai.request.model. slice groundedness by model variant; sudden drops on one model usually mean a prompt change interacts badly with that model’s compliance behavior. - Eval-fail-rate-by-cohort (dashboard). groundedness fail rate split by route, model, chunking strategy, and tenant.
- Citation density. number of explicit citations per 100 tokens of output; abrupt drops correlate with grounded-to-ungrounded regressions.
Minimal Python:
from fi.evals import Groundedness, RAGFaithfulness, AnswerRelevancy
groundedness = Groundedness()
faithfulness = RAGFaithfulness()
relevancy = AnswerRelevancy()
result = groundedness.evaluate(
input="What is the refund window?",
output="The refund window is 60 days.",
context="Customers may request a refund within 30 days of purchase."
)
print(result.score, result.reason)
Pair the three: Groundedness for the gate, RAGFaithfulness for the trend, AnswerRelevancy to catch the trick of “grounded by saying nothing.”
For an online eval wired to a traceAI span on every retrieval-augmented step, configure Groundedness as a span-event evaluator with a pinned out-of-family judge:
from fi.evals import Groundedness, ChunkAttribution
from fi.tracing import traceAI
# Inline groundedness on every RAG span: score gets written back as a span event
groundedness = Groundedness(judge_model="claude-opus-4-7", mode="per_step")
chunk_attr = ChunkAttribution()
@traceAI.span(kind="llm", evaluators=[groundedness, chunk_attr])
def generate(question, retrieved_docs):
answer = llm.complete(question, context=retrieved_docs)
# Span event eval.groundedness.{score,reason} written automatically
# Hard-block on healthcare/legal routes:
if not groundedness.last_result.passed and route in {"clinical", "legal"}:
raise PostGuardrailBlock("ungrounded_answer")
return answer
Groundedness in the 2026 retrieval stack
A 2026 retrieval stack has more moving parts than the 2023 “embed + cosine” pattern, and each one creates a distinct groundedness failure surface:
- Hybrid retrieval (dense + BM25 + metadata filter). groundedness fails when the BM25 component pulls a chunk that looks relevant by keywords but lacks the entailing detail. The model writes a confident answer based on the keyword match.
- Re-rankers. a re-ranker can promote a chunk that mentions the query terms but doesn’t support the answer. Score
ContextRelevanceper chunk before re-ranking, thenGroundednesson the answer. - Query rewriting. when the agent rewrites the user query for retrieval, the rewritten query may pull chunks irrelevant to the original intent. Score
AnswerRelevancyagainst the original user input, not the rewritten query. - Graph-RAG. when retrieval traverses a knowledge graph, the entailment path can stretch multiple hops; pair
GroundednesswithMultiHopReasoningto score path-level support. - Long-context “no-retrieval” architectures. some 2026 stacks dump entire knowledge bases into the model context, skipping retrieval. Groundedness still applies (the context is still finite); pair with
RULER-style retrieval probes.
The pattern across all five: instrument the chunk and the answer separately, score Groundedness on the answer, ContextRelevance on each retrieved chunk, and plot the joint distribution. The cells where chunk relevance is high and answer groundedness is low are where you find the most-impactful retrieval bugs.
Groundedness for refusals
A subtle 2026 failure: the model correctly refuses to answer because the context doesn’t support a confident answer, and your groundedness evaluator flags the refusal as “ungrounded” because the refusal text doesn’t appear verbatim in the context. The fix is in the evaluator config. refusals should be evaluated against an AnswerRefusal rubric or excluded from Groundedness scoring, not penalized as ungrounded. We’ve seen teams ship “improvements” that lifted groundedness pass-rate by training the model to never refuse, which is exactly the wrong direction.
Common mistakes
- Conflating groundedness with answer correctness. A response can be grounded in retrieved context that is itself wrong. groundedness tests the model, not the corpus. Pair with a separate factual-accuracy check and a knowledge-base review cadence.
- Running groundedness without storing the retrieved context on the trace. If
retrieval.documentsis missing on the span, the evaluator has nothing to score against and silently fails open. Audit your traceAI integration on every deploy. - Using groundedness on summarization tasks where the source is the input, not retrieved context. That is a faithfulness problem, not a RAG-groundedness one. Run
Faithfulnessinstead, with the input as the context. - Treating a 100% groundedness pass-rate as success. It is also achievable by quoting the context verbatim and never answering the user’s question. pair with
AnswerRelevancy. The combined “grounded AND relevant” gate is what you actually want. - Letting the same model that generated the answer also judge groundedness. Self-evaluation inflates scores by 3-8 points in our 2026 evals; pin the judge to a different model family.
- Running groundedness only on the final answer in an agent trajectory. Intermediate retrieval-augmented steps can hallucinate and propagate; score every retrieval-augmented span, not just the last one.
- Ignoring the prompt as a regression source. When groundedness drops without a model or retriever change, the prompt almost always added phrases like “use your general knowledge” or “supplement with…”. Diff the prompt first.
- No per-route threshold. A 92% groundedness rate is acceptable for general chat and a release blocker for clinical answers. Set per-route gates, not one global number.
Frequently Asked Questions
What is groundedness in RAG?
Groundedness is a pass/fail metric that confirms every claim in a response is backed by the retrieved context. If the model adds information the context does not contain, the response fails.
How is groundedness different from faithfulness?
Groundedness checks that the response stays inside the provided context. Faithfulness measures the proportion of claims supported. it returns a continuous 0-1 score. Groundedness is the gate, faithfulness is the gauge.
How do you measure groundedness?
FutureAGI's fi.evals Groundedness evaluator takes the output and the retrieved context as inputs and returns Pass or Fail with a reason. You wire it onto every RAG span in production via traceAI.