Evaluation

What Is Groundedness?

A RAG evaluation metric that measures whether every claim in an LLM response is supported by the retrieved context provided to the model.

What Is Groundedness?

Groundedness is a RAG evaluation metric that checks whether an LLM response is firmly supported by the retrieved context the model received. An evaluator breaks the output into claims and verifies each one against the context — if any claim adds information the context does not contain, the response is ungrounded. It runs in two surfaces: as part of an offline regression suite over a golden dataset, and as a live evaluator on production traces inside platforms like FutureAGI. Groundedness is the primary signal for catching silent retrieval failures.

Why It Matters in Production LLM and Agent Systems

A RAG system that retrieves perfect context can still produce wrong answers. The model adds a date that was never in the chunk. It generalises a single quote into a policy. It mentions a competitor that was not in the document. None of these reads as a hallucination if you only spot-check — the answer sounds reasonable, the citation looks valid, the user nods. Without a groundedness check, this regression compounds across every release.

The pain is concentrated on engineering and trust-and-safety teams. An ML engineer ships a new chunking strategy and groundedness drops 14 points overnight — the chunks are now too small to entail full claims. A support team gets escalations from a customer who was told their refund window was 60 days when the policy actually said 30. A compliance lead is asked, mid-audit, “how do you know your medical-info bot only repeats what is in your knowledge base?” and the only honest answer is that nobody is checking.

In 2026 agentic-RAG stacks the failure mode multiplies. The retriever runs once per planning step, every step’s output feeds the next step’s retrieval, and a single ungrounded intermediate claim gets cited as fact two hops later. Step-level groundedness scoring on each retrieved-augmented span is how you stop that chain.

How FutureAGI Handles Groundedness

FutureAGI’s approach is to expose groundedness as a first-class evaluator at three points in the lifecycle. The fi.evals.Groundedness class takes output and context (with optional input) and returns Pass or Fail with a written reason — that is the same evaluator whether you call it from a notebook, attach it to a Dataset for offline regression, or wire it onto a production trace via traceAI.

Concretely: a RAG team using traceAI-langchain instruments their LangChain chain so every retrieval span carries retrieval.documents and the answer span carries llm.output. They configure Groundedness to score every answer span where retrieval ran, and the result is written back as a span event. The dashboard then plots groundedness pass-rate by route, by model variant, and by chunking strategy. When a deploy drops the rate from 94% to 81%, the team diffs failing reasons against passing ones in the FutureAGI evaluation explorer — the failing reasons cluster around a new prompt that told the model to “supplement with general knowledge”, which is exactly the regression that broke groundedness.

Unlike Ragas faithfulness, which scores claim-by-claim on a 0-1 scale, FutureAGI’s Groundedness is a hard pass/fail gate intended to be paired with the continuous Faithfulness evaluator — one for blocking, one for trending.

How to Measure or Detect It

Groundedness is measurable directly. The signals to wire up:

  • fi.evals.Groundedness — Pass/Fail per response with a reason string explaining which claim is unsupported.
  • fi.evals.RAGFaithfulness — companion 0-1 score for the proportion of claims supported by context, used for trending.
  • OTel attribute retrieval.documents — the context payload your evaluator scores against; without it, groundedness cannot be computed.
  • Eval-fail-rate-by-cohort (dashboard) — groundedness fail rate split by route, model, and chunking strategy.

Minimal Python:

from fi.evals import Groundedness

evaluator = Groundedness()

result = evaluator.evaluate(
    input="What is the refund window?",
    output="The refund window is 60 days.",
    context="Customers may request a refund within 30 days of purchase."
)
print(result.score, result.reason)

Common Mistakes

  • Conflating groundedness with answer correctness. A response can be grounded in retrieved context that is itself wrong — groundedness tests the model, not the corpus. Pair with a separate factual-accuracy check.
  • Running groundedness without storing the retrieved context on the trace. If retrieval.documents is missing on the span, the evaluator has nothing to score against and silently fails open.
  • Using groundedness on summarisation tasks where the source is the input, not retrieved context. That is a faithfulness problem, not a RAG-groundedness one.
  • Treating a 100% groundedness pass-rate as success. It is also achievable by quoting the context verbatim and never answering the user’s question — pair with AnswerRelevancy.
  • Letting the same model that generated the answer also judge groundedness. Self-evaluation inflates scores; pin the judge to a different model family.

Frequently Asked Questions

What is groundedness in RAG?

Groundedness is a pass/fail metric that confirms every claim in a response is backed by the retrieved context. If the model adds information the context does not contain, the response fails.

How is groundedness different from faithfulness?

Groundedness checks that the response stays inside the provided context. Faithfulness measures the proportion of claims supported — it returns a continuous 0-1 score. Groundedness is the gate, faithfulness is the gauge.

How do you measure groundedness?

FutureAGI's fi.evals Groundedness evaluator takes the output and the retrieved context as inputs and returns Pass or Fail with a reason. You wire it onto every RAG span in production via traceAI.