What is faithfulness in LLM evaluation?

Faithfulness is a 0-1 metric for the share of claims in a response that are supported by the provided context. A score of 0.6 means 60% of claims were entailed by context.

How is faithfulness different from groundedness?

Faithfulness is continuous — it gives a 0-1 score per response. Groundedness is binary — Pass only if every claim is supported. Use faithfulness for trending and regressions, groundedness as a release gate.

How do you measure faithfulness?

FutureAGI's fi.evals.Faithfulness extracts claims from the response, runs NLI entailment against the context, and returns the support ratio. RAGFaithfulness adds RAG-specific handling like query-echo filtering and multi-context support.

What Is Faithfulness? Definition & FutureAGI Guide (2026)

What Is Faithfulness?

Faithfulness is a RAG evaluation metric that quantifies how much of a model’s response is actually supported by the retrieved context. The evaluator extracts atomic claims from the output, runs each through a Natural Language Inference (NLI) check against the context, and returns a score from 0.0 (every claim hallucinated) to 1.0 (fully faithful). It runs on offline RAG datasets and on live production spans, and it is the metric of choice when you need a continuous signal — say, to detect a 6-point regression — rather than the hard pass/fail of groundedness.

Why It Matters in Production LLM and Agent Systems

Most RAG failures are partial. The model gets the headline right, then garnishes the answer with one sentence the context never mentioned — a date, a phone number, a product name pulled from parametric memory. A binary “is this hallucinating” check often passes that response. A continuous faithfulness score does not: it drops from 1.0 to 0.83, the regression dashboard lights up, and the trace surfaces the unsupported claim verbatim.

This shows up across three audiences. ML engineers need faithfulness as the leading indicator for retrieval, prompt, or model regressions — it moves before user-facing accuracy does. Product managers need it for SLAs: “we maintain ≥0.92 mean faithfulness on customer-facing answers” is a number you can ship to a contract. Compliance teams need it because a 0.95 faithfulness score with a per-claim breakdown is auditable evidence; “the model seemed fine” is not.

In 2026 agentic-RAG and self-RAG stacks, faithfulness becomes a step-level signal. Each retrieval-augmented step in a multi-hop trajectory should carry its own faithfulness score, because a single 0.4-faithfulness intermediate output corrupts every downstream tool call that consumes it. End-to-end metrics hide this; per-step faithfulness exposes it.

How FutureAGI Handles Faithfulness

FutureAGI’s approach is to ship two faithfulness evaluators in fi.evals because the use cases differ. Faithfulness is the general claim-vs-context NLI scorer — it accepts a response, one or more context strings, and returns a 0-1 score with per-claim labels (supported, contradicted, neutral). RAGFaithfulness is the RAG-aware variant: it filters query echoing, supports multi-context retrieval lists, and exposes confidence-weighted scoring — the things that matter once you are running on actual production retrieval traffic, not synthetic benchmarks.

Concretely: a team running a documentation chat-bot on traceAI-llamaindex samples 10% of production spans, runs RAGFaithfulness on each, and writes the score back as a span event. The Agent Command Center dashboard plots mean faithfulness per index version. When a new chunking config drops mean faithfulness from 0.91 to 0.78, the team filters traces below 0.6, exports the failing claims into a regression dataset, and runs ProTeGi to evolve the system prompt against that dataset — closing the loop from production signal to fix.

We have found that pairing Faithfulness (continuous, for trending) with Groundedness (binary, for blocking) catches partial hallucinations that either metric on its own would miss — Ragas-style faithfulness alone is the trend, but you still need a gate.

How to Measure or Detect It

Faithfulness is directly measurable. Wire up:

fi.evals.Faithfulness — generic 0-1 score with per-claim entailment labels.
fi.evals.RAGFaithfulness — RAG-aware variant with multi-context, query-echo filtering, and confidence weighting.
fi.evals.RAGFaithfulnessWithReference — same plus a reference answer signal for benchmark evaluations.
OTel attribute retrieval.documents — the context array your evaluator scores against.
Eval score percentiles (dashboard) — track p50 and p10 faithfulness; the p10 moves first under a regression.

Minimal Python:

from fi.evals import RAGFaithfulness

faithfulness = RAGFaithfulness()

result = faithfulness.evaluate([{
    "query": "What is the refund window?",
    "response": "The refund window is 30 days, and it doubles for premium accounts.",
    "contexts": ["Customers may request a refund within 30 days of purchase."]
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

Treating faithfulness as a single gauge for RAG quality. A 0.95 faithfulness score on a wrong-question response is still a wrong answer. Pair with AnswerRelevancy and ContextRelevance.
Using Faithfulness when you should use RAGFaithfulness. The basic class does not filter query echo or support multi-context retrieval — your scores will be artificially high on RAG traffic.
Computing faithfulness against the input prompt instead of retrieved context. That measures instruction adherence, not faithfulness — different evaluator, different failure mode.
Setting alert thresholds without baselining the p10. Mean faithfulness moves slowly; the bottom decile moves first. Alert on the 10th percentile dropping below your floor.
Skipping NLI entailment in favour of word overlap. Word-overlap faithfulness rewards verbatim copying and misses paraphrased support — RAGFaithfulness uses NLI on purpose.