What is factual consistency in LLM evaluation?

Factual consistency is a 0-1 metric that uses NLI to check whether each claim in a response is entailed by, contradicted by, or neutral to a reference answer. Contradictions are penalised heavily.

How is factual consistency different from groundedness?

Factual consistency compares response claims against a reference answer or known-correct text. Groundedness compares response claims against retrieved context. Use factual consistency for benchmarks; use groundedness for RAG.

How do you measure factual consistency?

FutureAGI's fi.evals.FactualConsistency extracts claims from the response, runs NLI entailment and contradiction checks against the reference, and returns a 0-1 score with per-claim status detail.

What Is Factual Consistency? Definition & FutureAGI Guide (2026)

What Is Factual Consistency?

Factual consistency is an LLM evaluation metric that scores how well a response’s claims agree with a reference answer or known-correct text. The evaluator extracts atomic claims and runs each through Natural Language Inference (NLI) checks against the reference — entailed claims score full credit, neutral claims get partial credit, and contradicted claims are penalised heavily. The output is a 0-1 score with per-claim status. Where groundedness measures support against retrieved context, factual consistency measures agreement against a reference — the canonical metric for benchmarks, golden datasets, and any evaluation with a canonical correct answer.

Why It Matters in Production LLM and Agent Systems

Many evaluation failures involve subtle contradictions rather than outright hallucinations. The model says a feature is launching in Q2 when the spec says Q3. It claims a product supports SAML when the docs say it requires SCIM. It flips a numeric range. None of these are full fabrications — the model is in the right neighbourhood, just contradicting the canonical source. A simple groundedness or hallucination check often misses these because the response is broadly consistent with retrieved context. Factual consistency against the reference catches them every time.

The pain falls on regression-testing teams and benchmark owners. An ML engineer runs a prompt change against the golden dataset and the average score barely moves — but factual consistency drops 8 points because the model is now flipping minor numbers in 1 in 12 responses. A compliance team needs evidence that a medical-info bot does not contradict canonical guidance; “looks consistent” is not auditable. A research team running benchmark suites needs an NLI-grounded score so contradictions are penalised more than mere omissions.

In 2026 multi-step agent stacks, factual consistency at intermediate steps is what catches a planner step contradicting an earlier tool-call output before the contradiction propagates. Step-level FC scoring tied to OTel spans makes the contradiction debuggable.

How FutureAGI Handles Factual Consistency

FutureAGI’s approach is to ship two factual consistency evaluators because the question can be asked at two granularities. fi.evals.FactualConsistency is the local-metric NLI scorer — it returns a 0-1 score and a per-claim breakdown of consistent, contradicted, neutral, and unverified claims. fi.evals.IsFactuallyConsistent is the cloud-template variant that returns a Pass/Fail gate with an explanation, useful for binary release gates. Both share the same NLI backbone that powers Faithfulness and RAGFaithfulness, so the contradiction-detection behaviour is consistent across the eval surface.

Concretely: a knowledge-bot team building a regression Dataset of 800 reference answers attaches FactualConsistency and re-runs nightly against new model versions. The evaluator returns a 0.87 mean — but the per-claim detail shows 11 contradictions across 800 responses, all on numeric ranges. The team uses the failing rows as a Persona set in simulate-sdk, runs targeted simulations, and tunes a post-guardrail that re-checks numeric claims against the reference before the response leaves the gateway. The same evaluator then gates merges in CI: the build fails if mean factual consistency drops more than 2 points from the baseline.

We have found that NLI-based factual consistency outperforms simple string-matching against references for production traffic — it tolerates paraphrase and surfaces real disagreement instead of penalising legitimate rewrites.

How to Measure or Detect It

Factual consistency is directly measurable when reference answers exist. Wire up:

fi.evals.FactualConsistency — local-metric 0-1 score with per-claim contradicted/consistent/neutral breakdown.
fi.evals.IsFactuallyConsistent — cloud-template Pass/Fail gate with explanation.
fi.evals.ContradictionDetection — narrower metric that returns 1.0 if no contradictions, 0.0 otherwise.
fi.evals.GroundTruthMatch — companion metric for direct match against gold.
Contradiction count by release (dashboard) — the discrete signal that exposes precision-of-claim regressions.

Minimal Python:

from fi.evals import FactualConsistency

evaluator = FactualConsistency()

result = evaluator.evaluate([{
    "response": "The Eiffel Tower is in Paris and was completed in 1887.",
    "reference": "The Eiffel Tower, located in Paris, was completed in 1889."
}])
print(result.eval_results[0].output, result.eval_results[0].reason)

Common Mistakes

Confusing factual consistency with groundedness. Consistency compares to a reference; groundedness compares to retrieved context. The former is for benchmarks; the latter is for RAG.
Using exact-match instead of NLI. String matching penalises legitimate paraphrase; NLI tolerates rephrasing while still catching contradictions.
Ignoring the contradiction count and reading only the aggregate. A 0.92 score with three flat contradictions is a different problem than a 0.92 score with 30 neutral-but-not-quite-entailed claims.
Treating neutral claims as failures. Neutral means the reference does not affirm or deny the claim — that is often legitimate elaboration, not a regression.
Letting reference answers go stale. A reference written months ago against an old model behaviour drags consistency scores down once the product moves on. Version your references.