Models

What Is Semantic Entropy?

An entropy measure computed over clusters of meaning-equivalent LLM samples, used as a reference-free hallucination-detection signal.

What Is Semantic Entropy?

Semantic entropy is a hallucination-detection signal that measures how much an LLM’s responses to the same prompt vary in meaning, not in wording. The Farquhar et al. paper introduced it as a fix for token-level entropy, which conflates two unrelated things: the model being uncertain about phrasing and the model being uncertain about the answer. Semantic entropy first clusters multiple sampled responses by meaning, then computes entropy over the clusters. A model that says the same thing five different ways scores low; a model that flips between three contradictory answers scores high. It’s a strong reference-free hallucination predictor.

Why It Matters in Production LLM and Agent Systems

Most production hallucination detection runs without ground truth — by definition, you don’t know the right answer at inference time. That makes signals derived from the model’s own behavior the most practical option. Token-level entropy and log-probabilities are noisy because they fire on every paraphrase. Semantic entropy is the cleanest version of “the model is uncertain about what to say, not just how to say it.”

The pain shows up where hallucination is costly and ground truth is unavailable. A medical-information chatbot cannot run a fact-check pipeline at every turn; it can run multiple samples and check semantic entropy. A long-form summarisation agent producing answers about events too recent for the eval set can score its own confidence via semantic entropy and flag low-confidence outputs for human review. A RAG pipeline retrieving from a frequently-updated knowledge base can use semantic entropy as a fast pre-filter before the slower Groundedness evaluator.

Engineering leaders feel the cost trade-off directly. Sampling N responses per query multiplies inference cost. Pairing semantic entropy with a routing policy — high-entropy traces get human review, low-entropy traces ship — keeps the cost manageable while the worst hallucinations get caught.

In 2026-era reasoning models, semantic entropy is also a direct measure of chain-of-thought stability. If five sampled chains arrive at different conclusions, the model isn’t reasoning — it’s guessing. Surfacing that distinction is the difference between an agent that fails loudly and one that fails confidently.

How FutureAGI Handles Semantic Entropy

FutureAGI’s approach is to expose semantic entropy as a composable signal in the evaluator stack rather than a one-off research metric. The standard recipe: sample N responses (typically 5–10) at non-zero temperature, cluster them via fi.evals.FactualConsistency (which performs bidirectional-entailment NLI), and compute Shannon entropy over the cluster-size distribution. fi.evals.HallucinationScore already folds a similar approach into a single composite, so most teams use it as a one-line drop-in.

For real-time production use the cheaper variant is fi.evals.EmbeddingSimilarity clustering — embed each sample, cluster by cosine-similarity threshold, take entropy. It’s an order of magnitude cheaper than NLI and correlates well enough for fast pre-filtering. Engineers route traces with high embedding-cluster entropy through the slower NLI path for confirmation.

Concretely: a legal-research agent on traceAI-openai samples 5 responses per query at temperature 0.5, computes embedding-similarity-cluster entropy, and sets a threshold at 1.2 nats. Anything above the threshold gets a HallucinationScore re-evaluation and, if confirmed, is routed back through the agent with an additional retrieval pass before the final answer goes to the user. Hallucination escape rate drops from 7.4% to 1.2% with a 1.6x increase in inference cost — a trade the team can defend to leadership.

How to Measure or Detect It

Treat semantic entropy as a reference-free uncertainty signal:

  • HallucinationScore: composite hallucination metric that uses semantic-entropy-style sampling internally; the simplest entry point.
  • FactualConsistency: NLI-based pairwise judge used to build meaning clusters for the rigorous variant.
  • EmbeddingSimilarity: cheap clustering basis for the production-fast variant.
  • Cluster-count distribution (dashboard signal): track how often N samples collapse to 1, 2, or 3+ meaning clusters; the shape predicts hallucination rate.
  • Threshold-vs-precision curve: empirical curve relating semantic-entropy threshold to flagged-output precision; lets you pick an operating point.
  • Cost-per-flag: total sampling-and-NLI cost divided by hallucinations caught; the key budget signal.

Minimal Python:

from fi.evals import HallucinationScore, FactualConsistency
import math

samples = [generate(prompt) for _ in range(5)]
hallu = HallucinationScore()

result = hallu.evaluate(
    input=prompt,
    output=samples[0],
    additional_outputs=samples[1:],
)
print(result.score, result.reason)

Common Mistakes

  • Using token-level entropy as a stand-in. Different metric — it fires on phrasing variation that has nothing to do with hallucination.
  • Sampling too few responses. N = 2 gives essentially no signal; use N >= 5.
  • Sampling at temperature 0. Samples are identical, entropy is zero, signal is useless.
  • Using a single embedding-similarity threshold for clustering. What counts as “same meaning” depends on the task; tune the threshold per domain.
  • Treating semantic entropy as a probability. It’s a relative signal — high entropy means more uncertain, but the absolute number doesn’t map to a calibrated hallucination probability.

Frequently Asked Questions

What is semantic entropy?

Semantic entropy is the entropy of an LLM's sampled answers when those samples are clustered by meaning rather than by surface text — a reference-free signal for hallucination.

How is semantic entropy different from token-level entropy?

Token entropy is high whenever phrasing varies, even when meaning is constant. Semantic entropy collapses paraphrases into one cluster, so it's high only when the underlying meanings differ — which is the actual hallucination signal.

How do you compute semantic entropy in FutureAGI?

Sample N responses, cluster by NLI bidirectional entailment via FactualConsistency, then compute Shannon entropy over the cluster sizes — or use HallucinationScore which folds this into a composite metric.