Guides

Evaluating LLM Confidence and Uncertainty (2026): The Calibration Methodology

Logprob aggregation, semantic entropy, Brier score, and Platt scaling. The 2026 methodology for calibrated LLM confidence scores you can actually trust.

·
Updated
·
11 min read
llm-evaluation calibration uncertainty logprobs semantic-entropy brier-score platt-scaling 2026
Editorial cover image for Evaluating LLM Confidence and Uncertainty in 2026
Table of Contents

An agent answers a customer question about prescription dosage. The model says it’s 0.92 confident. The answer is wrong. The customer follows the advice and ends up in the emergency room. The post-mortem reads like every other one: 0.9 verbalized confidence corresponded to 64% accuracy on the company’s labeled hold-out set. Nobody had measured calibration before launch. The escalation rule was “if confidence less than 0.5, escalate” — and the model almost never reported below 0.7.

LLM-stated confidence is theater. The model says “95% sure” and is wrong 30% of the time. The number is a generated token sequence, not an internal probability, and RLHF rewards the confident-sounding sequence over the hedged one. Teams that ship without measuring calibration get a uniform high-confidence stream that fails silently at the top of the curve.

The methodology that produces a confidence score you can actually trust has three legs: logprob aggregation when the provider exposes token probabilities, semantic entropy across paraphrase ensembles when it doesn’t, and Brier score against a held-out labeled set so you know how far the raw signal is from the truth. Without that triangulation, confidence is a number you don’t trust.

This guide covers each leg, the post-hoc calibration step (Platt scaling, isotonic regression) that maps a raw signal onto a usable probability, and the production patterns that put the whole loop on every span.

TL;DR: the calibrated confidence stack

SignalWhat it measuresWhen to use
Verbalized confidenceModel’s self-reported probabilityOne signal in an ensemble; never alone
Mean top-1 logprobAverage chosen-token probability over claim-bearing tokensAPI exposes logprobs; structured or classification output
Top-k token entropyShannon entropy of next-token distributionSame as above; high entropy flags hesitation
Semantic entropyEntropy of meaning-clustered completionsAPI hides logprobs; open-ended generation
Brier score / ECECalibration gap on held-out labelsThe metric that says how far off the raw signal is
Platt / isotonicPost-hoc calibratorMapping raw signal to a usable probability
Classifier headTrained on (input, was_correct) pairsWorkload-specific; highest accuracy with enough data

The pipeline is the same shape every time. Collect a raw signal. Measure miscalibration on a held-out set. Fit a calibrator. Recalibrate when the distribution shifts.

Why verbalized confidence misleads

Three structural reasons the verbal number is biased.

Pretraining doesn’t teach calibration. The model is trained to predict the next token, not to predict whether the resulting answer is right. Internal token probabilities are well-calibrated for the next-token task. The verbal sentence “I’m 90% sure” is itself a sequence of tokens the model has chosen, conditioned on whatever pattern of confidence language showed up in training. It has no direct connection to the internal probability of the underlying claim.

RLHF amplifies overconfidence. Reward models prefer answers that sound decisive. The trained model learns that hedging is penalized in the reward signal, so it hedges less than it should. Tian et al. (2023) and follow-up work show that across GPT-4, Claude, and Llama-2 families, verbalized confidence is monotonically related to accuracy but systematically above the diagonal in the 0.7-1.0 region. The gap is largest exactly where it matters most: high-stakes high-confidence decisions.

Self-reports are post-hoc. Asking the model “how sure are you” generates a number after the answer is already on the page. The number is consistent with whatever rubric the prompt provided, not with an internal estimate of correctness. A different prompt produces a different scale.

The fix is not a better prompt. The fix is to replace the verbal number, or augment it, with a signal that has a defensible relationship to correctness. Two such signals are publicly available: logprobs and ensemble disagreement.

Logprob aggregation: when the API exposes them

OpenAI’s chat completions endpoint returns top-k log-probabilities when you pass logprobs=True. Most open-weight inference servers (vLLM, TGI, llama.cpp) expose them by default. Anthropic returns them on certain endpoints. When you can see logprobs, two derived signals dominate.

Mean top-1 logprob over claim-bearing tokens. Average the log-probability of the chosen token across the spans that carry semantic content (entities, numbers, names). Filler tokens (the, of, and) ride at near-1.0 probability regardless of correctness; including them washes out the signal. For structured outputs, restrict the aggregate to the parsed field tokens.

Top-k token entropy. Shannon entropy over the top-k token distribution at each step. High entropy means the model was choosing between several plausible tokens at that position. Spikes in entropy on claim tokens are strong correlates of error.

A working aggregator:

import math

def claim_token_uncertainty(tokens, claim_mask):
    # tokens: list of {token, top_logprob, top_k=[(tok, logp), ...]}
    # claim_mask: list of bool, True for claim-bearing positions
    logps, entropies = [], []
    for tok, is_claim in zip(tokens, claim_mask):
        if not is_claim:
            continue
        logps.append(tok["top_logprob"])
        probs = [math.exp(lp) for _, lp in tok["top_k"]]
        z = sum(probs)
        probs = [p / z for p in probs]
        entropies.append(-sum(p * math.log(p + 1e-12) for p in probs))
    return {
        "mean_logprob": sum(logps) / max(len(logps), 1),
        "mean_entropy": sum(entropies) / max(len(entropies), 1),
    }

Where logprob aggregation falls short: reasoning models. OpenAI’s o-series hides the chain-of-thought tokens. The visible logprobs cover the final-answer surface, which is often a confident summary of an internally uncertain reasoning trace. Same pattern for any model behind a “reasoning” or “thinking” wrapper. Treat logprob aggregation as one signal for these models, not the primary one.

Semantic entropy: when logprobs are hidden

For any API that hides logprobs (most hosted commercial endpoints in 2026), the workaround is to estimate uncertainty from sampling behavior. The naive version is lexical entropy: sample N completions at temperature 0.7 and measure how often the surface strings agree. The problem is that paraphrases register as disagreement: “Paris is the capital” and “The capital is Paris” look distinct lexically while expressing the same answer.

Semantic entropy (Farquhar et al., Nature 2024) fixes this. Sample N completions, cluster them by meaning, and compute entropy over the cluster proportions instead of the raw strings. Two ways to cluster:

  • Bidirectional NLI. Two completions belong in the same cluster if a small NLI model says each entails the other. Computationally heavier, more accurate on paraphrase-heavy tasks.
  • Embedding then cluster. Embed each sample with a sentence encoder; cluster with HDBSCAN or agglomerative clustering at a tuned threshold. Cheaper, looser on edge cases.

A reference implementation:

import math
from collections import Counter

def semantic_entropy(samples, equivalent_fn):
    # samples: list of generated strings
    # equivalent_fn(a, b) -> bool: bidirectional NLI or embedding-distance check
    clusters = []
    for s in samples:
        placed = False
        for cluster in clusters:
            if equivalent_fn(s, cluster[0]):
                cluster.append(s)
                placed = True
                break
        if not placed:
            clusters.append([s])
    n = len(samples)
    probs = [len(c) / n for c in clusters]
    return -sum(p * math.log(p + 1e-12) for p in probs)

Farquhar et al. report semantic entropy detecting hallucinations 10 to 15 points more accurately than lexical entropy or verbalized confidence across question-answering benchmarks. The cost is the N samples. For an agent at 50 calls per second with N=5, the inference cost is 5x baseline. Route only the high-stakes calls through it. The Agent Command Center makes the cost auditable per call via agentcc_cost_total and agentcc_tokens_total Prometheus counters.

Where semantic entropy falls short: tasks where multiple correct answers exist (open-ended generation, creative tasks). Entropy is high because the answer space is large, not because the model is uncertain. Anchor the metric on tasks with a defensible ground truth.

Brier score, ECE, and the reliability diagram

A raw signal is useful only if you know how miscalibrated it is. The headline metric is Brier score: the mean squared error between predicted probability p_i and the binary correctness label y_i, averaged over N samples. Lower is better. Brier decomposes into a calibration term (how close binned predictions are to observed accuracy) and a sharpness term (how often the model commits to extreme probabilities versus sitting at 0.5). A constantly-0.5 model has perfect calibration on a 50/50 task and terrible sharpness; you want both.

Expected Calibration Error (ECE) is the binned reliability gap: split predictions into B confidence bins (10 is standard), compute the absolute gap between the bin’s observed accuracy and its mean predicted probability, and take the sample-weighted average across bins. ECE below 0.05 on your labeled set is a reasonable production target.

The reliability diagram is the picture both metrics summarize: predicted confidence on the x-axis, observed accuracy on the y-axis, one point per bin. A well-calibrated signal sits on the diagonal. Most off-the-shelf LLMs sit well above it in the 0.7-1.0 region — over-confident, exactly as RLHF predicts.

Compute all three on a held-out labeled set of 500 to 2000 production-representative examples. Pull traces from production, not invented test cases. Skew toward the hardest 10% of inputs (edge cases, rare intents). Label by hand. This is the gold-set; everything downstream depends on it. The synthetic test data guide covers how to scale labeled coverage without losing signal.

Calibration: Platt scaling and isotonic regression

The raw signal (verbalized, logprob mean, semantic entropy) is rarely calibrated out of the box. Two standard post-hoc fixes turn a miscalibrated signal into a usable probability.

Platt scaling fits a one-parameter logistic regression on (raw_signal, was_correct) pairs from a calibration split:

import numpy as np
from sklearn.linear_model import LogisticRegression

def fit_platt(raw_signals, labels):
    X = np.array(raw_signals).reshape(-1, 1)
    y = np.array(labels)
    return LogisticRegression().fit(X, y)

def apply_platt(model, raw_signal):
    return model.predict_proba([[raw_signal]])[0, 1]

Works well when miscalibration is monotonic and roughly linear in logit space. Cheap to fit, stable on small calibration sets (200+ examples), the default starting point.

Isotonic regression fits a non-parametric monotonic mapping. Handles weirder calibration curves where the bias is non-linear:

from sklearn.isotonic import IsotonicRegression

def fit_isotonic(raw_signals, labels):
    return IsotonicRegression(out_of_bounds="clip").fit(raw_signals, labels)

Needs more data (1000+ examples for stability) but reads off the data without parametric assumptions. The right choice when Platt under-corrects in the tails.

Fit on a calibration split. Validate on a held-out test split. Check that Brier and ECE actually improve (they usually do — Platt scaling commonly cuts ECE by 30 to 60 percent on verbalized confidence). Then deploy the calibrator as a thin wrapper on the raw signal.

Recalibrate when the model version changes, when the prompt template changes substantively, or when the input distribution shifts. The calibrator is a contract between the model and the production traffic; the contract breaks when either side moves.

Production patterns

Three pieces wire the whole loop into production. Span instrumentation that captures the raw signals, a labeled set that anchors calibration, and a gate that routes uncertain answers to escalation.

Instrumentation with traceAI

traceAI (Apache 2.0, OpenTelemetry-native) captures confidence signals as custom span attributes alongside the standard OTel GenAI fields. Auto-instrumentation across OpenAI, LangChain, Groq, Portkey, and Gemini means the LLM span exists; you only have to add the uncertainty attributes.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
from opentelemetry import trace

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="confidence-eval",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

tracer = trace.get_tracer(__name__)

def answer_with_uncertainty(prompt, n_samples=5):
    with tracer.start_as_current_span("llm.answer_with_uncertainty") as span:
        samples = [
            client.chat.completions.create(
                model="gpt-4.1",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                logprobs=True,
                top_logprobs=5,
            )
            for _ in range(n_samples)
        ]
        sem_entropy = semantic_entropy(
            [s.choices[0].message.content for s in samples],
            equivalent_fn=nli_equivalent,
        )
        logp_signal = claim_token_uncertainty(
            extract_tokens(samples[0]),
            claim_mask=mask_claims(samples[0]),
        )
        span.set_attribute("llm.semantic_entropy", sem_entropy)
        span.set_attribute("llm.mean_top_logprob", logp_signal["mean_logprob"])
        span.set_attribute("llm.mean_entropy", logp_signal["mean_entropy"])
        span.set_attribute("llm.n_samples", n_samples)
        return majority_answer(samples), sem_entropy, logp_signal

The three custom attributes ride the trace tree. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time mean the same trace ingests cleanly into Phoenix or Traceloop without re-instrumenting.

Evaluating against the labeled set

The ai-evaluation SDK (Apache 2.0) ships the rubric surface. AnswerRefusal, TaskCompletion, and Groundedness are three of the 60+ EvalTemplate classes that map directly to “did the agent escalate when it should” and “was the answer correct against context.” CustomLLMJudge is the authoring primitive for calibration-specific rubrics on top of those.

from fi.evals import Evaluator
from fi.evals.templates import AnswerRefusal, TaskCompletion, Groundedness
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key=..., fi_secret_key=...)

calibration_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "CalibrationScore",
        "model": "gpt-4.1",
        "grading_criteria": """Given the input, the agent's answer, and the
stated confidence (0-1), return 1.0 if confidence above 0.7 and answer correct,
or confidence below 0.3 and answer wrong. Return 0.0 if confidence above 0.7
and answer wrong (over-confident wrong).""",
    },
)

results = evaluator.evaluate(
    eval_templates=[AnswerRefusal(), TaskCompletion(), Groundedness(), calibration_judge],
    inputs=[
        TestCase(
            input=row.input,
            output=row.output,
            context=row.context,
            metadata={
                "stated_confidence": row.stated_confidence,
                "was_correct": row.was_correct,
            },
        )
        for row in labeled_set
    ],
)

The Platform’s self-improving evaluators retune from production thumbs feedback at lower per-eval cost than Galileo Luna-2, so the calibration rubric ages with the product instead of drifting against last quarter’s distribution.

Deploying the uncertainty gate

A calibrated probability becomes an operating threshold. The threshold is a business decision (medical triage operates at a different point on the curve than internal chat); the calibration is engineering.

def route(prompt):
    answer, sem_ent, logp = answer_with_uncertainty(prompt)
    raw = combine(sem_ent, logp["mean_logprob"], verbalized_confidence(answer))
    p_correct = platt.predict(raw)  # calibrated probability
    if p_correct < 0.65:
        return escalate_to_human(prompt, reason=f"p_correct={p_correct:.2f}")
    return answer

Error Feed clusters the failures via HDBSCAN soft-clustering over ClickHouse-stored embeddings. A Sonnet 4.5 Judge agent writes the immediate_fix per cluster against a 5-category 30-subtype taxonomy. The fixes feed back into the Platform’s self-improving evaluators so calibration drift surfaces as a specific recurring failure mode, not a vague score that moved.

Where Future AGI’s eval stack fits

Three surfaces, one story.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes covering AnswerRefusal, TaskCompletion, Groundedness, plus 20+ local heuristic metrics (BLEU, ROUGE, JSON schema, embedding similarity) at sub-10 ms with zero API cost. CustomLLMJudge is the authoring primitive for CalibrationScore, HedgeLanguageQuality, and any workload-specific calibration rubric.
  • Future AGI Platform. Self-improving evaluators retune from production thumbs feedback, classifier-backed scoring at lower per-eval cost than Galileo Luna-2, in-product agent authoring for the calibration loop. The threshold-tuning loop closes without re-authoring rubrics.
  • Error Feed (inside the eval stack). HDBSCAN over ClickHouse, Sonnet 4.5 Judge writes immediate_fix per cluster against the 5-category 30-subtype taxonomy. Linear ships today as the only routed integration.

traceAI (Apache 2.0) ships 50+ AI surfaces across four languages with auto-instrumentation on OpenAI, LangChain, Groq, Portkey, and Gemini. The Agent Command Center is the SOC 2 Type II, HIPAA, GDPR, and CCPA certified hosted runtime per futureagi.com/trust, with 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, and agentcc_cost_total / agentcc_tokens_total Prometheus counters that make the N-sample semantic entropy cost auditable per call.

Ready to put calibrated confidence under your own workload? Start with the ai-evaluation SDK quickstart, instrument logprobs and semantic entropy on your LLM spans via traceAI, and fit a Platt calibrator on a 500-example labeled set this week. The calibrated probability is the artifact — not the verbal number the model generated.

Three takeaways for 2026

  1. Verbalized confidence is one signal, never the primary one. Logprob aggregation when the API exposes them, semantic entropy when it doesn’t. Combine.
  2. Brier score and ECE on a held-out labeled set are the contract. Without the labeled set, you don’t know how far off the raw signal is, and threshold tuning is hopeless.
  3. Post-hoc calibration is cheap, and you have to recalibrate. Platt scaling on 200+ examples will commonly cut ECE by 30 to 60 percent on verbalized confidence. Refit when the model, prompt, or distribution shifts.

Sources

Frequently asked questions

Why is verbalized LLM confidence unreliable?
Verbalized confidence (the model saying 'I'm 95% sure') is a generated token sequence, not an internal probability. RLHF rewards confident-sounding answers, so the model learns to under-hedge. Across the standard calibration literature, frontier models score 0.85 to 0.95 on questions they answer correctly 60 to 70 percent of the time. The verbal number correlates weakly with accuracy and is systematically biased upward. Treat it as a useful signal in an ensemble, never as the primary score. The calibrated alternatives are logprob aggregation when the API exposes token probabilities, semantic entropy across paraphrase ensembles when it does not, and a small classifier trained on held-out (input, was-correct) pairs.
What is semantic entropy and when do I use it?
Semantic entropy is the Shannon entropy of an LLM's answer distribution after meaning-equivalent answers are grouped. Sample N completions at temperature 0.7, cluster them by semantic equivalence (a small NLI model or an embedding-then-cluster step), and compute entropy over the cluster proportions. High entropy means the model is internally uncertain in meaning, not just in surface form. Use it on any API that hides logprobs (Anthropic's standard chat, hosted OpenAI o-series, most commercial endpoints), and on any task where multiple surface forms express the same answer. Farquhar et al. (2024) report semantic entropy detects hallucinations 10 to 15 points more accurately than lexical entropy or self-reported confidence.
What is Brier score and why is it the right calibration metric?
Brier score is the mean squared error between a predicted probability and the binary correctness label. It decomposes cleanly into calibration (how well predicted probabilities match observed accuracy) and sharpness (how often the model commits to confident predictions versus sitting at 0.5). Lower is better. Compute it on a held-out labeled set of 500 to 2000 production-representative examples. Pair Brier with Expected Calibration Error (ECE), which bins predictions and measures the gap to accuracy per bin, and a reliability diagram, which plots confidence against accuracy. A well-calibrated model sits on the diagonal of the reliability diagram. ECE below 0.05 on a labeled set is a reasonable production target.
How do I calibrate an LLM's confidence after measuring miscalibration?
Two standard post-hoc calibrators. Platt scaling fits a one-parameter logistic regression on (raw_confidence, was_correct) pairs from a holdout set; it works well when the miscalibration is monotonic and roughly linear in logit space. Isotonic regression fits a non-parametric monotonic mapping and handles weirder miscalibration curves at the cost of more data (1000+ examples for stability). Temperature scaling is Platt's single-knob cousin, common for classification models. Fit on a calibration split (not the test split), check Brier and ECE improvement, then deploy the mapping as a thin wrapper on the raw signal. Recalibrate when the model, prompt, or distribution shifts.
Should I always use logprob aggregation when it's available?
Yes when the task is classification, structured output, or any decision that lives on specific claim-bearing tokens. The mean top-1 log-probability on entity, number, and name tokens correlates strongly with correctness. For open-ended generation, raw token logprobs are dominated by filler tokens (the, of, and) that the model is certain about regardless of whether the answer is right. Filter for claim-bearing tokens, or aggregate logprobs only over the parsed structured field. The exception: OpenAI's reasoning models (o-series) hide the reasoning logprobs, so logprob aggregation captures only the final-answer surface, which is often a high-confidence summary of an internally uncertain reasoning trace. Pair with semantic entropy.
How does Future AGI fit into a confidence evaluation workflow?
Three surfaces. traceAI (Apache 2.0, OpenTelemetry-native) captures uncertainty signals as custom span attributes (llm.mean_top_logprob, llm.semantic_entropy, llm.verbalized_confidence) alongside the standard OTel GenAI fields. The ai-evaluation SDK (Apache 2.0) ships AnswerRefusal, TaskCompletion, and Groundedness templates plus CustomLLMJudge for authoring calibration rubrics (CalibrationScore, HedgeLanguageQuality) on top of the labeled set. The Future AGI Platform layers self-improving evaluators that retune from production thumbs feedback at lower per-eval cost than Galileo Luna-2, and Error Feed clusters miscalibrated traces via HDBSCAN over ClickHouse and writes an immediate_fix per cluster so the calibration rubric ages with the product.
Related Articles
View all