Evaluation

What Is Self-Consistency Evaluation?

An evaluation method that scores whether repeated LLM or agent runs converge on the same answer, reasoning path, or tool decision.

What Is Self-Consistency Evaluation?

Self-consistency evaluation is an LLM evaluation method that checks whether repeated runs of the same prompt, task, or agent trajectory converge on the same result. It appears in eval pipelines, regression suites, and production trace review when non-deterministic output makes one sample too weak to trust. FutureAGI teams use it to compare final answers, reasoning steps, tool choices, or labels, then convert disagreement into a metric threshold for release gates and alerts.

In 2026, self-consistency evaluation has moved from “nice-to-have for chain-of-thought research” to “table stakes for any agent that touches money, health, or legal decisions.” Frontier reasoning models with extended thinking (GPT-5.x thinking, Claude Opus 4.7 extended thinking, Gemini 3 Deep Think) introduce more sampling variance under the hood than the headline temperature setting suggests, so one-shot evaluation underweights stability risk. On AIME 2025 and FrontierMath (Epoch AI, frontier ~2%), majority-vote over 16 samples lifts solve rate 10-20 points over single-pass. direct evidence that single-sample evaluation systematically under-reports both capability and instability.

Why self-consistency evaluation matters in production LLM and agent systems

The failure mode is not always “the model is wrong.” Often it is worse: the model is right once, wrong twice, and persuasive every time. A customer-support assistant answers the same refund question with three different policy interpretations. A retrieval agent chooses different tools for identical cases because one intermediate step drifts. A coding agent passes a smoke test on the first run, then edits a different file on replay. If you only score one output, that instability looks like success.

The pain spreads across teams. Developers cannot reproduce bugs because the same input no longer fails. SREs see noisy user complaints without a clean exception, because HTTP status and latency are normal. Product managers see cohort-level churn after a prompt update. Compliance teams lose confidence in audits when regulated answers vary by sample. and in healthcare, finance, and legal AI, that is a deal-breaker.

Self-consistency matters more for 2026 multi-step pipelines. Agents plan, retrieve, call tools, validate structured output, and sometimes hand work to another agent over A2A. A final answer can match the expected label while the trajectory is unstable underneath. Unlike Ragas faithfulness, which checks whether an answer is supported by context, self-consistency asks whether independent samples converge on the same answer or decision. Both signals matter: grounded but unstable systems still create operational risk.

How FutureAGI handles self-consistency evaluation

FutureAGI’s approach is to treat self-consistency as a repeatable evaluation contract, not a notebook replay. The anchor is CustomEvaluation: engineers define how samples are grouped, what fields are compared, and what agreement threshold counts as a pass. The evaluator sits beside Groundedness or ToolSelectionAccuracy when consistency depends on reasoning, evidence, or tool choice.

A real workflow. A claims agent receives one user question and runs five sampled attempts at a fixed model version (Claude Opus 4.7, temperature 0.3 for the policy classifier and 0.7 for the response draft). The dataset stores input, sample_id, output, normalized_answer, tool_name, reasoning_summary, trace_id, and cohort. CustomEvaluation compares normalized final answers first, then checks whether required tool decisions agree. The metric returns an agreement score, a pass/fail label, and a reason such as “3 of 5 samples selected the wrong policy lookup tool.”

For live systems, the same check runs on trace cohorts from traceAI-langchain. Fields such as llm.token_count.prompt, agent.trajectory.step, and trace_id help the engineer separate prompt-length instability from planner instability. If agreement drops below 0.85 for the enterprise_refund cohort, the team blocks the prompt release, sends disagreement clusters to annotation, and adds a regression eval before the next deploy. Agent Command Center can also route high-disagreement traffic through model fallback, but the eval result stays the auditable reason for that action.

Where disagreement comes from

Variance sourceSymptom in traceWhere to look first
Decoding randomnessDifferent wording, same answerLower temperature, set seed where available
Tool-call divergenceDifferent tools selectedToolSelectionAccuracy, planner prompt
Retrieval orderDifferent chunks ranked topRetriever determinism, reranker version
Reasoning instabilitySame answer, different reasoning traceReasoning-mode budget, planner prompt
Memory contaminationSame prompt, different historyMemory writes, session boundaries
Model snapshot updateQuiet drift after vendor patchPin model snapshot or accept variance

We’ve found that the first two rows account for most production self-consistency failures we see in customer audits.

How to measure self-consistency evaluation

Measure agreement across repeated samples, then inspect why disagreement happened:

  • CustomEvaluation score. returns a configured agreement score, label, and reason for each sample group.
  • Final-answer agreement. percentage of sampled outputs that normalize to the same answer or class.
  • Trajectory agreement. compare agent.trajectory.step, selected tools, and required validation steps across runs.
  • Dashboard signal. alert on self-consistency eval-fail-rate-by-cohort, not only the global average.
  • Variance by route. compare agreement by prompt version, model route, temperature setting, and retriever version.
  • User-feedback proxy. watch thumbs-down rate, escalation rate, and duplicate-ticket reopening after consistency drops.

Minimal Python:

from fi.evals import CustomEvaluation

consistency = CustomEvaluation(
    name="refund_answer_consistency_v2",
    rubric=(
        "Compare repeated outputs for the same input. "
        "Score 1-5 on agreement: 5=identical decision and tool; "
        "3=same final answer, different tool path; 1=different decisions."
    ),
)
result = consistency.evaluate(input=question, output=samples, context=policy_doc)
print(result.score, result.label, result.reason)

Common mistakes

  • Sampling only twice. Two outputs can agree by luck; use enough samples to expose variance on high-risk cohorts. We default to 5 for free-form, 10 for high-stakes.
  • Comparing raw text only. Normalize labels, citations, tool names, and structured fields before scoring disagreement.
  • Ignoring correct minority answers. A 4-of-5 vote can still be wrong; pair consistency with Groundedness or human review.
  • Changing temperature during the eval. Keep sampling settings fixed, or the metric measures configuration noise.
  • Treating instability as a model-only issue. Retriever changes, tool timeouts, memory writes, and prompt length can also cause divergence.
  • Skipping reasoning-mode budget. Frontier extended-thinking modes have their own internal variance; if you do not cap thinking tokens, you cannot reproduce.

Frequently Asked Questions

What is self-consistency evaluation?

Self-consistency evaluation measures whether repeated LLM or agent runs converge on the same answer, reasoning path, or tool decision. FutureAGI implements it with fi.evals.CustomEvaluation and tracks the score across datasets or traces.

How is self-consistency evaluation different from self-consistency prompting?

Self-consistency prompting is a generation strategy that samples multiple reasoning paths and votes on an answer. Self-consistency evaluation is the measurement layer that scores how much those samples agree and whether the agreement is acceptable.

How do you measure self-consistency evaluation?

Run multiple samples for the same input, compare final answers and key decisions, then score agreement with fi.evals.CustomEvaluation. Track pass rate, variance, reason codes, and eval-fail-rate-by-cohort before using the metric as a release gate.