Guides

Evaluating LLM Self-Reflection Loops: The 3 Metrics That Matter (2026)

Self-reflection loops sometimes improve outputs and sometimes destroy them. The pre-vs-post delta, over-correction rate, and cost-per-improvement metrics.

March 3, 2026

Updated May 20, 2026

13 min read

llm-evaluation self-reflection reflexion critic self-consistency agent-evaluation llm-observability 2026

Table of Contents

Most production reflection loops were switched on because Reflexion looked good in a paper and the framework shipped a one-line flag. Almost none have been evaluated on the only question that matters: when this loop fires on real traffic, does it make outputs better, worse, or the same, and at what price? Average accuracy is a confident answer to the wrong question. It hides the inputs the loop just degraded, the cost it added on routes it never helped, and the convergence failures the user sees as a spinner.

The opinion this post earns: self-reflection sometimes improves outputs and sometimes destroys them. The three metrics that turn reflection from hope into engineering are the pre-vs-post delta per question, the over-correction rate, and the cost-per-improvement. A loop that lifts mean accuracy by 4 points while quietly regressing 8 percent of correct answers and tripling latency reads the same as a loop that lifts 2 points cleanly. They are very different systems shipping under the same headline number.

This guide walks the three metrics, the task-type matrix that decides where reflection earns its keep, the traceAI span shape, and the FAGI surface that closes the loop from failure cluster back to optimized prompt.

TL;DR: three metrics, one decision

Metric	What it measures	Failure if missing
Pre-vs-post delta per question	Paired score on the draft and the refined output, split by baseline-correct vs baseline-wrong	Mean accuracy hides every drift on the happy path
Over-correction rate	Percentage of already-correct cases the loop degraded	Ship a regression at scale on traffic the dataset under-represents
Cost-per-improvement	Tokens and seconds per point of quality gained, by round	Round budget is set by literature, not by your route’s marginal curve

Adjacent axes (reflection accuracy, convergence behavior, latency overhead) all collapse into one of the three. Reflection accuracy is the upstream cause of over-correction. Convergence is the downstream cause of cost-per-improvement blowing up. Latency is a unit of the cost number. Three metrics, one decision: turn the loop on, off, or route-gated.

Why average accuracy is the wrong unit

The first sin in reflection eval is reporting a single mean. Averaging treats a 10-point lift on a wrong answer and a 6-point drop on a correct one as moves of the same shape. One is the loop doing the job; the other is the loop creating a user-visible regression. Sum and divide and both disappear.

Three things tend to break once you look past the mean. The reflector hallucinates its own critique with the same blind spots that produced the bad answer; the refiner rewrites under a confident wrong critique. The loop thrashes on hard inputs that don’t have a stable answer, burning the round budget while the user watches a spinner. Cost piles up on routes where the lift is fractional, doubling the bill for a 0.3-point move on the headline. All three are invisible to a mean. All three are obvious the moment you pair the pre-score with the post-score and look at the distribution.

Same shape of bug as the agent passes evals fails production pattern: a system aces a benchmark and regresses on real traffic because the dataset wasn’t asked to expose the regression. The fix is the same. Score in a way that makes the regression visible.

Metric 1: pre-vs-post delta per question

The right unit is paired scoring. Same input, two outputs (pre-reflection draft, post-reflection committed answer), same rubric, per-case delta. Three numbers fall out.

Lift on wrong-baseline cases. Subset to cases the pre-score flagged as bad. How much did the loop improve them? The value the loop adds.

Drift on right-baseline cases. Subset to cases the pre-score flagged as already good. How often did the loop make them worse? The over-correction tax.

Net per-case delta. Average across the dataset, weighted by your traffic shape, not the curated benchmark mix. The headline, after lift and drift have already been disclosed.

A loop that lifts wrong cases by 12 points but drifts 8 percent of right cases is a different system from one that lifts 6 points cleanly. The first is a coin flip with extra steps; the second is a real improvement. The mean alone reads them the same way.

Build the dataset so the subsets actually exist. A set weighted entirely toward known failures (the natural choice when you’re hunting lift) cannot measure drift, because there are no right-baseline cases to drift on. Skew toward representative traffic with both already-correct and known-wrong cases inside, 200 to 500 per route as a starting size. The deterministic versus LLM-judge evaluation split keeps the paired scoring cheap: deterministic checks pre and post, LLM judge only on cases where the deterministic results disagree.

Metric 2: over-correction rate

Over-correction is the silent killer: the rate at which the loop degrades an already-correct answer. The dataset most teams ship is weighted toward known failures (correct for measuring lift, wrong for catching drift), and the right-baseline cases that get rewritten by the loop disappear into the mean.

The mechanic upstream of over-correction is the reflector’s false-positive rate. The reflector makes a binary call per round: this is wrong, or this is fine. False positive flags a correct answer as wrong, the refiner rewrites it, the post-score drops. False negative flags a wrong answer as fine, the loop commits the bad output, the post-score never moves. Both rates move with the reflector prompt. Paranoid reflector reduces false negatives and inflates false positives; conservative reflector goes the other way. The right operating point is route-specific and you cannot find it without measuring both rates against ground truth.

Three mitigations that ship in 2026:

Pre-post score gate. Run the rubric on both outputs. If the post-score drops below the pre-score by more than a threshold, roll back. Cheap, deterministic, catches over-correction at runtime even when the eval missed it.
Tighter reflector prompts. Vague rubrics (“is this answer good?”) drive false positives. Spell out what wrong means for the route.
External critic. Switch from self-critique to CRITIC against a real retrieval or tool check. An external verifier grades the answer instead of the reflector grading itself.

Build OverCorrectionRisk as a CustomLLMJudge rubric and run it on the right-baseline subset every regression cycle. The number wants to be near zero. Anything above a few percent on a high-volume route ships a regression at scale.

Metric 3: cost-per-improvement

Every reflection round is a full LLM call. Every round is a sequential round-trip. Both compound.

Cost per request is the sum of the pre-call cost, every reflector-call cost, and every refiner-call cost. A two-round Reflexion loop on a 4K-token context typically lands at 3 to 5 times the single-shot bill. Self-consistency with N=5 samples lands near 5x by construction. Latency is worse than cost because rounds cannot parallelize. A 1.5-second base call becomes a 4 to 5-second loop on two rounds.

The right reporting unit is marginal quality-per-additional-round. Round 1 might lift the mean score by 4 points at 1.8x cost. Round 2 might lift by 0.5 points at 3.2x. Round 3 is usually negative on both axes. Plot the curve and the round budget picks itself: stop where marginal lift drops below the marginal cost you’re willing to pay for that route.

The Agent Command Center gateway returns per-call cost and latency on every response:

x-prism-cost (USD)
x-prism-latency-ms
x-prism-model-used
x-prism-fallback-used
x-prism-routing-strategy

Wire the loop through the gateway and the cost lands on every round. No separate ledger. Sum across rounds for cost-per-request; divide by the per-case quality delta for cost-per-improvement. The wider story is in the AI agent cost optimization observability playbook.

Where reflection helps and where it hurts

Most production reflection loops should run on a subset of traffic, not all of it. The decision is a function of task type, latency budget, and whether external verification exists.

Task type	Reflection helps?	Pattern	Why or why not
Code generation with unit tests	Yes	CRITIC	External verifier provides ground truth; refiner has signal to act on
Multi-step reasoning (math, planning)	Yes	Self-consistency or chain-of-self-correction	Step-level errors are cheaper to catch than avoid; majority vote dampens one-off slips
RAG factual answers	Often	CRITIC against retrieval	External Groundedness check beats self-critique; loop only fires when the check fails
High-volume cheap chat	No	None	Marginal lift is fractional; overhead never pays back
Real-time voice / in-call assistants	No	None	Sequential round-trip cost is structural; latency budget kills the loop before quality matters
Outputs already cascade-evaluated	No	None	Loop pays for verification a separate judge stack already covers

Five patterns to choose between when reflection does belong on the route.

Single-shot critique. One pass: draft, critique, rewrite, commit. Cheap and bounded. Common failure: the critic rewrites a correct answer with extra hedging, dropping specificity without changing the underlying fact.

Reflexion. Iterative critique plus retry with episodic memory; verbal lessons get written into a buffer that primes the next attempt (Shinn et al. 2023). Strong on tasks with external verification. Common failure: thrashing 4+ rounds without convergence when the lesson buffer doesn’t constrain the next attempt.

CRITIC. Tool-augmented critique (Gou et al. 2023). The verifier is a tool (search, retrieval, calculator, typed checker), not the LLM. Strong on factual outputs. Common failure: the tool query misses the right document and confirms a wrong fact.

Self-consistency. Sample N outputs at higher temperature, take the majority (Wang et al. 2022). No explicit critic. Strong on reasoning tasks where the failure mode is a one-off slip. Common failure: when the model has a shared bias, three of five samples hallucinate the same fact and the majority is wrong.

Chain-of-self-correction. Decompose the task, critique each step, recombine (Self-Refine, Madaan et al. 2023). Strong on long-context analytical tasks. Common failure: step critiques are individually correct but the recomposition introduces inconsistencies the loop never sees.

A pattern bake-off (same dataset, five branches, three metrics per branch) is one of the cleanest A/B tests on an agent. The LLM evaluation playbook covers the dataset scaffolding it plugs into.

Instrumenting the loop with traceAI

The eval signal only lands if the loop is instrumented. traceAI ships 14 OTel-aligned span kinds across Python, TypeScript, Java, and C# (50+ AI surfaces total). For a reflection loop, the structure is one outer CHAIN span, one LLM span per round for the reflector, one LLM span per round for the refiner, and custom attributes carrying the round metadata.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType,
    SpanAttributes,
    FiSpanKindValues,
)
from opentelemetry import trace

register(
    project_type=ProjectType.OBSERVE,
    project_name="reflection-loop-eval",
)
tracer = trace.get_tracer(__name__)


def run_reflection_loop(question, context, max_rounds=3):
    with tracer.start_as_current_span("reflection_loop") as loop_span:
        loop_span.set_attribute(
            SpanAttributes.FI_SPAN_KIND,
            FiSpanKindValues.CHAIN.value,
        )

        draft = generate_draft(question, context)
        pre = score(draft, context)
        loop_span.set_attribute("reflection.pre_score", pre)

        current = draft
        for r in range(max_rounds):
            with tracer.start_as_current_span(f"reflect_round_{r}") as s:
                s.set_attribute("reflection.round", r)

                critique = reflect(current, question, context)
                if critique.verdict == "fine":
                    s.set_attribute("reflection.stop_reason", "verdict_fine")
                    break

                current = refine(current, critique)
                s.set_attribute("reflection.post_score", score(current, context))

        loop_span.set_attribute("reflection.final_score", score(current, context))
        return current

The span tree is the source of truth for every downstream rubric. The LLM observability self-hosting guide covers the deployment options for the trace store; the reflection eval runs the same way against hosted or self-hosted.

Wiring the rubric with ai-evaluation

Scoring is a paired run over the golden set: same input, two outputs (pre, post), same templates. The ai-evaluation SDK ships 60+ EvalTemplate classes; the core ones for reflection eval are Groundedness, ContextAdherence, TaskCompletion, FactualAccuracy, and AnswerRefusal.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness,
    ContextAdherence,
    TaskCompletion,
    FactualAccuracy,
)
from fi.testcases import TestCase

evaluator = Evaluator(
    fi_api_key=FI_API_KEY,
    fi_secret_key=FI_SECRET_KEY,
)

templates = [
    Groundedness(),
    ContextAdherence(),
    TaskCompletion(),
    FactualAccuracy(),
]

def paired_score(case, pre_output, post_output):
    pre = TestCase(
        input=case["question"],
        output=pre_output,
        context=case["context"],
        expected_output=case["expected"],
    )
    post = TestCase(
        input=case["question"],
        output=post_output,
        context=case["context"],
        expected_output=case["expected"],
    )
    return {
        "pre": evaluator.evaluate(eval_templates=templates, inputs=[pre]),
        "post": evaluator.evaluate(eval_templates=templates, inputs=[post]),
    }

For the reflection-specific axes, use CustomLLMJudge to define ReflectionAccuracy, OverCorrectionRisk, and ConvergenceCorrectness inline. ai-evaluation can also serve as the external critic in the CRITIC pattern: replace self-critique with a real Groundedness check against retrieval and the loop becomes a tool-augmented verifier. Four distributed runners (Celery, Ray, Temporal, Kubernetes) handle parallel sweeps; a five-pattern bake-off on a 1,000-case dataset goes from hours to minutes.

Closing the loop: Error Feed and agent-opt

Three metrics tell you which loops to keep and which to switch off. The next two questions are which failures to fix first, and what to change.

Failing traces (low post-score, high drift, thrash to budget) flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering at prob >= 0.4 groups them into named issues. Each cluster fires a Claude Sonnet 4.5 Judge (Bedrock) for a 30-turn investigation across 8 span-tools, with a Haiku Chauffeur summarising spans over 3000 characters. Per cluster, the Judge writes a 5-category 30-subtype taxonomy classification, the 4-D trace score, and an immediate_fix string naming the change to ship today. Linear ships today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

For reflection loops, clusters tend to be specific: “Reflexion thrashes 4+ rounds on multi-hop retrieval cases,” “self-consistency picks majority-wrong when 3/5 samples hallucinate the same date,” “single-shot critique rewrites correct answers with hedging on policy questions,” “CRITIC retrieval missed the relevant doc, external check confirmed the wrong fact.” Each cluster is a regression test the team never has to write; the on-call engineer promotes 3 to 10 representative traces into the offline set with the rubric attached.

Once the failure mode is named, two knobs are the obvious next move: the reflector prompt and the convergence threshold. Both can be optimized with agent-opt, which ships six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) and a uniform EarlyStoppingConfig. The optimizer reads the offline set Error Feed expanded, treats the reflector prompt as the variable, and uses the three-metric score (cost and over-correction weighted into the objective) as the target. A typical run lifts ReflectionAccuracy by 5 to 15 points and cuts over-correction by half on the route it was tuned for. Today this is eval-driven optimization; a direct trace-stream-to-agent-opt connector is on the active roadmap, not shipped.

Three deliberate tradeoffs

Paired scoring costs twice the eval bill. Two outputs per case, two scoring passes. Run deterministic templates on both sides where possible; reserve the LLM judge for cases where pre and post diverge.
An external critic adds a dependency. CRITIC against retrieval means loop correctness now depends on retrieval quality. Add ContextRelevance to the rubric stack so a wrong retrieval that confirmed a wrong fact looks like a failure, not a pass.
A pre-post score gate caps the upside. Rolling back when the post-score drops below a threshold is the safest production gate, but it kills cases where the reflector improved a flawed-but-passing draft. Set the threshold per route, not globally.

How Future AGI ships the reflection-eval bridge

FAGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined paired scoring; graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.

ai-evaluation (Apache 2.0) is the code-first surface: 60+ EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, FactualAccuracy, AnswerRefusal, plus CustomLLMJudge for ReflectionAccuracy, OverCorrectionRisk, ConvergenceCorrectness. Four distributed runners (Celery, Ray, Temporal, Kubernetes). 13 guardrail backends, 9 open-weight.

traceAI (Apache 2.0) is the OpenTelemetry-native instrumentation. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including CHAIN for the loop and LLM for each round. Pluggable semantic conventions at register() time.

Agent Command Center (Apache 2.0, single Go binary) is the gateway. 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching. x-prism-cost and x-prism-latency-ms headers land on every call; per-round cost slides into the trace timeline. Self-host or use gateway.futureagi.com/v1.

The Future AGI Platform is the operational layer: self-improving evaluators retune from thumbs feedback, an in-product agent writes rubrics from natural-language descriptions, classifier-backed evals run at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack as the clustering and what-to-fix layer. agent-opt covers reflector-prompt and convergence-threshold optimization.

Ready to evaluate your own reflection loop? Start with the ai-evaluation SDK, wire paired scoring against your dataset, instrument with traceAI, route through Agent Command Center. Three metrics, one trace tree, one decision.

Frequently asked questions

How do you evaluate an LLM self-reflection loop?

Score each input twice with the same rubric, once on the pre-reflection draft and once on the post-reflection output, and report three numbers. The pre-vs-post delta per question (lift on wrong-baseline cases, drift on right-baseline cases). The over-correction rate (percentage of already-correct answers the loop made worse). The cost-per-improvement (extra tokens and seconds spent per point of quality gained). Average accuracy alone hides all three. A loop that lifts mean accuracy by 4 points while degrading 8 percent of correct answers and tripling latency is not the same system as a loop that lifts 2 points cleanly; the average number cannot tell them apart.

Why does average accuracy hide self-reflection failures?

Averaging treats a 10-point lift on a wrong answer and a 6-point drop on a correct one as the same magnitude, so a loop that fires correctly on 60 percent of cases and over-corrects on 8 percent can post a healthy mean even when one user in twelve walks away with a degraded answer they would not have seen with reflection off. The right unit is paired per-question scoring with the dataset split into already-correct and already-wrong subsets, with lift and drift reported separately. The headline number has to clear three bars at once instead of one.

What is over-correction in a reflection loop?

Over-correction is the rate at which the loop degrades an already-correct answer. The reflector flags a fine response as wrong (false positive), the refiner rewrites it with extra hedging, dropped specificity, or a new factual slip, and the post score is lower than the pre score. Most teams never measure it because their eval dataset is weighted toward known failures, and the right-baseline cases that drift get masked by the lifted wrong cases. Mitigations include a pre-post score gate that rolls back when the loop made things worse, tighter reflector prompts, and switching to an external critic like a retrieval check instead of self-critique.

How do I calculate cost-per-improvement for a reflection loop?

Sum the token cost and wall-clock latency for every round, divide by the per-case quality delta, and plot the marginal curve across rounds. Round 1 might lift the mean score by 4 points at 1.8x the single-shot cost. Round 2 might lift by 0.5 points at 3.2x. Round 3 is usually negative on both axes. Read cost off the Agent Command Center gateway response headers (x-prism-cost, x-prism-latency-ms) on every call and the math is a sum, not a separate ledger. The round budget is set where the marginal curve flattens for the route, not where the literature says it should be.

When does self-reflection help versus hurt?

It helps on high-stakes single-shot tasks where errors are easier to detect than to avoid, on multi-step reasoning where each step can be checked, and on tasks where external verification is available (unit tests, schema checks, retrieval evidence). It hurts on real-time agents with tight latency budgets, on high-volume cheap routes where the marginal lift never pays back the round cost, and on outputs already cascade-evaluated by a separate judge stack where the loop pays for verification twice. The honest answer is that the gain-versus-cost tradeoff is rarely measured, so teams ship reflection by default and assume it works on traffic it has never been tested on.

What templates from ai-evaluation map onto self-reflection eval?

Run Groundedness, ContextAdherence, TaskCompletion, FactualAccuracy, and AnswerRefusal as paired scores on the pre-reflection draft and the post-reflection output; the delta per template is the per-axis lift. Build three custom CustomLLMJudge rubrics on top: ReflectionAccuracy (was the reflector's flag correct), OverCorrectionRisk (did refinement degrade a correct answer), ConvergenceCorrectness (did the loop stop for the right reason). ai-evaluation can also serve as the external critic in the CRITIC pattern: replace the self-critique step with a real Groundedness check against the retrieval result and the loop becomes a tool-augmented verifier instead of a self-graded one.

How does Future AGI capture reflection-loop cost and latency?

The Agent Command Center gateway returns per-call cost and latency in response headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used). Wire the reflection loop through the gateway and the per-round numbers land on every call. traceAI carries the loop structure as one CHAIN span for the loop and one LLM span per round, with reflection.round, reflection.pre_score, and reflection.post_score as custom attributes, so the gateway cost rows align with the trace timeline. The Future AGI Platform's Error Feed clusters reflection failures via HDBSCAN soft-clustering, a Sonnet 4.5 Judge writes the immediate_fix per cluster, and the fix feeds the self-improving evaluators at lower per-eval cost than Galileo Luna-2.

View all

Guides

The 2026 LLM Evaluation Playbook

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, closed loop from failing trace to regression.

Rishav Hada · Apr 12, 2026

10 min

Guides

Evaluating DSPy Pipelines in 2026: Signature-Level Eval

Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.

Vrinda Damani · Apr 3, 2026

11 min

Guides

The LLM Eval Vendor Buyer's Guide for 2026

Heads-of-engineering buyer guide for LLM eval vendors 2026. Ten criteria, eight vendor categories scored honestly, 5-question rubric, procurement flow.

Nikhil Pareek · Mar 16, 2026

17 min

TL;DR: three metrics, one decision

Why average accuracy is the wrong unit

Metric 1: pre-vs-post delta per question

Metric 2: over-correction rate

Metric 3: cost-per-improvement

Where reflection helps and where it hurts

Instrumenting the loop with traceAI

Wiring the rubric with ai-evaluation

Closing the loop: Error Feed and agent-opt

Three deliberate tradeoffs

How Future AGI ships the reflection-eval bridge

Related reading

Frequently asked questions