Evaluating LLM Self-Reflection Loops: The 3 Metrics That Matter (2026)
Self-reflection loops sometimes improve outputs and sometimes destroy them. The pre-vs-post delta, over-correction rate, and cost-per-improvement that turn reflection from hope into engineering.
Table of Contents
Most production reflection loops were switched on because Reflexion looked good in a paper and the framework shipped a one-line flag. Almost none have been evaluated on the only question that matters: when this loop fires on real traffic, does it make outputs better, worse, or the same, and at what price? Average accuracy is a confident answer to the wrong question. It hides the inputs the loop just degraded, the cost it added on routes it never helped, and the convergence failures the user sees as a spinner.
The opinion this post earns: self-reflection sometimes improves outputs and sometimes destroys them. The three metrics that turn reflection from hope into engineering are the pre-vs-post delta per question, the over-correction rate, and the cost-per-improvement. A loop that lifts mean accuracy by 4 points while quietly regressing 8 percent of correct answers and tripling latency reads the same as a loop that lifts 2 points cleanly. They are very different systems shipping under the same headline number.
This guide walks the three metrics, the task-type matrix that decides where reflection earns its keep, the traceAI span shape, and the FAGI surface that closes the loop from failure cluster back to optimized prompt.
TL;DR: three metrics, one decision
| Metric | What it measures | Failure if missing |
|---|---|---|
| Pre-vs-post delta per question | Paired score on the draft and the refined output, split by baseline-correct vs baseline-wrong | Mean accuracy hides every drift on the happy path |
| Over-correction rate | Percentage of already-correct cases the loop degraded | Ship a regression at scale on traffic the dataset under-represents |
| Cost-per-improvement | Tokens and seconds per point of quality gained, by round | Round budget is set by literature, not by your route’s marginal curve |
Adjacent axes (reflection accuracy, convergence behavior, latency overhead) all collapse into one of the three. Reflection accuracy is the upstream cause of over-correction. Convergence is the downstream cause of cost-per-improvement blowing up. Latency is a unit of the cost number. Three metrics, one decision: turn the loop on, off, or route-gated.
Why average accuracy is the wrong unit
The first sin in reflection eval is reporting a single mean. Averaging treats a 10-point lift on a wrong answer and a 6-point drop on a correct one as moves of the same shape. One is the loop doing the job; the other is the loop creating a user-visible regression. Sum and divide and both disappear.
Three things tend to break once you look past the mean. The reflector hallucinates its own critique with the same blind spots that produced the bad answer; the refiner rewrites under a confident wrong critique. The loop thrashes on hard inputs that don’t have a stable answer, burning the round budget while the user watches a spinner. Cost piles up on routes where the lift is fractional, doubling the bill for a 0.3-point move on the headline. All three are invisible to a mean. All three are obvious the moment you pair the pre-score with the post-score and look at the distribution.
Same shape of bug as the agent passes evals fails production pattern: a system aces a benchmark and regresses on real traffic because the dataset wasn’t asked to expose the regression. The fix is the same. Score in a way that makes the regression visible.
Metric 1: pre-vs-post delta per question
The right unit is paired scoring. Same input, two outputs (pre-reflection draft, post-reflection committed answer), same rubric, per-case delta. Three numbers fall out.
Lift on wrong-baseline cases. Subset to cases the pre-score flagged as bad. How much did the loop improve them? The value the loop adds.
Drift on right-baseline cases. Subset to cases the pre-score flagged as already good. How often did the loop make them worse? The over-correction tax.
Net per-case delta. Average across the dataset, weighted by your traffic shape, not the curated benchmark mix. The headline, after lift and drift have already been disclosed.
A loop that lifts wrong cases by 12 points but drifts 8 percent of right cases is a different system from one that lifts 6 points cleanly. The first is a coin flip with extra steps; the second is a real improvement. The mean alone reads them the same way.
Build the dataset so the subsets actually exist. A set weighted entirely toward known failures (the natural choice when you’re hunting lift) cannot measure drift, because there are no right-baseline cases to drift on. Skew toward representative traffic with both already-correct and known-wrong cases inside, 200 to 500 per route as a starting size. The deterministic versus LLM-judge evaluation split keeps the paired scoring cheap: deterministic checks pre and post, LLM judge only on cases where the deterministic results disagree.
Metric 2: over-correction rate
Over-correction is the silent killer: the rate at which the loop degrades an already-correct answer. The dataset most teams ship is weighted toward known failures (correct for measuring lift, wrong for catching drift), and the right-baseline cases that get rewritten by the loop disappear into the mean.
The mechanic upstream of over-correction is the reflector’s false-positive rate. The reflector makes a binary call per round: this is wrong, or this is fine. False positive flags a correct answer as wrong, the refiner rewrites it, the post-score drops. False negative flags a wrong answer as fine, the loop commits the bad output, the post-score never moves. Both rates move with the reflector prompt. Paranoid reflector reduces false negatives and inflates false positives; conservative reflector goes the other way. The right operating point is route-specific and you cannot find it without measuring both rates against ground truth.
Three mitigations that ship in 2026:
- Pre-post score gate. Run the rubric on both outputs. If the post-score drops below the pre-score by more than a threshold, roll back. Cheap, deterministic, catches over-correction at runtime even when the eval missed it.
- Tighter reflector prompts. Vague rubrics (“is this answer good?”) drive false positives. Spell out what wrong means for the route.
- External critic. Switch from self-critique to CRITIC against a real retrieval or tool check. An external verifier grades the answer instead of the reflector grading itself.
Build OverCorrectionRisk as a CustomLLMJudge rubric and run it on the right-baseline subset every regression cycle. The number wants to be near zero. Anything above a few percent on a high-volume route ships a regression at scale.
Metric 3: cost-per-improvement
Every reflection round is a full LLM call. Every round is a sequential round-trip. Both compound.
Cost per request is the sum of the pre-call cost, every reflector-call cost, and every refiner-call cost. A two-round Reflexion loop on a 4K-token context typically lands at 3 to 5 times the single-shot bill. Self-consistency with N=5 samples lands near 5x by construction. Latency is worse than cost because rounds cannot parallelize. A 1.5-second base call becomes a 4 to 5-second loop on two rounds.
The right reporting unit is marginal quality-per-additional-round. Round 1 might lift the mean score by 4 points at 1.8x cost. Round 2 might lift by 0.5 points at 3.2x. Round 3 is usually negative on both axes. Plot the curve and the round budget picks itself: stop where marginal lift drops below the marginal cost you’re willing to pay for that route.
The Agent Command Center gateway returns per-call cost and latency on every response:
x-prism-cost(USD)x-prism-latency-msx-prism-model-usedx-prism-fallback-usedx-prism-routing-strategy
Wire the loop through the gateway and the cost lands on every round. No separate ledger. Sum across rounds for cost-per-request; divide by the per-case quality delta for cost-per-improvement. The wider story is in the AI agent cost optimization observability playbook.
Where reflection helps and where it hurts
Most production reflection loops should run on a subset of traffic, not all of it. The decision is a function of task type, latency budget, and whether external verification exists.
| Task type | Reflection helps? | Pattern | Why or why not |
|---|---|---|---|
| Code generation with unit tests | Yes | CRITIC | External verifier provides ground truth; refiner has signal to act on |
| Multi-step reasoning (math, planning) | Yes | Self-consistency or chain-of-self-correction | Step-level errors are cheaper to catch than avoid; majority vote dampens one-off slips |
| RAG factual answers | Often | CRITIC against retrieval | External Groundedness check beats self-critique; loop only fires when the check fails |
| High-volume cheap chat | No | None | Marginal lift is fractional; overhead never pays back |
| Real-time voice / in-call assistants | No | None | Sequential round-trip cost is structural; latency budget kills the loop before quality matters |
| Outputs already cascade-evaluated | No | None | Loop pays for verification a separate judge stack already covers |
Five patterns to choose between when reflection does belong on the route.
Single-shot critique. One pass: draft, critique, rewrite, commit. Cheap and bounded. Common failure: the critic rewrites a correct answer with extra hedging, dropping specificity without changing the underlying fact.
Reflexion. Iterative critique plus retry with episodic memory; verbal lessons get written into a buffer that primes the next attempt (Shinn et al. 2023). Strong on tasks with external verification. Common failure: thrashing 4+ rounds without convergence when the lesson buffer doesn’t constrain the next attempt.
CRITIC. Tool-augmented critique (Gou et al. 2023). The verifier is a tool (search, retrieval, calculator, typed checker), not the LLM. Strong on factual outputs. Common failure: the tool query misses the right document and confirms a wrong fact.
Self-consistency. Sample N outputs at higher temperature, take the majority (Wang et al. 2022). No explicit critic. Strong on reasoning tasks where the failure mode is a one-off slip. Common failure: when the model has a shared bias, three of five samples hallucinate the same fact and the majority is wrong.
Chain-of-self-correction. Decompose the task, critique each step, recombine (Self-Refine, Madaan et al. 2023). Strong on long-context analytical tasks. Common failure: step critiques are individually correct but the recomposition introduces inconsistencies the loop never sees.
A pattern bake-off (same dataset, five branches, three metrics per branch) is one of the cleanest A/B tests on an agent. The LLM evaluation playbook covers the dataset scaffolding it plugs into.
Instrumenting the loop with traceAI
The eval signal only lands if the loop is instrumented. traceAI ships 14 OTel-aligned span kinds across Python, TypeScript, Java, and C# (50+ AI surfaces total). For a reflection loop, the structure is one outer CHAIN span, one LLM span per round for the reflector, one LLM span per round for the refiner, and custom attributes carrying the round metadata.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType,
SpanAttributes,
FiSpanKindValues,
)
from opentelemetry import trace
register(
project_type=ProjectType.OBSERVE,
project_name="reflection-loop-eval",
)
tracer = trace.get_tracer(__name__)
def run_reflection_loop(question, context, max_rounds=3):
with tracer.start_as_current_span("reflection_loop") as loop_span:
loop_span.set_attribute(
SpanAttributes.FI_SPAN_KIND,
FiSpanKindValues.CHAIN.value,
)
draft = generate_draft(question, context)
pre = score(draft, context)
loop_span.set_attribute("reflection.pre_score", pre)
current = draft
for r in range(max_rounds):
with tracer.start_as_current_span(f"reflect_round_{r}") as s:
s.set_attribute("reflection.round", r)
critique = reflect(current, question, context)
if critique.verdict == "fine":
s.set_attribute("reflection.stop_reason", "verdict_fine")
break
current = refine(current, critique)
s.set_attribute("reflection.post_score", score(current, context))
loop_span.set_attribute("reflection.final_score", score(current, context))
return current
The span tree is the source of truth for every downstream rubric. The LLM observability self-hosting guide covers the deployment options for the trace store; the reflection eval runs the same way against hosted or self-hosted.
Wiring the rubric with ai-evaluation
Scoring is a paired run over the golden set: same input, two outputs (pre, post), same templates. The ai-evaluation SDK ships 60+ EvalTemplate classes; the core ones for reflection eval are Groundedness, ContextAdherence, TaskCompletion, FactualAccuracy, and AnswerRefusal.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness,
ContextAdherence,
TaskCompletion,
FactualAccuracy,
)
from fi.testcases import TestCase
evaluator = Evaluator(
fi_api_key=FI_API_KEY,
fi_secret_key=FI_SECRET_KEY,
)
templates = [
Groundedness(),
ContextAdherence(),
TaskCompletion(),
FactualAccuracy(),
]
def paired_score(case, pre_output, post_output):
pre = TestCase(
input=case["question"],
output=pre_output,
context=case["context"],
expected_output=case["expected"],
)
post = TestCase(
input=case["question"],
output=post_output,
context=case["context"],
expected_output=case["expected"],
)
return {
"pre": evaluator.evaluate(eval_templates=templates, inputs=[pre]),
"post": evaluator.evaluate(eval_templates=templates, inputs=[post]),
}
For the reflection-specific axes, use CustomLLMJudge to define ReflectionAccuracy, OverCorrectionRisk, and ConvergenceCorrectness inline. ai-evaluation can also serve as the external critic in the CRITIC pattern: replace self-critique with a real Groundedness check against retrieval and the loop becomes a tool-augmented verifier. Four distributed runners (Celery, Ray, Temporal, Kubernetes) handle parallel sweeps; a five-pattern bake-off on a 1,000-case dataset goes from hours to minutes.
Closing the loop: Error Feed and agent-opt
Three metrics tell you which loops to keep and which to switch off. The next two questions are which failures to fix first, and what to change.
Failing traces (low post-score, high drift, thrash to budget) flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering at prob >= 0.4 groups them into named issues. Each cluster fires a Claude Sonnet 4.5 Judge (Bedrock) for a 30-turn investigation across 8 span-tools, with a Haiku Chauffeur summarising spans over 3000 characters. Per cluster, the Judge writes a 5-category 30-subtype taxonomy classification, the 4-D trace score, and an immediate_fix string naming the change to ship today. Linear ships today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
For reflection loops, clusters tend to be specific: “Reflexion thrashes 4+ rounds on multi-hop retrieval cases,” “self-consistency picks majority-wrong when 3/5 samples hallucinate the same date,” “single-shot critique rewrites correct answers with hedging on policy questions,” “CRITIC retrieval missed the relevant doc, external check confirmed the wrong fact.” Each cluster is a regression test the team never has to write; the on-call engineer promotes 3 to 10 representative traces into the offline set with the rubric attached.
Once the failure mode is named, two knobs are the obvious next move: the reflector prompt and the convergence threshold. Both can be optimized with agent-opt, which ships six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) and a uniform EarlyStoppingConfig. The optimizer reads the offline set Error Feed expanded, treats the reflector prompt as the variable, and uses the three-metric score (cost and over-correction weighted into the objective) as the target. A typical run lifts ReflectionAccuracy by 5 to 15 points and cuts over-correction by half on the route it was tuned for. Today this is eval-driven optimization; a direct trace-stream-to-agent-opt connector is on the active roadmap, not shipped.
Three deliberate tradeoffs
- Paired scoring costs twice the eval bill. Two outputs per case, two scoring passes. Run deterministic templates on both sides where possible; reserve the LLM judge for cases where pre and post diverge.
- An external critic adds a dependency. CRITIC against retrieval means loop correctness now depends on retrieval quality. Add
ContextRelevanceto the rubric stack so a wrong retrieval that confirmed a wrong fact looks like a failure, not a pass. - A pre-post score gate caps the upside. Rolling back when the post-score drops below a threshold is the safest production gate, but it kills cases where the reflector improved a flawed-but-passing draft. Set the threshold per route, not globally.
How Future AGI ships the reflection-eval bridge
FAGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined paired scoring; graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.
ai-evaluation (Apache 2.0) is the code-first surface: 60+ EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, FactualAccuracy, AnswerRefusal, plus CustomLLMJudge for ReflectionAccuracy, OverCorrectionRisk, ConvergenceCorrectness. Four distributed runners (Celery, Ray, Temporal, Kubernetes). 13 guardrail backends, 9 open-weight.
traceAI (Apache 2.0) is the OpenTelemetry-native instrumentation. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including CHAIN for the loop and LLM for each round. Pluggable semantic conventions at register() time.
Agent Command Center (Apache 2.0, single Go binary) is the gateway. 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching. x-prism-cost and x-prism-latency-ms headers land on every call; per-round cost slides into the trace timeline. Self-host or use gateway.futureagi.com/v1.
The Future AGI Platform is the operational layer: self-improving evaluators retune from thumbs feedback, an in-product agent writes rubrics from natural-language descriptions, classifier-backed evals run at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack as the clustering and what-to-fix layer. agent-opt covers reflector-prompt and convergence-threshold optimization.
Ready to evaluate your own reflection loop? Start with the ai-evaluation SDK, wire paired scoring against your dataset, instrument with traceAI, route through Agent Command Center. Three metrics, one trace tree, one decision.
Related reading
Frequently asked questions
How do you evaluate an LLM self-reflection loop?
Why does average accuracy hide self-reflection failures?
What is over-correction in a reflection loop?
How do I calculate cost-per-improvement for a reflection loop?
When does self-reflection help versus hurt?
What templates from ai-evaluation map onto self-reflection eval?
How does Future AGI capture reflection-loop cost and latency?
Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.
Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.
Heads-of-engineering buyer's guide for LLM eval vendors in 2026. Ten buying criteria, eight vendor categories scored honestly, a five-question rubric, and a procurement workflow.