What is reasoning quality in agent evaluation?

Reasoning quality is an agent-eval metric that scores the coherence, logical progression, and justification depth of an agent's per-step thoughts across a trajectory. FutureAGI provides both a rule-based and an LLM-judge implementation.

How is reasoning quality different from task completion?

Task completion measures whether the agent finished. Reasoning quality measures whether the agent reasoned about how. An agent can complete a task with no thoughts emitted and score 0 on reasoning quality — useful when you specifically need ReAct-style transparency.

How do you measure reasoning quality?

FutureAGI's fi.evals.ReasoningQuality is rule-based — it counts thoughts, measures average length, and scores reasoning-indicator density. fi.evals.ReasoningQualityEval is the LLM-judge variant for nuanced rubric scoring. Pick based on speed and cost requirements.

What Is Reasoning Quality? Agent Eval Definition (2026)

What Is Reasoning Quality?

Reasoning quality is an agent-evaluation metric that scores how well an agent thought through its trajectory — coherence, logical progression, and justification depth across the per-step thoughts it emitted. The rule-based variant tracks thought count, average thought length, and the density of reasoning indicators (“because”, “therefore”, “first”, “then”, “however”). The LLM-judge variant scores against a coherence rubric. The metric returns a 0–1 score with a thought-count and indicator-count breakdown. In FutureAGI it is implemented as both the ReasoningQuality local-metric class and the ReasoningQualityEval framework class in fi.evals.

Why It Matters in Production LLM and Agent Systems

An agent that arrives at a correct answer through nonsense reasoning is an outage waiting to happen. Today’s prompt produced the right output by accident; tomorrow’s near-identical prompt will produce a confidently wrong one with the same kind of reasoning. Reasoning quality is the metric that catches that fragility before it ships.

The pain shows up in three places. ReAct-pattern agents that lose their Thought: step entirely after a prompt rewrite — outcomes look fine on a regression set, but every production failure is now opaque because there is no chain to debug. Chain-of-thought agents whose thoughts collapse to one-line filler (“I’ll do that now”) instead of multi-step justification — a length-and-indicator regression. Agents whose thoughts fragment into disconnected statements — high count, low coherence — making the trajectory unreviewable for compliance.

In 2026 stacks where agents are subject to audit (EU AI Act for high-risk systems, internal model-risk reviews, customer SOC2 evidence), reasoning quality is what you point at when an auditor asks “how do you know this agent is making decisions you can trace?” A trajectory with high ReasoningQuality and an emitted thought per step is auditable; a black-box trajectory is not. Multi-agent systems compound this — a sub-agent with no reasoning makes the entire handoff chain unexplainable.

How FutureAGI Handles Reasoning Quality

FutureAGI’s approach offers two implementations on the same agent trajectory input. The local fi.evals.ReasoningQuality class collects every non-empty step.thought in the trajectory, then scores three components: a length score (peaking at 10–30 words per thought), a reasoning-indicator density score (counting hits from a 16-keyword list including “because”, “therefore”, “since”, “first”, “then”, “however”), and a progression score (rewarding more thoughts up to a sensible cap). It is sub-millisecond, deterministic, and returns thought-level details. When the agent emits no explicit thoughts, the metric falls back to scanning actions for implicit indicators — partial credit rather than a hard zero.

The framework variant fi.evals.ReasoningQualityEval uses an LLM judge against a coherence rubric (“evaluate logical coherence of a response”) and is the right choice when you need nuance — for example, distinguishing “first… then… but actually…” that recovers from a wrong direction (good reasoning) from “first… then… finally…” that walks confidently into a wrong answer (bad reasoning despite passing the indicator density check). Concretely: a legal-research agent team using traceAI-anthropic runs ReasoningQuality on every regression for a fast signal and ReasoningQualityEval on a 10% sample for a higher-fidelity score. They alert on the cheaper rule-based metric and use the judge metric to root-cause divergences. Compared with Patronus’s coherence evaluator (LLM-only, slower, opaque to thought-level breakdowns), the FutureAGI two-tier approach lets teams calibrate cost against fidelity.

How to Measure or Detect It

Bullet-list of measurement signals tied to reasoning quality evals:

fi.evals.ReasoningQuality — returns a 0–1 score with thought_count, avg_thought_length, indicator_count. Sub-ms; run on every trajectory.
fi.evals.ReasoningQualityEval — LLM-judge variant; pin the judge model to a different family from the agent and run on a sampled cohort.
agent.trajectory.step.thought OTel attribute — the field the rule-based metric reads; missing on a span is itself a regression signal worth alerting on.
Indicator-density-by-cohort dashboard — segment by agent variant to see which prompt produces structured reasoning vs filler.

Minimal Python:

from fi.evals import ReasoningQuality

metric = ReasoningQuality()
result = metric.evaluate(trajectory=run.trajectory)
print(result.score, result.thought_count, result.indicator_count)

Common Mistakes

Treating reasoning quality as a substitute for task completion. A trajectory with high reasoning quality and a 0 completion is still a failed task — both belong on the dashboard.
Running only the rule-based variant on subjective tasks. Indicator density rewards “because” and “therefore” even when the reasoning is wrong; pair with ReasoningQualityEval on sampled traces.
Ignoring zero-thought trajectories. When an agent stops emitting Thought: steps after a prompt change, the metric will go to 0.3 (the implicit-fallback floor); that is a deploy-blocking regression, not a noisy data point.
Using the agent’s own model as the judge model in ReasoningQualityEval. Self-evaluation inflates scores; pin the judge to a different model family.
Tuning the indicator list to fit your prompt. That converts the metric into a regex over your own template; keep the standard list and trust the score.