What Is Step Efficiency? Agent Eval Definition (2026)

What Is Step Efficiency?

Step efficiency is an agent-evaluation metric that scores how cheaply an agent reached its outcome — fewer steps relative to optimal, no redundant tool calls, low failure rate. It blends three signals across the trajectory: the ratio of expected to actual steps (40%), a redundancy score that penalises duplicate tool.name + arguments signatures (30%), and the success rate of every individual ToolCall (30%). The metric returns a 0–1 score with a details dict. In FutureAGI it is the StepEfficiency class in fi.evals, used to surface token-cost regressions and infinite-loop risk before they reach production billing.

Why It Matters in Production LLM and Agent Systems

Two agents can produce the same correct answer with wildly different cost profiles. One reaches the goal in three clean steps; the other loops twelve times across redundant tool calls and a retry storm. From the user’s perspective both succeeded; from the engineering team’s perspective only one is sustainable. Step efficiency is the metric that makes that gap visible.

The pain shows up in three failure modes. The first is runaway cost: an agent that quietly takes 4× the steps after a prompt update, multiplying token spend without any change in the answer-quality signal. The second is loop risk: an agent that calls the same tool with the same arguments three times in a row before timing out — a near-miss for an infinite-loop-agent incident that a step-efficiency drop would have flagged in regression. The third is retry-storm: tool failures cascade as the agent re-issues the same call, masking a backend outage as agent slowness.

In 2026 stacks where agents run dozens of steps across MCP tools, sub-agents, and RAG retrievers, step efficiency is the early-warning signal for budget incidents. A 2-point drop on a regression dashboard maps to a real percentage on the monthly inference bill, and a redundant_steps > 0 count maps directly to wasted spend that compounds at fleet scale.

How FutureAGI Handles Step Efficiency

FutureAGI’s approach is a deterministic three-signal score that adapts to whatever baseline you have. The fi.evals.StepEfficiency class consumes an AgentTrajectoryInput and computes: a step-ratio component using expected_trajectory length when provided, falling back to task.max_steps, falling back to a built-in heuristic (1.0 if total_steps ≤ 10, else 10/total_steps); a redundancy component that hashes each tool call as tool.name:json.dumps(arguments, sort_keys=True) and penalises every duplicate; and a failure component using tool_call.success flags from the trajectory. Default weights are 40/30/30, configurable via the constructor (StepEfficiency(config={"expected_step_weight": 0.5, ...})). The output is a 0–1 score plus a details dict with total_steps, redundant_steps, failed_calls, and the active step baseline.

Concretely: a customer-support agent team using traceAI-langchain runs StepEfficiency on every nightly regression of their 800-task golden set. They alert on step_efficiency < 0.65 per cohort and on redundant_steps > 0 aggregated by tool name. When a model swap causes the agent to repeat a lookup_order call twice in 18% of trajectories, the efficiency score drops from 0.88 to 0.74 — caught in CI, not in the AWS bill at month-end. They pair the metric with TrajectoryScore so they see the full picture (completion held, efficiency dropped) and route the regression directly to a prompt fix. Compared with Arize Phoenix’s agent_evaluation traces (manual span inspection), StepEfficiency is one number that compresses path length, redundancy, and failure rate into a single alertable signal.

How to Measure or Detect It

Bullet-list of measurement signals tied to StepEfficiency:

fi.evals.StepEfficiency — returns a 0–1 score with total_steps, redundant_steps, failed_calls in details. Threshold per cohort.
agent.trajectory.step OTel attribute — every step span carries step_number, tool_calls, and success flags; the eval reads these directly.
Tokens-per-task dashboard signal — pair step efficiency with llm.token_count.total per trajectory; a divergence means cost-per-task regressed without quality regressing.
Redundant-tool-call alert — a non-zero redundant_steps count over a 24h window is a loop-precursor signal; route to incident review.

Minimal Python:

from fi.evals import StepEfficiency

metric = StepEfficiency()
result = metric.evaluate(trajectory=run.trajectory,
                         expected_trajectory=run.gold_trajectory,
                         task={"max_steps": 10})
print(result.score, result.details)

Common Mistakes

Confusing step efficiency with trajectory score. Step efficiency is one dimension of agent quality; trajectory score is comprehensive. They are not interchangeable.
Running it without expected_trajectory or max_steps. The eval falls back to a generic heuristic that under-penalises long trajectories — supply a baseline.
Treating a high efficiency score as success. An agent can be highly efficient at the wrong task. Always pair with TaskCompletion.
Ignoring the redundancy signal. A small redundancy count today is a loop incident next month — alert on it, do not just chart it.
Tuning weights without re-baselining. Changing expected_step_weight mid-program makes historical scores incomparable; fork a new metric name.

Frequently Asked Questions

What is step efficiency in agent evaluation?

Step efficiency is an agent-eval metric that scores how few steps an agent used to reach its outcome, penalising redundant tool calls and failed actions. It returns a 0–1 score blended from step-ratio, redundancy, and call-success components.

How is step efficiency different from trajectory score?

Step efficiency is a single dimension — path length and redundancy. Trajectory score is comprehensive — it includes step efficiency as one of three components, alongside task completion and tool selection. Use step efficiency to track cost and loop risk; use trajectory score for overall agent quality.

How do you measure step efficiency?

FutureAGI's fi.evals.StepEfficiency consumes the agent trajectory plus an optional expected_trajectory or task.max_steps and returns a 0–1 score with a details dict containing total_steps, expected_steps, redundant_steps, and failed_calls.