How is agent planning different from the ReAct pattern?

Agent planning proposes or revises a path toward a goal. The ReAct pattern interleaves reasoning, action, and observation, so a ReAct agent may use planning but does not guarantee a good plan.

How do you measure agent planning?

Measure agent planning with ToolSelectionAccuracy, GoalProgress, StepEfficiency, TaskCompletion, and trace fields such as agent.trajectory.step. Track eval-fail-rate-by-step, repeated steps, and token-cost-per-trace.

Agent Planning Definition, Metrics & Examples

Q: What is agent planning?

Agent planning is how an AI agent chooses ordered steps, tools, constraints, and stop conditions before it acts. FutureAGI measures planning quality from traceAI spans and agent evaluators such as ToolSelectionAccuracy, GoalProgress, StepEfficiency, and TaskCompletion.

What Is Agent Planning?

Agent planning is the process an AI agent uses to choose ordered steps, tools, constraints, and stop conditions before it acts. It is an agent-reliability concept for multi-step LLM systems, visible in production traces as planner spans, tool-call decisions, memory reads, and trajectory steps. In FutureAGI, agent planning is measured from traceAI instrumentation and agent evals so engineers can catch wrong-tool calls, loops, wasted tokens, and unsafe actions before they reach users.

Why Agent Planning matters in production LLM and agent systems

Bad planning usually looks like general model unreliability until you inspect the trace. A support agent receives “refund the delayed order if shipping confirms loss” and plans search_docs -> draft_reply, skipping the shipping API and billing policy check. Another agent over-plans: it calls search, billing, search again, then escalates after six expensive steps. Both failures can produce fluent answers while hiding missing evidence, wrong permissions, or cost blowups.

Developers feel the pain first because the bug is not one prompt line. The failure is distributed across planner output, tool choice, memory state, and executor behavior. SRE sees p99 latency spikes and token-cost-per-trace jumps. Product sees inconsistent outcomes for the same user intent. Compliance sees weak audit trails: no clear record of why a tool was selected or why the agent stopped.

Common symptoms include repeated agent.trajectory.step values, high retry counts after planner spans, rising llm.token_count.prompt, tool calls with empty or stale arguments, and eval failures concentrated in multi-step cases. In 2026-era agent systems, planning also crosses MCP tools, framework graphs, long-context memory, and gateway routing. That raises the cost of a bad plan: one mistaken step can read the wrong account, write bad memory, call a paid API, and then produce a confident answer from polluted context.

How FutureAGI handles Agent Planning

FutureAGI’s approach is to treat an agent plan as an evaluated trajectory, not as private reasoning text. With the langchain traceAI integration, planner, tool, retriever, and model spans share one trace context. The core fields are agent.trajectory.step, tool name, span status, latency, and llm.token_count.prompt. If the same workflow uses OpenAI Agents, the openai-agents traceAI integration gives the same production trace surface, so planning quality can be compared across framework versions instead of inferred from logs.

Example: a logistics agent plans classify_request -> lookup_order -> check_shipping -> apply_refund_policy -> draft_reply. FutureAGI attaches ToolSelectionAccuracy to the tool-choice step, GoalProgress to the overall trajectory, StepEfficiency to the number of steps, and TaskCompletion to the final outcome. If ToolSelectionAccuracy stays high but StepEfficiency drops, the planner likely chose correct tools but added unnecessary detours. If GoalProgress stalls after lookup_order, the planner may be missing a state transition into shipping verification.

Unlike ReAct prompting by itself, this does not assume that a reasoning string proves plan quality. The engineer sets an alert on eval-fail-rate-by-step, sends failing traces into a regression dataset, and gates the next prompt or model change on the same plan-quality checks. A fix might be a stop condition, a stricter tool schema, or a planner prompt patch, but the evidence comes from traceAI spans plus evaluator scores.

How to measure or detect Agent Planning quality

Measure agent planning at the trajectory level and at each planned decision point:

agent.trajectory.step: records ordered agent steps, making skipped verification, repeated loops, and premature stops visible.
ToolSelectionAccuracy: returns a tool-choice score for whether the selected tool matched the intent and state.
GoalProgress: scores whether the trajectory moved the agent toward the requested goal.
StepEfficiency: scores whether the plan used unnecessary or repeated steps.
Dashboard signals: eval-fail-rate-by-step, p99 latency, retry count, token-cost-per-trace, and tool-timeout rate by planned route.
User proxies: thumbs-down rate, human-escalation rate, and reopened-ticket rate for workflows with the same initial intent.

Minimal evaluator sketch:

from fi.evals import ToolSelectionAccuracy, GoalProgress, StepEfficiency

trajectory = [{"step": "lookup_order"}, {"step": "check_shipping"}]
scores = {
    "tool_selection": ToolSelectionAccuracy().evaluate(trajectory),
    "goal_progress": GoalProgress().evaluate(trajectory),
    "step_efficiency": StepEfficiency().evaluate(trajectory),
}

Set thresholds per workflow instead of one global score. A refund agent may require perfect ToolSelectionAccuracy before any billing write, while a research agent may tolerate extra search steps if GoalProgress improves. Review failed trajectories weekly and promote representative traces into regression evals so a prompt, tool schema, or model change cannot reintroduce the same planning error.

Common mistakes

Treating a natural-language plan as proof of progress. The plan can look sensible while the trace shows repeated lookup, retries, skipped verification, or a missing state transition.
Scoring only final answers. Correct-looking outputs can arrive after expensive retries, hiding latency, token cost, and wrong-tool recovery paths from release gates during traffic spikes.
Letting the planner choose tools without current state. Stale memory can route the agent to the wrong account, document, or customer record when tools share similar names.
Using one max-step limit as the control. It catches infinite loops late and misses short plans that skip authorization, policy, or evidence checks.
Merging planner and executor spans. Without separate agent.trajectory.step records, teams cannot locate whether failure came from choice, tool output, or synthesis during incident review.