Agents

What Is Agent Planning?

The step-selection process an AI agent uses to choose tools, order, constraints, and stop conditions before acting.

What Is Agent Planning?

Agent planning is the process an AI agent uses to choose ordered steps, tools, constraints, and stop conditions before it acts. It is an agent-reliability concept for multi-step LLM systems, visible in production traces as planner spans, tool-call decisions, memory reads, and trajectory steps. In FutureAGI, agent planning is measured from traceAI instrumentation and agent evals so engineers can catch wrong-tool calls, loops, wasted tokens, and unsafe actions before they reach users. In May 2026, frontier reasoning models. GPT-5.x, Claude Opus 4.7, Gemini 3 Pro. usually plan well in zero-shot but still degrade past 10-12 steps without explicit progress evals.

Why Agent Planning matters in production LLM and agent systems

Bad planning usually looks like general model unreliability until you inspect the trace. A support agent receives “refund the delayed order if shipping confirms loss” and plans search_docs -> draft_reply, skipping the shipping API and billing policy check. Another agent over-plans: it calls search, billing, search again, then escalates after six expensive steps. Both failures can produce fluent answers while hiding missing evidence, wrong permissions, or cost blowups.

Developers feel the pain first because the bug is not one prompt line. The failure is distributed across planner output, tool choice, memory state, and executor behavior. SRE sees p99 latency spikes and token-cost-per-trace jumps. Product sees inconsistent outcomes for the same user intent. Compliance sees weak audit trails: no clear record of why a tool was selected or why the agent stopped.

Common symptoms include repeated agent.trajectory.step values, high retry counts after planner spans, rising llm.token_count.prompt, tool calls with empty or stale arguments, and eval failures concentrated in multi-step cases. In 2026-era agent systems, planning also crosses MCP tools, agent framework graphs, long-context memory, and gateway routing. That raises the cost of a bad plan: one mistaken step can read the wrong account, write bad memory, call a paid API, and then produce a confident answer from polluted context. Planning-quality headroom is large on public references: on GAIA (Meta, 3-tier multi-hop assistant set) frontier agents still solve under half of Level 2 tasks, and on SWE-Bench Verified (500 human-validated GitHub issues) most failures trace back to the planning step that selected the wrong file or skipped a required sub-task before the executor ever ran.

How FutureAGI handles Agent Planning

FutureAGI’s approach is to treat an agent plan as an evaluated trajectory, not as private reasoning text. With the langchain traceAI integration, planner, tool, retriever, and model spans share one trace context. The core fields are agent.trajectory.step, tool name, span status, latency, and llm.token_count.prompt. If the same workflow uses OpenAI Agents, the openai-agents traceAI integration gives the same production trace surface, so planning quality can be compared across framework versions instead of inferred from logs.

Example: a logistics agent on GPT-5.1 plans classify_request -> lookup_order -> check_shipping -> apply_refund_policy -> draft_reply. FutureAGI attaches ToolSelectionAccuracy to the tool-choice step, GoalProgress to the overall trajectory, StepEfficiency to the number of steps, and TaskCompletion to the final outcome. If ToolSelectionAccuracy stays high but StepEfficiency drops, the planner likely chose correct tools but added unnecessary detours. If GoalProgress stalls after lookup_order, the planner may be missing a state transition into shipping verification.

Unlike ReAct prompting by itself, this does not assume that a reasoning string proves plan quality. The engineer sets an alert on eval-fail-rate-by-step, sends failing traces into a regression dataset, and gates the next prompt or model change on the same plan-quality checks. A fix might be a stop condition, a stricter tool schema, or a planner prompt patch, but the evidence comes from traceAI spans plus evaluator scores.

Planning patterns and when they fit

There is no single best planning pattern. the right one depends on the task shape. The table is the FutureAGI default mapping.

PatternBest forFailure mode FutureAGI catches
ReAct (reason + act per step)Open-ended research, debuggingPlan drift via GoalProgress flatlines
Plan-and-execute (plan once, then run)Structured workflowsPlan-staleness via ToolSelectionAccuracy
Tree-of-thoughtHard reasoning, multiple solutionsWasted branches via StepEfficiency
Hierarchical planningComplex multi-tool workflowsSub-plan failure via per-agent slicing
LLM compiler (parallel calls)Independent sub-tasksWrong dependency graph via trajectory order
Workflow templates (AWM)Frequent patternsWrong template via ToolSelectionAccuracy

How to measure or detect Agent Planning quality

Measure agent planning at the trajectory level and at each planned decision point:

  • agent.trajectory.step: records ordered agent steps, making skipped verification, repeated loops, and premature stops visible.
  • ToolSelectionAccuracy: returns a tool-choice score for whether the selected tool matched the intent and state.
  • GoalProgress: scores whether the trajectory moved the agent toward the requested goal.
  • StepEfficiency: scores whether the plan used unnecessary or repeated steps.
  • Dashboard signals: eval-fail-rate-by-step, p99 latency, retry count, token-cost-per-trace, and tool-timeout rate by planned route.
  • User proxies: thumbs-down rate, human-escalation rate, and reopened-ticket rate for workflows with the same initial intent.

Minimal evaluator sketch:

from fi.evals import ToolSelectionAccuracy, GoalProgress, StepEfficiency

trajectory = [{"step": "lookup_order"}, {"step": "check_shipping"}]
scores = {
    "tool_selection": ToolSelectionAccuracy().evaluate(trajectory),
    "goal_progress": GoalProgress().evaluate(trajectory),
    "step_efficiency": StepEfficiency().evaluate(trajectory),
}

Set thresholds per workflow instead of one global score. A refund agent may require perfect ToolSelectionAccuracy before any billing write, while a research agent may tolerate extra search steps if GoalProgress improves. Review failed trajectories weekly and promote representative traces into regression evals so a prompt, tool schema, or model change cannot reintroduce the same planning error.

Common mistakes

  • Treating a natural-language plan as proof of progress. The plan can look sensible while the trace shows repeated lookup, retries, skipped verification, or a missing state transition.
  • Scoring only final answers. Correct-looking outputs can arrive after expensive retries, hiding latency, token cost, and wrong-tool recovery paths from release gates during traffic spikes.
  • Letting the planner choose tools without current state. Stale memory can route the agent to the wrong account, document, or customer record when tools share similar names.
  • Using one max-step limit as the control. It catches infinite loops late and misses short plans that skip authorization, policy, or evidence checks.
  • Merging planner and executor spans. Without separate agent.trajectory.step records, teams cannot locate whether failure came from choice, tool output, or synthesis during incident review.

Frequently Asked Questions

What is agent planning?

Agent planning is how an AI agent chooses ordered steps, tools, constraints, and stop conditions before it acts. FutureAGI measures planning quality from traceAI spans and agent evaluators such as ToolSelectionAccuracy, GoalProgress, StepEfficiency, and TaskCompletion.

How is agent planning different from the ReAct pattern?

Agent planning proposes or revises a path toward a goal. The ReAct pattern interleaves reasoning, action, and observation, so a ReAct agent may use planning but does not guarantee a good plan.

How do you measure agent planning?

Measure agent planning with ToolSelectionAccuracy, GoalProgress, StepEfficiency, TaskCompletion, and trace fields such as agent.trajectory.step. Track eval-fail-rate-by-step, repeated steps, and token-cost-per-trace.