What Is an Agent Trajectory?
The ordered sequence of steps, tool calls, observations, retries, and final state produced during a multi-step agent run.
What Is an Agent Trajectory?
An agent trajectory is the ordered path an AI agent takes from task intake to final response, including planning, reasoning notes, tool calls, observations, retries, handoffs, and termination state. In evaluation work, it is the evidence record used to judge multi-step agent behavior instead of scoring only the final answer. The trajectory appears in production traces and eval datasets, where FutureAGI metrics such as TrajectoryScore read agent.trajectory.step spans to detect loops, wrong tools, stalled progress, and unsafe actions.
Why Agent Trajectories Matter in Production LLM and Agent Systems
Scoring only the final answer hides the failure modes that make agents expensive and hard to operate. An agent can produce a correct refund message after calling the same order API four times, selecting the wrong shipping tool once, and recovering from two failed retries. A user may see success, but engineering sees higher latency, higher token cost, and a larger chance of a bad action on the next run.
The pain lands across the stack. Developers lose the causal story behind a regression because the final answer does not show the path. SREs see p99 latency, tool-timeout spikes, retry storms, and token-cost-per-trace increases without knowing which step caused the jump. Compliance and security teams care because trajectories show whether the agent touched an unapproved system, exposed sensitive context, or skipped a required guardrail. Product teams feel it as inconsistent user outcomes: two users ask the same thing and get different execution paths.
This is especially relevant for 2026-era agentic systems. A single request may include MCP tool calls, RAG retrieval, a planner, a verifier, and a sub-agent handoff. Each step is a chance for an agent loop, runaway cost, wrong-tool selection, or partial completion. The trajectory is the only artifact that ties those events together into one debuggable record.
How FutureAGI Handles Agent Trajectories
FutureAGI’s approach is to make the trajectory both traceable and scoreable. The anchor surface for this term is eval:TrajectoryScore, the TrajectoryScore evaluator in fi.evals. In a support-agent workflow, traceAI-langchain captures each planner decision and tool call as an agent.trajectory.step span, then the same structured run is attached to an eval dataset. TrajectoryScore becomes the headline check, while StepEfficiency, ToolSelectionAccuracy, TaskCompletion, GoalProgress, and ReasoningQuality explain which part of the path failed.
Example: a refunds agent should inspect policy, fetch an order, decide eligibility, and issue or deny a refund. A new prompt still returns the right final text in most tests, but traces show two redundant lookup_order calls and one detour through a shipping tool. FutureAGI scores the trajectory, flags the ToolSelectionAccuracy and StepEfficiency drop, and links the failed eval row back to the production trace. The engineer does not tune against the final answer alone; they add a regression case, tighten the tool-selection rubric, and set an alert when trajectory score falls below the release threshold. Unlike Ragas faithfulness, which mainly checks whether an answer is supported by retrieved context, trajectory evaluation treats step order, action choice, and progress as first-class evidence.
How to Measure or Detect an Agent Trajectory
Use these signals together rather than treating the trajectory as plain log text:
fi.evals.TrajectoryScore— returns a 0-1 trajectory quality score that can gate a release or alert on a production cohort.agent.trajectory.step— the trace field to inspect for step count, selected tool, tool result, retry state, and termination state.- Component regressions — pair the trajectory with
StepEfficiency,ToolSelectionAccuracy,GoalProgress, andReasoningQualityto find the failing dimension. - Dashboard signals — track trajectory-score-by-cohort, token-cost-per-trace, eval-fail-rate-by-agent-version, and p99 latency per trajectory length.
- User-feedback proxy — rising thumbs-down rate or escalation rate on long trajectories often points to loop risk or partial completion.
Minimal Python:
from fi.evals import TrajectoryScore
metric = TrajectoryScore()
result = metric.evaluate(
trajectory=run.trajectory,
task=run.task_definition,
)
print(result.score)
Common Mistakes
- Treating trajectories as raw logs. Without a schema for steps, actions, observations, and tool results, evals cannot compare runs.
- Scoring only the final answer. The answer can be correct while the path shows wrong tools, redundant calls, or unsafe intermediate actions.
- Ignoring failed intermediate calls. A recovered timeout still matters; it predicts p99 latency and retry-storm risk.
- Mixing agent versions in one baseline. Compare trajectories by prompt, model, tool catalog, and framework version, or regressions blur together.
- Storing private reasoning verbatim. Step summaries and action metadata are usually enough; avoid retaining sensitive reasoning internals when policy forbids it.
Frequently Asked Questions
What is an agent trajectory?
An agent trajectory is the ordered record of a multi-step agent run, including the goal, reasoning notes, tool calls, observations, retries, handoffs, and final answer. It is the evidence used to evaluate how the agent reached the outcome.
How is an agent trajectory different from trajectory score?
An agent trajectory is the raw structured path the agent took. Trajectory score is an evaluation metric that scores that path, usually alongside task completion, step efficiency, and tool selection accuracy.
How do you measure an agent trajectory?
FutureAGI uses fi.evals.TrajectoryScore on trajectory records and trace fields such as agent.trajectory.step. Teams track the score, component regressions, and step-level anomalies across eval datasets and production traces.