How is trajectory score different from step efficiency?

Step efficiency only measures how few steps the agent used relative to optimal. Trajectory score is comprehensive — it includes step efficiency as one of three components, alongside task completion and tool selection. Use trajectory score as the single-number agent-quality summary.

How do you measure trajectory score?

FutureAGI's fi.evals.TrajectoryScore consumes the full trajectory plus task definition and returns a weighted 0–1 score with a component_scores breakdown. The default weights (40/30/30) are configurable via the constructor.

What Is Trajectory Score? Composite Agent Eval (2026)

Q: What is trajectory score?

Trajectory score is a composite agent-eval metric that blends task completion (40%), step efficiency (30%), and tool selection accuracy (30%) into a single 0–1 score, computed across the agent's full multi-step trajectory.

What Is Trajectory Score?

Trajectory score is a composite agent-evaluation metric that summarises an entire multi-step agent run in a single number. It blends three sub-scores against the same trajectory of thought-action-observation steps: task completion (40% of weight), step efficiency (30%), and tool selection accuracy (30%). The metric returns a 0–1 score plus a component_scores dict so engineers can see which dimension drove the result. In FutureAGI it is the TrajectoryScore class in fi.evals, and it is the canonical headline metric on agent regression dashboards.

Why It Matters in Production LLM and Agent Systems

A single answer-relevancy score on the final message tells you almost nothing about an agent. Two runs can produce the same correct answer where one took three clean steps and the other looped twelve times across two wrong tools and a retry storm. The first is a healthy production trace; the second is a slow-burning incident. Without a trajectory-level summary, both look identical to a leaderboard.

The pain is most acute for teams running daily regressions on agents in CI. Engineers see a release where mean answer-quality is unchanged, but the trajectory score quietly dropped from 0.84 to 0.71 because a prompt update made the agent take twice as many steps to reach the same answer — costing 2× tokens and 1.6× p99 latency. SREs see retry storms in tool calls but no eval signal that warned them. Product owners compare two agent variants and have no single metric to declare a winner.

In 2026-era multi-agent and MCP-connected stacks, trajectories sprawl: planner step, retriever, three tool calls, sub-agent handoff, two more tool calls, critique, final answer. Each adds a chance to fail without breaking the surface output. Trajectory score is the metric that compresses that complexity into one alertable signal — and unlike a hand-rolled aggregate, the FutureAGI implementation exposes the components so you never lose the why.

How FutureAGI Handles Trajectory Score

FutureAGI’s approach is to compose three deterministic, fast metrics into one score with surfaced sub-results. The fi.evals.TrajectoryScore class instantiates TaskCompletion, StepEfficiency, and ToolSelectionAccuracy internally, calls each on the same AgentTrajectoryInput, and returns the weighted sum plus a component_scores dict and a per-component reason. Default weights are 40/30/30 across completion, efficiency, and tool selection, configurable on construction (TrajectoryScore(config={"completion_weight": 0.5, ...})). The components are intentionally orthogonal: completion is about outcome, efficiency is about step count and redundancy, tool selection is about whether the right tools were used. A regression in any one shows up at the component level even when the headline score moves slowly.

Concretely: a research-agent team using traceAI-langgraph instruments their LangGraph runs, captures every step into agent.trajectory.step spans, and attaches TrajectoryScore to a 500-task golden dataset via Dataset.add_evaluation(). Their command-center dashboard tracks four series: trajectory score, completion, efficiency, and tool selection. When the headline score dips 4 points after a model swap, the breakdown shows efficiency dropped 11 points while completion held — pinpointing redundant tool calls in the new model’s trajectories. They alert on component_scores.step_efficiency directly, fix the prompt, and re-run regression. Compared with LangSmith’s “trajectory_evaluator” (LLM-judge based, slow, opaque), TrajectoryScore is rule-based, sub-second, and traceable to a specific failure dimension.

How to Measure or Detect It

Bullet-list of measurement signals tied to TrajectoryScore:

fi.evals.TrajectoryScore — returns a 0–1 score and a component_scores dict with task_completion, step_efficiency, tool_selection. Alert on the headline; root-cause via the breakdown.
agent.trajectory.step OTel attribute — the per-step span emitted by traceAI for agent frameworks; the eval reads tool_calls, thought, observation, and is_final from these spans.
Trajectory-score-by-cohort dashboard signal — segment by user intent, agent variant, or tool surface to find which slice regressed.
Component-divergence alert — a 5+ point gap between any two components is itself a signal that the agent’s behaviour has shifted, even when the headline holds steady.

Minimal Python:

from fi.evals import TrajectoryScore

metric = TrajectoryScore(config={"completion_weight": 0.4,
                                 "efficiency_weight": 0.3,
                                 "tool_weight": 0.3})
result = metric.evaluate(trajectory=run.trajectory,
                         task=run.task_definition,
                         available_tools=run.tool_catalogue)
print(result.score, result.component_scores)

Common Mistakes

Reporting only the headline score. A 0.78 trajectory score can come from balanced 0.78s or from completion=0.95, efficiency=0.45 — wildly different stories. Always log component scores.
Confusing trajectory score with step efficiency. Step efficiency is one component (30%); trajectory score is the comprehensive composite. Do not substitute one for the other.
Re-weighting components without baseline calibration. Changing weights mid-program makes historical scores incomparable; if you must rebalance, fork a new metric name.
Running it without task.required_tools or available_tools. The tool-selection sub-component degrades to call-success rate alone, hiding wrong-tool failures.
No regression eval per agent variant. A single trajectory-score number across your fleet of agents is a vanity metric — score per variant.