Observability

What Is Agent Tracing?

Agent tracing records every model call, tool action, handoff, and decision step in a multi-step AI agent run.

What Is Agent Tracing?

Agent tracing is an AI observability practice that records each step of an agent run: model calls, tool choices, arguments, handoffs, retries, memory reads, and final output. It shows up in production traces where a multi-step agent would otherwise look like disconnected LLM calls. FutureAGI uses traceAI instrumentation and TrajectoryScore evaluations to connect what the agent did with whether the trajectory passed evaluation and where the run regressed.

Why Agent Tracing Matters in Production LLM and Agent Systems

Most agent failures are not single bad completions. They are bad chains: the agent retrieves stale context, selects the wrong tool, retries a timeout, delegates to another agent, then gives a confident final answer that hides the earlier fault. Without agent tracing, the incident looks like “model quality dropped” when the real cause was a broken tool schema, a missing parent span, or a handoff loop.

The pain lands on several teams. Developers need the exact step that changed after a prompt, model, or tool update. SREs need latency and retry visibility across tools, not only p99 latency for the outer API call. Product teams need to explain why a user saw a wrong action. Compliance teams need a reviewable record when an agent touches customer data, payments, bookings, or internal systems.

Common symptoms include orphan spans, repeated agent.trajectory.step values, tool calls with empty arguments, high tool-error rate, rising token-cost-per-trace, and traces where the final answer has no clear supporting path. Unlike plain OpenTelemetry request tracing, agent tracing treats planning, tool choice, handoff, memory access, and final response as first-class events. That matters more in 2026-era agent pipelines because one user turn can cross multiple models, retrievers, tools, and sub-agents before anything reaches the user.

How FutureAGI Handles Agent Tracing

FutureAGI’s approach is to treat a trace as evidence for evaluation, not just a debugging transcript. In a FutureAGI workflow, traceAI:openai-agents instruments the OpenAI Agents SDK so model calls, tool invocations, handoffs, and agent steps appear in one trace tree. The same run can then be scored with eval:TrajectoryScore, the fi.evals metric class that provides a comprehensive trajectory evaluation score.

A concrete workflow looks like this: a support agent receives “refund the duplicate charge,” calls a customer lookup tool, hands off to a billing agent, calls a refund API, and writes the final response. traceAI emits spans tagged with fi.span.kind, gen_ai.request.model, tool names, tool arguments, and agent.trajectory.step. FutureAGI can attach TrajectoryScore to the trace, then slice failures by agent version, tool, route, or dataset cohort.

When a regression appears, the engineer does not start from the final message. They inspect the first failed step. If TrajectoryScore drops after a new prompt, and the trace shows an extra billing handoff plus duplicate refund attempts, the next action is specific: tighten the tool description, add a threshold alert for repeated refund calls, and run a regression eval before shipping the prompt. For adjacent issues, ToolSelectionAccuracy can isolate wrong-tool choices while TaskCompletion checks whether the full run achieved the user’s goal. Unlike a plain LangSmith-style timeline used only for replay, the FutureAGI pattern joins replay, attributes, and evaluation on the same trace id.

How to Measure or Detect Agent Tracing Quality

Measure agent tracing at the span, trace, and evaluation layers:

  • TrajectoryScore: comprehensive trajectory evaluation score for whether the agent path was acceptable for the task.
  • agent.trajectory.step: step index or label used to detect skipped, repeated, or reordered actions.
  • fi.span.kind: span taxonomy that separates AGENT, LLM, TOOL, RETRIEVER, GUARDRAIL, and EVALUATOR spans.
  • orphan-span rate: percentage of agent, tool, or model spans without the expected parent; target near 0%.
  • token-cost-per-trace: cost of the whole agent run, not just the final model call.
  • eval-fail-rate-by-agent-version: share of traces that fail TrajectoryScore, ToolSelectionAccuracy, or TaskCompletion after a release.

Minimal evaluation sketch:

from fi.evals import TrajectoryScore

score = TrajectoryScore().evaluate(
    input=user_request,
    trajectory=trace_steps,
)
print(score.score, score.reason)

User-feedback proxies help when labels are sparse: thumbs-down rate after tool-using turns, refund/escalation rate, and “agent said done but nothing changed” tickets.

Common Mistakes

  • Tracing only the final LLM call. Agent bugs often live in tool choice, retry order, or handoff behavior; model spans alone hide that.
  • Dropping parent context across tools. If async tool spans orphan from the agent span, the trace cannot explain the real sequence.
  • Scoring only final answers. A good answer after unsafe or duplicate actions is still a bad trajectory; add TrajectoryScore.
  • Ignoring cost per trace. Multi-step agents can pass quality checks while silently doubling token or tool spend.
  • Mixing step names across services. Stable agent.trajectory.step values make releases comparable; ad hoc names break dashboards.

Frequently Asked Questions

What is agent tracing?

Agent tracing records every model call, tool selection, argument, handoff, retry, memory read, and final output in a multi-step AI agent run.

How is agent tracing different from LLM tracing?

LLM tracing focuses on model requests and responses. Agent tracing adds the step graph around those calls: planning, tool use, handoffs, retries, memory access, and final task outcome.

How do you measure agent tracing quality?

Use traceAI spans tagged with agent.trajectory.step and evaluate the run with TrajectoryScore. Dashboards should also track orphan-span rate, per-step latency, tool-error rate, and eval-fail-rate by agent version.