Agents

What Is Multi-Agent Tracing?

Distributed tracing of every LLM call, tool call, and handoff inside a multi-agent system, linked under one trace ID for replay and evaluation.

What Is Multi-Agent Tracing?

Multi-agent tracing is an agent-observability practice that captures, links, and visualizes every step taken by every agent in a multi-agent system as one distributed trace. Each LLM call, tool call, handoff, and shared-memory read becomes a span tied to a root trace ID, so the team trajectory can be replayed as a single tree. In FutureAGI, this trace becomes the measurement surface for debugging coordination failures, attributing latency and cost, and running step-level evaluations.

Why Multi-Agent Tracing Matters in Production LLM and Agent Systems

In a multi-agent system, the most expensive failures are the ones that span agents. A planner picks the wrong specialist; the specialist returns half-completed work; the planner re-tries with a different specialist; the team eventually answers — but burned eight minutes and $3 doing it. Without multi-agent tracing, the only signal you get is “request was slow” and “cost was high,” because a flat per-call log loses the parent-child relationship between agents.

The pain is felt across the team. Backend engineers cannot reproduce intermittent failures because they cannot see the inter-agent message that triggered them. SREs cannot attribute p99 latency to a specific agent. Compliance teams cannot prove which agent emitted a regulated piece of content. Product leads see the user-visible failure but cannot tell whether it was a planner bug, a tool bug, or a handoff bug.

For 2026 stacks built on LangGraph, AutoGen, CrewAI, and the OpenAI Agents SDK, multi-agent topologies are no longer rare — researcher + writer + critic teams, planner + N executors, and supervisor-with-subagents patterns are mainstream. The change in stack means the change in observability is non-optional. A single missing span breaks causality for the whole trace, and a flat trace breaks step-level evaluation.

How FutureAGI Handles Multi-Agent Tracing

FutureAGI’s approach is to ship framework-specific traceAI integrations that already know how to mark agent boundaries. traceAI-langchain instruments graph-style agent nodes; traceAI-openai-agents instruments handoff events; traceAI-crewai instruments crew task assignments; traceAI-autogen instruments group-chat messages. Each emits OpenTelemetry spans with agent.trajectory.step, the agent name, and the tool name, all under one root trace ID — so when a researcher hands off to a writer, the parent-child link survives.

The components in the FutureAGI multi-agent tracing stack are: (1) instrumentation, via the traceAI integration; (2) transport, via OTLP to the FutureAGI collector; (3) storage and indexing of traces with eval hooks; (4) the trace viewer, which renders agent boundaries differently from tool boundaries; and (5) the eval layer, where TrajectoryScore and TaskCompletion run against the stored traces. The case that ties them together: a support team sees eval-fail-rate-by-cohort spike for gpt-4o-mini-routed traffic; they open the worst trace; they see the planner agent (still on gpt-4o-mini) handing off to the wrong tool 12% of the time, with the parent-child agent link visible in the timeline. Unlike a trace-only LangSmith workflow, the evaluator score and the failing handoff live on the same trace. Without multi-agent tracing, that diagnosis takes hours of log spelunking.

How to Measure or Detect Multi-Agent Tracing

Multi-agent tracing is itself the measurement surface — but the signals you watch on top of it are:

  • agent.trajectory.step (OTel attribute): the canonical span field; presence on every span confirms instrumentation is correct.
  • trace span count per agent: a useful sanity check — sudden jumps usually mean an agent fell into a retry loop.
  • handoff span density: number of agent-handoff events per trace; spikes correlate with planner regressions.
  • TrajectoryScore: scored against the full multi-agent trace; returns one rating per joint trajectory.
  • TaskCompletion: 0–1 score on team-level success.
  • trace completeness rate (dashboard signal): percentage of traces where every expected agent emitted at least one span.

Minimal Python (trajectory evaluation):

from fi.evals import TrajectoryScore

score = TrajectoryScore().evaluate(
    input=user_goal,
    trajectory=traceai_trace_spans,
)

Common mistakes

  • Instrumenting only one agent in a multi-agent team. A trace with one missing agent loses the handoff link — debug effectively becomes guesswork.
  • Stripping the trace ID at the agent boundary. If each agent runs in its own process and the trace context is not propagated, you get N flat traces instead of one tree.
  • Tracing without step-level evaluators. Traces show what happened; without TrajectoryScore or TaskCompletion you cannot tell whether what happened was correct.
  • Sampling traces uniformly in production. Multi-agent traces are dense; uniform sampling drops the rare-but-important failure trajectories. Sample tail-error traces at 100%.
  • Confusing inter-agent messages with tool calls. They are different span kinds; conflating them destroys handoff metrics and breaks agent-handoff dashboards.

Frequently Asked Questions

What is multi-agent tracing?

Multi-agent tracing captures every LLM call, tool call, and handoff across every agent in a system under a single trace ID, so the team's full trajectory is one continuous timeline.

How is multi-agent tracing different from regular LLM tracing?

LLM tracing covers single-model calls. Multi-agent tracing additionally links agent-to-agent handoffs, shared memory reads, and tool dispatch across multiple agents, so causality survives the boundary between agents.

How do you implement multi-agent tracing?

Use FutureAGI's traceAI integrations for your framework (LangChain, OpenAI Agents, CrewAI, AutoGen). Each emits OpenTelemetry spans with `agent.trajectory.step` so spans share one trace ID across agents.