AI Evaluations

LLMs

AI Agents

How to Trace and Debug Multi-Agent Systems: A Production Guide to Multi-Agent Observability

How to Trace and Debug Multi-Agent Systems: A Production Guide to Multi-Agent Observability

How to Trace and Debug Multi-Agent Systems: A Production Guide to Multi-Agent Observability

How to Trace and Debug Multi-Agent Systems: A Production Guide to Multi-Agent Observability

How to Trace and Debug Multi-Agent Systems: A Production Guide to Multi-Agent Observability

Last Updated

Mar 23, 2026

By

Rishav Hada
Rishav Hada

Time to read

1 min read

Table of Contents

TABLE OF CONTENTS

  1. Introduction

Your multi-agent system works fine locally. Three agents coordinate, call tools, pass context, and return a clean answer. Then you deploy to production, and something breaks. The final output is wrong, but you have no idea which agent failed, which tool call returned garbage, or where the reasoning chain fell apart. This is the core problem multi-agent observability solves.

Multi-agent systems introduce failure modes that single-agent setups never face. Agents hand off tasks, share state, call external APIs, and make independent decisions. When one agent hallucinates or a tool call times out, that error cascades silently through the rest of the chain. Traditional logging gives you fragments. Distributed tracing for AI agents gives you the full picture: every decision, every tool invocation, every token spent across your entire agent workflow.

This guide covers how to trace multi-agent workflows end to end, how to debug AI agents when they fail in production, and how to build an observability stack that catches silent failures before your users do.

  1. Why AI Agents Fail in Production

Before we talk about tracing, let's understand what actually goes wrong. Multi-agent systems break differently than traditional software. Here are the failure categories that show up repeatedly in production:

Figure 1: AI Agents Fail in Production

Tool calling errors are the most common. An agent decides to call a function, but the parameters are malformed. The tool returns an error, and the agent either retries incorrectly or ignores the failure and hallucinates an answer instead. Without tool call tracing for LLM agents, you will never see this happen.

Silent failures in multi-agent systems are harder to catch. Agent A passes context to Agent B, but the context is incomplete or irrelevant. Agent B produces a confident but wrong response. No error is thrown. No exception is logged. The user just gets a bad answer, and your monitoring dashboard stays green.

LLM agent hallucination debugging becomes critical when agents fabricate tool outputs or invent data they never retrieved. In a multi-step agent workflow, a hallucination in step 2 corrupts every subsequent step. Standard logs will show the final output but not where the fabrication originated.

Latency compounding is another production killer. Each agent in a chain adds latency. If your orchestrator agent waits for a planner, a retriever, and a summarizer, a 2-second delay in any one of them can push total response time past user tolerance. Production multi-agent latency debugging requires span-level timing data that traditional monitoring tools do not provide.

  1. The Trace and Span Hierarchy for Agent Systems

If you come from backend engineering, you already know traces and spans from distributed systems. The same concept applies to multi-agent observability, but with AI-specific extensions.

A trace represents one complete execution of your agent system, from the initial user query to the final response. Within that trace, each operation gets a span. In multi-agent systems, the span hierarchy typically looks like this:

Span Level

What It Captures

Example

Root Span

Full agent workflow execution

invoke_agent triage_agent

Agent Span

Individual agent's processing

invoke_agent research_agent

LLM Span

A single model call

chat gpt-5

Tool Span

External tool/API invocation

execute_tool web_search

Retriever Span

Vector DB or knowledge base query

retrieve context_store

Embedding Span

Embedding generation

embed text-embedding-3-small

Table 1: Trace and Span Hierarchy for Agent Systems

Each span carries attributes: input tokens, output tokens, latency, model name, status code, and error type. When Agent A hands off to Agent B, the child span links back to the parent, preserving the full execution tree. This span and trace hierarchy for agents is what makes root cause analysis possible.

Here is what a complete trace tree looks like for a customer support multi-agent system handling the query "What is the status of my order #4521?":

Trace: abc-123
└── invoke_agent triage_agent              [4.2s]
      ├── chat gpt-5                       [600ms]  → decides to route to order_lookup_agent
      ├── invoke_agent order_lookup_agent    [2.8s]
      │     ├── execute_tool order_api       [1.9s]  → GET /orders/4521
      │     └── chat gpt-4o                  [900ms] → formats order data into natural language
      └── invoke_agent response_agent        [800ms]
            └── chat gpt-5                  [800ms] → composes final user-facing reply

Every span in this tree is clickable. If the final answer is wrong, you can walk backward: did the response agent misinterpret the data? Did the order API return stale information? Did the triage agent route to the wrong sub-agent? The trace gives you the full chain of custody for every piece of information.

The industry is converging on OpenTelemetry (OTel) as the standard for collecting this telemetry data. Microsoft, Google, IBM, and the broader open-source community are actively developing GenAI semantic conventions that standardize how agent telemetry is structured. The OpenTelemetry GenAI SIG has defined specific span operations like invoke_agent, create_agent, and execute_tool, along with standardized attributes like gen_ai.agent.name, gen_ai.request.model, and gen_ai.usage.input_tokens.

  1. How to Set Up Multi-Agent Observability

Setting up multi-agent observability involves three layers: instrumentation, collection, and visualization. Here is the practical breakdown.

Step 1: Instrument Your Agent Code

Auto-instrumentation is honestly the quickest way to get started. Most of the popular frameworks out there, like LangChain, CrewAI, OpenAI Agents SDK, and Pydantic AI, already support OpenTelemetry-based tracing, either built right in or through dedicated instrumentation libraries.

There are a few paths to instrument your agents, depending on your stack and how much control you want.

Manual OpenTelemetry instrumentation gives you full control. You create a TracerProvider, set up an OTLP exporter, and wrap your agent logic in custom spans. This works with any framework but requires you to manually define every span, set attributes like gen_ai.request.model and gen_ai.usage.input_tokens, and manage parent-child span relationships yourself. It is the most flexible option, but also the most labor-intensive.

Here is a minimal example of manual OTel setup for an agent call:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("multi-agent-app")

with tracer.start_as_current_span("invoke_agent triage_agent") as span:
    span.set_attribute("gen_ai.agent.name", "triage_agent")
    span.set_attribute("gen_ai.request.model", "gpt-5")
    # your agent logic here
    with tracer.start_as_current_span("execute_tool web_search") as tool_span:
        tool_span.set_attribute("gen_ai.tool.name", "web_search")
        # tool call logic here

This approach works but gets tedious fast in multi-agent setups where you have dozens of tool calls and handoffs to track.

Framework-native tracing is another option. LangChain has LangSmith callbacks, CrewAI emits telemetry events, and the OpenAI Agents SDK supports trace collection natively. These give you framework-specific visibility without writing custom span code, but they lock you into that framework's tracing format and backend. If you run multiple frameworks in the same system, you end up with fragmented traces that do not connect to each other.

Auto-instrumentation libraries sit in the middle. They patch supported frameworks at runtime and emit standardized OpenTelemetry spans automatically, no code changes to your agent logic required. This is the approach that scales best for production multi-agent systems because you get consistent span schemas across frameworks and can export to any OTel-compatible backend.

For example, using Future AGI's open-source TraceAI library (one such auto-instrumentation library), you can instrument an OpenAI-based agent in a few lines:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="my_agent_project",
)

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

From this point, every LLM call, tool invocation, and retriever hit is automatically captured as spans. No changes to your agent logic are needed.

For multi-agent setups using the OpenAI Agents SDK, you add the agents instrumentor:

from traceai_openai_agents import OpenAIAgentsInstrumentor
from traceai_mcp import MCPInstrumentor

OpenAIAgentsInstrumentor().instrument(tracer_provider=trace_provider)
MCPInstrumentor().instrument(tracer_provider=trace_provider)

This captures agent-to-agent handoffs, MCP tool calls, and the full execution graph across your multi-agent system.

Step 2: Export Traces to a Backend

TraceAI exports to any OpenTelemetry-compatible backend: Jaeger, Grafana Tempo, Datadog, or Future AGI's Observe platform. The traces flow through the standard OTLP (OpenTelemetry Protocol) pipeline, which means you are not locked into any single vendor.

Step 3: Visualize and Analyze

Once traces land in your backend, you can view each agent run as a nested timeline. Each node in the waterfall view represents a span. You can click into any span to inspect its input, output, latency, token count, and error status.

This is where the debugging power comes in. If a user reports a bad answer, you pull up the trace, walk the span tree, and find the exact point where the reasoning went wrong.

  1. Debugging Common Multi-Agent Failures

Let's walk through how to find the root cause in agent failures using trace data.

5.1 Debugging Tool Calling Errors

When an agent calls a tool and gets an error, the tool span will show a non-success status code along with the error message. Check these things in order:

  1. Was the tool called with correct parameters? Inspect the tool span's input attributes.

  2. Did the tool return an error or an empty result? Check the span's output and status.

  3. How did the agent react to the failure? Look at the next LLM span to see if the agent retried, hallucinated, or gave up.

Agent tool calling errors often stem from poorly structured prompts that cause the LLM to generate invalid function arguments. Trace data makes this immediately visible.

Here is what this looks like in a real trace. Say your booking agent calls a flight_search tool. In the tool span, you see:

Span: execute_tool flight_search
Status: ERROR
Attributes:
  gen_ai.tool.name: flight_search
  tool.input: {"origin": "NYC", "destination": "", "date": "2026-03-15"}
  tool.output: {"error": "destination is required"}
  gen_ai.agent.name: booking_agent

The destination field is empty. Now you go one span up to the LLM call that generated this tool invocation:

Span: chat gpt-4o
Attributes:
  gen_ai.request.model: gpt-5
  llm.input: "User wants to fly from New York to somewhere warm next week"
  llm.output: {"tool_call": "flight_search", "args": {"origin": "NYC", "destination": "", "date": "2026-03-15"}}

The model could not resolve "somewhere warm" into a concrete destination, so it passed an empty string instead of asking for clarification. The fix here is a prompt-level change: add an instruction that tells the agent to ask the user for a specific destination when the query is ambiguous, rather than calling the tool with incomplete parameters.

Without span-level trace data, your logs would only show "flight_search failed" with no visibility into why the LLM generated bad arguments in the first place.

5.2 Debugging Hallucination in Multi-Step Workflows

LLM agent hallucination debugging in multi-agent systems requires comparing what the retriever actually returned against what the agent claimed. In your trace:

  1. Open the retriever span and check the retrieved documents.

  2. Open the subsequent LLM span and check the generated response.

  3. If the response contains claims not present in the retrieved context, you have a hallucination.

Here is what a hallucination looks like in trace data. Your research agent retrieves context from a knowledge base and then generates a summary:

Span: retrieve context_store
Attributes:
  retriever.query: "Q1 2026 revenue for Acme Corp"
  retriever.documents: [
    "Acme Corp reported $42M in Q1 2026 revenue, a 12% increase YoY."
  ]

Span: chat gpt-5
Attributes:
  gen_ai.request.model: gpt-5
  llm.input: [retrieved context + user query]
  llm.output: "Acme Corp reported $42M in Q1 2026 revenue, a 12% increase YoY, driven primarily by expansion into the European market."

The "driven primarily by expansion into the European market" part appears nowhere in the retrieved documents. That is the hallucination. In a single-agent setup, you might catch this. In a multi-agent pipeline, this fabricated detail gets passed to a downstream analyst agent that uses it as a factual input for its own reasoning, and the error compounds silently.

Catching this manually is impractical at scale, which is where automated evaluation comes in.

Automated evaluation metrics (like those provided by Future AGI's evaluation suite) can flag this automatically using LLM-as-judge for agent evaluation. These evaluators compare each LLM span's output against the retriever span's documents and assign a faithfulness score. When that score drops below your threshold, say 0.85, the trace gets flagged for review. This turns hallucination detection from a manual trace-reading exercise into an automated quality gate that runs on every single agent execution.

5.3 Debugging Latency Issues

For production multi-agent latency debugging, sort spans by duration. The waterfall view immediately shows which agent or tool call is the bottleneck. Common culprits include:

  • Retriever queries hitting large, unoptimized vector indexes

  • Sequential tool calls that could run in parallel

  • LLM calls using unnecessarily large context windows

  • Agent loops where the orchestrator retries the same failing step

You can also monitor agent token usage and cost per step by examining the gen_ai.usage.input_tokens and gen_ai.usage.output_tokens attributes on each LLM span.

To make this concrete: you notice P95 latency on your customer support agent pipeline has jumped from 4 seconds to 11 seconds. You pull up a slow trace and see the waterfall:

invoke_agent triage_agent         [200ms]
  ├── chat gpt-4o                 [800ms]
  ├── invoke_agent retriever_agent [6200ms]   ← bottleneck
  │     ├── retrieve vector_store  [5900ms]
  │     └── chat gpt-4o           [300ms]
  ├── invoke_agent summarizer_agent [1800ms]
  │     └── chat gpt-4o           [1800ms]

The retriever agent's vector store query is taking 5.9 seconds. You check the span attributes and see it is querying an unindexed collection with 2M+ documents using a broad embedding search with top_k=50. The fix is either indexing the collection, reducing top_k, or adding a metadata pre-filter to narrow the search space. Without span-level timing, you would only know the overall pipeline was slow, not that a single vector query was responsible for 54% of the total latency.

  1. Evaluating Multi-Agent System Output

Tracing tells you what happened. Evaluation tells you if it was good. Combining both creates a closed feedback loop that drives continuous improvement.

6.1 Key Metrics for Multi-Agent Evaluation

Metric

What It Measures

How to Compute

Task Completion Rate

% of queries where the final agent output correctly answers the user

LLM-as-judge or human annotation

Tool Accuracy

% of tool calls with correct parameters and valid responses

Span-level status code analysis

Faithfulness Score

Does the output match retrieved context?

Retriever span vs. LLM output comparison

End-to-End Latency

Total time from query to response

Root span duration

Cost per Query

Total token spend across all agents

Sum of token counts across LLM spans

Agent Handoff Success Rate

% of inter-agent handoffs that preserve required context

Custom span attribute checks

Table 2: Key Metrics for Multi-Agent Evaluation

These multi-agent evaluation metrics give you quantitative signal on where your system is weak. When faithfulness drops, you know your retriever or grounding prompt needs work. When tool accuracy dips, you check for schema changes or API regressions.

6.2 Setting Up Alerts for Agent Quality Drift

Production systems need continuous monitoring, not one-time audits. Set up alerts for:

  • Latency spikes: When P95 latency exceeds your SLA threshold

  • Error rate increases: When tool span failure rate rises above baseline

  • Quality score drops: When automated evaluation scores trend downward

  • Token cost anomalies: When cost per query jumps unexpectedly (often indicates agent loops)

Future AGI's monitoring module supports OTEL-powered dashboards with configurable alerts for all of these signals. You can set thresholds on specific agents within a multi-agent chain, so you know exactly which agent is degrading.

  1. Multi-Agent System Architecture Tracing: Best Practices

After working with distributed tracing for AI agents across multiple production deployments, here are the practices that consistently make the biggest difference:

Instrument early, not after a production incident. Adding tracing after deployment is significantly harder than building it in from the start.

Name your spans descriptively. Use names like research_agent:web_search instead of generic tool_call. Clear span names save time during debugging.

Separate environments with project versions. Use distinct project names or version tags for dev, staging, and production traces to prevent test data from polluting production dashboards.

Trace agent state, not just inputs and outputs. If your agents maintain memory or state between steps, capture state transitions as span attributes. This is critical for agent state management debugging.

Combine tracing with automated evaluation. Raw traces give you the "what." Automated evals (faithfulness, relevance, safety scores) give you the "how good." Together, they tell the full story.

Use consistent span attributes across frameworks. If you run agents on LangChain and CrewAI within the same system, ensure both emit spans with the same attribute schema. OpenTelemetry semantic conventions handle this when you use compliant instrumentation libraries.

  1. How Multi-Agent Observability with Future AGI Works

Future AGI provides a complete observability and evaluation layer built for multi-agent systems. Its open-source TraceAI library handles instrumentation across 15+ frameworks (OpenAI, Anthropic, LangChain, CrewAI, DSPy, Pydantic AI, and more) with auto-instrumentation that requires zero changes to your agent code.

The platform's Agent Compass feature goes beyond traditional trace visualization. It automatically clusters errors, identifies root causes using a built-in error taxonomy, and suggests fixes. Instead of manually sifting through thousands of traces, you get grouped failure patterns with actionable diagnostics.

For evaluation, Future AGI packs in over 50 ready-to-use metrics, covering hallucination detection, context adherence scoring, and tool accuracy measurement. Since these checks run directly inside your production traces, you get real-time quality signals without needing to set up or manage a separate evaluation pipeline.

For teams running multi-step agent workflow monitoring at scale, Future AGI's Observe module tracks throughput, error rates, latency distributions, and cost per query across your entire agent fleet with customizable alert thresholds.

  1. Conclusion

Multi-agent observability is the difference between shipping agents that work in demos and agents that hold up in production. Without distributed tracing, you are debugging blind. Without automated evaluation, you are flying without instruments.

The key takeaways: instrument your agents from day one using OpenTelemetry-compatible tooling, build span hierarchies that reflect your actual agent architecture, combine tracing with automated evaluation for a complete feedback loop, and set up alerts to catch agent quality drift before your users notice. The tools exist. OpenTelemetry provides the standard, TraceAI provides the instrumentation, and Future AGI provides the platform to trace, evaluate, and monitor your multi-agent systems end to end.

Frequently Asked Questions

How do you debug LLM agent chains when errors do not throw exceptions?

What is the difference between logging and distributed tracing for AI agents?

How does multi-agent observability with Future AGI differ from standard monitoring tools?

Can you trace agent tool calls and API responses across different frameworks?

Table of Contents

Table of Contents

Table of Contents

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Related Articles

Related Articles

future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo