How to Trace and Debug Multi-Agent Systems in 2026: A Production Guide with traceAI, OpenTelemetry, and Span-Level Evals
Trace, debug, and evaluate multi-agent AI systems in 2026 with traceAI, OpenTelemetry spans, and rubric scoring. Code, span tree, and three real failure cases.
Table of Contents
TL;DR: How to Trace and Debug Multi-Agent Systems in 2026
| Question | One line answer |
|---|---|
| What instruments the agent? | traceAI (Apache 2.0, OpenTelemetry-native). Decorate the entrypoint with @tracer.agent and each tool with @tracer.tool. |
| Where do spans go? | Any OTLP-compatible backend (Future AGI Observe, Jaeger, Tempo, Datadog, Honeycomb). |
| How is a multi-agent failure debugged? | Walk the span tree: input, plan, retrieval, tool args, tool output, LLM rewrite, final answer. Score each span, find the bad one. |
| How do I catch regressions before users do? | Span-level evals via fi.evals.evaluate(...) plus an Agent Command Center gateway that gates on the score. |
| What about hallucination in a multi-step chain? | Score each LLM span for faithfulness against the upstream retriever span. Threshold of 0.85 is a workable starting point. |
| What changed in 2026? | OpenTelemetry GenAI semantic conventions stabilized; traceAI added agent and tool decorators; Agent Command Center routes through the same trace plane. |
Why Multi-Agent Systems Break Silently in Production and What Distributed Tracing Fixes
Your multi-agent system works fine locally. Three agents coordinate, call tools, pass context, and return a clean answer. Then you deploy to production, and something breaks. The final output is wrong, but you have no idea which agent failed, which tool call returned garbage, or where the reasoning chain fell apart. This is the core problem multi-agent observability solves.
Multi-agent systems introduce failure modes that single-agent setups never face. Agents hand off tasks, share state, call external APIs, and make independent decisions. When one agent hallucinates or a tool call times out, that error cascades silently through the rest of the chain. Traditional logging gives you fragments. Distributed tracing for AI agents gives you the full picture: every decision, every tool invocation, every token spent across your entire agent workflow.
This guide covers how to trace multi-agent workflows end to end, how to debug AI agents when they fail in production, and how to build an observability stack that catches silent failures before your users do. The code is built on Future AGI’s open source traceAI library (Apache 2.0, OpenTelemetry-native), fi.evals (Apache 2.0 evaluation library), and the Agent Command Center gateway for runtime policy enforcement.
Why AI Agents Fail in Production: Tool Calling Errors, Silent Failures, Hallucinations, and Latency Compounding
Multi-agent systems break differently than traditional software. Four failure categories show up repeatedly in production, and each one motivates a specific tracing requirement.
Tool calling errors are the most common. An agent decides to call a function, but the parameters are malformed. The tool returns an error, and the agent either retries incorrectly or ignores the failure and hallucinates an answer instead. Without tool call tracing for LLM agents, you will never see this happen.
Silent failures in multi-agent systems are harder to catch. Agent A passes context to Agent B, but the context is incomplete or irrelevant. Agent B produces a confident but wrong response. No error is thrown. No exception is logged. The user just gets a bad answer, and your monitoring dashboard stays green.
LLM agent hallucination debugging becomes critical when agents fabricate tool outputs or invent data they never retrieved. In a multi-step agent workflow, a hallucination in step 2 corrupts every subsequent step. Standard logs will show the final output but not where the fabrication originated.
Latency compounding is another production killer. Each agent in a chain adds latency. If your orchestrator agent waits for a planner, a retriever, and a summarizer, a 2-second delay in any one of them can push total response time past user tolerance. Production multi-agent latency debugging requires span-level timing data that traditional monitoring tools do not provide.
The Trace and Span Hierarchy for Agent Systems: Root, Agent, LLM, Tool, Retriever, and Embedding Spans
If you come from backend engineering, you already know traces and spans from distributed systems. The same concept applies to multi-agent observability, with AI-specific extensions defined by the OpenTelemetry GenAI SIG.
A trace represents one complete execution of your agent system, from the initial user query to the final response. Within that trace, each operation gets a span. In multi-agent systems, the span hierarchy typically looks like this:
| Span Level | What It Captures | Example |
|---|---|---|
| Root Span | Full agent workflow execution | invoke_agent triage_agent |
| Agent Span | Individual agent’s processing | invoke_agent research_agent |
| LLM Span | A single model call | chat gpt-5 |
| Tool Span | External tool or API invocation | execute_tool web_search |
| Retriever Span | Vector DB or knowledge base query | retrieve context_store |
| Embedding Span | Embedding generation | embed text-embedding-3-small |
Table 1: Span hierarchy for multi-agent systems.
Each span carries attributes: input tokens, output tokens, latency, model name, status code, and error type. When Agent A hands off to Agent B, the child span links back to the parent, preserving the full execution tree. This span and trace hierarchy is what makes root cause analysis possible.
Here is what a complete trace tree looks like for a customer support multi-agent system handling the query “What is the status of my order #4521?”:
Trace: abc-123
invoke_agent triage_agent [4.2s]
chat gpt-5 [600ms] decides to route to order_lookup_agent
invoke_agent order_lookup_agent [2.8s]
execute_tool order_api [1.9s] GET /orders/4521
chat gpt-5 [900ms] formats order data into natural language
invoke_agent response_agent [800ms]
chat gpt-5 [800ms] composes final user-facing reply
Every span in this tree is clickable. If the final answer is wrong, you can walk backward: did the response agent misinterpret the data? Did the order API return stale information? Did the triage agent route to the wrong sub-agent? The trace gives you the full chain of custody for every piece of information.
The industry has converged on OpenTelemetry (OTel) as the standard for collecting this telemetry data. Microsoft, Google, IBM, and the broader open-source community contributed GenAI semantic conventions that standardize how agent telemetry is structured. The conventions define specific span operations like invoke_agent, create_agent, and execute_tool, along with standardized attributes like gen_ai.agent.name, gen_ai.request.model, and gen_ai.usage.input_tokens.
How to Set Up Multi-Agent Observability: Instrumentation with traceAI, Backend Export, and Trace Visualization
Setting up multi-agent observability involves three layers: instrumentation, collection, and visualization. Here is the practical breakdown in 2026.
Step 1: Instrument Your Agent with traceAI Decorators
Auto-instrumentation is the quickest way to start. Most popular frameworks (LangChain, CrewAI, OpenAI Agents SDK, Pydantic AI, Google ADK, LlamaIndex) ship a dedicated traceAI instrumentor. For workflows where you want explicit control over span names and structure, the @tracer.agent and @tracer.tool decorators are the 2026 idiomatic way to instrument.
# Manual traceAI instrumentation with decorators.
# Requires: pip install fi-instrumentation-otel (Apache 2.0)
# Env: FI_API_KEY, FI_SECRET_KEY
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="multi-agent-prod",
)
tracer = FITracer(trace_provider.get_tracer(__name__))
@tracer.agent
def triage_agent(user_query: str) -> str:
"""Decide which sub-agent handles the query, then call it."""
plan = decide_route(user_query)
return run_sub_agent(plan, user_query)
@tracer.tool(name="order_api", description="Lookup order by ID")
def order_api(order_id: str) -> dict:
"""Fetch order from the backend."""
return fetch_order(order_id)
Two lines of decoration give you a full parent-child span tree. The @tracer.agent decorator marks the entrypoint as an agent span; everything called inside it (including downstream @tracer.tool calls and any LLM call routed through a traceAI-instrumented client) attaches as a child span automatically.
For framework-native instrumentation, the dedicated instrumentor handles the wiring with zero code changes inside the agent. For an OpenAI Agents SDK app:
# Framework auto-instrumentation. Zero changes to agent code.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai_agents import OpenAIAgentsInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="agents-prod",
)
OpenAIAgentsInstrumentor().instrument(tracer_provider=trace_provider)
From this point, every agent-to-agent handoff, tool call, and LLM completion in the OpenAI Agents SDK runtime emits a structured span. Instrumentors exist for LangChain, CrewAI, LlamaIndex, DSPy, Pydantic AI, OpenAI, Anthropic, and more in the traceAI repo.
If your stack mixes frameworks (LangChain plus OpenAI Agents SDK plus custom code), call multiple instrumentors. They cooperate through the shared OpenTelemetry tracer provider, so spans from different frameworks land in one trace tree with one root.
Step 2: Export Traces to an OpenTelemetry-Compatible Backend Using OTLP
traceAI exports to any OpenTelemetry-compatible backend: Jaeger, Grafana Tempo, Datadog, Honeycomb, or Future AGI’s Observe. The traces flow through the standard OTLP (OpenTelemetry Protocol) pipeline, so you are not locked into any single vendor. By default register(...) configures the exporter to send to the Future AGI cloud project named in project_name; override the endpoint to point at any OTLP collector.
Step 3: Visualize and Analyze Agent Traces with Waterfall Span Views
Once traces land in your backend, you can view each agent run as a nested timeline. Each node in the waterfall view represents a span. Click into any span to inspect its input, output, latency, token count, and error status.
This is where the debugging power comes in. If a user reports a bad answer, you pull up the trace, walk the span tree, and find the exact point where the reasoning went wrong.
Debugging Common Multi-Agent Failures: Tool Errors, Hallucinations, and Latency Issues
Three representative cases. Each shows the trace, the diagnosis, and the fix.
Case 1: Tool Calling Error Caused by Ambiguous LLM Output
Your booking agent calls a flight_search tool. The tool span shows:
Span: execute_tool flight_search
Status: ERROR
Attributes:
gen_ai.tool.name: flight_search
tool.input: {"origin": "NYC", "destination": "", "date": "2026-03-15"}
tool.output: {"error": "destination is required"}
gen_ai.agent.name: booking_agent
The destination field is empty. Now you walk one span up to the LLM call that generated this tool invocation:
Span: chat gpt-5
Attributes:
gen_ai.request.model: gpt-5
llm.input: "User wants to fly from New York to somewhere warm next week"
llm.output: {"tool_call": "flight_search",
"args": {"origin": "NYC", "destination": "", "date": "2026-03-15"}}
The model could not resolve “somewhere warm” into a concrete destination, so it passed an empty string instead of asking for clarification. The fix is a prompt-level change: add an instruction that tells the agent to ask the user for a specific destination when the query is ambiguous, rather than calling the tool with incomplete parameters. Without span-level trace data, your logs would only show “flight_search failed” with no visibility into why the LLM generated bad arguments in the first place.
Case 2: Hallucination in a Multi-Step Workflow
LLM agent hallucination debugging in multi-agent systems requires comparing what the retriever actually returned against what the agent claimed. The pattern: open the retriever span, open the subsequent LLM span, compare. Anything in the LLM output not grounded in the retrieved context is a fabrication.
Your research agent retrieves context from a knowledge base and then generates a summary:
Span: retrieve context_store
Attributes:
retriever.query: "Q1 2026 revenue for Acme Corp"
retriever.documents: [
"Acme Corp reported $42M in Q1 2026 revenue, a 12% increase YoY."
]
Span: chat gpt-5
Attributes:
gen_ai.request.model: gpt-5
llm.input: [retrieved context + user query]
llm.output: "Acme Corp reported $42M in Q1 2026 revenue, a 12% increase YoY,
driven primarily by expansion into the European market."
The “driven primarily by expansion into the European market” part appears nowhere in the retrieved documents. That is the hallucination. In a single-agent setup, you might catch this. In a multi-agent pipeline, this fabricated detail gets passed to a downstream analyst agent that uses it as a factual input for its own reasoning, and the error compounds silently.
Catching this manually is impractical at scale, which is where automated span-level evaluation comes in.
# Run a faithfulness eval on the LLM span against the retriever span.
# Requires: pip install future-agi (Apache 2.0)
# Env: FI_API_KEY, FI_SECRET_KEY
from fi.evals import evaluate
result = evaluate(
"faithfulness",
output=llm_span_output,
context="\n".join(retriever_span_documents),
model="turing_flash", # cloud judge, roughly 1-2s
)
print(result) # score, reason, latency
The score attaches to the LLM span when the evaluate call runs inside the agent’s trace context. The pattern is to import the auto-enrichment helper from fi.evals.otel and call it once at startup; every subsequent evaluate() inside an active span writes its score, reason, and latency back to that span as attributes. See the next code block for the full setup. The trace tree in the Future AGI Observe dashboard now shows the failing span with its faithfulness score next to its input and output. When the score drops below your threshold (0.85 is a workable starting point for production), the trace gets flagged for review.
Case 3: Latency Outlier in a Single Span
For production multi-agent latency debugging, sort spans by duration. The waterfall view immediately shows which agent or tool call is the bottleneck. Common culprits include retriever queries on unindexed vector stores, sequential tool calls that could run in parallel, LLM calls with unnecessarily large context windows, and agent loops where the orchestrator retries the same failing step.
You notice p95 latency on your customer support agent pipeline has jumped from 4 seconds to about 9 seconds. You pull up a slow trace and see the waterfall:
invoke_agent triage_agent [200ms]
chat gpt-5 [800ms]
invoke_agent retriever_agent [6200ms] bottleneck
retrieve vector_store [5900ms]
chat gpt-5 [300ms]
invoke_agent summarizer_agent [1800ms]
chat gpt-5 [1800ms]
The retriever agent’s vector store query is taking 5.9 seconds out of the roughly 8.8 second wall time. You check the span attributes and see it is querying an unindexed collection with 2M+ documents using a broad embedding search with top_k=50. The fix is either indexing the collection, reducing top_k, or adding a metadata pre-filter to narrow the search space. Without span-level timing, you would only know the overall pipeline was slow, not that a single vector query was responsible for over half the total latency.
Evaluating Multi-Agent System Output: Key Metrics, Span-Level Evals, and Automated Alerts
Tracing tells you what happened. Evaluation tells you whether it was good. Combining both creates a closed feedback loop that drives continuous improvement.
Key Metrics for Multi-Agent Evaluation
| Metric | What It Measures | How to Compute |
|---|---|---|
| Task Completion Rate | Percent of queries where the final agent output correctly answers the user | LLM-as-judge or human annotation |
| Tool Accuracy | Percent of tool calls with correct parameters and valid responses | Span-level status code analysis |
| Faithfulness Score | Does the LLM output match the retrieved context | Retriever span vs LLM span comparison via evaluate("faithfulness", ...) |
| End-to-End Latency | Total time from query to response | Root span duration |
| Cost per Query | Total token spend across all agents | Sum of token counts across LLM spans |
| Agent Handoff Success Rate | Percent of inter-agent handoffs that preserve required context | Custom span attribute checks |
Table 2: Key Metrics for Multi-Agent Evaluation.
These metrics give you quantitative signal on where your system is weak. When faithfulness drops, your retriever or grounding prompt needs work. When tool accuracy dips, you check for schema changes or API regressions.
Attaching Span-Level Evals Inside the Trace Context
The pattern that closes the loop is span-level evaluation. The same fi.evals templates that score offline datasets also score live spans, and the score attaches to the span as an attribute. The Observe dashboard surfaces failing spans by score, by error rate, and by latency in one view.
# Attach span-level evals inside the agent runtime.
# Env: FI_API_KEY, FI_SECRET_KEY
from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="multi-agent-prod",
)
tracer = FITracer(trace_provider.get_tracer(__name__))
# Auto-attach every evaluate() score, reason, and latency to the active span.
enable_auto_enrichment()
@tracer.agent
def grounded_answer(query: str, retrieved_docs: list[str]) -> str:
answer = call_llm(query, retrieved_docs)
# Score against the retrieved context. Score attaches to the active span.
evaluate(
"faithfulness",
output=answer,
context="\n".join(retrieved_docs),
model="turing_flash",
)
return answer
Setting Up Alerts for Agent Quality Drift
Production systems need continuous monitoring, not one-time audits. Set up alerts for:
- Latency spikes: when p95 latency exceeds your SLA threshold.
- Error rate increases: when tool span failure rate rises above baseline.
- Quality score drops: when automated evaluation scores trend downward.
- Token cost anomalies: when cost per query jumps unexpectedly (often indicates agent loops).
The Observe module supports OTel-powered dashboards with configurable alerts for all of these signals. You can set thresholds on specific agents within a multi-agent chain, so you know exactly which agent is degrading.
Gating Regressions at the Agent Command Center
The trace plus eval pattern is reactive. To make it preventive, route the agent through the Agent Command Center gateway. The gateway is a BYOK router with built-in guardrail scanners (PII redaction, prompt injection, toxicity, jailbreak, custom rules). The gateway can fail-closed on a quality threshold so that bad outputs do not reach the user. The same gateway emits traceAI spans, so the path from request, to gate, to model, to evaluation lives in one trace tree.
Multi-Agent Architecture Tracing Best Practices: Seven Principles
After working with distributed tracing for AI agents across multiple production deployments, here are the practices that consistently make the biggest difference.
Instrument early, not after a production incident. Adding tracing after deployment is significantly harder than building it in from the start.
Name your spans descriptively. Use names like research_agent:web_search instead of generic tool_call. Clear span names save time during debugging.
Separate environments with project versions. Use distinct project names or version tags for dev, staging, and production traces so test data does not pollute production dashboards.
Trace agent state, not just inputs and outputs. If your agents maintain memory or state between steps, capture state transitions as span attributes. This is critical for agent state management debugging.
Combine tracing with automated evaluation. Raw traces give you the “what.” Automated evals (faithfulness, instruction following, tool accuracy) give you the “how good.” Together they tell the full story.
Use consistent span attributes across frameworks. If you run agents on LangChain and CrewAI within the same system, ensure both emit spans with the same attribute schema. OpenTelemetry semantic conventions handle this when you use compliant instrumentation libraries.
Gate at the boundary. Run the same eval templates that ran in CI against live traces, and use the Agent Command Center gateway to block traffic that fails the gate. Without the gate, the eval only tells you about the regression after it shipped.
How Future AGI Multi-Agent Observability Works: traceAI, Agent Compass, fi.evals Templates, and the Gateway
Future AGI provides a complete observability and evaluation layer built for multi-agent systems. The open source traceAI library (Apache 2.0) handles instrumentation across 15-plus frameworks (OpenAI, Anthropic, LangChain, CrewAI, DSPy, Pydantic AI, LlamaIndex, Google ADK, MCP, and more) with auto-instrumentation that requires zero changes to your agent code, plus the @tracer.agent and @tracer.tool decorators when you want explicit control.
The Agent Compass feature goes beyond traditional trace visualization. It automatically clusters errors, identifies root causes using a built-in error taxonomy, and suggests fixes. Instead of manually sifting through thousands of traces, you get grouped failure patterns with actionable diagnostics.
For evaluation, the fi.evals catalog ships 50-plus ready-to-use templates: task completion, faithfulness, faithfulness with citations, instruction following, tool use correctness, context relevance, toxicity, PII, brand tone, plus custom LLM judges via fi.evals.metrics.CustomLLMJudge. Cloud judges run on the turing_flash (roughly 1-2 seconds), turing_small (roughly 2-3 seconds), and turing_large (roughly 3-5 seconds) models. Templates run inside production traces, so you get real-time quality signals without managing a separate evaluation pipeline. See the cloud evals docs for the model selection guide.
For teams running multi-step agent workflow monitoring at scale, Observe tracks throughput, error rates, latency distributions, and cost per query across the entire agent fleet with customizable alert thresholds. Output gating runs at the Agent Command Center gateway, which is a BYOK router with built-in PII, prompt-injection, toxicity, and custom-rule scanners.
How Distributed Tracing and Span-Level Evaluation Together Keep Multi-Agent Systems Reliable in Production
Multi-agent observability is the difference between shipping agents that work in demos and agents that hold up in production. Without distributed tracing, you are debugging blind. Without span-level evaluation, you are flying without instruments. Without a gateway, your fix only lives in the next deploy, not at the boundary.
The takeaways. Instrument your agents from day one using OpenTelemetry-compatible tooling. Build span hierarchies that reflect your actual agent architecture. Attach evals to spans, not just to offline datasets. Route production traffic through a gateway that fails closed on the same thresholds your CI gates use. The tools exist. OpenTelemetry provides the standard, traceAI provides the instrumentation, fi.evals provides the eval catalog, and Future AGI ties trace, eval, gateway, and optimizer into one closed loop.
Voice AI evaluation infrastructure in 2026: five testing layers, STT/LLM/TTS metrics, synthetic test harness, traceAI instrumentation, and Future AGI Simulate.
OpenAI Frontier vs Claude Cowork 2026 head-to-head: agent execution, governance, security, pricing, and the eval layer every CTO needs on top of both.
How engineering teams ship safe AI in 2026. CI/CD guardrails, drift detection, adversarial robustness, monitoring. Future AGI Protect + Guardrails as #1 stack.