Guides

How to Debug AI Agents in 2026: Traces, Spans, and Fix Recipes

Step-by-step playbook for debugging AI agents in 2026. Real tracing decorators, span waterfall view, error propagation, tool-call diffs, and Fix Recipes.

·
Updated
·
6 min read
agents evaluations tracing
How to Debug AI Agents in 2026: Traces, Spans, and Fix Recipes
Table of Contents

TL;DR

StepWhat you doTime
1. InstrumentInstall traceAI, decorate your agent with @tracer.agent, tools with @tracer.tool, chains with @tracer.chain1 min
2. Span waterfallOpen the run in the Agent Command Center. See every LLM call, tool call, retriever as nested spans30 sec
3. Error propagationClick the failing span. See which parents inherited the error, which siblings succeeded1 min
4. Tool-call diffDiff the failed run against a recent successful run for the same task1 min
5. Apply Fix RecipeRead the auto-generated root cause, apply the suggested prompt or tool fix1 min

Debugging an AI agent in 2026 is a different exercise from debugging deterministic code. The system is non-deterministic, distributed across LLM calls, tool calls, and child agents, and fails silently when an answer is wrong but well-formed. The fast path is the same every time: capture every step as an OpenTelemetry span, visualize the run as a waterfall, follow error propagation, diff against a known-good run, and apply a fix.

This guide walks through that playbook with Future AGI traceAI (Apache 2.0) and the Agent Command Center. The same waterfall and error-propagation patterns translate to any OpenTelemetry-compatible backend; the value of the Agent Command Center is the failure taxonomy and Fix Recipes built on top of the trace.

Why Traditional Monitoring Tools Cannot Debug AI Agents

A traditional APM stack (Datadog, New Relic, Grafana) treats an agent as one HTTP call. You see total latency and a top-level error rate, but the LLM call inside the agent, the retriever, and the three tools the agent invoked are invisible. When the agent picks the wrong tool, the APM stack reports a successful 200 response.

Newer LLM observability platforms (LangSmith, Arize Phoenix, Langfuse, Datadog LLM Observability) close most of that gap. They expose the LLM call, the tool call, and the retriever as separate spans. The remaining gap is interpretation: thousands of spans become useful only when something clusters them by failure type, links symptoms to causes, and tells you which fix to try first. That last step is where the Agent Command Center sits on top of traceAI.

What changed since 2025

Three things shifted between mid-2025 and May 2026:

  1. OpenTelemetry GenAI semantic conventions matured. OTel GenAI defines stable attributes for LLM spans (model, prompt, response, tokens, finish reason). Every serious tracer now emits these, so traces are portable across backends.
  2. Agent-native span types are first-class. Tools, retrievers, planners, and child-agent invocations have their own semantic types, not just generic internal spans.
  3. Tail-based sampling for failures became the default. Sampling 1% of happy-path traces but 100% of failures keeps cost down while preserving the debugging signal.

Step 1: Instrument Your Agent with traceAI

traceAI is an open-source (Apache 2.0) tracer built on OpenTelemetry. It ships:

  • Auto-instrumentors for LangChain, LangGraph, OpenAI, Anthropic, LlamaIndex, CrewAI, AutoGen, DSPy
  • Manual instrumentation via FITracer with @tracer.agent, @tracer.tool, @tracer.chain decorators
  • OTLP gRPC and HTTP exporters
  • Compatibility with any OTel backend in addition to the Agent Command Center

Install

pip install traceAI-langchain ai-evaluation

Pick the auto-instrumentor that matches your stack. The list is in the traceAI README.

Configure credentials

import os

os.environ["FI_API_KEY"] = "your-fi-api-key"
os.environ["FI_SECRET_KEY"] = "your-fi-secret-key"

Both keys come from the Future AGI dashboard. Use role-based access control to scope keys per environment.

Register the project and instrument

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from opentelemetry import trace
from traceai_langchain import LangChainInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="support-agent-prod",
)

# Auto-instrument LangChain across the whole process
LangChainInstrumentor().instrument(tracer_provider=trace_provider)

# Get a manual tracer for custom code paths
tracer = FITracer(trace.get_tracer(__name__))

Decorate your agent

For custom code paths the auto-instrumentor cannot see, use the three primary decorators. They map to OTel GenAI span types.

from fi_instrumentation import register, FITracer
from opentelemetry import trace

tracer = FITracer(trace.get_tracer(__name__))


@tracer.tool(name="retrieve_kb")
def retrieve_kb(query: str) -> str:
    # Real retriever goes here. The decorator captures input, output, latency.
    return lookup(query)


@tracer.chain(name="format_prompt")
def format_prompt(question: str, context: str) -> str:
    return f"Question: {question}\n\nContext: {context}\n\nAnswer:"


@tracer.agent(name="support_agent")
def support_agent(question: str) -> str:
    context = retrieve_kb(question)
    prompt = format_prompt(question, context)
    return call_llm(prompt)

Each call to support_agent now emits one agent span, with child spans for the retriever, the chain, and the LLM call (added automatically by the LangChain or OpenAI auto-instrumentor). Together they form the waterfall view in step 2.

Step 2: Open the Span Waterfall

Open the run in the Agent Command Center at /platform/monitor/command-center. The trace view renders the run as a nested waterfall:

support_agent                           1450 ms   ok
  retrieve_kb                             280 ms   ok
    vector_search                         150 ms   ok
  format_prompt                            12 ms   ok
  openai.chat.completions.create          820 ms   ok
    model: gpt-5-2025-08-07
    tokens_in: 412 / tokens_out: 88

Every row is a span. Click any row to see input, output, attributes, and any associated evaluations. This is the canonical view you keep open while debugging.

Step 3: Follow Error Propagation

When a tool call fails, you usually do not see the error at the tool span itself. You see a wrong answer at the top-level agent span. The error propagation view fixes that.

Click the failing span. The Agent Command Center marks every parent span that inherited the failure state and every sibling span that succeeded:

support_agent                          1390 ms   error    <-- root visible
  retrieve_kb                            240 ms   error    <-- error originated here
    vector_search                         60 ms   error    <-- tool returned empty
  format_prompt                           12 ms   ok       (ran with empty context)
  openai.chat.completions.create         990 ms   ok       (hallucinated an answer)

Now the causal chain is obvious. The retriever returned empty, the chain ran anyway, the LLM hallucinated. Without the propagation view you would have to read every span yourself.

Step 4: Diff the Tool Calls

Most agent regressions are caused by an input change: someone updated a prompt, the tool schema changed, or a retriever returned different chunks. The tool-call diff view aligns spans by name across two runs and highlights the field-level deltas.

Pick the failed run and a recent successful run for the same task. The diff shows:

  • The retriever input and output (chunks, scores)
  • The LLM prompt and the LLM response
  • Tool arguments and tool results
  • Model and decoding parameters

Most of the time, one row is highlighted in red. That row is the cause.

Step 5: Read the Root Cause and Apply a Fix Recipe

For each cluster of failures, the Agent Command Center generates a developer-ready ticket:

  • Root cause: plain-language statement of what failed and why
  • Long-term recommendation: structural change to prevent recurrence
  • Immediate fix: a concrete prompt, tool-schema, or code edit
  • Evaluation: a script or eval template to verify the fix worked

A typical Fix Recipe for the example above reads:

The retriever returned empty for the 2FA reset question, but the chain still called the LLM with the empty context. The LLM hallucinated a generic answer. Fix the chain to short-circuit on empty retrieval (return a “no documentation found” response). Add a faithfulness eval to gate this path in CI.

The eval part looks like:

from fi.evals import evaluate

result = evaluate(
    "faithfulness",
    output="<the agent's answer>",
    context="<the retrieved document>",
)
print(result.score, result.explanation)

The eval call uses string-template metrics on the Future AGI cloud. turing_flash runs in roughly 1 to 2 seconds, turing_small in 2 to 3 seconds, turing_large in 3 to 5 seconds, depending on the metric.

Best Practices for Production Debugging

  • Instrument every tool, not just the entry agent. A trace without tool spans hides the most common failure modes.
  • Use tail-based sampling. Keep 100% of failing runs and 1 to 5% of successful runs.
  • Tag runs with user journey, release version, and feature flag. Clustering and diffs only work if the metadata is there.
  • Run an eval on every production run for the metrics you care about (faithfulness, instruction adherence, tool selection). Use evaluate with the string-template form for fast metrics and Evaluator with a CustomLLMJudge for the rest.
  • Review the failure feed daily on high-traffic agents. The Agent Command Center surfaces the highest-impact cluster first.
  • Share runs and Fix Recipes with the prompt-engineering and product teams. Most fixes are prompt changes, not code changes.

Frequently asked questions

What is the fastest way to debug an AI agent in 2026?
Use OpenTelemetry-compatible tracing with a tool that visualizes spans as a waterfall and clusters failures by root cause. The Future AGI stack does this with traceAI plus the Agent Command Center: install the SDK, decorate your agent and tools, and view runs at /platform/monitor/command-center. The waterfall and error-propagation views isolate the failing span so the next step is a focused fix rather than a manual trace hunt.
Why are AI agents harder to debug than regular software?
Three reasons. First, agents are non-deterministic; the same input can produce different outputs across runs. Second, agent behaviour is distributed across LLM calls, tool calls, retrievers, and child agents, so a stack trace alone hides the causal chain. Third, agents fail silently: a hallucinated answer, a wrong tool, or a context loss looks like a successful run unless something else flags it. You need traces, evals, and a failure taxonomy together.
What's the difference between logging and tracing for agents?
Logs capture discrete events with no structure across them. Traces capture causally linked spans with parent/child relationships, latency, and attributes. For an agent, tracing is essential because the question is rarely 'did this line run' but 'why did the LLM pick that tool given that context.' A trace shows the prompt, the tool schema, the tool output, and the next LLM step as linked spans you can replay.
How does the @tracer decorator pattern work in traceAI?
traceAI exposes FITracer with three primary decorators. @tracer.agent marks the top-level agent function, @tracer.tool wraps individual tool functions, and @tracer.chain wraps composite steps such as retrieval-augmented prompts. Each decorator creates a span with semantic conventions (input, output, tool name, model). Auto-instrumentors for LangChain, OpenAI, Anthropic, LlamaIndex, and others add spans without code changes.
What is error propagation in agent tracing?
Error propagation is the visualization that shows how a single failure flows up the span tree. If a tool span returns an error, the parent agent span inherits the failure state, and any downstream chain or tool spans that depended on the failed output are highlighted. This makes it obvious whether the root cause is the failing tool or a missing handler higher up in the agent.
How do I diff a failed agent run against a successful one?
Open both runs in the Agent Command Center. The tool-call diff view aligns spans by name and shows field-level deltas for inputs, outputs, retrieved chunks, and tool arguments. This isolates the input that changed between runs, which is usually the cause when an agent regressed after a deploy or prompt edit.
Can I debug agents in production without slowing them down?
Yes. traceAI uses OpenTelemetry's async exporter and adds milliseconds of overhead per span. The decorators emit spans only when sampling is on. In production, set a low sampling rate for happy-path traces and a high rate for failures, errors, and tagged user journeys. The Agent Command Center supports tail-based sampling so 100% of failing runs are kept regardless of the sampling rate.
Which open-source license does traceAI ship under?
traceAI is released under the Apache 2.0 license. The full license text is on the GitHub repository at github.com/future-agi/traceAI. The companion ai-evaluation library is also Apache 2.0. Both can be used commercially without per-seat licensing, and both export traces in OpenTelemetry format so you can send them to any compatible backend in addition to the Agent Command Center.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.