Observability

What Is Agent Observability?

LLM observability specialized for branching, looping, multi-step agents, capturing graph topology, state diffs, tool calls, and trajectory eval scores.

What Is Agent Observability?

Agent observability is LLM observability specialized for branching, looping, multi-step agents. It captures the agent’s call graph as a tree of spans linked by gen_ai.agent.graph.node_id and gen_ai.agent.graph.parent_node_id, exposes state diffs between nodes, records every tool calling span and handoff, and attaches trajectory-level eval scores like TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy. A flat span list buries the loop iteration count, the tool decision that triggered a sub-agent, and the memory read that returned stale state. Agent observability renders the trace as the actual graph the agent walked, with the failure point. wrong tool, infinite loop, stale memory, dropped handoff. visible and replayable.

In 2026 this is no longer a nice-to-have. With production agent runs routinely fanning out across 20–100 spans, multiple MCP tool servers, agent memory reads against vector databases, and A2A protocol handoffs to external agents, the only thing that makes a complex run debuggable is a graph view bound to eval scores. Generic APM tooling cannot do this; flame graphs were designed for linear request waterfalls and do not render branching loops or cross-process agent handoff edges. Engineers running 2026-era agentic AI systems without graph-aware observability spend more time reconstructing what happened than fixing what happened.

Why agent observability matters in production LLM and agent systems

A LangGraph run, a CrewAI crew, an OpenAI Agents SDK loop, an AutoGen team, a Google ADK orchestration, a Strands run, or a Pydantic AI agent is not a linear request. It branches: planner picks one of three actions; executor calls a tool, sometimes another agent; critic loops back to planner if the action failed. A single user request fans out into 15–100 spans across dozens of nodes, with state changing between every node.

Generic LLM observability shows you the spans in chronological order. That hides at least four things you need to debug:

  • Loop iteration: which node_id is on iteration 7 of an infinite loop, and what state changed between iterations 6 and 7?
  • Tool decision: which tool did the agent pick and why. what was the prompt it saw at the decision point, and what tool result drove the next decision?
  • Handoff edge: when sub-agent A handed off to sub-agent B (in-process or over A2A), what state was passed, and did the trace context survive the boundary?
  • Memory operation: was an agent memory read on a stale entry, and did the agent rely on it?

The pain falls hardest on agent engineers triaging post-deploy incidents. A user reports “the agent went in circles for two minutes.” In a flat span list, this is 24 mostly identical LLM spans. In an agent trace, this is a planner node that picked the same tool seven times because the tool’s response was always parsed as a failure. visible in 10 seconds. The same user reports “it gave me the wrong refund amount.” In a flat list, this is a final answer that looks right. In an agent trace, this is a read_account tool span at step 3 that returned a stale cache hit, propagated through three more reasoning steps, and produced a confident wrong answer.

For 2026 multi-agent stacks. agent2agent-protocol, model-context-protocol, and tool meshes that span cloud providers. the graph also crosses processes. An A2A handoff to a remote agent has to preserve trace context across the network hop, or the second agent’s spans orphan from the first agent’s trace and the incident becomes unanswerable.

The metrics that distinguish agent observability from generic LLM observability are graph-topology metrics. Span count per trace was never an interesting number for chat completions; for agents it is the primary signal for runaway loops. Per-node fan-out, per-node iteration count, handoff depth, state-diff size between adjacent nodes, and cross-process integrity are all new dimensions that did not exist in a 2022 LLM dashboard but are now first-class debugging signals. A production-grade agent observability deployment exposes each of these as a queryable dashboard panel.

Agent observability vs LLM observability vs APM

The table below is the shortest way to explain why a 2026 agent stack needs a specialized observability layer, not a generic APM stack with an LLM plugin.

DimensionAPM (Datadog, New Relic)LLM observability (LangSmith, basic OTel)Agent observability (FutureAGI)
Primary unitHTTP request spanLLM span (prompt + completion)Agent run = graph of LLM + tool + handoff + memory spans
TopologyLinear request waterfallLinear chain of LLM callsBranching graph with loops, handoffs, retries
Tool callsGeneric outbound HTTPLogged as textgen_ai.tool.name + arguments + result + selection accuracy
State diffsNoneNoneInput → output state per node, rendered as diff
Eval scoresNoneOutput-level scoring sometimesTrajectory-level TaskCompletion, TrajectoryScore, ToolSelectionAccuracy
Loop detectionN/AManualBuilt-in: spans-per-node alert
Cross-process trace contextW3C traceparentOften brokenPropagated across MCP, A2A, sub-agent boundaries
ReplayLimitedLimitedPer-node replay via simulate-sdk

A generic APM stack will tell you the agent run took 18 seconds. LLM observability will tell you it made 24 LLM calls. Agent observability tells you the planner node on iteration 4 picked the wrong tool because the tool result at iteration 3 was parsed as null, and shows the diff between input and output state on that node.

The market matters here too. By May 2026 the gap between agent-aware observability and chain-only observability has widened. Tools that locked into a chain-trace model in 2023 (LangSmith’s original UI; basic OTel exporters) still render flat sequences, while agent-aware platforms (FutureAGI, AgentOps) render the graph. The functional difference is whether a senior engineer can localize a regression to a node in under two minutes. Anything longer and incident response stalls. and at 2026 agent fan-out levels, “longer” is the default unless the tooling is graph-aware.

How FutureAGI handles agent observability

FutureAGI’s approach is to treat the agent graph as a first-class data structure on top of OpenTelemetry. The instrumentation layer is traceAI with agent-aware integrations: traceAI-openai-agents, traceAI-langchain (LangGraph), traceAI-crewai, traceAI-autogen, traceAI-google-adk, traceAI-strands, traceAI-pydantic-ai, traceAI-smolagents, traceAI-haystack, traceAI-agno, traceAI-beeai, traceAI-dspy. Each captures node identity (gen_ai.agent.graph.node_id, gen_ai.agent.graph.parent_node_id, gen_ai.agent.name, gen_ai.agent.description), tool calls (gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.type), memory operations (fi.span.kind=memory.read|memory.write), and handoffs as parent-child span edges that survive process boundaries.

The platform renders the trace as the actual graph. not a flame graph. with loop edges, handoff arrows, tool calling leaves, memory operations, and state diffs between nodes. Click any node and the right pane shows the input state, the LLM call inside, the tool outputs, the memory operations, and the resulting output state. Click again and you can replay the node with modified state via the FutureAGI simulate surface. Replay produces a new trace in the same graph shape, so a fix can be tested without redeploying the agent.

The differentiator is trajectory-level evaluation bound to graph nodes. fi.evals.TrajectoryScore runs across the full agent run and writes its verdict back as gen_ai.evaluation.score.value on the root span. fi.evals.TaskCompletion returns end-to-end success. fi.evals.ToolSelectionAccuracy flags nodes where the agent picked the wrong tool. The same trace that shows the graph also shows where the trajectory broke. Filtering dashboards to “traces where TrajectoryScore < 0.6 in the last 24h” produces a review queue of failed agent runs, sorted by user impact and routed by cohort. Compared with LangSmith’s chain-trace view, which renders as a linear sequence, the FutureAGI view preserves the graph and binds eval scores back to each node. which means the engineer’s debugging loop runs in the same surface as the dashboard, not in three different tabs.

For pre-production debugging, the simulate-sdk runs Persona and Scenario test cases through the agent and produces the same trace structure. A failing scenario in CI shows the same graph view as a failing production run, so the regression engineer never context-switches between two trace formats. The same trace also powers agent-as-judge workflows: the judge sees the full graph, not just the final answer.

In a typical incident, an engineer filters to fi.span.kind=AGENT AND TrajectoryScore < 0.6 AND user.id=X, opens the failing trace, sees the planner→executor→planner→executor loop in the graph view, identifies that the tool always returns an empty list, and replays the executor node with a mocked tool response to confirm the fix. The fix lands as a routing-policy adjustment inside Agent Command Center. for instance, pinning the failing tool’s parser to a stronger model. and the next sampled trace confirms recovery.

In our 2026 evals at FutureAGI, the strongest single metric for catching agent regressions early is spans-per-node distribution. Healthy agent runs cluster tightly; runaway loops appear as long-tail outliers within minutes of a bad deploy. The second strongest metric is the handoff edge integrity check: any A2A or sub-agent boundary where the child span does not link back to the parent indicates a trace propagation bug that will hide future incidents. The same spans-per-node and trajectory-score signals correlate strongly with public agent-trajectory benchmarks. τ-bench retail (multi-turn customer-support with tool state) and GAIA’s three-tier multi-hop assistant set. so dashboard health and benchmark score move together when instrumentation is correct.

A worked example: localizing an agentic RAG regression

A retrieval-driven agent that answers internal documentation questions starts producing answers that “feel right” but cite the wrong section. Final-answer eval scores stay borderline; user thumbs-down rate rises. In a flat trace view this is hard to localize. In the FutureAGI graph view, an engineer filters to TaskCompletion < 0.7 AND tag.cohort=docs-qa, opens a failing trace, and sees the graph: the retriever node fired three times across the run, each with a different query reformulation, and only one returned the correct chunk. The agent’s planner picked the first reformulation’s results and ignored the third. ContextRelevance on the first retrieval is 0.42; on the third it is 0.91. The fix is a planner-side change: weight later reformulations higher when the earlier ones return low-relevance results. The change ships, the next sampled batch confirms recovery, and the regression eval cohort is updated to lock in the fix. Every step of that loop happens in one surface. no jumping between agent observability dashboard, evaluate workbench, and a separate retrieval debug tool.

The same pattern applies to retrieval-augmented generation pipelines that sit underneath an agent: when grounding fails, the agent observability graph view points at the retriever node, the eval scores quantify the drop, and the fix can target either the retriever, the reranker, or the planner that consumes them. Compared to a 2022 RAG debugging workflow that ended at “the retrieved chunks look fine, I don’t know,” the 2026 agent observability workflow ends at a specific node, a specific eval score, and a specific code-level change.

How to measure or detect agent observability quality

Wire these signals across the agent run, then make them dashboardable, alertable, and queryable:

  • Graph topology: gen_ai.agent.graph.node_id, gen_ai.agent.graph.parent_node_id, gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description.
  • Span kind: fi.span.kind=AGENT for agent nodes, fi.span.kind=TOOL for tool calls, fi.span.kind=CHAIN for orchestration steps, fi.span.kind=memory.read|memory.write for memory operations.
  • Tool calls: gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.type.
  • Trajectory eval: fi.evals.TrajectoryScore returns 0–1 across the run; TaskCompletion returns end-to-end success; ToolSelectionAccuracy returns per-step correctness.
  • Faithfulness on intermediate reasoning: Faithfulness and Groundedness on nodes that cite retrieved context.
  • Loop detection: spans per node. alert if any single node_id appears > N times in one trace; pair with agent loop detection and infinite loop agent heuristics.
  • Trajectory length distribution: p99 spans per trace; tail anomalies flag runaway runs and stuck retries.
  • Cross-process integrity: percentage of MCP and A2A boundaries where the child span resolves to a parent.
  • Cost and latency per node: per-node cost and p99 latency. agents amplify both, so per-node breakdown localizes regressions faster than per-trace aggregates.
from fi.evals import TrajectoryScore, ToolSelectionAccuracy, TaskCompletion

trajectory_score = TrajectoryScore()
tool_acc = ToolSelectionAccuracy()
task = TaskCompletion()

t = trajectory_score.evaluate(
    trace=agent_trace,           # captured via traceAI-openai-agents
    goal="Find and email the Q3 sales report",
)
tc = tool_acc.evaluate(trajectory=agent_trace)
te = task.evaluate(input="Find and email the Q3 sales report", trajectory=agent_trace)
print(t.score, tc.score, te.score)

A healthy agent observability deployment has a stable TrajectoryScore distribution release-over-release, spans-per-node p99 within configured caps, zero orphaned cross-process spans on the trace integrity dashboard, and per-cohort TaskCompletion floors that hold under model swaps. Those four properties together are the production-readiness contract. Pair them with Faithfulness and Groundedness for any agent that grounds answers in retrieved context, and with PII and PromptInjection for any agent that touches user data or external content.

To wire the trajectory evaluators online. scoring every production trace at ingest rather than during a nightly batch. bind them to the agent span emitted by traceAI:

from traceai_openai_agents import OpenAIAgentsInstrumentor
from fi.evals import TrajectoryScore, TaskCompletion, ToolSelectionAccuracy, PromptInjection

OpenAIAgentsInstrumentor().instrument()

online_evaluators = [
    TrajectoryScore(),
    TaskCompletion(),
    ToolSelectionAccuracy(),
    PromptInjection(),
]

# Each agent run lands as a root span with `fi.span.kind=AGENT`.
# The platform fans the evaluator list out across its sub-tree and writes
# the verdict back as `gen_ai.evaluation.score.value` on the root.
for span in stream_agent_spans():
    for evaluator in online_evaluators:
        evaluator.evaluate_span(span, write_back=True)

Common mistakes

  • Rendering agent runs as flame graphs. Flame graphs are great for HTTP request waterfalls; they hide loop edges and handoff topology. Use a tree or graph view for agents.
  • Skipping gen_ai.agent.graph.node_id. Without it, you cannot tell two iterations of the same node apart from two different nodes. The graph view collapses and the loop is invisible.
  • No loop detection alert. Agents go infinite. A simple “more than 20 spans for the same node_id in one trace” alert catches most cases.
  • Trajectory evals only run offline. Run TrajectoryScore against a production sample; otherwise quality drift in production is invisible until users complain.
  • Lost context across A2A handoffs. Sub-agent calls over the network must propagate W3C traceparent, or the second agent starts a new trace and the graph breaks.
  • Treating tool spans as opaque. Record gen_ai.tool.call.arguments and gen_ai.tool.call.result. without them you cannot debug why the agent acted on a bad observation.
  • No state-diff capture. Two adjacent nodes that look identical may differ in one field of state. Capture the diff or you cannot reason about why the agent took a different action.
  • One dashboard for agents and chat completions. Agent runs have radically different cardinality and shape; force them into a chat-completion dashboard and you lose the per-node and per-edge signals.
  • No replay surface. Without per-node replay via simulate, debugging requires redeploying the agent with extra logging. a slow loop that masks the regression while users keep hitting it.
  • Tagging the trace only at the root. Cohort, user, feature, and release tags need to live on every span so dashboards can slice cleanly. Root-only tagging breaks per-node slicing.
  • No regression cohort for agent traces. Without a regression eval cohort sampled from production, you cannot tell whether a model swap or prompt change moved the trajectory distribution release-over-release.

Frequently Asked Questions

What is agent observability?

Agent observability is LLM observability specialized for multi-step, branching, looping agents. It captures the agent's call graph, state diffs between nodes, every tool call and handoff, and trajectory-level eval scores like TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy.

How is agent observability different from LLM observability?

LLM observability is fine for linear chains. Agent observability adds graph topology. gen_ai.agent.graph.node_id, parent_node_id, handoff edges, loop counters. so a branching LangGraph or OpenAI Agents SDK run renders as the actual graph and not a flat span list.

How do you measure agent quality at runtime?

Attach fi.evals.TrajectoryScore, TaskCompletion, and ToolSelectionAccuracy to the agent trace; results land as gen_ai.evaluation.score.value span events. Filter dashboards to traces where TrajectoryScore &lt; 0.6 to surface failed runs for review.