How is agent observability different from LLM observability?

LLM observability is fine for linear chains. Agent observability adds graph topology — gen_ai.agent.graph.node_id, parent_node_id, handoff edges, loop counters — so a branching LangGraph or OpenAI Agents SDK run renders as the actual graph and not a flat span list.

How do you measure agent quality at runtime?

Attach fi.evals.TrajectoryScore, GoalProgress, and ToolSelectionAccuracy to the agent trace; results land as gen_ai.evaluation.score.value span events. Filter dashboards to traces where TrajectoryScore < 0.6 to surface failed runs for review.

What Is Agent Observability? FutureAGI Guide (2026)

Q: What is agent observability?

Agent observability is LLM observability specialized for multi-step, branching, looping agents. It captures the agent's call graph, state diffs between nodes, every tool call and handoff, and trajectory-level eval scores like TrajectoryScore.

What Is Agent Observability?

Agent observability is LLM observability specialized for branching, looping, multi-step agents. It captures the agent’s call graph as a tree of spans linked by gen_ai.agent.graph.node_id and gen_ai.agent.graph.parent_node_id, exposes state diffs between nodes, records every tool call and handoff, and attaches trajectory-level eval scores like TrajectoryScore. A flat span list buries the loop iteration count and the tool decision that triggered a sub-agent. Agent observability renders the trace as the actual graph the agent walked, with the failure point — wrong tool, infinite loop, stale state — visible and replayable.

Why It Matters in Production LLM and Agent Systems

A LangGraph run, a CrewAI crew, an OpenAI Agents SDK loop, or a Google ADK orchestration is not a linear request. It branches: planner picks one of three actions; executor calls a tool, sometimes another agent; critic loops back to planner if the action failed. A single user request fans out into 15–50 spans across dozens of nodes.

Generic LLM observability shows you the spans in chronological order. That hides three things you need to debug:

Loop iteration: which node_id is on iteration 7 of an infinite loop, and what state changed between iterations 6 and 7?
Tool decision: which tool did the agent pick and why — what was the prompt it saw at the decision point?
Handoff edge: when sub-agent A handed off to sub-agent B, what state was passed?

The pain falls hardest on agent engineers triaging post-deploy incidents. A user reports “the agent went in circles for two minutes.” In a flat span list, this is 24 mostly identical LLM spans. In an agent trace, this is a planner node that picked the same tool seven times because the tool’s response was always parsed as a failure — visible in 10 seconds.

For 2026 multi-agent stacks (agent2agent-protocol, model-context-protocol), the graph also crosses processes. An A2A handoff to a remote agent has to preserve trace context across the network hop, or the second agent’s spans orphan from the first agent’s trace.

How FutureAGI Handles Agent Observability

FutureAGI’s approach is to treat the agent graph as a first-class data structure on top of OpenTelemetry. The instrumentation layer is traceAI with agent-aware integrations: traceAI-openai-agents, traceAI-langchain (LangGraph), traceAI-crewai, traceAI-google-adk, traceAI-strands, traceAI-autogen, traceAI-pydantic-ai. Each captures node identity (gen_ai.agent.graph.node_id, gen_ai.agent.graph.parent_node_id, gen_ai.agent.name), tool calls (gen_ai.tool.name, gen_ai.tool.call.arguments, gen_ai.tool.call.result), and handoffs as parent-child span edges.

The platform renders the trace as the actual graph — not a flame graph — with the loop edges, handoff arrows, and state diffs between nodes. Click any node and the right pane shows the input state, the LLM call inside, the tool outputs, and the resulting output state. Click again and you can replay the node with modified state via the FutureAGI simulation surface.

The differentiator is trajectory-level evaluation. fi.evals.TrajectoryScore runs across the full agent run and writes its verdict back as gen_ai.evaluation.score.value on the root span. fi.evals.GoalProgress measures partial credit per node. fi.evals.ToolSelectionAccuracy flags nodes where the agent picked the wrong tool. The same trace that shows the graph also shows where the trajectory broke. Filtering dashboards to “traces where TrajectoryScore < 0.6 in the last 24h” produces a review queue of failed agent runs, sorted by user impact.

For pre-production debugging, the FutureAGI simulate-sdk runs Persona and Scenario test cases through the agent and produces the same trace structure. A failing scenario in CI shows the same graph view as a failing production run, so the regression engineer never context-switches between two trace formats.

In a typical incident, an engineer filters to fi.span.kind=AGENT AND TrajectoryScore < 0.6 AND user.id=X, opens the failing trace, sees the planner→executor→planner→executor loop in the graph view, identifies that the tool always returns an empty list, and replays the executor node with a mocked tool response to confirm the fix.

How to Measure or Detect It

Wire these signals across the agent run:

Graph topology: gen_ai.agent.graph.node_id, gen_ai.agent.graph.parent_node_id, gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description.
Span kind: fi.span.kind=AGENT for agent nodes, fi.span.kind=TOOL for tool calls, fi.span.kind=CHAIN for orchestration steps.
Tool calls: gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.type.
Trajectory eval: fi.evals.TrajectoryScore returns 0–1 across the run; GoalProgress returns per-step credit; ToolSelectionAccuracy returns boolean per tool decision.
Loop detection: spans per node — alert if any single node_id appears > N times in one trace.
Trajectory length distribution: p99 spans per trace; tail anomalies flag runaway runs.

from fi.evals import TrajectoryScore, ToolSelectionAccuracy

trajectory_score = TrajectoryScore()
result = trajectory_score.evaluate(
    trace=agent_trace,    # captured via traceAI-openai-agents
    goal="Find and email the Q3 sales report"
)
print(result.score, result.reason)

Common Mistakes

Rendering agent runs as flame graphs. Flame graphs are great for HTTP request waterfalls; they hide loop edges and handoff topology. Use a tree or graph view for agents.
Skipping gen_ai.agent.graph.node_id. Without it, you cannot tell two iterations of the same node apart from two different nodes.
No loop detection alert. Agents go infinite. A simple “more than 20 spans for the same node_id in one trace” alert catches most cases.
Trajectory evals only run offline. Run TrajectoryScore against a production sample; otherwise quality drift in production is invisible.
Lost context across A2A handoffs. Sub-agent calls over the network must propagate W3C traceparent, or the second agent starts a new trace and the graph breaks.