Observability

What Is AI Observability?

Runtime visibility for AI systems using traces, spans, quality evals, token cost, latency, retrieval, and agent execution signals.

What Is AI Observability?

AI observability is the production observability discipline for AI systems: capturing structured traces, model inputs and outputs, retrievals, tool calls, evaluator scores, cost, and latency so engineers can explain behavior and fix failures. It shows up in production traces, not only offline tests, and covers LLM apps, RAG pipelines, voice agents, and multi-step agents. In FutureAGI, traceAI integrations such as traceAI-langchain emit OpenTelemetry spans with fields like llm.token_count.prompt and fi.span.kind, while evaluators such as Groundedness attach quality signals to the same run. The May 2026 short version: if your dashboard answers “what is p99 latency?” but cannot answer “which retrieved chunk caused this ungrounded answer?”, you have monitoring, not AI observability.

Why AI observability matters in production LLM and agent systems

AI systems fail through hidden intermediate decisions, not just thrown exceptions. A customer-support agent can return a fluent answer with HTTP 200 while the retriever pulled stale policy text, the model ignored a high-priority system instruction, a tool timed out twice, and token cost tripled because a retry loop expanded the context window. Classic application monitoring sees latency and status code. AI observability sees the chain of evidence.

Ignoring it creates three concrete failure modes:

  • Silent hallucination: an answer looks confident but is not grounded in retrieved context or tool output. A Groundedness evaluator on the trace would have caught it; an HTTP 200 did not.
  • Runaway cost: one user request fans into repeated model calls, judge calls, and tool retries. gen_ai.usage.input_tokens aggregated by trace exposes the cost; an averaged per-request metric hides it.
  • Unattributed regression: a prompt, retriever, model, or route changes, but the dashboard cannot connect the change to quality. Without gen_ai.request.model and prompt-version on every span, a model swap that broke “EU enterprise” cohort goes undetected for a week.

The pain lands on different teams at once. Developers need the prompt, retrieved chunks, tool arguments, and span tree to debug one failed answer. SREs need p99 latency, retry counts, token-cost-per-trace, and error cohorts to keep service margins. Compliance teams need redacted prompts, audit logs, and evaluator results tied to the exact production run for sectoral audits. Product teams need user-impact slices such as failed traces by workflow, account tier, or release version.

This is especially relevant in 2026-era agentic systems because one user turn can cross multiple model providers, MCP servers, vector databases, sub-agents via A2A, and a voice frontend on LiveKit or Pipecat. Without trace context, every downstream span becomes a detached clue. With AI observability, the full request becomes a replayable incident record. The agent benchmarks that frontier labs report on. τ-bench retail/airline (multi-turn customer support, frontier 60-72%), SWE-Bench Verified (500 real GitHub issues, frontier 70-78%), GAIA Level 3 (Meta, 45-58%), OSWorld (desktop-action agents, 35-42%). all measure trajectory quality, which only makes sense if you can observe the trajectory. A production system without trajectory-level traces can’t even compare itself to those benchmarks.

The other 2026 shift is regulatory. The EU AI Act high-risk regime requires automated logging “to enable post-market monitoring”; ISO/IEC 42001 and the NIST AI RMF Generative AI Profile both call out logging and traceability as core controls. None of those are satisfied by request-response logging. they need trace-level evidence with model, prompt version, retrieval payload, and policy decisions attached. AI observability is the engineering practice that produces that evidence.

How FutureAGI handles AI observability

FutureAGI’s approach is to make the production trace the shared object for debugging, evaluation, monitoring, and audit. The traceAI instrumentation layer emits OpenTelemetry-compatible spans from AI frameworks: traceAI-langchain, traceAI-langgraph, traceAI-openai, traceAI-openai-agents, traceAI-anthropic, traceAI-google-genai, traceAI-llamaindex, traceAI-crewai, traceAI-autogen, traceAI-mcp, traceAI-livekit, and ~30 more. A LangChain RAG app, for example, produces nested spans for the user request, retriever call, reranker, LLM generation, tool call, and final response, all under one trace ID.

The fields matter. llm.token_count.prompt and llm.token_count.completion show token growth by step. gen_ai.request.model records the model used for a span. gen_ai.usage.input_tokens and gen_ai.usage.output_tokens give billing-grade token counts. fi.span.kind distinguishes LLM, RETRIEVER, TOOL, AGENT, GUARDRAIL, and EVALUATOR spans. agent.trajectory.step makes a multi-step agent trace searchable by step number instead of forcing an engineer to read raw logs. For MCP-mediated calls, the trace context propagates across the server boundary so the parent trajectory stays intact.

Quality signals attach to the same trace. A Groundedness evaluator scores whether an answer is supported by retrieved context. ToolSelectionAccuracy flags wrong tool decisions inside an agent trajectory. ProtectFlash runs as a fast prompt-injection check before sensitive tool execution. Faithfulness, ContextRelevance, ContextPrecision, and ContextRecall give RAG-specific signals. TaskCompletion and TrajectoryScore give agent-level signals. The engineer filters FutureAGI to traces where groundedness fell below threshold, opens the failing span, inspects the retriever payload, and either changes the retriever, adds a guardrail, or starts a regression eval.

A real example: a healthcare RAG support agent built on LangGraph + Anthropic + pgvector. traceAI-langgraph and traceAI-anthropic capture every node. After upgrading from Claude Sonnet 4.5 to Claude Opus 4.7, Groundedness average drops from 0.91 to 0.83 on the production trace stream. The team filters traces by gen_ai.request.model="claude-opus-4-7" and fi.span.kind=RETRIEVER, finds the retriever payload is unchanged but the model is generating longer answers that drift from the source, and pushes a prompt update that constrains generation length. The fix ships through the FutureAGI release gate with a per-cohort Groundedness threshold and the regression doesn’t recur.

Unlike LangSmith-style debugging that often starts from framework-specific traces, the traceAI path is OpenTelemetry-native. The same span can feed FutureAGI dashboards, an OTLP backend, and an on-call alert. That matters when the incident crosses a model gateway, a retriever service, and a background worker. Compared to Arize Phoenix and Langfuse, which both emit OpenTelemetry but treat evaluation as a separate workflow, FutureAGI attaches evaluator scores directly to the span. so the question “which production traces are ungrounded right now?” is one filter, not three joins.

How to measure AI observability

Measure AI observability by checking whether every important AI decision is represented as a span, attribute, or evaluator signal. The table maps surface to instrumentation.

AI surfaceWhat to traceFutureAGI integrationKey attributes
LLM call (OpenAI / Anthropic / Google)Prompt, completion, model, tokens, costtraceAI-openai, traceAI-anthropic, traceAI-google-genaigen_ai.request.model, llm.token_count.prompt
LLM gateway (LiteLLM / Portkey / Bedrock)Route choice, fallback, cache hit, costtraceAI-litellm, traceAI-portkey, traceAI-bedrockgen_ai.request.model, gateway route id
Retrieval (LangChain / LlamaIndex / Haystack)Query, retrieved chunks, ranking, latencytraceAI-langchain, traceAI-llamaindex, traceAI-haystackfi.span.kind=RETRIEVER, retriever metadata
Vector store (Pinecone / Qdrant / pgvector / LanceDB)Index, search params, top-k hitstraceAI-pinecone, traceAI-qdrant, traceAI-pgvector, traceAI-lancedbvector store id, k, score
Agent (LangGraph / OpenAI Agents / CrewAI / AutoGen)Trajectory, tools, handoffs, stop conditionstraceAI-langgraph, traceAI-openai-agents, traceAI-crewai, traceAI-autogenagent.trajectory.step, fi.span.kind=AGENT
Tool call via MCPServer, tool, request, responsetraceAI-mcpfi.span.kind=TOOL, MCP server attrs
Agent-to-agent delegationTask id, child trajectory, contracttraceAI-a2aA2A task id, parent trace id
Voice (LiveKit / Pipecat)Audio in/out, ASR, TTS, turn-takingtraceAI-livekit, traceAI-pipecataudio span, ASR transcript
Guardrail decisionDetector, action, policy versionAgent Command Center auditfi.span.kind=GUARDRAIL
Evaluator scoreScore, reason, evaluator classfi.evals attachfi.span.kind=EVALUATOR
Prompt managementTemplate version, variablesfi.promptprompt template id and version

The signals to wire on every system:

  • Trace coverage: percentage of model, retriever, tool, guardrail, and agent steps with a parent trace id. Below 95% means you have observability gaps.
  • Span taxonomy: fi.span.kind populated for LLM, RETRIEVER, TOOL, AGENT, GUARDRAIL, and EVALUATOR spans. Missing kinds mean dashboards can’t pivot.
  • Token and cost signals: llm.token_count.prompt, llm.token_count.completion, gen_ai.usage.input_tokens, and token-cost-per-trace.
  • Quality signal: Groundedness returns a score or verdict for whether the response is supported by supplied context.
  • Operational signal: p99 latency, time-to-first-token, retry count, eval-fail-rate-by-cohort, and escalation-rate after low-score traces.
  • Audit signal: model name, prompt version, route, guardrail decision, reviewer state on every gated request.
from fi.evals import Groundedness

evaluator = Groundedness()
result = evaluator.evaluate(
    response=answer,
    context=retrieved_context,
)
print(result.score, result.reason)

For online evaluation wired directly to traceAI spans. so every production trace carries a Groundedness, ToolSelectionAccuracy, and TaskCompletion score you can pivot in the tracing UI. attach evaluators to the span pipeline:

from fi.evals import Groundedness, ToolSelectionAccuracy, TaskCompletion
from traceai.langgraph import LangGraphInstrumentor

LangGraphInstrumentor().instrument()  # emits gen_ai.* + fi.span.kind=AGENT spans

for ev in [Groundedness(online=True),
           ToolSelectionAccuracy(online=True, node_filter=["tool"]),
           TaskCompletion(online=True)]:
    ev.attach_to_spans(
        attribute="fi.span.kind",
        sample_rate=0.2,
        emit_attribute=f"fi.eval.{ev.__class__.__name__.lower()}",
    )

A practical readiness test for 2026: sample 100 failed or low-rated user turns and ask whether an engineer can identify the failing model, prompt version, retrieval payload, tool call, evaluator verdict, and owner in under five minutes. If yes, your AI observability is real. If no, you have logs, not observability.

LLM, ML, agent: where the boundaries are

LLM observability is the narrow case. model call traces with prompt, completion, model, and token signals. ML observability is the older discipline. drift, performance, data quality, prediction distributions for classical models. Agent observability is the graph-level discipline for branching, looping, tool-using agents. AI observability is the umbrella: it covers all three plus the gateway, retrieval, and voice surfaces. The right mental model in 2026 is that all four are facets of the same trace tree, not separate systems. FutureAGI emits all four through traceAI because trying to debug a 2026 agent across four observability vendors is how teams burn three months on incident response that should have taken three hours.

What changed in observability from 2024 to 2026

Three things changed materially. First, OpenTelemetry became the default transport for AI traces. the OpenInference and gen_ai.* semantic conventions are stable, every major framework emits OTel spans, and vendor lock-in moved from “your trace format” to “your eval and UI layer.” FutureAGI’s traceAI is OTel-native by construction. Second, the trace surface widened from single-model calls to full agent trajectories, MCP-mediated tool calls, A2A delegations, and voice pipelines. meaning the trace is now distributed across services by default. Third, evaluation moved from offline-only to span-attached: the score lives on the trace, not in a separate notebook. The combination is that “AI observability” in 2026 is a unified discipline with traces as the core artifact, evaluators as first-class signals, and the gateway audit log as the compliance overlay.

The pre-2024 pattern of “log the prompt, eyeball the output” doesn’t scale past one agent, and “use ML observability for everything” doesn’t work either because LLM call patterns (variable-length context, stochastic output, tool-augmented trajectories) violate the assumptions ML observability tools built on tabular features were optimized for. We’ve found that teams who try to bolt LLM observability onto an existing APM stack spend more time fighting the schema than fixing real bugs; teams who adopt AI-native OTel traces from day one ship faster.

Cost and latency observability

The 2026 production constraint that often eclipses correctness is cost-per-trace and tail latency. A model swap that improves Groundedness by 2% but doubles token cost is rarely shippable. AI observability has to surface cost-per-trace at the same fidelity as quality. gen_ai.usage.input_tokens plus gen_ai.usage.output_tokens plus gen_ai.request.model aggregated to trace level gives the per-request bill. Pair with p99 time-to-first-token and total latency, sliced by route in Agent Command Center, and you can spot the regression at deploy time instead of at the end-of-month invoice.

Common mistakes

  • Observing only the final model call. Retrieval, reranking, tools, guardrails, and agent handoffs often contain the real fault. Instrument every span boundary.
  • Logging prompts without span context. Raw prompts help less when they are detached from trace id, model, cost, latency, and parent span. Use traceAI to keep prompt + completion + metadata co-located.
  • Treating evals as offline-only. Regression datasets catch release bugs; span-attached scores catch provider drift, corpus drift, and route changes after deploy. Pair offline eval with sampled production eval.
  • Sampling away rare failures. Uniform sampling hides high-cost retries and safety failures. Sample by workflow, score, user impact, and anomaly signal. never uniformly.
  • Skipping redaction policy. AI traces may contain PII, secrets, or customer data; redact before storage while preserving tokens, timings, and verdicts. The PII evaluator at ingest is the simplest enforcement point.
  • No trace versioning on prompt changes. When the prompt template changes, the trace’s effective semantics change; without a prompt version on every span, historical comparisons mix incompatible runs.
  • Single-region storage for global traffic. Latency and data-residency constraints break when traces sit in one region. Use regional OTLP collectors with FutureAGI replication.
  • Confusing dashboards with observability. A dashboard is a precomputed view. Observability is the ability to ask new questions of recent traces. including “show me this exact incident trajectory by trace id.” If you cannot do the latter, the dashboards are decoration.
  • Inconsistent span kinds across teams. When team A emits fi.span.kind=TOOL and team B emits kind=tool_call, dashboards split and aggregation breaks. Standardize span kinds and trace attributes in a shared spec; use the OpenInference conventions as the baseline.
  • No retention policy for trace data. Trace volume grows linearly with traffic; storing every span at full resolution for two years is impossible. Set a tiered retention policy (full fidelity for 7-30 days, sampled for 90-365 days, audit-grade traces for the regulatory retention window) and codify it in your storage backend.
  • Treating evaluator output as binary pass/fail. Most useful evaluators emit a score plus a reason string. Discarding the reason throws away the “why” that engineers need at debug time. Store both, surface both in the trace UI.
  • No cross-tenant trace isolation. When traces from multiple tenants land in the same store with the same attribute namespace, a single search query can leak data across tenants. Tag every span with a tenant id and enforce tenant-scoped queries.

Frequently Asked Questions

What is AI observability?

AI observability is runtime visibility for AI systems across traces, spans, prompts, completions, retrievals, tool calls, token usage, latency, cost, and eval scores. It helps teams explain behavior and fix production failures.

How is AI observability different from LLM observability?

LLM observability focuses on model calls and model-adjacent signals. AI observability is wider: it also covers RAG, agents, voice stages, gateway decisions, model drift, and cross-service traces.

How do you measure AI observability?

Instrument with traceAI integrations such as traceAI-langchain, capture fields like llm.token_count.prompt and fi.span.kind, then attach evaluator scores such as Groundedness to sampled production traces.