Observability

What Is LLM Observability?

The runtime telemetry layer for LLM and agent systems, capturing structured traces, tokens, costs, eval scores, and agent topology.

What Is LLM Observability?

LLM observability is the discipline of turning LLM and agent runtime behavior into structured, queryable signals. It extends pre-AI APM (latency, errors, status codes) with LLM-specific data: traces composed of spans for prompts, completions, retrievals, tool calls, and sub-agent dispatches; per-span token counts and cost; evaluator scores attached to live spans; drift signals across input and output distributions; and the agent graph itself. In 2026 the transport layer is OpenTelemetry with the GenAI semantic conventions, and FutureAGI’s traceAI emits these spans directly into any OTLP-compatible backend.

If you are reading this in May 2026, the textbook definition is not the interesting part. The interesting parts are: which signals matter now that agent traces span 10–50 nested calls, what the MCP protocol (spec) changes for tool-call observability, why span-attached eval scores have replaced separate eval-and-trace tables, and how the LLM gateway became a first-class observability data plane. This page is an opinionated tour of all four.

Why LLM observability matters in production LLM and agent systems

Production LLM systems fail in ways APM cannot see. A support agent answers a question correctly, the response passes a string-match check, the trace shows it retried a tool call eight times, hit a stale retriever three times, and burned $42 in judge tokens. and your dashboard still says zero errors. Three forces broke the old monitoring model:

  • Agents stopped being toys. A single user request inside a real agent stack now generates 10–50 spans across LLM calls, retrievers, tool invocations, and sub-agent dispatches. Without span-level structure, debugging is grep. Without graph topology, you see spans but lose the tree. A multi-agent system compounds the problem.
  • Cost stopped being a footnote. A reasoning model burning 40K output tokens at $15 per 1M tokens turns one user turn into 60 cents. Multiplied by retries and judge evals, one feature can cost more than the user’s monthly subscription. Cost attribution by tenant, route, and prompt version is the only way to defend a margin in 2026.
  • Quality became a runtime signal. Models drift when providers refresh weights. GPT-5.x, Claude Opus 4.7, Gemini 3.x, and Llama 4 have all shipped silent revisions that moved groundedness, hallucination rate, and refusal behavior. RAG quality drifts when the corpus changes. Latency alerts catch infra; eval-score alerts catch quality drift. You need both.

ML engineers, SREs, and compliance leads all feel this. The symptom in logs is “everything is fine” while users churn or trust erodes. Multi-step pipelines compound the problem. a corrupt retrieval at step two silently poisons steps three through five.

The pre-AI APM model breaks at three places

Classic APM was built for HTTP request → response → status code. LLM observability breaks that contract in three places. First, a single request can spawn dozens of sub-operations; the unit of analysis is the trace tree, not the request. Second, “success” is not status code 200. it is whether the answer was grounded, the tool choice was right, and the user accepted the result; you need an evaluator to know. Third, cost is variable per request and dominated by token spend; a Datadog dashboard with CPU and request rate cannot answer “did the May 12 prompt rollout double our token bill?”. a span-attached gen_ai.cost.total aggregate can.

How FutureAGI does LLM observability

FutureAGI’s approach is to treat observability as one surface of a full reliability loop: trace, evaluate, simulate, route, and guard. The instrumentation layer is traceAI, an Apache 2.0 OpenTelemetry library with drop-in coverage of 50+ frameworks across Python, TypeScript, Java, and C#. Instrumenting a LangChain app is one import: register(project_name="prod-rag") then LangChainInstrumentor().instrument() and every chain step emits a span. Other integrations include traceAI-openai, traceAI-anthropic, traceAI-google-genai, traceAI-bedrock, traceAI-mcp, traceAI-crewai, traceAI-autogen, traceAI-langgraph, traceAI-openai-agents, traceAI-google-adk, and traceAI-livekit for voice agents.

Spans carry FutureAGI’s GenAI semantic conventions. gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens, gen_ai.server.time_to_first_token, gen_ai.cost.total, plus fi.span.kind to tag whether a span is an LLM, RETRIEVER, TOOL, AGENT, GUARDRAIL, or CHAIN. Spans are persisted in ClickHouse and rendered as agent graphs in the platform. not flat span lists.

Span-attached evals: the differentiator

The piece that separates 2026-grade observability from a glorified log viewer is span-attached evals: a HallucinationScore, Groundedness, AnswerRelevancy, or Faithfulness evaluator runs on every sampled span and writes its verdict back as gen_ai.evaluation.score.value with a reason. Filter the dashboard to “spans where citation grounding dropped below 0.7 in the last 24h” and that becomes your review queue. Unlike LangSmith, where evals are a separate dataset stitched by primary key, the score is part of the trace. Engineers alert on rolling-mean eval drops the same way they alert on p99 latency, and the same span powers both queries. We’ve found this single architectural choice. eval-on-span. to be the highest-leverage observability decision teams make in 2026.

Agent Command Center as observability data plane

The Agent Command Center is the LLM gateway layer of FutureAGI. It proxies every provider call, applies pre-guardrails and post-guardrails, enforces a routing policy (round-robin, weighted, least-latency, or cost-optimized), runs semantic cache, and supports model fallback plus traffic mirroring. Crucially, the gateway emits the same gen_ai.* semantic attributes that traceAI consumes, so observability has a complete view even when a request never reaches application code (a cache hit) or when it failed over to a backup model. In our 2026 evals, teams that deploy the gateway alongside traceAI gain ~30% additional incident-detection coverage versus teams that only instrument application code.

Reference: the gen_ai.* attribute surface

AttributeWhere it comes fromWhat you alert on
gen_ai.request.modelevery LLM spandistribution shift after a routing change
gen_ai.usage.input_tokensLLM spanprompt cost; long-context regressions
gen_ai.usage.output_tokensLLM spanreasoning-model cost; runaway generation
gen_ai.usage.total_tokensLLM spantotal-cost ceiling per cohort
gen_ai.server.time_to_first_tokenstreaming LLM spanUX SLA; provider weight refreshes
gen_ai.client.operation.durationLLM/CHAIN spanend-to-end latency per route
gen_ai.cost.totalgateway-derivedmargin-killing routes
gen_ai.evaluation.score.valuespan-attached evaluatorrolling-mean drop alerts
gen_ai.evaluation.namespan eventwhich evaluator fired
fi.span.kindtraceAI-taggedfilter by LLM/RETRIEVER/TOOL/AGENT
tool.nameMCP/function-call spantool-selection drift after rollouts
agent.trajectory.stepagent framework spanper-step failure attribution
prompt.versionapplication-taggedA/B regression attribution

The dashboard signals that matter daily: eval-fail-rate-by-cohort, p99 TTFT, token-cost-per-trace, and per-tool error rate.

How to measure or detect LLM observability quality

Observability is itself a measurement layer. what matters is the signal density per span. Wire these:

  • Token usage: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens (legacy traceAI also emits llm.token_count.prompt, llm.token_count.completion).
  • Latency: gen_ai.server.time_to_first_token for streaming TTFT, gen_ai.client.operation.duration for end-to-end span duration.
  • Cost: gen_ai.cost.total, gen_ai.cost.input, gen_ai.cost.output written from the gateway price table.
  • Span kind: fi.span.kind distinguishes LLM, RETRIEVER, TOOL, AGENT, CHAIN so you can filter.
  • Eval scores: gen_ai.evaluation.score.value and gen_ai.evaluation.name attached as span events from HallucinationScore, Groundedness, AnswerRelevancy, ToolSelectionAccuracy, TaskCompletion, or Faithfulness.
  • Safety: PromptInjection, ProtectFlash, BiasDetection, Toxicity, and PII evaluator scores tagged on spans for regulated workloads.
  • Agent topology: agent.trajectory.step, plus the parent span graph rendered as a tree.
from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor

trace_provider = register(project_name="prod-rag")
LangChainInstrumentor().instrument(tracer_provider=trace_provider)

Cohort-filtered regression eval over a captured trace Dataset:

from fi.evals import Groundedness, HallucinationScore, TaskCompletion
from fi.datasets import Dataset

ds = Dataset.from_traces(project="prod-rag", cohort="billing", days=7)

eval_chain = [Groundedness(), HallucinationScore(), TaskCompletion()]

results = ds.evaluate(
    evaluators=eval_chain,
    sample_strategy="failure_biased",
    baseline_prompt_version="v1.4.0",
    candidate_prompt_version="v1.5.0",
)
results.summary(group_by=["prompt.version", "gen_ai.request.model"]).to_csv("regression.csv")

Coverage, density, and the 99% rule

A trace is only as good as its weakest span. The metric that matters is span completeness: of every production trace, what percentage of LLM, RETRIEVER, TOOL, and AGENT spans carry gen_ai.request.model, both token counts, a duration, a status, and fi.span.kind? Below 95% completeness, dashboards lie. Above 99%, post-mortems become tractable. The fastest wins in our 2026 evals come from teams that just audit completeness and patch the missing attributes. they do not need a new tool, they need their existing tool to report fully.

A useful frame for “what should this layer actually catch”: in 2026, public benchmarks paint the upper bound of model quality, but they say nothing about your prompt, your retriever, or your tenant. On RAGTruth’s 18K labeled chunks frontier models still fail groundedness on 5–8% of answers, and HaluEval’s 35K Q&A set shows GPT-4-class models hallucinating at ~16.4%. Observability is the only way to know your specific number, on your traffic, this week.

Drift detection: input, output, retrieval, evaluator

Four drift classes matter. Input drift (user prompt distribution). easy to detect, lowest signal. Output drift (response length, refusal rate, style). high signal because provider refreshes show up here first. Retrieval drift (vector database reindex changes which chunks rank). silently destroys RAG quality. Evaluator drift (Groundedness rolling mean slides 0.05 points week-over-week). leading indicator of user complaints. FutureAGI tracks all four by default; a generic APM stack tracks none.

Observing agent and MCP-mediated workflows

In 2026 a non-trivial agentic product mounts 5–10 MCP servers. CRM, ticketing, internal knowledge base, write-capable account server, vector store. Every tools/call, resources/read, and prompts/get is a tool.name-tagged span; observability slices errors and latency by server, by tool, and by agent.trajectory.step. When the Confluence MCP server starts returning stale resources after a reindex, FutureAGI’s dashboard flags a drop in Faithfulness against the affected cohort and the trace view points to the exact resources/read span. That kind of cross-server view is impossible in a framework-locked tracer (LangSmith for LangChain only, Anthropic’s tracer for Claude only). The traceAI-mcp integration sees every MCP client equally. Claude Desktop, OpenAI Agents SDK, LangGraph, custom Strands agents. because it instruments the protocol, not the application.

Wiring observability into the dev → staging → prod loop

A team that treats observability as a production-only feature ships the wrong abstraction. The same traceAI instrumentation should run in development (to capture local agent runs while debugging), in simulation (so simulated Persona and Scenario tests produce real traces), in CI (so regression evals emit comparable spans), and in production. When the four environments share the same trace schema, an engineer can replay a production trace through staging and watch the same span tree light up. FutureAGI’s simulate-sdk is wired into traceAI for this reason. a LiveKitEngine voice simulation, a CloudEngine text simulation, and a production voice agent all emit the same span shape with ASRAccuracy, TTSAccuracy, and ConversationCoherence scores attached.

Comparison: FutureAGI, LangSmith, Langfuse, Helicone

LangSmith is LangChain-locked and treats evals as a separate dataset. Langfuse is open-source and trace-first; eval support is a bolt-on. Helicone is gateway-only and not designed for span-attached evaluator scoring. FutureAGI is purpose-built for the four-way intersection of trace + eval + simulate + route: every span carries an evaluator score, the gateway emits the same OTel attributes, and the evaluator suite (HallucinationScore, Groundedness, AnswerRelevancy, Faithfulness, ContextRelevance, ContextPrecision, ContextRecall, ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, PromptInjection, ProtectFlash, BiasDetection, Toxicity, PII, CustomEvaluation) runs natively against trace data. The Arize and Datadog LLM offerings are closer to FutureAGI’s design point than LangSmith but neither ships a gateway, a simulate-sdk, or agent-opt optimizers in the same stack.

What changes when you add agent2agent (A2A) traffic

The Agent2Agent protocol (A2A spec) ships in 2026 production stacks for cross-vendor agent collaboration. Observability has to handle traces that cross organizational boundaries, with partial visibility into the remote agent. FutureAGI’s traceAI emits W3C trace context so a trace started in your stack continues into a partner’s compliant tracer; on the inbound side, traceAI-a2a instruments your agent as the remote endpoint and stitches the trace. This matters for any team building B2B agent integrations because joint-task TaskCompletion is no longer measurable from one side alone.

Cost observability: the second most-mature surface in 2026

Cost is now first-class. In 2026 a serious LLM workload spends six to seven figures monthly on tokens; a 10% routing-policy win compounds into real dollars. FutureAGI’s cost observability slices gen_ai.cost.total by route, tenant, model, prompt version, and agent role. The dashboard answers questions a finance team actually asks: which user cohort costs more than its margin allows; which prompt version doubled output token spend; which reasoning model burns more output tokens for marginal quality gain; what the semantic cache hit rate is by route. We’ve found that the highest-leverage cost wins come from cache hit-rate tuning (+30–50% savings) and routing-policy adjustment (cost-optimized route for non-critical traffic), both of which require observability at the gateway level, not just at application code.

Privacy, redaction, and compliance in 2026

Prompts and completions are the most sensitive data a regulated company processes. Observability that stores them in plaintext is a liability. FutureAGI’s redaction layer runs pre-storage: PII and ProtectFlash evaluators scan every span, mask matched substrings, and tag the span with a redaction event. Regulated industries (healthcare, finance, legal) typically run a tighter mode where prompts are hashed and only the hash plus evaluator verdicts are retained, with the raw text held in a short-TTL store. Compliance reviewers query by prompt.hash and gen_ai.evaluation.score.value without ever seeing raw PII. This is one of the few places where building observability in-house is genuinely harder than buying it. the redaction primitives need to be evaluator-aware, not regex-driven, because PII in 2026 chat data is contextual (“my dog Buddy” is fine, “my SSN is…” is not), and only an evaluator can tell the difference reliably.

Common mistakes (May 2026 edition)

  • Treating it as logs with extra fields. Span structure, OTel attributes, and span-attached eval scores have to be modeled from day one. bolting them on later means re-instrumenting every call site.
  • Sampling uniformly at 1%. The 99th percentile is where the bug lives. Sample by user, by failure signal, and by cost outlier. not uniformly.
  • Not tagging prompt versions. Without a prompt.version id on the span, A/B rollouts and regression attribution become guesswork.
  • Conflating eval and observability. Offline eval datasets catch regressions before release; span-attached evals catch drift after release. Teams that have only one ship the other class of bugs.
  • Skipping redaction. Prompts and completions carry PII, secrets, and customer data. Pre-storage redaction with PII or ProtectFlash is non-negotiable for regulated workloads.
  • Building a custom OTel pipeline instead of using semantic conventions. Custom attribute names lock you into one backend. Standardize on gen_ai.* from day one.
  • Ignoring the gateway. Without the LLM gateway in the data path, your observability stops at application code and misses cache hits, fallbacks, and out-of-app provider calls.
  • Watching only the join of trace + eval. Multi-agent system traces also need a graph view; flat span lists hide handoff failures.
  • Treating Langfuse, LangSmith, or Helicone as ceiling. Those tools cover tracing; FutureAGI’s span-attached eval architecture is a different design point. Test the architecture, not the brand.

Frequently Asked Questions

What is LLM observability?

LLM observability is the runtime telemetry layer for LLM and agent systems. structured traces, span-level token and cost metadata, eval scores attached to live spans, drift detection, and agent graph topology, transported over OpenTelemetry.

How is LLM observability different from traditional APM?

Traditional APM captures HTTP latency, error rates, and exceptions. LLM observability adds token usage per call, prompt versions, span-attached eval scores, retrieval quality, tool-call decisions, and hallucination signals. none of which fit a status-code-and-latency model.

How do you measure LLM observability?

Instrument with traceAI to emit OpenTelemetry spans carrying gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, and gen_ai.server.time_to_first_token, then attach fi.evals scores like HallucinationScore as span events for production traces.