Guides

What Is LLM Observability? The Ultimate 2026 Guide

LLM observability in 2026 is OpenTelemetry plus LLM-aware spans plus eval-as-span-attribute. The reference guide for ML engineers picking a stack.

·
Updated
·
12 min read
llm-observability opentelemetry openinference gen-ai-semantic-conventions eval-as-span 2026
Editorial cover image for What Is LLM Observability? The Ultimate 2026 Guide
Table of Contents

A senior engineer pings you at 8 a.m. The Slack reads: “the legal-research agent quoted the wrong statute again, and a customer noticed this time.” You open the dashboard. Request volume looks healthy. P95 latency is fine. No exceptions. The agent didn’t crash. It just lied. Generic APM saw nothing because nothing was wrong by its definition. This is the gap LLM observability fills, and the only architecture that closes it cleanly in 2026 is OpenTelemetry plus LLM-aware span attributes plus eval-as-span-attribute. Everything else is a vendor lock-in waiting to happen.

This guide is the reference: what LLM observability is, what it isn’t, the OTel-native baseline, the LLM-aware span conventions, the eval-as-span pattern, production sampling and retention, and where the trace tree becomes the unit of work. Written for ML engineers and tech leads picking a stack they won’t have to rip out next year. Last updated May 20, 2026.

TL;DR

LLM observability is OpenTelemetry tracing with LLM-aware span attributes (OpenInference or OTel-GenAI) and evaluator scores attached to those spans. The trace tree is the unit. Tools that ship their own span format lock you in. The right stack is: OTel-native instrumentation, OpenInference / OTel-GenAI conventions on top, eval scores written back to spans as gen_ai.evaluation.* attributes, a cheap classifier plus an LLM-judge sampling policy, and a backend you can swap. Get those five right and the discipline works. Get the wire format wrong and every other choice locks tighter over time.

What LLM observability isn’t

Three terms keep landing in the same procurement deck and the buyer keeps treating them as one thing. Observability watches. Evaluation judges. Benchmarking ranks. The clean conceptual map lives in agent observability vs evaluation vs benchmarking; the operating shorthand for this post is: observability captures what the agent did, evaluation scores whether it was correct against a rubric on your data, benchmarking measures a model against a standardized public set like SWE-bench Verified or BFCL.

The reason this matters here is that “LLM observability” gets stretched to mean all three, and the stretched definition picks the wrong tool. An observability platform without eval scores on spans is a debugger. An eval platform without traces is a test runner. A benchmark score on a model card is not a production gate. Keep the workflows separate; keep the seams clean. The seam that matters most for this guide is eval-as-span-attribute, covered below.

Generic APM also fails the same way. Null pointers throw. HTTP 500s fire. Timeouts trip. LLM systems produce confident, well-formed text that happens to be wrong. They call the right tool with subtly wrong arguments. They retrieve a chunk containing a poisoned instruction. They burn through your monthly token budget on one streaming response that won’t stop. None of these register in an APM. The deeper boundary between monitoring and observability for LLM systems lives in LLM monitoring vs LLM observability.

The OTel-native baseline

OpenTelemetry won. The wire protocol is OTLP. The instrumentation libraries are stable. The exporters are everywhere. Across 2024-2025, every serious LLM tracing tool either started OTel-native (Phoenix, Future AGI traceAI, OpenLLMetry) or bridged back to OTel because customers refused to ship a proprietary tracer (Langfuse, LangSmith, several others). ServiceNow’s March 2026 acquisition of Traceloop / OpenLLMetry was the signal: OTel is now the LLM observability substrate, not a “nice to have.”

The OTel-native baseline buys three things proprietary tracers can’t:

  1. Wire-format portability. Spans go into any OTLP collector. You can run a free local Phoenix, a hosted Future AGI cloud, a Datadog OTel intake, a Honeycomb pipeline, and a self-hosted Grafana Tempo cluster, and the wire format is the same. Backend becomes the swappable layer.
  2. Collector-side processing. Tail sampling, PII redaction, attribute filtering, eval execution, and routing all happen at the collector. The application stays a simple emitter; policy lives downstream where it belongs.
  3. Existing ecosystem. The OTel collector knows how to talk to every observability backend that exists. New backend launches with OTLP support on day one. Proprietary tracers had to ship 30 adapters; OTel ships one.

A practical baseline in Python (the pattern is the same in TypeScript, Java, and .NET):

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from fi_instrumentation.otel import SemanticConvention
from traceai_openai import OpenAIInstrumentor

tracer_provider = register(
    project_name="legal-research-agent",
    project_type=ProjectType.OBSERVE,
    project_version_name="v2.4.1",
    semantic_convention=SemanticConvention.OPENINFERENCE,
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

That single register() call is the whole instrumentation step. Every subsequent openai.chat.completions.create produces an OTLP span carrying the full OpenInference attribute set, exported to whichever collector your environment points at. Future AGI’s traceAI is Apache 2.0; you can swap the exporter to send the same spans into Phoenix, Tempo, Honeycomb, or your own collector without changing application code. The OTel conventions guide (OpenInference vs OpenLLMetry vs OpenLIT) walks the differences between the three competing layers.

LLM-aware span attributes

OTLP carries the wire format. Semantic conventions name the attributes. The conventions that matter in 2026:

  • OpenInference (Arize): openinference.* namespace, the most production-tested LLM convention. Stable attribute names across LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR span kinds. The reference implementation is Phoenix; Future AGI traceAI emits OpenInference-shaped spans natively.
  • OTel-GenAI (OpenTelemetry GenAI SIG): gen_ai.* namespace, the long-run standard. Covers token semantics (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_read_tokens), cost rollups (gen_ai.cost.input, gen_ai.cost.output, gen_ai.cost.cache_write), model identity, and (in progress) evaluation attributes (gen_ai.evaluation.*). Still stabilizing.
  • OpenLLMetry (Traceloop): traceloop.* namespace, overlaps significantly with OpenInference. Now under ServiceNow.

A minimum span carries five attribute classes regardless of which convention you pick:

ClassExample attributesWhy it matters
Identityspan.kind, trace_id, parent_span_id, service.name, service.versionTrace tree topology
Model + promptmodel.name, model.version, prompt.template.name, prompt.template.version, messages.inputReproducibility on regressions
Costusage.input_tokens, usage.output_tokens, cost.total, cost.cache_readBill attribution per call
Resultmessages.output, latency_ms, status, error.class, retry_countThe thing that broke
Eval + policyevaluation.score.value, evaluation.label, guardrail.name, guardrail.resultWhy it broke and what blocked it

Pick OpenInference if you want production-tested today; pick OTel-GenAI if you want the long-run standard; emit both if your stack supports it. Future AGI traceAI ships all four conventions (FI native, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) on the same wire via a single semantic_convention= argument, so you can re-export the same trace into the convention the downstream tool prefers. The deeper convention comparison is in what is OpenInference; the trace anatomy reference is what does a good LLM trace look like.

Two things that aren’t span attributes but get confused for them. Logs are events, not spans: emit OTel logs alongside the trace, don’t stuff log lines into span attributes. Metrics are aggregates, not spans: emit OTel metrics for p99 latency and error rate rollups, but run the trace tree as the unit of debugging.

Eval-as-span-attribute

This is the pattern most tools get wrong. The eval is part of the trace. The score lives on the span.

When a judge scores a response, the result writes back to the same span the LLM call emitted, as gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation. A failing faithfulness score points directly at the retrieval span that caused it. A flagged tool-call success points directly at the argument that broke. The trace tree carries the verdict and the evidence on the same node.

The wrong shape, common in 2024-era tools: eval lives in a separate workspace, runs asynchronously, scores arrive in a different UI 30 minutes later, joined to traces by trace_id if you’re lucky. Operationally that means two dashboards, two retention policies, two access paths, and the engineer who hits a regression has to context-switch to figure out what failed.

The right shape, what 2026 stacks ship:

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalTag, EvalTagType, EvalSpanKind, EvalName,
)
from fi_instrumentation.otel import SemanticConvention

tracer_provider = register(
    project_name="legal-research-agent",
    project_version_name="v2.4.1",
    semantic_convention=SemanticConvention.OPENINFERENCE,
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.GROUNDEDNESS,
            mapping={"input": "input.value", "output": "output.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.TASK_COMPLETION,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)

Declare the evaluators at register() time. The collector runs them server-side post-export. Scores write back to the matching spans. Zero user-request latency. The eval becomes a property of the trace, not a parallel dataset.

Two operational patterns sit on top of eval-as-span:

  1. Offline eval runs the same rubrics against a versioned dataset on every PR. CI gate. Regressions fail the build.
  2. Online eval runs sampled rubrics against live spans. Drift detector. Score distribution shifts before users complain.

Same rubric definition, two cadences. The eval logic doesn’t fork. The longer playbook is in agent passes evals fails production.

Cost is the lever that makes this work. Frontier judges on every span double your inference bill. Fine-tuned classifiers at sub-cent-per-call economics let you score 100 percent of traffic; reserve frontier judges for adjudication on borderline traces. Future AGI’s turing_flash runs guardrail-grade screening at 50-70 ms p95; full eval templates run at roughly 1-2 seconds. BYOK on top, so any LLM can sit behind the evaluator at zero platform fee. This per-eval economics is what separates “we eval every span” from “we eval one percent and hope.”

The trace tree as the unit

A request comes in. The agent plans. It retrieves four chunks. It calls a tool. The tool calls another tool. The model writes a response. A guardrail rewrites it. A judge scores it. That entire sequence is a tree of spans rooted at the request. The trace tree, not the chat log, is the unit of work for LLM observability.

Why this matters operationally:

  • Debugging. The first question on a regression is “where in the tree did this go wrong.” The trace tree answers it. A flat log of LLM calls doesn’t.
  • Eval attribution. A failing faithfulness score on the root response points at the retrieval span via parent-child links. The fix lives at the retrieval span, not the LLM call.
  • Cost attribution. Per-span cost sums up the tree. You can see that 70 percent of the cost lived in the retriever’s reranker step, not the final LLM call. The aggregate metric never tells you that.
  • Replay. A saved trace tree replays in pre-prod with the same prompt versions, the same tool stubs, the same judge model. The reproducibility unit is the tree.

The OpenInference span kinds cover the topology a modern agent traverses: LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, plus A2A_CLIENT and A2A_SERVER for agent-to-agent protocol calls. Future AGI traceAI ships 14 span kinds; Phoenix ships around 8; Langfuse’s native format ships 5 and bridges the rest. Span-kind depth is the difference between a trace UI that renders a multi-agent topology correctly and one that flattens it into a wall of LLM calls.

The span-vs-trace decomposition is in LLM span vs trace; the broader trace anatomy in what is LLM tracing.

Production patterns: sampling, retention, redaction

Three operational choices that decide whether the stack survives load.

Tail-based sampling. Head-based sampling (drop a percentage of traces at emit time) is the wrong default for LLM systems because failure is rare and high-value. Use tail sampling at the OTel collector: retain 100 percent of errors, retain 100 percent of below-threshold eval scores, retain 100 percent of high-cost traces (above the 95th percentile), retain a fixed percentage (1-10 percent) of clean traces for trend data. Phoenix, Langfuse, Future AGI, and Datadog all support tail sampling on judge score or status. The collector sees the full trace before deciding to drop it.

Retention tiering. Trace storage is the dominant cost line at scale. 100K requests per day, 30 spans per request, 1 KB per span lands at roughly 90 GB per month before payload compression. Three-tier the storage: hot ClickHouse for 14-30 days (fast queries, dashboards, alerts), warm columnar for 90 days (compliance + retrospective debugging), cold object storage for 1-7 years if regulated. Retention is the lever procurement actually argues about; quote a per-GB-month rate before signing.

PII redaction. Redact at the collector, not at the application. The application emits the full span; a collector processor strips secrets and PII before export. This keeps instrumentation code clean and centralizes the redaction policy. Future AGI traceAI ships built-in PII redaction at the span-attribute layer so raw secrets never leave your network. Phoenix and Langfuse handle redaction via configurable processors; OpenLLMetry leaves it to your downstream collector. Verify before procurement.

Sustained throughput. Most production agent stacks need 10K+ spans per second sustained. Langfuse on tuned ClickHouse, Phoenix via Arize AX cloud, Future AGI with ClickHouse trace storage, and Datadog all clear that bar in customer environments. Helicone scales to 1K+ requests per second on standard ClickHouse and higher with tuning. Run a load test against your real span schema; vendor numbers always understate payload size. The longer treatment is in LLM tracing best practices.

Cost telemetry on the same trace. Per-span cost (cost.input, cost.output, cost.cache_read, cost.total) belongs on the span the model call lives on. Hierarchical budgets (org, team, user, key, tag) belong at the gateway. The Future AGI Agent Command Center gateway ships microdollar-accurate cost on every response (x-prism-cost, x-prism-cache, x-prism-fallback-used headers) and enforces five-level hierarchical budgets so a runaway loop rejects at the gateway, not after the bill arrives. The detailed cost attribution patterns live in AI agent cost optimization observability.

Where Future AGI fits

Most production stacks end up running three or four products: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the pick when those have to live on the same Apache 2.0 self-hostable plane with OpenInference-shaped traces as the unit.

  • traceAI (github.com/future-agi/traceAI): Apache 2.0 OTel-native, 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Four pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) via a single register() argument. 14 OpenInference span kinds including A2A_CLIENT, A2A_SERVER, EVALUATOR, GUARDRAIL. Built-in PII redaction at the span-attribute layer.
  • ai-evaluation (github.com/future-agi/ai-evaluation): Apache 2.0, 50+ EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, TaskCompletion, EvaluateFunctionCalling, PromptInjection, plus 11 customer-agent-specific templates) as pytest CI scorers and span-attached online scorers. Same rubric runs offline and live. Lower per-eval cost than Galileo Luna-2 on classifier-backed rubrics. EvalTag API wires rubric to span at zero added inference latency.
  • Agent Command Center (docs.futureagi.com/docs/command-center): Apache 2.0 single Go binary gateway. 100+ providers, BYOK routing, five-level hierarchical budgets, exact and semantic caching, microdollar cost accounting. 18+ runtime guardrails inline as GUARDRAIL spans on the same trace tree.
  • Closed self-improving loop: eval-driven prompt optimization through agent-opt (six optimizers: RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) plus simulation through simulate-sdk plus the Error Feed (HDBSCAN clustering plus a Claude Sonnet 4.5 Judge writing immediate_fix). Other tools ship the parts; Future AGI ships the loop.

Free tier with 50 GB tracing, 2,000 AI credits, 100K gateway requests, 100K cache hits, 1M text simulation tokens, 60 voice minutes, unlimited datasets and prompts. Boost $250/mo, Scale $750/mo (HIPAA), Enterprise from $2,000/mo (SOC 2). SOC 2 Type II + HIPAA + GDPR + CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit. The OSS landscape comparison is in best open-source LLM observability; the self-hosting topology in LLM observability self-hosting guide.

The verdict

LLM observability is not a dashboard. It is an architecture: OpenTelemetry on the wire, OpenInference or OTel-GenAI on the attributes, evaluator scores on the spans, a classifier-plus-LLM-judge sampling policy on the eval, and a backend you can swap on the storage. Get those five right and the rest of the discipline composes cleanly. Get the wire format wrong and every other choice locks tighter.

The 2026 question isn’t “which platform has the prettiest trace UI.” It’s “which architecture survives the next vendor pivot.” OTel-native with LLM-aware conventions does. Proprietary span formats with OTel bridges layered on don’t. The teams that picked OTel in 2024 are still running the same instrumentation today; the teams that picked a vendor format are migrating. If you’re starting greenfield, instrument once with OTel and OpenInference, attach evaluator scores to spans via EvalTag at register time, route through a BYOK gateway with cost on the trace, and pick a backend that ships error analysis and self-improving evaluators in the same product. The trace tree is the unit. Everything else builds on it.

Sources

OpenTelemetry GenAI semantic conventions · OpenInference conventions · Future AGI traceAI · Future AGI ai-evaluation · Agent Command Center docs · Arize Phoenix · OpenLLMetry / Traceloop

Agent observability vs evaluation vs benchmarking · Best LLM tracing tools in 2026 · What is OpenInference · LLM tracing best practices · What does a good LLM trace look like

Frequently asked questions

What is LLM observability in one sentence?
LLM observability is OpenTelemetry tracing with LLM-aware span attributes and evaluator scores attached to those spans, exported through OTLP into a backend you can swap. The unit of observability is the trace tree, not the request log. Every model call, retrieval, tool invocation, guardrail decision, and judge score becomes a span node with OpenInference or OTel-GenAI conventions on top of it. Stacks that ship their own span format lock you in next quarter, because attribute drift between proprietary schema and the OpenTelemetry GenAI working-group conventions becomes a real maintenance line.
Why does OpenTelemetry matter for LLM observability?
OTel is the wire protocol that won across infra in 2024-2025 and now carries LLM-specific conventions on top. OpenInference (Arize), OTel-GenAI (the OpenTelemetry GenAI SIG), and OpenLLMetry (Traceloop) all sit on OTLP, so the same span renders in Phoenix, Future AGI, Tempo, Honeycomb, or Datadog. ServiceNow's March 2026 acquisition of Traceloop/OpenLLMetry validated OTel as the LLM observability standard. Picking an OTel-native stack means you can swap collectors and backends without re-instrumenting. Picking a vendor-shaped wire format means you re-instrument when the vendor's interest diverges from yours.
What is eval-as-span-attribute?
Eval-as-span-attribute is the pattern where evaluator scores attach to the span as first-class attributes, not on a separate dashboard. The span carries gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation written by an automated judge, exactly the way HTTP status carries response.status_code. The eval lives on the trace tree, so a failing faithfulness score points directly at the retrieval span that caused it. Without this, you have two dashboards that never join, and the loop from failing trace to retuned rubric stays open.
Which semantic convention should I pick: OpenInference, OTel-GenAI, or OpenLLMetry?
OpenInference is the most production-tested in 2026 (Arize ships it, Phoenix and Future AGI render it natively, attribute names are stable). OTel-GenAI is the long-run standard once the SIG finalizes its full spec, and most tools already emit the gen_ai.* baseline. OpenLLMetry's traceloop.* namespace overlaps with both. Future AGI traceAI ships all four conventions (FI native, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) on the same wire with a single register() argument, so the choice stays revisitable. If you're starting greenfield, instrument with OpenInference and re-export to gen_ai.* once the GenAI SIG ships v1.
How do I sample LLM traces without losing failure visibility?
Use tail-based sampling on the OTel collector. Retain 100 percent of failures, errors, high-cost spans, and below-threshold eval scores. Sample 1-10 percent of clean traces for trend data. Head-based percentage sampling alone drops the rare failure that caused the customer complaint, which is the exact trace you need. Phoenix, Langfuse, Future AGI, and Datadog all support tail sampling on judge score or status. Combine with retention tiering: hot ClickHouse for 30 days, cold object storage for 90+ days, raw payloads encrypted at the collector.
How does Future AGI cover this?
traceAI ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C# with four pluggable semantic conventions on the same wire and 14 OpenInference span kinds including A2A_CLIENT, A2A_SERVER, EVALUATOR, and GUARDRAIL. ai-evaluation attaches eval scores via the EvalTag API at register() time, so the collector runs them server-side post-export and writes scores back to the span without adding user-request latency. Agent Command Center fronts 100+ providers with cost attribution and 18+ runtime guardrails on the same trace stream. Apache 2.0 self-hostable, SOC 2 Type II + HIPAA + GDPR + CCPA certified.
Related Articles
View all