Guides

LLM App Observability with OpenTelemetry: The 2026 Setup

OTel for LLM apps in 2026 = OTel-GenAI + OpenInference + eval-as-span-attribute. The three layers, the traceAI register pattern, span enrichment, and sampling.

·
Updated
·
12 min read
opentelemetry llm-observability openinference traceai otel-genai llm-tracing agent-observability 2026
Editorial cover image for LLM App Observability with OpenTelemetry 2026
Table of Contents

An LLM platform engineer pings you Tuesday morning. “The summary agent dropped a key clause again. We need to know why.” You open the trace. The model call landed clean. Latency was fine. The retrieval looked normal. Nothing in the OTel span tree tells you the chunk that should have carried the clause was demoted to rank 4 by the reranker and never made it into the context window. Your tracer captured the trip; it did not capture the cause. OTel for LLM apps in 2026 is OTel-GenAI semantic conventions plus OpenInference span kinds plus eval-as-span-attribute, on a tracer that ships all three natively. Anything less and you have built a logging system pretending to be observability.

This post is the setup playbook: the three layers, the register() pattern that wires them in one call, how to enrich spans with using_attributes and EvalTag, the sampling and retention patterns that survive production, and where traceAI fits.

The three layers

The instrumentation stack for an LLM app is not one thing. It is three concentric conventions on the same OTLP wire.

Layer 1: OTel base. The wire format (OTLP), span context propagation (W3C trace context), and the collector pipeline. This layer is solved. Every backend that matters speaks OTLP, every microservice already emits it, and the collector is where tail sampling, PII redaction, and routing live. No LLM-specific work happens here. Base OTel gives you a request log. That is all.

Layer 2: OTel-GenAI semantic conventions. The OpenTelemetry project’s gen_ai.* namespace, maintained by the GenAI SIG. Covers model identity (gen_ai.system, gen_ai.request.model, gen_ai.response.model), token usage (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_read_tokens), operation semantics (gen_ai.operation.name), and the evolving gen_ai.evaluation.* set for score attachment. The spec is still in Development, but the experimental contract is stable enough that mature instrumentation libraries pin against it. This is the long-run standard once v1 ships.

Layer 3: OpenInference span kinds. Arize’s parallel convention under openinference.*, with a longer track record and the piece OTel-GenAI does not yet enumerate: the 14 LLM-aware span kinds. LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN. These are what make the trace tree render as an agent topology instead of a flat list of HTTP calls. A RETRIEVER span carries top-k documents and similarity scores. A GUARDRAIL span carries the rule that fired and why. A2A span pairs carry propagated trace_id across agent boundaries so a multi-agent system stays in one tree.

The 2026 stack ships all three. Base OTel as the substrate, gen_ai.* for model semantics, openinference.* for span kinds and message capture. They live on the same span. The tracer flips the dominant namespace at register time, and the other attribute set rides along as compatibility metadata.

The mistake teams keep making is treating these as a choice. They are not. ServiceNow’s March 2026 acquisition of Traceloop/OpenLLMetry validated OTel as the substrate; the OpenInference span-kind taxonomy is what makes the substrate useful; OTel-GenAI is what closes the loop with the OpenTelemetry project itself. Ship all three. The longer treatment of the convention layer is in what is LLM observability.

Practical setup: the register pattern

The whole instrumentation step is one function call. Place it in your app’s startup module, before any LLM client is constructed.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices,
)
from fi_instrumentation.otel import SemanticConvention, Transport
from traceai_openai import OpenAIInstrumentor
from traceai_langchain import LangChainInstrumentor

trace_provider = register(
    project_name="checkout_assistant",
    project_type=ProjectType.OBSERVE,
    project_version_name="v2.4.0",
    semantic_convention=SemanticConvention.OPENINFERENCE,
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.GROUNDEDNESS,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.PROMPT_INJECTION,
            model=ModelChoices.TURING_FLASH,
            mapping={"input": "input.value"},
        ),
    ],
    metadata={"git_sha": "abc123"},
    batch=True,
    transport=Transport.HTTP,
)

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)

Five lines worth reading carefully.

semantic_convention=SemanticConvention.OPENINFERENCE picks the dominant namespace. Flip to OTEL_GENAI if the backend prefers gen_ai.* rendering; flip to OPENLLMETRY if the existing collector pipeline is OpenLLMetry-shaped. The four-way switch is what lets you change the backend without re-instrumenting the call sites.

project_type=ProjectType.OBSERVE is the production mode. EXPERIMENT is the dataset-bound mode used for offline regression runs.

eval_tags=[...] declares server-side rubrics at register time. The hot path of the LLM call writes one attribute into the span and exits. The evaluator runs after the export, against the matching spans, on the FI collector. Scores write back to gen_ai.evaluation.* attributes on the same span. Zero added latency on the user request. This is the eval-as-span-attribute pattern in code.

batch=True queues exports through OTel’s BatchSpanProcessor. Spans buffer in memory and ship in batched OTLP requests rather than one network call per span.

Each Instrumentor().instrument() patches the underlying SDK. OpenAIInstrumentor covers openai.chat.completions.create and the streaming variants; LangChainInstrumentor covers LangChain runnables and emits the LangGraph topology attributes (langgraph.node.name, langgraph.node.type, conditional edge events) when the agent runs on LangGraph. The instrumentor catalog covers 46 Python frameworks, 39 TypeScript packages, 24 Java modules, and a C# core. The pattern is identical across languages.

To point traceAI at a non-FAGI collector, set FI_BASE_URL and FI_GRPC_URL to the collector address and choose the semantic convention the backend understands. The exporter is OTLP either way.

Span enrichment: using_attributes and EvalTag

The register call wires the tracer. Span enrichment is where the trace stops being a request log and starts being a debugger.

using_attributes

Most observability work fails on tenant attribution. Engineers look at a failing trace and cannot tell which session, which user, which feature flag, or which prompt template version produced it. The using_attributes context manager solves this in three lines.

from fi_instrumentation import using_attributes

with using_attributes(
    session_id="sess_a1b2c3",
    user_id="user_42",
    metadata={"prompt_template": "checkout_v3.2", "ab_bucket": "treatment"},
    tags=["checkout", "p1"],
):
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": query}],
    )

Every span produced inside the block carries session.id, user.id, the metadata bag, and the tag list. The attributes propagate down the trace tree, so a child RETRIEVER span emitted by LangChain inside the same context inherits the same session.id without re-wiring. The block is also the right place to attach prompt template versions, feature flags, A/B buckets, and cost centers. Anything you want to filter by in the backend belongs here.

Two patterns earn their keep in production.

Per-request envelope. Wrap the inbound request handler in a using_attributes block. Every span the request produces is now tagged with the request’s session, user, and routing metadata. This is the only attribution strategy that survives async work, retries, and tool-call fan-out.

Per-stage scoping. A retrieval-heavy pipeline has stages (preprocess, retrieve, rerank, generate, postprocess) and you want to slice latency and cost by stage. Nest a second using_attributes block per stage, setting a pipeline.stage attribute. The collector now renders the trace as a pipeline view, not a flat call list.

EvalTag

EvalTag is the declaration form of eval-as-span-attribute. Each tag specifies which span kind it scores (OBSERVATION_SPAN plus EvalSpanKind.LLM, RETRIEVER, TOOL, or AGENT), which built-in rubric to run (GROUNDEDNESS, CONTEXT_ADHERENCE, PROMPT_INJECTION, CONTENT_MODERATION, FACTUAL_ACCURACY, TASK_COMPLETION, plus 50-plus more), which judge model to use (TURING_LARGE for full eval at ~1-2 s, TURING_FLASH for fast screening at 50-70 ms p95), and how to map span attributes into the rubric’s input slots.

The tags ride along on the OTel Resource. The FI collector reads them, runs the matching evals server-side against ingested spans, and writes scores back to gen_ai.evaluation.<rubric>.score.value, gen_ai.evaluation.<rubric>.score.label, and gen_ai.evaluation.<rubric>.explanation on the original span. The next time you open that trace, the score is on it.

The architecture matters. Langfuse requires langfuse.score() SDK calls in the request path. Phoenix has no built-in eval engine and routes scoring through its parallel phoenix.evals surface. The server-side EvalTag pattern is the only one that keeps “every span has a score” practical at 10K-plus spans per second. Treatment of the rubric layer itself is in LLM-as-a-judge best practices.

Production patterns: sampling, retention, redaction

Three operational choices decide whether the stack survives load.

Tail-based sampling. Head-based percentage sampling is the wrong default for LLM workloads. Failure is rare. Failure is high-value. Dropping the trace that caused the customer complaint is the worst possible outcome of a sampling policy. Configure the OTel collector for tail-based sampling instead. Retain 100 percent of errors, 100 percent of spans with gen_ai.evaluation.*.score.value below the rubric’s threshold, 100 percent of gen_ai.cost.total above the 95th percentile, and 1 to 10 percent of clean traces for trend data. The collector sees the full trace before deciding to drop it; the policy decision is informed by what actually happened in the request.

Tail-sampling configuration sits in the collector YAML, not the application. The application emits everything; policy lives downstream. Phoenix, Langfuse, traceAI, and Datadog all support tail sampling on judge score or status.

Retention tiering. Trace storage is the dominant cost line at scale. A baseline of 100K requests per day, 30 spans per request, 1 KB per span before payload compression lands at roughly 90 GB per month. Real LLM workloads have larger payloads (full message arrays, retrieved chunks, tool results) and push the number higher. Three-tier the storage. Hot ClickHouse for 14 to 30 days drives the live dashboards and alerts. Warm columnar for 90 days covers compliance and retrospective debugging. Cold object storage for 1 to 7 years covers the regulated window. Compress payloads at the collector; encrypt at rest.

Retention is the lever procurement actually argues about. Quote per-GB-month rates before signing anything.

Redaction at the collector. PII and secrets should never leave the application network with raw payloads attached. The right architecture is: application emits the full span, a collector processor strips secrets and PII (regex patterns, named-entity classifiers, allow-listed attribute keys) before export, redacted spans flow to the backend. This keeps instrumentation code clean and centralizes policy.

traceAI ships built-in PII redaction at the span-attribute layer, so the network hop after the application never carries the raw secret in the first place. The collector-side processor is the second line of defense. Phoenix and Langfuse handle redaction via configurable processors; OpenLLMetry leaves it to the downstream collector. Verify before procurement.

Sustained throughput. Production agent stacks routinely sustain 10K-plus spans per second. Langfuse on tuned ClickHouse, Phoenix via Arize AX cloud, Future AGI with ClickHouse trace storage, and Datadog all clear that bar in customer environments. Run a load test against your real span schema; vendor numbers always understate payload size. The deeper treatment lives in LLM tracing best practices.

Cost on the same trace. Per-span cost (gen_ai.cost.input, gen_ai.cost.output, gen_ai.cost.cache_read, gen_ai.cost.total) belongs on the span the model call lives on. Hierarchical budgets (org, team, user, key, tag) belong at the gateway so runaway loops reject before the bill arrives. The pattern is covered end-to-end in AI agent cost optimization observability.

Where Future AGI fits

Most production stacks end up running three or four products: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the pick when those have to live on the same Apache 2.0 self-hostable plane with OpenInference-shaped traces as the unit.

traceAI is the Apache 2.0 instrumentation surface. Fifty-plus AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Four pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) on the same wire via the semantic_convention= argument. Fourteen OpenInference span kinds including A2A_CLIENT, A2A_SERVER, EVALUATOR, and GUARDRAIL. Built-in PII redaction at the span-attribute layer. Native LangGraph topology capture (langgraph.node.name, langgraph.node.type, conditional edge events, state diffs per node) so multi-node agents render as topologies, not flat span lists. The voice attribute surface (gen_ai.voice.*) and the computer-use surface (gen_ai.computer_use.*) cover what teams retrofit later at three times the cost; the OpenInference voice agent guide and Claude Code observability post walk those surfaces in depth.

ai-evaluation is the Apache 2.0 eval library on top of traceAI. Fifty-plus EvalTemplate classes as pytest CI scorers and span-attached online scorers. The same rubric runs offline against a versioned dataset on every PR and online against sampled live spans. The EvalTag API wires rubrics to spans at register time at zero added inference latency. Per-eval economics beat Galileo Luna-2 on the published classifier-backed rubrics; BYOK on top of the platform fee on judge calls.

Agent Command Center is the Apache 2.0 single Go-binary gateway. One hundred-plus providers, BYOK routing, five-level hierarchical budgets, exact plus semantic caching, microdollar cost accounting. Eighteen-plus runtime guardrails emit inline as GUARDRAIL spans on the same trace tree. The gateway emits OpenInference-compatible spans for every call that flows through https://gateway.futureagi.com/v1, so app-side traceAI and provider-side gateway spans join on trace_id for one tree per request. The headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, x-prism-routing-strategy, x-prism-guardrail-triggered) are span attributes inline.

The closed self-improving loop is the differentiator. Spans flow into Observe projects. The Error Feed runs HDBSCAN soft-clustering across related failures, a Claude Sonnet 4.5 Judge writes an immediate_fix per cluster, and the fixes feed the platform’s self-improving evaluators so the rubric gets sharper as production traffic flows. agent-opt then optimizes prompts using six optimizers (RandomSearch, BayesianSearch with teacher-inferred few-shot and resumable Optuna sessions, MetaPrompt, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), with EarlyStoppingConfig to keep runs bounded. Other tools ship the parts; Future AGI ships the loop.

Free tier with 50 GB tracing, 2,000 AI credits, 100K gateway requests, 100K cache hits, 1M text simulation tokens, 60 voice minutes, unlimited datasets and prompts. Boost $250/mo, Scale $750/mo (HIPAA), Enterprise from $2,000/mo (SOC 2). SOC 2 Type II + HIPAA + GDPR + CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

Anti-patterns to avoid

Six patterns kill an OTel rollout faster than no observability at all.

  • Vendor-SDK-only instrumentation. Every backend swap becomes a re-instrumentation. The cost compounds every quarter.
  • One convention, no escape hatch. Coding against a single semantic convention without a pluggable layer means every backend change rewrites the call sites.
  • Manual with tracer.start_as_current_span(...) blocks on every LLM call. Works for a demo, breaks at scale. The framework instrumentor is what patches every code path the framework exposes.
  • No LangGraph topology capture. A multi-node agent traced as a flat span list hides which node failed, which conditional edge fired, and which state diff propagated the bad input. The topology attributes are not decoration; the LangGraph agent evaluation guide covers the full attribute surface.
  • No gateway-emitted spans. App-side traces miss provider-side latency, the routing decision, and the cost attribution. Gateway-emitted spans close that gap.
  • Treating evaluation as post-hoc. A trace without a score is a request log. Server-side EvalTag evaluation puts the score on the span at zero request latency. This is the only practical way to keep “every span has a score” true at production volume.

The verdict

OTel for LLM apps in 2026 is settled. Base OTel on the wire, OTel-GenAI for model semantics, OpenInference for span kinds, eval-as-span-attribute on the trace tree itself. The instrumentation is one register() call plus one-line instrumentors per framework. Span enrichment is using_attributes for tenant metadata and EvalTag for scores. Production patterns are tail-based sampling, three-tier retention, collector-side redaction, gateway-emitted cost.

The 2026 question is not “which tracer has the prettiest UI.” It is “which architecture survives the next vendor pivot.” OTel-native with all three convention layers does. Proprietary span formats with OTel bridges layered on do not. Pin the semantic conventions in your repo, attach eval scores to spans at register time, and pick a tracer that ships all three layers natively. The trace tree is the unit. Everything else builds on it.

Read the OTel instrumentation tool comparison next for the head-to-head of producers that ship spans worth ingesting, or the traceAI getting-started post for the install steps without the architecture context.

Sources

OpenTelemetry GenAI semantic conventions · OpenInference conventions · Future AGI traceAI · Future AGI ai-evaluation · Agent Command Center docs · W3C trace context · Arize Phoenix

What is LLM observability · Best LLM tracing tools in 2026 · LLM tracing best practices · Agent observability vs evaluation vs benchmarking · What does a good LLM trace look like

Frequently asked questions

What is the OTel stack for LLM apps in 2026?
Three layers. Base OTel handles the wire format (OTLP), span context propagation, and the collector pipeline. OTel-GenAI sits on top with the gen_ai.* attribute namespace for model identity, token usage, and cost. OpenInference sits beside it with the 14 LLM-aware span kinds (LLM, RETRIEVER, TOOL, AGENT, GUARDRAIL, EVALUATOR, A2A_CLIENT, A2A_SERVER, others) that OTel-GenAI does not yet enumerate. You ship all three together because OTel base alone is a request log, OTel-GenAI alone has no span-kind taxonomy, and OpenInference alone misses the OTel-project blessing. Pick a tracer that emits all three on the same wire and the stack becomes portable.
OTel-GenAI or OpenInference — which do I pick?
You do not pick. You emit both. OTel-GenAI is the OpenTelemetry project's gen_ai.* namespace covering model name, token counts, and operation semantics. OpenInference is Arize's parallel set under openinference.* covering the LLM-aware span kinds and full message capture. The two communities are converging and most serious tracers ship them on the same span. traceAI lets you flip the dominant convention at register() time via semantic_convention=OTEL_GENAI or OPENINFERENCE; the other attribute set rides along as compatibility metadata so any backend can render the trace.
What is eval-as-span-attribute?
The evaluator score writes back to the same span the LLM call emitted, as gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation. The eval becomes a property of the trace, not a parallel dataset. The hot path of the LLM call sees zero added latency because the evaluator runs server-side post-export. A failing groundedness score now points directly at the retrieval span that caused it. Stacks that keep evals in a separate dashboard force two retention policies, two access paths, and a context switch on every regression.
How do I enrich spans with tenant and session metadata?
Use the using_attributes context manager from fi_instrumentation. Wrap the LLM call in `with using_attributes({"session.id": ..., "user.id": ..., "tag.tags": [...]}):` and every span produced inside the block carries those attributes. The pattern works for any custom field you want sliceable in the backend: prompt template version, feature flag, cost center, A/B bucket. The attributes propagate down the trace tree, so a child retriever span inherits the same session.id without re-wiring.
How do I sample LLM traces without losing failures?
Tail-based sampling at the OTel collector. Retain 100 percent of errors, 100 percent of spans with eval scores below threshold, 100 percent of high-cost spans, and 1 to 10 percent of clean traces for trend data. The collector sees the full trace before deciding to drop it, which is the only sampling strategy that survives an LLM workload because failure is rare and high-value. Head-based percentage sampling drops the rare hallucination that caused the customer complaint. Pair with three-tier retention: hot ClickHouse for 14 to 30 days, warm columnar for 90 days, cold object storage for compliance windows.
Can I use traceAI with Jaeger, Tempo, Datadog, or Honeycomb?
Yes. traceAI is an OTLP producer. Point the exporter at any OTLP-compatible collector, pick the semantic convention the backend understands, and the spans flow. Future AGI is one supported backend; Phoenix, Tempo, Jaeger, Honeycomb, New Relic, and Datadog OTel intake all work. The application code does not change. This is the architectural payoff of OTel-or-bust: the SDK is stable and the backend is the variable.
How does Future AGI fit?
traceAI ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C# with four pluggable semantic conventions on the same wire (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) and 14 OpenInference span kinds. The EvalTag API attaches scores at register() time, so the collector runs them server-side and writes back to gen_ai.evaluation.* on the matching spans without adding user-request latency. Agent Command Center fronts 100+ providers with cost on every response and 18+ runtime guardrails on the same trace stream. Apache 2.0 self-hostable; SOC 2 Type II + HIPAA + GDPR + CCPA certified.
Related Articles
View all