Research

What is LLM Tracing? Spans, OTel GenAI, and Sampling in 2026

LLM tracing is structured spans for prompts, tools, retrievals, and sub-agents under OTel GenAI conventions. What it is and how to implement it in 2026.

·
13 min read
llm-tracing opentelemetry otel-genai openinference traceai agent-observability spans 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline WHAT IS LLM TRACING? fills the left half. The right half shows a wireframe horizontal trace timeline with five nested spans, drawn in pure white outlines with a soft white halo behind the deepest nested span.
Table of Contents

A user asks your support agent a question. The reply is short, polite, and correct. The user is satisfied. Your APM dashboard shows zero errors, p95 latency at 1.4 seconds, and 200 status codes across the board. Now look at the trace. The agent retried its tool call eight times because the retriever returned a stale chunk. It called a guardrail twice. It hit the eval scorer late, after the response was already streamed. It burned $1.40 in judge tokens for a question whose cached answer cost $0.002 yesterday. None of that shows up in logs. None of it shows up in APM. It only shows up in a tree of spans that names every step the request took.

That tree is what LLM tracing produces. Pre-AI APM was built for stateless request/response systems where the unit of debugging is one HTTP call. LLM systems fail differently. They fail by being right but expensive. They fail by hallucinating with high confidence. They fail by drifting when a model provider quietly updates weights. They fail by burning loops in agent graphs that no exception ever raises. LLM tracing is the discipline of turning every step of an LLM-powered request into a structured, queryable span so those failures become debuggable.

TL;DR: What LLM tracing is

LLM tracing captures every operation inside an LLM-powered request as a structured span and arranges those spans into a tree under one trace id. Each span carries a start time, end time, status, parent span id, and an attribute bag with the prompt, the completion, the model name, the token counts, the temperature, the system instructions, and the tool definitions. The transport in 2026 is OpenTelemetry, with the OTel GenAI semantic conventions defining a standard gen_ai.* attribute namespace. The instrumentation libraries are OpenInference (40+ Python packages, with TypeScript and Java coverage), traceAI (FutureAGI’s OTel-native framework across Python, TypeScript, Java, and C#), and vendor SDKs. The backend is your choice: open-source (Langfuse, Phoenix, FutureAGI), closed platforms (LangSmith, Braintrust), or APM-native (Datadog).

Why LLM tracing matters in 2026

Three things made tracing operational, not optional.

First, agents stopped being toys. A single user request inside a real agent stack now generates 10 to 50 spans across LLM calls, retriever queries, tool invocations, guardrail checks, and sub-agent dispatches. Without span-level structure, debugging is grep in a log file. With span structure but no trace tree, you see the spans in chronological order and have to mentally reconstruct which node was inside which loop iteration. Spans plus the parent-child tree is the minimum unit of useful agent debugging.

Second, cost stopped being a footnote. A reasoning model burning 40K output tokens at $15 per 1M tokens turns a single user turn into 60 cents. Multiply by retries, tool calls, judge evals, and a feature can cost more than the user’s monthly subscription. Token-level cost attribution per user, per prompt version, per route, per feature flag is now an operational requirement. APM dashboards do not carry token counts as a first-class metric.

Third, quality became a runtime signal, not a release-time one. Models drift when providers update weights. RAG quality drifts when the underlying corpus changes. Prompt rollouts have second-order effects you only see in production. The standard answer is span-attached eval scores: every production span carries a quality verdict from a heuristic check, an LLM-as-judge, schema validation, or citation grounding. Latency alerts catch infra. Eval score alerts catch quality drift. Both ride on the trace.

The transport caught up. The OpenTelemetry project defined GenAI semantic conventions that name gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, and gen_ai.response.finish_reasons as standard span attributes. The spec is in development status as of 2026, gated by OTEL_SEMCONV_STABILITY_OPT_IN, but the direction is settled.

Editorial five-span horizontal trace timeline on a black starfield background showing the structure of one LLM-powered request: a top-level span SUPPORT_AGENT.RUN containing nested spans CHAT_PROMPT, RETRIEVER.SEARCH, TOOL.CALL.LOOKUP_ORDER (focal span with a soft white halo glow), and EVAL.GROUNDEDNESS, drawn as thin white outlines with parent and child relations rendered as nested bars.

The anatomy of an LLM trace

A trace is one user request from start to finish. A span is one operation inside that trace. The minimum a span carries:

  • Start and end timestamps. Microsecond precision in OTel.
  • Span id and parent span id. The parent id is what builds the tree.
  • Span name. Human-readable, like openai.chat.completion or agent.tool_call.
  • Status. OK, ERROR, or unset. Errors carry a stack trace.
  • Attribute bag. A map of typed key-value pairs.
  • Events. Discrete points in time inside the span; useful for streaming first-token markers and intermediate state.

The novelty is what goes into the attribute bag for an LLM span. None of this is in an http.request span.

Required gen_ai.* attributes

The OpenTelemetry GenAI specification names these as the canonical span attributes, all currently in development status:

  • gen_ai.operation.name: the operation type. Well-known values: chat, embeddings, retrieval, generate_content, execute_tool.
  • gen_ai.provider.name: openai, anthropic, aws.bedrock, azure.ai.inference, google.vertex_ai, etc.
  • gen_ai.request.model: the model id requested.
  • gen_ai.response.model: the model id actually used (sometimes different from the requested id, for example when a provider routes a deprecated id to a successor).
  • gen_ai.response.id: the provider’s completion id.
  • gen_ai.response.finish_reasons: why generation stopped.
  • gen_ai.response.time_to_first_chunk: streaming latency to first token.

Request parameter attributes

  • gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.top_k
  • gen_ai.request.max_tokens, gen_ai.request.seed, gen_ai.request.stop_sequences
  • gen_ai.request.frequency_penalty, gen_ai.request.presence_penalty
  • gen_ai.request.choice.count, gen_ai.request.stream

Token usage attributes

  • gen_ai.usage.input_tokens: prompt tokens.
  • gen_ai.usage.output_tokens: completion tokens.
  • gen_ai.usage.cache_creation.input_tokens: tokens written to provider cache.
  • gen_ai.usage.cache_read.input_tokens: tokens served from provider cache.
  • gen_ai.usage.reasoning.output_tokens: reasoning tokens for models that expose them separately.

The cache and reasoning attributes matter operationally. A reasoning model that uses 30K reasoning tokens before producing 500 visible output tokens is priced and budgeted differently from a non-reasoning chat call. If your trace schema collapses these into a single token field, your cost dashboards will quietly under-attribute.

Opt-in content attributes

  • gen_ai.input.messages: the full prompt payload.
  • gen_ai.output.messages: the full completion payload.
  • gen_ai.system_instructions: system prompt.
  • gen_ai.tool.definitions: function/tool definitions passed to the model.
  • gen_ai.conversation.id: multi-turn conversation linkage.

These are opt-in because they carry PII. Pre-storage redaction is non-negotiable for regulated workloads. The opt-in flag is the spec’s acknowledgment that capturing the prompt is a compliance decision, not a technical one.

Span types in an LLM trace

Different operations get different span types. The minimum useful set:

LLM call spans. One per call to a chat completion or text generation endpoint. Carry every gen_ai.* attribute above plus the prompt and completion.

Tool call spans. One per function or tool invocation by the model. Carry the tool name, arguments, return value, latency, and status. Nest inside the LLM span that decided to call the tool.

Retriever spans. One per vector search, BM25 search, or hybrid retrieval. Carry the query, the top-k chunks returned, similarity scores, and the index version. Critical for RAG debugging because retrieval misses are the most common source of hallucination.

Sub-agent spans. One per dispatch to a child agent. Nest the child agent’s full trace tree under the parent.

Guardrail spans. One per input or output validator. Carry the rule name, the verdict, and the modified payload if the guardrail rewrote the input or output.

Evaluator spans. One per online scorer (LLM-as-judge, schema check, citation grounder). Either nested inside the parent LLM span or linked via span event.

Custom spans. Anything your business logic does between LLM calls (preprocessing, postprocessing, business rule checks, persistence) gets its own span if you care about its latency or status.

How LLM tracing is implemented

Three integration points: instrumentation, transport, and backend.

Instrumentation libraries

You have three viable paths in 2026.

OpenInference. Arize maintains the OpenInference repository with around 31 Python instrumentation packages covering OpenAI, Anthropic, Bedrock, Groq, MistralAI, LangChain, LlamaIndex, DSPy, CrewAI, Agno, OpenAI Agents, AutoGen, and PydanticAI, plus 13 JavaScript and 4 Java packages including LangChain4j and Spring AI. It describes itself as complementary to the OpenTelemetry GenAI conventions, not a replacement. The instrumentations are OTLP-compatible and send to any OTel backend.

traceAI. FutureAGI maintains traceAI as an Apache 2.0 OTel-native instrumentation framework for 35+ frameworks across Python, TypeScript, Java, and C#. Concretely: packages cover Python, TypeScript, Java (including LangChain4j and Spring AI), and a C# core library on NuGet. It follows the OpenTelemetry semantic conventions for GenAI, supports custom TracerProviders and OTLP exporters, and ships traces to any OTel-compatible backend (Datadog, Grafana, Jaeger, FutureAGI’s own platform). It is OTel done correctly for LLM workloads, not a vendor lock-in SDK.

Vendor SDKs. Most observability vendors ship their own SDK. Langfuse, LangSmith, Braintrust, Helicone, and Datadog all do. Some are OTel-native, some are proprietary with an OTel translation layer, some are proprietary with no OTel path. Read carefully before instrumenting your codebase against a non-OTel SDK; the switching cost compounds.

Transport

OTLP is the standard. HTTP and gRPC are both supported. The shape is identical; gRPC is faster and the default for service-to-service hops, HTTP is friendlier for browsers and locked-down networks. An OTel collector can sit in the middle to enrich, filter, or route spans across multiple backends. The collector is also where you implement tail-based sampling.

Backends

The backend is what stores, queries, and visualizes traces. Six categories worth naming:

  • OSS LLM-native backends: Langfuse (MIT core, ClickHouse storage), Phoenix (ELv2, OTLP-first), FutureAGI (Apache 2.0, ClickHouse storage with full OTel ingest).
  • Closed platforms: LangSmith, Braintrust. Strong UX, varying OTel posture.
  • APM-native: Datadog LLM Observability, New Relic AI Monitoring. LLM spans inside the APM dashboard. Pricier but unifies LLM and infra signals.
  • Generic OTel backends: Jaeger, Tempo, Grafana. Free or self-hosted, no LLM-specific UI but full OTel fidelity.
  • Cloud-native: AWS X-Ray, Google Cloud Trace, Azure Monitor. Workable, no LLM-specific surface.
  • Self-built: ClickHouse plus a UI. Reasonable when your platform team has ClickHouse expertise.

If your trace volume grows past 100M spans per month, the storage choice matters more than the UI. ClickHouse-backed systems handle this volume comfortably; row-store backends start to wobble at this scale.

Sampling decisions

Cost-driven sampling at 1% hides the long-tail failures you need traces to catch. Two patterns work in production:

Head-based sampling. Decide at trace creation time whether to keep the trace. Cheap because you do not buffer the full trace. The downside: you cannot decide based on outcome, because outcome is not known yet. If you sample at the head, sample by user id (every Nth user gets full traces) and by feature flag (always trace experiments at 100%), not uniformly.

Tail-based sampling. Buffer the full trace, decide at the end whether to keep it. Expensive because you need a buffer. The benefit: you can keep 100% of traces with errors, eval scores below threshold, p95+ latency, or anomalous token cost. This is the pattern that catches the failures uniform sampling buries.

A reasonable default for LLM workloads: tail-based sampling with these keep rules:

  1. Keep 100% of traces with status = ERROR.
  2. Keep 100% of traces with any eval score below threshold.
  3. Keep 100% of traces in the top 1% of token cost or latency.
  4. Keep 100% of traces tagged with experiment_id.
  5. Sample 5% of the remaining traffic uniformly for distribution analysis.

The 5% number is a starting point. If your eval scoring is online and adds non-trivial cost, drop the uniform sample. If your storage is cheap, raise it.

Common mistakes when implementing LLM tracing

  • Treating tracing as logging with extra fields. It is not. The parent-child tree, the gen_ai.* attribute schema, and span events are first-class. Bolting them on later means re-instrumenting every call site.
  • Sampling too aggressively at the head. 1% uniform head sampling hides the failures the trace was meant to catch. Use tail sampling for production.
  • Not tagging prompt versions. If you cannot filter spans by prompt version id, you cannot compare A/B prompt rollouts, you cannot attribute regressions, and post-mortems become guesswork.
  • Forgetting redaction. gen_ai.input.messages and gen_ai.output.messages carry PII. Pre-storage redaction is non-negotiable for regulated workloads.
  • Using a proprietary SDK as the only path. A proprietary SDK on top of OTel is fine. A proprietary SDK instead of OTel is a switching cost waiting to be paid.
  • Flattening agent traces. A LangGraph or CrewAI run is a tree. A flat span list buries the loop and the tool decisions. Force tree-structured trace views.
  • No span-attached eval scores. Latency alerts catch infra. Eval score alerts catch quality drift. Without span-attached scores, you have a dataset of inputs and outputs and a separate dataset of eval verdicts that you stitch by primary key in SQL.
  • Ignoring cache and reasoning tokens. A trace schema that collapses gen_ai.usage.cache_read.input_tokens and gen_ai.usage.reasoning.output_tokens into a single token count under-attributes cost on reasoning models and over-attributes it on cached calls.

The future: where LLM tracing is heading

A few directions are settled, others are emerging.

OTel GenAI graduates from development. As of 2026 the spec is gated by OTEL_SEMCONV_STABILITY_OPT_IN. The opt-in flag will eventually flip to default-on. Tools that handle version pinning gracefully will look better than tools that silently drift across attribute renames.

Agent-aware UI becomes the default. A flat span list buries the loop. Tools that render runs as actual graphs and let you replay a single node with new state will pull ahead. The unit of debugging in an agent system is a node-in-graph with input state, output state, and tool calls, not a single LLM call.

Span-attached evals become standard. The shift is from “we run an eval suite at release” to “every production trace carries quality verdicts as it happens.” The CI gate, the on-call alert, and the monitoring dashboard all consume the same score stream.

Open instrumentation, vendor backend. This mirrors what happened in cloud-native. The win is open instrumentation at the SDK layer, with pluggable backends. OTel won the metrics and traces fight in cloud-native because instrumentation owners refused to maintain N parallel SDKs. The same logic is playing out for gen_ai.* attributes.

Span-level cost budgets. Rate limits and token budgets at the gateway layer are common. Span-level budgets (this user, this prompt, this feature gets at most $X per day at p99) are not yet table stakes but are appearing. The data is already in the trace; what is missing is the policy enforcement layer that reads the trace stream and shorts a request when its budget is exhausted.

Long-context retrieval traces become legible. Retrieving 200K tokens of context across multiple stages, with reranking, deduplication, and summarization, is a debugging nightmare without span-level structure. Tools that render the retrieval pipeline as a tree with similarity scores and token counts at each step will pull ahead in RAG-heavy workloads.

The throughline of all five: LLM tracing is becoming the substrate for production AI, the same way distributed tracing became the substrate for cloud-native services. If you cannot see the spans, attribute the cost, attach the scores, and replay the path, you are flying blind on a workload where being wrong is expensive.

How FutureAGI implements LLM tracing

FutureAGI is the production-grade LLM tracing platform built around the OTel-native span tree this post described. The full stack runs on one Apache 2.0 self-hostable plane:

  • traceAI - traceAI is Apache 2.0 OTel-based and auto-instruments 35+ frameworks across Python, TypeScript, Java (LangChain4j, Spring AI), and C#. LangChain, LlamaIndex, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, Pydantic AI, DSPy, Mastra, and Vercel AI SDK all emit the same OpenInference and OTel GenAI semantic conventions.
  • Trace storage - ClickHouse trace storage handles high-volume ingestion. The Agent Command Center renders the trace tree with prompt-version tagging, agent-graph topology, span-kind filtering, and per-cohort comparison.
  • Span-attached evals - 50+ first-party metrics (Groundedness, Tool Correctness, Task Completion, Hallucination, PII, Toxicity, Refusal Calibration) attach to live spans as they arrive. turing_flash runs guardrail screening at 50 to 70 ms p95 and full eval templates at about 1 to 2 seconds.
  • Optimization and gateway - six prompt-optimization algorithms consume failing trajectories, the Agent Command Center fronts 100+ providers with BYOK routing where turing_flash delivers 50-70ms p95 latency, and 18+ runtime guardrails (PII, prompt injection, jailbreak, tool-call enforcement) read the same trace stream that powers the dashboard.

Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams adopting LLM tracing end up running three or four tools in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because the trace, eval, simulation, gateway, and guardrail surfaces all live on one self-hostable runtime; the loop closes without stitching.

Sources

Related: What is LLM Observability?, LLM Testing Playbook 2026, Arize AI Alternatives in 2026, Braintrust Alternatives in 2026

Frequently asked questions

What is LLM tracing in plain terms?
LLM tracing is the practice of capturing every step inside an LLM-powered request as a structured span: the prompt, the model call, the tool invocation, the retriever query, the sub-agent dispatch, the guardrail check. Each span carries a parent id, a duration, a status, and an attribute bag with prompts, tokens, model name, and cost. The tree of spans is the trace. Without traces, debugging a failing agent is grep over a log file. With traces, you replay the exact path the request took.
How is LLM tracing different from regular distributed tracing?
It uses the same OpenTelemetry primitives (spans, traces, parent ids, attribute bags) but the attribute schema is different. A regular HTTP span carries http.method, http.status_code, and http.url. An LLM span carries gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.operation.name, the prompt, the completion, and tool definitions. The OTel GenAI semantic conventions formalize this attribute namespace. The plumbing is the same; the payload is what makes it LLM tracing.
What are OTel GenAI semantic conventions?
A set of standardized attribute names under the gen_ai.* namespace defined by the OpenTelemetry project. Canonical attributes include gen_ai.operation.name (chat, embeddings, execute_tool), gen_ai.provider.name, gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, gen_ai.response.finish_reasons. As of 2026, the conventions are still in development status. Tools that handle version pinning gracefully via OTEL_SEMCONV_STABILITY_OPT_IN look better than tools that silently drift.
What is OpenInference and how does it relate to OTel GenAI?
OpenInference is a parallel set of conventions and instrumentations maintained by Arize, predating the OTel GenAI spec and complementary to it. The OpenInference repo ships around 31 Python instrumentation packages across LLM providers, frameworks, and agent platforms, plus 13 JavaScript packages and 4 Java packages. The instrumentations emit OTLP-compatible spans that any OTel backend can consume. In practice, most observability vendors accept either or both, and the attribute sets overlap heavily.
What is traceAI?
traceAI is FutureAGI's Apache 2.0 OTel-native instrumentation framework for 35+ frameworks across Python, TypeScript, Java, and C#. Concretely, packages cover Python, TypeScript, Java (including LangChain4j and Spring AI), and a C# core library on NuGet. It follows OpenTelemetry semantic conventions for GenAI, supports custom TracerProviders and OTLP exporters, and ships traces to any OTel-compatible backend including Datadog, Grafana, Jaeger, or FutureAGI's own platform. It is not a vendor lock-in SDK; it is OTel done correctly for LLM workloads.
How aggressive can I be with trace sampling for LLM workloads?
Less aggressive than you think. Cost-driven sampling at 1% hides the long-tail failures that matter; the p99 is where the bug lives. The pragmatic pattern is tail-based sampling that keeps 100% of traces with errors, with eval scores below threshold, with high token cost, or with long latency, plus a uniform low-rate sample of the rest for distribution analysis. If sampling at the head, sample by user id and by feature flag, not uniformly; you want consistent traces for any single user session.
Do I need traces if I already have logs?
Yes. Logs are unstructured per-event records. Traces give you the parent-child structure of an entire request, including which LLM call called which tool, what state was passed, which retriever query returned what, where the agent loop terminated. For a request that touches 10 to 50 spans across LLM calls, retriever queries, tool invocations, and sub-agent dispatches, logs require manual reconstruction of the call tree. Traces give you the tree natively.
What does an LLM tracing implementation cost in operational complexity?
At minimum: an instrumentation library on each service (traceAI, OpenInference, vendor SDK), an OTel collector or direct OTLP endpoint, a backend for storage and query (ClickHouse, S3, OTel-compatible vendor), and a UI for trace search and replay. The harder operational cost is schema discipline: deciding which attributes are mandatory, which are opt-in (gen_ai.input.messages, gen_ai.output.messages carry PII), and which are derived. A team that gets the schema right at week one saves a quarter of refactoring work later.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.