Research

Span vs Trace in LLM Observability: What's the Difference in 2026?

A trace is one user request; a span is one operation inside that trace. OTel terminology, parent-child trees, and what makes a good LLM trace in 2026.

June 22, 2025

9 min read

llm-tracing spans traces opentelemetry otel-genai observability agent-debugging 2026

You debug a failing agent by reading 30 log lines in chronological order, mentally reconstructing which call was inside which loop, which retry came after which tool failure, and where the loop terminated. The reconstruction takes 20 minutes per failure, and half the time the conclusion is wrong. You debug the same failing agent by reading a span tree and the answer is in 30 seconds. The difference is data model. Spans have parent ids; logs do not. Logs are flat; traces are trees. The vocabulary matters because the abstraction matters: a trace is one user request, a span is one operation inside it, and the parent_span_id is what makes the abstraction work. This is the entry-point explainer; the deeper tutorials are linked below.

If you want depth, read these next:

What is LLM Tracing? for spans, OTel GenAI, and sampling

What is LLM Observability? for the full observability layer

Best OTel Instrumentation Tools for LLMs for the SDK landscape

TL;DR: Span vs trace

A trace is one user request from start to finish, identified by a single trace_id. A span is one operation inside that trace: an LLM call, a tool call, a retriever query, a sub-agent dispatch, a guardrail check, an evaluator run. One trace contains many spans. The parent_span_id field on each span builds the tree: the root span has no parent; children point at their parent’s span_id. The OpenTelemetry data model is the same for HTTP services and for LLM applications. What changes is the attributes on each span; an LLM span carries gen_ai.* attributes, an HTTP span carries http.* attributes.

Why the vocabulary matters

Three reasons.

First, debugging an agent run with 30 LLM calls, retriever queries, tool calls, and guardrail checks is intractable as a flat log file. You need the tree to see which tool call was inside which retry loop. The vocabulary is what makes the tooling possible: tools that render the tree, query by parent, replay a single node.

Second, cross-service propagation requires the abstraction. When a Python frontend calls a Go retriever calls a Rust embedding service, the trace_id ties the spans across service boundaries. Without the trace_id concept, distributed debugging is back to log correlation by timestamp.

Third, the eval and monitoring layer attaches scores at the span level. An eval score that says “Faithfulness 0.7 on this trace” is less actionable than “Faithfulness 0.7 on the generation span at depth 3, which received chunks from retriever span at depth 2 with similarity 0.81”. Span-level granularity is what makes monitoring precise.

The transport caught up in parallel. The OpenTelemetry data model formalized spans and traces in 2019 for cloud-native services. The OTel GenAI semantic conventions (in development as of 2026) extended the attribute namespace for LLM workloads under gen_ai.*. OpenTelemetry-aware observability tools, including FutureAGI, Langfuse, Arize Phoenix, Datadog, and the OpenTelemetry collector ecosystem, speak this language natively.

What a span carries

Every span carries the same minimum metadata regardless of what it represents.

Identity

trace_id: ties the span to one user request.
span_id: unique identifier of this span.
parent_span_id: identifier of the span that owns this one; null for the root.

Timing

start_time: Unix-nanosecond timestamp in OTel; actual clock precision depends on the SDK and runtime.
end_time: Unix-nanosecond timestamp in OTel; actual clock precision depends on the SDK and runtime.
The duration is end minus start.

Status

OK, ERROR, or unset.
Error spans can include exception events with an error message and stack trace when the instrumentation records them.

Name

Human-readable, like openai.chat.completion, retriever.search, agent.tool_call, guardrail.check.

Attribute bag

Typed key-value pairs.
For LLM spans: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.operation.name.
For tool spans: tool.name, tool.arguments, tool.return_value.
For retriever spans: rag.retriever.top_k, rag.retriever.chunks, rag.retriever.index.version.
Custom application attributes go here too: app.prompt.version, app.feature.flag, app.user_id_hash.

Events

Discrete points in time inside the span.
Useful for streaming first-token markers, intermediate state, and sub-events too small to warrant their own span.

What a trace contains

A trace is the collection of every span that shares a trace_id, plus the implicit tree structure built from parent_span_id.

The root span is the entry point: usually the user-visible operation like chat.respond or support_agent.run. Below it sit nested spans for each LLM call, retriever query, tool call, guardrail check, and sub-agent dispatch.

Tree depth depends on the application:

A simple chat completion is a shallow trace: a root request span plus one LLM call span beneath it.
A RAG agent is a 2-3 level tree: root, retriever, generation, optional grounding check.
A multi-step agent with tool calls and sub-agents is a 4-7 level tree: root, planning LLM, tool call, sub-agent dispatch, sub-agent’s LLM, sub-agent’s tool call, evaluator.

Most observability backends render this tree natively as nested horizontal bars on a time ruler.

Span types in an LLM trace

Different operations get different span types. The minimum useful set:

LLM call spans

One per call to a chat completion or text generation endpoint. Carry every gen_ai.* attribute plus the prompt and completion (opt-in). Nest under the application logic span that triggered the LLM call.

Tool call spans

One per function or tool invocation by the model. Carry the tool name, arguments, return value, latency, and status. Nest inside the LLM span that decided to call the tool.

Retriever spans

One per vector search, BM25 search, or hybrid retrieval. Carry the query, the top-k chunks returned, similarity scores, and the index version. For depth, see What is RAG Observability?.

Sub-agent spans

One per dispatch to a child agent. The child agent’s full trace tree nests under the parent. The shape is a sub-tree, not a flat span.

Guardrail spans

One per input or output validator. Carry the rule name, the verdict, and the modified payload if the guardrail rewrote the input or output.

Evaluator spans

One per online scorer (LLM-as-judge, schema check, citation grounder). Either nested inside the parent LLM span or connected via a span link when the eval runs out-of-band.

Custom spans

Anything your business logic does between LLM calls (preprocessing, postprocessing, business rule checks, persistence) gets its own span if you care about its latency or status.

Common mistakes when modeling LLM spans and traces

Treating spans as logs. Logs have timestamps and severity. Spans have parent_span_id, attribute bags, and structured status. Bolting span structure on top of log lines later means re-instrumenting every call site.
Span-per-line-of-code. Too many spans crush the backend and make the tree unreadable. Pick a granularity (every LLM call, every tool call, every retriever call) and stick to it.
Span-per-chunk in retrievers. Each retriever call is one span; chunks are an attribute on that span, not children. Span-per-chunk explodes trace size 10x.
Flat trace for agent runs. A LangGraph or CrewAI run is a tree. A flat span list buries the loop and the tool decisions. Force tree-structured trace views with parent_span_id discipline.
No prompt version tag. Add app.prompt.version as a custom attribute on every LLM span. Without it, A/B prompt rollouts are not analyzable.
No trace_id propagation across services. When a Python service calls a Go service, the trace_id has to pass in headers. Without propagation, the cross-service tree is broken.
Forgetting redaction. gen_ai.input.messages and gen_ai.output.messages carry PII. Configure the SDK or collector to redact before storage. Treat retriever chunk_text the same way.
Ignoring the root. Without a clean named root span (chat.respond, support_agent.run), traces are anonymous. Force a top-level span at every public entry point.

How OTel terminology compares to vendor terms

Different vendors use slightly different vocabulary on top of the OpenTelemetry data model.

OTel term	Langfuse term	Phoenix term	LangSmith term
Trace	Trace	Trace	Run (top-level)
Span (LLM call)	Generation	Span (LLM)	Run (LLM)
Span (other op)	Span	Span	Run (chain, tool)
Parent span	Parent observation	Parent span	Parent run
Attribute	Metadata + I/O	Attribute	Inputs + outputs + metadata
Span event	Event	Span event	Event

OTel is the lingua franca. Vendor-specific terms map cleanly back. The mapping matters when you migrate between vendors: an export from Langfuse is straightforward to import into Phoenix or FutureAGI because the underlying OTel model is the same.

The future: where trace and span semantics are heading

A few directions are settled, others are emerging.

OTel GenAI graduates from development. As of 2026 the spec is gated by OTEL_SEMCONV_STABILITY_OPT_IN. The opt-in flag will eventually flip to default-on. Tools that handle version pinning gracefully will look better than tools that silently drift across attribute renames.

Convention convergence. OpenInference and OTel GenAI continue to converge. Most backends already accept both. Future versions will reduce duplication.

Agent-aware UI becomes the default. A flat span list buries the agent loop. Tools that render runs as actual graphs and let you replay a single span with new state will pull ahead.

Span-attached evals become standard. The shift is from “we run an eval suite at release” to “every production span carries quality verdicts as it happens.” The CI gate, the on-call alert, and the monitoring dashboard all consume the same score stream.

Span-level cost budgets. Per-user, per-prompt, per-feature budgets that read the live span stream and short a request when its budget is exhausted. The data is already in the trace; what is missing is the policy enforcement layer.

The throughline of all five: by 2026, span and trace are not internal vocabulary; they are the interface across instrumentation libraries, observability backends, eval platforms, and monitoring layers. Get the abstraction right and the rest of the LLM observability stack composes; get it wrong and every component pays the cost.

FAQ

The FAQ above answers the common questions. For deeper coverage of any single topic, follow the related posts.

How to use this with FAGI

FutureAGI is the production-grade OTel-native LLM tracing stack. traceAI is Apache 2.0 OTel instrumentation across Python, TypeScript, Java, and C# (50+ integrations) that emits OTel-conformant spans with gen_ai.* attributes plus openinference.* / llm.* namespaces; the Agent Command Center ingests OTLP, persists in ClickHouse, and renders the trace tree with span-kind filtering, prompt-version tagging, and per-cohort comparison. Span-attached eval scores attach via turing_flash for guardrail screening at 50 to 70 ms p95 and full eval templates at about 1 to 2 seconds.

Because the model is OTel under the hood, exports from Langfuse, Phoenix, or LangSmith move into FutureAGI without re-instrumentation. The same plane carries 50+ eval metrics, persona-driven simulation, the BYOK gateway across 100+ providers, and 18+ guardrails on one self-hostable surface. Pricing starts free with a 50 GB tracing tier; Boost ($250/mo), Scale ($750/mo), and Enterprise ($2,000/mo with SOC 2 and HIPAA BAA) cover the maturity ladder.

Sources

Series cross-link

Frequently asked questions

What is the difference between a span and a trace in plain terms?

A trace is one user request from start to finish, identified by a single trace_id. A span is one operation inside that trace, like an LLM call, a tool call, or a retriever query. A trace is a tree; spans are the nodes; the parent_span_id field is what builds the tree. One trace contains many spans. The OpenTelemetry data model is the same for HTTP services and for LLM applications; only the attributes on each span differ.

What does a span actually carry?

A span carries a span_id, a trace_id, a parent_span_id (the field that builds the tree), a name, a start time, an end time, a status (OK, ERROR, or unset), and an attribute bag of typed key-value pairs. For an LLM span the attributes include gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.operation.name, the prompt and completion (opt-in), and any custom application attributes. Spans can also carry events (timestamped points inside the span) and links (to other spans not in the parent chain).

How do I read a trace as a tree?

Find the root span: the span whose parent_span_id is null or zero. That is the entry point to the request. The root span's children are the spans whose parent_span_id matches the root's span_id. Recurse. Most observability backends render this tree natively as nested horizontal bars on a time ruler, with the root at the top and children indented under it. The depth of the tree mirrors the call structure of the application.

What is an observation in this terminology?

Observation is a Langfuse-specific term for the unit that gets written to their data model. In OpenTelemetry the unit is the span. In Langfuse the unit is the observation, which can be a generation (an LLM call), a span (any operation), or an event (a single point in time). The mapping is roughly: Langfuse generation = OTel span with gen_ai.* attributes, Langfuse span = OTel span with custom attributes, Langfuse event = OTel span event. Different vendors use different vocabulary; OTel spans are the lingua franca.

What does a good LLM trace look like?

A clean root span named after the user-visible operation (chat.respond, support_agent.run, document.summarize). Nested LLM call spans for each model invocation with full gen_ai.* attributes. Nested tool call spans with arguments, return values, and status. Nested retriever spans with chunk ids, similarity scores, and doc versions. Nested guardrail spans with rule names and verdicts. Nested evaluator spans with score events. Every span has a status (OK or ERROR) and a duration. The tree mirrors the call structure of the application.

How deep should an LLM trace go?

Deep enough to debug the failures you fear, not so deep that the span volume crushes the backend. Heuristics: every LLM call gets a span, every tool call gets a span, every retriever call gets a span, every guardrail check gets a span, every sub-agent dispatch gets a span. Internal business logic gets a span only if you care about its latency or status. Span-per-line-of-code is too deep; span-per-LLM-call is too shallow if you have tools and retrievers.

How are span attributes different for LLM workloads?

An HTTP span carries http.method, http.status_code, http.url. An LLM span carries gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.operation.name, plus opt-in content attributes for prompts and completions. The plumbing is the same OpenTelemetry data model; the attribute namespace is different. The OTel GenAI semantic conventions standardize the gen_ai.* attribute names; OpenInference is a parallel namespace.

What is the trace_id used for?

The trace_id ties together every span that belongs to one user request. Querying by trace_id pulls the full tree. Cross-service propagation uses the trace_id: when a Python service calls a Go service, the trace_id passes in the request headers, the Go service starts new spans with the same trace_id and a parent_span_id that points back into the Python service. The result is one trace tree that crosses service boundaries. Without trace_id propagation, you lose the cross-service tree.

View all

Research

What is LLM Tracing? Spans, OTel GenAI, and Sampling in 2026

LLM tracing is structured spans for prompts, tools, retrievals, and sub-agents under OTel GenAI conventions. What it is and how to implement it in 2026.

Rishav Hada · Jan 16, 2026

13 min

Research

What is OpenInference? OpenTelemetry for LLM Apps in 2026

OpenInference is the OpenTelemetry-aligned semantic convention and instrumentation library for LLM applications, maintained by Arize. What it is and how it fits in 2026.

Rishav Hada · Apr 11, 2026

8 min

Research

What Does a Good LLM Trace Look Like in 2026: Anatomy and Attributes

Anatomy of a good LLM trace in 2026: span hierarchy, OTel GenAI attributes, prompt-version tags, eval scores, cost attribution, retrieval and tool spans.

NVJK Kartik · Mar 18, 2026

12 min

TL;DR: Span vs trace

Why the vocabulary matters

What a span carries

Identity

Timing

Status

Name

Attribute bag

Events

Links

What a trace contains

Span types in an LLM trace

LLM call spans

Tool call spans

Retriever spans

Sub-agent spans

Guardrail spans

Evaluator spans

Custom spans

Common mistakes when modeling LLM spans and traces

How OTel terminology compares to vendor terms

The future: where trace and span semantics are heading

FAQ

How to use this with FAGI

Sources

Series cross-link

Frequently asked questions