A trace is the complete record of one request through a system — an ordered, parent-child tree of spans that share a single trace_id, capturing every LLM call, tool invocation, retrieval, and sub-agent step.

What is the difference between a trace and a span?

A trace is the whole request lineage; a span is one timed operation inside it. A trace contains many spans; every span belongs to exactly one trace. You filter dashboards by trace_id when you want to see the full tree, by span attributes when you want to see one step.

How do you capture LLM traces?

Instrument frameworks with traceAI (FutureAGI's OpenTelemetry library), which auto-emits spans for OpenAI, LangChain, CrewAI, and 50+ others. Spans share a trace_id via OTel context propagation and ship to any OTLP backend.

What Is a Trace? Definition & FutureAGI Guide (2026)

What Is a Trace (in LLM Observability)?

A trace is the full causal record of one user request as it moves through an LLM or agent system. It is a tree of spans — each span a timed operation (an LLM call, a retrieval, a tool call, a guardrail check, a sub-agent dispatch) — sharing a single trace_id and linked by parent-span relationships. Each span carries an attribute bag (model, tokens, latency, prompt, completion). The trace, not the log line, is the unit of debugging in modern LLM observability: it shows you what the system did on one turn and where time, tokens, and quality went.

Why It Matters in Production LLM and Agent Systems

LLM systems are not request-response. A single user turn can fan out into a planner step, three retrievals against different stores, four tool calls (one of which kicks off a sub-agent), a critique pass, and a final completion — 15 to 50 spans for one request. Without a trace, this is invisible. You see “the bot answered slowly” in user feedback and 30 unrelated log lines in stdout. With a trace, you see the planner spent 4.1s waiting on a stale vector index, the retriever returned irrelevant chunks, and the tool call retried six times before succeeding.

The pain hits engineers at three layers. SREs cannot triage incidents without a trace tree — a flat log stream cannot tell you which step in a multi-step pipeline regressed. ML engineers cannot run regression evals if production traces are not captured with prompt and completion content; they have nothing to replay. Compliance leads cannot answer “which model saw this user’s PII” without a session.id-tagged trace.

In agentic 2026 workloads — LangGraph branches, OpenAI Agents SDK handoffs, Google ADK orchestrations — the trace is also the graph. State diffs between nodes, the loop counter on a recursive call, and the tool decision that triggered a sub-agent all live on the trace. A flat span list buries this; a tree-rendered trace surfaces it.

How FutureAGI Handles Traces

FutureAGI captures traces through traceAI, its OpenTelemetry instrumentation library. A LangChain RAG app instrumented with traceAI-langchain emits one trace per chain invocation: a top-level CHAIN span, a RETRIEVER span for the vector lookup, an LLM span for the completion, and any TOOL spans the agent calls. Each span carries gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.cost.total, and fi.span.kind to distinguish what kind of work happened.

The trace is rendered in the FutureAGI platform as a flame graph plus a graph view — pick the rendering that matches the question. Latency triage uses the flame graph; agent debugging uses the graph view, which preserves loop edges and handoff arrows that a flame graph collapses.

The differentiator is what attaches to a trace. session.id and user.id make traces filterable by user cohort; gen_ai.evaluation.score.value written by fi.evals.TrajectoryScore puts a quality verdict on the same trace as the latency data, so the SRE on call sees both signals in one place. Unlike LangSmith, where evals live in a separate tab, FutureAGI’s trace view is the eval view — quality drift and latency drift surface together.

In a typical incident, an engineer filters to “all traces in the last hour where TrajectoryScore < 0.6 and gen_ai.usage.total_tokens > 20000,” opens any one of them, and sees the exact tool-call loop or stale-retriever path that caused the trajectory to fail.

How to Measure or Detect It

Traces are themselves the measurement primitive — what to track:

Trace-level attributes: trace_id, session.id, user.id, total trace duration, span count, error count.
Per-span attributes: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.client.operation.duration, fi.span.kind.
Trace-level eval: fi.evals.TrajectoryScore returns a 0–1 score across the whole agent trajectory; write it back as gen_ai.evaluation.score.value on the root span.
Coverage health: percentage of production requests producing a complete trace (target ≥ 99%); orphan-span rate < 1%.
Cost-per-trace: aggregated gen_ai.cost.total by trace_id, sliced by user cohort.

from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor

trace_provider = register(project_name="prod-rag")
LangChainInstrumentor().instrument(tracer_provider=trace_provider)
# every chain.invoke() now emits a trace

Common Mistakes

Treating a trace as a log line. A trace is a tree, not a string. Logs answer “what did the service say”; traces answer “what causal path produced this answer.”
Capturing spans without parent-child links. Spans without parent_span_id orphan from the trace and hide the call graph. Always propagate OTel context across async tasks and process boundaries.
Sampling traces uniformly at 1%. The 99th percentile is where the bug lives. Sample by user, by error, and by eval-fail signal — not uniformly.
Forgetting session.id and user.id. Without them, you cannot filter traces by user cohort or stitch a multi-turn conversation, which makes most debugging questions unanswerable.
No retention policy. Traces are bulky. Keep full content for 7–14 days, attribute-only for 90, and aggregate metrics indefinitely.