What Is LLM Tracing?
LLM tracing records LLM and agent requests as OpenTelemetry traces with model calls, spans, tokens, costs, tools, retrievals, and eval context.
What Is LLM Tracing?
LLM tracing is an LLM observability technique that records one model or agent request as a structured trace of spans. A trace preserves the prompt, model, retrievals, tool calls, guardrails, response, token usage, cost, latency, and errors for each step in the production run. Unlike a log line, it preserves parent-child causality across multi-step pipelines. the difference between knowing “this request was slow” and knowing “the retriever called the vector database twice, the second call hit a cold index, and the model spent 18K output tokens recovering.” FutureAGI uses traceAI integrations such as traceAI-langchain to emit OpenTelemetry spans with gen_ai.* attributes and to attach evaluator results to the same trace.
In 2026 a single agent turn against Claude Opus 4.7 or GPT-5.x can fan out into 30+ spans across LLM calls, retrievers, tools, MCP servers, and sub-agents. Flat logs cannot represent that. The trace tree is the unit of debugging now.
Why LLM tracing matters in production LLM and agent systems
The failure mode is not “the model errored.” It is usually a hidden chain of small decisions: a retriever returns stale context, the model spends 18K output tokens reasoning around it, a tool call retries twice, a fallback model answers with lower groundedness, and the user receives a plausible but wrong response. Without LLM tracing, those steps collapse into one slow API call or one support ticket.
Developers feel it first because they cannot reproduce a bad answer from logs alone. SREs feel it during incidents because p99 latency says “slow,” but not whether the delay came from a vector store, a tool timeout, or streaming decode. Product and compliance teams feel it later, when they need evidence of which model saw which user data and which guardrail or evaluator approved the final response.
The symptoms are concrete: orphan spans, missing token counts, one trace id split across services, rising token-cost-per-trace, user thumbs-down spikes after a prompt release, or eval failures clustered around one retriever. Agentic systems make this sharper in 2026 because a single user turn may include planning, retrieval, tool use, sub-agent handoff, critique, and fallback. Flat logs hide the causal path; a trace keeps the tree intact.
Why 2026 broke flat-log debugging
Three forces broke the old approach. First, agents routinely run reasoning models that emit 10–40K output tokens for one user turn, so a single span carries 100x the data of a classic HTTP request. Second, MCP (spec) and A2A (spec) push tool calls and inter-agent calls out of process. spans cross network boundaries and need W3C trace context propagation. Third, LLM gateways sit in front of every model call to apply routing, fallback, and semantic caching; the same logical request can resolve from cache, fail over to a backup model, or split across two providers. and the trace has to stitch all three. Flat application logs have no answer for this. Trace trees do.
How FutureAGI traces LLM and agent requests
FutureAGI’s approach is to make the production trace the shared object for debugging, evaluation, and cost review. In a LangChain RAG workflow, traceAI-langchain creates a root CHAIN span, a RETRIEVER span for vector search, an LLM span for the answer, and optional TOOL, GUARDRAIL, or EVALUATOR spans. The exact fields that matter attach as OpenTelemetry attributes: fi.span.kind, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.server.time_to_first_token, gen_ai.cost.total, and agent.trajectory.step.
A real workflow looks like this: an engineer instruments the checkout-support agent with traceAI, samples production traces where response complaints rise, and filters for fi.span.kind = RETRIEVER plus high downstream gen_ai.usage.output_tokens. The trace shows that an outdated refund-policy chunk entered the prompt. The engineer adds a regression case to the golden dataset, runs HallucinationScore and Groundedness on the sampled output, sets an alert when trace-attached eval score drops below the release threshold, and routes risky requests through a post-guardrail before the next prompt rollout.
The traceAI integration matrix in 2026
traceAI ships drop-in instrumentation for 50+ frameworks across Python, TypeScript, Java, and C#. The map matters because in a 2026 agent stack the trace will usually cross at least three of these:
| Layer | traceAI integrations |
|---|---|
| Model providers | traceAI-openai, traceAI-anthropic, traceAI-google-genai, traceAI-bedrock, traceAI-vertexai, traceAI-mistral, traceAI-groq, traceAI-together, traceAI-cohere, traceAI-fireworks, traceAI-deepseek, traceAI-xai, traceAI-ollama, traceAI-vllm, traceAI-watsonx, traceAI-cerebras |
| Agent frameworks | traceAI-langchain, traceAI-langgraph, traceAI-crewai, traceAI-autogen, traceAI-openai-agents, traceAI-google-adk, traceAI-pydantic-ai, traceAI-strands, traceAI-haystack, traceAI-dspy, traceAI-smolagents, traceAI-agno, traceAI-mastra, traceAI-beeai, traceAI-claude-agent-sdk |
| Protocols | traceAI-mcp, traceAI-a2a |
| Vector stores | traceAI-pinecone, traceAI-weaviate, traceAI-chromadb, traceAI-qdrant, traceAI-milvus, traceAI-pgvector, traceAI-lancedb, traceAI-elasticsearch, traceAI-redis, traceAI-mongodb, traceAI-azure-search |
| Voice | traceAI-livekit, traceAI-pipecat |
| Gateway / orchestration | traceAI-litellm, traceAI-portkey, traceAI-instructor, traceAI-guardrails, traceAI-vercel |
| JVM | langchain4j, spring-ai, spring-boot-starter, semantic-kernel, plus Java SDKs for major providers |
The point of this matrix is not the list. it is that one trace can include spans from a traceAI-langchain planner, a traceAI-mcp tool call, a traceAI-openai model call, and a traceAI-pinecone retrieval, all stitched into a single tree by W3C trace context.
Eval-on-trace: the differentiator
Unlike a raw Jaeger trace, which shows timing without LLM quality verdicts, FutureAGI keeps eval results beside the span that produced the answer. HallucinationScore, Groundedness, AnswerRelevancy, Faithfulness, ContextRelevance, ContextPrecision, ContextRecall, ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, PromptInjection, and ProtectFlash can all run as span events. The same trace answers four questions at once: where did latency come from, where did cost come from, where did the agent make the wrong tool choice, and where did answer quality break? In our 2026 evals, teams that wire span-attached evals close 60% of regressions in under an hour because the trace localizes the failure to a single span; teams that wire trace-only spend a day correlating logs and eval dashboards by primary key.
The Agent Command Center in the trace path
The LLM gateway sits in the data path between application code and model providers. A request that hits the gateway emits a GATEWAY span carrying gen_ai.request.model, the route decision (round-robin, weighted, least-latency, cost-optimized, conditional), cache hit status, fallback events, pre-guardrail and post-guardrail verdicts, and timing. When the gateway routes around a degraded provider. say GPT-5.x is slow and traffic shifts to Claude Opus 4.7. the trace shows the original target, the fallback, and the resolved provider. Tracing without a gateway in the path misses cache hits and silent failovers; tracing with the gateway gives you a single end-to-end picture.
How to use LLM tracing in 2026 production debugging
Treat tracing quality as coverage plus signal density. Useful signals include:
- Trace coverage: percentage of production LLM requests with a complete trace tree. Target 99% or higher for critical paths.
- Span completeness: every LLM span has
gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens, duration, status, andfi.span.kind. Missing fields hide failure modes. - Causality health: orphan-span rate, missing parent span ids, and traces split across service boundaries. W3C trace context propagation is non-negotiable in 2026.
- Latency shape: p99 trace duration,
gen_ai.server.time_to_first_token, and slowest span kind by route. - Cost density: token-cost-per-trace, output-token spikes, and cost grouped by prompt version or user cohort.
- Quality attachment:
HallucinationScoreflags unsupported claims;Groundednesschecks context support;AnswerRelevancychecks the answer addressed the request;ToolSelectionAccuracychecks agent tool choice. Attach sampled results to traces as span events. - User-feedback proxy: thumbs-down rate, escalation rate, and refund/contact rate joined back to trace ids by
user.idorsession.id.
from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor
from traceai_openai import OpenAIInstrumentor
from traceai_mcp import MCPInstrumentor
provider = register(project_name="prod-support-agent")
LangChainInstrumentor().instrument(tracer_provider=provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)
MCPInstrumentor().instrument(tracer_provider=provider)
A trace is measurable when an engineer can open one failed user turn and answer: which span failed, which model ran, which context entered the prompt, how many tokens were spent, and which evaluator score crossed threshold. The empirical pressure on trace quality keeps rising: on τ-bench’s multi-turn customer-support set, frontier agents fail end-to-end task completion on 30–50% of trajectories, and on SWE-Bench Verified (500 expert-validated GitHub issues) the best 2026 coding agents resolve well under half the issues. both gaps that only show up clearly when you can read the failing trajectory at the span level.
A second snippet. adding cohort and prompt-version tags to a live trace so the eval and trace are queryable together:
from opentelemetry import trace
from fi.evals import ToolSelectionAccuracy, TaskCompletion
tracer = trace.get_tracer("prod-support-agent")
tool_eval = ToolSelectionAccuracy()
task_eval = TaskCompletion()
def run_turn(user_q, expected_tool, trajectory, final_answer):
with tracer.start_as_current_span("agent.turn") as span:
span.set_attribute("prompt.version", "v2.3.1")
span.set_attribute("cohort.name", "billing")
span.set_attribute("user.id", trajectory.user_id)
t = tool_eval.evaluate(trajectory=trajectory.steps, expected_tool=expected_tool)
c = task_eval.evaluate(trajectory=trajectory.steps, final_output=final_answer)
span.set_attribute("gen_ai.evaluation.ToolSelectionAccuracy", t.value)
span.set_attribute("gen_ai.evaluation.TaskCompletion", c.value)
return final_answer
Sampling that catches the bugs that matter
Uniform 1% sampling is the wrong default. The 99th percentile is where the bugs live. The 2026 pattern: keep 100% of error spans, 100% of spans where any evaluator score is below threshold, 100% of cost outliers above the per-trace p99, and 1–5% uniform baseline. FutureAGI’s sampling configuration supports per-attribute rules so the same project can downsample healthy spans while keeping every failed eval.
Multi-agent and MCP trace shapes
A multi-agent system trace looks like a forest of sub-trees, one per agent role, joined by handoff spans. Each agent’s sub-trace contains its own LLM, retriever, and tool spans. The cross-agent edge is the handoff span with source agent name, target agent name, and payload. agent.trajectory.step distinguishes per-agent step counts so a dashboard can answer “which agent role failed on this trajectory?” Compared to a single-framework tracer (LangSmith only sees LangChain, Anthropic’s tracer only sees Claude), traceAI sees every framework, every model, every protocol. because it instruments the OpenTelemetry layer, not the application surface.
For an MCP-mediated trace, every tools/call, resources/read, and prompts/get produces a span tagged with tool.name, server identity, JSON-serialized arguments, observation, latency, and agent.trajectory.step. When a finance-operations agent connects MCP servers for invoices, contracts, and customer records, the engineer can filter the trace by tool.name to see exactly which MCP server changed the trajectory. The same trace shape works whether the client is Claude Desktop, OpenAI Agents SDK, LangGraph, or a custom Strands agent because traceAI instruments the protocol.
Voice agent traces are their own dialect
A voice agent trace includes STT spans (audio → text), LLM spans (text reasoning), TTS spans (text → audio), and turn-taking spans. The traceAI-livekit and traceAI-pipecat integrations capture all four with ASRAccuracy, TTSAccuracy, ConversationCoherence, and CustomerAgentConversationQuality evaluators ready to attach. A simulated voice run through LiveKitEngine from the simulate-sdk emits the same span shape as production, so a regression caught in simulation is reproducible with an identical trace structure in production.
What “ready for an incident” looks like
A team is ready for an incident when, given a user complaint, an engineer can: open the platform, find the trace by user.id and time window, read the trace tree top-to-bottom in under 60 seconds, see model name, prompt version, all retrieved chunks, all tool calls, all evaluator verdicts, total cost, and per-span duration. If any of those pieces is missing, the trace is not production-ready. In our 2026 evals, the teams that hit that bar resolve customer-reported regressions 4–8x faster than teams whose trace data lives in three places.
From traces to datasets to regression tests
The fastest path from production to safer releases is trace → dataset → regression eval. FutureAGI lets engineers select a trace, mark it as a golden dataset candidate, attach the expected answer or expected tool sequence, and run the evaluator suite against it on every release candidate. Traces become the source of truth for what production actually looks like. not synthetic prompts, not benchmark questions, but real user turns. The 2026 pattern we’ve found most effective: sample 2–5% of production traces, route them through an LLM-as-a-judge for triage, promote the validated ones into versioned datasets, and gate every deploy against the resulting suite.
Span attributes you should add beyond the OTel defaults
The gen_ai.* semantic conventions cover the model side. Real production traces need additional tags. Add prompt.version to attribute regressions to a specific rollout. Add feature.id to slice by product area. Add user.id and tenant.id for multi-tenant cost analysis. Add experiment.id when running an A/B test. Add cohort.name for downstream filters (refund, billing, healthcare). Add evaluator.score.threshold when a span fails an eval so the alert can include the breached value. These attributes are inexpensive. a dozen extra strings per span. but they turn a generic trace tree into a debuggable production object. FutureAGI exposes them as first-class fields in dashboards and supports custom attribute schemas per project.
Cost breakdowns per trace
In 2026 the most common cost-attribution question is “which user trace cost more than we charged?” The answer requires tracing every span with gen_ai.cost.total plus a stable user.id and feature.id. Once those three are in place, FutureAGI’s dashboards roll cost per trace by user, feature, prompt version, and model. We’ve found that a single chart. token-cost-per-successful-trace, sliced by prompt version. catches the majority of cost regressions in the same week they ship. Without per-trace cost attribution, a 30% token-spend increase looks like infrastructure noise; with it, the offending prompt version is obvious.
Cross-environment trace continuity
Tracing in 2026 spans more environments than production alone. The same traceAI instrumentation should run in local development, in CI regression evals, in simulate-sdk scenario runs, and in production. When all four emit traces in the same shape, an engineer can replay a failing production trace through staging with a one-line change. The simulate surface (Persona, Scenario, CloudEngine, LiveKitEngine) ships with traceAI integration so simulated runs produce comparable traces. failure modes caught in simulation reproduce against the same span structure in production. This is one of the practical reasons we built simulate, evaluate, and trace as one stack: the friction of context-switching between three vendors loses the trace shape and the workflow falls apart.
Common mistakes (May 2026 edition)
- Tracing only the final LLM call. Retrievals, tools, guardrails, and fallbacks are where many failures start; missing spans make the trace misleading.
- Losing context across async tools. If OpenTelemetry context is not propagated, child spans orphan and agent causality disappears. Use
traceAI’s async-aware instrumentation, not a hand-rolled wrapper. - Storing prompt text without redaction. Traces often contain PII, secrets, or customer data; redact before long-term retention using
PIIorProtectFlashpre-storage. - Sampling away failures. Uniform low sampling drops rare bad traces. Keep all errors, all eval failures, and all high-cost outliers; sample healthy traces.
- Using traces without evals. Timing explains slowness; trace-attached evaluators explain whether the answer was acceptable. Tracing alone tells you a regression happened. not what regressed.
- Skipping
prompt.versiontags. A/B regression attribution becomes guesswork without it. - Building a custom OTel pipeline instead of using semantic conventions. Standardize on
gen_ai.*. Custom attribute names lock you to one backend. - Ignoring W3C trace context across MCP and A2A boundaries. Tool calls and inter-agent calls cross network boundaries; without context propagation the trace fragments.
- Treating LangSmith’s trace tree as the ceiling. LangSmith covers LangChain well, but agents in 2026 cross frameworks; framework-locked tracing leaves blind spots.
Frequently Asked Questions
What is LLM tracing?
LLM tracing records an LLM or agent request as a structured trace of spans covering model calls, prompts, responses, retrievals, tools, tokens, cost, latency, and errors.
How is LLM tracing different from LLM observability?
LLM tracing is request-level lineage: what happened inside one run. LLM observability is broader and also includes dashboards, alerting, drift monitoring, evaluation trends, cost attribution, and production feedback.
How do you measure LLM tracing?
Instrument with traceAI integrations such as traceAI-langchain, then track gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, fi.span.kind, and trace-attached HallucinationScore results.