Research

Best LLM Tracing Tools in 2026: 7 Span-Tree Platforms

FutureAGI traceAI, Phoenix, Langfuse, Helicone, Datadog, OpenLLMetry, and OpenLIT compared on span semantics, OTel adherence, and waterfall depth in 2026.

·
11 min read
llm-tracing opentelemetry openinference phoenix langfuse span-tree observability 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline LLM TRACING TOOLS 2026 fills the left half. The right half shows a wireframe span tree with branching child spans, drawn in pure white outlines with a soft white halo behind the root span.
Table of Contents

LLM tracing in 2026 means OpenTelemetry-shaped span trees where every chain, retrieval, tool call, and judge score lives on a node and rolls up into a parent invocation. The seven tools below span instrumentation libraries, dedicated platforms, and APM-integrated paths. The differences that matter are span semantics (OpenInference adherence), waterfall depth, sampling policy, and how the platform handles 10K+ spans per second. This guide is the honest shortlist; cross-link to Best LLM Monitoring Tools, Best AI Agent Debugging Tools, and What is LLM Tracing for the related cuts.

TL;DR: Best LLM tracing tool per use case

Use caseBest pickWhy (one phrase)PricingOSS
OTel-native tracing + span-attached evals + guardrails + gateway in one runtimeFutureAGI traceAIUnified eval, observe, simulate, gate, optimize loopFree + usage from $2/GBApache 2.0 (traceAI + platform)
OpenInference reference + auto-instrumentationArize PhoenixOTel-first reference implPhoenix free, AX Pro $50/moElastic License 2.0
Self-hosted span tree at scale with promptsLangfuseClickHouse-backed scaleHobby free, Core $29/moMIT core
Gateway-attached request tracingHeliconeLowest friction from base URL changeHobby free, Pro $79/moApache 2.0
LLM spans inside Datadog APMDatadog LLM ObservabilityUnified APM and LLMCustom; from $31/host/mo APMClosed
Drop-in OTel instrumentation, vendor-agnosticOpenLLMetryOne-line instrumentationFreeApache 2.0
LLM + GPU + infra in one OTel collectorOpenLITUnified LLM and infra telemetryFreeApache 2.0

If you only read one row: pick FutureAGI traceAI when OTel-native tracing must share a runtime with span-attached evals, guardrails, gateway, and simulation; pick Phoenix for the canonical OpenInference reference; pick Langfuse for full-platform OSS span trees.

What an LLM trace actually contains

A working LLM trace covers five attribute classes. Anything less and you ship blind to a real class of regressions:

  1. Identity. OTel span kind, trace ID, span ID, parent ID, service name, environment.
  2. Model and prompt. Model name, model version, prompt template ID, prompt rendered, system prompt, user prompt.
  3. Cost. Prompt tokens, completion tokens, total tokens, total cost USD.
  4. Result. Response, response tokens, latency, status, error class.
  5. Eval. Per-judge score, judge name, threshold, pass/fail, judge cost.

OpenInference standardized these attributes in 2024-2025. Using the canonical names keeps traces portable across Phoenix, Langfuse, FutureAGI, and Datadog.

The 7 LLM tracing tools compared

1. FutureAGI traceAI: The leading OTel-native tracing platform with span-attached evals

Apache 2.0 for traceAI. Apache 2.0 for the platform.

FutureAGI is the leading LLM tracing platform when OTel-native span trees must share a runtime with eval scoring, guardrails, gateway routing, and simulation. traceAI is a one-line OTel instrumentation library that auto-instruments LangChain, LlamaIndex, OpenAI, Anthropic, Bedrock, and 35+ other frameworks across Python, TypeScript, Java, and C#. The FutureAGI platform attaches Turing eval model scores directly to the spans, runs the Agent Command Center at /platform/monitor/command-center for live span-attached guardrails, and supplies 50+ eval metrics, 18+ guardrails, a BYOK gateway across 100+ providers, and 6 prompt-optimization algorithms in the same plane.

Use case: Teams running RAG agents, voice agents, and copilots where the production failure must replay in pre-prod with the same scorer contract, and where tracing, eval, gating, and routing must live in one stack rather than five.

Pricing: Free plus usage from $2/GB storage, $10 per 1,000 AI credits, $5 per 100,000 gateway requests. Boost $250/mo, Scale $750/mo (HIPAA), Enterprise from $2,000/mo (SOC 2).

OSS status: Apache 2.0 for traceAI. Apache 2.0 for the platform repo. Stricter than Phoenix’s Elastic License 2.0 on commercial reuse.

Performance: turing_flash runs span-attached guardrail screening at 50-70ms p95 and full eval templates at roughly 1-2s.

Best for: Teams who want OpenInference-shaped span trees plus span-attached judge scores, guardrails, and gateway analytics in one runtime, hosted or self-hosted.

Worth flagging: Phoenix’s auto-instrumentation catalog is the longer-published reference; FutureAGI traceAI ships the same OpenInference conventions across 35+ frameworks and pairs them with the broader platform that Phoenix does not include.

2. Arize Phoenix: Best for OpenInference reference and auto-instrumentation

Source available. Self-hostable. Phoenix Cloud and Arize AX paths exist.

Use case: Teams that want OpenInference adherence and the canonical OTel reference implementation. Phoenix accepts traces over OTLP and auto-instruments LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI, Bedrock, Anthropic, and 12+ others across Python, TypeScript, and Java.

Pricing: Phoenix free for self-hosting. AX Free SaaS includes 25K spans/month, 1 GB ingestion, 15 days retention. AX Pro is $50/mo with 50K spans, 30 days retention.

OSS status: Elastic License 2.0. Source available, with restrictions on offering as a managed service.

Best for: Engineers who care about open instrumentation standards and want a path from local Phoenix into the broader Arize AX product without rewriting traces.

Worth flagging: Phoenix is not a gateway, not a guardrail product, not a simulator. ELv2 license matters for legal teams that follow OSI definitions strictly. See Phoenix Alternatives.

3. Langfuse: Best for self-hosted span trees at scale with prompts

Open source core. Self-hostable. Hosted cloud option.

Use case: Self-hosted production tracing with prompt versioning, dataset-driven evals, and human annotation. ClickHouse-backed span storage handles 10K+ spans/sec on tuned infrastructure.

Pricing: Hobby free with 50K units per month, 30 days data access, 2 users. Core $29/mo with 100K units. Pro $199/mo. Enterprise $2,499/mo.

OSS status: MIT core. Enterprise directories handled separately.

Best for: Platform teams that want trace data in their own infrastructure with first-party prompts, datasets, and annotation queues.

Worth flagging: OTel ingestion uses Langfuse’s own schema layered over OTel; full OpenInference adherence requires translation. See Langfuse Alternatives.

4. Helicone: Best for gateway-attached request tracing

Apache 2.0. Self-hostable. Hosted cloud option.

Use case: Teams that want zero-code instrumentation by switching the OpenAI base URL to Helicone’s gateway. Every request becomes a span without any SDK change.

Pricing: Hobby free with 10K logs/mo. Pro $79/mo with 100K logs. Team and Enterprise tiers add SSO and on-prem.

OSS status: Apache 2.0. 4K+ stars.

Best for: Teams that want the lowest friction from cold-start to production traces, especially when the LLM client lives in a language without a maintained native instrumentation library.

Worth flagging: Roadmap risk after the March 2026 Mintlify acquisition; the platform remains usable but new feature velocity slowed. Span depth is shallower than Phoenix or Langfuse for multi-step agents. See Helicone Alternatives.

5. Datadog LLM Observability: Best when Datadog is the standard

Closed platform. SaaS only.

Use case: Teams that already run Datadog APM and want LLM spans correlated with infra metrics, traces, and logs in one product.

Pricing: Custom; from $31/host/mo APM plus LLM Observability add-on. Per-span ingest and per-log indexing add up at scale.

OSS status: Closed.

Best for: Engineering organizations standardized on Datadog where infra correlation matters more than open instrumentation.

Worth flagging: Datadog at scale crosses into five-figure monthly contracts. Eval depth is shallower than dedicated LLM platforms; pair with Phoenix or FutureAGI for richer scoring. Datadog accepts OTel but the path of least resistance is the Datadog SDK.

6. OpenLLMetry (Traceloop): Best for vendor-agnostic OTel instrumentation

Apache 2.0. Library only.

Use case: Teams that already operate Tempo, Jaeger, Honeycomb, or Datadog and want LLM spans in their existing OTel backend. pip install traceloop-sdk and a Traceloop.init() call.

Pricing: Free for the OSS library. Traceloop Pro $79/mo for the hosted backend.

OSS status: Apache 2.0. 4K+ stars.

Best for: Teams whose tracing standard is “OTel into our existing collector” and who do not want a separate platform.

Worth flagging: No standalone OSS dashboard. Pair with a backend that understands LLM-specific UI (Phoenix, Langfuse, FutureAGI).

7. OpenLIT: Best for LLM + GPU + infra in one OTel collector

Apache 2.0. Library + optional UI.

Use case: Teams that want LLM spans alongside GPU and infra metrics under a single OTel collector. OpenLIT ships NVIDIA exporters plus instrumentation for LLM frameworks and vector DBs.

Pricing: Free for the library and the optional ClickHouse-backed UI.

OSS status: Apache 2.0. 1.5K+ stars.

Best for: Teams running self-hosted GPU clusters who want unified LLM + infra telemetry without two collectors.

Worth flagging: Smaller community than Phoenix or Langfuse. Eval and prompt management are not first-class.

Future AGI four-panel dark product showcase. Top-left: Span tree waterfall (focal panel with halo) showing 6 rows for agent.run, router.classify, retriever.search, openai.chat, validator.run FAIL, formatter.compose with start offsets and duration bars. Top-right: Span attributes key-value list with model, prompt_tokens, completion_tokens, latency_ms, cost_usd, otel.span_kind. Bottom-left: OpenInference SDK coverage grid with 16 chip cards across languages and frameworks. Bottom-right: Trace search bar with filter expression and 4 result rows showing trace_id, started_at, duration, status.

Decision framework: pick by constraint

  • OpenInference adherence non-negotiable: FutureAGI traceAI, Phoenix.
  • Self-hosting required: FutureAGI, Langfuse, Phoenix, Helicone.
  • Already on Datadog: FutureAGI traceAI as the OTel-native consolidator alongside Datadog LLM Observability for infra correlation.
  • Existing Tempo or Jaeger: FutureAGI traceAI, OpenLLMetry, OpenLIT emit into your collector.
  • GPU + LLM in one collector: OpenLIT.
  • Span-attached evals: FutureAGI, Phoenix (with Phoenix evals), Langfuse (with custom scorers).
  • Lowest friction to first trace: FutureAGI traceAI, OpenLLMetry, Helicone (gateway).
  • Prompt diff and judge dashboards in the trace UI: FutureAGI, Phoenix, Langfuse lead.

Common mistakes when picking a tracing tool

  • Confusing tracing with monitoring. Tracing captures the span tree. Monitoring watches the trends. Most teams need both. See Best LLM Monitoring Tools.
  • Picking on demo videos. Demos use clean span schemas with idealized attribute payloads. Run a load test on your real schema with your real payload sizes.
  • Skipping OpenInference adherence. A non-OpenInference tracer locks the team into one backend. Pick OTel + OpenInference when the long-term plan is portability.
  • Head-based sampling alone. Drop the rare failure that caused the user complaint. Use tail-based sampling on errors, high-cost traces, and below-threshold eval scores.
  • Ignoring storage cost. ClickHouse retention dominates the bill. 90 days at 10M traces is 200 GB to 2 TB depending on payload size.
  • Treating ELv2 as open source. Phoenix is source available, not OSI open source.

What changed in LLM tracing in 2026

DateEventWhy it matters
May 2026Braintrust added Java auto-instrumentationJava, Spring AI, LangChain4j teams can trace with less manual code.
Mar 9, 2026FutureAGI shipped Agent Command Center and ClickHouse trace storageHigh-volume trace analytics moved into the same plane as evals and gateway.
Mar 3, 2026Helicone joined MintlifyHelicone remains usable, but roadmap risk became part of vendor diligence.
Jan 22, 2026Phoenix added CLI prompt commandsTrace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling.
2025-2026OpenInference v1 conventions stabilizedCross-platform span schema reduces vendor lock-in.
2025OTel collector LLM-aware processors gained adoptionTail-based sampling on judge scores became practical.

How to actually evaluate this for production

  1. Run a domain reproduction. Instrument the same workload with two candidates. Compare span fidelity, eval-attribute coverage, and storage cost on your real payload.

  2. Test sampling policy. Verify that errors, high-cost traces, and below-threshold scores are retained at 100%. Drop the rest at the rate the budget allows.

  3. Cost-adjust. Real cost equals platform price plus span volume, payload size, retention days, judge score volume, and the SRE hours to operate the storage layer.

How FutureAGI implements LLM tracing

FutureAGI is the production-grade LLM tracing platform built around the closed reliability loop that other tracing picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

  • Tracing, traceAI (Apache 2.0) ships the broadest cross-language coverage in 2026 across Python, TypeScript, Java (LangChain4j and Spring AI), and a C# core, with auto-instrumentation for 35+ frameworks and OpenInference-shaped spans into ClickHouse-backed storage.
  • Evals, 50+ first-party metrics attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
  • Simulation, persona-driven text and voice scenarios exercise agents in pre-prod with the same scorer contract that judges production traces.
  • Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, while 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams comparing tracing tools end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Read next: Best LLM Monitoring Tools, Best AI Agent Debugging Tools, Best AI Agent Observability Tools, What is LLM Tracing

Frequently asked questions

What are the best LLM tracing tools in 2026?
The shortlist is FutureAGI traceAI, Arize Phoenix, Langfuse, Helicone, Datadog LLM Observability, OpenLLMetry, and OpenLIT. FutureAGI traceAI and Phoenix lead on OpenTelemetry and OpenInference adherence; FutureAGI also pairs traces with span-attached evals, gateway, and guardrails on one runtime. Langfuse and Datadog lead on dashboard depth. Helicone leads on gateway-attached tracing. OpenLLMetry and OpenLIT are instrumentation libraries that emit spans into any OTel collector. The right pick depends on whether the constraint is unified runtime, OTel, dashboard, or gateway.
How is LLM tracing different from generic distributed tracing?
LLM traces add span attributes that classical APM does not understand: prompt and response payloads, token counts, model name, judge scores, retrieved chunks, and tool-call arguments. OpenInference standardized these conventions in 2024-2025. A generic tracer (Tempo, Jaeger, Datadog APM) shows the span tree but cannot render prompt diffs, judge score heatmaps, or chunk attribution without an LLM-aware UI on top.
Should I use OpenTelemetry-native tracing or a vendor-specific SDK?
OpenTelemetry-native is the safer long-term bet. The wire format is standardized, instrumentation libraries (OpenLLMetry, OpenLIT, FutureAGI traceAI) emit spans into any OTel-compatible backend, and you avoid vendor lock-in on the trace store. Vendor-specific SDKs (LangSmith, native Helicone) give richer first-party UX in their own platform but harder migration. Most 2026 teams pick OpenInference-native + a backend that understands the conventions.
What span attributes should every LLM trace carry?
At minimum: model name and version, prompt template ID, prompt rendered, response, prompt tokens, completion tokens, total cost, latency, OTel span kind LLM, and any retrieved chunk IDs. For agents add tool name, tool arguments, tool result, parent span ID. For RAG add retriever name, query, top-k, chunk scores. OpenInference defines the canonical attribute names; using them keeps traces portable across Phoenix, Langfuse, FutureAGI, and Datadog.
How do I sample LLM traces without losing failure visibility?
Sample 100% of failures, errors, and high-cost traces. Sample a fixed percentage (1-10%) of successful traces. Always retain spans that triggered an eval below threshold. Most platforms support tail-based sampling on the OTel collector or in the backend (Phoenix, Langfuse, FutureAGI). Avoid head-based percentage sampling alone; you will lose the rare failure that caused the user complaint.
How does pricing compare across LLM tracing tools in 2026?
Phoenix self-host is free; Arize AX Pro is $50 per month. Langfuse Hobby is free; Core is $29 per month flat. Helicone Hobby is free; Pro is $79 per month. Datadog LLM Observability has a Free plan up to 40K LLM spans per month and Pro from around $160 per month annually for 100K spans, with overage and retention add-ons priced separately. FutureAGI is free plus usage from $2/GB. OpenLLMetry and OpenLIT are free libraries; the cost is your downstream backend.
Which tool handles 10K+ spans per second sustained?
Langfuse on tuned ClickHouse, Phoenix via Arize AX cloud, FutureAGI with ClickHouse trace storage, and Datadog all run on scalable storage backends that teams have pushed past 10K spans per second in their own environments. Helicone scales to 1K+ requests/sec on standard ClickHouse, higher with tuning. OpenLLMetry and OpenLIT are bounded by your downstream collector and storage. Run a load test with your real span schema; vendor numbers understate payload size.
Can I use my existing APM (Datadog, Grafana) for LLM tracing?
Yes for the wire format, partially for the UX. Tempo, Jaeger, Datadog APM accept OTel spans; OpenLLMetry, OpenLIT, FutureAGI traceAI emit them. The gap is the LLM-specific UI: prompt diffing, eval-attached scores, chunk attribution, and judge dashboards do not render natively in classical APM. Most teams pair APM with a dedicated LLM tracing tool, or pick Datadog LLM Observability for unified APM + LLM.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.