Best LLM Tracing Tools in 2026: 6 Honest Picks
Best LLM tracing tools 2026 compared: Future AGI traceAI, Phoenix, Langfuse, OpenLLMetry, Helicone, Datadog. OTel discipline + auto-instrumentation.
Table of Contents
LLM tracing in 2026 is OTel-or-bust. The span tree, not the chat log, is the unit: every chain step, retrieval, tool call, guardrail decision, and judge score lives on a node with OpenInference-standard attributes and ships into a backend you can swap. Tools that invented their own span format last cycle are bridging back to OpenTelemetry; tools that started on OTel keep the wire format portable. The six picks below are the honest shortlist. What separates them is auto-instrumentation breadth, span enrichment depth, OTel discipline, and behavior at 10K+ spans/sec sustained. Last updated May 20, 2026.
TL;DR: best LLM tracing tool per use case
| Use case | Best pick | Why | Pricing | License |
|---|---|---|---|---|
| OpenInference-first tracing + evals + guardrails + gateway on one runtime | Future AGI traceAI | 50+ surfaces across 4 langs, span-attached evals, Agent Command Center | Free + usage | Apache 2.0 |
| OpenInference reference + auto-instrumentation reach | Arize Phoenix | OTLP-first, canonical attribute names | Phoenix free, AX Pro $50/mo | ELv2 |
| Self-hosted span tree with prompts and datasets | Langfuse | OSS platform breadth; OTel bridge over its own format | Hobby free, Core $29/mo | MIT core |
| Drop-in OTel instrumentation into your existing collector | OpenLLMetry | Vendor-agnostic library, one-line init | Free library | Apache 2.0 |
| Lowest-friction first trace via gateway base URL change | Helicone | Zero SDK change for OpenAI-compatible clients | Hobby free, Pro $79/mo | Apache 2.0 |
| LLM spans next to APM, logs, and infra | Datadog LLM Observability | Unified APM + LLM correlation | APM $31/host + add-on | Closed |
One-row summary: pick Future AGI traceAI when OpenInference-shaped traces have to share a runtime with evals, guardrails, and gateway; Phoenix when OpenInference adherence is the entire requirement; Datadog only when Datadog already runs everything else.
Why OTel-or-bust is the right frame for 2026
LLM tracing went through three eras. SDK era: every vendor shipped a proprietary tracer with a proprietary schema. Bridge era: vendors added OTel exporters but kept their own attribute names. OpenInference era, where we are now: attribute names are standardized, the OTel wire format is the common substrate, and the trace becomes portable across backends.
Three properties separate the tools worth shortlisting from the ones that lock you in next quarter.
- Auto-instrumentation breadth. How many LLM SDKs, agent frameworks, vector DBs, and reranker libraries does the tracer pick up without manual span creation? OpenAI, Anthropic, Gemini, Bedrock, Cohere, LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, LlamaIndex, Haystack, Pinecone, Weaviate, Qdrant, Chroma. The gap between the top two and the rest is large.
- Span enrichment depth. Does the span carry OpenInference span kinds, prompt template IDs, rendered prompts, token splits, chunk scores, judge scores, and policy verdicts, or just
modelandlatency? - OTel discipline. Does the tool emit OpenInference-shaped OTel spans natively, or ship its own format and translate at the edge? Native is portable; translation drops attributes when conventions move.
A tracer that satisfies (1) is a logger with a tree. (1) and (2) is LLM observability. (1) through (3) is what you want anchoring your stack in 2026.
What an LLM trace actually contains
A working LLM trace covers five attribute classes: identity (OTel span kind, trace + span + parent IDs, service, environment); model and prompt (model name and version, template ID, rendered prompt, message list, output schema); cost (prompt + completion + total tokens, total cost USD, price snapshot); result (response, latency, status, error class, retry count); and eval + policy (judge score, judge name, threshold, pass/fail, guardrail decision, policy ID). OpenInference standardized these names in 2024-2025; use them and your trace is portable, skip them and you pay for it on migration day.
The 6 LLM tracing tools, compared
1. Future AGI traceAI: best for OpenInference-first tracing with evals, guardrails, and gateway on one runtime
Apache 2.0 traceAI. Apache 2.0 platform. Self-hostable. Hosted cloud.
Quick take. traceAI is the broadest OpenInference-shaped instrumentation surface in 2026: 50+ AI surfaces across Python, TypeScript, Java (LangChain4j, Spring AI), and a C# core, with one-line register() setup and zero manual span creation. The platform attaches Turing eval scores directly to spans, runs the Agent Command Center gateway across 100+ providers with 18+ runtime guardrails on the same trace stream, and ships 50+ EvalTemplate classes plus 20+ local heuristic metrics.
Ideal for. Teams running RAG agents, voice agents, or copilots where a production failure must replay in pre-prod with the same scorer contract, and tracing, eval, guardrails, gateway, and prompt optimization need to live on one Apache 2.0 self-hostable plane.
Key strengths.
- 50+ AI surfaces auto-instrumented across 4 languages. 14 OpenInference span kinds (LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, others); Phoenix ships 8, Langfuse 5.
- Span-attached evals: 50+ metrics as pytest CI scorers and online scorers. Lower per-eval cost than Galileo Luna-2 on the published rubrics.
- Built-in PII redaction at the span attribute layer before export; the trace never leaves your network with raw secrets in it.
- Agent Command Center gateway on the same plane: 100+ providers, BYOK routing, fallback, caching, 18+ runtime guardrails. Benchmarked at ~29k req/s, P99 21 ms with guardrails on, on
t3.xlarge. - SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust; ISO 27001 in active audit.
Honest limitations. More moving parts than a pure instrumentation library when all you want is a tracer. Self-host runs ClickHouse, Postgres, Redis, Temporal, and the gateway as real services; use hosted cloud if you don’t want to operate the data plane. Phoenix has a longer-published OpenInference reference repo even though traceAI ships the broader surface today.
Pricing. Free to start with generous limits; usage-based after that. Compliance add-ons (HIPAA BAA, SAML SSO + SCIM) layer per tier. Pricing is usage-based, not per-seat.
Expert verdict. Pick Future AGI traceAI when the trace has to anchor the broader reliability loop: span-attached evals, runtime guardrails, gateway routing, simulation, and prompt optimization on one self-hostable plane. The buying signal is teams that have already stitched a stack manually (Langfuse for traces, Braintrust for evals, a notebook for optimization, a separate gateway) and watched the handoffs lose fidelity between layers.
2. Arize Phoenix: best for OpenInference reference and auto-instrumentation reach
Source available under Elastic License 2.0. Self-hostable. Phoenix Cloud and Arize AX paths.
Quick take. Phoenix is the canonical OpenInference reference. Arize owns the spec, conventions land here first, and the auto-instrumentation packages are the longest-tested in production. OTLP-first ingestion, a clean local workbench (phoenix.launch_app() and you have a tracer), and a published path into Arize AX when scale or compliance demands it.
Ideal for. Engineers who care about open instrumentation standards, want a local workbench they can stand up in a notebook, and value the Arize AX upgrade path.
Key strengths.
- Canonical OpenInference reference; new attribute conventions ship here first.
- Auto-instrumentation across Python, TypeScript, and Java for LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI Agents SDK, Bedrock, Anthropic, and 12+ others.
- Embedding-drift heritage from Arize’s pre-LLM observability era; retrieval-quality dashboards and chunk-level drift are mature.
phoenix.launch_app()boots a full tracer + UI in one command; the development inner loop is the best in the category.
Honest limitations. Not a gateway, not a guardrail product, not a simulator. Elastic License 2.0 is source available, not OSI open source — legal teams that follow OSI strictly will flag it. Trajectory metric library is smaller than Future AGI’s, and scoring lives in a parallel Phoenix eval surface rather than as a span-attached primitive.
Pricing. Phoenix free self-hosted. AX Free 25K spans/mo, 15 days retention. AX Pro $50/mo with 50K spans, 30 days. Enterprise custom with SOC 2, HIPAA, data residency.
Expert verdict. Pick Phoenix when OpenInference adherence and the Arize-AX upgrade path are the buying signals, and gateway, guardrails, simulation, and strict OSI open source live elsewhere. See Arize Phoenix Alternatives.
3. Langfuse: best for self-hosted span trees with prompts, datasets, and a mature UI
MIT core. Self-hostable. Hosted cloud option.
Quick take. Langfuse is the strongest OSS-first platform pick when self-hosted tracing with prompt versioning and dataset-driven evals is the requirement. Big community, mature self-hosted story, polished UI for traces, prompts, datasets, runs, and human annotation queues. The catch is the wire format: Langfuse ships its own span schema with an OTel ingestion bridge layered over it. Full OpenInference adherence requires translation.
Ideal for. Platform teams that operate the data plane themselves, want trace data in their own infrastructure, and pair Langfuse with external eval, guardrail, and gateway layers.
Key strengths.
- MIT core; mature architecture across Postgres, ClickHouse, Redis, queues. ClickHouse-backed span storage handles 10K+ spans/sec on tuned infrastructure.
- Prompt management with labels, environments, version diffs; datasets, runs, human annotation queues all first-class.
- Experiments CI/CD integration shipped May 2026.
- Big community, large GitHub footprint, broad integration coverage with LiteLLM proxy logging.
Honest limitations. OTel ingestion uses Langfuse’s own schema layered over OTel; full OpenInference adherence requires translation, and attribute drift between Langfuse and OpenInference is a real maintenance line. Trajectory metrics are not first-class: 5 span kinds vs Future AGI’s 14. Simulation, voice eval, prompt optimization, and runtime guardrails live in adjacent tools you wire in yourself. Enterprise directories ship under a separate commercial license outside the MIT core.
Pricing. Hobby free with 50K units/mo. Core $29/mo with 100K units, $8 per additional 100K. Pro $199/mo. Enterprise $2,499/mo. A “unit” covers a trace, observation, score, or eval, which is why production cost compounds faster than the flat price suggests.
Expert verdict. Pick Langfuse if OSS observability with prompts, datasets, and a mature UI is the entire requirement and you accept the Langfuse-shaped wire format as your substrate. Skip if OpenInference-native traces are non-negotiable or you need eval, guardrails, and gateway on the same runtime. See Langfuse Alternatives.
4. OpenLLMetry (Traceloop): best for drop-in OTel instrumentation into your existing collector
Apache 2.0 library. Optional Traceloop hosted backend.
Quick take. OpenLLMetry is the DIY pick. pip install traceloop-sdk, one Traceloop.init() call, and LLM-aware spans start flowing into whatever OTel collector you already operate (Tempo, Jaeger, Honeycomb, New Relic, Grafana, Datadog APM). No platform to learn, no second UI. The cost is that the library is just the emitter; the dashboard, eval, prompts, datasets, and annotation pieces are wherever you wire them.
Ideal for. Engineering teams whose tracing standard is already “OTel into our existing collector” and who don’t want a second observability platform. Common case: an org running Honeycomb or Grafana Tempo for service tracing that wants LLM spans in the same backend.
Key strengths.
- Apache 2.0, 4K+ GitHub stars, vendor-agnostic by design.
- One-line
Traceloop.init()instrumentation for OpenAI, Anthropic, Bedrock, Cohere, LangChain, LlamaIndex, Haystack, Pinecone, Weaviate, Chroma, and 30+ others. - Emits OpenTelemetry spans; the wire format is portable. No lock-in on the trace store.
Honest limitations. No first-party dashboard worth running in production. Pair with a backend that understands LLM-specific UI (Phoenix, Future AGI, Langfuse) or accept that classical APM will render the span tree without prompt diffs, judge heatmaps, or chunk attribution. Span attribute coverage is OpenInference-compatible but the framework catalog updates slower than traceAI’s.
Pricing. Free for the OSS library. Traceloop’s hosted backend (separate product) is $79/mo.
Expert verdict. Pick OpenLLMetry when the backend is already chosen and the requirement is “make my existing OTel stack LLM-aware”. Skip if you also need eval, guardrails, simulation, or a dashboard that renders prompt-shaped data out of the box.
5. Helicone: best for lowest-friction first trace via gateway base URL change
Apache 2.0. Self-hostable. Hosted cloud.
Quick take. Helicone is the gateway-first option. Change base_url on your OpenAI-compatible client to Helicone’s endpoint and every request becomes a logged span with no SDK swap. Zero instrumentation friction is the pitch, and for OpenAI-compatible workloads (most of them in 2026), it works.
Ideal for. Teams that want production traces today, especially when the LLM client lives in a language without a maintained native instrumentation library, or developer-velocity reasons rule out a code-level integration.
Key strengths.
- Lowest cold-start friction in the category: a
base_urlchange is the entire integration. - Sessions, request analytics, prompts, caching, and rate limits in one product.
- Apache 2.0, 4K+ GitHub stars, self-hostable. Good fit for serverless deployments where adding an SDK is awkward.
Honest limitations. Span depth is shallower than Phoenix or Langfuse for multi-step agents. Tool calls, retrievals, and sub-agent handoffs that don’t pass through the gateway aren’t traced unless you instrument them separately. Roadmap risk became part of vendor diligence after the March 2026 Mintlify acquisition; platform remains usable but feature velocity slowed. Native OpenInference adherence is partial.
Pricing. Hobby free with 10K logs/mo. Pro $79/mo with 100K logs. Team and Enterprise tiers add SSO and on-prem.
Expert verdict. Pick Helicone for a first trace this week on OpenAI-compatible workloads. Replace or pair with traceAI or OpenLLMetry once multi-step agents enter the picture. See Helicone Alternatives.
6. Datadog LLM Observability: best when Datadog is already the standard
Closed platform. SaaS with regional residency. APM-integrated.
Quick take. Datadog ships LLM Observability as an APM add-on. The pitch is one tool for everything: LLM spans correlated with APM traces, infrastructure metrics, logs, RUM, and security events. Right pick only when Datadog is already the system of record and unified APM-plus-LLM observability beats eval depth. Honeycomb and New Relic play the same role for teams on those backends; Datadog is the largest of the three on the LLM observability path.
Ideal for. Enterprise teams where Datadog is the system of record and SRE/platform engineers want LLM workloads on the same dashboard, alerting plane, and on-call rotation as service infrastructure.
Key strengths.
- LLM spans inside the same product as APM, logs, RUM, security, and infra metrics.
- Infrastructure correlation: LLM latency next to DB query latency next to downstream service latency.
- Mature enterprise security posture, regional residency, SOC 2, SRE workflows. Scales to high-volume span ingestion on Datadog’s existing backend.
Honest limitations. Eval surface is shallower than dedicated LLM platforms: no first-party simulator, fewer built-in metric primitives, no integrated guardrails. Cost scales fast with span volume; production contracts cross into five figures monthly without much effort. Path of least resistance is the Datadog SDK, not OTel, and the Datadog SDK is not OpenInference-shaped by default.
Pricing. APM at $31 per host per month with annual billing, plus LLM Observability add-on metered per ingested span and per indexed log. Expect contracts above $1,000/mo at modest scale; production teams enter five-figure monthly contracts quickly.
Expert verdict. Pick Datadog LLM Observability when Datadog is the standard and consolidation beats eval depth. Pair with Future AGI or Phoenix if eval and span-attached scoring become the bottleneck. Same logic for Honeycomb and New Relic teams: keep the backend, point traceAI or OpenLLMetry at it.

How the 6 tools score on the OTel-or-bust scorecard
| Capability | Future AGI | Phoenix | Langfuse | OpenLLMetry | Helicone | Datadog |
|---|---|---|---|---|---|---|
| Auto-instrumentation surfaces | 50+ / 4 langs | ~20 / 3 langs | ~15 / 2 langs | ~30 / 2 langs | Gateway-only | SDK + OTel ingest |
| OpenInference span kinds | 14 | 8 | 5 (own format) | OI-compatible | Partial | OTel + APM |
| Native OTel + OpenInference | Full | Full (reference) | OTel bridge over own | Full | Partial | OTel accepted, SDK preferred |
| Span-attached evals | Full (50+ metrics) | Partial | Partial | None | None | Partial |
| Built-in PII redaction | Yes | No (configurable) | No (configurable) | No | No | Partial |
| Self-host license | Apache 2.0 | ELv2 | MIT core | Apache 2.0 (lib) | Apache 2.0 | Closed |
| Gateway on same plane | Yes (100+) | No | No | No | Yes | No |
| Inline guardrails | Yes (18+) | No | No | No | No | No |
| Sustained 10K+ spans/sec | Yes | Yes (AX) | Yes (tuned) | Collector-bound | 1K+ rps | Yes |
Decision framework: pick by constraint
- OpenInference adherence non-negotiable: Future AGI traceAI, Phoenix, OpenLLMetry.
- Self-hosting under Apache 2.0: Future AGI, Helicone, OpenLLMetry. Under MIT or ELv2: Langfuse, Phoenix.
- Already on Honeycomb, New Relic, Tempo, or Jaeger: OpenLLMetry or Future AGI traceAI emit OpenInference-shaped spans into the collector you already run.
- Already on Datadog: Datadog LLM Observability for unified APM + LLM, or point traceAI / OpenLLMetry at the Datadog OTel intake when eval depth matters more than the native Datadog SDK ergonomics.
- Span-attached evals on the same trace: Future AGI (50+ metrics first-class), Phoenix (via Phoenix evals), Langfuse (via custom scorers).
- Lowest-friction first trace: Helicone (gateway base URL), then Future AGI traceAI or OpenLLMetry (one-line
register()/init()). - Gateway, guardrails, and tracing on the same runtime: Future AGI is the only pick here. Helicone is gateway-first without the guardrail and eval surface; Datadog is APM-first without a guardrail layer.
Common mistakes when picking a tracing tool
- Confusing tracing with monitoring. Tracing captures the span tree per request; monitoring watches trends across requests. Most teams need both.
- Picking on demo videos. Vendor demos use clean span schemas with idealized payloads. Load-test on your real schema with real payload sizes before signing anything.
- Head-based sampling alone. Drops the rare failure that caused the user complaint. Use tail-based sampling on errors, high-cost traces, and below-threshold eval scores.
- Ignoring storage cost. ClickHouse retention dominates the bill. 90 days at 10M traces is 200 GB to 2 TB depending on payload size.
- Treating ELv2 as open source. Phoenix is source available, not OSI open source. Verify with legal before procurement.
- Treating “OTel-compatible” and “OpenInference-shaped” as the same thing. Accepting OTel spans is table stakes; emitting OpenInference attributes is what makes the trace portable.
Recent LLM tracing updates
| Date | Event | Why it matters |
|---|---|---|
| May 2026 | Braintrust added Java auto-instrumentation | Java, Spring AI, LangChain4j teams can trace with less manual code. |
| Mar 9, 2026 | Future AGI shipped Agent Command Center and ClickHouse trace storage | High-volume trace analytics moved into the same plane as evals and gateway. |
| Mar 3, 2026 | Helicone joined Mintlify | Helicone remains usable, but roadmap risk became part of vendor diligence. |
| Jan 22, 2026 | Phoenix added CLI prompt commands | Trace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling. |
| 2025-2026 | OpenInference v1 conventions stabilized | Cross-platform span schema reduces vendor lock-in. |
How to evaluate this for production
- Run a domain reproduction. Export a slice of real traces (failures, long-tail prompts, tool calls, retrieval misses, hand-labeled outcomes) and instrument each candidate with your harness, OTel payload shape, prompt versions, and judge model. Don’t accept a demo dataset.
- Measure reliability under load. Track p50, p95, p99 ingestion, dropped spans, failed judge calls, retry count, query latency, and alert delay as concurrency rises. Sustained 10K+ spans/sec on your real payload is the table-stakes target.
- Cost-adjust. Real cost equals platform price plus span volume, payload size, retention days, judge tokens, retry rate, annotation hours, and SRE time. Self-hosted loses if infra bill plus on-call time exceeds SaaS overage; hosted loses if per-span pricing compounds at scale.
Where Future AGI fits
Most teams end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the pick when those live on the same Apache 2.0 self-hostable plane and the OpenInference-shaped trace is the unit. traceAI auto-instruments 50+ AI surfaces across 4 languages; ai-evaluation attaches 50+ EvalTemplates plus 20+ local metrics on the same span; the Agent Command Center gateway fronts 100+ providers with 18+ runtime guardrails (~29k req/s, P99 21 ms with guardrails on, t3.xlarge). SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust. Start free; usage-based after that. Pricing.
Sources
Future AGI · traceAI · ai-evaluation · Agent Command Center docs · Phoenix · Langfuse · OpenLLMetry · Helicone · Datadog LLM Observability docs · OpenInference conventions
Read next
Best LLM Monitoring Tools · Best AI Agent Debugging Tools · Best AI Agent Observability Tools · What is LLM Tracing · traceAI
Frequently asked questions
What are the best LLM tracing tools in 2026?
How is LLM tracing different from generic distributed tracing?
Should I use OpenTelemetry-native tracing or a vendor-specific SDK?
What span attributes should every LLM trace carry?
How do I sample LLM traces without losing failure visibility?
How does pricing compare across LLM tracing tools in 2026?
Which tool handles 10K+ spans per second sustained?
Can I use my existing APM (Datadog, Honeycomb, Grafana) for LLM tracing?
OpenInference is the OpenTelemetry-aligned semantic convention and instrumentation library for LLM applications, maintained by Arize. 2026 fit explained.
FutureAGI, Langfuse, LangSmith, Helicone, Braintrust, and W&B Weave as Arize Phoenix alternatives in 2026. Pricing, OSS license, OTel coverage, tradeoffs.
Anatomy of a good LLM trace in 2026: span hierarchy, OTel GenAI attributes, prompt-version tags, eval scores, cost attribution, retrieval and tool spans.