Research

Best LLM Tracing Tools in 2026: 6 Honest Picks

Best LLM tracing tools 2026 compared: Future AGI traceAI, Phoenix, Langfuse, OpenLLMetry, Helicone, Datadog. OTel discipline + auto-instrumentation.

·
Updated
·
16 min read
llm-tracing opentelemetry openinference phoenix langfuse span-tree observability 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline LLM TRACING TOOLS 2026 fills the left half. The right half shows a wireframe span tree with branching child spans, drawn in pure white outlines with a soft white halo behind the root span.

LLM tracing in 2026 is OTel-or-bust. The span tree, not the chat log, is the unit: every chain step, retrieval, tool call, guardrail decision, and judge score lives on a node with OpenInference-standard attributes and ships into a backend you can swap. Tools that invented their own span format last cycle are bridging back to OpenTelemetry; tools that started on OTel keep the wire format portable. The six picks below are the honest shortlist. What separates them is auto-instrumentation breadth, span enrichment depth, OTel discipline, and behavior at 10K+ spans/sec sustained. Last updated May 20, 2026.

TL;DR: best LLM tracing tool per use case

Use caseBest pickWhyPricingLicense
OpenInference-first tracing + evals + guardrails + gateway on one runtimeFuture AGI traceAI50+ surfaces across 4 langs, span-attached evals, Agent Command CenterFree + usageApache 2.0
OpenInference reference + auto-instrumentation reachArize PhoenixOTLP-first, canonical attribute namesPhoenix free, AX Pro $50/moELv2
Self-hosted span tree with prompts and datasetsLangfuseOSS platform breadth; OTel bridge over its own formatHobby free, Core $29/moMIT core
Drop-in OTel instrumentation into your existing collectorOpenLLMetryVendor-agnostic library, one-line initFree libraryApache 2.0
Lowest-friction first trace via gateway base URL changeHeliconeZero SDK change for OpenAI-compatible clientsHobby free, Pro $79/moApache 2.0
LLM spans next to APM, logs, and infraDatadog LLM ObservabilityUnified APM + LLM correlationAPM $31/host + add-onClosed

One-row summary: pick Future AGI traceAI when OpenInference-shaped traces have to share a runtime with evals, guardrails, and gateway; Phoenix when OpenInference adherence is the entire requirement; Datadog only when Datadog already runs everything else.

Why OTel-or-bust is the right frame for 2026

LLM tracing went through three eras. SDK era: every vendor shipped a proprietary tracer with a proprietary schema. Bridge era: vendors added OTel exporters but kept their own attribute names. OpenInference era, where we are now: attribute names are standardized, the OTel wire format is the common substrate, and the trace becomes portable across backends.

Three properties separate the tools worth shortlisting from the ones that lock you in next quarter.

  1. Auto-instrumentation breadth. How many LLM SDKs, agent frameworks, vector DBs, and reranker libraries does the tracer pick up without manual span creation? OpenAI, Anthropic, Gemini, Bedrock, Cohere, LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, LlamaIndex, Haystack, Pinecone, Weaviate, Qdrant, Chroma. The gap between the top two and the rest is large.
  2. Span enrichment depth. Does the span carry OpenInference span kinds, prompt template IDs, rendered prompts, token splits, chunk scores, judge scores, and policy verdicts, or just model and latency?
  3. OTel discipline. Does the tool emit OpenInference-shaped OTel spans natively, or ship its own format and translate at the edge? Native is portable; translation drops attributes when conventions move.

A tracer that satisfies (1) is a logger with a tree. (1) and (2) is LLM observability. (1) through (3) is what you want anchoring your stack in 2026.

What an LLM trace actually contains

A working LLM trace covers five attribute classes: identity (OTel span kind, trace + span + parent IDs, service, environment); model and prompt (model name and version, template ID, rendered prompt, message list, output schema); cost (prompt + completion + total tokens, total cost USD, price snapshot); result (response, latency, status, error class, retry count); and eval + policy (judge score, judge name, threshold, pass/fail, guardrail decision, policy ID). OpenInference standardized these names in 2024-2025; use them and your trace is portable, skip them and you pay for it on migration day.

The 6 LLM tracing tools, compared

1. Future AGI traceAI: best for OpenInference-first tracing with evals, guardrails, and gateway on one runtime

Apache 2.0 traceAI. Apache 2.0 platform. Self-hostable. Hosted cloud.

Quick take. traceAI is the broadest OpenInference-shaped instrumentation surface in 2026: 50+ AI surfaces across Python, TypeScript, Java (LangChain4j, Spring AI), and a C# core, with one-line register() setup and zero manual span creation. The platform attaches Turing eval scores directly to spans, runs the Agent Command Center gateway across 100+ providers with 18+ runtime guardrails on the same trace stream, and ships 50+ EvalTemplate classes plus 20+ local heuristic metrics.

Ideal for. Teams running RAG agents, voice agents, or copilots where a production failure must replay in pre-prod with the same scorer contract, and tracing, eval, guardrails, gateway, and prompt optimization need to live on one Apache 2.0 self-hostable plane.

Key strengths.

  • 50+ AI surfaces auto-instrumented across 4 languages. 14 OpenInference span kinds (LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, others); Phoenix ships 8, Langfuse 5.
  • Span-attached evals: 50+ metrics as pytest CI scorers and online scorers. Lower per-eval cost than Galileo Luna-2 on the published rubrics.
  • Built-in PII redaction at the span attribute layer before export; the trace never leaves your network with raw secrets in it.
  • Agent Command Center gateway on the same plane: 100+ providers, BYOK routing, fallback, caching, 18+ runtime guardrails. Benchmarked at ~29k req/s, P99 21 ms with guardrails on, on t3.xlarge.
  • SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust; ISO 27001 in active audit.

Honest limitations. More moving parts than a pure instrumentation library when all you want is a tracer. Self-host runs ClickHouse, Postgres, Redis, Temporal, and the gateway as real services; use hosted cloud if you don’t want to operate the data plane. Phoenix has a longer-published OpenInference reference repo even though traceAI ships the broader surface today.

Pricing. Free to start with generous limits; usage-based after that. Compliance add-ons (HIPAA BAA, SAML SSO + SCIM) layer per tier. Pricing is usage-based, not per-seat.

Expert verdict. Pick Future AGI traceAI when the trace has to anchor the broader reliability loop: span-attached evals, runtime guardrails, gateway routing, simulation, and prompt optimization on one self-hostable plane. The buying signal is teams that have already stitched a stack manually (Langfuse for traces, Braintrust for evals, a notebook for optimization, a separate gateway) and watched the handoffs lose fidelity between layers.

2. Arize Phoenix: best for OpenInference reference and auto-instrumentation reach

Source available under Elastic License 2.0. Self-hostable. Phoenix Cloud and Arize AX paths.

Quick take. Phoenix is the canonical OpenInference reference. Arize owns the spec, conventions land here first, and the auto-instrumentation packages are the longest-tested in production. OTLP-first ingestion, a clean local workbench (phoenix.launch_app() and you have a tracer), and a published path into Arize AX when scale or compliance demands it.

Ideal for. Engineers who care about open instrumentation standards, want a local workbench they can stand up in a notebook, and value the Arize AX upgrade path.

Key strengths.

  • Canonical OpenInference reference; new attribute conventions ship here first.
  • Auto-instrumentation across Python, TypeScript, and Java for LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI Agents SDK, Bedrock, Anthropic, and 12+ others.
  • Embedding-drift heritage from Arize’s pre-LLM observability era; retrieval-quality dashboards and chunk-level drift are mature.
  • phoenix.launch_app() boots a full tracer + UI in one command; the development inner loop is the best in the category.

Honest limitations. Not a gateway, not a guardrail product, not a simulator. Elastic License 2.0 is source available, not OSI open source — legal teams that follow OSI strictly will flag it. Trajectory metric library is smaller than Future AGI’s, and scoring lives in a parallel Phoenix eval surface rather than as a span-attached primitive.

Pricing. Phoenix free self-hosted. AX Free 25K spans/mo, 15 days retention. AX Pro $50/mo with 50K spans, 30 days. Enterprise custom with SOC 2, HIPAA, data residency.

Expert verdict. Pick Phoenix when OpenInference adherence and the Arize-AX upgrade path are the buying signals, and gateway, guardrails, simulation, and strict OSI open source live elsewhere. See Arize Phoenix Alternatives.

3. Langfuse: best for self-hosted span trees with prompts, datasets, and a mature UI

MIT core. Self-hostable. Hosted cloud option.

Quick take. Langfuse is the strongest OSS-first platform pick when self-hosted tracing with prompt versioning and dataset-driven evals is the requirement. Big community, mature self-hosted story, polished UI for traces, prompts, datasets, runs, and human annotation queues. The catch is the wire format: Langfuse ships its own span schema with an OTel ingestion bridge layered over it. Full OpenInference adherence requires translation.

Ideal for. Platform teams that operate the data plane themselves, want trace data in their own infrastructure, and pair Langfuse with external eval, guardrail, and gateway layers.

Key strengths.

  • MIT core; mature architecture across Postgres, ClickHouse, Redis, queues. ClickHouse-backed span storage handles 10K+ spans/sec on tuned infrastructure.
  • Prompt management with labels, environments, version diffs; datasets, runs, human annotation queues all first-class.
  • Experiments CI/CD integration shipped May 2026.
  • Big community, large GitHub footprint, broad integration coverage with LiteLLM proxy logging.

Honest limitations. OTel ingestion uses Langfuse’s own schema layered over OTel; full OpenInference adherence requires translation, and attribute drift between Langfuse and OpenInference is a real maintenance line. Trajectory metrics are not first-class: 5 span kinds vs Future AGI’s 14. Simulation, voice eval, prompt optimization, and runtime guardrails live in adjacent tools you wire in yourself. Enterprise directories ship under a separate commercial license outside the MIT core.

Pricing. Hobby free with 50K units/mo. Core $29/mo with 100K units, $8 per additional 100K. Pro $199/mo. Enterprise $2,499/mo. A “unit” covers a trace, observation, score, or eval, which is why production cost compounds faster than the flat price suggests.

Expert verdict. Pick Langfuse if OSS observability with prompts, datasets, and a mature UI is the entire requirement and you accept the Langfuse-shaped wire format as your substrate. Skip if OpenInference-native traces are non-negotiable or you need eval, guardrails, and gateway on the same runtime. See Langfuse Alternatives.

4. OpenLLMetry (Traceloop): best for drop-in OTel instrumentation into your existing collector

Apache 2.0 library. Optional Traceloop hosted backend.

Quick take. OpenLLMetry is the DIY pick. pip install traceloop-sdk, one Traceloop.init() call, and LLM-aware spans start flowing into whatever OTel collector you already operate (Tempo, Jaeger, Honeycomb, New Relic, Grafana, Datadog APM). No platform to learn, no second UI. The cost is that the library is just the emitter; the dashboard, eval, prompts, datasets, and annotation pieces are wherever you wire them.

Ideal for. Engineering teams whose tracing standard is already “OTel into our existing collector” and who don’t want a second observability platform. Common case: an org running Honeycomb or Grafana Tempo for service tracing that wants LLM spans in the same backend.

Key strengths.

  • Apache 2.0, 4K+ GitHub stars, vendor-agnostic by design.
  • One-line Traceloop.init() instrumentation for OpenAI, Anthropic, Bedrock, Cohere, LangChain, LlamaIndex, Haystack, Pinecone, Weaviate, Chroma, and 30+ others.
  • Emits OpenTelemetry spans; the wire format is portable. No lock-in on the trace store.

Honest limitations. No first-party dashboard worth running in production. Pair with a backend that understands LLM-specific UI (Phoenix, Future AGI, Langfuse) or accept that classical APM will render the span tree without prompt diffs, judge heatmaps, or chunk attribution. Span attribute coverage is OpenInference-compatible but the framework catalog updates slower than traceAI’s.

Pricing. Free for the OSS library. Traceloop’s hosted backend (separate product) is $79/mo.

Expert verdict. Pick OpenLLMetry when the backend is already chosen and the requirement is “make my existing OTel stack LLM-aware”. Skip if you also need eval, guardrails, simulation, or a dashboard that renders prompt-shaped data out of the box.

5. Helicone: best for lowest-friction first trace via gateway base URL change

Apache 2.0. Self-hostable. Hosted cloud.

Quick take. Helicone is the gateway-first option. Change base_url on your OpenAI-compatible client to Helicone’s endpoint and every request becomes a logged span with no SDK swap. Zero instrumentation friction is the pitch, and for OpenAI-compatible workloads (most of them in 2026), it works.

Ideal for. Teams that want production traces today, especially when the LLM client lives in a language without a maintained native instrumentation library, or developer-velocity reasons rule out a code-level integration.

Key strengths.

  • Lowest cold-start friction in the category: a base_url change is the entire integration.
  • Sessions, request analytics, prompts, caching, and rate limits in one product.
  • Apache 2.0, 4K+ GitHub stars, self-hostable. Good fit for serverless deployments where adding an SDK is awkward.

Honest limitations. Span depth is shallower than Phoenix or Langfuse for multi-step agents. Tool calls, retrievals, and sub-agent handoffs that don’t pass through the gateway aren’t traced unless you instrument them separately. Roadmap risk became part of vendor diligence after the March 2026 Mintlify acquisition; platform remains usable but feature velocity slowed. Native OpenInference adherence is partial.

Pricing. Hobby free with 10K logs/mo. Pro $79/mo with 100K logs. Team and Enterprise tiers add SSO and on-prem.

Expert verdict. Pick Helicone for a first trace this week on OpenAI-compatible workloads. Replace or pair with traceAI or OpenLLMetry once multi-step agents enter the picture. See Helicone Alternatives.

6. Datadog LLM Observability: best when Datadog is already the standard

Closed platform. SaaS with regional residency. APM-integrated.

Quick take. Datadog ships LLM Observability as an APM add-on. The pitch is one tool for everything: LLM spans correlated with APM traces, infrastructure metrics, logs, RUM, and security events. Right pick only when Datadog is already the system of record and unified APM-plus-LLM observability beats eval depth. Honeycomb and New Relic play the same role for teams on those backends; Datadog is the largest of the three on the LLM observability path.

Ideal for. Enterprise teams where Datadog is the system of record and SRE/platform engineers want LLM workloads on the same dashboard, alerting plane, and on-call rotation as service infrastructure.

Key strengths.

  • LLM spans inside the same product as APM, logs, RUM, security, and infra metrics.
  • Infrastructure correlation: LLM latency next to DB query latency next to downstream service latency.
  • Mature enterprise security posture, regional residency, SOC 2, SRE workflows. Scales to high-volume span ingestion on Datadog’s existing backend.

Honest limitations. Eval surface is shallower than dedicated LLM platforms: no first-party simulator, fewer built-in metric primitives, no integrated guardrails. Cost scales fast with span volume; production contracts cross into five figures monthly without much effort. Path of least resistance is the Datadog SDK, not OTel, and the Datadog SDK is not OpenInference-shaped by default.

Pricing. APM at $31 per host per month with annual billing, plus LLM Observability add-on metered per ingested span and per indexed log. Expect contracts above $1,000/mo at modest scale; production teams enter five-figure monthly contracts quickly.

Expert verdict. Pick Datadog LLM Observability when Datadog is the standard and consolidation beats eval depth. Pair with Future AGI or Phoenix if eval and span-attached scoring become the bottleneck. Same logic for Honeycomb and New Relic teams: keep the backend, point traceAI or OpenLLMetry at it.

Future AGI four-panel dark product showcase. Top-left: Span tree waterfall (focal panel with halo) showing 6 rows for agent.run, router.classify, retriever.search, openai.chat, validator.run FAIL, formatter.compose with start offsets and duration bars. Top-right: Span attributes key-value list with model, prompt_tokens, completion_tokens, latency_ms, cost_usd, otel.span_kind. Bottom-left: OpenInference SDK coverage grid with 16 chip cards across languages and frameworks. Bottom-right: Trace search bar with filter expression and 4 result rows showing trace_id, started_at, duration, status.

How the 6 tools score on the OTel-or-bust scorecard

CapabilityFuture AGIPhoenixLangfuseOpenLLMetryHeliconeDatadog
Auto-instrumentation surfaces50+ / 4 langs~20 / 3 langs~15 / 2 langs~30 / 2 langsGateway-onlySDK + OTel ingest
OpenInference span kinds1485 (own format)OI-compatiblePartialOTel + APM
Native OTel + OpenInferenceFullFull (reference)OTel bridge over ownFullPartialOTel accepted, SDK preferred
Span-attached evalsFull (50+ metrics)PartialPartialNoneNonePartial
Built-in PII redactionYesNo (configurable)No (configurable)NoNoPartial
Self-host licenseApache 2.0ELv2MIT coreApache 2.0 (lib)Apache 2.0Closed
Gateway on same planeYes (100+)NoNoNoYesNo
Inline guardrailsYes (18+)NoNoNoNoNo
Sustained 10K+ spans/secYesYes (AX)Yes (tuned)Collector-bound1K+ rpsYes

Decision framework: pick by constraint

  • OpenInference adherence non-negotiable: Future AGI traceAI, Phoenix, OpenLLMetry.
  • Self-hosting under Apache 2.0: Future AGI, Helicone, OpenLLMetry. Under MIT or ELv2: Langfuse, Phoenix.
  • Already on Honeycomb, New Relic, Tempo, or Jaeger: OpenLLMetry or Future AGI traceAI emit OpenInference-shaped spans into the collector you already run.
  • Already on Datadog: Datadog LLM Observability for unified APM + LLM, or point traceAI / OpenLLMetry at the Datadog OTel intake when eval depth matters more than the native Datadog SDK ergonomics.
  • Span-attached evals on the same trace: Future AGI (50+ metrics first-class), Phoenix (via Phoenix evals), Langfuse (via custom scorers).
  • Lowest-friction first trace: Helicone (gateway base URL), then Future AGI traceAI or OpenLLMetry (one-line register() / init()).
  • Gateway, guardrails, and tracing on the same runtime: Future AGI is the only pick here. Helicone is gateway-first without the guardrail and eval surface; Datadog is APM-first without a guardrail layer.

Common mistakes when picking a tracing tool

  • Confusing tracing with monitoring. Tracing captures the span tree per request; monitoring watches trends across requests. Most teams need both.
  • Picking on demo videos. Vendor demos use clean span schemas with idealized payloads. Load-test on your real schema with real payload sizes before signing anything.
  • Head-based sampling alone. Drops the rare failure that caused the user complaint. Use tail-based sampling on errors, high-cost traces, and below-threshold eval scores.
  • Ignoring storage cost. ClickHouse retention dominates the bill. 90 days at 10M traces is 200 GB to 2 TB depending on payload size.
  • Treating ELv2 as open source. Phoenix is source available, not OSI open source. Verify with legal before procurement.
  • Treating “OTel-compatible” and “OpenInference-shaped” as the same thing. Accepting OTel spans is table stakes; emitting OpenInference attributes is what makes the trace portable.

Recent LLM tracing updates

DateEventWhy it matters
May 2026Braintrust added Java auto-instrumentationJava, Spring AI, LangChain4j teams can trace with less manual code.
Mar 9, 2026Future AGI shipped Agent Command Center and ClickHouse trace storageHigh-volume trace analytics moved into the same plane as evals and gateway.
Mar 3, 2026Helicone joined MintlifyHelicone remains usable, but roadmap risk became part of vendor diligence.
Jan 22, 2026Phoenix added CLI prompt commandsTrace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling.
2025-2026OpenInference v1 conventions stabilizedCross-platform span schema reduces vendor lock-in.

How to evaluate this for production

  1. Run a domain reproduction. Export a slice of real traces (failures, long-tail prompts, tool calls, retrieval misses, hand-labeled outcomes) and instrument each candidate with your harness, OTel payload shape, prompt versions, and judge model. Don’t accept a demo dataset.
  2. Measure reliability under load. Track p50, p95, p99 ingestion, dropped spans, failed judge calls, retry count, query latency, and alert delay as concurrency rises. Sustained 10K+ spans/sec on your real payload is the table-stakes target.
  3. Cost-adjust. Real cost equals platform price plus span volume, payload size, retention days, judge tokens, retry rate, annotation hours, and SRE time. Self-hosted loses if infra bill plus on-call time exceeds SaaS overage; hosted loses if per-span pricing compounds at scale.

Where Future AGI fits

Most teams end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the pick when those live on the same Apache 2.0 self-hostable plane and the OpenInference-shaped trace is the unit. traceAI auto-instruments 50+ AI surfaces across 4 languages; ai-evaluation attaches 50+ EvalTemplates plus 20+ local metrics on the same span; the Agent Command Center gateway fronts 100+ providers with 18+ runtime guardrails (~29k req/s, P99 21 ms with guardrails on, t3.xlarge). SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust. Start free; usage-based after that. Pricing.

Sources

Future AGI · traceAI · ai-evaluation · Agent Command Center docs · Phoenix · Langfuse · OpenLLMetry · Helicone · Datadog LLM Observability docs · OpenInference conventions

Best LLM Monitoring Tools · Best AI Agent Debugging Tools · Best AI Agent Observability Tools · What is LLM Tracing · traceAI

Frequently asked questions

What are the best LLM tracing tools in 2026?
The honest shortlist is Future AGI traceAI, Arize Phoenix, Langfuse, OpenLLMetry, Helicone, and Datadog LLM Observability. Future AGI traceAI and Phoenix lead on OpenTelemetry plus OpenInference discipline; Future AGI also pairs traces with span-attached evals, gateway, and guardrails on one runtime. Langfuse leads on the OSS platform story but ships its own span format with an OTel bridge. OpenLLMetry is the DIY OTel-into-your-collector pick. Helicone is the gateway-first option for the lowest-friction first trace. Datadog is the right pick only when Datadog is already the system of record. The right answer turns on instrumentation breadth, span enrichment depth, and whether the wire format will outlive the vendor.
How is LLM tracing different from generic distributed tracing?
LLM traces add span attributes a classical tracer does not understand: prompt and response payloads, token counts per call, model name and version, prompt template ID, retrieved chunks with scores, tool-call arguments and results, judge scores, and policy decisions. OpenInference standardized these conventions across 2024-2025 so the same span renders the same way in Phoenix, Future AGI, Langfuse with translation, and any OTel backend that knows the attribute names. A generic tracer (Tempo, Jaeger, vanilla Datadog APM) shows the span tree but can't render prompt diffs, judge heatmaps, or chunk attribution without an LLM-aware UI on top.
Should I use OpenTelemetry-native tracing or a vendor-specific SDK?
OpenTelemetry plus OpenInference is the safer long-term bet. The wire format is standardized, instrumentation libraries (Future AGI traceAI, OpenLLMetry, Phoenix's OpenInference packages) emit the same span shape into any OTel-compatible backend, and you keep the trace store swappable. Vendor-specific SDKs (LangSmith, native Helicone, Langfuse's pre-OTel SDKs) give richer first-party UX but harder migration. Most 2026 teams default to OpenInference-shaped spans plus a backend that understands the conventions, and treat the backend as the swappable layer.
What span attributes should every LLM trace carry?
At minimum the OpenInference required set: OTel span kind (LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR), model name and version, prompt template ID, prompt rendered, response, prompt tokens, completion tokens, total cost, latency, status, and error class. For agents add tool name, tool arguments, tool result, parent invocation. For RAG add retriever name, query, top-k, chunk scores. For eval-attached traces add judge name, score, threshold, pass/fail, judge cost. Using the canonical attribute names keeps traces portable across Phoenix, Future AGI, Langfuse, and Datadog without a translation layer in the middle.
How do I sample LLM traces without losing failure visibility?
Sample 100% of failures, errors, and high-cost traces. Sample a fixed percentage (1-10%) of successful traces. Always retain spans where an attached eval came in below threshold. Use tail-based sampling on the OTel collector or in the backend so the decision sees the full trace before dropping it. Phoenix, Langfuse, Future AGI, and Datadog all support tail sampling on judge score or status. Avoid head-based percentage sampling alone; you will lose the rare failure that caused the user complaint, and that's the trace you most need.
How does pricing compare across LLM tracing tools in 2026?
Phoenix self-host is free; Arize AX Pro is $50/mo. Langfuse Hobby is free; Core is $29/mo flat with $8 per additional 100K units. Helicone Hobby is free; Pro is $79/mo. Datadog LLM Observability layers on top of $31/host/mo APM and meters per ingested span plus per indexed log, which crosses into five-figure monthly contracts at production scale. Future AGI is free with generous limits then usage-based. OpenLLMetry is a free library; the cost is your downstream backend. The real cost equation is subscription plus span volume, payload size, retention days, judge tokens, and the SRE hours to operate the storage layer.
Which tool handles 10K+ spans per second sustained?
Langfuse on tuned ClickHouse, Phoenix via Arize AX cloud, Future AGI with ClickHouse trace storage, and Datadog all sit on storage backends that production teams have pushed past 10K spans per second in their own environments. Helicone scales to 1K+ requests per second on standard ClickHouse and higher with tuning, but the gateway path makes it more sensitive to payload size. OpenLLMetry's ingestion ceiling is whatever your downstream collector and storage can absorb. Run a load test with your real span schema; vendor numbers always understate payload size.
Can I use my existing APM (Datadog, Honeycomb, Grafana) for LLM tracing?
Yes for the wire format, partially for the UX. Tempo, Jaeger, Honeycomb, New Relic, and Datadog APM all accept OTel spans, and Future AGI traceAI, OpenLLMetry, and Phoenix's OpenInference packages emit OpenInference-shaped spans into any of them. The gap is the LLM-specific UI: prompt diffing, eval-attached scores, chunk attribution, judge dashboards, and refusal calibration heatmaps don't render natively in classical APM. Most teams pair APM with a dedicated LLM tracing tool, or pick Datadog LLM Observability when one product for everything beats eval depth.
Related Articles
View all