Best 5 Datadog LLM Observability Alternatives in 2026
Five Datadog LLM Observability alternatives scored on OpenInference support, bundle-free pricing, gateway-native routing, and what each replacement actually fixes when you re-point the OpenTelemetry exporter or replace the DD agent.
Table of Contents
Datadog LLM Observability shipped GA in October 2024 and has matured fast, the dashboards are clean, the trace viewer is recognizable to anyone who has used Datadog APM, and most engineering teams already have a Datadog contract on file. That last point is also why teams are leaving in 2026. LLM Observability is sold as a line item on top of APM, Logs, and Infrastructure, and the compounded bill is the first thing finance flags once LLM volume crosses from experiment to production. The other reasons are structural: features still maturing on a legacy APM platform, no native gateway or routing, OpenInference support that sits behind Datadog’s proprietary trace schema, and vendor lock-in via the DD agent.
This guide ranks five replacements worth migrating to, names what each fixes versus Datadog LLM Observability, and walks through the two migrations that always bite: re-pointing the OpenTelemetry exporter and replacing the DD agent on every workload that emits LLM traces today.
TL;DR: pick by exit reason
| Why you are leaving Datadog LLM Observability | Pick | Why |
|---|---|---|
| You want OpenInference-native traces feeding a closed loop of eval and optimization | Future AGI Agent Command Center | Apache 2.0 OSS stack plus a self-improving loop |
| You want a free, self-hosted OpenInference-native tracer | Arize Phoenix | The reference OpenInference implementation, runs locally |
| You want a hosted LLM-native observability suite with prompt management | Langfuse | OSS-core LLM observability with mature prompts and evals |
| You want lightweight per-request cost and session traces | Helicone | Drop-in proxy with friendly pricing curve below 10M req/mo |
| You want a high-throughput gateway with an integrated eval stack | Maxim Bifrost | Go-based gateway tuned for low-latency routing with Maxim eval |
Why people are leaving Datadog LLM Observability in 2026
Five exit drivers show up repeatedly across Hacker News threads on Datadog earnings, /r/dataengineering migration discussions, the OpenTelemetry community channels, and G2 reviews from the last two quarters.
1. Bundle pricing tied to APM compounds quickly
LLM Observability lists at $7 per million spans on the public pricing page, but that line item rarely shows up in isolation. The teams that pick it are already paying for APM hosts, Logs ingestion, and Infrastructure monitoring. As LLM volume scales, two things stack: ingested spans on LLM Observability itself, plus the APM, Logs, and custom-metrics charges from the application services that emit those spans. Reddit threads in March 2026 describe small teams whose Datadog bill grew from $3K/month to $11K/month over two quarters once a few LLM products went production, the LLM Observability line was a fraction of the delta, the rest was APM hosts and indexed logs that scaled with the agent workload. The pattern is structural, not a billing accident. When LLM volume is the only thing growing, the bundle works against you.
2. LLM features still maturing on a legacy APM platform
Datadog LLM Observability is built on top of APM primitives, services, spans, traces, resources, that pre-date the OpenInference span conventions by a decade. The pieces LLM teams actually want (token-by-token streaming traces, tool-call cost attribution, multi-step agent graphs, dataset-linked evaluations) are bolted onto the APM schema rather than first-class. Several surfaces remain in beta as of May 2026: the agent-graph view, the prompt-version comparison panel, and the eval-as-code workflow. Datadog ships fast, but the foundation is APM, not LLM-native.
3. No native gateway or routing
Datadog LLM Observability observes. It doesn’t route. There’s no virtual-key system, no provider-failover policy, no cost-aware routing, no prompt registry. Teams that want both observability and gateway behavior end up paying Datadog for traces plus another vendor (Portkey, Helicone, LiteLLM, Kong) for the gateway, then engineering the two surfaces to share session IDs and metadata. The duplication is the most common reason teams cite for re-architecting in 2026: “we’re paying twice for the same picture.”
4. Vendor lock-in via the Datadog agent
The recommended deployment pattern uses the Datadog agent, the same daemon that ships APM, Logs, and metrics. Convenient if you already run it on every host; a migration blocker if you don’t. Replacing the agent means redeploying every service whose Helm chart, ECS task definition, Kubernetes DaemonSet, or systemd unit references it. Datadog’s OpenTelemetry support has improved, the OTel collector can ship traces to Datadog without the proprietary agent. But documentation still leads with the agent path, and most production deployments are on the agent.
5. OpenInference support sits behind the proprietary schema
OpenInference is the de facto open standard for LLM span conventions, maintained by the OpenTelemetry community with major contributions from Arize and Future AGI. Datadog accepts OpenInference-formatted traces via the OTel collector path, but internally maps them onto Datadog’s proprietary llm.* attribute schema. The mapping is lossy in both directions. Teams that want their traces portable across Phoenix, Future AGI, Langfuse, and any future OpenInference-native tool find Datadog the most awkward stop on the path. If portability matters, you eventually leave.
What to look for in a Datadog LLM Observability replacement
The default “best LLM observability” axes are necessary but not sufficient for a Datadog exit. Score replacements on the seven that map to the surfaces you’re actually migrating off:
| Axis | What it measures |
|---|---|
| 1. OpenInference-native traces | Are spans stored in the open schema, not transcoded behind a proprietary one? |
| 2. Standalone pricing (no bundle) | Can you buy observability without buying APM, Logs, and Infra alongside? |
| 3. Native gateway or routing | Does the tool also route requests, or is it observation-only? |
| 4. Self-host posture | Can the stack run inside your VPC, fully air-gapped from the vendor? |
| 5. Eval + optimizer loop | Does the tool use its own trace data to improve prompts and routes? |
| 6. Agent-graph depth | Are multi-step agent flows first-class, or bolted onto APM trace views? |
| 7. Migration tooling | Are there published collector configs or importers for Datadog specifically? |
1. Future AGI Agent Command Center: Best for closing the loop
Verdict: Future AGI is the only entry in this list that fixes Datadog LLM Observability’s biggest weakness, traces sit in a dashboard and inform humans, but never the system itself. Agent Command Center captures the trace via OpenInference, scores it with the eval library, clusters failures, runs the optimizer, and pushes the updated prompt or route back into the gateway on the next request. The other four are observation layers of different shapes. FAGI is an observation layer wired to an optimizer, with the gateway in the same product.
What it fixes versus Datadog LLM Observability:
- OpenInference native, end to end.
traceAI(Apache 2.0) is one of the reference OpenInference implementations. Spans land in FAGI in the open schema without transcoding, so the same trace stream feeds Phoenix, Langfuse, or any other OpenInference-native consumer in parallel. No lossy round-trip through a proprietaryllm.*attribute schema. - Standalone pricing, no bundle. FAGI sells observability without requiring APM, Logs, or Infrastructure SKUs. Scale tier from $99/month with linear scaling above 5M traces and no add-on multipliers. Teams whose Datadog bill compounded because LLM volume dragged APM and Logs along with it see the chargeback flatten immediately.
- Native gateway, not a separate purchase. Agent Command Center is the gateway. Virtual keys, provider failover, cost-aware routing, and a prompt registry are first-class, not a second vendor stitched in over session IDs. The Future AGI Protect model family is the inline guardrail layer at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain.
- Closed-loop eval and optimization. Every captured trace is scored against task-completion, faithfulness, and tool-use rubrics by default via
ai-evaluation(Apache 2.0). Theagent-optlibrary (Apache 2.0) rewrites prompts automatically via ProTeGi, Bayesian, or GEPA, driven by those scores. Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters related trace failures into named issues (50 traces → 1 issue), auto-writes the root cause from the span evidence plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue so a regression surfaces like an exception rather than buried in a Datadog dashboard. Cost and quality sit in the same row; the optimizer treats them as a joint signal. - OSS instrumentation.
traceAI,ai-evaluation, andagent-optare all Apache 2.0. The hosted Command Center adds RBAC, failure-cluster views, the Protect guardrails layer, and AWS Marketplace procurement.
Migration from Datadog LLM Observability: OpenInference instrumentation drops in as a sidecar to existing Datadog agent setups, so you can dual-write traces for a shadow period. Once parity is validated, point the OpenTelemetry exporter at FAGI’s collector endpoint, redeploy, and decommission the agent. The Datadog importer reads exported APM traces in JSON, lifts the llm.* attributes back to OpenInference, and seeds historical dashboards. Timeline: seven to ten engineering days for a workload with under 50 services, including the shadow-traffic period.
Where it falls short:
-
agent-opt is opt-in, start with traceAI + ai-evaluation in week one and turn the optimizer on once eval baselines stabilize. The loop compounds value over weeks rather than at day one.
-
The flame-graph trace view is actively in development. Datadog APM’s flame graph carries a decade of polish; teams whose root-cause workflow lives in the flame graph every day should preview the FAGI trace view before standardizing.
Pricing: Free tier with 100K traces/month. Scale tier from $99/month with linear per-trace scaling above 5M (no add-on multipliers). Enterprise with SOC 2 Type II and AWS Marketplace.
Score: 7 of 7 axes.
2. Arize Phoenix: Best for OpenInference-native self-hosted exit
Verdict: Phoenix is the pick when the requirement is “this runs on our infrastructure, in the open schema, with source we can audit.” Apache 2.0, Python-native, and one of the two reference OpenInference implementations (Future AGI’s traceAI is the other). Trade-off: breadth. Phoenix is laser-focused on traces, datasets, and evals, with no gateway, no prompt registry, and no team-grade RBAC out of the box.
What it fixes versus Datadog LLM Observability:
- OpenInference is the storage schema, not a translation layer. Phoenix stores spans as OpenInference natively. Traces flowing in are portable to any other OpenInference consumer without transcoding.
- Free and self-hosted. Runs as a Docker container, a pip-installable Python package, or a Kubernetes deployment. No vendor in the loop, no per-trace billing, no agent footprint.
- Eval and dataset workflows are first-class. Phoenix’s experiment runner pairs prompts and datasets with evaluators, the workflow most LLM teams want and Datadog only partially supports.
Migration from Datadog LLM Observability: Replace the Datadog agent or OTel exporter with a Phoenix OTLP endpoint. Service code that uses OpenInference instrumentation (most modern LLM SDKs ship it via openinference-instrumentation-* packages) doesn’t change. Code that relies on ddtrace.llmobs needs to be replaced with the equivalent OpenInference instrumentor. You lose Datadog’s hosted multi-tenant dashboards and the APM-side correlation. Timeline: five to seven engineering days for a Python-heavy workload, longer if non-Python services need wrappers.
Where it falls short:
- No gateway, no routing, no virtual keys.
- No optimizer; traces feed evaluators, not a closed loop back into runtime behavior.
- RBAC, audit, and SSO live in Arize’s hosted product, not in OSS Phoenix.
- Self-host operations get harder above a few thousand traces per second; Arize sells the hosted scale path.
Pricing: Open source under Apache 2.0. Arize’s hosted product (Arize AX) is the upsell path with custom pricing.
Score: 4 of 7 axes (missing: gateway, optimizer, native RBAC).
3. Langfuse: Best for hosted LLM-native observability with prompts
Verdict: Langfuse is the pick when your reason for leaving is “Datadog’s LLM features are bolted onto APM and we want a tool whose core abstraction is the LLM trace.” MIT-licensed core with a hosted tier, mature prompt management, and an active eval module. Trade-off: the same one every observation-only tool in this list has, no gateway.
What it fixes versus Datadog LLM Observability:
- LLM-native data model. Traces, generations, observations, and scores are the core types, not retrofitted onto an APM schema. Multi-step agent flows show up as nested observations rather than a flat span list to mentally re-assemble.
- Mature prompt management. Langfuse’s prompt module versions prompts, links them to traces, and supports A/B comparisons. Datadog’s prompt-version comparison is still in beta as of May 2026.
- Standalone pricing. Hobby tier free for 50K observations/month; Pro from $59/month. No APM, Logs, or Infrastructure SKU to buy alongside.
- Self-host posture. The MIT-licensed core runs on Postgres and is the most common self-host story in the LLM-native observability category outside of Phoenix.
Migration from Datadog LLM Observability: OpenTelemetry collector with the Langfuse OTLP receiver, or direct instrumentation via the Langfuse SDK. Datadog-specific ddtrace.llmobs calls need to be rewritten. You lose APM-side correlation and the Datadog metrics surface. Timeline: five to seven engineering days for instrumentation cutover, plus a week if you adopt Langfuse Prompts as a prompt registry.
Where it falls short:
- No gateway. Pair Langfuse with LiteLLM, Helicone, or Future AGI’s gateway.
- No optimizer.
- Hosted EU and US regions split data residency; teams needing both regions plan accordingly.
Pricing: OSS core MIT-licensed. Hobby free up to 50K observations/month. Pro from $59/month. Enterprise custom.
Score: 5 of 7 axes (missing: gateway, optimizer).
4. Helicone: Best for lightweight hosted observability via proxy
Verdict: Helicone is the pick when your reason for leaving Datadog is pricing and your workload sits below 10M requests per month. Drop-in proxy with per-request cost telemetry, session traces, and a clean dashboard. One wrinkle: Helicone acquired Mintlify in March 2026, and parts of the docs surface have folded into Mintlify’s stack, the roadmap reflects the org change.
What it fixes versus Datadog LLM Observability:
- Friendlier pricing curve below 10M req/mo. Pro tier from $25/month scales more gently than the compound of Datadog’s LLM Observability + APM + Logs.
- Proxy-shaped instrumentation. Code change is a one-line
base_urlswap rather than agent deployment plus SDK wrapping. Services already pointed at the OpenAI or Anthropic SDK get instrumented for free. - Self-host option. Helicone’s open-source self-host (Apache 2.0) runs on Postgres + ClickHouse. The project’s own docs admit scale-out beyond a few hundred RPS gets non-trivial.
Migration from Datadog LLM Observability: Point the OpenAI or Anthropic SDK’s base_url at Helicone, set the auth header, and Helicone captures every request. Run in parallel with Datadog’s agent for a shadow week, validate cost and latency, then decommission the agent. Helicone’s Prompts product is less feature-rich than Langfuse or Future AGI’s, so many teams keep prompts in-repo as Jinja2 post-migration. You lose APM correlation and the Datadog metrics surface. Timeline: three to five engineering days without a prompt-registry replacement.
Where it falls short:
- No optimizer; traces inform humans, not the gateway.
- Routing intelligence is basic (round-robin and failover); cost-aware model routing requires upstream code.
- Self-host operations get harder above a few hundred RPS.
- The Mintlify acquisition is recent enough that some surfaces are still in flux.
Pricing: Free tier with 10K requests/month. Pro from $25/month. Enterprise custom.
Score: 4 of 7 axes (missing: optimizer, deep agent-graph view, prompt-version diffs).
5. Maxim Bifrost: Best for high-throughput gateway with eval
Verdict: Maxim’s Bifrost is the pick when the workload is high-concurrency and you want both a gateway and an eval pipeline from one vendor. Bifrost is written in Go, designed for low-latency routing, and benchmarks above Python-based proxies on RPS per node. The Maxim eval stack handles offline experiments and online scoring.
What it fixes versus Datadog LLM Observability:
- Gateway in the picture. Unlike Phoenix, Langfuse, and Helicone, Bifrost is a gateway-first product with throughput as the headline feature.
- Throughput per node. The Go runtime plus connection pooling gives Bifrost higher RPS per node than Python-based proxies on the same hardware. Maxim’s published benchmarks claim sub-millisecond overhead at p50; independent reproduction is ongoing.
- Tight integration with Maxim’s eval stack. Traces flow into Maxim’s eval pipeline without an OTel hop. If your team uses Maxim for offline evaluations, the gateway and the evals share data models.
- Self-host posture. Runs as a Go binary, container, helm chart, or static binary on a VM.
Migration from Datadog LLM Observability: OpenAI-compatible endpoint via Bifrost’s proxy. Replace the Datadog agent with the Bifrost binary and route traffic. Maxim’s eval pipeline accepts traces directly; configure the sink and redeploy. You lose APM-side correlation. Timeline: five to eight engineering days for the gateway cutover, plus another week if the team also adopts Maxim eval as the primary scoring surface.
Where it falls short:
- No optimizer in the closed-loop sense; eval scores feed back to humans, not to a runtime policy update.
- Younger than Phoenix, Langfuse, or Helicone in this category; the OSS ecosystem (Terraform providers, off-the-shelf Grafana dashboards) is thinner.
- Throughput is the headline; teams that picked Datadog for ops familiarity rather than gateway throughput won’t feel the upside.
Pricing: Bifrost is open source. Maxim’s hosted gateway pricing is custom, typically anchored to the eval product’s usage.
Score: 4 of 7 axes (missing: optimizer, mature OSS ecosystem, native prompt registry).
Capability matrix
| Axis | Future AGI | Arize Phoenix | Langfuse | Helicone | Maxim Bifrost |
|---|---|---|---|---|---|
| OpenInference-native traces | Yes (traceAI reference impl) | Yes (reference impl) | OTel + native SDK | Proxy-shaped, OTel sink available | OTel sink available |
| Standalone pricing (no bundle) | Yes | OSS, free | Yes | Yes | Yes |
| Native gateway or routing | Yes (Agent Command Center) | No | No | Proxy-shaped, basic routing | Yes (Bifrost) |
| Self-host posture | BYOC + OSS instrumentation | OSS, Apache 2.0 | OSS core, MIT | Apache 2.0 self-host | OSS Go binary |
| Eval + optimizer loop | Yes (ai-evaluation + agent-opt) | Eval only, no optimizer | Eval only, no optimizer | No | Eval via Maxim, no optimizer |
| Agent-graph depth | Native sessions + multi-step | Native | Native nested observations | Per-request | Per-request |
| Datadog migration tooling | Importer + dual-write recipes | Collector config templates | Collector config templates | base_url swap | Gateway swap |
Migration notes: what breaks when leaving Datadog LLM Observability
Three surfaces always need attention.
Re-pointing the OpenTelemetry exporter
If your services already emit OpenTelemetry spans to Datadog via the OTel collector (the cleaner of the two deployment paths), migration is a collector-config change. Swap the Datadog exporter for the destination’s OTLP receiver, update endpoint and auth, and redeploy. The OpenInference instrumentation in service code stays untouched. This is the five-minute path on paper, the half-day path in practice once you account for shadow-traffic validation and metric-name reconciliation.
If your services use the Datadog agent path with ddtrace.llmobs SDK calls, migration is heavier. The Datadog SDK’s LLM Observability API is proprietary; every call needs to be rewritten to the equivalent OpenInference instrumentor. Community packages (openinference-instrumentation-openai, openinference-instrumentation-anthropic, openinference-instrumentation-langchain, etc.) cover most SDKs out of the box. Custom spans need a one-time port.
Replacing the DD agent
The Datadog agent is convenient when you already run it for APM, Logs, and Infrastructure. It’s a migration blocker when you only run it for LLM Observability. Replacement looks different across patterns: in Kubernetes, the DaemonSet is removed and an OTel collector takes its place as a sidecar or DaemonSet; in ECS, the agent task definition is replaced; in plain VMs, the systemd unit is swapped. The redeployment ripples through every workload whose Helm chart or task definition references the agent. Plan a phased rollout, agent and collector side by side during validation, rather than a big-bang cutover.
Reconciling APM-side metrics and custom dashboards
The deepest hidden cost of leaving Datadog LLM Observability is the APM-side context you give up: services and endpoints map traces to upstream callers, custom dashboards mix LLM cost with database and HTTP latency, and the metric explorer slices by service, env, and version. The replacements in this list are LLM-native and won’t reproduce that view. Most teams take this as a feature, the goal of leaving is to stop paying for APM correlation. But the migration plan needs to name which Datadog dashboards aren’t coming with you and which OSS or hosted equivalents (Grafana with the destination’s data source, plus a custom panel for LLM cost) replace them.
Decision framework: Choose X if
Choose Future AGI if your reason for leaving is more than pricing or bundle fatigue, you also want trace data to drive prompt rewrites and routing-policy updates over time. Pick this when production agent workloads are becoming a significant line item and the OSS instrumentation (traceAI, ai-evaluation, agent-opt) plus the hosted Command Center, with the Protect guardrails layer at ~67 ms latency, justify the migration as a structural one rather than a SKU swap.
Choose Arize Phoenix if the requirement is “this runs on our hardware, in the open schema, free.” Pick this when you have engineering budget for a separate gateway and you value strict OpenInference fidelity above hosted polish.
Choose Langfuse if you want a hosted LLM-native observability suite with mature prompt management and you don’t need a gateway from the same vendor. Pick this when the Datadog complaint is “the data model is APM, not LLM,” not “the bill is too big.”
Choose Helicone if your reason for leaving is pricing and you’re well below 10M requests per month. Pick this for straightforward workloads where a proxy-shaped tool with per-request cost and session traces is enough.
Choose Maxim Bifrost if your reason for leaving includes “we want a single vendor for gateway and eval, and throughput per node matters.” Pick this when the proxy hop’s own latency budget shows up in your SLOs.
What we did not include
Three products show up in other 2026 Datadog LLM Observability alternatives listicles that we left out: New Relic AI Monitoring (capable, but the same APM-foundation critique applies, if you’re leaving Datadog for being APM-rooted, New Relic is a sideways move); Honeycomb for LLMs (excellent for high-cardinality trace exploration but the LLM-specific surface is thinner than this cohort’s as of May 2026); Grafana Cloud with Tempo + Loki (powerful self-build, but the LLM-native dashboards, prompt management, and eval workflow are entirely on you, closer to a foundation than a product).
Related reading
- Best 5 LangSmith Alternatives in 2026
- Best LLM Observability Tools in 2026
- What Is LLM Observability? The 2026 Definition
Sources
- Datadog LLM Observability product page, datadoghq.com/product/llm-observability
- Datadog LLM Observability pricing, datadoghq.com/pricing
- Datadog earnings transcripts and AI-related commentary, Q1 and Q4 2025
- Hacker News threads on Datadog LLM Observability and pricing, 2025 to 2026
- Reddit /r/dataengineering and /r/LLMDevs migration discussions, February-May 2026
- OpenInference specification, github.com/Arize-ai/openinference
- Arize Phoenix, github.com/Arize-ai/phoenix (Apache 2.0)
- Langfuse, github.com/langfuse/langfuse (MIT core)
- Helicone, github.com/Helicone/helicone (Apache 2.0 self-host)
- Helicone acquisition of Mintlify, March 2026, helicone.ai/blog
- Maxim Bifrost product page and benchmarks, getmaxim.ai/bifrost
- Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
- Future AGI traceAI, github.com/future-agi/traceAI (Apache 2.0)
- Future AGI ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
- Future AGI agent-opt, github.com/future-agi/agent-opt (Apache 2.0)
- Future AGI Protect latency benchmark, arxiv.org/abs/2510.13351 (67 ms text, 109 ms image)
Frequently asked questions
Why are people moving off Datadog LLM Observability in 2026?
What is the closest like-for-like alternative?
Can I migrate without changing service code?
How do I replace the Datadog agent?
Is there an open-source Datadog LLM Observability alternative?
Which alternative is cheapest at scale?
How does Future AGI Agent Command Center compare to Datadog LLM Observability?
Five Pydantic AI alternatives scored on multi-agent depth, language reach, observability without Logfire, optimizer presence, and what each replacement actually fixes for teams who outgrew the type-system-first framework.
Future AGI vs LangSmith scored on tracing, evaluation, prompt management, deployment, security, and developer experience. Honest verdict, May 2026 pricing, where each one falls short, and why only one closes the loop.
Five Eyer AI alternatives scored on multi-language SDK coverage, self-host posture, gateway and optimizer reach, and what each replacement actually fixes for teams outgrowing AI-monitoring-only tooling.