Arize AI Alternatives in 2026: 5 Honest Picks
Honest 2026 comparison of the best Arize AI alternatives: Future AGI, Langfuse, LangSmith, Braintrust, Datadog. Pricing, gateway, eval depth, license.
Table of Contents
Arize started as ML observability and added agent observability later. That history is the entire reason teams shopping alternatives in 2026 land on the same three complaints: AX pricing reads enterprise but customization stops short; the agent-era story is bolted on top of a prompt-shaped trace UI; and there is no native gateway, no inline guardrails, so the runtime control layer has to be stitched from a different vendor. The five alternatives below cover most procurement shortlists. What separates them is where each closes the gap you actually hit. Last updated May 20, 2026.
TL;DR: best Arize alternative per use case
| Use case | Best pick | Why | Pricing | License |
|---|---|---|---|---|
| Trajectory eval + tracing + gateway + guardrails on one Apache 2.0 plane | Future AGI | traceAI + eval-stack + Agent Command Center + Error Feed | Free + usage | Apache 2.0 |
| Self-hosted observability with prompts and datasets | Langfuse | Mature OSS traces, prompts, datasets, evals | Core $29/mo | MIT core |
| LangChain or LangGraph runtime | LangSmith | Native chain and graph trace semantics | Plus $39/seat/mo | Closed, MIT SDK |
| Closed-loop eval workbench with the best UI | Braintrust | Experiments, scorers, sandboxed agent evals | Pro $249/mo | Closed |
| Already on Datadog for everything else | Datadog LLM Observability | LLM spans next to APM and infra | APM $31/host + add-on | Closed |
One-row summary. Pick Future AGI when trajectory has to be the unit and the runtime control layer (gateway, guardrails) has to live on the same plane as traces and evals. Pick LangSmith when LangGraph is the runtime. Pick Datadog when the constraint is one tool for everything.
Why teams leave Arize in 2026
Arize owned ML observability before LLM tracing was a category. Phoenix is a real developer tool. AX is a credible enterprise product. If your company already runs Arize around tabular ML or drift, procurement is easier than buying a new platform. The agent era is a different problem.
Three patterns repeat in procurement conversations.
- Pricing reads enterprise; customization stops short. AX Pro at $50/mo looks fair until a team wants a custom judge model, an in-house grader replacing a Turing-grade scorer, non-Phoenix span attributes in the same UI, or a trajectory metric that does not ship in the box. Those asks route to a custom contract conversation. The platform is opinionated where it should be flexible.
- The agent-era story was bolted on. Tool Correctness, Plan Adherence, planner depth, recovery rate read as scorers grafted onto a prompt-shaped trace tree rather than first-class span attributes you can filter, alert, and gate against. The OpenInference reference lives at Arize, which is to its credit, but trajectory metrics live in the eval surface, not on the span.
- No native gateway, no inline guardrails. AX is observe + evaluate. Routing across providers, fallback, caching, runtime PII redaction, prompt injection scanning, tool permission enforcement, and budget gates live in a separate product if they live at all. Stitching a gateway from one vendor and a guardrail layer from another under an observability product from a third is where teams lose the loop.
The five alternatives below split cleanly along those three gaps. Future AGI closes all three on one plane. Langfuse closes the OSS observability gap. LangSmith closes the LangChain ergonomics gap. Braintrust closes the structured-eval-UI gap. Datadog closes the one-tool-for-everything gap.
License and hosting posture across the alternatives
| Platform | License | Hosting |
|---|---|---|
| Future AGI | Apache 2.0 OSS | Self-host + managed cloud |
| Langfuse | MIT core, enterprise directories separate | Self-host + managed cloud |
| LangSmith | Closed platform, MIT SDK | Cloud + Enterprise self-host |
| Braintrust | Closed platform | Cloud primary; Enterprise self-host |
| Datadog LLM Observability | Closed platform | SaaS only |
| Arize Phoenix (for comparison) | Elastic License 2.0 (source-available) | Self-host only |
| Arize AX (for comparison) | Closed platform | SaaS + Enterprise self-host add-on |
Two notes worth pinning before procurement runs vendor diligence. First, Elastic License 2.0 is not OSI open source; it restricts hosted-service offerings and several legal teams now treat ELv2 the same way they treat BSL. If “open source” is a hard requirement, the only platform-scope Apache 2.0 picks here are Future AGI; Langfuse core is MIT with the enterprise directories on a separate license. Second, self-hostable and OSS are different things. Phoenix, Langfuse, and Future AGI all have self-hosted paths, but their licenses, enterprise gates, and operating footprints differ.
The 5 Arize alternatives, compared
1. Future AGI: trajectory-native eval, tracing, gateway, and guardrails on one Apache 2.0 plane
Open source. Self-hostable. Hosted cloud. Eval-stack package.
Quick take. Future AGI is the pick when the three gaps Arize leaves open all close on the same plane. The eval stack ships as a package: ai-evaluation is the code-first surface with 50+ EvalTemplate classes (Tool Correctness, Plan Adherence, Goal Adherence, Task Completion, Groundedness, Hallucination, PII, Toxicity, and the rest); traceAI carries the same rubric as span-attached scores on live traces; Agent Command Center is a Go binary under Apache 2.0 that fronts 100+ providers with 18+ built-in guardrail scanners on the same trace stream. Error Feed sits inside the eval stack: HDBSCAN clusters failing traces and a Sonnet 4.5 Judge writes the immediate fix, so a tool-call regression becomes a labeled dataset row instead of a Jira ticket.
Ideal for. Teams running RAG agents, voice agents, support automation, or copilots where a missed tool call in production should land as a failing test case before the next release, and where the runtime control layer (routing, caching, PII redaction, prompt injection scanning) has to live on the same plane as traces and evals.
Key strengths.
- traceAI auto-instruments 50+ AI surfaces across 4 languages, including LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, Spring AI, LangChain4j, Semantic Kernel. 14 span kinds against Phoenix’s 8.
- Trajectory metrics as first-class span attributes. Tool Correctness, Plan Adherence, Goal Adherence, Task Completion ship as pytest CI scorers and online scorers; lower per-eval cost than Galileo Luna-2 at comparable accuracy on the published rubrics. BYOK so any LLM can be the judge at zero platform fee.
- Error Feed inside the eval stack. HDBSCAN soft-clustering over ClickHouse, a Sonnet 4.5 Judge with a 30-turn budget and 8 span tools, a 5-category 30-subtype failure taxonomy. The post-incident loop closes without manual export.
- Agent Command Center gateway on the same plane. 18+ built-in scanners (PII, prompt injection, content moderation, secret detection, hallucination, topic restriction, MCP security, tool permissions, custom expression rules, webhook BYOG) plus 15 third-party adapters (Lakera Guard, Presidio, Llama Guard, AWS Bedrock Guardrails, Azure Content Safety, Pangea, Aporia, Enkrypt AI, others). Benchmarked at ~29k req/s, P99 21 ms with guardrails on, on
t3.xlarge. - Compliance. SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust; ISO/IEC 27001 in active audit.
Honest limitations. More moving parts than LangSmith inside a LangChain app or a single-purpose tracer like Phoenix. ClickHouse, Postgres, Redis, Temporal, and the gateway are real services on self-host; the hosted cloud is the easier path. Native-adapter coverage is strongest on OpenAI, Anthropic, Gemini, Bedrock, Cohere, and Azure.
Pricing intelligence. Free to start with generous limits; usage-based after that. Compliance add-ons (HIPAA BAA, SAML SSO + SCIM) layer per tier. Pricing is usage-based rather than per-seat.
Verdict. Pick Future AGI when the trajectory has to be the unit, the post-incident loop has to close back into pre-prod tests without manual export, and the gateway plus guardrails have to enforce the same eval contract the trace tree carries. Buying signal: your team has already stitched Phoenix for traces, a notebook for optimization, a separate gateway, and a guardrail vendor, and watched the same incident class repeat because the handoffs lost fidelity.
2. Langfuse: best for self-hosted observability with prompts and datasets
Open-source core (MIT). Self-hostable. Hosted cloud option.
Quick take. The strongest OSS-first Arize alternative when the main job is self-hosted tracing with prompt versioning and dataset-driven evals. Active project, large community, mature self-hosting story. Where things get thin is everything outside that scope: trajectory metrics, simulation, gateway, guardrails, and prompt optimization live in adjacent tools.
Ideal for. Platform teams that operate the data plane, want trace data in their own infrastructure, and pair Langfuse with a CI eval framework.
Key strengths.
- MIT core; mature architecture across Postgres, ClickHouse, Redis, object storage, queues, workers.
- Prompt management with labels, environments, version diffs; datasets, runs, human annotation queues.
- OpenTelemetry ingestion; LiteLLM proxy logging; broad framework integrations.
- Experiments CI/CD integration shipped May 2026, closing a real release-gate gap for OSS-first teams.
Honest limitations. Trajectory metrics are not first-class. 5 span kinds against traceAI’s 14, and the trace UI is LLM-shaped, not trajectory-shaped. Simulation, voice eval, prompt optimization, and runtime guardrails are not built in. Enterprise directories ship under a separate commercial license outside MIT.
Pricing intelligence. Hobby free with 50K units/mo, 30 days access, 2 users. Core $29/mo with 100K units, $8 per additional 100K, 90 days access, unlimited users. Pro $199/mo. Enterprise $2,499/mo. A “unit” covers a trace, observation, score, or evaluation, which is why production cost compounds at agent scale: one user request can write a dozen units across router, retriever, tool, judge, and post-processor spans.
Verdict. Pick Langfuse if OSS observability with prompts and datasets is the entire requirement and the team can pair it with external eval and guardrail layers. Skip if you need trajectory metrics on the span itself or a runtime gateway. See Langfuse Alternatives.
3. LangSmith: best for LangChain and LangGraph runtimes
Closed platform. MIT SDK. Cloud, hybrid, and Enterprise self-hosting.
Quick take. Lowest-friction first pick when LangGraph is the runtime. Native trace semantics for chains, graphs, retrievers, tools, and prompts; Playground replay, Fleet deployment, and Studio graph visualization in one product. Outside LangChain, the value drops fast.
Ideal for. LangChain v1 and LangGraph teams who want eval, deployment, and observability in the same mental model as the runtime.
Key strengths.
- LangGraph spans render as the actual graph, not a flat list.
- Playground replay, Prompt Hub, annotation queues, Fleet deployment, Studio graph visualization.
- Cloud, hybrid, and Enterprise (VPC) self-hosting, with LangSmith Self-Hosted v0.13 shipping more parity in January 2026 for data-residency requirements.
- Same-day support for new LangChain releases.
Honest limitations. Framework coupling cuts both ways. Custom agents, LiteLLM, direct provider SDKs, or non-LangChain orchestration see the value drop. Seat pricing makes cross-functional access expensive. No first-party simulation, no integrated gateway, no inline guardrails. Base trace overage at $2.50/1K and extended traces at $5.00/1K stack up at high volume.
Pricing intelligence. Developer free with 5,000 base traces/mo, 1 Fleet agent, 50 Fleet runs, 1 seat. Plus $39/seat/mo with 10,000 base traces, unlimited Fleet agents, 500 Fleet runs, up to 3 workspaces. Enterprise custom.
Verdict. Pick LangSmith if LangChain is the runtime and framework-native ergonomics matter more than open-source control. Skip if the stack is heterogeneous or the gateway and guardrail problem is real. See LangSmith Alternatives.
4. Braintrust: best for closed-loop eval workbench with strong UI
Closed platform. Hosted cloud or Enterprise self-host.
Quick take. Best eval UI in the closed category. Experiments, datasets, scorers, prompt iteration, online scoring, and CI gating in one product, with sandboxed agent evaluation for tool-calling agents. Center of gravity is structured evals, not the full agent loop.
Ideal for. Teams that prefer to buy rather than build, want experiments and scorers in one polished UI, and don’t need OSS control or a native gateway.
Key strengths.
- Polished UI for experiments, scorers, datasets, and prompt iteration.
- Sandboxed agent evaluation with tool-call execution; agent-evals more developed than Langfuse or Phoenix.
- Online scoring and CI gates in the same product as offline experiments.
- May 2026 Java auto-instrumentation for Spring AI and LangChain4j.
Honest limitations. Closed platform; Enterprise-only self-host. No first-party voice simulator. Gateway is a developer convenience, not a runtime enforcement layer. Inline guardrails and prompt optimization are not first-class. Pro at $249/mo is the highest entry-tier outside enterprise; overage adds up at production scale.
Pricing intelligence. Starter $0 with 1 GB, 10K scores, 14 days retention, unlimited users. Pro $249/mo with 5 GB, 50K scores, 30 days. Overage on Starter $4/GB and $2.50/1K scores; on Pro $3/GB and $1.50/1K. Enterprise custom.
Verdict. Pick Braintrust if structured evals with a polished UI is the dominant problem and gateway, guardrails, and simulation are off the requirement list. See Braintrust Alternatives.
5. Datadog LLM Observability: best when Datadog is already the standard
Closed platform. SaaS with regional residency. APM-integrated.
Quick take. Datadog ships LLM Observability as an APM add-on. The pitch is one tool for everything: LLM spans next to APM, infrastructure metrics, logs, and security, correlated with database queries, downstream service latency, and infrastructure events.
Ideal for. Enterprise teams where Datadog is the system of record and unified APM + LLM observability with shared dashboards and on-call rotations is the goal.
Key strengths.
- LLM spans inside the same product as APM, logs, RUM, security, and infra metrics.
- Infrastructure correlation: LLM latency next to DB latency next to downstream service latency.
- Mature enterprise security posture and SRE workflows.
- Scales to high-volume span ingestion on Datadog’s existing backend.
Honest limitations. Eval surface is shallower than dedicated LLM platforms. No first-party simulator, fewer built-in metric primitives, no integrated runtime guardrails. Cost scales fast with span volume; Datadog bills per ingested span plus per indexed log. Path of least resistance is the Datadog SDK, not OTel, which lowers portability.
Pricing intelligence. APM at $31 per host per month annual billing, plus LLM Observability add-on metered per ingested span and per indexed log. Expect contracts above $1,000/mo at modest scale; production teams enter five-figure monthly contracts quickly.
Verdict. Pick Datadog LLM Observability when Datadog is the system of record and one-tool consolidation beats eval depth. Pair with Future AGI or Braintrust if eval and trajectory scoring become the bottleneck. See Braintrust vs Datadog.

Capability coverage across the 5 alternatives
| Capability | Future AGI | Langfuse | LangSmith | Braintrust | Datadog |
|---|---|---|---|---|---|
| Span kinds (count) | 14 | 5 | LangChain-native | proprietary | OTel + APM |
| Trajectory metrics (Tool Correctness, Plan Adherence) | First-class span attribute | Manual scorer | Manual scorer | Manual scorer | Manual scorer |
| Span-attached evals | Full (50+ metrics) | Partial | Partial | Full | Partial |
| Voice + text simulation | Full | None | None | None | None |
| LLM gateway | Full (Agent Command Center, 100+ providers) | None | None | Partial | None |
| Inline guardrails | Full (18+ built-in scanners + 15 third-party adapters) | None | None | None | None |
| OTel + OpenInference | Full (traceAI, 50+ surfaces) | Partial | Partial | Partial | Full (OTel + APM) |
| Self-host license | Apache 2.0 | MIT core, enterprise separate | Enterprise-only | Enterprise-only | None |
For comparison, Phoenix ships 8 span kinds under Elastic License 2.0 and does not ship gateway or inline guardrails; AX adds product observability, RBAC, online evals, and Alyx on top.
Decision framework: choose X if
- Future AGI if the trajectory has to be the unit, eval has to live on the span, and the runtime control layer (gateway, guardrails) has to live on the same plane. Buying signal: your team has already stitched Phoenix for traces, a notebook for prompt optimization, a separate gateway, and an inline guardrail vendor, and the same incident class keeps repeating.
- Langfuse if OSS observability with prompts, datasets, and a mature self-hosting story is the entire requirement, and you can pair it with external eval and guardrail layers.
- LangSmith if LangChain or LangGraph is the runtime and framework-native ergonomics matter more than open-source control.
- Braintrust if structured evals with a polished UI is the dominant problem and you don’t need a native gateway, inline guardrails, or simulation.
- Datadog LLM Observability if Datadog is already the system of record and one-tool consolidation beats eval depth.
Common mistakes when picking an Arize alternative
- Collapsing Phoenix and AX into one product. Phoenix is source available under Elastic License 2.0. AX is commercial. Price, license, and feature claims need separate rows in the evaluation sheet.
- Treating self-hostable as the same thing as OSS. Phoenix is ELv2. Langfuse core is MIT with enterprise directories on a separate license. Future AGI is Apache 2.0 across the stack. Check legal language before architecture review.
- Pricing only the subscription. Real cost is subscription plus trace volume, score volume, judge tokens, retries, storage retention, annotation labor, and the team running the stack. A “unit” in Langfuse compounds at agent scale; a span in Datadog compounds at trace density.
- Scoring final answers only. Multi-step agents fail through tool selection, retrieval misses, retries, state drift, loop behavior, and partial refusal. Require trace-level, session-level, and path-aware evaluation, and verify the evaluator can attach scores to non-leaf spans.
- Skipping the migration plan. Tracing is the easy half if OTel and OpenInference are already in place. Datasets, scorers, prompts, human review queues, and CI gates are the hard half.
Recent eval platform updates
| Date | Event | Why it matters |
|---|---|---|
| May 5, 2026 | Phoenix added provider tools in Playground and Prompts | Phoenix can store and round-trip vendor-native tools such as web search, code execution, file search, computer use, and Gemini grounding. |
| Apr 13, 2026 | Arize AX shipped RBAC GA, plus Alyx improvements through April | AX moved deeper into enterprise control and agent-assisted workflows; Alyx still needs validation against your security and eval needs. |
| May 2026 | Langfuse shipped Experiments CI/CD integration | OSS-first teams can run experiments in GitHub Actions and catch quality regressions before release. |
| Mar 19, 2026 | LangSmith Agent Builder became Fleet | LangSmith is expanding from trace and eval workflows into managed agent operations for LangChain teams. |
| Mar 9, 2026 | Future AGI shipped Agent Command Center and ClickHouse trace storage | Gateway routing, guardrails, cost controls, and high-volume trace analytics moved into the same reliability loop as traces and evals. |
| Jan 16, 2026 | LangSmith Self-Hosted v0.13 shipped | Enterprise teams got more parity for self-managed LangSmith deployments, which matters for trace data residency. |
How to evaluate this for production
- Run a domain reproduction. Export a slice of real traces (failures, long-tail prompts, tool calls, retrieval misses, hand-labeled outcomes). Instrument each candidate with your OTel payload shape, prompt versions, and judge model.
- Measure reliability under load. Track p50, p95, p99 ingestion, dropped spans, failed judge calls, retry count, query latency, and alert delay as concurrency rises.
- Cost-adjust. Real cost equals platform price times trace volume, judge sampling rate, retry rate, storage retention, and annotation hours.
- Verify the runtime control layer. If gateway, guardrails, and provider routing live in a different vendor than traces and evals, model the handoff cost honestly. A failing trace that does not return to a release gate is a failing trace that ships again.
Where Future AGI fits
Most teams comparing Arize alternatives end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the recommended pick when those have to live on the same Apache 2.0 self-hostable plane and the trajectory has to be the unit. traceAI auto-instruments 50+ AI surfaces across 4 languages with 14 OpenInference span kinds. ai-evaluation ships 50+ EvalTemplate classes as pytest CI scorers and span-attached online scorers, with lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces with HDBSCAN, has a Sonnet 4.5 Judge write the immediate fix, and promotes the trace into the dataset agent-opt searches against. Agent Command Center fronts 100+ providers with routing, fallback, exact and semantic caching, and 18+ built-in guardrail scanners on the same trace plane, benchmarked at ~29k req/s, P99 21 ms with guardrails on, on t3.xlarge. SOC 2 Type II + HIPAA + GDPR + CCPA per futureagi.com/trust; ISO/IEC 27001 in active audit. Start free with generous limits; pricing is usage-based after that.
Sources
Future AGI pricing · Future AGI GitHub · traceAI · ai-evaluation · Agent Command Center docs · Arize pricing · Phoenix docs · Arize AX docs · Langfuse pricing · LangSmith pricing · Braintrust pricing · Datadog pricing · Datadog LLM Observability docs
Read next
Best AI Agent Observability Tools 2026 · LangSmith Alternatives · Langfuse Alternatives · Braintrust Alternatives · Galileo Alternatives · Future AGI vs Arize AI LLM Evaluation
Frequently asked questions
What is the best Arize AI alternative in 2026?
Why do teams leave Arize in 2026?
Is Arize Phoenix open source?
How does Arize AX pricing compare with alternatives?
What is the difference between Arize Phoenix and Arize AX?
How do you migrate from Arize without breaking traces?
Does any Arize alternative include a native AI gateway?
Honest 2026 comparison of Langfuse alternatives: Future AGI, LangSmith, Phoenix, Braintrust, Helicone on eval depth, gateway, and the loop.
LangSmith alternatives in 2026 compared on cost at scale, LangChain coupling, missing eval, guardrail, and gateway layers. Six honest picks with pricing.
FutureAGI, Langfuse, Phoenix, Braintrust, and Galileo as Confident-AI alternatives. Pricing, OSS license, eval depth, production gaps.