LangSmith Alternatives in 2026: 6 Honest Picks for Production AI Teams
LangSmith alternatives in 2026 compared on cost at scale, LangChain coupling, missing eval, guardrail, and gateway layers. Six honest picks with pricing.
Table of Contents
Teams leave LangSmith for three reasons, in roughly this order. Cost at scale, once base traces at $2.50 per 1,000 and extended traces at $5.00 per 1,000 stack up against a five-seat eval team. LangChain coupling, when the production stack outgrows LangChain or absorbs LiteLLM, raw provider SDKs, CrewAI, AutoGen, or a custom controller. And missing layers, because LangSmith does not ship native eval-on-prod-traces, runtime guardrails, or a self-hostable LLM gateway. The right alternative depends on which of the three you hit first. This guide ranks six honest picks against those three axes. Updated May 20, 2026.
TL;DR: best LangSmith alternative per failure mode
| You are leaving LangSmith because of | Best pick | Why (one phrase) | Pricing | OSS |
|---|---|---|---|---|
| Missing layers (eval-on-prod, guardrails, gateway) | Future AGI | One Apache 2.0 plane across trace, eval, gateway, and guardrail | Free + usage | Apache 2.0 |
| Cost at scale with self-host requirement | Langfuse | Mature OSS traces, prompts, datasets at flat tiers | Hobby free, Core $29/mo | MIT core |
| LangChain coupling, OTel-purist team | Arize Phoenix | OpenInference reference, framework-agnostic by design | Free self-hosted, AX Pro $50/mo | ELv2 |
| Eval workflow is the dominant problem | Braintrust | Polished hosted UI for experiments, scorers, and CI gates | Starter free, Pro $249/mo | Closed |
| Cost at scale, gateway is the entry point | Helicone | Base URL swap, request analytics, caching, cost control | Hobby free, Pro $79/mo | Apache 2.0 |
| Datadog is already the system of record | Datadog LLM Observability | LLM spans next to APM and infra metrics | APM $31/host + add-on | Closed |
One-row summary. Pick Future AGI when the missing layers (eval-on-prod, guardrails, gateway) are the reason you are shopping. Pick Langfuse when self-hosted OSS observability is the hard requirement. Pick Datadog when one tool for everything beats eval depth.
What actually breaks in LangSmith at production scale
LangSmith is excellent inside LangChain and LangGraph. Trace semantics line up with the runtime, Prompt Hub and Fleet sit next to the agent code, and the self-hosted v0.13 release closed most of the Enterprise parity gap (IAM auth for external Postgres and Redis, mTLS for ClickHouse, KEDA autoscaling, Redis cluster support). So why do migration threads keep showing up?
Three patterns repeat. Read them as the operating model for the rest of this post.
1. Cost at scale. Plus is $39 per seat per month with 10K base traces included. Above that, base traces cost $2.50 per 1,000 and extended traces $5.00 per 1,000, with another $2.50 per 1,000 to upgrade base to extended. Fleet runs above 500/month bill at $0.05 each. A five-engineer team on 100K traces a month is past $400 before retention or extended traces, and the line tracks success linearly.
2. LangChain coupling. The value drops fast outside LangChain. LiteLLM apps, raw provider SDKs, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, DSPy, Mastra, Spring AI, LangChain4j, and custom controllers all need OTel-shaped trace ingestion. LangSmith accepts OTel but does not surface non-LangChain runs with the same fidelity. The “LangSmith is the runtime control plane” line stops being true as soon as the stack is mixed.
3. Missing layers. LangSmith ships excellent tracing, dev-loop evaluation, Prompt Hub, Fleet, and Studio. It does not ship a native eval-on-prod-traces loop where every live span carries a score, runtime guardrails at the gateway layer, or a self-hostable LLM gateway. Teams that need those layers stitch them: Braintrust for evals, a homegrown gateway, Lakera or Presidio for guardrails. The handoffs leak.
The six tools below each address at least one of the three. The honest comparison is where each fills a different gap.

The 6 LangSmith alternatives compared
1. Future AGI: best when missing layers are the reason you are leaving
Apache 2.0. Self-hostable. Hosted cloud. Eval stack + gateway on one plane.
Quick take. Future AGI is the pick when LangSmith’s missing layers are what pushed you out. The eval stack ships as a package: ai-evaluation is the code-first surface with 50+ EvalTemplate classes, traceAI carries the same rubric as a span-attached score on live traces, and the Agent Command Center gateway routes across 100+ providers with 18+ runtime guardrails on the same plane. Error Feed clusters failing traces with HDBSCAN plus a Sonnet 4.5 Judge that writes the immediate fix, so a tool-call regression becomes a labeled dataset row instead of a Jira ticket. The handoffs between trace, eval, gate, and gateway are versioned objects, not manual exports.
Ideal for. Teams that already stitch LangSmith plus a notebook plus a separate gateway and watch the same regression repeat because the production-to-pre-prod loop is manual. Strong fit for RAG agents, voice agents, support automation, and copilots across Python, TypeScript, Java, and C#.
Key strengths.
- traceAI auto-instruments 50+ AI surfaces across 4 languages: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, Spring AI, LangChain4j, Semantic Kernel. 14 OpenInference span kinds (TOOL, RETRIEVER, AGENT, EVALUATOR, GUARDRAIL, A2A_CLIENT, A2A_SERVER, VECTOR_DB, others).
- Eval-stack package. 50+ evaluators (Tool Correctness, Plan Adherence, Task Completion, Refusal Calibration, Hallucination, Groundedness, PII, Toxicity) ship as pytest CI scorers and online span-attached scorers reading the same rubric. Lower per-eval cost than Galileo Luna-2 at comparable accuracy on the published rubrics.
- Agent Command Center: OpenAI-compatible gateway in a single Go binary under Apache 2.0. 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, MCP and A2A protocol support. Benchmarked at ~29k req/s with P99 21 ms on
t3.xlargewith guardrails on. - Error Feed closes the loop. HDBSCAN soft-clusters failing traces over ClickHouse; a Sonnet 4.5 Judge (30-turn budget, 8 span tools, 90% prompt-cache hit) writes the immediate fix; the trace promotes into the dataset that gates the next release.
- SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust. ISO 27001 in active audit.
Honest limitations. More moving parts than a single-purpose tracer. ClickHouse, Postgres, Redis, Temporal, and the gateway are real services on full self-host. Most teams use hosted cloud or start with the OSS trio (traceAI + ai-evaluation + agent-opt) plus one container for the gateway, then add the data plane when volume justifies it.
Pricing. Free to start with generous limits across tracing, gateway requests, AI credits, voice simulation, and 30-day retention. Usage-based after that, not per-seat. Compliance add-ons (HIPAA BAA, SAML SSO + SCIM) layer per tier. Pricing.
Expert verdict. Pick Future AGI when production failures need to close back into pre-prod tests automatically and the trajectory has to be the unit. The buying signal is teams paying for the same incident class repeating across releases. See Future AGI vs LangSmith for the head-to-head.

2. Langfuse: best when cost at scale and self-host are the constraints
MIT core. Self-hostable. Hosted cloud option.
Quick take. Langfuse is the strongest OSS-first LangSmith alternative when self-hosted traces, prompts, datasets, and evals are the entire requirement and you can pair it with external eval and guardrail layers. Its own LangSmith alternative page frames it around framework neutrality, OSS licensing, and pricing transparency. That frame holds up.
Ideal for. Platform teams who operate the data plane, want trace data in their own infrastructure, and have a CI eval framework already. Teams whose CTO ruled out closed SaaS for traces or whose data residency rules force on-prem.
Key strengths.
- MIT core; mature self-host across Postgres, ClickHouse, Redis or Valkey, object storage, queues, and workers.
- Prompt management with labels, environments, version diffs. Datasets, runs, and human annotation queues.
- OpenTelemetry ingestion, LiteLLM proxy logging, integrations with LangChain, LlamaIndex, and OpenAI SDK.
- Experiments CI/CD integration shipped May 2026 closes the gap on offline experiments and PR gates.
Honest limitations. Trajectory metrics are not first-class: 5 span kinds versus Future AGI’s 14, and the trace UI is LLM-shaped rather than trajectory-shaped. Simulation, voice eval, optimization, and runtime guardrails live in adjacent tools you wire up. Enterprise directories ship outside MIT. Self-host footprint expands once ClickHouse, Redis, and worker queues scale together; the saving versus LangSmith Plus shifts to your infra and SRE bill.
Pricing. Hobby free with 50K units, 2 users. Core $29/mo with 100K units, $8 per additional 100K, unlimited users. Pro $199/mo. Enterprise $2,499/mo. A “unit” covers a trace, observation, score, or evaluation, which is why production cost compounds with eval-heavy workflows.
Expert verdict. Pick Langfuse when OSS observability and prompt management under self-host is the dominant requirement and you have an eval harness elsewhere. Skip if you need trajectory-shaped scoring, voice simulation, or runtime guardrails on the same plane. See Langfuse Alternatives and Langfuse vs LangSmith.
3. Arize Phoenix: best when LangChain coupling is the reason you are leaving
Source available under ELv2. Self-hostable. Phoenix Cloud and Arize AX paths.
Quick take. Phoenix is the canonical OpenInference reference. Built by Arize, the team that owned ML observability for embedding drift before LLM observability was a category. The buying signal is a platform team that already thinks in OpenTelemetry, wants framework-neutral instrumentation by default, and is open to graduating into Arize AX later.
Ideal for. Engineers who care about open instrumentation standards, want a local Phoenix workbench for trace inspection and experiments, and need OpenInference attribute names to land in their backend untranslated.
Key strengths.
- OpenInference reference; canonical attribute names land in Phoenix first.
- Auto-instrumentation for LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI Agents SDK, Bedrock, and Anthropic across Python, TypeScript, and Java.
- Embedding-drift heritage; retrieval-quality dashboards and chunk-level drift are stronger than most LLM-era platforms.
- Clean local workbench.
phoenix.launch_app()and you have a tracer running in seconds.
Honest limitations. Phoenix is not a gateway, not a guardrail product, and not a simulator. ELv2 is source available, not OSI open source, so do not call it OSS in a procurement review without a footnote. The trajectory metric library is smaller than Future AGI’s, and online scoring lives in the Phoenix eval surface rather than as a span-attached primitive that ships with the SDK.
Pricing. Phoenix free self-hosted, user-managed retention. AX Free 25K spans/mo, 1 GB, 15 days retention. AX Pro $50/mo with 50K spans, 10 GB, 30 days. AX Enterprise custom.
Expert verdict. Pick Phoenix when OpenInference adherence and the Arize AX upgrade path are the buying signals, and you can pair it with adjacent tooling for gateway, guardrails, and simulation. Skip when you need a unified plane. See Phoenix Alternatives.
4. Braintrust: best when the eval workflow is the dominant problem
Closed platform. Hosted cloud or Enterprise self-host.
Quick take. Braintrust has the best eval UI in the closed category. Experiments, datasets, scorers, prompt iteration, online scoring, and CI gating in one polished product, with sandboxed agent evaluation for tool-calling agents. The center of gravity is structured evals, not the full agent loop.
Ideal for. Teams that prefer to buy rather than build, want experiments and scorers in one polished UI, and have ruled out OSS as a procurement constraint. Strong fit if engineering and product both need a trace-to-dataset-to-scorer-to-gate path without infra ownership.
Key strengths.
- Polished UI for experiments, scorers, datasets, and prompt iteration; the closest hosted analog to LangSmith on the eval surface.
- Sandboxed agent evaluation with tool-call execution; agent eval surface more developed than Langfuse or Phoenix.
- Online scoring and CI gates in the same product as offline experiments.
- May 2026 added Java auto-instrumentation for Spring AI and LangChain4j, plus dataset snapshots, environments, cloud storage export, full-text search, and subqueries.
Honest limitations. Closed platform; self-host is Enterprise only. No first-party voice simulator. Gateway, guardrails, and prompt optimization are not first-class. Pro at $249/mo is the highest entry-tier outside enterprise contracts, and per-score overage stacks fast at production scale. Trajectory metrics beyond hand-composed scorers are not built in.
Pricing. Starter $0 with 1 GB, 10K scores, 14 days retention. Pro $249/mo with 5 GB, 50K scores, 30 days. Overage $4/GB and $2.50 per 1K scores on Starter; $3/GB and $1.50 per 1K on Pro. Enterprise custom.
Expert verdict. Pick Braintrust when structured evals with a polished UI is the dominant problem and gateway, guardrails, and simulation are off the requirement list. See Braintrust Alternatives and Future AGI vs Braintrust.
5. Helicone: best when the gateway is the entry point
Apache 2.0. Self-hostable. Hosted cloud option.
Quick take. Helicone is the fastest path to value when the production issue is provider routing, caching, p95 latency, cost attribution, user-level analytics, fallback behavior, or alerting on live LLM traffic. Swap the base URL and you have request analytics in minutes.
Ideal for. Teams with live traffic and no clean answer to which users, prompts, models, or endpoints drove a p99 spike. Strong first tool when direct provider SDK calls are spread across the codebase and adding SDK-level instrumentation is more work than swapping a URL.
Key strengths.
- Apache 2.0 self-hostable gateway plus observability stack.
- OpenAI-compatible gateway in front of 100+ models. Routing, fallback, retries, caching, rate limits.
- Request logging, sessions, user metrics, cost tracking, datasets, alerts, reports, HQL query language, eval scores, feedback, prompts.
- Cache hits and budgets are easy to read; SQL-like HQL is a strong audit surface.
Honest limitations. Helicone will not replace a deep eval platform on its own. It ships eval scores, datasets, and feedback, but the surface is shallower than Braintrust or Future AGI. On March 3, 2026, Helicone said it had joined Mintlify and that services would remain live in maintenance mode with security updates, new models, bug fixes, and performance fixes. Treat roadmap depth as part of vendor diligence.
Pricing. Hobby free with 10K requests, 1 GB storage, 1 seat. Pro $79/mo with unlimited seats, alerts, reports, HQL. Team $799/mo with 5 orgs, SOC 2, HIPAA. Enterprise custom.
Expert verdict. Pick Helicone when gateway-first observability and cost control are the immediate problem and you can layer evals elsewhere later. Skip when you need a unified eval plus trace plus guardrail plane. See Helicone Alternatives and Future AGI vs Helicone.
6. Datadog LLM Observability: best when Datadog is already the system of record
Closed platform. SaaS with regional residency. APM-integrated.
Quick take. Datadog ships LLM Observability as an APM add-on. The pitch is one tool for everything: LLM spans next to APM, infrastructure metrics, logs, RUM, and security, correlated with database queries and downstream service latency on a single rotation. The buying signal is unified observability, not eval depth.
Ideal for. Enterprise teams where Datadog is the system of record, SRE and on-call rotations live there, and consolidation beats specialization on the LLM surface.
Key strengths.
- LLM spans inside the same product as APM, logs, RUM, security, and infra metrics.
- Infrastructure correlation: LLM latency next to DB query latency next to downstream service latency in the same trace.
- Mature enterprise security posture, SSO, audit logging, regional residency, and SRE workflows.
- Scales to high-volume span ingestion on the existing Datadog backend.
Honest limitations. Eval surface is shallower than dedicated LLM platforms: no first-party simulator, fewer built-in metric primitives, no integrated guardrails. Cost scales fast with span volume; Datadog bills per ingested span plus per indexed log on top of APM. Vendor lock-in compounds; path of least resistance is the Datadog SDK rather than OTel. Most teams pair Datadog with Future AGI or Braintrust once eval and trajectory scoring become the bottleneck.
Pricing. APM at $31/host/mo annual, plus LLM Observability metered per ingested span and per indexed log. Expect contracts above $1,000/mo at modest scale; production teams enter five-figure monthly contracts quickly.
Expert verdict. Pick Datadog LLM Observability when Datadog is already the standard and one-tool consolidation beats eval depth. Pair with Future AGI or Braintrust if eval and trajectory scoring become the bottleneck. See Datadog LLM alternatives and Braintrust vs Datadog.
Cost at 100K traces a month, 5-seat eval team
| Platform | Monthly cost | How it adds up |
|---|---|---|
| Phoenix self-hosted | $0 + infra | One container plus an OTel collector; infra and SRE time on you |
| Langfuse Core | $59 | $29 base + $30 for the additional 100K units; flat tier |
| Helicone Pro | $79 | Flat fee plus modest gateway overage at this volume |
| Future AGI | $180 | Usage on the 50 GB tracing tier + AI credits |
| Braintrust Pro | $320 | $249 base + processed-data + score overage |
| LangSmith Plus | $420 | $39 × 5 seats + 90K base traces at $2.50 / 1K |
| Datadog LLM | $1,000+ | $31/host APM + per-ingested-span and per-indexed-log metering |
Real production cost is platform price plus trace volume, judge token spend, retry rate, storage retention, annotation labor, and infra time. Subscription is the small line.
Self-host operational footprint
| Platform | Footprint | Day-one services |
|---|---|---|
| Helicone | Lightweight | Gateway plus Postgres |
| Phoenix | Lightweight | One container plus an OTel collector |
| Future AGI | Lightweight | OSS trio pip install plus one gateway container; full plane adds ClickHouse, Postgres, Redis, Temporal |
| Langfuse | Moderate | Web + worker + Postgres + ClickHouse + Redis + object storage |
| Braintrust | Enterprise only | Self-host gated to Enterprise contracts |
| LangSmith v0.13 | Moderate | Multi-service deploy; Enterprise tier only |
| Datadog | None | SaaS only; regional residency, not self-host |
Decision framework: choose X if
- Future AGI if the gap is missing layers (eval-on-prod-traces, runtime guardrails, gateway) and the loop has to close on one Apache 2.0 plane. Buying signal: your team already stitches LangSmith plus a notebook plus a separate gateway, and the same incident class keeps repeating.
- Langfuse if the gap is cost at scale with a self-host requirement and you have an eval harness elsewhere. Buying signal: CTO ruled out closed SaaS for traces, and the eval team is fine running Python scorers in CI.
- Arize Phoenix if the gap is LangChain coupling and the team thinks in OpenTelemetry. Buying signal: the AX upgrade path matters, or your platform team treats instrumentation standards as the buying axis.
- Braintrust if the gap is the eval workflow specifically, and gateway plus guardrails are off the requirement list. Buying signal: product and engineering both need trace-to-dataset-to-scorer-to-gate in a polished UI without owning infra.
- Helicone if the gap is gateway, cost control, and request analytics, and you can layer evals later. Buying signal: live traffic, no clean answer to which users or prompts drove p99.
- Datadog LLM Observability if Datadog is already the system of record and one-tool consolidation beats eval depth. Buying signal: SRE and on-call rotations already live in Datadog.
Common mistakes when picking a LangSmith alternative
- Overstating the lock-in problem. LangSmith ingests non-LangChain traces over OTel. The real question is whether LangChain concepts should remain the center of evals, prompts, and deployment, not whether the trace pipe technically works.
- Treating OSS and self-hostable as the same. Future AGI, Langfuse, Phoenix, and Helicone all have self-hosted paths, but licenses differ. Phoenix is ELv2 (source available, not OSI open source). Langfuse enterprise directories live outside MIT. Verify license terms before procurement.
- Pricing only the subscription. Real cost is seats, trace volume, retention, score volume, judge tokens, gateway requests, cache hits, annotation hours, and the infra team running self-hosted services. The platform line is rarely the biggest one.
- Choosing by integration logos. Verify active maintenance for the exact framework versions you use. LangChain v1, OpenAI Responses, Claude tool use, OTel semantic conventions, and provider SDK changes all break traces quietly when adapters lag.
- Ignoring trajectory-shaped eval. Final-answer scoring misses tool selection, retries, retrieval misses, loop behavior, memory drift, and session handoffs. Require trace-level and session-level evaluation if your agent does more than one call.
- Migrating traces but leaving the datasets behind. Tracing is the easy half. Datasets, scorers, prompts, human review queues, CI gates, annotations, and the production-to-eval workflow that turns failures into regression tests are the hard half.
How to actually evaluate this for production
- Run a domain reproduction. Export a representative slice of real traces, including failures, long-tail prompts, tool calls, retrieval misses, safety edge cases, and hand-labeled outcomes. Instrument each candidate with your harness, your OTel payload shape, your prompt versions, and your judge model. Do not accept a demo dataset.
- Measure reliability under load. Track p50, p95, p99 ingestion, dropped spans, duplicate spans, failed judge calls, retry count, query latency, and alert delay as concurrency rises. Plot the curve before you commit.
- Cost-adjust. Real cost equals platform price plus trace volume, token volume, test-time compute, judge sampling rate, retry rate, storage retention, gateway calls, cache hits, and annotation hours. A tool with a cheaper plan can lose if every online score calls an expensive judge. A self-hosted tool can lose if the infra bill and on-call time exceed the SaaS overage.
Where Future AGI fits
Most teams comparing LangSmith alternatives end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the recommended pick when those have to live on one Apache 2.0 self-hostable plane and the trajectory has to be the unit. traceAI auto-instruments 50+ AI surfaces across 4 languages, ai-evaluation ships 50+ EvalTemplate classes as pytest CI and online span-attached scorers reading the same rubric, Error Feed clusters failing traces and writes the immediate fix, and the Agent Command Center gateway fronts 100+ providers with 18+ runtime guardrails. Start free; pricing is usage-based after that.
Sources
Future AGI pricing · Future AGI GitHub · traceAI · ai-evaluation · Agent Command Center docs · LangSmith pricing · LangSmith docs · LangSmith Self-Hosted v0.13 · Langfuse pricing · Langfuse self-hosting · Phoenix docs · Arize pricing · Braintrust pricing · Braintrust changelog · Helicone pricing · Helicone × Mintlify · Datadog pricing
Read next
Future AGI vs LangSmith · Langfuse vs LangSmith · Best AI Agent Observability Tools · Braintrust Alternatives · Best Datadog LLM Observability Alternatives
Frequently asked questions
Why do teams leave LangSmith in 2026?
What is the best LangSmith alternative in 2026?
Is LangSmith open source?
Can I self-host an alternative to LangSmith?
Which LangSmith alternative is best if I am already on LangChain or LangGraph?
How does LangSmith pricing compare to alternatives in 2026?
Is Helicone still a safe LangSmith alternative after the Mintlify acquisition?
Honest 2026 comparison of Langfuse alternatives: Future AGI, LangSmith, Phoenix, Braintrust, Helicone on eval depth, gateway, and the loop.
Honest 2026 comparison of the best Arize AI alternatives: Future AGI, Langfuse, LangSmith, Braintrust, Datadog. Pricing, gateway, eval depth, license.
Langfuse vs LangSmith 2026 head-to-head: license, framework neutrality, prompts, datasets, eval, self-host, the unified-stack axis.