Research

LangSmith Alternatives in 2026: 6 Honest Picks for Production AI Teams

LangSmith alternatives in 2026 compared on cost at scale, LangChain coupling, missing eval, guardrail, and gateway layers. Six honest picks with pricing.

·
Updated
·
18 min read
langsmith-alternatives llm-evaluation llm-observability open-source self-hosting ai-gateway agent-observability 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline LANGSMITH ALTERNATIVES 2026 fills the left half. The right half shows a wireframe pair of chain links breaking apart, suggesting unlocking from a vendor, drawn in pure white outlines with a soft white halo behind the breaking point.
Table of Contents

Teams leave LangSmith for three reasons, in roughly this order. Cost at scale, once base traces at $2.50 per 1,000 and extended traces at $5.00 per 1,000 stack up against a five-seat eval team. LangChain coupling, when the production stack outgrows LangChain or absorbs LiteLLM, raw provider SDKs, CrewAI, AutoGen, or a custom controller. And missing layers, because LangSmith does not ship native eval-on-prod-traces, runtime guardrails, or a self-hostable LLM gateway. The right alternative depends on which of the three you hit first. This guide ranks six honest picks against those three axes. Updated May 20, 2026.

TL;DR: best LangSmith alternative per failure mode

You are leaving LangSmith because ofBest pickWhy (one phrase)PricingOSS
Missing layers (eval-on-prod, guardrails, gateway)Future AGIOne Apache 2.0 plane across trace, eval, gateway, and guardrailFree + usageApache 2.0
Cost at scale with self-host requirementLangfuseMature OSS traces, prompts, datasets at flat tiersHobby free, Core $29/moMIT core
LangChain coupling, OTel-purist teamArize PhoenixOpenInference reference, framework-agnostic by designFree self-hosted, AX Pro $50/moELv2
Eval workflow is the dominant problemBraintrustPolished hosted UI for experiments, scorers, and CI gatesStarter free, Pro $249/moClosed
Cost at scale, gateway is the entry pointHeliconeBase URL swap, request analytics, caching, cost controlHobby free, Pro $79/moApache 2.0
Datadog is already the system of recordDatadog LLM ObservabilityLLM spans next to APM and infra metricsAPM $31/host + add-onClosed

One-row summary. Pick Future AGI when the missing layers (eval-on-prod, guardrails, gateway) are the reason you are shopping. Pick Langfuse when self-hosted OSS observability is the hard requirement. Pick Datadog when one tool for everything beats eval depth.

What actually breaks in LangSmith at production scale

LangSmith is excellent inside LangChain and LangGraph. Trace semantics line up with the runtime, Prompt Hub and Fleet sit next to the agent code, and the self-hosted v0.13 release closed most of the Enterprise parity gap (IAM auth for external Postgres and Redis, mTLS for ClickHouse, KEDA autoscaling, Redis cluster support). So why do migration threads keep showing up?

Three patterns repeat. Read them as the operating model for the rest of this post.

1. Cost at scale. Plus is $39 per seat per month with 10K base traces included. Above that, base traces cost $2.50 per 1,000 and extended traces $5.00 per 1,000, with another $2.50 per 1,000 to upgrade base to extended. Fleet runs above 500/month bill at $0.05 each. A five-engineer team on 100K traces a month is past $400 before retention or extended traces, and the line tracks success linearly.

2. LangChain coupling. The value drops fast outside LangChain. LiteLLM apps, raw provider SDKs, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, DSPy, Mastra, Spring AI, LangChain4j, and custom controllers all need OTel-shaped trace ingestion. LangSmith accepts OTel but does not surface non-LangChain runs with the same fidelity. The “LangSmith is the runtime control plane” line stops being true as soon as the stack is mixed.

3. Missing layers. LangSmith ships excellent tracing, dev-loop evaluation, Prompt Hub, Fleet, and Studio. It does not ship a native eval-on-prod-traces loop where every live span carries a score, runtime guardrails at the gateway layer, or a self-hostable LLM gateway. Teams that need those layers stitch them: Braintrust for evals, a homegrown gateway, Lakera or Presidio for guardrails. The handoffs leak.

The six tools below each address at least one of the three. The honest comparison is where each fills a different gap.

Two-line time-series chart showing the LangChain dependency decline from January 2024 through May 2026: a descending curve labeled LangChain-based agents drops from roughly 80 percent to 35 percent of new agents, while an ascending curve labeled framework-neutral (OTel / OpenInference) rises from roughly 20 percent to 65 percent, with a glowing crossover point marked Q3 2025 where the two curves intersect at 50 percent.

The 6 LangSmith alternatives compared

1. Future AGI: best when missing layers are the reason you are leaving

Apache 2.0. Self-hostable. Hosted cloud. Eval stack + gateway on one plane.

Quick take. Future AGI is the pick when LangSmith’s missing layers are what pushed you out. The eval stack ships as a package: ai-evaluation is the code-first surface with 50+ EvalTemplate classes, traceAI carries the same rubric as a span-attached score on live traces, and the Agent Command Center gateway routes across 100+ providers with 18+ runtime guardrails on the same plane. Error Feed clusters failing traces with HDBSCAN plus a Sonnet 4.5 Judge that writes the immediate fix, so a tool-call regression becomes a labeled dataset row instead of a Jira ticket. The handoffs between trace, eval, gate, and gateway are versioned objects, not manual exports.

Ideal for. Teams that already stitch LangSmith plus a notebook plus a separate gateway and watch the same regression repeat because the production-to-pre-prod loop is manual. Strong fit for RAG agents, voice agents, support automation, and copilots across Python, TypeScript, Java, and C#.

Key strengths.

  • traceAI auto-instruments 50+ AI surfaces across 4 languages: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, Spring AI, LangChain4j, Semantic Kernel. 14 OpenInference span kinds (TOOL, RETRIEVER, AGENT, EVALUATOR, GUARDRAIL, A2A_CLIENT, A2A_SERVER, VECTOR_DB, others).
  • Eval-stack package. 50+ evaluators (Tool Correctness, Plan Adherence, Task Completion, Refusal Calibration, Hallucination, Groundedness, PII, Toxicity) ship as pytest CI scorers and online span-attached scorers reading the same rubric. Lower per-eval cost than Galileo Luna-2 at comparable accuracy on the published rubrics.
  • Agent Command Center: OpenAI-compatible gateway in a single Go binary under Apache 2.0. 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, MCP and A2A protocol support. Benchmarked at ~29k req/s with P99 21 ms on t3.xlarge with guardrails on.
  • Error Feed closes the loop. HDBSCAN soft-clusters failing traces over ClickHouse; a Sonnet 4.5 Judge (30-turn budget, 8 span tools, 90% prompt-cache hit) writes the immediate fix; the trace promotes into the dataset that gates the next release.
  • SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust. ISO 27001 in active audit.

Honest limitations. More moving parts than a single-purpose tracer. ClickHouse, Postgres, Redis, Temporal, and the gateway are real services on full self-host. Most teams use hosted cloud or start with the OSS trio (traceAI + ai-evaluation + agent-opt) plus one container for the gateway, then add the data plane when volume justifies it.

Pricing. Free to start with generous limits across tracing, gateway requests, AI credits, voice simulation, and 30-day retention. Usage-based after that, not per-seat. Compliance add-ons (HIPAA BAA, SAML SSO + SCIM) layer per tier. Pricing.

Expert verdict. Pick Future AGI when production failures need to close back into pre-prod tests automatically and the trajectory has to be the unit. The buying signal is teams paying for the same incident class repeating across releases. See Future AGI vs LangSmith for the head-to-head.

Future AGI four-panel dark product showcase that matches LangSmith's surfaces while staying framework-agnostic. Top-left: Tracing / Sessions with a latency time-series Dec 2025 to May 2026 (soft halo on the May spike) and a session row demo_session, Duration 84 s, Total Cost $0, Total Traces 2. Top-right: Versioned Prompts list with support-classifier-v3 (87% pass), agent-router-v5 (v5 · 91%, focal violet ring), grounded-rag-prompt (v12 · 88%), red-team-detector-v2 (v2 · 95%), each with an inline diff sparkline. Bottom-left: Datasets · evals pipeline showing Failing traces (32) → Dataset auto-built (212 rows) → Eval suite (8 metrics), plus a violet "Run eval" button and 212 / 212 scored progress. Bottom-right: Annotations human-review KPIs showing Total 1,010, Completed 612, Completion Rate 60.6% (focal halo), Avg per Day 87, with a 60.6% green progress bar and Completed / In Progress / Pending / Skipped legend.

2. Langfuse: best when cost at scale and self-host are the constraints

MIT core. Self-hostable. Hosted cloud option.

Quick take. Langfuse is the strongest OSS-first LangSmith alternative when self-hosted traces, prompts, datasets, and evals are the entire requirement and you can pair it with external eval and guardrail layers. Its own LangSmith alternative page frames it around framework neutrality, OSS licensing, and pricing transparency. That frame holds up.

Ideal for. Platform teams who operate the data plane, want trace data in their own infrastructure, and have a CI eval framework already. Teams whose CTO ruled out closed SaaS for traces or whose data residency rules force on-prem.

Key strengths.

  • MIT core; mature self-host across Postgres, ClickHouse, Redis or Valkey, object storage, queues, and workers.
  • Prompt management with labels, environments, version diffs. Datasets, runs, and human annotation queues.
  • OpenTelemetry ingestion, LiteLLM proxy logging, integrations with LangChain, LlamaIndex, and OpenAI SDK.
  • Experiments CI/CD integration shipped May 2026 closes the gap on offline experiments and PR gates.

Honest limitations. Trajectory metrics are not first-class: 5 span kinds versus Future AGI’s 14, and the trace UI is LLM-shaped rather than trajectory-shaped. Simulation, voice eval, optimization, and runtime guardrails live in adjacent tools you wire up. Enterprise directories ship outside MIT. Self-host footprint expands once ClickHouse, Redis, and worker queues scale together; the saving versus LangSmith Plus shifts to your infra and SRE bill.

Pricing. Hobby free with 50K units, 2 users. Core $29/mo with 100K units, $8 per additional 100K, unlimited users. Pro $199/mo. Enterprise $2,499/mo. A “unit” covers a trace, observation, score, or evaluation, which is why production cost compounds with eval-heavy workflows.

Expert verdict. Pick Langfuse when OSS observability and prompt management under self-host is the dominant requirement and you have an eval harness elsewhere. Skip if you need trajectory-shaped scoring, voice simulation, or runtime guardrails on the same plane. See Langfuse Alternatives and Langfuse vs LangSmith.

3. Arize Phoenix: best when LangChain coupling is the reason you are leaving

Source available under ELv2. Self-hostable. Phoenix Cloud and Arize AX paths.

Quick take. Phoenix is the canonical OpenInference reference. Built by Arize, the team that owned ML observability for embedding drift before LLM observability was a category. The buying signal is a platform team that already thinks in OpenTelemetry, wants framework-neutral instrumentation by default, and is open to graduating into Arize AX later.

Ideal for. Engineers who care about open instrumentation standards, want a local Phoenix workbench for trace inspection and experiments, and need OpenInference attribute names to land in their backend untranslated.

Key strengths.

  • OpenInference reference; canonical attribute names land in Phoenix first.
  • Auto-instrumentation for LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI Agents SDK, Bedrock, and Anthropic across Python, TypeScript, and Java.
  • Embedding-drift heritage; retrieval-quality dashboards and chunk-level drift are stronger than most LLM-era platforms.
  • Clean local workbench. phoenix.launch_app() and you have a tracer running in seconds.

Honest limitations. Phoenix is not a gateway, not a guardrail product, and not a simulator. ELv2 is source available, not OSI open source, so do not call it OSS in a procurement review without a footnote. The trajectory metric library is smaller than Future AGI’s, and online scoring lives in the Phoenix eval surface rather than as a span-attached primitive that ships with the SDK.

Pricing. Phoenix free self-hosted, user-managed retention. AX Free 25K spans/mo, 1 GB, 15 days retention. AX Pro $50/mo with 50K spans, 10 GB, 30 days. AX Enterprise custom.

Expert verdict. Pick Phoenix when OpenInference adherence and the Arize AX upgrade path are the buying signals, and you can pair it with adjacent tooling for gateway, guardrails, and simulation. Skip when you need a unified plane. See Phoenix Alternatives.

4. Braintrust: best when the eval workflow is the dominant problem

Closed platform. Hosted cloud or Enterprise self-host.

Quick take. Braintrust has the best eval UI in the closed category. Experiments, datasets, scorers, prompt iteration, online scoring, and CI gating in one polished product, with sandboxed agent evaluation for tool-calling agents. The center of gravity is structured evals, not the full agent loop.

Ideal for. Teams that prefer to buy rather than build, want experiments and scorers in one polished UI, and have ruled out OSS as a procurement constraint. Strong fit if engineering and product both need a trace-to-dataset-to-scorer-to-gate path without infra ownership.

Key strengths.

  • Polished UI for experiments, scorers, datasets, and prompt iteration; the closest hosted analog to LangSmith on the eval surface.
  • Sandboxed agent evaluation with tool-call execution; agent eval surface more developed than Langfuse or Phoenix.
  • Online scoring and CI gates in the same product as offline experiments.
  • May 2026 added Java auto-instrumentation for Spring AI and LangChain4j, plus dataset snapshots, environments, cloud storage export, full-text search, and subqueries.

Honest limitations. Closed platform; self-host is Enterprise only. No first-party voice simulator. Gateway, guardrails, and prompt optimization are not first-class. Pro at $249/mo is the highest entry-tier outside enterprise contracts, and per-score overage stacks fast at production scale. Trajectory metrics beyond hand-composed scorers are not built in.

Pricing. Starter $0 with 1 GB, 10K scores, 14 days retention. Pro $249/mo with 5 GB, 50K scores, 30 days. Overage $4/GB and $2.50 per 1K scores on Starter; $3/GB and $1.50 per 1K on Pro. Enterprise custom.

Expert verdict. Pick Braintrust when structured evals with a polished UI is the dominant problem and gateway, guardrails, and simulation are off the requirement list. See Braintrust Alternatives and Future AGI vs Braintrust.

5. Helicone: best when the gateway is the entry point

Apache 2.0. Self-hostable. Hosted cloud option.

Quick take. Helicone is the fastest path to value when the production issue is provider routing, caching, p95 latency, cost attribution, user-level analytics, fallback behavior, or alerting on live LLM traffic. Swap the base URL and you have request analytics in minutes.

Ideal for. Teams with live traffic and no clean answer to which users, prompts, models, or endpoints drove a p99 spike. Strong first tool when direct provider SDK calls are spread across the codebase and adding SDK-level instrumentation is more work than swapping a URL.

Key strengths.

  • Apache 2.0 self-hostable gateway plus observability stack.
  • OpenAI-compatible gateway in front of 100+ models. Routing, fallback, retries, caching, rate limits.
  • Request logging, sessions, user metrics, cost tracking, datasets, alerts, reports, HQL query language, eval scores, feedback, prompts.
  • Cache hits and budgets are easy to read; SQL-like HQL is a strong audit surface.

Honest limitations. Helicone will not replace a deep eval platform on its own. It ships eval scores, datasets, and feedback, but the surface is shallower than Braintrust or Future AGI. On March 3, 2026, Helicone said it had joined Mintlify and that services would remain live in maintenance mode with security updates, new models, bug fixes, and performance fixes. Treat roadmap depth as part of vendor diligence.

Pricing. Hobby free with 10K requests, 1 GB storage, 1 seat. Pro $79/mo with unlimited seats, alerts, reports, HQL. Team $799/mo with 5 orgs, SOC 2, HIPAA. Enterprise custom.

Expert verdict. Pick Helicone when gateway-first observability and cost control are the immediate problem and you can layer evals elsewhere later. Skip when you need a unified eval plus trace plus guardrail plane. See Helicone Alternatives and Future AGI vs Helicone.

6. Datadog LLM Observability: best when Datadog is already the system of record

Closed platform. SaaS with regional residency. APM-integrated.

Quick take. Datadog ships LLM Observability as an APM add-on. The pitch is one tool for everything: LLM spans next to APM, infrastructure metrics, logs, RUM, and security, correlated with database queries and downstream service latency on a single rotation. The buying signal is unified observability, not eval depth.

Ideal for. Enterprise teams where Datadog is the system of record, SRE and on-call rotations live there, and consolidation beats specialization on the LLM surface.

Key strengths.

  • LLM spans inside the same product as APM, logs, RUM, security, and infra metrics.
  • Infrastructure correlation: LLM latency next to DB query latency next to downstream service latency in the same trace.
  • Mature enterprise security posture, SSO, audit logging, regional residency, and SRE workflows.
  • Scales to high-volume span ingestion on the existing Datadog backend.

Honest limitations. Eval surface is shallower than dedicated LLM platforms: no first-party simulator, fewer built-in metric primitives, no integrated guardrails. Cost scales fast with span volume; Datadog bills per ingested span plus per indexed log on top of APM. Vendor lock-in compounds; path of least resistance is the Datadog SDK rather than OTel. Most teams pair Datadog with Future AGI or Braintrust once eval and trajectory scoring become the bottleneck.

Pricing. APM at $31/host/mo annual, plus LLM Observability metered per ingested span and per indexed log. Expect contracts above $1,000/mo at modest scale; production teams enter five-figure monthly contracts quickly.

Expert verdict. Pick Datadog LLM Observability when Datadog is already the standard and one-tool consolidation beats eval depth. Pair with Future AGI or Braintrust if eval and trajectory scoring become the bottleneck. See Datadog LLM alternatives and Braintrust vs Datadog.

Cost at 100K traces a month, 5-seat eval team

PlatformMonthly costHow it adds up
Phoenix self-hosted$0 + infraOne container plus an OTel collector; infra and SRE time on you
Langfuse Core$59$29 base + $30 for the additional 100K units; flat tier
Helicone Pro$79Flat fee plus modest gateway overage at this volume
Future AGI$180Usage on the 50 GB tracing tier + AI credits
Braintrust Pro$320$249 base + processed-data + score overage
LangSmith Plus$420$39 × 5 seats + 90K base traces at $2.50 / 1K
Datadog LLM$1,000+$31/host APM + per-ingested-span and per-indexed-log metering

Real production cost is platform price plus trace volume, judge token spend, retry rate, storage retention, annotation labor, and infra time. Subscription is the small line.

Self-host operational footprint

PlatformFootprintDay-one services
HeliconeLightweightGateway plus Postgres
PhoenixLightweightOne container plus an OTel collector
Future AGILightweightOSS trio pip install plus one gateway container; full plane adds ClickHouse, Postgres, Redis, Temporal
LangfuseModerateWeb + worker + Postgres + ClickHouse + Redis + object storage
BraintrustEnterprise onlySelf-host gated to Enterprise contracts
LangSmith v0.13ModerateMulti-service deploy; Enterprise tier only
DatadogNoneSaaS only; regional residency, not self-host

Decision framework: choose X if

  • Future AGI if the gap is missing layers (eval-on-prod-traces, runtime guardrails, gateway) and the loop has to close on one Apache 2.0 plane. Buying signal: your team already stitches LangSmith plus a notebook plus a separate gateway, and the same incident class keeps repeating.
  • Langfuse if the gap is cost at scale with a self-host requirement and you have an eval harness elsewhere. Buying signal: CTO ruled out closed SaaS for traces, and the eval team is fine running Python scorers in CI.
  • Arize Phoenix if the gap is LangChain coupling and the team thinks in OpenTelemetry. Buying signal: the AX upgrade path matters, or your platform team treats instrumentation standards as the buying axis.
  • Braintrust if the gap is the eval workflow specifically, and gateway plus guardrails are off the requirement list. Buying signal: product and engineering both need trace-to-dataset-to-scorer-to-gate in a polished UI without owning infra.
  • Helicone if the gap is gateway, cost control, and request analytics, and you can layer evals later. Buying signal: live traffic, no clean answer to which users or prompts drove p99.
  • Datadog LLM Observability if Datadog is already the system of record and one-tool consolidation beats eval depth. Buying signal: SRE and on-call rotations already live in Datadog.

Common mistakes when picking a LangSmith alternative

  • Overstating the lock-in problem. LangSmith ingests non-LangChain traces over OTel. The real question is whether LangChain concepts should remain the center of evals, prompts, and deployment, not whether the trace pipe technically works.
  • Treating OSS and self-hostable as the same. Future AGI, Langfuse, Phoenix, and Helicone all have self-hosted paths, but licenses differ. Phoenix is ELv2 (source available, not OSI open source). Langfuse enterprise directories live outside MIT. Verify license terms before procurement.
  • Pricing only the subscription. Real cost is seats, trace volume, retention, score volume, judge tokens, gateway requests, cache hits, annotation hours, and the infra team running self-hosted services. The platform line is rarely the biggest one.
  • Choosing by integration logos. Verify active maintenance for the exact framework versions you use. LangChain v1, OpenAI Responses, Claude tool use, OTel semantic conventions, and provider SDK changes all break traces quietly when adapters lag.
  • Ignoring trajectory-shaped eval. Final-answer scoring misses tool selection, retries, retrieval misses, loop behavior, memory drift, and session handoffs. Require trace-level and session-level evaluation if your agent does more than one call.
  • Migrating traces but leaving the datasets behind. Tracing is the easy half. Datasets, scorers, prompts, human review queues, CI gates, annotations, and the production-to-eval workflow that turns failures into regression tests are the hard half.

How to actually evaluate this for production

  1. Run a domain reproduction. Export a representative slice of real traces, including failures, long-tail prompts, tool calls, retrieval misses, safety edge cases, and hand-labeled outcomes. Instrument each candidate with your harness, your OTel payload shape, your prompt versions, and your judge model. Do not accept a demo dataset.
  2. Measure reliability under load. Track p50, p95, p99 ingestion, dropped spans, duplicate spans, failed judge calls, retry count, query latency, and alert delay as concurrency rises. Plot the curve before you commit.
  3. Cost-adjust. Real cost equals platform price plus trace volume, token volume, test-time compute, judge sampling rate, retry rate, storage retention, gateway calls, cache hits, and annotation hours. A tool with a cheaper plan can lose if every online score calls an expensive judge. A self-hosted tool can lose if the infra bill and on-call time exceed the SaaS overage.

Where Future AGI fits

Most teams comparing LangSmith alternatives end up running three or four products in production: one for traces, one for evals, one for the gateway, one for guardrails. Future AGI is the recommended pick when those have to live on one Apache 2.0 self-hostable plane and the trajectory has to be the unit. traceAI auto-instruments 50+ AI surfaces across 4 languages, ai-evaluation ships 50+ EvalTemplate classes as pytest CI and online span-attached scorers reading the same rubric, Error Feed clusters failing traces and writes the immediate fix, and the Agent Command Center gateway fronts 100+ providers with 18+ runtime guardrails. Start free; pricing is usage-based after that.

Sources

Future AGI pricing · Future AGI GitHub · traceAI · ai-evaluation · Agent Command Center docs · LangSmith pricing · LangSmith docs · LangSmith Self-Hosted v0.13 · Langfuse pricing · Langfuse self-hosting · Phoenix docs · Arize pricing · Braintrust pricing · Braintrust changelog · Helicone pricing · Helicone × Mintlify · Datadog pricing

Future AGI vs LangSmith · Langfuse vs LangSmith · Best AI Agent Observability Tools · Braintrust Alternatives · Best Datadog LLM Observability Alternatives

Frequently asked questions

Why do teams leave LangSmith in 2026?
Three reasons, in roughly this order. First, cost at scale: Plus is $39 per seat per month, base traces are $2.50 per 1,000 and extended traces are $5.00 per 1,000, so a five-engineer team on 100K traces a month is already past $400 before retention upgrades. Second, LangChain coupling: the platform is excellent inside LangChain and LangGraph, but value drops fast for LiteLLM apps, raw provider SDKs, CrewAI, AutoGen, OpenAI Agents SDK, or custom orchestrators. Third, missing layers: LangSmith does not ship a native eval-on-prod-traces loop, runtime guardrails at the gateway, or an LLM gateway you can self-host. Teams that hit any of the three start shopping.
What is the best LangSmith alternative in 2026?
It depends which of the three reasons you hit first. If cost at scale is the issue, Langfuse or Helicone with their flat-fee tiers will halve your bill. If LangChain coupling is the issue, Future AGI's traceAI or Arize Phoenix give you framework-neutral OpenInference traces across 50+ AI surfaces. If missing layers is the issue, Future AGI is the single platform that ships eval-on-prod-traces, runtime guardrails, and an OpenAI-compatible gateway on one Apache 2.0 plane. For teams already on Datadog APM, Datadog LLM Observability keeps spans next to infra. Pick on the gap you actually have, not the demo.
Is LangSmith open source?
The LangSmith SDK is MIT licensed and the public client code is on GitHub, but the LangSmith platform backend is closed source. That distinction matters if procurement requires source-level control over the eval and trace backend, or if your security review insists on inspecting the data plane. Future AGI and Helicone are Apache 2.0 end to end. Langfuse core is mostly MIT, with enterprise directories under a separate commercial license. Arize Phoenix is source available under Elastic License 2.0, which does not meet the OSI open-source definition. Verify license terms before you cite OSS as a constraint.
Can I self-host an alternative to LangSmith?
Yes, but the operational footprint varies. Future AGI, Langfuse, Arize Phoenix, and Helicone all ship self-hostable distributions. LangSmith Self-Hosted v0.13 ships too, but only on the Enterprise tier. The real comparison is what you run on day one. Helicone is a gateway plus Postgres. Phoenix is a single container plus an OTel collector. Future AGI's OSS trio is one pip install plus a single container for the Agent Command Center gateway. Langfuse needs web plus worker plus Postgres plus ClickHouse plus Redis plus object storage; the footprint expands as your span volume rises.
Which LangSmith alternative is best if I am already on LangChain or LangGraph?
LangSmith is still the path of least resistance when LangChain or LangGraph is the runtime and you can absorb the seat plus trace pricing. The native graph trace semantics, Prompt Hub, Fleet deployment, and Studio visualization are tied to the framework in ways alternatives have to recreate. Future AGI's traceAI is the alternative with the closest LangGraph trace fidelity, plus eval scores on the span and an Apache 2.0 license. Langfuse handles LangChain traces well but treats the graph as a flat span list. Use an alternative when you need framework neutrality, source-level control, or a unified eval-and-gateway plane.
How does LangSmith pricing compare to alternatives in 2026?
LangSmith Developer is free with 5,000 base traces a month, one seat, and one Fleet agent. Plus is $39 per seat per month, 10,000 base traces included, then $2.50 per 1,000 base traces and $5.00 per 1,000 extended traces. Enterprise is custom. Future AGI starts free with usage-based tracing, gateway requests, AI credits, and voice simulation included. Langfuse Hobby is free, Core is $29 per month with 100,000 units, Pro is $199 per month. Braintrust Pro is $249 per month. Helicone Pro is $79 per month. Phoenix is free self-hosted and AX Pro is $50 per month. Datadog LLM Observability is metered per ingested span plus per indexed log on top of the $31 per host APM fee.
Is Helicone still a safe LangSmith alternative after the Mintlify acquisition?
Helicone remains useful for gateway-first observability, request analytics, caching, alerts, HQL queries, and model cost tracking. The acquisition by Mintlify on March 3, 2026 adds roadmap risk: Helicone described its own services as continuing in maintenance mode with security updates, new model coverage, bug fixes, and performance fixes, but no major new features. Treat it as a stable gateway plus log platform with predictable behavior, not as the place to bet on next-generation features like agent simulation, voice eval, or runtime guardrail evolution.
Related Articles
View all