Research

Best Open Source LLM Observability in 2026: 7 Stacks Ranked

Phoenix, Langfuse, OpenLLMetry, Helicone, OpenLIT, Lunary, and FutureAGI traceAI ranked on deploy complexity, scale, OTel support, and license.

September 29, 2025

11 min read

open-source llm-observability phoenix langfuse openllmetry openlit self-hosted 2026

Table of Contents

Open-source LLM observability in 2026 is no longer one project. There are at least seven maintained options spanning instrumentation libraries, full self-hosted platforms, and hybrid hosted-or-self-host paths. The honest ranking depends on what the team actually needs: low-friction OTel instrumentation, a hosted dashboard, ClickHouse-backed scale, OpenInference adherence, or eval-attached spans. This guide ranks all seven against an objective rubric (deploy complexity, scale ceiling, OTel support, license, community size, eval depth) and is honest about where FutureAGI traceAI fits versus Phoenix or Langfuse.

TL;DR: Best OSS LLM observability stack per use case

Use case	Best pick	Why (one phrase)	Self-host complexity	License
OpenTelemetry instrumentation + span-attached evals + gateway in one Apache 2.0 stack	FutureAGI traceAI	Library plus full platform with judge scoring, simulation, and gateway	Low (library) or Medium (full platform)	Apache 2.0 (traceAI + platform)
Full platform with traces + prompts + datasets + evals	Langfuse	Mature features, large community	Medium (ClickHouse + Postgres + Redis)	MIT core
OpenTelemetry-native + canonical OpenInference	Arize Phoenix	OTLP-first reference implementation	Medium (Postgres + queue)	Elastic License 2.0
Drop-in instrumentation for any OTel backend	OpenLLMetry	One-line instrumentation, vendor-agnostic	Low (library only)	Apache 2.0
Gateway-first with sessions and cost analytics	Helicone	Lowest friction from base URL change	Medium (Supabase + ClickHouse)	Apache 2.0
OTel + GPU + LLM telemetry in one library	OpenLIT	LLM and infra telemetry under one collector	Low (library + collector)	Apache 2.0
Lightweight platform for small teams	Lunary	Simple deploy, dashboards out of the box	Low (Postgres only)	Apache 2.0

If you only read one row: pick FutureAGI traceAI when OTel instrumentation should also unlock span-attached evals, simulation, and a gateway in the same Apache 2.0 stack. Pick Langfuse for full-platform OSS depth without the gateway. Pick Phoenix when OpenInference is non-negotiable and the workbench is enough.

Scoring rubric

Each tool is ranked across six axes:

Deploy complexity. Single binary versus ClickHouse + Postgres + queue + worker. Lower is better for small teams.
Scale ceiling. Spans per second on commodity infrastructure before re-architecting.
OTel support. Native OTLP ingestion, OpenInference conventions, OTel collector compatibility.
License. OSI-approved Apache 2.0 / MIT versus source-available ELv2 versus core-OSS-with-paid-enterprise.
Community size. GitHub stars, contributors, release cadence.
Eval depth. Span-attached scores, judge libraries, dataset replay.

These are the axes that matter for procurement. Feature-checkbox lists miss the ones that actually break in production.

The 7 OSS LLM observability tools, ranked

1. FutureAGI traceAI: Best for OpenInference instrumentation that closes into evals, gateway, and guardrails

Apache 2.0. Library + full platform.

Architecture: traceAI is an Apache 2.0 OTel instrumentation library that auto-instruments LangChain, LlamaIndex, OpenAI, Anthropic, Bedrock, and others, emitting OpenTelemetry GenAI semantic-convention spans across Python, TypeScript, Java, and C#. Pair it with any OTel backend, or with the FutureAGI platform for span-attached judge scores, simulation, the Agent Command Center gateway, and 18+ guardrails in one Apache 2.0 stack.

Deploy complexity: Low for the library; medium for the full platform (ClickHouse + Postgres + Redis + Temporal + Agent Command Center gateway). Documented Helm charts available.

Scale ceiling: Library throughput bounded by your OTel collector. Full platform: 10K+ spans/sec on tuned ClickHouse.

OTel support: Native OpenTelemetry GenAI semantic conventions. Multi-language coverage matches Phoenix; the platform layer adds eval-attached spans and gateway-emitted spans into the same trace tree.

License: Apache 2.0 for the platform repo; Apache 2.0 for traceAI. OSI-approved.

Eval depth: 50+ first-party judge scores attach as span attributes. The platform also adds simulation across text and voice, prompt optimization (6 algorithms), Agent Command Center gateway routing across 100+ providers with BYOK, and 18+ guardrails (PII redaction, prompt-injection blocking, jailbreak detection, tool-call enforcement).

Worth flagging: Phoenix and Langfuse have larger OSS communities today. The full-platform path is real ops work (ClickHouse, Temporal, Agent Command Center), use the hosted cloud if you do not want to operate the data plane.

2. Langfuse: Best for full-platform OSS depth without the gateway

MIT core. Self-hostable. Hosted cloud option.

Architecture: Langfuse runs ClickHouse for span storage, Postgres for metadata, Redis for queues, and a Node API. The hosted version uses the same architecture. The platform ships traces, sessions, prompts, datasets, scores, annotations, and a query-builder dashboard.

Deploy complexity: Medium. ClickHouse + Postgres + Redis + workers; documented Helm chart and Docker Compose.

Scale ceiling: 10K+ spans/sec on tuned infrastructure. ClickHouse handles the bulk; the API layer scales horizontally.

OTel support: OTel ingestion supported via the Langfuse /api/public/otel endpoint; uses Langfuse’s own schema layered over OTel.

License: MIT for the core. Enterprise directories (ee/) are licensed separately. 14K+ stars on GitHub.

Eval depth: First-party: dataset experiments, custom scorers, LLM-as-judge, human annotation queues. Recently added Experiments CI/CD integration in 2026.

Worth flagging: “MIT core” needs an asterisk in procurement reviews. Some advanced features (RBAC, SSO, audit logs) live in the EE directories. Multi-region self-hosting is a real engineering project. No first-party gateway, simulation, or guardrail layer in the same product.

3. Arize Phoenix: Best for OpenInference and OpenTelemetry adherence

Source available. Self-hostable. Phoenix Cloud and Arize AX paths exist.

Architecture: Phoenix runs as a Python or container service with Postgres for storage. The reference implementation for OpenInference semantic conventions across Python, TypeScript, and Java instrumentation.

Deploy complexity: Medium. Postgres + Phoenix server + auto-instrumentors.

Scale ceiling: High on the hosted Arize AX path; mid-scale for self-hosted Phoenix without external storage tuning.

OTel support: OTLP-first. Auto-instruments LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI, Bedrock, Anthropic, and 12+ others.

License: Elastic License 2.0. Source available, with restrictions on offering as a managed service. NOT OSI-approved open source. 5K+ stars on GitHub.

Eval depth: First-party: 30+ OSS evaluators, dataset experiments, LLM-as-judge with structured outputs, batch eval pipelines.

Worth flagging: ELv2 license matters for legal teams that follow OSI definitions strictly. Phoenix is not a gateway, not a guardrail product, not a simulator.

4. OpenLLMetry (Traceloop): Best for one-line OTel instrumentation

Apache 2.0. Library only.

Architecture: OpenLLMetry is a set of OTel instrumentations for LLM frameworks (OpenAI, Anthropic, LangChain, LlamaIndex, Bedrock, Cohere, and 20+ others). Spans go to any OTel collector. Traceloop’s hosted backend is optional.

Deploy complexity: Low. pip install traceloop-sdk and a Traceloop.init() call.

Scale ceiling: Bounded by the OTel collector and storage you choose (Tempo, Jaeger, Honeycomb, Datadog).

OTel support: Native. Uses Traceloop’s semantic conventions, with OpenInference compatibility.

License: Apache 2.0. 4K+ stars on GitHub.

Eval depth: None at the library layer. The Traceloop hosted platform adds dashboards, prompts, and evals; the OSS instrumentation library is observability-only.

Worth flagging: No standalone hosted dashboard in the OSS edition. Pair with Tempo, Jaeger, or a dedicated platform.

5. Helicone: Best for gateway-first telemetry

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Helicone is a proxy gateway that captures every LLM request as a span. Self-hosted runs Supabase (Postgres + auth) plus ClickHouse for traces. Hosted version uses the same architecture.

Deploy complexity: Medium. Supabase + ClickHouse + workers.

Scale ceiling: 1K+ requests/sec on standard ClickHouse, higher with tuning.

OTel support: Helicone has its own schema. OTel exporters exist but are secondary.

License: Apache 2.0. 4K+ stars.

Eval depth: Sessions, request analytics, prompts, scores. Eval surface is shallower than Langfuse or Phoenix.

Worth flagging: Roadmap risk after the March 2026 Mintlify acquisition; the platform remains usable but new feature velocity slowed. See Helicone Alternatives.

6. OpenLIT: Best for LLM + GPU + infra telemetry in one OTel collector

Apache 2.0. Library + optional UI.

Architecture: OpenLIT ships OTel instrumentation for LLM frameworks, vector DBs, GPU usage (via NVIDIA exporters), and infra. The optional ClickHouse-backed UI gives a unified view across LLM and infra spans.

Deploy complexity: Low to medium. Library is one dependency; the optional UI adds ClickHouse.

Scale ceiling: Bounded by the ClickHouse cluster you operate.

OTel support: Native. Strong on the GPU and infra side, with LLM coverage that grew through 2025 and 2026.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Light. Focus is on telemetry breadth, not eval depth.

Worth flagging: Smaller community than Langfuse or Phoenix. Eval and prompt management are not first-class.

7. Lunary: Best for a lightweight self-hosted platform

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Lunary ships an LLM monitoring app with Postgres-only backend (no ClickHouse, no Redis) and a Node + React UI. The simplest self-host of the seven.

Deploy complexity: Low. One Postgres, one container.

Scale ceiling: Mid-scale. Postgres-backed storage caps span throughput before re-architecting.

OTel support: SDK + custom HTTP API. OTel ingestion is supported but not the primary path.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Built-in evaluators, dataset workflows, prompt versioning. Lighter than Langfuse but covers the basics.

Worth flagging: Smaller community and slower release cadence. Multi-region scaling not first-class.

Decision framework: pick by constraint

OSI-approved license required: FutureAGI traceAI, OpenLLMetry, OpenLIT, Helicone, Lunary, Langfuse core. Phoenix is ELv2.
Lowest deploy footprint: OpenLLMetry, OpenLIT, FutureAGI traceAI library. Lunary for a lightweight platform.
Maximum scale ceiling: Langfuse with ClickHouse. FutureAGI platform on ClickHouse. Phoenix via Arize AX.
OpenInference adherence: Phoenix and FutureAGI traceAI lead.
Gateway-first telemetry: Helicone, with FutureAGI Agent Command Center as the closed-loop alternative.
GPU + LLM in one collector: OpenLIT.
Smallest team, simplest deploy: Lunary.
Already running Tempo or Jaeger: OpenLLMetry, OpenLIT, FutureAGI traceAI emit into your existing collector.

Common mistakes when picking an OSS observability tool

Confusing “OSS” with “self-hostable for free at scale”. Self-hosting at 10K spans/sec means an SRE who knows ClickHouse. The infra hours are real.
Picking on GitHub stars alone. Stars correlate with hype, not with maintenance. Check release cadence, contributor count, and issue close-rate.
Ignoring license clauses. Phoenix is ELv2 (no managed-service offerings). Langfuse has enterprise dirs outside MIT. Verify before legal review.
Skipping the eval requirement. Tracing without span-attached evals leaves a quality blind spot. Eval pass-rate trend is the leading indicator of regressions.
Treating instrumentation library as a platform. OpenLLMetry, OpenLIT, traceAI emit spans; they do not give you a hosted dashboard. Pair them with one.
Forgetting OTel collector tuning. Without batching and queuing, OTel ingestion drops spans under load. Tune the collector before benchmarking.

What changed in OSS LLM observability in 2026

Date	Event	Why it matters
May 2026	Langfuse shipped Experiments CI/CD integration	OSS-first teams can gate experiments in GitHub Actions.
Mar 9, 2026	FutureAGI shipped Agent Command Center and ClickHouse trace storage	High-volume trace analytics moved into the same loop as evals and gateway.
Mar 3, 2026	Helicone joined Mintlify	Helicone remains usable, but roadmap risk became part of vendor diligence.
Jan 22, 2026	Phoenix added CLI prompt commands	Trace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling.
2025-2026	OpenInference v1 conventions stabilized across Phoenix and traceAI	Cross-platform span schema reduces vendor lock-in.
2025	Lunary continued lightweight platform development	Smaller teams now have a maintained one-Postgres option.

How to actually evaluate this for production

Run a domain reproduction. Export a representative slice of real LLM traffic, instrument with each candidate, and compare span fidelity, eval coverage, and storage cost.
Test the full loop. Simulate a regression, surface it via the platform, replay in pre-prod, push a fix through CI. Track time-to-resolve at each stage.
Cost-adjust. Real cost equals platform price (zero for OSS) plus infra cost (compute, storage, retention) plus the SRE hours to operate ClickHouse, Postgres, Redis at the throughput you need.

How FutureAGI implements open-source LLM observability

FutureAGI is the production-grade open-source LLM observability platform built around the closed reliability loop that other OSS observability picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

Tracing, traceAI (Apache 2.0) auto-instruments 35+ frameworks across Python, TypeScript, Java (LangChain4j and Spring AI), and a C# core, with OpenInference-shaped spans flowing into ClickHouse-backed storage.
Evals, 50+ first-party metrics (Faithfulness, Hallucination, Tool Correctness, Task Completion, Plan Adherence) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
Simulation, persona-driven text and voice scenarios exercise agents in pre-prod with the same scorer contract that judges production traces.
Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, while 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II. The license is genuinely Apache 2.0 across the stack rather than ELv2 (Phoenix) or MIT-with-enterprise-split (Langfuse).

Most teams comparing OSS observability picks end up running three or four tools in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Series cross-link

Frequently asked questions

What are the best open-source LLM observability tools in 2026?

The shortlist by maintained scope and 2026 activity: Arize Phoenix, Langfuse, OpenLLMetry (Traceloop), Helicone, OpenLIT, Lunary, and FutureAGI traceAI. Phoenix and Langfuse lead on full-platform features. OpenLLMetry and OpenLIT lead on OTel-native instrumentation libraries. Helicone leads gateway-first telemetry. Lunary leads on lightweight project dashboards. FutureAGI traceAI is the newest entrant with OpenInference conventions and span-attached evals.

Which open-source LLM observability tool is OSI-approved?

FutureAGI traceAI is Apache 2.0 and the FutureAGI platform repo is Apache 2.0. Langfuse core is MIT, with enterprise directories handled separately. OpenLLMetry is Apache 2.0. OpenLIT is Apache 2.0. Helicone is Apache 2.0. Lunary is Apache 2.0. Arize Phoenix is source available under Elastic License 2.0, which is not OSI-approved open source. Verify license carefully when self-hosting and redistribution matter for legal review.

Which OSS observability tool has the lowest deployment complexity?

OpenLLMetry and OpenLIT are instrumentation-only libraries; deployment is just adding a Python or TypeScript dependency and pointing at any OTel collector. Lunary is the lightest full platform: SQLite or Postgres, one container, minimal services. Langfuse and Phoenix are heavier: ClickHouse, Postgres, and a queue. FutureAGI traceAI's instrumentation library deploys like OpenLLMetry; the platform layer matches Langfuse for service count.

Which OSS observability tool scales best to 10K+ spans per second?

Langfuse with ClickHouse and Phoenix with the Arize AX cloud both handle 10K+ spans per second on tuned infrastructure. FutureAGI's hosted cloud uses ClickHouse trace storage for the same scale. OpenLIT and OpenLLMetry rely on the OTel collector and downstream storage you choose; throughput is bounded by the storage you pick. Helicone and Lunary are best-suited to mid-scale workloads. Run a load test before committing.

Should I use OpenTelemetry directly or pick an OSS observability platform?

OpenTelemetry is the wire format. You still need a backend that understands LLM-specific conventions (OpenInference for spans, judge scores as attributes, prompt versions as resource tags). Most teams pair an OTel instrumentation library (OpenLLMetry, OpenLIT, traceAI) with a backend (Phoenix, Langfuse, FutureAGI). Going OTel-direct without LLM-aware tooling means rebuilding span tree views and eval-attached scoring yourself.

How does pricing compare across OSS LLM observability tools?

All seven tools below are free for the OSS edition. Hosted versions vary: FutureAGI is free plus usage from $2/GB; Boost is $250/mo, Scale is $750/mo with HIPAA, Enterprise from $2,000/mo with SOC 2 Type II. Phoenix free self-host, Arize AX Pro $50/mo. Langfuse Hobby free, Core $29/mo. Helicone Hobby free, Pro $79/mo. Lunary Free tier, Pro $20/mo. OpenLLMetry's Traceloop platform $79/mo Pro. OpenLIT is community-only. Self-hosting cost equals infra plus the engineer-hours that operate it.

Which OSS observability tool supports OpenInference conventions?

FutureAGI traceAI emits OpenInference spans across Python, TypeScript, Java, and C# with auto-instrumentation for 35+ frameworks. Phoenix is the canonical OpenInference reference implementation. OpenLLMetry uses its own Traceloop conventions but interoperates with OpenInference via translation. OpenLIT supports a subset. Langfuse has an OpenInference compatibility layer. If OpenInference adherence is non-negotiable, FutureAGI traceAI and Phoenix lead.

Which OSS tool is best for adding LLM observability to an existing OTel stack?

FutureAGI traceAI, OpenLLMetry, and OpenLIT are instrumentation libraries that emit spans into any OTel collector you already operate (Tempo, Jaeger, Datadog, Grafana, Honeycomb). They do not require a separate platform. If you already run Datadog or Grafana, start with traceAI (broadest framework and language coverage) and only add a dedicated platform when LLM-specific dashboards and evals become a bottleneck.

View all

Research

Best Self-Hosted LLM Observability in 2026: 7 Picks Ranked

Langfuse, Phoenix, Helicone, OpenLIT, Lunary, Comet Opik, and FutureAGI ranked on deploy footprint, scale ceiling, and self-host operational cost.

Rishav Hada · Oct 12, 2025

11 min

Research

Best AI Agent Observability Tools in 2026: 8 Platforms Compared

FutureAGI, Langfuse, Phoenix, Datadog, Helicone, LangSmith, Braintrust, Galileo for agent observability in 2026. Pricing, OTel, span-attached scores, and gaps.

Rishav Hada · Dec 2, 2025

14 min

Research

Best LLM Monitoring Tools in 2026: 7 Platforms Compared

FutureAGI, Datadog, Langfuse, Phoenix, Helicone, Braintrust, LangSmith for LLM monitoring in 2026. Latency, drift, cost, and eval pass-rate trends compared.

Vrinda Damani · Aug 11, 2025

12 min

TL;DR: Best OSS LLM observability stack per use case

Scoring rubric

The 7 OSS LLM observability tools, ranked

1. FutureAGI traceAI: Best for OpenInference instrumentation that closes into evals, gateway, and guardrails

2. Langfuse: Best for full-platform OSS depth without the gateway

3. Arize Phoenix: Best for OpenInference and OpenTelemetry adherence

4. OpenLLMetry (Traceloop): Best for one-line OTel instrumentation

5. Helicone: Best for gateway-first telemetry

6. OpenLIT: Best for LLM + GPU + infra telemetry in one OTel collector

7. Lunary: Best for a lightweight self-hosted platform

Decision framework: pick by constraint

Common mistakes when picking an OSS observability tool

What changed in OSS LLM observability in 2026

How to actually evaluate this for production

How FutureAGI implements open-source LLM observability

Sources

Series cross-link

Related reading

Frequently asked questions