Best Open Source LLM Observability in 2026: 7 Stacks Ranked
Phoenix, Langfuse, OpenLLMetry, Helicone, OpenLIT, Lunary, and FutureAGI traceAI ranked on deploy complexity, scale, OTel support, and license.
Table of Contents
Open-source LLM observability in 2026 is no longer one project. There are at least seven maintained options spanning instrumentation libraries, full self-hosted platforms, and hybrid hosted-or-self-host paths. The honest ranking depends on what the team actually needs: low-friction OTel instrumentation, a hosted dashboard, ClickHouse-backed scale, OpenInference adherence, or eval-attached spans. This guide ranks all seven against an objective rubric (deploy complexity, scale ceiling, OTel support, license, community size, eval depth) and is honest about where FutureAGI traceAI fits versus Phoenix or Langfuse.
TL;DR: Best OSS LLM observability stack per use case
| Use case | Best pick | Why (one phrase) | Self-host complexity | License |
|---|---|---|---|---|
| OpenTelemetry instrumentation + span-attached evals + gateway in one Apache 2.0 stack | FutureAGI traceAI | Library plus full platform with judge scoring, simulation, and gateway | Low (library) or Medium (full platform) | Apache 2.0 (traceAI + platform) |
| Full platform with traces + prompts + datasets + evals | Langfuse | Mature features, large community | Medium (ClickHouse + Postgres + Redis) | MIT core |
| OpenTelemetry-native + canonical OpenInference | Arize Phoenix | OTLP-first reference implementation | Medium (Postgres + queue) | Elastic License 2.0 |
| Drop-in instrumentation for any OTel backend | OpenLLMetry | One-line instrumentation, vendor-agnostic | Low (library only) | Apache 2.0 |
| Gateway-first with sessions and cost analytics | Helicone | Lowest friction from base URL change | Medium (Supabase + ClickHouse) | Apache 2.0 |
| OTel + GPU + LLM telemetry in one library | OpenLIT | LLM and infra telemetry under one collector | Low (library + collector) | Apache 2.0 |
| Lightweight platform for small teams | Lunary | Simple deploy, dashboards out of the box | Low (Postgres only) | Apache 2.0 |
If you only read one row: pick FutureAGI traceAI when OTel instrumentation should also unlock span-attached evals, simulation, and a gateway in the same Apache 2.0 stack. Pick Langfuse for full-platform OSS depth without the gateway. Pick Phoenix when OpenInference is non-negotiable and the workbench is enough.
Scoring rubric
Each tool is ranked across six axes:
- Deploy complexity. Single binary versus ClickHouse + Postgres + queue + worker. Lower is better for small teams.
- Scale ceiling. Spans per second on commodity infrastructure before re-architecting.
- OTel support. Native OTLP ingestion, OpenInference conventions, OTel collector compatibility.
- License. OSI-approved Apache 2.0 / MIT versus source-available ELv2 versus core-OSS-with-paid-enterprise.
- Community size. GitHub stars, contributors, release cadence.
- Eval depth. Span-attached scores, judge libraries, dataset replay.
These are the axes that matter for procurement. Feature-checkbox lists miss the ones that actually break in production.
The 7 OSS LLM observability tools, ranked
1. FutureAGI traceAI: Best for OpenInference instrumentation that closes into evals, gateway, and guardrails
Apache 2.0. Library + full platform.
Architecture: traceAI is an Apache 2.0 OTel instrumentation library that auto-instruments LangChain, LlamaIndex, OpenAI, Anthropic, Bedrock, and others, emitting OpenTelemetry GenAI semantic-convention spans across Python, TypeScript, Java, and C#. Pair it with any OTel backend, or with the FutureAGI platform for span-attached judge scores, simulation, the Agent Command Center gateway, and 18+ guardrails in one Apache 2.0 stack.
Deploy complexity: Low for the library; medium for the full platform (ClickHouse + Postgres + Redis + Temporal + Agent Command Center gateway). Documented Helm charts available.
Scale ceiling: Library throughput bounded by your OTel collector. Full platform: 10K+ spans/sec on tuned ClickHouse.
OTel support: Native OpenTelemetry GenAI semantic conventions. Multi-language coverage matches Phoenix; the platform layer adds eval-attached spans and gateway-emitted spans into the same trace tree.
License: Apache 2.0 for the platform repo; Apache 2.0 for traceAI. OSI-approved.
Eval depth: 50+ first-party judge scores attach as span attributes. The platform also adds simulation across text and voice, prompt optimization (6 algorithms), Agent Command Center gateway routing across 100+ providers with BYOK, and 18+ guardrails (PII redaction, prompt-injection blocking, jailbreak detection, tool-call enforcement).
Worth flagging: Phoenix and Langfuse have larger OSS communities today. The full-platform path is real ops work (ClickHouse, Temporal, Agent Command Center), use the hosted cloud if you do not want to operate the data plane.
2. Langfuse: Best for full-platform OSS depth without the gateway
MIT core. Self-hostable. Hosted cloud option.
Architecture: Langfuse runs ClickHouse for span storage, Postgres for metadata, Redis for queues, and a Node API. The hosted version uses the same architecture. The platform ships traces, sessions, prompts, datasets, scores, annotations, and a query-builder dashboard.
Deploy complexity: Medium. ClickHouse + Postgres + Redis + workers; documented Helm chart and Docker Compose.
Scale ceiling: 10K+ spans/sec on tuned infrastructure. ClickHouse handles the bulk; the API layer scales horizontally.
OTel support: OTel ingestion supported via the Langfuse /api/public/otel endpoint; uses Langfuse’s own schema layered over OTel.
License: MIT for the core. Enterprise directories (ee/) are licensed separately. 14K+ stars on GitHub.
Eval depth: First-party: dataset experiments, custom scorers, LLM-as-judge, human annotation queues. Recently added Experiments CI/CD integration in 2026.
Worth flagging: “MIT core” needs an asterisk in procurement reviews. Some advanced features (RBAC, SSO, audit logs) live in the EE directories. Multi-region self-hosting is a real engineering project. No first-party gateway, simulation, or guardrail layer in the same product.
3. Arize Phoenix: Best for OpenInference and OpenTelemetry adherence
Source available. Self-hostable. Phoenix Cloud and Arize AX paths exist.
Architecture: Phoenix runs as a Python or container service with Postgres for storage. The reference implementation for OpenInference semantic conventions across Python, TypeScript, and Java instrumentation.
Deploy complexity: Medium. Postgres + Phoenix server + auto-instrumentors.
Scale ceiling: High on the hosted Arize AX path; mid-scale for self-hosted Phoenix without external storage tuning.
OTel support: OTLP-first. Auto-instruments LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI, Bedrock, Anthropic, and 12+ others.
License: Elastic License 2.0. Source available, with restrictions on offering as a managed service. NOT OSI-approved open source. 5K+ stars on GitHub.
Eval depth: First-party: 30+ OSS evaluators, dataset experiments, LLM-as-judge with structured outputs, batch eval pipelines.
Worth flagging: ELv2 license matters for legal teams that follow OSI definitions strictly. Phoenix is not a gateway, not a guardrail product, not a simulator.
4. OpenLLMetry (Traceloop): Best for one-line OTel instrumentation
Apache 2.0. Library only.
Architecture: OpenLLMetry is a set of OTel instrumentations for LLM frameworks (OpenAI, Anthropic, LangChain, LlamaIndex, Bedrock, Cohere, and 20+ others). Spans go to any OTel collector. Traceloop’s hosted backend is optional.
Deploy complexity: Low. pip install traceloop-sdk and a Traceloop.init() call.
Scale ceiling: Bounded by the OTel collector and storage you choose (Tempo, Jaeger, Honeycomb, Datadog).
OTel support: Native. Uses Traceloop’s semantic conventions, with OpenInference compatibility.
License: Apache 2.0. 4K+ stars on GitHub.
Eval depth: None at the library layer. The Traceloop hosted platform adds dashboards, prompts, and evals; the OSS instrumentation library is observability-only.
Worth flagging: No standalone hosted dashboard in the OSS edition. Pair with Tempo, Jaeger, or a dedicated platform.
5. Helicone: Best for gateway-first telemetry
Apache 2.0. Self-hostable. Hosted cloud option.
Architecture: Helicone is a proxy gateway that captures every LLM request as a span. Self-hosted runs Supabase (Postgres + auth) plus ClickHouse for traces. Hosted version uses the same architecture.
Deploy complexity: Medium. Supabase + ClickHouse + workers.
Scale ceiling: 1K+ requests/sec on standard ClickHouse, higher with tuning.
OTel support: Helicone has its own schema. OTel exporters exist but are secondary.
License: Apache 2.0. 4K+ stars.
Eval depth: Sessions, request analytics, prompts, scores. Eval surface is shallower than Langfuse or Phoenix.
Worth flagging: Roadmap risk after the March 2026 Mintlify acquisition; the platform remains usable but new feature velocity slowed. See Helicone Alternatives.
6. OpenLIT: Best for LLM + GPU + infra telemetry in one OTel collector
Apache 2.0. Library + optional UI.
Architecture: OpenLIT ships OTel instrumentation for LLM frameworks, vector DBs, GPU usage (via NVIDIA exporters), and infra. The optional ClickHouse-backed UI gives a unified view across LLM and infra spans.
Deploy complexity: Low to medium. Library is one dependency; the optional UI adds ClickHouse.
Scale ceiling: Bounded by the ClickHouse cluster you operate.
OTel support: Native. Strong on the GPU and infra side, with LLM coverage that grew through 2025 and 2026.
License: Apache 2.0. 1.5K+ stars.
Eval depth: Light. Focus is on telemetry breadth, not eval depth.
Worth flagging: Smaller community than Langfuse or Phoenix. Eval and prompt management are not first-class.
7. Lunary: Best for a lightweight self-hosted platform
Apache 2.0. Self-hostable. Hosted cloud option.
Architecture: Lunary ships an LLM monitoring app with Postgres-only backend (no ClickHouse, no Redis) and a Node + React UI. The simplest self-host of the seven.
Deploy complexity: Low. One Postgres, one container.
Scale ceiling: Mid-scale. Postgres-backed storage caps span throughput before re-architecting.
OTel support: SDK + custom HTTP API. OTel ingestion is supported but not the primary path.
License: Apache 2.0. 1.5K+ stars.
Eval depth: Built-in evaluators, dataset workflows, prompt versioning. Lighter than Langfuse but covers the basics.
Worth flagging: Smaller community and slower release cadence. Multi-region scaling not first-class.

Decision framework: pick by constraint
- OSI-approved license required: FutureAGI traceAI, OpenLLMetry, OpenLIT, Helicone, Lunary, Langfuse core. Phoenix is ELv2.
- Lowest deploy footprint: OpenLLMetry, OpenLIT, FutureAGI traceAI library. Lunary for a lightweight platform.
- Maximum scale ceiling: Langfuse with ClickHouse. FutureAGI platform on ClickHouse. Phoenix via Arize AX.
- OpenInference adherence: Phoenix and FutureAGI traceAI lead.
- Gateway-first telemetry: Helicone, with FutureAGI Agent Command Center as the closed-loop alternative.
- GPU + LLM in one collector: OpenLIT.
- Smallest team, simplest deploy: Lunary.
- Already running Tempo or Jaeger: OpenLLMetry, OpenLIT, FutureAGI traceAI emit into your existing collector.
Common mistakes when picking an OSS observability tool
- Confusing “OSS” with “self-hostable for free at scale”. Self-hosting at 10K spans/sec means an SRE who knows ClickHouse. The infra hours are real.
- Picking on GitHub stars alone. Stars correlate with hype, not with maintenance. Check release cadence, contributor count, and issue close-rate.
- Ignoring license clauses. Phoenix is ELv2 (no managed-service offerings). Langfuse has enterprise dirs outside MIT. Verify before legal review.
- Skipping the eval requirement. Tracing without span-attached evals leaves a quality blind spot. Eval pass-rate trend is the leading indicator of regressions.
- Treating instrumentation library as a platform. OpenLLMetry, OpenLIT, traceAI emit spans; they do not give you a hosted dashboard. Pair them with one.
- Forgetting OTel collector tuning. Without batching and queuing, OTel ingestion drops spans under load. Tune the collector before benchmarking.
What changed in OSS LLM observability in 2026
| Date | Event | Why it matters |
|---|---|---|
| May 2026 | Langfuse shipped Experiments CI/CD integration | OSS-first teams can gate experiments in GitHub Actions. |
| Mar 9, 2026 | FutureAGI shipped Agent Command Center and ClickHouse trace storage | High-volume trace analytics moved into the same loop as evals and gateway. |
| Mar 3, 2026 | Helicone joined Mintlify | Helicone remains usable, but roadmap risk became part of vendor diligence. |
| Jan 22, 2026 | Phoenix added CLI prompt commands | Trace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling. |
| 2025-2026 | OpenInference v1 conventions stabilized across Phoenix and traceAI | Cross-platform span schema reduces vendor lock-in. |
| 2025 | Lunary continued lightweight platform development | Smaller teams now have a maintained one-Postgres option. |
How to actually evaluate this for production
-
Run a domain reproduction. Export a representative slice of real LLM traffic, instrument with each candidate, and compare span fidelity, eval coverage, and storage cost.
-
Test the full loop. Simulate a regression, surface it via the platform, replay in pre-prod, push a fix through CI. Track time-to-resolve at each stage.
-
Cost-adjust. Real cost equals platform price (zero for OSS) plus infra cost (compute, storage, retention) plus the SRE hours to operate ClickHouse, Postgres, Redis at the throughput you need.
How FutureAGI implements open-source LLM observability
FutureAGI is the production-grade open-source LLM observability platform built around the closed reliability loop that other OSS observability picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:
- Tracing, traceAI (Apache 2.0) auto-instruments 35+ frameworks across Python, TypeScript, Java (LangChain4j and Spring AI), and a C# core, with OpenInference-shaped spans flowing into ClickHouse-backed storage.
- Evals, 50+ first-party metrics (Faithfulness, Hallucination, Tool Correctness, Task Completion, Plan Adherence) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and
turing_flashruns the same rubrics at 50 to 70 ms p95. - Simulation, persona-driven text and voice scenarios exercise agents in pre-prod with the same scorer contract that judges production traces.
- Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, while 18+ runtime guardrails enforce policy on the same plane.
Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II. The license is genuinely Apache 2.0 across the stack rather than ELv2 (Phoenix) or MIT-with-enterprise-split (Langfuse).
Most teams comparing OSS observability picks end up running three or four tools in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.
Sources
- Phoenix GitHub repo
- Langfuse GitHub repo
- OpenLLMetry GitHub repo
- Helicone GitHub repo
- OpenLIT GitHub repo
- Lunary GitHub repo
- FutureAGI traceAI GitHub repo
- OpenInference conventions
- Helicone Mintlify announcement
- Langfuse pricing
- Arize pricing
- FutureAGI pricing
Series cross-link
Read next: Best Self-Hosted LLM Observability, Best LLM Tracing Tools, Best LLM Monitoring Tools
Related reading
Frequently asked questions
What are the best open-source LLM observability tools in 2026?
Which open-source LLM observability tool is OSI-approved?
Which OSS observability tool has the lowest deployment complexity?
Which OSS observability tool scales best to 10K+ spans per second?
Should I use OpenTelemetry directly or pick an OSS observability platform?
How does pricing compare across OSS LLM observability tools?
Which OSS observability tool supports OpenInference conventions?
Which OSS tool is best for adding LLM observability to an existing OTel stack?
Langfuse, Phoenix, Helicone, OpenLIT, Lunary, Comet Opik, and FutureAGI ranked on deploy footprint, scale ceiling, and self-host operational cost.
FutureAGI, Langfuse, Phoenix, Datadog, Helicone, LangSmith, Braintrust, Galileo for agent observability in 2026. Pricing, OTel, span-attached scores, and gaps.
FutureAGI, Datadog, Langfuse, Phoenix, Helicone, Braintrust, LangSmith for LLM monitoring in 2026. Latency, drift, cost, and eval pass-rate trends compared.