Research

Best Open Source LLM Observability in 2026: 7 Stacks Ranked

Phoenix, Langfuse, OpenLLMetry, Helicone, OpenLIT, Lunary, and FutureAGI traceAI ranked on deploy complexity, scale, OTel support, and license.

·
11 min read
open-source llm-observability phoenix langfuse openllmetry openlit self-hosted 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline OSS LLM OBSERVABILITY 2026 fills the left half. The right half shows a wireframe stack of nested layers with an OSS badge, drawn in pure white outlines with a soft white halo behind the topmost layer.

Open-source LLM observability in 2026 is no longer one project. There are at least seven maintained options spanning instrumentation libraries, full self-hosted platforms, and hybrid hosted-or-self-host paths. The honest ranking depends on what the team actually needs: low-friction OTel instrumentation, a hosted dashboard, ClickHouse-backed scale, OpenInference adherence, or eval-attached spans. This guide ranks all seven against an objective rubric (deploy complexity, scale ceiling, OTel support, license, community size, eval depth) and is honest about where FutureAGI traceAI fits versus Phoenix or Langfuse.

TL;DR: Best OSS LLM observability stack per use case

Use caseBest pickWhy (one phrase)Self-host complexityLicense
OpenTelemetry instrumentation + span-attached evals + gateway in one Apache 2.0 stackFutureAGI traceAILibrary plus full platform with judge scoring, simulation, and gatewayLow (library) or Medium (full platform)Apache 2.0 (traceAI + platform)
Full platform with traces + prompts + datasets + evalsLangfuseMature features, large communityMedium (ClickHouse + Postgres + Redis)MIT core
OpenTelemetry-native + canonical OpenInferenceArize PhoenixOTLP-first reference implementationMedium (Postgres + queue)Elastic License 2.0
Drop-in instrumentation for any OTel backendOpenLLMetryOne-line instrumentation, vendor-agnosticLow (library only)Apache 2.0
Gateway-first with sessions and cost analyticsHeliconeLowest friction from base URL changeMedium (Supabase + ClickHouse)Apache 2.0
OTel + GPU + LLM telemetry in one libraryOpenLITLLM and infra telemetry under one collectorLow (library + collector)Apache 2.0
Lightweight platform for small teamsLunarySimple deploy, dashboards out of the boxLow (Postgres only)Apache 2.0

If you only read one row: pick FutureAGI traceAI when OTel instrumentation should also unlock span-attached evals, simulation, and a gateway in the same Apache 2.0 stack. Pick Langfuse for full-platform OSS depth without the gateway. Pick Phoenix when OpenInference is non-negotiable and the workbench is enough.

Scoring rubric

Each tool is ranked across six axes:

  1. Deploy complexity. Single binary versus ClickHouse + Postgres + queue + worker. Lower is better for small teams.
  2. Scale ceiling. Spans per second on commodity infrastructure before re-architecting.
  3. OTel support. Native OTLP ingestion, OpenInference conventions, OTel collector compatibility.
  4. License. OSI-approved Apache 2.0 / MIT versus source-available ELv2 versus core-OSS-with-paid-enterprise.
  5. Community size. GitHub stars, contributors, release cadence.
  6. Eval depth. Span-attached scores, judge libraries, dataset replay.

These are the axes that matter for procurement. Feature-checkbox lists miss the ones that actually break in production.

The 7 OSS LLM observability tools, ranked

1. FutureAGI traceAI: Best for OpenInference instrumentation that closes into evals, gateway, and guardrails

Apache 2.0. Library + full platform.

Architecture: traceAI is an Apache 2.0 OTel instrumentation library that auto-instruments LangChain, LlamaIndex, OpenAI, Anthropic, Bedrock, and others, emitting OpenTelemetry GenAI semantic-convention spans across Python, TypeScript, Java, and C#. Pair it with any OTel backend, or with the FutureAGI platform for span-attached judge scores, simulation, the Agent Command Center gateway, and 18+ guardrails in one Apache 2.0 stack.

Deploy complexity: Low for the library; medium for the full platform (ClickHouse + Postgres + Redis + Temporal + Agent Command Center gateway). Documented Helm charts available.

Scale ceiling: Library throughput bounded by your OTel collector. Full platform: 10K+ spans/sec on tuned ClickHouse.

OTel support: Native OpenTelemetry GenAI semantic conventions. Multi-language coverage matches Phoenix; the platform layer adds eval-attached spans and gateway-emitted spans into the same trace tree.

License: Apache 2.0 for the platform repo; Apache 2.0 for traceAI. OSI-approved.

Eval depth: 50+ first-party judge scores attach as span attributes. The platform also adds simulation across text and voice, prompt optimization (6 algorithms), Agent Command Center gateway routing across 100+ providers with BYOK, and 18+ guardrails (PII redaction, prompt-injection blocking, jailbreak detection, tool-call enforcement).

Worth flagging: Phoenix and Langfuse have larger OSS communities today. The full-platform path is real ops work (ClickHouse, Temporal, Agent Command Center), use the hosted cloud if you do not want to operate the data plane.

2. Langfuse: Best for full-platform OSS depth without the gateway

MIT core. Self-hostable. Hosted cloud option.

Architecture: Langfuse runs ClickHouse for span storage, Postgres for metadata, Redis for queues, and a Node API. The hosted version uses the same architecture. The platform ships traces, sessions, prompts, datasets, scores, annotations, and a query-builder dashboard.

Deploy complexity: Medium. ClickHouse + Postgres + Redis + workers; documented Helm chart and Docker Compose.

Scale ceiling: 10K+ spans/sec on tuned infrastructure. ClickHouse handles the bulk; the API layer scales horizontally.

OTel support: OTel ingestion supported via the Langfuse /api/public/otel endpoint; uses Langfuse’s own schema layered over OTel.

License: MIT for the core. Enterprise directories (ee/) are licensed separately. 14K+ stars on GitHub.

Eval depth: First-party: dataset experiments, custom scorers, LLM-as-judge, human annotation queues. Recently added Experiments CI/CD integration in 2026.

Worth flagging: “MIT core” needs an asterisk in procurement reviews. Some advanced features (RBAC, SSO, audit logs) live in the EE directories. Multi-region self-hosting is a real engineering project. No first-party gateway, simulation, or guardrail layer in the same product.

3. Arize Phoenix: Best for OpenInference and OpenTelemetry adherence

Source available. Self-hostable. Phoenix Cloud and Arize AX paths exist.

Architecture: Phoenix runs as a Python or container service with Postgres for storage. The reference implementation for OpenInference semantic conventions across Python, TypeScript, and Java instrumentation.

Deploy complexity: Medium. Postgres + Phoenix server + auto-instrumentors.

Scale ceiling: High on the hosted Arize AX path; mid-scale for self-hosted Phoenix without external storage tuning.

OTel support: OTLP-first. Auto-instruments LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI, Bedrock, Anthropic, and 12+ others.

License: Elastic License 2.0. Source available, with restrictions on offering as a managed service. NOT OSI-approved open source. 5K+ stars on GitHub.

Eval depth: First-party: 30+ OSS evaluators, dataset experiments, LLM-as-judge with structured outputs, batch eval pipelines.

Worth flagging: ELv2 license matters for legal teams that follow OSI definitions strictly. Phoenix is not a gateway, not a guardrail product, not a simulator.

4. OpenLLMetry (Traceloop): Best for one-line OTel instrumentation

Apache 2.0. Library only.

Architecture: OpenLLMetry is a set of OTel instrumentations for LLM frameworks (OpenAI, Anthropic, LangChain, LlamaIndex, Bedrock, Cohere, and 20+ others). Spans go to any OTel collector. Traceloop’s hosted backend is optional.

Deploy complexity: Low. pip install traceloop-sdk and a Traceloop.init() call.

Scale ceiling: Bounded by the OTel collector and storage you choose (Tempo, Jaeger, Honeycomb, Datadog).

OTel support: Native. Uses Traceloop’s semantic conventions, with OpenInference compatibility.

License: Apache 2.0. 4K+ stars on GitHub.

Eval depth: None at the library layer. The Traceloop hosted platform adds dashboards, prompts, and evals; the OSS instrumentation library is observability-only.

Worth flagging: No standalone hosted dashboard in the OSS edition. Pair with Tempo, Jaeger, or a dedicated platform.

5. Helicone: Best for gateway-first telemetry

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Helicone is a proxy gateway that captures every LLM request as a span. Self-hosted runs Supabase (Postgres + auth) plus ClickHouse for traces. Hosted version uses the same architecture.

Deploy complexity: Medium. Supabase + ClickHouse + workers.

Scale ceiling: 1K+ requests/sec on standard ClickHouse, higher with tuning.

OTel support: Helicone has its own schema. OTel exporters exist but are secondary.

License: Apache 2.0. 4K+ stars.

Eval depth: Sessions, request analytics, prompts, scores. Eval surface is shallower than Langfuse or Phoenix.

Worth flagging: Roadmap risk after the March 2026 Mintlify acquisition; the platform remains usable but new feature velocity slowed. See Helicone Alternatives.

6. OpenLIT: Best for LLM + GPU + infra telemetry in one OTel collector

Apache 2.0. Library + optional UI.

Architecture: OpenLIT ships OTel instrumentation for LLM frameworks, vector DBs, GPU usage (via NVIDIA exporters), and infra. The optional ClickHouse-backed UI gives a unified view across LLM and infra spans.

Deploy complexity: Low to medium. Library is one dependency; the optional UI adds ClickHouse.

Scale ceiling: Bounded by the ClickHouse cluster you operate.

OTel support: Native. Strong on the GPU and infra side, with LLM coverage that grew through 2025 and 2026.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Light. Focus is on telemetry breadth, not eval depth.

Worth flagging: Smaller community than Langfuse or Phoenix. Eval and prompt management are not first-class.

7. Lunary: Best for a lightweight self-hosted platform

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Lunary ships an LLM monitoring app with Postgres-only backend (no ClickHouse, no Redis) and a Node + React UI. The simplest self-host of the seven.

Deploy complexity: Low. One Postgres, one container.

Scale ceiling: Mid-scale. Postgres-backed storage caps span throughput before re-architecting.

OTel support: SDK + custom HTTP API. OTel ingestion is supported but not the primary path.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Built-in evaluators, dataset workflows, prompt versioning. Lighter than Langfuse but covers the basics.

Worth flagging: Smaller community and slower release cadence. Multi-region scaling not first-class.

Future AGI four-panel dark product showcase. Top-left: OTel ingestion chips for openai, anthropic, langchain, llamaindex, bedrock SDKs. Top-right: Span tree (focal panel with halo) showing agent.run root branching to retriever.search, openai.chat, validator.run with red halo on failed validator span. Bottom-left: Apache 2.0 license badge with GitHub stars 12k+, contributors 180+, last release v3.9.4. Bottom-right: Self-host KPIs showing trace ingest rate 12k/min, storage 142 GB, p95 query 280ms, retention 90d.

Decision framework: pick by constraint

  • OSI-approved license required: FutureAGI traceAI, OpenLLMetry, OpenLIT, Helicone, Lunary, Langfuse core. Phoenix is ELv2.
  • Lowest deploy footprint: OpenLLMetry, OpenLIT, FutureAGI traceAI library. Lunary for a lightweight platform.
  • Maximum scale ceiling: Langfuse with ClickHouse. FutureAGI platform on ClickHouse. Phoenix via Arize AX.
  • OpenInference adherence: Phoenix and FutureAGI traceAI lead.
  • Gateway-first telemetry: Helicone, with FutureAGI Agent Command Center as the closed-loop alternative.
  • GPU + LLM in one collector: OpenLIT.
  • Smallest team, simplest deploy: Lunary.
  • Already running Tempo or Jaeger: OpenLLMetry, OpenLIT, FutureAGI traceAI emit into your existing collector.

Common mistakes when picking an OSS observability tool

  • Confusing “OSS” with “self-hostable for free at scale”. Self-hosting at 10K spans/sec means an SRE who knows ClickHouse. The infra hours are real.
  • Picking on GitHub stars alone. Stars correlate with hype, not with maintenance. Check release cadence, contributor count, and issue close-rate.
  • Ignoring license clauses. Phoenix is ELv2 (no managed-service offerings). Langfuse has enterprise dirs outside MIT. Verify before legal review.
  • Skipping the eval requirement. Tracing without span-attached evals leaves a quality blind spot. Eval pass-rate trend is the leading indicator of regressions.
  • Treating instrumentation library as a platform. OpenLLMetry, OpenLIT, traceAI emit spans; they do not give you a hosted dashboard. Pair them with one.
  • Forgetting OTel collector tuning. Without batching and queuing, OTel ingestion drops spans under load. Tune the collector before benchmarking.

What changed in OSS LLM observability in 2026

DateEventWhy it matters
May 2026Langfuse shipped Experiments CI/CD integrationOSS-first teams can gate experiments in GitHub Actions.
Mar 9, 2026FutureAGI shipped Agent Command Center and ClickHouse trace storageHigh-volume trace analytics moved into the same loop as evals and gateway.
Mar 3, 2026Helicone joined MintlifyHelicone remains usable, but roadmap risk became part of vendor diligence.
Jan 22, 2026Phoenix added CLI prompt commandsTrace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling.
2025-2026OpenInference v1 conventions stabilized across Phoenix and traceAICross-platform span schema reduces vendor lock-in.
2025Lunary continued lightweight platform developmentSmaller teams now have a maintained one-Postgres option.

How to actually evaluate this for production

  1. Run a domain reproduction. Export a representative slice of real LLM traffic, instrument with each candidate, and compare span fidelity, eval coverage, and storage cost.

  2. Test the full loop. Simulate a regression, surface it via the platform, replay in pre-prod, push a fix through CI. Track time-to-resolve at each stage.

  3. Cost-adjust. Real cost equals platform price (zero for OSS) plus infra cost (compute, storage, retention) plus the SRE hours to operate ClickHouse, Postgres, Redis at the throughput you need.

How FutureAGI implements open-source LLM observability

FutureAGI is the production-grade open-source LLM observability platform built around the closed reliability loop that other OSS observability picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

  • Tracing, traceAI (Apache 2.0) auto-instruments 35+ frameworks across Python, TypeScript, Java (LangChain4j and Spring AI), and a C# core, with OpenInference-shaped spans flowing into ClickHouse-backed storage.
  • Evals, 50+ first-party metrics (Faithfulness, Hallucination, Tool Correctness, Task Completion, Plan Adherence) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
  • Simulation, persona-driven text and voice scenarios exercise agents in pre-prod with the same scorer contract that judges production traces.
  • Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, while 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II. The license is genuinely Apache 2.0 across the stack rather than ELv2 (Phoenix) or MIT-with-enterprise-split (Langfuse).

Most teams comparing OSS observability picks end up running three or four tools in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Read next: Best Self-Hosted LLM Observability, Best LLM Tracing Tools, Best LLM Monitoring Tools

Frequently asked questions

What are the best open-source LLM observability tools in 2026?
The shortlist by maintained scope and 2026 activity: Arize Phoenix, Langfuse, OpenLLMetry (Traceloop), Helicone, OpenLIT, Lunary, and FutureAGI traceAI. Phoenix and Langfuse lead on full-platform features. OpenLLMetry and OpenLIT lead on OTel-native instrumentation libraries. Helicone leads gateway-first telemetry. Lunary leads on lightweight project dashboards. FutureAGI traceAI is the newest entrant with OpenInference conventions and span-attached evals.
Which open-source LLM observability tool is OSI-approved?
FutureAGI traceAI is Apache 2.0 and the FutureAGI platform repo is Apache 2.0. Langfuse core is MIT, with enterprise directories handled separately. OpenLLMetry is Apache 2.0. OpenLIT is Apache 2.0. Helicone is Apache 2.0. Lunary is Apache 2.0. Arize Phoenix is source available under Elastic License 2.0, which is not OSI-approved open source. Verify license carefully when self-hosting and redistribution matter for legal review.
Which OSS observability tool has the lowest deployment complexity?
OpenLLMetry and OpenLIT are instrumentation-only libraries; deployment is just adding a Python or TypeScript dependency and pointing at any OTel collector. Lunary is the lightest full platform: SQLite or Postgres, one container, minimal services. Langfuse and Phoenix are heavier: ClickHouse, Postgres, and a queue. FutureAGI traceAI's instrumentation library deploys like OpenLLMetry; the platform layer matches Langfuse for service count.
Which OSS observability tool scales best to 10K+ spans per second?
Langfuse with ClickHouse and Phoenix with the Arize AX cloud both handle 10K+ spans per second on tuned infrastructure. FutureAGI's hosted cloud uses ClickHouse trace storage for the same scale. OpenLIT and OpenLLMetry rely on the OTel collector and downstream storage you choose; throughput is bounded by the storage you pick. Helicone and Lunary are best-suited to mid-scale workloads. Run a load test before committing.
Should I use OpenTelemetry directly or pick an OSS observability platform?
OpenTelemetry is the wire format. You still need a backend that understands LLM-specific conventions (OpenInference for spans, judge scores as attributes, prompt versions as resource tags). Most teams pair an OTel instrumentation library (OpenLLMetry, OpenLIT, traceAI) with a backend (Phoenix, Langfuse, FutureAGI). Going OTel-direct without LLM-aware tooling means rebuilding span tree views and eval-attached scoring yourself.
How does pricing compare across OSS LLM observability tools?
All seven tools below are free for the OSS edition. Hosted versions vary: FutureAGI is free plus usage from $2/GB; Boost is $250/mo, Scale is $750/mo with HIPAA, Enterprise from $2,000/mo with SOC 2 Type II. Phoenix free self-host, Arize AX Pro $50/mo. Langfuse Hobby free, Core $29/mo. Helicone Hobby free, Pro $79/mo. Lunary Free tier, Pro $20/mo. OpenLLMetry's Traceloop platform $79/mo Pro. OpenLIT is community-only. Self-hosting cost equals infra plus the engineer-hours that operate it.
Which OSS observability tool supports OpenInference conventions?
FutureAGI traceAI emits OpenInference spans across Python, TypeScript, Java, and C# with auto-instrumentation for 35+ frameworks. Phoenix is the canonical OpenInference reference implementation. OpenLLMetry uses its own Traceloop conventions but interoperates with OpenInference via translation. OpenLIT supports a subset. Langfuse has an OpenInference compatibility layer. If OpenInference adherence is non-negotiable, FutureAGI traceAI and Phoenix lead.
Which OSS tool is best for adding LLM observability to an existing OTel stack?
FutureAGI traceAI, OpenLLMetry, and OpenLIT are instrumentation libraries that emit spans into any OTel collector you already operate (Tempo, Jaeger, Datadog, Grafana, Honeycomb). They do not require a separate platform. If you already run Datadog or Grafana, start with traceAI (broadest framework and language coverage) and only add a dedicated platform when LLM-specific dashboards and evals become a bottleneck.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.