Research

Best Self-Hosted LLM Observability in 2026: 7 Picks Ranked

Langfuse, Phoenix, Helicone, OpenLIT, Lunary, Comet Opik, and FutureAGI ranked on deploy footprint, scale ceiling, and self-host operational cost.

·
11 min read
self-hosted llm-observability langfuse phoenix kubernetes clickhouse open-source 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline SELF-HOSTED LLMOPS 2026 fills the left half. The right half shows a wireframe server rack with five nodes drawn in pure white outlines, with a soft white halo behind the most-stable node.
Table of Contents

Self-hosted LLM observability matters when data sovereignty, cost-at-scale, or custom retention is a hard requirement. There are at least seven maintained options in 2026, ranging from one-container Postgres deploys to ClickHouse-backed platforms that handle 10K+ spans per second. The honest ranking depends on what the team can actually operate. This guide ranks all seven against an objective rubric (deploy footprint, scale ceiling, license, OTel support, eval depth, community size) and is fair about where each tool fits versus the others.

TL;DR: Best self-hosted LLM observability per use case

Use caseBest pickWhy (one phrase)Deploy complexityLicense
Span-attached evals, gateway, simulation, and guardrails in one self-hosted planeFutureAGIFull Apache 2.0 stack with judge scoring on every spanMedium-HighApache 2.0
Full-platform OSS depth at 10K+ spans/secLangfuseMature ClickHouse-backed scaleMedium-HighMIT core
OpenTelemetry adherence + reference implArize PhoenixOTel-first with OpenInferenceMediumElastic License 2.0
Gateway-first telemetry with sessionsHeliconeLowest friction from base URL changeMediumApache 2.0
LLM + GPU + infra in one OTel stackOpenLITUnified collector, GPU exportersLow-MediumApache 2.0
Smallest team, simplest deployLunaryOne Postgres, one containerLowApache 2.0
Already running Comet for classical MLComet OpikOSS LLM library + Comet platformMediumApache 2.0

If you only read one row: pick FutureAGI when span-attached evals, the gateway, and guardrails must live in the same self-hosted plane. Pick Langfuse for full-platform OSS depth without the gateway. Pick Lunary when the constraint is one engineer maintaining the deploy.

What self-host actually requires

A working self-hosted LLM observability deploy at scale touches five surfaces:

  1. Span storage. ClickHouse for high throughput; Postgres for low-volume workloads. Most platforms above use ClickHouse for span tables.
  2. Metadata and RBAC store. Postgres for users, projects, prompts, datasets, scores.
  3. Queue / async ingestion. Redis or BullMQ to absorb burst traffic and defer expensive scoring.
  4. API + UI. Node, Python, or Go backend; React frontend.
  5. Eval workers. Async judge scoring runs on workers; can be CPU-bound or GPU-bound depending on judge model.

Anything below this leaves a gap. Postgres-only platforms cap at mid-scale. No-queue platforms drop spans under burst load. UI-only platforms without eval workers force a parallel CI pipeline.

The 7 self-hosted platforms ranked

1. FutureAGI: Best for span-attached evals, gateway, simulation, and guardrails in one self-hosted plane

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: ClickHouse + Postgres + Redis + Temporal + Agent Command Center gateway. The full stack runs simulate, evaluate, observe, gate, optimize, and route in one runtime. traceAI is the Apache 2.0 OpenTelemetry instrumentation layer, emitting OpenTelemetry GenAI semantic-convention spans across Python, TypeScript, Java, and C#.

Deploy footprint: Documented Helm chart. Production deploys are heavier than Lunary; comparable to Langfuse plus the Agent Command Center service and Temporal worker pool.

Scale ceiling: 10K+ spans/sec on tuned ClickHouse, matching Langfuse’s ceiling and adding gateway throughput on the same plane.

OTel support: Native OpenTelemetry GenAI semantic conventions. Multi-language coverage (Python, TypeScript, Java, C#).

License: Apache 2.0 for the platform repo; Apache 2.0 for traceAI. OSI-approved.

Eval depth: 50+ first-party judge metrics attach as span attributes via Turing eval models, dataset experiments, simulation across text and voice, prompt optimizer (6 algorithms), and 18+ runtime guardrails (PII redaction, prompt-injection blocking, jailbreak detection, tool-call enforcement) via the Agent Command Center.

Worth flagging: More moving parts than Langfuse self-host (Temporal and Agent Command Center are real services). Newer self-host community than Phoenix or Langfuse, Langfuse’s community is genuinely larger today, but FutureAGI ships the gateway and guardrail layer Langfuse defers to adjacent tools. Use the hosted cloud if you do not want to operate the data plane.

2. Langfuse: Best for full-platform OSS depth at scale

MIT core. Self-hostable. Hosted cloud option.

Architecture: ClickHouse for span storage, Postgres for metadata, Redis for queues, Node API. The hosted version uses the same architecture, which means lessons from langfuse.com translate directly to self-host operations.

Deploy footprint: Documented Helm chart on Kubernetes. Production deploys run 4-8 replicas of the API, a ClickHouse cluster of 2-4 nodes, a Postgres primary plus replica, a Redis instance.

Scale ceiling: 10K+ spans/sec on tuned ClickHouse. Multi-region requires extra engineering.

OTel support: OTel ingestion via the /api/public/otel endpoint; uses Langfuse’s own schema layered over OTel.

License: MIT for the core. Enterprise dirs (RBAC, SSO, audit) are licensed separately. 14K+ stars.

Eval depth: First-party datasets, custom scorers, LLM-as-judge, human annotation queues. Experiments CI/CD integration shipped in 2026.

Worth flagging: “MIT for non-enterprise” needs an asterisk in legal review. ClickHouse disk is the dominant cost at scale. JS heap OOM and BullMQ queue tuning are real concerns documented in the Langfuse FAQ. See Langfuse Alternatives.

3. Arize Phoenix: Best for OpenTelemetry adherence

Source available. Self-hostable. Phoenix Cloud and Arize AX paths exist.

Architecture: Phoenix runs as a Python or container service with Postgres for storage. Heavy auto-instrumentation library with OpenInference conventions across Python, TypeScript, and Java.

Deploy footprint: Postgres + Phoenix server. Lighter than Langfuse for moderate workloads.

Scale ceiling: Mid-scale for self-hosted Phoenix without external storage tuning. The Arize AX cloud path uses different storage architecture for higher scale.

OTel support: OTLP-first. Reference implementation for OpenInference semantic conventions. Auto-instruments LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, and 12+ others.

License: Elastic License 2.0. 5K+ stars on GitHub. NOT OSI-approved open source.

Eval depth: First-party 30+ OSS evaluators, dataset experiments, LLM-as-judge with structured outputs.

Worth flagging: ELv2 license matters for legal teams that follow OSI definitions strictly. Phoenix is not a gateway, not a guardrail product, not a simulator.

4. Helicone: Best for gateway-first telemetry

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Proxy gateway captures every LLM request as a span. Self-hosted runs Supabase (Postgres + auth) + ClickHouse for traces.

Deploy footprint: Supabase + ClickHouse + workers. Documented Docker Compose.

Scale ceiling: 1K+ requests/sec on standard ClickHouse, higher with tuning.

OTel support: Has its own schema. OTel exporters exist but are secondary.

License: Apache 2.0. 4K+ stars.

Eval depth: Sessions, request analytics, prompts, scores. Eval surface shallower than Langfuse or Phoenix.

Worth flagging: Roadmap risk after the March 2026 Mintlify acquisition; the platform remains usable but new feature velocity slowed.

5. OpenLIT: Best for LLM + GPU + infra in one OTel collector

Apache 2.0. Library + optional UI.

Architecture: OTel instrumentation library covering LLM frameworks, vector DBs, GPU usage (via NVIDIA exporters), and infra. Optional ClickHouse-backed UI.

Deploy footprint: Library is one dependency. The optional UI adds ClickHouse.

Scale ceiling: Bounded by your ClickHouse cluster.

OTel support: Native. Strong on the GPU and infra side.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Light. Focus is on telemetry breadth, not eval depth.

Worth flagging: Smaller community than Langfuse or Phoenix. Eval and prompt management are not first-class.

6. Lunary: Best for the lightest possible self-host

Apache 2.0. Self-hostable. Hosted cloud option.

Architecture: Postgres-only backend (no ClickHouse, no Redis), Node + React UI.

Deploy footprint: One Postgres, one container. The simplest deploy of the seven.

Scale ceiling: Mid-scale. Postgres-backed storage caps span throughput before re-architecting.

OTel support: SDK + custom HTTP API. OTel ingestion is supported but not the primary path.

License: Apache 2.0. 1.5K+ stars.

Eval depth: Built-in evaluators, dataset workflows, prompt versioning.

Worth flagging: Smaller community and slower release cadence. Multi-region scaling not first-class.

7. Comet Opik: Best for teams already on Comet

Apache 2.0. Self-hostable. Hosted Comet platform.

Architecture: Opik is the OSS LLM project; the Comet platform handles experiments, dashboards, team workflows. Self-host uses Postgres + ClickHouse.

Deploy footprint: Medium. ClickHouse + Postgres + workers; documented Helm chart.

Scale ceiling: Medium-high. Comet’s classical ML lineage gives operational maturity.

OTel support: Custom SDK with OTel compatibility.

License: Apache 2.0 for Opik. Closed Comet platform.

Eval depth: Datasets, scorers, traces, PII screening with a pytest-friendly Python SDK.

Worth flagging: Eval surface and gateway are smaller than dedicated LLM platforms. Opik is newer and less mature than the classic Comet platform.

Future AGI four-panel dark product showcase. Top-left: Deploy stack health check showing clickhouse, postgres, redis, temporal, api, web services all healthy. Top-right: Ingest throughput line chart (focal panel with halo) climbing to peak 12.4k spans/sec, with KPI tiles for peak, avg, p99 latency. Bottom-left: Infra footprint table showing 4 nodes (ch-01, ch-02, pg-01, redis-01) with CPU/RAM utilization bars. Bottom-right: Retention and storage KPIs with stacked bar showing spans 60%, scores 22%, prompts 12%, datasets 6%.

Decision framework: pick by constraint

  • OSI-approved license required: Langfuse core, Helicone, OpenLIT, Lunary, Opik, FutureAGI. Phoenix is ELv2.
  • Lowest operational cost: Lunary, then OpenLIT (library-only).
  • Maximum scale ceiling: Langfuse and FutureAGI on tuned ClickHouse.
  • OpenTelemetry-first: Phoenix, FutureAGI traceAI, OpenLLMetry.
  • Gateway-first architecture: Helicone, FutureAGI Agent Command Center.
  • Multi-region self-host with documented Helm: Langfuse, FutureAGI, Phoenix.
  • Already running ClickHouse: Langfuse, Helicone, FutureAGI, Opik are the easiest fits.
  • Already running Comet for classical ML: Opik.

Common mistakes when picking a self-hosted platform

  • Underestimating ClickHouse operations. Schema migrations, disk planning, replica failover, and query tuning are real engineering. Without an SRE who knows ClickHouse, expect on-call pages.
  • Picking on demo videos. Demos run on tuned hardware with synthetic load. Run a load test on your real span schema before committing.
  • Ignoring OTel collector tuning. Without batching, OTel ingestion drops spans under load. Tune the collector before benchmarking.
  • Forgetting retention math. ClickHouse disk is the dominant cost. 90 days at 10M traces per month is roughly 200 GB to 2 TB depending on payload size.
  • Skipping the upgrade plan. Langfuse, Phoenix, and FutureAGI ship breaking changes between major versions. Plan upgrade windows.
  • Treating ELv2 as open source. Phoenix is source available, not OSI open source. Legal teams that follow OSI definitions strictly will block it.

What changed in self-host LLMOps in 2026

DateEventWhy it matters
May 2026Langfuse shipped Experiments CI/CD integrationSelf-hosted teams can gate experiments in GitHub Actions without leaving the cluster.
Mar 9, 2026FutureAGI shipped Agent Command Center and ClickHouse trace storageGateway, guardrails, and high-volume trace analytics moved into the same self-hosted plane.
Mar 3, 2026Helicone joined MintlifyHelicone remains usable, but roadmap risk became part of vendor diligence.
Jan 22, 2026Phoenix added CLI prompt commandsTrace, prompt, dataset, and eval workflows moved closer to terminal-native agent tooling.
2025-2026OpenInference v1 conventions stabilizedCross-platform span schema reduces vendor lock-in for self-hosted teams.
2025ClickHouse adoption broadened across LLM observability toolsOperations playbooks moved closer to standard SRE knowledge.

How to actually evaluate this for production

  1. Run a domain reproduction. Stand up two candidate platforms side-by-side. Emit OTel spans into both. Compare span fidelity, eval coverage, query latency, and storage cost on your real workload for two weeks.

  2. Test the upgrade path. Run a major version upgrade in staging. Time the maintenance window. Verify backward-compat on the SDK and the OTel schema.

  3. Cost-adjust. Real cost equals infra (compute + storage + retention) plus the SRE hours to operate ClickHouse, Postgres, and Redis at the throughput you need. Add upgrade-window costs.

How FutureAGI implements self-hosted LLM observability

FutureAGI is the production-grade self-hostable LLM observability platform built around the closed reliability loop that other self-hosted picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

  • Tracing, traceAI (Apache 2.0) auto-instruments 35+ frameworks across Python, TypeScript, Java (LangChain4j and Spring AI), and a C# core, with OpenInference-shaped spans flowing into ClickHouse-backed storage that handles 10K+ spans per second on tuned infra.
  • Evals, 50+ first-party metrics attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
  • Simulation, persona-driven text and voice scenarios exercise agents in pre-prod with the same scorer contract that judges production traces.
  • Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, while 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with the OSS edition; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II. The license is Apache 2.0 across the stack so commercial reuse and air-gapped deploys are unambiguous.

Most teams comparing self-hosted observability picks end up running three or four tools in production: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Read next: Best Open Source LLM Observability, Best LLM Tracing Tools, Best LLMOps Platforms

Frequently asked questions

What are the best self-hosted LLM observability tools in 2026?
The shortlist is Langfuse, Arize Phoenix, Helicone, OpenLIT, Lunary, Comet Opik, and FutureAGI. Langfuse leads on full-platform OSS depth. Phoenix leads on OpenTelemetry adherence. Helicone leads on gateway-first architecture. OpenLIT and FutureAGI scale through OTel collectors. Lunary is the lightest deploy. Opik is the strongest fit when the team already runs Comet for classical ML.
Why self-host LLM observability instead of using SaaS?
Three reasons keep teams self-hosting: data sovereignty (regulated industries cannot send prompts and responses to a third-party cloud), cost at scale (cloud per-trace pricing crosses self-host TCO around 5M traces per month), and control (custom retention, custom RBAC, per-tenant data isolation). The trade-off is real engineering hours: ClickHouse, Postgres, Redis, queues, and upgrades are not zero-effort.
Which self-hosted LLM observability tool has the lowest deployment complexity?
Lunary is the lightest: one Postgres, one container. OpenLIT is light when used as a library plus an OTel collector. Helicone needs Supabase plus ClickHouse. Langfuse and FutureAGI need ClickHouse, Postgres, Redis, and workers; documented Helm charts simplify the deploy. Phoenix runs as a Python service with Postgres but auto-instrumentation adds complexity at scale.
What infrastructure does self-hosting Langfuse actually require?
Production Langfuse runs ClickHouse for span storage, Postgres for metadata, Redis for queues, and the Node API on Kubernetes. Documented Helm charts cover the install. ClickHouse disk is the dominant cost at scale; expect 200 GB to 2 TB per month per 10M traces depending on retention. Plan for an SRE who knows ClickHouse before committing to multi-region self-host.
Does self-hosting work on ARM architectures (Graviton, M-series)?
Mostly yes in 2026. Langfuse, Phoenix, Helicone, FutureAGI traceAI, and Lunary publish ARM container images. ClickHouse runs on ARM since v22. Postgres has had ARM images for years. Verify per-component before committing; some Java-based instrumentation libraries still ship x86-only Docker images that need cross-compilation.
How does self-hosting cost compare to SaaS pricing?
At low volume (under 1M traces per month) SaaS is cheaper than self-host TCO. At mid-scale (1M to 5M traces) it depends on retention; SaaS retention tiers often force expensive upgrades. At high scale (5M+ traces per month) self-host wins on raw cost but loses on engineer-hours. Most teams that self-host successfully have an existing ClickHouse or Postgres operations team.
Can I run multiple self-hosted observability tools side-by-side during migration?
Yes, and it is the recommended migration pattern. OpenTelemetry instrumentation libraries (OpenLLMetry, traceAI) emit spans into multiple backends simultaneously. Run Langfuse and FutureAGI side-by-side for two weeks; compare span fidelity, eval coverage, and storage cost on the same workload. Cut over only after the new platform reaches parity on the metrics your team cares about.
Which self-hosted tool handles 10K+ spans per second?
Langfuse, Phoenix (with Arize AX cloud architecture), and FutureAGI all handle 10K+ spans/sec on tuned ClickHouse. Helicone scales to 1K+ requests/sec on standard ClickHouse, higher with tuning. OpenLIT scales with the collector and storage you operate. Lunary is best for mid-scale workloads. Run a load test before committing; vendor benchmarks understate the load on your real schema.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.