Research

What Is LLM Observability? A 2026 Architecture Guide

Canonical 2026 LLM observability guide: OpenInference, OTel GenAI, three pillars (cost, latency, eval), five approaches, FAGI vs Phoenix vs Langfuse.

January 11, 2026

20 min read

ai-gateway 2026 llm-observability

Table of Contents

Originally published May 17, 2026. Updated Q2 2026.

A platform engineer at a Series B fintech shipped a customer-onboarding agent on a Wednesday: LangChain plus four tools plus retrieval plus an OpenAI fallback. The trace pane in the APM tool showed exactly two spans per request: one HTTP span into the agent endpoint, one HTTP span out to OpenAI. Forty-seven internal steps lived inside those two spans, invisible.

By Friday the agent had three incidents: a hallucinated routing number that nearly triggered a wire transfer, a tool call that retried itself in a loop for ninety seconds, and a token spike from a retrieval prompt that ran 14 times in one conversation. None were visible in the trace tool, because the trace tool didn’t know about LLMs.

This guide defines LLM observability, the discipline that fills that gap, anchored to OpenInference, OpenTelemetry GenAI semantic conventions, and OTLP, and surveys the 2026 landscape across FAGI traceAI, Arize Phoenix, Langfuse, Helicone, and Maxim Bifrost.

TL;DR: The 2026 LLM Observability Definition

LLM observability is a telemetry discipline that captures every prompt, completion, tool call, retrieval, and agent step in a generative AI system as a structured OpenInference or OpenTelemetry GenAI span, attaches token counts, dollar cost, latency, and evaluation scores as first-class span attributes, and exports the result through OTLP to a backend that supports span-tree visualisation, per-tenant cost attribution, drift detection on quality, and correlation between offline evaluation runs and production traces via span_id.

The category is anchored by two open specifications: OpenInference (Apache 2.0, originally from Arize, now adopted across the ecosystem) and OpenTelemetry GenAI semantic conventions (governed by the OpenTelemetry project). Most 2026 stacks emit OpenInference instrumentation, export via OTLP, and verify the spans against both specs.

The three pillars adapted to LLMs. Traces become span trees rooted at the user request with child spans for every prompt, retrieval, tool call, and sub-agent. Metrics become token counts, dollar cost, time-to-first-token, latency, and aggregate evaluation scores. Logs become structured span attributes plus optional raw prompt and completion text, gated by a redaction layer because logs carry the highest PII and PHI risk in the stack.
Evaluation is first-class telemetry. Hallucination probability, faithfulness, answer relevance, toxicity, and JSON Schema validity write to the span as attributes alongside latency.
Sampling is inverted from traditional APM. Long-tail errors are the most valuable signal, so tail-based sampling driven by eval scores replaces head-based sampling.
Five named approaches ship in 2026: OpenInference plus OTel native, vendor-proprietary, log-shipping, custom warehousing, and hybrid.
2026 landscape: Future AGI traceAI (Apache 2.0, OpenInference plus OTel native across 50+ AI surfaces across Python / TypeScript / Java / C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), closed eval-gateway loop) with Error Feed. part of the eval stack (the clustering and what-to-fix layer that feeds the self-improving evaluators), sitting alongside to auto-cluster trace failures into named issues with zero config (50 traces → 1 issue, root cause + quick fix + long-term recommendation auto-written per issue), Arize Phoenix (Apache 2.0, the canonical open-source notebook surface), Langfuse (MIT, the open-source product surface), Helicone (proxy-based log-shipping, joined Mintlify March 3, 2026), and Maxim Bifrost (Apache 2.0, the gateway-side trace emitter).

What Is LLM Observability?

LLM observability is the practice of capturing every step of a generative AI request - prompt construction, retrieval lookup, tool call, model completion, sub-agent step, evaluator pass - as a structured span with rich attributes, exporting those spans through the OpenTelemetry Protocol, and reading them in a backend that understands the LLM-specific shape of the data.

The definition matters because traditional application performance monitoring (APM) was built for HTTP and database workloads. An APM agent installed into a LangChain application emits roughly one or two HTTP spans per request. Everything that happens between - the sub-agent steps, the tool calls, the retrieval lookups, the reranker passes - lives inside those HTTP spans as opaque blob payloads. When the agent hallucinates or burns a thousand tokens on a retry loop, the APM agent shows the right total latency and the right HTTP status code, and nothing else.

The shared shape across the open specifications is unambiguous:

Every meaningful step is a span. Span kinds include CHAIN (an orchestrator step), LLM (a model call), RETRIEVER (a vector search), TOOL (a function call from the agent), AGENT (a sub-agent boundary), EMBEDDING (a vector generation), and RERANKER. OpenInference names these kinds; OpenTelemetry GenAI semantic conventions adopt the same vocabulary.
Every span carries an attribute taxonomy. Model name, model version, temperature, top-p, max tokens, system prompt hash, prompt and completion tokens, dollar cost, latency, time-to-first-token, retrieval chunk IDs and scores, tool name, tool arguments, embedding vector, embedding model. The taxonomy is published; instrumentation libraries map their internal events into it.
Every trace exports through OTLP. The OpenTelemetry Protocol is the wire format; backends that read OTLP read the trace without translation. The same OpenInference instrumentation can target Future AGI’s traceAI backend, Arize Phoenix, Langfuse, Honeycomb, Grafana Tempo, or a self-hosted Jaeger interchangeably.
Evaluators write back to the span. Hallucination probability, faithfulness, answer relevance, toxicity, JSON Schema conformity, and PII presence run inline (or on a sampled subset) and the scores attach to the same span. The shared span_id between an offline evaluation run and a production trace is what makes regressions debuggable.

The Three Pillars of LLM Observability

Traces, metrics, and logs remain the three pillars of observability in 2026, but each is adapted to the LLM workload. Reading them as the same three pillars from traditional APM is the mistake most teams make in their first three months of trying to observe an LLM application.

Traces: Span Trees Across Agent Steps

A production trace in a 2026 LLM application isn’t three or ten spans. It’s thirty to three hundred spans, organised as a span tree rooted at the user request.

A representative trace for a customer-support agent: the user message arrives as the root span (a CHAIN span representing the agent invocation). Child spans include an LLM span for the planner step, a TOOL span wrapping a retrieval call (with nested EMBEDDING and RETRIEVER spans carrying retrieval.top_k, retrieval.score, and retrieval.chunk_ids), another LLM span for answer synthesis, and a final LLM span for formatting. Each LLM span carries llm.model_name, llm.token_count.prompt, llm.token_count.completion, llm.token_count.total, and llm.latency_ms attributes; the root span carries the aggregate cost and latency.

The span tree is what makes the trace debuggable. Without the tree, a thousand-token retry loop is invisible because the request still returned 200 OK. With the tree, the retry loop shows up as the same TOOL span repeated 14 times under the same agent root, and the fix is one configuration change.

Metrics: Token Cost, Latency, and Quality Scores

The metrics layer of LLM observability emits four families of numbers per request, per tenant, per model, and per prompt version:

Token counts. gen_ai.client.token.usage is the OpenTelemetry GenAI metric name. Prompt tokens, completion tokens, and total tokens are emitted per call; cached prompt tokens are emitted separately when the provider supports prefix caching. The metric is the basis for cost attribution and for catching prompt bloat (a system prompt that grows by 800 tokens in a deploy shows up here within minutes).
Dollar cost. Derived from token counts and the per-model price card, emitted as gen_ai.client.cost.usd (a Future AGI extension on top of the OpenTelemetry GenAI conventions). Cost per tenant, cost per prompt version, and cost per template are the three axes most production teams query against.
Latency. Total request latency and time-to-first-token are the two numbers the user actually feels. Time-to-first-token matters for streaming responses; total latency matters for non-streaming JSON responses and for evaluation pipelines. Both are emitted as histograms so percentile graphs render correctly in the backend.
Quality eval scores. Hallucination rate, faithfulness, answer relevance, toxicity, and JSON Schema validity emit as gauges or counters; the per-span attribute version of the same number lives on the trace, so the metric drill-through to the underlying trace is one click.

The quality score is the biggest single difference from traditional APM. An APM dashboard shows latency p50, p95, p99, error rate, and request rate; an LLM observability dashboard shows all five of those plus token cost per request, hallucination rate per template, and faithfulness on the production traffic compared to the offline benchmark. The eval score is on the same dashboard as the latency, because in LLM applications a fast hallucination is worse than a slow correct answer.

Logs: Structured Span Attributes (And the Redaction Layer)

The third pillar collapses into the first two in LLM observability. The canonical 2026 pattern is to put the prompt and completion text as attributes on the LLM span and gate that attribute behind a redaction or sampling layer for PII, PHI, and secrets.

The redaction layer is the part most teams skip and most security reviewers catch. Raw prompt logging is the second-highest data-exposure surface in an LLM stack (after the model provider itself). The 2026 standard is to run a redaction scanner inline on the prompt before it lands as a span attribute, redact PII and PHI in place, and write the redacted version to the span. The original text is either dropped or stored in a separately-controlled retention bucket with a shorter TTL and a tighter access list.

Why LLM Observability Differs From Traditional Observability

The three-pillar adaptation above is the surface difference. Three structural differences underneath drive every other design choice.

Eval Scores Are First-Class Telemetry

Traditional observability treats correctness as a binary signalled by HTTP status code. A 200 OK is correct; a 5xx is incorrect. LLM observability has no such binary. A 200 OK that returns a hallucinated routing number is a production incident; a 200 OK that returns a correct answer is the goal. The signal that distinguishes them is the evaluation score, not the HTTP status code.

The implication is that the observability backend has to surface eval scores everywhere the traditional backend surfaces error rates. Dashboards show hallucination rate per template per hour. Alerts fire when faithfulness drops below a threshold. The trace explorer filters on eval score the same way it filters on HTTP status.

Future AGI traceAI runs evaluators inline as a first-class part of the trace pipeline; Arize Phoenix and Langfuse run them as a sidecar pass that writes scores back via API. Helicone surfaces user-feedback scores but doesn’t ship a built-in evaluator catalog. The depth of the catalog and the inline-versus-sidecar mode are the two procurement axes that matter most.

Span Attributes Are an Order of Magnitude Richer

A traditional HTTP span carries roughly ten attributes: method, URL, status code, sizes, peer, retries, error, route. An LLM span carries closer to fifty: model name, model version, temperature, top-p, max tokens, system prompt hash, prompt and completion tokens, cached tokens, dollar cost, latency, time-to-first-token, tool names, tool arguments, retrieval chunk IDs and scores, embedding model, eval scores per evaluator, guardrail decisions, tenant ID, prompt version, session ID, conversation ID, the redacted prompt and completion, the finish reason, and more.

The attribute taxonomy is what makes drill-through queries possible. “Show me every trace where faithfulness is below 0.7 on prompt version v3.2 for tenant acme-corp in the last seven days” is a one-line filter in a 2026 LLM observability backend. It isn’t expressible at all against generic HTTP spans.

OpenInference publishes the canonical attribute namespaces (llm.*, retrieval.*, tool.*, embedding.*, chain.*); OpenTelemetry GenAI publishes the OTel-native equivalents (gen_ai.*). Most 2026 instrumentation emits both.

Sampling Is Inverted

Traditional APM uses head-based sampling: at request entry, decide whether to keep the trace (typically 1 to 10 percent) and drop the rest before any further work happens. This works for HTTP workloads because aggregate metrics are representative and the kept 1 percent is a useful drill-down sample.

LLM observability inverts the intuition. A hallucination on a single niche query is often more important than the latency distribution of the bulk of requests, because the niche query reveals a failure mode the bulk traffic doesn’t exercise. Head-based sampling drops the niche query 99 percent of the time, which is the wrong outcome.

The 2026 standard is tail-based sampling driven by evaluation scores and explicit signals. Keep every span where the hallucination evaluator fires above a threshold. Keep every span where token cost exceeds a per-request budget. Keep every span where the user fed back a thumbs-down. Keep every span where a guardrail blocked or rewrote the response. Sample the remainder at 1 to 5 percent for baseline graphs. Future AGI ships eval-driven tail sampling natively in the traceAI pipeline; most other backends require a custom OTel collector configuration.

The Five Approaches to LLM Observability in 2026

Production teams in 2026 ship LLM observability under one of five approaches. The approach drives almost every downstream procurement decision (license, lock-in, depth of evals, sampling strategy, retention cost).

1. OpenInference Plus OpenTelemetry Native

The open-standards path. The application is instrumented with OpenInference libraries (one install per LLM client and orchestrator framework); the instrumentation emits OTLP-compatible spans; the spans export to any OTLP collector or directly to a backend that reads OTLP.

This is the path Future AGI traceAI, Arize Phoenix, and Langfuse all support natively. The strongest property is backend-portability: switching the backend is a configuration change, not a rewrite. The instrumentation library catalog is the deepest of any approach (OpenInference covers OpenAI, Anthropic, LangChain, LlamaIndex, LiteLLM, Bedrock, Vertex AI, AutoGen, CrewAI, Pydantic AI, OpenAI Agents SDK, and more out of the box).

2. Vendor-Proprietary

The closed-format path. The application uses a vendor SDK; the vendor SDK emits a proprietary wire format to a vendor backend; the format and instrumentation aren’t portable. The advantage is integration depth at the vendor level. The disadvantage is the lock-in and the procurement risk if the vendor pivots or gets acquired. In 2026, the proprietary path is most often a transitional choice for teams that adopted a tool early and haven’t yet migrated to OpenInference.

3. Log-Shipping

The lightweight path. Write every request and response to stdout or a managed log pipeline (Datadog Logs, AWS CloudWatch, Grafana Loki, or a vendor-managed equivalent like Helicone). Reconstruct the trace at query time by joining log lines on a request ID.

The advantage is operational simplicity. The disadvantage is that logs without span structure can’t answer “why did this trace fail?” - the parent-child relationship between the agent step and the underlying tool call isn’t in the log payload, the eval score isn’t joined to the log line, and cost attribution requires a downstream warehousing job. Helicone is the canonical proxy-based log-shipper. Helicone joined Mintlify on March 3, 2026 and is in maintenance mode, which compounds procurement risk for new adopters.

4. Custom Warehousing

The ad-hoc path. Write every prompt, completion, span, and eval score directly to a row-store warehouse (ClickHouse, BigQuery, Snowflake, DuckDB) with a bespoke schema. Build dashboards with the team’s existing BI tool. Reconstruct the trace tree with SQL joins.

The advantage is total control over schema, retention, and query patterns; the warehouse can join LLM spans to revenue data and customer records in the same SQL query. The disadvantage is the upfront engineering cost and perpetual maintenance burden. A few large enterprises in regulated industries ship a custom warehouse as the system-of-record because the audit requirement demands it, with an OpenInference plus OTLP pipeline as the live operational view on top.

5. Hybrid

The realistic production path. OpenInference instrumentation in the application, OTLP export to a vendor or open-source backend for the live operational view, and a custom warehouse sink for long-term retention and finance attribution. The OTel collector forks the OTLP stream: one branch to the backend, one to the warehouse.

Most production teams above a thousand requests per day end up here within twelve months. The first three months are usually a single backend; month three through six adds the warehouse sink because procurement, finance, or compliance asks for retention beyond what the backend offers.

A Buyer’s Guide for LLM Observability in 2026

Six criteria decide which backend wins a procurement evaluation.

License clarity. Apache 2.0 or MIT for the open-source backend, so the source is auditable and the deploy is portable. Vendor-managed plans should publish a clear license for the on-premise binary if one exists.
Open standards conformance. OpenInference instrumentation support and OTLP ingest are the two non-negotiables. A backend that requires a proprietary SDK is a procurement risk in 2026.
Evaluation depth. Count the built-in evaluators in the catalog (hallucination, faithfulness, answer relevance, toxicity, JSON Schema validity, PII detection, prompt injection, custom LLM-judge templates, custom code-based templates). Inline evaluation matters more than sidecar evaluation for tail sampling.
Agent and multi-step support. A 2026 trace has 30 to 300 spans, not 3 to 10. The span tree explorer has to render the multi-step structure without collapsing it; the trace search has to filter on parent-child relationships; cost attribution has to roll up across the tree.
Cost attribution at the span level. Per-span dollar cost, per-tenant cost, per-prompt-version cost, and per-template cost are the four axes finance and product teams need.
Eval-to-trace round trip. A regression debugged in an offline evaluation run should land in the live dashboard with one click via shared span_id. The backend that ships this loop natively (rather than as an integration task) is the one that pays for itself.

A practical evaluation runs the same test workload through three or four candidate backends in parallel for a week, measures the four metric families (token, cost, latency, eval) across all backends, and compares the trace explorer experience on 30 to 50 representative traces. The winner is rarely the one with the longest feature list; it’s the one whose trace tree renders the agent’s behaviour clearly enough that a new engineer can find the bug in five minutes.

Five Myths About LLM Observability

Myth 1: Request and response logging is observability. Logs without span structure, attribute taxonomy, and eval scores can’t answer “why did this trace fail?”. The parent-child relationship between agent step and tool call isn’t in a log line; the eval score isn’t joined to the log line; cost attribution requires a downstream warehousing job. Logging is a useful baseline; it isn’t observability.

Myth 2: A vendor SDK is the only path to deep instrumentation. OpenInference plus OTLP is the open-standards path and every serious backend supports it. The instrumentation library catalog is now richer than any single vendor SDK ships.

Myth 3: Traditional APM handles LLMs natively. APM agents emit HTTP spans, not GenAI spans. The Datadog APM agent installed into a LangChain application emits one or two HTTP spans per request and a hundred internal LLM steps live invisibly inside them. Newer GenAI tabs in some APM products read OTLP-style GenAI spans, but only when the application is instrumented with OpenInference or the OpenTelemetry GenAI SDK.

Myth 4: Head-based sampling is fine for LLMs. The hallucination on the niche query is the trace you most want; head-sampling drops it 99 percent of the time. Tail-based sampling driven by eval scores is the production-correct pattern.

Myth 5: Evaluation belongs in a separate offline pipeline. In 2026, evaluators run inline (or on a sampled subset) and write scores to the same span as the request. The shared span_id between offline runs and production traces is what makes regressions debuggable.

The 2026 LLM Observability Landscape

Five projects define the production-relevant set in 2026. Three are Apache 2.0 or MIT open source; one is a proprietary commercial backend; one is in maintenance mode following an acquisition.

Future AGI traceAI is Future AGI’s Apache 2.0 OpenInference plus OpenTelemetry-native instrumentation layer; it ships instrumentors for OpenAI, Anthropic, LangChain, LlamaIndex, LiteLLM, Bedrock, Vertex AI, AutoGen, CrewAI, OpenAI Agents SDK, Pydantic AI, and more, and emits OTLP-compatible spans to the Future AGI backend or to any OTLP-compatible target. The differentiator is the closed eval-gateway loop: traceAI feeds the same evaluation catalog that the Future AGI ai-evaluation, agent-optimization, and Agent Command Center gateway run on, so a regression debugged offline can be promoted to a gateway guardrail in one stack.
Arize Phoenix is the canonical open-source notebook-first eval and trace explorer. Apache 2.0, OpenInference native (Arize originated the OpenInference specification), the open ancestor of Arize’s commercial AX product. Phoenix is the go-to choice for teams that want a local-first trace explorer plus an evaluator runner; it’s widely used for prototype-stage observability and as a development companion.
Langfuse is MIT, the open-source product surface for trace, eval, prompt management, and dataset versioning. The product covers a broad surface area and ships a public cloud as well as a self-host. Langfuse’s prompt registry and dataset tooling are the strongest features against the rest of the field.
Helicone is the canonical log-shipping path through a proxy: it intercepts every LLM call and writes a structured log line per request. Helicone joined Mintlify on March 3, 2026 and is in maintenance mode; new adopters should weigh the maintenance status against the simplicity of the integration. The product remains usable for teams that want a one-line proxy installation.
Maxim Bifrost is Apache 2.0 in Go, primarily an AI gateway, with first-class OpenTelemetry GenAI trace emission on the gateway hop. Bifrost is the right pick when the observability and the gateway live on the same network hop, and when the team wants both surfaces in one binary; teams that want a deeper trace explorer with rich agent multi-step rendering typically pair Bifrost with a dedicated backend (Future AGI traceAI, Phoenix, or Langfuse) on the receiving end.

How We Think About LLM Observability at Future AGI

Future AGI ships the four surfaces - observability, evaluation, agent optimization, and the Agent Command Center gateway - on the same code path, so a span captured in production can be replayed against an offline evaluator, the regression debugged, the fix promoted, and the gateway can enforce a guardrail derived from the same eval. The closed loop is the point.

traceAI is Apache 2.0, OpenInference plus OpenTelemetry-native instrumentation. The catalog covers LLM clients (OpenAI, Anthropic, Cohere, Mistral, Groq), orchestration frameworks (LangChain, LlamaIndex, LiteLLM, Haystack, DSPy), agent frameworks (CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, AG2), and cloud providers (Bedrock, Vertex AI, Azure OpenAI). Spans export via OTLP to the Future AGI backend or any OTLP collector.
ai-evaluation runs inline on traces or offline on benchmark datasets; the evaluator catalog covers hallucination, faithfulness, answer relevance, toxicity, PII presence, prompt injection, custom LLM-judge templates, and custom code-based templates. Eval scores write to the same span as the request.
agent-optimization closes the loop: it takes a failing trace plus an eval score, suggests a prompt or tool revision, and lets the team promote the revision to a new prompt version. The next production trace uses the new version; the eval pipeline re-scores it.
Agent Command Center is the gateway that enforces the eval-derived guardrail at the network hop. The Protect guardrail layer runs in 65 ms text median time-to-label per the Future AGI Protect paper (arXiv 2510.13351), and the guardrail policy is derived from the same eval catalog that scored the trace upstream.

The self-improving loop is the point. A team that ships only traceAI gets the canonical OpenInference plus OTLP experience and competes with Phoenix and Langfuse. A team that ships the full stack gets the trace plus eval plus optimization plus gateway loop on one code path, where a regression in production drives a fix that lands in the gateway guardrail without leaving the same product. That’s a depth-of-integration claim, not a feature-list claim.

Wiring an Application for LLM Observability

The simplest possible starting point is two lines of code plus an environment variable.

from fi_instrumentation import register
from traceai_openai import OpenAIInstrumentor

# Register the OTel SDK with traceAI defaults.
register(
    project_name="customer-support-agent",
    endpoint="https://app.futureagi.com/v1/traces",
)

# Instrument the OpenAI client; every call now emits OpenInference spans.
OpenAIInstrumentor().instrument()

# Existing OpenAI SDK code unchanged from here.
import openai
openai.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Explain the trace."}],
)

Every call through the OpenAI SDK now emits an OpenInference LLM span with llm.model_name, llm.token_count.prompt, llm.token_count.completion, llm.token_count.total, and llm.latency_ms attributes. Adding LangChain, LlamaIndex, or an agent framework is one more instrumentor call per library. Adding the evaluation pass is one more call against the Future AGI eval API; the score writes back to the same span via span_id.

The same pattern works against any OTLP-compatible backend; pointing the OTel endpoint at Phoenix, Langfuse, or a self-hosted collector swaps the backend without rewriting the instrumentation.

When Should You Adopt LLM Observability?

Use LLM observability when any of these conditions hold. More than one LLM call per user request (true the moment you ship an agent, a RAG pipeline, or a multi-step chain). More than one tenant whose token cost needs separate attribution. A regulated workload (HIPAA, SOC 2, NYDFS, EU AI Act Annex III) that demands per-request audit attributes. A quality regression in production that the request log alone can’t explain.

Skip LLM observability when you have one LLM call per request, one tenant, no regulatory pressure, no quality complaints, and a deployment that’s small enough that an tail -f on the application log is enough. That combination disappears within the first six months of production traffic for most teams.

For most production teams in 2026, the question isn’t whether to ship LLM observability but which of the five approaches to start with. The hybrid path is the median end state; most teams begin with OpenInference plus OTel native and add the warehouse sink in month three or four.

Best 5 AI Gateways for LLM Observability and Tracing in 2026, the OpenTelemetry-native observability ranking
Claude Code Observability with OpenInference and OpenTelemetry in 2026, OpenInference + OpenTelemetry for Claude Code observability
Best 5 AI Gateways for LLM Cost Optimization in 2026, the five-layer cost stack and the 2026 trust cohort
What is an AI Gateway? Governance, Routing, and Observability in 2026, the architectural primer for the category

Frequently asked questions

What Is LLM Observability in Simple Terms?

LLM observability captures every prompt, completion, tool call, retrieval, and agent step in a generative AI system as a structured span, attaches token counts, dollar cost, latency, and evaluation scores to that span, and exports the result to a backend that renders the span tree, attributes cost per tenant, and ties a production trace to an offline evaluation score through a shared `span_id`. It is the evolution of APM for stacks where the most expensive and fragile component is a probabilistic model call, not a database query.

How Is LLM Observability Different From Traditional Observability?

Three differences matter. Evaluation scores are first-class telemetry alongside latency, because a 200 OK with a hallucination is a production incident the request log alone cannot flag. Span attributes are an order of magnitude richer than HTTP spans: model, temperature, system prompt hash, retrieved chunks, embedding vector, tool arguments, and per-call token counts live on the span. Sampling is inverted: traditional observability drops the 99 percent and keeps the tail, while LLM observability keeps long-tail errors aggressively because a single hallucination on a niche query is the signal you need.

What Are the Three Pillars of LLM Observability?

Traces, metrics, and logs adapted to LLMs. Traces are span trees rooted at the user request, with child spans for every prompt, retrieval, tool call, and sub-agent step; OpenInference and OpenTelemetry GenAI define the span kinds (CHAIN, LLM, RETRIEVER, TOOL, AGENT, EMBEDDING). Metrics are token counts, dollar cost, time-to-first-token, latency, and aggregate evaluation scores per request, per tenant, per model, per prompt version. Logs are raw prompt and completion text plus structured attributes on the same span, gated behind a redaction layer.

What Standards Should LLM Observability Anchor To in 2026?

Three. OpenInference (Apache 2.0, originally from Arize) defines span kinds, attribute namespaces, and instrumentation libraries. OpenTelemetry GenAI semantic conventions define OTel-native spans, metrics, and events for GenAI clients, MCP, and provider-specific conventions; toggle latest experimental conventions on with `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental`. OTLP itself is the third anchor, so traces remain backend-portable.

What Are the Five Approaches to LLM Observability?

OpenInference plus OpenTelemetry native (the open standards path through FAGI traceAI, Arize Phoenix, or Langfuse). Vendor-proprietary (closed SDK and backend). Log-shipping (write to stdout or a log pipeline like Helicone and reconstruct at query time). Custom warehousing (write spans to ClickHouse, BigQuery, or Snowflake with bespoke schema). Hybrid (OpenInference instrumentation, OTLP to a backend, plus a warehouse sink for retention) is the realistic production path most teams converge on.

How Does Evaluation Fit Into LLM Observability?

Evaluation is first-class telemetry, not a separate pipeline. Production-grade LLM observability runs evaluators inline on every span (or a sampled subset) and writes the score - hallucination probability, faithfulness, answer relevance, toxicity, JSON schema validity - as a span attribute. The same evaluator runs offline on benchmark datasets; the link between offline and production is the `span_id` shared between the evaluation run and the production trace.

Why Is Sampling Harder for LLM Observability?

Long-tail errors are the most valuable signal. The 2026 standard is tail-based sampling driven by evaluation scores: keep every span where the hallucination evaluator fires above a threshold, every span where token cost exceeds a per-request budget, every span where the user fed back a thumbs-down. Sample the remainder at 1 to 5 percent. Future AGI ships eval-driven tail sampling natively; most other backends require a custom collector pipeline.

Can LLM Observability Replace an APM?

Partially. The LLM observability stack handles the model, retrieval, and agent layers natively; traditional APM (Datadog APM, New Relic, Honeycomb) handles HTTP, database, queue, and infrastructure layers. OTLP makes the two complementary: a single collector receives HTTP spans from the APM agent and GenAI spans from the OpenInference instrumentation, and surfaces a single trace tree across them.

What Is OpenInference and How Does It Relate to OpenTelemetry?

OpenInference is an Apache 2.0 instrumentation spec originally published by Arize, defining span kinds (CHAIN, LLM, RETRIEVER, TOOL, AGENT, EMBEDDING, RERANKER), attribute namespaces, and a library catalog (OpenAI, Anthropic, LangChain, LlamaIndex, LiteLLM, Bedrock, Vertex AI, AutoGen, CrewAI, Pydantic AI, OpenAI Agents SDK, and more). OpenTelemetry GenAI is the broader standard inside OpenTelemetry. The two converged in 2025-2026: OpenInference instrumentors emit OTLP-compatible spans, and GenAI semantic conventions borrow attribute names from OpenInference. Practically: instrument with OpenInference, export via OTLP.

What Should I Look for in an LLM Observability Buyer's Guide?

Six criteria. License clarity, open standards conformance (OpenInference plus OTLP), evaluation depth (count built-in evaluators in the catalog), agent and multi-step support (a 2026 trace has 30 to 300 spans), cost attribution at the span level (not just request level), and a clear path from a production span to an offline evaluation row via shared `span_id`.

Where Does FAGI traceAI Fit in the LLM Observability Landscape?

FAGI traceAI is Future AGI's Apache 2.0 OpenInference plus OpenTelemetry-native instrumentation layer, shipping instrumentors across the LLM client, orchestration, and agent framework ecosystem, and emitting OTLP-compatible spans. The deeper fit is the closed loop: traceAI feeds the same eval catalog that the Future AGI evaluation surface, the agent optimization surface, and the Agent Command Center gateway run on, so a span captured in production can be replayed against an offline evaluator, the regression debugged, the fix promoted, and the gateway can enforce a guardrail derived from the same eval - all in one stack.

How Does FAGI traceAI Differ From Arize Phoenix?

Both are Apache 2.0, both ship OpenInference instrumentation, both export OTLP, and the two projects share much of the underlying OpenInference catalog. Phoenix is an open-source notebook-first eval and trace explorer, with Arize's commercial AX product as the production surface. FAGI traceAI is the OpenInference-native instrumentation layer that feeds Future AGI's production evaluation, agent optimization, and Agent Command Center gateway in a single stack.

Does LLM Observability Slow Down My Application?

Synchronous instrumentation overhead is in the single-digit milliseconds for a well-tuned OpenInference exporter; the OpenTelemetry SDK is non-blocking by default and batches spans before OTLP export. The slow part is inline evaluation, which adds a model call. Most production stacks run evaluators on a sampled subset (5 to 20 percent) or asynchronously after the response returns. The Future AGI Protect model family runs inline at 65 ms text / 107 ms image median time-to-label per the [Future AGI Protect paper (arXiv 2510.13351)](https://arxiv.org/abs/2510.13351) — FAGI's own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII; the bound for inline production evaluation in 2026 and a model family rather than a plugin chain.

What Are the Common Myths About LLM Observability?

Five. Logs are not observability without span structure, attribute taxonomy, and eval scores. A vendor SDK is not the only path; OpenInference plus OTLP is the open standards path. Traditional APM does not handle LLMs natively; the APM agent emits HTTP spans, not GenAI spans. Head-based sampling is wrong for LLMs because long-tail errors are the signal. Evaluation does not belong in a separate offline pipeline; in 2026, evaluators run inline and write scores to the same span as the request.

What Is the Simplest Way to Start With LLM Observability?

Install the OpenInference instrumentor for your LLM client, point the OTLP exporter at the backend of your choice, and verify the span tree renders with at least one LLM span carrying `llm.model_name`, `llm.token_count.total`, and `llm.latency_ms`. Add an evaluator pass (hallucination plus answer relevance is the minimal pair) and verify the score writes to the span. That gets you the canonical trace plus eval pattern in an afternoon.

View all

Research

Langfuse Alternatives in 2026: 5 Honest Picks for Production AI

Honest 2026 comparison of Langfuse alternatives: Future AGI, LangSmith, Phoenix, Braintrust, Helicone on eval depth, gateway, and the loop.

NVJK Kartik · Mar 18, 2026

16 min

Research

Helicone Alternatives in 2026: 6 Gateway and LLM Observability Tools

FutureAGI, Portkey, LiteLLM, Langfuse, OpenRouter, and LangSmith as Helicone alternatives in 2026 after the Mintlify acquisition. Pricing, OSS, tradeoffs.

Rishav Hada · Feb 1, 2026

19 min

Research

OpenRouter Alternatives in 2026: 5 LLM Gateway Platforms Compared

Portkey, LiteLLM, TrueFoundry, Helicone, and FutureAGI as OpenRouter alternatives in 2026. Pricing, OSS license, BYOK fees, and what each won't solve.

Nikhil Pareek · Dec 23, 2025

15 min

TL;DR: The 2026 LLM Observability Definition

What Is LLM Observability?

The Three Pillars of LLM Observability

Traces: Span Trees Across Agent Steps

Metrics: Token Cost, Latency, and Quality Scores

Logs: Structured Span Attributes (And the Redaction Layer)

Why LLM Observability Differs From Traditional Observability

Eval Scores Are First-Class Telemetry

Span Attributes Are an Order of Magnitude Richer

Sampling Is Inverted

The Five Approaches to LLM Observability in 2026

1. OpenInference Plus OpenTelemetry Native

2. Vendor-Proprietary

3. Log-Shipping

4. Custom Warehousing

5. Hybrid

A Buyer’s Guide for LLM Observability in 2026

Five Myths About LLM Observability

The 2026 LLM Observability Landscape

How We Think About LLM Observability at Future AGI

Wiring an Application for LLM Observability

When Should You Adopt LLM Observability?

Related reading

Frequently asked questions