Guides

LLM Observability and Monitoring in 2026: The Complete Field Guide

What LLM observability means in 2026: traces, spans, evals, span-attached scores. Compare top 5 platforms, see real traceAI code, and learn what to alert on.

May 2, 2025

Updated May 14, 2026

9 min read

evaluations hallucination llms observability

Table of Contents

TL;DR: LLM Observability in 2026

Concept	What it means in 2026	The primitive
Trace	A directed acyclic graph of spans for one user request	OpenTelemetry span
Span	One unit of work (LLM call, tool call, retriever call, workflow step)	`gen_ai.*` attributes
Metric	Aggregate over many spans (p95 latency, cost per route)	Span attribute roll-up
Eval score	Quality number attached to a span	`fi.evals.evaluate()` + auto-enrichment
Alert	Notification when a metric breaches threshold	Rolling window query
Versioning	Prompt and model variant tracking	Span attribute + dataset

The 2026 maturity bar is that all six live on the same span data. One UI shows trace tree, eval score, prompt version, cost, latency, and alert status for any request.

Why LLM Observability Is Different from Traditional APM

Classic application performance monitoring assumes deterministic behavior: the same input produces the same output, errors are exceptions, latency follows a known distribution. LLM applications break all three assumptions.

Non-deterministic outputs. Two identical requests can produce different answers, with different quality, different cost, and different latency. You cannot debug by replaying a request and inspecting the result.
Opaque internals. The LLM is a black box; you only observe inputs and outputs, not the reasoning that produced the output. The “stack trace” for a wrong answer is the prompt, the retrieved context, the tool calls, and the model output, not a file and a line number.
Multi-component pipelines. A single user request invokes a retriever, a reranker, possibly a tool, possibly a sub-agent, and one or more LLM calls. Each can fail in its own way.
Quality is a continuous variable. A response is not “success” or “fail” but “0.83 grounded, 0.91 relevant, 0.04 toxic”. Observability has to surface those continuous quality dimensions, not just binary outcomes.

That is why 2026 LLM observability stacks add four primitives on top of the classic three pillars (logs, metrics, traces): span-attached evaluation scores, prompt and model versions, token and cost telemetry, and retrieval and tool-call structure.

What to Trace in an LLM Application

Span type	What it captures	Why you trace it
LLM call	Provider, model, prompt, completion, token usage, finish reason	Cost, quality, latency, version drift
Retriever call	Query, top-k chunks, similarity scores, retrieval latency	RAG groundedness, recall
Reranker call	Input chunks, reranked order, scores	Retrieval quality lift
Tool call	Tool name, arguments, result, success / failure	Agent reliability, argument correctness
Embedding call	Input text, model, vector dimensions, latency	Index health, cost
Workflow step	Step name, inputs, outputs, duration	Multi-step trace tree
Sub-agent call	Agent name, input, output, sub-spans	Multi-agent decomposition
Eval span	Metric name, score, reason, latency_ms	Span-attached quality scoring

Every entry in this table should appear as an OpenTelemetry span with standard gen_ai.* attributes in 2026. Auto-instrumentation libraries (traceAI, OpenInference) produce LLM, retriever, embedding, and tool-call spans automatically for popular frameworks; custom workflow steps, sub-agent boundaries, and eval spans typically need a one-line manual span wrapper.

Top 5 LLM Observability Platforms in 2026

1. Future AGI: Unified observability + evaluation in one platform

Why #1. Future AGI is the only platform on this list that ships OpenTelemetry-native auto-instrumentation (traceAI), a unified eval API (fi.evals.evaluate() over 100+ Turing-cloud templates and 76+ local heuristics), and a span-enrichment hook that attaches every score to the span where it ran, in one stack. The result is one UI that shows trace tree, evaluation scores, prompt, retrieved context, cost, latency, and alerts for any request.

Capabilities:

traceAI auto-instrumentation for LangChain, LlamaIndex, OpenAI, Anthropic, Gemini, AWS Bedrock, Google ADK, CrewAI, AutoGen, and more.
100+ cloud Turing evaluators (turing_flash 1 to 2s, turing_small 2 to 3s, turing_large 3 to 5s) plus 76+ local heuristics.
Span-attached evaluation via enable_auto_enrichment() (docs).
Agent Command Center for BYOK LLM routing with built-in observability (/platform/monitor/command-center).
Persona-driven multi-turn simulation through agent-simulate.
Bayesian and ProTeGi prompt optimization through agent-opt.

License. traceAI and ai-evaluation are both Apache 2.0.

Quick start:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor
from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment

tracer_provider = register(project_name="rag_demo", project_type=ProjectType.OBSERVE)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
enable_auto_enrichment()

# Inside any LangChain chain or LangGraph node:
r = evaluate("groundedness", output=response, context=retrieved_context, model="turing_flash")
# Score, reason, latency_ms are now attributes on the active span

Best for. Teams that want one platform for tracing, evaluation, simulation, and prompt optimization.

2. Langfuse: Open-source observability with strong prompt versioning

What it is. Langfuse is an open-source LLM observability and prompt-management platform with strong trace UI, prompt versioning, dataset management, and a pluggable eval framework.

Strengths. MIT-licensed self-hostable; clean trace UI; first-class prompt versioning; OpenTelemetry support; integrates with LangChain, LlamaIndex, and many providers.

Considerations. Eval is more lightweight than Future AGI’s; you typically pair Langfuse with an external evaluator for hallucination, groundedness, and rubric metrics. Production deployment requires running Postgres and a worker; managed cloud available.

Best for. Teams that want self-hosted observability with strong prompt-versioning workflows and are willing to bring their own eval layer.

3. Arize Phoenix and Arize AX

What it is. Phoenix is Arize’s open-source LLM observability and eval toolkit (Apache 2.0); Arize AX is the commercial production-grade platform on top.

Strengths. Built on OpenInference, the predecessor and now sibling spec to OTel GenAI conventions; strong eval templates; Phoenix runs locally for dev, AX runs at production scale. Native integrations with LlamaIndex, LangChain, OpenAI, and most major frameworks.

Considerations. Two products (Phoenix for dev / OSS, AX for production) means a split workflow if you want both. Phoenix’s hosted UI is excellent for dev but you graduate to AX for production drift and dashboards.

Best for. Teams already on Arize or those who want a strong OSS + commercial pairing with a well-documented eval template library.

4. LangSmith

What it is. LangSmith is LangChain’s hosted observability and eval platform.

Strengths. Tightest possible integration with LangChain and LangGraph; production-grade trace UI; built-in dataset and experiment management; prompt hub and versioning. The natural choice if your stack is LangChain-first.

Considerations. Closed-source and hosted-only; less mature for non-LangChain stacks; pricing reflects LangChain’s premium positioning.

Best for. Teams that build heavily on LangChain or LangGraph and want the most integrated observability path with one vendor.

5. Braintrust

What it is. Braintrust is an eval-first observability platform with strong dataset and experiment-tracking workflows. Production trace viewing is increasingly first-class.

Strengths. Excellent eval experiment ergonomics; offline + online unified; tight workflow for running ad-hoc experiments against datasets.

Considerations. Originally eval-first, production observability is newer; pricing is enterprise-tilted.

Best for. Teams whose primary motion is offline eval with golden datasets and who want production tracing layered on top.

Comparison Table

Platform	License (core)	Auto-instrumentation	Span-attached eval	Multi-agent traces	Prompt opt	Simulation
Future AGI	Apache 2.0 (SDK)	Yes (traceAI)	Yes (auto-enrichment)	Yes	Yes (agent-opt)	Yes (agent-simulate)
Langfuse	MIT	Yes (OTel)	Via plugin	Yes	Via dataset	No
Phoenix / AX	Apache 2.0 / Closed	Yes (OpenInference)	Yes	Yes	No	No
LangSmith	Closed	Yes (LC native)	Yes	Yes	No	Limited
Braintrust	Closed	Yes	Yes	Yes	No	No

Best Practices for LLM Observability in 2026

Instrument before you build. Add register() and the instrumentor call before instantiating any chain, workflow, or agent. Late instrumentation produces an incomplete span tree.

Use OpenTelemetry GenAI semantic conventions. Any instrumentation library worth using emits gen_ai.* attributes by default. If you write custom spans, use the same attribute names so dashboards work across backends.

Attach evaluation scores to spans, not to a separate database. Run evaluate() inside the workflow step that produced the output. One observability UI then shows trace + score + prompt + cost + latency together. Decouple to a separate eval store only if you have a specific reason.

Sample for online eval; don’t run on every request. Cloud LLM-judge evaluators add 1 to 5 seconds per score; small-model classifiers add 100 to 200ms. Sample 5 to 10 percent of production traffic for online eval, run 100 percent in dev, and run a deeper offline eval on a curated dataset in CI.

Alert on five signals. Eval-score regression, p95 latency, cost per session, error rate, retrieval recall. All five should be span attribute roll-ups, not separate metrics pipelines.

Version prompts and models as span attributes. When a regression appears in eval scores, you should be able to filter by prompt.version and gen_ai.request.model to find the change that caused it.

Centralise traces. Many tools and frameworks emit traces; one observability UI should ingest all of them. Use OTLP export from every framework so traces land in one place.

Common Pitfalls

Logging instead of tracing. Plain logs of “user_id=X prompt=Y completion=Z” do not give you a trace tree, parent-child relationships, or span-level scoring. Use OpenTelemetry instrumentation.
Sampling evaluation at 100 percent in production. Cloud LLM-judge eval is too expensive and too slow to run on every request. Sample.
Forgetting to call enable_auto_enrichment(). Without it, evaluate() results do not attach to spans, and you end up with eval scores in one database and traces in another.
Reading too many spans into one trace. A single user request should produce one trace; if you see 50 traces per request, your context propagation is broken across async boundaries.
Trusting drift dashboards without a ground-truth dataset. Production traces tell you what is happening; only a curated eval dataset tells you whether it is right.

How to Get Started in One Hour

pip install traceai-langchain ai-evaluation (or the equivalent for your framework).
Set FI_API_KEY and FI_SECRET_KEY from app.futureagi.com.
Add three lines of code:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor
from fi.evals.otel import enable_auto_enrichment

tracer_provider = register(project_name="my_app", project_type=ProjectType.OBSERVE)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
enable_auto_enrichment()

Open Observe in your Future AGI project. Run your app once. The trace tree appears with every step.
Add evaluate("groundedness", output=..., context=...) inside the RAG step. The score attaches to the span automatically.

That gives you end-to-end LLM observability with span-attached evaluation in under an hour.

Conclusion

LLM observability in 2026 is the difference between an LLM application that you can debug, improve, and trust in production and one that runs blind. The 2026 maturity bar is OpenTelemetry-native auto-instrumentation, span-attached evaluation scores, multi-agent trace propagation, and one UI that shows trace + score + prompt + cost + latency together. Future AGI is the platform that ships all five in one stack with Apache 2.0 instrumentation and a unified eval API; Langfuse, Phoenix, LangSmith, and Braintrust each cover useful subsets.

If you are starting from scratch, instrument with traceAI and add evaluate() calls inside the spans where outputs are produced. If you already have logs and metrics but no trace tree, that is the first thing to add. From there, the rest of the observability story falls out of the data you are already capturing.

Get started with Future AGI | traceAI on GitHub | evaluate platform

Sources

Future AGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI/blob/main/LICENSE
Future AGI ai-evaluation (Apache 2.0): https://github.com/future-agi/ai-evaluation/blob/main/LICENSE
Future AGI evaluate docs: https://docs.futureagi.com/docs/sdk/evals/evaluate/
Future AGI cloud evals (Turing models): https://docs.futureagi.com/docs/sdk/evals/cloud-evals
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenInference spec: https://github.com/Arize-ai/openinference
Langfuse (MIT): https://github.com/langfuse/langfuse
Arize Phoenix (Apache 2.0): https://github.com/Arize-ai/phoenix
LangSmith docs: https://docs.smith.langchain.com/
Braintrust docs: https://www.braintrust.dev/docs

Frequently asked questions

What is LLM observability in one sentence?

LLM observability is the practice of making the runtime behavior of an LLM application (prompts, retrieved context, tool calls, model outputs, latency, cost, evaluation scores) visible, queryable, and alertable so engineering teams can debug, improve, and trust the system in production. It extends classic three-pillar observability (logs, metrics, traces) with four AI-specific dimensions: evaluation scores attached to spans, prompt and model versions, token and cost telemetry, and retrieval and tool-call structure.

What is the difference between LLM observability and LLM monitoring?

Monitoring is alerting on known-bad signals like high error rate, elevated latency, or breached cost ceilings. Observability is the deeper capability to ask new questions about an unfamiliar failure mode, like which retrieval shard caused yesterday's hallucination spike or which tool call introduced a regression. Monitoring tells you something is wrong; observability tells you why. In practice the two run on the same data: traces and spans, evaluation scores, and metrics. Most platforms in 2026 ship both surfaces in one product.

Why is Future AGI ranked #1 in this 2026 list?

Future AGI is the only platform that ships a unified eval API (100+ Turing-cloud templates plus 76+ local heuristics through fi.evals.evaluate()), OpenTelemetry-native auto-instrumentation (traceAI for LangChain, LlamaIndex, OpenAI, Anthropic, Gemini, AWS Bedrock, Google ADK, CrewAI, AutoGen), and built-in persona-driven simulation plus Bayesian prompt optimization in one stack. The other platforms ship subsets; Langfuse leans observability + prompt management, LangSmith leans LangChain-native, Braintrust leans eval-experiment, Phoenix leans OSS tracing. Pair Future AGI's loop (instrument, evaluate, simulate, optimize) and it closes the path from production failure to a versioned prompt fix without leaving one product.

What are the OpenTelemetry GenAI semantic conventions and why do they matter?

They are the standard span attribute names for LLM telemetry (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reason, etc.) that stabilised in late 2025. They matter because they make traces vendor-portable; the same span emitted by traceAI is readable by Datadog, Honeycomb, Grafana Tempo, Phoenix, or any OpenTelemetry-compatible backend. Before the conventions stabilised, every vendor used proprietary attribute names and you were locked in. In 2026 your tracing layer should emit OTel GenAI by default.

What should you alert on for an LLM application in production?

Five core signals. First, evaluation-score regression: groundedness, hallucination rate, or task success dropping below threshold on a rolling window. Second, latency: p95 end-to-end inference and per-tool-call latency. Third, cost: cost per session, per user, per route, with alerts on per-route budget breaches. Four, error rate: timeouts, malformed responses, tool failures. Five, retrieval-quality regression: recall@k or context_relevance dropping on a known eval set. The 2026 maturity bar is that all five signals roll up from span attributes, so alerts use the same data as debugging traces.

How do you instrument an existing LangChain or LlamaIndex application?

Two lines per framework. For LangChain: pip install traceai-langchain ai-evaluation, then call register() and LangChainInstrumentor().instrument(tracer_provider=...) before instantiating chains. For LlamaIndex: pip install traceai-llama-index, then LlamaIndexInstrumentor().instrument(...). After that every chain step, retriever call, tool call, and LLM completion becomes an OpenTelemetry span automatically with parent-child relationships preserved. Add enable_auto_enrichment() once and any fi.evals.evaluate() call inside an active span attaches its score to that span.

Does observability add meaningful latency to LLM production traffic?

Auto-instrumentation through OpenTelemetry adds microseconds per span in the hot path; the visible cost is the asynchronous batch export of spans to your observability backend, which runs out-of-band on a background thread. Evaluation scoring is the operation that can add meaningful latency if you run it synchronously in-line. The 2026 pattern is to sample 5 to 10 percent of production traffic for online eval and run the rest in a background queue or batch job, so the user-visible response path is not blocked on evaluation latency.

Can you self-host an LLM observability platform?

Yes. Phoenix (Apache 2.0), Langfuse (MIT), and traceAI's instrumentation libraries (Apache 2.0) are all self-hostable. Future AGI's traceAI ships open-source instrumentation that exports OTLP, so the spans can land in any self-hosted backend (Grafana Tempo, Jaeger, Phoenix, Honeycomb on-prem). The trade-off is that the managed Future AGI backend includes the Turing-cloud evaluators, span-attached enrichment, simulation, and prompt optimization features that take significant engineering effort to recreate. Most teams self-host instrumentation and use the managed backend for everything above the trace layer.

View all

Guides

Real-Time LLM Evaluation in 2026: Setup, Code, Latency

Set up real-time LLM evaluation in 2026 with span-attached evals, 1 to 2 second judges, and code. 7 platforms compared, FAGI traceAI walkthrough.

NVJK Kartik · Aug 14, 2025

8 min

Guides

Future AGI vs Fiddler AI 2026: Honest LLM Observability Comparison

Honest 2026 comparison of Future AGI vs Fiddler AI: LLM eval, agent observability, traditional ML monitoring, pricing, integrations, and which platform fits which team.

Rishav Hada · Jul 24, 2025

7 min

Guides

Evaluating GenAI in Production 2026: The Full Framework

How to evaluate GenAI in production in 2026. Pre-deploy CI evals, online metrics, LLM-as-judge calibration, drift, safety, and how to stand up a working stack.

Nikhil Pareek · Jun 19, 2025

7 min