Research

What is Error Analysis for LLMs? Cluster, Label, Prioritize in 2026

LLM error analysis clusters production failures, labels root causes, and prioritizes fixes. The workflow, the embeddings, and the tools teams use in 2026.

June 17, 2025

12 min read

error-analysis llm-failures failure-clustering root-cause production-llm debugging-llm llmops 2026

A team’s support agent is missing on 9% of refund queries. Customer feedback is climbing. The on-call engineer reads twenty failed traces and gets a vague sense that “the agent is sometimes wrong.” Two weeks later the team has rewritten three prompts, added a new retrieval index, and tightened the JSON schema. The 9% miss rate is now 8.4%. Nobody can articulate why.

The team that does error analysis instead samples 500 failed traces, clusters them, and labels each cluster with a root cause. The clusters reveal that 41% of failures fall in one bucket: the agent calls the refund tool with the user’s account id instead of the order id, because the prompt instructs “look up the order” without specifying which id to pass. Two-line prompt fix. 41% of failures gone. The remaining clusters are smaller and budgeted against the quarterly plan.

This is what error analysis is for. The first team was guessing. The second team had a ranked list of root causes and acted on the largest. This piece walks through what LLM error analysis is, the five-step workflow, the failure clusters that recur across workloads, and how the practice integrates with eval-driven development and the rest of the LLMOps stack.

TL;DR: Cluster failures, label root causes, prioritize fixes

LLM error analysis is the workflow that turns failed traces into a prioritized list of root causes. You sample failures, embed the inputs and outputs, cluster the embeddings, open each cluster and label the root cause, count cluster size and severity, and pick the top 1-3 clusters to fix. The output is a ranked list, not a vague sense of unease. Each labeled cluster also becomes a candidate for the eval suite, closing the loop between production failures and the offline test set that prevents regressions.

If you only read one paragraph: a team that fixes the wrong cluster ships engineering work that does not move the user-visible quality bar. Error analysis names which cluster matters first.

Why error analysis matters in 2026

Three forces converged.

First, LLM failure modes are different from traditional software failures. A null-pointer exception is a single, traceable bug. An LLM hallucination is a probabilistic distribution of failures across many inputs, with overlapping root causes (prompt vagueness, retrieval miss, model bias, tool definition ambiguity). A team that treats hallucinations as bug-by-bug debugging burns out before reaching root cause.

Second, the production failure stream is the highest-quality source of new eval cases. Synthetic data fills gaps; production failures fill the gaps you did not know existed. Without a workflow that surfaces and labels production failures, the eval suite stays stuck on the cases the team thought of at launch.

Third, fixing the wrong thing has a high opportunity cost. Engineering time spent on the third-largest cluster while the largest cluster grows is engineering time the workload does not get back. Prioritization on actual cluster sizes beats prioritization on the engineer’s gut feel about what is broken.

Tools like traceAI capture LLM calls, prompts, completions, tokens, and tool calls as OTel-native spans. Prompt versions and eval scores attach via custom attributes from the platform layer. That tagging is what makes downstream clustering tractable; without it, you are clustering across a fragmented log file.

The five-step error analysis workflow

The shape mirrors classical ML error analysis. The novelty is the failure types and the embedding choices.

1. Sample failed traces

Define what counts as a failure. The objective filters that produce a clean failure set:

Eval score below threshold. If you have an online judge attaching scores to spans, traces with groundedness under 0.7 (or whatever your threshold is) are a clean filter.
Error status. Spans tagged with status=error (tool-call failure, schema validation failure, gateway timeout).
Low user feedback. Thumbs-down, complaint flag, escalation-to-human signals.
Agent loop signature. Traces with step count over a budget, repeated tool calls with same arguments, no-progress detection firing.
Cost outliers. The top 1% of traces by token cost; expensive traces often correlate with stuck agents.

Pull a tractable batch. 200-1000 traces is the sweet spot: large enough to cluster meaningfully, small enough to label by hand.

2. Embed and cluster

Embed the traces. Three approaches:

Embed the user input. Good for finding distribution shift (a new query type the workload was not designed for) and input-side failure modes.
Embed the output. Good for finding output-side patterns (a recurring hallucination phrasing, a recurring refusal template, a recurring tool-call shape).
Embed input plus output. Best for mixed failure modes; expensive but covers more ground.

Cluster the embeddings. HDBSCAN with default parameters works well for finding natural cluster counts. K-means with silhouette-score-based k selection is the older alternative. Visualize with a 2D projection (UMAP or t-SNE). The visualization is for humans; the clusters are what you label.

Many observability platforms (Future AGI, Phoenix, Galileo, LangSmith) ship a built-in clustering view that handles this automatically. If you are doing it by hand, the embedding model can be small (text-embedding-3-small or a similar OSS model). The clusters are robust to embedding model choice within reason.

3. Label clusters

Open each cluster. Read 5-10 traces from the center of the cluster (lowest distance to centroid). Name the root cause.

This is the slow step. A typical session covers 8-15 clusters in 2-3 hours of focused work. The label should be specific enough that a different engineer reading it later understands the failure mode. “Tool-call wrong” is too vague; “refund tool called with account_id instead of order_id” is specific.

A consistent labeling vocabulary helps. Build a taxonomy as you go. Common top-level labels: groundedness miss, refusal over-trigger, refusal under-trigger, tool-call argument mismatch, tool-call wrong tool, agent loop, retrieval miss, schema violation, hallucinated entity. Subtypes refine within these.

4. Count and estimate impact

Two numbers per cluster:

Size. How many traces fall in this cluster. The base estimate of impact.
Severity. How bad each failure is for the user. A 1-5 scale works. A wrong refund amount is severity 5; a slightly verbose answer is severity 1.

Multiply for a priority score. Optionally divide by fix cost (a prompt tweak is cheap; rebuilding retrieval is expensive) to get an ROI-ranked list.

5. Pick top clusters and fix

Rank the clusters. Pick the top 1-3 by priority. Each cluster maps to a stack layer:

Prompt layer. Vagueness in instructions; missing constraints; conflicting rules.
Tool layer. Tool description ambiguity; wrong tool granularity; missing required arguments.
Retrieval layer. Wrong index; wrong chunk size; missing facets; reranker tuning.
Model layer. A different model id; a different reasoning budget; a different temperature.
Evaluation layer. A rubric that should have caught this and did not.

Each fix should have an eval test that would have caught the failure. Promote a representative trace from the cluster into the eval suite, write the rubric, run the candidate fix against the test, verify pass.

Common LLM failure clusters

Six recur across workloads. Knowing the canonical signatures speeds up labeling.

Groundedness miss

The output is fluent and confident but unsupported by the provided context. Hallucinated facts, fabricated citations, made-up entity names. Signature in the trace: a retriever span returns chunks that do not contain the answer; the LLM call ignores this and produces an answer anyway.

Common root causes: retriever returns the wrong chunks (retrieval miss); chunks are present but the prompt does not instruct the model to ground in them; chunks are present and instructed but the model is too confident to refuse when context is insufficient.

Refusal over-trigger

The model refuses cases it should handle. The user asks a benign question; the safety layer (or the model’s own refusal calibration) returns a refusal template. Signature: refusal-template detection on the output; the input does not match a known refuse-pattern.

Common root causes: over-eager safety prompt; a guardrail with a too-aggressive pattern match; a model swap to a more conservatively tuned model.

Refusal under-trigger

The model handles cases it should refuse. The user asks a question that violates policy; the model answers anyway. Signature: a policy classifier flags the input; the output is not a refusal.

Common root causes: a missing safety rule in the prompt; a guardrail not wired on the output side; a model that has different refusal behavior than the calibration model.

Tool-call argument mismatch

The right tool is called with wrong arguments. Signature: tool span shows the tool name; the argument values are wrong (wrong id, wrong type, missing required field).

Common root causes: prompt is vague about which id to pass; tool description does not describe argument semantics; the model conflates two similar arguments.

Tool-call wrong tool

A different tool is called instead of the right one. Signature: tool span shows tool A when the input clearly required tool B.

Common root causes: tool descriptions are too similar; tool granularity is wrong (too many small tools when a coarser tool would suffice); the prompt does not clarify when each tool is appropriate.

Agent loop

The agent retries without progress. Signature: many spans, repeated tool calls with same arguments, no-progress detection or step-budget exhaustion.

Common root causes: termination heuristics missing; tool returns are too verbose for the agent to extract the answer; the planner produced a plan the executor cannot complete.

How error analysis integrates with the rest of the LLMOps stack

Error analysis is not a standalone activity. It composes with three other primitives.

Tracing

Error analysis runs on traces. Without span-level structure, you cannot filter, embed, or cluster meaningfully. Every span tagged with input, output, eval scores, prompt version, tool calls, retriever queries. See What is LLM Tracing? for the schema discipline this depends on.

Eval-driven development

Each labeled cluster becomes a candidate for the eval suite. Promote 1-3 representative traces from the cluster, write the rubric that detects the failure, and the next prompt iteration must clear the test. See Eval-Driven Development.

Drift detection

Online evaluators on production spans tag scores and feed a drift detector that watches rolling-mean rubric scores. When a cluster grows large enough to move the rolling mean, the drift alert fires and triggers an error analysis session.

Prompt versioning

The fix is a new prompt version. The eval pass-rate vector reflects whether the cluster has been addressed. Per-version rolling scores in production confirm whether the fix held. See Prompt Versioning.

Tools that support error analysis in 2026

Five viable patterns:

Future AGI. Apache 2.0 stack with built-in failure clustering on traces, embedding-based grouping, and a UI that promotes labeled clusters into eval candidates.
Phoenix. OSS, OTLP-first, ships clustering and embeddings UI for trace exploration; integrates with the broader Arize ecosystem.
Galileo. Closed platform with first-class error analysis on agent traces; data error potential, drift detection, and clustering.
LangSmith. Closed platform, ships “annotation queues” for routing failed traces to labelers; integrates with LangSmith eval.
Manual pipeline. Python plus pandas plus an embedding API plus HDBSCAN plus a Jupyter notebook. Works for small-scale analysis; impractical past 10K traces per session.

The choice depends on where the traces already live, how much labeling automation you want, and whether the platform integrates with your eval suite for promotion of clusters into test cases.

Common mistakes when doing error analysis

Sampling without a filter. Sampling random traces (passes and failures mixed) wastes labeling time. Filter to failures first.
Reading individual traces without clustering. A single trace tells one story. The cluster tells the recurring story.
Vague labels. “Bad answer” is not a label. “Tool called with wrong id” is.
Inconsistent labels across analysts. Two engineers labeling the same cluster differently makes the priority math wrong. Maintain a shared taxonomy.
Not feeding clusters into the eval suite. A cluster you fix once will recur if no test prevents it.
Fixing the wrong cluster first. Engineering time on the third-largest cluster while the largest grows is engineering time wasted. Rank by impact, not by what looks interesting.
Skipping the cadence. Error analysis once at launch is not enough. Schedule a recurring session: weekly for fast-iterating workloads, monthly for steady-state.
Treating clustering output as final. HDBSCAN produces clusters; the human labels them. The cluster boundaries are suggestions, not gospel.

What changed in error analysis in 2026

A few inflection points reshaped the practice.

Auto-clustering on traces. Platforms ship clustering as a first-class view on traces, eliminating the manual embedding pipeline.
Failure-to-eval pipelines. Tools that promote labeled clusters into eval suites in one click closed the loop between production failures and offline tests.
Distilled judges as failure detectors. Cheap judges can score every span, surfacing the failure population that error analysis works on.
OTel GenAI conventions. Standard span attributes mean the clustering pipeline does not need to know which framework produced the trace.

The future: where error analysis is heading

LLM-assisted root cause labeling. The labeler reads 5 traces; the LLM proposes a label. The labeler accepts, edits, or rejects. Can materially reduce labeling time when analysts accept the suggested labels.

Cross-version cluster diffs. When prompt v18 ships, the system compares cluster distribution against prompt v17 and surfaces which clusters grew. Automatic regression-cluster detection.

Cluster-to-fix recommendations. Tools that read the cluster, the trace internals, and the prompt history and suggest specific prompt or retrieval fixes. Still early; check the recommendations before applying.

Per-tenant cluster surfacing. Enterprise customers ship their own failure modes. Per-tenant cluster views give the customer success team a workload-specific failure surface.

Adversarial cluster mining. Auto-generation of adversarial test cases targeted at the largest production clusters, fed back into the eval suite as proactive coverage.

The throughline: error analysis stops being a periodic manual exercise and becomes a continuous loop that connects production failures to offline tests to prompt iterations to deployment. Teams that close this loop ship faster and stay ahead of the failure modes that quietly accumulate.

How to use this with FAGI

FutureAGI is the production-grade error analysis stack. The platform ships built-in failure clustering on traces with embedding-based grouping, a UI that promotes labeled clusters into eval candidates, span-attached scoring that surfaces low-eval traces for clustering, annotation queues for cluster labeling, and dataset versioning so today’s clusters become tomorrow’s CI tests. turing_flash runs guardrail screening at 50 to 70 ms p95 across production traffic; full eval templates run at about 1 to 2 seconds for offline scoring of mined clusters.

The Agent Command Center is where cluster surfacing, per-tenant cluster views, and cluster-to-eval promotion live. The same plane carries 50+ eval metrics, persona-driven simulation that auto-generates adversarial cases targeted at production clusters, the BYOK gateway across 100+ providers, 18+ guardrails, and Apache 2.0 traceAI instrumentation on one self-hostable surface. Pricing starts free with a 50 GB tracing tier; Boost ($250/mo), Scale ($750/mo), and Enterprise ($2,000/mo with SOC 2 and HIPAA BAA) cover the maturity ladder.

Sources

Series cross-link

Frequently asked questions

What is error analysis for LLMs in plain terms?

Error analysis is the workflow that turns a heap of LLM failures into a ranked list of root causes. You sample failed traces, cluster them by failure mode, label each cluster with a root cause, count the size of each cluster, and prioritize fixes by cluster size and severity. The pattern is the same shape as classical machine learning error analysis, adapted for the failure modes specific to LLMs: hallucination, refusal mis-calibration, tool-call wrongness, agent looping, retrieval misses.

Why does error analysis matter in 2026?

Three reasons. First, LLM workloads have failure modes that no log message names. A 9% groundedness regression has no stack trace; you find it by inspecting failed traces and noticing a pattern. Second, fixing the wrong thing wastes engineering time. A team that spends two weeks improving the retrieval pipeline when the actual failure cluster is tool-call argument mismatches ships zero quality improvement. Third, production failure clusters are the highest-value source of new eval cases.

What is the error analysis workflow?

Five steps. (1) Sample failed traces by an objective filter (eval score below threshold, error status, low user feedback). (2) Embed the inputs and outputs and cluster the embeddings. (3) Open each cluster, read 5-10 traces, label the root cause. (4) Count the cluster size and estimate the user impact. (5) Rank by impact and pick the top 1-3 clusters to fix. The output is a prioritized list, not a vague sense of unease.

How do I cluster LLM failures?

Three approaches. Embed the user input and cluster (good for input-side failure modes like distribution shift). Embed the output and cluster (good for output-side failure modes like hallucination patterns). Embed input plus output plus failure category and cluster (best for mixed failure modes). HDBSCAN or k-means with silhouette score for k selection works well. Many platforms ship a built-in clustering view that handles this automatically.

What are the most common LLM failure clusters?

Six recur across most workloads. Groundedness misses (the answer is fluent but unsupported by context). Refusal over-trigger (the model refuses cases it should handle). Refusal under-trigger (the model handles cases it should refuse). Tool-call wrongness (right tool, wrong arguments; or wrong tool entirely). Agent looping (the agent retries without progress). Retrieval misses (the retriever returns irrelevant chunks). Each has a characteristic signature in the trace that makes it diagnosable once you know the pattern.

How do I prioritize which cluster to fix first?

Three signals. Cluster size: how many traces fall in this cluster. Severity: how bad each failure is for the user (a wrong refund amount is worse than a slightly verbose answer). Cost: how expensive each failure is to fix (a prompt tweak is cheap; rebuilding the retrieval pipeline is expensive). The product of these is the priority score. Fix the high-impact, low-cost clusters first; budget the expensive fixes against quarterly objectives.

How does error analysis integrate with eval-driven development?

Tightly. Each labeled failure cluster becomes a candidate for the eval suite. You promote one or more representative failures from the cluster into the test set, write the rubric that detects the failure, and the next prompt iteration must clear the test before shipping. Error analysis is the front-end that produces eval cases; the eval suite is the back-end that prevents the regression from recurring.

What does error analysis cost in operational complexity?

At minimum: a way to filter failed traces, an embedding pipeline, a clustering library, and a UI that shows clusters with example traces. The harder cost is discipline: scheduling regular error-analysis sessions, labeling clusters consistently across analysts, and feeding the labeled clusters back into the eval suite. Tools that automate the clustering and the integration with the eval suite save the most time. Doing this manually past 10K production traces is impractical.