Research

Error Analysis for LLM Applications: 2026 Workflow Guide

A 2026 error analysis workflow for LLM apps. Cluster failure cases, label root causes, prioritize fixes. Concrete dataset, code, and rubrics that ship.

·
6 min read
error-analysis llm-evaluation llm-observability failure-analysis regression-testing best-practices comprehensive-guide 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline ERROR ANALYSIS WORKFLOW fills the left half. The right half shows a wireframe grid of clustered circular bubbles of varying sizes representing different error clusters arranged loosely on a 2D plane, with a soft white halo glow on the largest bubble cluster in the center-right of the grid, drawn in pure white outlines.
Table of Contents

A team running a customer-support agent at a fintech ships a prompt update on a Wednesday. By Monday, escalation rate is up 14 percent and the customer success team is angry. The team’s eval dashboard says aggregate pass rate dropped from 91 percent to 87 percent; that’s all the data they have. Three days of debugging later, they find it: a small wording change in the system prompt caused the agent to refuse 4-of-10 valid refund requests on accounts older than 24 months. The fix is one line. The cost was three days because nobody had clustered the failures. Error analysis is the workflow that turns “we regressed” into “the >24-month-account refund cluster regressed by 18 percent on prompt v23 and you should look at the refusal logic.”

This guide walks through a practical seven-step error analysis workflow for production LLM teams, with code that works against a typical trace store and a reproducible structure for your own dataset.

TL;DR: Error analysis in seven steps

StepOutputTools
1. Pull failed tracesList of trace_idstrace store query
2. Embed and clusterCluster id per traceOpenAI embeddings + HDBSCAN
3. Label root causesFailure-mode tag per clusterhand-label + LLM-judge assist
4. Prioritize by impactTop 3 to fixspreadsheet
5. Build regression setPer-cluster datasetjsonl
6. CI gate per clusterPer-cluster pass ratepytest
7. Re-run weeklyFresh prioritizationscheduled job

If you only read one row: the per-cluster pass rate is the unit of measurement that engineers actually want to see. Aggregate scores are lagging indicators.

Editorial diagram on a black starfield background titled ERROR CLUSTER MAP with subhead FAILURE-CASE BUBBLES BY ROOT CAUSE. A 2D scatter plot bubble chart drawn in thin white outlines with horizontal axis labeled ROOT CAUSE CATEGORY (retrieval miss, prompt drift, tool error, hallucination, refusal mismatch) and vertical axis labeled SEVERITY. Twelve circular bubbles of varying sizes scattered across the grid, each labeled with a tiny number. The largest bubble in the center-right area labeled RETRIEVAL MISS has a thicker outline and a soft white radial halo glow as the focal element. A small legend on the right labels bubble size as FAILURE COUNT. Pure white outlines on pure black with faint grid background.

Step 1: Pull failed traces

The signal sources for “failed”:

  • Judge below threshold. Span-attached judge score below the per-rubric threshold (groundedness, refusal calibration, tool-call accuracy).
  • User thumbs-down. Explicit feedback signal joined to the trace.
  • Escalation. User escalated to a human or the agent self-escalated.
  • Abandonment. Session ended before task completion.
  • Schema violation. Structured output didn’t validate.
  • Tool error. A tool call returned an error.
  • Refusal mismatch. Refusal calibration judge said the agent refused valid or accepted invalid.

Pull 500-2,000 failed traces per workflow per week. Below 500, clusters are noisy. Above 2,000, triage time outpaces the value.

from datetime import datetime, timedelta
import os
from clickhouse_connect import get_client  # or whichever store you use

client = get_client(host=os.environ["CLICKHOUSE_HOST"])

def pull_failed_traces(workflow: str, days: int = 7) -> list[dict]:
    since = datetime.utcnow() - timedelta(days=days)
    # Use parameterized queries; never f-string user input into SQL.
    sql = """
        SELECT trace_id, query, response, judge_score, user_feedback, escalated
        FROM spans
        WHERE workflow = %(workflow)s
          AND timestamp >= %(since)s
          AND (judge_score < 0.7 OR user_feedback = 'down' OR escalated = true)
        LIMIT 2000
    """
    rows = client.query(sql, parameters={"workflow": workflow,
                                          "since": since.isoformat()}).result_rows
    return rows

Step 2: Embed and cluster

Embed each failure case using a query, response, and trace-summary string. Cluster with HDBSCAN (handles variable cluster size and a noise bucket) or KMeans with a chosen K (5-15 typical).

from openai import OpenAI
from sklearn.cluster import HDBSCAN
import numpy as np

oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def cluster_failures(rows: list[dict]) -> dict:
    texts = [f"Query: {r['query']}\nResponse: {r['response']}" for r in rows]
    # Batch-embed
    emb_resp = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    embeddings = np.array([e.embedding for e in emb_resp.data])
    # Cluster
    clusterer = HDBSCAN(min_cluster_size=20, metric="cosine")
    labels = clusterer.fit_predict(embeddings)
    clusters = {}
    for row, label in zip(rows, labels):
        clusters.setdefault(int(label), []).append(row)
    return clusters

HDBSCAN’s noise bucket (label = -1) is fine; investigate it last. Most production workloads land on 8-15 named clusters plus noise.

Step 3: Label root causes

Hand-label each cluster with a failure-mode tag. Two reviewers, kappa above 0.6 minimum.

The 2026 failure-mode taxonomy:

TagDescriptionCommon cause
retrieval_missRAG returned wrong chunksStale index, query rewrite bug
prompt_driftModel behavior shifted on same promptProvider model update
tool_errorTool call failed or had wrong argsSchema mismatch, expired token
hallucinationFactually wrong, not refusedLong context, weak grounding
refusal_mismatchRefused valid or accepted invalidRefusal threshold off
schema_violationOutput didn’t validateFree-form prompt, missing structured output
context_overflowInput exceeded effective contextLong history, no summarization
persona_breakBroke characterAdversarial input, jailbreak
tone_mismatchWrong register or empathyPrompt edits
latency_breachAbove p99 budgetProvider issue, retries

LLM-as-judge can assist labeling at the cluster level; pull 5 representative samples per cluster, ask a frontier model to suggest the tag, then human-confirm.

Step 4: Prioritize by impact

Frequency alone is misleading. A 12 percent failure on FAQ matters less than a 1 percent failure on refund. Get an impact weight from the business owner per intent class.

IMPACT_WEIGHTS = {
    "refund": 10,
    "onboarding": 7,
    "billing": 6,
    "support": 4,
    "faq": 1,
}

def prioritize(clusters: dict) -> list[dict]:
    ranked = []
    for cluster_id, rows in clusters.items():
        if cluster_id == -1:
            continue
        n = len(rows)
        # weighted by intent class
        weight = sum(IMPACT_WEIGHTS.get(r.get("intent", "support"), 1) for r in rows)
        impact = n * (weight / max(n, 1))
        ranked.append({"cluster_id": cluster_id, "size": n, "impact": impact,
                       "tag": rows[0].get("tag", "unlabeled")})
    ranked.sort(key=lambda x: -x["impact"])
    return ranked

Fix the top 3 this sprint. Re-run after the fix.

Step 5: Build regression set per cluster

Each labeled cluster contributes 50-200 rows to its regression dataset. The dataset rows are the original failed traces plus a gold answer (corrected by a human reviewer or the post-fix expected output).

def build_regression_set(clusters: dict, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    for cluster_id, rows in clusters.items():
        if cluster_id == -1:
            continue
        tag = rows[0].get("tag", "unlabeled")
        with open(f"{output_dir}/{tag}_cluster_{cluster_id}.jsonl", "w") as f:
            for row in rows[:200]:
                f.write(json.dumps({
                    "query": row["query"],
                    "response_actual": row["response"],
                    "expected_pattern": row.get("gold_pattern", ""),
                    "tag": tag,
                }) + "\n")

Step 6: CI gate per cluster

Each cluster gets its own pass-rate gate.

# tests/test_clusters.py
import pytest
from glob import glob

CLUSTER_FILES = glob("data/clusters/*.jsonl")

@pytest.mark.parametrize("cluster_file", CLUSTER_FILES)
def test_cluster_pass_rate(cluster_file):
    rows = [json.loads(line) for line in open(cluster_file)]
    threshold = THRESHOLDS.get(cluster_file, 0.7)
    pass_count = 0
    for row in rows:
        actual = run_agent(row["query"])
        if matches_expected(actual, row["expected_pattern"]):
            pass_count += 1
    pass_rate = pass_count / len(rows)
    assert pass_rate >= threshold, \
        f"Regression in {cluster_file}: {pass_rate:.3f} < {threshold}"

A merge that regresses a known cluster blocks. The PR comment surfaces which cluster regressed.

Step 7: Re-run weekly

Schedule the full workflow weekly. Three things shift across runs:

  • Cluster shape. New failure modes appear; stale ones shrink.
  • Prioritization. As the top cluster shrinks (because you fixed it), the next one rises.
  • Regression set. Datasets grow; old rows still pass; new rows pressure the next prompt update.

A weekly cadence keeps error analysis a workflow, not a one-time investigation.

Common mistakes in error analysis

  • Eyeballing 20 traces. A real cluster needs at least 100 examples. Pattern recognition on 20 is noise.
  • Single-engineer triage. Without two reviewers, kappa is unverified.
  • Frequency-only prioritization. A 1 percent refund failure beats a 12 percent FAQ failure.
  • No closed loop. A fix without a regression test gets re-broken next quarter.
  • Stale cluster labels. Clusters drift; refresh labels every 4-6 weeks.
  • Hand-labeling everything. LLM-judge assists at the cluster summary level; the judge is the speed multiplier.
  • No business impact weights. Engineers cannot guess weights; ask the business owner.
  • Skipping noise bucket. HDBSCAN’s noise often hides the most informative outliers.

Tools that support error analysis in 2026

  • FutureAGI. Apache 2.0. Trace store with cluster queries, dataset auto-build, judge calibration. ClickHouse-backed.
  • LangSmith. Closed platform. Datasets, clusters by user-feedback signal.
  • Langfuse. MIT core. Datasets v2, annotation queues, judge runs over clusters.
  • Phoenix. ELv2. OpenInference-aligned trace store, cluster analysis.
  • Custom on Postgres + scikit-learn. A 200-line Python script gets you 80 percent of the workflow if you do not want a platform.

Recent error analysis updates

DateEventWhy it matters
Mar 2026FutureAGI shipped Agent Command Center and ClickHouse trace storageCluster queries across millions of spans became cheap.
2023HDBSCAN added to scikit-learn 1.3HDBSCAN became a one-import default; pin scikit-learn>=1.3.
Jan 2024OpenAI text-embedding-3 familytext-embedding-3-small at $0.02 per 1M tokens (current) made large-scale failure clustering cheap; text-embedding-3-large is $0.13 per 1M.
Jun 2025Galileo Luna 2 announcementSmaller-language-model judges with lower latency and cost than frontier judges, available on Galileo Enterprise.
Dec 2025DeepEval v3.9.x agent metricsFirst-party metrics for tool-call, plan, and conversational failures aligned with cluster tags.

Sources

Read next: Best LLM Monitoring Tools 2026, LLM Testing Playbook 2026, LLM Observability Platform Buyer’s Guide 2026

Frequently asked questions

What is error analysis for LLM applications?
Error analysis is the workflow of pulling failed traces, clustering them by failure mode, labeling root causes, and prioritizing fixes by frequency × business impact. It is what turns a vague 'the agent is failing' complaint into a concrete prioritized list of what to fix next. Without error analysis, teams react to the loudest user complaint instead of the highest-impact failure mode.
Why does error analysis matter in 2026?
Three reasons. First, eval scores tell you that quality regressed but not why. Error analysis turns aggregate scores into root causes. Second, distilled judges and span-attached scoring made it cheap to label every span, so the bottleneck shifted from data to triage. Third, agent products fail in more ways than chatbots; the failure-mode taxonomy is bigger and harder to keep in your head.
How do I cluster failure cases in 2026?
Three steps. First, pull failed traces (judge below threshold, user thumbs-down, escalation, abandonment). Second, embed each failure case (the user query, the agent response, the trace summary) and run a clustering algorithm (HDBSCAN, KMeans with chosen K, or LLM-driven topic discovery). Third, hand-label each cluster with a root cause and prioritize. The loop ends when 80 percent of failures map to known root causes.
What are the common failure modes for LLM apps?
Eight high-frequency modes in 2026. Retrieval miss (RAG returned wrong chunks). Prompt drift (model's behavior shifted on the same prompt). Tool error (tool call failed or had wrong arguments). Hallucination (factually wrong, not refused). Refusal mismatch (refused valid request or accepted invalid one). Schema violation (output didn't match the structured format). Context overflow (input exceeded model's effective context window). Persona break (agent broke character or revealed system prompt).
How do I prioritize fixes from error analysis?
Frequency × business impact. Cluster by failure mode, count occurrences, multiply by an impact weight (business-owner-supplied: refund failures 10x, onboarding failures 7x, FAQ failures 1x). Sort descending. Fix the top three. Re-run error analysis after the fix; the prioritization shifts as the highest-frequency cluster shrinks.
Can I run error analysis without a proprietary SDK?
Yes. The reference implementation uses OpenAI's SDK plus traceAI (Apache 2.0) for instrumentation. Embeddings, clustering (scikit-learn HDBSCAN), root-cause labeling (LLM-as-judge with a custom prompt), and prioritization are all plain Python. The trace store can be FutureAGI cloud, Phoenix self-host, or a Postgres table. The workflow is vendor-neutral end-to-end.
How does error analysis fit into CI?
Two integrations. First, the regression dataset auto-builds from labeled clusters; failed cases from cluster X become test cases for next week's eval. Second, the per-cluster pass rate is a CI gate; a regression on a known root-cause cluster blocks the merge. The eval suite shifts from 'global pass rate' to 'per-cluster pass rate', which is what the engineer fixing the bug actually wants to see.
What are common mistakes in error analysis?
Five. First, eyeballing 20 traces and declaring a pattern (the actual cluster shape needs at least 100 cases). Second, labeling clusters once and never refreshing (clusters drift). Third, prioritizing by absolute frequency without business impact (a 1 percent failure on refund > a 12 percent failure on FAQ). Fourth, no closed loop into the dataset (failures get fixed but the regression isn't blocked next time). Fifth, single-engineer triage (clusters need at least two reviewers for kappa above 0.6).
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.