Guides

AI Agent Failure Modes in 2026: The 5-Category Taxonomy

AI agents fail in five categories, not fifty. The taxonomy production teams use to name incidents, route fixes, and close the loop.

·
Updated
·
13 min read
agent-evaluation agent-observability ai-agent-failure-modes error-feed 2026
Editorial cover image for AI Agent Failure Modes in 2026: 12 Named Modes, Real Incidents, and Detection Recipes
Table of Contents

An on-call page at 03:14. The customer-support agent that shipped Wednesday is quoting refund amounts off by an order of magnitude, contradicting itself across turns, and on one trace handing a user another customer’s order. You pull the failing conversations. By breakfast the channel is full of guesses: hallucination, prompt regression, retrieval drift, a vendor side effect, the new tool. Six engineers, six different fixes, no shared name for what just broke.

This is the part of the on-call cycle that does not compound. If your incident review does not name the failure category, you are not learning from it. You are patching symptoms and shipping vibes. AI agents fail in five categories, not fifty. Planning errors, tool errors, retrieval errors, reasoning errors, safety and policy violations. Every OWASP entry, every MITRE ATLAS technique, every long catalog you might bookmark collapses cleanly into one of those five. The five each have their own fix workflow. Conflate them and the patches stop compounding.

The opinion this post earns: the right unit of postmortem is the failure category, not the symptom. A named category routes the fix to the right surface — planner, tool layer, retriever, generator, output scanner — and the named subtype becomes a regression test in CI. This guide walks the five categories, the trace fingerprint that distinguishes each, the fix-loop that closes incidents back into evals, and the role Future AGI’s Error Feed plays as the layer that writes the category, the score, and the immediate_fix for you.

TL;DR: five categories, one fix-loop

CategoryTrace fingerprintWhere the fix lives
Planning errorsWrong tool sequence, infinite loop, missing stepPlanner prompt, plan-vs-execute eval, cycle detector
Tool errorsWrong tool, wrong args, no error handling, fabricated toolTool allowlist, schema guard, retry policy, MCP descriptor scan
Retrieval errorsWrong chunk, missing context, role-switch token in chunk, hallucinated sourceRetriever rubric, chunker re-eval, ingestion control
Reasoning errorsFactual error, math error, instruction ignored, hallucination on grounded contextGenerator rubric, judge, calibration set
Safety and policy violationsPII leak, jailbreak compliance, harmful output, refusal failureOutput scanner, refusal eval, adversarial pushback set

Five categories, thirty subtypes when you go one layer down. Map your incident to the row and you fix the right surface; conflate rows and you ship patches that age within the week.

Why incident review without a taxonomy is theater

Every “AI agent hallucinated again” postmortem closes with three things: a one-off prompt tweak, a vague action item to “add a guardrail,” and a Slack thread that will not be read again. Nothing compounds. The next deploy can regress the tweak silently because no eval was written. The guardrail action item never specifies which surface the guardrail belongs on, so it goes to whoever is least busy. Two months later the same failure recurs and the team cannot tell if it is a regression or a different mode masquerading as the same symptom.

A taxonomy fixes three problems at once. It routes the fix to the right surface (planner-side patches do not belong in an output scanner). It generates a regression test (the cluster becomes a labeled dataset entry the next PR has to clear). And it gives the team a shared vocabulary so the next on-call can read last month’s incident and understand it without re-deriving the diagnosis.

Five categories is the sweet spot. Three is too coarse: “input, model, output” collapses planner and tool errors into the same bucket and you lose the fix routing. Twelve is too granular and the review turns into a labeling argument instead of a patch. Five matches the natural failure surfaces in a 2026 agent’s span tree: plan, tool, retrieve, reason, respond.

Category 1: planning errors

The agent picked the wrong tool sequence, looped, skipped a required step, or kept executing after the plan should have terminated. The reasoning step inside the plan span is the one that broke; everything downstream inherits the corruption.

Subtypes that recur. Wrong-tool-selected (picked a tool the task did not need). Missing-step (skipped a required precondition like authentication or a check-balance call). Infinite-loop (same tool with similar args fires more than three times per trace). Premature-termination (returned to the user before the plan finished). Plan-vs-execute divergence (planner decided one sequence, executor ran a different one — the canonical fingerprint of indirect prompt injection at the plan layer).

Trace fingerprint. The plan span’s emitted tool sequence does not match the executed tool sequence. Or the executed sequence matches but the goal eval did not advance for N consecutive steps. Or agent_trace_steps_total climbs monotonically with no goal progress.

Where the fix lives. Planner-side. Tighter system prompt with explicit termination conditions. A cycle detector that fires when the same tool with similar arguments runs more than three times. A plan-vs-execute evaluator that scores the divergence and gates the next request. The November 2025 four-agent loop on dev.to burned $47K over eleven days because the planner kept passing checklist artifacts back and forth between two agents and no cycle detector noticed. Per-call rate limits did not save it. Planning-layer evals would have.

Category 2: tool errors

The plan was right, but a tool call returned junk, was invoked with the wrong arguments, or never had its error path handled. The agent then proceeded as if the call succeeded.

Subtypes that recur. Wrong-args (argument shape failed the tool’s schema; the agent retried with the same shape). Tool-hallucination (the agent invoked a function that does not exist in the runtime catalog). No-error-handling (the tool returned a 500 and the agent fabricated a reasonable-sounding response on top of the failure). MCP-tool-poisoning (the tool’s description carried hidden instructions that fired when the agent picked it). API-drift (the third-party endpoint changed schema, error codes, or rate limits, and the mock in CI still returns the old payload).

Trace fingerprint. A tool span returning a non-2xx status or schema-violation flag, followed by a generation span whose output references the failed call’s expected result. Or agent_tool_allowlist_violation_total non-zero. Or the tool’s descriptor hash drifted between catalog import and runtime invocation.

Where the fix lives. Tool layer. A per-key tool allowlist enforced at the gateway. An argument-shape evaluator (EvaluateFunctionCalling against the runtime schema). A hard error-path that fails closed rather than letting the generator fabricate over a missing result. MCP descriptor hash checks at import time, before the catalog ever sees the agent. The August 2025 MCPTox benchmark measured a 72.8% attack success rate against o1-mini across 45 MCP servers; the Replit DB deletion the same month was tool-misuse during a declared freeze. Both are tool-error category, different subtype, same fix routing.

Category 3: retrieval errors

The agent went to fetch context and came back with the wrong material. Whatever the generator does next inherits the bad inputs.

Subtypes that recur. Wrong-chunk (top-k surfaced material adjacent to the question but not on it). Missing-context (the right document was in the corpus but never retrieved). Hallucinated-source (the agent cited a document that does not exist in the index). Role-switch-token in chunk (the retrieved text contains system:, <|im_start|>, ### Instruction:, or the long tail of indirect-injection patterns). Stale-index (the retriever was evaluated at chunk size 800 against 12K docs; production now runs 38K docs through a re-embedded chunker and the same query lands on different chunks).

Trace fingerprint. Context relevance drops while groundedness stays stable. Or the top-k chunks contain pattern matches against the role-switch regex set. Or the retrieved document’s source URI does not resolve in the production index manifest. PoisonedRAG (USENIX Security 2025) showed five poisoned documents in a one-million-document corpus reach ~90% attack success on the target question — the trace fingerprint there is a retrieved chunk whose embedding clustered with attacker-controlled text after ingestion.

Where the fix lives. Retrieval layer. Split the eval suite by layer so retrieval rubrics (ContextRelevance, ChunkAttribution, ChunkUtilization) catch the drift before generation rubrics absorb it. Ingestion-side filters on role-switch tokens. Re-evaluation of the chunker and embedding model when the corpus crosses a 2x size threshold. Covered in depth in Evaluate RAG in CI/CD (2026).

Category 4: reasoning errors

Retrieval returned clean context, the plan was sound, the tools all worked, and the final answer is still wrong. The generator confabulated, did the math incorrectly, or ignored a system-prompt instruction.

Subtypes that recur. Factual-error (the answer contradicts the supplied context). Math-error (numeric reasoning produced a wrong total, percentage, or unit conversion). Instruction-not-followed (the system prompt said “respond in JSON” and the agent returned prose). Confabulation-on-grounded-context (the context was correct and the agent generated a plausible-sounding fabrication anyway). Multi-turn-incoherence (the agent contradicted its earlier turn within the same session).

Trace fingerprint. The four-dim trace score’s factual_grounding or instruction_adherence axis drops below 3 while retrieval rubrics held. Or a senior engineer reads ten traces, disagrees with the judge on six, and cannot articulate why. That usually means the rubric is grading a prompt version that no longer ships.

Where the fix lives. Generator rubric and judge calibration. The Moffatt v Air Canada tribunal decision (February 2024) is a reasoning error: the chatbot fabricated a bereavement-fare policy that did not exist in the supplied policy context. The patch is not a guardrail; it is a factual_grounding evaluator running on every response against the policy corpus, with a failed score gating the next request through the same template. Pair the rubric with a human-labelled calibration set drawn from production. Track judge-versus-human drift as its own metric.

Category 5: safety and policy violations

The agent produced an output that should have been refused, leaked something it should have masked, or complied with an instruction the system prompt forbade.

Subtypes that recur. PII-leak (the response surfaced an email, phone, SSN, or API key from context or memory). Refusal-failure (the agent complied with a jailbreak instead of declining). Harmful-output (the response generated content in a category the safety policy blocks). System-prompt-violation (the agent ignored an explicit instruction like “never recommend competitor products”). Cross-tenant-leak (the agent surfaced data from another customer’s session because memory or retrieval scoping failed). Output-injection (the response rendered as a Markdown image with an attacker-controlled src, auto-exfiltrating chat history on render — the June 2025 EchoLeak chain is the canonical case).

Trace fingerprint. The four-dim trace score’s privacy_and_safety axis drops below 3. An outbound URL in agent output does not match the render allowlist. An adversarial pushback eval flips the answer between turn N and N+1 with no new evidence — the Anthropic-OpenAI joint pushback evaluation (August 2025) documented frontier models doing this even on originally correct answers.

Where the fix lives. Output-side. Inline scanners on outbound responses (PII Detection, Data Leakage Prevention, Secret Detection, Custom Expression Rules for outbound URL allowlists). Adversarial pushback evaluators on the eval surface, scoring answer_flip_rate per template. Refusal contracts as their own rubric, versioned in the same PR as the system prompt that defines them.

The order matters: walk the trace front-to-back

The five categories compose. Real incidents stack two or three; naming a single mode in a postmortem misses the chain. The diagnostic discipline is to walk the trace front-to-back and stop at the first category that broke, because earlier errors corrupt everything downstream.

Plan first. If the planner picked a wrong tool sequence, the tool that ran inside the wrong plan is not the bug; the planner is. Fixing the tool layer here is a category mismatch — it produces a patch that does not generalize.

Tool second. If the plan was right and a tool returned junk, generation reasoning over the junk will look like a reasoning error in isolation. The fix is the tool’s error path, not a smarter generator.

Retrieval third. A wrong chunk makes a perfect generator produce a wrong answer that scores low on factual grounding. Fixing the generator does nothing; the retriever is the bug.

Reasoning fourth. Only after planning, tooling, and retrieval are clean is the residual a true reasoning error.

Safety and policy last. Many safety failures have an earlier-category root cause — a retrieval error surfacing PII, or a planning error invoking a tool the policy should have gated. The output-side scanner is the backstop, not the diagnosis. Naming the upstream category is what produces a fix that compounds.

The fix-loop: cluster, judge, promote, gate

A named taxonomy is the start. The loop that turns it into compounding quality is five steps.

  1. Cluster. Failing traces flow into a store with their span embeddings. HDBSCAN soft-clustering groups them into named issues. No engineer triages a flat list of 800 failures.
  2. Judge. A judge agent runs on each cluster, emits the 5-category 30-subtype classification, the 4-dim trace score, and a single immediate_fix string naming the change to ship today (rubric edit, prompt patch, tool-call guard, retrieval-filter tweak).
  3. Promote. The on-call engineer accepts the cluster, selects 3-10 representative traces, commits them into the offline eval set with route tags and rubric labels. Every promoted trace is a regression the next PR cannot break.
  4. Re-gate. The next CI run grades the new entries with the same rubric the production scorer used. The next PR touching that path has to clear them.
  5. Optimize. agent-opt searches the prompt space on the expanded set against the same rubric; the winning candidate has to clear the gate before it ships.

Cadence: weekly on active products, faster on volatile launches. Static eval sets older than a quarter rarely match production. The discipline is to promote back into the offline set on every closed cluster, so the offline set ratchets stronger off what production already broke.

How Future AGI ships the loop

Future AGI ships the eval stack as a package, and Error Feed sits inside it as the category-naming layer.

The mechanics. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues at prob >= 0.4. Each cluster fires a Claude Sonnet 4.5 Judge agent on Bedrock for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, submit_finding, submit_scores, submit_summary), with a Claude Haiku “Chauffeur” summarising spans over 3000 characters. Prompt cache hit ratio sits around 90%, which keeps unit economics survivable.

What the Judge writes back. Per cluster, three things engineers actually read. A 5-category 30-subtype taxonomy classification. The 4-dim trace score: factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1-5. A single immediate_fix string naming the change to ship today.

Where the fix goes. The fix feeds the Platform’s self-improving evaluators so the rubric ages with the product. The cluster becomes a candidate dataset entry the on-call engineer promotes into the offline set. Linear OAuth is wired today for one-click ticket creation; Slack, GitHub, Jira, and PagerDuty are on the roadmap. Every incident becomes a regression test the team never had to write by hand.

Why the loop closes. ai-evaluation (Apache 2.0) is the code-first surface: 50+ EvalTemplate classes, the real Evaluator API, four distributed runners. traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic across 50+ AI surfaces in Python, TypeScript, Java, and C#. The Future AGI Platform is the operational layer where self-improving evaluators retune from thumbs feedback and classifier-backed evals run at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer.

Ready to name your next incident? Wire one EvalTemplate against your current dataset, attach the same template as an EvalTag on live traces via traceAI, and let Error Feed cluster the first failing batch. The category in the cluster name is the first row of your taxonomy filled in by the system, not by guesswork. Start with the ai-evaluation SDK quickstart.

Frequently asked questions

Why a five-category taxonomy instead of a long catalog?
Because every long catalog you actually run an incident review against collapses into the same five buckets. Planning errors, tool errors, retrieval errors, reasoning errors, safety and policy violations. OWASP's twenty entries, MITRE ATLAS techniques, the in-house spreadsheet a team built in February — they all roll up cleanly into one of those five. The point of a taxonomy in postmortem is not exhaustive coverage. It is route-the-fix. Five categories give you five different fix workflows. Fifty entries give you a labeling argument that delays the patch. The 5-category 30-subtype taxonomy Future AGI's Error Feed Judge writes back on every failing trace is built on this collapse: five top-level routes, thirty subtypes for diagnosis, one immediate_fix string per cluster.
How do I decide which category a failure belongs to?
Walk the trace in order. Plan span first: did the agent pick the right tool sequence, or did it pick a wrong tool, loop, or skip a step? That is a planning error. If the plan was right but a tool call returned junk, fabricated, used wrong args, or never error-handled, that is a tool error. If retrieval ran, look at the chunks: wrong chunk, missing chunk, hallucinated source, role-switch token in the chunk — retrieval error. If retrieval was clean and the final answer is still wrong, math busted, instruction ignored, or factually false against the supplied context, that is a reasoning error. If the response leaked PII, complied with a jailbreak, produced harmful content, or violated the system prompt's refusal contract, that is a safety and policy violation. The order matters because earlier-category errors corrupt everything downstream.
What does the 4-dim trace score map to in the taxonomy?
The four axes are the diagnostic over the five categories. Factual grounding drops when retrieval or reasoning broke. Privacy and safety drops when the response violated policy. Instruction adherence drops on planning or reasoning when the agent ignored the system prompt. Optimal plan execution drops when planning or tool errors produced a wrong sequence. The four-dim score is graded one to five per axis by the same judge on every trace; the named category in the 30-subtype taxonomy tells the engineer which fix workflow to open. The same vocabulary runs in the offline eval set and on live spans.
How does Error Feed write the immediate_fix string?
Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues at prob greater than 0.4, so noise points stay recoverable. Each cluster fires a Claude Sonnet 4.5 Judge agent on Bedrock for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, submit_finding, submit_scores, submit_summary, plus a Haiku Chauffeur for spans over 3000 characters), with prompt cache hit around 90 percent. The Judge emits the 5-category 30-subtype classification, the 4-dim trace score, and a single immediate_fix string that names the change to ship today: rubric edit, prompt patch, tool-call guard, retrieval-filter tweak. The fix feeds the Platform's self-improving evaluators and the cluster becomes a candidate dataset entry.
Where do prompt injection, jailbreaks, and MCP tool poisoning fit?
Prompt injection lives across two categories, depending on what fired. If the injection hijacked the plan and rerouted the agent to a different tool sequence, it surfaces as a planning error subtype. If the injection coaxed the response into leaking PII or bypassing a refusal, it surfaces as a safety and policy violation. Jailbreak completions are always safety-and-policy. MCP tool poisoning is a tool-error subtype with a supply-chain root cause: the descriptor itself shipped malicious instructions, so the agent picking the tool was an unsafe choice the catalog should have rejected. The point of the taxonomy is not to argue which label is correct for an attack class; it is to route the fix. Injection-via-plan-hijack goes to a planner-side guard; injection-via-leak goes to an output-side scanner.
How does the fix-loop close back into evals?
Five steps. Error Feed clusters failing traces into named issues. The Judge writes the 4-dim score, the taxonomy label, and the immediate_fix. The on-call engineer promotes three to ten representative traces from the cluster into the offline eval set with route tags and rubric labels. The next CI run grades the new entries with the same rubric the production scorer used. agent-opt searches the prompt space on the expanded set against that same rubric, and the winning candidate has to clear the gate before it ships. Cadence is weekly on active products, faster on volatile launches. Static eval sets older than a quarter rarely match production patterns.
Why is incident review without a taxonomy a problem?
Because the patches do not compound. When every postmortem closes with 'the agent hallucinated' or 'a guardrail failed,' two things happen. The fix-of-the-week is a one-off prompt tweak with no eval attached, so the next deploy can regress it silently. And the team cannot tell whether last quarter's incidents are the same failure mode recurring or different ones masquerading as the same symptom. A named category routes the fix to the right surface — planner, tool layer, retriever, generator, output scanner — and the named subtype becomes the regression test in CI. Without it, incident review is theater: it produces motion, not learning.
Related Articles
View all