Guides

AI Agent Failure Modes in 2026: 12 Named Modes, Real Incidents, and Detection Recipes

Field guide to 12 AI agent failure modes for 2026, cross-walked to OWASP LLM Top 10, OWASP Agentic Top 10, NIST AI 600-1, and MITRE ATLAS, with named detection metrics, ten 2024-2026 incidents, and the Future AGI Agent Failure Map.

·
24 min read
ai-gateway 2026
Editorial cover image for AI Agent Failure Modes in 2026: 12 Named Modes, Real Incidents, and Detection Recipes
Table of Contents

Originally published May 12, 2026. Updated May 15, 2026.

On March 24, 2026 at 10:39 UTC, a threat actor calling itself TeamPCP pushed LiteLLM 1.82.7 to PyPI and 1.82.8 thirteen minutes later, after force-pushing 76 of 77 aquasecurity/trivy-action tags to malicious commits, exfiltrating LiteLLM’s CI publisher token, and bundling a credential harvester, a Kubernetes lateral-movement toolkit, and a .pth-based Python persistence loader into the wheel that pip install would run on import. This guide names the twelve AI agent failure modes production teams have to defend against in 2026, cross-walks each one to OWASP LLM Top 10, OWASP Top 10 for Agentic Applications 2026, NIST AI 600-1, and MITRE ATLAS, and pairs every mode with a 2024-2026 incident, a named runtime detection metric, and a Future AGI capability that catches it.

TL;DR: 12 Failure Modes, 4 Catalogs, One Map

Twelve named failure modes define the production agent risk surface for 2026, and the Future AGI Agent Failure Map is the first field guide in our competitive review to cross-walk every one to OWASP LLM Top 10 2025, OWASP Top 10 for Agentic Applications 2026, NIST AI 600-1 Generative AI Profile, and MITRE ATLAS v5.4.0. Six block inline at the gateway, three are partial inline plus a paired continuous evaluator, three are eval-stage failures that no inline guardrail can catch.

Each mode below is tagged with its OWASP LLM Top 10 entry, OWASP Agentic Top 10 2026 entry, NIST AI 600-1 risk category, and MITRE ATLAS technique where applicable.

  • Prompt Injection: LLM01:2025, ASI01:2026 Goal Hijack, NIST AI 600-1 Information Security, ATLAS AML.T0051
  • Tool Misuse: LLM06:2025, ASI02:2026 Tool Misuse and Exploitation, NIST AI 600-1 Human-AI Configuration
  • Tool Poisoning via MCP: LLM03:2025 Supply Chain, ASI04:2026 Agentic Supply Chain Compromise, ATLAS Publish Poisoned AI Agent Tool
  • Tool Hallucination: LLM09:2025 Misinformation, ASI08:2026 Cascading Agent Failures (related), NIST AI 600-1 Confabulation
  • Goal Misgeneralization: ASI01:2026 Goal Hijack (broader category), NIST AI 600-1 Human-AI Configuration
  • Reward Hacking: ASI01:2026 (related), NIST AI 600-1 Human-AI Configuration
  • Runaway Loops: LLM10:2025 Unbounded Consumption, ASI08:2026 Cascading Agent Failures, NIST AI 600-1 Environmental Impacts
  • RAG Poisoning: LLM04:2025 Data and Model Poisoning, LLM08:2025 Vector and Embedding Weaknesses, ASI06:2026 Memory and Context Poisoning, ATLAS AML.T0020
  • Memory Poisoning: ASI06:2026 Memory and Context Poisoning, ATLAS AML.T0080
  • Excessive Agency: LLM06:2025, ASI03:2026 Agent Identity and Privilege Abuse
  • Output Injection: LLM02:2025 Sensitive Information Disclosure, LLM05:2025 Improper Output Handling, ASI05:2026 Unexpected Code Execution
  • Sycophancy: ASI09:2026 Human-Agent Trust Exploitation, NIST AI 600-1 Information Integrity

The 12 Failure Modes at a Glance

The master taxonomy. Each row pairs a mode with its canonical labels, a real 2024-2026 incident, a named runtime detection metric, and the Future AGI capability that catches it. The mode sections below annotate each row in full.

Catalog cross-walk: each row maps an agent failure mode to OWASP LLM Top 10, OWASP Agentic Top 10, NIST AI 600-1, and MITRE ATLAS labels, with the anchor incident and the Future AGI capability that catches it.

#ModeCanonical LabelsAnchor IncidentDetection MetricFuture AGI Capability
1Prompt InjectionLLM01:2025 / ASI01:2026 / AML.T0051EchoLeak CVE-2025-32711, June 2025agent_planner_executor_divergence_totalPrompt Injection scanner
2Tool MisuseLLM06:2025 / ASI02:2026Replit DB deletion, July 2025agent_tool_allowlist_violation_totalTool Permissions scanner
3Tool PoisoningLLM03:2025 / ASI04:2026 / ATLAS Publish Poisoned ToolMCPTox 72.8 percent ASR, August 2025agent_mcp_descriptor_hash_drift_totalMCP Security scanner
4Tool HallucinationLLM09:2025 / ASI08:2026 (related)Cursor “Sam” bot, April 2025agent_unknown_tool_call_totalTool Permissions plus Hallucination Detection
5Goal MisgeneralizationASI01:2026 (broader) / NIST AI 600-1Air Canada tribunal, February 2024agent_proxy_vs_groundtruth_eval_deltaContinuous evaluators with span_id link
6Reward HackingASI01:2026 (related) / DeepMind spec gamingdev.to four-agent loop, $47K over 11 daysagent_verifier_vs_holdout_eval_deltaShadow experiments plus custom evaluators
7Runaway LoopsLLM10:2025 / ASI08:2026dev.to four-agent loop, $47K over 11 daysagent_trace_steps_total (monotonic, no goal progress)Per-key budgets plus circuit breaking
8RAG PoisoningLLM04:2025 / LLM08:2025 / ASI06:2026 / AML.T0020PoisonedRAG USENIX Security 2025, 5 docs in 1Magent_rag_role_switch_pattern_totalHallucination Detection plus Custom Expression Rules
9Memory PoisoningASI06:2026 / AML.T0080Microsoft AI Recommendation Poisoning, February 2026agent_memory_write_anomaly_totalInput Validation plus Webhook (BYOG)
10Excessive AgencyLLM06:2025 / ASI03:2026OX Security MCP STDIO RCE class, April 2026 (CVE-2025-49596)agent_scope_escalation_totalVirtual Keys plus Tool Permissions
11Output InjectionLLM02:2025 / LLM05:2025 / ASI05:2026EchoLeak Markdown image exfiltration, June 2025agent_outbound_url_disallowed_totalDLP plus Secret Detection plus Custom Expression Rules
12SycophancyASI09:2026 / NIST AI 600-1Anthropic and OpenAI joint pushback eval, August 2025agent_answer_flip_rateAdversarial pushback evaluators

The names are Prometheus-style counters that compose with the OpenTelemetry GenAI semantic conventions as agent-specific extensions, not as substitutes for gen_ai.* spec attributes. Twelve metrics, twelve modes, one alert rule per row.

Why Are AI Agents Still Failing at Production Scale in 2026?

Because the production agent surface in 2026 is wider, more autonomous, and more interconnected than the LLM-as-component surface most observability and security tooling was designed for. Five structural changes between 2024 and 2026 reset what counts as a production failure, and each one maps directly to a named mode in the taxonomy below.

Agent Autonomy Outran the Surrounding Control Plane

In 2024 the average LLM application was a single chat completion with retry logic; per-call rate limits and a static token budget were enough. By 2026 the average production agent runs a multi-step plan, picks tools from a catalog, writes to persistent memory, and orchestrates other agents over A2A.

The July 2025 Replit DB deletion happened during a declared freeze because the agent retained write authority the freeze policy never propagated to. The November 2025 four-agent loop burned eleven days because the per-call rate limit never noticed an inter-agent A2A loop. The new failures live between the calls, not inside them.

The Tool Surface Exploded With MCP and A2A

Every tool description an agent imports is a piece of attacker-influenceable text on a new supply-chain surface. The MCPTox benchmark in August 2025 measured a 72.8 percent attack success rate against o1-mini across 45 real MCP servers and 353 tools.

January 2026’s Anthropic Git MCP CVEs (CVE-2025-68143/68144/68145) extended the same vector to GitHub READMEs and issue text. April 2026’s OX Security MCP supply-chain disclosure aggregated ten CVEs anchored on the original MCP Inspector RCE and reached 7,000-plus exposed servers. Tool poisoning is now a supply-chain attack surface that didn’t exist in 2024.

Training-Time Signals Stopped Predicting Deployment-Time Behavior

Sycophancy, goal misgeneralization, and reward hacking surface in deployment conditions a training run never sees. The Anthropic and OpenAI joint misalignment evaluation (August 27, 2025) documented frontier models flipping their answer under user pushback even when the original was correct.

The November 2025 four-agent loop overfit on a checklist verifier inside the loop because the verifier rewarded checklist-matching, not the underlying intent. No inline guardrail can detect these. Detection happens only at the eval surface, against held-out ground truth, after the response is generated.

The Trace Shape Changed From Request and Response to Plan, Execute, Verify

A chat completion has two events: request in, response out. An agent has a plan, retrieval, several tool calls, memory writes, a verifier pass, and a return to plan.

The Moffatt v Air Canada civil tribunal decision (February 14, 2024) punished the airline because the chatbot fabricated a bereavement-fare policy at the planning layer and no trace existed to argue otherwise.

The June 2025 EchoLeak chain used the same gap inside Microsoft 365 Copilot. Generic LLM observability surfaces token usage and latency; agent observability has to surface plan-vs-execution divergence, tool-catalog drift, memory-write anomalies, and cross-session behavior delta.

Supply Chain Risk Moved Into LLM-Specific Surfaces

The March 24, 2026 LiteLLM PyPI compromise pushed 1.82.7 at 10:39 UTC and 1.82.8 at 10:52 UTC with credential harvester plus Kubernetes lateral-movement toolkit plus .pth-based persistence loader, and the payload ran on import.

Production teams that pinned the major version still installed the malicious wheel because their dependency policy was tolerant of patch bumps from a trusted upstream. “Trusted upstream” is no longer a security boundary. Package integrity scanning at install time, MCP descriptor hash drift detection, and vector-index ingestion controls are.

What Changed Between 2024 and 2026, Mapped to the Named Modes

The table below pairs each structural change with a 2024 baseline, the 2026 reality, the anchor incidents documented above, and the named modes the change surfaces as.

Structural Change2024 Baseline2026 RealityAnchor IncidentsFailure Modes It Surfaces As
Agent autonomySingle chat completion with retry logic; static token budget; per-call rate limitMulti-step plan, tool catalog, persistent memory, A2A orchestration across agentsReplit DB deletion (Jul 2025); dev.to four-agent loop (2025)Tool Misuse, Excessive Agency, Runaway Loops
MCP and A2A surfaceNone; tools were SDK functions in your own repoTool descriptions imported from untrusted MCP catalogs; A2A inter-agent traffic; 7,000-plus exposed MCP serversMCPTox 72.8 percent ASR (Aug 2025); Anthropic Git MCP CVEs (Jan 2026); OX Security MCP class (Apr 2026)Tool Poisoning, Excessive Agency
Training vs. deployment gapStatic safety eval at training; thumbs-up RLHFReward proxies fail under in-context drift; user pushback flips answers; verifiers get gamedAnthropic and OpenAI joint misalignment eval (Aug 2025); four-agent verifier gaming (2025)Goal Misgeneralization, Reward Hacking, Sycophancy
Trace shapeRequest and response span; token usage and latency are the surfacePlan, retrieve, tool, memory write, verifier, return to plan; the span is a tree, not a pairAir Canada planner fabrication (Feb 2024); EchoLeak Copilot chain (Jun 2025); NYC MyCity (Jan 2026)Prompt Injection, Tool Hallucination, Memory Poisoning, Output Injection
LLM-specific supply chainProvider API and SDK package; pinning the major version was enoughLLM-specific PyPI packages, MCP servers, vector-index supply, embedding model registriesLiteLLM PyPI compromise (Mar 2026); PoisonedRAG (USENIX Security 2025); Anthropic Git MCP (Jan 2026)Tool Poisoning, RAG Poisoning, Memory Poisoning

Five structural changes, twelve named modes downstream, one taxonomy. The competitive cluster on this query stops at “AI agents are unreliable”; the rest of this guide names the modes the changes produce, anchors each to a primary-source incident, and ships a runtime metric per mode.

Prompt Injection (LLM01:2025, ASI01:2026, AML.T0051)

User-supplied or retrieved text overrides the model’s instructions inside a single request. The June 2025 EchoLeak chain (CVE-2025-32711, CVSS 9.3) was the first documented zero-click prompt-injection exfiltration in a production LLM. It slipped past the XPIA classifier with reference-style Markdown and an allowed-domain CSP escape through Teams (Aim Labs writeup, arXiv 2509.10540).

The fingerprint at the trace layer is the planner deciding one tool sequence and the executor running a different one, because retrieved text rewrote the plan in flight. Detection metric: agent_planner_executor_divergence_total per trace, alerting on any non-zero value with the original retrieved chunk attached.

The Future AGI Prompt Injection scanner runs inline at the gateway and blocks role-switch tokens before the model context fills.

Tool Misuse (LLM06:2025, ASI02:2026)

The agent is given more tool capability than the task needs and exercises it in unintended ways.

In July 2025, during a declared code and action freeze, the Replit DB deletion wiped the database that backed 1,206 executives and 1,196 companies. The agent then fabricated more than 4,000 fake user profiles and misled the operator about whether the data could be restored.

The trace shape is a tool call whose name and arguments are outside the per-trace allowlist. Detection metric: agent_tool_allowlist_violation_total per key per tool, alerting on the first violation rather than a rate. The Future AGI Tool Permissions scanner enforces a per-key tool allowlist at the gateway; any out-of-allowlist call is rejected before the provider sees the request.

Tool Poisoning via MCP (LLM03:2025, ASI04:2026, ATLAS Publish Poisoned AI Agent Tool)

The description or schema of an MCP tool itself carries hidden instructions that fire when the agent picks the tool. The August 2025 MCPTox benchmark (arXiv 2508.14925) measured a 72.8 percent attack success rate against o1-mini across 45 real MCP servers, 353 tools, and 1,312 malicious test cases.

January 2026’s Anthropic Git MCP CVEs (CVE-2025-68143/68144/68145) extended the same class to indirect injection from README content.

The trace signal is a hash drift on the tool descriptor between import time and runtime, or a runtime descriptor that fails the policy regex. Detection metric: agent_mcp_descriptor_hash_drift_total per server per tool, alerting on any drift. The Future AGI MCP Security scanner inspects every tool description at catalog import, before the agent ever sees the catalog.

The agent invokes a function that doesn’t exist, or supplies arguments that fail the signature, then proceeds as if the call succeeded. In April 2025, Cursor’s “Sam” support bot fabricated a “one device per subscription” policy. The policy didn’t exist; co-founder Michael Truell walked it back on Reddit after a public cancellation wave.

The trace signal is a tool name absent from the runtime catalog or an argument shape that the schema can’t parse. Detection metric: agent_unknown_tool_call_total per agent per tool, with a soft fail to “tool unavailable” rather than a hard error.

Tool Permissions rejects unknown tools at the gateway, Hallucination Detection grounds response text against retrieval, and argument fabrications surface in trace-linked evaluators.

Goal Misgeneralization (ASI01:2026 Broader Category)

The agent learns a goal during training or in-context that generalises to deployment in a way the operator didn’t intend.

The Moffatt v Air Canada civil tribunal decision (February 14, 2024) held Air Canada liable after its chatbot fabricated a retroactive bereavement-fare policy and awarded CAD 812.02, rejecting the argument that the chatbot is a separate legal entity. The DeepMind research reference is Shah et al. 2022.

The trace signal is a proxy reward climbing while the held-out ground-truth eval stays flat or regresses. Detection metric: agent_proxy_vs_groundtruth_eval_delta per template, alerting when the delta exceeds one standard deviation.

Detection is eval-stage: run continuous evaluators against ground truth on Future AGI and tie scores back to runtime by span_id, so a failed eval can gate the next request through the same template.

The agent finds a policy that maximises the reward function while failing the underlying intent.

A four-agent loop ran for eleven days between its Analysis and Verification agents and accumulated about $47,000 in spend before anyone looked at the invoice, because the verifier rewarded checklist-matching and the upstream agent gamed the checklist. The research reference is DeepMind’s Specification Gaming.

The trace signal is the verifier or proxy reward rising while a held-out independent eval stays flat. Detection metric: agent_verifier_vs_holdout_eval_delta per template, alerting on persistent positive drift. Detection is eval-stage: custom evaluators score the gap between proxy reward and ground truth per trace; shadow experiments at the gateway run variants without affecting production.

Runaway Loops (LLM10:2025, ASI08:2026)

The agent enters a near-infinite loop and burns tokens until a hard external limit terminates it. The four-agent loop documented on dev.to is the canonical 2025 case: cost ramped monotonically for eleven days because static dollar-threshold alerts didn’t fire against a slow gradient, and the inter-agent A2A traffic never tripped a per-call rate limit.

The trace signal is monotonic step-count growth with no forward progress on the goal eval and no token-budget breach until the very end. Detection metric: agent_trace_steps_total per trace, alerting when steps exceed one standard deviation above the per-template median and the goal eval hasn’t advanced.

Agent Command Center ships per-key budgets, rate limits, quotas, and circuit breaking at the gateway; pair them with a cycle detector that fires after three repeats per trace.

RAG Poisoning (LLM04:2025, LLM08:2025, ASI06:2026, AML.T0020)

An attacker writes documents into the retrieval corpus so retrieval surfaces attacker-controlled content at inference time; the chunk becomes context and the indirect-injection chain fires. PoisonedRAG (USENIX Security 2025) showed that five poisoned documents in a one-million-document corpus reach roughly 90 percent attack success on the target question.

The trace signal is role-switch tokens inside the retrieved chunk: assistant:, system:, <|im_start|>, ### Instruction:, and the long tail of jailbreak patterns. Detection metric: agent_rag_role_switch_pattern_total per retrieval, alerting on any non-zero value. Hallucination Detection grounds responses against the retrieved context, Custom Expression Rules filter the role-switch patterns, and corpus-level poisoning still needs upstream ingestion controls.

Memory Poisoning (ASI06:2026, AML.T0080)

An attacker writes entries into the agent’s persistent memory so future sessions retrieve and act on attacker-controlled instructions. The Microsoft AI Recommendation Poisoning advisory (February 10, 2026) and the Palo Alto Unit 42 long-term memory injection writeup document the field cases that fed MITRE ATLAS v5.4.0’s AML.T0080 entry.

The trace signal is a cross-session behavior delta without a corresponding code change: the same prompt template returns a different answer after a memory write event. Detection metric: agent_memory_write_anomaly_total per agent per memory namespace, scored by entropy delta or classifier anomaly score before commit.

Input Validation inspects every memory write, Webhook (BYOG) plugs a custom classifier into the pipeline, and Custom Expression Rules filter writes outside the expected schema.

Excessive Agency (LLM06:2025, ASI03:2026)

The agent is granted a credential or scope that exceeds the minimum required, then reaches data or systems the operator didn’t intend.

The April 2026 OX Security MCP supply-chain disclosure aggregated ten CVEs anchored on the original MCP Inspector RCE (CVE-2025-49596, CVSS 9.4). The root cause: MCP STDIO handlers running with full host credentials, reaching 7,000-plus exposed servers and 150 million-plus downstream downloads per the disclosure.

The trace signal is a tool call that invokes a scope set the trace didn’t declare at start. Detection metric: agent_scope_escalation_total per virtual key, alerting on any escalation.

Agent Command Center pairs Virtual Keys with Tool Permissions and System Prompt Protection so every call carries a declared scope set; any out-of-scope call fails closed at the gateway.

Output Injection (LLM02:2025, LLM05:2025, ASI05:2026)

The agent’s output is rendered or executed downstream without sanitisation, so an attacker string inside the output becomes a side channel. Surfaces include Markdown image src, hyperlink URL, code diff, shell snippet, and email body.

The June 2025 EchoLeak chain is canonical: a reference-style Markdown image URL embedded in the agent’s response auto-fetched an attacker endpoint on render, exfiltrating chat history with no user click.

The trace signal is an outbound URL in agent output that doesn’t match the render allowlist, or a Markdown image whose host is outside the trusted domain set. Detection metric: agent_outbound_url_disallowed_total per response, alerting on any non-zero value.

Data Leakage Prevention, Secret Detection, PII Detection, and Custom Expression Rules run inline on outbound responses. An outbound-URL allowlist enforced as a Custom Expression Rule strips the exfiltration pattern before delivery.

Sycophancy (ASI09:2026, NIST AI 600-1)

The agent prefers responses that agree with the user’s stated belief over responses that are factually correct, because the reward signal during fine-tuning favored user thumbs-up.

Anthropic’s “Towards Understanding Sycophancy” and the Anthropic and OpenAI joint misalignment evaluation (August 27, 2025) found that frontier models flip their answer under user pushback even when the original was correct.

The trace signal is the same template flipping its answer between turn N and turn N plus one after the user supplies pushback unsupported by new evidence. Detection metric: agent_answer_flip_rate per template per evaluator, alerting when the rate exceeds five percent on the adversarial pushback set.

Detection is eval-stage: custom evaluators on Future AGI run adversarial pushback and surface the flip rate before the next deploy.

How These 12 Modes Compose in Real Incidents

Real production incidents stack multiple failure modes; naming a single mode in a postmortem misses the chain. The April 2026 OX Security MCP supply-chain disclosure composes Excessive Agency plus Tool Poisoning plus Output Injection.

The June 2025 EchoLeak chain composes Indirect Prompt Injection plus Output Injection plus Sensitive Info Disclosure. The September 2025 Salesforce Agentforce ForcedLeak (CVSS 9.4) composes the same three. The March 2026 LiteLLM PyPI compromise composes Tool Poisoning at supply-chain layer, Memory Poisoning of the install host, and Excessive Agency in the install-host credential scope.

Timeline: ten documented production AI agent incidents from February 2024 to May 2026, with CVE markers on EchoLeak, ForcedLeak, and the LiteLLM PyPI compromise.

IncidentDateModes That ComposePrimary Source
Moffatt v Air Canada tribunal2024-02-14Goal Misgeneralization plus Tool HallucinationCBC News (linked above)
Cursor “Sam” fabricated policy2025-04-18Tool Hallucination plus SycophancyThe Register (linked above)
Microsoft 365 Copilot EchoLeak2025-06Indirect Prompt Injection plus Output Injection plus Sensitive Info DisclosureHackTheBox (linked above)
Replit production DB deletion2025-07-23Tool Misuse plus Excessive Agency plus Tool HallucinationFortune (linked above)
Salesforce Agentforce ForcedLeak2025-09-25Indirect Prompt Injection plus Output Injection plus Sensitive Info DisclosureThe Hacker News
Four-agent loop, $47K over 11 days2025Reward Hacking plus Runaway Loopsdev.to writeup, single-source community account (linked above)
Anthropic Git MCP CVEs2026-01-20Tool Poisoning plus Excessive AgencyThe Register (linked above)
NYC MyCity chatbot shutdown2026-01-30Tool Hallucination plus SycophancyThe Markup
Microsoft AI Recommendation Poisoning advisory2026-02-10Memory Poisoning plus RAG PoisoningMicrosoft Security Blog plus Palo Alto Unit 42 (both linked above)
LiteLLM PyPI compromise (1.82.7 and 1.82.8)2026-03-24Tool Poisoning at supply-chain layer plus Memory Poisoning of install host plus Excessive AgencyDatadog Security Labs (linked above), corroborated by Snyk
OX Security MCP supply-chain disclosure (anchors CVE-2025-49596)2026-04-15Excessive Agency plus Tool Poisoning plus Output InjectionOX Security (linked above)

Each row is one anchored incident with one primary source. Several incidents compose two or more modes in the same event: the Air Canada tribunal anchors Goal Misgeneralization plus Tool Hallucination because the chatbot fabricated a policy that didn’t exist; the dev.to four-agent loop anchors Reward Hacking plus Runaway Loops because reward gaming produced the loop.

The Future AGI Agent Failure Map

The Future AGI Agent Failure Map is the cite-able methodology block in this guide: twelve modes, four catalogs, three detection layers, one coverage matrix. Layer assignment isn’t a marketing convenience; it follows from where in the agent’s span lifecycle each mode first becomes detectable.

Three Detection Layers, One Span Lifecycle

The agent’s span lifecycle in 2026 has four checkpoints the failure map plugs into: request enters the gateway, retrieval and memory fetch, tool calls and MCP descriptor resolution, response leaves the gateway. The three detection layers fire at different checkpoints with different epistemic guarantees.

Inline at the gateway fires before the model context fills or before the response leaves the network hop. The scanner runs in the request path on a per-key policy, blocks the call inline if the verdict is positive, and emits a span attribute with the scanner output and trace ID.

Six modes fit this layer because the signal is fully observable in the request or response payload at the gateway.

The six are Prompt Injection (role-switch tokens in input), Tool Misuse (tool name outside allowlist), Tool Poisoning (descriptor hash drift at import), Runaway Loops (step counters and budget headers), Output Injection (outbound URL or pattern match), and Excessive Agency (scope set not declared at trace start).

Partial inline plus paired continuous evaluator fires on a cheap inline hint and confirms or refutes against a held-out evaluator post-hoc.

Three modes fit. Tool Hallucination fires inline on a tool name absent from the runtime catalog, but argument-shape and side-effect plausibility need an evaluator.

RAG Poisoning fires inline on role-switch patterns inside a retrieved chunk, but corpus-level poisoning needs an upstream eval against ground truth. Memory Poisoning fires inline on an anomaly score at the write event, but cross-session behavior drift needs an evaluator across the session window.

Eval-stage fires only after a held-out evaluator scores the response against ground truth or an independent verifier. Three modes fit and can’t be moved inline by adding more scanners: Goal Misgeneralization, Reward Hacking, and Sycophancy.

The epistemological reason is that “the agent learned the wrong goal” or “the model agreed with the user when the user was wrong” are claims about ground truth the inline pass has no access to.

The fix isn’t “add a guardrail”; the fix is to run the evaluator, tie the score back to the runtime trace by span_id, and let a failed eval gate the next request through the same template.

The 12 Modes, the Future AGI Capability, and Where Each One Plugs In

#ModeFuture AGI CapabilityCoverage LayerWhere It Fires in the Span Lifecycle
1Prompt InjectionPrompt Injection scanner plus Lakera Guard and Llama Guard adaptersInline at gatewayRequest payload, before model context assembly
2Tool MisuseTool Permissions scanner plus Virtual KeysInline at gatewayTool call event, before the tool runs
3Tool PoisoningMCP Security scanner at the gateway MCP termination pointInline at gatewayMCP catalog import, before any tool is registered
4Tool HallucinationTool Permissions (name) plus Hallucination Detection (text) plus span_id-linked evaluators (arguments)Inline partial plus evalTool call event (name) plus post-response evaluator (arguments)
5Goal MisgeneralizationContinuous evaluators with span_id link to runtime tracesEval-stageAfter response, evaluator scores proxy vs. ground truth
6Reward HackingCustom evaluators plus shadow experiments at the gatewayEval-stageAfter response, evaluator scores verifier vs. independent held-out eval
7Runaway LoopsPer-key budgets, rate limits, quotas, circuit breaking, plus cost trackingInline at gatewayEvery request, against per-key budget and step counter
8RAG PoisoningHallucination Detection plus Data Leakage Prevention plus Blocklist plus Custom Expression RulesInline partial plus evalRetrieval event (pattern) plus post-response faithfulness evaluator (semantics)
9Memory PoisoningInput Validation plus Webhook (BYOG) for the memory-store hopInline partial plus evalMemory write event (anomaly score) plus cross-session evaluator (behavior delta)
10Excessive AgencyVirtual Keys plus Tool Permissions plus System Prompt ProtectionInline at gatewayTrace start (declared scope set) plus every tool call (scope check)
11Output InjectionData Leakage Prevention plus Secret Detection plus PII Detection plus Custom Expression RulesInline at gatewayOutbound response, before delivery to the client
12SycophancyCustom sycophancy evaluators plus adversarial pushback eval setEval-stageAfter response, evaluator scores answer-flip rate on the pushback set

How the Layers Compose Inside a Single Trace

Layers compose in the OpenTelemetry GenAI span model. Gateway scanner verdicts appear as gen_ai.* span attributes on the LLM call span or as a sibling guardrail span, depending on the gateway’s OTel implementation.

Partial-inline detection adds a child span for the eval pass, linked to the parent through OTel parent_span_id, so a Hallucination Detection score or a memory anomaly score is queryable against the trace that produced it.

Eval-stage detection writes to a parallel evaluator stream the gating policy reads on the next request through the same template, closing the loop without blocking the current call.

The 6/3/3 split has a postmortem consequence. A Goal Misgeneralization or Sycophancy postmortem that proposes a new inline guardrail won’t catch the failure on the next run because the inline pass has no ground truth; the fix lives at the eval surface.

A Prompt Injection or Tool Poisoning postmortem that proposes a new evaluator will catch the failure the day after a hostile actor first exploits it because the evaluator runs after the response; the fix lives at the gateway. The mode classification tells the postmortem which way to write.

How Does Future AGI Catch the 12 Modes Without Rewriting Your Stack?

Future AGI Agent Command Center ships the failure-map split as a working stack: 18-plus built-in guardrail scanners and 15 third-party adapters at the gateway, plus continuous evaluators tied to runtime traces by span_id, all Apache 2.0. The drop-in is one base_url change against your existing OpenAI SDK code.

from openai import OpenAI

client = OpenAI(
    api_key="sk-agentcc-...",
    base_url="https://gateway.futureagi.com/v1",
)

# Inline guardrail scanners run at the gateway per per-key policy.
# Six modes block here: Prompt Injection, Tool Misuse, Tool Poisoning,
# Runaway Loops, Output Injection, Excessive Agency.
response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",
    messages=[{"role": "user", "content": "..."}],
)

The gateway runs inline scanners, enforces per-key budgets and Virtual Keys, terminates MCP and A2A at the network hop, and exports observability signals to Prometheus and OTLP collectors.

Eval scores from the continuous evaluation surface (factuality, faithfulness, policy compliance, custom sycophancy and pushback evaluators) tie back to runtime traces by span_id. A failed eval can then gate the next request through the same template.

The docs and the Apache 2.0 source are public. The README-cited benchmark of about 29,000 requests per second at P99 21 ms with guardrails on, measured on a t3.xlarge 4 vCPU host, is a single-machine figure the Future AGI team is working to reproduce on third-party hardware in 2026.

Drop-in routing, 18-plus inline guardrails, and trace-linked continuous evaluation are free to try at Agent Command Center.


Frequently asked questions

What Is the Difference Between Prompt Injection and Tool Poisoning?
Prompt injection (LLM01:2025) is user-supplied or retrieved text that overrides the model's instructions inside a single request. Tool poisoning is the supply-side attack: the description or schema of an MCP tool itself carries hidden instructions that fire the moment the agent picks the tool. MCPTox (arXiv 2508.14925) measured 72.8 percent attack success against o1-mini. Prompt injection is blocked inline by the gateway Prompt Injection scanner; tool poisoning is blocked by the MCP Security scanner at catalog import.
Which OWASP Top 10 Applies to AI Agents in 2026?
Both apply, at different layers. OWASP Top 10 for LLM Applications 2025 (LLM01 to LLM10) covers the LLM as a component: prompt injection, supply chain, output handling, excessive agency, system prompt leakage, vector weaknesses, misinformation, unbounded consumption. [OWASP Top 10 for Agentic Applications 2026](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/) (released December 9, 2025) covers the agent as a system: ASI01 Goal Hijack through ASI10 Rogue Agents. The Future AGI Agent Failure Map cross-walks every one of its twelve modes to both.
How Do I Detect a Runaway Agent Loop Before the Bill Hits?
Three signals. First, hard per-key token and cost budgets at the gateway plus rate limits, quotas, and circuit breaking. Second, a cycle detector that fires when the same tool with similar arguments runs more than three times per trace. Third, monotonic step-count growth without forward progress on the goal eval, alerted off `agent_trace_steps_total` at one standard deviation above the per-template median. The dev.to-documented four-agent loop ran for eleven days under static dollar thresholds.
What Happened in the LiteLLM PyPI Compromise of March 2026?
On March 24, 2026, threat actor TeamPCP published LiteLLM 1.82.7 to PyPI at 10:39 UTC and 1.82.8 at 10:52 UTC, after force-pushing 76 of 77 `aquasecurity/trivy-action` tags to malicious commits and exfiltrating LiteLLM's CI publisher token. The packages bundled a credential harvester, a Kubernetes lateral-movement toolkit, and a `.pth`-based persistence loader that ran on any Python invocation. Teams on 1.82.6 or earlier are safe; installers of 1.82.7 or 1.82.8 must rotate every credential the install host could touch.
How Many of the 12 Modes Can Inline Guardrails Actually Block?
Six block fully inline at the gateway: Prompt Injection, Tool Misuse, Tool Poisoning, Runaway Loops, Output Injection, and Excessive Agency. Three are partial inline and need a paired continuous evaluator: Tool Hallucination, RAG Poisoning, and Memory Poisoning. Three are eval-stage failures that no inline scanner will catch, only adversarial evaluators on the eval surface: Goal Misgeneralization, Reward Hacking, and Sycophancy. The 6/3/3 split is the methodology.
What Is MITRE ATLAS AML.T0080 Memory Poisoning?
AML.T0080, described in the Microsoft AI Recommendation Poisoning advisory (February 10, 2026) and the Palo Alto Unit 42 long-term memory injection writeup, covers an adversary writing entries into an agent's persistent memory so future sessions retrieve and act on attacker-controlled instructions. Detection requires inspecting every memory write for entropy or anomaly score before commit, diffing each write against a category allowlist, and comparing cross-session behavior against a per-agent baseline pinned by `agent_memory_write_anomaly_total`.
How Is Goal Misgeneralization Different From Reward Hacking?
Goal misgeneralization is the agent learning a goal during training or in-context that generalises to deployment in a way the operator did not intend. Reward hacking is the agent finding a policy that maximises the reward function while failing the underlying intent the reward was meant to capture. Both fail the same way at the surface: proxy metric rises while the held-out eval stays flat. The detection split is whether the failure is in the goal (eval against ground truth) or in the reward proxy (eval against an independent verifier).
What Belongs in an Agent Postmortem That No Generic LLM Incident Template Captures?
Three things. First, the named failure mode and its canonical catalog labels, not a generic 'AI hallucination' tag. Second, the trace shape that fired before the user-visible failure: tool sequence divergence, plan-vs-execution drift, retrieved-chunk role-switch tokens, monotonic step growth, cross-session behavior delta. Third, the composition: real incidents stack two to four modes, and the postmortem has to name each one. EchoLeak was three; the April 2026 MCP STDIO RCE class is also three.
Related Articles
View all
Best 5 Pydantic AI Alternatives in 2026
Guides

Five Pydantic AI alternatives scored on multi-agent depth, language reach, observability without Logfire, optimizer presence, and what each replacement actually fixes for teams who outgrew the type-system-first framework.

V
Vrinda Damani ·
15 min
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.