Guides

The 2026 LLM Incident Response Playbook

LLM incidents need a different playbook than API incidents. Six steps, four incident classes, postmortem becomes golden-set entry to loop.

·
Updated
·
13 min read
incident-response llm-incident postmortem error-feed ai-gateway sre-for-llms 2026
Editorial cover image for The 2026 LLM Incident Response Playbook
Table of Contents

11:07 pm. A Slack alert fires: the support agent’s factual_grounding score has dropped 11 points across the last 18 minutes. A second alert lands a moment later — Error Feed has clustered 47 failing traces into a new issue, and the Sonnet 4.5 Judge has written an immediate_fix that says the prompt template is concatenating an empty retrieval block. A Linear ticket is already on the queue. The on-call flips the gateway routing rule for the refund route to the previous prompt version, watches x-prism-fallback-used: true come back on the next 200 requests, then starts root cause. By midnight the postmortem outline is open and the golden set has a new entry.

This is what an LLM incident looks like when the playbook is wired. LLM incidents need a different playbook than API incidents. A broken endpoint maps to a single commit and a single rollback. A drifting rubric maps to a prompt edit or a quiet model-version push or a RAG re-index or an upstream tool change — any of which can produce the same symptom on eight percent of traffic before anyone notices. The runbook has to accommodate that shape.

The opinion this post earns: six steps, four incident classes, one loop closer. Detect, triage, contain, eval, fix, review. The four classes are hallucination, jailbreak, drift, and PII leak. The loop closer is the postmortem becoming a golden-set entry. Everything else is shape around those three primitives.

Why LLM incidents need their own playbook

Three things diverge from the classical SRE shape.

The failure is distributional. There’s no single broken line of code. The bug shows up on eight percent of inputs that share a structural feature no one wrote a test for. By the time the rolling mean shifts, bad output has shipped to several thousand users. The detection surface has to grade score distributions, not error logs.

Rollback isn’t clean. The model gave a wrong answer at 4:32 pm; the conversation thread still carries that answer. Retrieved chunks summarised by the bad version got cached downstream. A reverted prompt fixes new traffic but in-flight sessions still carry the bug.

RCA is a research question. The prompt was edited yesterday. The provider pushed a quiet version update overnight. The RAG corpus added 12k new documents this morning. Any one could produce the symptom. The runbook has to triage which before the fix lands.

If your incident process maps one broken commit to one rollback, it’ll miss the most common LLM failure shape.

TL;DR: six steps, four classes

StepOwnerArtifactGate
DetectMonitor or on-callAlert + Linear ticket with clusterPage fires
TriageOn-callClass label + severity (S1/S2/S3)Class named
ContainOn-callGateway routing flip; cohort scopedRoute on known-safe fallback
EvalEval ownerBug class run on golden setPass or named gap
FixAuthor of the changePatch PR or config flipPatch merged
ReviewIncident commanderBlameless postmortem + golden-set entryNew entry committed
ClassWhat you seeWhere the fix lives
HallucinationFactual error against supplied contextGenerator rubric, judge calibration
JailbreakRefusal contract bypassedOutput scanner, refusal eval, adversarial set
DriftRolling-mean rubric score dropped on a routeEval gate threshold, prompt or model version
PII leakEmail, phone, SSN, API key, cross-tenant data surfacedOutput-side scanner, retrieval scoping, tenant guard

If you only build three pieces: detection, containment, the review-to-golden-set loop. The others compound on top.

Step 1: detect

An LLM incident starts when the system tells you it broke, not when a customer tweets that it did. Three signals catch the four classes.

Per-rubric rolling-mean drift. Sampled production traces score against the same rubrics that gate CI. A 2-to-5-point drop sustained over 15 to 60 minutes on a per-route, per-prompt-version basis is the canonical threshold. Catches prompt regressions, model-version drift, and RAG shifts that hurt quality without changing error rates.

Error Feed clusters. Failing traces cluster around shared structural features. Error Feed uses HDBSCAN soft-clustering over ClickHouse-stored embeddings to group failures into named issues at prob >= 0.4. The 5-category 30-subtype taxonomy (covered in AI agent failure modes (2026)) gives the alert a name a human can route on.

Customer escalation. A support ticket flagged with the model-output tag auto-opens an incident and pulls the underlying trace by session.id. The two signals above catch most things; this catches the rest.

The Linear ticket carries triage context. The Sonnet 4.5 Judge agent on Bedrock runs across the cluster with a 30-turn budget and 8 span tools, with a Haiku Chauffeur sub-agent summarising spans over 3000 characters and a ~90 percent prompt-cache hit ratio. Per cluster the Judge writes an immediate_fix proposal, evidence quotes, and a 4-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1 to 5). The score lets the on-call sort 47 member traces by severity without reading all 47. Linear OAuth ships today; Slack, GitHub, Jira, PagerDuty on the roadmap. For depth, see error analysis for LLM applications.

Step 2: triage: name the class

The first call the on-call makes is which class. Naming it routes everything downstream; calling every page “the agent hallucinated again” is why patches stop compounding.

Hallucination. The factual_grounding axis dropped below 3 while retrieval rubrics held stable. The supplied context was correct; the generator confabulated. Usually S2; escalates to S1 on regulated routes (medical, financial, legal).

Jailbreak. The instruction_adherence or privacy_and_safety axis dropped. The refusal contract was bypassed — the model complied with an instruction the system prompt should have refused. Adversarial pushback or a fresh jailbreak template is the usual root. S1 if a regulated category breached; S2 if the refusal was soft.

Drift. The rolling-mean rubric score dropped across a route with no code deploy. Four candidate causes: prompt edit, quiet provider version push, RAG corpus change, downstream API shape change. Severity scales with the drop magnitude and affected user count.

PII leak. The privacy_and_safety axis hit 1 or 2 on at least one trace. Email, phone, SSN, API key, or cross-tenant data surfaced in a response. Always S1. Regulatory exposure (SOC 2, HIPAA, GDPR) starts the moment the leak shipped; containment must happen before RCA, not after.

Severity rubric on top of the class: S1 is safety, leak, or regulatory exposure (contain first, RCA second). S2 is quality regression with no safety impact (RCA first, then rollback). S3 is edge case in one tenant (defer; add to golden set). Teams without this taxonomy end up over-paging (everyone burns out) or under-paging (the leak ships for six hours). Best AI agent failure detection tools walks the broader signal space.

Step 3: contain

Containment is where the LLM playbook diverges most from classical SRE. Classical containment is “roll back the deploy.” LLM containment is “flip the gateway routing rule so the affected route falls through to a known-safe configuration.”

The Agent Command Center is the containment primitive. Six native adapters plus OpenAI-compatible presets and self-hosted backends. Six routing strategies (shadow, mirror, race, canary, fallback, primary) reconfigurable at runtime. Circuit breaker on [429, 500, 502, 503, 504]. Per-tenant routing rules to scope blast radius. Response headers (x-prism-routing-strategy, x-prism-fallback-used, x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-guardrail-triggered) confirm the flip on the wire.

Per class:

  • Hallucination. Flip the route to the previous prompt version on the previous model. Tighten RailType.OUTPUT with Groundedness + FactualAccuracy at AggregationStrategy.ANY.
  • Jailbreak. Tighten the inline scanner set. The 8 SDK Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) run inline as a sub-10ms pre-filter. Add the bypass template to the adversarial pushback set in the same pass.
  • Drift. Flip to the last known-good (prompt_version, model_version) tuple. Buys time for RCA; doesn’t fix root cause.
  • PII leak. Highest urgency. Flip the inline output scanner to PII Detection + Data Leakage Prevention + Secret Detection at AggregationStrategy.ANY with strict thresholds. Per-tenant audit log on every request from the affected window. Notify legal in parallel.

Best AI gateways for compliance and audit trails covers the eval-driven rollout patterns that make runtime flips safe.

Step 4: eval: run the bug class on the golden set

The route is on a known-safe configuration; users are unblocked. The next question is sharper than “what broke” — did the golden set already contain this class, and did the eval gate catch it?

Three outcomes from running the cluster’s representative traces through the offline suite:

  1. Golden set has the class; rubric catches it. The regression slipped past CI because the bug lives in production code the golden set doesn’t exercise on this path. Fix in step 5; gate is fine.
  2. Golden set has the class; rubric misses it. The rubric is stale or too lenient. Fix must include a rubric tighten or threshold adjustment. Most common on incidents older than three months.
  3. Golden set doesn’t have the class. The CI gate had nothing to catch. Fix must include a new EvalTemplate and a new golden-set entry. Most common in the first six months of a product.

The eval surface is the ai-evaluation SDK (Apache 2.0). Real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(...) API. 60+ EvalTemplate classes including Groundedness, ContextAdherence, ContextRelevance, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, LLMFunctionCalling. 13 guardrail backends (9 open-weight plus 4 API). Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED.

Without this step, the fix is a vibe patch — a prompt tweak with no rubric attached, regressable on the next deploy. Your AI agent passes evals but still fails in production covers why the gap between offline pass and online drop is mathematical, not accidental.

Step 5: fix

The shape of the fix follows from the eval-step outcome.

  • Prompt rollback. The registry exposes a one-click revert; ships in seconds. Most common S2 hallucination and drift remediation.
  • Rubric tighten. A threshold gets stricter. Ships via config. The shape when the golden set had the class but the rubric was too lenient.
  • New EvalTemplate. A fresh rubric scores the missing dimension; the next CI run picks it up. The shape when the golden set didn’t have the class.
  • Retrieval re-index. The corpus is rebuilt with offending documents removed. Ships behind a shadow route so the new corpus is compared against the old before promotion.
  • Data fix. Poisoned document, malformed embedding, cross-tenant memory leak — fixed at the data layer, not the model layer.

If the fix is a prompt rewrite, agent-opt speeds the iteration loop. Six optimizers ship today: RandomSearchOptimizer, BayesianSearchOptimizer (Optuna-backed, resumable), MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer. The trace-stream-to-agent-opt connector is the active roadmap item, so failing traces have to be promoted to a dataset (the review step) before agent-opt picks them up.

Verification is offline plus online. The canary route gets the new configuration first; the same rubrics that gate offline run against canary traffic as span-attached scores via traceAI. The canary widens only after rolling-mean rubric scores match or beat baseline.

Step 6: review: the postmortem becomes a golden-set entry

This step is what makes the playbook compound. Without it, each incident produces a one-off fix and the team writes the same regression next month. With it, the golden set ratchets every quarter and the CI gate catches more.

Six sections, plus one LLM-specific seventh.

  1. Summary. What happened, when, who was affected, how it was contained. Class label and severity included.
  2. Timeline. Detect, triage, contain, eval, fix, verification, customer comms. Per-step timestamps. Trace id and Error Feed cluster id included.
  3. Root cause. Prompt diff, model version diff, corpus change, or tool API diff. The upstream change that produced the symptom.
  4. Customer impact. User count, severity, sessions affected. Pull from session.id filters on traceAI.
  5. Action items. Owner-and-date. Include the routing flip’s revert and the eval-gate update.
  6. Lessons. Blameless on the engineer who pushed the button; process focus on the playbook that let the regression ship.
  7. Eval-gate review and golden-set entry. Every postmortem produces at least one new golden-set entry. This is the loop closer.

The entry is concrete: a representative trace from the cluster, committed into the offline eval set with a route tag, the class label (hallucination / jailbreak / drift / PII leak), the rubric label, and a one-line description. The next CI run grades it. The next PR touching that path has to clear it.

This is the diff between a learning team and a recurring-incident team. The golden set ratchets. The eval gate ratchets. The CI run on the next PR catches the failure class the incident surfaced. The 2026 LLM evaluation playbook covers the six layers the postmortem feeds.

A PII-leak worked example

A healthcare assistant returns another customer’s appointment details in an answer. The on-call gets paged.

Detect. Support escalates; Error Feed has already clustered two adjacent traces. The Judge’s immediate_fix says the retrieval layer is missing a tenant filter on the appointments index.

Triage. privacy_and_safety score on the representative trace is 1. Class: PII leak. Severity: S1 — cross-tenant data, regulated route.

Contain. Route flipped to a configuration with RailType.RETRIEVAL enforcing a tenant-filter check, plus PII Detection + Data Leakage Prevention at the output layer with AggregationStrategy.ANY. x-prism-routing-strategy: fallback confirms on the wire. Per-tenant audit log captures every request from the affected window. Legal notified.

Eval. The golden set does not contain a cross-tenant retrieval test for this route. The gate had nothing to catch. Fix scope expands to a new EvalTemplate plus a new golden-set entry.

Fix. Retrieval layer’s tenant-filter check patched. New EvalTemplate checks tenant isolation on the appointments route. Index rebuilt with explicit tenant tags. Canary runs two hours; rolling-mean privacy_and_safety returns to baseline.

Review. Postmortem published 48 hours later. Three golden-set entries committed: cross-tenant retrieval, PII-in-output regression, routing-rule sanity check. The next PR touching the appointments route has to clear all three. Lessons section names the gap in retrieval-layer rubric coverage as a process failure, not a person failure.

Five anti-patterns the playbook closes

  • No containment primitive. Without a runtime routing flip, the team waits on the deploy pipeline while users see bad output. The diff between 90-second and 90-minute containment.
  • No per-rubric alerting. If the only detection surface is a generic error rate, the first signal is a customer tweet. Rolling-mean drift catches regressions that don’t trip error logs.
  • No incident class taxonomy. Without four named classes, every page becomes S1, the team burns out, and the genuine S1s lose urgency.
  • No eval-gated verification. Shipping the fix without a gate run means the team learns whether the fix works from users.
  • No review-to-golden-set loop. Without the postmortem producing a new entry, the gate stays static while the failure space grows.

How Future AGI wires the playbook

Future AGI ships the incident-response surfaces as a connected stack. Start with the SDK and traceAI for the detection and verification gates; graduate to the Platform for self-improving rubrics that retune from postmortem feedback.

  • Error Feed. HDBSCAN soft-clustering over ClickHouse-stored embeddings. Sonnet 4.5 Judge on Bedrock, 30-turn budget, 8 span tools, Haiku Chauffeur for large spans, ~90 percent prompt-cache hit. Per-cluster immediate_fix plus 4-D severity score. Linear OAuth today.
  • Agent Command Center. Six native adapters plus OpenAI-compatible presets and self-hosted backends. Six routing strategies, circuit breaker on [429, 500, 502, 503, 504], per-tenant routing, response headers on the wire. SOC 2 Type II, HIPAA, GDPR, CCPA.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python / TypeScript / Java / C#. 14 span kinds. 62 built-in evals via EvalTag. Forensic span filtering by session.id, user.id, time range, error attributes.
  • ai-evaluation (Apache 2.0). 60+ EvalTemplate classes. 13 guardrail backends. 8 sub-10ms Scanners. Four distributed runners. RailType.INPUT/OUTPUT/RETRIEVAL. Multi-modal CustomLLMJudge via LiteLLM.
  • agent-opt. Six optimizers, shared EarlyStoppingConfig, resumable Optuna studies. Trace-stream connector on the roadmap.
  • Platform. Self-improving evaluators retune from thumbs feedback. In-product authoring agent turns natural-language descriptions into rubrics. Classifier-backed scoring below Galileo Luna-2.

The chain — Slack alertLinear ticketgateway routing flipgolden-set eval runfixnew golden-set entry — is the runbook the on-call follows at 11 pm. The playbook’s job is to make every step take seconds instead of hours, and to make sure the next incident of the same class doesn’t ship through the gate at all.

Frequently asked questions

Why do LLM incidents need a different playbook than API incidents?
Three reasons. The failure is distributional, so the bug ships on eight percent of traffic before any error log fires. Rollback is dirty, because cached responses, downstream summaries, and retrieved chunks carry bad output forward even after a code revert. And root cause is itself a research question — a prompt edit, a quiet model-version push, a RAG re-index, and an upstream tool change can all produce the same symptom. A playbook that maps one broken commit to one rollback will miss the shape of the failure entirely.
What are the six steps of the LLM incident response playbook?
Detect, triage, contain, eval, fix, review. Detect is an Error Feed cluster spike, per-rubric drift, or a customer escalation. Triage names the class (hallucination, jailbreak, drift, or PII leak) and the severity. Contain flips the gateway route to a known-safe configuration. Eval runs the bug class on the golden set to see if the rubric caught it. Fix ships the patch (prompt rollback, rubric tighten, retrieval re-index). Review writes the postmortem and the postmortem becomes a new golden-set entry. The loop closer is review-to-golden-set; without it, the same bug ships again next month.
What are the four LLM-specific incident classes?
Hallucination (the model invented a fact that contradicts the supplied context). Jailbreak (the model complied with an instruction the system prompt should have refused). Drift (the rolling-mean rubric score dropped on a route without a code change). PII leak (the response surfaced an email, phone, SSN, API key, or cross-tenant data). Each class has a different detection surface, a different containment primitive, and a different fix. Naming the class first is what routes the rest of the playbook.
How fast can a gateway routing flip contain an LLM incident?
Seconds, if the gateway is wired right. The Agent Command Center reads routing rules from config that can be updated at runtime, so the affected route flips to a known-safe fallback (previous prompt version on a frontier model) without a code deploy. Response headers x-prism-routing-strategy and x-prism-fallback-used confirm the flip happened on the wire. The circuit breaker on 429/500/502/503/504 means provider-side outages contain themselves. Per-tenant routing rules let blast radius stay scoped to one customer.
What does a postmortem-becomes-golden-set entry actually look like?
A representative trace from the incident cluster, committed into the offline eval set with a route tag, a rubric label, and a one-line description of the failure class. The next CI run grades it with the same rubric the production scorer used. The next PR touching that path has to clear it. Every postmortem produces at least one new entry; the golden set ratchets stronger every quarter; the CI gate catches more failure classes over time. Teams that skip the promote step ship the same bug recurring on a six-week cycle.
What does Error Feed do during incident response?
Error Feed is the detection-and-RCA surface. HDBSCAN soft-clustering groups failing traces into named issues. A Claude Sonnet 4.5 Judge agent on Bedrock runs a 30-turn investigation across 8 span tools, with a Haiku Chauffeur sub-agent summarising large spans at a roughly 90 percent prompt-cache hit ratio. The Judge writes an immediate_fix, evidence quotes, and a 4-dimensional severity score per cluster. A Linear ticket lands on the on-call queue with the fix proposal and member traces attached. The on-call has triage context before they finish opening Slack.
What incident-response anti-patterns make LLM outages worse?
Five. No containment primitive (you wait on the deploy pipeline while users see bad output). No per-rubric alerting (you find out from a tweet). No incident class taxonomy (every page becomes Sev 1, the team burns out). No eval-gated verification (you ship the fix and pray). No postmortem-to-golden-set loop (the same failure class ships again next month). The six-step playbook closes all five.
Related Articles
View all