The 2026 LLM Incident Response Playbook
LLM incidents need a different playbook than API incidents. Six steps, four incident classes, postmortem becomes golden-set entry to loop.
Table of Contents
11:07 pm. A Slack alert fires: the support agent’s factual_grounding score has dropped 11 points across the last 18 minutes. A second alert lands a moment later — Error Feed has clustered 47 failing traces into a new issue, and the Sonnet 4.5 Judge has written an immediate_fix that says the prompt template is concatenating an empty retrieval block. A Linear ticket is already on the queue. The on-call flips the gateway routing rule for the refund route to the previous prompt version, watches x-prism-fallback-used: true come back on the next 200 requests, then starts root cause. By midnight the postmortem outline is open and the golden set has a new entry.
This is what an LLM incident looks like when the playbook is wired. LLM incidents need a different playbook than API incidents. A broken endpoint maps to a single commit and a single rollback. A drifting rubric maps to a prompt edit or a quiet model-version push or a RAG re-index or an upstream tool change — any of which can produce the same symptom on eight percent of traffic before anyone notices. The runbook has to accommodate that shape.
The opinion this post earns: six steps, four incident classes, one loop closer. Detect, triage, contain, eval, fix, review. The four classes are hallucination, jailbreak, drift, and PII leak. The loop closer is the postmortem becoming a golden-set entry. Everything else is shape around those three primitives.
Why LLM incidents need their own playbook
Three things diverge from the classical SRE shape.
The failure is distributional. There’s no single broken line of code. The bug shows up on eight percent of inputs that share a structural feature no one wrote a test for. By the time the rolling mean shifts, bad output has shipped to several thousand users. The detection surface has to grade score distributions, not error logs.
Rollback isn’t clean. The model gave a wrong answer at 4:32 pm; the conversation thread still carries that answer. Retrieved chunks summarised by the bad version got cached downstream. A reverted prompt fixes new traffic but in-flight sessions still carry the bug.
RCA is a research question. The prompt was edited yesterday. The provider pushed a quiet version update overnight. The RAG corpus added 12k new documents this morning. Any one could produce the symptom. The runbook has to triage which before the fix lands.
If your incident process maps one broken commit to one rollback, it’ll miss the most common LLM failure shape.
TL;DR: six steps, four classes
| Step | Owner | Artifact | Gate |
|---|---|---|---|
| Detect | Monitor or on-call | Alert + Linear ticket with cluster | Page fires |
| Triage | On-call | Class label + severity (S1/S2/S3) | Class named |
| Contain | On-call | Gateway routing flip; cohort scoped | Route on known-safe fallback |
| Eval | Eval owner | Bug class run on golden set | Pass or named gap |
| Fix | Author of the change | Patch PR or config flip | Patch merged |
| Review | Incident commander | Blameless postmortem + golden-set entry | New entry committed |
| Class | What you see | Where the fix lives |
|---|---|---|
| Hallucination | Factual error against supplied context | Generator rubric, judge calibration |
| Jailbreak | Refusal contract bypassed | Output scanner, refusal eval, adversarial set |
| Drift | Rolling-mean rubric score dropped on a route | Eval gate threshold, prompt or model version |
| PII leak | Email, phone, SSN, API key, cross-tenant data surfaced | Output-side scanner, retrieval scoping, tenant guard |
If you only build three pieces: detection, containment, the review-to-golden-set loop. The others compound on top.
Step 1: detect
An LLM incident starts when the system tells you it broke, not when a customer tweets that it did. Three signals catch the four classes.
Per-rubric rolling-mean drift. Sampled production traces score against the same rubrics that gate CI. A 2-to-5-point drop sustained over 15 to 60 minutes on a per-route, per-prompt-version basis is the canonical threshold. Catches prompt regressions, model-version drift, and RAG shifts that hurt quality without changing error rates.
Error Feed clusters. Failing traces cluster around shared structural features. Error Feed uses HDBSCAN soft-clustering over ClickHouse-stored embeddings to group failures into named issues at prob >= 0.4. The 5-category 30-subtype taxonomy (covered in AI agent failure modes (2026)) gives the alert a name a human can route on.
Customer escalation. A support ticket flagged with the model-output tag auto-opens an incident and pulls the underlying trace by session.id. The two signals above catch most things; this catches the rest.
The Linear ticket carries triage context. The Sonnet 4.5 Judge agent on Bedrock runs across the cluster with a 30-turn budget and 8 span tools, with a Haiku Chauffeur sub-agent summarising spans over 3000 characters and a ~90 percent prompt-cache hit ratio. Per cluster the Judge writes an immediate_fix proposal, evidence quotes, and a 4-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1 to 5). The score lets the on-call sort 47 member traces by severity without reading all 47. Linear OAuth ships today; Slack, GitHub, Jira, PagerDuty on the roadmap. For depth, see error analysis for LLM applications.
Step 2: triage: name the class
The first call the on-call makes is which class. Naming it routes everything downstream; calling every page “the agent hallucinated again” is why patches stop compounding.
Hallucination. The factual_grounding axis dropped below 3 while retrieval rubrics held stable. The supplied context was correct; the generator confabulated. Usually S2; escalates to S1 on regulated routes (medical, financial, legal).
Jailbreak. The instruction_adherence or privacy_and_safety axis dropped. The refusal contract was bypassed — the model complied with an instruction the system prompt should have refused. Adversarial pushback or a fresh jailbreak template is the usual root. S1 if a regulated category breached; S2 if the refusal was soft.
Drift. The rolling-mean rubric score dropped across a route with no code deploy. Four candidate causes: prompt edit, quiet provider version push, RAG corpus change, downstream API shape change. Severity scales with the drop magnitude and affected user count.
PII leak. The privacy_and_safety axis hit 1 or 2 on at least one trace. Email, phone, SSN, API key, or cross-tenant data surfaced in a response. Always S1. Regulatory exposure (SOC 2, HIPAA, GDPR) starts the moment the leak shipped; containment must happen before RCA, not after.
Severity rubric on top of the class: S1 is safety, leak, or regulatory exposure (contain first, RCA second). S2 is quality regression with no safety impact (RCA first, then rollback). S3 is edge case in one tenant (defer; add to golden set). Teams without this taxonomy end up over-paging (everyone burns out) or under-paging (the leak ships for six hours). Best AI agent failure detection tools walks the broader signal space.
Step 3: contain
Containment is where the LLM playbook diverges most from classical SRE. Classical containment is “roll back the deploy.” LLM containment is “flip the gateway routing rule so the affected route falls through to a known-safe configuration.”
The Agent Command Center is the containment primitive. Six native adapters plus OpenAI-compatible presets and self-hosted backends. Six routing strategies (shadow, mirror, race, canary, fallback, primary) reconfigurable at runtime. Circuit breaker on [429, 500, 502, 503, 504]. Per-tenant routing rules to scope blast radius. Response headers (x-prism-routing-strategy, x-prism-fallback-used, x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-guardrail-triggered) confirm the flip on the wire.
Per class:
- Hallucination. Flip the route to the previous prompt version on the previous model. Tighten
RailType.OUTPUTwithGroundedness+FactualAccuracyatAggregationStrategy.ANY. - Jailbreak. Tighten the inline scanner set. The 8 SDK Scanners (
JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner) run inline as a sub-10ms pre-filter. Add the bypass template to the adversarial pushback set in the same pass. - Drift. Flip to the last known-good
(prompt_version, model_version)tuple. Buys time for RCA; doesn’t fix root cause. - PII leak. Highest urgency. Flip the inline output scanner to
PII Detection+Data Leakage Prevention+Secret DetectionatAggregationStrategy.ANYwith strict thresholds. Per-tenant audit log on every request from the affected window. Notify legal in parallel.
Best AI gateways for compliance and audit trails covers the eval-driven rollout patterns that make runtime flips safe.
Step 4: eval: run the bug class on the golden set
The route is on a known-safe configuration; users are unblocked. The next question is sharper than “what broke” — did the golden set already contain this class, and did the eval gate catch it?
Three outcomes from running the cluster’s representative traces through the offline suite:
- Golden set has the class; rubric catches it. The regression slipped past CI because the bug lives in production code the golden set doesn’t exercise on this path. Fix in step 5; gate is fine.
- Golden set has the class; rubric misses it. The rubric is stale or too lenient. Fix must include a rubric tighten or threshold adjustment. Most common on incidents older than three months.
- Golden set doesn’t have the class. The CI gate had nothing to catch. Fix must include a new
EvalTemplateand a new golden-set entry. Most common in the first six months of a product.
The eval surface is the ai-evaluation SDK (Apache 2.0). Real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(...) API. 60+ EvalTemplate classes including Groundedness, ContextAdherence, ContextRelevance, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, LLMFunctionCalling. 13 guardrail backends (9 open-weight plus 4 API). Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED.
Without this step, the fix is a vibe patch — a prompt tweak with no rubric attached, regressable on the next deploy. Your AI agent passes evals but still fails in production covers why the gap between offline pass and online drop is mathematical, not accidental.
Step 5: fix
The shape of the fix follows from the eval-step outcome.
- Prompt rollback. The registry exposes a one-click revert; ships in seconds. Most common S2 hallucination and drift remediation.
- Rubric tighten. A threshold gets stricter. Ships via config. The shape when the golden set had the class but the rubric was too lenient.
- New EvalTemplate. A fresh rubric scores the missing dimension; the next CI run picks it up. The shape when the golden set didn’t have the class.
- Retrieval re-index. The corpus is rebuilt with offending documents removed. Ships behind a shadow route so the new corpus is compared against the old before promotion.
- Data fix. Poisoned document, malformed embedding, cross-tenant memory leak — fixed at the data layer, not the model layer.
If the fix is a prompt rewrite, agent-opt speeds the iteration loop. Six optimizers ship today: RandomSearchOptimizer, BayesianSearchOptimizer (Optuna-backed, resumable), MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer. The trace-stream-to-agent-opt connector is the active roadmap item, so failing traces have to be promoted to a dataset (the review step) before agent-opt picks them up.
Verification is offline plus online. The canary route gets the new configuration first; the same rubrics that gate offline run against canary traffic as span-attached scores via traceAI. The canary widens only after rolling-mean rubric scores match or beat baseline.
Step 6: review: the postmortem becomes a golden-set entry
This step is what makes the playbook compound. Without it, each incident produces a one-off fix and the team writes the same regression next month. With it, the golden set ratchets every quarter and the CI gate catches more.
Six sections, plus one LLM-specific seventh.
- Summary. What happened, when, who was affected, how it was contained. Class label and severity included.
- Timeline. Detect, triage, contain, eval, fix, verification, customer comms. Per-step timestamps. Trace id and Error Feed cluster id included.
- Root cause. Prompt diff, model version diff, corpus change, or tool API diff. The upstream change that produced the symptom.
- Customer impact. User count, severity, sessions affected. Pull from
session.idfilters on traceAI. - Action items. Owner-and-date. Include the routing flip’s revert and the eval-gate update.
- Lessons. Blameless on the engineer who pushed the button; process focus on the playbook that let the regression ship.
- Eval-gate review and golden-set entry. Every postmortem produces at least one new golden-set entry. This is the loop closer.
The entry is concrete: a representative trace from the cluster, committed into the offline eval set with a route tag, the class label (hallucination / jailbreak / drift / PII leak), the rubric label, and a one-line description. The next CI run grades it. The next PR touching that path has to clear it.
This is the diff between a learning team and a recurring-incident team. The golden set ratchets. The eval gate ratchets. The CI run on the next PR catches the failure class the incident surfaced. The 2026 LLM evaluation playbook covers the six layers the postmortem feeds.
A PII-leak worked example
A healthcare assistant returns another customer’s appointment details in an answer. The on-call gets paged.
Detect. Support escalates; Error Feed has already clustered two adjacent traces. The Judge’s immediate_fix says the retrieval layer is missing a tenant filter on the appointments index.
Triage. privacy_and_safety score on the representative trace is 1. Class: PII leak. Severity: S1 — cross-tenant data, regulated route.
Contain. Route flipped to a configuration with RailType.RETRIEVAL enforcing a tenant-filter check, plus PII Detection + Data Leakage Prevention at the output layer with AggregationStrategy.ANY. x-prism-routing-strategy: fallback confirms on the wire. Per-tenant audit log captures every request from the affected window. Legal notified.
Eval. The golden set does not contain a cross-tenant retrieval test for this route. The gate had nothing to catch. Fix scope expands to a new EvalTemplate plus a new golden-set entry.
Fix. Retrieval layer’s tenant-filter check patched. New EvalTemplate checks tenant isolation on the appointments route. Index rebuilt with explicit tenant tags. Canary runs two hours; rolling-mean privacy_and_safety returns to baseline.
Review. Postmortem published 48 hours later. Three golden-set entries committed: cross-tenant retrieval, PII-in-output regression, routing-rule sanity check. The next PR touching the appointments route has to clear all three. Lessons section names the gap in retrieval-layer rubric coverage as a process failure, not a person failure.
Five anti-patterns the playbook closes
- No containment primitive. Without a runtime routing flip, the team waits on the deploy pipeline while users see bad output. The diff between 90-second and 90-minute containment.
- No per-rubric alerting. If the only detection surface is a generic error rate, the first signal is a customer tweet. Rolling-mean drift catches regressions that don’t trip error logs.
- No incident class taxonomy. Without four named classes, every page becomes S1, the team burns out, and the genuine S1s lose urgency.
- No eval-gated verification. Shipping the fix without a gate run means the team learns whether the fix works from users.
- No review-to-golden-set loop. Without the postmortem producing a new entry, the gate stays static while the failure space grows.
How Future AGI wires the playbook
Future AGI ships the incident-response surfaces as a connected stack. Start with the SDK and traceAI for the detection and verification gates; graduate to the Platform for self-improving rubrics that retune from postmortem feedback.
- Error Feed. HDBSCAN soft-clustering over ClickHouse-stored embeddings. Sonnet 4.5 Judge on Bedrock, 30-turn budget, 8 span tools, Haiku Chauffeur for large spans, ~90 percent prompt-cache hit. Per-cluster
immediate_fixplus 4-D severity score. Linear OAuth today. - Agent Command Center. Six native adapters plus OpenAI-compatible presets and self-hosted backends. Six routing strategies, circuit breaker on
[429, 500, 502, 503, 504], per-tenant routing, response headers on the wire. SOC 2 Type II, HIPAA, GDPR, CCPA. - traceAI (Apache 2.0). 50+ AI surfaces across Python / TypeScript / Java / C#. 14 span kinds. 62 built-in evals via
EvalTag. Forensic span filtering bysession.id,user.id, time range, error attributes. - ai-evaluation (Apache 2.0). 60+
EvalTemplateclasses. 13 guardrail backends. 8 sub-10ms Scanners. Four distributed runners.RailType.INPUT/OUTPUT/RETRIEVAL. Multi-modalCustomLLMJudgevia LiteLLM. - agent-opt. Six optimizers, shared
EarlyStoppingConfig, resumable Optuna studies. Trace-stream connector on the roadmap. - Platform. Self-improving evaluators retune from thumbs feedback. In-product authoring agent turns natural-language descriptions into rubrics. Classifier-backed scoring below Galileo Luna-2.
The chain — Slack alert → Linear ticket → gateway routing flip → golden-set eval run → fix → new golden-set entry — is the runbook the on-call follows at 11 pm. The playbook’s job is to make every step take seconds instead of hours, and to make sure the next incident of the same class doesn’t ship through the gate at all.
Related reading
- The 2026 LLM Evaluation Playbook
- AI Agent Failure Modes in 2026: The 5-Category Taxonomy
- Your AI Agent Passes Evals But Still Fails in Production
- Error Analysis for LLM Applications
- Production LLM Monitoring Checklist
- Best AI Agent Failure Detection Tools (2026)
- What Does a Good LLM Trace Look Like (2026)
- Best AI Gateways for Compliance and Audit Trails (2026)
Frequently asked questions
Why do LLM incidents need a different playbook than API incidents?
What are the six steps of the LLM incident response playbook?
What are the four LLM-specific incident classes?
How fast can a gateway routing flip contain an LLM incident?
What does a postmortem-becomes-golden-set entry actually look like?
What does Error Feed do during incident response?
What incident-response anti-patterns make LLM outages worse?
Shadow is not canary. Mirror routing with no user effect vs percentage routing with rollback. Score-attached traffic, ACC patterns, gotchas.
Azure OpenAI eval has three Azure-specific axes: deployment-name drift, region-pinning, and Content Safety precision on benign queries. Here's the pattern.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.