Research

LLM Incident Response Playbook in 2026: Detection to Postmortem

LLM incident response in 2026: detection via eval drift, triage, rollback, customer comms, postmortem. The eval-gate-driven playbook from page to action items.

April 14, 2026

11 min read

incident-response llm-incident postmortem rollback eval-drift on-call sre-for-llms 2026

A team ships a refresh of the support agent at 4pm Tuesday. By 4:32pm the rubric drift alarm fires: faithfulness on the refund-bot route is down 9 points, refusal rate is up 18 points. The on-call engineer is paged. By 4:34pm the gateway has auto-reverted the cohort to the incumbent (per-cohort eval-gated rollback caught the regression). By 4:38pm the on-call has the trace id for a representative failure, the prompt diff, and the rubric drift chart. By 4:55pm the incident channel is open and customer comms is drafted. By 6pm the immediate fix is in review. By Friday the postmortem is published; two action items expand the eval gate’s golden dataset and tighten the per-rubric threshold.

This is what LLM incident response looks like when the playbook is wired. Detection is automated, rollback is automated, the postmortem is blameless, and the eval gate gets stronger after each incident. This guide is the production playbook from page through postmortem in 2026. It draws on Google’s SRE book, Atlassian’s Incident Management Handbook, and the SRE tradition more broadly, with LLM-specific extensions for eval drift, prompt rollback, and judge calibration.

TL;DR: The five LLM incident shapes

Shape	Detection signal	Default rollback path
Eval drift	Rolling-mean rubric score drops 2-5+ points	Prompt-version revert via registry
Cost spike	Tokens-per-success rises 30%+	Gateway-level cost circuit breaker; prompt revert
Latency spike	p99 jumps 2-5x	Tool/upstream healthcheck; circuit breaker on slow tool
Guardrail regression	Refusal rate or PII leak rate shifts 5+ points	Per-cohort eval-gated rollback; guardrail config revert
Outage / provider event	Error rate jumps, traffic drops	Gateway failover to fallback provider

If you only read one row: the unit of fast LLM incident response is per-cohort eval-gated rollback in the gateway. The playbook’s job is to handle the cases the auto-rollback does not catch.

Stage 1: Detection

Three independent signals catch the five incident shapes. Wire all three.

Online eval drift

Rolling-mean rubric scores per route, per prompt version, per user cohort. Alarm thresholds:

2-5 point drop sustained 15+ minutes -> investigate.
5+ point drop -> page.

Distilled judge models (FutureAGI turing_flash at 50-70ms p95 for guardrail screening, ~1-2s for full eval templates; Galileo Luna-2) keep the per-trace cost low enough to score 5-20% of production traffic continuously. For depth on online eval, see What is LLM Evaluation? and Production LLM Monitoring Checklist.

Cost and latency

Tokens-per-success, dollar-cost-per-success, latency p50/p95/p99 per route. Alarm thresholds:

30%+ rise in tokens-per-success -> investigate.
50%+ rise -> page.
p99 latency 2x baseline -> page.

The cost-and-latency signals catch incidents that eval drift alone misses (a successful-but-wasteful trajectory regression).

Guardrail and refusal rate

Refusal rate (legitimate-vs-illegitimate split), PII leak rate, prompt-injection success rate. Alarm thresholds:

5+ point shift in either direction on refusal rate -> investigate.
Any non-zero PII leak rate sustained -> page.
Prompt-injection success above 0.5% -> page.

The guardrail signal catches the safety-class regressions that the eval drift misses (refusal calibration is hard to score in the rubric judges).

For the broader observability surface, see Production LLM Monitoring Checklist, LLM Tracing Best Practices, and Best AI Agent Observability Tools in 2026.

Stage 2: Triage

The on-call engineer’s job in the first 5 minutes:

Confirm the alarm. Look at the dashboard. Is the rubric drift real or a single-point spike that has already recovered?
Identify the change. Was a prompt updated, a model swapped, a gateway config changed in the last 24 hours? Most LLM incidents are change-induced.
Pull a representative trace. The trace id surfaces the failure. Span attributes (prompt version, cohort, tool calls) tell the story.
Decide rollback path. Per-cohort eval-gated rollback handled it (no action). Prompt-version revert needed. Code revert needed. Wait-and-see if the signal is borderline.

The 5-minute clock is generous; per-cohort rollback often closes the loop in 2-3 minutes. The on-call’s manual action is for the cases the auto-rollback did not handle.

Stage 3: Rollback

Three rollback paths, in order of preference.

Per-cohort eval-gated rollback

If the change went out behind a canary, the gateway auto-reverts the cohort when the rubric monitor fires. No manual action. The on-call confirms the revert fired and moves to investigation.

This is the right default. For the deployment pattern, see LLM Deployment Best Practices in 2026.

Prompt-version revert

The prompt registry exposes a one-click revert to the previous version. Takes seconds; no code deploy. Use when the regression is prompt-driven and the canary did not catch it (because the canary cohort was too small, the rubric did not cover the failure mode, or the rollout was 100% from the start).

LangSmith Prompt Hub, FutureAGI prompt versions, Braintrust prompts, and Helicone Prompts all support one-click revert.

Code revert

git revert <sha>, redeploy. Takes minutes. Use when the change is in code (gateway config, agent logic, tool implementation) rather than in prompts.

Provider failover

For provider outages, the gateway routes to a fallback provider with an equivalent model. The fallback path must have been load-tested in advance. A fallback that has not been load-tested is not a fallback; verify under load quarterly.

Editorial figure on a black starfield background titled LLM INCIDENT RESPONSE TIMELINE with subhead DETECT THROUGH POSTMORTEM. A horizontal timeline with five labeled stops: t=0 DETECT, +5m TRIAGE, +15m CONTAIN, +1h RESOLVE, +24h POSTMORTEM, the POSTMORTEM stop has a soft white halo glow as the focal element. Drawn in pure white outlines on pure black with faint grid background.

Stage 4: Customer comms

If the incident has user-visible impact, draft the status update. The pattern:

One channel, one drafter. Avoid five engineers writing five conflicting updates.
Acknowledge fast, detail later. “We are aware of an issue affecting refund queries; investigating; next update in 15 minutes.” Beats waiting an hour for a perfect message.
Avoid speculation. State what is known. “Latency on chat queries elevated since 4:30pm” not “We think the model provider is having issues.”
Set update cadence. Every 15-30 minutes during the incident, then a final all-clear.
Match the language to the audience. Engineering customers want the trace id and the rubric drift; consumer-facing messages need plain English.

For multi-tenant SaaS, the comms split is per-tenant: only tenants in the affected cohort get the page. For shared-tenant incidents, the global status page is the channel.

Stage 5: Resolution

Resolution closes the loop. Three confirmations.

The signal is back to baseline. Rubric drift recovered, latency back to p99 baseline, refusal rate normal. Wait one full alert window (typically 30-60 minutes) after the rollback to confirm.
The fix is in review. A PR exists, an eval gate ran on it, the regression case is in the test suite for next time.
Customer comms closed. Final all-clear posted; affected users notified if the comms plan requires it.

The incident channel stays open until all three are confirmed.

Stage 6: Postmortem

The postmortem is blameless. Focus is the process, not the engineer who pushed the button.

Six sections.

Summary

One paragraph: what happened, when, who was affected. 4-6 sentences.

Timeline

Per-step time stamps from detection to all-clear. Use UTC. Pin the trace id of the representative failure.

16:30 PT - prompt v18 rolled out at 100% (no canary)
16:32 PT - rubric drift alarm fires: faithfulness -9 points
16:34 PT - on-call paged
16:38 PT - prompt-version revert to v17 executed
16:43 PT - rubric drift recovers
17:15 PT - all-clear posted

Root cause

Prompt diff, model version change, gateway config change, or upstream provider event. The trace id of a representative failure surfaces the actual span tree. Pin both the diff and the trace.

Why eval missed it

This is the LLM-specific section. Three common answers:

Dataset gap. The golden dataset under-represented the failure slice.
Judge calibration drift. The judge model updated and the calibration with it.
Threshold too loose. The rubric threshold did not catch a real regression.

Action items

Specific, owned, dated. Examples:

“Add 50 representative refund-bot failures to the golden dataset by EOW (owner: Alice).”
“Tighten faithfulness rubric threshold from 0.85 to 0.90 (owner: Bob).”
“Update canary minimum-cohort-size from 3% to 5% to reach significance faster (owner: Charlie).”
“Add eval-gated rollback to the support-bot rollout pattern by Q2 (owner: Dana).”

The action items must close the loop: expand the eval gate, add the regression-suite case, tighten the rollback threshold. A postmortem without action items is documentation.

Lessons

Framing for the team and the broader org. What did we learn? What does the next person who touches this surface need to know?

For the canonical SRE postmortem template, see Google’s SRE book postmortem chapter and Atlassian’s Incident Handbook.

Special case: LLM security incidents

Five triggers escalate an incident to security:

Prompt injection succeeded and exfiltrated data.
Guardrail regression let PII or sensitive content leak in outputs.
Compromised model provider or stolen API keys.
Tool-call exfiltration (the agent called a tool with unauthorised arguments).
Supply-chain compromise (an OSS dependency was malicious).

For the supply-chain class, see the worked example in LiteLLM Compromised Incident Response Migration Guide.

The security-class playbook differs from the eval-drift playbook in three ways:

Loop in security review immediately. Do not run the standard eval-drift triage in parallel.
Consider regulatory disclosure obligations. Depending on jurisdiction (GDPR Article 33, CCPA, HIPAA, PCI-DSS), notification timelines apply.
Preserve evidence. Trace logs, gateway request logs, and prompt-version snapshots may be needed for forensic review.

For the broader safety surface, see Top 5 AI Guardrailing Tools in 2025, Best AI Agent Guardrails Platforms in 2026, and Prompt Injection in 2025.

Prevention closes more incidents than response

Everything above is about the moment after an alarm fires. The cheaper move is keeping incidents from firing in the first place. Three layers convert detect-and-rollback into never-detect:

Eval-gated rollout at deploy time. Every prompt change, model swap, and gateway config update ships behind a canary cohort with a rubric threshold. The gateway auto-reverts when the threshold is missed. The on-call never sees the page because the same eval-gated rollback pattern from Stage 3 runs at the deploy boundary, not the incident boundary.
Inline guardrail screening on every request. PII leakage, prompt injection, jailbreak, and tool-call argument violations are caught at the request boundary, not in the postmortem. FutureAGI’s turing_flash runs the screening pass at 50-70 ms p95, so gating every span fits inside a normal user-facing latency budget. The classes that show up in 30% of historical incident timelines (PII leak, prompt injection, refusal regression) become enforcement, not detection.
Persona-driven simulation pre-prod. Synthetic adversaries replay the failure modes that produced last quarter’s incidents before the next release reaches users. The regression set grows from each postmortem’s action items, so the same incident class stops shipping twice.

FutureAGI’s Agent Command Center wires these three layers into one runtime alongside the detection signals from Stage 1: eval-gated routing across 100+ providers with BYOK, 18+ runtime guardrails, and persona-driven simulation share the same trace tree the on-call already uses for response. The incident playbook still exists for the cases that slip through; the count of cases that need it drops sharply once prevention is wired the same way response is.

Common mistakes in LLM incident response

Detection on a single signal. Eval drift catches some incidents and misses others. Wire three signals.
Manual rollback when auto-rollback was available. A 30-minute manual rollback is 28 minutes of avoidable user impact when per-cohort eval-gated rollback would have closed in 2 minutes.
Code revert when prompt revert works. Prompt revert is seconds; code revert is minutes. Use the right tool.
Customer comms from five different engineers. One channel, one drafter.
Speculation in status updates. “We think the model provider…” erodes trust. State what is known.
No representative trace pinned in the postmortem. Without it, the postmortem is theory; with it, the postmortem is forensic.
Action items that do not close the loop. “We will be more careful next time” is not an action item. “Add 50 cases to the regression suite by Friday” is.
No quarterly chaos drill. A rollback that has not been exercised is not a rollback. Trigger it deliberately.
Treating provider weight updates as out-of-scope. Providers update weights without notice; the rubric drift will fire even though no team change happened. The playbook applies; the action items shift toward upstream monitoring.
Not reviewing eval calibration after the fact. If the eval gate did not catch it, the calibration is the issue.

What changed in LLM incident response in 2026

Date	Event	Why it matters
2024	Per-cohort eval-gated rollback patterns matured across major gateway platforms	Auto-rollback closed the loop for most prompt-driven incidents
2025	Distilled judges (Galileo Luna-2 introduced June 2025, FutureAGI turing_flash) reached production scale	Online eval at 5-20% traffic became cost-feasible
2025	OTel GenAI semantic conventions widely adopted	Trace ids portable across vendors; postmortems reference the same trace shape
2025	Several public LLM-supply-chain incidents (LiteLLM compromise)	Security-class incident response patterns codified
2026	Eval-drift, cost, and guardrail signals integrated into PagerDuty / Slack / Opsgenie	LLM incidents got the same paging plumbing as classical SRE

How to actually run an LLM incident response program in 2026

Wire three independent detection signals. Eval drift, cost/latency, guardrail rate.
Set thresholds and pages. 2-5 point rubric drop -> investigate, 5+ -> page; 30%+ cost rise -> investigate, 50%+ -> page; 5+ point guardrail shift -> page.
Default to per-cohort eval-gated rollback. Gateway-level for prompt-driven changes.
Build the rollback runbook. Prompt revert, code revert, gateway config revert, provider failover.
Define the on-call rotation. On-call engineer, eval owner, customer comms drafter; release captain and security reviewer at scale.
Run quarterly chaos drills. Trigger a known-bad prompt; verify detection, rollback, and comms.
Standardise the postmortem template. Six sections; SRE-blameless tradition.
Close the loop. Every postmortem produces an action item that expands the eval gate, the regression suite, or the rollback pattern.

For the deployment context, see LLM Deployment Best Practices in 2026 and CI/CD for AI Agents Best Practices.

Sources

Series cross-link

Frequently asked questions

What does an LLM incident look like in 2026?

Five common shapes. (1) Eval drift: rolling-mean rubric scores drop 5-15 points after a prompt change, model swap, or provider weight update. (2) Cost spike: tokens-per-success rises 30%+ on the same workload. (3) Latency spike: p99 jumps 2-5x because a tool call is hanging. (4) Guardrail regression: refusal rate flips from 4% to 27% or PII leak rate rises. (5) Outage: provider returns errors, gateway fails over (or doesn't), traffic drops. Each has a different detection signal, a different triage path, and a different fix. The playbook covers all five.

How do I detect an LLM incident before users complain?

Wire three independent signals. (1) Online eval drift: rolling-mean rubric scores per route, alarm on 2-5 point drop sustained over 15-60 minutes. (2) Cost and latency: tokens-per-success and p99 latency, alarm on 30% jump. (3) Refusal and guardrail rate: alarm on 5+ point shift in either direction. The three signals catch different incident classes; pages from any of them are first-class. For the underlying observability, see [Production LLM Monitoring Checklist](/blog/production-llm-monitoring-checklist-2026) and [LLM Tracing Best Practices](/blog/llm-tracing-best-practices-2026).

What is the right rollback procedure for an LLM incident?

Three rollback paths, in order of preference. (1) Per-cohort eval-gated rollback: if the change went out behind a canary, the gateway auto-reverts the cohort when the rubric monitor fires; no manual action needed. (2) Prompt-version rollback: the prompt registry exposes a one-click revert to the previous version; takes seconds, no code deploy. (3) Code rollback: git revert the deploy; takes minutes. Avoid waiting for a code rollback when a prompt rollback works; the latter is the right default for prompt-driven incidents.

Who should be on the LLM incident response rotation?

Three roles minimum. (1) On-call engineer: triages the page, executes the rollback, files the incident channel. (2) Eval owner: interprets the rubric drift, decides whether the canary monitor was right, owns the calibration. (3) Customer comms: drafts and ships the customer-facing update if the incident has user-visible impact. For larger orgs, add a release captain (owns the rollout pipeline) and a security/safety reviewer (for guardrail-class incidents). For depth on the SRE pattern, see Google's [Incident Management chapter](https://sre.google/sre-book/managing-incidents/) of the SRE book.

How long should an LLM postmortem take?

From incident close to draft postmortem: 24-48 hours. From draft to published, blameless review complete: 5-7 days. The discipline matches classical SRE postmortem timing (Atlassian's [Incident Handbook](https://www.atlassian.com/incident-management/handbook), Google's [SRE postmortem template](https://sre.google/sre-book/postmortem-culture/)). The LLM-specific additions: include the trace id of a representative failure, include the rubric drift chart, include the prompt-version diff, include the eval-gate verdict (did it catch it; if not, why). The action items either expand the eval gate, add a regression-suite case, or change the rollout pattern.

What does a blameless LLM postmortem cover?

Six sections. (1) Summary: one paragraph; what happened, when, who was affected. (2) Timeline: detection, triage, rollback, resolution, customer comms; per-step time stamps. (3) Root cause: prompt diff, model version, gateway config, or upstream provider event; trace id of representative failure. (4) Why eval missed it: dataset gap, judge calibration drift, or threshold too loose. (5) Action items: expand the eval gate, add the failure to the regression suite, tighten the rollback threshold. (6) Lessons: framing for the team and broader org. Blameless means the focus is the process, not the engineer who pushed the button.

When should I treat an LLM incident as a security incident?

Five triggers. (1) Prompt injection succeeded and exfiltrated data. (2) Guardrail regression let PII leak in outputs. (3) Compromised model provider or stolen API keys. (4) Tool-call exfiltration (the agent called a tool with arguments it should not have). (5) Supply-chain compromise (an OSS dependency was malicious; see [the LiteLLM compromised incident response](/blog/litellm-compromised-incident-response-migration-guide) for a worked example). Loop in security review immediately, follow the incident-response plan rather than the eval-drift playbook, and consider regulatory disclosure obligations.

What does FutureAGI ship for LLM incident response?

Future AGI's Apache 2.0 stack covers the four detection signals (eval drift, cost, latency, guardrail) with span-attached scorers (turing_flash judge at 50-70ms p95 for guardrail screening, ~1-2s for full eval templates) and gateway-level rollback. The [Agent Command Center](/platform/monitor/command-center) supports per-cohort rollback with eval-gated rollback as the default rollout pattern. The traceAI Apache 2.0 instrumentation library produces OTel-native spans so trace ids surface in the postmortem. The platform integrates with PagerDuty, Slack, and standard incident-management tools for paging.