Agent Rollout Strategies in 2026: The Four-Stage Gate
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Table of Contents
A prompt edit ships at 4:47 pm. The canary runs at 1 percent for forty minutes. Dashboards look fine. The team promotes to 25 percent. Twelve hours later, an enterprise tenant opens a ticket; the new prompt subtly broke a tool-call schema on a refund-flow sub-route the 30-example regression suite never hit. Mean Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.62. Semantic cache keeps serving the bad answer for another forty minutes after the rollback. The gate fired green because it answered the wrong question.
Agent rollout isn’t a deploy. It’s a partial behavior change on a non-deterministic system, with blast radius across silent paths, signal that only resolves after hours of real traffic, and caches that outlive the rollback. The opinion this post earns: agent rollout is a four-stage gate (shadow, canary, percentage, full), and each stage answers a different eval question. Skipping one ships a production incident.
This is the playbook: what each stage proves, the eval question, the rubric math, the rollback triggers, and how the Agent Command Center routing layer makes the four stages a header change. Code shaped against the ai-evaluation SDK and the gateway.
TL;DR: the four-stage gate
| Stage | Traffic to candidate | User impact | Eval question | Promotion gate |
|---|---|---|---|---|
| 1. Shadow | 100% mirrored | Zero | Does the candidate behave wildly differently on real traffic? | Per-rubric distribution within 1 point of production over 24-72h |
| 2. Canary | 1-5% live | Tier-stratified | Is the candidate at least as good with users in the loop? | Containment * (1 - False Resolution) within noise floor of baseline |
| 3. Percentage | 10, 25, 50% live | Broader | Are per-rubric deltas statistically significant on prod data? | Welch’s t-test p > 0.05 on each rubric vs 7-day baseline |
| 4. Full | 100% live | All users | Does the candidate hold the line under load with auto-rollback armed? | Guardrail trip rate, rubric rolling mean, and p99 latency hold for 48-72h |
Skipping a stage is the cheap-and-fast failure mode. The eval question changes at each stage; a green check at stage 1 doesn’t answer stage 2’s question.
Why agent rollouts break differently
Classical deploys ship code. Agent rollouts ship a behavior change, and break in three ways code deploys don’t.
Non-determinism makes single traces useless as proof. The candidate can answer correctly on a probe and wrong on the next identical probe. The promotion decision lives on scored distributions over real traffic, not curl examples.
One prompt edit fans out across silent paths. A change meant to tighten a refund-flow tool-call format also affects summarization, the help-desk fallback, and the rare disambiguation turn that triggers once per 500 requests. The regression only shows on the full distribution.
Rollback isn’t clean. Semantic caches store old responses keyed on prompt-hashes that don’t change on revert. Downstream stores snapshot the bad output and reuse it. If the rollback plan is “revert the deploy,” regressions ghost-serve for hours. The first agent incident most teams hit is “we rolled back but it didn’t fix.”
Each stage covers a failure the prior stage can’t see. For the eval foundation, see the 2026 LLM evaluation playbook.
Stage 1: Shadow, does the candidate behave reasonably on real traffic
Every production request is duplicated. Production answers the user; the candidate runs in parallel; the candidate’s output is scored offline and discarded. Zero user effect at full traffic coverage.
The eval question is bounded: does the candidate’s distribution on real traffic look reasonable against production’s? A 1.5-point Groundedness drop is a stop signal even though no user saw a single candidate response. A new cluster in the candidate’s Error Feed that doesn’t appear in production’s is also a stop signal. The candidate isn’t promoted because it scored well on a curated test set; it’s promoted because the rubric distribution on real traffic holds.
# Configure shadow routing on Agent Command Center
import requests
response = requests.post(
"https://gateway.futureagi.com/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_AGENTCC_KEY",
"x-agentcc-routing-strategy": "shadow",
"x-agentcc-shadow-target": "candidate-v2-prompt",
},
json={
"model": "production-agent-v1",
"messages": [{"role": "user", "content": user_input}],
},
)
# Response is from production-agent-v1.
# Candidate runs in parallel; its response is captured for scoring, not served.
print(response.headers["x-agentcc-routing-strategy"]) # "shadow"
Hold for 24 to 72 hours depending on volume. The cost trade is real: every request runs twice, so spend roughly doubles. Mirror is the cheaper variant when 5 to 10 percent sampling converges the rubric in the same window. Shadow’s other failure mode is treating it as a one-shot; keep it on after promotion so the team notices quality breaking before the next ramp.
Stage 2: Canary, Containment Rate times False Resolution Rate
The candidate serves real traffic on 1 to 5 percent of users, stratified by tenant tier. Shadow proved the candidate doesn’t behave wildly differently on real inputs; canary proves it’s at least as good with users in the loop.
The eval question shifts from distribution shape to user outcomes. The gate metric is the product:
Containment Rate × (1 − False Resolution Rate)
Containment is the share of conversations resolved without escalation. False Resolution is the share of “resolved” conversations where the resolution was wrong (caught by a user complaint, a thumbs-down, an Error Feed cluster, or a second-touch metric). The product is the useful gate because the canary failure mode is a candidate that resolves more cases but resolves them worse. A 2-point Containment lift with a 4-point False Resolution lift is a regression the average rubric won’t catch.
# Canary at 5%, stratified by tenant tier
response = requests.post(
"https://gateway.futureagi.com/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_AGENTCC_KEY",
"x-agentcc-routing-strategy": "canary",
"x-agentcc-canary-target": "agent-v2",
"x-agentcc-canary-percent": "5",
"x-agentcc-canary-stratify-tag": "tenant_tier",
},
json={"model": "agent-v1", "messages": [...]},
)
print(response.headers["x-agentcc-routing-strategy"]) # "canary"
Two canary failure modes show up in incident reports. Canary without a written rollback criterion: a canary that “looks fine” isn’t a promotion signal. Write the rubric and the trigger before traffic flips. Canary without stratification: routing 5 percent at random can put 100 percent of one enterprise tenant on the candidate. The five-level budget hierarchy (org, team, user, key, tag) is the routing primitive for stratified canary.
Hold for 24 to 72 hours. The math is the same as any A/B test, with Containment * (1 - False Resolution) as the metric and the rubric noise floor as the minimum detectable effect. For foundations, see the A/B testing LLM prompts guide.
Stage 3: Percentage, statistical gate on production-data deltas
If canary holds, the candidate ramps 10, 25, 50, each step held 12 to 24 hours. The eval question shifts: not “is the candidate reasonable” or “is the candidate at least as good,” but “is the delta statistically significant on the gating rubric?”
A green check should mean “this rollout did not introduce a statistically significant regression,” not “mean Groundedness sat above 0.85.” Floors catch catastrophic breaks; deltas catch slow regressions; both are required.
import statistics
from scipy import stats
def rollout_gate(candidate, baseline, alpha=0.05, min_effect=0.05):
"""Fail only if the mean dropped, the change is significant, and the effect is real."""
delta = statistics.mean(candidate) - statistics.mean(baseline)
if delta >= 0:
return True, f"no regression (delta=+{delta:.3f})"
_, p = stats.ttest_ind(candidate, baseline, equal_var=False)
if p >= alpha:
return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
if abs(delta) < min_effect:
return True, f"delta={delta:.3f} below effect floor {min_effect}"
return False, f"regression: delta={delta:.3f}, p={p:.3f}"
The baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, traffic drifts; the gate drifts with them or it catches ordinary movement instead of regressions. Use a two-proportion z-test on pass-rate rubrics like citation validity, Welch’s t-test on continuous rubrics. For long-tail failures that hide in averages, gate on percentiles. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score as native assertion metrics, so a regression pushing p95_score below a tail floor while leaving the mean intact fails on the percentile.
The 25 percent step is where teams discover a rubric they forgot to gate on. A five-rubric starting set:
| Dimension | Threshold against baseline |
|---|---|
| Groundedness | Within 1 point (or higher) |
| TaskCompletion | Within 1 point (or higher) |
| AnswerRefusal | At or below production false-refusal rate |
| Toxicity / PromptInjection | No worse than production |
| New Error Feed cluster | Zero candidate-only clusters tolerated |
The “no new failure cluster” gate catches the worst-case regression that average scores hide.
Stage 4: Full, 100 percent with auto-rollback armed
The candidate ramps to 100 percent. The previous version isn’t retired yet. Rollback decisions are too slow when they’re a meeting; the triggers go into the gateway config, and the flip happens on the next request, not after a Slack thread.
# agentcc-rollback.yaml
rollback_triggers:
- name: guardrail_trip_rate
metric: x-agentcc-guardrail-triggered
window: 15m
threshold: 1.5x_baseline # vs trailing 7-day
action: rollback_immediate
- name: rubric_regression
metric: groundedness_rolling_mean
window: 1h
threshold: -0.5 # absolute point drop on 1-5
significance: p<0.05
action: rollback_immediate
- name: p99_latency
metric: x-agentcc-latency-ms
aggregation: p99
window: 10m
threshold: 1.3x_baseline
action: rollback_immediate
- name: candidate_only_cluster
metric: error_feed_cluster_id
condition: not_in_trailing_window(7d)
action: rollback_immediate
Any single trigger fires the rollback; the gateway flips to the prior version on the next request, and the response headers carry the new strategy so observability sees the flip without polling a control plane. Hold 48 to 72 hours under armed rollback before the previous version is retired. Cost guardrails are the missed fifth trigger: a race-shape rollout fanning out to three providers will quadruple the bill inside a week without a per-tag budget cap.
The rollback path is the often-skipped step. On rollback: invalidate the semantic cache namespace tagged with the candidate version, flush any downstream stores that snapshotted the bad output, and bump the rubric version so the production observer scores against the right floor. Otherwise regressions ghost-serve for hours after the revert. For the broader deploy story, see LLM deployment best practices.
How Agent Command Center wires the four stages
The four stages are header changes on the same routing layer. Agent Command Center ships shadow, mirror, race, fallback, load-balanced, and budget-aware routing as first-class strategies in the gateway core. The gateway is a single 17 MB Go binary (Apache 2.0, OpenAI-compatible) running at ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on. Self-host or use gateway.futureagi.com/v1.
Every response carries the headers the eval layer needs:
x-agentcc-routing-strategy: which strategy fired (shadow, mirror, canary, race, fallback)x-agentcc-model-used: which provider returned the served responsex-agentcc-fallback-used: whether a candidate fell back to a different providerx-agentcc-latency-ms: gateway-measured end-to-end latencyx-agentcc-cost: gateway-measured per-request costx-agentcc-guardrail-triggered: whether an input or output guardrail fired
traceAI spans pick the headers up as attributes so eval scores on the trace tree know which arm they belong to. The same rubric runs against both arms; spans tell the eval layer which baseline to compare against. For more on separating production and candidate traces, see LLM eval with shadow traffic and canary.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, TaskCompletion, AnswerRefusal,
Toxicity, PromptInjection,
)
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
rubric = [Groundedness(), TaskCompletion(), AnswerRefusal(),
Toxicity(), PromptInjection()]
production = evaluator.evaluate(eval_templates=rubric,
inputs=[TestCase(**t) for t in production_traces])
candidate = evaluator.evaluate(eval_templates=rubric,
inputs=[TestCase(**t) for t in candidate_traces])
For a canary serving live traffic, the same rubric runs as an output guardrail. MAJORITY aggregation ships the response only if most rubrics agree it’s clean; ALL is stricter; WEIGHTED lets Groundedness count more than Toxicity. For rubrics that aren’t gameable, see the agent evaluation frameworks guide.
The hardest rollout signal is “the candidate broke a path production handles fine.” A 0.3-point average rubric drop can hide a cluster of catastrophic failures in one sub-route. Error Feed clusters candidate-side failures via HDBSCAN soft-clustering over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge agent reads the failing trace and writes an immediate_fix plus a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each) per cluster. If a cluster appears in the candidate’s feed but not in production’s, the rollback fires. Linear OAuth ships today; Slack, GitHub, Jira, and PagerDuty are roadmap.
Picking the stage to start at
Not every rollout starts at stage 1. Four dimensions pick the entry point:
- Blast radius. A regression hitting 50 percent starts at shadow. A regression hitting 0.5 percent on a tier-3 tenant can canary directly. Blast radius is downstream * traffic-share.
- Statistical-power need. Proving the candidate is better (customer-facing claim, compliance audit) wants the full percentage ramp. Proving it’s not worse can stop at canary plus shadow.
- Cost tolerance. Race doubles or triples the bill; full shadow doubles spend on the route; mirror at 5 percent costs 5 percent more; canary doesn’t increase cost. Pick by what the finance lead will sign.
- Tenant sensitivity. Enterprise and regulated workloads (HIPAA, GDPR, financial services) need per-tenant ramps. Free tier can ramp on aggregate traffic. Don’t mix.
The reference path: shadow for any new prompt, model, or graph change. Clean after 24 hours, 1 to 5 percent canary stratified by tier. Clean for 48 to 72 hours, ramp 10 / 25 / 50 / 100 with the t-test gate. If the candidate is also moving providers, run the production-vs-candidate arm as a race underneath so latency stays inside contract bounds.
Anti-patterns from incident postmortems
- Skipping shadow. Shadow is the only stage with zero user effect at full coverage; skipping it means the first time the candidate touches a user is the first time the team finds out what its failure modes look like.
- Canary without a written rollback criterion. Watching dashboards is not a gate. Write the rubric and the trigger before traffic flips.
- Canary without stratification. Routing 5 percent at random can put 100 percent of one tier-1 tenant on the candidate. Use the stratify-tag header.
- Mean-only gating. Mean Groundedness at 0.91 while a sub-route ran at 0.62 is the canonical fail story. Gate on percentiles and Error Feed clusters, not just means.
- Race without cost guardrails. A race policy fanning out to three providers quadruples the bill inside a week without per-tag budgets.
- No cache invalidation on rollback. Cached candidate responses keep serving until TTL expires. The rollback path includes the cache flush, or the rollback didn’t roll back.
- Treating shadow as a one-shot. Distribution drifts; the rubric catches the drift only if shadow stays on.
What the eval stack adds on top
Routing decides which arm a request takes; the eval stack decides whether the candidate is good enough to promote. The same templates that gate the CI PR (Groundedness, TaskCompletion, AnswerRefusal, Toxicity, PromptInjection) also score the canary’s live traffic, so a regression slipping past CI surfaces on production scoring before the next ramp. For the CI side, see evaluate RAG applications in CI/CD.
Before a candidate prompt goes into shadow, six agent-opt optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) can search for variants that beat the incumbent on the rubric. The Platform retunes evaluators from thumbs feedback; classifier-backed evals run at lower per-eval cost than Galileo Luna-2 on rubrics with a clean classifier target. Honest framing: the trace-stream-to-dataset connector is roadmap, so today the loop runs on curated datasets the team promotes failing traces into. See automated optimization for agents.
Practical first move
If you’re running an agent in production and the rollout strategy is “deploy and watch dashboards”:
- Pick one route where prompt edits are frequent.
- Stand up
traceAIso candidate traces carry anagentcc.routing_strategyattribute. - Pick three rubrics that matter on the route (
Groundedness,TaskCompletion, plus one route-specific). - Turn shadow on at 100 percent with
x-agentcc-routing-strategy: shadowfor 24 hours. - Write a one-page promotion rubric and rollback triggers.
- The next prompt change runs shadow first, then 1 to 5 percent canary stratified by tier, then 10 / 25 / 50 / 100 with the t-test gate, then full under armed auto-rollback.
Once it holds on one route, expand. The gateway headers and the five-level budget hierarchy are what make the rollout work across more than one team without becoming someone’s full-time job.
Ready to wire a four-stage agent rollout? Point your OpenAI SDK at https://gateway.futureagi.com/v1, set x-agentcc-routing-strategy: shadow on the first request, and let the rubric do the gating. Your next prompt edit has somewhere to land.
Related reading
- The 2026 LLM Evaluation Playbook
- Evaluate RAG Applications in CI/CD (2026)
- LLM Deployment Best Practices (2026)
- A/B Testing LLM Prompts Best Practices (2026)
- Agent Passes Evals, Fails Production (2026)
- Canary Model Rollouts with an AI Gateway (2026)
- LLM Eval with Shadow Traffic and Canary (2026)
- AI Agent Cost Optimization and Observability (2026)
Frequently asked questions
What is the four-stage agent rollout gate?
How long should each rollout stage run?
What auto-rollback triggers should an agent rollout arm?
Why do agent rollbacks ghost-serve regressions after the revert?
Shadow versus canary at 1 percent: which is the right first stage?
What does Containment Rate times False Resolution Rate measure in canary?
How does Agent Command Center support a four-stage rollout?
Shadow is not canary. Mirror routing with no user effect vs. percentage routing with rollback. Score-attached traffic, the Agent Command Center patterns, and the gotchas.
Five AI gateways for A/B testing LLM models and prompts in 2026, scored on shadow traffic, sample-size enforcement, and outcome-attached scoring at the gateway hop.
Six AI gateways scored on canary rollouts: % routing granularity, score-attached canary traffic, auto-rollback on guardrail trip, and what each gateway falls short on.