Guides

Agent Rollout Strategies in 2026: The Four-Stage Gate

Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.

·
Updated
·
12 min read
agent-rollout shadow-traffic canary ai-gateway llm-deployment 2026
Editorial cover image for Agent Rollout Strategies in 2026: A Practical Playbook
Table of Contents

A prompt edit ships at 4:47 pm. The canary runs at 1 percent for forty minutes. Dashboards look fine. The team promotes to 25 percent. Twelve hours later, an enterprise tenant opens a ticket; the new prompt subtly broke a tool-call schema on a refund-flow sub-route the 30-example regression suite never hit. Mean Groundedness held at 0.91. Production Groundedness on the affected traffic sat at 0.62. Semantic cache keeps serving the bad answer for another forty minutes after the rollback. The gate fired green because it answered the wrong question.

Agent rollout isn’t a deploy. It’s a partial behavior change on a non-deterministic system, with blast radius across silent paths, signal that only resolves after hours of real traffic, and caches that outlive the rollback. The opinion this post earns: agent rollout is a four-stage gate (shadow, canary, percentage, full), and each stage answers a different eval question. Skipping one ships a production incident.

This is the playbook: what each stage proves, the eval question, the rubric math, the rollback triggers, and how the Agent Command Center routing layer makes the four stages a header change. Code shaped against the ai-evaluation SDK and the gateway.

TL;DR: the four-stage gate

StageTraffic to candidateUser impactEval questionPromotion gate
1. Shadow100% mirroredZeroDoes the candidate behave wildly differently on real traffic?Per-rubric distribution within 1 point of production over 24-72h
2. Canary1-5% liveTier-stratifiedIs the candidate at least as good with users in the loop?Containment * (1 - False Resolution) within noise floor of baseline
3. Percentage10, 25, 50% liveBroaderAre per-rubric deltas statistically significant on prod data?Welch’s t-test p > 0.05 on each rubric vs 7-day baseline
4. Full100% liveAll usersDoes the candidate hold the line under load with auto-rollback armed?Guardrail trip rate, rubric rolling mean, and p99 latency hold for 48-72h

Skipping a stage is the cheap-and-fast failure mode. The eval question changes at each stage; a green check at stage 1 doesn’t answer stage 2’s question.

Why agent rollouts break differently

Classical deploys ship code. Agent rollouts ship a behavior change, and break in three ways code deploys don’t.

Non-determinism makes single traces useless as proof. The candidate can answer correctly on a probe and wrong on the next identical probe. The promotion decision lives on scored distributions over real traffic, not curl examples.

One prompt edit fans out across silent paths. A change meant to tighten a refund-flow tool-call format also affects summarization, the help-desk fallback, and the rare disambiguation turn that triggers once per 500 requests. The regression only shows on the full distribution.

Rollback isn’t clean. Semantic caches store old responses keyed on prompt-hashes that don’t change on revert. Downstream stores snapshot the bad output and reuse it. If the rollback plan is “revert the deploy,” regressions ghost-serve for hours. The first agent incident most teams hit is “we rolled back but it didn’t fix.”

Each stage covers a failure the prior stage can’t see. For the eval foundation, see the 2026 LLM evaluation playbook.

Stage 1: Shadow, does the candidate behave reasonably on real traffic

Every production request is duplicated. Production answers the user; the candidate runs in parallel; the candidate’s output is scored offline and discarded. Zero user effect at full traffic coverage.

The eval question is bounded: does the candidate’s distribution on real traffic look reasonable against production’s? A 1.5-point Groundedness drop is a stop signal even though no user saw a single candidate response. A new cluster in the candidate’s Error Feed that doesn’t appear in production’s is also a stop signal. The candidate isn’t promoted because it scored well on a curated test set; it’s promoted because the rubric distribution on real traffic holds.

# Configure shadow routing on Agent Command Center
import requests

response = requests.post(
    "https://gateway.futureagi.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_AGENTCC_KEY",
        "x-agentcc-routing-strategy": "shadow",
        "x-agentcc-shadow-target": "candidate-v2-prompt",
    },
    json={
        "model": "production-agent-v1",
        "messages": [{"role": "user", "content": user_input}],
    },
)
# Response is from production-agent-v1.
# Candidate runs in parallel; its response is captured for scoring, not served.
print(response.headers["x-agentcc-routing-strategy"])  # "shadow"

Hold for 24 to 72 hours depending on volume. The cost trade is real: every request runs twice, so spend roughly doubles. Mirror is the cheaper variant when 5 to 10 percent sampling converges the rubric in the same window. Shadow’s other failure mode is treating it as a one-shot; keep it on after promotion so the team notices quality breaking before the next ramp.

Stage 2: Canary, Containment Rate times False Resolution Rate

The candidate serves real traffic on 1 to 5 percent of users, stratified by tenant tier. Shadow proved the candidate doesn’t behave wildly differently on real inputs; canary proves it’s at least as good with users in the loop.

The eval question shifts from distribution shape to user outcomes. The gate metric is the product:

Containment Rate × (1 − False Resolution Rate)

Containment is the share of conversations resolved without escalation. False Resolution is the share of “resolved” conversations where the resolution was wrong (caught by a user complaint, a thumbs-down, an Error Feed cluster, or a second-touch metric). The product is the useful gate because the canary failure mode is a candidate that resolves more cases but resolves them worse. A 2-point Containment lift with a 4-point False Resolution lift is a regression the average rubric won’t catch.

# Canary at 5%, stratified by tenant tier
response = requests.post(
    "https://gateway.futureagi.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_AGENTCC_KEY",
        "x-agentcc-routing-strategy": "canary",
        "x-agentcc-canary-target": "agent-v2",
        "x-agentcc-canary-percent": "5",
        "x-agentcc-canary-stratify-tag": "tenant_tier",
    },
    json={"model": "agent-v1", "messages": [...]},
)
print(response.headers["x-agentcc-routing-strategy"])  # "canary"

Two canary failure modes show up in incident reports. Canary without a written rollback criterion: a canary that “looks fine” isn’t a promotion signal. Write the rubric and the trigger before traffic flips. Canary without stratification: routing 5 percent at random can put 100 percent of one enterprise tenant on the candidate. The five-level budget hierarchy (org, team, user, key, tag) is the routing primitive for stratified canary.

Hold for 24 to 72 hours. The math is the same as any A/B test, with Containment * (1 - False Resolution) as the metric and the rubric noise floor as the minimum detectable effect. For foundations, see the A/B testing LLM prompts guide.

Stage 3: Percentage, statistical gate on production-data deltas

If canary holds, the candidate ramps 10, 25, 50, each step held 12 to 24 hours. The eval question shifts: not “is the candidate reasonable” or “is the candidate at least as good,” but “is the delta statistically significant on the gating rubric?”

A green check should mean “this rollout did not introduce a statistically significant regression,” not “mean Groundedness sat above 0.85.” Floors catch catastrophic breaks; deltas catch slow regressions; both are required.

import statistics
from scipy import stats

def rollout_gate(candidate, baseline, alpha=0.05, min_effect=0.05):
    """Fail only if the mean dropped, the change is significant, and the effect is real."""
    delta = statistics.mean(candidate) - statistics.mean(baseline)
    if delta >= 0:
        return True, f"no regression (delta=+{delta:.3f})"
    _, p = stats.ttest_ind(candidate, baseline, equal_var=False)
    if p >= alpha:
        return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"
    if abs(delta) < min_effect:
        return True, f"delta={delta:.3f} below effect floor {min_effect}"
    return False, f"regression: delta={delta:.3f}, p={p:.3f}"

The baseline is a rolling 7-day production observation, not a frozen number. Models drift, prompts drift, traffic drifts; the gate drifts with them or it catches ordinary movement instead of regressions. Use a two-proportion z-test on pass-rate rubrics like citation validity, Welch’s t-test on continuous rubrics. For long-tail failures that hide in averages, gate on percentiles. The fi CLI ships pass_rate, avg_score, p50/p90/p95_score as native assertion metrics, so a regression pushing p95_score below a tail floor while leaving the mean intact fails on the percentile.

The 25 percent step is where teams discover a rubric they forgot to gate on. A five-rubric starting set:

DimensionThreshold against baseline
GroundednessWithin 1 point (or higher)
TaskCompletionWithin 1 point (or higher)
AnswerRefusalAt or below production false-refusal rate
Toxicity / PromptInjectionNo worse than production
New Error Feed clusterZero candidate-only clusters tolerated

The “no new failure cluster” gate catches the worst-case regression that average scores hide.

Stage 4: Full, 100 percent with auto-rollback armed

The candidate ramps to 100 percent. The previous version isn’t retired yet. Rollback decisions are too slow when they’re a meeting; the triggers go into the gateway config, and the flip happens on the next request, not after a Slack thread.

# agentcc-rollback.yaml
rollback_triggers:
  - name: guardrail_trip_rate
    metric: x-agentcc-guardrail-triggered
    window: 15m
    threshold: 1.5x_baseline    # vs trailing 7-day
    action: rollback_immediate

  - name: rubric_regression
    metric: groundedness_rolling_mean
    window: 1h
    threshold: -0.5             # absolute point drop on 1-5
    significance: p<0.05
    action: rollback_immediate

  - name: p99_latency
    metric: x-agentcc-latency-ms
    aggregation: p99
    window: 10m
    threshold: 1.3x_baseline
    action: rollback_immediate

  - name: candidate_only_cluster
    metric: error_feed_cluster_id
    condition: not_in_trailing_window(7d)
    action: rollback_immediate

Any single trigger fires the rollback; the gateway flips to the prior version on the next request, and the response headers carry the new strategy so observability sees the flip without polling a control plane. Hold 48 to 72 hours under armed rollback before the previous version is retired. Cost guardrails are the missed fifth trigger: a race-shape rollout fanning out to three providers will quadruple the bill inside a week without a per-tag budget cap.

The rollback path is the often-skipped step. On rollback: invalidate the semantic cache namespace tagged with the candidate version, flush any downstream stores that snapshotted the bad output, and bump the rubric version so the production observer scores against the right floor. Otherwise regressions ghost-serve for hours after the revert. For the broader deploy story, see LLM deployment best practices.

How Agent Command Center wires the four stages

The four stages are header changes on the same routing layer. Agent Command Center ships shadow, mirror, race, fallback, load-balanced, and budget-aware routing as first-class strategies in the gateway core. The gateway is a single 17 MB Go binary (Apache 2.0, OpenAI-compatible) running at ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on. Self-host or use gateway.futureagi.com/v1.

Every response carries the headers the eval layer needs:

  • x-agentcc-routing-strategy: which strategy fired (shadow, mirror, canary, race, fallback)
  • x-agentcc-model-used: which provider returned the served response
  • x-agentcc-fallback-used: whether a candidate fell back to a different provider
  • x-agentcc-latency-ms: gateway-measured end-to-end latency
  • x-agentcc-cost: gateway-measured per-request cost
  • x-agentcc-guardrail-triggered: whether an input or output guardrail fired

traceAI spans pick the headers up as attributes so eval scores on the trace tree know which arm they belong to. The same rubric runs against both arms; spans tell the eval layer which baseline to compare against. For more on separating production and candidate traces, see LLM eval with shadow traffic and canary.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, TaskCompletion, AnswerRefusal,
    Toxicity, PromptInjection,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env
rubric = [Groundedness(), TaskCompletion(), AnswerRefusal(),
          Toxicity(), PromptInjection()]

production = evaluator.evaluate(eval_templates=rubric,
    inputs=[TestCase(**t) for t in production_traces])
candidate  = evaluator.evaluate(eval_templates=rubric,
    inputs=[TestCase(**t) for t in candidate_traces])

For a canary serving live traffic, the same rubric runs as an output guardrail. MAJORITY aggregation ships the response only if most rubrics agree it’s clean; ALL is stricter; WEIGHTED lets Groundedness count more than Toxicity. For rubrics that aren’t gameable, see the agent evaluation frameworks guide.

The hardest rollout signal is “the candidate broke a path production handles fine.” A 0.3-point average rubric drop can hide a cluster of catastrophic failures in one sub-route. Error Feed clusters candidate-side failures via HDBSCAN soft-clustering over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge agent reads the failing trace and writes an immediate_fix plus a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each) per cluster. If a cluster appears in the candidate’s feed but not in production’s, the rollback fires. Linear OAuth ships today; Slack, GitHub, Jira, and PagerDuty are roadmap.

Picking the stage to start at

Not every rollout starts at stage 1. Four dimensions pick the entry point:

  • Blast radius. A regression hitting 50 percent starts at shadow. A regression hitting 0.5 percent on a tier-3 tenant can canary directly. Blast radius is downstream * traffic-share.
  • Statistical-power need. Proving the candidate is better (customer-facing claim, compliance audit) wants the full percentage ramp. Proving it’s not worse can stop at canary plus shadow.
  • Cost tolerance. Race doubles or triples the bill; full shadow doubles spend on the route; mirror at 5 percent costs 5 percent more; canary doesn’t increase cost. Pick by what the finance lead will sign.
  • Tenant sensitivity. Enterprise and regulated workloads (HIPAA, GDPR, financial services) need per-tenant ramps. Free tier can ramp on aggregate traffic. Don’t mix.

The reference path: shadow for any new prompt, model, or graph change. Clean after 24 hours, 1 to 5 percent canary stratified by tier. Clean for 48 to 72 hours, ramp 10 / 25 / 50 / 100 with the t-test gate. If the candidate is also moving providers, run the production-vs-candidate arm as a race underneath so latency stays inside contract bounds.

Anti-patterns from incident postmortems

  • Skipping shadow. Shadow is the only stage with zero user effect at full coverage; skipping it means the first time the candidate touches a user is the first time the team finds out what its failure modes look like.
  • Canary without a written rollback criterion. Watching dashboards is not a gate. Write the rubric and the trigger before traffic flips.
  • Canary without stratification. Routing 5 percent at random can put 100 percent of one tier-1 tenant on the candidate. Use the stratify-tag header.
  • Mean-only gating. Mean Groundedness at 0.91 while a sub-route ran at 0.62 is the canonical fail story. Gate on percentiles and Error Feed clusters, not just means.
  • Race without cost guardrails. A race policy fanning out to three providers quadruples the bill inside a week without per-tag budgets.
  • No cache invalidation on rollback. Cached candidate responses keep serving until TTL expires. The rollback path includes the cache flush, or the rollback didn’t roll back.
  • Treating shadow as a one-shot. Distribution drifts; the rubric catches the drift only if shadow stays on.

What the eval stack adds on top

Routing decides which arm a request takes; the eval stack decides whether the candidate is good enough to promote. The same templates that gate the CI PR (Groundedness, TaskCompletion, AnswerRefusal, Toxicity, PromptInjection) also score the canary’s live traffic, so a regression slipping past CI surfaces on production scoring before the next ramp. For the CI side, see evaluate RAG applications in CI/CD.

Before a candidate prompt goes into shadow, six agent-opt optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) can search for variants that beat the incumbent on the rubric. The Platform retunes evaluators from thumbs feedback; classifier-backed evals run at lower per-eval cost than Galileo Luna-2 on rubrics with a clean classifier target. Honest framing: the trace-stream-to-dataset connector is roadmap, so today the loop runs on curated datasets the team promotes failing traces into. See automated optimization for agents.

Practical first move

If you’re running an agent in production and the rollout strategy is “deploy and watch dashboards”:

  1. Pick one route where prompt edits are frequent.
  2. Stand up traceAI so candidate traces carry an agentcc.routing_strategy attribute.
  3. Pick three rubrics that matter on the route (Groundedness, TaskCompletion, plus one route-specific).
  4. Turn shadow on at 100 percent with x-agentcc-routing-strategy: shadow for 24 hours.
  5. Write a one-page promotion rubric and rollback triggers.
  6. The next prompt change runs shadow first, then 1 to 5 percent canary stratified by tier, then 10 / 25 / 50 / 100 with the t-test gate, then full under armed auto-rollback.

Once it holds on one route, expand. The gateway headers and the five-level budget hierarchy are what make the rollout work across more than one team without becoming someone’s full-time job.

Ready to wire a four-stage agent rollout? Point your OpenAI SDK at https://gateway.futureagi.com/v1, set x-agentcc-routing-strategy: shadow on the first request, and let the rubric do the gating. Your next prompt edit has somewhere to land.

Frequently asked questions

What is the four-stage agent rollout gate?
Shadow, canary, percentage, full. Shadow runs the candidate on real traffic with zero user effect; the question is whether the rubric scores converge against production. Canary serves 1 to 5 percent live, stratified by tenant tier; the question is Containment Rate times False Resolution Rate against the trailing 24-hour production baseline. Percentage ramps 10, 25, 50; the question is whether per-rubric deltas are statistically significant on live traffic. Full carries 100 percent with auto-rollback armed; the question is whether the guardrail trip rate, the rubric rolling mean, and p99 latency hold for the lock-in window. Each stage answers a different eval question; skipping one ships a production incident.
How long should each rollout stage run?
Shadow runs 24 to 72 hours, long enough for the rubric distribution on candidate traffic to converge against production. Canary holds at 1 to 5 percent for 24 to 72 hours, until per-rubric deltas exceed the rubric noise floor and the per-tenant Error Feed shows no candidate-only failure clusters. Percentage ramps 10, 25, 50 with 12 to 24 hours at each step and a Welch t-test gate against the trailing 7-day production baseline. Full waits 48 to 72 hours under armed auto-rollback before the previous version is retired. The gates compound: a regression that hides in shadow surfaces in canary because the live blast radius is non-zero.
What auto-rollback triggers should an agent rollout arm?
Four triggers, all written before traffic flips. Guardrail trip rate above 1.5 times the trailing 7-day production rate over a 15-minute window. Per-rubric rolling mean drops more than the noise floor (a 0.5-point drop on a 1-5 Groundedness rubric is meaningful; 0.05 is not) with p less than 0.05 on Welch's t-test. p99 latency exceeds 1.3 times the production baseline for 10 minutes. Error Feed surfaces a candidate-only failure cluster the trailing window has not seen. Any single trigger fires the rollback signal; the Agent Command Center routing flips back to the prior version and the response headers carry the new strategy on the next request.
Why do agent rollbacks ghost-serve regressions after the revert?
Semantic caches key on prompt-hash, not on which agent version generated the response. After a rollback, every cache hit re-serves the candidate's bad output until the TTL expires. The fix lives in the rollback path, not the deploy: invalidate the semantic cache namespace tagged with the candidate version, flush any downstream stores that snapshotted the bad output, and bump the rubric version so the production observer scores the post-rollback traffic against the right floor. Without invalidation, you ship two incidents: the regression and the ghost-served regression.
Shadow versus canary at 1 percent: which is the right first stage?
Shadow first, every time. Shadow has zero user effect at full traffic coverage, which means the candidate gets scored against the actual production distribution before a single user sees it. A 1 percent canary has a non-zero blast radius and lower statistical power than 100 percent shadow on the same route. The two have different jobs: shadow proves the candidate does not behave wildly differently on real traffic; canary proves the candidate is at least as good as production with users in the loop. Skipping shadow means the first time the candidate touches a user is the first time you find out what its failure modes look like.
What does Containment Rate times False Resolution Rate measure in canary?
Containment Rate is the share of conversations resolved without escalation; False Resolution Rate is the share of resolved conversations where the resolution was wrong. Multiplying them is the useful metric: a high Containment Rate with a high False Resolution Rate means the agent is confidently giving wrong answers. The canary gate fails when Containment * (1 - False Resolution) on the candidate drops more than the noise floor against the trailing 24-hour production baseline. The two-metric form is what stops a candidate that resolves more cases but resolves them worse from looking like a win on the average.
How does Agent Command Center support a four-stage rollout?
Agent Command Center is a 17 MB Go binary (Apache 2.0, OpenAI-compatible) that ships shadow, mirror, race, fallback, load-balanced, and budget-aware routing as first-class strategies. Routing is a header on the request: x-agentcc-routing-strategy switches stages without redeploying. The five-level budget hierarchy (org, team, user, key, tag) is the routing primitive for per-tenant percentage ramps. Response headers carry x-agentcc-routing-strategy, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-latency-ms, x-agentcc-cost, and x-agentcc-guardrail-triggered so traceAI spans tag the right arm and the eval scoring layer attaches the rubric to the right tree. Auto-rollback wires on guardrail trip rate; Error Feed clusters candidate-only failures (HDBSCAN over ClickHouse plus a Sonnet 4.5 Judge writing immediate_fix) so the rollback signal carries the cluster, not just the average.
Related Articles
View all