LLM Eval with Shadow Traffic and Canary Deployment in 2026
Shadow is not canary. Mirror routing with no user effect vs. percentage routing with rollback. Score-attached traffic, the Agent Command Center patterns, and the gotchas.
Table of Contents
A team finishes a six-week migration to a newer reasoning model. Offline evals look good. Cost looks better. Routing flips at 10 am Wednesday for 100 percent of traffic. By noon, function-call accuracy on the billing agent is down nine points. By 2 pm, refusal rate on the support agent is up seven. On-call rolls back from the gateway dashboard. The post-mortem reads identical to the one from six months earlier: the offline eval set under-represented the real-traffic distribution, no per-cohort live monitor, no canary gate.
PR-time evals catch the obvious regressions on a held-out set. Offline A/B catches the rest of the in-distribution failures. Production traffic has a long tail no held-out set ever captures, and the only way to score the candidate against that tail is to route real traffic at it with a safety net. Shadow and canary are both eval gates on real traffic, and they answer different questions. Shadow mirrors with no user-visible effect; canary serves live with rollback-ready percentage routing. The team that confuses them either ships a quality regression (skipped shadow) or paralyzes on offline numbers that never converge (skipped canary).
This post disambiguates the two, the mirror variant between them, the race pattern beside them, the routing config that exposes them as a header change on Agent Command Center, the score-attachment plumbing that makes the comparison apples-to-apples, and the gotchas that turn a careful rollout into a slow-motion incident.
TL;DR: the four routing patterns, one promotion funnel
| Pattern | Who sees the candidate | Cost overhead | Eval question |
|---|---|---|---|
| Shadow | No one (production serves) | 1x (full duplication) | Does the candidate behave reasonably on the real distribution? |
| Mirror | No one (sampled subset) | N% (sample rate) | Same question, cost-bounded |
| Canary | A stratified user slice (live) | Slice size | Is the candidate at least as good with users in the loop? |
| Race | The user whose request won the race | Nx (parallel fan-out) | Can two candidates beat the latency SLO together? |
Shadow then mirror then canary then full rollout, with race reserved for latency-bounded SLOs. The Agent Command Center gateway exposes all four behind the same routing-strategy header so a candidate moves through the funnel without code changes. For the broader four-stage rollout shape, see the agent rollout strategies playbook.
Shadow is not canary: the disambiguation that prevents the wrong incident
The two patterns share routing plumbing and a score-attached evaluator stack. They diverge on the eval question, on user-visibility, and on the failure mode each unlocks.
Shadow’s job is distribution. The candidate runs in parallel on real inputs; the user sees the production response. No blast radius, full traffic coverage available, the candidate’s rubric distribution scored against production’s. A 1.5-point Groundedness drop is a stop signal even though no user saw a single candidate response. A new failure cluster in the candidate’s Error Feed that does not appear in production’s is also a stop signal. Shadow proves the candidate does not behave wildly differently on the real distribution before a user sees it.
Canary’s job is outcomes. The candidate serves a percentage of real users live, scored on the same rubric as production, with auto-rollback armed. The blast radius is non-zero by design because user-visible signals (feedback rate, retry rate, escalation, thumbs-down) only exist when a human is in the loop. A candidate that wins Groundedness on shadow can lose retry rate on canary. The two are not redundant.
The confusion shows up in two failure modes. Skipped shadow: the first time the candidate touches a user is the first time the team finds out what its failure modes look like, the canonical 5 pm Friday incident. Skipped canary: the team waits forever for the offline rubric to “feel ready” instead of putting the candidate in front of users with a rollback path, and the prompt edit that should have shipped in a week stalls for six. Both stages run, in order, with the eval score attached to each.
For the case on why offline-only is incomplete, see Agent Passes Evals, Fails Production and the 2026 LLM evaluation playbook.
Shadow patterns: mirror routing, async eval, score attached on the span
Shadow’s mechanic is request duplication. Production handles the user; the gateway forks the same request to the candidate; both responses come back with a shared correlation ID; the ai-evaluation SDK scores both on the same template list. The user sees one response; the platform sees a scored pair.
# Shadow on Agent Command Center: production serves, candidate is scored offline.
import requests
response = requests.post(
"https://gateway.futureagi.com/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_AGENTCC_KEY",
"x-agentcc-routing-strategy": "shadow",
"x-agentcc-shadow-target": "candidate-v2",
},
json={"model": "production-v1", "messages": [...]},
)
# Production response is served. Candidate response is captured on the
# correlation-id'd span; ai-evaluation subscribes to the shadow stream.
print(response.headers["x-agentcc-routing-strategy"]) # "shadow"
Scoring runs against both arms with the same rubric. Guardrails(rail_type=RailType.OUTPUT, aggregation=AggregationStrategy.MAJORITY) is the right default for multi-dimensional rubrics so a single noisy template does not flip the cohort.
from fi.evals import Evaluator, Guardrails
from fi.evals.types import RailType, AggregationStrategy
from fi.evals.templates import (
Groundedness, ContextAdherence, TaskCompletion,
LLMFunctionCalling, AnswerRefusal,
)
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
rubric = [Groundedness(threshold=0.85), ContextAdherence(threshold=0.90),
TaskCompletion(threshold=0.80), LLMFunctionCalling(threshold=0.92),
AnswerRefusal()]
gate = Guardrails(rail_type=RailType.OUTPUT,
aggregation=AggregationStrategy.MAJORITY,
templates=rubric)
prod = evaluator.evaluate(eval_templates=rubric,
inputs=[TestCase(**t) for t in production_traces])
cand = evaluator.evaluate(eval_templates=rubric,
inputs=[TestCase(**t) for t in candidate_traces])
Mirror is shadow at a fractional sample rate (x-agentcc-routing-strategy: mirror with x-agentcc-mirror-sample-rate: 0.10). Same plumbing, fewer pairs, smaller bill. Default to mirror at 10-25 percent for new candidates; step up to full shadow on safety-critical routes. A 10 percent mirror on a 1M-request-per-day route still produces 100K scored pairs in 24 hours, more than enough for most rubrics to converge.
The trap with shadow and mirror is storing two responses you never score. A team turns mirror on at 25 percent and ships; six weeks later storage cost is up 40 percent and no one has run the comparison. Wire the evaluator into the same pipeline that ingests the shadow trace, or do not run shadow. The ai-evaluation SDK ships a distributed runner (Celery, Ray, Temporal, Kubernetes) that scores shadowed traces as they arrive.
Canary patterns: percentage routing, stratified cohorts, auto-rollback in config
Canary is the first pattern where a real user sees the candidate. A small percentage of traffic, defined by a deterministic hash of the user ID and stratified by a tenant or feature tag, is served the candidate live. Both cohorts score on the same rubric. The gate is the delta against the trailing 7-day production baseline, not a frozen number from a curated set.
# Canary at 5%, stratified by tenant tier
response = requests.post(
"https://gateway.futureagi.com/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_AGENTCC_KEY",
"x-agentcc-routing-strategy": "canary",
"x-agentcc-canary-target": "agent-v2",
"x-agentcc-canary-percent": "5",
"x-agentcc-canary-stratify-tag": "tenant_tier",
},
json={"model": "agent-v1", "messages": [...]},
)
print(response.headers["x-agentcc-routing-strategy"]) # "canary"
Three properties of canary are non-negotiable.
Pre-registered rollback rubric. A canary without a written rollback criterion is a slow incident. The team launches at 5 percent, watches the dashboard for 20 minutes, walks away. When the failure mode shows up on a Friday afternoon spike, no one is watching the right metric. The four triggers, written before traffic flips:
# agentcc-rollback.yaml
rollback_triggers:
- name: guardrail_trip_rate
metric: x-agentcc-guardrail-triggered
window: 15m
threshold: 1.5x_baseline
action: rollback_immediate
- name: rubric_regression
metric: groundedness_rolling_mean
window: 1h
threshold: -0.5
significance: p<0.05
action: rollback_immediate
- name: p99_latency
metric: x-agentcc-latency-ms
aggregation: p99
window: 10m
threshold: 1.3x_baseline
action: rollback_immediate
- name: candidate_only_cluster
metric: error_feed_cluster_id
condition: not_in_trailing_window(7d)
action: rollback_immediate
Any single trigger fires the revert on the next request. Median rollback latency in internal tests is ~35 seconds. At 33 RPS that still re-serves 1,150 turns; at 1,000 RPS sub-second matters and the gateway has to flip on the next request, not after a Slack thread.
Stratified, not uniform. Production traffic is rarely uniform. Whale users hit the highest-stakes routes (paying tier, enterprise SLAs, regulated workflows) and a uniform 5 percent canary can route the bulk of candidate volume to the users whose failure cost is the highest. The candidate looks fine on aggregate and catastrophic on the segment that pays the bill. The Agent Command Center’s five-level budget hierarchy (org, team, user, key, tag) scopes a canary to tag=internal, tag=staging, or tag=beta_cohort without rewriting the routing config. Start on the segment with the lowest failure cost, expand, and only roll to enterprise tenants after the rubric holds on the lower-stakes cohorts.
Hold for a full traffic cycle. A canary that runs for two hours on a Tuesday morning has not seen the weekend, the geography spread, or the time-of-day distribution. Hold for 24 to 72 hours; promote on partial signal and you will roll back on the next cycle. Compute the required sample size before launch with the same power formula as offline A/B (n ≈ 16 * p * (1 - p) / delta² at 80% power, 5% significance, plus a 1.5x buffer for real-traffic variance). For the deeper statistical case, see A/B testing LLM prompts.
Score attachment: the only way the gates actually gate
The eval score has to land on the same OpenTelemetry span as the routing decision so the promotion runs on the delta, not on the dashboard vibe. Agent Command Center emits every routing decision as response headers that traceAI picks up as span attributes:
x-agentcc-routing-strategy(shadow, mirror, canary, race, fallback)x-agentcc-model-used(which provider returned the served response)x-agentcc-fallback-used(whether a candidate fell back)x-agentcc-latency-ms,x-agentcc-cost(per-request latency, cost)x-agentcc-guardrail-triggered(whether an input/output guardrail fired)
The ai-evaluation SDK joins those attributes with the per-trace rubric scores so a single dashboard shows per-cohort cost, latency, rubric pass rate, and guardrail fire rate. Without score-attached traffic the comparison lives in someone’s head; with it, the gate fires from config when the delta breaks the threshold. The same rubric runs against both cohorts; the same correlation ID joins production to candidate; the same noise floor calibrates the threshold as the self-improving evaluators learn each route’s variance.
Racing: the latency-bounded pattern that sits beside the funnel
Race sends the same request to multiple candidates in parallel (x-agentcc-routing-strategy: race, x-agentcc-race-targets: candidate-a,candidate-b), responds from the fastest correct one, and scores all responses asynchronously. Race is the right tool when latency is a hard SLO (voice agents, real-time UX) and at least two candidates already meet the quality bar. The response carries x-agentcc-model-used: <winner> so the trace tells you which arm served.
Cost is the trade. Race fans out Nx inference for one served response; a three-way race on a 1M-request-per-day route is 3M candidate calls, and if two of the three are frontier models with thinking tokens the bill scales fast. The benefit is that p99 latency drops to the minimum of N independent draws, which on long-tail-heavy distributions can shrink p99 by 30 to 50 percent. For batch agents and human-in-the-loop chat where the user is happy with two-second response time, race is wasted budget. Scope race with per-strategy cost guardrails through the same five-level budget hierarchy that stratifies the canary.
The Agent Command Center implementation: one binary, four strategies
Agent Command Center ships shadow, mirror, race, fallback, load-balanced, and budget-aware routing as first-class strategies, with canary built on tag-based scoping and the five-level budget hierarchy. Apache 2.0, single 17 MB Go binary, OpenAI-compatible, ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on. Self-host or use gateway.futureagi.com/v1.
The loop has four moving parts:
- Routing as a header.
x-agentcc-routing-strategy: shadow | mirror | canary | raceswitches stages without redeploying. The same gateway handles all four; the same response headers tag every trace. - Score-attached traceAI spans. The ai-evaluation SDK (50+ pre-built
EvalTemplateclasses plus self-improving Platform evaluators plus unlimited custom evaluators authored by an in-product agent) scores both arms on the same rubric with a shared correlation ID. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2 on rubrics with a clean classifier target. - Auto-rollback in gateway config. The four triggers (guardrail trip rate, rubric rolling mean, p99 latency, candidate-only Error Feed cluster) wire into YAML before traffic flips. Any single trigger fires on the next request, no human in the rollback path.
- Error Feed plus agent-opt closes the loop. When the gate trips, Error Feed clusters candidate-only failures via HDBSCAN over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge writes an
immediate_fixplus a four-dimensional score per cluster. agent-opt (six optimizers:RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer, withEarlyStoppingConfigto bound compute) proposes a prompt rewrite. The team reviews the diff; the candidate restarts at 1 percent.
Honest framing. The full closed loop ships today, but the trace-stream-to-dataset connector that auto-builds optimization datasets from canary failure clusters is roadmap, so today the loop runs on curated datasets the team promotes failing traces into. Linear OAuth is wired on the rollback signal; Slack, GitHub, Jira, and PagerDuty are roadmap.
For the gateway-as-rollout-control deep dive across six vendors, see Best AI Gateways for Canary Model Rollouts in 2026.
Three anti-patterns from production incident reports
Shadow without offline scoring. Mirror is on at 25 percent, the pipeline stores two responses per request, six weeks later storage cost is up 40 percent and no one has run the comparison. The candidate could be 15 points worse and no one knows. Fix: every shadow or mirror trace must have an evaluator subscribed to it on ingest. If the scoring is not wired, do not store the shadow.
Canary without a written rollback condition. “Looks fine on the dashboard” is not a gate. The rollback condition has to be a rule the gateway evaluates on a rolling window, not a feeling. Wire the four triggers above into config before traffic flips. Pair the auto-rollback with an alert that carries the failure cluster so the on-call sees what broke, not just that something did.
Uniform canary on a power-law distribution. A random 5 percent canary on a tenant-tier-skewed mix routes the bulk of candidate volume to whale users whose failure cost is the highest. The aggregate looks fine; the segment that pays the bill is catastrophic. Stratify with the tag header. Start on tag=internal, expand to tag=beta_cohort, only then roll to enterprise. The canonical fail story is “mean Groundedness 0.91, sub-route 0.62”; the fix is gating on percentiles and Error Feed clusters, not means.
For the rollback discussion across rollout patterns, see LLM Deployment Best Practices and CI/CD for AI Agents.
The decision matrix that picks the pattern
| If your situation is… | Use |
|---|---|
| First test of a new candidate, low prior signal, safety-critical route | Shadow at 100% on a low-volume route |
| Candidate is expensive or the shadowed slice would double the bill | Mirror at 10-25% |
| Shadow or mirror shows stable lift; you need user-visible signal | Canary at 5% on a stratified tag |
| Hard latency SLO, two or more candidates pass quality bar | Race across the candidates |
| New tool-calling schema you have not validated end-to-end | Shadow first, then mirror, then canary |
| Cost optimization where the candidate is cheaper but unproven | Mirror at 25%, then canary at 10% on tag=internal |
| Multi-region rollout with regional compliance constraints | Canary by geography tag |
The step most teams skip is the mirror-between-shadow-and-canary one. Going shadow-then-canary leaves out the cost-bounded signal-gathering phase, which is where most of the candidate tuning happens. Mirror is cheap. Use it.
Closing: routing is the safety net
Offline evals are necessary and not sufficient. Shadow, mirror, canary, and race catch the long tail that only exists in production. The disambiguation matters: shadow proves the candidate behaves reasonably on the real distribution before a user sees it; canary proves it is at least as good with users in the loop; mirror is shadow at the budget you can afford; race is the latency tool that sits beside the funnel. The team that confuses them ships either the regression or the indefinite stall. The team that runs the funnel in order with score-attached traffic ships on the rubric, not the vibe.
Point your OpenAI SDK at https://gateway.futureagi.com/v1, set x-agentcc-routing-strategy: shadow on the first request, and let the rubric do the gating. The next prompt edit has somewhere to land.
Related reading
- Agent Rollout Strategies in 2026: The Four-Stage Gate
- Best 6 AI Gateways for Canary Model Rollouts in 2026
- A/B Testing LLM Prompts Best Practices in 2026
- The 2026 LLM Evaluation Playbook
- Agent Passes Evals, Fails Production in 2026
- The 12 Metrics for AI Conversation Monitoring in 2026
- Best AI Gateways for LLM Observability and Tracing in 2026
- AI Evaluation: Open Source LLM Evaluation Library
Frequently asked questions
What is the difference between shadow traffic and canary deployment?
What is mirror sampling and when do I use it instead of full shadow?
How do I attach an eval score to shadow or canary traffic?
How fast should auto-rollback fire when a canary regresses?
Why does a 5 percent canary regress on enterprise customers when aggregate metrics look fine?
How does Future AGI score both production and candidate cohorts on the same rubric?
A practical time-to-value plan for your LLM eval stack: day 1-3 smoke set, week 1 PR gate, month 1 incident clustering, quarter 1 budget chargeback, beyond.
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.