Guides

LLM Eval with Shadow Traffic and Canary Deployment in 2026

Shadow is not canary. Mirror routing with no user effect vs. percentage routing with rollback. Score-attached traffic, the Agent Command Center patterns, and the gotchas.

·
Updated
·
12 min read
llm-evaluation shadow-traffic canary-deployment ai-gateway production-llm rollout traffic-routing 2026
Editorial cover image for LLM Eval with Shadow Traffic and Canary Deployment in 2026
Table of Contents

A team finishes a six-week migration to a newer reasoning model. Offline evals look good. Cost looks better. Routing flips at 10 am Wednesday for 100 percent of traffic. By noon, function-call accuracy on the billing agent is down nine points. By 2 pm, refusal rate on the support agent is up seven. On-call rolls back from the gateway dashboard. The post-mortem reads identical to the one from six months earlier: the offline eval set under-represented the real-traffic distribution, no per-cohort live monitor, no canary gate.

PR-time evals catch the obvious regressions on a held-out set. Offline A/B catches the rest of the in-distribution failures. Production traffic has a long tail no held-out set ever captures, and the only way to score the candidate against that tail is to route real traffic at it with a safety net. Shadow and canary are both eval gates on real traffic, and they answer different questions. Shadow mirrors with no user-visible effect; canary serves live with rollback-ready percentage routing. The team that confuses them either ships a quality regression (skipped shadow) or paralyzes on offline numbers that never converge (skipped canary).

This post disambiguates the two, the mirror variant between them, the race pattern beside them, the routing config that exposes them as a header change on Agent Command Center, the score-attachment plumbing that makes the comparison apples-to-apples, and the gotchas that turn a careful rollout into a slow-motion incident.

TL;DR: the four routing patterns, one promotion funnel

PatternWho sees the candidateCost overheadEval question
ShadowNo one (production serves)1x (full duplication)Does the candidate behave reasonably on the real distribution?
MirrorNo one (sampled subset)N% (sample rate)Same question, cost-bounded
CanaryA stratified user slice (live)Slice sizeIs the candidate at least as good with users in the loop?
RaceThe user whose request won the raceNx (parallel fan-out)Can two candidates beat the latency SLO together?

Shadow then mirror then canary then full rollout, with race reserved for latency-bounded SLOs. The Agent Command Center gateway exposes all four behind the same routing-strategy header so a candidate moves through the funnel without code changes. For the broader four-stage rollout shape, see the agent rollout strategies playbook.

Shadow is not canary: the disambiguation that prevents the wrong incident

The two patterns share routing plumbing and a score-attached evaluator stack. They diverge on the eval question, on user-visibility, and on the failure mode each unlocks.

Shadow’s job is distribution. The candidate runs in parallel on real inputs; the user sees the production response. No blast radius, full traffic coverage available, the candidate’s rubric distribution scored against production’s. A 1.5-point Groundedness drop is a stop signal even though no user saw a single candidate response. A new failure cluster in the candidate’s Error Feed that does not appear in production’s is also a stop signal. Shadow proves the candidate does not behave wildly differently on the real distribution before a user sees it.

Canary’s job is outcomes. The candidate serves a percentage of real users live, scored on the same rubric as production, with auto-rollback armed. The blast radius is non-zero by design because user-visible signals (feedback rate, retry rate, escalation, thumbs-down) only exist when a human is in the loop. A candidate that wins Groundedness on shadow can lose retry rate on canary. The two are not redundant.

The confusion shows up in two failure modes. Skipped shadow: the first time the candidate touches a user is the first time the team finds out what its failure modes look like, the canonical 5 pm Friday incident. Skipped canary: the team waits forever for the offline rubric to “feel ready” instead of putting the candidate in front of users with a rollback path, and the prompt edit that should have shipped in a week stalls for six. Both stages run, in order, with the eval score attached to each.

For the case on why offline-only is incomplete, see Agent Passes Evals, Fails Production and the 2026 LLM evaluation playbook.

Shadow patterns: mirror routing, async eval, score attached on the span

Shadow’s mechanic is request duplication. Production handles the user; the gateway forks the same request to the candidate; both responses come back with a shared correlation ID; the ai-evaluation SDK scores both on the same template list. The user sees one response; the platform sees a scored pair.

# Shadow on Agent Command Center: production serves, candidate is scored offline.
import requests

response = requests.post(
    "https://gateway.futureagi.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_AGENTCC_KEY",
        "x-agentcc-routing-strategy": "shadow",
        "x-agentcc-shadow-target": "candidate-v2",
    },
    json={"model": "production-v1", "messages": [...]},
)
# Production response is served. Candidate response is captured on the
# correlation-id'd span; ai-evaluation subscribes to the shadow stream.
print(response.headers["x-agentcc-routing-strategy"])  # "shadow"

Scoring runs against both arms with the same rubric. Guardrails(rail_type=RailType.OUTPUT, aggregation=AggregationStrategy.MAJORITY) is the right default for multi-dimensional rubrics so a single noisy template does not flip the cohort.

from fi.evals import Evaluator, Guardrails
from fi.evals.types import RailType, AggregationStrategy
from fi.evals.templates import (
    Groundedness, ContextAdherence, TaskCompletion,
    LLMFunctionCalling, AnswerRefusal,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env
rubric = [Groundedness(threshold=0.85), ContextAdherence(threshold=0.90),
          TaskCompletion(threshold=0.80), LLMFunctionCalling(threshold=0.92),
          AnswerRefusal()]
gate = Guardrails(rail_type=RailType.OUTPUT,
                  aggregation=AggregationStrategy.MAJORITY,
                  templates=rubric)

prod = evaluator.evaluate(eval_templates=rubric,
    inputs=[TestCase(**t) for t in production_traces])
cand = evaluator.evaluate(eval_templates=rubric,
    inputs=[TestCase(**t) for t in candidate_traces])

Mirror is shadow at a fractional sample rate (x-agentcc-routing-strategy: mirror with x-agentcc-mirror-sample-rate: 0.10). Same plumbing, fewer pairs, smaller bill. Default to mirror at 10-25 percent for new candidates; step up to full shadow on safety-critical routes. A 10 percent mirror on a 1M-request-per-day route still produces 100K scored pairs in 24 hours, more than enough for most rubrics to converge.

The trap with shadow and mirror is storing two responses you never score. A team turns mirror on at 25 percent and ships; six weeks later storage cost is up 40 percent and no one has run the comparison. Wire the evaluator into the same pipeline that ingests the shadow trace, or do not run shadow. The ai-evaluation SDK ships a distributed runner (Celery, Ray, Temporal, Kubernetes) that scores shadowed traces as they arrive.

Canary patterns: percentage routing, stratified cohorts, auto-rollback in config

Canary is the first pattern where a real user sees the candidate. A small percentage of traffic, defined by a deterministic hash of the user ID and stratified by a tenant or feature tag, is served the candidate live. Both cohorts score on the same rubric. The gate is the delta against the trailing 7-day production baseline, not a frozen number from a curated set.

# Canary at 5%, stratified by tenant tier
response = requests.post(
    "https://gateway.futureagi.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_AGENTCC_KEY",
        "x-agentcc-routing-strategy": "canary",
        "x-agentcc-canary-target": "agent-v2",
        "x-agentcc-canary-percent": "5",
        "x-agentcc-canary-stratify-tag": "tenant_tier",
    },
    json={"model": "agent-v1", "messages": [...]},
)
print(response.headers["x-agentcc-routing-strategy"])  # "canary"

Three properties of canary are non-negotiable.

Pre-registered rollback rubric. A canary without a written rollback criterion is a slow incident. The team launches at 5 percent, watches the dashboard for 20 minutes, walks away. When the failure mode shows up on a Friday afternoon spike, no one is watching the right metric. The four triggers, written before traffic flips:

# agentcc-rollback.yaml
rollback_triggers:
  - name: guardrail_trip_rate
    metric: x-agentcc-guardrail-triggered
    window: 15m
    threshold: 1.5x_baseline
    action: rollback_immediate
  - name: rubric_regression
    metric: groundedness_rolling_mean
    window: 1h
    threshold: -0.5
    significance: p<0.05
    action: rollback_immediate
  - name: p99_latency
    metric: x-agentcc-latency-ms
    aggregation: p99
    window: 10m
    threshold: 1.3x_baseline
    action: rollback_immediate
  - name: candidate_only_cluster
    metric: error_feed_cluster_id
    condition: not_in_trailing_window(7d)
    action: rollback_immediate

Any single trigger fires the revert on the next request. Median rollback latency in internal tests is ~35 seconds. At 33 RPS that still re-serves 1,150 turns; at 1,000 RPS sub-second matters and the gateway has to flip on the next request, not after a Slack thread.

Stratified, not uniform. Production traffic is rarely uniform. Whale users hit the highest-stakes routes (paying tier, enterprise SLAs, regulated workflows) and a uniform 5 percent canary can route the bulk of candidate volume to the users whose failure cost is the highest. The candidate looks fine on aggregate and catastrophic on the segment that pays the bill. The Agent Command Center’s five-level budget hierarchy (org, team, user, key, tag) scopes a canary to tag=internal, tag=staging, or tag=beta_cohort without rewriting the routing config. Start on the segment with the lowest failure cost, expand, and only roll to enterprise tenants after the rubric holds on the lower-stakes cohorts.

Hold for a full traffic cycle. A canary that runs for two hours on a Tuesday morning has not seen the weekend, the geography spread, or the time-of-day distribution. Hold for 24 to 72 hours; promote on partial signal and you will roll back on the next cycle. Compute the required sample size before launch with the same power formula as offline A/B (n ≈ 16 * p * (1 - p) / delta² at 80% power, 5% significance, plus a 1.5x buffer for real-traffic variance). For the deeper statistical case, see A/B testing LLM prompts.

Score attachment: the only way the gates actually gate

The eval score has to land on the same OpenTelemetry span as the routing decision so the promotion runs on the delta, not on the dashboard vibe. Agent Command Center emits every routing decision as response headers that traceAI picks up as span attributes:

  • x-agentcc-routing-strategy (shadow, mirror, canary, race, fallback)
  • x-agentcc-model-used (which provider returned the served response)
  • x-agentcc-fallback-used (whether a candidate fell back)
  • x-agentcc-latency-ms, x-agentcc-cost (per-request latency, cost)
  • x-agentcc-guardrail-triggered (whether an input/output guardrail fired)

The ai-evaluation SDK joins those attributes with the per-trace rubric scores so a single dashboard shows per-cohort cost, latency, rubric pass rate, and guardrail fire rate. Without score-attached traffic the comparison lives in someone’s head; with it, the gate fires from config when the delta breaks the threshold. The same rubric runs against both cohorts; the same correlation ID joins production to candidate; the same noise floor calibrates the threshold as the self-improving evaluators learn each route’s variance.

Racing: the latency-bounded pattern that sits beside the funnel

Race sends the same request to multiple candidates in parallel (x-agentcc-routing-strategy: race, x-agentcc-race-targets: candidate-a,candidate-b), responds from the fastest correct one, and scores all responses asynchronously. Race is the right tool when latency is a hard SLO (voice agents, real-time UX) and at least two candidates already meet the quality bar. The response carries x-agentcc-model-used: <winner> so the trace tells you which arm served.

Cost is the trade. Race fans out Nx inference for one served response; a three-way race on a 1M-request-per-day route is 3M candidate calls, and if two of the three are frontier models with thinking tokens the bill scales fast. The benefit is that p99 latency drops to the minimum of N independent draws, which on long-tail-heavy distributions can shrink p99 by 30 to 50 percent. For batch agents and human-in-the-loop chat where the user is happy with two-second response time, race is wasted budget. Scope race with per-strategy cost guardrails through the same five-level budget hierarchy that stratifies the canary.

The Agent Command Center implementation: one binary, four strategies

Agent Command Center ships shadow, mirror, race, fallback, load-balanced, and budget-aware routing as first-class strategies, with canary built on tag-based scoping and the five-level budget hierarchy. Apache 2.0, single 17 MB Go binary, OpenAI-compatible, ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on. Self-host or use gateway.futureagi.com/v1.

The loop has four moving parts:

  • Routing as a header. x-agentcc-routing-strategy: shadow | mirror | canary | race switches stages without redeploying. The same gateway handles all four; the same response headers tag every trace.
  • Score-attached traceAI spans. The ai-evaluation SDK (50+ pre-built EvalTemplate classes plus self-improving Platform evaluators plus unlimited custom evaluators authored by an in-product agent) scores both arms on the same rubric with a shared correlation ID. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2 on rubrics with a clean classifier target.
  • Auto-rollback in gateway config. The four triggers (guardrail trip rate, rubric rolling mean, p99 latency, candidate-only Error Feed cluster) wire into YAML before traffic flips. Any single trigger fires on the next request, no human in the rollback path.
  • Error Feed plus agent-opt closes the loop. When the gate trips, Error Feed clusters candidate-only failures via HDBSCAN over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge writes an immediate_fix plus a four-dimensional score per cluster. agent-opt (six optimizers: RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer, with EarlyStoppingConfig to bound compute) proposes a prompt rewrite. The team reviews the diff; the candidate restarts at 1 percent.

Honest framing. The full closed loop ships today, but the trace-stream-to-dataset connector that auto-builds optimization datasets from canary failure clusters is roadmap, so today the loop runs on curated datasets the team promotes failing traces into. Linear OAuth is wired on the rollback signal; Slack, GitHub, Jira, and PagerDuty are roadmap.

For the gateway-as-rollout-control deep dive across six vendors, see Best AI Gateways for Canary Model Rollouts in 2026.

Three anti-patterns from production incident reports

Shadow without offline scoring. Mirror is on at 25 percent, the pipeline stores two responses per request, six weeks later storage cost is up 40 percent and no one has run the comparison. The candidate could be 15 points worse and no one knows. Fix: every shadow or mirror trace must have an evaluator subscribed to it on ingest. If the scoring is not wired, do not store the shadow.

Canary without a written rollback condition. “Looks fine on the dashboard” is not a gate. The rollback condition has to be a rule the gateway evaluates on a rolling window, not a feeling. Wire the four triggers above into config before traffic flips. Pair the auto-rollback with an alert that carries the failure cluster so the on-call sees what broke, not just that something did.

Uniform canary on a power-law distribution. A random 5 percent canary on a tenant-tier-skewed mix routes the bulk of candidate volume to whale users whose failure cost is the highest. The aggregate looks fine; the segment that pays the bill is catastrophic. Stratify with the tag header. Start on tag=internal, expand to tag=beta_cohort, only then roll to enterprise. The canonical fail story is “mean Groundedness 0.91, sub-route 0.62”; the fix is gating on percentiles and Error Feed clusters, not means.

For the rollback discussion across rollout patterns, see LLM Deployment Best Practices and CI/CD for AI Agents.

The decision matrix that picks the pattern

If your situation is…Use
First test of a new candidate, low prior signal, safety-critical routeShadow at 100% on a low-volume route
Candidate is expensive or the shadowed slice would double the billMirror at 10-25%
Shadow or mirror shows stable lift; you need user-visible signalCanary at 5% on a stratified tag
Hard latency SLO, two or more candidates pass quality barRace across the candidates
New tool-calling schema you have not validated end-to-endShadow first, then mirror, then canary
Cost optimization where the candidate is cheaper but unprovenMirror at 25%, then canary at 10% on tag=internal
Multi-region rollout with regional compliance constraintsCanary by geography tag

The step most teams skip is the mirror-between-shadow-and-canary one. Going shadow-then-canary leaves out the cost-bounded signal-gathering phase, which is where most of the candidate tuning happens. Mirror is cheap. Use it.

Closing: routing is the safety net

Offline evals are necessary and not sufficient. Shadow, mirror, canary, and race catch the long tail that only exists in production. The disambiguation matters: shadow proves the candidate behaves reasonably on the real distribution before a user sees it; canary proves it is at least as good with users in the loop; mirror is shadow at the budget you can afford; race is the latency tool that sits beside the funnel. The team that confuses them ships either the regression or the indefinite stall. The team that runs the funnel in order with score-attached traffic ships on the rubric, not the vibe.

Point your OpenAI SDK at https://gateway.futureagi.com/v1, set x-agentcc-routing-strategy: shadow on the first request, and let the rubric do the gating. The next prompt edit has somewhere to land.

Frequently asked questions

What is the difference between shadow traffic and canary deployment?
Shadow mirrors a live request to a candidate model with zero user impact: production serves the user, the candidate response is captured and scored offline. Canary serves the candidate output live to a percentage slice of users, scored against the production cohort on the same rubric, with auto-rollback armed. Shadow proves the candidate does not behave wildly differently on the real distribution; canary proves the candidate is at least as good with users in the loop. They measure different eval questions and skipping either is a known incident pattern. The team that skips shadow ships a regression; the team that skips canary paralyzes on offline numbers that never converge.
What is mirror sampling and when do I use it instead of full shadow?
Mirror is shadow at a fractional sample rate. Instead of duplicating every request, mirror routes a percentage of production traffic to the candidate, scores it, and still responds from the production path. Use mirror when the candidate is expensive (frontier reasoning models with thinking tokens, multi-modal models) or when the shadowed slice is large enough that full duplication doubles the bill. A 10 percent mirror on a route doing 1 million requests per day produces 100,000 scored pairs in 24 hours, which is more than enough to compute lift on most rubrics with statistical confidence. Default to mirror at 10-25 percent for new candidates and only step up to full shadow when the candidate is cheap or the route is safety-critical.
How do I attach an eval score to shadow or canary traffic?
The eval score lands on the same OpenTelemetry span as the routing decision. Future AGI Agent Command Center emits x-agentcc-routing-strategy, x-agentcc-model-used, and x-agentcc-canary-percent on every response; traceAI picks those up as span attributes; the ai-evaluation SDK runs the same EvalTemplate list (Groundedness, ContextAdherence, TaskCompletion, LLMFunctionCalling, AnswerRefusal) against both arms with a shared correlation ID. Aggregation uses Guardrails with RailType.OUTPUT and AggregationStrategy.MAJORITY so a single noisy template does not flip the cohort. Without score-attached traffic you are gating on dashboard vibes; the promotion runs on the delta against the trailing 7-day production baseline, not on a frozen number from a curated test set.
How fast should auto-rollback fire when a canary regresses?
Fast enough that the regression does not compound. A service at 33 RPS will still serve 1,150 turns on a 35-second rollback; at 1,000 RPS sub-second matters. The trigger menu is short and written into the gateway config before traffic flips: guardrail trip rate above 1.5 times the trailing 7-day baseline over a 15-minute window, per-rubric rolling mean drop below the noise floor with p less than 0.05 on Welch's t-test, p99 latency above 1.3 times baseline for 10 minutes, and any candidate-only Error Feed cluster the trailing window has not seen. Any single trigger fires the revert on the next request, not after a Slack thread.
Why does a 5 percent canary regress on enterprise customers when aggregate metrics look fine?
Most production traffic is power-law distributed; a uniform 5 percent canary can route the majority of candidate volume to whale users whose failure cost is the highest. The candidate looks clean on the aggregate and catastrophic on the segment that pays the bill. The fix is a stratified canary: tag-based scoping (tag=internal, tag=staging, tag=beta_cohort) instead of random sampling. The Agent Command Center's five-level budget hierarchy (org, team, user, key, tag) is the routing primitive for stratified canary. Start the canary on the segment with the lowest failure cost, expand to the next, and only roll to enterprise tenants after the rubric holds on the lower-stakes cohorts.
How does Future AGI score both production and candidate cohorts on the same rubric?
Agent Command Center captures both the production response and the candidate response from a shadow, mirror, or canary route as separate traces with a shared correlation ID. The ai-evaluation SDK scores both paths with the same EvalTemplate list so the comparison is apples-to-apples. The Error Feed clusters candidate-only failures via HDBSCAN over ClickHouse-stored span embeddings and a Sonnet 4.5 Judge writes an immediate_fix plus a four-dimensional score per cluster, which feeds the Platform's self-improving evaluators and agent-opt's prompt-rewrite proposals. Linear OAuth ships today on the rollback signal; Slack, GitHub, Jira, and PagerDuty are roadmap.
Related Articles
View all