What Is Traffic Mirroring (LLM Gateway)?
An LLM-gateway feature that copies a share of production requests to a second model in parallel for offline comparison, without affecting the production response.
What Is Traffic Mirroring?
Traffic mirroring is an LLM-gateway feature that copies a configurable share of production requests to a second model or provider in parallel, captures the mirrored response for offline comparison — quality, latency, cost — but never returns it to the caller. The user sees the primary response. This is a shadow technique, distinct from a canary deployment (which actually serves the canary response to a fraction of users). Mirroring is the safe way to evaluate a new model on real production traffic before flipping the routing policy. FutureAGI’s Agent Command Center exposes this as the routing.mirror block.
Why it matters in production LLM/agent systems
Switching models is high-risk. A new model that benchmarks well on a curated dataset can still fail on production phrasing, edge cases, and tail prompts. Three traditional options for evaluating a new model are all bad:
- Offline benchmarks. The dataset doesn’t match production traffic distribution.
- Canary. Real users see real failures. If the new model breaks JSON mode, customers lose data.
- Migration “big bang”. Switch everyone, hope, roll back if angry tickets appear. Tickets are a noisy and slow signal.
Traffic mirroring sidesteps all three. The primary model still serves every user. The mirror captures the new model’s output on the same prompts in parallel, so engineers compare quality, latency, and cost on identical traffic without customer risk. After 1–7 days of captured data, the team has a real-traffic regression-eval input set and a defensible decision on whether to switch.
For agent systems where one task triggers many model calls and a quality drop on call #3 cascades through the whole trajectory, mirroring is often the only realistic way to evaluate a model swap. We’ve found that even a 5% sample over 72 hours surfaces tail-prompt regressions that a 5,000-row offline benchmark misses, especially around tool-use formatting and refusal behavior on adversarial inputs.
How FutureAGI handles it
FutureAGI’s Agent Command Center implements traffic mirroring in internal/routing/mirror.go. The configuration:
routing:
mirror:
enabled: true
rules:
- source_model: "gpt-4o"
target_provider: "anthropic"
target_model: "claude-sonnet-4"
sample_rate: 0.1
experiment_id: "claude-vs-gpt-2026q2"
- source_model: "*"
target_provider: "staging"
sample_rate: 0.01
On every matching request, the gateway:
- Issues the primary call and returns the response to the caller (untouched).
- Async, with
sample_rateprobability, issues the mirrored call totarget_provider/target_model. - Captures both responses, latencies, token counts, and request ID into a
ShadowStore(internal/routing/shadow_store.go) tagged with theexperiment_id.
The mirror runs in its own goroutine — production latency is unaffected. The captured pairs feed FutureAGI’s evaluation surface as a side-by-side dataset. Engineers run pairwise evals (AnswerRelevancy, Coherence, custom rubrics) over the shadow set and ship a quality-delta report. The same traceAI tracer emits agentcc.mirror.experiment_id, agentcc.mirror.target_model, and agentcc.mirror.captured, joining the rest of the trace tree. Compared with manually scripting offline replays, this is real production traffic with real-time capture — and unlike a canary, no user ever sees the new model’s output.
How to measure or detect it
Operate traffic-mirroring against:
- Capture rate — actual mirrored / sample-rate × eligible. A 0.1 sample rate on 100K requests should produce ~10K captures; lower means the mirror is dropping calls.
- Mirror-side failure rate — independent of production. Tracks new-model reliability without affecting users.
- Quality delta — pairwise eval (e.g.,
AnswerRelevancyon (prompt, primary) vs. (prompt, mirror)). The headline metric. - Latency delta — p50/p99 of mirror vs. primary, on identical prompts.
- Cost delta — token-cost-per-call on the mirror vs. primary.
- Mirror lag — time-to-capture for the mirrored call. High lag means the staging provider is overloaded.
from fi.evals import AnswerRelevancy
# Run on the captured shadow set
shadow = client.gateway.shadow_store.list(experiment_id="claude-vs-gpt-2026q2")
for pair in shadow:
AnswerRelevancy().evaluate(input=pair.prompt, output=pair.mirror_response)
Common mistakes
- Confusing traffic mirroring with canary deployment. Mirror = shadow-only; canary = real users see canary responses.
- Mirroring 100% of traffic to a small staging provider. The mirror saturates the staging quota and drops calls.
- Not setting an
experiment_id. Without it, captured pairs from different experiments mix together and are unanalysable. - Forgetting that mirrored calls cost money. A 10% mirror is a 10% cost increase on the mirrored model.
- Treating mirror success rate as a quality signal. Success ≠ quality; run an actual eval on the captured pairs.
Frequently Asked Questions
What is traffic mirroring in an LLM gateway?
Traffic mirroring copies a configurable share of production requests to a second model in parallel, captures the mirrored response for offline comparison, and never returns the mirrored output to the caller.
How is traffic mirroring different from a canary deployment?
A canary serves real users a fraction of traffic from the new model — failures affect customers. Traffic mirroring is shadow-only: the production response always comes from the primary, mirrored responses are captured offline.
How does FutureAGI implement traffic mirroring?
Agent Command Center exposes routing.mirror with rules per source_model, target_provider, target_model, and sample_rate. Mirrored calls run async, are captured to a ShadowStore, and never block the primary response.