What Is Weighted Routing (LLM Gateway)?
Weighted routing is an LLM gateway strategy that sends traffic to providers or models according to configured percentage weights.
What Is Weighted Routing?
Weighted routing is an LLM gateway strategy that sends a configured percentage of requests to each eligible provider, model, region, or deployment. Instead of cycling equally, the router samples from targets such as 90% gpt-4o-mini and 10% claude-sonnet-4, then records the selected target in the gateway trace. It is used in production gateways for load balancing, canary rollouts, cost control, and provider migration. FutureAGI exposes it through Agent Command Center routing policies.
Why it matters in production LLM/agent systems
Weighted routing fails quietly when configured weights are treated as truth instead of being checked against real traffic. A policy may say 80/20, but retries, provider errors, model fallback, and streaming disconnects can turn the effective user-visible split into 60/40. That breaks canary analysis, hides provider saturation, and makes finance teams chase cost changes that look random.
The pain lands across several teams:
- SREs see target distribution drift, rising 429s, and p99 latency spikes on one provider while the policy still looks healthy.
- Product engineers read false A/B results because one model handled harder requests or more retries than the other.
- Compliance owners lose confidence when region-specific traffic is handled by weights instead of hard conditional routing.
- End users feel inconsistent quality when multi-step agents switch models across planning, tool use, and final response generation.
Weighted routing is especially important in 2026-era agent pipelines because one user task may create dozens of LLM calls. A 10% canary at the request level can become a much larger exposure if the canary model is used inside every tool-planning step. The right question is not “did we set the weight?” It is “what share of successful, failed, retried, and costly traces did each target actually handle?”
How FutureAGI handles weighted routing
FutureAGI handles weighted routing in Agent Command Center through the gateway:routing-policies surface. A gateway policy can set strategy: "weighted" and define targets with explicit weight values, such as 90 for the current production model and 10 for a new deployment. The same policy can sit beside model fallback, pre-guardrail, post-guardrail, semantic-cache, and traffic-mirroring primitives, so rollout control is not scattered across application code.
A real workflow looks like this: an engineer creates payments-answering-v3 with two targets, openai:gpt-4o-mini at weight 95 and anthropic:claude-sonnet-4 at weight 5. The gateway evaluates the policy on every request, records the selected provider and model on the trace, and attaches request fields such as gen_ai.request.model, gen_ai.system, and llm.token_count.prompt. The engineer watches actual target share, eval-fail-rate-by-target, p99 latency, and token-cost-per-trace. If the 5% target shows higher timeout rate or cost, they set its weight to 0 and route failures through model fallback before trying the rollout again.
FutureAGI’s approach is to treat a weight change as a versioned rollout event with trace evidence, not a hidden SDK setting. Unlike LiteLLM’s Python Router configuration, Agent Command Center keeps weighted routing in the same control plane as gateway observability, guardrails, and provider health. That makes a policy review answerable from traces: which policy ran, which target was selected, what it cost, and whether the response passed downstream quality checks.
How to measure or detect it
Measure weighted routing by comparing intended share to effective share, then slicing quality and cost by the selected target:
- Configured versus actual target share — compare the policy weights with completed requests per provider/model over the same time window.
- Effective share after retry and fallback — count the target that served the final response, not only the first target selected.
- p99 latency and timeout rate by target — detects a canary that is slower even when its output quality looks acceptable.
- Token-cost-per-trace — join
llm.token_count.promptand completion token counts to the selected route. - Trace fields — filter by
gen_ai.request.model,gen_ai.system, routing policy ID, and selected target. - User feedback proxy — compare thumbs-down rate, escalation rate, and support reopen rate by route cohort.
A healthy weighted rollout should converge within tolerance over a statistically meaningful window. For a 5% canary, a one-minute sample can be noisy; a high-traffic hour is more useful. Alert when actual share differs from configured share by more than the rollout tolerance, or when p99 latency, error rate, or eval-fail-rate regresses for the canary target.
Common mistakes
Engineers usually get weighted routing wrong at the measurement boundary, not in the routing formula:
- Reading configured weights as guaranteed per-minute percentages. Small cohorts vary; measure over enough traffic.
- Measuring selected provider before retry or fallback. The user-visible target may be different.
- Using weights for compliance routing. Use conditional routing for hard region, tenant, or data-residency rules.
- Raising a new model from 1% to 50% without p99 latency and eval-fail-rate gates.
- Running a canary without cohort tags, then trying to debug quality regressions from aggregate metrics.
Frequently Asked Questions
What is weighted routing in an LLM gateway?
Weighted routing is an LLM gateway strategy that sends traffic to providers, models, regions, or deployments according to configured weights. It is commonly used for load distribution, canary rollouts, cost control, and provider migration.
How is weighted routing different from round-robin routing?
Round-robin routing cycles through targets evenly, while weighted routing gives each target a configured share such as 90/10 or 50/30/20. Weighted routing is better when targets differ in cost, capacity, trust, or rollout stage.
How do you measure weighted routing?
In FutureAGI, measure configured weight versus actual target share inside Agent Command Center traces and dashboards. Join route decisions to `gen_ai.request.model`, `gen_ai.system`, token cost, p99 latency, and error rate by target.