Evaluating LLM Routing Policies in 2026
Routing-policy eval is not model eval. The 2026 playbook: route correctness, cost-savings realized vs theory, quality preservation under substitution, and fallback correctness — instrumented end to end.
Table of Contents
You ship a multi-model gateway. The config says: GPT-5 for the planner step, Haiku for the formatter, Sonnet for the responder, fallback to Bedrock on rate-limit, race the cheap tier for latency-sensitive routes. You run per-call evals on the model outputs. Every score looks healthy. Three weeks later, finance flags a 30 percent overspend on the head of the distribution, and CSAT on refunds is down 9 points. None of the model-level evals caught any of it.
This is the failure mode the routing-policy era ran into. Per-call eval tells you the answer that came out of the system. It does not tell you whether the system was set up to produce that answer. Routing-policy eval is not model eval. The router’s job is four separable questions: route correctness, cost-savings realized versus theory, quality preservation under substitution, and fallback correctness. Each needs its own rubric, its own dataset, and its own gate. Score the policy as its own artifact.
This post is the engineering pattern for that. The four axes, what each one catches, the shadow-route workflow that makes the comparison runnable on real traffic, and the FAGI surfaces (gateway headers, traceAI span attributes, ai-evaluation templates, Error Feed clusters) that turn the workflow into an instrumented loop.
Why model-quality eval misses router failures
Model eval scores an output against a rubric. Routing-policy eval scores a decision against a pool. The two artifacts answer different questions and break in different places.
Three failure modes hide inside a healthy model-eval dashboard. The first is silent over-routing: the cheap tier is doing work the cheaper-still tier could have absorbed, and the only visible signal is a slowly creeping token bill nobody attributes. The second is silent under-routing: the planner step is firing a small model on a hard intent, the trajectory loops twice as often, and the per-call score is fine because the model eventually got there. The third is fallback rot: the primary has been stable for ten months, the fallback chain has drifted, and the first time it fires under real load is the first time anyone learns it’s broken.
None of these show up if your eval is a moving average of Groundedness across the whole router. Aggregate scores hide the cells where the policy is wrong. A 0.88 average TaskCompletion can be 0.94 on technical questions and 0.71 on refunds, and the refunds are exactly the lane the cost-aware router redirected to the cheap tier last quarter. The dashboard says the agent is fine. The policy is the bug.
The teams that close this gap treat the routing policy as a separate piece of code with its own tests. For the broader frame on why this split matters, see agent observability vs evaluation vs benchmarking and the 2026 LLM evaluation playbook.
Axis 1: route correctness — right model for the question
The first rubric asks a single question. Given the input and the available pool, was the chosen model the right pick.
This is the rubric almost no team writes, and it’s the one that produces the largest insight on first run. The pattern is a CustomLLMJudge with four arguments: the request, the model the router selected, the candidate pool with cost and latency tiers, and the budget envelope. The judge returns a label (correct, over-routed, under-routed, ambiguous) plus a one-sentence reason. The five built-in templates (Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal) score the output; this judge scores the decision.
from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge
evaluator = Evaluator(fi_api_key=FI_KEY, fi_secret_key=FI_SECRET)
route_correctness = CustomLLMJudge(
config={
"grading_criteria": (
"Given the input, the model the router picked, and the candidate "
"pool with cost and latency tiers, was the routed model the right "
"pick? Label: correct, over_routed, under_routed, ambiguous. "
"Explain in one sentence. Consider safety alignment on adversarial "
"inputs, schema fidelity on formatter steps, and reasoning depth "
"on planner steps."
),
"model": "gpt-5",
}
)
First-run numbers between 8 and 22 percent wrong decisions on otherwise healthy routers are normal. The headline percentage matters less than where the wrong decisions cluster. Over-routing concentrated on cheap intents is a cost leak. Under-routing concentrated on high-stakes intents is a quality regression sitting on a fuse.
Stratify the golden set on the dimensions the router uses to decide: intent (support, sales, technical, general), length (short, medium, long), difficulty (easy, hard, adversarial), and cost tier (free, paid, enterprise). 200 to 1000 queries is enough to start; refresh weekly by promoting failing production traces. The set lives next to the rubric, versioned together.
Axis 2: cost-savings realized versus theoretical
The routing config promises a number. The bill delivers a different one. The gap is the second axis.
Theoretical savings are easy to compute: “60 percent of traffic hits the cheap tier at one fifth the per-token cost, so save 48 percent on token spend.” Realized savings are what survives once retries, cascades, shadow traffic, and cache misses settle. The gap is wider than most teams budget for. A cheap-first cascade with a 50 percent advertised hit rate delivers a 30 to 35 percent realized saving once retries on the frontier model are counted. A shadow rollout running at 10 to 25 percent mirror traffic adds that fraction back as experimental cost. A semantic cache that benchmarks at 40 percent hits delivers 28 percent on a new corpus.
The eval is the join between trace-attributed cost and the outcome event, computed weekly. Three quantities sit on the same span. The first is the dollar cost the gateway set on the response, surfaced as x-agentcc-cost and exported to the trace processor. The second is the outcome event (outcome.resolved=true for support, outcome.accepted=true for coding agents, outcome.booked=true for sales), written as a span attribute when the user signal lands. The third is the routing strategy and resolved model, from x-agentcc-routing-strategy and x-agentcc-model-used. Divide cost by resolved outcomes per route per week, and the savings number stops being fiction.
curl https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-agentcc-..." \
-H "Content-Type: application/json" \
-d '{"model":"router/cost-aware-v3","messages":[...]}' \
-D headers.txt
# Response headers (set by the gateway before the body):
# x-agentcc-routing-strategy: cost-aware-v3
# x-agentcc-model-used: anthropic/claude-3-5-haiku
# x-agentcc-cost: 0.000018
# x-agentcc-latency-ms: 142
# x-agentcc-fallback-used: false
# x-agentcc-cache: miss
For the longer treatment on why cost-per-outcome is the only honest denominator, see AI agent cost optimization and observability. The point that lives here is narrower. The realized-savings number is a query against per-trace data. If the runtime cannot tell you the cost of a single trace, in dollars, at the span level, the savings claim is borrowed from a brochure.
Axis 3: quality preservation under substitution
Every model swap is a hypothesis. The hypothesis is: the cheaper model’s quality on this specific step is within an acceptable band of the more expensive one. The mistake teams make is shipping the swap and finding out later. The mistake is not the swap itself. The mistake is the missing experiment.
The pattern that survives a quarter is three rules.
Rule 1 — score the step, not the trajectory. The planner step’s rubric is “did it pick the right tool”; the formatter step’s rubric is “did it produce valid JSON against the schema”; the responder step’s rubric is faithfulness, helpfulness, and refusal correctness. Same rubric scores both the incumbent and the candidate. Per-step is the unit because a swap on the formatter shouldn’t be gated by a CSAT score that’s mostly responding to the responder.
Rule 2 — pre-commit the band. Write it down before the experiment starts. “Haiku is allowed on the formatter if its EvaluateFunctionCalling rubric score stays within 0.03 of Sonnet on a 500-trace shadow set, and within 0.05 on a 95th-percentile slice of hard cases.” The band is non-negotiable once the experiment runs. A swap that survives by widening the band post-hoc is a regression with a comfortable narrative.
Rule 3 — mirror, score, gate. The gateway mirrors a percentage of live traffic to the candidate model. The trace processor scores both responses with the same rubric. The dashboard shows the band continuously. When the band holds for the agreed window (typically one to three weeks at 10 to 25 percent mirror volume), promote. When it doesn’t, the swap dies on the bench and the line item never moved.
from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling, ContextAdherence
evaluator = Evaluator(fi_api_key=FI_KEY, fi_secret_key=FI_SECRET)
result = evaluator.evaluate(
eval_templates=[EvaluateFunctionCalling(), ContextAdherence()],
inputs=[
{"input": planner_input, "output": incumbent_planner_output},
{"input": planner_input, "output": candidate_planner_output},
],
)
# Two scores per step. Same rubric. Same trace.
# Promote candidate only when the band holds for the agreed window.
The honest tradeoff: mirror traffic is real traffic, real tokens, real dollars. You’re paying for the experiment. The discipline is treating mirror cost as the price of not regressing CSAT three weeks downstream. Cheap compared to the alternative. For the related pattern of evals that pass offline and fail in production, see when an agent passes evals and fails in production.
Axis 4: fallback correctness — does the rare path actually work
The fallback chain is the part of the routing policy that almost nobody tests after week one. The primary is stable for ten months, the fallback hasn’t fired in eight, and by the time it does fire the chain has drifted. A model joined the pool. A regional residency requirement changed. The ordering got stale. A provider deprecated an endpoint. The first production fallback under load is the first eval anyone runs on it.
The fix is to chaos-test on a sample of golden-set traffic, every sweep. Force a primary failure (rate limit, timeout, regional residency miss, provider outage), let the chain fire, score the result with the same per-route rubric you use on the primary. The target is fallback quality within 5 to 10 percent of primary quality. Wider than that, and the policy is promising a graceful degradation it cannot deliver.
Three production events to instrument. The header x-agentcc-fallback-used flips to true on every call where the chain fired. The Prometheus counter agentcc_requests_total{status="fallback"} aggregates the rate. The trace span carries routing.fallback_used=true plus routing.decision_reason, so the post-hoc debug starts from the trace tree, not from grepping logs. The combined signal turns fallback events from invisible into queryable. For deeper coverage on the gateway side of this, see LLM failover and fallback for AI gateways and what is an LLM fallback strategy.
The cluster the Error Feed surfaces most often on first run: “Fallback chain skips Bedrock when Bedrock would have served the regional residency need.” Wrong ordering for a compliance constraint that nobody re-checked when the policy was last edited.
The shadow-route workflow
Shadow routing is the technique that makes all four axes runnable on real traffic without risk. Send a copy of the production request to a candidate router, score both answers with the same rubric, never show the candidate’s answer to the user. The Agent Command Center ships shadow and mirror modes alongside its routing strategies, plus race semantics for fastest-of-N and circuit-breaker behavior on the fallback path.
The workflow is five steps.
Step 1. Instrument every gateway call with the five headers above. Lift them into traceAI span attributes so the eval can query them.
Step 2. Build the stratified golden set. 200 to 1000 queries, split across intent, length, difficulty, and cost tier. Pull from production traces, label them, keep the hard ones in. Refresh weekly.
Step 3. Run the production router and one or more candidate routers against the set. The candidates run in shadow mode: the gateway captures their response and routing metadata, never serves them to the user.
Step 4. Score per-route quality, route correctness, and Pareto position with one pass through the ai-evaluation SDK. Four distributed runners (Celery, Ray, Temporal, Kubernetes) finish the sweep in minutes.
Step 5. Cluster the failures. The Error Feed runs HDBSCAN soft-clustering over failing traces and surfaces named clusters. A Sonnet 4.5 Judge writes the immediate_fix per cluster. Linear is the only Error Feed integration wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The cluster signal feeds the next sweep’s golden set.
# Production call — the answer the user sees
prod = gateway.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": user_input}],
extra_headers={"x-agentcc-routing-strategy": "intent-classifier-v2"},
)
prod_route = prod.headers["x-agentcc-model-used"]
fallback_fired = prod.headers["x-agentcc-fallback-used"] == "true"
# Shadow call — candidate router, response never shown
shadow = gateway.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": user_input}],
extra_headers={
"x-agentcc-routing-strategy": "cost-aware-budget-v3",
"x-agentcc-shadow": "true",
},
)
shadow_route = shadow.headers["x-agentcc-model-used"]
# Score both with the same rubric — apples to apples
Aggregate across the golden set, plot the two routers on the cost-quality and latency-quality planes, and promote the candidate only on Pareto improvement. Never on a single axis.
What to instrument on the traceAI side
The eval is only as good as the metadata it can see. The minimum span-attribute surface is five attributes per call.
from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from opentelemetry import trace
provider = register(
project_type=ProjectType.OBSERVE,
project_name="prod-routing-eval",
)
OpenAIInstrumentor().instrument(tracer_provider=provider)
current = trace.get_current_span()
current.set_attribute("routing.strategy_id", prod.headers["x-agentcc-routing-strategy"])
current.set_attribute("routing.model_used", prod_route)
current.set_attribute("routing.candidate_pool", ",".join(pool))
current.set_attribute("routing.fallback_used", fallback_fired)
current.set_attribute("routing.decision_reason", reason_from_router)
Every trace now carries the routing decision, the alternatives that existed, and why this one was picked. The eval has something to look at, and the post-hoc debug doesn’t require grepping the gateway logs.
Anti-patterns to watch for
Four patterns are common and all of them are quiet failures.
Routing without a per-route quality SLA. If you can’t say “the fast tier must clear 0.85 on TaskCompletion or it isn’t eligible for this intent,” the policy has no constraint. Cost wins by default, quality drifts.
No shadow-route eval. A/B tests work but expose users to the candidate, which is dangerous on high-stakes intents. Shadow is strictly safer for the comparison.
Single-axis routing. Cost alone produces “always cheapest.” Latency alone produces “always fastest.” Quality alone produces “always most expensive.” Real policies compose at least two axes and gate on Pareto position.
Untested fallback. The rare path is rare until it isn’t. Chaos-test the chain every sweep, or the first production fallback under load is the first eval anyone runs.
How Future AGI ships the routing-eval loop
Future AGI ships routing-policy evaluation across four surfaces that compose into one workflow.
Agent Command Center is the gateway runtime. OpenAI-compatible drop-in via base_url="https://gateway.futureagi.com/v1"; existing OpenAI SDK code keeps working. Six native provider adapters (OpenAI, Anthropic, Gemini, Bedrock, Azure, Cohere) plus 100-plus more via OpenAI-compatible presets. Routing strategies cover weighted, least-latency, cost-optimized, adaptive, and race. Shadow and mirror modes ship as first-class config. Fallback chains, circuit breakers, and per-tenant budgets with five-level granularity (org, team, user, key, tag) gate the cost side of every experiment. Response headers expose x-agentcc-routing-strategy, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-cache, and x-agentcc-provider on every call. Prometheus on /-/metrics. OTLP traces to any collector. Single Go binary, Apache 2.0, self-host or hit the cloud endpoint. ~29k req/s and P99 ≤ 21 ms with guardrails on, on t3.xlarge. SOC 2 Type II, HIPAA, GDPR, and CCPA certified for the regulated workloads where routing crosses compliance boundaries.
ai-evaluation is the rubric layer. 50+ pre-built EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, plus the CustomLLMJudge you build route-correctness rubrics on. Four distributed runners (Celery, Ray, Temporal, Kubernetes) so the sweep finishes in minutes rather than hours. Apache 2.0.
traceAI is the tracing layer underneath. 50+ AI surfaces across Python, TypeScript, Java, and C#. Auto-instrumentation for OpenAI, LangChain, Groq, Portkey, and Gemini. The routing.strategy_id, routing.candidate_pool, and routing.decision_reason span attributes carry routing metadata into the trace tree. PII redaction built in.
The Future AGI Platform is where the loop closes. Self-improving evaluators retune routing thresholds from production feedback, with lower per-eval cost than Galileo Luna-2 on the same workloads. The Error Feed runs HDBSCAN soft-clustering over failing traces and writes immediate-fix suggestions through a Sonnet 4.5 Judge. Linear is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The typical first-sweep cluster surfaces something like “router picks the cheap model on customer-support refunds where the capable model would catch the policy violation”: a named pattern, with a fix, that a human can act on in the next standup.
Ready to evaluate your routing policy, not just your models? Point your OpenAI SDK at https://gateway.futureagi.com/v1, read the x-agentcc-* headers on the response, and run a shadow sweep through the ai-evaluation SDK. Start with the Agent Command Center quickstart and the traceAI integration guide.
Related reading
- AI Agent Cost Optimization and Observability in 2026
- Agent Observability vs Evaluation vs Benchmarking (2026)
- Your Agent Passes Evals and Fails in Production. Here’s Why. (2026)
- Best LLM Gateways (2026)
- Best AI Gateways for LLM Failover and Fallback (2026)
- What Is an LLM Fallback Strategy? (2026)
- Agent Evaluation Frameworks (2026)
- The 2026 LLM Evaluation Playbook
Frequently asked questions
Why is routing-policy evaluation different from model evaluation?
What is route-correctness and how do you score it?
What does cost-savings-realized mean compared to theoretical savings?
What is quality preservation under substitution and why does it matter?
How do you test fallback correctness?
What metadata does the gateway need to expose for routing eval to work?
How does Future AGI's Agent Command Center support routing-policy evaluation?
How to evaluate LiteLLM-routed apps: paired comparison across providers on your data, tool-call parity, latency parity, and the gateway alternative.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.