Guides

Evaluating LLM Routing Policies in 2026

Routing-policy eval is not model eval. The 2026 playbook: route correctness, cost-savings realized vs theory, quality preservation under substitution, and fallback correctness — instrumented end to end.

·
Updated
·
12 min read
ai-gateway llm-routing llm-evaluation model-routing shadow-routing fallback 2026
Editorial cover image for Evaluating LLM Routing Policies in 2026
Table of Contents

You ship a multi-model gateway. The config says: GPT-5 for the planner step, Haiku for the formatter, Sonnet for the responder, fallback to Bedrock on rate-limit, race the cheap tier for latency-sensitive routes. You run per-call evals on the model outputs. Every score looks healthy. Three weeks later, finance flags a 30 percent overspend on the head of the distribution, and CSAT on refunds is down 9 points. None of the model-level evals caught any of it.

This is the failure mode the routing-policy era ran into. Per-call eval tells you the answer that came out of the system. It does not tell you whether the system was set up to produce that answer. Routing-policy eval is not model eval. The router’s job is four separable questions: route correctness, cost-savings realized versus theory, quality preservation under substitution, and fallback correctness. Each needs its own rubric, its own dataset, and its own gate. Score the policy as its own artifact.

This post is the engineering pattern for that. The four axes, what each one catches, the shadow-route workflow that makes the comparison runnable on real traffic, and the FAGI surfaces (gateway headers, traceAI span attributes, ai-evaluation templates, Error Feed clusters) that turn the workflow into an instrumented loop.

Why model-quality eval misses router failures

Model eval scores an output against a rubric. Routing-policy eval scores a decision against a pool. The two artifacts answer different questions and break in different places.

Three failure modes hide inside a healthy model-eval dashboard. The first is silent over-routing: the cheap tier is doing work the cheaper-still tier could have absorbed, and the only visible signal is a slowly creeping token bill nobody attributes. The second is silent under-routing: the planner step is firing a small model on a hard intent, the trajectory loops twice as often, and the per-call score is fine because the model eventually got there. The third is fallback rot: the primary has been stable for ten months, the fallback chain has drifted, and the first time it fires under real load is the first time anyone learns it’s broken.

None of these show up if your eval is a moving average of Groundedness across the whole router. Aggregate scores hide the cells where the policy is wrong. A 0.88 average TaskCompletion can be 0.94 on technical questions and 0.71 on refunds, and the refunds are exactly the lane the cost-aware router redirected to the cheap tier last quarter. The dashboard says the agent is fine. The policy is the bug.

The teams that close this gap treat the routing policy as a separate piece of code with its own tests. For the broader frame on why this split matters, see agent observability vs evaluation vs benchmarking and the 2026 LLM evaluation playbook.

Axis 1: route correctness — right model for the question

The first rubric asks a single question. Given the input and the available pool, was the chosen model the right pick.

This is the rubric almost no team writes, and it’s the one that produces the largest insight on first run. The pattern is a CustomLLMJudge with four arguments: the request, the model the router selected, the candidate pool with cost and latency tiers, and the budget envelope. The judge returns a label (correct, over-routed, under-routed, ambiguous) plus a one-sentence reason. The five built-in templates (Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal) score the output; this judge scores the decision.

from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge

evaluator = Evaluator(fi_api_key=FI_KEY, fi_secret_key=FI_SECRET)

route_correctness = CustomLLMJudge(
    config={
        "grading_criteria": (
            "Given the input, the model the router picked, and the candidate "
            "pool with cost and latency tiers, was the routed model the right "
            "pick? Label: correct, over_routed, under_routed, ambiguous. "
            "Explain in one sentence. Consider safety alignment on adversarial "
            "inputs, schema fidelity on formatter steps, and reasoning depth "
            "on planner steps."
        ),
        "model": "gpt-5",
    }
)

First-run numbers between 8 and 22 percent wrong decisions on otherwise healthy routers are normal. The headline percentage matters less than where the wrong decisions cluster. Over-routing concentrated on cheap intents is a cost leak. Under-routing concentrated on high-stakes intents is a quality regression sitting on a fuse.

Stratify the golden set on the dimensions the router uses to decide: intent (support, sales, technical, general), length (short, medium, long), difficulty (easy, hard, adversarial), and cost tier (free, paid, enterprise). 200 to 1000 queries is enough to start; refresh weekly by promoting failing production traces. The set lives next to the rubric, versioned together.

Axis 2: cost-savings realized versus theoretical

The routing config promises a number. The bill delivers a different one. The gap is the second axis.

Theoretical savings are easy to compute: “60 percent of traffic hits the cheap tier at one fifth the per-token cost, so save 48 percent on token spend.” Realized savings are what survives once retries, cascades, shadow traffic, and cache misses settle. The gap is wider than most teams budget for. A cheap-first cascade with a 50 percent advertised hit rate delivers a 30 to 35 percent realized saving once retries on the frontier model are counted. A shadow rollout running at 10 to 25 percent mirror traffic adds that fraction back as experimental cost. A semantic cache that benchmarks at 40 percent hits delivers 28 percent on a new corpus.

The eval is the join between trace-attributed cost and the outcome event, computed weekly. Three quantities sit on the same span. The first is the dollar cost the gateway set on the response, surfaced as x-agentcc-cost and exported to the trace processor. The second is the outcome event (outcome.resolved=true for support, outcome.accepted=true for coding agents, outcome.booked=true for sales), written as a span attribute when the user signal lands. The third is the routing strategy and resolved model, from x-agentcc-routing-strategy and x-agentcc-model-used. Divide cost by resolved outcomes per route per week, and the savings number stops being fiction.

curl https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-agentcc-..." \
  -H "Content-Type: application/json" \
  -d '{"model":"router/cost-aware-v3","messages":[...]}' \
  -D headers.txt

# Response headers (set by the gateway before the body):
# x-agentcc-routing-strategy: cost-aware-v3
# x-agentcc-model-used: anthropic/claude-3-5-haiku
# x-agentcc-cost: 0.000018
# x-agentcc-latency-ms: 142
# x-agentcc-fallback-used: false
# x-agentcc-cache: miss

For the longer treatment on why cost-per-outcome is the only honest denominator, see AI agent cost optimization and observability. The point that lives here is narrower. The realized-savings number is a query against per-trace data. If the runtime cannot tell you the cost of a single trace, in dollars, at the span level, the savings claim is borrowed from a brochure.

Axis 3: quality preservation under substitution

Every model swap is a hypothesis. The hypothesis is: the cheaper model’s quality on this specific step is within an acceptable band of the more expensive one. The mistake teams make is shipping the swap and finding out later. The mistake is not the swap itself. The mistake is the missing experiment.

The pattern that survives a quarter is three rules.

Rule 1 — score the step, not the trajectory. The planner step’s rubric is “did it pick the right tool”; the formatter step’s rubric is “did it produce valid JSON against the schema”; the responder step’s rubric is faithfulness, helpfulness, and refusal correctness. Same rubric scores both the incumbent and the candidate. Per-step is the unit because a swap on the formatter shouldn’t be gated by a CSAT score that’s mostly responding to the responder.

Rule 2 — pre-commit the band. Write it down before the experiment starts. “Haiku is allowed on the formatter if its EvaluateFunctionCalling rubric score stays within 0.03 of Sonnet on a 500-trace shadow set, and within 0.05 on a 95th-percentile slice of hard cases.” The band is non-negotiable once the experiment runs. A swap that survives by widening the band post-hoc is a regression with a comfortable narrative.

Rule 3 — mirror, score, gate. The gateway mirrors a percentage of live traffic to the candidate model. The trace processor scores both responses with the same rubric. The dashboard shows the band continuously. When the band holds for the agreed window (typically one to three weeks at 10 to 25 percent mirror volume), promote. When it doesn’t, the swap dies on the bench and the line item never moved.

from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling, ContextAdherence

evaluator = Evaluator(fi_api_key=FI_KEY, fi_secret_key=FI_SECRET)

result = evaluator.evaluate(
    eval_templates=[EvaluateFunctionCalling(), ContextAdherence()],
    inputs=[
        {"input": planner_input, "output": incumbent_planner_output},
        {"input": planner_input, "output": candidate_planner_output},
    ],
)
# Two scores per step. Same rubric. Same trace.
# Promote candidate only when the band holds for the agreed window.

The honest tradeoff: mirror traffic is real traffic, real tokens, real dollars. You’re paying for the experiment. The discipline is treating mirror cost as the price of not regressing CSAT three weeks downstream. Cheap compared to the alternative. For the related pattern of evals that pass offline and fail in production, see when an agent passes evals and fails in production.

Axis 4: fallback correctness — does the rare path actually work

The fallback chain is the part of the routing policy that almost nobody tests after week one. The primary is stable for ten months, the fallback hasn’t fired in eight, and by the time it does fire the chain has drifted. A model joined the pool. A regional residency requirement changed. The ordering got stale. A provider deprecated an endpoint. The first production fallback under load is the first eval anyone runs on it.

The fix is to chaos-test on a sample of golden-set traffic, every sweep. Force a primary failure (rate limit, timeout, regional residency miss, provider outage), let the chain fire, score the result with the same per-route rubric you use on the primary. The target is fallback quality within 5 to 10 percent of primary quality. Wider than that, and the policy is promising a graceful degradation it cannot deliver.

Three production events to instrument. The header x-agentcc-fallback-used flips to true on every call where the chain fired. The Prometheus counter agentcc_requests_total{status="fallback"} aggregates the rate. The trace span carries routing.fallback_used=true plus routing.decision_reason, so the post-hoc debug starts from the trace tree, not from grepping logs. The combined signal turns fallback events from invisible into queryable. For deeper coverage on the gateway side of this, see LLM failover and fallback for AI gateways and what is an LLM fallback strategy.

The cluster the Error Feed surfaces most often on first run: “Fallback chain skips Bedrock when Bedrock would have served the regional residency need.” Wrong ordering for a compliance constraint that nobody re-checked when the policy was last edited.

The shadow-route workflow

Shadow routing is the technique that makes all four axes runnable on real traffic without risk. Send a copy of the production request to a candidate router, score both answers with the same rubric, never show the candidate’s answer to the user. The Agent Command Center ships shadow and mirror modes alongside its routing strategies, plus race semantics for fastest-of-N and circuit-breaker behavior on the fallback path.

The workflow is five steps.

Step 1. Instrument every gateway call with the five headers above. Lift them into traceAI span attributes so the eval can query them.

Step 2. Build the stratified golden set. 200 to 1000 queries, split across intent, length, difficulty, and cost tier. Pull from production traces, label them, keep the hard ones in. Refresh weekly.

Step 3. Run the production router and one or more candidate routers against the set. The candidates run in shadow mode: the gateway captures their response and routing metadata, never serves them to the user.

Step 4. Score per-route quality, route correctness, and Pareto position with one pass through the ai-evaluation SDK. Four distributed runners (Celery, Ray, Temporal, Kubernetes) finish the sweep in minutes.

Step 5. Cluster the failures. The Error Feed runs HDBSCAN soft-clustering over failing traces and surfaces named clusters. A Sonnet 4.5 Judge writes the immediate_fix per cluster. Linear is the only Error Feed integration wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The cluster signal feeds the next sweep’s golden set.

# Production call — the answer the user sees
prod = gateway.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": user_input}],
    extra_headers={"x-agentcc-routing-strategy": "intent-classifier-v2"},
)
prod_route = prod.headers["x-agentcc-model-used"]
fallback_fired = prod.headers["x-agentcc-fallback-used"] == "true"

# Shadow call — candidate router, response never shown
shadow = gateway.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": user_input}],
    extra_headers={
        "x-agentcc-routing-strategy": "cost-aware-budget-v3",
        "x-agentcc-shadow": "true",
    },
)
shadow_route = shadow.headers["x-agentcc-model-used"]

# Score both with the same rubric — apples to apples

Aggregate across the golden set, plot the two routers on the cost-quality and latency-quality planes, and promote the candidate only on Pareto improvement. Never on a single axis.

What to instrument on the traceAI side

The eval is only as good as the metadata it can see. The minimum span-attribute surface is five attributes per call.

from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from opentelemetry import trace

provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="prod-routing-eval",
)
OpenAIInstrumentor().instrument(tracer_provider=provider)

current = trace.get_current_span()
current.set_attribute("routing.strategy_id", prod.headers["x-agentcc-routing-strategy"])
current.set_attribute("routing.model_used", prod_route)
current.set_attribute("routing.candidate_pool", ",".join(pool))
current.set_attribute("routing.fallback_used", fallback_fired)
current.set_attribute("routing.decision_reason", reason_from_router)

Every trace now carries the routing decision, the alternatives that existed, and why this one was picked. The eval has something to look at, and the post-hoc debug doesn’t require grepping the gateway logs.

Anti-patterns to watch for

Four patterns are common and all of them are quiet failures.

Routing without a per-route quality SLA. If you can’t say “the fast tier must clear 0.85 on TaskCompletion or it isn’t eligible for this intent,” the policy has no constraint. Cost wins by default, quality drifts.

No shadow-route eval. A/B tests work but expose users to the candidate, which is dangerous on high-stakes intents. Shadow is strictly safer for the comparison.

Single-axis routing. Cost alone produces “always cheapest.” Latency alone produces “always fastest.” Quality alone produces “always most expensive.” Real policies compose at least two axes and gate on Pareto position.

Untested fallback. The rare path is rare until it isn’t. Chaos-test the chain every sweep, or the first production fallback under load is the first eval anyone runs.

How Future AGI ships the routing-eval loop

Future AGI ships routing-policy evaluation across four surfaces that compose into one workflow.

Agent Command Center is the gateway runtime. OpenAI-compatible drop-in via base_url="https://gateway.futureagi.com/v1"; existing OpenAI SDK code keeps working. Six native provider adapters (OpenAI, Anthropic, Gemini, Bedrock, Azure, Cohere) plus 100-plus more via OpenAI-compatible presets. Routing strategies cover weighted, least-latency, cost-optimized, adaptive, and race. Shadow and mirror modes ship as first-class config. Fallback chains, circuit breakers, and per-tenant budgets with five-level granularity (org, team, user, key, tag) gate the cost side of every experiment. Response headers expose x-agentcc-routing-strategy, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-cache, and x-agentcc-provider on every call. Prometheus on /-/metrics. OTLP traces to any collector. Single Go binary, Apache 2.0, self-host or hit the cloud endpoint. ~29k req/s and P99 ≤ 21 ms with guardrails on, on t3.xlarge. SOC 2 Type II, HIPAA, GDPR, and CCPA certified for the regulated workloads where routing crosses compliance boundaries.

ai-evaluation is the rubric layer. 50+ pre-built EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, plus the CustomLLMJudge you build route-correctness rubrics on. Four distributed runners (Celery, Ray, Temporal, Kubernetes) so the sweep finishes in minutes rather than hours. Apache 2.0.

traceAI is the tracing layer underneath. 50+ AI surfaces across Python, TypeScript, Java, and C#. Auto-instrumentation for OpenAI, LangChain, Groq, Portkey, and Gemini. The routing.strategy_id, routing.candidate_pool, and routing.decision_reason span attributes carry routing metadata into the trace tree. PII redaction built in.

The Future AGI Platform is where the loop closes. Self-improving evaluators retune routing thresholds from production feedback, with lower per-eval cost than Galileo Luna-2 on the same workloads. The Error Feed runs HDBSCAN soft-clustering over failing traces and writes immediate-fix suggestions through a Sonnet 4.5 Judge. Linear is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. The typical first-sweep cluster surfaces something like “router picks the cheap model on customer-support refunds where the capable model would catch the policy violation”: a named pattern, with a fix, that a human can act on in the next standup.

Ready to evaluate your routing policy, not just your models? Point your OpenAI SDK at https://gateway.futureagi.com/v1, read the x-agentcc-* headers on the response, and run a shadow sweep through the ai-evaluation SDK. Start with the Agent Command Center quickstart and the traceAI integration guide.

Frequently asked questions

Why is routing-policy evaluation different from model evaluation?
Model eval asks whether a model is good. Routing-policy eval asks whether the policy made the right pick from a pool, on the inputs production actually sees. The two artifacts are different. A model can score 0.92 on a benchmark and still ruin a routed slice because the router fed it a distribution it was never tested against. Conversely, a model with mediocre benchmark scores can be the right pick for a narrow lane the router consistently sends it. The policy is a piece of code that maps inputs to models, and like any piece of code it has bugs, drifts, and needs tests. The four tests that matter: route correctness (right model for the question class), cost-savings realized versus theory, quality preservation under substitution, and fallback correctness. Each is a separate question with a separate rubric.
What is route-correctness and how do you score it?
Route-correctness asks whether the policy picked the right model from the pool for a given input, independent of how well that model answered. You score it by building a rubric judge that takes four arguments — the input, the model selected, the candidate pool with cost and latency tiers, and the budget envelope — and returns a label: correct, over-routed, under-routed, or ambiguous. The CustomLLMJudge template in the ai-evaluation SDK encodes this in roughly twenty lines. First-run numbers between 8 and 22 percent wrong decisions on otherwise healthy routers are common. The distribution of those wrong decisions matters more than the headline rate. Over-routing concentrated on a cheap intent is a cost leak. Under-routing concentrated on a high-stakes intent is a quality regression waiting to surface.
What does cost-savings-realized mean compared to theoretical savings?
Theoretical savings are what the routing config claims: 'send 60 percent of traffic to the cheap tier, save 40 percent on token spend.' Realized savings are what shows up on the bill after retries, cascades, and shadow traffic settle. The gap is usually wider than teams expect. A cheap-first cascade with a 50 percent hit rate looks like a 50 percent saving in theory and a 30 to 35 percent saving in practice once retries on the frontier model are counted. A shadow rollout adds 10 to 25 percent mirror traffic that costs real tokens. A semantic cache that promises 40 percent hits delivers 28 percent on a new corpus. The eval is the join between trace-attributed cost and the outcome event, computed weekly. The denominator is cost-per-resolved-outcome, not cost-per-token. Without that denominator the savings number is fiction.
What is quality preservation under substitution and why does it matter?
When the policy swaps the frontier model out for a cheaper one on a step, the quality on that step has to stay inside a pre-committed band — or the swap shouldn't ship. The pattern is three rules. First, the rubric is per step (planner, tool-caller, formatter, responder), not per trajectory. Second, the band is explicit: 'Haiku is allowed on the formatter if its rubric score stays within 0.03 of Sonnet on a 500-trace shadow set.' Third, the gateway mirrors live traffic to the candidate for one to three weeks, the trace processor scores both, and the rollout flips only when the band holds. Without this discipline, a model swap looks fine in week one and turns into a 16-point CSAT drop by week four, and nobody connects the regression to the routing change because the eval set was static.
How do you test fallback correctness?
You chaos-test the fallback chain on a sample of golden-set traffic. Force a primary failure (rate limit, timeout, regional residency miss, provider outage), let the fallback fire, and score the result with the same rubric you use on the primary route. The target is fallback quality within 5 to 10 percent of primary quality. Wider than that, and the gateway is promising a graceful degradation it cannot deliver. The Agent Command Center sets x-agentcc-fallback-used on every response where the fallback path was taken, so production telemetry surfaces real-world fallback events. The trap most teams hit is testing fallback once at setup and never again. The primary stays healthy long enough that the fallback drifts — a new model joins the pool, a regional config changes, the chain ordering gets stale — and the first time it fires under real load is the first time anyone learns it doesn't work.
What metadata does the gateway need to expose for routing eval to work?
Five attributes per response, set as headers before the body returns: x-agentcc-routing-strategy (which policy fired), x-agentcc-model-used (the resolved model), x-agentcc-fallback-used (true when the fallback chain handled the request), x-agentcc-cost (dollar cost of the call), and x-agentcc-latency-ms (gateway-measured latency). traceAI lifts those into span attributes — routing.strategy_id, routing.model_used, routing.candidate_pool, routing.fallback_used, routing.decision_reason — so the eval has something to query. Prometheus surfaces agentcc_cost_total, agentcc_tokens_total, agentcc_cache_hits_total and misses, and agentcc_requests_total by provider and status. Without that metadata on every call, the eval is guessing. With it, the policy becomes an instrumented system that you can debug like any other piece of production code.
How does Future AGI's Agent Command Center support routing-policy evaluation?
The gateway ships routing-strategy headers, shadow and mirror modes, and the cost and latency response attributes that the eval reads from. Routing config lives in YAML so policy swaps are config changes, not redeploys. Six native provider adapters cover OpenAI, Anthropic, Gemini, Bedrock, Azure, and Cohere, with 100-plus more via OpenAI-compatible presets. Five-level budgets (org, team, user, key, tag) gate runaway spend on shadow experiments. Exact and semantic caches surface as separate span attributes so cost-per-outcome stays honest when a hit returns at single-digit milliseconds. The ai-evaluation SDK runs route-correctness, per-route quality, and Pareto-position scoring across four distributed runners. traceAI carries the routing metadata into the trace tree. The Error Feed clusters failing traces with HDBSCAN and writes immediate-fix suggestions through the Sonnet 4.5 Judge — the typical cluster is something like 'router picks the cheap model on customer-support refunds where the capable model would catch the policy violation.'
Related Articles
View all