What Is Cost-Optimized Routing (LLM Gateway)?
An LLM-gateway strategy that selects the cheapest eligible provider or model while preserving required quality, latency, availability, and policy constraints.
What Is Cost-Optimized Routing?
Cost-optimized routing is an LLM-gateway routing strategy that sends each request to the lowest-cost eligible provider or model while honoring quality, latency, compliance, and availability constraints. It shows up in the gateway before the provider call, where a routing policy compares model prices, token estimates, user tier, cache state, and fallback health. In FutureAGI’s Agent Command Center, cost routing is configured through routing-policies. It does not always choose the cheapest model; it chooses the cheapest route that can still meet the task’s reliability threshold.
Why it matters in production LLM/agent systems
Over-spend is usually silent. The app works, invoices rise, and no trace explains why a low-risk support FAQ went to a premium model. The obvious failure mode is runaway cost, but the second-order failures are just as expensive: engineers add brittle model-selection logic inside application code, SREs lose provider-level cost attribution, and product teams cannot tell whether a downgrade broke answer quality or only changed latency.
The symptoms show up as skewed provider mix, token-cost-per-trace spikes, high cache-miss cost, and routes where p99 latency improves while answer quality falls. Finance feels it in margin. Platform teams feel it in quota planning. End users feel it when a cheap model gets a task it cannot handle and returns a shallow or malformed answer.
Cost routing matters more for 2026-era agent systems because one user action may trigger planning calls, tool-selection calls, RAG calls, summarization calls, and final response calls. A single bad route is small; 30 bad route decisions in one agent trajectory can turn a cent-level task into a dollar-level task. Good cost routing separates task classes: low-risk classification can use a small model, tool execution may require a model with strong function-calling, and regulated workflows may require a provider pinned by region or tenant policy.
How FutureAGI handles cost-optimized routing
FutureAGI handles cost-optimized routing in Agent Command Center through the routing-policies surface from the gateway:routing-policies anchor. A production policy can name the primitive routing policy: cost-optimized, set eligible targets, attach constraints, and emit the selected provider and model into the trace. The policy sits beside model fallback, semantic-cache, pre-guardrail, post-guardrail, and traffic-mirroring, so the engineer can reason about cost, safety, and availability in one gateway workflow.
A real example: a support agent receives both billing questions and account-closure requests. The platform team creates one cost-optimized policy for the support_agent route. FAQ-style requests with metadata.risk = "low" can go to a small model if the semantic cache misses. Account-closure requests require a provider that supports strict JSON and stronger reasoning. If the cheap route fails a schema check or times out, the policy moves into model fallback instead of returning an invalid answer.
The trace records gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, the selected provider, the routing policy ID, and the cache outcome. Engineers then review a dashboard grouped by policy: cost per trace, fallback rate, p99 latency, and post-route eval pass rate. FutureAGI’s approach is to treat cost as a constrained routing objective, not a global preference for cheaper models. Unlike LiteLLM-style price lists that can be wired directly into app code, Agent Command Center keeps the route as an auditable policy object with trace evidence for each decision.
How to measure or detect cost-optimized routing
Measure cost-optimized routing as a policy outcome, not as a single invoice number:
- Token-cost-per-trace — cost rolled up from prompt and completion tokens for every trace under a routing policy.
- Quality pass rate after route — percentage of routed outputs passing
AnswerRelevancy,Groundedness,JSONValidation, or the task-specific eval. - Fallback trigger rate — share of cheap-route attempts that move to
model fallback; high rates mean the first target is underpowered. - p99 latency by policy — confirms the cheapest route is not violating user-facing latency targets.
- Provider mix drift — actual percentage of calls by provider and model, compared with the intended policy.
- User-feedback proxy — thumbs-down rate or escalation rate segmented by selected model.
from fi.evals import AnswerRelevancy
score = AnswerRelevancy().evaluate(
input=user_question,
output=routed_response,
)
Use the eval score beside gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, not instead of them. A route that saves 60% on tokens but doubles escalation rate is not cost-optimized; it is cost-shifted.
Common mistakes
- Choosing the cheapest model globally. Cost routing must honor task class, latency SLO, region policy, and minimum quality score.
- Ignoring completion-token variance. A cheap model that writes twice as much can cost more than a concise premium model.
- Measuring spend only by provider invoice. You need token-cost-per-trace grouped by routing policy and selected model.
- Routing agents like chatbots. Planning, tool use, and final response calls often deserve different cost policies.
- Skipping eval checks after downgrade. Run
AnswerRelevancy,Groundedness, orJSONValidationon sampled cheap-route outputs before increasing traffic.
Frequently Asked Questions
What is cost-optimized routing?
Cost-optimized routing is an LLM-gateway strategy that selects the lowest-cost eligible model or provider for each request while preserving latency, quality, availability, and policy constraints.
How is cost-optimized routing different from least-latency routing?
Least-latency routing prefers the fastest healthy target. Cost-optimized routing prefers the cheapest target that still meets the request's reliability and latency thresholds.
How do you measure cost-optimized routing?
Measure token-cost-per-trace by routing policy, plus `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, fallback rate, p99 latency, and quality eval pass rate after the route decision.