Research

What Is LLM Routing? A 2026 Field Guide

A 2026 field guide to LLM routing — the plain-English definition, the five named strategies (round-robin, weighted, latency-aware, cost-aware, quality-aware), the architecture, the buyer's signals, and the myths.

·
20 min read
ai-gateway 2026 llm-routing
Editorial cover image for What Is LLM Routing? A 2026 Field Guide
Table of Contents

Originally published May 17, 2026.

A platform team migrated a customer-support copilot from a single OpenAI key to a five-provider routing pool in March. The first version of the router shipped a flat round-robin policy across OpenAI, Anthropic, Bedrock, Vertex AI, and Mistral; within a week, the support quality score on hard prompts had dropped by 12 percent because half the traffic was hitting a cheap tier model that couldn’t handle the workload, while the cost dashboard showed an unexpected spike on Claude 3 Opus that nobody had authorised.

The team didn’t have a router problem. They had a routing-policy problem, and a missing evaluation feedback loop.

This guide defines LLM routing, the canonical category inside the AI gateway that decides which upstream model serves each request, anchored to OpenTelemetry GenAI semantic conventions, OWASP LLM Top 10 2025, and the named routing strategies that have stabilised across production AI gateways in 2026.

TL;DR: The 2026 LLM Routing Definition

LLM routing is the behaviour inside an AI gateway that selects one upstream large language model from a pool of candidates on each request, on a policy that can read the request shape, live provider state (latency, error rate, cost per token), and the user or tenant context, and that records the selection as an OpenTelemetry GenAI span attribute so the trace remembers which provider served the request.

The category sits inside the AI gateway market that Gartner named in its Market Guide for AI Gateways (October 2025); routing is one of the canonical primitives every named AI gateway (Cloudflare, IBM, Kong, Microsoft Azure API Management, Databricks Unity, Future AGI Agent Command Center, Portkey, LiteLLM, Maxim Bifrost) ships at the same network hop as authentication, guardrails, caching, and telemetry.

  • Routing isn’t the same as load balancing. Layer 4 and Layer 7 load balancers distribute identical traffic across identical backends. LLM routers distribute non-identical traffic (a hard reasoning prompt versus a one-shot classification) across non-identical backends (GPT-4 versus Claude 3 Haiku versus Llama 3) on a policy that has to read the request.
  • Five named routing strategies have stabilised: round-robin, weighted distribution, latency-aware, cost-aware, and quality-aware. Production gateways compose them, the choice is rarely “which one” and almost always “which two or three for which workload.”
  • The OpenTelemetry GenAI span attributes for routing, gen_ai.request.model, gen_ai.response.model, plus the routing-strategy and fallback metadata, turn a router from an opaque black box into an auditable trace where the route can be replayed.
  • The self-improving loop is the 2026 frontier: tie the routing decision to a held-out evaluation score via span_id so the policy updates from feedback rather than from a hand-edited config file.

What Is LLM Routing?

LLM routing is the layer inside an AI gateway that, on every incoming request, picks one upstream large language model from a configured pool of candidates and forwards the request through a provider adapter to that model. The router is a Layer 7 decision: it inspects the OpenAI-compatible request payload, reads live signals about the candidate pool (current latency, error rate, cost-per-token, capacity headroom), reads the user or tenant context (virtual key, product, compliance regime), and applies a routing policy that maps that input to a single upstream provider.

The five primary-source definitions of the AI gateway category (Cloudflare, IBM, Kong, Microsoft Azure API Management, and Databricks Unity) all name routing as a canonical primitive. The Gartner Market Guide for AI Gateways (October 2025) treats multi-provider routing as one of the four pillars (alongside guardrails, caching, and observability) that defines the category.

On the inbound side, the client calls a single OpenAI-compatible endpoint and never knows which upstream served the response. On the outbound side, the gateway translates the call to the upstream provider’s native schema (Anthropic Messages, Bedrock InvokeModel, Vertex AI generateContent) and dispatches it. In between, the router runs the policy.

What the router isn’t: it isn’t a load balancer, it isn’t a service mesh, it isn’t the full AI gateway. It’s one block inside the gateway, sitting next to the cache check, the guardrail layer, and the cost meter. Procurement conversations frequently conflate “LLM router” (dispatch logic only. OpenRouter, MLflow router) with “AI gateway” (the full control plane); the model routing listicle ranks both shapes side-by-side.

The LLM Router Architecture

The architecture of a production LLM router has three blocks: the input adapter on the way in, the strategy engine in the middle, and the provider dispatch with fallback on the way out. Each block is a place where the router can emit a span attribute, fail open, or fail closed.

1. Input adapter. The router accepts the OpenAI-compatible request shape (chat completions, embeddings, image generation, audio) and extracts the routing-relevant fields: the model name (often used as a routing key with provider prefix, e.g. anthropic/claude-3-5-sonnet), the message payload, the temperature and max-tokens parameters, the user metadata, and any custom headers (x-fagi-route-policy, x-tenant-id, x-budget-cap).

2. Strategy engine. The strategy engine reads the configured routing policy for the virtual key (round-robin, weighted, latency-aware, cost-aware, quality-aware, or a composed policy) and applies it against three live data sources: the candidate pool (each provider with base cost, latency, quality score); the live provider state (rolling error rate, P95 latency, rate-limit headroom from the gateway’s own telemetry); and the eval feedback (held-out evaluation scores tied to recent traces by span_id). The engine emits a single routing decision, one provider name, a confidence score, a fallback list, recorded as a custom attribute on the OpenTelemetry GenAI span.

3. Provider dispatch with fallback. The dispatch block translates the request to the upstream provider’s schema, swaps in the provider key (the client never sees it), and forwards. On a 429, 5xx, or guardrail block, dispatch falls back through the configured fallback chain sequentially; each fallback attempt emits a span attribute. The streaming response (SSE or WebSocket) flows back with a running cost and token counter.

The three blocks compose. A round-robin policy with no live signal and no eval feedback is a one-block router; a quality-aware policy with a self-improving loop and a 5-step fallback chain is the three-block production shape.

The 5 Problems LLM Routing Solves

Production teams adopt LLM routing for one of five reasons, in roughly this order of frequency in 2026.

1. Single-provider blast radius

A single-provider outage takes the entire LLM-dependent feature down. The April 2026 multi-hour OpenAI degradation incident is the canonical recent example; teams without a router watched P99 latency flatline while teams with a router watched the failover chain pick up traffic on Anthropic and Bedrock. The router solves this by treating the second and third upstream providers as first-class fallback targets. The failover happens in milliseconds and emits a span attribute (gen_ai.fallback.reason = "upstream_5xx") the SRE team can replay.

2. Cost variance across providers for the same workload

Token prices vary by an order of magnitude across providers and tiers for the same prompt: GPT-4-class models cost roughly 10x to 30x what mid-tier models cost per million tokens. Without routing, every request hits the developer’s default provider, and the bill scales linearly with traffic regardless of whether the workload actually needed the top tier. Cost-aware routing solves this by picking the cheapest provider that meets a quality floor on a per-request basis. Avengers Pro, RouteLLM, and IBM Research all show up to 85 percent savings while preserving GPT-4-level quality, but only when the quality floor evaluator is wired in.

3. Tail latency

P99 latency is a graveyard for chat copilots and voice agents. Tail-heavy upstream providers (cold-start latency, autoscaler lag, regional congestion) make user-facing latency unpredictable. Latency-aware routing reads a rolling P95 window per provider and routes to the lowest-latency candidate that meets the cost and quality floors; race-style routing recovers another tier of latency at the cost of doubled token spend on the discarded response. The voice-agent sub-800-millisecond budget is the workload where this matters most.

4. Quality drift across model versions

Model providers ship updates. The same prompt against gpt-4o in February 2026 and gpt-4o in May 2026 isn’t the same model; the named version moved underneath the API. Quality drift shows up as a slow degradation in customer-visible metrics that the engineering team can’t trace to a deploy because they didn’t deploy. Quality-aware routing solves this by scoring the response against a held-out eval set and tying the score back to the routing decision via span_id. When a model’s score drops, the router downweights that provider for similar future requests.

5. Tenant and workload heterogeneity

A multi-tenant SaaS product has heterogeneous workloads inside one application: enterprise tenants on strict HIPAA need PHI-redacted routing to Bedrock-only providers, free-tier tenants need cost-capped routing to the cheapest fallback, internal employees need quality-first routing for high-stakes prompts. Without a router, the application has to encode all of that in business logic; with a router, the policy lives in the gateway and updates from a configuration change rather than a deploy.

These five problems aren’t theoretical, every named production AI gateway in 2026 (Future AGI Agent Command Center, Portkey, LiteLLM, Maxim Bifrost, Cloudflare AI Gateway, Databricks Unity) cites at least three of them on its product page.

The 5 Named LLM Routing Strategies

Five routing strategies have stabilised across production AI gateways in 2026. They appear under slightly different names in different vendor docs (Bifrost’s “Adaptive” is a composed strategy of Cost-Optimized plus Least-Latency plus Error-Penalty; Portkey’s conditional routing is a wrapper around the same five primitives with metadata operators) but the underlying primitives are stable.

Round-robin

The simplest strategy. The router cycles through the candidate pool one provider at a time, ignoring request shape, live state, and eval feedback. Round-robin works in two narrow cases: batch workloads with uniform request difficulty, and load distribution across identical mid-tier models where the differentiator is per-provider rate-limit headroom, not quality.

Round-robin fails when the pool contains tiers (a cheap model and a premium model in the same pool) or when the workload has variance. The customer-support team in the opening anecdote tripped over this: round-robin on a five-provider mixed-tier pool sent 20 percent of hard prompts to a cheap tier model that couldn’t handle them.

Weighted distribution

The router splits traffic by a configured percentage across the candidate pool. The canonical use case is canary rollouts: 95 percent on the current production model, 5 percent on the candidate, with a quality-floor evaluator gating the increase. The second canonical use case is A/B testing: 50-50 split on two models, both responses scored on the same held-out eval set so the team has statistical evidence before promoting.

Weighted is the simplest strategy that admits a feedback loop. The 5 percent isn’t static, when the candidate’s eval score crosses a threshold, the percentage updates automatically. The canary-to-full-promotion cycle drops from weeks (with hand-edited configs) to days (with the loop wired in).

Latency-aware

The router maintains a rolling window of P95 latency per provider (typically 30 to 300 seconds), keyed by the request shape, and picks the lowest-latency candidate above the cost and quality floors. The signal source is the gateway’s own outbound request telemetry.

Latency-aware routing has two failure modes. A stale window leaves a recovering provider downweighted after its latency has normalised; a cold-started provider has no data and either gets routed nothing or everything. The fix is a hybrid policy: weighted distribution for new providers, latency-aware once a confidence threshold is met.

Race routing is the latency-extreme variant: fire the request at the two lowest-latency providers in parallel, return the first response, discard the second. Cost doubles on the routed request; the latency benefit is real for sub-800-millisecond voice agents and high-tier RAG.

Cost-aware

The router picks the cheapest provider that meets a quality floor. The cost signal is per-token from the candidate pool’s base configuration; the quality floor is an evaluation threshold attached to the virtual key or the workload type.

Cost-aware routing without a quality floor is the single most common production-grade mistake. Avengers Pro from Shanghai AI Lab, RouteLLM from LMSYS, and IBM Research all publish 70 to 85 percent cost savings under the routed setup, but each conditions the savings on a quality floor. The right pattern is to attach a held-out eval set, compute a per-provider quality score, and let the router pick the cheapest provider whose score is above the floor.

Quality-aware

The router scores candidates on a held-out evaluation set tied to the workload and routes to the highest-scoring candidate that meets the cost and latency floors. Scoring is per-task (chat, summarisation, code generation, function calling, RAG answer quality) so the same provider gets different scores on different workloads.

Quality-aware is the most powerful strategy and the most expensive to operate. It needs a labeled eval set per workload (the binding constraint), a continuous scoring pipeline (every routed response gets scored, the score writes back by span_id), and a policy that updates from the score. The self-improving loop is the 2026 frontier here: the router emits a span on every request, the evaluation sidecar scores the response, the score updates the candidate’s quality window, and the next request routes on the updated window, the whole loop closes in seconds.

Implementation Patterns

LLM routing ships in three implementation patterns in 2026. The choice affects where the router runs, who operates it, and how the eval feedback loop closes.

1. Dedicated gateway service (the default)

The router runs inside a standalone AI gateway process or container in front of the upstream providers. Every application calls the gateway over HTTP on an OpenAI-compatible endpoint; the gateway runs the router, the cache, the guardrails, and the cost meter at the same network hop. This is the default in 2026 and the recommended starting point for teams with more than one application or tenant sharing the routing pool. Future AGI Agent Command Center, Portkey, Bifrost, Cloudflare AI Gateway, and Databricks Unity all ship in this shape; the Truefoundry control-plane-plus-gateway-plane reference architecture is the canonical example.

2. Sidecar deployment

The router runs as a sidecar container inside each application pod. The application calls localhost rather than the gateway endpoint; the sidecar handles the routing. This is the right pattern when per-pod isolation is required (NYDFS 500.11 segmentation, regulated multi-tenant SaaS) or when latency from a centralised gateway is unacceptable. The trade-off is per-pod state: the rolling latency window, eval feedback, and cost meter are local to the pod unless the sidecars share a Redis or a coordination service.

3. Library SDK (in-process)

The router runs in-process as a library inside the application. LiteLLM is the most common implementation (pip install litellm runs the same routing logic in-process as the standalone proxy). The application keeps the network-hop overhead at zero and gives up the centralised control plane. This is the right starting point for single-application, single-tenant, low-volume routing; promote to a network-hop gateway once the routing pool exceeds two providers, the application count exceeds two, or any regulatory pressure on the LLM path arrives.

Composing strategies inside a pattern

In all three patterns, the production router rarely runs a single strategy. The canonical 2026 compositions are:

  • Quality-aware on top, cost-aware in the middle, round-robin on the bottom. Score candidates; pick the cheapest above the quality floor; tie-break with round-robin.
  • Weighted canary for new models, latency-aware for the production pool. Canary candidates get 5 percent weighted traffic; the remaining 95 percent routes on rolling P95 latency.
  • Race for the latency-critical path, latency-aware for the rest. Voice agents and sub-second RAG use race; everything else uses latency-aware with a 60-second rolling window.

The composition is the policy. Teams that ship “we use quality-aware routing” without naming the composition rarely ship a working router; teams that ship “we run quality-aware on top of cost-aware with a weighted 5 percent canary on claude-3-opus-2026-04” rarely have routing incidents.

Buyer’s Guide: When to Adopt LLM Routing

LLM routing has a clear adopt-now / wait signal in 2026. Use the following two checklists to self-diagnose.

Adopt LLM routing today if

  • You depend on more than one LLM provider in the request path, for cost shopping, fallback, or quality routing. Two providers is the threshold; three is the production median.
  • Your monthly LLM spend has crossed $5,000 or your cost-per-request variance across providers is more than 3x. Below that threshold, operational overhead rarely pays back; above it, cost-aware routing pays back the engineering investment in weeks.
  • You operate in a regulated environment. HIPAA, NYDFS, PCI-DSS, EU AI Act Annex III, DORA, that demands centralised audit logging, PHI redaction, and per-request span attributes. The router is the natural enforcement point for OWASP LLM Top 10 2025 controls.
  • You run more than one tenant or product whose costs need separate attribution. Per-virtual-key budgets and per-tenant policies belong in the gateway, not in business logic.
  • You ship agent workloads with tool calls and MCP traffic. Agent inner loops are the most cost-sensitive routing surface in 2026, a single hallucinated retry can 10x the token spend; quality-aware routing with eval feedback is the only stable answer.
  • You have a finance team asking why the LLM line item is a single opaque number. OpenTelemetry GenAI-native cost metrics per provider, per tenant, per model are the deliverable; the router is the source.

Wait if

  • You have one provider, one product, one tenant. A router has no work to do until you have more than one of any of those three.
  • Your latency budget can’t absorb the 10 to 50 millisecond router overhead. Sub-200-millisecond voice agents on a single regional provider sometimes have to stay single-sourced; the race pattern recovers the benefit if the budget is tight but not zero.
  • You have no observability surface yet, no OpenTelemetry collector, no Grafana, no Prometheus. Build observability first, promote to a router second.
  • You have no team to operate it. A self-hosted router on a single-product team often degrades into unowned shadow infrastructure. Either commit a named owner or start with a managed gateway (Cloudflare AI Gateway, Databricks Unity) and promote to self-hosted later.

Most teams that ship LLM-dependent features for more than six months end up in the adopt-now column on at least three signals; the wait-if column is where teams correctly recognise that the routing layer is premature for their stage.

Common LLM Routing Myths

Five myths show up repeatedly in 2026 conversations about LLM routing.

Myth 1: “Routing is the same as load balancing.” Load balancing distributes identical traffic across identical backends on health signals. LLM routing distributes non-identical traffic across non-identical backends on a policy that reads the request, the live provider state, and the user context. The analogy that gets people in trouble is the one that treats gpt-4-turbo and claude-3-haiku as interchangeable backends.

Myth 2: “Cost-aware routing always saves money.” Cost-aware routing without a quality floor degrades output silently because the cheapest model isn’t the right model on hard prompts. The published 70 to 85 percent savings from RouteLLM, Avengers Pro, and IBM Research all condition the savings on a quality floor; teams that skip the conditioning consistently report regressions.

Myth 3: “A router is a router; one is as good as another.” The named strategies and their composition rules are the differentiator, not the existence of routing. A gateway that ships only round-robin and a gateway that ships composable quality-aware routing with eval feedback are two different procurement decisions.

Myth 4: “Quality-aware routing needs human labels in the loop.” The dominant 2026 pattern is automated evaluation. LLM-as-judge, programmatic checks, golden-dataset scoring, running in a sidecar that scores every routed response and writes the score back to the router by span_id. Human labels are useful for eval set construction and periodic audit; the routing loop itself doesn’t need a human in line.

Myth 5: “If I have a gateway, I have routing solved.” The gateway is the network hop. The router is one block inside it. A gateway with five providers configured and a single round-robin policy is a router in name only; a gateway with composable strategies, live signal inputs, and the eval feedback loop wired in is the production shape. The difference shows up in the post-mortem after the first cost spike or quality regression.

The 2026 LLM Routing Landscape Snapshot

Seven AI gateways ship production-grade LLM routing in 2026. The Future AGI Gateway Scorecard listicle ranks them across seven dimensions; the summary below is the routing-strategy slice.

  • Future AGI Agent Command Center ships 15 routing and reliability strategies (round-robin, weighted, least-latency, cost-optimized, adaptive, race, plus failover, retries with backoff, circuit breaking, model fallbacks, complexity-based routing, provider lock) in one Apache 2.0 Go binary. OpenAI-compatible drop-in across 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends; the Future AGI Protect model family as the inline guardrail layer (FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII at 65 ms text / 107 ms image median time-to-label per arXiv 2510.13351, a model family, not a plugin chain); traceAI (50+ AI surfaces across Python / TypeScript / Java / C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), OpenInference-native) with Error Feed (part of the eval stack (the clustering and what-to-fix layer that feeds the self-improving evaluators) auto-clustering related routing failures into named issues with zero config); OpenTelemetry GenAI-native cost metrics; MCP plus A2A.
  • Portkey ships conditional metadata routing with nine operators over a 4-tier budget hierarchy. Source-available core; commercial control plane consolidating under Palo Alto Networks following the April 30, 2026 acquisition close.
  • Maxim Bifrost ships a documented 4-factor Adaptive score (Error Penalty 50 percent, Latency 20 percent, Utilization 5 percent, Momentum) plus Race plus round-robin. Apache 2.0 Go binary; vendor-published 11 microsecond range overhead at 5,000 RPS.
  • LiteLLM ships simple-shuffle, least-busy, latency-based, cost-based, and rate-limit-aware variants. MIT; pin commit hashes after the March 24, 2026 PyPI compromise of 1.82.7 and 1.82.8.
  • OpenRouter is the provider-directory routing surface across 400+ models with one billing point; 5.5 percent platform fee. Strong for experimentation; weaker on guardrails and per-tenant policy.
  • Cloudflare AI Gateway ships edge routing tight to the Cloudflare stack; Universal Endpoint deprecated in 2026, use Dynamic Routing.
  • Databricks Unity AI Gateway ships routing as a first-class feature for MCP-routed agents inside the Databricks lakehouse.

The 2026 trust cohort matters in procurement: Helicone joined Mintlify on March 3, 2026 and is in maintenance mode; Portkey announced its Palo Alto Networks acquisition on April 30, 2026; LiteLLM’s PyPI compromise reshaped supply-chain hygiene expectations. Apache 2.0 single-binary alternatives with no pending acquisition collapse most of these procurement questions.

How Future AGI Thinks About LLM Routing

LLM routing is one half of the production AI loop. The other half is the evaluation layer that uses routing traces plus eval scores to update the routing policy itself. Future AGI ships both layers as one runtime so the router gets better at its job over time, more than smarter at logging.

The shape of the loop:

  1. Every routed request emits an OpenTelemetry GenAI span via traceAI (Apache 2.0), carrying the routing-strategy attribute, the candidate pool, the per-candidate score, and the upstream provider that served the response.
  2. The ai-evaluation pipeline (Apache 2.0) scores the response against a held-out eval set tied to the workload (tool-use accuracy, code correctness, task completion, summarisation faithfulness, RAG groundedness) and writes the score back to the span by span_id.
  3. The agent-opt optimiser (Apache 2.0; ProTeGi, Bayesian, GEPA) reads the scored spans and emits a policy diff with math attached, of the shape: “for finance-squad, route turns under 8K input tokens to claude-3-haiku between 9am and 5pm; regression 0.4%; monthly saving $3,840.”
  4. Inline Protect guardrails, 65 ms text / 107 ms image median time-to-label image per arXiv 2510.13351, sit on the routed path so the guardrail check doesn’t eat the latency budget.
  5. The whole loop ships inside Agent Command Center: Apache 2.0 single Go binary, OpenAI-compatible drop-in across 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends, 18+ built-in guardrail scanners, exact plus semantic caching, per-virtual-key budgets, MCP plus A2A, OpenTelemetry GenAI-native traces. Self-hostable in Docker, Kubernetes, AWS, GCP, Azure, or air-gapped.

A static router is a configuration file. A self-improving router is a control plane that responds to model drift, workload drift, and cost drift without a human in the configuration loop. The eval signal is the binding constraint; the runtime is where the loop closes.

Try Agent Command Center free. OpenAI-compatible routing across 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends with composable strategies, 18+ PII and PHI guardrails, per-key budgets, MCP plus A2A, and OpenTelemetry GenAI-native traces in one Apache 2.0 Go binary at gateway.futureagi.com/v1.

Frequently asked questions

What Is LLM Routing in Simple Terms?
LLM routing is the layer inside an AI gateway that decides which upstream model serves each request. The client calls one OpenAI-compatible endpoint; the router reads the request, the live provider state, and the user context, then forwards the call to OpenAI, Anthropic, Bedrock, Vertex AI, or a self-hosted model. The selection is recorded on the OpenTelemetry GenAI span. Without a router, every application is permanently single-sourced to whichever provider key the developer pasted in first.
What Are the Five Named LLM Routing Strategies in 2026?
Round-robin distributes requests evenly across the pool. Weighted distribution splits traffic by a configured percentage (95 to 5 for canary rollouts, 50 to 50 for A/B tests). Latency-aware picks the provider with the lowest rolling P95 latency from a live signal window. Cost-aware picks the cheapest provider that meets a quality floor (the floor is the part the cost-only routers get wrong). Quality-aware scores candidates on a held-out evaluation set and routes to the highest scorer. In the self-improving loop, eval feedback updates the router policy on the next request rather than waiting for a hand-edited config.
How Is LLM Routing Different From a Load Balancer?
A Layer 4 or Layer 7 load balancer distributes identical traffic across identical backends on health signals. An LLM router distributes non-identical traffic across non-identical backends: GPT-4, Claude 3 Opus, and Llama 3 produce different outputs at different cost and latency profiles, and the router has to read the request shape and the user context to pick the right one. The router is closer to a content-based switch than a round-robin balancer.
Does LLM Routing Save Money?
Yes, when paired with a quality floor evaluator. Cost-aware routing picks the cheapest provider that meets the quality threshold; research from Shanghai AI Lab Avengers Pro, LMSYS RouteLLM, and IBM Research shows up to 85 percent savings while preserving GPT-4-level quality. Without a quality floor, cost routing degrades output silently because the cheapest model is not always the right model. The self-improving loop closes this by tying eval scores back to the router policy via `span_id`.
Is LLM Routing the Same as an AI Gateway?
No. The router is one component inside an AI gateway. A full AI gateway also runs authentication, input and output guardrails, exact and semantic caching, per-virtual-key budgets, OpenTelemetry GenAI telemetry, and MCP plus A2A protocol handling. A gateway that only routes (OpenRouter, MLflow router) is a router; a gateway with the full control plane is the production category.
How Does Quality-Aware Routing Work?
Quality-aware routing scores upstream models on a held-out evaluation set tied to your workload. Production gateways link the gateway trace to the evaluation result via `span_id`; on a low score the router downweights that provider for similar future requests. The self-improving loop runs continuously: every routed request emits a span, evals score the response, and the router policy updates from the feedback. The eval set is the binding constraint.
Can I Build LLM Routing in My Own Application?
Yes, and many teams do at the start. The library SDK pattern (LiteLLM as a Python import, OpenAI SDK plus a custom wrapper) puts routing logic in-process. The trade-off is that every application instance has to learn the same routes, telemetry is per-process rather than per-request hop, and you cannot enforce centralised guardrails or budgets. Promote to a network-hop gateway once you have more than one application sharing the route table or any regulatory pressure on the LLM path.
What Are the OpenTelemetry GenAI Span Attributes for Routing?
The [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) define `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, and the cost metric. Production gateways add custom attributes for the routing decision: the strategy used, the candidate pool, the per-candidate score, and whether the selection was a fallback after a 429 or 5xx.
When Should I Not Use LLM Routing?
Skip a router when you have one provider, one product, one tenant, no regulatory pressure on the LLM path, and a latency budget that cannot absorb the 10 to 50 millisecond router overhead. For most teams in 2026, that combination disappears within the first six months of production traffic. Voice agents on sub-800-millisecond budgets sometimes stay single-sourced; even there, the race pattern (parallel requests to two providers, first to respond wins) recovers the routing benefit.
What Are the Failure Modes of LLM Routing?
Five failure modes show up in production. Cost-only routing without a quality floor silently degrades output. Latency-only routing on a stale window flaps when the upstream recovers but the score has not caught up. Round-robin on an unbalanced pool sends half the traffic to the cheapest-tier model even on hard prompts. Quality-aware routing on a stale eval set drifts. Sticky session routing on a hashed user ID concentrates a noisy tenant on one provider during a rate-limit event. The fix is the same in each case: tie the router to a live eval signal, not a static rule.
Is LLM Routing Open Source?
Yes. Future AGI Agent Command Center ships Apache 2.0 as a single Go binary. LiteLLM ships MIT (pin commit hashes after the March 2026 PyPI compromise of `1.82.7` and `1.82.8`). Portkey's gateway core is MIT. Maxim Bifrost is Apache 2.0 in Go. License clarity matters for procurement because three of the largest paid alternatives are in transition in 2026 (Helicone, Portkey, Keywords AI rebranded to Respan).
How Do I Evaluate LLM Routing Vendors?
Score them on the seven dimensions in the [Future AGI Gateway Scorecard](https://futureagi.com/blog/best-ai-gateways-model-routing/): provider breadth, routing strategy depth, latency overhead, guardrail depth, observability, deployment flexibility, and TCO including acquisition independence. For a routing procurement specifically, routing-strategy depth and observability carry the heaviest weight.
What Is the Self-Improving Loop in LLM Routing?
The self-improving loop ties the routing decision to a held-out evaluation score. Every routed request emits a span; the evaluation pipeline scores the response on the workload's quality dimensions; the score writes back to the candidate's rolling quality window by `span_id`; the next request routes on the updated window. The loop closes in seconds rather than days. Future AGI ships the loop as `traceAI` plus `ai-evaluation` plus `agent-opt` (all Apache 2.0) inside Agent Command Center.
Can LLM Routing Handle MCP and Agent Traffic?
Yes. Production AI gateways in 2026 route MCP tool calls and agent-to-agent (A2A) traffic the same way they route chat completions; the routing key is the model or the agent identity rather than the request shape. MCP routing has a security dimension after the April 2026 Anthropic MCP STDIO command-injection class of vulnerabilities — production gateways enforce a tool-allowlist filter, OAuth 2.1 transport, and Streamable HTTP. Future AGI Agent Command Center ships a dedicated MCP Security scanner among the 18+ built-in guardrails.
What Is the 2026 LLM Routing Landscape?
Seven gateways are the production-relevant set: Future AGI Agent Command Center (Apache 2.0, 15 strategies), Portkey (Palo Alto Networks acquisition, conditional metadata routing), Maxim Bifrost (Apache 2.0 Go, 4-factor adaptive), LiteLLM (MIT, post-March 2026 PyPI compromise), OpenRouter (provider directory, 5.5% fee), Cloudflare AI Gateway (edge, Dynamic Routing), and Databricks Unity (lakehouse-native MCP routing).
Related Articles
View all