Guides

Best 5 AI Gateways for Routing Claude Code Requests in Production in 2026

Five AI gateways scored on routing Claude Code requests in production: policy expressiveness, per-region routing, failover, P99 overhead, observability.

April 3, 2026

20 min read

ai-gateway 2026 claude-code llm-routing

Table of Contents

The first time Anthropic returned a sustained wall of 529s on a Friday afternoon, the SRE on call learned three things. One, the Claude Code fleet had no deterministic fallback. Two, the gateway in front was making routing decisions by calling another LLM to classify each turn, and that classifier was itself a billable Anthropic call stuck behind the same 529s. Three, the dashboard reported how many requests had failed but not why each had been routed where it was routed. The post-mortem ran four weeks.

Routing Claude Code in production isn’t monitoring Claude Code in production. Monitoring is attribution after the fact. Routing is decisions on the hot path, every turn, under a P99 budget of 50ms gateway overhead and zero tolerance for the gateway becoming the dependency that takes the workload down. All five gateways here advertise routing. Only some route in a way an SRE will sign off on at 3 AM.

TL;DR

Future AGI Agent Command Center is the strongest pick for routing Claude Code in production because the hot routing path stays deterministic at sub-millisecond decisions (no LLM in the loop), with per-region pools, sticky session routing via consistent-hash on session ID, explicit fallback graphs (primary region → secondary region → Bedrock → fail-fast 503), and OpenTelemetry-native MTTR telemetry per route. The other four picks below win on specific edges.

Future AGI Agent Command Center — Best overall. Deterministic hot path (P50 of 8ms, P99 of 34ms), per-region pools, sticky session routing, and explicit fallback graphs.
Portkey — Best for mature config-driven fallback chains. Config-as-code fallback graphs and per-virtual-key region routing (verify the Palo Alto Networks acquisition timeline before signing multi-year).
Kong AI Gateway — Best for real per-region routing on top of Kong’s existing data planes. Self-hosted enterprise API gateway with global anycast.
LiteLLM — Best for source-available Python-native routing that runs in your VPC. Deterministic fallback config; pin commits after the March 24, 2026 PyPI compromise.
Cloudflare AI Gateway — Best for the lowest gateway-overhead P99 at the edge. Sub-15ms overhead, but routing policy is shallow.

Why routing Claude Code in production is its own problem

A coding agent in production isn’t a chatbot. Each Claude Code invocation opens a session that may span hundreds of turns, with the same session needing claude-haiku-4-5 for a one-line lint and claude-opus-4-7 for a 180K-token refactor. The session must survive a regional Anthropic incident, and each routing decision has to happen in single-digit milliseconds. Four properties make the routing layer hard:

Routing decisions are hot-path. The gateway has roughly 50ms of overhead before a developer notices it exists. That covers the routing decision, header rewriting, TLS termination, observability emission, and the regional-pool lookup. No room for an LLM-judge route decision: we have measured Claude-Haiku-classifier routing add 380-450ms to P50 and 900ms+ to P99.
Failover isn’t retry. When Anthropic’s us-east cluster returns a wall of 529s, retrying the same upstream three times amplifies the incident. The right behavior is a deterministic fallback strategy: Anthropic us-east → Anthropic us-west → Bedrock claude-sonnet-4-6 → fail-fast, with an observable event per hop. Three of the five gateways below do this; two pretend to.
Per-region routing affects cache hit ratio. Claude Code packs project files into each turn’s context, and Anthropic’s prompt cache is region-local. A naive anycast router that routes turn N to us-east and turn N+1 to us-west misses cache on every turn and the session runs 4-5x more expensive. Routing has to be sticky.
Burst traffic looks like a DDoS. A team running Claude Code in CI/CD can burst from 5 RPS to 200 RPS in a minute when a merge gate triggers across 40 branches. The gateway’s circuit breaker and queueing under burst is what determines whether the next merge succeeds or CI has to be paused.

The default “best AI gateway for coding agents” listicle doesn’t score on any of this. For the rest of this post, “gateway” means an AI gateway between Claude Code clients and one or more model provider regions, configured via ANTHROPIC_BASE_URL.

The 7 routing axes we score on

Axis	What it measures
1. Deterministic policy expressiveness	Can the gateway encode routing as a deterministic function of request attributes (token count, model hint, region tag, repo, developer tier) without calling another LLM in the hot path?
2. Per-region routing	Can the gateway route to a specific Anthropic region (us-east, us-west, eu-central, apac) and stick a session to a region for cache locality?
3. Failover semantics	When Anthropic returns 5xx, does the gateway have a configurable fallback graph with deterministic ordering, or is it round-robin/random/best-effort?
4. P99 latency overhead	What is the gateway’s own overhead at P99, measured under a Claude-Code-shaped workload (long context, streaming, tool calls)?
5. Routing-decision observability	For any given turn, can you answer the question “why was this turn routed to claude-haiku-4-5 and not claude-opus-4-7” in the dashboard, without grepping logs?
6. Cache-hit policy interaction	Does the routing policy preserve prompt-cache locality, or does anycast load-balancing destroy the hit ratio?
7. Traffic-shape resilience	Under burst load (10x baseline RPS in under a minute), does the gateway shed gracefully or fall over?

Verdict line at the end of each pick scores all seven.

How we picked

We started from the universe of public AI gateways advertising Anthropic-compatible routing with documented region pinning or fallback chains. We dropped two whose “fallback” was a single retry against the same upstream, and one that used Claude itself to classify route decisions on the hot path. The remaining five each have a different theory of how routing should work; the comparison below is honest about which theory holds up at 200 RPS.

1. Future AGI Agent Command Center: Best for deterministic routing with a self-improving offline policy

Verdict: Future AGI is the only gateway here that separates the hot-path routing decision (deterministic, sub-millisecond, no LLM in the loop) from an offline learner that updates the policy. The other four are static routers or hosted black boxes. Agent Command Center treats routing as a versioned artifact: hot path stays deterministic, policy improves between deploys.

What it does for routing Claude Code in production:

Deterministic policy expressiveness. Routes are a typed policy, predicate-to-pool mappings evaluated in order, predicates as pure functions of request attributes (token count, model hint, region tag, developer tier, repo). A typical Claude Code policy routes ~62% of turns to claude-haiku-4-5 (under 8K input tokens), 31% to claude-sonnet-4-6 (8K-40K), and 7% to claude-opus-4-7 (over 40K or explicitly hinted). Every decision emits an audit span.
Per-region routing. Pools support region pinning (anthropic.us-east, anthropic.us-west, anthropic.eu-central). Sessions are sticky via consistent-hash on session ID, so cache locality survives across turns.
Failover semantics. Fallback graphs are explicit and per-pool. Default Claude Code shape: primary region → secondary region → Bedrock claude-sonnet-4-6 → fail-fast 503. Failover RTT P50 of 110ms (warm DNS), P99 of 320ms (cold TLS rehandshake). Every hop emits a failover span.
P99 latency overhead. P50 of 8ms, P99 of 34ms on a Claude-Code-shaped workload (300 RPS, mixed turn sizes, SSE streaming). The Future AGI Protect model family runs inline at ~65 ms p50 text and ~107 ms p50 image (arXiv 2510.13351) when enabled. FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain.
Routing-decision observability. Every turn produces a fi.route span: policy version, predicate fired, pool selected, failover hops. Group-by-pool and group-by-region are first-class slices.
Cache-hit policy interaction. Session-sticky routing preserves prompt-cache locality. Dashboard exposes cache hit ratio per pool so a policy change that destroys hits surfaces immediately.
Traffic-shape resilience. Stateless Go data plane behind a token-bucket admission controller. Tested at 10x burst (50 RPS → 500 RPS over 45s); P99 climbed to 71ms during burst, recovered within 30s.

The loop. Every routing decision feeds an offline learner that scores it against fi.evals rubrics (did the route produce a good output? did haiku fail a turn that should have escalated to sonnet?). traceAI (50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), OpenInference-native) emits spans; Error Feed (the part of the eval stack, the clustering and what-to-fix layer that feeds the self-improving evaluators) sits alongside as the zero-config error monitor: auto-clusters related misroutings into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue. fi.opt.optimizers proposes a policy update, typically a threshold adjustment, that you review, deploy, and version. The hot path never calls the learner.

Where it falls short:

agent-opt is opt-in, for one-week routing pilots focused on static fallback graphs, start with traceAI + ai-evaluation and turn the optimizer on once eval baselines stabilize.
The policy DSL is opinionated. Teams that want freeform Lua or JavaScript routing hooks (Kong-style) will find Future AGI’s predicate language more constrained, also why the hot path is fast.
Bedrock and Vertex fallback support is solid for sonnet and opus; claude-haiku-4-5 on those clouds is still GA-pending in some regions.
Free tier (100K traces / month) covers a developer’s daily workload but not production CI burst. Plan on Scale or Enterprise if production routing is the wedge.

Pricing: Free tier with 100K traces / month. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II, BAA, and BYOC. AWS Marketplace listing available.

Score: 7/7 axes.

2. Portkey: Best for hosted gateway with mature config-driven fallback chains

Verdict: Portkey has the most polished hosted routing product in this category. The config language for fallback graphs is genuinely good, readable, versioned, reviewable. Trade-off: if your security team won’t allow Claude Code traffic through a SaaS data plane, this is the wrong pick.

What it does for routing Claude Code in production:

Deterministic policy expressiveness. “Strategy” config supports fallback, loadbalance, and conditional modes. The conditional strategy encodes “if metadata.tokens_estimated > 40000 then opus, else haiku”, deterministic, no LLM in the hot path. One rung below Future AGI’s typed DSL.
Per-region routing. Virtual keys pinned to a region by configuring the underlying Anthropic key with a regional endpoint. Session stickiness implicit through virtual key reuse.
Failover semantics. Ordered targets with per-target retry counts and on-status conditions. Failover RTT P99 around 380ms in our test.
P99 latency overhead. P50 of 14ms, P99 of 52ms from a us-east client. EU teams see higher P99 because the primary data plane is US-resident as of May 2026.
Routing-decision observability. Dashboard names the strategy and target per request. Detail is one layer below a top-level slice, answerable but not first-class.
Cache-hit policy interaction. Virtual-key stickiness preserves cache locality unless you load-balance across regions within a key; loadbalance will destroy hits if misconfigured.
Traffic-shape resilience. Tested at 10x burst with P99 climbing to ~110ms, recovering within 60 seconds.

Where it falls short:

No offline learner.
Hosted-only by default. BYOC is mature but requires procurement and a Kubernetes operator install.
conditional gets verbose for deeply nested metadata predicates.
Pricing escalates above 5M requests/month faster than the lighter alternatives.

Pricing: Free tier with 10K requests/day. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II.

Score: 6/7 axes (missing: offline policy improvement).

3. Kong AI Gateway: Best for self-hosted enterprise routing with global anycast

Verdict: Kong AI Gateway is the pick when your platform team already runs Kong for REST APIs across multiple regions. Strengths: real per-region routing (because the data plane already runs in every region) and operator familiarity. Weaknesses: AI-specific primitives are newer, and routing observability is plugin-driven.

What it does for routing Claude Code in production:

Deterministic policy expressiveness. Kong route rules plus the AI Proxy plugin. Match on headers, JSON body fields, and developer tier. Token-count predicate routing typically requires a pre-route Lua plugin, expressive but heavier to author.
Per-region routing. Where Kong shines. The data plane already runs in every region your APIs run; routing Claude Code to the nearest Anthropic region is configuring the upstream pool with regional endpoints. Session stickiness through the consumer-affinity plugin.
Failover semantics. Upstream pools with health checks and ordered targets, the pattern Kong has shipped for years. Failover RTT P99 measured at 290ms.
P99 latency overhead. P50 of 11ms, P99 of 42ms (3 replicas, US-east). AI Proxy adds 3-6ms on the base proxy.
Routing-decision observability. OTel plugin + your own sink. Default Kong dashboard names the upstream served; AI-specific detail is plugin-driven.
Cache-hit policy interaction. Region-pinned upstreams preserve cache locality.
Traffic-shape resilience. Tested at 10x burst with P99 climbing to ~95ms, recovering within 25 seconds.

Where it falls short:

No offline learner.
AI Proxy primitives are newer than the rest of Kong; route_type was still adding features through Q1 2026.
Observability story requires wiring multiple plugins. Expect two weeks of platform-team time.
Out-of-the-box, failover graph is one level deep. Multi-hop failovers (region A → region B → Bedrock) require chained upstream definitions and careful health-check tuning.

Pricing: Kong is open source. Kong Konnect (managed) starts free. Enterprise plans for SLA, plugins, and support start around $1.5K/month.

Score: 5/7 axes (missing: native AI routing observability, offline learner).

4. LiteLLM: Best for Python-native self-hosted proxy with explicit failover lists

Verdict: LiteLLM is the pick when Claude Code traffic can’t leave the VPC and the security team wants to read every line of code that touches a prompt. Source-available, Python-native. Routing config is explicit and failover semantics are correct out of the box. Trade-off: it’s a Python process, tune Gunicorn workers and care about GIL behavior under burst.

What it does for routing Claude Code in production:

Deterministic policy expressiveness. YAML model_list entries plus routing strategies (simple-shuffle, least-busy, usage-based-routing, latency-based-routing). Token-count predicate routing typically goes through a Python pre-call hook.
Per-region routing. Per-model-list region pinning. Session stickiness through user_id and usage-based-routing. Easy to misconfigure: simple-shuffle destroys cache hits, usage-based-routing preserves them.
Failover semantics. Explicit fallback lists per model with num_retries and allowed_fails. Deterministic, well-documented. Failover RTT P99 measured at 340ms.
P99 latency overhead. P50 of 22ms, P99 of 78ms (4 Gunicorn workers, 2 CPU each, US-east). Higher because of Python serialization. If you need sub-50ms P99, LiteLLM is the wrong pick.
Routing-decision observability. Dashboard shows the model served and fallback hops. Deeper detail requires inspecting request metadata or wiring a traceAI sink behind LiteLLM.
Cache-hit policy interaction. Preserved with usage-based-routing + sticky user_id; broken with the wrong strategy.
Traffic-shape resilience. Python’s GIL is the limit. P99 climbed to ~145ms during a 10x burst, recovery 90 seconds.

Where it falls short:

No offline learner.
Python proxy means higher P99 and longer burst recovery than Go-based or edge-based alternatives.
Dashboard is functional, not polished. Routing audit means SQL or an external sink.
YAML config grows long fast, large fleets end up with hundreds of model-list entries managed via Helm.

Pricing: Open source under MIT. LiteLLM Enterprise tier with SLA + SSO + audit starts around $250/month for small teams.

Score: 5/7 axes (missing: offline learner, sub-50ms P99 ceiling).

5. Cloudflare AI Gateway: Best for edge-routed gateway with the lowest P99 overhead

Verdict: Cloudflare AI Gateway is the pick when latency is the dominant constraint and you accept a shallower routing policy in exchange for sub-15ms gateway overhead at the edge. Cheapest to run, fastest to deploy. For raw speed and lightweight observability, genuinely good. For a typed policy DSL or offline learner, look elsewhere.

What it does for routing Claude Code in production:

Deterministic policy expressiveness. Rules in the Cloudflare dashboard plus Workers AI binding. Predicate routing on token count or developer tier requires writing a Worker in front; the gateway is closer to a smart proxy with caching and rate limits than a full policy engine. Sufficient for “primary plus one fallback;” not enough for a five-tier predicate ladder without a Worker.
Per-region routing. Anycast routes to the nearest PoP automatically. Downside for cache locality: the same session may hit different PoPs across turns unless you pin the upstream Anthropic endpoint.
Failover semantics. “Fallback providers” config, ordered list on configurable error codes. Failover RTT P99 of 220ms, the fastest of the five because the edge has the secondary’s connection state cached. One level deep without a Worker.
P99 latency overhead. P50 of 4ms, P99 of 13ms at the edge against a US-east client. Lowest by a wide margin.
Routing-decision observability. Lightweight. Dashboard shows request, upstream, fallback. Deeper detail requires Worker logging or an external sink.
Cache-hit policy interaction. Cloudflare’s response cache is excellent for fully cacheable requests (rare for Claude Code). Anthropic prompt-cache locality requires explicit region pinning; default anycast destroys hits.
Traffic-shape resilience. Tested at 10x burst with P99 climbing to 28ms, recovering within 10 seconds.

Where it falls short:

No offline learner.
Routing policy is shallow without a Worker on top.
Region pinning required for cache locality; default anycast destroys hits.
Claude Code integration documented mostly through community examples as of May 2026; first-party docs lean on OpenAI-compatible mode.
Routing-decision observability is hand-rolled.

Pricing: Free tier with 100K requests / month. Paid tiers integrated into Workers pricing, typically $5/month minimum plus per-request fees that scale linearly.

Score: 4.5/7 axes (missing: deep policy expressiveness, offline learner, native routing observability).

Capability matrix

Axis	Future AGI	Portkey	Kong AI Gateway	LiteLLM	Cloudflare AI Gateway
Deterministic policy DSL	Typed predicates	YAML conditional	Lua + plugin	YAML + Python hook	Worker-authored
Per-region routing	Native, sticky	Per virtual key	Native multi-region	Per model-list entry	Anycast (needs pinning)
Failover semantics	Multi-hop graph	Ordered list	Health-check pool	Explicit list	One-level + Worker
P99 latency overhead	34ms	52ms	42ms	78ms	13ms
Routing-decision span	First-class	Detail view	OTel plugin	Metadata only	Worker logs
Cache-locality safe	Sticky default	Sticky per key	Region-pinned	Strategy-dependent	Needs pinning
Burst resilience	71ms P99 at 10x	110ms P99 at 10x	95ms P99 at 10x	145ms P99 at 10x	28ms P99 at 10x
Offline policy learner	`fi.opt`	None	None	None	None

Decision framework: Choose X if

Choose Future AGI if you want the routing policy to keep improving without an SRE babysitting it. Deterministic hot path, offline learner, policy updates landing in change management. Strongest when Claude Code is the dominant production workload.

Choose Portkey if you want a hosted gateway with a polished config-driven fallback graph and your security posture allows a SaaS data plane.

Choose Kong AI Gateway if you already operate Kong for REST APIs across regions and the right answer is to extend the existing stack.

Choose LiteLLM if Claude Code traffic can’t leave the VPC and source-availability beats P99 latency. Tune Gunicorn before you tune the policy.

Choose Cloudflare AI Gateway if latency is the dominant constraint and the policy is simple (primary + one fallback), or you’re happy to author it in a Worker.

Common mistakes when wiring Claude Code routing in production

Mistake	What goes wrong	Fix
Using an LLM-judge on the hot path to classify route decisions	Adds 400-900ms to P99 latency and creates a circular failure mode when the same provider is rate-limited	Move classification offline; ship deterministic predicates derived from the offline analysis
One global pool with no region pinning	Anycast load-balancer destroys Anthropic prompt-cache hits and the bill triples	Pin sessions to a region; consistent-hash by session ID
Fallback configured as retry against the same upstream	A regional incident becomes an amplified incident as retries pile up	Configure ordered fallback graphs across regions; cap per-target retries at 1
Tagging routing decisions with no policy version	”Why was this routed here last Tuesday” becomes unanswerable after the policy changes	Emit policy-version on every routing-decision span; treat the policy as a versioned artifact
Setting a circuit breaker without testing under burst	Burst from CI triggers an avalanche of opens and the developer-facing fleet stalls	Synthetic-test the gateway at 10x baseline RPS before production rollout
Buffering streaming through a custom Worker on the edge	Claude Code’s progress UI freezes mid-turn and the developer experience regresses	Pass SSE through unbuffered; test from the actual CLI, not from curl
Sharing one virtual key across many developers in routing	Per-developer routing rules become uninforceable and observability collapses	Issue per-developer virtual keys that fan out to one underlying account

How Future AGI closes the loop on routing Claude Code in production

The other four gateways treat routing as a static config: write the policy, deploy it, leave it. Future AGI treats routing as a versioned artifact whose hot-path execution is deterministic but whose policy gets updated by an offline learner between deploys. Six stages:

Trace. Every Claude Code turn produces a span tree via traceAI (Apache 2.0). The routing decision is a first-class span: policy version, predicate matched, pool selected, region targeted, fallback hops, final upstream, prompt-cache hit ratio.
Evaluate. ai-evaluation (Apache 2.0) scores every turn. FAGI ships a 60+ EvalTemplate classes in the ai-evaluation SDK with self-improving evaluators on the Future AGI Platform (task completion, faithfulness, code-correctness, tool-use accuracy, structured-output, hallucination, groundedness, instruction-following, agentic surfaces), plus unlimited custom evaluators authored end-to-end by an in-product eval-authoring agent that uses tool calling on your code, plus self-improving evaluators that learn from live production traces, plus FAGI’s proprietary classifier model family that runs continuous high-volume per-route scoring at very low cost-per-token (lower per-eval cost than Galileo Luna-2). Scores join routing-decision data so the loop can answer “did the route produce a good output?” Catalog is the floor, not the ceiling.
Cluster. Misroutings cluster by failure mode. Patterns surface: “turns over 12K input tokens routed to haiku regress on faithfulness,” or “turns with tool-call density above 4 regress on opus at us-west.” This is the core of evaluating LLM routing policies.
Optimize. fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) propose a policy update, a threshold adjustment or new predicate. The proposal is a diff against the current policy, reviewable like any code change.
Route. Hot path stays deterministic. The approved policy version becomes the next deploy; predicates evaluate in sub-millisecond time, no LLM in the loop.
Re-deploy and observe. New policy is versioned, deployed, observable. If the score regresses, automatic rollback.

Net effect: a team starting with a hand-written policy typically sees misroutings drop 35-55% within six weeks, with no change in P99 because the learner runs offline. The hot path stays deterministic, that’s what makes this safe to run on call.

The three building blocks are open source under Apache 2.0, traceAI, ai-evaluation, agent-opt (github.com/future-agi). The hosted Agent Command Center adds the failure-cluster view, live Protect guardrails (~65 ms text-only, per arXiv 2510.13351), RBAC, SOC 2 Type II, and AWS Marketplace for procurement.

What we did not include

Three gateways that show up in other 2026 listicles but didn’t fit here:

OpenRouter. Fantastic for model exploration but routing primitives are consumer-facing; per-region pinning and SRE-grade failover graphs aren’t first-class.
Helicone. Strong observability but routing is basic (round-robin + simple failover). Right pick for monitoring-first workloads (see our token-monitoring post).
TrueFoundry. Solid MLOps gateway but the routing-policy DSL was still maturing through Q1 2026; revisit in Q3.

Sources

Anthropic Claude Code documentation, claude.ai/docs/claude-code
Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
Portkey AI gateway, portkey.ai
Kong AI Gateway, konghq.com/products/kong-ai-gateway
LiteLLM proxy, github.com/BerriAI/litellm
Cloudflare AI Gateway, developers.cloudflare.com/ai-gateway

Frequently asked questions

What is the lowest-latency AI gateway for Claude Code in production?

Cloudflare AI Gateway, P99 of 13ms at the edge. Trade-off: shallower routing policy.

Can the gateway make routing decisions by calling another LLM in the hot path?

It can, but it should not. Claude-Haiku-classifier routing adds 380-450ms to P50 and 900ms+ to P99. Worse, when the classifier and the served model share an upstream, a regional incident takes both down at once. Deterministic predicates with an offline learner is the production-safe pattern.

How do I keep Anthropic prompt-cache hits when routing across regions?

Session-sticky routing. Hash the session ID into a consistent ring and pin all turns to one region. Future AGI by default; Portkey per virtual key; Kong via consumer affinity; LiteLLM via `usage-based-routing`; Cloudflare requires explicit upstream pinning.

What is a reasonable failover RTT P99 budget?

Under 400ms is acceptable. Over 1s usually means the failover is doing a cold TLS rehandshake — warm the secondary pool. The five gateways here sit between 220ms (Cloudflare) and 380ms (Portkey).

How do I make the gateway observability story SRE-grade?

The routing decision has to be a first-class span — policy version, predicate matched, pool, region, fallback hops, prompt-cache hit ratio — queryable without grepping logs. Future AGI emits this natively; the other four require wiring.

Should we route Claude Code turns to non-Claude models as a fallback?

With care. Claude Code is tuned for Claude; non-Claude fallbacks often degrade tool-use. Safer: fail over to a different Anthropic region, then Bedrock claude-sonnet-4-6 (still Claude), and swap providers only as a last resort.

How is Future AGI Agent Command Center different from Portkey for routing?

Portkey's config is static — what you ship today is what you ship next month. Future AGI ships a versioned policy whose hot path is deterministic but whose contents get updated by an offline learner between deploys.

View all

Guides

Best 5 AI Gateways to Cache Claude Code Calls in 2026

Five AI gateways scored on caching Claude Code calls in 2026: cross-developer cache scope, semantic-match thresholds, hit-rate, TTL, what each misses.

Rishav Hada · May 16, 2026

17 min

Guides

Top 5 Tools for Claude Code Cost Management in 2026

Five tools for Claude Code cost management in 2026: four gateways, the native Anthropic dashboard, and a FinOps platform, scored on chargeback, caps.

NVJK Kartik · May 14, 2026

18 min

Guides

Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026

Five AI gateways scored on Claude Code token monitoring in 2026: per-dev attribution, per-repo budgets, session traces, alerts, where each falls short.

Rishav Hada · May 8, 2026

17 min

TL;DR

Why routing Claude Code in production is its own problem

The 7 routing axes we score on

How we picked

1. Future AGI Agent Command Center: Best for deterministic routing with a self-improving offline policy

2. Portkey: Best for hosted gateway with mature config-driven fallback chains

3. Kong AI Gateway: Best for self-hosted enterprise routing with global anycast

4. LiteLLM: Best for Python-native self-hosted proxy with explicit failover lists

5. Cloudflare AI Gateway: Best for edge-routed gateway with the lowest P99 overhead

Capability matrix

Decision framework: Choose X if

Common mistakes when wiring Claude Code routing in production

How Future AGI closes the loop on routing Claude Code in production

What we did not include

Related reading

Sources

Frequently asked questions