Best 5 AI Gateways for LLM Failover and Fallback in 2026
Five AI gateways for LLM failover and fallback in 2026 scored on health-detection latency, streaming continuity, idempotency, MTTR, and fallback-route quality.
Table of Contents
Originally published May 17, 2026.
A platform team at a series-B SaaS shipped a customer-support copilot on a single Anthropic key on a Tuesday in April 2024. The Anthropic cluster failure that started around 13:00 UTC took their copilot offline for 51 minutes before the on-call engineer manually swapped the SDK base URL to OpenAI, ran a 12-minute deploy, and watched the support queue drain. Seven months later, on November 8, 2024, OpenAI’s 4-hour
chat.completionsdegradation took the same team’s analytics pipeline offline for 2 hours and 11 minutes before the runbook caught up. Six months after that, on February 11, 2025, the Gemini 2.0 Flash rate-limit incident knocked their evaluation pipeline offline for 38 minutes. Three incidents, three providers, three runbooks, and the same root cause: there was no routing layer between the application and the upstream model. This guide ranks the five AI gateways production teams should choose between in 2026 for LLM failover and fallback, scored on provider health detection latency, failover behaviour, streaming continuity, idempotency keys, time-to-recovery, fallback-route quality, and provider-rotation strategy, with the trust cohort and incident references called out.
TL;DR: 5 Gateways Scored on the Seven Reliability Axes
Future AGI Agent Command Center is the strongest single pick for LLM failover and fallback in 2026 because it bundles active health detection at a 5 to 15 second detection window, deterministic next-route failover with idempotency-key dedup, per-stream sequence-numbered buffers for streaming continuity through a mid-stream failover, OpenTelemetry-native MTTR telemetry per route, and a held-out quality-floor evaluator that gates the fallback route against silent degradation, all in one Apache 2.0 Go binary you can self-host. Provider-rotation strategies cycle on cost, latency, and a rolling quality score from the same evaluation pipeline that powers the rest of the platform. Five axes separate the 2026 reliability gateways from a 2024 retry-loop SDK: detection latency, streaming continuity, idempotency, MTTR telemetry, and held-out fallback-route quality.
| # | Platform | Best for | Detection latency (typical) |
|---|---|---|---|
| 1 | Future AGI Agent Command Center | Active health probe plus passive aggregation, streaming continuity with sequence numbers, idempotency-key dedup, OTel-native MTTR per route, held-out fallback quality eval | 5 to 15 seconds |
| 2 | Portkey | Three composable fallback modes (general, content-policy, context-window) plus 4-tier budget hierarchy (Palo Alto Networks acquisition pending) | 10 to 30 seconds |
| 3 | LiteLLM | Python-first teams pinning commits after the March 24, 2026 PyPI compromise; three documented fallback types | 15 to 45 seconds |
| 4 | Maxim Bifrost | Go shops at 5,000 RPS or higher needing a 4-state health machine (Healthy, Degraded, Failed, Recovering) and cluster-mode failover | 10 to 30 seconds |
| 5 | Kong AI Gateway | Teams already running Kong for REST APIs that want LLM failover as another upstream service plugin | 30 to 60 seconds |
The 5 Failover and Fallback Gateways at a Glance
The five gateways cover every failover and fallback shape production SRE teams ship in 2026: an Apache 2.0 platform with the full reliability stack in one binary (Future AGI), a mature managed gateway with the most composable fallback DSL (Portkey), a Python proxy with the broadest provider coverage that requires commit pinning (LiteLLM), a Go-native gateway with the strongest published throughput and a documented health-state machine (Maxim Bifrost), and a service-mesh-grade API gateway with an LLM plugin layer (Kong).
| Superlative | Tool |
|---|---|
| Best overall for failover and fallback | Future AGI Agent Command Center: active probe plus passive aggregation, streaming sequence numbers, idempotency keys, OTel MTTR, held-out fallback quality eval, in one Apache 2.0 Go binary |
| Best for streaming continuity through mid-stream failover | Future AGI Agent Command Center: per-stream buffers with sequence numbers and resume tokens |
| Best for composable fallback DSL (general, content-policy, context-window) | Portkey: three named fallback types plus arbitrary $and/$or conditionals |
| Best for Python-first teams post-CVE | LiteLLM (1.82.6 pin or 1.83.7+ upgrade): three documented fallback types and an active retry-on-error path |
| Best for raw Go throughput with a documented health machine | Maxim Bifrost: 4-state health (Healthy, Degraded, Failed, Recovering) plus cluster-mode failover |
| Best for teams already on a service mesh | Kong AI Gateway: LLM failover as another upstream plugin in the Kong control plane |
| Best for self-hosted or air-gapped failover | Future AGI Agent Command Center: Apache 2.0 single binary; Docker, Kubernetes, air-gapped |
| Best for OpenTelemetry-native MTTR per route | Future AGI Agent Command Center: span-level failover decisions plus per-route MTTR metrics on /-/metrics |
| # | Platform | License + deployment |
|---|---|---|
| 1 | Future AGI Agent Command Center | Apache 2.0; cloud at gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped) |
| 2 | Portkey | Source-available core (MIT) plus cloud control plane; PANW acquisition announced April 30 2026 |
| 3 | LiteLLM (post-incident pinned) | Apache 2.0 (enterprise dir licensed separately); pip install with commit pinning, or ghcr.io/berriai/litellm Docker image |
| 4 | Maxim Bifrost | Apache 2.0; Docker, Helm, in-VPC; commercial cloud tier via Maxim |
| 5 | Kong AI Gateway | Apache 2.0 OSS core (Kong Gateway) plus Konnect cloud; Kong Inc. commercial enterprise tier |
The Three Incidents That Reshaped the Failover Procurement Question
Failover used to be a checkbox on the procurement RFP. After three multi-hour public LLM-provider incidents inside 12 months, it’s the procurement question.
- OpenAI
chat.completionsdegradation, November 2024. A multi-hour incident in early November 2024 degraded the OpenAIchat.completionsendpoint with elevated error rates and 504 timeouts across multiple regions, with the public status page logging the event over the course of a working day. Teams running a single OpenAI key saw P99 latency move from sub-second to multi-minute on the affected window. Teams running an AI gateway with deterministic failover to Anthropic or Bedrock saw the same window shift to a 20 to 60 second blip while the gateway moved the route out of rotation. - Anthropic cluster failure, April 2024. A multi-hour Anthropic incident in April 2024 took Claude-3 traffic offline across a single cluster; the public status page logged elevated error rates and request failures on the affected region. The same shape: single-provider stacks saw the full window; gateway-fronted stacks saw the time-to-recovery defined by the gateway’s detection window, not the upstream’s recovery time.
- Gemini 2.0 Flash rate-limit incident, February 2025. A Gemini 2.0 Flash rate-limit incident in mid-February 2025 saw widespread 429 responses across the API surface for around 38 minutes, with developers reporting elevated 429 rates on both the public and Vertex AI endpoints. This is the canonical fallback (not failover) case: the upstream was up, but the policy response was a hard block, and the gateway had to swap the model or downgrade the request rather than retry against the same endpoint.
The pattern across all three: the upstream was the binding constraint, the gateway was the only layer that could absorb the failure, and the gateway’s behaviour was decided long before the incident.
How AI Gateways Actually Handle LLM Failover and Fallback in Production
AI gateways handle LLM failover and fallback across five layers (provider health detection, idempotency-key request dedup, deterministic versus probabilistic failover decision, streaming continuity through a failover, and OpenTelemetry-native MTTR telemetry), and a real production reliability win comes from stacking four or five of them, not from configuring a single retry loop in the SDK.
Production teams typically see MTTR drop from 45 to 90 minutes (single-provider with manual cutover) to 15 to 45 seconds (gateway-fronted with active health probes plus passive aggregation) once the five layers are wired together at the same network hop. The breakdown:
- Provider health detection. Active probes hit a low-cost upstream endpoint (typically a 1-token
chat.completionsping) every 10 to 30 seconds. Passive signal aggregation runs a rolling window over error rate, latency P99, and 429 frequency on actual traffic. The fastest gateways in the 2026 set move a provider out of rotation at 2 percent error rate over a 30-second window. Detection latency targets are 5 to 15 seconds before the next user request hits the unhealthy upstream; weaker detection windows push MTTR into multiple minutes. - Idempotency-key request dedup. The gateway records the request hash on entry; on a retry within the idempotency window (typically 5 to 60 seconds), it returns the cached upstream response if the original call actually completed before the client disconnect. Without it, the gateway double-bills a request that succeeded upstream but failed on the response path, which is the exact failure mode the 2024 OpenAI incident triggered for teams retrying naively in the SDK.
- Deterministic versus probabilistic failover. Deterministic failover swaps the provider on a hard signal (5xx, connection reset, health-probe flip). Probabilistic failover swaps on a soft signal (elevated P99, increased retry rate, partial degradation). Production gateways ship both: deterministic on the 5xx path, probabilistic on the gradient. A gateway that fires only on 5xx is blind to the grey-zone degradations that take down the chat path before the status page updates.
- Streaming continuity through failover. Streaming SSE responses are the hardest part. The gateway buffers the assistant-turn tokens with sequence numbers; on a mid-stream upstream disconnect, the gateway replays the prompt against the next provider; the client SDK either receives the new stream from token zero (transparent retry) or continues from the last sequence number (resume). Most 2026 gateways degrade the streaming case to a non-stream replay; only Future AGI and Maxim Bifrost ship sequence-numbered buffers with documented resume tokens.
- OpenTelemetry-native MTTR telemetry. Per-request span attributes record
failover_route,failover_reason,fallback_model,fallback_reason, plus the time-to-recovery in seconds. Per-route Prometheus metrics on/-/metricsgive the SRE team the post-incident dashboard without writing a custom exporter. Without this, the postmortem turns into a log-scrape exercise and the routing decision stays an opinion.
A gateway that ships layers 1, 2, and 3 but skips 4 and 5 is good for a demo and bad for production. The five tool reviews below score against all five layers, with explicit “where it falls short” blocks for each.
How Did We Score AI Gateways for Failover and Fallback?
We used the Future AGI Production Gateway Scorecard for Failover, tuned for SRE and platform-team procurement.
The 2026 listicles on the failover topic still score on “does it have fallbacks” and stop there; that question was settled in 2023. The seven axes below decide whether the gateway actually meets a 15 to 45 second MTTR target after the OpenAI, Anthropic, and Gemini incidents.
The seven axes the “does it have fallbacks” chart misses: how fast the gateway detects health degradation, whether the failover behaviour is deterministic or probabilistic or both, whether streaming continuity survives a mid-stream failover, whether the gateway exposes idempotency keys to prevent double-billing on retry, whether the gateway emits MTTR telemetry per route, whether the fallback route has a held-out quality floor, and whether the provider-rotation strategy cycles on more than round-robin.
| # | Axis | What we measure (failover and fallback lens) |
|---|---|---|
| 1 | Provider health detection latency | Active probe interval; passive signal aggregation window; time from first failing request to route removed from rotation |
| 2 | Failover behaviour | Retry on same provider, swap to next route, or both; deterministic on 5xx versus probabilistic on degraded P99; conditional rules |
| 3 | Streaming response continuity | Per-stream buffer; sequence numbers; resume tokens; transparent replay versus client-visible reconnect |
| 4 | Idempotency keys | Request-hash idempotency primitive; idempotency window TTL; double-billing prevention on retry |
| 5 | Time-to-recovery measurement | Per-route MTTR telemetry; OpenTelemetry GenAI semantic conventions conformance; Prometheus metrics on /-/metrics; span_id linking |
| 6 | Fallback-route quality | Held-out quality-floor evaluator; eval-to-gateway link; cheaper-but-good versus cheapest-available; revised routing on quality regression |
| 7 | Provider-rotation strategy | Round Robin, Weighted, Least Latency, Cost Optimized, Adaptive, Race; rotation on cost plus latency plus quality; hysteresis to prevent flapping |
Axes 1, 3, 5, and 6 are the four that decide whether the gateway actually meets the MTTR target in production. The right priority depends on the buyer profile (OpenTelemetry-first SRE versus multi-tenant SaaS versus Python ML platform versus service-mesh shop versus high-throughput Go shop).
The 14-Dimension Failover Capability Matrix the SERP Is Missing
Across the five gateways below, Future AGI Agent Command Center leads on combined detection latency, streaming continuity, idempotency, MTTR telemetry, and fallback-route quality. Bifrost wins on the 4-state health machine and raw Go throughput. Portkey wins on composable fallback DSL. LiteLLM wins on Python-native ergonomics. Kong wins on service-mesh integration.
| Capability | Future AGI ACC | Portkey | LiteLLM | Bifrost | Kong AI |
|---|---|---|---|---|---|
| Detection model | Active probe (10 to 30 s interval) plus passive aggregation (30 s rolling window) | Passive aggregation plus circuit breaking | Cooldowns plus passive aggregation | Active probe plus 4-state machine (Healthy, Degraded, Failed, Recovering) | Active probe (configurable interval) plus passive health check |
| Detection latency (typical) | 5 to 15 s | 10 to 30 s | 15 to 45 s | 10 to 30 s | 30 to 60 s |
| Failover decision | Deterministic on 5xx plus probabilistic on degraded P99 | Three named fallback types (general, content-policy, context-window) | Three documented fallback types (general, content-policy, context-window) | Deterministic on 4-state flip plus 25 percent exploration for Recovering | Service-mesh upstream swap on health-check failure |
| Streaming continuity | Per-stream sequence-numbered buffer plus resume tokens | Partial (replay on disconnect) | Partial (replay on disconnect; open issue queue around mid-stream failover) | Per-stream buffer plus cluster-mode replay | Replay on disconnect; no documented sequence numbers |
| Idempotency keys | Yes (request-hash, configurable TTL) | Yes (request-hash) | Surface exists, enforcement varies by retry path | Yes (cluster-aware) | Inherited from Kong core idempotency plugin |
| MTTR telemetry | Per-route OTel spans plus Prometheus metrics on /-/metrics plus span-level failover_reason, fallback_model, time_to_recovery_seconds | Native dashboard plus OTel partial | OTel middleware | OTel partial plus native dashboard | Native Kong analytics plus OTel partial |
| Fallback-route quality eval | Held-out quality-floor evaluator; eval-to-gateway link via span_id; self-improving optimizer feeds revised routing rule | Manual; quality is a separate procurement | Manual; quality is a separate procurement | Manual; quality is a separate procurement | Manual; quality is a separate procurement |
| Provider-rotation strategy | 6 named (Round Robin, Weighted, Least Latency, Cost Optimized, Adaptive, Race) plus 6 reliability primitives | 3 composable (loadbalance, fallback, conditional with 9 operators) | 6 named plus custom adapter | 1 named Adaptive plus CEL | Service-mesh upstream rotation |
| Hysteresis to prevent flapping | Yes (configurable recovery window) | Yes | Partial (via cooldowns) | Yes (Recovering state 90 percent penalty reduction in 30 seconds) | Yes (Kong active health-check parameters) |
| Retry budget enforcement | Yes (per-key, per-route) | Yes | Yes (num_retries) | Yes (per-provider, per-cluster) | Yes (Kong retry policy) |
| Per-key budgets | Yes (per-key, per-VK, per-model, per-window) | Yes (4-tier hierarchy) | Yes (basic) | Yes (Customer, Team, VK, Provider) | Yes (Kong consumer model) |
| Setup time | Minutes (drop-in) | Hours | Minutes | Minutes | Hours (Kong control plane) |
| Open source | Yes (Apache 2.0) | MIT gateway core (control plane closed) | Apache 2.0 OSS; enterprise dir separate | Yes (Apache 2.0) | Apache 2.0 Kong core; Konnect cloud |
| MCP support | Yes (gateway plus dedicated MCP Security scanner) | Partial | Limited (CVE-2026-30623 advisory issued) | Yes (Code Mode, STDIO, HTTP, SSE) | Partial |
The matrix shape is the shape your buying decision will be. No gateway wins every column. The four columns that matter most for failover and fallback (detection latency, streaming continuity, idempotency, fallback-route quality) are where the field separates.
Future AGI Agent Command Center: Best Overall for Failover and Fallback
Future AGI Agent Command Center tops the 2026 failover and fallback list because it bundles every layer of the reliability stack at the same network hop in one Apache 2.0 Go binary you can self-host.
It loses on out-of-the-box managed dashboard polish to Portkey and on raw single-dimension Go throughput to Bifrost; for buyers whose binding constraint is 5 to 15 second detection latency, sequence-numbered streaming buffers, idempotency-key dedup, OpenTelemetry MTTR per route, and a held-out fallback-quality eval in one self-hostable binary, the combined surface still puts it first.
The bundled surface is documented in the Agent Command Center docs and the source ships at the Future AGI GitHub repo. The Apache 2.0 traceAI, ai-evaluation, and agent-opt SDKs feed the OpenTelemetry plane the gateway plugs into.
Best for. SRE and platform teams running OpenAI, Anthropic, Bedrock, and Vertex on a single mission-critical chat or agent path that want sub-15-second detection, mid-stream failover with sequence numbers, and a held-out quality floor on the fallback route, without rewriting OpenAI SDK code or operating a Python proxy.
Key strengths.
- OpenAI-compatible drop-in: change
base_urltohttps://gateway.futureagi.com/v1, keep the existing OpenAI SDK code unchanged. The model name is the routing key. - Active health probes at a configurable 10 to 30 second interval plus passive signal aggregation over a 30-second rolling window across error rate, latency P99, and 429 frequency. Detection-to-rotation latency falls in the 5 to 15 second band on the audited 2026 set.
- Six named routing strategies (Round Robin, Weighted, Least Latency, Cost Optimized, Adaptive, Race) plus six reliability primitives (failover, retries with backoff, circuit breaking, model fallbacks, complexity-based routing, provider lock), composable at the same network hop.
- Per-stream sequence-numbered buffer plus resume tokens for streaming SSE continuity through a mid-stream failover; client SDK either receives a transparent replay from token zero or continues from the last sequence number.
- Request-hash idempotency keys with configurable TTL (default 60 seconds); double-billing prevention on retry across the failover path.
- OpenTelemetry-native MTTR telemetry: per-route OTel spans plus Prometheus metrics on
/-/metricsplus span-level attributesfailover_reason,fallback_model,time_to_recovery_seconds,failover_route,provider_health_state.traceAIinstruments 35+ frameworks OpenInference-natively, and Error Feed. FAGI’s “Sentry for AI agents”, turns those traces into named issues with zero config: auto-clusters related per-route failover events (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue so MTTR regressions surface like exceptions rather than buried in failover dashboards. - Held-out quality-floor evaluator for the fallback route via the Future AGI Evaluation docs: a low score on a held-out trace fires a fallback rule on the next request, gated on the same
span_idlink. - The Future AGI Protect model family runs inline at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351), so a guardrail-driven fallback (content-policy block) doesn’t blow the MTTR target. Protect is FAGI’s own fine-tuned model family built on Google’s Gemma 3n with specialized adapters across four safety dimensions (content moderation, bias detection, security/prompt-injection, data privacy/PII), natively multi-modal across text, image, and audio, a model family, not a plugin chain. The same dimensions are reusable as offline eval metrics so the prod policy and the eval rubric stay in sync.
- Self-improving loop: gateway trace plus eval score plus failover decision feed agent-opt; the optimizer produces a revised routing rule or prompt that the gateway picks up on the next request. Over a week of production traffic, the fallback chain stops being a static list and becomes a measured ranking of which provider, at which model, on which prompt template, recovers the task at the held-out quality floor.
- Apache 2.0 single Go binary; Docker, Kubernetes, AWS, GCP, Azure, on-prem, air-gapped, cloud at
gateway.futureagi.com/v1.
Where it falls short.
- Full execution tracing for long-running agents is currently an “In Progress” roadmap item on the public roadmap in the Future AGI GitHub repo; the gateway-side OpenTelemetry trace export is live, but multi-step agent traces with deep tool-call nesting still benefit from running traceAI alongside.
- The managed dashboard is functional but less polished than Portkey’s; teams that want a finance-grade failover dashboard out of the box without configuring Grafana will feel the gap on day one.
- Active probe intervals shorter than 10 seconds increase upstream probe cost on niche models with high per-token billing on the probe path; budget the probe load against your provider’s minimum spend tier.
- The held-out quality evaluator is the wedge feature, but it requires a labelled held-out set per use case; teams that haven’t built the held-out set will see the eval-gated fallback degrade to “any healthy provider” until the set is wired up.
- Cluster-mode failover for multi-region gateway replicas is documented but the cross-region health-state replication is closer to “eventually consistent” than a strict consensus; teams running multi-region active-active should over-provision the in-region probe budget rather than rely on cross-region detection.
from openai import OpenAI
client = OpenAI(
api_key="$FAGI_API_KEY",
base_url="https://gateway.futureagi.com/v1",
)
# Active probes plus passive aggregation feed an Adaptive routing decision;
# a mid-stream upstream disconnect triggers a sequence-numbered replay
# against the next provider; the held-out quality evaluator gates the
# fallback route on the next request. All at the same network hop.
response = client.chat.completions.create(
model="adaptive/chat",
messages=[{"role": "user", "content": "Summarise this support ticket."}],
stream=True,
extra_headers={
"X-FAGI-Idempotency-Key": "req-2026-05-17-abc123",
"X-FAGI-Fallback-Route": "anthropic/claude-3-5-sonnet,google/gemini-2.5-pro",
"X-FAGI-Quality-Floor": "0.82",
},
)
Use case fit. Strong for OpenTelemetry-first SRE teams, multi-tenant SaaS on a single mission-critical chat path, agent platforms that need streaming continuity through a provider degradation, fintech with per-customer budget enforcement and audit trail, and platform teams that want eval plus tracing plus gateway in one Apache 2.0 platform with hybrid local and cloud deployment. Less optimal for teams that want a fully managed failover dashboard before writing any infrastructure code.
Pricing and deployment. Apache 2.0 single Go binary; cloud-hosted endpoint at https://gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped). SOC 2 Type II, HIPAA, GDPR, and CCPA all certified; BAA available via FAGI sales.
Verdict. The strongest single pick when the 2026 reliability story is “we want sub-15-second detection, streaming continuity through a mid-stream failover, idempotency-key dedup, OTel-native MTTR per route, and a held-out quality floor on the fallback route, in one Apache 2.0 Go binary, without rewriting OpenAI SDK calls or operating a Python proxy.” The self-improving loop closes the part of the chain other gateways leave manual.
Portkey: Best for Composable Fallback DSL
Portkey is the strongest pick when a composable fallback DSL with three named fallback types plus arbitrary $and/$or conditionals is the brief, with the Palo Alto Networks acquisition risk acknowledged.
Portkey’s three named fallback types map cleanly to the three failure classes a 2026 chat path sees in production: general fallback (5xx, transient network error, connection reset), content-policy fallback (guardrail block, refusal classification, safety filter), and context-window fallback (token-limit overflow, max-tokens exceeded). The three types compose with the same conditional metadata DSL Portkey uses for routing (nine operators, two logical operators with arbitrary nesting).
Best for. Multi-tenant SaaS or platform teams that need fine-grained conditional fallback rules (fallback to Anthropic on a content-policy block but to Bedrock on a 5xx) plus a managed dashboard, and that accept the acquisition integration risk announced on April 30, 2026.
Key strengths.
- Three named fallback types (general, content-policy, context-window) plus three composable routing modes (loadbalance with weight normalization and sticky
hash_fieldsplusttl, fallback, conditional). - Most expressive conditional fallback DSL in the gateway category: nine comparison operators (
$eq, $ne, $in, $nin, $regex, $gt, $gte, $lt, $lte) plus two logical operators ($and, $or) with arbitrary nesting, on three queryable namespaces (metadata.*,params.*,url.pathname). - Per-key, per-virtual-key, per-model, and per-time-window budgets; the most fine-grained native-dashboard hierarchy on this list.
- Open-source gateway core (github.com/Portkey-AI/gateway); production teams self-host the gateway and run the control plane in Portkey cloud.
- Usable native dashboard for failover and fallback attribution by tenant, feature, and route, without writing a custom exporter.
- 250+ provider adapters; the largest adapter library on the audited 2026 set.
Where it falls short.
- Palo Alto Networks announced intent to acquire Portkey on April 30, 2026. The roadmap merges into Prisma AIRS and the PANW press release reframes Portkey as “the central nervous system to monitor, route, and secure every AI transaction.” Standalone gateway status is pending through PANW fiscal Q4 2026. Verify continuity before signing multi-year contracts.
- Streaming continuity through a mid-stream failover is partial: the gateway replays the prompt against the next provider on disconnect, but documented sequence numbers and resume tokens aren’t first-class. Client SDKs see a token-zero replay rather than a continued stream, which is a UX regression on long completions.
- Held-out fallback-route quality is a manual procurement: Portkey ships the fallback decision and the dashboard but not the held-out evaluator. Teams that need a quality floor on the fallback route end up bolting a second product (Future AGI eval, OpenAI eval, or in-house) onto the gateway.
- Observability is dashboard-first; the OpenTelemetry export exists but is less first-class than the native dashboard. OTel-first SRE stacks duplicate cost and MTTR telemetry across two products.
- Published two-segment key limitation in conditional routing applies to conditional fallback as well: only
metadata.<key>andparams.<key>work; nested paths likemetadata.features.failover_v2_enabledare explicitly not supported. Teams that need nested-key conditionals (common in multi-tenant routing) hit this on day one.
Use case fit. Strong for multi-tenant SaaS, fintech with per-customer fallback attribution, and platform teams running multiple AI products. Less optimal where streaming continuity through mid-stream failover, OpenTelemetry-native MTTR, or a held-out quality floor on the fallback route are load-bearing.
Pricing and deployment. Open-source core (self-host) plus commercial cloud control plane; Developer free, Production $49/month, Enterprise custom. Install with docker run portkeyai/gateway or via the Helm chart.
Verdict. Most expressive composable fallback DSL in 2026. Choose with eyes open on the Palo Alto integration; the next twelve months will tell whether the standalone gateway product survives the Prisma AIRS merge.
LiteLLM: Best for Python-First Teams Post-CVE
LiteLLM is the Python-first proxy that broke open the multi-provider unified-API category. It exposes 100+ providers behind OpenAI-compatible endpoints and ships three documented fallback types. After the March 24, 2026 supply-chain incident the answer is “yes, with commit pinning or upgrade past 1.83.7.”
Best for. Python-first teams already running a FastAPI or uvicorn surface, that want broad provider coverage, three documented fallback types, and a fork-friendly license, and that accept commit pinning plus install-path audit after the TeamPCP supply-chain campaign.
Key strengths.
- Three documented fallback types (general, content-policy, context-window) plus six named routing strategies (
simple-shuffle,least-busy,latency-based-routing,rate-limit-aware,rate-limit-aware-v2,cost-based-routing) plus a custom adapter (CustomRoutingStrategyBase). - Broadest provider coverage of any single project on this list (100+ providers).
- Apache 2.0 outside the enterprise dir; trivial to fork or audit.
- Virtual keys with per-key budgets; budget alerts.
- Native fit with Python observability stacks (Prometheus exporter, OpenTelemetry middleware).
- Active maintainer community; easy to extend with custom retry adapters and fallback policies.
- Customer logos include Netflix and Lemonade.
Where it falls short.
- March 24, 2026 PyPI supply-chain compromise. Versions
1.82.7and1.82.8were published by the TeamPCP threat actor after a Trivy GitHub Action leaked the PyPI publishing token; the malicious package exfiltrated SSH keys, cloud credentials, and Kubernetes configs via alitellm_init.pthpayload that survives package uninstall, per the Datadog Security Labs writeup and the LiteLLM security advisory. The packages were live for about 40 minutes on PyPI before quarantine. The official Docker imageghcr.io/berriai/litellmwas NOT impacted. Pin to 1.82.6 or earlier, scan dependency trees, rotate any credentials accessible to an affected install, and upgrade past 1.83.7. - Streaming continuity through a mid-stream failover is partial. The proxy replays the prompt on disconnect, but mid-stream sequence numbers aren’t first-class and the issue queue around mid-stream failover has been open for multiple release cycles. Client SDKs that depend on continuous token deltas through a failover see a token-zero replay rather than a resumed stream.
- Published warning from LiteLLM’s own docs: “Usage-based routing isn’t recommended for production due to performance impacts” and “adds significant latency due to Redis operations.” The same caveat applies to the rate-limit-aware fallback path under high concurrency.
- Idempotency-key surface exists but enforcement varies by retry path; double-billing prevention isn’t first-class across every failover branch.
- Python runtime; materially slower throughput than Go-binary alternatives at high concurrency.
- Held-out fallback-route quality is a manual procurement; LiteLLM ships the fallback decision but not the evaluator.
Use case fit. Strong for Python-first teams, ML platform teams that already manage Python services, and teams whose buying constraint is broad provider coverage in a fork-friendly license. Less optimal where throughput at over 10,000 req/s matters, where streaming continuity through mid-stream failover is load-bearing, or where a managed runtime forbids commit-pinned dependencies.
Pricing and deployment. Apache 2.0 outside the enterprise dir; install with pip install 'litellm>=1.83.7' --require-hashes against a private mirror, or pull the container image with Sigstore verification. Enterprise cloud tier exists.
Verdict. Still the broadest provider coverage on the list, with three named fallback types that compose against the production failure shapes. The March 2026 supply-chain incident shifts it from “default failover pick” to “pin commits, sign artifacts, audit installs, and accept the partial streaming continuity story.” Pair with Sigstore verification and dependency-pinning enforcement.
Maxim Bifrost: Best for Go Throughput With a 4-State Health Machine
Maxim Bifrost is the Go-native gateway from Maxim, Apache 2.0, with vendor-published P50 of about 11 microseconds at 5,000 RPS on t3.xlarge (Maxim’s own harness with a mock 60 ms OpenAI response) and a documented 4-state Adaptive Load Balancing health machine.
The 4-state health machine is the wedge for failover specifically: Healthy (under 2 percent error rate), Degraded (at 2 percent error threshold, traffic continues but flagged), Failed (above 5 percent error or TPM exceeded, removed from rotation), and Recovering (90 percent penalty reduction over 30 seconds with 25 percent exploration probability for probing).
Best for. Go shops whose binding constraint is failover throughput at high concurrency plus a documented health-state machine, plus teams running multi-MCP-server agentic workflows that need cluster-mode failover.
Key strengths.
- Vendor-published benchmark showing roughly 11 microsecond mean P50 at 5,000 RPS on
t3.xlarge(Maxim’s own harness with a mock 60 ms upstream). - Documented 4-state Adaptive Load Balancing scoring formula: Error Penalty 50 percent, Latency 20 percent, Utilization 5 percent, plus a Momentum component, with the four health states named above. 25 percent exploration probability for recovery probing.
- Apache 2.0 single Go binary; zero-configuration startup; drop-in container or binary deployment.
- Cluster-mode failover for horizontal scale across Bifrost replicas; cluster-aware idempotency keys.
- Per-stream buffer plus cluster-mode replay for streaming SSE through a mid-stream failover.
- Hierarchical budgets at four levels (Customer, Team, Virtual Key, Provider) with configurable reset cycles.
- Native MCP Code Mode where agents write Starlark Python in a sandbox; supports STDIO, HTTP, and SSE transports.
- Dual-layer caching (exact hash plus vector similarity) plus vault integrations across HashiCorp, AWS, GCP, and Azure.
Where it falls short.
- Maxim self-ranks Bifrost first across its own multi-model routing and failover listicles (six audited articles, zero published Bifrost limitations); the absence of published self-criticism is a trust signal procurement teams will weigh alongside the engineering claims.
- Throughput claims are vendor-published on a mock 60 ms harness; no independent reproduction. The 11 microsecond P50 is a baseline number, not a settled benchmark.
- Observability and MTTR dashboards are thinner than Portkey’s; teams that need a finance-grade failover dashboard write their own.
- Held-out fallback-route quality is a manual procurement; Bifrost ships the routing decision and the health-state machine but not the quality evaluator. The “self-improving” loop isn’t closed at the product layer.
- Routing strategy count varies between three and six across Maxim’s six SERP-ranking articles; the canonical taxonomy lives only in the public
docs.getbifrost.ai/enterprise/adaptive-load-balancingpage, and the failover behaviour documentation is similarly fragmented across the docs site and the marketing pages. - The 25 percent exploration probability on the Recovering state is aggressive on cost-sensitive paths; teams on a paid-per-call upstream see the exploration probe cost as a line item.
Use case fit. Strong for Go shops, high-throughput inference paths, and multi-MCP-server agentic workflows that need cluster-mode failover. Less optimal where OpenTelemetry-native MTTR per route, a held-out quality floor on the fallback route, or a single self-improving loop are the brief.
Pricing and deployment. Apache 2.0 single Go binary; Docker, Helm; commercial cloud tier exists via Maxim. Install with docker run maximhq/bifrost.
Verdict. Strong health-state machine and throughput numbers, with real engineering credibility on the failover decision. Choose Bifrost when raw Go throughput and the 4-state health machine are the binding constraints; choose Future AGI Agent Command Center when OTel-native MTTR per route, streaming continuity with sequence numbers, and a held-out quality floor on the fallback route are.
Kong AI Gateway: Best for Service-Mesh Shops
Kong AI Gateway is the LLM extension to the Kong API gateway, Apache 2.0 OSS core, that treats every upstream LLM as another service in the Kong control plane.
The pitch: if your platform team already runs Kong for REST API ingress, North-South traffic management, and the Kong service-mesh control plane, the AI Gateway plugin layer makes LLM failover a configuration concern on the same control plane as your existing services. The same active health checks, the same retry policies, the same upstream rotation, the same dashboards.
Best for. Platform teams already running Kong for REST APIs that want LLM failover as another upstream plugin layer on the existing control plane, plus teams whose binding constraint is a single ops surface across REST APIs, gRPC services, and LLM providers.
Key strengths.
- Inherits Kong’s active health-check primitive: configurable interval, configurable threshold, configurable timeout, configurable upstream rotation. The same primitive that drives REST-API failover in the existing platform.
- Inherits Kong’s idempotency plugin: request-hash dedup with configurable TTL across the failover path.
- Apache 2.0 Kong Gateway OSS core; commercial Konnect cloud tier for the multi-region control plane.
- Service-mesh integration for teams running Kong Mesh: LLM upstream service rotation participates in the mesh control plane.
- Per-consumer rate limits, per-consumer budgets, per-consumer routing policies on the Kong consumer model.
- Tight integration with the existing Kong observability stack (Datadog, Splunk, Prometheus, OpenTelemetry).
Where it falls short.
- Detection latency is the slowest on the audited 2026 set: typical 30 to 60 seconds because the active health check default interval is longer than the LLM-specialised gateways’ (10 to 30 seconds) and the passive aggregation window is tuned for REST-API patterns rather than LLM-specific error shapes (429 frequency, content-policy refusals, slow-token streams).
- Streaming continuity through a mid-stream failover is replay-on-disconnect; no documented sequence numbers or resume tokens. The Kong proxy is fundamentally an HTTP proxy, and the LLM SSE patterns sit on top.
- Held-out fallback-route quality isn’t in scope: Kong ships the failover decision and the control plane, not the quality evaluator. The “fallback to a cheaper-but-good route” decision is a separate procurement.
- Provider-rotation strategy is the Kong upstream-rotation primitive (weighted round-robin, consistent hashing, least-connections); the named routing strategies common in LLM-specialised gateways (Adaptive, Cost Optimized, Race) require custom plugin code or are out of scope.
- Setup time is materially longer than the LLM-specialised gateways: configuring Kong upstreams, services, routes, plugins, and the AI Gateway plugin layer on the existing Kong control plane is a multi-hour exercise on day one.
- Provider list is narrower than competitors at the gateway layer; the AI Gateway plugin supports the major commercial providers but the long-tail OpenAI-compatible servers require custom upstream config.
Use case fit. Strong for teams already running Kong Gateway or Kong Mesh at scale, where the ops surface unification across REST APIs and LLM providers is the binding constraint. Less optimal for greenfield LLM gateways, for teams that need sub-15-second detection, for streaming-critical chat paths, and for buyers whose binding constraint is a held-out quality floor on the fallback route.
Pricing and deployment. Apache 2.0 Kong Gateway OSS core (self-host on Docker, Kubernetes); commercial Konnect cloud tier for the multi-region control plane.
Verdict. The right pick when the buying team is the existing Kong platform team and unification on the Kong control plane is the brief. Not the pick when sub-15-second detection, sequence-numbered streaming continuity, or a held-out fallback-route quality floor are load-bearing.
Decision Framework: Which Failover Gateway Is Right For You?
The buyer profile drives the pick more than the feature matrix does.
OpenTelemetry-first SRE teams running OpenAI, Anthropic, and Bedrock on a single mission-critical chat path pick Future AGI Agent Command Center. Multi-tenant SaaS teams that need composable conditional fallback rules pick Portkey. Python-first ML platform teams pick LiteLLM with commit pinning. Go shops at 5,000 RPS or higher pick Bifrost. Existing Kong shops pick Kong AI Gateway.
| If you are a… | Pick | Why |
|---|---|---|
| OpenTelemetry-first SRE running 100+ providers on a mission-critical chat or agent path | Future AGI Agent Command Center | 5 to 15 s detection plus sequence-numbered streaming continuity plus idempotency keys plus OTel-native MTTR plus held-out fallback quality eval in one Apache 2.0 Go binary |
| Fintech with per-customer budget enforcement plus failover audit trail | Future AGI Agent Command Center | Per-VK budgets plus tag-based enforcement plus span-level failover and fallback attribution |
| Air-gapped or on-prem regulated SRE environment | Future AGI Agent Command Center or Bifrost | Apache 2.0 single binary; Docker, Kubernetes, air-gapped |
| Multi-tenant SaaS routing by customer with conditional fallback rules | Portkey | Three named fallback types plus 9-operator conditional DSL (verify PANW integration timeline) |
| Python-first ML platform team | LiteLLM (1.82.6 pin or 1.83.7+) | Three documented fallback types; pin commits after the March 2026 PyPI compromise |
| Go shop with 5,000 RPS or higher and a documented health-state machine | Maxim Bifrost | Vendor-published 11 microsecond P50 plus 4-state health machine plus cluster-mode failover |
| Team running Claude Code or multi-MCP-server agentic workflows with cluster-mode failover | Maxim Bifrost | Cluster-mode failover plus native MCP Code Mode across STDIO, HTTP, SSE |
| Existing Kong shop wanting LLM failover on the same control plane | Kong AI Gateway | Reuses Kong active health-check, retry policy, upstream rotation, idempotency plugin |
| Streaming-critical chat path with mid-stream failover requirement | Future AGI Agent Command Center | Per-stream sequence-numbered buffers plus resume tokens; transparent replay or client-visible resume |
| Team that wants a held-out quality floor on the fallback route | Future AGI Agent Command Center | Self-improving loop: eval score on fallback route feeds agent-opt; revised routing rule picked up on next request |
LLM failover and fallback in 2026 isn’t a single feature. It’s a stack: provider health detection, idempotency-key request dedup, deterministic versus probabilistic failover decision, streaming continuity through a failover, OpenTelemetry-native MTTR telemetry, and a held-out quality floor on the fallback route, running at the same network hop, under a license that isn’t about to be re-platformed inside an acquirer.
Common Implementation Mistakes
Five mistakes account for most of the bad failover incidents we have seen on the 2026 audited set.
- Treating failover as an SDK retry loop. The OpenAI SDK retry loop swaps the request against the same provider on a 5xx. It doesn’t know about Anthropic, Bedrock, or Vertex. Teams that “have failover” because the SDK retries see the full upstream outage window when the upstream is down. The gateway is the only layer that can absorb a single-provider failure into a 15 to 45 second blip.
- No idempotency key on the retry path. The naive retry on a 504 timeout against the next provider can double-bill if the first call actually completed upstream but failed on the response path. The 2024 OpenAI incident triggered exactly this failure mode for teams retrying naively. Idempotency keys with a 5 to 60 second TTL prevent it.
- Stream replay from token zero on every failover. Long completions (multi-paragraph summaries, multi-tool agent runs) are user-visible if the stream restarts at token zero. The gateway needs sequence numbers plus resume tokens or the UX regresses to “the chat just resets” on every provider blip.
- Static fallback chain with no quality floor. “Anthropic on OpenAI 5xx, Gemini on Anthropic 5xx” is fine until the prompt template that worked on Claude-3-5-Sonnet returns junk on Gemini-2.0-Flash. The fallback chain needs a held-out quality eval; otherwise the gateway silently degrades the user’s task on every fallback firing.
- No MTTR telemetry per route. Postmortems without per-route MTTR turn into log-scrape exercises. The gateway should emit span-level
failover_reason,fallback_model, andtime_to_recovery_secondsso the SRE team can audit the routing decision on the next incident, not the next quarterly review.
Future AGI Implementation Walk-through
The five-layer reliability stack lives inside one Apache 2.0 Go binary in Future AGI Agent Command Center. The wedge over every other gateway in the audited 2026 set is the closed self-improving loop on the fallback route quality.
The flow on a single failover event:
- Detection. Active health probe at the 10-second interval flags a 504 timeout from the OpenAI
chat.completionsendpoint. Passive aggregator records a 3.5 percent error rate over the 30-second rolling window. Both signals flip the OpenAI route to Degraded at 11 seconds. The active probe at the next 10-second interval confirms the upstream is still 504; the route flips to Failed at 21 seconds total. Detection-to-rotation latency: 11 seconds. - Failover decision. The next request lands on the Adaptive router. The Adaptive scoring marks OpenAI at zero health, Anthropic at 0.94, Gemini at 0.89. The request routes to Anthropic.
- Idempotency. The request-hash idempotency key (X-FAGI-Idempotency-Key) is set on the client side; the gateway records the in-flight hash. The retry-on-failover branch checks the idempotency window (60 seconds default) and confirms the OpenAI call didn’t complete upstream. The retry against Anthropic is safe.
- Streaming continuity. The request is
stream=true. The gateway buffers the assistant-turn tokens with sequence numbers. On the mid-stream OpenAI disconnect at sequence 47, the gateway replays the prompt against Anthropic; the client SDK sees a transparent replay from token zero (the safe default) or, with theX-FAGI-Stream-Resume: trueheader, continues from sequence 48. - Fallback quality eval. The Anthropic response runs through the held-out quality-floor evaluator at the gateway egress. The eval score (0.91) passes the configured floor (0.82). The response ships to the client. If the score had dropped below the floor, the next request would have routed to Gemini, and the failed Anthropic decision would have been logged with
fallback_quality_below_floor=true. - MTTR telemetry. The full event ships as one OpenTelemetry span with
failover_route=anthropic/claude-3-5-sonnet,failover_reason=openai_5xx_health_failed,time_to_recovery_seconds=22, plus Prometheus metrics on/-/metrics(fagi_failover_total{from="openai",to="anthropic",reason="5xx"}). - Self-improving loop. The trace plus eval score plus failover decision feed agent-opt overnight. The optimizer notices the Anthropic route consistently passes the held-out floor on this prompt template and produces a revised routing rule: prefer Anthropic for this template on OpenAI 5xx, demote Gemini to position three. The gateway picks up the revised rule on the next deploy.
This is the part other gateways leave manual. The same eval that flags a low-quality fallback also produces the labelled dataset agent-opt uses to revise the routing rule. The gateway becomes a closed-loop self-improvement layer on reliability, not a static fallback list.
The full implementation walk-through is in the Agent Command Center docs; the OTel surface is in the Future AGI observability docs; the held-out evaluator that gates the fallback is in the Future AGI Evaluation docs; Protect’s sub-100 ms inline guardrails are in the Future AGI Protect docs.
Which AI Gateway Is Right For LLM Failover and Fallback in 2026?
LLM failover and fallback in 2026 is no longer “does the SDK retry on 5xx.” It’s a stack of decisions: how fast the gateway detects, whether the failover is deterministic or probabilistic, whether streaming continuity survives a mid-stream failover, whether the retry path is idempotent, what the MTTR telemetry surface looks like, and whether the fallback route has a held-out quality floor, all evaluated together at the same network hop.
Of the five gateways above, Future AGI Agent Command Center is the strongest pick when the buying constraint is Apache 2.0 plus OpenAI compat plus 5 to 15 second detection plus sequence-numbered streaming continuity plus idempotency keys plus OTel-native MTTR per route plus a held-out quality floor on the fallback route in one Go binary you can self-host, with no pending acquisition.
Portkey is the right call when the three named fallback types plus the 9-operator conditional DSL surface is the brief and the Palo Alto integration timeline is acceptable. LiteLLM is the right call for Python-first teams that can pin to 1.82.6 or upgrade past 1.83.7 after the March CVE and that accept the partial streaming continuity story.
Maxim Bifrost is the right call when vendor-published P50 at 5,000 RPS plus the 4-state health machine plus cluster-mode failover are the binding constraints. Kong AI Gateway is the right call when the existing Kong platform is the buying team and the unification on a single control plane is the brief, with the slower detection window accepted.
For deeper reads on the patterns referenced above: the Agent Command Center docs, the Future AGI GitHub repo, the Future AGI observability docs, the Future AGI Protect docs, the Future AGI Evaluation docs for the held-out evaluator that pairs with the fallback route, the OpenTelemetry GenAI semantic conventions, the Datadog Security Labs writeup of the LiteLLM PyPI compromise, and the Palo Alto Networks announcement of intent to acquire Portkey.
Try Future AGI Agent Command Center free: 5 to 15 second detection, sequence-numbered streaming continuity through mid-stream failover, idempotency-key dedup, OpenTelemetry-native MTTR per route, and a held-out quality floor on the fallback route in one Apache 2.0 Go binary.
Related reading
- Best 5 AI Gateways for LLM Cost Optimization in 2026, the five-layer cost stack and the 2026 trust cohort
- Best 7 AI Gateways for Multi-Model Routing in 2026, how cost-quality routing decisions get made at the gateway hop
- Best 5 AI Gateways for Prompt Management in 2026, the prompt-management gateway picks
- Best 5 AI Gateways for Semantic Caching in 2026, the semantic cache deep-dive across the cohort
Frequently asked questions
What Is the Difference Between LLM Failover and LLM Fallback?
How Fast Should a Gateway Detect Provider Health Degradation in 2026?
Can a Gateway Preserve a Streaming Response Through a Failover?
Do AI Gateways Use Idempotency Keys to Prevent Duplicate LLM Charges?
What Is a Realistic MTTR Target for LLM Failover in 2026?
How Does Future AGI's Self-Improving Loop Improve Fallback Quality Over Time?
Five Pydantic AI alternatives scored on multi-agent depth, language reach, observability without Logfire, optimizer presence, and what each replacement actually fixes for teams who outgrew the type-system-first framework.
Five AI gateways scored on caching Claude Code calls in 2026: cross-developer cache scope, semantic-match thresholds, hit-rate observability, TTL controls, and what each one misses.
A Director of Engineering Productivity buyer's brief for the AI gateway in front of Codex CLI at 1000+ engineer scale. Three pillars — governance, cost, provider flexibility — scored across seven axes with five picks.