Best 5 AI Gateways for Rate Limiting LLM Calls in 2026
Five AI gateways for rate limiting LLM calls in 2026 scored on the seven-axis rate-limit rubric, provider-tier awareness, fair-share enforcement, and observability of rate-limit events.
Table of Contents
Originally published May 17, 2026.
A platform team running a multi-tenant document copilot shipped a global 600 RPM cap on Friday and woke up to two incidents on Monday: a paying enterprise customer who burned the entire quota with one batch job, and three smaller customers who saw a wall of 429s and churned. The gateway never heard about Anthropic tier 3, never queued the burst, never enforced fair-share, and never emitted a rate-limit event the platform team could correlate. This guide compares the five AI gateways production platform teams should choose between in 2026 for rate limiting LLM calls, scored on the seven-axis rate-limit rubric, provider-tier awareness, and back-pressure signaling with primary sources.
TL;DR: 5 Gateways Scored on the Seven-Axis Rate-Limit Rubric and the 2026 Trust Cohort
Future AGI Agent Command Center is the strongest single pick for rate limiting LLM calls in 2026 because it bundles per-key, per-tenant, per-user, and per-feature limits with weighted fair-share scheduling, provider-tier awareness for Anthropic tier 1 to 4 RPM and OpenAI organization-level RPM and TPM, sliding-window plus token-bucket algorithms, queue-or-reject back-pressure with Retry-After and RateLimit-Remaining headers, and OpenTelemetry-native rate-limit event export in one Apache-2.0 Go binary you can self-host. A global request-per-minute counter no longer cuts it; the four axes that separate a 2026 rate-limiting gateway from a 2024 LLM proxy are granularity (who is being limited), algorithm correctness (avoid the fixed-window boundary spike), fair-share enforcement (Gini under 0.3 on healthy traffic), and observability of the rate-limit event itself.
| # | Platform | Best for | 2026 event you should know |
|---|---|---|---|
| 1 | Future AGI Agent Command Center | Per-key + per-tenant + per-user + per-feature + provider-tier-aware limits with weighted fair-share + OTel-native rate-limit events in one Apache-2.0 Go binary | Apache 2.0 single Go binary; Protect at roughly 65 ms (arXiv 2510.13351); no pending acquisition |
| 2 | Portkey | Managed dashboard with per-virtual-key rate limits + 4-tier budget hierarchy + native cost-and-rate dashboard | Palo Alto Networks announced intent to acquire on April 30, 2026; close expected PANW fiscal Q4 |
| 3 | Kong AI Gateway | Platform teams already running Kong who want enterprise-grade rate-limit plugins (Advanced, Graph, Local, Cluster, Redis) on the AI route | Kong 3.8+ ships the AI gateway plugins; enterprise tier for the Advanced limiter |
| 4 | LiteLLM (post-incident pinned) | Python-first teams pinning a commit or upgrading past 1.83.7 who want virtual keys with per-key RPM and TPM budgets | TeamPCP PyPI supply-chain compromise of versions 1.82.7 and 1.82.8 on March 24, 2026 |
| 5 | Cloudflare AI Gateway | Edge-first teams that want a global rate-limit on the CDN hop ahead of the upstream provider | Free tier with global per-account limits; less aware of upstream provider-tier ceilings |
The 5 Rate-Limiting Gateways at a Glance
The five cover every rate-limit shape platform teams actually ship in 2026: an Apache-2.0 open-source platform with the seven-axis rubric covered in one binary (Future AGI), a managed dashboard with acquisition pending (Portkey), an enterprise plugin gateway already running for REST traffic (Kong AI Gateway), a Python proxy under remediation (LiteLLM), and an edge limiter (Cloudflare AI Gateway).
| Superlative | Tool |
|---|---|
| Best overall for rate limiting | Future AGI Agent Command Center: per-key + per-tenant + per-user + per-feature + provider-tier-aware limits with weighted fair-share + OTel-native rate-limit events in one Apache-2.0 Go binary |
| Best for OpenAI-compat drop-in | Future AGI Agent Command Center: base_url swap against the existing OpenAI SDK; no SDK rewrite |
| Best for sub-5 ms rate-limit overhead | Future AGI Agent Command Center: in-memory token-bucket plus Redis sliding-window in parallel; rate-limit decision under 5 ms p99 |
| Best for native rate-limit dashboard | Portkey: per-virtual-key rate limits surfaced in the managed dashboard |
| Best for teams already running Kong | Kong AI Gateway: enterprise rate-limit plugins on the AI route |
| Best for Python-first teams post-incident | LiteLLM (1.82.6 pin or 1.83.7+ upgrade): virtual keys with per-key RPM and TPM |
| Best for edge global rate-limit | Cloudflare AI Gateway: account-level edge limit ahead of upstream provider |
| Best for self-hosted or air-gapped | Future AGI Agent Command Center: Apache 2.0; Docker, Kubernetes, air-gapped |
| # | Platform | Best for | License + deployment |
|---|---|---|---|
| 1 | Future AGI Agent Command Center | Per-key + per-tenant + per-user + per-feature limits with weighted fair-share + provider-tier awareness + OTel-native rate-limit events | Apache 2.0; cloud at gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped) |
| 2 | Portkey | Managed dashboard with per-virtual-key rate limits + 4-tier budget hierarchy | MIT (open-source gateway) + cloud control plane; PANW acquisition pending |
| 3 | Kong AI Gateway | Platform teams already running Kong who want enterprise rate-limit plugins on the AI route | Apache 2.0 core + commercial Enterprise tier; self-host or Kong Konnect cloud |
| 4 | LiteLLM (post-incident pinned) | Python-first teams pinning a known-good commit; virtual keys with RPM and TPM | MIT (the enterprise dir is licensed separately); pip install |
| 5 | Cloudflare AI Gateway | Edge limiter ahead of the upstream provider | Closed source; Cloudflare cloud only |
Helicone is intentionally not in the ranked list. As of March 3, 2026 it has been acquired by Mintlify and the public roadmap has shifted toward a documentation-platform-first stance. Teams already on Helicone should treat it as a planned migration, not a continued procurement.
How Did We Score AI Gateways for Rate Limiting LLM Calls?
We used the Future AGI Rate-Limit Scorecard, tuned for platform-engineering procurement. Most 2026 rate-limit listicles score on “does it have a rate limit” and stop there. They don’t score on whether the limit is per-tenant or per-key, whether the algorithm is sliding-window or fixed-window, whether the gateway is aware of the upstream provider tier, whether the back-pressure signal reaches the client, or whether the platform team can observe the rate-limit event.
The scorecard below runs seven axes across fourteen comparison columns, including the four that decide whether the gateway actually keeps tenants fair and abusers contained in production.
| # | Axis | What we measure (rate-limit lens) |
|---|---|---|
| 1 | Rate-limit granularity | Per-key, per-tenant, per-user, per-feature, per-model, per-route; combinable multi-axis enforcement |
| 2 | Algorithm | Token-bucket, leaky-bucket, sliding-window, fixed-window; series composition; algorithm correctness on the boundary |
| 3 | Burst handling | Queue depth (bounded), reject-after-queue, hybrid; queue-shed policy; soft-cap and hard-cap separation |
| 4 | Fair-share between tenants | Weighted fair queuing; tenant priority class; Gini coefficient observation; starvation guards |
| 5 | Provider-tier awareness | Anthropic tier 1 to 4 RPM ceilings; OpenAI organization-level RPM and TPM ceilings; dynamic headroom propagation to the limiter |
| 6 | Back-pressure signaling | 429, Retry-After, RateLimit-Remaining, RateLimit-Reset headers; streaming-aware back-pressure (SSE); client retry guidance |
| 7 | Observability of rate-limit events | OpenTelemetry span per decision; Prometheus counters by tenant and decision; queue-depth histogram; fairness Gini gauge |
Axes 1, 4, 5, and 7 are the four that decide whether the gateway actually does platform-engineering work in production. The right priority depends on the buyer profile (multi-tenant SaaS versus internal multi-feature platform versus regulated workload with contractual RPM SLAs).
The 14-Dimension Capability Matrix the Rate-Limit SERP Is Missing
Across the five gateways below, Future AGI Agent Command Center leads on combined granularity, algorithm correctness, fair-share, provider-tier awareness, and rate-limit observability. Kong AI Gateway wins on enterprise plugin maturity. Portkey wins on managed dashboard polish. LiteLLM wins on Python-native ergonomics. Cloudflare AI Gateway wins on edge global enforcement.
| Capability | Future AGI ACC | Portkey | Kong AI Gateway | LiteLLM | Cloudflare AI GW |
|---|---|---|---|---|---|
| Per-key limits | Yes | Yes | Yes | Yes | Yes (account-level) |
| Per-tenant limits | Yes (custom property) | Yes (virtual key tier) | Yes (consumer plugin) | Yes (custom property) | No (account only) |
| Per-user limits | Yes (header-based) | Yes (metadata) | Yes (consumer plugin) | Partial (manual) | No |
| Per-feature limits | Yes (route + tag) | Yes (route + tag) | Yes (route plugin) | Partial (tag) | No |
| Token-bucket | Yes | Yes | Yes (Advanced) | Yes | Yes |
| Sliding-window | Yes | Yes | Yes (Advanced) | Partial | Yes |
| Fixed-window | Yes | Yes | Yes | Yes | Yes |
| Burst queue depth | Configurable (bounded) | Configurable | Configurable | Manual | Edge-only |
| Weighted fair-share | Yes | Yes | Yes (Enterprise) | Manual | No |
| Provider-tier awareness | Yes (Anthropic + OpenAI) | Partial | Manual | Manual | No |
| 429 + Retry-After | Yes | Yes | Yes | Yes | Yes |
| RateLimit-Remaining headers | Yes (RFC 9239) | Yes | Yes (Advanced) | Partial | Yes |
| OTel rate-limit span | Yes | Partial | Yes (OTel plugin) | Partial | No |
| Prometheus rate-limit counters | Yes (/-/metrics) | Partial | Yes | Yes | No |
The shape of the matrix is the shape your buying decision will be: no gateway wins every column, and the five columns that matter most for rate limiting (per-tenant + per-user granularity, sliding-window correctness, weighted fair-share, provider-tier awareness, and OTel observability) are where the field separates.
How AI Gateways Actually Rate Limit LLM Calls in Production
AI gateways enforce rate limits across seven axes (granularity, algorithm, burst, fair-share, provider-tier awareness, back-pressure, observability), and a real production rate-limit win comes from composing four or five of them correctly, not from setting one number.
Platform teams typically see three failure modes once they reach 5,000 RPM aggregate across tenants without a real rate-limiting gateway: the noisy-neighbor that consumes 80 percent of the upstream provider’s tier 3 RPM, the boundary-spike abuser that fires 2x the configured limit by straddling fixed-window edges, and the silent 429 storm where the upstream provider rate-limits the platform but the gateway doesn’t surface that back-pressure to the client. The breakdown:
- Rate-limit granularity. Per-key alone isn’t enough. Per-tenant catches the multi-key tenant that issues 200 keys to its sub-organizations; per-user catches the abusing end-user inside a paying tenant; per-feature catches the new experimental endpoint that bursts under launch traffic. Combine multi-axis: a request is allowed only if every axis allows. Most production teams need three to four axes minimum.
- Algorithm choice. Token-bucket fits LLM bursts (one user fires three requests then idles for a minute). Sliding-window fits contract enforcement (the tenant is contracted for 600 RPM, the count must be exact). Fixed-window is the wrong default: a user fires 600 requests at second 59 of one minute and 600 at second 0 of the next, doubling the contracted rate without violating the count. Production teams compose token-bucket per user with sliding-window per tenant; fixed-window is never the right answer for contractual SLAs.
- Burst handling. A bounded queue (typically 50 to 200 slots at the gateway) absorbs one-second bursts from paying tenants without 429-ing them, and the queue rejects the moment depth is exceeded. Queue-only is wrong (infinite memory growth under abuse); reject-only is wrong (paying tenants 429 on a normal burst). The right default for paying tenants is queue-then-reject with bounded depth.
- Fair-share between tenants. A global rate limit lets one tenant saturate the upstream provider tier at the expense of every other customer. Weighted fair queuing drains tokens in proportion to tenant priority weight, keeps the fairness Gini coefficient under 0.3 on healthy traffic, and emits a per-minute fairness metric. Without fair-share, a noisy neighbor is a churn event for every other tenant.
- Provider-tier awareness. Anthropic ships tier 1 (50 RPM Claude 3.5 Sonnet, 40,000 input TPM) through tier 4 (4,000 RPM, 400,000 input TPM) for the Messages API; OpenAI ships organization-level RPM and TPM that vary by model and usage tier. A 2026 rate-limiting gateway pulls those ceilings into the limiter at config time, propagates headroom changes when the upstream key crosses a tier, and never sends a request the upstream provider will 429.
- Back-pressure signaling. The gateway returns
429withRetry-After(seconds), surfacesRateLimit-RemainingandRateLimit-Reseton every response (RFC 9239 draft headers), and for SSE-streaming routes emits an in-stream back-pressure frame before terminating. Without these, the client retries blindly and amplifies the storm. - Observability of rate-limit events. One OpenTelemetry span per rate-limit decision with attributes for tenant, user, key, algorithm, limit, current count, decision, and queue depth, plus Prometheus counters for allow, queue, and reject by tenant. The platform team builds three dashboards: top rejected tenants, queue-depth p99 by route, and fairness Gini by minute. Without this, the first signal of an abuser is a support ticket.
A gateway that ships axis 1 and 2 but skips 4, 5, and 7 is good for a demo and bad for a multi-tenant production. The five gateway reviews below are scored against all seven axes, plus the four scorecard axes that decide whether the gateway actually keeps tenants fair and abusers contained.
Future AGI Agent Command Center: Best Overall for Rate Limiting LLM Calls
Future AGI Agent Command Center tops the 2026 rate-limit list because it bundles every axis of the seven-axis rubric at the same network hop in one Apache-2.0 Go binary you can self-host.
It loses on enterprise plugin maturity to Kong AI Gateway and on managed dashboard polish to Portkey; for buyers whose binding constraint is per-tenant + per-user + per-feature granularity with weighted fair-share, provider-tier awareness for Anthropic and OpenAI, and rate-limit events landing in an existing OpenTelemetry stack, the combined surface still puts it first. The combined surface is documented in the Agent Command Center docs and the source ships at the Future AGI GitHub repo.
Best for. Platform-engineering teams already running OpenTelemetry that want OpenAI-compatible drop-in, fine-grained per-tenant + per-user + per-feature rate limits with weighted fair-share, provider-tier awareness, and rate-limit events emitted into their existing observability stack, without rewriting OpenAI SDK code or operating a Python proxy.
Key strengths.
- OpenAI-compatible drop-in: change
base_urltohttps://gateway.futureagi.com/v1, keep the existing OpenAI SDK code unchanged; rate-limit policy attaches at the same hop without SDK changes. - Seven-axis rate-limit enforcement at one hop: per-key, per-tenant (via custom property), per-user (via header), per-feature (via route + tag), per-model, and per-time-window, combinable multi-axis.
- Algorithm choice: token-bucket for burst tolerance, sliding-window for contract enforcement, leaky-bucket for upstream-throttled queues; compose in series at the same hop.
- Bounded burst queue: configurable depth (default 100; production-tuned to 50 to 200 based on tenant priority class); queue-then-reject with soft-cap and hard-cap separation.
- Weighted fair queuing across tenants: assign service weight per virtual key, drain tokens in proportion to weight when the upstream provider tier is saturated, emit fairness Gini gauge per minute. Healthy traffic sits under 0.3 Gini; production alerting fires at 0.6.
- Provider-tier awareness: configure the upstream Anthropic tier (1 to 4) and the OpenAI organization-level RPM and TPM at the gateway; the limiter never sends a request the upstream will 429.
- Back-pressure signaling:
429withRetry-After,RateLimit-Remaining,RateLimit-Resetheaders (RFC 9239 draft); SSE-streaming-aware back-pressure on streaming routes. - OpenTelemetry-native rate-limit event export: one span per decision with
tenant_id,user_id,key,algorithm,limit,current_count,decision,queue_depth; Prometheus counters on/-/metricsfor Grafana dashboards.traceAIinstruments 50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) OpenInference-natively, and Error Feed (the part of the eval stack, the clustering and what-to-fix layer that feeds the self-improving evaluators) sits alongside as the zero-config error monitor: auto-clusters related per-tenant rate-limit failures (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue so emerging fairness regressions surface like exceptions rather than buried in OTel rows. - Rate-limit decision under 5 ms p99 on a well-provisioned Redis (in-memory token-bucket plus Redis-backed sliding-window in parallel). The Future AGI Protect model family runs on a separate hot path at ~65 ms p50 text and ~107 ms p50 image (arXiv 2510.13351) so the rate-limit check isn’t coupled to guardrail latency. Protect is FAGI’s own fine-tuned model family built on Google’s Gemma 3n with specialized adapters across four safety dimensions (content moderation, bias detection, security/prompt-injection, data privacy/PII), natively multi-modal across text, image, and audio, a model family, not a plugin chain.
- Self-improving loop: the platform team feeds 429 rejection rates, queue-depth p99, and fairness Gini back into agent-opt, which proposes tuning of per-tenant weights and burst-queue depths from the trace. No other gateway closes this rate-limit-tuning feedback loop in one product.
- Apache 2.0; single Go binary; Docker, Kubernetes, AWS, GCP, Azure, on-prem, air-gapped, cloud at
gateway.futureagi.com/v1.
Where it falls short. The enterprise plugin library for non-AI REST traffic is intentionally out of scope; teams that need a single plane for REST and AI rate limits will run Kong AI Gateway alongside it. The managed rate-limit dashboard is functional but less polished than Portkey’s at first glance; teams that want a finance-grade native dashboard with zero infrastructure work will reach for Portkey first. Self-host operations cost is non-zero; teams that want a fully managed plane should choose the hosted endpoint at gateway.futureagi.com/v1 rather than self-host.
from openai import OpenAI
client = OpenAI(
api_key="$FAGI_API_KEY",
base_url="https://gateway.futureagi.com/v1",
)
# Existing OpenAI SDK code unchanged from here.
# Per-tenant (header), per-user (header), per-feature (route + tag), and
# provider-tier-aware limits all apply at the same network hop.
response = client.chat.completions.create(
model="anthropic/claude-3-5-sonnet",
messages=[{"role": "user", "content": "Summarise this support ticket."}],
extra_headers={
"X-Tenant-Id": "acme-corp",
"X-User-Id": "user-29384",
"X-Feature": "support-copilot",
},
)
# Response carries the back-pressure signals:
# RateLimit-Remaining: 142
# RateLimit-Reset: 27
# A 429 carries Retry-After in seconds.
Use case fit. Strong for OpenTelemetry-first platform teams, multi-tenant SaaS with per-customer fair-share enforcement, fintech with contractual RPM SLAs, and platform teams that want trace plus eval plus gateway plus rate-limit-tuning feedback loop in one Apache-2.0 platform with hybrid local and cloud deployment. Less optimal for teams that want a fully managed rate-limit dashboard before writing any infrastructure code.
Pricing and deployment. Apache 2.0 single Go binary; cloud-hosted endpoint at https://gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped).
Verdict. The strongest single pick when the 2026 rate-limit story is “we want per-tenant + per-user + per-feature limits with weighted fair-share, provider-tier awareness for Anthropic and OpenAI, and rate-limit events in our existing OpenTelemetry stack, in one Apache-2.0 binary, without rewriting OpenAI SDK calls or operating a Python proxy.”
Portkey: Best for Managed Rate-Limit Dashboard Out of the Box
Portkey is the strongest pick when you want per-virtual-key rate limits plus a four-tier budget hierarchy plus a managed cost-and-rate dashboard out of the box. It’s what most production teams reach for when “we need rate-limit enforcement and a dashboard next week” is the brief, and it has the largest adapter library on the routed rate-limit path.
Best for. Multi-tenant SaaS or internal multi-product platforms that need fine-grained per-virtual-key rate limits, a four-tier budget hierarchy, and a usable rate-limit dashboard without writing a custom exporter.
Key strengths.
- Per-key, per-virtual-key, per-model, and per-time-window rate limits; the most fine-grained native-dashboard hierarchy on this list.
- Token-bucket plus sliding-window algorithms; configurable burst tolerance per virtual key.
- 429 with
Retry-After,RateLimit-Remaining, andRateLimit-Resetheaders; consistent back-pressure signaling. - Large adapter library (250+ providers, including private OSS deployments); rate-limit policy works the same across providers.
- Usable native dashboard for rate-limit attribution by tenant, feature, and route, without writing a custom exporter.
- Open-source gateway core (github.com/Portkey-AI/gateway); production teams self-host the gateway and run the control plane in Portkey cloud.
Where it falls short. Palo Alto Networks announced intent to acquire Portkey on April 30, 2026; the press release says the deal is expected to close in PANW fiscal Q4 2026 and that Portkey will become the AI Gateway for Prisma AIRS. Verify standalone-product continuity before signing a multi-year contract. Observability is dashboard-first; the OpenTelemetry export of rate-limit events exists but is less first-class than the native dashboard, so OTel-first stacks end up duplicating telemetry. Provider-tier awareness is partial; the limiter doesn’t auto-detect Anthropic tier 3 versus tier 4 without manual configuration. The control plane is closed; check whether the open-source core covers air-gapped requirements.
Use case fit. Strong for multi-tenant SaaS, fintech with per-customer rate-limit attribution, and platform teams running multiple AI products. Less optimal for teams that want their rate-limit telemetry flowing into an existing OpenTelemetry collector and Grafana stack as a first-class output, or for teams that need the gateway to know about Anthropic tier ceilings out of the box.
Pricing and deployment. Open-source core (self-hosted) plus commercial cloud control plane; enterprise tiers.
Verdict. The most mature per-virtual-key rate-limit hierarchy with a managed dashboard in 2026. Choose with eyes open on the Palo Alto integration; the next twelve months will tell whether the standalone gateway product survives the merge.
Kong AI Gateway: Best for Platform Teams Already Running Kong
Kong AI Gateway is the AI-specific overlay on the Kong API gateway. Teams already running Kong for REST traffic get the same rate-limit plugins (rate-limiting, rate-limiting-advanced, graphql-rate-limiting-advanced) applied to the AI route, plus AI-specific plugins (ai-proxy, ai-rate-limiting-advanced, ai-prompt-template) for upstream provider abstraction.
Best for. Platform-engineering teams that already operate Kong for REST APIs and want the same rate-limit primitives on the AI route without operating a second gateway runtime.
Key strengths.
- Mature enterprise rate-limit plugins: Local (in-memory, single-instance), Cluster (shared via gossip), and Redis (shared via Redis) policy stores; pick the storage tier that fits your cluster topology.
- Per-consumer, per-credential, per-service, and per-route rate limits; combinable multi-axis enforcement via the Consumer entity.
- Sliding-window and fixed-window algorithms in the Advanced plugin; configurable window size and sync rate.
- 429 with
Retry-After,RateLimit-Remaining, andRateLimit-Resetheaders; consistent back-pressure signaling per RFC 9239. - Native OpenTelemetry export via the
opentelemetryplugin; rate-limit decisions surface as spans on the request trace. - Kong Konnect managed cloud plus on-prem Kong Gateway Enterprise; the same plugin config works in both deployments.
Where it falls short. The AI-specific layer (ai-proxy, ai-rate-limiting-advanced) lags the rest of the Kong plugin ecosystem in cadence; new provider integrations land slower than at Portkey or LiteLLM. Provider-tier awareness (Anthropic tier 1 to 4 RPM, OpenAI organization-level RPM and TPM) is manual; the team configures the upstream ceiling as a static plugin parameter rather than pulling it from the provider. Weighted fair-share across tenants is an Enterprise tier feature in the Advanced plugin; the open-source Local and Cluster policy plugins ship hard-cap-only enforcement, no fair queuing. The control plane footprint is heavier than a single Go binary; for teams that don’t already run Kong, the operations cost isn’t justified by the AI rate-limit feature surface alone.
Use case fit. Strong for platform teams that already operate Kong for REST traffic, regulated environments where the existing Kong audit and compliance surface is a hard requirement, and enterprises with an existing Kong Konnect contract. Less optimal for teams that want a single Apache-2.0 Go binary or for teams where Anthropic tier awareness is a hard requirement.
Pricing and deployment. Apache 2.0 core (Kong Gateway OSS) plus commercial Enterprise tier (Advanced plugins, fair queuing, support); Kong Konnect cloud or self-host.
Verdict. The right pick when Kong is already the platform-engineering plane and the AI route shouldn’t be a second gateway runtime. Choose elsewhere when the AI-specific rate-limit features (Anthropic tier awareness, weighted fair-share at the AI layer) are the binding constraint.
LiteLLM: Best for Python-First Teams Post-CVE
LiteLLM is the Python-first proxy that broke open the multi-provider unified-API category. It exposes 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends behind OpenAI-compatible endpoints and ships virtual keys with per-key RPM and TPM budgets that work as a rate-limit primitive. After the March 24, 2026 supply-chain incident the answer is “yes, with commit pinning or upgrade past 1.83.7.”
Best for. Python-first teams that already deploy a FastAPI or uvicorn surface, want broad provider coverage, virtual keys with per-key RPM and TPM, and are willing to pin commit hashes (or upgrade past 1.83.7) and hold their own upstream provider DPA.
Key strengths.
- Virtual keys with per-key RPM and TPM budgets; budget alerts on threshold.
- 429 with
Retry-Afterheaders; consistent back-pressure on the OpenAI-compatible surface. - Broadest provider coverage of any single project on this list (20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends).
- MIT (the enterprise dir is licensed separately); trivial to fork or audit.
- Native fit with Python observability stacks (Prometheus exporter, OpenTelemetry middleware).
- Active maintainer community; easy to extend with custom rate-limit policies in Python.
Where it falls short. March 24, 2026 PyPI supply-chain compromise. Versions 1.82.7 and 1.82.8 were published by an attacker who had taken over the maintainer’s PyPI token. The compromised package exfiltrated SSH keys, cloud credentials, and Kubernetes configs to an attacker-controlled endpoint, per the Datadog Security Labs writeup of the TeamPCP campaign. Pin commit hashes, scan for affected versions in the dependency tree, rotate any credentials touched by an affected install, and upgrade past 1.83.7. Python runtime; materially slower rate-limit-decision throughput than Go-binary alternatives at high concurrency (10x to 30x slower per-decision is the rough number for Python token-bucket against Go token-bucket on the same hardware). Sliding-window enforcement is partial; the default implementation is a per-key counter that approximates sliding-window rather than enforcing it precisely. Weighted fair-share is manual; the team writes the weighting policy in Python middleware. Provider-tier awareness (Anthropic tier 1 to 4 RPM) is manual configuration via virtual-key budget config.
Use case fit. Strong for Python-first teams, ML platform teams that already manage Python services, and teams whose buying constraint is broad provider coverage in a fork-friendly license. Less optimal where rate-limit-decision throughput at over 10,000 req/s matters or where you’re pinned to a managed runtime that doesn’t allow commit-pinned dependencies.
Pricing and deployment. MIT (enterprise dir licensed separately); pip install. Enterprise cloud tier exists.
Verdict. Still the broadest provider coverage on the list, but the March 2026 supply-chain incident shifts it from “default pick” to “pin commits or upgrade past 1.83.7 and audit.” Pair with Sigstore verification and dependency-pinning enforcement.
Cloudflare AI Gateway: Best for Edge Global Rate-Limit
Cloudflare AI Gateway is the edge-first entry on the list. It sits in front of upstream provider endpoints (OpenAI, Anthropic, Google, Replicate, plus Workers AI), enforces account-level rate limits at the CDN hop, and surfaces request analytics in the Cloudflare dashboard. It’s the gateway most often cited when “we want a global request cap ahead of the upstream provider with zero infrastructure” is the brief.
Best for. Teams that want a global request cap on the CDN hop ahead of the upstream provider, especially teams already running Cloudflare for their web traffic and willing to live within an account-level (not per-tenant) limit shape.
Key strengths.
- Drop-in OpenAI-compatible proxy; change the base URL to the AI Gateway hostname and traffic flows through the edge limiter.
- Account-level RPM and TPM caps with token-bucket and sliding-window options at the Cloudflare edge.
- 429 with
Retry-Afterand CDN-native back-pressure signaling. - Free tier with generous account-level limits; paid tier scales linearly.
- Native Cloudflare analytics dashboard with per-provider, per-model, and per-account request and token counts.
- Zero infrastructure to operate; runs entirely on Cloudflare’s global edge.
Where it falls short. Per-tenant, per-user, and per-feature granularity is absent at the gateway layer; the limiter is account-level and the platform team has to add tenancy enforcement somewhere else (typically in the application or at a second hop). Provider-tier awareness for Anthropic tier 1 to 4 and OpenAI organization-level RPM is absent; the limiter treats every upstream as a generic HTTP endpoint. Weighted fair-share across tenants is absent. OpenTelemetry export of rate-limit events isn’t first-class; the Cloudflare dashboard is the primary observability surface and exporting spans into an existing OTel collector is manual. Closed source; the platform team is betting on the Cloudflare team’s roadmap and uptime for AI-specific features.
Use case fit. Strong for teams already running Cloudflare for their web traffic, early-stage teams that want a global request cap with zero infrastructure, and teams whose threat model is volumetric abuse at the edge rather than per-tenant fair-share. Less optimal for multi-tenant SaaS where per-tenant fair-share is the binding constraint, or for teams that want provider-tier-aware enforcement against Anthropic tier 3 or OpenAI organization-level limits.
Pricing and deployment. Closed source; Cloudflare cloud only. Free tier with account-level limits; paid tier scales linearly.
Verdict. The lowest-friction way to put a global request cap ahead of the upstream provider. Not the right gateway when per-tenant rate-limit enforcement or provider-tier awareness is the brief.
The 2026 Gateway Migration and Trust Cohort
Every gateway listicle on the SERP is treating these as if they didn’t happen. They did, and they reshape the rate-limit procurement question for 2026.
- Helicone joining Mintlify (March 3, 2026). Helicone acquired by Mintlify; public roadmap shifts toward documentation-platform-first. Existing Helicone users should treat this as a planned migration window, not a continued procurement.
- LiteLLM PyPI supply-chain compromise (March 24, 2026). Versions
1.82.7and1.82.8were compromised on PyPI; the malicious package exfiltrated SSH keys, cloud credentials, and Kubernetes configs to an attacker-controlled endpoint. Pin commit hashes, scan dependency trees, rotate any credentials accessible to an affected install, and upgrade past 1.83.7. Primary source: the Datadog Security Labs writeup. - Anthropic MCP STDIO RCE class (April 2026). OX Security disclosed an STDIO transport class flaw affecting 7,000+ publicly accessible MCP servers and 150M+ downstream downloads, with multiple CVEs filed across downstream implementations. MCP gateways are now expected to enforce least-privilege tool access, OAuth 2.1, and Streamable HTTP transport. Rate limits on MCP tool calls per agent session became a default expectation overnight. Primary coverage: the Hacker News report on the Anthropic MCP design vulnerability.
- Portkey acquired by Palo Alto Networks (April 30, 2026). Acquisition announced; the gateway will become the AI Gateway for Prisma AIRS, with close expected in PANW fiscal Q4 2026. Standalone-product continuity is pending integration; verify roadmap before signing a multi-year contract. Primary source: the Palo Alto Networks press release.
The practical takeaway: for the next twelve months, license clarity, acquisition independence, and supply-chain pinning are part of the rate-limit procurement decision. A cheap gateway you have to migrate off in six months isn’t cheap, and a gateway whose package on PyPI was malware for 48 hours isn’t “still a default.”
Decision Framework: Which Rate-Limiting Gateway Is Right for You in 2026?
The buyer profile drives the pick more than the feature matrix does. OpenTelemetry-first platform teams pick Future AGI Agent Command Center; multi-tenant SaaS teams that want a managed dashboard pick Portkey; teams already running Kong for REST pick Kong AI Gateway; Python-first ML-platform teams pick LiteLLM; edge-first teams already on Cloudflare pick Cloudflare AI Gateway.
Choose Future AGI Agent Command Center for rate limiting if:
- You need per-tenant + per-user + per-feature granularity with weighted fair-share
- Anthropic tier 1 to 4 RPM or OpenAI org-level RPM awareness is required
- You want rate-limit events as OpenTelemetry spans in your existing collector
- Self-improving loop on per-tenant weight tuning matters to you
- Apache 2.0 single Go binary is the deployment shape you want
Choose Portkey if:
- Managed dashboard out of the box is the binding requirement
- You can live with partial provider-tier awareness and partial OTel export
- The PANW acquisition timeline is acceptable for your contract horizon
Choose Kong AI Gateway if:
- Kong is already the platform-engineering plane for REST traffic
- Enterprise plugin maturity outweighs AI-specific feature cadence
- Weighted fair-share at Enterprise tier is acceptable budget-wise
Choose LiteLLM (pinned past 1.83.7) if:
- Python-first stack and broad provider coverage are the binding constraints
- You can hold commit pins and Sigstore-verified dependencies as a process
Choose Cloudflare AI Gateway if:
- Account-level edge limit is sufficient (no per-tenant requirement)
- Zero infrastructure to operate is the constraint
- Cloudflare is already the CDN for your web traffic
| If you are a… | Pick | Why |
|---|---|---|
| Platform team on OpenTelemetry with multi-tenant SaaS | Future AGI Agent Command Center | Per-tenant + per-user + per-feature granularity + weighted fair-share + provider-tier awareness + OTel-native rate-limit events |
| Fintech with contractual RPM SLAs and audit trail | Future AGI Agent Command Center | Sliding-window correctness + per-tenant fair-share + span-level rate-limit attribution |
| Air-gapped or on-prem regulated environment | Future AGI Agent Command Center | Apache 2.0 single Go binary; Docker, Kubernetes, air-gapped |
| Multi-tenant SaaS that wants a managed dashboard out of the box | Portkey | Most fine-grained per-virtual-key hierarchy + native dashboard (verify PANW integration) |
| Platform team already running Kong for REST | Kong AI Gateway | Same plugin model on the AI route; enterprise-grade Advanced plugins |
| Python-first ML platform team | LiteLLM (1.82.6 pin or 1.83.7+) | Broad provider coverage with per-key RPM and TPM; pin or upgrade past March 24 |
| Edge-first team already on Cloudflare | Cloudflare AI Gateway | Account-level edge limit ahead of upstream; zero infrastructure |
Common Rate-Limit Implementation Mistakes Platform Teams Make
Even when teams pick the right gateway, they ship the wrong limiter configuration. Five mistakes account for most of the production incidents we see.
- Shipping a single global RPM counter and calling it rate limiting. A global counter doesn’t catch the noisy-neighbor tenant, doesn’t enforce fair-share, and doesn’t surface per-tenant rejection rates. The Gini coefficient on inbound traffic climbs from a healthy 0.2 to over 0.7 within a week of launch, every other tenant feels the impact, and the first signal is a churn meeting. Fix: multi-axis enforcement with at least per-tenant and per-user buckets, weighted fair queuing, and per-tenant rejection counters on the dashboard.
- Using fixed-window when the SLA is contractual. Fixed-window lets a tenant burn 2x the contracted rate by straddling two windows. A 600 RPM contract becomes 1,200 RPM at second 59 of window 1 and second 0 of window 2. If the contract has a financial penalty for breach, fixed-window is a liability. Fix: sliding-window for contract enforcement, token-bucket for burst tolerance, compose in series.
- Reject-only burst policy on paying tenants. A paying tenant fires three requests in a second and the third gets a 429. Their client retries, the limiter rejects again, and the tenant opens a ticket about “your gateway dropped my request.” The right default for paying tenants is a bounded queue with depth 50 to 200 that absorbs one-second bursts and rejects only after the queue is full. Fix: queue-then-reject with bounded depth, configurable per virtual key, observability on queue-depth p99.
- Ignoring provider-tier ceilings. The gateway is configured for 5,000 aggregate RPM, but the upstream Anthropic key is tier 2 (1,000 RPM Claude 3.5 Sonnet). Anthropic 429s the platform, the gateway doesn’t propagate the back-pressure, and the platform spends a week debugging “intermittent failures.” Fix: configure the upstream provider tier at the gateway, propagate headroom changes when the key crosses a tier, never send a request the upstream will reject.
- No span-level observation of rate-limit events. The first signal of an abuser is a support ticket. The platform team has no dashboard that shows top-10 rejected tenants, queue-depth p99 by route, or fairness Gini by minute. By the time the abuser is identified, the data window has rolled. Fix: one OpenTelemetry span per rate-limit decision with attributes for tenant, user, decision, and queue depth; three platform dashboards; alert on Gini over 0.6 and queue-depth p99 over 100.
Future AGI Rate-Limit Implementation Walk-Through
A production-grade rate-limit policy at the gateway is four pieces composed together: multi-axis enforcement, algorithm choice per axis, provider-tier awareness, and observability. The Agent Command Center configures all four in one YAML at the gateway hop and pipes the resulting telemetry back into the optimizer.
# gateway.yaml - rate-limit policy
rate_limits:
# Per-tenant token-bucket for burst tolerance
- axis: tenant
header: X-Tenant-Id
algorithm: token_bucket
bucket_size: 200
refill_rate: 100 # tokens per minute
queue_depth: 100
on_full: reject_with_retry_after
# Per-tenant sliding-window for contract enforcement
- axis: tenant
header: X-Tenant-Id
algorithm: sliding_window
window: 60s
limit_lookup: tenant_contract # pulled from control plane
on_breach: reject_with_retry_after
# Per-user token-bucket for abuse prevention inside a tenant
- axis: user
header: X-User-Id
algorithm: token_bucket
bucket_size: 30
refill_rate: 15 # tokens per minute
queue_depth: 0
on_full: reject_with_retry_after
# Per-feature sliding-window on the new experimental route
- axis: route_tag
tag: support-copilot-experimental
algorithm: sliding_window
window: 60s
limit: 200
on_breach: reject_with_retry_after
# Provider-tier awareness pulled into the limiter
provider_tiers:
- upstream: anthropic
tier: 3
rpm: 2000
input_tpm: 200000
output_tpm: 50000
- upstream: openai
org_id: org-acme
rpm: 10000
tpm: 2000000
# Weighted fair-share across tenants when upstream is saturated
fair_share:
algorithm: weighted_fair_queuing
weights:
enterprise: 4
business: 2
free: 1
gini_alert_threshold: 0.6
# Observability
telemetry:
otlp_endpoint: $OTLP_ENDPOINT
prometheus_metrics: /-/metrics
span_attributes:
- tenant_id
- user_id
- key
- algorithm
- limit
- current_count
- decision
- queue_depth
Three production dashboards land for free once the telemetry flows:
- Top rejected tenants by hour. Sort by 429 count, descending. The top tenant in a healthy week sits under 1 percent of total rejections; over 10 percent flags an abuser or a tenant whose contract limit needs renegotiation.
- Queue-depth p99 by route. Healthy routes sit under 30. Over 80 flags the limiter as the bottleneck; either raise the bucket size or shed traffic upstream.
- Fairness Gini by minute. Healthy traffic sits under 0.3. Over 0.6 fires the alert and the platform team checks for a single tenant saturating the upstream tier.
The closed loop is what no other gateway on this list ships. The Future AGI optimizer reads the 429 rate, queue-depth p99, and Gini gauge from the OpenTelemetry trace, proposes adjustments to per-tenant weights, bucket sizes, and queue depths, and surfaces them as a tuning candidate the platform team reviews and applies. The same trace surface that flags a fairness regression also produces the labeled dataset that agent-opt uses to revise the limiter. That makes the gateway a closed-loop self-improvement layer for rate limits, not a static config file.
For deeper reads on the underlying Protect performance numbers, see the arXiv writeup on the Future AGI inline-guardrail benchmark at 65 ms; the rate-limit hot path sits well below that on a separate decision plane.
Rate limiting LLM calls in 2026 isn’t a single counter. It’s a stack: per-tenant + per-user + per-feature granularity, sliding-window plus token-bucket plus leaky-bucket algorithms composed in series, weighted fair-share across tenants, provider-tier awareness for Anthropic and OpenAI, back-pressure signaling that reaches the client, and OpenTelemetry-native rate-limit event export feeding an optimizer that tunes the policy, running at the same network hop, under a license that isn’t about to be re-platformed inside an acquirer.
Future AGI Agent Command Center is the strongest single pick when the buying constraint is one Apache-2.0 binary that ships every axis of the rate-limit rubric self-hostable. Teams already on Portkey should weigh the Palo Alto integration timeline; Kong teams should validate the Advanced plugin tier; Python-first teams should pin LiteLLM commits or upgrade past 1.83.7; edge-first teams should compare Cloudflare AI Gateway’s account-level limit against per-tenant requirements before committing.
For deeper reads: the Agent Command Center docs, the Future AGI GitHub repo, the Future AGI observability docs, the Future AGI Protect docs, the Future AGI Evaluation docs, and the OpenTelemetry GenAI semantic conventions.
Try Future AGI Agent Command Center free: drop-in OpenAI-compatible routing, per-tenant + per-user + per-feature rate limits, weighted fair-share scheduling, Anthropic tier 1 to 4 and OpenAI organization-level RPM awareness, queue-then-reject back-pressure with RFC 9239 headers, and OpenTelemetry-native rate-limit event export in one Apache-2.0 Go binary.
Related reading
- Best 5 AI Gateways for LLM Cost Optimization in 2026, the five-layer cost stack and the 2026 trust cohort
- Best 5 AI Gateways for LLM Failover and Fallback in 2026, fallback and failover gateway picks
- Best 7 AI Gateways for Multi-Model Routing in 2026, how cost-quality routing decisions get made at the gateway hop
- Best 5 AI Gateways for Prompt Management in 2026, the prompt-management gateway picks
Frequently asked questions
What Is the Difference Between Token-Bucket and Sliding-Window Rate Limiting for LLM Calls?
How Should an AI Gateway Handle Anthropic Tier 1 to 4 RPM and OpenAI Organization-Level Limits?
Should the Gateway Queue Excess Traffic or Reject It Outright?
How Do You Enforce Fair-Share Between Tenants at the Gateway?
What Is an Acceptable Latency Overhead for a Rate-Limit Check at the Gateway?
How Do You Observe Rate-Limit Events in a Way That Is Useful for the Platform Team?
LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Helpful and harmless trade. Labs that pretend otherwise are training to a benchmark, not a behavior. A practitioner's reading of the alignment paradox in mid-2026.