Guides

Best 5 AI Gateways for Rate Limiting LLM Calls in 2026

Five AI gateways for rate limiting LLM calls in 2026 scored on the seven-axis rate-limit rubric, provider-tier awareness, fair-share enforcement, and observability of rate-limit events.

·
31 min read
ai-gateway 2026
Editorial cover image for Best 5 AI Gateways for Rate Limiting LLM Calls in 2026
Table of Contents

Originally published May 17, 2026.

A platform team running a multi-tenant document copilot shipped a global 600 RPM cap on Friday and woke up to two incidents on Monday: a paying enterprise customer who burned the entire quota with one batch job, and three smaller customers who saw a wall of 429s and churned. The gateway never heard about Anthropic tier 3, never queued the burst, never enforced fair-share, and never emitted a rate-limit event the platform team could correlate. This guide compares the five AI gateways production platform teams should choose between in 2026 for rate limiting LLM calls, scored on the seven-axis rate-limit rubric, provider-tier awareness, and back-pressure signaling with primary sources.

TL;DR: 5 Gateways Scored on the Seven-Axis Rate-Limit Rubric and the 2026 Trust Cohort

Future AGI Agent Command Center is the strongest single pick for rate limiting LLM calls in 2026 because it bundles per-key, per-tenant, per-user, and per-feature limits with weighted fair-share scheduling, provider-tier awareness for Anthropic tier 1 to 4 RPM and OpenAI organization-level RPM and TPM, sliding-window plus token-bucket algorithms, queue-or-reject back-pressure with Retry-After and RateLimit-Remaining headers, and OpenTelemetry-native rate-limit event export in one Apache-2.0 Go binary you can self-host. A global request-per-minute counter no longer cuts it; the four axes that separate a 2026 rate-limiting gateway from a 2024 LLM proxy are granularity (who is being limited), algorithm correctness (avoid the fixed-window boundary spike), fair-share enforcement (Gini under 0.3 on healthy traffic), and observability of the rate-limit event itself.

#PlatformBest for2026 event you should know
1Future AGI Agent Command CenterPer-key + per-tenant + per-user + per-feature + provider-tier-aware limits with weighted fair-share + OTel-native rate-limit events in one Apache-2.0 Go binaryApache 2.0 single Go binary; Protect at roughly 65 ms (arXiv 2510.13351); no pending acquisition
2PortkeyManaged dashboard with per-virtual-key rate limits + 4-tier budget hierarchy + native cost-and-rate dashboardPalo Alto Networks announced intent to acquire on April 30, 2026; close expected PANW fiscal Q4
3Kong AI GatewayPlatform teams already running Kong who want enterprise-grade rate-limit plugins (Advanced, Graph, Local, Cluster, Redis) on the AI routeKong 3.8+ ships the AI gateway plugins; enterprise tier for the Advanced limiter
4LiteLLM (post-incident pinned)Python-first teams pinning a commit or upgrading past 1.83.7 who want virtual keys with per-key RPM and TPM budgetsTeamPCP PyPI supply-chain compromise of versions 1.82.7 and 1.82.8 on March 24, 2026
5Cloudflare AI GatewayEdge-first teams that want a global rate-limit on the CDN hop ahead of the upstream providerFree tier with global per-account limits; less aware of upstream provider-tier ceilings

The 5 Rate-Limiting Gateways at a Glance

The five cover every rate-limit shape platform teams actually ship in 2026: an Apache-2.0 open-source platform with the seven-axis rubric covered in one binary (Future AGI), a managed dashboard with acquisition pending (Portkey), an enterprise plugin gateway already running for REST traffic (Kong AI Gateway), a Python proxy under remediation (LiteLLM), and an edge limiter (Cloudflare AI Gateway).

SuperlativeTool
Best overall for rate limitingFuture AGI Agent Command Center: per-key + per-tenant + per-user + per-feature + provider-tier-aware limits with weighted fair-share + OTel-native rate-limit events in one Apache-2.0 Go binary
Best for OpenAI-compat drop-inFuture AGI Agent Command Center: base_url swap against the existing OpenAI SDK; no SDK rewrite
Best for sub-5 ms rate-limit overheadFuture AGI Agent Command Center: in-memory token-bucket plus Redis sliding-window in parallel; rate-limit decision under 5 ms p99
Best for native rate-limit dashboardPortkey: per-virtual-key rate limits surfaced in the managed dashboard
Best for teams already running KongKong AI Gateway: enterprise rate-limit plugins on the AI route
Best for Python-first teams post-incidentLiteLLM (1.82.6 pin or 1.83.7+ upgrade): virtual keys with per-key RPM and TPM
Best for edge global rate-limitCloudflare AI Gateway: account-level edge limit ahead of upstream provider
Best for self-hosted or air-gappedFuture AGI Agent Command Center: Apache 2.0; Docker, Kubernetes, air-gapped
#PlatformBest forLicense + deployment
1Future AGI Agent Command CenterPer-key + per-tenant + per-user + per-feature limits with weighted fair-share + provider-tier awareness + OTel-native rate-limit eventsApache 2.0; cloud at gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped)
2PortkeyManaged dashboard with per-virtual-key rate limits + 4-tier budget hierarchyMIT (open-source gateway) + cloud control plane; PANW acquisition pending
3Kong AI GatewayPlatform teams already running Kong who want enterprise rate-limit plugins on the AI routeApache 2.0 core + commercial Enterprise tier; self-host or Kong Konnect cloud
4LiteLLM (post-incident pinned)Python-first teams pinning a known-good commit; virtual keys with RPM and TPMMIT (the enterprise dir is licensed separately); pip install
5Cloudflare AI GatewayEdge limiter ahead of the upstream providerClosed source; Cloudflare cloud only

Helicone is intentionally not in the ranked list. As of March 3, 2026 it has been acquired by Mintlify and the public roadmap has shifted toward a documentation-platform-first stance. Teams already on Helicone should treat it as a planned migration, not a continued procurement.

How Did We Score AI Gateways for Rate Limiting LLM Calls?

We used the Future AGI Rate-Limit Scorecard, tuned for platform-engineering procurement. Most 2026 rate-limit listicles score on “does it have a rate limit” and stop there. They don’t score on whether the limit is per-tenant or per-key, whether the algorithm is sliding-window or fixed-window, whether the gateway is aware of the upstream provider tier, whether the back-pressure signal reaches the client, or whether the platform team can observe the rate-limit event.

The scorecard below runs seven axes across fourteen comparison columns, including the four that decide whether the gateway actually keeps tenants fair and abusers contained in production.

#AxisWhat we measure (rate-limit lens)
1Rate-limit granularityPer-key, per-tenant, per-user, per-feature, per-model, per-route; combinable multi-axis enforcement
2AlgorithmToken-bucket, leaky-bucket, sliding-window, fixed-window; series composition; algorithm correctness on the boundary
3Burst handlingQueue depth (bounded), reject-after-queue, hybrid; queue-shed policy; soft-cap and hard-cap separation
4Fair-share between tenantsWeighted fair queuing; tenant priority class; Gini coefficient observation; starvation guards
5Provider-tier awarenessAnthropic tier 1 to 4 RPM ceilings; OpenAI organization-level RPM and TPM ceilings; dynamic headroom propagation to the limiter
6Back-pressure signaling429, Retry-After, RateLimit-Remaining, RateLimit-Reset headers; streaming-aware back-pressure (SSE); client retry guidance
7Observability of rate-limit eventsOpenTelemetry span per decision; Prometheus counters by tenant and decision; queue-depth histogram; fairness Gini gauge

Axes 1, 4, 5, and 7 are the four that decide whether the gateway actually does platform-engineering work in production. The right priority depends on the buyer profile (multi-tenant SaaS versus internal multi-feature platform versus regulated workload with contractual RPM SLAs).

The 14-Dimension Capability Matrix the Rate-Limit SERP Is Missing

Across the five gateways below, Future AGI Agent Command Center leads on combined granularity, algorithm correctness, fair-share, provider-tier awareness, and rate-limit observability. Kong AI Gateway wins on enterprise plugin maturity. Portkey wins on managed dashboard polish. LiteLLM wins on Python-native ergonomics. Cloudflare AI Gateway wins on edge global enforcement.

CapabilityFuture AGI ACCPortkeyKong AI GatewayLiteLLMCloudflare AI GW
Per-key limitsYesYesYesYesYes (account-level)
Per-tenant limitsYes (custom property)Yes (virtual key tier)Yes (consumer plugin)Yes (custom property)No (account only)
Per-user limitsYes (header-based)Yes (metadata)Yes (consumer plugin)Partial (manual)No
Per-feature limitsYes (route + tag)Yes (route + tag)Yes (route plugin)Partial (tag)No
Token-bucketYesYesYes (Advanced)YesYes
Sliding-windowYesYesYes (Advanced)PartialYes
Fixed-windowYesYesYesYesYes
Burst queue depthConfigurable (bounded)ConfigurableConfigurableManualEdge-only
Weighted fair-shareYesYesYes (Enterprise)ManualNo
Provider-tier awarenessYes (Anthropic + OpenAI)PartialManualManualNo
429 + Retry-AfterYesYesYesYesYes
RateLimit-Remaining headersYes (RFC 9239)YesYes (Advanced)PartialYes
OTel rate-limit spanYesPartialYes (OTel plugin)PartialNo
Prometheus rate-limit countersYes (/-/metrics)PartialYesYesNo

The shape of the matrix is the shape your buying decision will be: no gateway wins every column, and the five columns that matter most for rate limiting (per-tenant + per-user granularity, sliding-window correctness, weighted fair-share, provider-tier awareness, and OTel observability) are where the field separates.

How AI Gateways Actually Rate Limit LLM Calls in Production

AI gateways enforce rate limits across seven axes (granularity, algorithm, burst, fair-share, provider-tier awareness, back-pressure, observability), and a real production rate-limit win comes from composing four or five of them correctly, not from setting one number.

Platform teams typically see three failure modes once they reach 5,000 RPM aggregate across tenants without a real rate-limiting gateway: the noisy-neighbor that consumes 80 percent of the upstream provider’s tier 3 RPM, the boundary-spike abuser that fires 2x the configured limit by straddling fixed-window edges, and the silent 429 storm where the upstream provider rate-limits the platform but the gateway doesn’t surface that back-pressure to the client. The breakdown:

  1. Rate-limit granularity. Per-key alone isn’t enough. Per-tenant catches the multi-key tenant that issues 200 keys to its sub-organizations; per-user catches the abusing end-user inside a paying tenant; per-feature catches the new experimental endpoint that bursts under launch traffic. Combine multi-axis: a request is allowed only if every axis allows. Most production teams need three to four axes minimum.
  2. Algorithm choice. Token-bucket fits LLM bursts (one user fires three requests then idles for a minute). Sliding-window fits contract enforcement (the tenant is contracted for 600 RPM, the count must be exact). Fixed-window is the wrong default: a user fires 600 requests at second 59 of one minute and 600 at second 0 of the next, doubling the contracted rate without violating the count. Production teams compose token-bucket per user with sliding-window per tenant; fixed-window is never the right answer for contractual SLAs.
  3. Burst handling. A bounded queue (typically 50 to 200 slots at the gateway) absorbs one-second bursts from paying tenants without 429-ing them, and the queue rejects the moment depth is exceeded. Queue-only is wrong (infinite memory growth under abuse); reject-only is wrong (paying tenants 429 on a normal burst). The right default for paying tenants is queue-then-reject with bounded depth.
  4. Fair-share between tenants. A global rate limit lets one tenant saturate the upstream provider tier at the expense of every other customer. Weighted fair queuing drains tokens in proportion to tenant priority weight, keeps the fairness Gini coefficient under 0.3 on healthy traffic, and emits a per-minute fairness metric. Without fair-share, a noisy neighbor is a churn event for every other tenant.
  5. Provider-tier awareness. Anthropic ships tier 1 (50 RPM Claude 3.5 Sonnet, 40,000 input TPM) through tier 4 (4,000 RPM, 400,000 input TPM) for the Messages API; OpenAI ships organization-level RPM and TPM that vary by model and usage tier. A 2026 rate-limiting gateway pulls those ceilings into the limiter at config time, propagates headroom changes when the upstream key crosses a tier, and never sends a request the upstream provider will 429.
  6. Back-pressure signaling. The gateway returns 429 with Retry-After (seconds), surfaces RateLimit-Remaining and RateLimit-Reset on every response (RFC 9239 draft headers), and for SSE-streaming routes emits an in-stream back-pressure frame before terminating. Without these, the client retries blindly and amplifies the storm.
  7. Observability of rate-limit events. One OpenTelemetry span per rate-limit decision with attributes for tenant, user, key, algorithm, limit, current count, decision, and queue depth, plus Prometheus counters for allow, queue, and reject by tenant. The platform team builds three dashboards: top rejected tenants, queue-depth p99 by route, and fairness Gini by minute. Without this, the first signal of an abuser is a support ticket.

A gateway that ships axis 1 and 2 but skips 4, 5, and 7 is good for a demo and bad for a multi-tenant production. The five gateway reviews below are scored against all seven axes, plus the four scorecard axes that decide whether the gateway actually keeps tenants fair and abusers contained.

Future AGI Agent Command Center: Best Overall for Rate Limiting LLM Calls

Future AGI Agent Command Center tops the 2026 rate-limit list because it bundles every axis of the seven-axis rubric at the same network hop in one Apache-2.0 Go binary you can self-host.

It loses on enterprise plugin maturity to Kong AI Gateway and on managed dashboard polish to Portkey; for buyers whose binding constraint is per-tenant + per-user + per-feature granularity with weighted fair-share, provider-tier awareness for Anthropic and OpenAI, and rate-limit events landing in an existing OpenTelemetry stack, the combined surface still puts it first. The combined surface is documented in the Agent Command Center docs and the source ships at the Future AGI GitHub repo.

Best for. Platform-engineering teams already running OpenTelemetry that want OpenAI-compatible drop-in, fine-grained per-tenant + per-user + per-feature rate limits with weighted fair-share, provider-tier awareness, and rate-limit events emitted into their existing observability stack, without rewriting OpenAI SDK code or operating a Python proxy.

Key strengths.

  • OpenAI-compatible drop-in: change base_url to https://gateway.futureagi.com/v1, keep the existing OpenAI SDK code unchanged; rate-limit policy attaches at the same hop without SDK changes.
  • Seven-axis rate-limit enforcement at one hop: per-key, per-tenant (via custom property), per-user (via header), per-feature (via route + tag), per-model, and per-time-window, combinable multi-axis.
  • Algorithm choice: token-bucket for burst tolerance, sliding-window for contract enforcement, leaky-bucket for upstream-throttled queues; compose in series at the same hop.
  • Bounded burst queue: configurable depth (default 100; production-tuned to 50 to 200 based on tenant priority class); queue-then-reject with soft-cap and hard-cap separation.
  • Weighted fair queuing across tenants: assign service weight per virtual key, drain tokens in proportion to weight when the upstream provider tier is saturated, emit fairness Gini gauge per minute. Healthy traffic sits under 0.3 Gini; production alerting fires at 0.6.
  • Provider-tier awareness: configure the upstream Anthropic tier (1 to 4) and the OpenAI organization-level RPM and TPM at the gateway; the limiter never sends a request the upstream will 429.
  • Back-pressure signaling: 429 with Retry-After, RateLimit-Remaining, RateLimit-Reset headers (RFC 9239 draft); SSE-streaming-aware back-pressure on streaming routes.
  • OpenTelemetry-native rate-limit event export: one span per decision with tenant_id, user_id, key, algorithm, limit, current_count, decision, queue_depth; Prometheus counters on /-/metrics for Grafana dashboards. traceAI instruments 50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) OpenInference-natively, and Error Feed (the part of the eval stack, the clustering and what-to-fix layer that feeds the self-improving evaluators) sits alongside as the zero-config error monitor: auto-clusters related per-tenant rate-limit failures (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue so emerging fairness regressions surface like exceptions rather than buried in OTel rows.
  • Rate-limit decision under 5 ms p99 on a well-provisioned Redis (in-memory token-bucket plus Redis-backed sliding-window in parallel). The Future AGI Protect model family runs on a separate hot path at ~65 ms p50 text and ~107 ms p50 image (arXiv 2510.13351) so the rate-limit check isn’t coupled to guardrail latency. Protect is FAGI’s own fine-tuned model family built on Google’s Gemma 3n with specialized adapters across four safety dimensions (content moderation, bias detection, security/prompt-injection, data privacy/PII), natively multi-modal across text, image, and audio, a model family, not a plugin chain.
  • Self-improving loop: the platform team feeds 429 rejection rates, queue-depth p99, and fairness Gini back into agent-opt, which proposes tuning of per-tenant weights and burst-queue depths from the trace. No other gateway closes this rate-limit-tuning feedback loop in one product.
  • Apache 2.0; single Go binary; Docker, Kubernetes, AWS, GCP, Azure, on-prem, air-gapped, cloud at gateway.futureagi.com/v1.

Where it falls short. The enterprise plugin library for non-AI REST traffic is intentionally out of scope; teams that need a single plane for REST and AI rate limits will run Kong AI Gateway alongside it. The managed rate-limit dashboard is functional but less polished than Portkey’s at first glance; teams that want a finance-grade native dashboard with zero infrastructure work will reach for Portkey first. Self-host operations cost is non-zero; teams that want a fully managed plane should choose the hosted endpoint at gateway.futureagi.com/v1 rather than self-host.

from openai import OpenAI

client = OpenAI(
    api_key="$FAGI_API_KEY",
    base_url="https://gateway.futureagi.com/v1",
)

# Existing OpenAI SDK code unchanged from here.
# Per-tenant (header), per-user (header), per-feature (route + tag), and
# provider-tier-aware limits all apply at the same network hop.
response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",
    messages=[{"role": "user", "content": "Summarise this support ticket."}],
    extra_headers={
        "X-Tenant-Id": "acme-corp",
        "X-User-Id": "user-29384",
        "X-Feature": "support-copilot",
    },
)

# Response carries the back-pressure signals:
#   RateLimit-Remaining: 142
#   RateLimit-Reset: 27
# A 429 carries Retry-After in seconds.

Use case fit. Strong for OpenTelemetry-first platform teams, multi-tenant SaaS with per-customer fair-share enforcement, fintech with contractual RPM SLAs, and platform teams that want trace plus eval plus gateway plus rate-limit-tuning feedback loop in one Apache-2.0 platform with hybrid local and cloud deployment. Less optimal for teams that want a fully managed rate-limit dashboard before writing any infrastructure code.

Pricing and deployment. Apache 2.0 single Go binary; cloud-hosted endpoint at https://gateway.futureagi.com/v1 or self-host (Docker, Kubernetes, air-gapped).

Verdict. The strongest single pick when the 2026 rate-limit story is “we want per-tenant + per-user + per-feature limits with weighted fair-share, provider-tier awareness for Anthropic and OpenAI, and rate-limit events in our existing OpenTelemetry stack, in one Apache-2.0 binary, without rewriting OpenAI SDK calls or operating a Python proxy.”

Portkey: Best for Managed Rate-Limit Dashboard Out of the Box

Portkey is the strongest pick when you want per-virtual-key rate limits plus a four-tier budget hierarchy plus a managed cost-and-rate dashboard out of the box. It’s what most production teams reach for when “we need rate-limit enforcement and a dashboard next week” is the brief, and it has the largest adapter library on the routed rate-limit path.

Best for. Multi-tenant SaaS or internal multi-product platforms that need fine-grained per-virtual-key rate limits, a four-tier budget hierarchy, and a usable rate-limit dashboard without writing a custom exporter.

Key strengths.

  • Per-key, per-virtual-key, per-model, and per-time-window rate limits; the most fine-grained native-dashboard hierarchy on this list.
  • Token-bucket plus sliding-window algorithms; configurable burst tolerance per virtual key.
  • 429 with Retry-After, RateLimit-Remaining, and RateLimit-Reset headers; consistent back-pressure signaling.
  • Large adapter library (250+ providers, including private OSS deployments); rate-limit policy works the same across providers.
  • Usable native dashboard for rate-limit attribution by tenant, feature, and route, without writing a custom exporter.
  • Open-source gateway core (github.com/Portkey-AI/gateway); production teams self-host the gateway and run the control plane in Portkey cloud.

Where it falls short. Palo Alto Networks announced intent to acquire Portkey on April 30, 2026; the press release says the deal is expected to close in PANW fiscal Q4 2026 and that Portkey will become the AI Gateway for Prisma AIRS. Verify standalone-product continuity before signing a multi-year contract. Observability is dashboard-first; the OpenTelemetry export of rate-limit events exists but is less first-class than the native dashboard, so OTel-first stacks end up duplicating telemetry. Provider-tier awareness is partial; the limiter doesn’t auto-detect Anthropic tier 3 versus tier 4 without manual configuration. The control plane is closed; check whether the open-source core covers air-gapped requirements.

Use case fit. Strong for multi-tenant SaaS, fintech with per-customer rate-limit attribution, and platform teams running multiple AI products. Less optimal for teams that want their rate-limit telemetry flowing into an existing OpenTelemetry collector and Grafana stack as a first-class output, or for teams that need the gateway to know about Anthropic tier ceilings out of the box.

Pricing and deployment. Open-source core (self-hosted) plus commercial cloud control plane; enterprise tiers.

Verdict. The most mature per-virtual-key rate-limit hierarchy with a managed dashboard in 2026. Choose with eyes open on the Palo Alto integration; the next twelve months will tell whether the standalone gateway product survives the merge.

Kong AI Gateway: Best for Platform Teams Already Running Kong

Kong AI Gateway is the AI-specific overlay on the Kong API gateway. Teams already running Kong for REST traffic get the same rate-limit plugins (rate-limiting, rate-limiting-advanced, graphql-rate-limiting-advanced) applied to the AI route, plus AI-specific plugins (ai-proxy, ai-rate-limiting-advanced, ai-prompt-template) for upstream provider abstraction.

Best for. Platform-engineering teams that already operate Kong for REST APIs and want the same rate-limit primitives on the AI route without operating a second gateway runtime.

Key strengths.

  • Mature enterprise rate-limit plugins: Local (in-memory, single-instance), Cluster (shared via gossip), and Redis (shared via Redis) policy stores; pick the storage tier that fits your cluster topology.
  • Per-consumer, per-credential, per-service, and per-route rate limits; combinable multi-axis enforcement via the Consumer entity.
  • Sliding-window and fixed-window algorithms in the Advanced plugin; configurable window size and sync rate.
  • 429 with Retry-After, RateLimit-Remaining, and RateLimit-Reset headers; consistent back-pressure signaling per RFC 9239.
  • Native OpenTelemetry export via the opentelemetry plugin; rate-limit decisions surface as spans on the request trace.
  • Kong Konnect managed cloud plus on-prem Kong Gateway Enterprise; the same plugin config works in both deployments.

Where it falls short. The AI-specific layer (ai-proxy, ai-rate-limiting-advanced) lags the rest of the Kong plugin ecosystem in cadence; new provider integrations land slower than at Portkey or LiteLLM. Provider-tier awareness (Anthropic tier 1 to 4 RPM, OpenAI organization-level RPM and TPM) is manual; the team configures the upstream ceiling as a static plugin parameter rather than pulling it from the provider. Weighted fair-share across tenants is an Enterprise tier feature in the Advanced plugin; the open-source Local and Cluster policy plugins ship hard-cap-only enforcement, no fair queuing. The control plane footprint is heavier than a single Go binary; for teams that don’t already run Kong, the operations cost isn’t justified by the AI rate-limit feature surface alone.

Use case fit. Strong for platform teams that already operate Kong for REST traffic, regulated environments where the existing Kong audit and compliance surface is a hard requirement, and enterprises with an existing Kong Konnect contract. Less optimal for teams that want a single Apache-2.0 Go binary or for teams where Anthropic tier awareness is a hard requirement.

Pricing and deployment. Apache 2.0 core (Kong Gateway OSS) plus commercial Enterprise tier (Advanced plugins, fair queuing, support); Kong Konnect cloud or self-host.

Verdict. The right pick when Kong is already the platform-engineering plane and the AI route shouldn’t be a second gateway runtime. Choose elsewhere when the AI-specific rate-limit features (Anthropic tier awareness, weighted fair-share at the AI layer) are the binding constraint.

LiteLLM: Best for Python-First Teams Post-CVE

LiteLLM is the Python-first proxy that broke open the multi-provider unified-API category. It exposes 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends behind OpenAI-compatible endpoints and ships virtual keys with per-key RPM and TPM budgets that work as a rate-limit primitive. After the March 24, 2026 supply-chain incident the answer is “yes, with commit pinning or upgrade past 1.83.7.”

Best for. Python-first teams that already deploy a FastAPI or uvicorn surface, want broad provider coverage, virtual keys with per-key RPM and TPM, and are willing to pin commit hashes (or upgrade past 1.83.7) and hold their own upstream provider DPA.

Key strengths.

  • Virtual keys with per-key RPM and TPM budgets; budget alerts on threshold.
  • 429 with Retry-After headers; consistent back-pressure on the OpenAI-compatible surface.
  • Broadest provider coverage of any single project on this list (20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends).
  • MIT (the enterprise dir is licensed separately); trivial to fork or audit.
  • Native fit with Python observability stacks (Prometheus exporter, OpenTelemetry middleware).
  • Active maintainer community; easy to extend with custom rate-limit policies in Python.

Where it falls short. March 24, 2026 PyPI supply-chain compromise. Versions 1.82.7 and 1.82.8 were published by an attacker who had taken over the maintainer’s PyPI token. The compromised package exfiltrated SSH keys, cloud credentials, and Kubernetes configs to an attacker-controlled endpoint, per the Datadog Security Labs writeup of the TeamPCP campaign. Pin commit hashes, scan for affected versions in the dependency tree, rotate any credentials touched by an affected install, and upgrade past 1.83.7. Python runtime; materially slower rate-limit-decision throughput than Go-binary alternatives at high concurrency (10x to 30x slower per-decision is the rough number for Python token-bucket against Go token-bucket on the same hardware). Sliding-window enforcement is partial; the default implementation is a per-key counter that approximates sliding-window rather than enforcing it precisely. Weighted fair-share is manual; the team writes the weighting policy in Python middleware. Provider-tier awareness (Anthropic tier 1 to 4 RPM) is manual configuration via virtual-key budget config.

Use case fit. Strong for Python-first teams, ML platform teams that already manage Python services, and teams whose buying constraint is broad provider coverage in a fork-friendly license. Less optimal where rate-limit-decision throughput at over 10,000 req/s matters or where you’re pinned to a managed runtime that doesn’t allow commit-pinned dependencies.

Pricing and deployment. MIT (enterprise dir licensed separately); pip install. Enterprise cloud tier exists.

Verdict. Still the broadest provider coverage on the list, but the March 2026 supply-chain incident shifts it from “default pick” to “pin commits or upgrade past 1.83.7 and audit.” Pair with Sigstore verification and dependency-pinning enforcement.

Cloudflare AI Gateway: Best for Edge Global Rate-Limit

Cloudflare AI Gateway is the edge-first entry on the list. It sits in front of upstream provider endpoints (OpenAI, Anthropic, Google, Replicate, plus Workers AI), enforces account-level rate limits at the CDN hop, and surfaces request analytics in the Cloudflare dashboard. It’s the gateway most often cited when “we want a global request cap ahead of the upstream provider with zero infrastructure” is the brief.

Best for. Teams that want a global request cap on the CDN hop ahead of the upstream provider, especially teams already running Cloudflare for their web traffic and willing to live within an account-level (not per-tenant) limit shape.

Key strengths.

  • Drop-in OpenAI-compatible proxy; change the base URL to the AI Gateway hostname and traffic flows through the edge limiter.
  • Account-level RPM and TPM caps with token-bucket and sliding-window options at the Cloudflare edge.
  • 429 with Retry-After and CDN-native back-pressure signaling.
  • Free tier with generous account-level limits; paid tier scales linearly.
  • Native Cloudflare analytics dashboard with per-provider, per-model, and per-account request and token counts.
  • Zero infrastructure to operate; runs entirely on Cloudflare’s global edge.

Where it falls short. Per-tenant, per-user, and per-feature granularity is absent at the gateway layer; the limiter is account-level and the platform team has to add tenancy enforcement somewhere else (typically in the application or at a second hop). Provider-tier awareness for Anthropic tier 1 to 4 and OpenAI organization-level RPM is absent; the limiter treats every upstream as a generic HTTP endpoint. Weighted fair-share across tenants is absent. OpenTelemetry export of rate-limit events isn’t first-class; the Cloudflare dashboard is the primary observability surface and exporting spans into an existing OTel collector is manual. Closed source; the platform team is betting on the Cloudflare team’s roadmap and uptime for AI-specific features.

Use case fit. Strong for teams already running Cloudflare for their web traffic, early-stage teams that want a global request cap with zero infrastructure, and teams whose threat model is volumetric abuse at the edge rather than per-tenant fair-share. Less optimal for multi-tenant SaaS where per-tenant fair-share is the binding constraint, or for teams that want provider-tier-aware enforcement against Anthropic tier 3 or OpenAI organization-level limits.

Pricing and deployment. Closed source; Cloudflare cloud only. Free tier with account-level limits; paid tier scales linearly.

Verdict. The lowest-friction way to put a global request cap ahead of the upstream provider. Not the right gateway when per-tenant rate-limit enforcement or provider-tier awareness is the brief.

The 2026 Gateway Migration and Trust Cohort

Every gateway listicle on the SERP is treating these as if they didn’t happen. They did, and they reshape the rate-limit procurement question for 2026.

  • Helicone joining Mintlify (March 3, 2026). Helicone acquired by Mintlify; public roadmap shifts toward documentation-platform-first. Existing Helicone users should treat this as a planned migration window, not a continued procurement.
  • LiteLLM PyPI supply-chain compromise (March 24, 2026). Versions 1.82.7 and 1.82.8 were compromised on PyPI; the malicious package exfiltrated SSH keys, cloud credentials, and Kubernetes configs to an attacker-controlled endpoint. Pin commit hashes, scan dependency trees, rotate any credentials accessible to an affected install, and upgrade past 1.83.7. Primary source: the Datadog Security Labs writeup.
  • Anthropic MCP STDIO RCE class (April 2026). OX Security disclosed an STDIO transport class flaw affecting 7,000+ publicly accessible MCP servers and 150M+ downstream downloads, with multiple CVEs filed across downstream implementations. MCP gateways are now expected to enforce least-privilege tool access, OAuth 2.1, and Streamable HTTP transport. Rate limits on MCP tool calls per agent session became a default expectation overnight. Primary coverage: the Hacker News report on the Anthropic MCP design vulnerability.
  • Portkey acquired by Palo Alto Networks (April 30, 2026). Acquisition announced; the gateway will become the AI Gateway for Prisma AIRS, with close expected in PANW fiscal Q4 2026. Standalone-product continuity is pending integration; verify roadmap before signing a multi-year contract. Primary source: the Palo Alto Networks press release.

The practical takeaway: for the next twelve months, license clarity, acquisition independence, and supply-chain pinning are part of the rate-limit procurement decision. A cheap gateway you have to migrate off in six months isn’t cheap, and a gateway whose package on PyPI was malware for 48 hours isn’t “still a default.”

Decision Framework: Which Rate-Limiting Gateway Is Right for You in 2026?

The buyer profile drives the pick more than the feature matrix does. OpenTelemetry-first platform teams pick Future AGI Agent Command Center; multi-tenant SaaS teams that want a managed dashboard pick Portkey; teams already running Kong for REST pick Kong AI Gateway; Python-first ML-platform teams pick LiteLLM; edge-first teams already on Cloudflare pick Cloudflare AI Gateway.

Choose Future AGI Agent Command Center for rate limiting if:
  - You need per-tenant + per-user + per-feature granularity with weighted fair-share
  - Anthropic tier 1 to 4 RPM or OpenAI org-level RPM awareness is required
  - You want rate-limit events as OpenTelemetry spans in your existing collector
  - Self-improving loop on per-tenant weight tuning matters to you
  - Apache 2.0 single Go binary is the deployment shape you want

Choose Portkey if:
  - Managed dashboard out of the box is the binding requirement
  - You can live with partial provider-tier awareness and partial OTel export
  - The PANW acquisition timeline is acceptable for your contract horizon

Choose Kong AI Gateway if:
  - Kong is already the platform-engineering plane for REST traffic
  - Enterprise plugin maturity outweighs AI-specific feature cadence
  - Weighted fair-share at Enterprise tier is acceptable budget-wise

Choose LiteLLM (pinned past 1.83.7) if:
  - Python-first stack and broad provider coverage are the binding constraints
  - You can hold commit pins and Sigstore-verified dependencies as a process

Choose Cloudflare AI Gateway if:
  - Account-level edge limit is sufficient (no per-tenant requirement)
  - Zero infrastructure to operate is the constraint
  - Cloudflare is already the CDN for your web traffic
If you are a…PickWhy
Platform team on OpenTelemetry with multi-tenant SaaSFuture AGI Agent Command CenterPer-tenant + per-user + per-feature granularity + weighted fair-share + provider-tier awareness + OTel-native rate-limit events
Fintech with contractual RPM SLAs and audit trailFuture AGI Agent Command CenterSliding-window correctness + per-tenant fair-share + span-level rate-limit attribution
Air-gapped or on-prem regulated environmentFuture AGI Agent Command CenterApache 2.0 single Go binary; Docker, Kubernetes, air-gapped
Multi-tenant SaaS that wants a managed dashboard out of the boxPortkeyMost fine-grained per-virtual-key hierarchy + native dashboard (verify PANW integration)
Platform team already running Kong for RESTKong AI GatewaySame plugin model on the AI route; enterprise-grade Advanced plugins
Python-first ML platform teamLiteLLM (1.82.6 pin or 1.83.7+)Broad provider coverage with per-key RPM and TPM; pin or upgrade past March 24
Edge-first team already on CloudflareCloudflare AI GatewayAccount-level edge limit ahead of upstream; zero infrastructure

Common Rate-Limit Implementation Mistakes Platform Teams Make

Even when teams pick the right gateway, they ship the wrong limiter configuration. Five mistakes account for most of the production incidents we see.

  1. Shipping a single global RPM counter and calling it rate limiting. A global counter doesn’t catch the noisy-neighbor tenant, doesn’t enforce fair-share, and doesn’t surface per-tenant rejection rates. The Gini coefficient on inbound traffic climbs from a healthy 0.2 to over 0.7 within a week of launch, every other tenant feels the impact, and the first signal is a churn meeting. Fix: multi-axis enforcement with at least per-tenant and per-user buckets, weighted fair queuing, and per-tenant rejection counters on the dashboard.
  2. Using fixed-window when the SLA is contractual. Fixed-window lets a tenant burn 2x the contracted rate by straddling two windows. A 600 RPM contract becomes 1,200 RPM at second 59 of window 1 and second 0 of window 2. If the contract has a financial penalty for breach, fixed-window is a liability. Fix: sliding-window for contract enforcement, token-bucket for burst tolerance, compose in series.
  3. Reject-only burst policy on paying tenants. A paying tenant fires three requests in a second and the third gets a 429. Their client retries, the limiter rejects again, and the tenant opens a ticket about “your gateway dropped my request.” The right default for paying tenants is a bounded queue with depth 50 to 200 that absorbs one-second bursts and rejects only after the queue is full. Fix: queue-then-reject with bounded depth, configurable per virtual key, observability on queue-depth p99.
  4. Ignoring provider-tier ceilings. The gateway is configured for 5,000 aggregate RPM, but the upstream Anthropic key is tier 2 (1,000 RPM Claude 3.5 Sonnet). Anthropic 429s the platform, the gateway doesn’t propagate the back-pressure, and the platform spends a week debugging “intermittent failures.” Fix: configure the upstream provider tier at the gateway, propagate headroom changes when the key crosses a tier, never send a request the upstream will reject.
  5. No span-level observation of rate-limit events. The first signal of an abuser is a support ticket. The platform team has no dashboard that shows top-10 rejected tenants, queue-depth p99 by route, or fairness Gini by minute. By the time the abuser is identified, the data window has rolled. Fix: one OpenTelemetry span per rate-limit decision with attributes for tenant, user, decision, and queue depth; three platform dashboards; alert on Gini over 0.6 and queue-depth p99 over 100.

Future AGI Rate-Limit Implementation Walk-Through

A production-grade rate-limit policy at the gateway is four pieces composed together: multi-axis enforcement, algorithm choice per axis, provider-tier awareness, and observability. The Agent Command Center configures all four in one YAML at the gateway hop and pipes the resulting telemetry back into the optimizer.

# gateway.yaml - rate-limit policy
rate_limits:
  # Per-tenant token-bucket for burst tolerance
  - axis: tenant
    header: X-Tenant-Id
    algorithm: token_bucket
    bucket_size: 200
    refill_rate: 100  # tokens per minute
    queue_depth: 100
    on_full: reject_with_retry_after

  # Per-tenant sliding-window for contract enforcement
  - axis: tenant
    header: X-Tenant-Id
    algorithm: sliding_window
    window: 60s
    limit_lookup: tenant_contract  # pulled from control plane
    on_breach: reject_with_retry_after

  # Per-user token-bucket for abuse prevention inside a tenant
  - axis: user
    header: X-User-Id
    algorithm: token_bucket
    bucket_size: 30
    refill_rate: 15  # tokens per minute
    queue_depth: 0
    on_full: reject_with_retry_after

  # Per-feature sliding-window on the new experimental route
  - axis: route_tag
    tag: support-copilot-experimental
    algorithm: sliding_window
    window: 60s
    limit: 200
    on_breach: reject_with_retry_after

# Provider-tier awareness pulled into the limiter
provider_tiers:
  - upstream: anthropic
    tier: 3
    rpm: 2000
    input_tpm: 200000
    output_tpm: 50000
  - upstream: openai
    org_id: org-acme
    rpm: 10000
    tpm: 2000000

# Weighted fair-share across tenants when upstream is saturated
fair_share:
  algorithm: weighted_fair_queuing
  weights:
    enterprise: 4
    business: 2
    free: 1
  gini_alert_threshold: 0.6

# Observability
telemetry:
  otlp_endpoint: $OTLP_ENDPOINT
  prometheus_metrics: /-/metrics
  span_attributes:
    - tenant_id
    - user_id
    - key
    - algorithm
    - limit
    - current_count
    - decision
    - queue_depth

Three production dashboards land for free once the telemetry flows:

  • Top rejected tenants by hour. Sort by 429 count, descending. The top tenant in a healthy week sits under 1 percent of total rejections; over 10 percent flags an abuser or a tenant whose contract limit needs renegotiation.
  • Queue-depth p99 by route. Healthy routes sit under 30. Over 80 flags the limiter as the bottleneck; either raise the bucket size or shed traffic upstream.
  • Fairness Gini by minute. Healthy traffic sits under 0.3. Over 0.6 fires the alert and the platform team checks for a single tenant saturating the upstream tier.

The closed loop is what no other gateway on this list ships. The Future AGI optimizer reads the 429 rate, queue-depth p99, and Gini gauge from the OpenTelemetry trace, proposes adjustments to per-tenant weights, bucket sizes, and queue depths, and surfaces them as a tuning candidate the platform team reviews and applies. The same trace surface that flags a fairness regression also produces the labeled dataset that agent-opt uses to revise the limiter. That makes the gateway a closed-loop self-improvement layer for rate limits, not a static config file.

For deeper reads on the underlying Protect performance numbers, see the arXiv writeup on the Future AGI inline-guardrail benchmark at 65 ms; the rate-limit hot path sits well below that on a separate decision plane.

Rate limiting LLM calls in 2026 isn’t a single counter. It’s a stack: per-tenant + per-user + per-feature granularity, sliding-window plus token-bucket plus leaky-bucket algorithms composed in series, weighted fair-share across tenants, provider-tier awareness for Anthropic and OpenAI, back-pressure signaling that reaches the client, and OpenTelemetry-native rate-limit event export feeding an optimizer that tunes the policy, running at the same network hop, under a license that isn’t about to be re-platformed inside an acquirer.

Future AGI Agent Command Center is the strongest single pick when the buying constraint is one Apache-2.0 binary that ships every axis of the rate-limit rubric self-hostable. Teams already on Portkey should weigh the Palo Alto integration timeline; Kong teams should validate the Advanced plugin tier; Python-first teams should pin LiteLLM commits or upgrade past 1.83.7; edge-first teams should compare Cloudflare AI Gateway’s account-level limit against per-tenant requirements before committing.

For deeper reads: the Agent Command Center docs, the Future AGI GitHub repo, the Future AGI observability docs, the Future AGI Protect docs, the Future AGI Evaluation docs, and the OpenTelemetry GenAI semantic conventions.

Try Future AGI Agent Command Center free: drop-in OpenAI-compatible routing, per-tenant + per-user + per-feature rate limits, weighted fair-share scheduling, Anthropic tier 1 to 4 and OpenAI organization-level RPM awareness, queue-then-reject back-pressure with RFC 9239 headers, and OpenTelemetry-native rate-limit event export in one Apache-2.0 Go binary.


Frequently asked questions

What Is the Difference Between Token-Bucket and Sliding-Window Rate Limiting for LLM Calls?
Token-bucket lets bursts through up to a fixed bucket size and refills at a constant rate, which fits LLM workloads where one user fires three requests in a second and then idles for a minute. Sliding-window counts every request in a rolling time interval and rejects the moment the count crosses the threshold, which fits abuse prevention and contract enforcement where the exact rate matters. Most production teams run them in series at the gateway: a token-bucket per user for burst tolerance and a sliding-window per tenant for hard contract limits. Fixed-window counters look simpler but create the well-known boundary spike where a user can fire 2x the limit by straddling two windows; do not ship fixed-window for any quota that has a contractual SLA.
How Should an AI Gateway Handle Anthropic Tier 1 to 4 RPM and OpenAI Organization-Level Limits?
The gateway needs to know which provider tier each upstream key sits in and surface the live headroom as a queue depth before the request leaves the hop. Anthropic publishes tier 1 (50 RPM on Claude 3.5 Sonnet) through tier 4 (4,000 RPM) for the Messages API; OpenAI ships organization-level RPM and TPM that vary by model and usage tier. A 2026 rate-limiting gateway treats those as the ceiling, then enforces a per-tenant fair-share below that ceiling, then surfaces a 429 with Retry-After to the client when the bucket empties. Future AGI Agent Command Center pulls provider-tier ceilings into the limiter at config time; LiteLLM exposes the same shape via virtual keys with manual tier-aware budget config; Cloudflare AI Gateway leans on its own edge limiter and is less aware of provider-side ceilings.
Should the Gateway Queue Excess Traffic or Reject It Outright?
Both, on different axes. Queue inside a soft cap with a bounded depth (typically 50 to 200 requests at the gateway) so a one-second burst from a paying tenant does not turn into a 429, and reject the moment the queue depth or the hard cap is exceeded so an abusing key does not park infinite work in memory. The trick is back-pressure: emit Retry-After on rejection, emit RateLimit-Remaining and RateLimit-Reset on every response, and surface the queue depth as a gateway metric so the platform team can tune the soft cap before users feel it. Reject-only is the right default for free-tier endpoints; queue-then-reject is the right default for paying tenants.
How Do You Enforce Fair-Share Between Tenants at the Gateway?
Weighted fair queuing at the gateway plus per-tenant token-bucket budgets. The naive setup is a global rate limit which a single abusing tenant can saturate at the expense of every other customer, producing a fairness Gini coefficient near 0.9 within seconds. Weighted fair queuing assigns a service weight to each tenant (typically proportional to contract tier), drains tokens in proportion to weight when the upstream provider tier is saturated, and emits a fairness metric per minute (Gini under 0.3 on healthy traffic; over 0.6 means one tenant is dominating). Future AGI Agent Command Center, Portkey, and Kong AI Gateway ship weighted fair-share natively; LiteLLM exposes the primitive but expects you to wire the weighting yourself.
What Is an Acceptable Latency Overhead for a Rate-Limit Check at the Gateway?
Under 5 ms p99 added to the request path is the 2026 production threshold. A token-bucket check against an in-memory store costs roughly 50 to 200 microseconds; a Redis-backed sliding-window check costs roughly 1 to 3 ms over a local network; a multi-tier check (per-user, per-tenant, per-feature) running in parallel costs 2 to 5 ms p99 on a well-provisioned Redis. Future AGI Agent Command Center publishes Protect at roughly 65 ms for guardrail scanning, and the rate-limit check sits well below that on a separate hot path. If your gateway adds 20+ ms just for the rate-limit decision, the limiter implementation is wrong; fix it before adding more axes.
How Do You Observe Rate-Limit Events in a Way That Is Useful for the Platform Team?
Emit one OpenTelemetry span per rate-limit decision with attributes for tenant_id, user_id, key, algorithm, limit, current_count, decision (allow, queue, reject), and queue_depth, and emit Prometheus counters for allow, queue, and reject by tenant. Then build three dashboards: top-10 rejected tenants by hour, queue-depth p99 by route, and fairness Gini by minute. Future AGI exports cost, rate-limit, and guardrail telemetry via the same OpenTelemetry trace so the platform team correlates a 429 spike with the deployment that caused it. Without span-level rate-limit telemetry, the first signal of an abuser is a support ticket from a paying customer.
Related Articles
View all
The Comprehensive Guide to LLM Security (2026)
Guides

LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.

NVJK Kartik
NVJK Kartik ·
17 min