What Is Rate Limiting?
A gateway policy that caps LLM requests, tokens, or spend per time window for users, tenants, keys, routes, or models.
What Is Rate Limiting?
Rate limiting in an LLM gateway is a traffic-control policy that caps requests, tokens, or cost over a time window for a user, tenant, API key, route, or model. It is a gateway reliability control, not an evaluation metric: it runs before provider calls and decides whether to pass, delay, retry, or reject traffic. FutureAGI exposes this as gateway:rate-limiting in Agent Command Center, where limits can protect provider quotas and stop runaway agent loops before they spend budget.
Why it matters in production LLM/agent systems
Provider quotas fail noisily, and customer traffic fails unevenly. Without gateway-side rate limiting, one tenant’s batch job can consume the shared OpenAI tokens-per-minute pool, trigger 429s for interactive users, and push agent retries into a cascading failure. The same pattern appears with tool-heavy agents: a planner that loops on a failing search tool can issue hundreds of model calls even though each call is syntactically valid.
The pain is split across teams. Developers see retries, duplicate work, and logs full of provider 429s. SREs see p99 latency climb because clients back off at different layers. Product teams see good users throttled because a single integration used the entire quota. Finance sees cost spikes after high-token prompts or long context windows.
Symptoms usually show up as:
- Sudden increases in 429s from a provider while gateway accept rate stays high.
- Token-per-minute usage pinned near quota for one tenant, route, or API key.
- Retry storms where each failed request generates two or three more attempts.
- p99 latency rising before error rate rises, because callers wait through backoff.
This matters more for 2026 agentic systems because rate limits multiply across steps. A single user task can fan out into planning, retrieval, tool selection, reflection, and response generation. Gateway limits turn that fanout into a controlled queue or a clear budget error instead of a provider outage.
How FutureAGI handles rate limiting
FutureAGI’s rate-limiting surface is gateway:rate-limiting inside Agent Command Center. The policy belongs in the gateway request path before the upstream provider call, close to routing, retry, model fallback, semantic-cache, pre-guardrail, and post-guardrail decisions. A practical policy scopes limits by tenant, user, API key, route, or model, then enforces both requests per minute and tokens per minute. The exact route might be support-agent, with a tenant cap of 600 requests per minute, 250,000 tokens per minute, and a burst allowance of 50 requests.
FutureAGI’s approach is to make the rate-limit decision observable, not just enforced. The gateway records the policy name, route, scope, decision, and retry window alongside traceAI fields such as llm.token_count.prompt, llm.token_count.completion, and gen_ai.request.model. Allowed calls continue into the configured routing-policy; denied calls return a structured 429 or budget_exceeded error that the agent planner can handle cleanly. Provider-side 429s are treated separately: if the tenant is within budget but the provider is saturated, the operator can use retry with backoff, model fallback, or least-latency routing to shift traffic.
Unlike a LiteLLM-only client wrapper, this control sits at the shared Agent Command Center boundary, so services cannot bypass limits by switching SDKs. For agent systems, the same traces can feed StepEfficiency or TrajectoryScore regression evals to find planners that waste steps before they hit the gateway limit.
How to measure or detect it
Track rate limiting as a gateway control, a cost control, and an agent-loop signal:
- Limit-hit-rate - percentage of requests rejected or delayed by policy, grouped by tenant, route, model, and API key.
- Token-budget utilization - rolling use of
llm.token_count.promptplusllm.token_count.completionagainst the configured tokens-per-minute cap. - Provider-429-after-gateway rate - provider 429s after the gateway allowed the request. This means internal limits are higher than provider quota or routing is uneven.
- Backoff wait p99 - how long callers wait because of
retry_after_ms, queue delay, or retry strategy. - Cost-per-trace p99 - catches prompts and agent loops that approach the limit without crossing it.
- StepEfficiency - a FutureAGI evaluator that checks whether an agent trajectory wastes steps before completion.
from fi.evals import StepEfficiency
result = StepEfficiency().evaluate(trajectory=agent_steps)
print(result)
Rate limit dimensions worth tracking in 2026
In our 2026 evals, the teams that survive provider quota changes treat rate limiting as a multi-dimensional policy, not a global RPM cap. The table is the set of dimensions we keep separate:
| Dimension | Why it matters | Typical caps to start |
|---|---|---|
| Requests per minute | Burst control | 600 RPM per tenant |
| Tokens per minute (prompt + completion) | Provider quota protection | 250K TPM per tenant |
| Cost per minute | Spend control across models | $0.50/min for support, $5/min for research |
| Concurrent requests | Memory and queue depth | 50 concurrent per route |
| Tool-call budget | Stops agent loops | 12 tool calls per agent trajectory |
| Daily budget | Long-horizon cost guard | $X per tenant per day |
| Cohort caps | Premium vs free differentiation | 5x premium |
Limit decisions multiply across 2026 model routes. Claude Opus 4.7 has different TPM than Sonnet 4.6, GPT-5.1 differs from GPT-5-mini, Gemini 3 Pro differs from Gemini 3 Flash. Unlike a LiteLLM-only client wrapper, FutureAGI’s Agent Command Center gateway:rate-limiting policy applies all seven dimensions at the shared boundary, so a runaway planner cannot escape the cap by switching SDKs. The same trace surface feeds StepEfficiency and TrajectoryScore regression checks, so engineers can find planners that waste steps before they ever hit the rate limit.
A useful anchor for setting the tool-call budget: on τ-bench (Anthropic, multi-turn customer-support tool use, frontier ~50% pass on retail tasks) and GAIA (Meta, 3 difficulty levels, frontier ~55% on level 1), well-behaved frontier agents complete typical tasks in 4-8 tool calls; trajectories that exceed 15 tool calls are usually loops or recovery storms rather than legitimate work. A per-trajectory cap of 12-15 catches >95% of runaway loops without clipping real multi-step tasks.
Common mistakes
- Limiting only requests per minute. Token-heavy prompts can stay under RPM while exhausting tokens per minute and budget.
- Sharing one global bucket across tenants. A batch customer can throttle premium interactive users with unrelated traffic.
- Retrying tenant-budget errors. If the tenant exceeded its cap, retries multiply load; return a clear budget response.
- Hiding provider 429s behind generic 500s. Operators lose the difference between gateway rejection and upstream quota failure.
- Setting limits without route labels. You cannot distinguish chat, embeddings, rerank, and agent planning traffic after the incident.
- Treating provider quota changes as a one-time event. In 2026, Claude Opus 4.7, GPT-5.1, and Gemini 3 Pro have all rolled out quota changes mid-quarter; the gateway is the place to absorb that drift, not application code.
- Ignoring the Agent Command Center retry budget. Retries multiply traffic; a poorly-tuned retry strategy on a rate-limited route turns a soft throttle into a hard outage.
Frequently Asked Questions
What is rate limiting in an LLM gateway?
Rate limiting is a gateway policy that caps requests, tokens, or cost for a user, tenant, key, route, or model over a time window. It protects provider quotas and stops runaway agent traffic before it becomes an outage.
How is rate limiting different from retry strategy?
Rate limiting decides whether traffic is allowed, delayed, or rejected. Retry strategy decides how a caller reissues a request after a retryable error such as a provider 429 or timeout.
How do you measure rate limiting?
Use Agent Command Center's `gateway:rate-limiting` policy with traceAI attributes such as `llm.token_count.prompt`, `llm.token_count.completion`, and `gen_ai.request.model`. Track limit-hit-rate, token-budget utilization, and provider-429-after-gateway rate.