What Is Rate Limiting?
A gateway policy that caps LLM requests, tokens, or spend per time window for users, tenants, keys, routes, or models.
What Is Rate Limiting?
Rate limiting in an LLM gateway is a traffic-control policy that caps requests, tokens, or cost over a time window for a user, tenant, API key, route, or model. It is a gateway reliability control, not an evaluation metric: it runs before provider calls and decides whether to pass, delay, retry, or reject traffic. FutureAGI exposes this as gateway:rate-limiting in Agent Command Center, where limits can protect provider quotas and stop runaway agent loops before they spend budget.
Why it matters in production LLM/agent systems
Provider quotas fail noisily, and customer traffic fails unevenly. Without gateway-side rate limiting, one tenant’s batch job can consume the shared OpenAI tokens-per-minute pool, trigger 429s for interactive users, and push agent retries into a cascading failure. The same pattern appears with tool-heavy agents: a planner that loops on a failing search tool can issue hundreds of model calls even though each call is syntactically valid.
The pain is split across teams. Developers see retries, duplicate work, and logs full of provider 429s. SREs see p99 latency climb because clients back off at different layers. Product teams see good users throttled because a single integration used the entire quota. Finance sees cost spikes after high-token prompts or long context windows.
Symptoms usually show up as:
- Sudden increases in 429s from a provider while gateway accept rate stays high.
- Token-per-minute usage pinned near quota for one tenant, route, or API key.
- Retry storms where each failed request generates two or three more attempts.
- p99 latency rising before error rate rises, because callers wait through backoff.
This matters more for 2026 agentic systems because rate limits multiply across steps. A single user task can fan out into planning, retrieval, tool selection, reflection, and response generation. Gateway limits turn that fanout into a controlled queue or a clear budget error instead of a provider outage.
How FutureAGI handles rate limiting
FutureAGI’s rate-limiting surface is gateway:rate-limiting inside Agent Command Center. The policy belongs in the gateway request path before the upstream provider call, close to routing, retry, model fallback, semantic-cache, pre-guardrail, and post-guardrail decisions. A practical policy scopes limits by tenant, user, API key, route, or model, then enforces both requests per minute and tokens per minute. The exact route might be support-agent, with a tenant cap of 600 requests per minute, 250,000 tokens per minute, and a burst allowance of 50 requests.
FutureAGI’s approach is to make the rate-limit decision observable, not just enforced. The gateway records the policy name, route, scope, decision, and retry window alongside traceAI fields such as llm.token_count.prompt, llm.token_count.completion, and gen_ai.request.model. Allowed calls continue into the configured routing-policy; denied calls return a structured 429 or budget_exceeded error that the agent planner can handle cleanly. Provider-side 429s are treated separately: if the tenant is within budget but the provider is saturated, the operator can use retry with backoff, model fallback, or least-latency routing to shift traffic.
Unlike a LiteLLM-only client wrapper, this control sits at the shared Agent Command Center boundary, so services cannot bypass limits by switching SDKs. For agent systems, the same traces can feed StepEfficiency or TrajectoryScore regression evals to find planners that waste steps before they hit the gateway limit.
How to measure or detect it
Track rate limiting as a gateway control, a cost control, and an agent-loop signal:
- Limit-hit-rate - percentage of requests rejected or delayed by policy, grouped by tenant, route, model, and API key.
- Token-budget utilization - rolling use of
llm.token_count.promptplusllm.token_count.completionagainst the configured tokens-per-minute cap. - Provider-429-after-gateway rate - provider 429s after the gateway allowed the request. This means internal limits are higher than provider quota or routing is uneven.
- Backoff wait p99 - how long callers wait because of
retry_after_ms, queue delay, or retry strategy. - Cost-per-trace p99 - catches prompts and agent loops that approach the limit without crossing it.
- StepEfficiency - a FutureAGI evaluator that checks whether an agent trajectory wastes steps before completion.
from fi.evals import StepEfficiency
result = StepEfficiency().evaluate(trajectory=agent_steps)
print(result)
Common mistakes
- Limiting only requests per minute. Token-heavy prompts can stay under RPM while exhausting tokens per minute and budget.
- Sharing one global bucket across tenants. A batch customer can throttle premium interactive users with unrelated traffic.
- Retrying tenant-budget errors. If the tenant exceeded its cap, retries multiply load; return a clear budget response.
- Hiding provider 429s behind generic 500s. Operators lose the difference between gateway rejection and upstream quota failure.
- Setting limits without route labels. You cannot distinguish chat, embeddings, rerank, and agent planning traffic after the incident.
Frequently Asked Questions
What is rate limiting in an LLM gateway?
Rate limiting is a gateway policy that caps requests, tokens, or cost for a user, tenant, key, route, or model over a time window. It protects provider quotas and stops runaway agent traffic before it becomes an outage.
How is rate limiting different from retry strategy?
Rate limiting decides whether traffic is allowed, delayed, or rejected. Retry strategy decides how a caller reissues a request after a retryable error such as a provider 429 or timeout.
How do you measure rate limiting?
Use Agent Command Center's `gateway:rate-limiting` policy with traceAI attributes such as `llm.token_count.prompt`, `llm.token_count.completion`, and `gen_ai.request.model`. Track limit-hit-rate, token-budget utilization, and provider-429-after-gateway rate.