How is denial of service different from runaway cost?

Denial of service is an availability failure; runaway cost is a spend failure. The same agent loop or request flood can cause both, but DoS is measured by saturation, errors, and user impact.

How do you measure denial of service in LLM apps?

Use FutureAGI Agent Command Center `gateway:rate-limit` with traceAI token and latency fields, plus StepEfficiency for loop-driven floods. Track limit-hit rate, queue time, provider 429s, and token burn by route.

What Is Denial of Service in LLM Apps? FutureAGI Guide (2026)

Q: What is denial of service in LLM apps?

Denial of service in LLM apps happens when abusive, automated, or runaway traffic exhausts model, gateway, tool, or budget capacity. Legitimate users then see high latency, throttling, failed requests, or degraded agent behavior.

What Is Denial of Service in LLM Apps?

Denial of service in LLM apps is a security failure mode where abusive, automated, or runaway traffic exhausts model, gateway, tool, or budget capacity until legitimate requests slow down or fail. It shows up at the gateway and in production traces as request floods, token spikes, retry storms, queue growth, or agent loops. FutureAGI treats it as a gateway control problem: enforce gateway:rate-limit, observe token and latency signals, and trigger fallback or rejection before provider quotas collapse.

Why it matters in production LLM/agent systems

LLM denial of service turns model capacity into an attack surface. A cheap prompt can force expensive context expansion, tool fan-out, retries, or long generation. If the gateway treats every request as normal traffic, one attacker, faulty client, or looping agent can consume a shared token pool and make unrelated users wait behind it.

The named failure modes are token exhaustion, retry amplification, and agent loop saturation. Token exhaustion happens when long prompts or generated outputs pin tokens-per-minute limits. Retry amplification happens when a provider 429 or timeout causes clients and agents to reissue work at several layers. Agent loop saturation happens when a planner keeps calling search, retrieval, code, or browser tools without making progress.

Developers feel it as flaky application behavior: requests pass locally but fail under tenant traffic, background jobs starve chat sessions, or one prompt shape burns 50x the normal tokens. SREs see queue depth, p99 latency, provider 429s, gateway 429s, retry counts, and cost-per-trace rise together. Security and compliance teams need evidence that abusive traffic was identified, scoped to a key or route, and blocked without dropping audit context.

This matters more for 2026 multi-step agents because availability is no longer one model call. A single user task can involve planning, retrieval, tool selection, reflection, fallback, and final answer generation. Each step can multiply load, so denial-of-service protection has to follow the trace, not just the first HTTP request.

How FutureAGI handles denial of service in LLM apps

FutureAGI anchors denial-of-service handling in Agent Command Center through the gateway:rate-limit surface. The policy sits before provider calls and around routing decisions, so it can cap requests per minute, tokens per minute, concurrent in-flight calls, and budget by tenant, API key, user, route, or model. It also records the decision on the trace instead of hiding it behind a generic 500 or provider error.

A practical example is a support agent route named support-agent-v3. The team sets gateway:rate-limit to 120 requests per minute, 180,000 tokens per minute, and 20 concurrent tool-using traces per tenant. traceAI records llm.token_count.prompt, llm.token_count.completion, gen_ai.request.model, agent.trajectory.step, route name, policy name, and limit decision. When a bad integration starts sending repeated 80k-token prompts, Agent Command Center rejects excess calls with a structured 429 and retry_after_ms. If the route is within tenant budget but one provider is saturated, the gateway can use model fallback or a retry policy with bounded backoff.

FutureAGI’s approach is to connect enforcement to root cause. Unlike provider-only OpenAI or Anthropic quota dashboards, the FutureAGI trace shows which tenant, key, prompt version, agent step, and route created the load. The engineer’s next move is concrete: alert on limit-hit rate, split the route into interactive and batch policies, add the abusive prompt shape to a regression dataset, and run StepEfficiency on looping trajectories before shipping a new agent planner.

How to measure or detect it

Measure denial of service as capacity saturation plus user impact, not just request volume:

gateway:rate-limit decisions - count allowed, delayed, rejected, and fallback-routed calls by tenant, route, key, and model.
traceAI fields - inspect llm.token_count.prompt, llm.token_count.completion, gen_ai.request.model, agent.trajectory.step, queue time, and retry count.
Dashboard signals - track limit-hit rate, provider-429-after-gateway rate, p99 latency, token-cost-per-trace, queue-depth p95, and concurrent in-flight traces.
StepEfficiency evaluator - flags agent trajectories that waste steps before completion; use it as a loop-amplification signal, not as a complete DoS detector.
User-feedback proxy - watch timeout complaints, degraded-answer reports, support escalations, and tenant-specific error spikes.

from fi.evals import StepEfficiency

agent_steps = [{"tool": "search"}, {"tool": "search"}, {"tool": "search"}]
result = StepEfficiency().evaluate(trajectory=agent_steps)
print(result)

Alert when provider 429s rise after the gateway allowed traffic, when one tenant consumes a disproportionate token share, or when agent steps grow while task completion stays flat.

Common mistakes

Limiting only requests per minute. A small request count can still exhaust tokens, context windows, provider quota, or tool concurrency.
Retrying every 429. Tenant-budget errors should return a clear budget response; blind retries turn throttling into more load.
Sharing one bucket across batch and interactive routes. Background evals can starve user-facing chat unless the gateway separates policies.
Ignoring agent step fan-out. A single chat request can create dozens of tool and model calls after planning starts.
Treating provider quotas as the control plane. Provider dashboards show saturation late; app-level limits need tenant, route, and prompt context.