What Is Runaway Cost (LLM Apps)?
An LLM-application failure mode where token consumption grows unboundedly across many calls, typically from agent loops or unbounded recursion.
What Is Runaway Cost?
Runaway cost is an LLM-application failure mode where total token consumption grows unboundedly across many calls. The mechanism is usually an agent stuck in a loop on a failing tool, a recursive plan-execute cycle that never terminates, or a chat session accumulating context without summarisation. Each individual call may be legal — under the context window, schema-valid — but the aggregate spend is pathological. It is distinct from context overflow (a single-call structural limit). In 2026 agent stacks, runaway cost is the dominant cost-failure shape because long-loop agents make per-trace token spend extremely variable.
Why It Matters in Production LLM and Agent Systems
On 2026-04-22 a research-agent product burned $48,000 of GPT-4o spend in 14 hours from a single misbehaving customer session. Postmortem: the user asked the agent to “find every academic citation supporting claim X.” The retriever returned imperfect chunks. The agent’s planner kept retrying with broader queries, accumulating tool outputs into the next prompt, never terminating because the success criterion (“comprehensive citations”) was undefined. Each model call was valid. The trajectory ran for 11 hours of wall-clock time and 2,400+ steps. No per-session token-budget alert was wired. The team saw the spend on the next day’s billing email.
That is the runaway-cost shape. It hits the finance team (line-item that does not match expected unit economics), the agent platform engineer (no obvious failure in the trace — every span succeeded), the SRE (worker thread held for hours), and the product team (one user blew the monthly margin). Especially common in 2026 because long-context models make extended trajectories technically feasible — the constraints that used to terminate agent runs (context-overflow errors) no longer fire.
In multi-agent systems runaway cost amplifies. An agent that fans out to four sub-agents who each fan out to four more produces 16x token spend per planning step. Without per-tenant rate-limiting and per-trace budget caps, a single recursive plan can exhaust a daily quota in minutes.
How FutureAGI Handles Runaway Cost
FutureAGI’s approach is two-layer. At the runtime layer, the Agent Command Center exposes per-tenant rate-limiting policies — token-per-minute caps, requests-per-minute caps, and per-trace token-budget caps. When a session exceeds its budget, the gateway returns a structured budget_exceeded error to the agent’s planner, which can then terminate cleanly rather than loop indefinitely. At the observability layer, every LLM span emits llm.token_count.total, llm.token_count.prompt, and llm.token_count.completion via traceAI integrations (traceAI-openai, traceAI-langchain, traceAI-openai-agents), and the FutureAGI dashboard rolls those up into per-trace cost attribution.
Concretely: the same research-agent team adds three policies. rate-limiting per tenant: 1M tokens/hour. Per-trace cap: 200K tokens. Step-count cap on the agent’s planning loop: 50 steps. They wire fi.evals.StepEfficiency over completed runs to surface wasted work — when the same query gets repeated three times in a trajectory, the metric flags it. They wire fi.evals.TrajectoryScore to catch the broader “agent never converged” pattern. The dashboard plots cost-per-trace p99 by route; the runaway session would have alerted within five minutes instead of fourteen hours.
Unlike provider-side billing dashboards that aggregate across tenants and routes, FutureAGI attributes cost down to the individual trace, agent step, and tool call.
How to Measure or Detect It
Signals to wire up:
- OTel attributes
llm.token_count.prompt,llm.token_count.completion,llm.token_count.total— the building blocks of cost attribution. - Dashboard signal: cost-per-trace p99 by route — runaway traces are high outliers.
- Step-count metric per session — long planning loops are the canonical cause.
fi.evals.StepEfficiency— flags wasted, repeated, or non-progressing steps.fi.evals.TrajectoryScore— catches non-converging trajectories.- Rate-limit-trip rate — frequent trips of the per-tenant rate limit indicate systemic runaway.
# Conceptual: per-trace budget gate, configured in the Agent Command Center
budget_policy = {
"max_tokens_per_trace": 200_000,
"max_steps_per_trace": 50,
"rate_limit": {"tokens_per_minute": 100_000, "scope": "tenant"}
}
# When exceeded, the gateway returns budget_exceeded; the planner terminates.
Common Mistakes
- No per-trace token budget. A per-call limit is not enough; cap aggregate spend per session.
- Confusing runaway cost with context overflow. Different mechanisms, different fixes; do not collapse them.
- Trusting provider-side dashboards. They lag by hours and aggregate across tenants — useless for per-customer attribution.
- No step-count cap on agent loops. Every long-loop agent needs a hard step ceiling and a clean termination path.
- Charging customers post-hoc. Without inline budget enforcement, a single bad session rewrites your unit economics.
Frequently Asked Questions
What is runaway cost in LLM apps?
Runaway cost is unbounded token consumption across many LLM calls — usually because an agent is stuck in a loop, a recursive tool call never terminates, or a chat session accumulates context without summarisation.
How is runaway cost different from context overflow?
Context overflow is a structural per-call failure where one request exceeds the model's context window. Runaway cost is cumulative across many calls — each call may fit, but the aggregate burns budget.
How do you prevent runaway cost?
Use the FutureAGI Agent Command Center rate-limiting policy with per-tenant token budgets, attribute cost via traceAI llm.token_count attributes, and run TrajectoryScore plus StepEfficiency to catch wasted steps.