Is model fallback the same as retry?

No. Retry re-issues the same request against the same model with backoff. Fallback switches to a different model — usually on a different provider. They are typically combined: retry first, then fall back.

How does FutureAGI implement model fallback?

Agent Command Center exposes a model_fallbacks map per model. On retryable errors, the gateway walks the chain (e.g., gpt-4o → claude-sonnet-4 → gemini-2.0-pro) until a provider succeeds.

What Is Model Fallback? LLM Gateway Definition (2026)

What Is Model Fallback?

Model fallback is an LLM-gateway behaviour that switches to a different model — usually on a different provider — when the primary model errors out, rate-limits, or times out. The gateway holds an ordered fallback chain per primary model and walks it on retryable failure. Fallback is distinct from retry: retry re-issues the same request against the same model with exponential backoff; fallback issues the request against a different model entirely. The two compose — most production systems retry first, then fall back. FutureAGI’s Agent Command Center exposes this as the model_fallbacks chain.

Why it matters in production LLM/agent systems

Provider failure modes are the single most common cause of LLM-app outages. The list is short and recurring:

Rate limit (429). Peak hours blow through TPM or RPM caps. Without fallback, every user sees an error.
Server error (5xx). OpenAI, Anthropic, and Bedrock all post regular incidents. Single-provider apps go down with them.
Timeout. A reasoning-heavy prompt occasionally exceeds the per-request timeout. Retrying the same model usually times out again; falling back to a faster model returns some answer.
Model deprecation. A scheduled deprecation date arrives mid-incident. A fallback chain to the next-gen model is graceful; without one, the app errors.

For agent systems running 10–50 model calls per task, the probability that at least one call fails on a single provider during a 5-minute window is non-trivial. Fallback chains turn provider availability from a per-provider risk into a (much smaller) per-chain risk. The product impact: a chatbot that answers — even with a slightly different style — beats a chatbot that errors.

We’ve found that the cost-of-failure profile differs sharply by surface. A consumer chatbot loses session retention; a B2B agent loses an in-flight task that may already have side effects (a tool call that ran, an email draft that posted). For tool-using agents, a missing fallback is doubly painful: the orchestrator may either retry the entire trajectory (doubling cost) or surface a half-completed task to the operator. In our 2026 evals, agent runs with a configured chain saw 38% lower task-abandonment rates during simulated provider outages, mostly because the agent could complete the planning step on a fallback model even when the primary’s reasoning endpoint was down.

How FutureAGI handles it

FutureAGI’s Agent Command Center implements model fallback in internal/routing/modelfallback.go as an ordered chain per primary model. Configuration:

routing:
  failover:
    enabled: true
    max_attempts: 3
    on_status_codes: [429, 500, 502, 503, 504]
    on_timeout: true
  retry:
    enabled: true
    max_retries: 2
    initial_delay: 500ms
    multiplier: 2.0
  model_fallbacks:
    gpt-4o:
      - claude-sonnet-4
      - gemini-2.0-pro
    claude-sonnet-4:
      - gpt-4o
      - gemini-2.0-pro

The control flow on a gpt-4o request that 429s:

Retry on gpt-4o with backoff (max_retries: 2).
If still failing, walk model_fallbacks["gpt-4o"] and try claude-sonnet-4.
If that fails, try gemini-2.0-pro.
If the chain exhausts, surface the error.

Each attempt is a separate OTel span carrying gen_ai.request.model, agentcc.routing.attempt, and agentcc.routing.fallback_reason. Per-provider circuit breakers (internal/routing/circuitbreaker.go) short-circuit chronically-failing providers so the chain doesn’t waste a full timeout on a known-down provider. Compared with LiteLLM’s fallback config — a flat list with no circuit awareness — FutureAGI’s chain is health-aware, attempt-bounded, and carries the chain’s full attempt history into the trace tree. The same trace surfaces in FutureAGI’s observability product, so a regression eval can filter on traces where fallback_reason != null to catch a quality drop introduced by chain entry #2.

How to measure or detect it

Operate model-fallback against:

Fallback trigger rate — fraction of requests where attempt #1 failed. Healthy: under 1%; alert at 5%.
Per-chain-entry success rate — how often each chain entry actually succeeds. If entry #2 also fails 30% of the time, redesign the chain.
Fallback-quality delta — sample requests where fallback fired and run a Coherence or AnswerRelevancy eval against both the (failed-primary) and the (fallback-success) responses. If quality drops too far, the chain entry is wrong.
agentcc.routing.fallback_reason OTel attribute — joins fallback events to the cause (status code, timeout, circuit-open).
Chain-exhaustion rate — final failures after the whole chain ran.
agentcc.routing.attempt — per-attempt counter, useful for filtering traces by chain depth.
Per-tenant fallback budget — cost cap per tenant when chain entries cost more than the primary; alert when a tenant breaches it within a billing window.

from fi.evals import AnswerRelevancy
# Periodic eval on fallback-served traces
AnswerRelevancy().evaluate(input=prompt, output=fallback_response)

Common mistakes

Conflating fallback with retry. Retry hits the same model; fallback hits a different one. They compose, but they aren’t the same.
Configuring fallback without circuit breakers. The chain wastes full timeouts on known-down providers.
Choosing a fallback model with very different formatting. JSON-mode users falling back to a non-JSON model corrupt downstream parsers.
Forgetting tool definitions translate. Some chain entries don’t support function-calling; the request silently degrades.
Treating fallback as free. A fallback chain that fires often is a routing problem, not a reliability win — fix the upstream first.