Engineering

How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026

A practitioner's guide to cutting Claude Code token spend with five stackable levers — native cache_control, MCP-tool compilation, semantic caching, model right-sizing, and context pruning — with worked math and an honest read on where the 90 percent claim holds.

·
13 min read
ai-gateway 2026 claude-code
Editorial cover image for How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026
Table of Contents

The “up to 90 percent” claim shows up on every gateway homepage in 2026. Maxim’s Bifrost benchmark reports 92.8 percent input-token reduction across 508 tools on 16 MCP servers. Anthropic’s own cache_control documentation reports a 90 percent reduction on cache-hit input tokens. Both numbers are real, both are reproducible, and neither is the average a 30-engineer team will see on its month-end Anthropic bill. The honest read on “up to 90 percent” is that it lives on specific subsets of the workload (cache-hit-heavy turns, fully compiled MCP sessions) and that the team-level average is the weighted blend across all turns, which usually lands between 60 and 85 percent when the levers are stacked competently.

This post is the playbook for stacking them competently. Five levers, each named, each quantified, each with the config that wires it. A worked example that takes a team from $50,000 a month to a defensible number. A production checklist. A short gateway picks brief at the end. The voice is opinionated and the numbers are honest. The asterisk on “up to 90 percent” is the entire point.


The claim and the asterisk

Three numbers anchor the public discourse on Claude Code cost reduction:

  1. Anthropic’s cache_control returns cached tokens at one-tenth the input price (a 90 percent discount per token) once a prefix is cached. On a Claude Code session where the system prompt and the first 50K of project context are stable across 40 turns, the cache-hit-token share of the bill drops by 90 percent. That’s real per-token, and it’s what powers most of the “90 percent” claims you see, it’s Anthropic doing the discounting, not the gateway.

  2. Maxim’s Bifrost Code Mode benchmark reports a 92.8 percent reduction in input tokens on the system-prompt boundary when 508 MCP tools are compiled into a single execute_python(code) surface. The benchmark is reproducible on the Maxim harness; on a real heterogeneous fleet, expect 25 to 45 percent on the cleanly-compilable subset of tools.

  3. The “up to 90 percent” gateway claim is the marketing line on Maxim, Bifrost, Portkey, and a half-dozen others. It’s a true statement about the maximum a single lever can do on a favourable subset, and a misleading statement about the mean a team will see.

The asterisk: stacking all five levers below produces 60 to 85 percent reduction on the median Claude Code fleet, with the high end requiring heavy MCP usage, long sessions with stable prefixes, and a real held-out evaluation gate on routing decisions. The 90 percent number is the upper bound on cache-hit-heavy and MCP-heavy subsets, not the floor.

If a vendor pitches you “up to 90 percent” without naming which lever does what or what the average is on a workload that looks like yours, the number is a vibe, not a forecast. Demand the math.


Where Claude Code’s bill actually goes

Before stacking levers, a quick anatomy of where the tokens land. From the workload data we collected across 22 engineering teams in Q1 2026:

BucketTypical share of input tokensNotes
MCP tool descriptions in system prompt28-38%Grows linearly with registered servers
MCP tool responses serialised into history13-22%Same payload often appears 15+ times in one session
Project file context (Claude Code packs aggressively)22-31%Stable across turns — high cache potential
Conversation history (user messages, model responses)14-19%Variable, hard to cache
Anthropic system prompt and tool boilerplate4-7%Constant, fully cacheable

Two things follow from this table. First, MCP-related tokens are 41 to 58 percent of input spend, the single largest bucket. Second, project file context plus the Anthropic system prompt (combined 26 to 38 percent) is highly cacheable because it doesn’t change within a session.

The five levers below attack these buckets in order of bite.


The 5 stackable levers

Lever 1: Anthropic native cache_control (foundation, 90 percent per cached token)

This is the foundation. Everything else stacks on top.

Anthropic’s prompt cache returns cached tokens at one-tenth the price of fresh input tokens, a 90 percent per-token discount once a prefix is cached and a 25 percent surcharge on the first turn that writes the cache (the “cache write” rate). Claude Code already uses cache_control on the system prompt and tool definitions in current versions, but the savings depend on prefix stability, every change to the system prompt invalidates the cache.

The gateway’s job here isn’t to invent caching; the gateway’s job is to keep the prefix stable. That means:

  • Tool definitions ordered deterministically. Some gateways re-sort tool definitions per session, which silently invalidates the cache. The fix is a stable canonical order, typically alphabetical by tool ID, with version in the description body, not in a position-shifting metadata field.
  • System-prompt boilerplate frozen. Inject team-specific instructions in a separate cache block downstream of the Anthropic boilerplate, not interleaved with it.
  • Project context promoted to a cache block. The first 50K to 120K tokens of project file context that Claude Code packs at session start should be marked cache_control: ephemeral so it cools down at 5 minutes of inactivity rather than between turns.
# Gateway-side cache_control injection on the Anthropic forwarding path
from anthropic.types import TextBlockParam

def inject_cache_breakpoints(messages):
    """
    Mark the first 4 large stable chunks with cache_control.
    Anthropic allows up to 4 cache breakpoints per request.
    """
    # Block 1: Anthropic boilerplate + tool definitions (sorted, stable)
    # Block 2: Team system prompt
    # Block 3: Project context (packed by Claude Code)
    # Block 4: Conversation history up to N-1
    for block_id in stable_prefix_blocks(messages):
        messages[block_id]["cache_control"] = {"type": "ephemeral"}
    return messages

Expected reduction on the cached subset: 90 percent per token, as advertised. Expected reduction on the team-level bill: 35 to 55 percent. The gap is because not every input token is cacheable, the user’s new message, the model’s incremental reasoning, and most tool-response chunks are fresh on every turn.

Failure mode: cache thrashing. If a gateway plugin injects a timestamp or a session-specific UUID into the system prompt for “telemetry”, the cache invalidates every single turn and the savings vanish. Audit your gateway’s modifications to the system prompt monthly.

Lever 2: MCP-tool compilation (50 to 70 percent on tool tokens)

The second-largest lever, attacking the 28 to 38 percent of input spend that’s MCP tool descriptions.

Instead of advertising N tool definitions and round-tripping each invocation, the gateway compiles MCP tools into a Python module exposed as a single high-level tool, execute_python(code). The model writes Python that calls compiled functions; the gateway sandboxes execution and returns the result. Maxim’s Bifrost benchmark on this pattern reports 92.8 percent reduction at the system-prompt boundary. On a real Claude Code fleet, expect 50 to 70 percent on the tool-token subset, because only 55 to 70 percent of tools compile cleanly, tools with streaming responses, binary outputs, or interactive prompts stay in tool-mode.

# Gateway MCP routing rule — compile-mode with promotion gate
mcp:
  compile_mode:
    enabled: true
    candidates: auto
    promotion_gate:
      held_out_eval: task_completion+code_correctness
      min_score_ratio: 0.99
    sandbox:
      runtime: python-3.12
      timeout_seconds: 30
      memory_mb: 256
      no_network: true

The non-negotiable gate: held-out evaluation. Compile-mode promotes a tool only if it scores within 1 percent of tool-mode on task completion, tool-call accuracy, and code-correctness rubrics. Without that gate, the gateway saves tokens by degrading quality, which is the worst kind of cost reduction, it doesn’t show up on the Anthropic invoice but it shows up on the engineering team’s velocity numbers.

Expected reduction on the tool-token subset: 50 to 70 percent. Expected contribution to the team-level bill: 14 to 25 percent of total Claude Code spend. Stacks cleanly with lever 1 because cache_control still applies to the compressed tool surface.

Lever 3: Semantic caching of tool results (10 to 30 percent on tool-call subset)

The third lever attacks the 13 to 22 percent of input spend that’s MCP tool responses serialised back into conversation history.

A semantic cache at the gateway returns the previous payload for a tool call with the same name and a content-aware hash of the arguments. Per-tool TTLs are mandatory, a 60-minute TTL on git.diff poisons subsequent turns the moment the user commits; a 30-second TTL on linear.list_issues burns spend.

ToolDefault TTLRationale
filesystem.read300sFiles change less often than the window
git.diff90sDiffs change as the agent edits
git.log600sLog is append-only
linear.list_issues1800sIssues move slowly
postgres.query (read-only)60sConservative; data may be live
web.fetch60sPages change

Hit rate stabilises around 35 to 55 percent within a week of real usage. The token saving is smaller than the hit rate because tool responses are smaller on average than tool descriptions, expect 10 to 30 percent of the tool-call subset, or roughly 3 to 8 percent of total Claude Code spend. Compounds with lever 1 because cached tool responses become part of a cacheable prefix.

Lever 4: Model right-sizing (80 to 85 percent on routed turns)

The fourth lever attacks the cost asymmetry between Claude models. As of May 2026:

ModelInput $/1MOutput $/1MNotes
claude-opus-4-7$15$75Default for hard reasoning, long tool sequences
claude-sonnet-4-6$3$15Default for medium-complexity edits
claude-haiku-4-5$0.80$4Fast and cheap for simple turns

A Claude Code session has many simple turns, file reads, syntax fixes, formatting, single-line edits, that Haiku handles indistinguishably from Opus. Routing only those turns to Haiku saves 95 percent per-token on the routed subset. Routing some to Sonnet saves 80 percent. The whole-session saving depends on the routed share, which in our reference fleet was 38 to 52 percent of turns.

The routing rule that works in practice:

def pick_model(turn_context):
    """
    Heuristic routing for Claude Code turns. The gateway evaluates this
    per-turn; the optimizer tunes the thresholds over time.
    """
    input_tokens = count_tokens(turn_context.messages)
    has_tool_calls = any_tool_required(turn_context)
    complexity_score = turn_context.classifier.score()  # 0..1

    # Hard turns to Opus
    if complexity_score > 0.7 or input_tokens > 60_000:
        return "claude-opus-4-7"
    # Medium turns to Sonnet
    if complexity_score > 0.3 or has_tool_calls:
        return "claude-sonnet-4-6"
    # Simple turns to Haiku
    return "claude-haiku-4-5"

Expected reduction on routed turns: 80 to 95 percent. Expected contribution to the team-level bill: 20 to 35 percent.

The non-negotiable gate again: held-out evaluation. A quality regression on Haiku-routed turns shows up as silent code defects, not as a Claude Code error. fi.evals scoring per-turn against task_completion and code_correctness rubrics is what keeps the router honest. Without it, model right-sizing is just hoping.

Lever 5: Context window pruning (15 to 25 percent on long sessions)

The fifth lever attacks conversation history growth on long sessions.

Claude Code keeps the full conversation history in input context by default. By turn 40, the history alone is 80K to 200K tokens, most of which is content the current turn doesn’t need, old tool outputs that have been superseded, file reads that have been re-read with newer content, intermediate reasoning the model has since incorporated.

Pruning runs at the gateway: identify spans that are safely droppable (superseded tool outputs, re-read files with newer content, intermediate reasoning that has been summarised), replace them with a one-line marker, keep everything else. The model retains continuity through the summary markers; the input-token count drops 15 to 25 percent on sessions past turn 20.

This is the highest-risk lever, pruning the wrong span breaks coherence. Two safety rules:

  • Never prune the most recent N turns (default N=10). Recent context is where the active reasoning lives.
  • Never prune content the model explicitly references. The gateway parses model output for back-references (“as I noted earlier”) and pins those spans.

Pruning compounds the worst with lever 1 (cache_control), because every prune invalidates the cache prefix downstream of it. The fix is to prune in batches and only at session boundaries when possible, accepting that mid-session pruning costs one cache-write turn to amortise across the next N cached turns.


Worked example: $50K/month to $X/month

A 30-engineer team running Claude Code at $50,000 per month is the canonical example. Pre-gateway, the bill breaks down roughly as:

Monthly Claude Code bill: $50,000
  - Input tokens (uncached): ~$32,500 (65%)
  - Output tokens: ~$17,500 (35%)

Input-token decomposition:
  - MCP tool descriptions (sys prompt):     $9,750 (30% of input)
  - MCP tool responses in history:          $5,850 (18% of input)
  - Project context (Claude Code packs):    $8,775 (27% of input)
  - Conversation history:                   $5,200 (16% of input)
  - Anthropic boilerplate:                  $2,925 (9% of input)

Applying the levers in sequence (each lever applies after the previous):

Lever 1 — Anthropic cache_control on stable prefix
  Cacheable subset: tool defs ($9,750) + project ctx ($8,775) + boilerplate ($2,925)
                  = $21,450 of input spend
  Cache hit rate on that subset: ~85% steady-state
  Per-cached-token discount: 90%
  Savings on input: 0.85 × 0.90 × $21,450 = $16,409
  After lever 1: input $32,500 - $16,409 = $16,091

Lever 2 — MCP-tool compilation (Code Mode)
  Tool-description bill after lever 1: $9,750 × 0.15 (uncached share) = $1,463
  Compile-mode coverage: 60% of tools
  Reduction on compiled subset: 60%
  Savings on input: 0.60 × 0.60 × $9,750 = $3,510 (on the full pre-cache count)
                    More precisely on post-cache cost: 0.60 × 0.60 × $1,463 = $527
  After lever 2 on input: $16,091 - $527 = $15,564
  (Note: lever 2's headline savings come on cache-write turns and uncached subset;
   the per-token economics shift but the team-level dollar saving is smaller post-cache)

Lever 3 — Semantic cache on tool responses
  Tool-response bill after lever 1: $5,850 × 0.30 (cache hit only partial here) = $1,755
  Semantic cache hit rate: 38%
  Savings on input: 0.38 × $1,755 = $667
  After lever 3 on input: $15,564 - $667 = $14,897

Lever 4 — Model right-sizing (Haiku/Sonnet for easy turns)
  Routed turn share: 45% (steady-state after eval-gated tuning)
  Avg saving per routed turn (Opus → Sonnet/Haiku mix): 82%
  This applies to BOTH input and output tokens on routed turns
  Input savings: 0.45 × 0.82 × $14,897 = $5,497
  Output savings: 0.45 × 0.82 × $17,500 = $6,457
  After lever 4: input $9,400, output $11,043 → total $20,443

Lever 5 — Context window pruning on long sessions
  Long-session share: ~40% of input volume (turns past #20)
  Pruning reduction on that subset: 20%
  Cache write penalty: ~5% of saved tokens billed at write rate
  Net savings: 0.40 × 0.20 × $9,400 × 0.95 = $714
  After lever 5: input $8,686, output $11,043 → total $19,729

Final monthly bill: ~$19,700
Reduction vs $50,000 baseline: 60.6%

The 60.6 percent is the honest number for a heavy-MCP, long-session, eval-gated implementation of all five levers. The same math with the optimistic dials, 92 percent cache hit, 75 percent compile-mode coverage, 55 percent routed share, lands around 73 percent. The 90 percent ceiling is only reachable when MCP is the dominant cost (60+ percent of input) and the session profile is overwhelmingly stable-prefix (very high cache hit). Both conditions are common; their intersection is the upper-end workload.

Either of these numbers is a defensible answer to the CFO. “Up to 90 percent” without showing the math isn’t.


Production checklist

ConcernWhat to check
Cache stabilityNo timestamps or session UUIDs injected into the system prompt; cache-write rate trends down within a week
Cache breakpoint placementAll 4 Anthropic cache breakpoints used, at the deepest stable boundaries — boilerplate, tool defs, project context, history-up-to-N-1
Compile-mode gatefi.evals scoring compile-mode vs tool-mode on held-out tasks before promoting any tool
Per-tool TTLEach MCP tool has an explicit TTL; no global default beyond 5 minutes
Router quality gatePer-turn eval scores tracked by chosen model; auto-rollback if Haiku-routed code_correctness regresses >2 percent
Pruning safetyNever prune most recent 10 turns; never prune content the model back-references
Latency budgetGateway hop p95 under 25 ms on Anthropic calls, under 15 ms on cached MCP calls; Protect inline at ~67 ms text (arXiv 2510.13351)
Cost attributionEvery span tagged with developer, session, repo, model used, cache_hit flag, compile_mode flag
Rollback pathUnsetting ANTHROPIC_BASE_URL and reverting mcp.json reverts to direct Anthropic; rehearse quarterly
Cold-startFirst-request latency after deploy under 500 ms p95; warm the cache and classifier before swing rollout
Provider invoice reconciliationMonthly: Anthropic invoice vs gateway’s reported spend within 2 percent; investigate larger drifts

Gateway picks brief

A short orientation. The full head-to-head lives in the sibling Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026 and Best MCP Gateway for Claude Code in 2026 posts.

Future AGI Agent Command Center. The only gateway here that wires all five levers into a self-improving loop. traceAI (Apache 2.0) captures per-tool and per-call spans; ai-evaluation (Apache 2.0) scores compile-mode promotion and router quality; agent-opt (Apache 2.0) tunes thresholds, TTLs, and routing policies session over session. Protect runs inline at ~67 ms text. The 60 to 85 percent reduction holds and tightens over months instead of plateauing after first setup. Hosted Agent Command Center plus Apache 2.0 building blocks.

Maxim Bifrost. Authors of the Code Mode pattern and the published 92.8 percent benchmark on 508 tools across 16 MCP servers. The most direct path to lever 2. Native semantic cache, OTel-native span export, Apache 2.0 Go binary. No optimizer, the levers stay where you set them. Pick for a one-shot benchmark or when compile-mode is the dominant lever.

Portkey. Polished hosted product. Virtual keys for per-developer attribution, mature semantic cache, the prettiest dashboard in the cohort. No compile-mode and no optimizer, so the ceiling on Portkey alone is 35 to 55 percent, strong on levers 1, 3, and 4, missing lever 2. Procurement signal: April 30, 2026 Palo Alto Networks acquisition merging the roadmap into Prisma AIRS.

Kong AI Gateway. Right pick if Kong is already the company’s API platform. Caching and routing through plugins, AI Proxy 3.6 supports MCP. No native compile-mode, wrapping MCP tools in Lua is meaningful engineering work. Plan two weeks of platform-team time for an MCP-grade dashboard.

LiteLLM. Source-available, Python-native, self-hostable. Strong on lever 4 (model routing) and lever 1 (cache_control passthrough), thinner on lever 2. The March 2026 PyPI supply-chain incident plus CVE-2026-30623 make version hygiene a permanent operational task; pin to 1.83.7-stable or later and audit weekly.


Sources

  • Anthropic prompt caching documentation, claude.ai/docs/build-with-claude/prompt-caching
  • Anthropic pricing, claude.ai/pricing (Opus 4-7, Sonnet 4-6, Haiku 4-5)
  • Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
  • Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
  • Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
  • Future AGI traceAI repo, github.com/future-agi/traceAI
  • Future AGI ai-evaluation repo, github.com/future-agi/ai-evaluation
  • Future AGI agent-opt repo, github.com/future-agi/agent-opt
  • Portkey AI gateway, portkey.ai
  • Palo Alto Networks Portkey acquisition (April 30, 2026), paloaltonetworks.com/company/press/2026/palo-alto-networks-to-acquire-portkey
  • Kong AI Gateway, konghq.com/products/kong-ai-gateway
  • LiteLLM PyPI supply-chain advisory (March 2026), docs.litellm.ai/blog/security-update-march-2026


Frequently asked questions

Is the 'up to 90 percent' claim a lie?
No, it is a true statement about the maximum a single lever (Anthropic native cache_control on cache-hit tokens, or Maxim's Code Mode on compiled-tool subsets) can do on a favourable subset. It is not the average a team will see on the month-end invoice. The team-level reduction from stacking all five levers competently is 60 to 85 percent. Treat 'up to 90 percent' as the ceiling on specific subsets, not the floor on the bill.
Which lever has the biggest impact?
On most Claude Code fleets, Anthropic native cache_control (lever 1) and model right-sizing (lever 4) are the two largest contributors — between them they typically account for two-thirds of the total savings. MCP-tool compilation (lever 2) is the largest contributor on heavy-MCP workloads with 6 or more servers registered. Semantic caching (lever 3) and context pruning (lever 5) are the long tail.
Do I need a gateway to use cache_control?
No. Anthropic's prompt cache works natively when Claude Code sends the right `cache_control` markers, which it already does in current versions. A gateway adds value by keeping the prefix stable across team-wide system-prompt changes, by auditing for cache-thrashing modifications, and by reporting cache-hit rates per developer and per repo. Without a gateway, the savings happen but the visibility does not.
Is compile-mode safe for production code?
Only with a held-out evaluation gate. Compile-mode executes model-written Python in a sandbox at the gateway; without `fi.evals` (or an equivalent) scoring compile-mode versus tool-mode on a held-out test set, the gateway promotes tools that silently regress code-correctness. The cost goes down, the quality goes down with it. The promotion gate is not optional infrastructure — it is the difference between cost reduction and quality regression dressed up as cost reduction.
How do I prove the savings to finance?
Three artifacts. (1) The Anthropic monthly invoice — falling. (2) The gateway's per-developer chargeback table — distribution of cost across the team, with cache-hit and compile-mode flags. (3) The `fi.evals` quality dashboard — proof that `code_correctness` and `task_completion` did not regress alongside the cost cut. Finance signs off on (1) plus (2). Engineering signs off on (3). Both signatures are required; either alone is fragile.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.