How to Reduce MCP Token Costs for Claude Code at Scale in 2026
A practical 2026 how-to for cutting MCP token spend on Claude Code at fleet scale: five levers, the mcp.json + gateway config that wires them, the metrics that prove the cut held.
Table of Contents
A 30-engineer team that runs Claude Code with eight MCP servers registered in ~/.claude/mcp.json is paying for the same tool descriptions and the same tool responses to be serialised into input context on every single turn. In the workload data we collected across 22 teams in Q1 2026, MCP-related input tokens were 41 to 58 percent of total Claude Code spend, and most of it was duplicate text the model had already seen on the previous turn.
This is a how-to. The goal is to take that 41 to 58 percent and cut it roughly in half, sustainably, at fleet scale, without breaking the tool-use UX that makes Claude Code useful. The shape is five levers, four implementation steps, three runnable snippets, and a production checklist. The gateway picks at the end are a short orientation, a longer head-to-head is the sibling post on the best MCP gateway for Claude Code in 2026.
The problem in one paragraph
Claude Code packs project context aggressively. Every turn sends the system prompt (which contains every registered MCP tool description), the conversation history (which contains every previous tool response serialised back into text), and the new user message. On a session with eight MCP servers (filesystem, git, postgres, slack, linear, figma, notion, search) the system prompt alone runs 7,200 to 11,000 tokens before the conversation starts. By turn 20, the same filesystem.read payload appears in input tokens 18 to 19 times. The bill is real. The cut is recoverable. The trick is doing it without inventing a tool-use protocol Claude Code doesn’t understand.
Prereqs
| Component | Version | Notes |
|---|---|---|
| Claude Code CLI | 1.4.0+ | Required for mcp.json schema with per-agent server lists |
| MCP gateway | One of FAGI ACC, Maxim Bifrost, Portkey, Kong AI Gateway, agentgateway.dev | See picks brief below |
| Streamable HTTP | MCP spec 2025-11-25 | STDIO is supported but adds the OX RCE class risk; prefer HTTP |
traceAI | 0.18+ | Apache 2.0 — for per-tool span capture |
ai-evaluation | 0.31+ | Apache 2.0 — for compile-mode held-out scoring |
| Env vars | ANTHROPIC_BASE_URL, FI_API_KEY, FI_SECRET_KEY | Set in shell profile, not just the IDE |
The five levers below all assume the gateway sits on both the Anthropic API path and the MCP path. Half of the levers do nothing if MCP traffic bypasses the gateway and goes direct to the MCP servers.
The 5 levers
Each lever is one concrete behavioural change at the gateway. The order is the order they bite, lever one buys you the most for the least work, lever five is the longest tail.
Lever 1: Selective registration per session
Claude Code reads ~/.claude/mcp.json at session start and pulls every tool description into the system prompt. The cheap fix is to stop registering tools an agent will never call.
The naive approach is one mcp.json per agent. It works but doesn’t scale past a handful of agents because every new task class needs a new file. The gateway version of selective registration runs a classifier on the first user message and returns only the tool descriptions the session plausibly needs. A documentation-edit task doesn’t need postgres; a database-migration task doesn’t need figma.
{
"mcpServers": {
"fagi-gateway": {
"type": "streamable-http",
"url": "https://gateway.futureagi.com/mcp/v1",
"headers": {
"Authorization": "Bearer ${FI_API_KEY}",
"x-fi-agent": "claude-code",
"x-fi-session": "${CLAUDE_SESSION_ID}",
"x-fi-selectivity": "classifier:v2"
}
}
}
}
The gateway federation endpoint advertises the union of all upstream MCP servers but only returns the descriptions the classifier selected. On a 38-tool fleet, the classifier averages 13 to 17 tools per session. Token saving: 8 to 15 percent of input spend.
Failure mode to watch, under-selection. If the classifier drops git from a session that turns into a git-blame investigation, the agent has no tool. Configure the classifier to default-include the four bedrock tools (filesystem, git, web, shell) and route the rest through the classifier.
Lever 2: Semantic caching of tool results
The same filesystem.read returning the same file in two sessions an hour apart re-serialises the same payload twice. A semantic cache keyed on tool name plus a content-aware hash of arguments returns the previous payload without round-tripping the MCP server. The cache lives at the gateway, not the client, every session sees the same cache.
Per-tool TTL is non-negotiable. A 60-minute TTL on git.diff poisons the next turn the moment the user commits. A 30-second TTL on linear.list_issues burns spend. The defaults below are what worked on our 22-team dataset:
| Tool | Default TTL | Reason |
|---|---|---|
filesystem.read | 300s | Files change less often than the cache window |
git.diff | 90s | Diffs change as the agent edits |
git.log | 600s | Log entries are append-only |
linear.list_issues | 1800s | Issue lists move slowly |
slack.search | 300s | Mid-fidelity freshness |
web.fetch | 60s | Conservative; pages change |
Hit rate stabilises around 35 to 55 percent within a week of a real team using the gateway. Token saving: 4 to 7 percent (the saving is smaller than the hit rate because tool responses are smaller on average than tool descriptions).
Lever 3: Compiled tool execution
This is the largest single lever and the one that requires the most care. Instead of advertising N tool definitions and round-tripping each invocation, the gateway compiles MCP tools into a Python module exposed as a single high-level tool, execute_python(code). The model writes Python that calls the compiled functions; the gateway sandboxes execution.
The published benchmark. Maxim’s Bifrost Code Mode run on 508 tools across 16 MCP servers, reports 92.8 percent input-token reduction at the system-prompt boundary. That’s a vendor-harness ceiling. On a real heterogeneous Claude Code fleet, expect 25 to 45 percent on the cleanly-compiled subset. The other levers stack on top.
The non-negotiable gate is held-out evaluation. Score compile-mode versus tool-mode on the same task and only promote candidates within 1 percent of tool-mode quality. Without this gate, compile-mode promotes tools that silently regress code-correctness, the cost goes down, the quality goes down with it.
# fi.evals scoring loop for compile-mode promotion
from fi.evals import EvalClient
client = EvalClient(api_key=os.environ["FI_API_KEY"])
result = client.evaluate(
eval_templates=["task_completion", "tool_call_accuracy", "code_correctness"],
inputs={
"tool_mode_trace": tool_mode_span,
"compile_mode_trace": compile_mode_span,
"ground_truth": held_out_task.expected
}
)
# Promote only if compile-mode within 1% of tool-mode on all three rubrics
if result.compile_mode.scores >= result.tool_mode.scores * 0.99:
promote_tool_to_compile_mode(tool_id)
Lever 4: Tool-description compression
Typical MCP descriptions are verbose, multi-paragraph docstring, examples, argument-by-argument prose written for humans. The gateway rewrites them at registration into a structured form the model parses at roughly half the token cost. On a 38-tool fleet, 40 percent compression saves about 1,000 input tokens per turn after selectivity has already cut the count.
Store both forms, compressed for serving, original for debugging. When a tool call fails with a structural error, the original description is what an engineer needs to read.
Lever 5: Per-session server-set pinning
Once a session has selected its tool set, pin it. Re-running the classifier on every turn, which some gateway defaults do, wastes tokens on the classifier prompt and risks the tool set churning mid-session. Pin the set at turn one, allow expansion only on an explicit user signal (“look at the Postgres schema”), and freeze it otherwise.
The pinning is gateway state, not client state. The Claude Code session ID is the cache key.
Implementation walkthrough: 4 steps
Step 1: Point Claude Code at the gateway for both Anthropic and MCP
The single biggest configuration mistake teams make is wiring ANTHROPIC_BASE_URL to the gateway and leaving MCP servers configured directly in mcp.json. Half of the levers above do nothing in that posture.
# Shell profile (.zshrc / .bashrc)
export ANTHROPIC_BASE_URL="https://gateway.futureagi.com/anthropic/v1"
export ANTHROPIC_API_KEY="${FI_VIRTUAL_KEY}" # per-developer virtual key
export FI_API_KEY="${FI_API_KEY}"
export FI_SECRET_KEY="${FI_SECRET_KEY}"
// ~/.claude/mcp.json — federation-only, no direct MCP server URLs
{
"mcpServers": {
"fagi-gateway": {
"type": "streamable-http",
"url": "https://gateway.futureagi.com/mcp/v1",
"headers": {
"Authorization": "Bearer ${FI_API_KEY}",
"x-fi-agent": "claude-code",
"x-fi-developer": "${USER}",
"x-fi-repo": "${PWD}",
"x-fi-session": "${CLAUDE_SESSION_ID}"
}
}
}
}
The federation endpoint speaks the MCP spec and fans out to all upstream MCP servers configured at the gateway. The Claude Code client sees one server; the gateway sees many.
Step 2: Configure selective registration, caching, and description compression at the gateway
The gateway-side config is where the levers live. The snippet below is the shape, every gateway has its own DSL but the fields are the same.
# fagi-gateway.yaml — Agent Command Center MCP routing rule
mcp:
upstream_servers:
- id: filesystem
url: stdio://filesystem-mcp
cache_ttl_seconds: 300
- id: git
url: stdio://git-mcp
cache_ttl_seconds: 90
- id: postgres
url: https://postgres-mcp.internal/mcp/v1
cache_ttl_seconds: 60
- id: linear
url: https://linear-mcp.internal/mcp/v1
cache_ttl_seconds: 1800
selective_registration:
classifier: classifier:v2
default_include: [filesystem, git, web, shell]
pin_per_session: true
max_tools_per_session: 22
description_compression:
enabled: true
target_token_reduction: 0.40
store_original: true
compile_mode:
enabled: true
candidates: auto
promotion_gate:
held_out_eval: task_completion+code_correctness
min_score_ratio: 0.99
semantic_cache:
enabled: true
similarity_threshold: 0.92
per_tool_ttl: true
observability:
traceai:
endpoint: https://app.futureagi.com/v1/traces
span_attributes: [mcp.tool.name, mcp.server.id, mcp.cache_hit, mcp.compile_mode]
Step 3: Wire traceAI so every MCP invocation is a child span of the Anthropic API call
Every MCP invocation needs to land in the same span tree as the Anthropic API call that triggered it. Without that, per-session attribution is broken and the loop that learns the levers has no signal.
# In the gateway worker process
from traceai import instrument
from traceai.anthropic import AnthropicInstrumentor
from traceai.mcp import MCPInstrumentor
instrument(
service_name="fagi-mcp-gateway",
instrumentors=[AnthropicInstrumentor(), MCPInstrumentor()],
span_attributes_default={
"fi.attributes.agent": "claude-code",
"fi.attributes.gateway.version": "0.18.4",
}
)
The MCP instrumentor adds mcp.tool.name, mcp.server.id, mcp.cache_hit, and mcp.compile_mode as span attributes, plus the per-call re-serialisation token cost. That last number is what makes the 50 percent cut visible, until you can see it, you can’t tune it.
Step 4: Verify with a known-good session
Run a single Claude Code session against the gateway, then verify three things, the gateway is in the path, the levers are firing, and the trace tree is correct.
# Verify gateway is in the Anthropic path
curl -s -H "Authorization: Bearer $ANTHROPIC_API_KEY" \
"$ANTHROPIC_BASE_URL/messages" \
-d '{"model":"claude-haiku-4-5","max_tokens":1,"messages":[{"role":"user","content":"hi"}]}' \
| jq '.id'
# Verify MCP federation endpoint responds
curl -s -H "Authorization: Bearer $FI_API_KEY" \
"https://gateway.futureagi.com/mcp/v1/tools/list"
# Verify traces landed and levers fired
fi traces list --session "$CLAUDE_SESSION_ID" --since 1h \
--filter "mcp.cache_hit=true OR mcp.compile_mode=true" \
--columns "tool,cache_hit,compile_mode,input_tokens,output_tokens"
The third command is the one that tells you the levers actually fired. If mcp.cache_hit is false on every row, the cache isn’t configured. If mcp.compile_mode is false on every row, compile-mode isn’t promoting candidates yet, which is correct on day one but should change within a week.
Measuring success
Four metrics. Compute them weekly. The loop tunes against them.
| Metric | How to compute | Target after 4 weeks |
|---|---|---|
| MCP input-token share | Sum of input tokens for spans with mcp.* attributes / total input tokens | 18-25% (down from 41-58%) |
| Semantic cache hit rate | Spans with mcp.cache_hit=true / total MCP spans | 35-55% |
| Compile-mode coverage | Distinct tool IDs with mcp.compile_mode=true at least once / total registered tool IDs | 55-70% |
| Tool-call failure rate | Spans with mcp.error=true / total MCP spans | <4% |
The first metric is the headline. The other three are why the first metric moved. A drop in MCP input-token share without a corresponding rise in cache hit rate or compile-mode coverage usually means the classifier is over-pruning and the agent is failing silently, check the fourth metric.
A team that started at 47 percent MCP input-token share trended to 21 percent over four weeks in our reference fleet. Cache hit rate stabilised at 38 percent. Compile-mode coverage reached 63 percent. Tool-call failure rate dropped from 11 percent to 3 percent, counter-intuitively, the levers improved correctness because compile-mode promotion is gated on a held-out eval that rejects regressions.
Latency overhead: the gateway hop adds 18 to 24 ms p95 on Anthropic calls and 6 to 9 ms p95 on cached MCP calls (the cache return is cheaper than the upstream MCP server). The Future AGI Protect model family runs inline at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain; included in the p95 number above. Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters related tool-call failures and cache-miss patterns into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so emerging regressions surface like exceptions.
Production checklist
| Concern | What to check |
|---|---|
| Latency overhead | p95 of gateway hop; alert if it exceeds 35 ms on Anthropic calls or 15 ms on cached MCP calls |
| Failure isolation | If the gateway is unreachable, does Claude Code fall back to direct Anthropic? (Default yes via ANTHROPIC_BASE_URL failover; MCP fans through gateway only) |
| Cost attribution | Every request tagged with developer ID, session ID, repo URL, and task class |
| Audit log | Gateway decisions (selectivity, cache hits, compile-mode promotions) logged to a separate trace stream, retained 90 days |
| Cold-start | First-request latency after deploy <500 ms p95; warm the classifier before swing rollout |
| Rollback | One-line rollback by unsetting ANTHROPIC_BASE_URL and reverting mcp.json to direct server URLs; rehearse quarterly |
| Cache poisoning | Monitor for stale git.diff and filesystem.read returns; alert if mcp.error=true correlates with cache hits |
| Compile-mode regressions | fi.evals score delta versus tool-mode tracked per tool; auto-demote on >1 percent regression |
| Description integrity | Original descriptions stored alongside compressed; debug endpoint serves the original |
| Selective registration drift | Classifier drift detector compares this week’s tool distribution to last week’s; alert on >25 percent change |
Gateway picks brief
A short orientation, for the full head-to-head, see the best MCP gateway for Claude Code post.
Future AGI Agent Command Center. The only gateway here that wires the five levers into a self-improving loop. traceAI captures per-tool spans, fi.evals scores compile-mode promotion, agent-opt (ProTeGi, Bayesian, GEPA) tunes selectivity rules and per-tool TTLs. Hit rate, compile-mode coverage, and MCP input-token share trend down session over session rather than holding flat. Apache 2.0 building blocks plus a hosted Agent Command Center with the MCP Security scanner and Protect at ~67 ms text. Score: 7/7 levers wired into a loop.
Maxim Bifrost. Authors of the Code Mode pattern and the published 92.8 percent benchmark on 508 tools across 16 MCP servers. The most direct path to lever three. Native semantic cache, OTel-native span export, Apache 2.0 Go binary. No optimizer, the levers stay where you set them. Pick this for a one-shot benchmark or when compile-mode is the only lever you care about.
Portkey. Polished hosted product. Selectivity through virtual servers, mature semantic cache, the prettiest dashboard in the cohort. No compile-mode and no optimizer, so the ceiling is roughly 18 to 28 percent on the levers it implements. Procurement signal: April 30, 2026 Palo Alto Networks acquisition merging the roadmap into Prisma AIRS.
Kong AI Gateway. The right pick if Kong is already the company’s API platform. Caching and selectivity through plugins, AI Proxy 3.6 supports MCP, OTel plugin for spans. No native compile-mode, wrapping MCP tools in Lua is meaningful engineering work. Plan two weeks of platform-team time.
agentgateway.dev. Linux Foundation-hosted, vendor-neutral OSS MCP gateway with selectivity and caching as policy-as-code. Right pick when foundation governance and acquisition-independence outrank dashboard polish. No compile-mode, no optimizer, the headline 50 percent cut requires bolting compile-mode on separately.
Where this fits in the Future AGI loop
The how-to above implements the five levers as configuration. To make it a self-improving capability, wire fi.evals to score every Claude Code session, feed low-scoring sessions into agent-opt, and let the optimizer rewrite the classifier rules, per-tool TTLs, and compile-mode promotion gates. Re-deploy on a versioned policy with auto-rollback on regression. Net effect on the reference fleet: MCP input-token share trended from 47 percent to 21 percent in four weeks without anyone manually tuning the gateway. That’s the loop the other four picks in the brief don’t ship.
Related reading
- How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026, the 90-percent Claude Code token-cost reduction playbook
- Best AI Gateway for Claude Code Cost Management in 2026, the Claude Code cost-management playbook
- Best 5 AI Gateways for Token Budgeting in 2026, per-virtual-key budgets and hard cutoffs in practice
- Best 5 AI Gateways for LLM Cost Optimization in 2026, the five-layer cost stack and the 2026 trust cohort
Sources
- Anthropic Claude Code MCP documentation, claude.ai/docs/claude-code/mcp
- Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
- Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
- OX Security advisory on MCP STDIO RCE class (April 15, 2026), ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem
- Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
- Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
- Future AGI traceAI repo, github.com/future-agi/traceAI
- Future AGI ai-evaluation repo, github.com/future-agi/ai-evaluation
- Future AGI agent-opt repo, github.com/future-agi/agent-opt
- Portkey AI gateway, portkey.ai
- Kong AI Gateway, konghq.com/products/kong-ai-gateway
- agentgateway.dev, agentgateway.dev (Linux Foundation project page)
Frequently asked questions
Do I have to use Future AGI primitives or can I use generic OpenTelemetry?
How do I roll this back if it breaks?
How much latency does this add?
Is compile-mode safe to run in production?
What happens if the classifier drops a tool the session actually needs?
Step-by-step walkthrough for wiring Claude Code to an MCP gateway in 2026: mcp.json config, routing rules, per-server auth scoping, and verification. With production checklist and gateway picks.
How to run Claude Code against OpenAI GPT-5 and GPT-4 via a translation gateway in 2026. Setup walkthrough, ENV vars, config snippets, then five gateways scored on translation fidelity.
A practitioner's guide to cutting Claude Code token spend with five stackable levers — native cache_control, MCP-tool compilation, semantic caching, model right-sizing, and context pruning — with worked math and an honest read on where the 90 percent claim holds.