Engineering

How to Reduce MCP Token Costs for Claude Code at Scale in 2026

A practical 2026 how-to for cutting MCP token spend on Claude Code at fleet scale: five levers, the mcp.json + gateway config that wires them, the metrics that prove the cut held.

·
12 min read
ai-gateway 2026 claude-code mcp
Editorial cover image for How to Reduce MCP Token Costs for Claude Code at Scale in 2026
Table of Contents

A 30-engineer team that runs Claude Code with eight MCP servers registered in ~/.claude/mcp.json is paying for the same tool descriptions and the same tool responses to be serialised into input context on every single turn. In the workload data we collected across 22 teams in Q1 2026, MCP-related input tokens were 41 to 58 percent of total Claude Code spend, and most of it was duplicate text the model had already seen on the previous turn.

This is a how-to. The goal is to take that 41 to 58 percent and cut it roughly in half, sustainably, at fleet scale, without breaking the tool-use UX that makes Claude Code useful. The shape is five levers, four implementation steps, three runnable snippets, and a production checklist. The gateway picks at the end are a short orientation, a longer head-to-head is the sibling post on the best MCP gateway for Claude Code in 2026.


The problem in one paragraph

Claude Code packs project context aggressively. Every turn sends the system prompt (which contains every registered MCP tool description), the conversation history (which contains every previous tool response serialised back into text), and the new user message. On a session with eight MCP servers (filesystem, git, postgres, slack, linear, figma, notion, search) the system prompt alone runs 7,200 to 11,000 tokens before the conversation starts. By turn 20, the same filesystem.read payload appears in input tokens 18 to 19 times. The bill is real. The cut is recoverable. The trick is doing it without inventing a tool-use protocol Claude Code doesn’t understand.


Prereqs

ComponentVersionNotes
Claude Code CLI1.4.0+Required for mcp.json schema with per-agent server lists
MCP gatewayOne of FAGI ACC, Maxim Bifrost, Portkey, Kong AI Gateway, agentgateway.devSee picks brief below
Streamable HTTPMCP spec 2025-11-25STDIO is supported but adds the OX RCE class risk; prefer HTTP
traceAI0.18+Apache 2.0 — for per-tool span capture
ai-evaluation0.31+Apache 2.0 — for compile-mode held-out scoring
Env varsANTHROPIC_BASE_URL, FI_API_KEY, FI_SECRET_KEYSet in shell profile, not just the IDE

The five levers below all assume the gateway sits on both the Anthropic API path and the MCP path. Half of the levers do nothing if MCP traffic bypasses the gateway and goes direct to the MCP servers.


The 5 levers

Each lever is one concrete behavioural change at the gateway. The order is the order they bite, lever one buys you the most for the least work, lever five is the longest tail.

Lever 1: Selective registration per session

Claude Code reads ~/.claude/mcp.json at session start and pulls every tool description into the system prompt. The cheap fix is to stop registering tools an agent will never call.

The naive approach is one mcp.json per agent. It works but doesn’t scale past a handful of agents because every new task class needs a new file. The gateway version of selective registration runs a classifier on the first user message and returns only the tool descriptions the session plausibly needs. A documentation-edit task doesn’t need postgres; a database-migration task doesn’t need figma.

{
  "mcpServers": {
    "fagi-gateway": {
      "type": "streamable-http",
      "url": "https://gateway.futureagi.com/mcp/v1",
      "headers": {
        "Authorization": "Bearer ${FI_API_KEY}",
        "x-fi-agent": "claude-code",
        "x-fi-session": "${CLAUDE_SESSION_ID}",
        "x-fi-selectivity": "classifier:v2"
      }
    }
  }
}

The gateway federation endpoint advertises the union of all upstream MCP servers but only returns the descriptions the classifier selected. On a 38-tool fleet, the classifier averages 13 to 17 tools per session. Token saving: 8 to 15 percent of input spend.

Failure mode to watch, under-selection. If the classifier drops git from a session that turns into a git-blame investigation, the agent has no tool. Configure the classifier to default-include the four bedrock tools (filesystem, git, web, shell) and route the rest through the classifier.

Lever 2: Semantic caching of tool results

The same filesystem.read returning the same file in two sessions an hour apart re-serialises the same payload twice. A semantic cache keyed on tool name plus a content-aware hash of arguments returns the previous payload without round-tripping the MCP server. The cache lives at the gateway, not the client, every session sees the same cache.

Per-tool TTL is non-negotiable. A 60-minute TTL on git.diff poisons the next turn the moment the user commits. A 30-second TTL on linear.list_issues burns spend. The defaults below are what worked on our 22-team dataset:

ToolDefault TTLReason
filesystem.read300sFiles change less often than the cache window
git.diff90sDiffs change as the agent edits
git.log600sLog entries are append-only
linear.list_issues1800sIssue lists move slowly
slack.search300sMid-fidelity freshness
web.fetch60sConservative; pages change

Hit rate stabilises around 35 to 55 percent within a week of a real team using the gateway. Token saving: 4 to 7 percent (the saving is smaller than the hit rate because tool responses are smaller on average than tool descriptions).

Lever 3: Compiled tool execution

This is the largest single lever and the one that requires the most care. Instead of advertising N tool definitions and round-tripping each invocation, the gateway compiles MCP tools into a Python module exposed as a single high-level tool, execute_python(code). The model writes Python that calls the compiled functions; the gateway sandboxes execution.

The published benchmark. Maxim’s Bifrost Code Mode run on 508 tools across 16 MCP servers, reports 92.8 percent input-token reduction at the system-prompt boundary. That’s a vendor-harness ceiling. On a real heterogeneous Claude Code fleet, expect 25 to 45 percent on the cleanly-compiled subset. The other levers stack on top.

The non-negotiable gate is held-out evaluation. Score compile-mode versus tool-mode on the same task and only promote candidates within 1 percent of tool-mode quality. Without this gate, compile-mode promotes tools that silently regress code-correctness, the cost goes down, the quality goes down with it.

# fi.evals scoring loop for compile-mode promotion
from fi.evals import EvalClient

client = EvalClient(api_key=os.environ["FI_API_KEY"])
result = client.evaluate(
    eval_templates=["task_completion", "tool_call_accuracy", "code_correctness"],
    inputs={
        "tool_mode_trace": tool_mode_span,
        "compile_mode_trace": compile_mode_span,
        "ground_truth": held_out_task.expected
    }
)

# Promote only if compile-mode within 1% of tool-mode on all three rubrics
if result.compile_mode.scores >= result.tool_mode.scores * 0.99:
    promote_tool_to_compile_mode(tool_id)

Lever 4: Tool-description compression

Typical MCP descriptions are verbose, multi-paragraph docstring, examples, argument-by-argument prose written for humans. The gateway rewrites them at registration into a structured form the model parses at roughly half the token cost. On a 38-tool fleet, 40 percent compression saves about 1,000 input tokens per turn after selectivity has already cut the count.

Store both forms, compressed for serving, original for debugging. When a tool call fails with a structural error, the original description is what an engineer needs to read.

Lever 5: Per-session server-set pinning

Once a session has selected its tool set, pin it. Re-running the classifier on every turn, which some gateway defaults do, wastes tokens on the classifier prompt and risks the tool set churning mid-session. Pin the set at turn one, allow expansion only on an explicit user signal (“look at the Postgres schema”), and freeze it otherwise.

The pinning is gateway state, not client state. The Claude Code session ID is the cache key.


Implementation walkthrough: 4 steps

Step 1: Point Claude Code at the gateway for both Anthropic and MCP

The single biggest configuration mistake teams make is wiring ANTHROPIC_BASE_URL to the gateway and leaving MCP servers configured directly in mcp.json. Half of the levers above do nothing in that posture.

# Shell profile (.zshrc / .bashrc)
export ANTHROPIC_BASE_URL="https://gateway.futureagi.com/anthropic/v1"
export ANTHROPIC_API_KEY="${FI_VIRTUAL_KEY}"  # per-developer virtual key
export FI_API_KEY="${FI_API_KEY}"
export FI_SECRET_KEY="${FI_SECRET_KEY}"
// ~/.claude/mcp.json — federation-only, no direct MCP server URLs
{
  "mcpServers": {
    "fagi-gateway": {
      "type": "streamable-http",
      "url": "https://gateway.futureagi.com/mcp/v1",
      "headers": {
        "Authorization": "Bearer ${FI_API_KEY}",
        "x-fi-agent": "claude-code",
        "x-fi-developer": "${USER}",
        "x-fi-repo": "${PWD}",
        "x-fi-session": "${CLAUDE_SESSION_ID}"
      }
    }
  }
}

The federation endpoint speaks the MCP spec and fans out to all upstream MCP servers configured at the gateway. The Claude Code client sees one server; the gateway sees many.

Step 2: Configure selective registration, caching, and description compression at the gateway

The gateway-side config is where the levers live. The snippet below is the shape, every gateway has its own DSL but the fields are the same.

# fagi-gateway.yaml — Agent Command Center MCP routing rule
mcp:
  upstream_servers:
    - id: filesystem
      url: stdio://filesystem-mcp
      cache_ttl_seconds: 300
    - id: git
      url: stdio://git-mcp
      cache_ttl_seconds: 90
    - id: postgres
      url: https://postgres-mcp.internal/mcp/v1
      cache_ttl_seconds: 60
    - id: linear
      url: https://linear-mcp.internal/mcp/v1
      cache_ttl_seconds: 1800

  selective_registration:
    classifier: classifier:v2
    default_include: [filesystem, git, web, shell]
    pin_per_session: true
    max_tools_per_session: 22

  description_compression:
    enabled: true
    target_token_reduction: 0.40
    store_original: true

  compile_mode:
    enabled: true
    candidates: auto
    promotion_gate:
      held_out_eval: task_completion+code_correctness
      min_score_ratio: 0.99

  semantic_cache:
    enabled: true
    similarity_threshold: 0.92
    per_tool_ttl: true

  observability:
    traceai:
      endpoint: https://app.futureagi.com/v1/traces
      span_attributes: [mcp.tool.name, mcp.server.id, mcp.cache_hit, mcp.compile_mode]

Step 3: Wire traceAI so every MCP invocation is a child span of the Anthropic API call

Every MCP invocation needs to land in the same span tree as the Anthropic API call that triggered it. Without that, per-session attribution is broken and the loop that learns the levers has no signal.

# In the gateway worker process
from traceai import instrument
from traceai.anthropic import AnthropicInstrumentor
from traceai.mcp import MCPInstrumentor

instrument(
    service_name="fagi-mcp-gateway",
    instrumentors=[AnthropicInstrumentor(), MCPInstrumentor()],
    span_attributes_default={
        "fi.attributes.agent": "claude-code",
        "fi.attributes.gateway.version": "0.18.4",
    }
)

The MCP instrumentor adds mcp.tool.name, mcp.server.id, mcp.cache_hit, and mcp.compile_mode as span attributes, plus the per-call re-serialisation token cost. That last number is what makes the 50 percent cut visible, until you can see it, you can’t tune it.

Step 4: Verify with a known-good session

Run a single Claude Code session against the gateway, then verify three things, the gateway is in the path, the levers are firing, and the trace tree is correct.

# Verify gateway is in the Anthropic path
curl -s -H "Authorization: Bearer $ANTHROPIC_API_KEY" \
     "$ANTHROPIC_BASE_URL/messages" \
     -d '{"model":"claude-haiku-4-5","max_tokens":1,"messages":[{"role":"user","content":"hi"}]}' \
  | jq '.id'

# Verify MCP federation endpoint responds
curl -s -H "Authorization: Bearer $FI_API_KEY" \
     "https://gateway.futureagi.com/mcp/v1/tools/list"

# Verify traces landed and levers fired
fi traces list --session "$CLAUDE_SESSION_ID" --since 1h \
  --filter "mcp.cache_hit=true OR mcp.compile_mode=true" \
  --columns "tool,cache_hit,compile_mode,input_tokens,output_tokens"

The third command is the one that tells you the levers actually fired. If mcp.cache_hit is false on every row, the cache isn’t configured. If mcp.compile_mode is false on every row, compile-mode isn’t promoting candidates yet, which is correct on day one but should change within a week.


Measuring success

Four metrics. Compute them weekly. The loop tunes against them.

MetricHow to computeTarget after 4 weeks
MCP input-token shareSum of input tokens for spans with mcp.* attributes / total input tokens18-25% (down from 41-58%)
Semantic cache hit rateSpans with mcp.cache_hit=true / total MCP spans35-55%
Compile-mode coverageDistinct tool IDs with mcp.compile_mode=true at least once / total registered tool IDs55-70%
Tool-call failure rateSpans with mcp.error=true / total MCP spans<4%

The first metric is the headline. The other three are why the first metric moved. A drop in MCP input-token share without a corresponding rise in cache hit rate or compile-mode coverage usually means the classifier is over-pruning and the agent is failing silently, check the fourth metric.

A team that started at 47 percent MCP input-token share trended to 21 percent over four weeks in our reference fleet. Cache hit rate stabilised at 38 percent. Compile-mode coverage reached 63 percent. Tool-call failure rate dropped from 11 percent to 3 percent, counter-intuitively, the levers improved correctness because compile-mode promotion is gated on a held-out eval that rejects regressions.

Latency overhead: the gateway hop adds 18 to 24 ms p95 on Anthropic calls and 6 to 9 ms p95 on cached MCP calls (the cache return is cheaper than the upstream MCP server). The Future AGI Protect model family runs inline at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain; included in the p95 number above. Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters related tool-call failures and cache-miss patterns into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so emerging regressions surface like exceptions.


Production checklist

ConcernWhat to check
Latency overheadp95 of gateway hop; alert if it exceeds 35 ms on Anthropic calls or 15 ms on cached MCP calls
Failure isolationIf the gateway is unreachable, does Claude Code fall back to direct Anthropic? (Default yes via ANTHROPIC_BASE_URL failover; MCP fans through gateway only)
Cost attributionEvery request tagged with developer ID, session ID, repo URL, and task class
Audit logGateway decisions (selectivity, cache hits, compile-mode promotions) logged to a separate trace stream, retained 90 days
Cold-startFirst-request latency after deploy <500 ms p95; warm the classifier before swing rollout
RollbackOne-line rollback by unsetting ANTHROPIC_BASE_URL and reverting mcp.json to direct server URLs; rehearse quarterly
Cache poisoningMonitor for stale git.diff and filesystem.read returns; alert if mcp.error=true correlates with cache hits
Compile-mode regressionsfi.evals score delta versus tool-mode tracked per tool; auto-demote on >1 percent regression
Description integrityOriginal descriptions stored alongside compressed; debug endpoint serves the original
Selective registration driftClassifier drift detector compares this week’s tool distribution to last week’s; alert on >25 percent change

Gateway picks brief

A short orientation, for the full head-to-head, see the best MCP gateway for Claude Code post.

Future AGI Agent Command Center. The only gateway here that wires the five levers into a self-improving loop. traceAI captures per-tool spans, fi.evals scores compile-mode promotion, agent-opt (ProTeGi, Bayesian, GEPA) tunes selectivity rules and per-tool TTLs. Hit rate, compile-mode coverage, and MCP input-token share trend down session over session rather than holding flat. Apache 2.0 building blocks plus a hosted Agent Command Center with the MCP Security scanner and Protect at ~67 ms text. Score: 7/7 levers wired into a loop.

Maxim Bifrost. Authors of the Code Mode pattern and the published 92.8 percent benchmark on 508 tools across 16 MCP servers. The most direct path to lever three. Native semantic cache, OTel-native span export, Apache 2.0 Go binary. No optimizer, the levers stay where you set them. Pick this for a one-shot benchmark or when compile-mode is the only lever you care about.

Portkey. Polished hosted product. Selectivity through virtual servers, mature semantic cache, the prettiest dashboard in the cohort. No compile-mode and no optimizer, so the ceiling is roughly 18 to 28 percent on the levers it implements. Procurement signal: April 30, 2026 Palo Alto Networks acquisition merging the roadmap into Prisma AIRS.

Kong AI Gateway. The right pick if Kong is already the company’s API platform. Caching and selectivity through plugins, AI Proxy 3.6 supports MCP, OTel plugin for spans. No native compile-mode, wrapping MCP tools in Lua is meaningful engineering work. Plan two weeks of platform-team time.

agentgateway.dev. Linux Foundation-hosted, vendor-neutral OSS MCP gateway with selectivity and caching as policy-as-code. Right pick when foundation governance and acquisition-independence outrank dashboard polish. No compile-mode, no optimizer, the headline 50 percent cut requires bolting compile-mode on separately.


Where this fits in the Future AGI loop

The how-to above implements the five levers as configuration. To make it a self-improving capability, wire fi.evals to score every Claude Code session, feed low-scoring sessions into agent-opt, and let the optimizer rewrite the classifier rules, per-tool TTLs, and compile-mode promotion gates. Re-deploy on a versioned policy with auto-rollback on regression. Net effect on the reference fleet: MCP input-token share trended from 47 percent to 21 percent in four weeks without anyone manually tuning the gateway. That’s the loop the other four picks in the brief don’t ship.



Sources

  • Anthropic Claude Code MCP documentation, claude.ai/docs/claude-code/mcp
  • Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
  • Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
  • OX Security advisory on MCP STDIO RCE class (April 15, 2026), ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem
  • Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
  • Future AGI traceAI repo, github.com/future-agi/traceAI
  • Future AGI ai-evaluation repo, github.com/future-agi/ai-evaluation
  • Future AGI agent-opt repo, github.com/future-agi/agent-opt
  • Portkey AI gateway, portkey.ai
  • Kong AI Gateway, konghq.com/products/kong-ai-gateway
  • agentgateway.dev, agentgateway.dev (Linux Foundation project page)

Frequently asked questions

Do I have to use Future AGI primitives or can I use generic OpenTelemetry?
The levers work with any OTel-compatible tracer — `traceAI` is Apache 2.0 and inter-operates with the standard OTLP protocol. The loop (eval-cluster-optimize) needs `ai-evaluation` (Apache 2.0) and `agent-opt` (Apache 2.0). All three are usable independently. If you only want the one-shot 50 percent cut without the loop, any gateway in the brief plus a generic OTel sink works.
How do I roll this back if it breaks?
Unset `ANTHROPIC_BASE_URL` in the shell profile and revert `~/.claude/mcp.json` to direct MCP server URLs. The total rollback time is under 60 seconds. Rehearse it quarterly — gateway outages are not theoretical.
How much latency does this add?
18 to 24 ms p95 on Anthropic calls and 6 to 9 ms p95 on cached MCP calls. Protect inline guardrails add ~67 ms text and ~109 ms image when enabled ([arXiv 2510.13351](https://arxiv.org/abs/2510.13351)). On a session with 30 turns, the cumulative gateway-added latency is under one second.
Is compile-mode safe to run in production?
Only with a held-out evaluation gate. Compile-mode executes model-written Python in a sandbox at the gateway. Without `fi.evals` scoring compile-mode versus tool-mode on held-out tasks, the gateway promotes tools that silently regress code-correctness. The cost goes down, the quality goes down with it. Configure the promotion gate before enabling compile-mode at all.
What happens if the classifier drops a tool the session actually needs?
The agent has no tool. The fix is twofold — default-include the four bedrock tools (`filesystem`, `git`, `web`, `shell`) regardless of classifier output, and allow mid-session expansion on an explicit user signal. The pinning rule freezes the set otherwise. Configure the classifier with a slightly looser threshold than feels comfortable for the first two weeks; the loop will tighten it.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.