Engineering

How to Reduce MCP Token Costs for Claude Code at Scale in 2026

Practical 2026 how-to for cutting MCP token spend on Claude Code at fleet scale: five levers, the mcp.json + gateway config, metrics that prove the cut.

March 24, 2026

12 min read

ai-gateway 2026 claude-code mcp

Table of Contents

A 30-engineer team that runs Claude Code with eight MCP servers registered in ~/.claude/mcp.json is paying for the same tool descriptions and the same tool responses to be serialised into input context on every single turn. In the workload data we collected across 22 teams in Q1 2026, MCP-related input tokens were 41 to 58 percent of total Claude Code spend, and most of it was duplicate text the model had already seen on the previous turn.

This is a how-to. The goal is to take that 41 to 58 percent and cut it roughly in half, sustainably, at fleet scale, without breaking the tool-use UX that makes Claude Code useful. The shape is five levers, four implementation steps, three runnable snippets, and a production checklist. The gateway picks at the end are a short orientation, a longer head-to-head is the sibling post on the best MCP gateway for Claude Code in 2026.

The problem in one paragraph

Claude Code packs project context aggressively. Every turn sends the system prompt (which contains every registered MCP tool description), the conversation history (which contains every previous tool response serialised back into text), and the new user message. On a session with eight MCP servers (filesystem, git, postgres, slack, linear, figma, notion, search) the system prompt alone runs 7,200 to 11,000 tokens before the conversation starts. By turn 20, the same filesystem.read payload appears in input tokens 18 to 19 times. The bill is real. The cut is recoverable. The trick is doing it without inventing a tool-use protocol Claude Code doesn’t understand.

Prereqs

Component	Version	Notes
Claude Code CLI	1.4.0+	Required for `mcp.json` schema with per-agent server lists
MCP gateway	One of FAGI ACC, Maxim Bifrost, Portkey, Kong AI Gateway, agentgateway.dev	See picks brief below
Streamable HTTP	MCP spec 2025-11-25	STDIO is supported but adds the OX RCE class risk; prefer HTTP
`traceAI`	0.18+	Apache 2.0 — for per-tool span capture
`ai-evaluation`	0.31+	Apache 2.0 — for compile-mode held-out scoring
Env vars	`ANTHROPIC_BASE_URL`, `FI_API_KEY`, `FI_SECRET_KEY`	Set in shell profile, not just the IDE

The five levers below all assume the gateway sits on both the Anthropic API path and the MCP path. Half of the levers do nothing if MCP traffic bypasses the gateway and goes direct to the MCP servers.

The 5 levers

Each lever is one concrete behavioural change at the gateway. The order is the order they bite, lever one buys you the most for the least work, lever five is the longest tail.

Lever 1: Selective registration per session

Claude Code reads ~/.claude/mcp.json at session start and pulls every tool description into the system prompt. The cheap fix is to stop registering tools an agent will never call.

The naive approach is one mcp.json per agent. It works but doesn’t scale past a handful of agents because every new task class needs a new file. The gateway version of selective registration runs a classifier on the first user message and returns only the tool descriptions the session plausibly needs. A documentation-edit task doesn’t need postgres; a database-migration task doesn’t need figma.

{
  "mcpServers": {
    "fagi-gateway": {
      "type": "streamable-http",
      "url": "https://gateway.futureagi.com/mcp/v1",
      "headers": {
        "Authorization": "Bearer ${FI_API_KEY}",
        "x-fi-agent": "claude-code",
        "x-fi-session": "${CLAUDE_SESSION_ID}",
        "x-fi-selectivity": "classifier:v2"
      }
    }
  }
}

The gateway federation endpoint advertises the union of all upstream MCP servers but only returns the descriptions the classifier selected. On a 38-tool fleet, the classifier averages 13 to 17 tools per session. Token saving: 8 to 15 percent of input spend.

Failure mode to watch, under-selection. If the classifier drops git from a session that turns into a git-blame investigation, the agent has no tool. Configure the classifier to default-include the four bedrock tools (filesystem, git, web, shell) and route the rest through the classifier.

Lever 2: Semantic caching of tool results

The same filesystem.read returning the same file in two sessions an hour apart re-serialises the same payload twice. A semantic cache keyed on tool name plus a content-aware hash of arguments returns the previous payload without round-tripping the MCP server. The cache lives at the gateway, not the client, every session sees the same cache.

Per-tool TTL is non-negotiable. A 60-minute TTL on git.diff poisons the next turn the moment the user commits. A 30-second TTL on linear.list_issues burns spend. The defaults below are what worked on our 22-team dataset:

Tool	Default TTL	Reason
`filesystem.read`	300s	Files change less often than the cache window
`git.diff`	90s	Diffs change as the agent edits
`git.log`	600s	Log entries are append-only
`linear.list_issues`	1800s	Issue lists move slowly
`slack.search`	300s	Mid-fidelity freshness
`web.fetch`	60s	Conservative; pages change

Hit rate stabilises around 35 to 55 percent within a week of a real team using the gateway. Token saving: 4 to 7 percent (the saving is smaller than the hit rate because tool responses are smaller on average than tool descriptions).

Lever 3: Compiled tool execution

This is the largest single lever and the one that requires the most care. Instead of advertising N tool definitions and round-tripping each invocation, the gateway compiles MCP tools into a Python module exposed as a single high-level tool, execute_python(code). The model writes Python that calls the compiled functions; the gateway sandboxes execution.

The published benchmark. Maxim’s Bifrost Code Mode run on 508 tools across 16 MCP servers, reports 92.8 percent input-token reduction at the system-prompt boundary. That’s a vendor-harness ceiling. On a real heterogeneous Claude Code fleet, expect 25 to 45 percent on the cleanly-compiled subset. The other levers stack on top.

The non-negotiable gate is held-out evaluation. Score compile-mode versus tool-mode on the same task and only promote candidates within 1 percent of tool-mode quality. Without this gate, compile-mode promotes tools that silently regress code-correctness, the cost goes down, the quality goes down with it.

# fi.evals scoring loop for compile-mode promotion
from fi.evals import EvalClient

client = EvalClient(api_key=os.environ["FI_API_KEY"])
result = client.evaluate(
    eval_templates=["task_completion", "tool_call_accuracy", "code_correctness"],
    inputs={
        "tool_mode_trace": tool_mode_span,
        "compile_mode_trace": compile_mode_span,
        "ground_truth": held_out_task.expected
    }
)

# Promote only if compile-mode within 1% of tool-mode on all three rubrics
if result.compile_mode.scores >= result.tool_mode.scores * 0.99:
    promote_tool_to_compile_mode(tool_id)

Lever 4: Tool-description compression

Typical MCP descriptions are verbose, multi-paragraph docstring, examples, argument-by-argument prose written for humans. The gateway rewrites them at registration into a structured form the model parses at roughly half the token cost. On a 38-tool fleet, 40 percent compression saves about 1,000 input tokens per turn after selectivity has already cut the count.

Store both forms, compressed for serving, original for debugging. When a tool call fails with a structural error, the original description is what an engineer needs to read.

Lever 5: Per-session server-set pinning

Once a session has selected its tool set, pin it. Re-running the classifier on every turn, which some gateway defaults do, wastes tokens on the classifier prompt and risks the tool set churning mid-session. Pin the set at turn one, allow expansion only on an explicit user signal (“look at the Postgres schema”), and freeze it otherwise.

The pinning is gateway state, not client state. The Claude Code session ID is the cache key.

Implementation walkthrough: 4 steps

Step 1: Point Claude Code at the gateway for both Anthropic and MCP

The single biggest configuration mistake teams make is wiring ANTHROPIC_BASE_URL to the gateway and leaving MCP servers configured directly in mcp.json. Half of the levers above do nothing in that posture.

# Shell profile (.zshrc / .bashrc)
export ANTHROPIC_BASE_URL="https://gateway.futureagi.com/anthropic/v1"
export ANTHROPIC_API_KEY="${FI_VIRTUAL_KEY}"  # per-developer virtual key
export FI_API_KEY="${FI_API_KEY}"
export FI_SECRET_KEY="${FI_SECRET_KEY}"

// ~/.claude/mcp.json — federation-only, no direct MCP server URLs
{
  "mcpServers": {
    "fagi-gateway": {
      "type": "streamable-http",
      "url": "https://gateway.futureagi.com/mcp/v1",
      "headers": {
        "Authorization": "Bearer ${FI_API_KEY}",
        "x-fi-agent": "claude-code",
        "x-fi-developer": "${USER}",
        "x-fi-repo": "${PWD}",
        "x-fi-session": "${CLAUDE_SESSION_ID}"
      }
    }
  }
}

The federation endpoint speaks the MCP spec and fans out to all upstream MCP servers configured at the gateway. The Claude Code client sees one server; the gateway sees many.

Step 2: Configure selective registration, caching, and description compression at the gateway

The gateway-side config is where the levers live. The snippet below is the shape, every gateway has its own DSL but the fields are the same.

# fagi-gateway.yaml: Agent Command Center MCP routing rule
mcp:
  upstream_servers:
    - id: filesystem
      url: stdio://filesystem-mcp
      cache_ttl_seconds: 300
    - id: git
      url: stdio://git-mcp
      cache_ttl_seconds: 90
    - id: postgres
      url: https://postgres-mcp.internal/mcp/v1
      cache_ttl_seconds: 60
    - id: linear
      url: https://linear-mcp.internal/mcp/v1
      cache_ttl_seconds: 1800

  selective_registration:
    classifier: classifier:v2
    default_include: [filesystem, git, web, shell]
    pin_per_session: true
    max_tools_per_session: 22

  description_compression:
    enabled: true
    target_token_reduction: 0.40
    store_original: true

  compile_mode:
    enabled: true
    candidates: auto
    promotion_gate:
      held_out_eval: task_completion+code_correctness
      min_score_ratio: 0.99

  semantic_cache:
    enabled: true
    similarity_threshold: 0.92
    per_tool_ttl: true

  observability:
    traceai:
      endpoint: https://app.futureagi.com/v1/traces
      span_attributes: [mcp.tool.name, mcp.server.id, mcp.cache_hit, mcp.compile_mode]

Step 3: Wire `traceAI` so every MCP invocation is a child span of the Anthropic API call

Every MCP invocation needs to land in the same span tree as the Anthropic API call that triggered it. Without that, per-session attribution is broken and the loop that learns the levers has no signal.

# In the gateway worker process
from traceai import instrument
from traceai.anthropic import AnthropicInstrumentor
from traceai.mcp import MCPInstrumentor

instrument(
    service_name="fagi-mcp-gateway",
    instrumentors=[AnthropicInstrumentor(), MCPInstrumentor()],
    span_attributes_default={
        "fi.attributes.agent": "claude-code",
        "fi.attributes.gateway.version": "0.18.4",
    }
)

The MCP instrumentor adds mcp.tool.name, mcp.server.id, mcp.cache_hit, and mcp.compile_mode as span attributes, plus the per-call re-serialisation token cost. That last number is what makes the 50 percent cut visible, until you can see it, you can’t tune it.

Step 4: Verify with a known-good session

Run a single Claude Code session against the gateway, then verify three things, the gateway is in the path, the levers are firing, and the trace tree is correct.

# Verify gateway is in the Anthropic path
curl -s -H "Authorization: Bearer $ANTHROPIC_API_KEY" \
     "$ANTHROPIC_BASE_URL/messages" \
     -d '{"model":"claude-haiku-4-5","max_tokens":1,"messages":[{"role":"user","content":"hi"}]}' \
  | jq '.id'

# Verify MCP federation endpoint responds
curl -s -H "Authorization: Bearer $FI_API_KEY" \
     "https://gateway.futureagi.com/mcp/v1/tools/list"

# Verify traces landed and levers fired
fi traces list --session "$CLAUDE_SESSION_ID" --since 1h \
  --filter "mcp.cache_hit=true OR mcp.compile_mode=true" \
  --columns "tool,cache_hit,compile_mode,input_tokens,output_tokens"

The third command is the one that tells you the levers actually fired. If mcp.cache_hit is false on every row, the cache isn’t configured. If mcp.compile_mode is false on every row, compile-mode isn’t promoting candidates yet, which is correct on day one but should change within a week.

Measuring success

Four metrics. Compute them weekly. The loop tunes against them.

Metric	How to compute	Target after 4 weeks
MCP input-token share	Sum of input tokens for spans with `mcp.*` attributes / total input tokens	18-25% (down from 41-58%)
Semantic cache hit rate	Spans with `mcp.cache_hit=true` / total MCP spans	35-55%
Compile-mode coverage	Distinct tool IDs with `mcp.compile_mode=true` at least once / total registered tool IDs	55-70%
Tool-call failure rate	Spans with `mcp.error=true` / total MCP spans	<4%

The first metric is the headline. The other three are why the first metric moved. A drop in MCP input-token share without a corresponding rise in cache hit rate or compile-mode coverage usually means the classifier is over-pruning and the agent is failing silently, check the fourth metric.

A team that started at 47 percent MCP input-token share trended to 21 percent over four weeks in our reference fleet. Cache hit rate stabilised at 38 percent. Compile-mode coverage reached 63 percent. Tool-call failure rate dropped from 11 percent to 3 percent, counter-intuitively, the levers improved correctness because compile-mode promotion is gated on a held-out eval that rejects regressions.

Latency overhead: the gateway hop adds 18 to 24 ms p95 on Anthropic calls and 6 to 9 ms p95 on cached MCP calls (the cache return is cheaper than the upstream MCP server). The Future AGI Protect model family runs inline at 65 ms text / 107 ms image median time-to-label (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain; included in the p95 number above. Error Feed (FAGI’s auto-clustering and root-cause layer) sits alongside as the zero-config error monitor: auto-clusters related tool-call failures and cache-miss patterns into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so emerging regressions surface like exceptions.

Production checklist

Concern	What to check
Latency overhead	p95 of gateway hop; alert if it exceeds 35 ms on Anthropic calls or 15 ms on cached MCP calls
Failure isolation	If the gateway is unreachable, does Claude Code fall back to direct Anthropic? (Default yes via `ANTHROPIC_BASE_URL` failover; MCP fans through gateway only)
Cost attribution	Every request tagged with developer ID, session ID, repo URL, and task class
Audit log	Gateway decisions (selectivity, cache hits, compile-mode promotions) logged to a separate trace stream, retained 90 days
Cold-start	First-request latency after deploy <500 ms p95; warm the classifier before swing rollout
Rollback	One-line rollback by unsetting `ANTHROPIC_BASE_URL` and reverting `mcp.json` to direct server URLs; rehearse quarterly
Cache poisoning	Monitor for stale `git.diff` and `filesystem.read` returns; alert if `mcp.error=true` correlates with cache hits
Compile-mode regressions	`fi.evals` score delta versus tool-mode tracked per tool; auto-demote on >1 percent regression
Description integrity	Original descriptions stored alongside compressed; debug endpoint serves the original
Selective registration drift	Classifier drift detector compares this week’s tool distribution to last week’s; alert on >25 percent change

Gateway picks brief

A short orientation, for the full head-to-head, see the best MCP gateway for Claude Code post.

Future AGI Agent Command Center. The only gateway here that wires the five levers into a self-improving loop. traceAI captures per-tool spans, fi.evals scores compile-mode promotion, agent-opt (six optimizers (ProTeGi, BayesianSearchOptimizer with Optuna, GEPAOptimizer, MetaPromptOptimizer, RandomSearchOptimizer, PromptWizardOptimizer), all sharing EarlyStoppingConfig) tunes selectivity rules and per-tool TTLs. Hit rate, compile-mode coverage, and MCP input-token share trend down session over session rather than holding flat. Apache 2.0 building blocks plus a hosted Agent Command Center with the MCP Security scanner and Protect at 65 ms text median time-to-label. Score: 7/7 levers wired into a loop.

Maxim Bifrost. Authors of the Code Mode pattern and the published 92.8 percent benchmark on 508 tools across 16 MCP servers. The most direct path to lever three. Native semantic cache, OTel-native span export, Apache 2.0 Go binary. No optimizer, the levers stay where you set them. Pick this for a one-shot benchmark or when compile-mode is the only lever you care about.

Portkey. Polished hosted product. Selectivity through virtual servers, mature semantic cache, the prettiest dashboard in the cohort. No compile-mode and no optimizer, so the ceiling is roughly 18 to 28 percent on the levers it implements. Procurement signal: April 30, 2026 Palo Alto Networks acquisition merging the roadmap into Prisma AIRS.

Kong AI Gateway. The right pick if Kong is already the company’s API platform. Caching and selectivity through plugins, AI Proxy 3.6 supports MCP, OTel plugin for spans. No native compile-mode, wrapping MCP tools in Lua is meaningful engineering work. Plan two weeks of platform-team time.

agentgateway.dev. Linux Foundation-hosted, vendor-neutral OSS MCP gateway with selectivity and caching as policy-as-code. Right pick when foundation governance and acquisition-independence outrank dashboard polish. No compile-mode, no optimizer, the headline 50 percent cut requires bolting compile-mode on separately.

Where this fits in the Future AGI loop

The how-to above implements the five levers as configuration. To make it a self-improving capability, wire fi.evals to score every Claude Code session, feed low-scoring sessions into agent-opt, and let the optimizer rewrite the classifier rules, per-tool TTLs, and compile-mode promotion gates. Re-deploy on a versioned policy with auto-rollback on regression. Net effect on the reference fleet: MCP input-token share trended from 47 percent to 21 percent in four weeks without anyone manually tuning the gateway. That’s the loop the other four picks in the brief don’t ship.

How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026, the 90-percent Claude Code token-cost reduction playbook
Best AI Gateway for Claude Code Cost Management in 2026, the Claude Code cost-management playbook
Best 5 AI Gateways for Token Budgeting in 2026, per-virtual-key budgets and hard cutoffs in practice
Best 5 AI Gateways for LLM Cost Optimization in 2026, the five-layer cost stack and the 2026 trust cohort

Sources

Anthropic Claude Code MCP documentation, claude.ai/docs/claude-code/mcp
Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
OX Security advisory on MCP STDIO RCE class (April 15, 2026), ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem
Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text / 107 ms image median time-to-label)
Future AGI traceAI repo, github.com/future-agi/traceAI
Future AGI ai-evaluation repo, github.com/future-agi/ai-evaluation
Future AGI agent-opt repo, github.com/future-agi/agent-opt
Portkey AI gateway, portkey.ai
Kong AI Gateway, konghq.com/products/kong-ai-gateway
agentgateway.dev, agentgateway.dev (Linux Foundation project page)

Frequently asked questions

Do I have to use Future AGI primitives or can I use generic OpenTelemetry?

The levers work with any OTel-compatible tracer — `traceAI` is Apache 2.0 and inter-operates with the standard OTLP protocol. The loop (eval-cluster-optimize) needs `ai-evaluation` (Apache 2.0) and `agent-opt` (Apache 2.0). All three are usable independently. If you only want the one-shot 50 percent cut without the loop, any gateway in the brief plus a generic OTel sink works.

How do I roll this back if it breaks?

Unset `ANTHROPIC_BASE_URL` in the shell profile and revert `~/.claude/mcp.json` to direct MCP server URLs. The total rollback time is under 60 seconds. Rehearse it quarterly — gateway outages are not theoretical.

How much latency does this add?

18 to 24 ms p95 on Anthropic calls and 6 to 9 ms p95 on cached MCP calls. Protect inline guardrails add 65 ms text / 107 ms image median time-to-label when enabled ([arXiv 2510.13351](https://arxiv.org/abs/2510.13351)). On a session with 30 turns, the cumulative gateway-added latency is under one second.

Is compile-mode safe to run in production?

Only with a held-out evaluation gate. Compile-mode executes model-written Python in a sandbox at the gateway. Without `fi.evals` scoring compile-mode versus tool-mode on held-out tasks, the gateway promotes tools that silently regress code-correctness. The cost goes down, the quality goes down with it. Configure the promotion gate before enabling compile-mode at all.

What happens if the classifier drops a tool the session actually needs?

The agent has no tool. The fix is twofold — default-include the four bedrock tools (`filesystem`, `git`, `web`, `shell`) regardless of classifier output, and allow mid-session expansion on an explicit user signal. The pinning rule freezes the set otherwise. Configure the classifier with a slightly looser threshold than feels comfortable for the first two weeks; the loop will tighten it.

View all

Engineering

How to Connect Claude Code to an MCP Gateway in 2026

Wiring Claude Code to an MCP gateway 2026: mcp.json config, routing rules, per-server auth scoping, verification. Production checklist and gateway picks.

Vrinda Damani · Mar 5, 2026

11 min

Engineering

Running Claude Code with OpenAI Models in 2026: A Gateway Setup Guide

Run Claude Code against OpenAI GPT-5 and GPT-4 via a translation gateway in 2026: setup, ENV vars, config, then five gateways scored.

Rishav Hada · May 15, 2026

16 min

Engineering

How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026

Cut Claude Code token spend with 5 stackable levers: cache_control, MCP-tool compilation, semantic caching, model right-sizing, pruning. Honest 90% read.

NVJK Kartik · Apr 11, 2026

13 min

The problem in one paragraph

Prereqs

The 5 levers

Lever 1: Selective registration per session

Lever 2: Semantic caching of tool results

Lever 3: Compiled tool execution

Lever 4: Tool-description compression

Lever 5: Per-session server-set pinning

Implementation walkthrough: 4 steps

Step 1: Point Claude Code at the gateway for both Anthropic and MCP

Step 2: Configure selective registration, caching, and description compression at the gateway

Step 3: Wire traceAI so every MCP invocation is a child span of the Anthropic API call

Step 4: Verify with a known-good session

Measuring success

Production checklist

Gateway picks brief

Where this fits in the Future AGI loop

Related reading

Sources

Frequently asked questions

Step 3: Wire `traceAI` so every MCP invocation is a child span of the Anthropic API call