Guides

How an MCP Gateway Cuts Token Costs in Claude Code and Codex CLI in 2026

A 2026 architecture essay on why MCP blows up coding-agent token bills in Claude Code and Codex CLI, and five mechanisms that compress cost.

April 13, 2026

14 min read

ai-gateway 2026 claude-code codex-cli mcp

Table of Contents

A 30-engineer team running both Claude Code and Codex CLI. Claude Code for long agentic edits, Codex CLI for refactors and CI scripting, pays for the same MCP tool descriptions to be tokenized across two completely different agent stacks. The Anthropic system prompt re-serializes the union from ~/.claude/mcp.json; Codex CLI does the same from ~/.codex/mcp.json; the agents share the underlying MCP servers but not the tokenization. Pull the trace across both and the docstring of filesystem.read, all 180 tokens, appears in input context tens of millions of times a week.

This essay is about why that happens, why it isn’t going away on its own, and the five mechanisms by which an MCP gateway in front of both agents compresses the bill. Not a how-to (the configuration walkthrough is here) and not a listicle (the head-to-head is here). It’s the architectural argument for why an MCP gateway is the load-bearing pattern of 2026 coding-agent infrastructure.

Thesis: MCP, as shipped in the 2025-11-25 specification and adopted by the two dominant CLIs, has token-amplification baked into its data plane. Not a bug, the consequence of a protocol that promotes tools as schema-rich text and round-trips every response as serialized JSON. A gateway is where to fix it because the gateway is the only point where the protocol can be rewritten without changing the agents.

The mechanics of the MCP token blow-up

To argue about how to cut MCP token costs you have to be precise about where they come from. Four surfaces.

Surface one: the system-prompt advertisement. When Claude Code or Codex CLI boots, it reads mcp.json, opens a transport, and calls tools/list. The union of tool descriptions is concatenated into the system prompt, on a moderately registered fleet (filesystem, git, postgres, slack, linear, figma, notion, search) that’s 7,200 to 11,000 tokens. Every turn pays for it again; APIs are stateless. Thirty turns in, the 9,000-token advertisement has been billed 30 times.

Surface two: the conversation re-serialization. Each MCP invocation produces a response. The agent serializes it into a content block (Anthropic) or function-call result (OpenAI), appends to the conversation, resends on the next turn. By turn 20 of a bug-fix loop, the same filesystem.read payload appears in input tokens 18 to 19 times.

Surface three: the schema overhead. MCP tools are advertised with full JSON Schema. A postgres.query with five optional parameters carries 400 to 700 tokens of schema the model rarely needs at sampling time. Schema rides along on every turn.

Surface four: the cross-agent duplication. When the same MCP server is registered in both Claude Code and Codex CLI (the default state for any team using both) the description is tokenized twice. Different tokenizers (Anthropic vs OpenAI tiktoken) produce slightly different counts, but the work is duplicated: one docstring, two system prompts, two billing meters. A team running both pays roughly 1.8x the MCP token bill of a single-agent team.

Add the four surfaces on a representative workload (12 engineers, both agents, eight MCP servers, 22 sessions a day, 30 turns each) and MCP input tokens land between 41 and 58 percent of total coding-agent spend. Single-agent teams sit closer to 41 percent; dual-agent teams closer to 58. Our 22-team Q1 2026 dataset median was 48 percent. That’s the bill the gateway is fighting.

Why the protocol does not fix this on its own

A reasonable counter-argument is that MCP will evolve. Technically correct, operationally irrelevant. The system-prompt advertisement is structural, models need to see what tools exist to call them, and compressing it requires either a vocabulary the model already knows (no frontier model has one) or a protocol shape that would break every agent in production. A protocol change that requires CLI changes is a multi-quarter rollout even if it ships clean.

The protocol is now load-bearing in production. The April 15, 2026 OX Security advisory on STDIO RCE pushed the industry toward HTTP transports and centralized execution at gateways. Before April 2026 an MCP gateway was a token-economics decision; after, it’s a security decision that also addresses token economics. The cost case got stronger because the alternative got weaker.

The five mechanisms

The 50 percent reduction figure is the sum of five distinct compression mechanisms a gateway can implement and the agent or protocol can’t. Numbers below are from the reference fleet, 22 teams running both agents against six to eight MCP servers in Q1 2026.

Mechanism one: tool selectivity

Stop sending tools the agent won’t call. A gateway does this two ways: static per-agent allowlists federated through a single endpoint, or dynamic per-session classification, the gateway reads the first user message and registers only the subset the task plausibly needs. Future AGI’s Agent Command Center implements both; Maxim Bifrost implements static well; agentgateway.dev expresses it as policy-as-code; Portkey’s “virtual server” is a static-allowlist surface with good UX.

On the reference fleet (38 tools registered, 14 used per session) the static allowlist alone recovers 8 to 12 percent of input tokens; the classifier adds 2 to 4 percent more. Compounded: 10 to 16 percent. The defensible posture: default-include the four bedrock tools (filesystem, git, web, shell) regardless of classifier output, expand on explicit user signal. Lowest engineering cost, lowest risk. Flip it first.

Mechanism two: semantic caching

A gateway-side cache keyed on tool name plus content-aware hash of arguments returns the previous payload without round-tripping the MCP server. The cache lives at the gateway, so every session. Claude Code, Codex CLI, any future MCP client, sees the same cache. The cross-agent payoff: a linear.list_issues call from Claude Code at 10:00 AM warms the cache that a Codex CLI session at 10:15 AM reads.

Per-tool TTL is non-negotiable. A 60-minute TTL on git.diff poisons the next turn the moment the developer commits. Reference defaults: 300s for filesystem.read, 90s for git.diff, 600s for git.log, 1800s for linear.list_issues, 60s for web.fetch.

Hit rate stabilizes at 35 to 55 percent within a week. Token saving from caching alone is 4 to 7 percent, smaller than the hit rate because responses are smaller than descriptions. Caching pairs with compilation: a cached response feeds the compiled-execution path with hot data.

Mechanism three: compiled execution

The largest single lever and the one that distinguishes a 2026 MCP gateway from a 2025 LLM proxy.

The pattern, authored by Maxim Bifrost and named “Code Mode,” replaces advertise-and-roundtrip with compilation. Instead of exposing N tool definitions, the gateway compiles MCP tools into a Python module exposed as one high-level tool, execute_python(code). The model writes Python; the gateway runs it in a sandbox; the result returns as a single content block. The advertisement collapses from N descriptions to one; the conversation collapses from N round-trips to one.

Maxim’s benchmark is 92.8 percent input-token reduction on 508 tools across 16 MCP servers, real, and a vendor-harness ceiling. On heterogeneous fleets, expect 25 to 45 percent on the cleanly-compiled subset; streaming, side-effecting, and human-in-the-loop tools resist compilation.

The non-negotiable gate is held-out evaluation. Promotion rule: a tool moves to compile-mode only when its compile-mode score is within 1 percent of tool-mode on three rubrics (task completion, tool-call accuracy, code correctness). Without the gate, compile-mode silently regresses quality.

Vendor philosophies diverge. Bifrost ships compile-mode as a raw capability; the team owns the eval gate. Future AGI’s Agent Command Center ships it with the eval gate built into the promotion path via fi.evals. Bifrost for teams that want to own the loop, Future AGI for teams that want it owned for them.

On the reference fleet, compile-mode coverage reaches 55 to 70 percent of registered tools within four weeks. The uncovered third are tools that shouldn’t compile. STDIO-bound legacy servers, unbounded streaming, interactive prompts.

Mechanism four: description compression

Most MCP descriptions are written for humans. Maintainers wrote multi-paragraph docstrings because they wanted humans to read them; the same text gets tokenized into the system prompt where humans never read it. The gateway rewrites at registration into a structured form the model parses at roughly half the token cost, short imperative sentences, type-annotated arguments, no examples. Mechanical and reversible, store both forms, original for debugging, compressed for serving. On the 38-tool reference fleet, 40 percent compression saves about 1,000 input tokens per turn after selectivity.

Aggressive compression can hurt model behavior; if the model can’t disambiguate git.diff from git.show you get tool-call confusion. Same gate as compile-mode, score compressed against original on held-out tasks, promote only when scores hold. Saving alone: 4 to 6 percent.

Mechanism five: server-set pinning per session

The smallest mechanism but the one that prevents the other four from quietly degrading.

Once a session has selected its tool set and cached its hot responses, pin the configuration. Re-running the classifier on every turn wastes tokens and risks the tool set churning mid-session. Worse, a churning set invalidates the prefix cache that Anthropic and OpenAI both use to discount repeated input tokens (40 to 90 percent off the cached prefix depending on provider). A gateway that destroys cache hits by re-classifying is paying full price for tokens the provider was ready to discount.

The rule: classifier runs once at session start, tool set is locked, expansion is allowed only on explicit user signal. The pin lives at the gateway, keyed on Claude Code session ID or Codex CLI conversation ID. Pinning saves 2 to 4 percent directly and preserves the prefix-cache discount that compounds on top of every other mechanism. Most often missing from naive implementations; the bug is forgetting to do it.

Stacking the five

Composition isn’t a simple sum; each mechanism shrinks the surface the next operates on. Conservatively bounded:

Selectivity: 10 to 16 percent of total input tokens recovered.
Caching: 4 to 7 percent on the post-selectivity surface.
Compiled execution: 25 to 35 percent on the cleanly-compiled subset (55 to 70 percent of remaining surface).
Description compression: 4 to 6 percent on what is left after compile-mode collapses the advertisement.
Pinning: 2 to 4 percent direct plus prefix-cache preservation that multiplies the other four.

Compose conservatively and the team is at 42 to 54 percent total reduction. The 50 percent headline that shows up in vendor materials is the midpoint of a real stack, not the upper bound of any single mechanism. Each mechanism can be enabled, evaluated, and rolled back on its own; the composition is the architecture.

A measurement framework that survives a quarter

The headline metric is MCP input-token share, input tokens for spans tagged with any mcp.* attribute, divided by total input tokens across both agents. Target after four weeks: 18 to 25 percent, down from the 41 to 58 percent starting band. This is what the CFO sees.

The diagnostic metrics, weekly: cache hit rate per tool (target 35 to 55 percent); compile-mode coverage as a fraction of registered tool IDs (target 55 to 70 percent); tool-call failure rate (target under 4 percent, the headline reduction is invalid if failures climbed). Future AGI’s traceAI tags spans with mcp.tool.name, mcp.server.id, mcp.cache_hit, mcp.compile_mode; Portkey’s MCP Hoot tooling produces a comparable dashboard; OTel-native gateways emit spans and leave dashboarding to the team.

The cross-agent measurement is most often missed. Sessions need tagging with agent.name so the same dashboard slices both. A team that only measures Claude Code is undercounting by close to half, because cross-agent duplication is the largest source of the original problem.

The audit log is the closing artifact. Every gateway decision, classification, cache hit, compile-mode promotion, description compression, lands in a trace stream retained 90 days. When an engineer reports “the gateway broke my workflow,” the audit log says whether the classifier dropped a tool, the cache returned stale, or compile-mode poisoned context.

Two case patterns

Pattern A: the bug-fix loop. A Claude Code session against a 200K-line repository loops on filesystem.read, git.log, git.blame, git.diff for 40 to 60 turns. Surfaces one through three compound; by turn 30, the same git.log payload appears in input context 28 times. Cache hit rate is highest of any class (60+ percent on filesystem.read), compile-mode coverage is excellent (pure functions), selectivity does little (the bedrock set is what the loop needs). Headline reduction: 48 to 54 percent.

Pattern B: the CI-integration scripting task. A Codex CLI session, “wire the deployment script to Slack and Linear”, touches filesystem, slack, linear, occasionally git, never postgres or figma. Selectivity is the biggest lever (most of the registered set is irrelevant); cache hit rate is moderate (15 to 25 percent); compile-mode is variable because side-effecting tools resist compilation. Headline reduction: 28 to 38 percent.

Vendor benchmarks feature Pattern A and elide Pattern B. A team’s real bill is a weighted average; mature platform teams measure the mix explicitly.

Vendor-agnostic implementation notes

The five mechanisms are architecture, not vendor choice. Every mature MCP gateway implements some subset; the differentiation is which subset and whether they compose into a feedback loop.

Maxim Bifrost is the implementer of compile-mode and the source of the 92.8 percent figure. Apache 2.0 Go binary, the most direct path to mechanism three; static allowlists handle one; caching native; compression as per-registration override; no optimizer. Pick when compile-mode is the lever that matters. Honesty caveat: Maxim’s own listicles don’t publish a limitations block, that absence is a signal.

Portkey’s MCP Hoot tooling is the most polished hosted MCP observability surface. Selectivity through virtual servers is mature; caching first-class; cache-hit telemetry best in the cohort. No compile-mode, no optimizer. The April 30, 2026 Palo Alto Networks acquisition (roadmap merging into Prisma AIRS) is a procurement signal.

Kong AI Gateway is the pick when Kong is already the API platform. One and two via existing plugins; three requires Lua engineering; four and five manual. agentgateway.dev is the Linux Foundation OSS option, one and two as policy-as-code, three not part of the project.

Future AGI Agent Command Center is one of multiple valid implementations. It wires all five mechanisms into a self-improving loop: traceAI (Apache 2.0, 50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), OpenInference-native) instruments MCP and the underlying Anthropic/OpenAI calls as one span tree across both agents; ai-evaluation (Apache 2.0) scores compile-mode and description-compression promotion against held-out tasks; agent-opt (Apache 2.0; six optimizers (ProTeGi, BayesianSearchOptimizer with Optuna, GEPAOptimizer, MetaPromptOptimizer, RandomSearchOptimizer, PromptWizardOptimizer), all sharing EarlyStoppingConfig) tunes the selectivity classifier, per-tool TTLs, and compile-mode candidate lists session over session. Agent Command Center adds the failure-cluster view, the MCP Security scanner, RBAC, and the Future AGI Protect model family as the inline guardrail layer at 65 ms text / 107 ms image median time-to-label (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain. Error Feed sits alongside as FAGI’s part of the eval stack (the clustering and what-to-fix layer that feeds the self-improving evaluators via HDBSCAN clustering plus a Sonnet 4.5 Judge writing immediate_fix), auto-clustering related per-tool failures into named issues (50 traces → 1 issue) with auto-written root cause plus quick fix plus long-term recommendation per issue. The distinguisher is the “compiler discovery” pattern, the gateway itself learns which tools are promotable, which TTLs are stable, which classifications are accurate. That loop is the wedge; the rest of the architecture is reproducible on any gateway above with comparable engineering investment.

The mechanisms define what an MCP gateway is for; the gateways define how it ships.

Where the pattern still has rough edges

Latency: a gateway hop adds 18 to 24 ms p95 on the LLM path; cached MCP returns shave latency overall (6 to 9 ms p95 vs 60 to 200 ms direct). Coding agents tolerate it; voice agents less so.

Cache poisoning: stale git.diff after a commit, stale filesystem.read after an edit. Per-tool TTL plus audit log plus failure-rate metric is the defense.

Compile-mode regression: the eval gate works when held-out tasks reflect production; it fails when production drifts. Mitigation is continuous evaluation, fi.evals rescoring weekly, demotion of candidates that fall below threshold.

Multi-agent state: Claude Code’s session ID is stable within a session but doesn’t carry across CLI invocations; Codex CLI’s conversation ID has similar limits. The gateway carries its own session model, a stable hash of (developer, repo, agent), to make pinning work across restarts.

The uncomfortable one: none of the five are agent-aware. The gateway compresses the MCP surface; it can’t compress the agent’s choice to over-prompt, read three irrelevant files, or ask postgres four ways. That class of waste is what the broader self-improving loop addresses, a separate post.

Sources

Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
Anthropic Claude Code MCP documentation, claude.ai/docs/claude-code/mcp
OpenAI Codex CLI MCP documentation, platform.openai.com/docs/codex-cli/mcp
Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
OX Security advisory on MCP STDIO RCE class (April 15, 2026), ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem
Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
Future AGI traceAI repo (Apache 2.0), github.com/future-agi/traceAI
Future AGI ai-evaluation repo (Apache 2.0), github.com/future-agi/ai-evaluation
Future AGI agent-opt repo (Apache 2.0), github.com/future-agi/agent-opt
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text / 107 ms image median time-to-label)
Portkey AI gateway and MCP Hoot tooling, portkey.ai
Palo Alto Networks Portkey acquisition (April 30, 2026), paloaltonetworks.com/company/press/2026/palo-alto-networks-to-acquire-portkey
Kong AI Gateway, konghq.com/products/kong-ai-gateway
agentgateway.dev, agentgateway.dev (Linux Foundation project page)

How to Reduce MCP Token Costs for Claude Code at Scale in 2026, the configuration walkthrough.
Best MCP Gateway for Claude Code to Cut Token Costs by 50 Percent in 2026, the head-to-head listicle.
Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026, the monitoring pilot.
What Is an AI Gateway? The 2026 Definition, the upstream definition.

Frequently asked questions

Why is the cross-agent angle load-bearing?

Surface-four duplication is the largest single contributor to the 58 percent ceiling and the easiest to miss in a single-agent benchmark. A team running only Claude Code sees 41 to 48 percent MCP share; adding Codex CLI pushes them to 50 to 58 percent without any new tools registered. The gateway is the only place where both tokenizations can be addressed at once.

Which mechanism is most underrated?

Pinning. The smallest direct lever (2 to 4 percent) but the one that protects the prefix-cache discount providers offer. A gateway that re-classifies mid-session destroys a 40 to 90 percent provider-side discount that would have multiplied every other mechanism.

Is compile-mode safe enough for production?

Only with a held-out evaluation gate and an audit log. The model is writing Python and the gateway is executing it in a sandbox; without scoring compile-mode against tool-mode and rejecting regressions, the gateway will silently promote tools that degrade quality. Centralizing model-written execution in a hardened gateway sandbox is also safer than running it locally — but only if the sandbox is actually hardened.

Does the gateway need to sit on the LLM API path or just the MCP path?

Both. Selectivity, description compression, and pinning need the LLM API path to read the system prompt and inject the compressed advertisement. Caching and compile-mode need the MCP path. The 50 percent headline requires both legs.

How does this compare to retrieving tool descriptions on demand?

On-demand retrieval ('RAG-MCP' in working-group threads) addresses surface one only. The five mechanisms here are complementary, not alternatives.

What if the gateway is unreachable?

The agent falls back to direct MCP. Keep `ANTHROPIC_BASE_URL` or `OPENAI_BASE_URL` pointed at the gateway with a documented revert path, and keep upstream MCP server URLs in version control. Rollback is under 60 seconds.

Is `agent-opt` the only way to close the loop?

No. Any combination of OTel span capture, a scoring layer, and a policy update path closes the loop. `agent-opt` is one Apache-2.0 implementation; the pattern is the loop, not the toolkit.

View all

Guides

Best 5 AI Gateways for MCP Tool-Level Observability with Codex CLI in 2026

Five AI gateways scored for MCP tool-level observability with Codex CLI: per-tool latency, success rate, argument validation, MCP auth.

Vrinda Damani · Apr 22, 2026

17 min

Guides

Best MCP Gateway for Claude Code to Cut Token Costs by 50 Percent in 2026

MCP gateway in front of Claude Code cuts input-token spend 50% in 2026: compiled tools, semantic caching, registration, scored across 5 real gateways.

Rishav Hada · Apr 5, 2026

17 min

Guides

Best 5 MCP Gateways for Claude Code in 2026

Five MCP gateways for Claude Code in 2026, scored on per-tool latency, server auth, tool-description scanning, session correlation, post-STDIO-RCE.

Rishav Hada · Feb 13, 2026

18 min

The mechanics of the MCP token blow-up

Why the protocol does not fix this on its own

The five mechanisms

Mechanism one: tool selectivity

Mechanism two: semantic caching

Mechanism three: compiled execution

Mechanism four: description compression

Mechanism five: server-set pinning per session

Stacking the five

A measurement framework that survives a quarter

Two case patterns

Vendor-agnostic implementation notes

Where the pattern still has rough edges

Sources

Related reading

Frequently asked questions