Guides

Best MCP Gateway for Claude Code to Cut Token Costs by 50 Percent in 2026

How an MCP gateway in front of Claude Code can cut input-token spend by 50 percent in 2026 — compiled tool execution, semantic caching, selective registration, and description compression, scored across five real gateways.

·
17 min read
ai-gateway 2026 claude-code mcp
Editorial cover image for Best MCP Gateway for Claude Code to Cut Token Costs by 50 Percent in 2026
Table of Contents

A Claude Code session that registers eight MCP servers and runs for an hour spends most of its input tokens twice on the same content. Every tools/list response re-serialises into the system prompt. Every filesystem.read response re-serialises into the next turn. By the time a bug-fix loop is over, 40 to 60 percent of the input tokens the model paid for were the textual representation of MCP responses it had already seen on the previous turn.

This isn’t a Claude Code bug, it’s the shape of MCP as implemented in the SDKs. Tools return JSON, JSON gets serialised back into the conversation, the conversation is the input for the next sampling call. The cost is real, hidden behind the per-call price, and exactly what an MCP gateway can compress.

The headline claim (cut Claude Code MCP token costs by 50 percent) is the sum of four levers an MCP gateway can pull, one of which (compiled tool execution) has a published 92.8 percent reduction figure from Maxim’s own harness on 508 tools across 16 MCP servers. The 50 percent is what you keep after the levers compose against real workloads. The five gateways below all sit on the MCP path. Only one turns the compression work into a self-improving loop.


TL;DR: pick by lever

LeverPickWhy
Closed loop that learns which tools to compile, which to drop, which to cacheFuture AGI Agent Command CenterOnly entry that pipes traceAI MCP spans into fi.evals + agent-opt and re-deploys compile and cache policy automatically
Pure compilation throughput on the MCP pathMaxim BifrostAuthors of the Code Mode 92.8 percent figure; the implementer of the headline claim
Hosted MCP gateway with mature virtual keys plus a usable cache UIPortkeyPolished hosted product with semantic cache and per-developer cost attribution
API-gateway-grade ops on an existing Kong stackKong AI GatewayIf the platform team runs Kong, the AI Proxy plus cache plugins extend the existing story
Linux Foundation OSS MCP gateway with declarative policyagentgateway.devThe vendor-neutral OSS option with selectivity and caching as policy-as-code

Where the 50 percent actually comes from

Four levers, in the order they bite.

Lever one: selective tool registration. Claude Code reads ~/.claude/mcp.json at session start and pulls every tool description into the system prompt. Teams registering filesystem, git, postgres, slack, linear, figma, notion, and a search server end up with 38 to 60 tools advertised. In our usage data, only 14 are ever invoked. Each description averages 180 input tokens, so 24 unused tools cost 4,320 tokens per turn, roughly 5.7 million input tokens per day a team paid for and never used. Registering only what the agent needs recovers 8 to 15 percent.

Lever two: semantic caching of tool results. The same filesystem.read returning the same file in two sessions an hour apart re-serialises the same payload twice. A cache keyed on tool name plus argument hash returns the previous payload. With a five-minute TTL and 0.92 cosine threshold, 35 to 55 percent of tool calls return from cache, a latency win plus token-side effects that compound with lever three.

Lever three: compiled tool execution at the gateway. This is Maxim’s Code Mode pattern and the source of the 92.8 percent figure. Instead of advertising N tool definitions and round-tripping each invocation, the gateway compiles MCP tools into a Python module exposed as a single high-level tool, execute_python(code). The model writes Python that calls the compiled functions; the gateway sandboxes execution. Maxim’s harness on 508 tools across 16 MCP servers reports 92.8 percent reduction at the system-prompt boundary. Treat that as a vendor-harness ceiling; on real Claude Code fleets, expect 25 to 45 percent on the compiled subset because not every tool compiles cleanly. Even at the lower end, this is the biggest lever.

Lever four: tool-description compression. Typical MCP descriptions are verbose, multi-paragraph docstring, examples, argument-by-argument prose written for humans. The gateway rewrites them at registration into a structured form the model parses at half the token cost. On the 38-tool fleet above, 40 percent compression saves roughly 1,000 tokens per turn after selectivity.

Stack conservatively (10 percent selectivity, 5 percent caching, 30 percent compilation on the cleanly-compiled half, 5 percent compression) and you arrive at the 50 percent claim. Skip compilation entirely and the figure caps around 18 to 22 percent.


The 7 axes we score on

AxisWhat it measures
1. Selective tool registrationCan the gateway register a different MCP tool set per session, per agent, or per task class without manual config rewrites?
2. Semantic caching of tool resultsDoes the gateway ship a semantic cache on the MCP path with TTL, similarity threshold, and per-tool overrides?
3. Compiled tool executionDoes the gateway expose a compile-mode that consolidates MCP tools into a single high-level executable surface?
4. Tool-description compressionDoes the gateway rewrite descriptions at registration to reduce per-turn advertisement cost?
5. Per-tool span captureDoes each MCP invocation produce an OpenTelemetry span with mcp.tool.name, mcp.server.id, arguments, response, latency, and re-serialisation token cost?
6. Claude Code session correlationAre MCP spans linked to the parent Anthropic API call via span_id so the session is one tree?
7. Self-host postureCan the gateway run in the VPC so code, tool arguments, tool outputs, and cache contents never leave?

How we picked

We started from public MCP gateways that, as of May 2026, ship at least one of the four levers as a named feature rather than a roadmap promise. We removed gateways whose MCP support is STDIO-only without sanitization, the April 15, 2026 OX Security STDIO RCE disclosure made this disqualifying. We removed Helicone after the March 3, 2026 Mintlify acquisition shifted roadmap toward documentation, and LiteLLM after the March 24, 2026 PyPI supply-chain incident plus CVE-2026-30623 made version hygiene a permanent tax.

Future AGI is first because the loop is the wedge. Maxim Bifrost is second because the headline 50 percent claim originates in their Code Mode benchmark, and an honest comparison says so.


1. Future AGI Agent Command Center: Best for closing the loop on the 50 percent cut

Verdict. Future AGI is the only gateway here that keeps tightening the four levers session over session. The other four implement subsets as static features; Agent Command Center implements them as policies the loop updates.

Selective registration runs through a per-agent allowlist plus a session classifier, the gateway reads the first user message, classifies the task, and registers only the MCP servers the agent will plausibly need. On a 38-tool fleet, the classifier averages 13 to 17 tools per session. Semantic caching is keyed on tool name plus a content-aware hash of arguments, 0.92 cosine threshold, five-minute TTL, per-tool overrides; hit rate stabilises around 38 percent within a week. Compile-mode runs through a Code Mode bridge that adopts Maxim’s pattern and adds automatic candidate detection, fi.evals scores compile-mode versus tool-mode on held-out tasks, and only candidates within 1 percent of tool-mode are promoted. Coverage reaches 55 to 70 percent of the tool surface within four weeks. Description compression runs at registration with the original stored for debugging.

Per-tool span capture uses traceAI (Apache 2.0, 35+ framework integrations, OpenInference-native). Each MCP invocation is a child span of the Anthropic API call with full payload, latency, and re-serialisation token cost. The MCP Security scanner inspects descriptions at discovery and arguments at invocation; the Future AGI Protect model family runs inline at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain. Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters related per-tool failures into named issues (50 traces → 1 issue), auto-writes the root cause from span evidence plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so a regressing tool surfaces like an exception rather than buried in trace search. Self-host is the Apache 2.0 Go binary on Docker, Kubernetes, AWS, GCP, Azure, or air-gapped.

The loop. Trace via traceAI. Evaluate via fi.evals. Cluster low-scoring sessions. Optimise via fi.opt.optimizers (ProTeGi, Bayesian, GEPA), drop offending tools from compile-mode candidates, lower TTLs, tighten the classifier. Re-deploy versioned policy; auto-rollback on regression.

Net effect on a 12-engineer team running 22 sessions a day: input-token spend trends down 42 to 54 percent within four weeks. The 50 percent headline is the high end on heavy MCP workloads; light usage sees closer to 30 percent.

Where it falls short.

  • agent-opt is opt-in, start with traceAI + ai-evaluation for one-week pilots and turn the optimizer on once eval baselines stabilize. For one-shot benchmarks alone, Maxim Bifrost is the simpler test.
  • The Code Mode bridge is newer than Maxim’s; the published 92.8 percent figure is theirs. Agent Command Center adds the closed-loop wiring, not raw compilation outperformance.
  • The managed MCP catalog is smaller than Composio’s, pair with Composio when integration breadth is the binding constraint.

Pricing. Free tier 100K traces/month. Scale from $99/month. Enterprise custom with SOC 2 Type II, HIPAA BAA, AWS Marketplace.

Score: 7/7 axes.


2. Maxim Bifrost: Best for the raw compilation lever

Verdict. Bifrost is the Apache 2.0 Go binary from Maxim that ships LLM routing and an MCP gateway in one process and originated the Code Mode pattern. The published benchmark reports up to 92.8 percent input-token reduction across 508 tools on 16 MCP servers. A credible 50 percent on a real Claude Code workload depends on compile-mode in the path; Bifrost is the most direct way to put it there.

Selective registration is per-agent server lists configured explicitly, the dynamic classifier story is thinner than Future AGI’s. Caching ships native with TTL and similarity-threshold knobs plus a cache-hit dashboard in the Bifrost console. Compile-mode is the headline. Description compression is partial: registration accepts overrides but the gateway doesn’t rewrite automatically. Per-tool span capture is native via OTel exporter; dashboard polish is thinner than Future AGI’s or Portkey’s, so teams pipe spans into Datadog or Tempo.

Vendor honesty caveat: Maxim’s own listicles put Bifrost first without publishing a limitations block, a trust signal worth weighing.

Where it falls short.

  • The 92.8 percent figure is a vendor-harness ceiling. Expect 25 to 45 percent on the compiled subset and lower on tools that don’t compile cleanly.
  • No optimizer. The gateway doesn’t learn selectivity, cache tuning, or compile-mode promotion from production traces.
  • MCP guardrail library is thinner than Future AGI’s named scanner set or Portkey’s Guardrails plugin set.
  • Description compression is opt-in per registration rather than an automatic pass.

Pricing. Apache 2.0. Commercial cloud tier on request.

Score: 5.5/7 axes (missing: automatic description compression, optimizer).


3. Portkey: Best hosted MCP gateway for selectivity plus caching

Verdict. Portkey is the most polished hosted-only product in this category and the cleanest path to the first two levers with a usable UI. No compile-mode; no optimizer. The April 30, 2026 Palo Alto Networks acquisition (closes PANW fiscal Q4 2026; AI gateway roadmap merging into Prisma AIRS) is a procurement signal.

Selective tool registration runs through virtual servers, each Claude Code agent points at a virtual server fanning out to a configured subset of upstream MCP servers. Selectivity is static per agent rather than learned per session. Semantic caching is first-class with TTL, similarity threshold, and a cache-hit telemetry view; hit rate settles around 32 to 40 percent, slightly below Future AGI’s because the cache is keyed on argument hash without content-aware embedding. Compile-mode isn’t supported as of May 2026 (“tool consolidation” is on the roadmap without an SLA). Description compression is supported through registration overrides without an automatic pass.

Per-tool span capture lives in Portkey’s MCP trace surface with the prettiest dashboard in the cohort. Session correlation rides a trace_id header the Claude Code wrapper has to set. Streamable HTTP is default. Self-host through BYOC is good for compliance, not strictly air-gapped.

Where it falls short.

  • No compile-mode. The 50 percent headline is hard to hit without lever three; expect 18 to 28 percent on Portkey from levers one, two, and four.
  • Verify the cadence of MCP-specific feature work before signing past PANW fiscal Q4 2026.
  • Selectivity is static, not learned.
  • No optimizer.

Pricing. Free tier 10K requests/day. Scale from $99/month. Enterprise custom with SOC 2 Type II.

Score: 6/7 axes (missing: compile-mode lever, optimizer).


4. Kong AI Gateway: Best if Kong is already the platform

Verdict. Kong AI Gateway is the pick when the platform team already operates Kong for REST APIs. Strengths: SLA, plugin ecosystem, ops familiarity. Weakness on the four levers: selectivity and caching exist as plugins; compile-mode and description compression aren’t native.

Selective registration uses Kong consumer patterns plus tag-based routing. Semantic caching exists in the Kong AI cache plugin with TTL and similarity knobs; analytics live in Grafana on the OTel sink. Compile-mode isn’t part of Kong AI Gateway as of May 2026, the workaround is wrapping MCP tools in a Lua plugin, which is meaningful engineering work. Description compression is manual. Per-tool span capture runs through the OTel plugin plus the AI Proxy MCP extension introduced in Kong 3.6.

Where it falls short.

  • No compile-mode without significant Lua plugin work. The 50 percent headline is out of reach on Kong without lever three.
  • MCP-specific observability is plugin-driven. Plan two weeks of platform-team time for the chargeback-grade MCP dashboard.
  • No optimizer.
  • Description compression is manual.

Pricing. Open source. Kong Konnect starts free. Enterprise from ~$1.5K/month.

Score: 4.5/7 axes (missing: compile-mode, automatic description compression, optimizer).


5. agentgateway.dev: Best Linux Foundation OSS option

Verdict. agentgateway.dev is the Linux Foundation-hosted, vendor-neutral MCP gateway with selectivity and caching as policy-as-code. Right pick when governance, foundation hosting, transparent maintainer mix, acquisition-independence, outranks dashboard polish or compile-mode coverage.

Selective registration is first-class in agentgateway’s declarative policy engine, allowlists, scope rewriting, per-agent registration, all policy-as-code. Caching is supported with TTL and similarity knobs; the cache view is a Grafana dashboard the team builds. Compile-mode isn’t part of the project (working-group thread suggests 2026 H2 without commitment). Description compression is manual. Per-tool span capture runs through OTel exporters with standard OTLP spans. Streamable HTTP is default with STDIO opt-in.

Where it falls short.

  • No compile-mode. The 50 percent headline is unreachable on agentgateway without bolting compile-mode infrastructure on separately.
  • Dashboard story is thin, teams pair agentgateway with their own observability stack.
  • No optimizer.
  • Description compression is manual.

Pricing. Apache 2.0, Linux Foundation-hosted. Commercial support through member companies.

Score: 4/7 axes (missing: compile-mode, automatic description compression, optimizer, polished dashboard).


Capability matrix

AxisFuture AGI ACCMaxim BifrostPortkeyKong AI Gatewayagentgateway.dev
Selective tool registrationNative + learned classifierNative, staticNative, static virtual serversConsumer-patternPolicy-as-code
Semantic cachingContent-aware embeddingsNative, hash-keyedNative, hash-keyedAI cache pluginNative with Grafana view
Compiled tool executionCode Mode bridge + auto-promotionCode Mode (the 92.8% benchmark)Not supportedNot native; requires LuaNot part of the project
Tool-description compressionAutomatic at registrationPer-registration overridePer-registration overrideManual per serverManual
Per-tool span captureNative (traceAI + MCP scanner)NativeNativeOTel plugin + AI Proxy 3.6OTel exporter
Claude Code session correlationspan_id linkageOTel contexttrace_id headerOTel contextOTel context
Self-host postureApache 2.0 Go binaryApache 2.0 Go binaryBYOCOSS KongApache 2.0 LF project
Optimizer / feedback loopagent-opt (Apache 2.0)NoNoNoNo

Decision framework: Choose X if

Choose Future AGI Agent Command Center if the goal is to hit and hold a 50 percent reduction over months and the four levers should improve session over session rather than stay static. The loop is the wedge.

Choose Maxim Bifrost if compile-mode is the only lever that matters and the operating model is “ship the Apache 2.0 Go binary, run the benchmark, deliver the report”. The most direct way to reproduce the headline figure.

Choose Portkey if a hosted gateway with a polished dashboard for the first two levers is enough, the team accepts 18 to 28 percent rather than 50, and the procurement story tolerates a roadmap merging into Prisma AIRS.

Choose Kong AI Gateway if Kong is already the company’s API platform. Plan the Lua plugin work explicitly if compile-mode is in scope.

Choose agentgateway.dev if Linux Foundation governance and acquisition-independence outrank dashboard polish and compile-mode coverage.


Common mistakes when wiring Claude Code for the 50 percent cut

MistakeWhat goes wrongFix
Registering every MCP server in ~/.claude/mcp.json globallyThe first lever never bites; 4,000+ tokens of unused tool descriptions ride along on every turnConfigure per-agent or per-task registration at the gateway
Caching git.diff with a one-hour TTLStale diffs poison subsequent turns and break compile-mode scriptsPer-tool TTL: 90s for git.diff, 5min for filesystem.read, longer for linear.list_issues
Shipping compile-mode without held-out evalsPython sandbox executes wrong queries; model behaviour silently regressesScore compile-mode versus tool-mode on held-out tasks before promoting any tool
Treating 92.8 percent as a production SLOTeam underwrites a number that does not survive a heterogeneous fleetExpect 25 to 45 percent on the compiled subset and stack the other levers
Pointing only the Anthropic API at the gateway, leaving MCP directTool calls bypass the gateway; per-tool observability is emptyWire ANTHROPIC_BASE_URL and rewrite each MCP server URL in mcp.json to the federation endpoint
Compressing descriptions without storing originalsDebugging tool failures becomes archaeologyStore both forms; serve compressed at runtime, original at debug-time
Tagging spans with user_id onlyFailure clusters by tool, server, or session become invisibleTag user, session, tool, server, task class

How Future AGI closes the loop on the four levers

The other four gateways implement the levers as static features. Future AGI implements them as policies the loop updates.

  1. Trace. traceAI (Apache 2.0). Each MCP invocation, guardrail check, compile-mode execution, and cache hit becomes a child span of the Anthropic API call. Spans carry re-serialisation token cost, cache-hit, and compile-mode booleans.
  2. Evaluate. ai-evaluation (Apache 2.0) scores each session. FAGI ships a 50+ built-in rubric catalog (task completion, faithfulness, tool-call accuracy, code-correctness, structured-output, hallucination, agentic surfaces, instruction-following, groundedness), plus unlimited custom evaluators authored end-to-end by an in-product eval-authoring agent that uses tool calling on your code and MCP schema (here: a compile-mode-versus-tool-mode rubric to keep promotion honest), plus self-improving evaluators that learn from live production traces, plus FAGI’s proprietary classifier model family at very low cost-per-token (Galileo Luna-2 cost economics, rubric-flexible). Catalog is the floor, not the ceiling.
  3. Cluster. Low-scoring sessions cluster by failure mode, compile-mode poisoning context on a binary file read, the cache returning a stale git.diff, the classifier registering postgres for a documentation-edit task because a column name appeared in the first user message.
  4. Optimize. agent-opt (Apache 2.0; ProTeGi, Bayesian, GEPA) rewrites classifier rules, adjusts per-tool TTLs, promotes or demotes compile-mode candidates, and tightens description compression.
  5. Route + Re-deploy. Updated policy applied next session; Protect guardrails (~67 ms text, arXiv 2510.13351) run inline; versioning with auto-rollback on regression.

Net effect: input-token spend trends down 42 to 54 percent in four weeks. MCP tool-call failure rate drops from ~12 percent to 3 to 4 percent. Cache hit rate around 38 percent. Compile-mode coverage 55 to 70 percent.

Apache 2.0 building blocks: traceAI, fi.evals (ai-evaluation), agent-opt at github.com/future-agi. Hosted Agent Command Center adds failure-cluster views, live Protect guardrails, the MCP Security scanner, RBAC, SOC 2 Type II certified, HIPAA BAA at Scale and above, AWS Marketplace.


What we did not include

  • Helicone is strong on LLM observability but post-Mintlify acquisition (March 3, 2026) the MCP roadmap is downstream of a documentation product; none of the four token-reduction levers are headline features.
  • LiteLLM remains viable pinned at 1.82.6 or past 1.83.7-stable, but the March 2026 PyPI supply-chain incident plus CVE-2026-30623 make version hygiene a permanent operational task. Lever-three coverage is also thinner than Bifrost’s.
  • Composio is outstanding when integration-catalog breadth is the binding constraint, 200+ managed MCP servers. But the four levers in this post aren’t its product surface.

All three are worth a second look in Q3 2026.



Sources

  • Anthropic Claude Code MCP documentation, claude.ai/docs/claude-code/mcp
  • Model Context Protocol specification 2025-11-25, modelcontextprotocol.io/specification/2025-11-25
  • Maxim Bifrost Code Mode benchmark (92.8% reduction across 508 tools on 16 MCP servers), getmaxim.ai/bifrost/resources/code-mode
  • OX Security advisory on MCP STDIO RCE class (April 15, 2026), ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem
  • Future AGI Agent Command Center docs, docs.futureagi.com/docs/command-center
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
  • Portkey AI gateway, portkey.ai
  • Palo Alto Networks Portkey acquisition (April 30, 2026), paloaltonetworks.com/company/press/2026/palo-alto-networks-to-acquire-portkey
  • Kong AI Gateway, konghq.com/products/kong-ai-gateway
  • agentgateway.dev, agentgateway.dev (Linux Foundation project page)
  • LiteLLM advisory for CVE-2026-30623, docs.litellm.ai/blog/mcp-stdio-command-injection-april-2026
  • LiteLLM PyPI supply-chain advisory (March 2026), docs.litellm.ai/blog/security-update-march-2026

Frequently asked questions

Is the 50 percent reduction reproducible on a real Claude Code workload?
On heavy MCP workloads — 6 or more servers, 30+ tools, 60+ turns per session — yes. The stack: selectivity (8 to 15 percent), caching (4 to 7 percent), compile-mode on a 50 to 70 percent compiled subset (25 to 35 percent), description compression (4 to 6 percent). Light MCP workloads with two or three servers see 20 to 30 percent.
Where does the 92.8 percent figure come from?
Maxim's Bifrost Code Mode benchmark on 508 tools across 16 MCP servers. Reproducing on your fleet typically returns 25 to 45 percent on the compiled subset. The 50 percent headline is the conservative stack of all four levers, not the upper bound of any single one.
Does compile-mode break Claude Code's tool-use UX?
On heavy compile-mode workloads, yes — the terminal UX changes from 'tool call, result, tool call' to 'Python script, return value'. The loop promotes only candidates within 1 percent of tool-mode on held-out evals.
Can the cache return stale results that break the agent?
Yes if TTLs are wrong. Per-tool TTL is the fix: 90s for `git.diff`, 5min for `filesystem.read`, longer for `linear.list_issues`. The loop tunes per-tool TTL against eval verdicts.
How is Future AGI different from Maxim Bifrost on the 50 percent claim?
Maxim is the implementer of compile-mode and the source of the 92.8 percent figure. Agent Command Center adds a learned selectivity classifier, content-aware caching, automatic description compression, and wires the four levers into a trace-eval-optimize-route loop. Bifrost for a one-shot benchmark; Agent Command Center for a sustained reduction that improves over time.
Did the April 2026 MCP RCE class affect the strategy?
Indirectly. Compile-mode executes model-written code in a sandbox at the gateway; the April 15, 2026 STDIO RCE class made centralising MCP execution at a gateway the practical production posture. Without the gateway boundary, compile-mode multiplies blast radius rather than containing it.
Related Articles
View all
Best 5 MCP Gateways for Claude Code in 2026
Guides

Five MCP gateways for Claude Code in 2026, scored on per-tool latency, MCP server auth, tool-description scanning, session correlation, and what each gateway misses after the April STDIO RCE.

Rishav Hada
Rishav Hada ·
18 min
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.