Guides

Best 5 AI Gateways to Cache Claude Code Calls in 2026

Five AI gateways scored on caching Claude Code calls in 2026: cross-developer cache scope, semantic-match thresholds, hit-rate observability, TTL controls, and what each one misses.

·
17 min read
ai-gateway 2026 claude-code semantic-caching
Editorial cover image for Best 5 AI Gateways to Cache Claude Code Calls in 2026
Table of Contents

A 25-developer team on Claude Code with Anthropic’s native prompt caching can still leave 35-50% of cacheable input on the floor. The cache_control blocks cache against a single API key’s recent window, they don’t share across developers, don’t match paraphrased prompts, and don’t tell you which repo keeps re-uploading the same 80K-token monorepo context every morning. The bill arrives anyway.

An AI gateway closes that gap. The five below all do some version of that; only one turns the miss into a feedback signal that retrains the policy.

This is the 2026 cohort, scored on the seven caching axes that matter when Claude Code is the workload.


TL;DR

Future AGI Agent Command Center is the strongest pick for an AI gateway for Claude Code caching because it ships exact and semantic cache layered on top of Anthropic’s native cache_control, per-tenant cache keying that holds cross-developer match across a 25-developer monorepo team, per-template threshold tuning, OpenTelemetry-native cache-miss telemetry, and a write-side Protect scanner that blocks cache poisoning at ~67 ms. The other four picks below win on specific edges.

  1. Future AGI Agent Command Center — Best overall. Exact + semantic cache, per-tenant namespacing, OTel cache-miss telemetry, and write-side Protect block.
  2. Portkey — Best for the most mature semantic cache UI out of the box. Fastest path to a working semantic cache with tunable threshold and TTL (verify the Palo Alto Networks acquisition timeline before signing multi-year).
  3. Helicone — Best for lightweight observability on top of Anthropic’s native cache. Drop-in proxy that exposes per-request cache_read_input_tokens cleanly (treat as planned migration after the March 3, 2026 Mintlify acquisition).
  4. LiteLLM — Best when Claude Code traffic cannot leave your VPC. Self-hosted exact + semantic cache, Python-native; pin commits after the March 24, 2026 PyPI compromise.
  5. Cloudflare AI Gateway — Best when Claude Code traffic already routes through Cloudflare. Edge-layer cache with strong TTL primitives plugs in at the CDN hop.

Why caching Claude Code is its own problem

Each Claude Code invocation re-uses three layers of context that look stable to a human and fresh to a default LLM cache:

  1. System prompt and tool definitions. ~5-15K tokens. Identical across every turn for a given Claude Code version. Anthropic’s native cache catches this cleanly.
  2. Project context block. ~30-200K tokens of project files. Identical within a session, often identical across sessions for the same developer, sometimes identical across developers on the same repo at the same git SHA.
  3. Conversation history. Grows turn by turn. Cacheable as a prefix.

Anthropic’s cache_control (5-min and 1-hr TTLs, ~90% input cost reduction on hit) handles (1) and within-session (3) well. It’s weaker on (2) across developers because the cache is scoped to the API key and the prefix must match byte-for-byte. Three developers opening the same monorepo within the same hour pay the full uncached rate the second and third time unless something shares state across them.

That something is the gateway. For the rest of this post, “gateway” means an AI gateway that speaks the Anthropic API. All five picks support ANTHROPIC_BASE_URL.


The 7 caching axes we score on

AxisWhat it measures
1. Exact-match cache (cross-key)Can the gateway serve developer A’s cached response to developer B if the prefix matches?
2. Semantic-match cacheDoes the gateway embed the prompt and serve a cached response above a configurable cosine threshold?
3. Layered with Anthropic nativeDoes the gateway co-exist with cache_control instead of stripping or double-billing it?
4. Per-repo / per-developer UICan you read hit rate sliced by repo or developer without writing SQL?
5. TTL + invalidationPer-request TTL override, force-refresh, namespace bust on deploy?
6. Cost-saved $ reportingDoes the gateway report the absolute USD returned, not just hit rate?
7. Self-host postureCan the gateway and its cache backend run in your VPC, so cached source code never leaves?

Anthropic’s native cache is the baseline. Every gateway here is layered on top of cache_control, not in place of it.


1. Future AGI Agent Command Center: Best for closing the cache loop

Verdict: Future AGI is the only gateway here that takes cache-miss data and turns it into a threshold rewrite or routing rule. The other four cache and report; Agent Command Center caches, reports, and uses the miss data to bend the curve.

What it does for Claude Code caching:

  • Exact cache (cross-key). Hashes the canonical prompt prefix (model + system prompt + tool defs + ordered project context) and shares the response across API keys in the same tenant. A 25-developer team on the same monorepo at the same git SHA pays the prefix once per TTL window, not 25 times. In our Q2 2026 cohort data this typically adds 18-28 percentage points on top of Anthropic’s native cache.
  • Semantic cache. Embeddings backend choice (in-memory, Qdrant, or Pinecone). Cosine threshold configurable per route, default 0.93, high enough that paraphrases hit, low enough that false positives are rare.
  • Layered with Anthropic native. Respects cache_control. The outer cache only serves when the native cache would have missed.
  • Per-repo / per-developer UI. fi.attributes.repo, fi.attributes.user.id, fi.attributes.session_id are first-class span attributes. Group-by is built-in.
  • TTL + invalidation. Per-request TTL and namespace headers; force-refresh via Cache-Control: no-store; namespace-busts via CI on system-prompt deploy.
  • Cost-saved reporting. Dashboard shows USD returned, sliced by provider, model, repo, and developer. One 30-developer reference deployment: cross-key exact cache returned $4,200 of $11,300 monthly Anthropic spend at a 37% effective hit rate.
  • Self-host posture. BYOC. Cache backend runs in your VPC. The Future AGI Protect model family runs inline at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain.

The loop. traceAI (35+ framework integrations, OpenInference-native) emits spans; fi.evals scores each miss as expected or unexpected. Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters related cache misses into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation, and tracks trend per issue so a regressing cache-miss pattern surfaces like an exception rather than buried in a cache dashboard. fi.opt.optimizers (ProTeGi, Bayesian, GEPA) re-tunes the per-route threshold or rewrites the canonicalisation rule so future prompts hash the same way. Production hit rate typically climbs 6-9 percentage points over the first six weeks, no developer behaviour change required.

Where it falls short:

  • The optimizer and trace pipeline are heavier than teams want when the brief is “just put a cache in front of Claude Code.” Under 10 developers and $3K/month, Helicone or Cloudflare are lighter.

  • Semantic cache ships with Qdrant and Pinecone adapters. Weaviate is roadmap.

Pricing. traceAI, ai-evaluation, agent-opt are Apache 2.0. Hosted: free tier with 100K traces/month, Scale from $99/month, Enterprise custom with SOC 2 Type II, HIPAA, GDPR, and CCPA certifications, plus a BAA. AWS Marketplace.

Score: 7/7 axes.


2. Portkey: Best for hosted semantic cache with tunable threshold

Verdict: The most polished hosted-only product for semantic caching. Threshold UI, TTL controls, and per-route namespace are mature; the dashboard reports the dollar number directly. It doesn’t optimize the threshold for you.

What it does for Claude Code caching:

  • Exact cache (cross-key). Native, scoped to the workspace; cross-virtual-key hits work cleanly.
  • Semantic cache. Configurable threshold (default 0.92), per-route namespace and TTL. The most usable threshold UI in this list.
  • Layered with Anthropic native. Respects cache_control; reads back cache_read_input_tokens so savings don’t double-count.
  • Per-repo / per-developer UI. Via metadata headers (x-portkey-trace-id, custom x-portkey-metadata JSON). The Claude Code wrapper must set them; otherwise only key-level slicing.
  • TTL + invalidation. x-portkey-cache-ttl, force-refresh via x-portkey-cache-force-refresh: true, namespace bust via API.
  • Cost-saved reporting. Native dashboard; tenant rollups on Scale.
  • Self-host posture. Open-source gateway core, closed control plane. Air-gapped story thinner than LiteLLM’s.

Where it falls short:

  • No optimizer. Too tight a threshold leaves hit rate on the table; too loose ships wrong answers. No automated tuning.
  • Metadata-header model requires Claude Code wrapper changes for per-repo slicing.
  • Palo Alto Networks announced intent to acquire Portkey on April 30, 2026, expected to close in PANW fiscal Q4. Verify standalone roadmap before a multi-year contract.

Pricing. Free tier 10K requests/day. Scale from $99/month. Enterprise custom with SOC 2 Type II.

Score: 6/7 axes (missing: feedback loop / threshold optimizer).


3. Helicone: Best for lightweight observability on top of Anthropic native

Verdict: The pick when the brief is “see Anthropic’s native cache hit rate cleanly, plus a thin exact-match layer on top.” Drop in the proxy URL, get a per-request table that surfaces cache_read_input_tokens. Its own cache is exact-only; semantic is roadmap.

What it does for Claude Code caching:

  • Exact cache (cross-key). Yes, scoped to the org. Works well for the system-prompt and tool-defs layer; less well for the longer project-context prefix.
  • Semantic cache. Not as of May 2026. Roadmap.
  • Layered with Anthropic native. Helicone’s strongest cache feature. The dashboard surfaces cache_read_input_tokens and cache_creation_input_tokens per request and computes implied saving.
  • Per-repo / per-developer UI. Via Helicone-User-Id and custom properties. Shallower than Portkey or Future AGI; adequate under 20 developers.
  • TTL + invalidation. TTL on the exact cache; force-refresh; namespace control exists, less polished than Portkey.
  • Cost-saved reporting. Native dashboard surfaces the dollar number for both Anthropic native and Helicone caches.
  • Self-host posture. OSS self-host is the air-gapped path. The hosted product was acquired by Mintlify on March 3, 2026 with the roadmap shifting toward documentation. Existing hosted users should treat continued procurement as a managed migration window.

Where it falls short:

  • No semantic cache, the layer that catches paraphrased Claude Code prompts.
  • Mintlify acquisition: confirm cache-feature continuity before a multi-year deployment.
  • Routing intelligence is basic. Claude-Code-specific routing (haiku for easy turns, opus for hard) has to be coded upstream.

Pricing. Free tier 10K requests/month. Pro from $25/month. Enterprise custom.

Score: 5/7 axes (missing: semantic cache, feedback loop).


4. LiteLLM: Best for self-hosted exact + semantic cache, Python-native

Verdict: The pick when Claude Code traffic can’t leave your VPC and security wants to read every line of code touching a prompt. Cache features are functional; the out-of-the-box dashboard is thinner than the hosted picks.

What it does for Claude Code caching:

  • Exact cache (cross-key). Native, Redis backend. Cross-key sharing requires explicit team/key scoping; default is per-key. Left at default you miss the cross-developer benefit.
  • Semantic cache. Qdrant or in-memory backend. Threshold configurable via config files and re-deploy, not a polished UI.
  • Layered with Anthropic native. Respects cache_control. Docs are clear; read them carefully.
  • Per-repo / per-developer UI. Via metadata.session_id, metadata.team_id, metadata.user_id. Per-repo slicing usually means exporting to your warehouse and writing a SQL view.
  • TTL + invalidation. Per-request TTL override, force-refresh, namespace control.
  • Cost-saved reporting. Tracked in the database; UI surfaces it. Less polished than Portkey; numbers are correct.
  • Self-host posture. Strongest in this list. Source-available, runs on your nodes, no telemetry leaves the VPC.

Where it falls short:

  • No optimizer.
  • Dashboard is functional, not polished. Per-repo slicing usually goes through a SQL view.
  • March 24, 2026 PyPI supply-chain incident. Versions 1.82.7 and 1.82.8 were compromised; the package exfiltrated SSH keys, cloud credentials, and Kubernetes configs (Datadog Security Labs reported it as the TeamPCP campaign). Pin commits, scan dependency trees, rotate touched credentials, and upgrade past 1.83.7 before adoption.

Pricing. MIT (enterprise dir licensed separately). Enterprise cloud tier from ~$250/month.

Score: 5.5/7 axes (missing: optimizer, polished dashboard).


5. Cloudflare AI Gateway: Best for edge-layer cache with strong TTL primitives

Verdict: The pick when Claude Code traffic already routes through Cloudflare (Workers, WAF, DDoS) and the platform team would rather plug the AI cache into the existing edge. Exact-match and TTL primitives are strong; AI-specific observability is improving but thinner than Portkey or Future AGI.

What it does for Claude Code caching:

  • Exact cache (cross-key). Native at the edge. Cross-account caching scoped to the AI Gateway namespace. TTL primitives are some of the best in this list.
  • Semantic cache. In beta as of May 2026; threshold UI less mature than Portkey’s. Roadmap moves fast.
  • Layered with Anthropic native. Respects cache_control. Worker-based interception doesn’t strip headers, so inner Anthropic cache and outer Cloudflare cache compose.
  • Per-repo / per-developer UI. Via cf-aig-metadata. Default dashboard slices by request, model, and provider; per-developer slicing requires wiring metadata from the Claude Code wrapper and reading Workers Analytics Engine.
  • TTL + invalidation. Strongest in this list. cf-aig-cache-ttl, cf-aig-skip-cache, namespace via gateway IDs. Custom cache keys via cf-aig-cache-key allow cache busting on git SHA.
  • Cost-saved reporting. Native dashboard shows per-provider cost and hit rate. “Dollar saved” is derivable, less polished than Portkey.
  • Self-host posture. Hosted at the edge only. No self-hosted version. Cached responses (containing source code) sit on Cloudflare’s edge, a strength for zero-ops teams, a non-starter for some compliance regimes.

Where it falls short:

  • Semantic cache maturity. Portkey and Future AGI have ~12 months more production mileage. Today Cloudflare is the right pick for exact-match at the edge, not for semantic.
  • No self-hosted option. If compliance forbids cached source code on Cloudflare’s edge, the gateway is out.
  • No optimizer.

Pricing. Free tier on Workers; paid scales with Workers usage. Enterprise for SLA.

Score: 5/7 axes (missing: mature semantic cache, self-host, optimizer).


Capability matrix

AxisFuture AGIPortkeyHeliconeLiteLLMCloudflare AI
Exact cache (cross-key)NativeYesYesYes (configure scope)Edge
Semantic cacheTunablePolished UINo (roadmap)FunctionalBeta
Layered with Anthropic nativeYesYesYes (surfaces native)YesYes
Per-repo hit-rate UINativeVia metadataShallowSQL neededVia metadata
TTL + invalidationYesYesYesYesStrongest
Cost-saved $ reportingPer repo + devNativeNativeLess polishedDerivable
Self-hostBYOCOSS proxy + cloudOSS self-hostStrongestNone
Feedback loopfi.optNoNoNoNo

Decision framework: Choose X if

Choose Future AGI if you want the miss data to tune the threshold and routing policy over time. agent-opt is opt-in, turn it on once Claude Code has eval baselines and live traces flowing, and the dollar saved compounds without a human re-tuning quarterly.

Choose Portkey for the most polished hosted semantic-cache UI and a usable dollar-saved dashboard, accepting the PANW acquisition timeline. Pick this when the brief is “set up a semantic cache next week.”

Choose Helicone if the strategy is “lean on Anthropic’s native cache_control and observe it cleanly.” Fits teams under 10 developers where the Mintlify roadmap is acceptable.

Choose LiteLLM if compliance requires Claude Code traffic to never leave the VPC. Pin commits or upgrade past 1.83.7.

Choose Cloudflare AI Gateway if Claude Code already routes through Cloudflare and exact-match plus strong TTL primitives are sufficient, semantic-match isn’t binding today.


Real-number examples from production

Five Q2 2026 data points, de-identified.

  1. Cross-key exact cache (Future AGI BYOC). 30-developer monorepo team. Pre-gateway: 41% input-token hit rate. Post-gateway: 67% after cross-key sharing of the project-context layer. Monthly Anthropic spend dropped from $11,300 to $7,100, a $4,200 return.

  2. Semantic-cache false-positive caveat. 12-developer team running Portkey at threshold 0.85. Hit rate spiked to 54%. Two days later a senior engineer noticed Claude was returning stale answers for “explain function X” when X had been modified. Raised threshold to 0.93, hit rate dropped to 38%, false positives stopped. Lesson: a permissive threshold returns wrong answers, not stale ones alone.

  3. Latency on hit vs miss. Cache hit through Future AGI’s hot path: ~67ms text on top of the cached prefix. Cache miss: the same ~67ms plus the full Anthropic round-trip (800ms-2.4s). The hit is roughly 12-35x faster end-to-end, the dimension developers notice before the cost saving.

  4. Native TTL window mismatch. A LiteLLM team saw lower cross-key hit rates during morning standup. The native 5-min TTL had expired between the first developer at 9:02 AM and the second at 9:11 AM; they hadn’t configured the 1-hour cache_control option. Switching to the 1-hour TTL plus the gateway outer cache lifted morning hit rate from 49% to 71%.

  5. Cloudflare cache-key trick. A 50-developer team set cf-aig-cache-key to include the git SHA, invalidating the cache on every CI deploy that changed Claude Code’s system prompt. Without it they had reproduced a bug where Claude responded with last-week’s system-prompt behaviour because the cache served the previous prefix’s response on near-identical inputs.


Common mistakes when caching Claude Code through a gateway

MistakeWhat goes wrongFix
Semantic threshold too looseCache returns slightly-wrong answers; developers stop trusting the agentStart at 0.93, tune up not down
Caching conversation history with a long TTLStale context bleeds across sessionsTTL the conversation layer at 1-5 min; project-context layer at 1 hour
Stripping cache_control blocksPay the gateway to cache what Anthropic was caching for freeRespect cache_control and only cache on miss
Not invalidating on system-prompt deployYesterday’s system prompt serves from cacheNamespace bust or cf-aig-cache-key in CI on system-prompt change
One team key, no cross-key cache configDefault is per-key; no sharing across developersConfigure cross-key scope (Future AGI, Portkey, LiteLLM)
Tracking hit rate without tracking dollar savedEngineering and finance never reconcileReport absolute USD returned, sliced by repo and developer

How Future AGI closes the loop on Claude Code cache spend

The other four gateways treat caching as an end state: cache, report, alert. Future AGI treats the miss as the input to a feedback loop. Six stages:

  1. Trace. Every Claude Code turn produces a span tree via traceAI (Apache 2.0). Spans capture the prompt, cache decision, similarity score, model, and session ID. Miss spans are tagged.

  2. Evaluate. ai-evaluation (Apache 2.0) scores each miss. FAGI ships a 50+ built-in rubric catalog (task completion, faithfulness, code-correctness, tool-use, structured-output, hallucination, agentic surfaces, instruction-following, groundedness) plus a custom miss-classification rubric (expected new context, or unexpected, paraphrase semantic cache should have caught, or repeat content exact cache should have caught), plus unlimited custom evaluators authored end-to-end by an in-product eval-authoring agent that uses tool calling on your code, plus self-improving evaluators that learn from live production traces, plus FAGI’s proprietary classifier model family at very low cost-per-token (Galileo Luna-2 cost economics, rubric-flexible). Catalog is the floor, not the ceiling.

  3. Cluster. Unexpected misses cluster by failure mode. A common pattern: “monorepo project context differs by one trailing newline across developers, exact cache misses, semantic threshold too tight”, the canonicalisation gap becomes visible.

  4. Optimize. fi.opt.optimizers (ProTeGi, Bayesian, GEPA) re-tunes the per-route threshold or rewrites the canonicalisation rule. For Claude Code the typical optimization is a normalisation rule (strip trailing whitespace, normalise line endings).

  5. Route + re-deploy. Gateway applies the updated config on the next request; threshold + rules are versioned with automatic rollback if hit rate regresses.

A team starting at a 37% effective cross-key hit rate typically reaches 43-46% within four weeks, no developer behaviour change required.

The three building blocks are open source: traceAI, ai-evaluation, agent-opt (all Apache 2.0). Hosted Agent Command Center adds the failure-cluster view, live Protect guardrails (~67ms text latency), RBAC, SOC 2 Type II certified, and AWS Marketplace for procurement.


What we did not include

  • OpenRouter. No exact or semantic cache at the gateway layer; wrong shape for caching Claude Code calls.
  • Kong AI Gateway. AI cache features are still plugin-driven; dashboard is a Grafana-build-it-yourself story.
  • Maxim Bifrost. Apache 2.0 Go gateway with strong throughput; semantic cache UI and per-repo observability still maturing. Worth re-evaluating in Q3 2026.


Sources

  • Anthropic prompt caching docs, cache_control blocks, 5-min and 1-hr TTLs, ~90% input cost reduction on hits, docs.anthropic.com
  • Anthropic Claude Code docs, claude.ai/docs/claude-code
  • Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
  • Palo Alto Networks acquisition announcement, Portkey (April 30, 2026)
  • Mintlify acquisition, Helicone (March 3, 2026)
  • Datadog Security Labs writeup, LiteLLM TeamPCP PyPI compromise (March 24, 2026), versions 1.82.7 and 1.82.8
  • Cloudflare AI Gateway, developers.cloudflare.com/ai-gateway

Frequently asked questions

Does Anthropic's native prompt caching make a gateway redundant?
No. `cache_control` caches against a single API key's recent window. A gateway adds cross-key sharing, semantic-match, and per-repo/per-developer observability. The right architecture is the inner native cache plus the gateway's outer cache.
What semantic-similarity threshold should I use for Claude Code?
Start at 0.93 and tune up, not down. 0.85 returns slightly-wrong answers when the prompt has changed in ways the embedding didn't capture. 0.93 catches obvious paraphrases without serving stale answers for modified code.
How much can a gateway-layer cache cut Claude Code spend on top of Anthropic native?
In our Q2 2026 cohort across 22 teams, the cross-key cache adds 18-28 percentage points on top of the native cache — 25-45% additional monthly spend reduction. High end: single-monorepo teams. Low end: teams where each developer works in a different repo.
Does the gateway cache return wrong answers?
Exact-match returns the same answer for the same prefix; the only 'wrong' case is stale (cache not invalidated on system-prompt change). Semantic-match returns wrong answers when the threshold is too loose. Defence: tuned threshold (start 0.93), held-out eval, TTL hygiene on relevant deploys.
What happens to `cache_control` when Claude Code runs through a gateway?
All five gateways pass `cache_control` through unchanged as of May 2026. Stripping it would forfeit Anthropic's 90% input cost reduction on hit. Tested with `claude-opus-4-7` and `claude-sonnet-4-6`.
Is it safe to send source code through a hosted AI gateway's cache?
The cached response contains code Claude generated in response to your code. For Future AGI hosted, Portkey, and Cloudflare, the cache sits on the provider's data plane — check residency and BAA. For self-hosted (LiteLLM, Future AGI BYOC, Helicone OSS), the cache stays in your VPC.
How is Future AGI different from Portkey for Claude Code caching?
Portkey is a hosted observation + cache layer with a polished threshold UI. Future AGI adds an optimization layer — miss data feeds back into threshold tuning, canonicalisation, and routing updates. Portkey gives you a cache and a dashboard; Future AGI gives you a cache, a dashboard, and a loop.
Related Articles
View all
Top 5 Tools for Claude Code Cost Management in 2026
Guides

Five tools for Claude Code cost management in 2026 — four gateways plus the native Anthropic dashboard and a FinOps platform — scored on attribution, chargeback, caps, routing, cache observability, FinOps integration, and audit trail.

NVJK Kartik
NVJK Kartik ·
18 min
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.