Gateway

What Is a Prompt Cache?

An LLM-gateway cache that stores responses keyed by the request prompt and returns them on a matching call, either exactly (exact-cache) or by embedding similarity (semantic-cache).

What Is a Prompt Cache?

A prompt cache is an LLM-gateway cache that stores model responses keyed by the request prompt and returns them on a subsequent matching call. The default and simplest form is an exact prompt cache — also called an exact-cache — that hits on byte-for-byte identical prompts, typically using a SHA-256 hash of the canonicalised request as the key. A semantic cache is a related but distinct technique that hits on similar prompts via embeddings. Both forms cut cost and latency. FutureAGI’s Agent Command Center exposes both as the exact-cache and semantic-cache layers.

Why it matters in production LLM/agent systems

Production traffic has three reasons to repeat:

  • System-prompt repetition. Every user message in a chatbot ships the same multi-thousand-token system prompt. Without caching, you pay for that prompt every turn.
  • Bot-to-bot churn. Health checks, smoke tests, and synthetic monitors hit the same prompts thousands of times per day.
  • User overlap. “What are your business hours?” gets asked tens of thousands of times by different users with identical phrasing.

For a chatbot doing 100K calls/day at $0.005/call, even a 15% exact-cache hit rate saves $75/day and shaves 800–3000ms off the cached responses. For agent systems where a planner repeatedly issues the same retrieval prompt or tool-call template, prompt caching can drop p99 latency on the hot path by 5–10×.

The risk: caching creates staleness. A prompt that depends on now() or on a freshly-updated knowledge base must skip the cache, or app code must invalidate it. Production systems route cache control via Cache-Control: no-store and per-namespace TTLs.

We’ve found that the second-order failure is more subtle than staleness: caches mask regressions. A model rollout ships a worse system prompt, but cache hits keep serving the old, better answer for hours; the team then attributes the eventual quality drop to the wrong commit. In our 2026 evals, teams that ran a Groundedness sample on cached versus live traffic caught these regressions a full release cycle earlier than teams that only watched hit rate.

How FutureAGI handles it

FutureAGI’s Agent Command Center exposes prompt caching as two cooperating layers:

  1. Exact-cache (internal/cache/store.go) — an LRU + TTL store keyed by a hash of the canonicalised request (model, messages, temperature, max_tokens, tool definitions). Backends: in-memory, Redis, S3, GCS, Azure Blob, or local disk.
  2. Semantic-cache (internal/cache/semantic.go) — embedding-similarity lookup with a tunable cosine threshold. Backends: Pinecone, Qdrant, Weaviate, in-memory.

Configuration:

cache:
  enabled: true
  default_ttl: 5m
  max_entries: 10000
  semantic:
    enabled: true
    threshold: 0.92
    backend: "qdrant"

Per-request headers override at runtime:

  • x-agentcc-cache-ttl: 10m — bump TTL for a hot route.
  • x-agentcc-cache-namespace: prod — isolate cache by tenant.
  • x-agentcc-cache-force-refresh: true — bypass read, refresh entry.
  • Cache-Control: no-store — skip caching entirely for this call.

The cache layer runs after pre-guardrail and before provider routing. Trace spans carry agentcc.cache.layer (exact or semantic), agentcc.cache.hit, and agentcc.cache.namespace, joining the rest of the traceAI tree. Compared with Anthropic’s prompt-caching feature — which caches at the provider layer per their API — FutureAGI’s prompt-cache works across providers and survives provider switches mid-conversation. Pair it with model_fallbacks and a request that fell over to Claude can still serve cached responses on the next identical call.

How to measure or detect it

Track prompt-cache health with:

  • Exact-cache hit rate — the simpler metric. Baseline: 5–15% for support bots, 50%+ for synthetic monitors.
  • Semantic-cache hit rate — separate counter. Baseline: 25–45% on natural-language traffic.
  • Total cost avoidedcost_usd_avoided = hits × avg_per_call_cost, surfaced on the cost dashboard.
  • TTL-eviction rate — entries leaving the cache before being hit. High eviction = TTL too short or working set too large.
  • Per-namespace hit-rate breakdown — surfaces tenants whose traffic shape kills the cache.
  • Cached-response eval-fail-rate — sampled Groundedness or AnswerRelevancy against cache hits to catch silent regressions after a prompt-template change.
  • agentcc.cache.layer distribution — ratio of exact vs semantic hits; helps tune the semantic threshold.
# Force-refresh a hot route (e.g. after prompt-template change):
client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_headers={"x-agentcc-cache-force-refresh": "true"},
)

Common mistakes

  • Conflating prompt-cache with semantic-cache. Exact-match and embedding-similarity are different mechanisms with different failure modes.
  • Caching responses that include timestamps or fresh data without a short TTL or a Cache-Control: no-store header.
  • Caching tool-calling responses. The cached call may issue a stale tool with stale arguments.
  • Caching across tenants without namespacing — the simplest way to leak data between customers.
  • Reading hit rate as a quality signal. Hit rate measures cache coverage, not cache correctness; pair with sampled-eval on cached responses.

Frequently Asked Questions

What is a prompt cache?

A prompt cache is a gateway-level cache that stores LLM responses keyed by the request prompt. The exact form hits on identical prompts; the semantic form hits on similar prompts via embeddings.

Is a prompt cache the same as a semantic cache?

No. The default prompt cache is exact-match — it hits only on byte-identical prompts. A semantic cache uses embeddings and a similarity threshold to hit on rephrasings.

Where does prompt-cache run in FutureAGI?

Agent Command Center implements an exact-cache as a hash-keyed LRU/TTL store and a separate semantic-cache. Both run before the provider call and can be controlled via x-agentcc-cache-* headers.