What Is a Prompt Cache?
An LLM-gateway cache that stores responses keyed by the request prompt and returns them on a matching call, either exactly (exact-cache) or by embedding similarity (semantic-cache).
What Is a Prompt Cache?
A prompt cache is an LLM-gateway cache that stores model responses keyed by the request prompt and returns them on a subsequent matching call. The default and simplest form is an exact prompt cache — also called an exact-cache — that hits on byte-for-byte identical prompts, typically using a SHA-256 hash of the canonicalised request as the key. A semantic cache is a related but distinct technique that hits on similar prompts via embeddings. Both forms cut cost and latency. FutureAGI’s Agent Command Center exposes both as the exact-cache and semantic-cache layers.
Why it matters in production LLM/agent systems
Production traffic has three reasons to repeat:
- System-prompt repetition. Every user message in a chatbot ships the same multi-thousand-token system prompt. Without caching, you pay for that prompt every turn.
- Bot-to-bot churn. Health checks, smoke tests, and synthetic monitors hit the same prompts thousands of times per day.
- User overlap. “What are your business hours?” gets asked tens of thousands of times by different users with identical phrasing.
For a chatbot doing 100K calls/day at $0.005/call, even a 15% exact-cache hit rate saves $75/day and shaves 800–3000ms off the cached responses. For agent systems where a planner repeatedly issues the same retrieval prompt or tool-call template, prompt caching can drop p99 latency on the hot path by 5–10×.
The risk: caching creates staleness. A prompt that depends on now() or on a freshly-updated knowledge base must skip the cache, or app code must invalidate it. Production systems route cache control via Cache-Control: no-store and per-namespace TTLs.
We’ve found that the second-order failure is more subtle than staleness: caches mask regressions. A model rollout ships a worse system prompt, but cache hits keep serving the old, better answer for hours; the team then attributes the eventual quality drop to the wrong commit. In our 2026 evals, teams that ran a Groundedness sample on cached versus live traffic caught these regressions a full release cycle earlier than teams that only watched hit rate.
How FutureAGI handles it
FutureAGI’s Agent Command Center exposes prompt caching as two cooperating layers:
- Exact-cache (
internal/cache/store.go) — an LRU + TTL store keyed by a hash of the canonicalised request (model, messages, temperature, max_tokens, tool definitions). Backends: in-memory, Redis, S3, GCS, Azure Blob, or local disk. - Semantic-cache (
internal/cache/semantic.go) — embedding-similarity lookup with a tunable cosine threshold. Backends: Pinecone, Qdrant, Weaviate, in-memory.
Configuration:
cache:
enabled: true
default_ttl: 5m
max_entries: 10000
semantic:
enabled: true
threshold: 0.92
backend: "qdrant"
Per-request headers override at runtime:
x-agentcc-cache-ttl: 10m— bump TTL for a hot route.x-agentcc-cache-namespace: prod— isolate cache by tenant.x-agentcc-cache-force-refresh: true— bypass read, refresh entry.Cache-Control: no-store— skip caching entirely for this call.
The cache layer runs after pre-guardrail and before provider routing. Trace spans carry agentcc.cache.layer (exact or semantic), agentcc.cache.hit, and agentcc.cache.namespace, joining the rest of the traceAI tree. Compared with Anthropic’s prompt-caching feature — which caches at the provider layer per their API — FutureAGI’s prompt-cache works across providers and survives provider switches mid-conversation. Pair it with model_fallbacks and a request that fell over to Claude can still serve cached responses on the next identical call.
How to measure or detect it
Track prompt-cache health with:
- Exact-cache hit rate — the simpler metric. Baseline: 5–15% for support bots, 50%+ for synthetic monitors.
- Semantic-cache hit rate — separate counter. Baseline: 25–45% on natural-language traffic.
- Total cost avoided —
cost_usd_avoided = hits × avg_per_call_cost, surfaced on the cost dashboard. - TTL-eviction rate — entries leaving the cache before being hit. High eviction = TTL too short or working set too large.
- Per-namespace hit-rate breakdown — surfaces tenants whose traffic shape kills the cache.
- Cached-response eval-fail-rate — sampled
GroundednessorAnswerRelevancyagainst cache hits to catch silent regressions after a prompt-template change. agentcc.cache.layerdistribution — ratio ofexactvssemantichits; helps tune the semantic threshold.
# Force-refresh a hot route (e.g. after prompt-template change):
client.chat.completions.create(
model="gpt-4o",
messages=[...],
extra_headers={"x-agentcc-cache-force-refresh": "true"},
)
Common mistakes
- Conflating prompt-cache with semantic-cache. Exact-match and embedding-similarity are different mechanisms with different failure modes.
- Caching responses that include timestamps or fresh data without a short TTL or a
Cache-Control: no-storeheader. - Caching tool-calling responses. The cached call may issue a stale tool with stale arguments.
- Caching across tenants without namespacing — the simplest way to leak data between customers.
- Reading hit rate as a quality signal. Hit rate measures cache coverage, not cache correctness; pair with sampled-eval on cached responses.
Frequently Asked Questions
What is a prompt cache?
A prompt cache is a gateway-level cache that stores LLM responses keyed by the request prompt. The exact form hits on identical prompts; the semantic form hits on similar prompts via embeddings.
Is a prompt cache the same as a semantic cache?
No. The default prompt cache is exact-match — it hits only on byte-identical prompts. A semantic cache uses embeddings and a similarity threshold to hit on rephrasings.
Where does prompt-cache run in FutureAGI?
Agent Command Center implements an exact-cache as a hash-keyed LRU/TTL store and a separate semantic-cache. Both run before the provider call and can be controlled via x-agentcc-cache-* headers.