Is a prompt cache the same as a semantic cache?

No. The default prompt cache is exact-match — it hits only on byte-identical prompts. A semantic cache uses embeddings and a similarity threshold to hit on rephrasings.

Where does prompt-cache run in FutureAGI?

Agent Command Center implements an exact-cache as a hash-keyed LRU/TTL store and a separate semantic-cache. Both run before the provider call and can be controlled via x-agentcc-cache-* headers.

What Is a Prompt Cache? LLM Gateway Definition (2026)

Q: What is a prompt cache?

A prompt cache is a gateway-level cache that stores LLM responses keyed by the request prompt. The exact form hits on identical prompts; the semantic form hits on similar prompts via embeddings.

What Is a Prompt Cache?

A prompt cache is an LLM-gateway cache that stores model responses keyed by the request prompt and returns them on a subsequent matching call. The default and simplest form is an exact prompt cache — also called an exact-cache — that hits on byte-for-byte identical prompts, typically using a SHA-256 hash of the canonicalised request as the key. A semantic cache is a related but distinct technique that hits on similar prompts via embeddings. Both forms cut cost and latency. FutureAGI’s Agent Command Center exposes both as the exact-cache and semantic-cache layers.

Why it matters in production LLM/agent systems

Production traffic has three reasons to repeat:

System-prompt repetition. Every user message in a chatbot ships the same multi-thousand-token system prompt. Without caching, you pay for that prompt every turn.
Bot-to-bot churn. Health checks, smoke tests, and synthetic monitors hit the same prompts thousands of times per day.
User overlap. “What are your business hours?” gets asked tens of thousands of times by different users with identical phrasing.

For a chatbot doing 100K calls/day at $0.005/call, even a 15% exact-cache hit rate saves $75/day and shaves 800–3000ms off the cached responses. For agent systems where a planner repeatedly issues the same retrieval prompt or tool-call template, prompt caching can drop p99 latency on the hot path by 5–10×.

The risk: caching creates staleness. A prompt that depends on now() or on a freshly-updated knowledge base must skip the cache, or app code must invalidate it. Production systems route cache control via Cache-Control: no-store and per-namespace TTLs.

We’ve found that the second-order failure is more subtle than staleness: caches mask regressions. A model rollout ships a worse system prompt, but cache hits keep serving the old, better answer for hours; the team then attributes the eventual quality drop to the wrong commit. In our 2026 evals, teams that ran a Groundedness sample on cached versus live traffic caught these regressions a full release cycle earlier than teams that only watched hit rate.

How FutureAGI handles it

FutureAGI’s Agent Command Center exposes prompt caching as two cooperating layers:

Exact-cache (internal/cache/store.go) — an LRU + TTL store keyed by a hash of the canonicalised request (model, messages, temperature, max_tokens, tool definitions). Backends: in-memory, Redis, S3, GCS, Azure Blob, or local disk.
Semantic-cache (internal/cache/semantic.go) — embedding-similarity lookup with a tunable cosine threshold. Backends: Pinecone, Qdrant, Weaviate, in-memory.

Configuration:

cache:
  enabled: true
  default_ttl: 5m
  max_entries: 10000
  semantic:
    enabled: true
    threshold: 0.92
    backend: "qdrant"

Per-request headers override at runtime:

x-agentcc-cache-ttl: 10m — bump TTL for a hot route.
x-agentcc-cache-namespace: prod — isolate cache by tenant.
x-agentcc-cache-force-refresh: true — bypass read, refresh entry.
Cache-Control: no-store — skip caching entirely for this call.

The cache layer runs after pre-guardrail and before provider routing. Trace spans carry agentcc.cache.layer (exact or semantic), agentcc.cache.hit, and agentcc.cache.namespace, joining the rest of the traceAI tree. Compared with Anthropic’s prompt-caching feature — which caches at the provider layer per their API — FutureAGI’s prompt-cache works across providers and survives provider switches mid-conversation. Pair it with model_fallbacks and a request that fell over to Claude can still serve cached responses on the next identical call.

How to measure or detect it

Track prompt-cache health with:

Exact-cache hit rate — the simpler metric. Baseline: 5–15% for support bots, 50%+ for synthetic monitors.
Semantic-cache hit rate — separate counter. Baseline: 25–45% on natural-language traffic.
Total cost avoided — cost_usd_avoided = hits × avg_per_call_cost, surfaced on the cost dashboard.
TTL-eviction rate — entries leaving the cache before being hit. High eviction = TTL too short or working set too large.
Per-namespace hit-rate breakdown — surfaces tenants whose traffic shape kills the cache.
Cached-response eval-fail-rate — sampled Groundedness or AnswerRelevancy against cache hits to catch silent regressions after a prompt-template change.
agentcc.cache.layer distribution — ratio of exact vs semantic hits; helps tune the semantic threshold.

# Force-refresh a hot route (e.g. after prompt-template change):
client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_headers={"x-agentcc-cache-force-refresh": "true"},
)

Common mistakes

Conflating prompt-cache with semantic-cache. Exact-match and embedding-similarity are different mechanisms with different failure modes.
Caching responses that include timestamps or fresh data without a short TTL or a Cache-Control: no-store header.
Caching tool-calling responses. The cached call may issue a stale tool with stale arguments.
Caching across tenants without namespacing — the simplest way to leak data between customers.
Reading hit rate as a quality signal. Hit rate measures cache coverage, not cache correctness; pair with sampled-eval on cached responses.