Gateway

What Is a Semantic Cache?

An LLM-gateway cache that returns a stored response when a new prompt's embedding exceeds a cosine-similarity threshold against a cached prompt.

What Is a Semantic Cache?

A semantic cache is an LLM-gateway cache that returns a previously stored response when a new prompt is similar enough to a cached one — not just identical. The gateway embeds the new prompt, runs a cosine-similarity search against stored embeddings, and returns the cached response if similarity crosses a configured threshold (typically 0.85–0.95). It hits where an exact prompt cache would miss: rephrasings, casing changes, and minor punctuation differences. FutureAGI’s Agent Command Center ships a semantic-cache backed by Pinecone, Qdrant, Weaviate, or an in-memory store.

Why it matters in production LLM/agent systems

Production traffic isn’t deterministic. A support bot sees “how do I reset my password” and “how can I reset my password please” and “password reset?” — all the same question, none with a byte-identical prompt. An exact-cache hits 0% on this traffic. A semantic-cache, properly tuned, hits 25–45%.

Three concrete impacts:

  • Cost — every cache hit is a saved provider call. For a support bot doing 100K calls/day at $0.005/call, a 35% hit rate saves $175/day.
  • Latency — cache hits return in 5–20ms instead of 800–3000ms. p99 latency drops sharply.
  • Reliability — when the upstream provider 5xx’s, recent semantic-cache hits keep the bot answering the easy questions.

The risk is the inverse of cache control: if the threshold is too loose, the cache returns the wrong answer. “How do I cancel my subscription?” and “how do I keep my subscription?” embed close together; a 0.80 threshold can collapse them. For agent systems running multi-step pipelines, a wrong cached answer at step 1 cascades through the whole trajectory. Threshold tuning is the operational discipline.

How FutureAGI handles it

FutureAGI’s Agent Command Center exposes the semantic-cache as a configurable layer in front of every chat-completion call. The implementation in internal/cache/semantic.go stores entries keyed by (model, vector) and runs a cosine-similarity scan at request time:

cache:
  enabled: true
  semantic:
    enabled: true
    threshold: 0.92          # default 0.85; raise for risk-averse routes
    dims: 256                # embedding dimensions
    max_entries: 50000
    backend: "qdrant"        # or pinecone, weaviate, memory
    namespace: "support-bot"
    ttl: 1h

The cache is keyed per model so a gpt-4o cache hit never returns to a claude-sonnet-4 request. Per-request headers — x-agentcc-cache-ttl, x-agentcc-cache-namespace, x-agentcc-cache-force-refresh — let app code override on the fly. Trace spans carry agentcc.cache.hit, agentcc.cache.similarity, and agentcc.cache.namespace, so a cache-hit-rate dashboard explains itself.

The product moat compared with a thin LiteLLM wrapper is the eval integration. Pre- and post-guardrails still run on cache hits — ProtectFlash filters injection attempts even when the answer is cached, and a HallucinationScore post-guardrail validates whether the cached response is still appropriate for the new prompt at the boundary similarity. FutureAGI’s evaluation surface tracks cached=true in trace metadata, so regression evals can check whether semantic-cache hits degrade quality at a given threshold.

We’ve found that the threshold conversation is rarely a one-shot tuning exercise. In our 2026 evals, the same semantic-cache that hit 38% safely on a tone-rephraser route returned wrong refunds at 0.88 on a billing route — same threshold, different content boundary. The fix is per-route thresholds plus a sampled Groundedness eval on cached responses, run weekly. Unlike generic Redis-backed semantic caches that ship with a single global threshold, Agent Command Center binds the threshold to the route, the model, and the namespace, so the operator can raise the bar on regulated routes without touching the chat-bot route.

How to measure or detect it

Operate the semantic-cache against:

  • Hit rate — exact-cache hits + semantic-cache hits, segmented by route. Healthy support-bot rates: 30–45%.
  • Threshold-crossing distribution — histogram of similarity scores for cache hits. If the bulk lives at 0.85–0.88, the threshold is borderline; if at 0.95+, you can lower it for more hits.
  • False-positive sample — sample 1% of cache hits and run a Groundedness or AnswerRelevancy evaluator on the (new prompt, cached response) pair.
  • Cost savedcost_usd_avoided derived from cache-hit count × per-call cost.
  • Cache backend latency — p99 of the vector-search call. Past 50ms, the cache hurts more than it helps.
from fi.evals import AnswerRelevancy
# Run periodically on a sample of cache hits to validate the threshold.
score = AnswerRelevancy().evaluate(
    input=new_prompt,
    output=cached_response,
)

Common mistakes

  • Picking a single threshold and walking away. Tune per route — a billing-policy bot needs 0.95+; a tone-of-voice rephraser can run at 0.80.
  • Caching across models. A gpt-4o-mini answer is rarely safe to return for a gpt-4o request.
  • Caching at the user level instead of namespacing by tenant. Cross-tenant cache hits are a data-leakage incident waiting to happen.
  • Using semantic-cache for tool-calling responses. Cached tool calls execute stale parameters.
  • Not invalidating cache after a prompt-template change. The cache will keep serving answers from the old prompt for as long as the TTL allows.

Frequently Asked Questions

What is a semantic cache?

A semantic cache is a gateway-level cache that returns a stored LLM response when a new prompt's embedding is similar enough — by cosine similarity — to a cached prompt's embedding.

How is a semantic cache different from a prompt cache?

An exact prompt cache (or exact-cache) hits only on byte-for-byte identical prompts. A semantic cache hits on rephrasings by comparing embeddings against a similarity threshold (typically 0.85–0.95).

How does FutureAGI implement semantic-cache?

Agent Command Center's semantic-cache embeds the prompt, runs cosine similarity against stored vectors at a tunable threshold, and is backed by Pinecone, Qdrant, Weaviate, or in-memory vectors.