What Is Exact Caching (LLM Gateway)?
Gateway-level caching that returns a stored LLM response only when the incoming request exactly matches a canonicalized prior request.
What Is Exact Caching?
Exact caching is an LLM-gateway caching strategy that returns a saved model response only when the incoming request matches a previous request exactly. The gateway usually hashes a canonicalized request that includes the model, messages, tools, sampling settings, and tenant namespace. It shows up before provider routing in production traces, where a hit removes a model call, token cost, and most latency. FutureAGI exposes it as the gateway:exact-cache surface in Agent Command Center.
Why it matters in production LLM/agent systems
Exact caching has a narrow failure mode: either the cache key matches production reality, or it lies. If the key is too strict, repeated calls miss the cache and every smoke test, health check, prompt-template run, and repeated FAQ burns provider latency again. If the key is too loose, a cached answer can escape its original context and become a stale answer, tenant leak, or wrong tool plan.
The pain lands differently by role. Developers see cache-hit-rate stuck near 0% even while logs show identical prompts. SREs see p99 latency and provider 429s rise during traffic spikes. Product teams see repeated “slow answer” feedback for questions that should be instant. Compliance teams worry when cache namespaces mix customer, region, or retention boundaries.
The common symptoms are easy to miss: cache_miss with no reason code, high token-cost-per-trace, identical prompt hashes split across models, or cache entries expiring before the second hit. In 2026-era agent pipelines, this matters beyond single chat calls. A planner may issue the same classification prompt on every step; a regression runner may replay the same 10,000 prompts after each model change; a tool-routing agent may repeat a no-op validation call. Exact caching turns those deterministic repeats into cheap control-plane events, but only if the request identity includes the fields that change behavior.
How FutureAGI handles exact-cache
In FutureAGI, exact caching lives on the Agent Command Center surface gateway:exact-cache. A common setup is a support FAQ route where deterministic prompts run through pre-guardrail, then exact-cache, then a routing policy: cost-optimized provider selection. A cache hit returns before provider routing; a miss continues to the selected model and writes the response back with the configured TTL.
The cache key should represent behavior, not just text. For a production chat completion, the key includes gen_ai.request.model, canonical messages, tool schemas, response format, sampling settings, cache namespace, and route ID. Trace fields such as llm.token_count.prompt and llm.token_count.completion let the engineer compute avoided token cost on hits, while cache-hit-rate by route shows whether the policy is doing useful work.
FutureAGI’s approach is to pair exact-cache with trace evidence and eval sampling. If a billing FAQ has a 42% hit rate but rising support escalations, the engineer samples cached responses and runs AnswerRelevancy before raising TTL. Compared with Anthropic prompt caching, which is provider-side prompt-prefix reuse, Agent Command Center exact-cache is gateway-owned response reuse. That distinction matters: provider prompt caching can reduce input processing, while exact-cache can skip the provider call entirely when the full canonical request matches.
How to measure or detect it
Measure exact caching as a routing primitive, not just a storage feature:
- Exact-cache hit rate: hits divided by eligible requests, segmented by route, model, tenant, and prompt version.
- Miss reason: key mismatch, TTL expired, namespace mismatch, force refresh, no-store request, or unsafe response type.
- Cost avoided: cache hits multiplied by the model’s average prompt and completion cost, using
llm.token_count.promptandllm.token_count.completion. - Latency delta: p50 and p99 latency for hits versus misses. A healthy exact-cache hit should avoid the provider round trip.
- Sampled correctness: run an evaluator on cached responses that affect customer-visible answers.
from fi.evals import AnswerRelevancy
score = AnswerRelevancy().evaluate(
input=request_prompt,
output=cached_response,
)
AnswerRelevancy measures whether the cached response still addresses the current prompt. Use it on a small sample of hits, not on every hit, so the evaluator does not erase the latency and cost savings.
Common mistakes
Most exact-cache failures come from making the key too narrow or treating cache behavior as model quality:
- Hashing raw prompt text only. Tool schemas, temperature, response format, model, tenant, and namespace must be part of the key or collisions return incompatible responses.
- Caching requests with time, account balance, inventory, or retrieved context without TTL tags. The cache hit is fast, but stale by design.
- Sharing cache namespaces across tenants. A perfect exact match can still expose private wording, internal policies, or account-specific content to another customer.
- Counting hits as quality. Exact-cache hit rate measures reuse; sample cached responses with evals before claiming the saved calls are safe.
- Using exact-cache where semantic-cache is needed. Rephrased natural-language traffic will miss, so cost savings stay low even under heavy repeated intent.
Frequently Asked Questions
What is exact caching?
Exact caching is gateway-level response caching that returns a saved LLM output only when a later request matches the cached request exactly.
How is exact caching different from semantic cache?
Exact caching uses a deterministic cache key and hits only on byte-identical or canonical-identical requests. Semantic cache uses embeddings to hit on prompts that mean roughly the same thing.
How do you measure exact caching?
In FutureAGI, measure the `gateway:exact-cache` route with exact-cache hit rate, miss reason, TTL eviction, p99 latency, and token cost avoided from `llm.token_count.prompt`.