Evaluating Prompt Caching Quality in 2026
Prompt caching saves 50-90% on spend but ships two silent regressions: invalidation bugs and semantic-cache wrong-prompt hits. The eval that catches both.
Table of Contents
Anthropic prompt caching cuts input cost by 90% on hit. OpenAI prompt caching saves 50%. Gemini context caching reduces context billing by a similar band. Gateway-side semantic caching saves another 30 to 50% on coding and support workloads. The dashboards celebrate hit rate. Finance celebrates the saved spend. Nobody asks whether the cached answer was actually correct for the new prompt.
That blind spot ships two classes of silent regression. Cache-invalidation bugs keep serving last week’s answer after the prompt prefix or source corpus changed. Semantic-cache wrong-prompt hits return a paraphrase-neighbor’s answer that looks 0.91 cosine-similar and means something different. Neither shows up in the hit-rate chart. Both show up in the post-mortem. The eval that prevents them is per-cache-key hit rate × hit quality × invalidation correctness — three numbers on the same dashboard. Without all three, your cache savings ship a quality regression.
This post is the playbook: where caching breaks in Anthropic, OpenAI, Gemini, and gateway semantic stacks, how to score each layer, the golden set, the rubrics, and how Future AGI’s eval stack closes the loop.
Why prompt caching needs eval at all
Caching looks free. It’s not. It’s a second model in your stack with its own precision and recall.
A cache that hits 70 percent of the time and is wrong on 5 percent of those is shipping 3.5 percent wrong answers fast and cheap, every day, with no alert. The visible line on the dashboard goes the right way. The line that matters (cost-per-correct-answer, not cost-per-call) goes the wrong way.
Cache-invalidation correctness is the second blind spot. When the system prompt ships a new version, when the RAG corpus gets reindexed, when a tool definition changes, every cached entry keyed against the old state is now serving the wrong answer. The cache doesn’t know. The user finds out.
Semantic caching adds a third failure mode exact caching doesn’t have. Two prompts can share 90 percent of their tokens and want different answers. Maximum dose of ibuprofen for an adult and maximum dose of ibuprofen for a child are 0.94 cosine-similar on most embedding models. The strings are nearly identical. The right answers differ by a factor of two. A semantic cache tuned to 0.85 happily serves one for the other.
These three failure modes are the agenda. The rest of the post is the eval.
The three failure modes, named
Cache-invalidation bugs (the silent staleness class)
The pattern: cache key omits something that should be part of it. A RAG cache keyed on the user query alone, with no corpus hash, keeps serving last week’s answer after the corpus updates. A system-prompt cache keyed only on the user turn keeps serving the old persona after the prompt ships a new version. A tool-spec cache that ignores the tool definition keeps returning calls to the deprecated signature.
The eval is mechanical. Change the upstream artifact (prompt version, corpus document, tool spec), replay the same user input, assert the served answer reflects the change. If it doesn’t, the cache key is wrong. Run it in CI on every prompt deploy, corpus rebuild, and tool-spec PR.
Semantic-cache wrong-prompt hits (the threshold class)
The pattern: cosine threshold loose enough to pull a near-paraphrase that means something else. Threshold-tuning is a precision-recall problem and the tuning lives per route, not globally. Medical, legal, financial, and live-data routes need tight thresholds (0.93 and up) and accept lower hit rate. Internal FAQ bots and lint-fix copilots can run looser (0.82 to 0.85).
The eval is a labeled paraphrase set with same-intent and different-intent pairs. Sweep the threshold. Plot precision against recall. Pick the lowest threshold that holds precision above the route floor.
Cross-tenant cache leaks (the namespace class)
The pattern: shared namespace across tenants. Tenant A asks summarize the Q3 earnings deck for ACME Corp. The answer caches. Tenant B asks the same question on a different deck and hits ACME’s cached answer because tenant ID wasn’t part of the key. Ships silently until a customer notices a competitor’s data in their response.
The eval: two requests, two tenant contexts, same prompt. Assert the responses come from different cache entries. Run it on every gateway config change.
Where caching lives in the 2026 stack
Three layers, three eval surfaces.
Provider-side prefix caching (Anthropic, OpenAI). Anthropic prompt caching marks blocks with cache_control and bills cache-read tokens at roughly 10 percent of full input cost. OpenAI prompt caching automatically caches identical prefixes >1024 tokens at 50 percent off. Both are byte-exact. Both fail invisibly when the cacheable prefix isn’t at the start of the prompt. Eval question: are you structuring prompts so the stable prefix leads, and measuring cache_read_input_tokens (Anthropic) or cached_tokens (OpenAI)?
Gemini context caching. Closer to a stored-context primitive. Upload a long context once, reference it by handle on subsequent calls, pay a reduced rate per cached token. Eval question: when the underlying document changes, does the handle get invalidated?
Gateway-side semantic caching. Where semantic risk lives. The gateway embeds the prompt, searches a vector store for nearest neighbors, returns the stored answer when the top hit clears the threshold. The Agent Command Center ships exact and semantic side by side (in-memory or Redis for exact; Qdrant, Pinecone, or in-memory vectors for semantic) with defaults of cosine 0.85, 256-dim embeddings, 50K max entries, LRU eviction, streaming bypass. Eval question: is the threshold tuned per route, and is the namespace partitioned per tenant?
For the gateway layer specifically, see semantic caching at the gateway and the audio-caching latency analysis.
Instrument the cache hop before you eval it
You can’t score what you can’t see. Stamp cache attributes on every span, the same way you stamp model name and latency.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="prompt-caching-eval",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
Every LLM call now lands a span with fi.span.kind=LLM. The gateway cache hop adds three attributes you actually care about:
cache.hit(bool)cache.semantic_similarity(float 0-1, present on semantic hits)cache.namespace(string, the tenant or route partition)
When the request goes through the Agent Command Center, response headers carry the same information back to your application: x-agentcc-cost reflects $0 on a cache hit, x-agentcc-latency-ms reflects single-digit-millisecond latency, x-agentcc-model-used reflects the model that originally generated the cached answer, and x-agentcc-cache is hit or miss. If you aren’t seeing those headers, the request bypassed the cache, which is a different problem.
Per-trace cache_hit is the foundation the rest of the eval sits on. Same span shape we use for evaluating RAG faithfulness and conversation completeness. The cache is just another scored span.
Build the golden set
Two hundred to five hundred labeled cases. Four buckets.
Exact-match repeats. Same prompt issued twice. The cache must hit on the second call. If it doesn’t, your prefix is unstable, your cache_control blocks are wrong, or the gateway isn’t keying what you think it is.
Same-intent near-paraphrases. What’s the refund policy and how do I get a refund. The semantic cache should hit and the answer should still be correct. Where threshold tuning earns its keep.
Different-intent paraphrase traps. Maximum dose of ibuprofen for an adult and maximum dose of ibuprofen for a child. Strings are 0.94 similar. Answers must not be shared. These catch loose thresholds before a user does.
Time-sensitive queries. Current NVDA price. Latest CVE for Log4j. Q1 earnings figure. The cache must not serve these from yesterday. A staleness rubric scores them.
Each case carries the prompt, the expected behavior (hit allowed yes/no, correct answer), and a tenant tag so isolation can be scored. Promote real production failures into this golden set every week, the same loop you use for any eval dataset.
Score cache quality with the rubrics you already have
The Future AGI Platform runs the same EvalTemplate library across cached and fresh responses. Nothing in the rubric changes. What changes is the cache.hit slice on the span, which lets you split the score by hit vs miss and see whether cached responses are scoring lower.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness,
ContextAdherence,
TaskCompletion,
AnswerRefusal,
)
from fi.testcases import TestCase
evaluator = Evaluator(
fi_api_key="...",
fi_secret_key="...",
)
results = evaluator.evaluate(
eval_templates=[
Groundedness(),
ContextAdherence(),
TaskCompletion(),
AnswerRefusal(),
],
inputs=[
TestCase(
query="What's the refund window on enterprise plans?",
response=cached_answer,
context=retrieved_context,
),
],
)
If the cached answer scores below the route floor on Groundedness or TaskCompletion, the cache shipped a regression. That regression has a trace attached. The difference is between cache hit rate fell two points last week and here are the 47 cache hits that scored below 0.75 on Groundedness, with the threshold that served each one and the fresh-LLM answer to compare against. That’s the difference between dashboard and debug.
Three CustomLLMJudge rubrics that pay back fast
For direct comparison, write a CustomLLMJudge rubric. Three earn their keep on every caching deployment.
CacheQualityMatch. Given the cached answer and a fresh LLM call on the same prompt, judge whether they would be functionally equivalent for the end user. Disagreement above a threshold means the cache served stale or wrong content. Run it on a 1 to 5 percent sample of hits. That’s the staleness sampler.
CrossTenantIsolation. Given two tenants and the same prompt, judge whether the responses reflect the same underlying context. If they do, and the tenants should have different access scopes, that’s a leak. Run it as a gate on every gateway config change.
InvalidationCorrectness. Given a corpus mutation event and a post-mutation prompt, judge whether the served answer reflects the mutation. If it doesn’t, the invalidation key is wrong. Run it in CI on every prompt or corpus deploy.
These rubrics live alongside your other evaluators and run on the same workers the rest of the stack uses.
The Agent Command Center cache layers
The Agent Command Center ships exact and semantic as native layers at the gateway. Exact-match L1 sits in memory or Redis. Semantic L2 sits in Qdrant, Pinecone, or in-memory vectors. Defaults: cosine 0.85, embedding dimension 256, max 50K entries, LRU eviction. Streaming routes bypass cache by design.
The caller controls the wrong-answer case with per-request override headers: x-agentcc-cache-force-refresh skips the cache, x-agentcc-cache-ttl overrides the lifetime, x-agentcc-cache-namespace partitions the lookup. The trace records the cache hit as a span attribute on the same span as the rest of the call, so cost-per-correct-answer counts a hit as a correct outcome instead of a missing one.
Two configuration moves matter most. Namespace by tenant. Tag-based primitives let you compose the namespace from tenant ID, route, and prompt version. Skip this and you ship the cross-tenant leak. Threshold per route. A single global cosine threshold is the wrong primitive. Medical and finance routes get 0.93 and up. Internal FAQ routes get 0.82 to 0.85. The route config is the right place; the global default is the trap.
Hit rates land 30 to 50 percent on shared coding-agent traffic, 30 to 60 percent on RAG with a stable corpus, above 50 percent on customer-support FAQ. Each hit returns in single-digit milliseconds at zero token cost.
The pattern that backfires: caching without invalidation on prompt or system-message changes. The prompt shipped a new version, the cached answer is now wrong, the user gets stale output for a week. Tie cache namespace to the prompt version. When the prompt ships, the namespace flips, the cache repopulates.
Production observability: per-trace cache_hit is the foundation
Once cache.hit is on every span, three queries earn their keep on day one.
Quality-by-hit-vs-miss split. Group spans by cache.hit, compute the mean rubric score per group. If the hit group scores materially worse than the miss group on the same route, your cache is shipping regressions. Alert on the delta, not the absolute score.
Staleness sampler. For 1 to 5 percent of hits, fire a fresh LLM call in parallel and score both with CacheQualityMatch. Disagreement above the route threshold evicts the entry and lowers the TTL. Cheap. Catches staleness within hours instead of weeks.
Invalidation gate in CI. On every prompt-version PR, replay a 50-case fixture against the staging gateway, assert the cache busts for any case whose prompt version moved. The assertion runs in seconds. The bug it catches is the we shipped a new persona and 40 percent of traffic still got the old one for six days class.
Span-attached scores are the same pattern we use for agent observability and AI agent cost optimization. Cache is one more scored axis on the same trace tree.
Anti-patterns we see repeatedly
Tracking hit rate only. The most common mistake. Add quality, invalidation, and isolation next to it on the same dashboard.
One threshold for every route. A 0.85 threshold that’s fine for the Slack bot hands you a wrong-dosage incident on the medical assistant. Threshold belongs in route config, never global.
No cross-tenant isolation test. Ships silently. The first time you find out is when a customer notices a competitor’s data in a response.
No staleness sampler. Caches age. TTLs help, they don’t solve it. A periodic fresh-vs-cache comparison on a 1 to 5 percent sample is the cheapest staleness detector you can build.
Treating prefix caching and semantic caching as one thing. Anthropic prompt caching and OpenAI prompt caching are byte-deterministic. Gateway semantic caching is probabilistic at the embedding level. Different failure modes, different evals. Conflating them is how teams miss the threshold-tuning problem.
No cache-vs-fresh comparison. Without it, you can’t tell what the cache is costing in quality. CacheQualityMatch is a 50-line rubric. It pays for itself the first regression it catches.
How Future AGI ships the cache-quality loop
The eval stack runs the loop today. Honest about the seams:
- traceAI stamps
cache.hit,cache.semantic_similarity, andcache.namespaceon every span. Python, TypeScript, Java, C#. OpenTelemetry-native. The cost-per-correct-answer join lives in your trace store. - The Future AGI Platform scores cached responses with the same Groundedness, ContextAdherence, TaskCompletion, and AnswerRefusal templates it uses on fresh ones. CustomLLMJudge rubrics for
CacheQualityMatch,CrossTenantIsolation, andInvalidationCorrectnessplug into the same runner. - The Error Feed clusters failing cache traces with HDBSCAN over the trace store. A Sonnet 4.5 Judge writes the cluster name and an
immediate_fix, opens a Linear issue (today; Slack, GitHub, Jira, PagerDuty on the roadmap). Typical clusters: semantic threshold 0.85 serves wrong answer for medical-dosage paraphrase, cross-tenant leak when tenant tag missing from namespace. - agent-opt runs six optimizers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) over the eval dataset to tune thresholds, embedding choices, and prompt prefixes against scored outcomes. - The Agent Command Center is the gateway. Exact and semantic cache native, tag-based namespacing, OTel and Prometheus observability, OpenAI-compatible drop-in via
base_url="https://gateway.futureagi.com/v1". Single Go binary, Apache 2.0. Benchmarked at ~29k req/s with P99 21 ms ont3.xlargewith guardrails on.
Honest tradeoff: the live trace-to-optimizer connector that auto-promotes failing cache hits into the next optimizer sweep is on the roadmap, not shipping today. You can run the loop manually now (score, cluster, sweep) and the connector closes that gap when it ships.
Ready to score the cache the same way you score the model? Point your OpenAI SDK at https://gateway.futureagi.com/v1, register the traceAI tracer, and run the four EvalTemplate rubrics against your cached responses. Start with the Agent Command Center quickstart and the traceAI integration guide.
Related reading
- LLM evaluation playbook (2026) — the broader live-traffic eval picture this post slices into.
- Detecting hallucinations in generative AI — Groundedness on cached and fresh responses.
- Error analysis for LLM applications (2026) — the Error Feed clustering loop end to end.
- AI agent cost optimization with observability (2026) — cache savings inside the broader cost picture.
- Semantic caching at the AI gateway (2026) — the gateway layer.
The summary
Cache hit rate is necessary, not sufficient. The three numbers that matter together — hit rate × hit quality × invalidation correctness — describe whether a caching layer is actually working or just looking like it is. Anthropic prompt caching, OpenAI prompt caching, Gemini context caching, and gateway semantic caches all fit the same eval framework once you stop treating hit rate as the only number on the dashboard.
The cheap version: stamp cache.hit on every span, run your normal rubric on cached responses, alert when cached-response quality drops below the route floor. That alone catches most regressions. Cross-tenant isolation, staleness sampling, and per-route threshold sweeps layer on top.
Take one thing from this post: the cache is a model. It has a precision number, a recall number, a latency number, and a cost number. Treat it like every other component in your stack. Give it a rubric, score it on the same trace tree as the rest of your calls. The dashboards get less interesting and the incidents get rarer. That’s the trade you wanted.
Frequently asked questions
Why isn't cache hit rate enough to evaluate prompt caching?
What's the difference between exact and semantic prompt caching?
How does Future AGI evaluate cache quality on live traffic?
What semantic-similarity threshold should I use?
What's the cross-tenant cache leak risk?
How do I detect a stale cache?
Does prompt caching break streaming responses?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Azure OpenAI eval has three Azure-specific axes: deployment-name drift, region-pinning, and Content Safety precision on benign queries. Here's the pattern.