Prompt Caching in 2026: How It Works, Pricing, and Where It Pays Off
How prompt caching works in 2026 on Anthropic, OpenAI, Gemini, and DeepSeek. Pricing, latency wins on prefix heavy prompts, gotchas, and observability.
Table of Contents
What prompt caching actually is in 2026
Prompt caching is a provider side feature, not a client trick. When you send the same prompt prefix twice, the model server reuses the cached attention state for that prefix and only computes the new suffix. You pay roughly 10 percent of the normal input price for cached tokens and your time to first token drops sharply.
As of 2026 the major hosted LLM APIs that ship prompt caching include Anthropic (cache_control), OpenAI (automatic prompt caching), Google Gemini (context caching), and DeepSeek (automatic), plus self hosted runtimes like vLLM and SGLang.
TL;DR
| Provider | Trigger | Cached input price | Latency improvement | Best for |
|---|---|---|---|---|
| Anthropic | Explicit cache_control, 5 min or 1 hour TTL | ~10% of base input (reads) | 40-80% on prefix heavy prompts | Agents, RAG, stable system prompts |
| OpenAI | Automatic at 1024+ token prefix | ~25-50% of base input | 30-80% on long prompts | Long one off prompts |
| Gemini | Explicit cache with storage billing | Per minute or per hour storage | 30-60% on long shared docs | Long documents reused widely |
| DeepSeek | Automatic prefix match | ~10% of base input | 40-80% | Cost sensitive workloads |
| Self hosted (vLLM, SGLang) | Automatic prefix cache | No API charge | 50-90% throughput gain | Teams running own GPUs |
Use cache_control breakpoints to split stable prefix from dynamic content. Watch the write penalty: on Anthropic, a 5 minute TTL cache write costs 1.25x the base input rate, so the break even is two reads.
How prompt caching works
The mechanism is simple. A transformer’s forward pass over a prompt produces a key value (KV) tensor for every layer at every token position. Normally the server discards these tensors after the response. With prompt caching, the server stores the KV tensors for a prefix and reuses them when a later request shares the same prefix.
You pay storage (sometimes), you skip compute, and you skip the network round trip on tokens already processed. That is the whole win.
Cache hit
A request whose prefix exactly matches a cached prefix. The server skips attention compute on the cached portion and only processes the new tail. Time to first token drops by the share of tokens that were cached. Cost drops by the same proportion times the discount factor.
Cache miss
A request that does not match any cached prefix. The server processes the prompt normally and (on Anthropic and Gemini) writes the new prefix into the cache. On Anthropic the write costs 1.25x base input for a 5 minute TTL or 2x for a 1 hour TTL. On OpenAI writes are free but caching is automatic only above 1024 tokens.
Partial match
Most production calls do not match exactly. They share a system prompt and few shot examples but differ on the user message. Anthropic’s explicit cache_control blocks let you mark the boundary so the system prompt portion still hits the cache. OpenAI matches the longest exact prefix automatically.
Provider deep dive
Anthropic
import anthropic
LONG_SYSTEM_PROMPT = "You are a senior support agent. ..." # 2-10k tokens
user_input = "How do I reset my password?"
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_input}],
)
print(resp.usage.cache_read_input_tokens, resp.usage.cache_creation_input_tokens)
Two TTL options as of 2026: ephemeral (5 minute) at 1.25x write, ephemeral (1 hour) at 2x write. Reads always run at 10 percent of base input. Up to 4 breakpoints per request. Full reference: Anthropic prompt caching docs.
OpenAI
from openai import OpenAI
LONG_SYSTEM_PROMPT = "You are a senior support agent. ..."
user_input = "How do I reset my password?"
client = OpenAI()
resp = client.responses.create(
model="gpt-5",
input=[
{"role": "system", "content": LONG_SYSTEM_PROMPT},
{"role": "user", "content": user_input},
],
)
print(resp.usage.input_tokens_details.cached_tokens)
Automatic for prompts over 1024 tokens. No explicit control. Cached input billed at a discount that varies by model (around 25 to 50 percent of base input on gpt-5 family at the time of writing). Reference: OpenAI prompt caching guide.
Google Gemini
from google import genai
from google.genai import types
LONG_DOCUMENT = "..." # large reusable document content
user_input = "Summarize the obligations in section 3."
client = genai.Client()
cache = client.caches.create(
model="gemini-3.0-pro",
config=types.CreateCachedContentConfig(
contents=[LONG_DOCUMENT],
ttl="3600s",
),
)
resp = client.models.generate_content(
model="gemini-3.0-pro",
contents=[user_input],
config=types.GenerateContentConfig(cached_content=cache.name),
)
Explicit caches with storage billing per minute. Useful for long shared documents (legal contracts, research papers, large config) reused across many users. Reference: Gemini context caching.
DeepSeek
Automatic, persistent across days, disk based. Cached input is about 10 percent of base. No explicit control, no TTL knob. Reference: DeepSeek context caching announcement.
Self hosted (vLLM, SGLang)
Both support automatic prefix KV cache. vLLM uses paged attention with prefix caching enabled by default in modern versions. SGLang exposes radix attention for tree shaped prefix sharing. No API price benefit (you run the GPUs) but throughput on shared prefix workloads can rise 2-5x. Reference: vLLM automatic prefix caching, SGLang radix attention.
Where prompt caching pays off
Long system prompts
Agents typically carry a 2-10k token system prompt with persona, tool definitions, and policy. Cache it. On Anthropic that single change can cut input cost 80 percent and shave 1-2 seconds off time to first token at the same model.
RAG with stable retrieval scaffolding
The system prompt, tool descriptions, and few shot examples are stable. The retrieved chunks change per query. Put a cache_control breakpoint at the boundary so the stable part hits cache and the dynamic part flows through normally.
Multi turn conversations
Each turn shares the prior turns. Anthropic’s explicit cache makes long conversations cheap by caching everything up to the most recent user message.
Few shot heavy classification
Few shot prompts with 20-50 examples are textbook caching wins. The examples are stable, the user input is short, the cache hit rate is near 100 percent.
Where prompt caching does not pay
- Short prompts (under 1024 tokens on OpenAI’s threshold, similar floor elsewhere).
- Highly personalized prompts with no shared structure across users.
- Very low traffic endpoints where the cache TTL expires between requests.
- Workloads where you must delete data immediately and your provider’s TTL exceeds your retention SLA.
Common gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Whitespace drift | Cache miss on what looks like the same prompt | Normalize whitespace and ordering before sending |
| Dynamic field in stable region | Hit rate near zero | Move user specific fields after the cache breakpoint |
| Tool definitions reordered | Hit rate degrades over time | Sort tools deterministically before sending |
| Image content changing per request | Cache invalid every call | Cache the text instructions, leave images uncached |
| Low traffic at long TTL | Most calls still hit write penalty | Either drop to short TTL or bypass caching for this route |
Observing prompt caching with Future AGI
You cannot tune what you cannot see. Cache hit rate, miss reasons, and the cost delta per route belong in your observability stack alongside latency and error rate.
Future AGI traceAI (Apache 2.0) lets you instrument LLM calls with register and FITracer and set Anthropic or OpenAI cache fields as span attributes. Routes are then filterable by cache_read_input_tokens > 0 and you can chart hit rate over time per endpoint. The same trace also carries online evaluator scores (faithfulness, toxicity, task completion) so you can verify that caching did not change output quality.
import anthropic
from fi_instrumentation import register, FITracer
register(project_name="prod-prompt-cache")
tracer = FITracer(__name__)
anthropic_client = anthropic.Anthropic()
with tracer.start_as_current_span("rag-answer") as span:
response = anthropic_client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=[{"type": "text", "text": "system prompt", "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "user question"}],
)
span.set_attribute("cache_read_tokens", response.usage.cache_read_input_tokens)
span.set_attribute("cache_write_tokens", response.usage.cache_creation_input_tokens)
For broader agent evaluation, faithfulness scoring, and regression suites on top, see the traceAI repository (Apache 2.0) and the ai-evaluation library (Apache 2.0).
Caching layers beyond prompt caching
Prompt caching is one layer. Two adjacent layers compose with it:
- Semantic caching catches paraphrases of past queries and returns a stored response without calling the model. Useful for FAQ workloads.
- Response caching catches exact repeats. Simpler than semantic, lower hit rate, near zero false positives.
Stack response cache → semantic cache → prompt cache → model call. Each layer catches what the layer above missed. Future AGI’s Agent Command Center surfaces hit rate, latency, and cost across all three when wired through traceAI.
Related reads
Frequently asked questions
What is prompt caching in LLM APIs?
How much does prompt caching save in 2026?
Anthropic vs OpenAI prompt caching: which is better?
Does prompt caching break determinism or change outputs?
How do I observe cache hit rate in production?
When should I not use prompt caching?
How does prompt caching interact with RAG?
Is prompt caching the same as semantic or response caching?
Master stimulus prompts in 2026: leading prompts, chain-stimulus, conditioning, prompt chaining, and CI-gated optimization with Future AGI Prompt Optimize.
Build production LLM agents in 2026. Task scoping, model selection (gpt-5, claude-opus-4.5), tools, evals, observability, and the orchestration-plus-eval loop.
LLM vs GPT in 2026 explained: definitions, architecture, GPT-5 vs Claude vs Gemini vs Llama 4, when each wins, and how to evaluate any LLM or GPT model.