Guides

Prompt Caching in 2026: How It Works, Pricing, and Where It Pays Off

How prompt caching works in 2026 on Anthropic, OpenAI, Gemini, and DeepSeek. Pricing, latency wins on prefix heavy prompts, gotchas, and observability.

·
Updated
·
6 min read
agents hallucination llms rag
Understanding Prompt Caching for Faster AI Responses
Table of Contents

What prompt caching actually is in 2026

Prompt caching is a provider side feature, not a client trick. When you send the same prompt prefix twice, the model server reuses the cached attention state for that prefix and only computes the new suffix. You pay roughly 10 percent of the normal input price for cached tokens and your time to first token drops sharply.

As of 2026 the major hosted LLM APIs that ship prompt caching include Anthropic (cache_control), OpenAI (automatic prompt caching), Google Gemini (context caching), and DeepSeek (automatic), plus self hosted runtimes like vLLM and SGLang.

TL;DR

ProviderTriggerCached input priceLatency improvementBest for
AnthropicExplicit cache_control, 5 min or 1 hour TTL~10% of base input (reads)40-80% on prefix heavy promptsAgents, RAG, stable system prompts
OpenAIAutomatic at 1024+ token prefix~25-50% of base input30-80% on long promptsLong one off prompts
GeminiExplicit cache with storage billingPer minute or per hour storage30-60% on long shared docsLong documents reused widely
DeepSeekAutomatic prefix match~10% of base input40-80%Cost sensitive workloads
Self hosted (vLLM, SGLang)Automatic prefix cacheNo API charge50-90% throughput gainTeams running own GPUs

Use cache_control breakpoints to split stable prefix from dynamic content. Watch the write penalty: on Anthropic, a 5 minute TTL cache write costs 1.25x the base input rate, so the break even is two reads.

How prompt caching works

The mechanism is simple. A transformer’s forward pass over a prompt produces a key value (KV) tensor for every layer at every token position. Normally the server discards these tensors after the response. With prompt caching, the server stores the KV tensors for a prefix and reuses them when a later request shares the same prefix.

You pay storage (sometimes), you skip compute, and you skip the network round trip on tokens already processed. That is the whole win.

Cache hit

A request whose prefix exactly matches a cached prefix. The server skips attention compute on the cached portion and only processes the new tail. Time to first token drops by the share of tokens that were cached. Cost drops by the same proportion times the discount factor.

Cache miss

A request that does not match any cached prefix. The server processes the prompt normally and (on Anthropic and Gemini) writes the new prefix into the cache. On Anthropic the write costs 1.25x base input for a 5 minute TTL or 2x for a 1 hour TTL. On OpenAI writes are free but caching is automatic only above 1024 tokens.

Partial match

Most production calls do not match exactly. They share a system prompt and few shot examples but differ on the user message. Anthropic’s explicit cache_control blocks let you mark the boundary so the system prompt portion still hits the cache. OpenAI matches the longest exact prefix automatically.

Provider deep dive

Anthropic

import anthropic

LONG_SYSTEM_PROMPT = "You are a senior support agent. ..."  # 2-10k tokens
user_input = "How do I reset my password?"

client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_input}],
)
print(resp.usage.cache_read_input_tokens, resp.usage.cache_creation_input_tokens)

Two TTL options as of 2026: ephemeral (5 minute) at 1.25x write, ephemeral (1 hour) at 2x write. Reads always run at 10 percent of base input. Up to 4 breakpoints per request. Full reference: Anthropic prompt caching docs.

OpenAI

from openai import OpenAI

LONG_SYSTEM_PROMPT = "You are a senior support agent. ..."
user_input = "How do I reset my password?"

client = OpenAI()
resp = client.responses.create(
    model="gpt-5",
    input=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},
        {"role": "user", "content": user_input},
    ],
)
print(resp.usage.input_tokens_details.cached_tokens)

Automatic for prompts over 1024 tokens. No explicit control. Cached input billed at a discount that varies by model (around 25 to 50 percent of base input on gpt-5 family at the time of writing). Reference: OpenAI prompt caching guide.

Google Gemini

from google import genai
from google.genai import types

LONG_DOCUMENT = "..."  # large reusable document content
user_input = "Summarize the obligations in section 3."

client = genai.Client()
cache = client.caches.create(
    model="gemini-3.0-pro",
    config=types.CreateCachedContentConfig(
        contents=[LONG_DOCUMENT],
        ttl="3600s",
    ),
)
resp = client.models.generate_content(
    model="gemini-3.0-pro",
    contents=[user_input],
    config=types.GenerateContentConfig(cached_content=cache.name),
)

Explicit caches with storage billing per minute. Useful for long shared documents (legal contracts, research papers, large config) reused across many users. Reference: Gemini context caching.

DeepSeek

Automatic, persistent across days, disk based. Cached input is about 10 percent of base. No explicit control, no TTL knob. Reference: DeepSeek context caching announcement.

Self hosted (vLLM, SGLang)

Both support automatic prefix KV cache. vLLM uses paged attention with prefix caching enabled by default in modern versions. SGLang exposes radix attention for tree shaped prefix sharing. No API price benefit (you run the GPUs) but throughput on shared prefix workloads can rise 2-5x. Reference: vLLM automatic prefix caching, SGLang radix attention.

Where prompt caching pays off

Long system prompts

Agents typically carry a 2-10k token system prompt with persona, tool definitions, and policy. Cache it. On Anthropic that single change can cut input cost 80 percent and shave 1-2 seconds off time to first token at the same model.

RAG with stable retrieval scaffolding

The system prompt, tool descriptions, and few shot examples are stable. The retrieved chunks change per query. Put a cache_control breakpoint at the boundary so the stable part hits cache and the dynamic part flows through normally.

Multi turn conversations

Each turn shares the prior turns. Anthropic’s explicit cache makes long conversations cheap by caching everything up to the most recent user message.

Few shot heavy classification

Few shot prompts with 20-50 examples are textbook caching wins. The examples are stable, the user input is short, the cache hit rate is near 100 percent.

Where prompt caching does not pay

  • Short prompts (under 1024 tokens on OpenAI’s threshold, similar floor elsewhere).
  • Highly personalized prompts with no shared structure across users.
  • Very low traffic endpoints where the cache TTL expires between requests.
  • Workloads where you must delete data immediately and your provider’s TTL exceeds your retention SLA.

Common gotchas

GotchaSymptomFix
Whitespace driftCache miss on what looks like the same promptNormalize whitespace and ordering before sending
Dynamic field in stable regionHit rate near zeroMove user specific fields after the cache breakpoint
Tool definitions reorderedHit rate degrades over timeSort tools deterministically before sending
Image content changing per requestCache invalid every callCache the text instructions, leave images uncached
Low traffic at long TTLMost calls still hit write penaltyEither drop to short TTL or bypass caching for this route

Observing prompt caching with Future AGI

You cannot tune what you cannot see. Cache hit rate, miss reasons, and the cost delta per route belong in your observability stack alongside latency and error rate.

Future AGI traceAI (Apache 2.0) lets you instrument LLM calls with register and FITracer and set Anthropic or OpenAI cache fields as span attributes. Routes are then filterable by cache_read_input_tokens > 0 and you can chart hit rate over time per endpoint. The same trace also carries online evaluator scores (faithfulness, toxicity, task completion) so you can verify that caching did not change output quality.

import anthropic
from fi_instrumentation import register, FITracer

register(project_name="prod-prompt-cache")
tracer = FITracer(__name__)
anthropic_client = anthropic.Anthropic()

with tracer.start_as_current_span("rag-answer") as span:
    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system=[{"type": "text", "text": "system prompt", "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "user question"}],
    )
    span.set_attribute("cache_read_tokens", response.usage.cache_read_input_tokens)
    span.set_attribute("cache_write_tokens", response.usage.cache_creation_input_tokens)

For broader agent evaluation, faithfulness scoring, and regression suites on top, see the traceAI repository (Apache 2.0) and the ai-evaluation library (Apache 2.0).

Caching layers beyond prompt caching

Prompt caching is one layer. Two adjacent layers compose with it:

  • Semantic caching catches paraphrases of past queries and returns a stored response without calling the model. Useful for FAQ workloads.
  • Response caching catches exact repeats. Simpler than semantic, lower hit rate, near zero false positives.

Stack response cache → semantic cache → prompt cache → model call. Each layer catches what the layer above missed. Future AGI’s Agent Command Center surfaces hit rate, latency, and cost across all three when wired through traceAI.

Frequently asked questions

What is prompt caching in LLM APIs?
Prompt caching is a provider side feature that lets an LLM API reuse the prefix of a prompt across requests so the model does not recompute attention over the same tokens twice. The cached prefix (system prompt, retrieved context, few shot examples) is stored on the provider side. Subsequent requests that share the same prefix pay a lower cached input price and return faster. It is supported on Anthropic, OpenAI, Google Gemini, DeepSeek, and most major hosted APIs as of 2026.
How much does prompt caching save in 2026?
Anthropic and DeepSeek price cache reads at roughly 10 percent of standard input; OpenAI prices cached input at a model dependent discount around 25 to 50 percent; Gemini bills cache storage by minute or hour rather than per cached token. Latency drops 40 to 80 percent on prompts where most tokens are cached. The catch: writes to the cache typically cost 1.25x the normal input rate on Anthropic for the 5 minute TTL.
Anthropic vs OpenAI prompt caching: which is better?
Anthropic uses explicit cache_control blocks and 5 minute or 1 hour TTLs, giving you fine grained control. OpenAI uses automatic caching above a 1024 token prefix with no TTL guarantee. Anthropic is better for predictable cache hit rate and shared agent systems. OpenAI is simpler for one off prompts above the token floor. Both price cached input around 10 to 25 percent of standard input.
Does prompt caching break determinism or change outputs?
No, prompt caching does not intentionally change outputs. The cache stores intermediate KV states for the prompt prefix, not generated tokens. The model still samples normally, so outputs follow the same temperature and top_p as an uncached call. Bit exact reproducibility on hosted providers still depends on provider routing and hardware, so verify with your own regression suite.
How do I observe cache hit rate in production?
API responses include cache usage fields (cache_creation_input_tokens and cache_read_input_tokens on Anthropic, prompt_tokens_details.cached_tokens on OpenAI). Log them per request and chart by route. With Future AGI traceAI you can set those fields as span attributes inside your existing LLM spans, so hit rate, miss reason, and cost impact sit next to the rest of your trace.
When should I not use prompt caching?
Skip it when prompts are short and unique (no prefix to reuse), when context is highly personalized per user with no shared template, or when total volume is low enough that the cache write penalty exceeds savings. Also skip it when you need to delete a tenant's data immediately and your provider's cache TTL does not match your retention policy.
How does prompt caching interact with RAG?
Cache the system prompt and few shot examples (stable across requests). Do not cache the retrieved chunks themselves if they change per query. With Anthropic you place a cache_control breakpoint at the boundary between stable system instructions and dynamic retrieved context. This typically yields 60 to 80 percent token reduction on RAG endpoints with high traffic.
Is prompt caching the same as semantic or response caching?
No. Prompt caching reuses prompt prefix compute on the provider side. Semantic caching matches similar queries and returns a previously generated answer without calling the model at all. Response caching stores exact prompt to response pairs. The three layers compose: response cache catches exact repeats, semantic cache catches paraphrases, prompt cache catches everything else with a shared prefix.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.