Guides

Prompt Caching in 2026: How It Works, Pricing, and Where It Pays Off

How prompt caching works in 2026 on Anthropic, OpenAI, Gemini, and DeepSeek. Pricing, latency wins on prefix heavy prompts, gotchas, and observability.

January 26, 2025

Updated May 14, 2026

6 min read

agents hallucination llms rag

What prompt caching actually is in 2026

Prompt caching is a provider side feature, not a client trick. When you send the same prompt prefix twice, the model server reuses the cached attention state for that prefix and only computes the new suffix. You pay roughly 10 percent of the normal input price for cached tokens and your time to first token drops sharply.

As of 2026 the major hosted LLM APIs that ship prompt caching include Anthropic (cache_control), OpenAI (automatic prompt caching), Google Gemini (context caching), and DeepSeek (automatic), plus self hosted runtimes like vLLM and SGLang.

TL;DR

Provider	Trigger	Cached input price	Latency improvement	Best for
Anthropic	Explicit cache_control, 5 min or 1 hour TTL	~10% of base input (reads)	40-80% on prefix heavy prompts	Agents, RAG, stable system prompts
OpenAI	Automatic at 1024+ token prefix	~25-50% of base input	30-80% on long prompts	Long one off prompts
Gemini	Explicit cache with storage billing	Per minute or per hour storage	30-60% on long shared docs	Long documents reused widely
DeepSeek	Automatic prefix match	~10% of base input	40-80%	Cost sensitive workloads
Self hosted (vLLM, SGLang)	Automatic prefix cache	No API charge	50-90% throughput gain	Teams running own GPUs

Use cache_control breakpoints to split stable prefix from dynamic content. Watch the write penalty: on Anthropic, a 5 minute TTL cache write costs 1.25x the base input rate, so the break even is two reads.

How prompt caching works

The mechanism is simple. A transformer’s forward pass over a prompt produces a key value (KV) tensor for every layer at every token position. Normally the server discards these tensors after the response. With prompt caching, the server stores the KV tensors for a prefix and reuses them when a later request shares the same prefix.

You pay storage (sometimes), you skip compute, and you skip the network round trip on tokens already processed. That is the whole win.

Cache hit

A request whose prefix exactly matches a cached prefix. The server skips attention compute on the cached portion and only processes the new tail. Time to first token drops by the share of tokens that were cached. Cost drops by the same proportion times the discount factor.

Cache miss

A request that does not match any cached prefix. The server processes the prompt normally and (on Anthropic and Gemini) writes the new prefix into the cache. On Anthropic the write costs 1.25x base input for a 5 minute TTL or 2x for a 1 hour TTL. On OpenAI writes are free but caching is automatic only above 1024 tokens.

Partial match

Most production calls do not match exactly. They share a system prompt and few shot examples but differ on the user message. Anthropic’s explicit cache_control blocks let you mark the boundary so the system prompt portion still hits the cache. OpenAI matches the longest exact prefix automatically.

Provider deep dive

Anthropic

import anthropic

LONG_SYSTEM_PROMPT = "You are a senior support agent. ..."  # 2-10k tokens
user_input = "How do I reset my password?"

client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_input}],
)
print(resp.usage.cache_read_input_tokens, resp.usage.cache_creation_input_tokens)

Two TTL options as of 2026: ephemeral (5 minute) at 1.25x write, ephemeral (1 hour) at 2x write. Reads always run at 10 percent of base input. Up to 4 breakpoints per request. Full reference: Anthropic prompt caching docs.

OpenAI

from openai import OpenAI

LONG_SYSTEM_PROMPT = "You are a senior support agent. ..."
user_input = "How do I reset my password?"

client = OpenAI()
resp = client.responses.create(
    model="gpt-5",
    input=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},
        {"role": "user", "content": user_input},
    ],
)
print(resp.usage.input_tokens_details.cached_tokens)

Automatic for prompts over 1024 tokens. No explicit control. Cached input billed at a discount that varies by model (around 25 to 50 percent of base input on gpt-5 family at the time of writing). Reference: OpenAI prompt caching guide.

Google Gemini

from google import genai
from google.genai import types

LONG_DOCUMENT = "..."  # large reusable document content
user_input = "Summarize the obligations in section 3."

client = genai.Client()
cache = client.caches.create(
    model="gemini-3.0-pro",
    config=types.CreateCachedContentConfig(
        contents=[LONG_DOCUMENT],
        ttl="3600s",
    ),
)
resp = client.models.generate_content(
    model="gemini-3.0-pro",
    contents=[user_input],
    config=types.GenerateContentConfig(cached_content=cache.name),
)

Explicit caches with storage billing per minute. Useful for long shared documents (legal contracts, research papers, large config) reused across many users. Reference: Gemini context caching.

DeepSeek

Automatic, persistent across days, disk based. Cached input is about 10 percent of base. No explicit control, no TTL knob. Reference: DeepSeek context caching announcement.

Self hosted (vLLM, SGLang)

Both support automatic prefix KV cache. vLLM uses paged attention with prefix caching enabled by default in modern versions. SGLang exposes radix attention for tree shaped prefix sharing. No API price benefit (you run the GPUs) but throughput on shared prefix workloads can rise 2-5x. Reference: vLLM automatic prefix caching, SGLang radix attention.

Where prompt caching pays off

Long system prompts

Agents typically carry a 2-10k token system prompt with persona, tool definitions, and policy. Cache it. On Anthropic that single change can cut input cost 80 percent and shave 1-2 seconds off time to first token at the same model.

RAG with stable retrieval scaffolding

The system prompt, tool descriptions, and few shot examples are stable. The retrieved chunks change per query. Put a cache_control breakpoint at the boundary so the stable part hits cache and the dynamic part flows through normally.

Multi turn conversations

Each turn shares the prior turns. Anthropic’s explicit cache makes long conversations cheap by caching everything up to the most recent user message.

Few shot heavy classification

Few shot prompts with 20-50 examples are textbook caching wins. The examples are stable, the user input is short, the cache hit rate is near 100 percent.

Where prompt caching does not pay

Short prompts (under 1024 tokens on OpenAI’s threshold, similar floor elsewhere).
Highly personalized prompts with no shared structure across users.
Very low traffic endpoints where the cache TTL expires between requests.
Workloads where you must delete data immediately and your provider’s TTL exceeds your retention SLA.

Common gotchas

Gotcha	Symptom	Fix
Whitespace drift	Cache miss on what looks like the same prompt	Normalize whitespace and ordering before sending
Dynamic field in stable region	Hit rate near zero	Move user specific fields after the cache breakpoint
Tool definitions reordered	Hit rate degrades over time	Sort tools deterministically before sending
Image content changing per request	Cache invalid every call	Cache the text instructions, leave images uncached
Low traffic at long TTL	Most calls still hit write penalty	Either drop to short TTL or bypass caching for this route

Observing prompt caching with Future AGI

You cannot tune what you cannot see. Cache hit rate, miss reasons, and the cost delta per route belong in your observability stack alongside latency and error rate.

Future AGI traceAI (Apache 2.0) lets you instrument LLM calls with register and FITracer and set Anthropic or OpenAI cache fields as span attributes. Routes are then filterable by cache_read_input_tokens > 0 and you can chart hit rate over time per endpoint. The same trace also carries online evaluator scores (faithfulness, toxicity, task completion) so you can verify that caching did not change output quality.

import anthropic
from fi_instrumentation import register, FITracer

register(project_name="prod-prompt-cache")
tracer = FITracer(__name__)
anthropic_client = anthropic.Anthropic()

with tracer.start_as_current_span("rag-answer") as span:
    response = anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system=[{"type": "text", "text": "system prompt", "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "user question"}],
    )
    span.set_attribute("cache_read_tokens", response.usage.cache_read_input_tokens)
    span.set_attribute("cache_write_tokens", response.usage.cache_creation_input_tokens)

For broader agent evaluation, faithfulness scoring, and regression suites on top, see the traceAI repository (Apache 2.0) and the ai-evaluation library (Apache 2.0).

Caching layers beyond prompt caching

Prompt caching is one layer. Two adjacent layers compose with it:

Semantic caching catches paraphrases of past queries and returns a stored response without calling the model. Useful for FAQ workloads.
Response caching catches exact repeats. Simpler than semantic, lower hit rate, near zero false positives.

Stack response cache → semantic cache → prompt cache → model call. Each layer catches what the layer above missed. Future AGI’s Agent Command Center surfaces hit rate, latency, and cost across all three when wired through traceAI.

Frequently asked questions

What is prompt caching in LLM APIs?

Prompt caching is a provider side feature that lets an LLM API reuse the prefix of a prompt across requests so the model does not recompute attention over the same tokens twice. The cached prefix (system prompt, retrieved context, few shot examples) is stored on the provider side. Subsequent requests that share the same prefix pay a lower cached input price and return faster. It is supported on Anthropic, OpenAI, Google Gemini, DeepSeek, and most major hosted APIs as of 2026.

How much does prompt caching save in 2026?

Anthropic and DeepSeek price cache reads at roughly 10 percent of standard input; OpenAI prices cached input at a model dependent discount around 25 to 50 percent; Gemini bills cache storage by minute or hour rather than per cached token. Latency drops 40 to 80 percent on prompts where most tokens are cached. The catch: writes to the cache typically cost 1.25x the normal input rate on Anthropic for the 5 minute TTL.

Anthropic vs OpenAI prompt caching: which is better?

Anthropic uses explicit cache_control blocks and 5 minute or 1 hour TTLs, giving you fine grained control. OpenAI uses automatic caching above a 1024 token prefix with no TTL guarantee. Anthropic is better for predictable cache hit rate and shared agent systems. OpenAI is simpler for one off prompts above the token floor. Both price cached input around 10 to 25 percent of standard input.

Does prompt caching break determinism or change outputs?

No, prompt caching does not intentionally change outputs. The cache stores intermediate KV states for the prompt prefix, not generated tokens. The model still samples normally, so outputs follow the same temperature and top_p as an uncached call. Bit exact reproducibility on hosted providers still depends on provider routing and hardware, so verify with your own regression suite.

How do I observe cache hit rate in production?

API responses include cache usage fields (cache_creation_input_tokens and cache_read_input_tokens on Anthropic, prompt_tokens_details.cached_tokens on OpenAI). Log them per request and chart by route. With Future AGI traceAI you can set those fields as span attributes inside your existing LLM spans, so hit rate, miss reason, and cost impact sit next to the rest of your trace.

When should I not use prompt caching?

Skip it when prompts are short and unique (no prefix to reuse), when context is highly personalized per user with no shared template, or when total volume is low enough that the cache write penalty exceeds savings. Also skip it when you need to delete a tenant's data immediately and your provider's cache TTL does not match your retention policy.

How does prompt caching interact with RAG?

Cache the system prompt and few shot examples (stable across requests). Do not cache the retrieved chunks themselves if they change per query. With Anthropic you place a cache_control breakpoint at the boundary between stable system instructions and dynamic retrieved context. This typically yields 60 to 80 percent token reduction on RAG endpoints with high traffic.

Is prompt caching the same as semantic or response caching?

No. Prompt caching reuses prompt prefix compute on the provider side. Semantic caching matches similar queries and returns a previously generated answer without calling the model at all. Response caching stores exact prompt to response pairs. The three layers compose: response cache catches exact repeats, semantic cache catches paraphrases, prompt cache catches everything else with a shared prefix.

View all

Guides

Stimulus Prompts in 2026: Advanced Prompt Engineering Guide

Master stimulus prompts in 2026: leading prompts, chain-stimulus, conditioning, prompt chaining, and CI-gated optimization with Future AGI Prompt Optimize.

Rishav Hada · Jan 28, 2025

8 min

Guides

How to Build LLM Agents in 2026: A Production Guide

Build production LLM agents in 2026. Task scoping, model selection (gpt-5, claude-opus-4.5), tools, evals, observability, and the orchestration-plus-eval loop.

Rishav Hada · Jan 7, 2025

11 min

Guides

LLM vs GPT 2026: Key Differences, How They Work, and When to Use Each

LLM vs GPT in 2026 explained: definitions, architecture, GPT-5 vs Claude vs Gemini vs Llama 4, when each wins, and how to evaluate any LLM or GPT model.

Rishav Hada · Dec 12, 2024

11 min

What prompt caching actually is in 2026

TL;DR

How prompt caching works

Cache hit

Cache miss

Partial match

Provider deep dive

Anthropic

OpenAI

Google Gemini

DeepSeek

Self hosted (vLLM, SGLang)

Where prompt caching pays off

Long system prompts

RAG with stable retrieval scaffolding

Multi turn conversations

Few shot heavy classification

Where prompt caching does not pay

Common gotchas

Observing prompt caching with Future AGI

Caching layers beyond prompt caching

Related reads

Frequently asked questions