Prompting

What Is Prompt Caching?

A technique for reusing stable prompt prefixes or cached LLM responses to reduce repeated input-processing cost and latency.

What Is Prompt Caching?

Prompt caching is a prompt-family optimization that reuses stable prompt prefixes, cached input processing, or saved responses for repeated LLM calls. In production it shows up at the gateway, provider API, or trace layer where llm.token_count.prompt, cache-hit status, and request identity explain whether a call paid full prompt cost. FutureAGI connects prompt caching to Agent Command Center gateway:exact-cache so engineers can reduce latency and spend without hiding stale, unsafe, or tenant-scoped outputs.

Why It Matters in Production LLM and Agent Systems

Repeated prompt work is one of the quietest forms of runaway cost. A customer-support agent may send the same 3,000-token system prompt on every turn, a regression runner may replay the same 20,000 eval prompts after each release, and a planner may issue the same classification instruction at every step. Without prompt caching, those repeats create higher p99 latency, avoidable token spend, and provider-rate-limit pressure before anyone sees a model-quality failure.

The failure mode is not only wasted money. A bad cache boundary can return a stale cached answer after a policy update, reuse account-specific context across tenants, or mask a prompt-template regression because the cached response looks successful. Developers feel this as confusing trace diffs: the prompt changed, but the output did not. SREs see low cache-hit-rate, high llm.token_count.prompt, and bursts of 429s. Product teams see slow answers on common requests. Compliance teams care when cached outputs bypass fresh guardrail checks or retention rules.

This matters more in 2026 agent pipelines than in single-turn chat. Multi-step agents repeat planner, router, summarizer, and validator prompts inside one user request. If caching is invisible, engineers cannot tell whether a fast path is safe reuse or an accidental stale response. If caching is measured, deterministic steps become cheap and auditable.

How FutureAGI Handles Prompt Caching

FutureAGI handles prompt caching as an observable workflow that connects prompt assets, gateway policy, and eval sampling. The specific anchor for this term is gateway:exact-cache, exposed through Agent Command Center. A typical route runs pre-guardrail, then exact-cache lookup, then a routing policy: cost-optimized provider choice on misses. Hits return from the gateway layer; misses continue to the model and write a TTL-scoped entry.

The engineer should decide what kind of reuse is allowed. For a stable “classify support intent” prompt, the cache key can include the prompt template version, normalized user intent, model, route ID, response format, tenant namespace, and sampling settings. Trace data keeps llm.token_count.prompt, cache hit or miss reason, latency, route, and prompt version next to the output. If an output later fails AnswerRelevancy or Groundedness, the engineer can lower TTL, invalidate the namespace, add a regression eval, or force refresh that route.

FutureAGI’s approach is to separate provider-side prompt-prefix reuse from gateway response reuse. Provider prompt caching, such as OpenAI or Anthropic prompt caching, can reduce the cost of processing repeated prefixes inside one provider. Agent Command Center exact-cache can skip the provider call when the canonical request is eligible and identical. Compared with a raw Redis cache in application code, the FutureAGI path keeps cache behavior tied to traces, routing policy, guardrails, and eval evidence.

How to Measure or Detect It

Measure prompt caching as a production control, not just a storage feature:

  • Cache hit rate by route: hits divided by eligible requests, segmented by prompt version, model, tenant, and cache namespace.
  • Token cost avoided: prompt and completion tokens skipped on hits, derived from llm.token_count.prompt, completion tokens, and route pricing.
  • Latency delta: p50 and p99 latency for hits versus misses. A useful hit should remove the provider round trip.
  • Miss reason: no entry, TTL expired, namespace mismatch, prompt-version mismatch, force refresh, or no-store.
  • Sampled correctness: run AnswerRelevancy or Groundedness on cached responses that reach users.
from fi.evals import AnswerRelevancy

result = AnswerRelevancy().evaluate(
    input=current_prompt,
    output=cached_response,
)
if result.score < 0.85:
    cache_action = "refresh"

The dashboard view should join cache-hit-rate, token-cost-per-trace, eval-fail-rate-by-cohort, and user-feedback proxies such as thumbs-down rate or escalation rate. A rising hit rate is only good when quality and safety hold.

Common Mistakes

Prompt caching breaks when engineers treat repeated text as repeated meaning. The key and the invalidation policy need to match the behavior being cached.

  • Caching prompts that depend on now(), inventory, account balance, or retrieved context without short TTLs or explicit invalidation.
  • Sharing cache namespaces across tenants. Exact text can still contain private wording, policy exceptions, or customer-specific tool results.
  • Treating cache hit rate as quality. Hit rate measures reuse; sampled evals measure whether reused answers still fit the request.
  • Confusing provider prompt caching with gateway response caching. One reduces prompt processing; the other may skip the model call entirely.
  • Leaving prompt version out of the key. A prompt edit should not inherit responses from the previous template.

Frequently Asked Questions

What is prompt caching?

Prompt caching reuses stable prompt work or cached LLM responses so repeated requests avoid some input-processing cost, latency, or provider calls.

How is prompt caching different from a prompt cache?

Prompt caching often refers to provider-side reuse of prompt prefixes or gateway caching as a general practice. A prompt cache is the concrete gateway component that stores and returns matching responses.

How do you measure prompt caching?

FutureAGI measures prompt caching with the `gateway:exact-cache` route, cache hit rate, `llm.token_count.prompt`, latency p99, token-cost-per-trace, and sampled AnswerRelevancy checks.