Research

RAG vs CAG: Choosing Cache-Augmented Generation in 2026

RAG vs Cache-Augmented Generation in 2026: 7 axes for choosing, the hybrid router pattern most teams ship, how to eval both paths with traceAI and FAGI.

March 29, 2026

Updated May 19, 2026

14 min read

rag cag cache-augmented-generation prompt-caching long-context retrieval evaluation 2026

Table of Contents

The 2026 question is not whether retrieval matters. It is whether retrieval is the right hop for your workload at all. Long-context models with prompt-caching pricing now let teams stuff an entire reference corpus into a cached prompt and answer every query against that cache in under a second. That pattern has a name. Cache-Augmented Generation, or CAG, sits next to Retrieval-Augmented Generation as a peer architecture, and the choice between them is workload-specific. This guide is the decision framework: what CAG is, the seven axes for choosing, the hybrid pattern most production teams ship, and how to eval both paths so the Pareto comparison stays honest.

TL;DR: RAG vs CAG in one table

Dimension	RAG	CAG
What it does	Retrieves top-k chunks per query, generates from chunks	Loads full corpus into cached prompt, generates from cache
Latency	100-500ms retrieval hop + generation	No retrieval hop, sub-second on cache hit
Cost shape	Per-query embed and rerank + generation tokens	One-time cache write + cached-input discount on every read
Corpus size	Scales to billions of tokens	Limited by model context (400k to 2M tokens in 2026)
Freshness	Re-index on change, queries see new data immediately	Cache invalidates on change, write cost amortizes over reads
Citation surface	Per-chunk attribution natively	Whole-context attribution, chunk citations need extra work
Best fit	Large, fresh, per-tenant, citation-heavy workloads	Small, stable, high-volume, latency-tight workloads

If you only read one row: RAG retrieves a few chunks per query, CAG stuffs the whole corpus once and reuses it. The right architecture depends on corpus size, freshness, latency budget, cost, citation needs, query distribution, and tenant model. Most production teams arrive at a hybrid router.

Why CAG became a real option in 2026

Two unrelated capability curves crossed in 2025 and 2026, and CAG is the architecture sitting in their intersection.

The first curve is context length. Gemini 2.5 ships with a 2M-token context window. Claude Sonnet 4.5 ships with 1M tokens. GPT-5.1 ships with 400k tokens. A 500k-token corpus is roughly 1,500 to 2,000 pages of reference documentation, which covers most product knowledge bases, most internal policy corpora, and most B2B-SaaS support content. Two years ago that corpus needed a vector store. Today it fits in a single prompt.

The second curve is prompt-caching economics. Anthropic prompt caching charges roughly 10% of the standard input rate on cached reads after a one-time cache write at 1.25x the standard rate. OpenAI prompt caching charges 50% of the standard input rate on cached reads at no write premium. Either way, a stable prompt that gets queried thousands of times per day pays the corpus cost once and amortizes it over every read. The cached-input discount makes the “stuff the whole corpus” pattern economically viable in a way it was not when every query paid full token cost.

When both curves cross, CAG becomes a peer architecture to RAG instead of an experiment. The question is not whether you can do it. The question is whether you should, for which workloads, and how to eval it against the RAG baseline you already run.

What CAG actually is

CAG loads the entire reference corpus into a long-context model prompt once, caches that prompt, and serves every query against the cached context with no retrieval hop. The four moving parts:

Corpus packaging. The reference documents get concatenated into a single prompt, ordered so high-priority content sits near the start (where attention is stronger). Section headers, document IDs, and timestamps stay in the prompt so the model can produce structured citations even without chunk metadata.
Cache write. The first request through Anthropic or OpenAI’s prompt-cache mechanism pays the cache-write premium and stores the corpus prompt server-side. Cache TTL is typically 5 minutes for Anthropic and longer for OpenAI; production teams keep the cache warm with a low-rate keep-alive query.
Cache read. Every subsequent query attaches the user question to the cached corpus prefix. The model sees the full corpus plus the question, generates the answer, and the request is billed at the discounted cached-input rate for the corpus prefix and the standard rate only for the question and the response.
Eval and observability. Every CAG call still lands as a traceAI span. The span carries fi.span.kind=LLM, an llm.cache_hit=true attribute, the cached input token count, the response, and the eval scores attached at the same span level as a RAG generation span.

The architectural shift is that retrieval is no longer a hop. The corpus is always in context. The model is always grounded. The latency cost of “did the retriever find the doc” is zero because the doc was never absent. What you trade is corpus size (capped by context window), freshness (cache TTL plus re-warm cost), and citation granularity (no chunk IDs unless you bake them into the corpus prompt). For background on the chunking decisions you do not have to make in CAG, see the advanced RAG chunking guide.

The seven axes for choosing RAG vs CAG

A real production decision rarely picks RAG or CAG in the abstract. It picks per workload, against seven axes.

1. Corpus size

Under 500k tokens, CAG fits comfortably with headroom for the query and the response. From 500k to 1.5M tokens, CAG works on Gemini 2.5 and Claude Sonnet 4.5 but starts to feel attention-degraded on niche details. Above 1.5M tokens, RAG is the right answer because long-context recall degrades faster than RAG’s top-k precision on a well-tuned retriever. The mid-band is where hybrid earns its keep.

2. Freshness

CAG inherits the cache’s freshness window. If your corpus changes daily, the cache invalidates daily and the write cost stops amortizing. RAG wins because the vector store re-indexes incrementally and queries see new data within minutes. If your corpus changes quarterly, the cache write amortizes over millions of queries and CAG wins on both cost and latency. The pivot point is roughly: re-write the cache more than once per 10,000 reads and RAG is cheaper.

3. Latency budget

Real-time voice assistants, autocomplete-style inline AI, and low-friction chat UIs need under 500ms time-to-first-token. The retrieval hop in RAG adds 100-500ms even with a well-tuned retriever. CAG removes that hop. If your latency budget is tight, CAG wins by construction. If your budget is over 1s, RAG has headroom to retrieve, rerank, and still feel snappy.

4. Cost shape

High-volume queries against the same corpus pay the cache-write cost once and amortize over the read volume. The cached-input discount (10% for Anthropic, 50% for OpenAI) makes the per-query token cost lower than RAG’s embed-plus-rerank-plus-generation token cost once you cross roughly 10,000 daily queries against the same corpus. Below that, RAG is cheaper because you do not pay the cache-write premium. Long-tail query mixes against many corpora favor RAG for the same reason.

5. Citation auditability

Regulated industries often need per-claim chunk-level citations: “this statement is grounded in section 4.2 of document X retrieved at timestamp Y.” RAG produces this natively because chunks are first-class objects in the pipeline. CAG can fake it by baking document IDs and section headers into the corpus prompt and asking the model to cite by header, but the citation is generated text, not a deterministic chunk lookup. If a regulator wants deterministic auditability per claim, RAG is the safer answer.

6. Query distribution

A broad query distribution where most queries touch most of the corpus favors CAG because the model has the whole context anyway. A niche query distribution where each query touches a different rare document favors RAG because retrieval finds the right doc out of millions. Most production knowledge bases sit closer to broad than niche, which is one reason CAG keeps winning in pilots.

7. Tenant model

A shared corpus that every tenant queries is a CAG layup. One cache, every tenant benefits. A per-tenant corpus where Tenant A and Tenant B must never see each other’s documents kills CAG economically because each tenant needs its own cache write, and the write premium no longer amortizes. RAG keeps the per-tenant model simple. This is the axis where the most production rollouts decide for RAG.

How to eval RAG and CAG with the same rubrics

Both paths need to be evaluated against the same rubrics for the Pareto comparison to mean anything. FutureAGI’s eval surface is built so the same scores attach to both span shapes.

Span shape. RAG produces a fi.span.kind=RETRIEVER span (with the retrieved chunks as span attributes) followed by an fi.span.kind=LLM span for the generation. CAG produces a single fi.span.kind=LLM span with llm.cache_hit=true and the cached input token count as attributes. traceAI captures both shapes through the same SDK call and the traceAI open-source repo defines the schema.

Shared rubrics. Four metrics work on both paths and should run on every span:

Groundedness scores whether the response is anchored in the supplied context (retrieved chunks for RAG, cached corpus for CAG).
ContextAdherence scores whether the response stays inside the supplied context versus wandering into model prior knowledge.
Completeness scores whether the response covers the required information from the supplied context.
FactualAccuracy scores claim-level factual correctness against the supplied context.

These four belong on every RAG and every CAG trace, and the FutureAGI evals docs cover the prompts that ship out of the box.

RAG-only rubrics. ChunkAttribution and ChunkUtilization are RAG-specific because CAG has no chunk hop. Skip them on CAG spans rather than scoring against a synthetic chunk.

CAG-specific rubrics. Two failure modes are unique to CAG and earn custom rubrics via CustomLLMJudge:

CAGContextDriftDetection scores whether the model has drifted past the cached corpus after long turns or many follow-ups. Pair it with ContextAdherence for the highest signal.
RAGCitationIntegrity (run on RAG only) scores whether the response’s cited chunks match the chunks it actually used.

Pareto comparison. For each workload, build a labeled set of queries (50-500 is enough to start), run RAG and CAG against the same set, score both with the same rubrics, and compute a cost-quality-latency Pareto. Plot the four metrics against the gateway-measured x-prism-cost and x-prism-latency-ms headers per call. The winner on your workload is the path that dominates the others on the axes that matter for that workload. The RAG evaluation in CI guide covers how to wire the same scores into pre-merge gates.

The hybrid CAG plus RAG pattern

Most production teams arrive at a hybrid router rather than picking one path. The pattern:

CAG hot path. Queries that fall inside a similarity threshold for the cached corpus route to the CAG path and return in roughly 200ms. The threshold is tuned against a representative query set so that 70-90% of production traffic hits the cache.
RAG cold path. Queries that fall outside the threshold route to the RAG fallback, retrieve the long-tail document, and return in roughly 1s. This is where rare entities, low-frequency topics, and freshness-sensitive queries live.
Router classifier. A small embedding-based classifier or a lightweight LLM judge decides between the two paths on the query alone, before any LLM call fires. The router runs in under 30ms in practice.
Trace symmetry. traceAI captures both paths with the same span attributes. The router decision lands as a span attribute (router.path=cag or router.path=rag), and the eval suite scores both paths consistently. The Agent Command Center gateway is where most teams put the router because the gateway already sits in front of both paths.
Platform feedback loop. The self-improving evaluators on the FutureAGI Platform retune the router threshold from production feedback over time. As query distribution shifts, the threshold shifts with it.

The hybrid pattern wins when the query distribution is bimodal (a fat hot tail plus a long cold tail), which describes most production knowledge bases, most support chatbots, and most product-help workloads. It loses when the distribution is uniform (CAG wins outright if the corpus fits) or when every tenant needs its own corpus (RAG wins outright because the cache premium kills CAG).

The five-step decision workflow

A concrete sequence to run this decision against your workload.

Step 1: Characterize your workload on the seven axes. Write down the corpus size in tokens, the change rate, the latency budget, the daily query volume, the citation requirement, the query distribution shape, and the tenant model. This is a 20-minute exercise and it kills 50% of decisions before any code runs.

Step 2: Prototype both paths with the same eval suite. Run a RAG-only baseline and a CAG-only baseline against the same 100-500 query labeled set. Score both with Groundedness, ContextAdherence, Completeness, and FactualAccuracy. Capture the gateway cost and latency headers per call.

Step 3: Measure the cost-quality-latency Pareto. Plot the four eval scores against cost and latency for both paths. If one path dominates on every axis that matters for your workload, ship that path. If neither dominates, you have a hybrid candidate.

Step 4: Configure the router or ship the single path. For hybrid, tune the router similarity threshold so that 70-90% of representative queries route to CAG and the rest fall back to RAG. For single-path, ship the winner and move on.

Step 5: Monitor with Error Feed and retune over time. Production query distributions drift. Cache freshness windows drift. Model context behavior drifts on minor model updates. The eval suite has to keep running on a sample of production traces, and the router threshold has to retune as the distribution shifts.

How FutureAGI grounds the RAG-vs-CAG decision

The platform treats RAG and CAG as peer architectures rather than first-class RAG and an afterthought for CAG.

traceAI captures both span shapes. RAG’s RETRIEVER plus LLM span tree and CAG’s single LLM span with llm.cache_hit=true both flow through the same OpenTelemetry-native SDK. The traceAI repo ships under Apache 2.0 with span conventions for both.

Eval rubrics work on both. Groundedness, ContextAdherence, Completeness, FactualAccuracy are first-class metrics on both paths, and ChunkAttribution and ChunkUtilization apply on RAG only. Custom rubrics for CAG drift detection ship via CustomLLMJudge.

Agent Command Center serves both paths. The gateway sits in front of the model layer and routes between CAG and RAG paths per query. The headers x-prism-cost and x-prism-latency-ms are returned on every response so the cost-quality-latency Pareto is measurable per call instead of estimated. The gateway ships 4 semantic-cache backends (Redis, Memcached, in-memory, and Vector) and 6 exact-cache backends, all of which serve both RAG and CAG workloads transparently.

Error Feed clusters CAG and RAG failures separately. Failed traces from each path land in the same Error Feed and HDBSCAN soft clustering separates them by failure mode. CAG clusters surface patterns like “model forgot cached context after turn 8” and “cached corpus stale on time-sensitive query.” RAG clusters surface patterns like “retrieval missed relevant doc” and “reranker downranked correct chunk.” A Sonnet 4.5 Judge writes an immediate_fix per cluster and the self-improving evaluators on the Platform retune router thresholds and eval prompts from the feedback loop. Today the Error Feed integration to ticketing is Linear-only; other destinations are roadmap.

Six optimizers ship for prompt tuning today. Eval-driven optimization on RAG retrieval prompts and CAG router classifier prompts runs today via the optimizer surface in the Platform. The trace-stream-to-agent-opt connector that lets the optimizer pull live production traces directly is on the roadmap.

Anti-patterns to avoid

A short list of failure modes that show up repeatedly in CAG rollouts.

Assuming CAG always wins because long context is available. Long context makes CAG viable for stable corpora. It does not make CAG correct for freshness-sensitive, per-tenant, or citation-heavy workloads. Run the seven-axis check before assuming.

Ignoring per-tenant cache cost. Each tenant needing its own cached corpus turns the cache-write premium into a per-tenant tax that destroys the economic case. If the workload is per-tenant, RAG is almost always the right answer.

Skipping a freshness eval. A stale CAG cache returns confidently wrong answers on time-sensitive queries with no error signal. The freshness rubric has to run on every release and flag stale-cache regressions before they ship.

Skipping a router stability test. CAG-vs-RAG router thresholds drift in production as the query distribution shifts. Without a stability test, the router silently degrades, and the cost-quality-latency Pareto silently moves. Eval the drift, do not assume it.

Treating RAG as legacy. RAG is the right architecture for a large set of 2026 workloads. CAG is not a replacement, it is a peer. The teams that frame this as “RAG is dead, long context wins” ship the wrong architecture for half their workloads and back-port RAG within a quarter.

Where this lands

The RAG-vs-CAG question in 2026 is a workload-specific decision on seven axes: corpus size, freshness, latency budget, cost shape, citation auditability, query distribution, and tenant model. CAG wins where corpora are small, stable, broadly queried, and shared. RAG wins where corpora are large, fresh, per-tenant, or citation-heavy. Most production teams ship a hybrid router that uses CAG for the hot path and RAG for the cold path, eval both paths with the same rubrics, and let the Platform retune the router from production feedback. The architecture choice is decided by the eval Pareto on the workload, not the leaderboard rank on the context window.

FutureAGI treats both as peer architectures. traceAI captures both span shapes through the same SDK. Groundedness, ContextAdherence, Completeness, and FactualAccuracy score both consistently. Agent Command Center routes between them per query and reports the cost and latency per call. Error Feed clusters failures by path and the Platform’s self-improving evaluators retune router thresholds and eval prompts from the production loop. The open-source platform ships under Apache 2.0 if the eval and observability layer needs to live inside your VPC.

Start with the seven-axis check. Prototype both paths against the same eval suite. Ship the Pareto winner per workload. The right architecture for your workload is the one the eval data picks, not the one the model release notes pick.

Frequently asked questions

What is Cache-Augmented Generation (CAG)?

CAG loads the entire reference corpus into a long-context model prompt once, caches that prompt via Anthropic prompt caching or OpenAI prompt caching, and serves every query against the cached context with no retrieval step. The cache hit returns the full corpus to the model in sub-second time and is billed at a discounted cached-input rate. Where RAG retrieves a few chunks per query, CAG stuffs the whole corpus once and reuses it across thousands of queries.

When does CAG beat RAG in 2026?

CAG wins on small stable corpora (under 500k tokens), high query volume against the same corpus, tight latency budgets (under 500ms), and broad query distributions. RAG wins on large corpora, freshness-sensitive workloads, per-tenant corpora, regulator-driven per-claim citation requirements, and long-tail niche queries where retrieval matters. Most production teams end up with a hybrid router that uses CAG for hot-path queries and RAG for cold-path queries.

Does long context kill RAG?

No. Long context makes CAG viable for stable small-to-medium corpora, but RAG is still the right answer for large corpora, frequently changing corpora, per-tenant data, and workloads that need chunk-level citation auditability. Long context is a new option, not a replacement. The 2026 question is not RAG or long context, it is RAG, CAG, or hybrid, picked per workload on cost, latency, freshness, and citation needs.

How do I evaluate CAG with FutureAGI?

Use the same traceAI spans you use for RAG, with fi.span.kind=LLM and an llm.cache_hit attribute on the cached call. Run Groundedness, ContextAdherence, Completeness, and FactualAccuracy on the response. Pair Groundedness with ContextAdherence to detect when the model wanders past its cached context after long turns. Add a custom CAGContextDriftDetection rubric via CustomLLMJudge for the specific drift pattern your workload sees.

What is the hybrid CAG plus RAG pattern?

A router classifies each query against a similarity threshold tuned for the cached corpus. If the query falls inside the CAG-served class, the cached path returns in about 200ms. If it falls outside, the RAG path retrieves the long-tail document and returns in about 1s. traceAI captures both paths, the eval suite scores both consistently, and the Platform retunes the router threshold from production feedback over time. Most production teams arrive here rather than picking one path.

What are the anti-patterns to avoid with CAG?

Assuming CAG always wins because long context is available, which works only for stable corpora. Ignoring per-tenant cache cost, where CAG balloons when each tenant needs its own cached corpus. Skipping a freshness eval, where stale CAG ships outdated answers on time-sensitive queries. Skipping a router stability test, where CAG-vs-RAG router thresholds drift in production and silently degrade quality. Eval the drift, do not assume it.

Can the same eval rubrics score RAG and CAG?

Yes. Groundedness, ContextAdherence, Completeness, and FactualAccuracy work on both. ChunkAttribution and ChunkUtilization are RAG-specific because CAG has no chunk hop. For CAG-specific failure modes like cached-context drift after long turns or stale-cache time-sensitive answers, add custom rubrics via CustomLLMJudge so the eval suite has parity for both paths and the cost-quality-latency Pareto comparison is fair.

View all

Research

Best Rerankers for RAG in 2026: 7 Models Compared

Cohere Rerank 4, BGE Reranker v2-m3, Jina v2, ColBERT, Voyage rerank-2.5, mxbai, Qwen3 reranker compared on RAG-eval lift, latency, license, multilingual.

Vrinda Damani · Jul 29, 2025

12 min

Research

Best LLMs of June 2026: Top Closed-Source, Open-Weight, Multimodal, and Coding Picks

Best LLMs of June 2026 by use case: Claude Fable 5 for raw coding, GLM-5.2 for open-weight value, GPT-5.5 for agents, Gemini 3.1 Pro for long-context multimodal.

Vrinda Damani · Jul 5, 2026

24 min

Research

Best Voice AI Models in June 2026: STT, TTS, and Voice Agent Stack

Best Voice AI June 2026: Deepgram Nova-3 for streaming STT, Cartesia Sonic-3.5 for TTS, Retell for voice agents, plus latency budgets and cost at scale.

Vrinda Damani · Jul 5, 2026

22 min