Evaluating RAG Chunking Strategies in 2026
Chunking is a domain question, not an SDK choice. Fixed-size loses on legal. Semantic wins prose. Clause-level wins contracts. Late-interaction wins code and multilingual.
Table of Contents
You shipped a RAG pipeline last quarter. The LangChain tutorial said 512 tokens with 50-token overlap, so that’s what you used. The bot answers FAQ-style questions. It also returns the indemnity clause when asked about termination, hallucinates citations on the multilingual compliance docs, and grounds answers in chunks whose body contradicts the section header. The team meeting blames the embedding model. The trace says the chunker cut the supporting clause across two chunks and the retriever returned the wrong half.
The chunk boundary was baked into the index at ingest time, before any embedding, retriever, or reranker decision got made. From that point on, every downstream lever is constrained by what ingestion produced. Tuning the reranker after that is rearranging chairs on a chunk surface that already lost the recall.
The opinion this post earns: chunking strategy is a domain question, not an SDK choice. Fixed-size loses on legal and medical because it cuts clauses and clinical sections. Semantic wins on long-form prose where topic boundaries beat markup. Clause-level wins on contracts where the legal unit is the chunk. Late-interaction wins on multilingual and code-mixed corpora where exact-token signal flattens under pooled embeddings. Pick by document structure, not by what the tutorial showed.
This guide is the eval methodology. Four strategies, three rubrics that separate retrieval failure from generation failure, a domain decision matrix, and production patterns that keep the choice honest after deploy. Code shaped against LlamaIndex, LangChain, and the Future AGI ai-evaluation SDK.
TL;DR: the strategy-by-domain matrix
| Domain | Winning strategy | Why |
|---|---|---|
| Contracts, regulations, clinical notes, SOPs | Clause / section-level | The clause is the legal or procedural unit |
| Research papers, long prose, transcripts, books | Semantic | Topic boundaries beat markup |
| Marketing copy, FAQs, product docs | Fixed-512 with overlap | Uniform paragraphs, fast, queries are lookup |
| Code, code + comments, mixed languages | Late-interaction (ColBERT) | Token-level scoring preserves identifiers |
| Multilingual (Spanish-English, etc.) | Late-interaction (Jina-ColBERT-v2) | Per-token cross-lingual matching |
| Tables, structured PDFs | Structure-aware + clause-level | Row and section integrity |
Two non-negotiables across every domain: per-tenant filtering on every query (cross-tenant leaks are a configuration class, not a model class), and ChunkAttribution plus ChunkUtilization scored on every PR so the bisect between retrieval and generator failure collapses to a five-minute job.
Why one chunking strategy does not fit
The dominant pattern in RAG tutorials is a five-line RecursiveCharacterTextSplitter call with chunk_size=512 and chunk_overlap=50. Fine for the workload the tutorial author tested it on. Wrong starting point for almost every production workload that isn’t a uniform short-paragraph blog corpus.
The mismatch shows up three ways:
- The wrong chunk gets retrieved. A user asks about Section 12 of an MSA. Vector similarity surfaces a chunk whose back half is Section 9 and front half is Section 12. The generator grounds in the wrong half. Groundedness gives it a passing score. Counsel catches the bug in five seconds.
- The right chunks are retrieved but the model uses the wrong ones. Top-5 returns relevant chunks at ranks 1 and 4; the generator latches onto rank 1 and ignores rank 4. The answer omits a critical clause. Groundedness still passes.
- A citation points at a chunk whose body does not contain the quoted span. The generator hallucinated the quote and pinned it to a real section header. Structured output passed. Schema validation passed. The fabricated citation looks identical to a real one.
The first two are retrieval failures masquerading as generation failures. The third is a generation failure that needs a deterministic check, not an LLM judge. None of them surface if your eval suite only scores faithfulness on the final answer. Treat chunking as a hyperparameter you sweep per document type, not a global setting copied from the quickstart. Skipping the ablation costs 8 to 15 points of recall@5 in corpora we have measured with customers.
The four strategies that cover the design space
Six strategies is too many; one is too few. Four covers the design space well enough that adding more rarely changes which family wins for a given domain.
1. Fixed-size (the baseline you put in the sweep)
Splits by token count with optional overlap. The default in most tutorials. Predictable, fast, easy to reason about. Ignores semantic structure entirely, so it slices clauses, code blocks, tables, and clinical sections wherever the counter hits the limit. Wins on uniform short-paragraph corpora (marketing copy, product FAQs, internal wikis with disciplined formatting). Loses everywhere the document has structure the splitter ignores. Run three sizes (256, 512, 1024) for the size-sensitivity curve. Fixed-512 with 50-token overlap is the right baseline, not the right default.
2. Semantic chunking (the long-prose winner)
Sentence-level embedding plus clustering. Adjacent sentences with high embedding similarity group into one chunk; the boundary lands where the topic shifts. Start and end are semantically coherent. Wins on long-form prose where author-intended sections are coarser than the topic shifts (research papers, transcripts, books, narrative knowledge bases). Loses on documents without clean sentence breaks (raw HTML scrapes, OCR’d PDFs, code). Costs roughly 3x to 5x the ingest compute of fixed-size; cross-encoder boundary detection and cached-sentence variants cut the premium.
3. Clause-level chunking (the legal and medical winner)
Treat the clause, section, or procedural unit as the chunk. One legal object per retrieval surface. The chunk ID is the section path, not a hash. Cross-references attach as metadata; defined terms resolve against the definitions index at chunk time. Wins on contracts, regulations, SOPs, clinical notes, compliance docs, and any corpus where the domain’s unit of reasoning is a discrete document section. Lift over fixed-size on identifier-heavy legal queries is 8 to 15 points of recall@k in contract corpora we measured with customers; the contract review RAG guide covers the segmenter pattern. Loses on documents without a clean section hierarchy.
4. Late-interaction retrieval (the multilingual and code winner)
ColBERT and ColBERTv2 store per-token embeddings instead of one pooled vector per chunk. At query time the maximum similarity across all token pairs scores the chunk, which preserves exact-token signal pooled embeddings flatten away. Newer variants (Jina-ColBERT-v2 for multilingual, PLAID for compressed indexes) cut the storage premium that kept ColBERT off most teams’ shortlists. Wins on multilingual corpora, code corpora where identifier matches inside semantic paragraphs decide the answer, mixed code-and-prose docs, and very long chunks. On a Spanish-English insurance corpus, Jina-ColBERT-v2 lifted recall@5 by 11 points over a single-vector multilingual embedder; on English code with Russian comments the lift was 14 points. Costs storage: 4x to 10x a single-vector index, though PLAID brings it to 2x to 3x at recall parity.
The eval methodology: three rubrics that earn their keep
Most teams write one rubric (faithfulness on the final answer) and regress for three quarters because they can’t tell whether the chunker, retriever, prompt, or model moved. Three rubrics, split by layer, collapse the bisect to a five-minute job.
Recall@k on a golden retrieval set. For each query, did the retriever return the chunk containing the supporting span within the top k? Track recall@1, recall@5, recall@10, plus MRR and NDCG. Compute per-strategy and per-domain.
ChunkAttribution. Which retrieved chunks the answer actually used. A low score with high ContextRelevance means the retriever returned good chunks but the generator only used one of them.
ChunkUtilization. What fraction of retrieved-and-relevant content the answer covered. A low score with high ChunkAttribution means the generator picked the right chunk but ignored half its supporting span.
These three separate retrieval failures from generator failures. Add ContextRelevance, Groundedness, ContextAdherence, Completeness, and FactualAccuracy for the full RAG suite, plus a deterministic citation-validity check on every cited span. The PDF QA chatbot playbook covers rubric definitions and threshold-setting.
from fi.evals import Evaluator
from fi.evals.templates import (
ContextRelevance, ChunkAttribution, ChunkUtilization,
Groundedness, ContextAdherence, Completeness, FactualAccuracy,
)
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
RUBRICS = {
"context_relevance": (ContextRelevance(), 0.80),
"chunk_attribution": (ChunkAttribution(), 0.75),
"chunk_utilization": (ChunkUtilization(), 0.70),
"groundedness": (Groundedness(), 0.85),
"context_adherence": (ContextAdherence(), 0.85),
"completeness": (Completeness(), 0.75),
"factual_accuracy": (FactualAccuracy(), 0.85),
}
def test_chunking_strategy(strategy, golden_set):
failures = []
for ex in golden_set:
chunks = retrieve(strategy, ex.tenant_id, ex.question, k=10)
ctx = "\n\n".join(c["text"] for c in chunks)
answer = generate(ex.question, chunks)
tc = TestCase(input=ex.question, output=answer, context=ctx)
for name, (template, floor) in RUBRICS.items():
score = evaluator.evaluate(eval_templates=[template], inputs=[tc]) \
.eval_results[0].metrics[0].value
if score < floor:
failures.append((ex.id, ex.domain, name, score))
return failures
Three habits separate a working CI gate from theatre. Per-rubric thresholds per domain. Contracts need a tighter Groundedness floor than marketing FAQs because malpractice exposure lives in the indemnity wording. Scope by route. A PR touching the contract bot prompt does not rerun the SOP suite. Diff against a moving baseline. Models drift; the baseline drifts with them; the gate catches regressions relative to moving truth, not a frozen number from last quarter.
The golden set: 200 to 500 query-to-expected-chunk pairs sampled from production traces (not invented; if you write queries yourself you encode your assumptions about how the system gets used). Span-level labels so a chunk only counts as a hit if it contains the supporting text. Stratified across document types and refreshed monthly. The synthetic test data guide covers scaling without losing signal.
Domain-by-domain decision: how to pick per route
The point of the sweep is not the leaderboard. It is the per-domain breakdown that tells you which strategy ships per route. Three patterns recur:
- Legal and medical: clause or section-level. Contracts, MSAs, regulations, clinical notes, drug labels. The unit of reasoning is the clause or labelled section. Fixed-size cuts a contraindication across the previous section’s boundary and the retriever returns half a warning. Add late-interaction on top for drug-name and ICD-code queries where token-level matching beats pooled embeddings.
- Long-form prose: semantic. Research papers, transcripts, books, narrative knowledge bases. Topic structure is coarser than the markup; sentence-similarity clustering finds the boundary where the discussion shifts. The 3x to 5x ingest premium is the only reason not to use it.
- Code and multilingual: late-interaction. Source code, code + comments, mixed languages, Spanish-English corpora. Identifier matches inside semantic paragraphs decide the answer. Jina-ColBERT-v2 is the strongest open-source option as of May 2026; mE5 with late-interaction overlay is the runner-up.
Marketing and product docs (FAQs, blog posts, product pages) are the one place fixed-512 with 50-token overlap is the right default. Paragraphs are short, formatting is consistent, queries are mostly lookup. Move on.
Tables and structured PDFs need structure-aware parsing first, then route per element type: tables become their own chunks (embed on caption plus header row); prose sections route to semantic or clause-level. Parser quality dominates everything else; the PDF QA chatbot guide covers the parser comparison matrix.
Route by document type before chunking, not after. A single splitter applied uniformly is the source of most chunking-eval failures we see in production.
Production patterns: bridge the same rubrics to live traces
The CI gate catches regressions you can think of. Production catches everything else. Same rubric, two places: CI against the versioned golden set before deploy, production observation against live traces after.
traceAI (Apache 2.0) ships auto-instrumentation for LlamaIndex, LangChain, LangGraph, and Haystack that emits OpenTelemetry spans for every retriever, query engine, and LLM call without manual span creation. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest the same spans into Phoenix or Traceloop without re-instrumenting; 14 span kinds include a first-class RETRIEVER that carries chunk strategy and size attributes per-span.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="rag_chunking_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)
Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk strategy. Sample 5 to 10 percent of production traffic for LLM-as-judge rubrics; deterministic rubrics like citation validity run on 100 percent. Alarm on a 2 to 5 point drop in rolling-mean per rubric per route over 30 to 90 minutes. Drift between offline pass and online drop is a quality signal of its own; a widening gap on the contract route means production users are asking questions your golden set doesn’t cover.
Three production-only signals beyond CI:
- Refusal rate per route. A drift up on a document type means retrieval started missing or the refusal head over-calibrated on a recent prompt change.
- Cost per correct answer. Tag every gateway call with chunk strategy and read
cost / correct_answer_countper strategy. The ratio that matters is correct-answer per dollar, not recall per query. - Citation validity rate. Should sit near 1.0. Drift means the generator started fabricating quotes the validator catches. Cheapest rubric in the stack.
Closing the loop: failure clusters become dataset entries
The loop is what makes the playbook compound. Without it, every incident produces a one-off fix and the team writes the same regression twice.
Error Feed sits inside the Future AGI eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named cluster. The interesting output of a chunking sweep is not the leaderboard; it is the failure clusters, which read like an engineering postmortem:
- “fixed-512 splits regulatory clauses across boundaries on MSA route”
- “semantic chunking drops compute-heavy docs that lack clear sentence breaks”
- “late-interaction recall lift evaporates on chunks shorter than 80 tokens”
- “clause-level chunking loses on schedule-only contracts where the segmenter misclassifies headers”
A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) reads the failing traces and writes the RCA, evidence quotes, and an immediate_fix. Fixes feed the Platform’s self-improving evaluators so the rubric ages with your product; representative traces promote into the golden set, and the next PR touching the offending route has to clear the new entries. Linear ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
Three deliberate tradeoffs
- Per-domain chunking costs build effort, not index size. Routing by document type to four chunkers requires a classifier on ingest and a router on retrieve. Both amortise across the corpus. The lift is 5 to 15 points of recall on the routes that win their right strategy; the alternative is a single splitter that loses on every route except one.
- Late-interaction costs storage. ColBERT is 4x to 10x the index size of a single-vector index; PLAID brings it to 2x to 3x at recall parity. The lift on multilingual and code-mixed corpora is 8 to 14 points. New deployments can start single-vector and add ColBERT for the routes where production traces show pooled embeddings losing.
- Sweeping costs eval budget. Four strategies times three sizes times 200 to 500 queries is more LLM-as-judge calls than scoring one strategy on the final answer. The payoff is debuggable per-route winners. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 and four distributed runners collapse a multi-hour serial run to minutes.
How Future AGI ships chunking evaluation
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator, thenevaluator.evaluate(eval_templates=[...], inputs=[TestCase(...)]). Seven RAG-specificEvalTemplateclasses (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy) plus 60+ total;CustomLLMJudgewith Jinja2 grading for boundary-coherence rubrics; four distributed runners (Celery, Ray, Temporal, Kubernetes). - Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes chunking-eval rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0): auto-instrumentation for LlamaIndex, LangChain, LangGraph, Haystack. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds with a first-class
RETRIEVERthat carries chunk strategy and size attributes per-span. - Error Feed (inside the eval stack): HDBSCAN clustering plus Sonnet 4.5 Judge writes the
immediate_fix; fixes feed the Platform’s self-improving evaluators. - Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters. Self-hosts in your VPC. SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Ready to run your first chunking sweep? Wire ContextRelevance, ChunkAttribution, ChunkUtilization, and Groundedness into a pytest fixture this afternoon against the ai-evaluation SDK, stratify the golden set by document type, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
- The 2026 LLM Evaluation Playbook
- How to Build (and Evaluate) a PDF QA Chatbot in 2026
- How to Build (and Evaluate) a Contract Review RAG Agent in 2026
- Advanced Chunking Techniques for RAG
- Best Rerankers for RAG (2026)
- Best Vector Databases for RAG (2026)
- Evaluating Vector Database Recall Quality (2026)
- Synthetic Test Data for LLM Evaluation (2026)
Frequently asked questions
Which chunking strategy should I use for RAG in 2026?
Why does fixed-size chunking fail on legal and medical documents?
What is late-interaction retrieval and when does it win?
How do I evaluate a chunking strategy properly?
Should chunk size be a hyperparameter?
How does Future AGI evaluate RAG chunking strategies?
Why is chunking the under-tested decision in RAG?
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.
Customer support eval in 2026: escalation taxonomy first, clause-level retrieval, tool-call correctness on Zendesk and Intercom, paired Containment and False-Resolution rates.
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.