Guides

Evaluating RAG Chunking Strategies in 2026

Chunking is a domain question. Fixed-size loses on legal, semantic wins prose, clause-level wins contracts, late-interaction wins code.

May 14, 2026

Updated May 20, 2026

13 min read

rag chunking llm-evaluation retrieval late-interaction 2026

Table of Contents

You shipped a RAG pipeline last quarter. The LangChain tutorial said 512 tokens with 50-token overlap, so that’s what you used. The bot answers FAQ-style questions. It also returns the indemnity clause when asked about termination, hallucinates citations on the multilingual compliance docs, and grounds answers in chunks whose body contradicts the section header. The team meeting blames the embedding model. The trace says the chunker cut the supporting clause across two chunks and the retriever returned the wrong half.

The chunk boundary was baked into the index at ingest time, before any embedding, retriever, or reranker decision got made. From that point on, every downstream lever is constrained by what ingestion produced. Tuning the reranker after that is rearranging chairs on a chunk surface that already lost the recall.

The opinion this post earns: chunking strategy is a domain question, not an SDK choice. Fixed-size loses on legal and medical because it cuts clauses and clinical sections. Semantic wins on long-form prose where topic boundaries beat markup. Clause-level wins on contracts where the legal unit is the chunk. Late-interaction wins on multilingual and code-mixed corpora where exact-token signal flattens under pooled embeddings. Pick by document structure, not by what the tutorial showed.

This guide is the eval methodology. Four strategies, three rubrics that separate retrieval failure from generation failure, a domain decision matrix, and production patterns that keep the choice honest after deploy. Code shaped against LlamaIndex, LangChain, and the Future AGI ai-evaluation SDK.

TL;DR: the strategy-by-domain matrix

Domain	Winning strategy	Why
Contracts, regulations, clinical notes, SOPs	Clause / section-level	The clause is the legal or procedural unit
Research papers, long prose, transcripts, books	Semantic	Topic boundaries beat markup
Marketing copy, FAQs, product docs	Fixed-512 with overlap	Uniform paragraphs, fast, queries are lookup
Code, code + comments, mixed languages	Late-interaction (ColBERT)	Token-level scoring preserves identifiers
Multilingual (Spanish-English, etc.)	Late-interaction (Jina-ColBERT-v2)	Per-token cross-lingual matching
Tables, structured PDFs	Structure-aware + clause-level	Row and section integrity

Two non-negotiables across every domain: per-tenant filtering on every query (cross-tenant leaks are a configuration class, not a model class), and ChunkAttribution plus ChunkUtilization scored on every PR so the bisect between retrieval and generator failure collapses to a five-minute job.

Why one chunking strategy does not fit

The dominant pattern in RAG tutorials is a five-line RecursiveCharacterTextSplitter call with chunk_size=512 and chunk_overlap=50. Fine for the workload the tutorial author tested it on. Wrong starting point for almost every production workload that isn’t a uniform short-paragraph blog corpus.

The mismatch shows up three ways:

The wrong chunk gets retrieved. A user asks about Section 12 of an MSA. Vector similarity surfaces a chunk whose back half is Section 9 and front half is Section 12. The generator grounds in the wrong half. Groundedness gives it a passing score. Counsel catches the bug in five seconds.
The right chunks are retrieved but the model uses the wrong ones. Top-5 returns relevant chunks at ranks 1 and 4; the generator latches onto rank 1 and ignores rank 4. The answer omits a critical clause. Groundedness still passes.
A citation points at a chunk whose body does not contain the quoted span. The generator hallucinated the quote and pinned it to a real section header. Structured output passed. Schema validation passed. The fabricated citation looks identical to a real one.

The first two are retrieval failures masquerading as generation failures. The third is a generation failure that needs a deterministic check, not an LLM judge. None of them surface if your eval suite only scores faithfulness on the final answer. Treat chunking as a hyperparameter you sweep per document type, not a global setting copied from the quickstart. Skipping the ablation costs 8 to 15 points of recall@5 in corpora we have measured with customers.

The four strategies that cover the design space

Six strategies is too many; one is too few. Four covers the design space well enough that adding more rarely changes which family wins for a given domain.

1. Fixed-size (the baseline you put in the sweep)

Splits by token count with optional overlap. The default in most tutorials. Predictable, fast, easy to reason about. Ignores semantic structure entirely, so it slices clauses, code blocks, tables, and clinical sections wherever the counter hits the limit. Wins on uniform short-paragraph corpora (marketing copy, product FAQs, internal wikis with disciplined formatting). Loses everywhere the document has structure the splitter ignores. Run three sizes (256, 512, 1024) for the size-sensitivity curve. Fixed-512 with 50-token overlap is the right baseline, not the right default.

2. Semantic chunking (the long-prose winner)

Sentence-level embedding plus clustering. Adjacent sentences with high embedding similarity group into one chunk; the boundary lands where the topic shifts. Start and end are semantically coherent. Wins on long-form prose where author-intended sections are coarser than the topic shifts (research papers, transcripts, books, narrative knowledge bases). Loses on documents without clean sentence breaks (raw HTML scrapes, OCR’d PDFs, code). Costs roughly 3x to 5x the ingest compute of fixed-size; cross-encoder boundary detection and cached-sentence variants cut the premium.

3. Clause-level chunking (the legal and medical winner)

Treat the clause, section, or procedural unit as the chunk. One legal object per retrieval surface. The chunk ID is the section path, not a hash. Cross-references attach as metadata; defined terms resolve against the definitions index at chunk time. Wins on contracts, regulations, SOPs, clinical notes, compliance docs, and any corpus where the domain’s unit of reasoning is a discrete document section. Lift over fixed-size on identifier-heavy legal queries is 8 to 15 points of recall@k in contract corpora we measured with customers; the contract review RAG guide covers the segmenter pattern. Loses on documents without a clean section hierarchy.

4. Late-interaction retrieval (the multilingual and code winner)

ColBERT and ColBERTv2 store per-token embeddings instead of one pooled vector per chunk. At query time the maximum similarity across all token pairs scores the chunk, which preserves exact-token signal pooled embeddings flatten away. Newer variants (Jina-ColBERT-v2 for multilingual, PLAID for compressed indexes) cut the storage premium that kept ColBERT off most teams’ shortlists. Wins on multilingual corpora, code corpora where identifier matches inside semantic paragraphs decide the answer, mixed code-and-prose docs, and very long chunks. On a Spanish-English insurance corpus, Jina-ColBERT-v2 lifted recall@5 by 11 points over a single-vector multilingual embedder; on English code with Russian comments the lift was 14 points. Costs storage: 4x to 10x a single-vector index, though PLAID brings it to 2x to 3x at recall parity.

The eval methodology: three rubrics that earn their keep

Most teams write one rubric (faithfulness on the final answer) and regress for three quarters because they can’t tell whether the chunker, retriever, prompt, or model moved. Three rubrics, split by layer, collapse the bisect to a five-minute job.

Recall@k on a golden retrieval set. For each query, did the retriever return the chunk containing the supporting span within the top k? Track recall@1, recall@5, recall@10, plus MRR and NDCG. Compute per-strategy and per-domain.

ChunkAttribution. Which retrieved chunks the answer actually used. A low score with high ContextRelevance means the retriever returned good chunks but the generator only used one of them.

ChunkUtilization. What fraction of retrieved-and-relevant content the answer covered. A low score with high ChunkAttribution means the generator picked the right chunk but ignored half its supporting span.

These three separate retrieval failures from generator failures. Add ContextRelevance, Groundedness, ContextAdherence, Completeness, and FactualAccuracy for the full RAG suite, plus a deterministic citation-validity check on every cited span. The PDF QA chatbot playbook covers rubric definitions and threshold-setting.

from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, ChunkAttribution, ChunkUtilization,
    Groundedness, ContextAdherence, Completeness, FactualAccuracy,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

RUBRICS = {
    "context_relevance":  (ContextRelevance(),  0.80),
    "chunk_attribution":  (ChunkAttribution(),  0.75),
    "chunk_utilization":  (ChunkUtilization(),  0.70),
    "groundedness":       (Groundedness(),      0.85),
    "context_adherence":  (ContextAdherence(),  0.85),
    "completeness":       (Completeness(),      0.75),
    "factual_accuracy":   (FactualAccuracy(),   0.85),
}

def test_chunking_strategy(strategy, golden_set):
    failures = []
    for ex in golden_set:
        chunks = retrieve(strategy, ex.tenant_id, ex.question, k=10)
        ctx = "\n\n".join(c["text"] for c in chunks)
        answer = generate(ex.question, chunks)
        tc = TestCase(input=ex.question, output=answer, context=ctx)
        for name, (template, floor) in RUBRICS.items():
            score = evaluator.evaluate(eval_templates=[template], inputs=[tc]) \
                .eval_results[0].metrics[0].value
            if score < floor:
                failures.append((ex.id, ex.domain, name, score))
    return failures

Three habits separate a working CI gate from theatre. Per-rubric thresholds per domain. Contracts need a tighter Groundedness floor than marketing FAQs because malpractice exposure lives in the indemnity wording. Scope by route. A PR touching the contract bot prompt does not rerun the SOP suite. Diff against a moving baseline. Models drift; the baseline drifts with them; the gate catches regressions relative to moving truth, not a frozen number from last quarter.

The golden set: 200 to 500 query-to-expected-chunk pairs sampled from production traces (not invented; if you write queries yourself you encode your assumptions about how the system gets used). Span-level labels so a chunk only counts as a hit if it contains the supporting text. Stratified across document types and refreshed monthly. The synthetic test data guide covers scaling without losing signal.

Domain-by-domain decision: how to pick per route

The point of the sweep is not the leaderboard. It is the per-domain breakdown that tells you which strategy ships per route. Three patterns recur:

Legal and medical: clause or section-level. Contracts, MSAs, regulations, clinical notes, drug labels. The unit of reasoning is the clause or labelled section. Fixed-size cuts a contraindication across the previous section’s boundary and the retriever returns half a warning. Add late-interaction on top for drug-name and ICD-code queries where token-level matching beats pooled embeddings.
Long-form prose: semantic. Research papers, transcripts, books, narrative knowledge bases. Topic structure is coarser than the markup; sentence-similarity clustering finds the boundary where the discussion shifts. The 3x to 5x ingest premium is the only reason not to use it.
Code and multilingual: late-interaction. Source code, code + comments, mixed languages, Spanish-English corpora. Identifier matches inside semantic paragraphs decide the answer. Jina-ColBERT-v2 is the strongest open-source option as of May 2026; mE5 with late-interaction overlay is the runner-up.

Marketing and product docs (FAQs, blog posts, product pages) are the one place fixed-512 with 50-token overlap is the right default. Paragraphs are short, formatting is consistent, queries are mostly lookup. Move on.

Tables and structured PDFs need structure-aware parsing first, then route per element type: tables become their own chunks (embed on caption plus header row); prose sections route to semantic or clause-level. Parser quality dominates everything else; the PDF QA chatbot guide covers the parser comparison matrix.

Route by document type before chunking, not after. A single splitter applied uniformly is the source of most chunking-eval failures we see in production.

Production patterns: bridge the same rubrics to live traces

The CI gate catches regressions you can think of. Production catches everything else. Same rubric, two places: CI against the versioned golden set before deploy, production observation against live traces after.

traceAI (Apache 2.0) ships auto-instrumentation for LlamaIndex, LangChain, LangGraph, and Haystack that emits OpenTelemetry spans for every retriever, query engine, and LLM call without manual span creation. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest the same spans into Phoenix or Traceloop without re-instrumenting; 14 span kinds include a first-class RETRIEVER that carries chunk strategy and size attributes per-span.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="rag_chunking_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)

Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk strategy. Sample 5 to 10 percent of production traffic for LLM-as-judge rubrics; deterministic rubrics like citation validity run on 100 percent. Alarm on a 2 to 5 point drop in rolling-mean per rubric per route over 30 to 90 minutes. Drift between offline pass and online drop is a quality signal of its own; a widening gap on the contract route means production users are asking questions your golden set doesn’t cover.

Three production-only signals beyond CI:

Refusal rate per route. A drift up on a document type means retrieval started missing or the refusal head over-calibrated on a recent prompt change.
Cost per correct answer. Tag every gateway call with chunk strategy and read cost / correct_answer_count per strategy. The ratio that matters is correct-answer per dollar, not recall per query.
Citation validity rate. Should sit near 1.0. Drift means the generator started fabricating quotes the validator catches. Cheapest rubric in the stack.

Closing the loop: failure clusters become dataset entries

The loop is what makes the playbook compound. Without it, every incident produces a one-off fix and the team writes the same regression twice.

Error Feed sits inside the Future AGI eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named cluster. The interesting output of a chunking sweep is not the leaderboard; it is the failure clusters, which read like an engineering postmortem:

“fixed-512 splits regulatory clauses across boundaries on MSA route”
“semantic chunking drops compute-heavy docs that lack clear sentence breaks”
“late-interaction recall lift evaporates on chunks shorter than 80 tokens”
“clause-level chunking loses on schedule-only contracts where the segmenter misclassifies headers”

A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) reads the failing traces and writes the RCA, evidence quotes, and an immediate_fix. Fixes feed the Platform’s self-improving evaluators so the rubric ages with your product; representative traces promote into the golden set, and the next PR touching the offending route has to clear the new entries. Linear ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.

Three deliberate tradeoffs

Per-domain chunking costs build effort, not index size. Routing by document type to four chunkers requires a classifier on ingest and a router on retrieve. Both amortise across the corpus. The lift is 5 to 15 points of recall on the routes that win their right strategy; the alternative is a single splitter that loses on every route except one.
Late-interaction costs storage. ColBERT is 4x to 10x the index size of a single-vector index; PLAID brings it to 2x to 3x at recall parity. The lift on multilingual and code-mixed corpora is 8 to 14 points. New deployments can start single-vector and add ColBERT for the routes where production traces show pooled embeddings losing.
Sweeping costs eval budget. Four strategies times three sizes times 200 to 500 queries is more LLM-as-judge calls than scoring one strategy on the final answer. The payoff is debuggable per-route winners. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 and four distributed runners collapse a multi-hour serial run to minutes.

How Future AGI ships chunking evaluation

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator, then evaluator.evaluate(eval_templates=[...], inputs=[TestCase(...)]). Seven RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) plus 60+ total; CustomLLMJudge with Jinja2 grading for boundary-coherence rubrics; four distributed runners (Celery, Ray, Temporal, Kubernetes).
Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes chunking-eval rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
traceAI (Apache 2.0): auto-instrumentation for LlamaIndex, LangChain, LangGraph, Haystack. 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds with a first-class RETRIEVER that carries chunk strategy and size attributes per-span.
Error Feed (inside the eval stack): HDBSCAN clustering plus Sonnet 4.5 Judge writes the immediate_fix; fixes feed the Platform’s self-improving evaluators.
Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters. Self-hosts in your VPC. SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Ready to run your first chunking sweep? Wire ContextRelevance, ChunkAttribution, ChunkUtilization, and Groundedness into a pytest fixture this afternoon against the ai-evaluation SDK, stratify the golden set by document type, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

Which chunking strategy should I use for RAG in 2026?

The one that matches your document structure. Fixed-size at 512 tokens is fine for uniform short-paragraph prose (blog corpora, product FAQs, marketing copy) and nowhere else. Semantic chunking by sentence-similarity clustering wins on long-form prose where topic boundaries matter more than markup (research papers, transcripts, books). Clause-level chunking wins on contracts, regulations, SOPs, and any document where the legal or procedural unit is the chunk. Late-interaction retrieval (ColBERT-style token-level scoring) wins on multilingual corpora, code-and-prose mixed documents, and anything where exact-token matches and semantic neighbourhoods both matter on the same query. Pick by document shape, not by what the tutorial showed. Run a sweep on a 200 to 500 query golden set with chunk-attribution and chunk-utilization scoring and the right answer surfaces in one afternoon.

Why does fixed-size chunking fail on legal and medical documents?

Because the unit of meaning in a contract or a clinical note is not 512 tokens, it is the clause or the note section. A fixed-size splitter cuts an indemnification clause across two chunks, joins the back half of payment terms to the front of limitation of liability, and produces a retrieval surface where every plausible-looking chunk is actually two half-clauses welded together. The retriever still returns something. The generator still grounds. Generation rubrics still pass groundedness. The lawyer or clinician catches the bug in five seconds. Clause-level chunking restores the unit of analysis the domain reasons over, which is what makes the eval suite mean something.

What is late-interaction retrieval and when does it win?

Late-interaction retrieval (ColBERT and ColBERTv2, plus newer variants like Jina-ColBERT-v2 for multilingual) stores per-token embeddings instead of one pooled vector per chunk. At query time the maximum similarity across all token pairs scores the chunk, which preserves exact-token signal that pooled embeddings flatten away. It wins on three workloads. Multilingual corpora where pooled embeddings underperform per-language. Code-and-prose mixed documents where identifier-level matches matter inside semantic paragraphs. Very long chunks where pooled embeddings lose discriminative power. The cost is roughly 4x to 10x the storage of a single-vector index and slower retrieval. New variants (PLAID, Stanford-DSP) cut this materially. For Spanish-English insurance docs or English code with Russian comments, the recall lift is the only thing that matters.

How do I evaluate a chunking strategy properly?

Three rubrics on a stratified golden set. Recall@k on a 200 to 500 query-to-expected-chunk set sampled from production traces (not invented) and stratified by document type. Chunk-attribution: which retrieved chunks the answer actually used, flagging retrieval surfacing irrelevant context. Chunk-utilization: what fraction of retrieved-and-relevant content the answer covered, flagging the generator ignoring good chunks. Wire these into a CI fixture and report per-domain breakdowns so the moment the winner flips across legal and prose, you see it. A single aggregate number hides the fact that fixed-512 wins on marketing copy and loses on contracts. The point of the sweep is not the leaderboard; it is knowing which strategy ships per route.

Should chunk size be a hyperparameter?

Yes, and treat it as one. Sweep 256, 512, and 1024 tokens for fixed and recursive strategies. Cap semantic chunks at the embedding model context window minus headroom for header context that travels with the chunk. For clause-level chunking, size is whatever the clause is; only split when a clause exceeds the embedding window. For late-interaction, size matters less because per-token scoring preserves the signal larger pooled chunks lose. The biggest mistake teams make is copying the LangChain tutorial default (512, 50 overlap) and never re-running the ablation when the corpus shifts from blog posts to contracts. Treat chunk size and strategy together as one knob, not two.

How does Future AGI evaluate RAG chunking strategies?

The ai-evaluation SDK (Apache 2.0) ships the chunking-eval rubrics as named EvalTemplate classes: ContextRelevance, ChunkAttribution, ChunkUtilization, Groundedness, ContextAdherence, Completeness, and FactualAccuracy. CustomLLMJudge with Jinja2 grading criteria covers boundary-coherence and domain-specific rubrics. The Evaluator API runs the sweep across strategies with four distributed runners (Celery, Ray, Temporal, Kubernetes) so the eighteen-to-twenty-four-variant comparison finishes in minutes, not hours. traceAI's auto-instrumentation tags every retrieval span with chunk strategy and size attributes via FI semantic conventions, so production traces stay comparable across the sweep without manual span creation. Error Feed clusters failing traces with HDBSCAN soft-clustering over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge agent writes the immediate_fix per cluster, which feeds the Platform's self-improving evaluators. The same rubric definition runs in CI before deploy and against live LlamaIndex or LangChain spans after.

Why is chunking the under-tested decision in RAG?

Because it gets decided once at ingest time and then quietly constrains every downstream stage. The embedding model, retriever, reranker, and generator can only work on the chunks ingestion gave them. Teams obsessively tune the prompt and the reranker, then never re-run an ablation on the splitter. The cost shows up as 8 to 15 points of recall@k left on the table, plus citation drift where claims point to chunks that no longer carry the supporting span. Treating chunking as a hyperparameter you sweep, not a setting copied from a tutorial, is the highest-leverage move most teams have not made.

View all

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

Guides

LLM Eval Golden Set Design: A 2026 Engineering Guide

Build a four-bucket golden set (production sample, adversarial, edge cases, failure replays) so a CI eval gate actually proves something about production.

NVJK Kartik · May 16, 2026

12 min

Guides

The Definitive Guide to Synthetic Data Generation with LLMs (2026)

The 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, use cases.

NVJK Kartik · May 9, 2026

12 min

TL;DR: the strategy-by-domain matrix

Why one chunking strategy does not fit

The four strategies that cover the design space

1. Fixed-size (the baseline you put in the sweep)

2. Semantic chunking (the long-prose winner)

3. Clause-level chunking (the legal and medical winner)

4. Late-interaction retrieval (the multilingual and code winner)

The eval methodology: three rubrics that earn their keep

Domain-by-domain decision: how to pick per route

Production patterns: bridge the same rubrics to live traces

Closing the loop: failure clusters become dataset entries

Three deliberate tradeoffs

How Future AGI ships chunking evaluation

Related reading

Frequently asked questions