Advanced RAG Chunking Techniques in 2026: Late Chunking, Semantic, Parent-Child, and How to Pick
Ranked RAG chunking strategies for 2026. Late chunking, semantic, hierarchical, parent-child, sliding window. Code, tradeoffs, and how to evaluate retrieval.
Table of Contents
TL;DR
| Technique | Best for | Cost | Reference |
|---|---|---|---|
| Late chunking | Documents where meaning crosses chunk boundaries | Single embedding pass on full doc | arXiv 2409.04701 |
| Semantic chunking | Long-form natural-language sources | Embeds every sentence | LangChain |
| Parent-child | Mixing small-chunk recall with large-chunk context | Two granularities stored | LangChain ParentDocumentRetriever |
| Hierarchical (RAPTOR) | Multi-hop questions over long documents | Tree-building offline | arXiv 2401.18059 |
| Sentence-level with window | Preserves sentence units, adds cross-sentence context | spaCy or NLTK | LangChain NLTKTextSplitter |
| Sliding window (recursive) | Default baseline, robust everywhere | None beyond a tokenizer | LangChain RecursiveCharacterTextSplitter |
| Fixed character | Quick prototyping only | None | LangChain CharacterTextSplitter |
This guide covers seven chunking strategies that matter for retrieval-augmented generation in May 2026. Each section has working code, the failure mode it addresses, and the cost it adds. The closing section covers how to evaluate which chunker actually wins on your corpus.
Why RAG Still Needs Chunking in 2026
RAG works by retrieving the most relevant passages from a vector index and feeding them to the LLM as context. The retrieval step embeds the user query, looks up the k most similar chunk vectors, and returns the underlying text. Chunking is the offline step that produces those chunk vectors in the first place.
Three reasons chunking still matters even though frontier models accept 200k to 1M tokens of context:
- Latency and cost scale with context length. A 1M-token query costs hundreds of times more than a 4k-token query at the same model. Production systems chunk to keep p95 latency and per-query cost in check.
- Long context is not perfect context. Frontier models still show degradation on information placed in the middle of very long prompts, a phenomenon documented across multiple papers from 2024 onward. Retrieval-and-rerank tends to land the right facts near the top of the prompt where the model attends best.
- Evaluation only works on small enough pieces. Faithfulness, groundedness, and citation evaluators all depend on a passage-level alignment between the answer and the retrieved chunk. Without chunks, you cannot ground the eval.
What changed since 2025
Three shifts define chunking in May 2026:
- Late chunking went mainstream. The Gunther et al. 2024 paper (arXiv 2409.04701) showed that embedding the full document first and pooling per-chunk vectors after produces measurably better retrieval on documents with cross-chunk references. Jina AI, Voyage, and Cohere now ship late-chunking-capable long-context embedders.
- Parent-child became the default in production stacks. Storing two granularities (small for retrieval, large for the LLM) is now built into LangChain and LlamaIndex.
- Evaluation moved from “did we retrieve the right chunk” to “did the answer cite the right span”. Modern eval suites trace the chunk through retrieval into the LLM answer and score on citation alignment.
The Working Text
All examples below use a short synthetic product-documentation passage as input. It is enough to show real chunk boundaries without overwhelming the page, and it is representative of the kind of corpus most production RAG systems actually index.
text = """
The Acme Retrieval API accepts a query string and returns the top-k most similar passages from a configured index. By default, k is 10 and the similarity metric is cosine. The API supports filters on metadata, hybrid retrieval that combines dense and sparse scores, and a rerank step that uses a cross-encoder.
Acme's hybrid retrieval blends dense embeddings with BM25 scores. The dense weight defaults to 0.7 and the sparse weight to 0.3, but both are configurable per request. Customers report higher recall on long-tail queries when the sparse weight is increased to 0.5.
Reranking is optional. When enabled, the top 50 candidates from hybrid retrieval are passed through a cross-encoder and the top 10 are returned. Reranking adds about 200 ms of latency per request but typically improves precision at k by 8 to 12 points on Acme's internal benchmarks.
"""
1. Fixed-Size Character Chunking (Baseline)
The simplest possible chunker. Split every N characters. Useful for prototyping; almost never the right answer in production.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=200,
chunk_overlap=0,
separator=" ",
)
chunks = splitter.split_text(text)
Tradeoffs:
- No semantic awareness; chunks can split mid-sentence.
- Fast and deterministic.
- Use as a prototyping baseline only.
2. Recursive Character Chunking (Robust Baseline)
The default in most stacks. The splitter tries a list of separators in order (paragraph break, then newline, then sentence boundary, then space) until each chunk fits the size limit.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
Tradeoffs:
- Respects natural document structure (headings, paragraphs).
- Cheap, deterministic, no embedding cost.
- Loses cross-paragraph context; meaning that spans paragraphs splits cleanly.
- The right default for code, API references, and well-structured documents.
3. Sentence-Level Chunking with Overlap
Use a linguistic parser to split on sentence boundaries, then group N sentences per chunk with an overlap.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
window = 3
stride = 2
chunks = []
for i in range(0, len(sentences), stride):
window_sents = sentences[i : i + window]
if window_sents:
chunks.append(" ".join(window_sents))
Tradeoffs:
- Preserves semantic units at the sentence level.
- Adds a spaCy or NLTK dependency.
- Sliding window with stride preserves cross-sentence context at the cost of duplicate tokens.
4. Semantic Chunking (Embedding-Based Boundaries)
Embed every sentence, compute pairwise cosine similarity between consecutive sentences, and split where similarity drops below a threshold. Each chunk is then internally cohesive.
import numpy as np
import spacy
from langchain_openai import OpenAIEmbeddings
def semantic_chunk(raw_text, threshold=0.75):
nlp = spacy.load("en_core_web_sm")
sentences = [sent.text.strip() for sent in nlp(raw_text).sents if sent.text.strip()]
emb = OpenAIEmbeddings()
vectors = np.array(emb.embed_documents(sentences))
chunks = [[sentences[0]]]
for i in range(1, len(sentences)):
prev = vectors[i - 1]
curr = vectors[i]
sim = float(prev @ curr / (np.linalg.norm(prev) * np.linalg.norm(curr)))
if sim >= threshold:
chunks[-1].append(sentences[i])
else:
chunks.append([sentences[i]])
return [" ".join(c) for c in chunks]
chunks = semantic_chunk(text)
LangChain and LlamaIndex both ship semantic-chunker wrappers (SemanticChunker, SemanticSplitterNodeParser) that wrap the same logic.
Tradeoffs:
- Captures cross-sentence cohesion better than recursive chunking.
- Embedding cost on every sentence at index time.
- Threshold is a hyperparameter; tune per corpus.
5. Parent-Child (Small-to-Big) Chunking
Store two granularities. Small “child” chunks (a sentence or two each) are embedded and indexed. Large “parent” chunks (a paragraph or section) are stored alongside. At retrieval time, search over child embeddings, but return the parent chunks to the LLM. This gives you small-chunk recall and large-chunk context in one step.
from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
vectorstore = Chroma(
collection_name="rag_demo",
embedding_function=OpenAIEmbeddings(),
)
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
parent_splitter=parent_splitter,
child_splitter=child_splitter,
)
# Index your source documents once so the retriever has something to return.
retriever.add_documents([Document(page_content=text)])
# At query time, the retriever embeds children but returns parent chunks.
results = retriever.invoke("How does Acme reranking work?")
LlamaIndex’s equivalent is the HierarchicalNodeParser + AutoMergingRetriever pair.
Tradeoffs:
- Best precision-recall balance for most production RAG.
- Two granularities in storage means more memory.
- Use this as your default starting point if you do not have a strong reason to pick something else.
6. Late Chunking (arXiv 2409.04701)
The 2024 idea that flipped the order of operations. Instead of chunking first and embedding each chunk in isolation, you encode the whole document with a long-context embedder, then pool per-chunk vectors from the resulting token embeddings. Each chunk vector carries information from the entire document.
The pattern below is illustrative pseudocode; the model-specific token-to-character alignment is non-trivial and varies per embedder, so use the official reference implementation in production.
# Illustrative only. Use the reference implementation linked below.
import numpy as np
def late_chunk_pseudocode(token_embeddings, char_to_token, chunk_spans):
# token_embeddings: shape (num_tokens, dim) from a long-context embedder.
# char_to_token: list of (char_start, char_end) per token from the model's tokenizer.
# chunk_spans: list of (char_start, char_end) per intended chunk.
chunk_vectors = []
for start, end in chunk_spans:
token_indices = [
i for i, (ts, te) in enumerate(char_to_token)
if ts >= start and te <= end
]
if token_indices:
chunk_vectors.append(np.mean(token_embeddings[token_indices], axis=0))
return np.array(chunk_vectors)
A tested implementation lives in Jina AI’s late-chunking repository, which handles the model-specific token alignment for Jina embeddings v3 and a few other long-context models.
Tradeoffs:
- Sharp gains on documents where ideas cross chunk boundaries (legal docs, scientific papers, multi-section reports).
- Requires a long-context embedder (Jina v3, Voyage 3, Cohere Embed v4). Standard 512-token embedders cannot do it.
- Slightly more expensive per document at index time, but a single pass instead of one per chunk.
Reference: Gunther, Mohr, and Wang, “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” September 2024 (arXiv 2409.04701).
7. Hierarchical Chunking (RAPTOR-Style)
Build a tree. Sentences or short passages are leaves. Cluster neighbouring leaves, summarize each cluster with an LLM, and use the summaries as the next level up. Repeat to the root. At query time, search both leaves and intermediate summaries; route the question to the right granularity.
# Pseudocode sketch. Full implementations live in LlamaIndex's TreeIndex
# and the RAPTOR reference repo.
def build_tree(chunks, n_levels=3):
level = chunks
levels = [level]
for _ in range(n_levels):
clusters = cluster_by_embedding(level)
summaries = [summarize_with_llm(c) for c in clusters]
level = summaries
levels.append(level)
return levels
Reference: Sarthi et al., “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,” January 2024 (arXiv 2401.18059).
Tradeoffs:
- Strong on multi-hop questions that need both detail and summary-level context.
- Heavy offline cost: cluster + LLM-summarize at every level.
- Best for static, long-form corpora (books, scientific papers, regulatory filings).
How to Pick: A Decision Path
- Start with recursive character chunking with 300 to 500 token chunks and 10 to 20 percent overlap. Cheap, robust, hard to beat as a baseline.
- If retrieval recall is low, move to semantic chunking. Especially helpful on long-form prose.
- If you need both recall and large LLM context, move to parent-child. Best precision-recall balance for production.
- If your documents have cross-chunk meaning (legal, scientific, multi-section reports), move to late chunking.
- If you need multi-hop reasoning over long documents, add a RAPTOR-style hierarchical layer on top of any of the above.
Do not skip steps 1 to 3 just to chase late chunking or RAPTOR. The expensive options buy you marginal gains that you can only see if you have a working baseline to compare against.
How to Evaluate a Chunking Strategy
The wrong evaluation is “does retrieval recall@10 go up.” Recall on a vector search is necessary but not sufficient; the LLM still has to use the retrieved chunk correctly. The right evaluation runs the full RAG pipeline and scores the generated answer.
A 2026 evaluation stack:
- Recall@k on a labeled query-passage set. Standard IR metric.
- Mean reciprocal rank for ordering quality.
- Faithfulness / groundedness LLM-judge metric on the final answer. Does the answer cite the retrieved chunks correctly, and only the retrieved chunks?
- Latency at p50 and p95 for the full pipeline.
- Cost per query end to end.
Future AGI’s evaluation library (Apache 2.0) covers the LLM-judge metrics in step 3: faithfulness, groundedness, context_relevance, and a custom-rubric judge for anything else. For step 1 and 2 (recall@k, mean reciprocal rank), use a standard IR-eval library against your labeled query-passage set; the eval library is not an IR retrieval evaluator. Pair the LLM-judge layer with traceAI (Apache 2.0) so every retrieval is captured as a span and you can diff across chunking strategies.
from fi.evals import evaluate
answer = "<the LLM's answer>"
retrieved_chunks = "<concatenated text of the chunks the retriever returned>"
result = evaluate(
"faithfulness",
output=answer,
context=retrieved_chunks,
)
print(result.score, result.explanation)
For deeper retrieval-specific metrics, define a custom rubric:
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator
provider = LiteLLMProvider(model="gpt-5-2025-08-07")
judge = CustomLLMJudge(
name="chunk_completeness",
prompt=(
"Given a user question, a set of retrieved chunks, and the gold "
"answer, score 0 to 1 whether the retrieved chunks contain "
"every fact needed for the gold answer. Respond with JSON: "
"{\"score\": float, \"reason\": string}."
),
provider=provider,
)
evaluator = Evaluator(judge)
score = evaluator.evaluate(output="<the model's answer>")
print(score)
Hosted judges run on turing_flash (around 1 to 2 seconds), turing_small (2 to 3 seconds), or turing_large (3 to 5 seconds). Wire FI_API_KEY and FI_SECRET_KEY into your environment and open runs at /platform/monitor/command-center to compare chunking strategies side by side.
References
- P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020. arXiv 2005.11401.
- Yepes, You, Milczek, Laverde, Li, “Financial Report Chunking for Effective Retrieval Augmented Generation,” 2024. arXiv 2402.05131.
- M. Gunther, I. Mohr, et al., “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” 2024. arXiv 2409.04701.
- P. Sarthi et al., “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,” 2024. arXiv 2401.18059.
- D. Blei, A. Ng, M. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, 2003. PDF.
Related reading
Frequently asked questions
What is chunking in RAG and why does it matter?
What is late chunking and why is it different in 2026?
What is parent-child chunking?
When should I use semantic chunking versus fixed-size chunking?
What is hierarchical chunking?
How big should a chunk be?
Does chunking still matter if the LLM has a 1M token context window?
How do I evaluate which chunking strategy is best for my corpus?
The best embedding models in 2026: NV-Embed-v2, BGE-M3, E5-mistral, OpenAI v3, Voyage 3, Cohere Embed-3. MTEB benchmarks, pricing, and how to pick.
Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.
Future AGI vs Weights and Biases in 2026: GenAI evals and tracing vs experiment tracking. Verdict, head-to-head feature table, pricing, and use cases.