Guides

Advanced RAG Chunking Techniques in 2026: Late Chunking, Semantic, Parent-Child, and How to Pick

Ranked RAG chunking strategies for 2026. Late chunking, semantic, hierarchical, parent-child, sliding window. Code, tradeoffs, and how to evaluate retrieval.

·
Updated
·
7 min read
rag evaluations llms embeddings
Advanced chunking techniques for RAG
Table of Contents

TL;DR

TechniqueBest forCostReference
Late chunkingDocuments where meaning crosses chunk boundariesSingle embedding pass on full docarXiv 2409.04701
Semantic chunkingLong-form natural-language sourcesEmbeds every sentenceLangChain
Parent-childMixing small-chunk recall with large-chunk contextTwo granularities storedLangChain ParentDocumentRetriever
Hierarchical (RAPTOR)Multi-hop questions over long documentsTree-building offlinearXiv 2401.18059
Sentence-level with windowPreserves sentence units, adds cross-sentence contextspaCy or NLTKLangChain NLTKTextSplitter
Sliding window (recursive)Default baseline, robust everywhereNone beyond a tokenizerLangChain RecursiveCharacterTextSplitter
Fixed characterQuick prototyping onlyNoneLangChain CharacterTextSplitter

This guide covers seven chunking strategies that matter for retrieval-augmented generation in May 2026. Each section has working code, the failure mode it addresses, and the cost it adds. The closing section covers how to evaluate which chunker actually wins on your corpus.

Why RAG Still Needs Chunking in 2026

RAG works by retrieving the most relevant passages from a vector index and feeding them to the LLM as context. The retrieval step embeds the user query, looks up the k most similar chunk vectors, and returns the underlying text. Chunking is the offline step that produces those chunk vectors in the first place.

Three reasons chunking still matters even though frontier models accept 200k to 1M tokens of context:

  1. Latency and cost scale with context length. A 1M-token query costs hundreds of times more than a 4k-token query at the same model. Production systems chunk to keep p95 latency and per-query cost in check.
  2. Long context is not perfect context. Frontier models still show degradation on information placed in the middle of very long prompts, a phenomenon documented across multiple papers from 2024 onward. Retrieval-and-rerank tends to land the right facts near the top of the prompt where the model attends best.
  3. Evaluation only works on small enough pieces. Faithfulness, groundedness, and citation evaluators all depend on a passage-level alignment between the answer and the retrieved chunk. Without chunks, you cannot ground the eval.

What changed since 2025

Three shifts define chunking in May 2026:

  1. Late chunking went mainstream. The Gunther et al. 2024 paper (arXiv 2409.04701) showed that embedding the full document first and pooling per-chunk vectors after produces measurably better retrieval on documents with cross-chunk references. Jina AI, Voyage, and Cohere now ship late-chunking-capable long-context embedders.
  2. Parent-child became the default in production stacks. Storing two granularities (small for retrieval, large for the LLM) is now built into LangChain and LlamaIndex.
  3. Evaluation moved from “did we retrieve the right chunk” to “did the answer cite the right span”. Modern eval suites trace the chunk through retrieval into the LLM answer and score on citation alignment.

The Working Text

All examples below use a short synthetic product-documentation passage as input. It is enough to show real chunk boundaries without overwhelming the page, and it is representative of the kind of corpus most production RAG systems actually index.

text = """
The Acme Retrieval API accepts a query string and returns the top-k most similar passages from a configured index. By default, k is 10 and the similarity metric is cosine. The API supports filters on metadata, hybrid retrieval that combines dense and sparse scores, and a rerank step that uses a cross-encoder.

Acme's hybrid retrieval blends dense embeddings with BM25 scores. The dense weight defaults to 0.7 and the sparse weight to 0.3, but both are configurable per request. Customers report higher recall on long-tail queries when the sparse weight is increased to 0.5.

Reranking is optional. When enabled, the top 50 candidates from hybrid retrieval are passed through a cross-encoder and the top 10 are returned. Reranking adds about 200 ms of latency per request but typically improves precision at k by 8 to 12 points on Acme's internal benchmarks.
"""

1. Fixed-Size Character Chunking (Baseline)

The simplest possible chunker. Split every N characters. Useful for prototyping; almost never the right answer in production.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    separator=" ",
)
chunks = splitter.split_text(text)

Tradeoffs:

  • No semantic awareness; chunks can split mid-sentence.
  • Fast and deterministic.
  • Use as a prototyping baseline only.

2. Recursive Character Chunking (Robust Baseline)

The default in most stacks. The splitter tries a list of separators in order (paragraph break, then newline, then sentence boundary, then space) until each chunk fits the size limit.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)

Tradeoffs:

  • Respects natural document structure (headings, paragraphs).
  • Cheap, deterministic, no embedding cost.
  • Loses cross-paragraph context; meaning that spans paragraphs splits cleanly.
  • The right default for code, API references, and well-structured documents.

3. Sentence-Level Chunking with Overlap

Use a linguistic parser to split on sentence boundaries, then group N sentences per chunk with an overlap.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]

window = 3
stride = 2

chunks = []
for i in range(0, len(sentences), stride):
    window_sents = sentences[i : i + window]
    if window_sents:
        chunks.append(" ".join(window_sents))

Tradeoffs:

  • Preserves semantic units at the sentence level.
  • Adds a spaCy or NLTK dependency.
  • Sliding window with stride preserves cross-sentence context at the cost of duplicate tokens.

4. Semantic Chunking (Embedding-Based Boundaries)

Embed every sentence, compute pairwise cosine similarity between consecutive sentences, and split where similarity drops below a threshold. Each chunk is then internally cohesive.

import numpy as np
import spacy
from langchain_openai import OpenAIEmbeddings


def semantic_chunk(raw_text, threshold=0.75):
    nlp = spacy.load("en_core_web_sm")
    sentences = [sent.text.strip() for sent in nlp(raw_text).sents if sent.text.strip()]

    emb = OpenAIEmbeddings()
    vectors = np.array(emb.embed_documents(sentences))

    chunks = [[sentences[0]]]
    for i in range(1, len(sentences)):
        prev = vectors[i - 1]
        curr = vectors[i]
        sim = float(prev @ curr / (np.linalg.norm(prev) * np.linalg.norm(curr)))
        if sim >= threshold:
            chunks[-1].append(sentences[i])
        else:
            chunks.append([sentences[i]])
    return [" ".join(c) for c in chunks]


chunks = semantic_chunk(text)

LangChain and LlamaIndex both ship semantic-chunker wrappers (SemanticChunker, SemanticSplitterNodeParser) that wrap the same logic.

Tradeoffs:

  • Captures cross-sentence cohesion better than recursive chunking.
  • Embedding cost on every sentence at index time.
  • Threshold is a hyperparameter; tune per corpus.

5. Parent-Child (Small-to-Big) Chunking

Store two granularities. Small “child” chunks (a sentence or two each) are embedded and indexed. Large “parent” chunks (a paragraph or section) are stored alongside. At retrieval time, search over child embeddings, but return the parent chunks to the LLM. This gives you small-chunk recall and large-chunk context in one step.

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(
    collection_name="rag_demo",
    embedding_function=OpenAIEmbeddings(),
)
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    parent_splitter=parent_splitter,
    child_splitter=child_splitter,
)

# Index your source documents once so the retriever has something to return.
retriever.add_documents([Document(page_content=text)])

# At query time, the retriever embeds children but returns parent chunks.
results = retriever.invoke("How does Acme reranking work?")

LlamaIndex’s equivalent is the HierarchicalNodeParser + AutoMergingRetriever pair.

Tradeoffs:

  • Best precision-recall balance for most production RAG.
  • Two granularities in storage means more memory.
  • Use this as your default starting point if you do not have a strong reason to pick something else.

6. Late Chunking (arXiv 2409.04701)

The 2024 idea that flipped the order of operations. Instead of chunking first and embedding each chunk in isolation, you encode the whole document with a long-context embedder, then pool per-chunk vectors from the resulting token embeddings. Each chunk vector carries information from the entire document.

The pattern below is illustrative pseudocode; the model-specific token-to-character alignment is non-trivial and varies per embedder, so use the official reference implementation in production.

# Illustrative only. Use the reference implementation linked below.
import numpy as np


def late_chunk_pseudocode(token_embeddings, char_to_token, chunk_spans):
    # token_embeddings: shape (num_tokens, dim) from a long-context embedder.
    # char_to_token: list of (char_start, char_end) per token from the model's tokenizer.
    # chunk_spans: list of (char_start, char_end) per intended chunk.
    chunk_vectors = []
    for start, end in chunk_spans:
        token_indices = [
            i for i, (ts, te) in enumerate(char_to_token)
            if ts >= start and te <= end
        ]
        if token_indices:
            chunk_vectors.append(np.mean(token_embeddings[token_indices], axis=0))
    return np.array(chunk_vectors)

A tested implementation lives in Jina AI’s late-chunking repository, which handles the model-specific token alignment for Jina embeddings v3 and a few other long-context models.

Tradeoffs:

  • Sharp gains on documents where ideas cross chunk boundaries (legal docs, scientific papers, multi-section reports).
  • Requires a long-context embedder (Jina v3, Voyage 3, Cohere Embed v4). Standard 512-token embedders cannot do it.
  • Slightly more expensive per document at index time, but a single pass instead of one per chunk.

Reference: Gunther, Mohr, and Wang, “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” September 2024 (arXiv 2409.04701).

7. Hierarchical Chunking (RAPTOR-Style)

Build a tree. Sentences or short passages are leaves. Cluster neighbouring leaves, summarize each cluster with an LLM, and use the summaries as the next level up. Repeat to the root. At query time, search both leaves and intermediate summaries; route the question to the right granularity.

# Pseudocode sketch. Full implementations live in LlamaIndex's TreeIndex
# and the RAPTOR reference repo.

def build_tree(chunks, n_levels=3):
    level = chunks
    levels = [level]
    for _ in range(n_levels):
        clusters = cluster_by_embedding(level)
        summaries = [summarize_with_llm(c) for c in clusters]
        level = summaries
        levels.append(level)
    return levels

Reference: Sarthi et al., “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,” January 2024 (arXiv 2401.18059).

Tradeoffs:

  • Strong on multi-hop questions that need both detail and summary-level context.
  • Heavy offline cost: cluster + LLM-summarize at every level.
  • Best for static, long-form corpora (books, scientific papers, regulatory filings).

How to Pick: A Decision Path

  1. Start with recursive character chunking with 300 to 500 token chunks and 10 to 20 percent overlap. Cheap, robust, hard to beat as a baseline.
  2. If retrieval recall is low, move to semantic chunking. Especially helpful on long-form prose.
  3. If you need both recall and large LLM context, move to parent-child. Best precision-recall balance for production.
  4. If your documents have cross-chunk meaning (legal, scientific, multi-section reports), move to late chunking.
  5. If you need multi-hop reasoning over long documents, add a RAPTOR-style hierarchical layer on top of any of the above.

Do not skip steps 1 to 3 just to chase late chunking or RAPTOR. The expensive options buy you marginal gains that you can only see if you have a working baseline to compare against.

How to Evaluate a Chunking Strategy

The wrong evaluation is “does retrieval recall@10 go up.” Recall on a vector search is necessary but not sufficient; the LLM still has to use the retrieved chunk correctly. The right evaluation runs the full RAG pipeline and scores the generated answer.

A 2026 evaluation stack:

  1. Recall@k on a labeled query-passage set. Standard IR metric.
  2. Mean reciprocal rank for ordering quality.
  3. Faithfulness / groundedness LLM-judge metric on the final answer. Does the answer cite the retrieved chunks correctly, and only the retrieved chunks?
  4. Latency at p50 and p95 for the full pipeline.
  5. Cost per query end to end.

Future AGI’s evaluation library (Apache 2.0) covers the LLM-judge metrics in step 3: faithfulness, groundedness, context_relevance, and a custom-rubric judge for anything else. For step 1 and 2 (recall@k, mean reciprocal rank), use a standard IR-eval library against your labeled query-passage set; the eval library is not an IR retrieval evaluator. Pair the LLM-judge layer with traceAI (Apache 2.0) so every retrieval is captured as a span and you can diff across chunking strategies.

from fi.evals import evaluate

answer = "<the LLM's answer>"
retrieved_chunks = "<concatenated text of the chunks the retriever returned>"

result = evaluate(
    "faithfulness",
    output=answer,
    context=retrieved_chunks,
)

print(result.score, result.explanation)

For deeper retrieval-specific metrics, define a custom rubric:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

provider = LiteLLMProvider(model="gpt-5-2025-08-07")

judge = CustomLLMJudge(
    name="chunk_completeness",
    prompt=(
        "Given a user question, a set of retrieved chunks, and the gold "
        "answer, score 0 to 1 whether the retrieved chunks contain "
        "every fact needed for the gold answer. Respond with JSON: "
        "{\"score\": float, \"reason\": string}."
    ),
    provider=provider,
)

evaluator = Evaluator(judge)
score = evaluator.evaluate(output="<the model's answer>")
print(score)

Hosted judges run on turing_flash (around 1 to 2 seconds), turing_small (2 to 3 seconds), or turing_large (3 to 5 seconds). Wire FI_API_KEY and FI_SECRET_KEY into your environment and open runs at /platform/monitor/command-center to compare chunking strategies side by side.

References

  1. P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” 2020. arXiv 2005.11401.
  2. Yepes, You, Milczek, Laverde, Li, “Financial Report Chunking for Effective Retrieval Augmented Generation,” 2024. arXiv 2402.05131.
  3. M. Gunther, I. Mohr, et al., “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” 2024. arXiv 2409.04701.
  4. P. Sarthi et al., “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,” 2024. arXiv 2401.18059.
  5. D. Blei, A. Ng, M. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, 2003. PDF.

Frequently asked questions

What is chunking in RAG and why does it matter?
Chunking is the process of splitting source documents into smaller passages that you embed and store in a vector index. RAG retrieval looks up the most similar chunks for a user query, then feeds them to the LLM as context. Chunk size and boundary placement directly drive recall, context dilution, and answer faithfulness. Bad chunking is the single most common cause of bad RAG.
What is late chunking and why is it different in 2026?
Late chunking is a technique introduced by Jina AI in late 2024 (Gunther et al., arXiv 2409.04701) and now widely adopted. Instead of chunking the document up front and embedding each chunk independently, you embed the entire document with a long-context embedder, then split the resulting token embeddings into per-chunk vectors via mean pooling. The chunk vectors carry full-document context, which sharply improves retrieval on documents where meaning crosses chunk boundaries.
What is parent-child chunking?
Parent-child (or 'small-to-big') chunking stores two granularities. Small child chunks are embedded and indexed for similarity search; larger parent chunks are returned to the LLM as the actual context. This gives you the recall of small-chunk retrieval and the context window of large-chunk reading. LangChain calls it ParentDocumentRetriever; LlamaIndex calls it HierarchicalNodeParser plus AutoMergingRetriever.
When should I use semantic chunking versus fixed-size chunking?
Use fixed-size or recursive character chunking as the default baseline. Move to semantic chunking when retrieval recall is low, especially on long-form documents where ideas span multiple paragraphs. Semantic chunking is more expensive (it embeds every sentence to compute boundaries) and benefits diminish on highly structured documents like API references, where recursive character splitting on headers is already near optimal.
What is hierarchical chunking?
Hierarchical chunking builds a tree of chunks at multiple granularities. Leaves are sentences or short passages. Higher levels are summaries of clusters of leaves. At query time, the retriever can walk the tree and pull either fine-grained leaves or coarse summaries depending on the query type. The RAPTOR paper (arXiv 2401.18059) is the reference implementation.
How big should a chunk be?
For text dense in semantic content, 200 to 500 tokens is a good starting point. For technical or code-heavy content, 500 to 1000 tokens works better because the unit of meaning is larger. For dialogue or chat logs, use turn boundaries rather than a token count. Always run an evaluation on your own corpus and queries before locking in a number; chunk size that wins on one corpus often loses on another.
Does chunking still matter if the LLM has a 1M token context window?
Yes. Long context helps but does not replace retrieval. Latency and cost scale with context length, and frontier models still show 'lost in the middle' degradation past a few hundred thousand tokens. Production systems in 2026 use chunking to keep latency, cost, and answer faithfulness under control, then put long-context models on top as a safety net for hard queries.
How do I evaluate which chunking strategy is best for my corpus?
Run the same retrieval queries across each chunking strategy and measure recall@k and mean reciprocal rank with a standard IR-eval library, then measure faithfulness of the generated answer with an LLM-judge eval. Future AGI's evaluation library covers the answer-level LLM-judge metrics (faithfulness, groundedness, context relevance), and traceAI captures each retrieval as a span so you can diff across strategies. The goal is to pick the chunker that maximizes faithfulness, not just retrieval recall.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.