Vector Chunking in 2026: Strategies, Chunk Sizes, and Retrieval Wins
Vector chunking in 2026: fixed, semantic, late, hierarchical, agentic, and SPLADE-style sparse chunking compared with sizes, retrieval gains, and pitfalls.
Table of Contents
TL;DR: vector chunking in 2026
| Strategy | When to use | Chunk size | Ingest cost | Retrieval gain vs fixed |
|---|---|---|---|---|
| Fixed-size | Uniform prose, blog, news | 256-512 tokens | Lowest | Baseline |
| Recursive character | Mixed structure, default in LangChain | 400-800 tokens | Low | Modest on most prose |
| Semantic | Short self-contained sections, FAQs | Variable, 200-1200 tokens | 5-15x fixed | Higher recall on cross-section queries (corpus-dependent) |
| Late chunking | Multi-hop, entity-resolution queries | 256-512 tokens | 2-4x fixed | Higher recall on entity queries (corpus-dependent) |
| Agentic (LLM-decided) | Contracts, technical docs, reports | LLM-chosen | 50-200x fixed | Highest on structured corpora (corpus-dependent) |
| Hierarchical | Long-doc QA with reranker | 200-400 child / 1500-3000 parent | 2x fixed | Higher recall on long-doc QA |
| Sparse + dense hybrid | Rare terms, codenames, IDs | 400-800 tokens | 2x fixed | Higher recall on rare-term queries |
If you read one row: fixed 400-600 token chunks with 15% overlap plus a cross-encoder reranker is the 2026 baseline. Move to semantic or late chunking only after you have measured a retrieval gap on a labeled set.
Why chunking matters most in 2026 RAG
A RAG pipeline has four moving parts: ingest (chunk and embed), index, retrieve, and generate. Bad chunking caps every metric below it. If the right facts are split across chunk boundaries, no reranker, no longer context window, and no better LLM can recover them. Engineering teams who debug “the LLM hallucinated” without inspecting chunks usually find the model never received the supporting evidence in the first place.
The other surfaces (vector DB choice, reranker model, generator model) all matter, but they have ceilings. Chunking has a floor: once it is wrong, every later step amplifies the error. That is why chunking is the first thing to evaluate when retrieval quality regresses.
What is vector chunking, precisely
Vector chunking is the step where a corpus is split into passages, each passage is embedded, and the resulting vectors are stored in a vector index for nearest-neighbor search at query time. Three parameters define a chunking strategy:
- Boundary rule. Where to cut: token count, character count, sentence, paragraph, semantic similarity drop, LLM-decided.
- Chunk size. How large each chunk is, usually expressed in tokens. Typical range 256 to 2048.
- Overlap. How many tokens of the previous chunk repeat at the start of the next. Typical range 10% to 20%.
Two derived choices ride on top: whether to store chunk metadata (page, section, document ID) alongside the vector, and whether to keep a parent-child relationship (hierarchical chunking).
The seven 2026 chunking strategies, explained
1. Fixed-size chunking
Split by token count, usually 256, 512, or 1024 tokens, with 10-20% overlap. This is the default in every RAG framework because it is fast, deterministic, and predictable. Fixed-size chunking works well when the corpus is uniform prose (news articles, blog posts, transcripts) and degrades when documents have strong internal structure (tables, code, contracts, hierarchical specs).
Use it when: prose corpus, single document type, ingest speed matters.
Skip it when: documents contain tables that would be split mid-row, code blocks, or hierarchical sections smaller than your chunk size.
2. Recursive character splitting
Split on a hierarchy of separators: try paragraphs first, then sentences, then characters, until each piece fits the target size. This is the LangChain default and respects soft document structure without requiring an embedding pass. It is a strict superset of fixed-size: same speed, slightly better boundaries.
Use it when: mixed-structure corpus, you want a “smart fixed” baseline.
Skip it when: you have heavy tables or code that need protected boundaries.
3. Semantic chunking
Embed candidate sentence groups, compute a similarity score between adjacent groups, and cut where similarity drops below a threshold. The chunk size becomes variable, ranging from a single sentence to a full section. Semantic chunking captures topical boundaries that fixed chunking misses but adds an embedding pass at ingest time, which is 5-15x slower per document.
Use it when: corpora with strong topical shifts, FAQs, knowledge bases, short sections that should stay together.
Skip it when: ingest throughput matters, or fixed chunks already retrieve well.
Reference implementation: LlamaIndex SemanticSplitterNodeParser and LangChain SemanticChunker ship this strategy.
4. Late chunking
Late chunking, introduced by Jina AI in 2024, embeds the full document with a long-context embedding model first, then pools per-chunk vectors from the same forward pass. The chunk vectors retain cross-chunk context because the embedding model has seen the entire document. This helps with multi-hop and entity-resolution queries where pronouns or references span chunk boundaries.
Use it when: multi-hop queries, entity-heavy corpora (legal, biomedical), pronoun and coreference matter.
Skip it when: documents are shorter than 2K tokens (no benefit), or your embedding model is not long-context.
5. Agentic chunking
Use an LLM to read each document and decide chunk boundaries based on content type. The model outputs span boundaries that respect tables, code blocks, section headers, and topical shifts. This is the most expensive strategy at ingest, costing one LLM call per document, but it is the highest-fidelity option for high-value corpora.
In 2026 most teams use a small, low-latency flash-class model for the chunking pass to keep cost under control. Cache the chunk plan so re-ingest does not re-pay the LLM cost.
Use it when: contracts, technical specs, regulatory filings, mixed-format reports.
Skip it when: corpus is uniform prose, ingest budget is tight, or you have not yet measured a retrieval gap that justifies the cost.
6. Hierarchical chunking
Store both small chunks (200-400 tokens) for retrieval and larger parent chunks (1500-3000 tokens) for generation. Retrieve at the child level for precision, expand to parents before sending to the generator for context. LlamaIndex calls this “auto-merging retrieval” and ships it as a node parser.
Use it when: long-document QA, tutorials, manuals where the answer is in a small section but needs surrounding context.
Skip it when: documents are short or self-contained.
7. Sparse and hybrid chunking
Pair each dense chunk vector with a sparse representation (BM25 or SPLADE) and ingest both into a hybrid index. Hybrid retrieval handles rare terms, codenames, product IDs, and out-of-vocabulary entities that dense vectors miss. Most production vector DBs (Pinecone, Weaviate, Qdrant, Vespa) ship hybrid search in 2026.
Use it when: corpus contains rare terms, product codes, person names, or technical identifiers.
Skip it when: purely conversational prose with no rare-term queries.
Choosing chunk size in 2026
Three buckets cover most production stacks:
- 256-512 tokens. Chat-style RAG over general docs, FAQs, customer support. Pair with retrieve top-20 to top-30 and rerank to top-5.
- 800-1200 tokens. Technical and legal corpora where surrounding context is required for grounding. Pair with retrieve top-10 to top-20 and rerank to top-3.
- 1500-2048 tokens. Long-document QA with strong rerankers and long-context generators. Pair with retrieve top-5 to top-10 and minimal reranking.
Co-tune overlap with chunk size. Below 256 tokens, 20-30% overlap prevents fact-loss at boundaries. Above 1000 tokens, 10% overlap is usually enough because boundary effects are smaller.
Always measure on a labeled retrieval set before locking in a size. A chunk size that wins on benchmarks rarely wins on production traffic without tuning.
Indexing methods that pair with chunking
Chunking is only half of vector retrieval. The other half is the index. In 2026 the two dominant approaches are:
- HNSW (Hierarchical Navigable Small World). Graph-based ANN index. Default in pgvector, Qdrant, Weaviate, and Pinecone. Tunable with
m(graph degree) andef_search(query-time candidates). Good general-purpose choice for chunk counts from 10K to 100M. - Product quantization (IVF-PQ, OPQ). Compresses vectors into a small code, trading recall for memory and speed. Default in FAISS for billion-scale indexes. Used when you need 1B+ chunks and can tolerate a recall hit.
Smaller indexes use HNSW. Billion-scale indexes use IVF-PQ or HNSW-PQ hybrids. The chunking strategy does not change the index choice, but chunk count does: smaller chunks mean more vectors, which pushes you toward PQ earlier.
Frameworks and libraries
Three open-source ecosystems handle most production chunking in 2026:
- LangChain ships
RecursiveCharacterTextSplitter,SemanticChunker, andMarkdownHeaderTextSplitterout of the box. - LlamaIndex ships
SentenceSplitter,SemanticSplitterNodeParser,HierarchicalNodeParser, and auto-merging retrievers. - Haystack ships
DocumentSplitterwith split-by-sentence, split-by-passage, and split-by-page modes.
Vector DBs that ship hybrid search and chunk-aware ingest in 2026 include Pinecone, Weaviate, Qdrant, Vespa, and Milvus.
How to evaluate a chunking change
Treat chunking like any other production change: ship behind a flag, evaluate on a labeled set, measure both retrieval and end-to-end metrics, and run a shadow comparison on a sample of live traffic before cutting over.
The retrieval-only metrics that move first when chunking changes:
- Context Recall. Was the right chunk in the retrieved set.
- Context Precision. Did the retrieved chunks contain the answer.
- Mean Reciprocal Rank (MRR). Where the right chunk ranked.
- Hit Rate at k. Whether the right chunk appeared in the top-k.
The end-to-end metrics that should move with them:
- Faithfulness. Response anchored in retrieved chunks; no hallucination.
- Answer Relevance. Response actually answers the question.
A change that lifts Context Recall but drops Faithfulness usually means chunks are now too large and the generator is hallucinating from irrelevant content. Both metrics must move together.
Production patterns that consistently win
Four patterns recur across 2026 production RAG stacks that hit high retrieval quality:
- Small chunks plus reranker. 400-600 token chunks, retrieve top-30 to top-50, rerank to top-5 with a cross-encoder. Beats large chunks without rerank on most corpora.
- Hybrid sparse plus dense. Always pair dense vectors with sparse BM25 or SPLADE for queries with rare terms or named entities.
- Chunk metadata. Always store section, page, source URL, and ingest timestamp alongside the vector. Needed for citation, freshness filtering, and access control.
- Production replay. Sample failing production traces, replay them through a candidate chunking change, and compare retrieval and end-to-end scores before cutting over.
Common chunking failure modes
Four pitfalls cause most retrieval regressions:
- Mid-structure splits. Splitting a table mid-row, a code block mid-function, or a list mid-item destroys grounding. Use markdown-aware or LLM-decided chunking for structured docs.
- Missing metadata. Storing only the vector and the chunk text loses section, page, and source URL. Citation, freshness, and access-control downstream features become impossible to add later.
- Over-large overlap. Above 30% overlap inflates index size, hurts precision, and creates duplicate retrievals that look like high recall but waste reranker budget.
- Single-query-set evaluation. Evaluating chunking on one query set that does not represent production traffic ships changes that win in eval and lose in prod. Always evaluate on production-shaped queries.
How Future AGI fits in: the retrieval-eval companion
Future AGI is not a vector database, a chunker, or an embedding model. The chunking decisions in this post are owned by your ingest pipeline (LangChain, LlamaIndex, Haystack) and your vector index (Pinecone, Weaviate, Qdrant, Vespa, Milvus).
Future AGI is the eval and observability companion that scores whether the chunking change you just shipped actually moved retrieval and end-to-end quality. The platform ships RAG-specific judges attached to traces: Context Recall, Context Precision, Faithfulness, Answer Relevance, and Chunk Attribution. The traceAI instrumentation library is Apache 2.0 and OpenTelemetry-compatible, so each chunk retrieval span carries its own scores in production. When teams swap a chunking strategy, the span-attached scores show whether Faithfulness held while Context Recall improved, which is the only signal that matters.
from fi.evals import evaluate
retrieval_score = evaluate(
"context_recall",
output=generator_response,
context=retrieved_chunks,
ground_truth=labeled_answer,
)
grounding_score = evaluate(
"faithfulness",
output=generator_response,
context=retrieved_chunks,
)
For BYOK gateway routing across embedding models and LLM generators during chunking experiments, the Agent Command Center sits in front of the providers and writes spans into the same trace stream as your retrieval and generation calls. That keeps the chunking change, the retrieval scores, and the end-to-end scores on one timeline.
Summary: chunking is the floor, evaluate every change
Vector chunking sets the floor on RAG quality in 2026. The seven strategies above (fixed, recursive, semantic, late, agentic, hierarchical, sparse-hybrid) cover the corpora you are likely to ship. The right answer is almost always a 400-600 token fixed or recursive baseline with reranking, then an upgrade to semantic, late, or agentic only after measuring a retrieval gap. Co-tune chunk size with reranker depth, always store metadata, and always evaluate both retrieval and end-to-end metrics together.
The unlock is not picking the trendiest strategy. The unlock is shipping every chunking change behind retrieval and Faithfulness scores so you know which trade-off you bought.
Frequently asked questions
What is vector chunking and why does it matter for RAG in 2026?
What is the right chunk size for RAG?
What is semantic chunking and when should I use it?
How does late chunking differ from regular chunking?
What is agentic chunking and is it production-ready?
How do I evaluate a chunking strategy?
How does chunking interact with rerankers?
What chunking pitfalls cost the most retrieval quality?
Retrieval-Augmented Generation (RAG) for LLMs in 2026: how it works, hybrid + reranker stack, evaluation metrics, and the FAGI eval companion for production.
Agentic RAG in 2026: tool-using agents over vector DBs, query rewriting, multi-hop retrieval, and how to trace and evaluate every retrieve span with FAGI.
RAG architecture in 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG. Real code plus Context Adherence and Groundedness eval.