Articles

Vector Chunking in 2026: Strategies, Chunk Sizes, and Retrieval Wins

Vector chunking in 2026: fixed, semantic, late, hierarchical, agentic, and SPLADE-style sparse chunking compared with sizes, retrieval gains, and pitfalls.

March 4, 2025

Updated May 14, 2026

10 min read

rag chunking vector-search retrieval-eval 2026

Table of Contents

TL;DR: vector chunking in 2026

Strategy	When to use	Chunk size	Ingest cost	Retrieval gain vs fixed
Fixed-size	Uniform prose, blog, news	256-512 tokens	Lowest	Baseline
Recursive character	Mixed structure, default in LangChain	400-800 tokens	Low	Modest on most prose
Semantic	Short self-contained sections, FAQs	Variable, 200-1200 tokens	5-15x fixed	Higher recall on cross-section queries (corpus-dependent)
Late chunking	Multi-hop, entity-resolution queries	256-512 tokens	2-4x fixed	Higher recall on entity queries (corpus-dependent)
Agentic (LLM-decided)	Contracts, technical docs, reports	LLM-chosen	50-200x fixed	Highest on structured corpora (corpus-dependent)
Hierarchical	Long-doc QA with reranker	200-400 child / 1500-3000 parent	2x fixed	Higher recall on long-doc QA
Sparse + dense hybrid	Rare terms, codenames, IDs	400-800 tokens	2x fixed	Higher recall on rare-term queries

If you read one row: fixed 400-600 token chunks with 15% overlap plus a cross-encoder reranker is the 2026 baseline. Move to semantic or late chunking only after you have measured a retrieval gap on a labeled set.

Why chunking matters most in 2026 RAG

A RAG pipeline has four moving parts: ingest (chunk and embed), index, retrieve, and generate. Bad chunking caps every metric below it. If the right facts are split across chunk boundaries, no reranker, no longer context window, and no better LLM can recover them. Engineering teams who debug “the LLM hallucinated” without inspecting chunks usually find the model never received the supporting evidence in the first place.

The other surfaces (vector DB choice, reranker model, generator model) all matter, but they have ceilings. Chunking has a floor: once it is wrong, every later step amplifies the error. That is why chunking is the first thing to evaluate when retrieval quality regresses.

What is vector chunking, precisely

Vector chunking is the step where a corpus is split into passages, each passage is embedded, and the resulting vectors are stored in a vector index for nearest-neighbor search at query time. Three parameters define a chunking strategy:

Boundary rule. Where to cut: token count, character count, sentence, paragraph, semantic similarity drop, LLM-decided.
Chunk size. How large each chunk is, usually expressed in tokens. Typical range 256 to 2048.
Overlap. How many tokens of the previous chunk repeat at the start of the next. Typical range 10% to 20%.

Two derived choices ride on top: whether to store chunk metadata (page, section, document ID) alongside the vector, and whether to keep a parent-child relationship (hierarchical chunking).

The seven 2026 chunking strategies, explained

1. Fixed-size chunking

Split by token count, usually 256, 512, or 1024 tokens, with 10-20% overlap. This is the default in every RAG framework because it is fast, deterministic, and predictable. Fixed-size chunking works well when the corpus is uniform prose (news articles, blog posts, transcripts) and degrades when documents have strong internal structure (tables, code, contracts, hierarchical specs).

Use it when: prose corpus, single document type, ingest speed matters.

Skip it when: documents contain tables that would be split mid-row, code blocks, or hierarchical sections smaller than your chunk size.

2. Recursive character splitting

Split on a hierarchy of separators: try paragraphs first, then sentences, then characters, until each piece fits the target size. This is the LangChain default and respects soft document structure without requiring an embedding pass. It is a strict superset of fixed-size: same speed, slightly better boundaries.

Use it when: mixed-structure corpus, you want a “smart fixed” baseline.

Skip it when: you have heavy tables or code that need protected boundaries.

3. Semantic chunking

Embed candidate sentence groups, compute a similarity score between adjacent groups, and cut where similarity drops below a threshold. The chunk size becomes variable, ranging from a single sentence to a full section. Semantic chunking captures topical boundaries that fixed chunking misses but adds an embedding pass at ingest time, which is 5-15x slower per document.

Use it when: corpora with strong topical shifts, FAQs, knowledge bases, short sections that should stay together.

Skip it when: ingest throughput matters, or fixed chunks already retrieve well.

Reference implementation: LlamaIndex SemanticSplitterNodeParser and LangChain SemanticChunker ship this strategy.

4. Late chunking

Late chunking, introduced by Jina AI in 2024, embeds the full document with a long-context embedding model first, then pools per-chunk vectors from the same forward pass. The chunk vectors retain cross-chunk context because the embedding model has seen the entire document. This helps with multi-hop and entity-resolution queries where pronouns or references span chunk boundaries.

Use it when: multi-hop queries, entity-heavy corpora (legal, biomedical), pronoun and coreference matter.

Skip it when: documents are shorter than 2K tokens (no benefit), or your embedding model is not long-context.

5. Agentic chunking

Use an LLM to read each document and decide chunk boundaries based on content type. The model outputs span boundaries that respect tables, code blocks, section headers, and topical shifts. This is the most expensive strategy at ingest, costing one LLM call per document, but it is the highest-fidelity option for high-value corpora.

In 2026 most teams use a small, low-latency flash-class model for the chunking pass to keep cost under control. Cache the chunk plan so re-ingest does not re-pay the LLM cost.

Use it when: contracts, technical specs, regulatory filings, mixed-format reports.

Skip it when: corpus is uniform prose, ingest budget is tight, or you have not yet measured a retrieval gap that justifies the cost.

6. Hierarchical chunking

Store both small chunks (200-400 tokens) for retrieval and larger parent chunks (1500-3000 tokens) for generation. Retrieve at the child level for precision, expand to parents before sending to the generator for context. LlamaIndex calls this “auto-merging retrieval” and ships it as a node parser.

Use it when: long-document QA, tutorials, manuals where the answer is in a small section but needs surrounding context.

Skip it when: documents are short or self-contained.

7. Sparse and hybrid chunking

Pair each dense chunk vector with a sparse representation (BM25 or SPLADE) and ingest both into a hybrid index. Hybrid retrieval handles rare terms, codenames, product IDs, and out-of-vocabulary entities that dense vectors miss. Most production vector DBs (Pinecone, Weaviate, Qdrant, Vespa) ship hybrid search in 2026.

Use it when: corpus contains rare terms, product codes, person names, or technical identifiers.

Skip it when: purely conversational prose with no rare-term queries.

Choosing chunk size in 2026

Three buckets cover most production stacks:

256-512 tokens. Chat-style RAG over general docs, FAQs, customer support. Pair with retrieve top-20 to top-30 and rerank to top-5.
800-1200 tokens. Technical and legal corpora where surrounding context is required for grounding. Pair with retrieve top-10 to top-20 and rerank to top-3.
1500-2048 tokens. Long-document QA with strong rerankers and long-context generators. Pair with retrieve top-5 to top-10 and minimal reranking.

Co-tune overlap with chunk size. Below 256 tokens, 20-30% overlap prevents fact-loss at boundaries. Above 1000 tokens, 10% overlap is usually enough because boundary effects are smaller.

Always measure on a labeled retrieval set before locking in a size. A chunk size that wins on benchmarks rarely wins on production traffic without tuning.

Indexing methods that pair with chunking

Chunking is only half of vector retrieval. The other half is the index. In 2026 the two dominant approaches are:

HNSW (Hierarchical Navigable Small World). Graph-based ANN index. Default in pgvector, Qdrant, Weaviate, and Pinecone. Tunable with m (graph degree) and ef_search (query-time candidates). Good general-purpose choice for chunk counts from 10K to 100M.
Product quantization (IVF-PQ, OPQ). Compresses vectors into a small code, trading recall for memory and speed. Default in FAISS for billion-scale indexes. Used when you need 1B+ chunks and can tolerate a recall hit.

Smaller indexes use HNSW. Billion-scale indexes use IVF-PQ or HNSW-PQ hybrids. The chunking strategy does not change the index choice, but chunk count does: smaller chunks mean more vectors, which pushes you toward PQ earlier.

Frameworks and libraries

Three open-source ecosystems handle most production chunking in 2026:

LangChain ships RecursiveCharacterTextSplitter, SemanticChunker, and MarkdownHeaderTextSplitter out of the box.
LlamaIndex ships SentenceSplitter, SemanticSplitterNodeParser, HierarchicalNodeParser, and auto-merging retrievers.
Haystack ships DocumentSplitter with split-by-sentence, split-by-passage, and split-by-page modes.

Vector DBs that ship hybrid search and chunk-aware ingest in 2026 include Pinecone, Weaviate, Qdrant, Vespa, and Milvus.

How to evaluate a chunking change

Treat chunking like any other production change: ship behind a flag, evaluate on a labeled set, measure both retrieval and end-to-end metrics, and run a shadow comparison on a sample of live traffic before cutting over.

The retrieval-only metrics that move first when chunking changes:

Context Recall. Was the right chunk in the retrieved set.
Context Precision. Did the retrieved chunks contain the answer.
Mean Reciprocal Rank (MRR). Where the right chunk ranked.
Hit Rate at k. Whether the right chunk appeared in the top-k.

The end-to-end metrics that should move with them:

Faithfulness. Response anchored in retrieved chunks; no hallucination.
Answer Relevance. Response actually answers the question.

A change that lifts Context Recall but drops Faithfulness usually means chunks are now too large and the generator is hallucinating from irrelevant content. Both metrics must move together.

Production patterns that consistently win

Four patterns recur across 2026 production RAG stacks that hit high retrieval quality:

Small chunks plus reranker. 400-600 token chunks, retrieve top-30 to top-50, rerank to top-5 with a cross-encoder. Beats large chunks without rerank on most corpora.
Hybrid sparse plus dense. Always pair dense vectors with sparse BM25 or SPLADE for queries with rare terms or named entities.
Chunk metadata. Always store section, page, source URL, and ingest timestamp alongside the vector. Needed for citation, freshness filtering, and access control.
Production replay. Sample failing production traces, replay them through a candidate chunking change, and compare retrieval and end-to-end scores before cutting over.

Common chunking failure modes

Four pitfalls cause most retrieval regressions:

Mid-structure splits. Splitting a table mid-row, a code block mid-function, or a list mid-item destroys grounding. Use markdown-aware or LLM-decided chunking for structured docs.
Missing metadata. Storing only the vector and the chunk text loses section, page, and source URL. Citation, freshness, and access-control downstream features become impossible to add later.
Over-large overlap. Above 30% overlap inflates index size, hurts precision, and creates duplicate retrievals that look like high recall but waste reranker budget.
Single-query-set evaluation. Evaluating chunking on one query set that does not represent production traffic ships changes that win in eval and lose in prod. Always evaluate on production-shaped queries.

How Future AGI fits in: the retrieval-eval companion

Future AGI is not a vector database, a chunker, or an embedding model. The chunking decisions in this post are owned by your ingest pipeline (LangChain, LlamaIndex, Haystack) and your vector index (Pinecone, Weaviate, Qdrant, Vespa, Milvus).

Future AGI is the eval and observability companion that scores whether the chunking change you just shipped actually moved retrieval and end-to-end quality. The platform ships RAG-specific judges attached to traces: Context Recall, Context Precision, Faithfulness, Answer Relevance, and Chunk Attribution. The traceAI instrumentation library is Apache 2.0 and OpenTelemetry-compatible, so each chunk retrieval span carries its own scores in production. When teams swap a chunking strategy, the span-attached scores show whether Faithfulness held while Context Recall improved, which is the only signal that matters.

from fi.evals import evaluate

retrieval_score = evaluate(
    "context_recall",
    output=generator_response,
    context=retrieved_chunks,
    ground_truth=labeled_answer,
)

grounding_score = evaluate(
    "faithfulness",
    output=generator_response,
    context=retrieved_chunks,
)

For BYOK gateway routing across embedding models and LLM generators during chunking experiments, the Agent Command Center sits in front of the providers and writes spans into the same trace stream as your retrieval and generation calls. That keeps the chunking change, the retrieval scores, and the end-to-end scores on one timeline.

Summary: chunking is the floor, evaluate every change

Vector chunking sets the floor on RAG quality in 2026. The seven strategies above (fixed, recursive, semantic, late, agentic, hierarchical, sparse-hybrid) cover the corpora you are likely to ship. The right answer is almost always a 400-600 token fixed or recursive baseline with reranking, then an upgrade to semantic, late, or agentic only after measuring a retrieval gap. Co-tune chunk size with reranker depth, always store metadata, and always evaluate both retrieval and end-to-end metrics together.

The unlock is not picking the trendiest strategy. The unlock is shipping every chunking change behind retrieval and Faithfulness scores so you know which trade-off you bought.

Frequently asked questions

What is vector chunking and why does it matter for RAG in 2026?

Vector chunking is the step where you split a corpus into passages, embed each passage, and store the embeddings in a vector index for similarity search. In 2026 RAG stacks, chunking is the single biggest lever on retrieval quality: a fixed 512-token chunk on legal text retrieves differently from a semantic chunk that respects section breaks. Chunk size, overlap, and boundary strategy together determine Context Recall, Context Precision, and downstream Faithfulness. Bad chunking caps every metric below it.

What is the right chunk size for RAG?

There is no universal answer, but 2026 production stacks usually sit in three buckets: 256-512 tokens for chat-style RAG over general docs, 800-1200 tokens for technical and legal corpora where surrounding context is required for grounding, and 1500-2048 tokens for long-document QA with strong rerankers in front. Always co-tune chunk size with overlap (10-20% is typical) and with your reranker top-k. Measure Context Recall and Faithfulness on a labeled set before locking in a size.

What is semantic chunking and when should I use it?

Semantic chunking splits on meaning boundaries rather than character count: it embeds candidate sentence groups, computes a similarity score between adjacent groups, and cuts where similarity drops. Use it when fixed-size chunks split atomic facts in half, when sections are short and self-contained (FAQs, knowledge bases), and when downstream Faithfulness drops because retrieved chunks lose context. Skip it when the corpus is uniform prose where fixed chunks already retrieve well, since semantic chunking is roughly 5-15x slower at ingest time.

How does late chunking differ from regular chunking?

Late chunking, introduced by Jina AI in 2024, embeds the full document with a long-context embedding model first, then pools per-chunk embeddings from the same forward pass. This preserves cross-chunk context inside each chunk vector instead of treating chunks as independent documents. Late chunking helps when entities or pronouns span chunk boundaries, since the chunk embedding still carries the global context. It needs a long-context embedding model (8K-32K tokens) and is slower per document but often improves retrieval on multi-hop and entity-resolution queries.

What is agentic chunking and is it production-ready?

Agentic chunking uses an LLM to decide chunk boundaries based on document structure and content type. The model reads the document and outputs span boundaries that respect tables, code blocks, sections, and topical shifts. It is the most expensive strategy (one LLM call per document at ingest) but is increasingly used in 2026 for technical docs, contracts, and structured reports where naive splitting destroys retrieval. Production-ready if you cap it to high-value corpora and cache the chunk plan.

How do I evaluate a chunking strategy?

Run two evaluations side by side on a labeled retrieval set. First, retrieval-only metrics: Context Recall (was the right chunk retrieved), Context Precision (did the retrieved chunks contain the answer), Mean Reciprocal Rank, and Hit Rate at k. Second, end-to-end RAG metrics: Faithfulness (response anchored in retrieved chunks) and Answer Relevance. A chunking change that lifts Context Recall but drops Faithfulness usually means chunks are now too large and the generator is hallucinating from irrelevant content. Both metrics must move together.

How does chunking interact with rerankers?

Chunking and reranking are co-tuned. Smaller chunks let you over-retrieve (top-50, top-100) and rerank down to top-5, which usually beats large chunks without rerank. Larger chunks reduce reranker effectiveness because the same chunk competes with itself for relevance. In 2026 production stacks, the canonical pattern is: 400-600 token chunks, retrieve top-30 to top-50, rerank to top-5 with a cross-encoder, then send to the generator. Tune the recall-precision trade-off by adjusting reranker top-k, not chunk size.

What chunking pitfalls cost the most retrieval quality?

Four pitfalls cause most regressions. First, splitting tables, code blocks, or lists mid-row, which destroys grounding. Second, ignoring document metadata (section headers, page numbers, source URL), which prevents downstream citation. Third, over-large overlap (above 30%), which inflates index size and hurts precision. Fourth, evaluating chunking changes on a single query set that does not represent production traffic. Always evaluate on production-shaped queries with retrieval and end-to-end metrics together.

View all

Guide

What Is RAG (Retrieval-Augmented Generation)? 2026 Guide for LLM Teams

Retrieval-Augmented Generation (RAG) for LLMs in 2026: how it works, hybrid + reranker stack, evaluation metrics, and the FAGI eval companion for production.

Rishav Hada · Mar 26, 2025

9 min

Guide

Agentic RAG in 2026: Patterns, Code, Observability

Agentic RAG in 2026: tool-using agents over vector DBs, query rewriting, multi-hop retrieval, and how to trace and evaluate every retrieve span with FAGI.

NVJK Kartik · Jul 21, 2025

13 min

Guide

RAG Architecture 2026: Patterns, Code, and Eval

RAG architecture in 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG. Real code plus Context Adherence and Groundedness eval.

NVJK Kartik · Jan 31, 2025

8 min