What is chunking in RAG?

Chunking in RAG splits source documents into smaller retrieval units before embedding, indexing, and context assembly. Good chunks preserve enough meaning to answer a query without wasting the context window.

How is chunking different from chunk overlap?

Chunking defines the boundaries of each retrieval unit. Chunk overlap repeats text across adjacent chunks so answers near a boundary are less likely to lose needed context.

How do you measure chunking?

FutureAGI measures chunk quality with ChunkAttribution and ChunkUtilization. Teams compare those scores with retrieval rank, context relevance, and production feedback cohorts.

What Is Chunking? FutureAGI Guide (2026)

What Is Chunking?

Chunking in RAG is the process of splitting source documents into smaller retrieval units before embedding, indexing, and context assembly. It is a RAG data-preparation and retrieval-design step that determines which evidence reaches the LLM in an eval pipeline or production trace. Good chunks preserve a complete idea, carry useful metadata, and fit the model’s context budget. Poor chunks cause missing evidence, noisy retrieval, weak citation support, and unused context; FutureAGI evaluates those failures with ChunkAttribution and ChunkUtilization.

Why Chunking Matters in Production LLM and Agent Systems

Chunking failures rarely announce themselves as errors. They surface as confident RAG answers that cite the wrong paragraph, miss the key exception, or combine two unrelated policies because a document was split at the wrong boundary. A retriever can report healthy top-k latency and still return fragments that are too small to answer the question or too large for the model to use. The downstream failure is usually a RAG hallucination, weak attribution, or stale-looking answer that actually came from damaged context assembly.

Developers feel this when a knowledge-base import passes ingestion tests but support tickets say “the source was there, yet the answer missed it.” SREs see higher token cost, lower cache efficiency, and eval failures after an embedding-model change or document reindex. Compliance teams care because bad chunks can separate a claim from the disclaimer or policy condition that makes it safe. Product teams see thumbs-down feedback clustered around long PDFs, tables, contracts, and troubleshooting guides.

The risk is sharper in 2026 agentic pipelines. A support agent may retrieve chunks, summarize them, call a billing tool, and write a final answer. If the first retrieval step loses the warranty condition, later steps can be technically valid and still act on incomplete evidence. Chunking therefore belongs in the reliability contract, alongside retrieval quality, grounding, and trace-level attribution.

How FutureAGI Handles Chunking

FutureAGI’s approach is to treat chunking as an observable retrieval-design decision, not a hidden preprocessing choice. In a typical workflow, an engineer imports product documentation into a RAG index with a specific splitter, overlap size, metadata schema, and document version. A LangChain or LlamaIndex service is instrumented with traceAI-langchain or traceAI-llamaindex, so each production trace keeps the query, retrieved chunk IDs, source document IDs, chunk text, generation output, and token fields such as llm.token_count.prompt.

The evaluator anchor is explicit: ChunkAttribution checks whether answer claims can be tied back to retrieved chunks, while ChunkUtilization checks whether the retrieved chunks were actually used by the model. Teams usually run both with ContextRelevance. If relevance is low, the retriever or index is not finding the right evidence. If relevance is high but utilization is low, the chunks may be too verbose, duplicated, or poorly ordered. If utilization is high but attribution fails, the model may be overgeneralizing from partial text.

Unlike a one-off Ragas notebook score, FutureAGI connects those measurements to trace cohorts, index versions, and release gates. A practical response is concrete: set a minimum ChunkAttribution threshold for regulated workflows, compare fail rate by splitter version, inspect the lowest-scoring chunks, then rerun a regression eval before rolling the new index to all traffic.

How to Measure or Detect Chunking

Use chunking metrics at both retrieval and answer time:

ChunkAttribution: scores whether the final answer can be traced to the retrieved chunks instead of unsupported model prose.
ChunkUtilization: scores whether the model used the retrieved chunks, which helps find oversized, noisy, or duplicated context.
Trace fields: store chunk ID, document ID, chunk rank, splitter version, overlap size, retrieved text, response, and llm.token_count.prompt.
Dashboard signals: monitor attribution fail rate by index version, context tokens per trace, p95 retrieval latency, and utilization by document type.
User proxy: correlate low scores with thumbs-down rate, human escalation rate, and comments like “source did not answer this.”

from fi.evals import ChunkAttribution, ChunkUtilization

contexts = ["Refunds are allowed within 30 days for annual plans."]
answer = "Annual-plan customers can request a refund within 30 days."

attribution = ChunkAttribution().evaluate(response=answer, contexts=contexts)
utilization = ChunkUtilization().evaluate(response=answer, contexts=contexts)
print(attribution.score, utilization.score)

Common Mistakes

Splitting by fixed tokens only. Token windows ignore headings, tables, and policy clauses, so the answerable unit can be split across chunks.
Using maximum overlap everywhere. More overlap can raise storage cost, duplicate retrieval results, and crowd out distinct evidence from the context window.
Ignoring metadata. Chunk text without document version, section title, product, locale, or timestamp is hard to audit when attribution fails.
Optimizing retrieval recall alone. High recall with low ChunkUtilization means the model received evidence but could not use it effectively.
Changing the splitter without regression evals. Reindexing can move answer boundaries; run chunk-level tests before shipping a new corpus version.