What Is Chunk Utilization?
A continuous RAG evaluation metric that scores how effectively a model integrates retrieved chunks into its response, returned as a 0-1 score.
What Is Chunk Utilization?
Chunk utilization is a continuous RAG evaluation metric that scores how effectively a model integrated each retrieved chunk into its response. The evaluator extracts information units from the context — entities, key phrases, n-grams — checks how many appear in the output, and returns a numeric score where higher values indicate richer integration. Where chunk attribution gives a Pass/Fail signal on whether the chunks were used at all, chunk utilization quantifies the efficiency of that use. It runs on every RAG span in production and on offline evaluation datasets, surfacing models that are technically using context but barely.
Why It Matters in Production LLM and Agent Systems
A high-attribution, low-utilization response is a specific failure mode worth a name: surface-level grounding. The model paraphrases one sentence from rank-1 chunk and ignores the four other chunks that contain the actual answer. Chunk attribution passes — at least one chunk was referenced. Faithfulness can also pass — the one referenced sentence is supported. Only chunk utilization shows that the model is leaving 80% of the relevant retrieved information on the table.
The pain is concentrated on retrieval and prompt engineers. An ML engineer ships a recall-optimised retriever that returns 10 chunks per query — utilization drops because the model can comfortably attend to only 2-3. A prompt engineer rewrites instructions to be more concise and the response gets shorter at the cost of utilization, missing the secondary facts users were asking about. A team running on a smaller model sees utilization fall sharply across model swaps; the smaller model can attend, but cannot synthesise across multiple chunks.
In 2026 multi-step agentic-RAG flows, chunk utilization is the signal that informs context-window discipline. An agent loop that retrieves more aggressively at every step but utilises less per step is wasting tokens and money for no quality gain — utilization is the metric that makes that obvious.
How FutureAGI Handles Chunk Utilization
FutureAGI’s approach is to ship two utilisation evaluators because the question can be asked at two granularities. fi.evals.ChunkUtilization is the cloud-template evaluator that returns an aggregate 0-1 score with a written reason — useful as a single trend metric on dashboards. fi.evals.ContextUtilization is the local-metric variant designed around recent research on context neglect: it weights entity overlap, phrase overlap, and n-gram overlap separately so you can diagnose why utilisation is low — entities present but phrasing parametric, or phrasing matched but entities mangled.
Concretely: a research-assistant team running on traceAI-langchain instruments retrieval and answer spans, attaches ChunkUtilization to every answer span. The Agent Command Center plots utilisation distribution by retrieval-k value. When the team raises top-k from 5 to 10 chasing recall, mean utilisation drops from 0.71 to 0.49 — they have more chunks but the model uses each one less. They roll back to k=5 and instead invest in a stronger reranker, optimising for a high-attribution, high-utilisation regime rather than a high-recall, low-utilisation one.
Unlike Ragas context-utilisation, which is a single combined score, FutureAGI splits the metric into entity, phrase, and n-gram weights — that decomposition is what tells you whether to fix retrieval, prompt, or model.
How to Measure or Detect It
Chunk utilization is directly measurable. Wire up:
fi.evals.ChunkUtilization— aggregate 0-1 utilization score with a written reason.fi.evals.ContextUtilization— local-metric variant with entity, phrase, and n-gram weight decomposition.fi.evals.ContextRelevanceToResponse— complementary signal that checks relevance from the response’s side.- OTel attributes
retrieval.documentsandllm.output— the inputs every utilisation evaluator depends on. - Mean utilisation by retrieval-k (dashboard) — the plot that exposes diminishing returns from over-retrieval.
Minimal Python:
from fi.evals import ChunkUtilization
evaluator = ChunkUtilization()
result = evaluator.evaluate(
input="What is the capital of France?",
output="According to the provided information, Paris is the capital of France.",
context=[
"Paris is the capital and largest city of France.",
"France is a country in Western Europe."
]
)
print(result.score, result.reason)
Common Mistakes
- Treating chunk utilization and chunk attribution as the same metric. Attribution is Pass/Fail (used or not). Utilization is continuous (how much was used). They expose different failure modes.
- Maximising utilization at the cost of conciseness. A 1500-word response that quotes every chunk verbatim has high utilization and terrible user experience — pair with
IsConciseor a length budget. - Reading utilization without precision. A retriever that returns 10 chunks per query inflates the denominator on utilization regressions; correct for top-k before comparing.
- Skipping utilization on multi-fact questions. This is exactly where surface-level grounding hides — single-fact questions are easy to fully utilize, multi-fact questions need utilization to expose missed integration.
- Using utilization to debug a hallucination. Hallucinations are a faithfulness or hallucination-score problem; utilization measures something orthogonal.
Frequently Asked Questions
What is chunk utilization?
Chunk utilization is a 0-1 RAG metric for how effectively a model integrates retrieved chunks into its response. Higher scores mean the response draws more entities, phrases, and structure from the context.
How is chunk utilization different from chunk attribution?
Attribution is binary — did the model reference the chunks at all. Utilization is continuous — how much of each chunk did the model actually use. Use attribution as a gate, utilization as an efficiency gauge.
How do you measure chunk utilization?
FutureAGI's fi.evals.ChunkUtilization takes the retrieved context and the response and returns a numeric score reflecting how thoroughly the model integrated the chunks. ContextUtilization is the entity-and-phrase-level local-metric variant.