What is a corpus in NLP/RAG?

A corpus is the source collection of documents, messages, transcripts, code, or records indexed before retrieval and generation. It defines what a RAG system can know at request time.

How is a corpus different from a dataset?

A corpus is the source material searched by a model or retriever. A dataset is a structured collection used for training, evaluation, labeling, or analysis, and it may be derived from a corpus.

What Is a Corpus? Definition & FutureAGI Guide (2026)

Q: How do you measure a corpus?

FutureAGI measures corpus quality through fi.kb.KnowledgeBase traces plus ContextRelevance, ContextRecall, Groundedness, and ChunkAttribution. Track empty-context rate, stale-document hits, and eval-fail-rate by corpus version.

What Is a Corpus?

A corpus is the governed source collection that an NLP or RAG system indexes before retrieval and generation. In the RAG family, it shows up before chunking, embedding, search, context packing, and final answer generation, but its scope and freshness shape every production trace. FutureAGI treats the corpus as a knowledge-base input to evaluate because bad corpus coverage produces irrelevant context, stale answers, and grounded-looking hallucinations.

Why It Matters in Production LLM and Agent Systems

Corpus failures create confident answers from missing or outdated evidence. If a refund policy is absent, duplicated, or left in an old folder, the retriever may return the nearest available document and the model may write a fluent but unsupported answer. The failure mode is not always “no retrieval.” More often it is silent hallucination downstream of a faulty retriever, stale context, or a cross-tenant document leak caused by weak corpus boundaries.

The pain spreads across teams. Developers chase prompt bugs when the source collection is the real defect. SREs see p99 retrieval latency rise after a bulk import or reindex. Compliance teams care about policy scope, PII retention, tenant isolation, and auditability of source documents. Product teams see the end-user symptom: thumbs-down events, support escalations, and citations that look plausible but point to the wrong source.

In 2026-era agentic pipelines, a corpus is not just a pile of PDFs behind a chatbot. Agents retrieve context for planning, tool choice, contract review, onboarding flows, and human handoff summaries. One missing clause can steer a multi-step agent toward the wrong action. Unlike Ragas faithfulness, which usually checks answer support after retrieval, corpus reliability starts earlier: the system must know which source collection was searched, which version was active, and which documents were eligible for the request.

How FutureAGI Handles Corpus Quality with KnowledgeBase

FutureAGI’s approach is to treat the corpus as a versioned production dependency, not an invisible folder passed to a vector database. The anchor surface is fi.kb.KnowledgeBase, where teams create, update, delete, and manage uploaded files for a knowledge base. In a RAG workflow, that knowledge base becomes the corpus that is chunked, embedded, retrieved, and evaluated.

Consider a support agent that answers billing and security-policy questions. An engineer uploads Markdown runbooks, PDF terms, and ticket-resolution notes into a KnowledgeBase, then connects a LangChain retriever. With traceAI-langchain, each request can carry the user query, retrieve span, returned chunks in retrieval.documents, prompt-token cost via llm.token_count.prompt, and final answer. FutureAGI then scores the same trace with ContextRelevance, ContextRecall, Groundedness, and ChunkAttribution.

The next action depends on the failing signal. Low ContextRecall means the corpus may not contain the expected policy, or the document was indexed under the wrong metadata. Low ContextRelevance means retrieval found documents, but not the right ones. Strong retrieval with weak Groundedness points to generation or prompt constraints. A missing ChunkAttribution result means the answer cannot be tied back to a source chunk.

Engineers can set a threshold such as “block release when ContextRelevance drops below 0.75 for the billing corpus,” create a regression eval from failed traces, and route high-risk answers to human review or a fallback response in Agent Command Center.

How to Measure or Detect It

Measure a corpus by coverage, freshness, retrieval quality, and downstream answer support:

ContextRelevance returns whether retrieved corpus chunks actually match the user’s query intent.
ContextRecall checks whether expected evidence from the corpus appears in retrieved context for labeled questions.
Groundedness detects whether the answer is supported by the retrieved corpus text.
ChunkAttribution checks whether answer claims can be tied to specific returned chunks.
Trace signals include empty-context rate, retrieved-chunk count, retrieval.documents, llm.token_count.prompt, stale-document hits, p99 retrieve latency, and eval-fail-rate-by-corpus-version.
User proxies include thumbs-down rate, escalation rate, missing-citation feedback, and repeated clarification requests after sourced answers.

from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="Can I cancel after renewal?",
    context=retrieved_chunks,
)
print(result.score, result.reason)

Track these signals by KnowledgeBase, document collection, retriever version, and release cohort. A global average can hide a corpus-specific regression that only affects one tenant, locale, or policy domain.

Common Mistakes

Corpus quality work fails when teams treat source material as static infrastructure. The usual mistakes are practical and preventable:

Indexing every document with no eligibility model. RAG needs tenant, domain, date, and policy filters before retrieval, not after generation.
Measuring answer quality without corpus version. A passing answer means little if the trace cannot name the source collection and document snapshot.
Mixing stale and current policy pages. The retriever may return both, and the model may merge incompatible rules into one confident answer.
Rechunking without regression evals. Chunk boundaries can change recall, attribution, and context cost even when document text stays identical.
Treating corpus gaps as model hallucination. If the source never contained the fact, fix ingestion, coverage, or fallback behavior first.

Before blaming the LLM, ask whether the right document was eligible, indexed, retrieved, and attributed.