How to Build (and Evaluate) a PDF QA Chatbot in 2026
A PDF QA chatbot is a retrieval problem, not a generation one. Parse, chunk, hybrid retrieve, cite, evaluate retrieval before generation, bridge to live OTel spans.
Table of Contents
You shipped a PDF QA bot last quarter. It answers the simple questions. It also confidently cites page 47 of a 30-page contract, returns the indemnity clause when asked about termination, and tells a user the document says something the document does not. The team meeting blames the model. The trace says the retriever surfaced the wrong chunk and the model dutifully grounded its answer in it.
This is the failure mode most PDF QA tutorials skip. A working index.query("what is the cap on liability?") returns something; without retrieval and generation evals running on every PR, you do not know whether the something is right, or whether the regression you just shipped is the chunker, the retriever, the prompt, or the model.
The opinion this post earns: retrieval-level evaluation is the right primitive for a PDF QA chatbot, not generation evaluation alone. Generation rubrics reward grounding to whatever context the retriever fed in. If the retriever surfaced the wrong chunk, a perfectly grounded answer is still wrong, and your generation scores will tell you it shipped fine. The retriever is the chatbot. The generator is how the retriever talks.
This guide walks the build end-to-end with the eval suite baked in from step one: parse, chunk, retrieve, generate, evaluate retrieval before generation, bridge to production, close the loop. Code shaped against LlamaIndex and the Future AGI ai-evaluation SDK.
TL;DR: the build and the eval suite
| Layer | Choice in 2026 | What you evaluate |
|---|---|---|
| Parser | Unstructured, LlamaParse, pdfplumber + OCR | Structure preservation (page, section, type) |
| Chunker | Semantic boundary, not fixed-character | Chunk size, boundary integrity |
| Embedder | bge-large, bge-m3, text-embedding-3-large | Recall@k on a held-out probe set |
| Vector store | pgvector, Qdrant, Weaviate, LanceDB | Per-tenant isolation, metadata filtering |
| Retrieval | Hybrid (vector + BM25) with RRF | ContextRelevance, ChunkAttribution, ChunkUtilization |
| Generator | Frontier model, structured output, citations | Groundedness, ContextAdherence, Completeness, FactualAccuracy |
| Citation check | Deterministic string match | Citation validity (no LLM judge) |
| Tracing | LlamaIndex via traceAI | Span-attached scores on live traffic |
| Closed loop | Error Feed auto-cluster + fix | Promoted-to-dataset weekly |
The build is the easy part. The eval discipline decides whether the bot is shippable.
Why most PDF QA bots ship broken
Three failure modes show up in production traces almost every time:
- The wrong chunk gets retrieved. The user asks about Section 12; vector similarity surfaces Section 9. The generator grounds its answer in Section 9. Generation rubrics give it a passing groundedness score.
- The right chunks are retrieved but the model uses the wrong ones. Top-5 returns the relevant chunks at ranks 1 and 4; the generator latches onto rank 1 and ignores rank 4. The answer omits a critical clause. Groundedness still passes.
- A citation points at a chunk that does not contain the quoted span. The generator hallucinated the quote. Structured output passed. Schema validation passed. The user is reading a fabricated citation that looks identical to a real one.
The first two are retrieval failures masquerading as generation failures. The third is a generation failure that needs a deterministic check, not an LLM judge. None of them surface if your eval suite only runs faithfulness on the final answer. The fix is to split the eval suite into a retrieval layer and a generation layer, gate both in CI, and bridge both to production OTel traces.
Step 1: parse with structure intact
PDFs lie about being simple text. A “text” PDF can carry multi-column layout, embedded tables, footnotes, sidebars, and figure captions. A “scanned” PDF is an image and needs OCR. Most parsing bugs in production trace to lost structure: a parser that flattens a two-column page into interleaved nonsense, or drops a table into a wall of cells.
Three reasonable picks in 2026:
- Unstructured. Open-source, category-aware extraction across text, tables, figures.
- LlamaParse. Hosted, optimised for complex layouts and tables.
- pdfplumber + Tesseract. Maximum control. Text via pdfplumber, OCR via Tesseract for scanned pages.
Whichever you pick, the parser output should preserve four fields on every element: page number, section heading, element type, reading order. These four fields are the difference between chunks the retriever can use and chunks that are noise. Twenty percent of the corpus showing up as image-only PDFs is common in regulated industries; OCR quality decides the floor of your retrieval scores.
Step 2: chunk by semantic boundary
Fixed-character chunking is the lazy default. Semantic chunking that respects document structure wins by 5 to 15 points on retrieval rubrics across the domains we have measured. The chunk is the unit of retrieval and attribution; if the boundary cuts a clause in half, the citation will too.
def semantic_chunks(elements, max_tokens: int = 500):
"""Group elements by section; split sections that exceed max_tokens."""
chunks = []
current = []
current_tokens = 0
current_section = None
for el in elements:
section_changed = el.section != current_section
size_exceeded = current_tokens + el.tokens > max_tokens
if section_changed or size_exceeded:
if current:
chunks.append({
"text": "\n\n".join(e.text for e in current),
"page": current[0].page,
"section": current_section,
"element_types": list({e.type for e in current}),
})
current = []
current_tokens = 0
current_section = el.section
current.append(el)
current_tokens += el.tokens
if current:
chunks.append({
"text": "\n\n".join(e.text for e in current),
"page": current[0].page,
"section": current_section,
"element_types": list({e.type for e in current}),
})
return chunks
Three rules decide whether the chunker earns its keep:
- Keep tables and figure captions separate. A table chunk should embed on the caption plus header row, not the full table flattened into a wall of cells.
- Carry header context. Prefix each chunk with its section heading; the embedding then encodes “this is from the Liability section,” not just the body text.
- Cap at the embedding model’s window minus headroom for the header prefix that travels with the chunk.
Evaluate the chunks themselves before you embed them. Distribution of chunk sizes, fraction that span section boundaries, fraction that include a table or caption — these are cheap to compute and they predict downstream retrieval rubric scores tightly. The advanced chunking guide covers chunk-level diagnostics in depth.
Step 3: hybrid retrieval and why pure vector loses
Pure vector search loses on identifier-heavy queries. Contract numbers, defined terms, regulatory clause IDs, statute references, drug names, part numbers — these queries dominate enterprise PDF traffic, and they are exactly where embeddings underperform exact-token matching. Pure BM25 loses the moment a user paraphrases. Hybrid retrieval — vector plus BM25 fused with reciprocal rank fusion — wins both.
def hybrid_search(conn, tenant_id: str, query: str, k: int = 10):
q_embed = embed(query)
with conn.cursor() as cur:
cur.execute(
"""
WITH vec AS (
SELECT id, 1 - (embedding <=> %s::vector) AS score, 'vec' AS source
FROM doc_chunks
WHERE tenant_id = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
),
bm25 AS (
SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS score, 'bm25' AS source
FROM doc_chunks
WHERE tenant_id = %s AND tsv @@ plainto_tsquery(%s)
ORDER BY score DESC
LIMIT %s
)
SELECT id, source, score FROM vec
UNION ALL
SELECT id, source, score FROM bm25
""",
(q_embed, tenant_id, q_embed, k * 2, query, tenant_id, query, k * 2),
)
return reciprocal_rank_fusion(cur.fetchall(), k=k)
RRF fuses the two ranked lists by summing 1 / (60 + rank) per result; 60 is the default from the original paper, tune per workload. Per-tenant namespaces (a tenant_id filter on every query) are non-negotiable for multi-customer deployments; cross-tenant leaks are a configuration class, not a model class. A query rewrite step (LlamaIndex’s HyDE or a one-shot rewrite prompt) lifts vector recall on under-specified queries by 3 to 8 points; score both the rewrite and the final answer.
Step 4: generate with citation enforcement
The generator is where hallucinations creep in even when retrieval did its job. Three controls compound:
- Structured output with per-claim citations.
- Deterministic citation validation against retrieval context.
- Refusal path when retrieval is empty or weak.
from pydantic import BaseModel
class Citation(BaseModel):
chunk_id: str
quoted_span: str
class Answer(BaseModel):
response: str
citations: list[Citation]
confidence: float # 0.0 to 1.0
PROMPT = """Answer the question using only the context below. Cite the
chunk_id and quote the exact span supporting each claim. If the context
does not answer the question, say so and set confidence below 0.3.
CONTEXT:
{context}
QUESTION: {question}"""
def generate(question: str, chunks: list[dict]) -> Answer:
context = "\n\n".join(f"[chunk_id={c['id']}]\n{c['text']}" for c in chunks)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": PROMPT.format(context=context, question=question)}],
response_format={"type": "json_schema", "json_schema": Answer.model_json_schema()},
)
answer = Answer.model_validate_json(resp.choices[0].message.content)
validate_citations(answer, chunks)
return answer
def validate_citations(answer: Answer, chunks: list[dict]):
chunk_map = {c["id"]: c["text"] for c in chunks}
for cite in answer.citations:
if cite.chunk_id not in chunk_map:
raise ValueError(f"unknown chunk_id: {cite.chunk_id}")
if cite.quoted_span not in chunk_map[cite.chunk_id]:
raise ValueError(f"quoted span not found: {cite.quoted_span!r}")
A failed citation validation should not ship to the user. Retry the generation with a stricter prompt that names the failed citation, or fall back to a refusal. Don’t paper over a fabricated quote.
Citation validity is the one rubric in this whole stack you do not need an LLM for. String match plus fuzzy tolerance is enough; the cheaper the check, the more often you can run it.
Step 5: the eval suite, split by layer
Most teams write three rubrics on the final answer and call it eval. Then they regress for three quarters because they cannot tell whether the regression is the retriever, the chunker, the prompt, or the model. Split the rubric set by layer and the bisect becomes trivial.
Retrieval rubrics — “did the retriever do its job?”:
- ContextRelevance. Are retrieved chunks relevant to the question?
- ChunkAttribution. Which retrieved chunks the answer actually used; flags retrieval surfacing irrelevant context.
- ChunkUtilization. What fraction of retrieved-and-relevant content the answer used; flags the generator ignoring good chunks.
Generation rubrics — “given the chunks we surfaced, did the generator do its job?”:
- Groundedness. Are claims supported by the retrieved context?
- ContextAdherence. Did the generator stay inside context, or inject world knowledge?
- Completeness. Did the answer cover what the question asked?
- FactualAccuracy. Are asserted facts correct against ground truth?
Deterministic floor: citation validity. Every cited span exists in retrieval context. String match, no LLM judge.
Wire all three families into a CI fixture against the Future AGI ai-evaluation SDK; RAG templates all read input, output, and context.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextAdherence, ContextRelevance,
Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy,
)
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
RUBRICS = {
# retrieval
"context_relevance": (ContextRelevance(), 0.80),
"chunk_attribution": (ChunkAttribution(), 0.75),
"chunk_utilization": (ChunkUtilization(), 0.70),
# generation
"groundedness": (Groundedness(), 0.85),
"context_adherence": (ContextAdherence(), 0.85),
"completeness": (Completeness(), 0.75),
"factual_accuracy": (FactualAccuracy(), 0.85),
}
def test_pdf_qa(eval_dataset):
failures = []
for ex in eval_dataset:
chunks = hybrid_search(conn, ex.tenant_id, ex.question, k=10)
ctx = "\n\n".join(c["text"] for c in chunks)
answer = generate(ex.question, chunks)
tc = TestCase(input=ex.question, output=answer.response, context=ctx)
for name, (template, floor) in RUBRICS.items():
score = evaluator.evaluate(eval_templates=[template], inputs=[tc]) \
.eval_results[0].metrics[0].value
if score < floor:
failures.append((ex.id, name, score))
# Citation validity is deterministic; runs separately.
try:
validate_citations(answer, chunks)
except ValueError as e:
failures.append((ex.id, "citation_validity", str(e)))
assert not failures, f"{len(failures)} rubric failures: {failures[:5]}"
Three habits separate a working CI gate from theatre. Set per-rubric thresholds. A 2-point drop from the trailing 7-day baseline fails the PR; an absolute floor catches catastrophic regressions. Scope by route. A PR touching the contract bot prompt does not rerun the SOP bot suite. Diff against a moving baseline. Models drift; the baseline drifts with them; the gate catches regressions relative to the moving truth.
Dataset shape: 100 to 200 cases sampled from real user questions, hand-annotated where needed, covering factual lookup, multi-hop reasoning, table extraction, refusal scenarios, ambiguous queries (the right move is clarification), and edge cases (footnotes, scanned pages, captions). Grow weekly by promoting failing production traces. The synthetic test data approach covers scaling without losing signal.
Step 6: bridge the same rubrics to production
The CI gate catches the regressions you can think of. Production catches everything else. The same rubrics should run as span-attached scorers against live traces.
traceAI (Apache 2.0) ships a LlamaIndex instrumentor that emits OpenTelemetry spans for every retriever, query engine, and LLM call without manual span creation. Pluggable semantic conventions at register() time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting; 14 span kinds include a first-class RETRIEVER, which is what makes per-retriever scoring practical.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="pdf_qa_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)
Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for online scoring; alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per route. The instrumentor adds no measurable latency; rubric scoring runs out-of-band.
Drift between offline pass and online drop is a quality signal of its own. Track per-rubric delta between CI baseline and production rolling mean; the gap tells you how representative your dataset is.
Step 7: close the loop
The loop is what makes the playbook compound. Without it, every incident produces a one-off fix and the team writes the same regression twice.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools including read_span, get_children, submit_finding) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. A Haiku sub-agent summarises spans over 3000 characters; prompt-cache hit ratio sits near 90 percent.
Two patterns close the loop:
- Fixes feed self-improving evaluators. The
immediate_fixfeeds back into the Platform’s self-improving evaluators so the rubric ages with your product instead of decaying. - Promote to dataset. From each named issue, an engineer promotes representative traces into the eval set. The next PR touching the offending path has to clear the new entries.
The dataset ratchets stronger; the CI gate catches more regressions every quarter. Teams whose eval scores trend up are the teams whose closed loop is the default, not the project.
Three deliberate tradeoffs
- Splitting retrieval and generation rubrics costs eval budget. Seven rubrics on every CI run is more LLM-as-judge calls than three on the final answer. The payoff is debuggable regressions. Future AGI’s classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default.
- Hybrid retrieval adds setup. A BM25 index alongside vector, RRF fusion, and per-store config. The lift is 5 to 10 points on identifier-heavy queries. New deployments can start with vector and add BM25 once production traces show the failure mode.
- Citation enforcement adds latency. Structured output plus per-claim validation costs roughly 10 to 20 percent latency over freeform. The payoff is no fabricated citations reach users. Worth it for compliance-sensitive domains; optional for casual workloads.
How Future AGI ships this
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving evals authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator+evaluator.evaluate(eval_templates=[...], inputs=[TestCase(...)]). Seven RAG-specificEvalTemplateclasses (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy) plus 60+ total. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B). 8 sub-10ms Scanners. Four distributed runners. - Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent generates rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0): LlamaIndex instrumentor via
LlamaIndexInstrumentor().instrument(...). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds includingRETRIEVER. 62 built-in evals viaEvalTag. - Error Feed (inside the eval stack): HDBSCAN clustering + Sonnet 4.5 Judge writes the
immediate_fix; fixes feed the Platform’s self-improving evaluators. - agent-opt (Apache 2.0): six optimisers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) over heuristics, LLM judge, and 60+ rubrics. - Agent Command Center: 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Ready to evaluate your first PDF QA bot? Wire ContextRelevance, ChunkAttribution, Groundedness, and citation validity into a pytest fixture this afternoon against the ai-evaluation SDK, then add the LlamaIndex traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
What's the right architecture for a PDF QA chatbot in 2026?
What chunking strategy works best for PDFs?
Should I use hybrid search (vector + BM25)?
How do I prevent hallucinated citations?
What eval set should I build for a PDF QA chatbot?
Why evaluate retrieval before generation?
How does Future AGI evaluate PDF QA chatbots?
RAG eval in CI/CD without the theatre: the cheap-fast-significant triangle, statistical gating, sharded parallelism, classifier cascades, production bridge.
Your eval set is a snapshot. Production is a river. The six drift modes that age every eval set the day it ships, and the trace-as-eval-surface loop that closes the gap.
Building an LLM eval framework is a one-week project and a one-year maintenance burden. The eight components, the honest cost map, and what to build vs buy.