Engineering

How to Build (and Evaluate) a PDF QA Chatbot in 2026

A PDF QA chatbot is a retrieval problem, not a generation one. Parse, chunk, hybrid retrieve, cite, evaluate retrieval before generation, bridge to live OTel spans.

·
11 min read
rag pdf chatbot llm-evaluation vector-database embeddings 2026
Editorial cover image for How to Build (and Evaluate) a PDF QA Chatbot in 2026
Table of Contents

You shipped a PDF QA bot last quarter. It answers the simple questions. It also confidently cites page 47 of a 30-page contract, returns the indemnity clause when asked about termination, and tells a user the document says something the document does not. The team meeting blames the model. The trace says the retriever surfaced the wrong chunk and the model dutifully grounded its answer in it.

This is the failure mode most PDF QA tutorials skip. A working index.query("what is the cap on liability?") returns something; without retrieval and generation evals running on every PR, you do not know whether the something is right, or whether the regression you just shipped is the chunker, the retriever, the prompt, or the model.

The opinion this post earns: retrieval-level evaluation is the right primitive for a PDF QA chatbot, not generation evaluation alone. Generation rubrics reward grounding to whatever context the retriever fed in. If the retriever surfaced the wrong chunk, a perfectly grounded answer is still wrong, and your generation scores will tell you it shipped fine. The retriever is the chatbot. The generator is how the retriever talks.

This guide walks the build end-to-end with the eval suite baked in from step one: parse, chunk, retrieve, generate, evaluate retrieval before generation, bridge to production, close the loop. Code shaped against LlamaIndex and the Future AGI ai-evaluation SDK.

TL;DR: the build and the eval suite

LayerChoice in 2026What you evaluate
ParserUnstructured, LlamaParse, pdfplumber + OCRStructure preservation (page, section, type)
ChunkerSemantic boundary, not fixed-characterChunk size, boundary integrity
Embedderbge-large, bge-m3, text-embedding-3-largeRecall@k on a held-out probe set
Vector storepgvector, Qdrant, Weaviate, LanceDBPer-tenant isolation, metadata filtering
RetrievalHybrid (vector + BM25) with RRFContextRelevance, ChunkAttribution, ChunkUtilization
GeneratorFrontier model, structured output, citationsGroundedness, ContextAdherence, Completeness, FactualAccuracy
Citation checkDeterministic string matchCitation validity (no LLM judge)
TracingLlamaIndex via traceAISpan-attached scores on live traffic
Closed loopError Feed auto-cluster + fixPromoted-to-dataset weekly

The build is the easy part. The eval discipline decides whether the bot is shippable.

Why most PDF QA bots ship broken

Three failure modes show up in production traces almost every time:

  • The wrong chunk gets retrieved. The user asks about Section 12; vector similarity surfaces Section 9. The generator grounds its answer in Section 9. Generation rubrics give it a passing groundedness score.
  • The right chunks are retrieved but the model uses the wrong ones. Top-5 returns the relevant chunks at ranks 1 and 4; the generator latches onto rank 1 and ignores rank 4. The answer omits a critical clause. Groundedness still passes.
  • A citation points at a chunk that does not contain the quoted span. The generator hallucinated the quote. Structured output passed. Schema validation passed. The user is reading a fabricated citation that looks identical to a real one.

The first two are retrieval failures masquerading as generation failures. The third is a generation failure that needs a deterministic check, not an LLM judge. None of them surface if your eval suite only runs faithfulness on the final answer. The fix is to split the eval suite into a retrieval layer and a generation layer, gate both in CI, and bridge both to production OTel traces.

Step 1: parse with structure intact

PDFs lie about being simple text. A “text” PDF can carry multi-column layout, embedded tables, footnotes, sidebars, and figure captions. A “scanned” PDF is an image and needs OCR. Most parsing bugs in production trace to lost structure: a parser that flattens a two-column page into interleaved nonsense, or drops a table into a wall of cells.

Three reasonable picks in 2026:

  • Unstructured. Open-source, category-aware extraction across text, tables, figures.
  • LlamaParse. Hosted, optimised for complex layouts and tables.
  • pdfplumber + Tesseract. Maximum control. Text via pdfplumber, OCR via Tesseract for scanned pages.

Whichever you pick, the parser output should preserve four fields on every element: page number, section heading, element type, reading order. These four fields are the difference between chunks the retriever can use and chunks that are noise. Twenty percent of the corpus showing up as image-only PDFs is common in regulated industries; OCR quality decides the floor of your retrieval scores.

Step 2: chunk by semantic boundary

Fixed-character chunking is the lazy default. Semantic chunking that respects document structure wins by 5 to 15 points on retrieval rubrics across the domains we have measured. The chunk is the unit of retrieval and attribution; if the boundary cuts a clause in half, the citation will too.

def semantic_chunks(elements, max_tokens: int = 500):
    """Group elements by section; split sections that exceed max_tokens."""
    chunks = []
    current = []
    current_tokens = 0
    current_section = None

    for el in elements:
        section_changed = el.section != current_section
        size_exceeded = current_tokens + el.tokens > max_tokens
        if section_changed or size_exceeded:
            if current:
                chunks.append({
                    "text": "\n\n".join(e.text for e in current),
                    "page": current[0].page,
                    "section": current_section,
                    "element_types": list({e.type for e in current}),
                })
            current = []
            current_tokens = 0
            current_section = el.section
        current.append(el)
        current_tokens += el.tokens

    if current:
        chunks.append({
            "text": "\n\n".join(e.text for e in current),
            "page": current[0].page,
            "section": current_section,
            "element_types": list({e.type for e in current}),
        })
    return chunks

Three rules decide whether the chunker earns its keep:

  1. Keep tables and figure captions separate. A table chunk should embed on the caption plus header row, not the full table flattened into a wall of cells.
  2. Carry header context. Prefix each chunk with its section heading; the embedding then encodes “this is from the Liability section,” not just the body text.
  3. Cap at the embedding model’s window minus headroom for the header prefix that travels with the chunk.

Evaluate the chunks themselves before you embed them. Distribution of chunk sizes, fraction that span section boundaries, fraction that include a table or caption — these are cheap to compute and they predict downstream retrieval rubric scores tightly. The advanced chunking guide covers chunk-level diagnostics in depth.

Step 3: hybrid retrieval and why pure vector loses

Pure vector search loses on identifier-heavy queries. Contract numbers, defined terms, regulatory clause IDs, statute references, drug names, part numbers — these queries dominate enterprise PDF traffic, and they are exactly where embeddings underperform exact-token matching. Pure BM25 loses the moment a user paraphrases. Hybrid retrieval — vector plus BM25 fused with reciprocal rank fusion — wins both.

def hybrid_search(conn, tenant_id: str, query: str, k: int = 10):
    q_embed = embed(query)
    with conn.cursor() as cur:
        cur.execute(
            """
            WITH vec AS (
              SELECT id, 1 - (embedding <=> %s::vector) AS score, 'vec' AS source
              FROM doc_chunks
              WHERE tenant_id = %s
              ORDER BY embedding <=> %s::vector
              LIMIT %s
            ),
            bm25 AS (
              SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS score, 'bm25' AS source
              FROM doc_chunks
              WHERE tenant_id = %s AND tsv @@ plainto_tsquery(%s)
              ORDER BY score DESC
              LIMIT %s
            )
            SELECT id, source, score FROM vec
            UNION ALL
            SELECT id, source, score FROM bm25
            """,
            (q_embed, tenant_id, q_embed, k * 2, query, tenant_id, query, k * 2),
        )
        return reciprocal_rank_fusion(cur.fetchall(), k=k)

RRF fuses the two ranked lists by summing 1 / (60 + rank) per result; 60 is the default from the original paper, tune per workload. Per-tenant namespaces (a tenant_id filter on every query) are non-negotiable for multi-customer deployments; cross-tenant leaks are a configuration class, not a model class. A query rewrite step (LlamaIndex’s HyDE or a one-shot rewrite prompt) lifts vector recall on under-specified queries by 3 to 8 points; score both the rewrite and the final answer.

Step 4: generate with citation enforcement

The generator is where hallucinations creep in even when retrieval did its job. Three controls compound:

  1. Structured output with per-claim citations.
  2. Deterministic citation validation against retrieval context.
  3. Refusal path when retrieval is empty or weak.
from pydantic import BaseModel

class Citation(BaseModel):
    chunk_id: str
    quoted_span: str

class Answer(BaseModel):
    response: str
    citations: list[Citation]
    confidence: float  # 0.0 to 1.0

PROMPT = """Answer the question using only the context below. Cite the
chunk_id and quote the exact span supporting each claim. If the context
does not answer the question, say so and set confidence below 0.3.

CONTEXT:
{context}

QUESTION: {question}"""

def generate(question: str, chunks: list[dict]) -> Answer:
    context = "\n\n".join(f"[chunk_id={c['id']}]\n{c['text']}" for c in chunks)
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": PROMPT.format(context=context, question=question)}],
        response_format={"type": "json_schema", "json_schema": Answer.model_json_schema()},
    )
    answer = Answer.model_validate_json(resp.choices[0].message.content)
    validate_citations(answer, chunks)
    return answer

def validate_citations(answer: Answer, chunks: list[dict]):
    chunk_map = {c["id"]: c["text"] for c in chunks}
    for cite in answer.citations:
        if cite.chunk_id not in chunk_map:
            raise ValueError(f"unknown chunk_id: {cite.chunk_id}")
        if cite.quoted_span not in chunk_map[cite.chunk_id]:
            raise ValueError(f"quoted span not found: {cite.quoted_span!r}")

A failed citation validation should not ship to the user. Retry the generation with a stricter prompt that names the failed citation, or fall back to a refusal. Don’t paper over a fabricated quote.

Citation validity is the one rubric in this whole stack you do not need an LLM for. String match plus fuzzy tolerance is enough; the cheaper the check, the more often you can run it.

Step 5: the eval suite, split by layer

Most teams write three rubrics on the final answer and call it eval. Then they regress for three quarters because they cannot tell whether the regression is the retriever, the chunker, the prompt, or the model. Split the rubric set by layer and the bisect becomes trivial.

Retrieval rubrics — “did the retriever do its job?”:

  • ContextRelevance. Are retrieved chunks relevant to the question?
  • ChunkAttribution. Which retrieved chunks the answer actually used; flags retrieval surfacing irrelevant context.
  • ChunkUtilization. What fraction of retrieved-and-relevant content the answer used; flags the generator ignoring good chunks.

Generation rubrics — “given the chunks we surfaced, did the generator do its job?”:

  • Groundedness. Are claims supported by the retrieved context?
  • ContextAdherence. Did the generator stay inside context, or inject world knowledge?
  • Completeness. Did the answer cover what the question asked?
  • FactualAccuracy. Are asserted facts correct against ground truth?

Deterministic floor: citation validity. Every cited span exists in retrieval context. String match, no LLM judge.

Wire all three families into a CI fixture against the Future AGI ai-evaluation SDK; RAG templates all read input, output, and context.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, ContextRelevance,
    Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

RUBRICS = {
    # retrieval
    "context_relevance": (ContextRelevance(), 0.80),
    "chunk_attribution": (ChunkAttribution(), 0.75),
    "chunk_utilization": (ChunkUtilization(), 0.70),
    # generation
    "groundedness": (Groundedness(), 0.85),
    "context_adherence": (ContextAdherence(), 0.85),
    "completeness": (Completeness(), 0.75),
    "factual_accuracy": (FactualAccuracy(), 0.85),
}

def test_pdf_qa(eval_dataset):
    failures = []
    for ex in eval_dataset:
        chunks = hybrid_search(conn, ex.tenant_id, ex.question, k=10)
        ctx = "\n\n".join(c["text"] for c in chunks)
        answer = generate(ex.question, chunks)
        tc = TestCase(input=ex.question, output=answer.response, context=ctx)

        for name, (template, floor) in RUBRICS.items():
            score = evaluator.evaluate(eval_templates=[template], inputs=[tc]) \
                .eval_results[0].metrics[0].value
            if score < floor:
                failures.append((ex.id, name, score))

        # Citation validity is deterministic; runs separately.
        try:
            validate_citations(answer, chunks)
        except ValueError as e:
            failures.append((ex.id, "citation_validity", str(e)))

    assert not failures, f"{len(failures)} rubric failures: {failures[:5]}"

Three habits separate a working CI gate from theatre. Set per-rubric thresholds. A 2-point drop from the trailing 7-day baseline fails the PR; an absolute floor catches catastrophic regressions. Scope by route. A PR touching the contract bot prompt does not rerun the SOP bot suite. Diff against a moving baseline. Models drift; the baseline drifts with them; the gate catches regressions relative to the moving truth.

Dataset shape: 100 to 200 cases sampled from real user questions, hand-annotated where needed, covering factual lookup, multi-hop reasoning, table extraction, refusal scenarios, ambiguous queries (the right move is clarification), and edge cases (footnotes, scanned pages, captions). Grow weekly by promoting failing production traces. The synthetic test data approach covers scaling without losing signal.

Step 6: bridge the same rubrics to production

The CI gate catches the regressions you can think of. Production catches everything else. The same rubrics should run as span-attached scorers against live traces.

traceAI (Apache 2.0) ships a LlamaIndex instrumentor that emits OpenTelemetry spans for every retriever, query engine, and LLM call without manual span creation. Pluggable semantic conventions at register() time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest into Phoenix or Traceloop without re-instrumenting; 14 span kinds include a first-class RETRIEVER, which is what makes per-retriever scoring practical.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="pdf_qa_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)

Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for online scoring; alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per route. The instrumentor adds no measurable latency; rubric scoring runs out-of-band.

Drift between offline pass and online drop is a quality signal of its own. Track per-rubric delta between CI baseline and production rolling mean; the gap tells you how representative your dataset is.

Step 7: close the loop

The loop is what makes the playbook compound. Without it, every incident produces a one-off fix and the team writes the same regression twice.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every trace failure into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools including read_span, get_children, submit_finding) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. A Haiku sub-agent summarises spans over 3000 characters; prompt-cache hit ratio sits near 90 percent.

Two patterns close the loop:

  • Fixes feed self-improving evaluators. The immediate_fix feeds back into the Platform’s self-improving evaluators so the rubric ages with your product instead of decaying.
  • Promote to dataset. From each named issue, an engineer promotes representative traces into the eval set. The next PR touching the offending path has to clear the new entries.

The dataset ratchets stronger; the CI gate catches more regressions every quarter. Teams whose eval scores trend up are the teams whose closed loop is the default, not the project.

Three deliberate tradeoffs

  • Splitting retrieval and generation rubrics costs eval budget. Seven rubrics on every CI run is more LLM-as-judge calls than three on the final answer. The payoff is debuggable regressions. Future AGI’s classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default.
  • Hybrid retrieval adds setup. A BM25 index alongside vector, RRF fusion, and per-store config. The lift is 5 to 10 points on identifier-heavy queries. New deployments can start with vector and add BM25 once production traces show the failure mode.
  • Citation enforcement adds latency. Structured output plus per-claim validation costs roughly 10 to 20 percent latency over freeform. The payoff is no fabricated citations reach users. Worth it for compliance-sensitive domains; optional for casual workloads.

How Future AGI ships this

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving evals authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator + evaluator.evaluate(eval_templates=[...], inputs=[TestCase(...)]). Seven RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) plus 60+ total. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B). 8 sub-10ms Scanners. Four distributed runners.
  • Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent generates rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • traceAI (Apache 2.0): LlamaIndex instrumentor via LlamaIndexInstrumentor().instrument(...). 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds including RETRIEVER. 62 built-in evals via EvalTag.
  • Error Feed (inside the eval stack): HDBSCAN clustering + Sonnet 4.5 Judge writes the immediate_fix; fixes feed the Platform’s self-improving evaluators.
  • agent-opt (Apache 2.0): six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) over heuristics, LLM judge, and 60+ rubrics.
  • Agent Command Center: 17 MB Go binary self-hosts in your VPC. 20+ providers via six native adapters plus OpenAI-compatible presets. RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Ready to evaluate your first PDF QA bot? Wire ContextRelevance, ChunkAttribution, Groundedness, and citation validity into a pytest fixture this afternoon against the ai-evaluation SDK, then add the LlamaIndex traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

What's the right architecture for a PDF QA chatbot in 2026?
Five components, one eval layer on top. A parser that keeps page numbers, section headings, table boundaries, and element types intact (Unstructured, LlamaParse, or pdfplumber plus OCR for scanned docs). A chunker that splits on semantic boundaries, not character count. An embedder from a current family (bge-large, bge-m3, or text-embedding-3). A vector store with per-tenant metadata filtering (pgvector, Qdrant, Weaviate, LanceDB). A generator with structured output and citation enforcement. The eval layer runs both retrieval rubrics (context relevance, chunk attribution, chunk utilization, context adherence) and generation rubrics (groundedness, factual accuracy, completeness, citation validity) on every PR and on a sampled fraction of live traffic. Without the retrieval rubrics you cannot tell whether a wrong answer is the model or the retriever, which means you cannot fix it.
What chunking strategy works best for PDFs?
Semantic chunking that respects document structure. Group elements by section and paragraph, split sections that exceed the embedding model's window, keep tables and figure captions as their own chunks. Preserve metadata on every chunk: page number, section heading, element type, reading order. Fixed-character chunking is the lazy default and costs 5 to 15 points on retrieval rubrics across the domains we have measured. The chunk is the unit of retrieval and the unit of attribution; if the chunk boundary cuts a clause in half, the citation will too. Cap chunk size at the embedding model's context window minus headroom for header context that travels with the chunk.
Should I use hybrid search (vector + BM25)?
Yes for most PDF QA workloads. Vector retrieval handles paraphrased queries and semantic neighbourhoods. BM25 handles exact-token matches: contract numbers, defined terms, regulatory clause IDs, statute references, drug names, part numbers. Pure vector loses on the identifier-heavy queries that dominate enterprise PDFs. Pure BM25 loses the moment a user paraphrases. Fuse the two ranked lists with reciprocal rank fusion (one over k plus rank, k usually 60) or weighted scoring tuned on a held-out set. The wiring cost is one extra index and one extra query path; the retrieval-rubric lift is worth it.
How do I prevent hallucinated citations?
Three controls compound. First, constrain the generator to a structured-output schema that requires per-claim citations with chunk IDs and quoted spans. Second, validate every cited span against the retrieval context with a deterministic string or fuzzy match before returning to the user; a failed validation triggers retry-with-stricter-prompt or refusal, not a hand-back of an invalid answer. Third, score citation validity in CI as a deterministic floor rubric and alarm in production on fabricated-citation rate. The first two are programmatic and cheap. The third catches drift the structured-output schema lets through, like the model citing a chunk that does not contain the quoted span.
What eval set should I build for a PDF QA chatbot?
Start at 100 to 200 cases sampled from real user questions where possible, hand-annotated when not. Cover six failure shapes: factual lookup (single chunk answers the question), multi-hop reasoning (the answer requires two or more chunks), table extraction (the answer lives in a table cell), refusal scenarios (the document does not contain the answer and the model must say so), ambiguous queries (the right move is to ask for clarification), and edge cases (footnotes, scanned pages, captions, multi-column layout). Store each case as (question, expected_answer, expected_source_chunks, tenant_id). Grow the set weekly by promoting failing production traces through Error Feed; the dataset that ratchets stronger is the one that catches the bugs your users actually file.
Why evaluate retrieval before generation?
Because generation rubrics measure the wrong thing if retrieval is broken. Groundedness rewards answers that stick to the retrieved context; if the retriever surfaced the wrong chunk, a grounded answer is still wrong. The same goes for context adherence and factual accuracy. Retrieval rubrics (context relevance, chunk attribution, chunk utilization) separate the two failure modes. A drop in context relevance with stable groundedness means the retriever regressed. A drop in groundedness with stable context relevance means the generator regressed. Run both layers in CI and you can debug the regression in one bisect instead of three days of guessing.
How does Future AGI evaluate PDF QA chatbots?
Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships the seven RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy) plus 60+ total templates, real Evaluator API, 13 guardrail backends, four distributed runners. The Future AGI Platform layers self-improving evaluators tuned by thumbs up/down feedback, an in-product authoring agent that writes PDF-specific rubrics (table extraction accuracy, citation format compliance) from natural-language descriptions, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. The same rubric runs in CI and as a span-attached scorer against live LlamaIndex traces via traceAI (50+ AI surfaces across Python, TypeScript, Java, C#). Error Feed sits inside the eval stack: HDBSCAN clustering plus a Sonnet 4.5 Judge writes the fix, and those fixes feed back into the platform's self-improving evaluators.
Related Articles
View all