Guides

LLM Chatbot Evaluation: A Comprehensive Guide (2026)

Chatbot eval is six stacked problems: intent, retrieval, generation, tool use, multi-turn coherence, and safety. One Groundedness score hides the failure mode that actually ships.

·
13 min read
chatbot-evaluation conversational-ai multi-turn ai-evaluation 2026
Editorial cover image for LLM Chatbot Evaluation Comprehensive Guide
Table of Contents

A chatbot ships. The Groundedness chart sits at 0.93 and the CI gate is green. Two weeks later the build gets pulled because the bot keeps answering chargeback questions it should be routing, the retriever surfaces the EU return window in US conversations, the booking tool gets called with a malformed order id, and the persona slips into sarcasm by turn eight. None of these failures touch Groundedness. All of them ship.

The opinion this post earns: chatbot evaluation is six independent eval problems stacked, not one. Conflating them into a single response-quality score is why most chatbot CI gates miss the bug that actually ships. Intent capture, retrieval, generation, tool use, multi-turn coherence, and safety each fail for different reasons against different rubrics. A working eval suite scores all six separately, stratifies the dataset so no class of failure can hide behind aggregate accuracy, and runs the same rubrics in CI and on live traces.

This guide is the map — six layers, the rubrics that catch each failure family, the CI gate that wires them together, and the production loop that promotes failing traces back into the dataset. Code shaped against the ai-evaluation SDK and traceAI. For domain-specific playbooks see the customer support and medical chatbot guides.

TL;DR: the six layers

LayerWhat it measuresBug it catchesPrimary rubrics
1. Intent captureDid the bot read the user’s request rightWrong target for every downstream layerIntent confusion matrix, entity precision/recall, CustomerAgentQueryHandling
2. RetrievalDid it pull the right contextEU policy quoted in a US conversationContextRelevance, ContextAdherence, ChunkAttribution, ChunkUtilization
3. GenerationIs the answer grounded, complete, on-policyPlausible-wrong answer with a real citationGroundedness, FactualAccuracy, IsHelpful, Completeness
4. Tool useFunction-calling correctnessWrong tool, wrong args, ignored outputEvaluateFunctionCalling, TaskCompletion
5. Multi-turn coherenceCumulative context, persona, completionContext drift, persona break, premature endConversationCoherence, ConversationResolution, CustomerAgent family
6. Safety + refusalRight refusals, wrong refusalsOver-refusal kills product; under-refusal kills safetyAnswerRefusal, IsHarmfulAdvice, DataPrivacyCompliance

Most chatbot regressions live in layer 1 (intent misread) and layer 5 (multi-turn breakage). One Groundedness score sees neither.

Why most chatbot eval misses the bug

A single-turn response-quality rubric scores (input, output) against a reference. A production chatbot fails in shapes that signature cannot see:

  • Intent miss upstream. A chargeback in non-standard wording gets labelled “order status.” Retrieval runs the wrong namespace. The bot gives a clean, grounded answer about shipping windows. Groundedness scores 0.93. The bug was three layers up.
  • Retrieval mismatch masquerading as generation. A relevant-looking chunk by semantic similarity didn’t apply to the user’s region. The answer sticks to the chunk so ContextAdherence is high. The chunk was wrong.
  • Tool failures invisible to the answer rubric. The agent picked intercom_get_conversation when the user was on Zendesk; the tool returned empty; the agent hallucinated a fallback. Groundedness scored the text, not the tool trace.
  • Persona break in turn eight. Friendly in turn one, sarcastic by turn eight after the user pushed back. Per-turn rubrics score each turn as coherent. Only the conversation-level rubric catches the drift.
  • Over-refusal in production. A prompt update tightened safety language and the bot started refusing valid product questions. AnswerRefusal scored each refusal as well-formed. Support tickets spiked before anyone noticed.

Six distinct bugs, six rubric families. None surface from one aggregate score.

Layer 1: intent capture

If the bot reads the request wrong, every downstream layer scores against the wrong target. Intent capture is the easiest layer to skip because it sits before the LLM call you instrumented.

Two sub-rubrics:

  • Intent classification. Predicted intent versus gold intent on a stratified set. Build a confusion matrix at the intent-class level (order_status, refund_request, account_access, product_question, escalation_request). The expensive cells are the cross-class confusions that send the user down the wrong tool path.
  • Entity extraction. Per-entity precision and recall on order_id, customer_email, region, product_line, urgency_flag, time_window. Pydantic validators reject malformed extractions inline; the offline eval scores structured output against gold annotations.

CustomerAgentQueryHandling scores how well the bot understood the request. Pair with a CustomLLMJudge against your intent schema for product-specific labels. A bot that hallucinates a confident intent is worse than one that asks a clarifying question, which is why CustomerAgentClarificationSeeking is the right partner template: it scores whether the bot asked the right question at the right turn.

CI floor: intent F1 over 0.92 on head intents, entity precision over 0.95 on order_id and customer_email, per-intent recall over 0.85.

Layer 2: retrieval

Retrieval failures are the easiest to mistake for generation failures. The answer reads right; the chunk it grounded on was wrong for the case. Score retrieval as its own layer, before the answer.

Four rubrics per turn:

  • ContextRelevance (eval_id=9). Are the retrieved chunks about the query.
  • ContextAdherence (eval_id=5). Does the answer stick to the retrieved chunks.
  • ChunkAttribution (eval_id=11). Does each claim map to a specific chunk id.
  • ChunkUtilization (eval_id=12). Were the retrieved chunks actually used.

Deterministic floors sit alongside: precision_at_k, recall_at_k, and namespace-correctness — the retriever queried the right per-tenant namespace for the user’s region, product line, and language. Cross-tenant retrieval is a configuration class of incident; the only durable fix is to prevent the query from ever being able to cross.

Per-conversation retrieval matters separately. The retriever can pull the right context per turn but ignore cumulative state (user said California in turn one, retriever queried without the region filter in turn three). Score retrieval on the rewritten standalone query, not the raw turn N input. A conversation-aware query rewriter is the lowest-effort fix.

Gate CI on context_precision >= 0.75, context_recall >= 0.80, chunk_attribution >= 0.90 on the answerable subset.

Layer 3: generation

Once intent is right and retrieval surfaced the right context, generation scores whether the answer is grounded, complete, helpful, and on-policy.

Four rubrics on the answerable subset:

  • Groundedness (eval_id=47). Every claim is supported by retrieved context.
  • FactualAccuracy (eval_id=66). Beyond grounding, the claims are factually correct against a reference.
  • Completeness (eval_id=10). The answer covers the question fully, not a polished partial.
  • IsHelpful (eval_id=84). Actionable and on-point, not just technically correct.

For citation-heavy domains (medical, legal, policy support), pair the LLM-judge rubrics with deterministic citation validation: every cited source must exist in the indexed corpus at the stated version, and the quoted span must appear verbatim in the chunk the citation points at. The LLM judge cannot catch fabricated citations as reliably as a string match against the index.

Generation is also where formatting and tone get scored. Tone, IsPolite, IsConcise, and CustomerAgentLanguageHandling cover surface dimensions; PromptInstructionAdherence scores system-prompt constraints (response length, output schema, refusal templates, persona). A correct answer in the wrong tone is a regression in a chat product.

Layer 4: tool use

A chatbot without tools is a glorified FAQ. A chatbot with tools has a new class of failure the answer rubrics cannot see. Score four dimensions per call:

  • Tool selection. Did the agent pick the right function for the user channel, intent, and entities. EvaluateFunctionCalling (eval_id=98) matches predicted function name and argument shape against the expected call.
  • Argument correctness. Do arguments match the schema with the right ids, customer email, region, and time window. Pydantic validators reject malformed calls inline; the offline rubric matches them against the gold trace.
  • Output use. Did the agent use the returned data or did it ignore the tool result and hallucinate. CustomerAgentConversationQuality and TaskCompletion score this.
  • Side-effect safety. Write tools (refund-create, ticket-update, account-modify) require a human-approval gate above a configurable threshold. Missing approval spans fail the build.
from pydantic import BaseModel, Field

class OrderStatusLookup(BaseModel):
    """Read tool. Safe to auto-execute."""
    order_id: str = Field(pattern=r"^[A-Z]{2}-\d{6,10}$")
    customer_email: str

class RefundCreate(BaseModel):
    """Write tool. Human approval above $50, dual-control above $500."""
    order_id: str
    amount_cents: int = Field(le=500_00)
    reason_code: str

Typed envelopes make the schema the contract. The agent proposes a write, the gateway emits an approval span, the human signs off, then the action executes. The Agent Command Center’s per-virtual-key AllowedTools and DeniedTools enforce scope at the gateway boundary so a coerced prompt cannot exceed the budget. Above-threshold actions emit an audit-log span with proposed action, approver, timestamp, and rollback handle.

CI floor: function_call_accuracy >= 0.95, task_completion >= 0.88, zero missing approval spans on write calls above threshold.

Layer 5: multi-turn coherence

The layer most chatbot CI gates skip and most production incidents come from. Correctness depends on cumulative context, not just the last turn. Score the conversation as the unit.

Four sub-rubrics:

  • Context retention. Did the bot remember facts the user provided earlier. CustomerAgentContextRetention.
  • Coherence. Are turns internally consistent; the bot avoids contradicting earlier statements. ConversationCoherence (eval_id=1).
  • Resolution. Did the dialogue reach the expected end state (issue resolved, booking created, lead qualified, escalation routed). ConversationResolution (eval_id=2).
  • Termination handling. Did the bot end at the right turn — not prematurely, not by trapping the user in a loop. CustomerAgentTerminationHandling and CustomerAgentLoopDetection.

Persona adherence is the rubric most teams miss. Per-turn rubrics score each turn as coherent. Only the conversation-level rubric catches the drift. Use CustomerAgentPromptConformance for persona-versus-system-prompt adherence, and a CustomLLMJudge with the rubric score 1.0 if every turn stays in the persona defined by the system prompt; 0.0 if any turn breaks persona.

Escalation accuracy belongs here. Did the bot escalate at the right turn given user signals (frustration, repeated requests, out-of-scope intent), trap the user in a loop, or escalate too early. CustomerAgentHumanEscalation ships this. The full escalation taxonomy and per-tier floors live in the customer support chatbot guide.

TaskCompletion (eval_id=99) crowns the layer with the end-to-end outcome score. Dataset shape: 3 to 12 turns per conversation, 50 to 150 conversations per intent class, stratified across happy paths, edge cases, and the hardest ten percent of production traffic.

Layer 6: safety and refusal calibration

Safety is two failure modes pulling opposite ways. Over-refusal kills the product. Under-refusal kills the user, the brand, or compliance. Score both.

Build a refusal test set with three buckets:

  • Should-answer. Cases the bot is expected to substantively respond to. Over-refusal is a product failure.
  • Should-refuse. Cases the bot is expected to decline (out of scope, beyond training, safety-tier-2, requires-clinician). Under-refusal is a safety failure.
  • Should-clarify. Cases the bot is expected to ask a clarifying question before answering or refusing. Failing to clarify is a UX failure.

AnswerRefusal (eval_id=88) scores the refusal directly; CustomerAgentClarificationSeeking scores the clarify path. Layer harm-class rubrics on every output regardless of refusal status: IsHarmfulAdvice, NoHarmfulTherapeuticGuidance for health, ClinicallyInappropriateTone, Toxicity, Sexist, ContentSafety. Refusal calibration only works with gold labels per case; without them the rubric collapses to “did it refuse,” which gives no signal on calibration.

PII and prompt-injection sit alongside as deterministic floors: DataPrivacyCompliance (eval_id=13) at 1.00 on input and output, PromptInjection (eval_id=18) at 1.00 on input, and the 8 sub-10ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) as pre-filter.

Future AGI Protect runs four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier at 65 ms text and 107 ms image median time-to-label per the Protect paper. The agentcc-gateway Go plugin carries deterministic PII regex covering 18 entity types with per-tenant pipeline_mode, fail_open, and per-check action (block / warn / mask / log) as the inline runtime side of the same rubric.

The CI gate: wire all six into one fixture

A working CI gate scores all six layers, stratifies the dataset so no class of failure can hide behind aggregate accuracy, and diffs against a moving baseline rather than alarming on every change.

from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, ContextAdherence, ChunkAttribution, ChunkUtilization,
    Groundedness, FactualAccuracy, Completeness, IsHelpful,
    EvaluateFunctionCalling, TaskCompletion, AnswerRefusal,
    ConversationCoherence, ConversationResolution,
    CustomerAgentContextRetention, CustomerAgentLoopDetection,
    DataPrivacyCompliance, IsHarmfulAdvice,
)
from fi.testcases import TestCase

evaluator = Evaluator()

LAYER_FLOORS = {
    "intent_f1": 0.92, "entity_precision": 0.95,
    "context_precision": 0.75, "context_recall": 0.80,
    "groundedness": 0.90, "is_helpful": 0.85,
    "function_call_accuracy": 0.95, "task_completion": 0.88,
    "conversation_coherence": 0.88, "conversation_resolution": 0.85,
    "data_privacy_compliance": 1.00, "answer_refusal_on_should_refuse": 0.97,
}

def test_chatbot(eval_dataset):
    results = {l: [] for l in
               ["intent", "retrieval", "generation", "tool", "multi_turn", "safety"]}
    for ex in eval_dataset:
        run = run_agent(ex.conversation, ex.region, ex.product_line)
        tc = TestCase(input=ex.last_user_message, output=run.response,
                      context="\n\n".join(c["text"] for c in run.chunks),
                      conversation=ex.conversation)
        results["intent"].append(score_intent(run, ex))
        results["retrieval"].append(score(evaluator, tc,
            [ContextRelevance(), ContextAdherence(),
             ChunkAttribution(), ChunkUtilization()]))
        if ex.gold_intent == "answerable" and run.intent == "answerable":
            results["generation"].append(score(evaluator, tc,
                [Groundedness(), FactualAccuracy(),
                 Completeness(), IsHelpful()]))
        if run.tool_calls:
            results["tool"].append(score_tool_calls(run, ex, evaluator, tc))
        results["multi_turn"].append(score(evaluator, tc,
            [ConversationCoherence(), ConversationResolution(),
             CustomerAgentContextRetention(), CustomerAgentLoopDetection()]))
        results["safety"].append(score(evaluator, tc,
            [AnswerRefusal(), DataPrivacyCompliance(), IsHarmfulAdvice()]))
    failures = check_floors(results, LAYER_FLOORS)
    assert not failures, f"chatbot failures: {failures[:6]}"

Three habits separate a working gate from theatre. Stratify the dataset across the six layers. Equal weight per intent class and per refusal bucket; natural-distribution accuracy hides safety misses behind easy questions. Diff against a moving baseline. Alarm on a 2-point sustained drop, not every change. Promote production failures weekly. Static eval sets go stale fast because user behavior drifts and adversaries adapt.

Production observability and the closing loop

The CI gate catches regressions you can think of. Production catches everything else. The same rubrics run as span-attached scorers against live traces, with the conversation root span carrying conversation-level scores.

traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. 14 span kinds (AGENT, TOOL, RETRIEVER, LLM, GUARDRAIL and more). Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time mean spans flow into your existing OTel collector without lock-in. 62 built-in evals wire via EvalTag.

Sample 5 to 10 percent of production traffic for LLM-judge rubrics; deterministic checks (function_call_accuracy, intent emission, PII presence) run on 100 percent. Six production-only signals to alarm on:

  • Intent drift. A 5-point shift in head intent distribution over a week usually means a prompt update tipped routing.
  • Retrieval namespace drift. A spike in cross-namespace queries is a configuration regression.
  • Tool-call failure rate. An argument-validation spike on zendesk_lookup_ticket is usually an upstream schema change.
  • Refusal rate per bucket. Drift up on should-answer (over-refusal) blocks release; drift down on should-refuse blocks safety.
  • Loop count per conversation. A rising mean means the clarification policy is breaking.
  • Per-conversation latency and cost. A correct chatbot at 12 seconds per turn is a product failure for chat.

Error Feed closes the loop. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing conversations into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser for spans over 3000 chars, prompt-cache hit ratio near 90 percent) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1 to 5 each). The fix feeds the Platform’s self-improving evaluators so rubrics across all six layers age with the product. Engineers cannot promote a failing trace on their own — the gold labels (intent, response_type, tool calls) need domain-lead review.

Common chatbot eval mistakes

  • One rubric for “response quality.” Aggregate scores hide which of the six layers is failing.
  • No intent eval. The classifier or implicit router is unscored; downstream rubrics measure against the wrong target.
  • No conversation-level rubric. Persona drift, context retention failures, and premature termination are invisible at the per-turn level.
  • Refusal rate without buckets. Over-refusal and under-refusal pull opposite ways; one rate aggregates them into noise.
  • Tool calls scored only by the answer rubric. Wrong tool selection or ignored tool output reads as a generation failure.
  • Static eval set written at launch. Chatbot regressions are the fastest-drifting eval data; promote from production weekly.
  • Tracing in one tool, eval in another. When the trace and the eval live in different places, no one looks at either.

How Future AGI ships the chatbot eval stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined rubrics across all six layers. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes covering every layer: ConversationCoherence, ConversationResolution, TaskCompletion, AnswerRefusal, Groundedness, ContextAdherence, ContextRelevance, ChunkAttribution, ChunkUtilization, IsHelpful, Completeness, FactualAccuracy, EvaluateFunctionCalling, plus the eleven-template CustomerAgent family (ClarificationSeeking, ContextRetention, ConversationQuality, HumanEscalation, InterruptionHandling, LanguageHandling, LoopDetection, ObjectionHandling, PromptConformance, QueryHandling, TerminationHandling). 13 guardrail backends, 8 sub-10ms Scanners, 4 distributed runners (Celery, Ray, Temporal, Kubernetes).
  • Future AGI Platform. Self-improving evaluators tuned by thumbs-up/down or relabel feedback. In-product agent authors custom evaluators from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Error Feed (inside the eval stack). HDBSCAN clustering plus Sonnet 4.5 Judge agent writes the immediate_fix; four-dimensional trace scoring; Linear ticketing today.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python / TypeScript / Java / C#; 14 span kinds; pluggable semantic conventions; 62 built-in evals via EvalTag.
  • Future AGI Protect. Four Gemma 3n LoRA adapters plus Protect Flash; 65 ms text / 107 ms image median time-to-label per the Protect paper.
  • Agent Command Center. 17 MB Go binary self-hosts in your VPC; 20+ providers; per-virtual-key AllowedTools and DeniedTools; SOC 2 Type II, HIPAA, GDPR, and CCPA certified (ISO/IEC 27001 in active audit).

Ready to evaluate your first chatbot across all six layers? Wire ContextRelevance, Groundedness, EvaluateFunctionCalling, ConversationCoherence, CustomerAgentLoopDetection, AnswerRefusal, and DataPrivacyCompliance into a pytest fixture this afternoon against the ai-evaluation SDK, then add traceAI instrumentation when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is one Groundedness score not enough for a chatbot?
Groundedness scores whether the final answer is supported by the retrieved context. It cannot tell you the bot misread the user's intent, retrieved the wrong document, called the wrong tool, forgot what the user said three turns ago, or refused a question it should have answered. Six independent failure modes feed into the same answer. Aggregating them into one score is why most production bugs slip through CI green and only show up in a CSAT chart two weeks later. Score each layer separately. The bug lives in one of them, not in the average.
What are the six layers of chatbot evaluation?
Intent capture (did the bot understand what was asked), retrieval (did it pull the right context), generation (is the answer grounded, complete, and on-policy), tool use (did it call the right function with the right arguments and use the result), multi-turn coherence (did it remember and adapt across turns), and safety with refusal calibration (does it refuse what it should and answer what it can). Each layer has its own rubric family. Each fails for different reasons. A working CI gate scores all six independently and stratifies the dataset so a single class of failure cannot hide behind aggregate accuracy.
How do I evaluate intent capture separately from the answer?
Build a labelled intent dataset with the canonical intent, the user utterance, and the entities the bot must extract (order id, region, product line, urgency). Score predicted intent against gold intent with a confusion matrix, and score entity extraction with precision and recall per entity type. Run it before the retrieval and generation layers fire. An intent miss upstream means every downstream rubric scores against the wrong target. The CustomerAgentQueryHandling template and a CustomLLMJudge against the intent schema cover the LLM-judge side; deterministic regex and Pydantic validators cover the entity side.
Should retrieval be evaluated per-turn or per-conversation?
Both. Per-turn catches narrow regressions like a wrong namespace filter or a stale recency rerank on a single question. Per-conversation catches the failure where the retriever did fine on every turn in isolation but missed the cumulative context from earlier turns (the user said California in turn one and the retriever queried without it in turn three). ContextRelevance, ContextAdherence, ChunkAttribution, and ChunkUtilization score the per-turn axis. A conversation-level rubric over the cumulative trace scores the cross-turn one. The fix that gets most teams 80 percent of the way is a conversation-aware query rewriter that folds the cumulative state into the standalone query before retrieval fires.
How do I evaluate tool calling in a chatbot?
Score four things per call. Tool selection (did the agent pick the right function for the user channel and intent). Argument correctness (do the arguments match the schema, with the right ids, customer email, and time window). Output use (did the agent use the returned data in the answer, or did it ignore the tool result and hallucinate). Side-effect safety (write tools require a human-approval gate above a configurable threshold). The SDK exposes EvaluateFunctionCalling for the deterministic name-and-shape match and CustomerAgentConversationQuality plus TaskCompletion for argument and output use. Gate CI on the deterministic check first, then on the LLM-judge templates.
What's a good multi-turn eval dataset size and shape?
Start at 50 to 150 multi-turn conversations per intent class with 3 to 12 turns each, stratified across the six layers. Cover happy paths, ambiguous wording, prompt-injection attempts, tool-call edge cases (missing id, wrong region), refusal cases, and the hardest ten percent of conversations from production. Grow weekly by promoting failing production traces auto-clustered by Error Feed into named issues. Each promotion needs a domain lead reviewing the gold label before it enters the dataset. Engineer-only labels grow noise, not signal.
What does Future AGI ship for chatbot evaluation?
The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes covering every layer: ConversationCoherence, ConversationResolution, TaskCompletion, AnswerRefusal, Groundedness, ContextAdherence, ContextRelevance, ChunkAttribution, ChunkUtilization, IsHelpful, Completeness, EvaluateFunctionCalling, plus the eleven-template CustomerAgent family (ClarificationSeeking, ContextRetention, ConversationQuality, HumanEscalation, InterruptionHandling, LanguageHandling, LoopDetection, ObjectionHandling, PromptConformance, QueryHandling, TerminationHandling). The Future AGI Platform layers self-improving evaluators tuned by domain-lead feedback and classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed auto-clusters failing conversations into named issues with an immediate fix and four-dimensional trace score. traceAI (Apache 2.0) attaches the same rubrics to live OTel spans across 50+ AI surfaces.
Related Articles
View all
The Ultimate Guide to LLM Guardrails (2026)
Guides

A senior-engineer guide to LLM guardrails: placement, the 9 open-weight and 4 API backends, latency budgets, ensembles, and the precision/recall split that actually catches harm.

NVJK Kartik
NVJK Kartik ·
14 min