LLM Chatbot Evaluation: A Comprehensive Guide (2026)
Chatbot eval is six stacked problems: intent, retrieval, generation, tool use, multi-turn coherence, and safety. One Groundedness score hides the failure mode that actually ships.
Table of Contents
A chatbot ships. The Groundedness chart sits at 0.93 and the CI gate is green. Two weeks later the build gets pulled because the bot keeps answering chargeback questions it should be routing, the retriever surfaces the EU return window in US conversations, the booking tool gets called with a malformed order id, and the persona slips into sarcasm by turn eight. None of these failures touch Groundedness. All of them ship.
The opinion this post earns: chatbot evaluation is six independent eval problems stacked, not one. Conflating them into a single response-quality score is why most chatbot CI gates miss the bug that actually ships. Intent capture, retrieval, generation, tool use, multi-turn coherence, and safety each fail for different reasons against different rubrics. A working eval suite scores all six separately, stratifies the dataset so no class of failure can hide behind aggregate accuracy, and runs the same rubrics in CI and on live traces.
This guide is the map — six layers, the rubrics that catch each failure family, the CI gate that wires them together, and the production loop that promotes failing traces back into the dataset. Code shaped against the ai-evaluation SDK and traceAI. For domain-specific playbooks see the customer support and medical chatbot guides.
TL;DR: the six layers
| Layer | What it measures | Bug it catches | Primary rubrics |
|---|---|---|---|
| 1. Intent capture | Did the bot read the user’s request right | Wrong target for every downstream layer | Intent confusion matrix, entity precision/recall, CustomerAgentQueryHandling |
| 2. Retrieval | Did it pull the right context | EU policy quoted in a US conversation | ContextRelevance, ContextAdherence, ChunkAttribution, ChunkUtilization |
| 3. Generation | Is the answer grounded, complete, on-policy | Plausible-wrong answer with a real citation | Groundedness, FactualAccuracy, IsHelpful, Completeness |
| 4. Tool use | Function-calling correctness | Wrong tool, wrong args, ignored output | EvaluateFunctionCalling, TaskCompletion |
| 5. Multi-turn coherence | Cumulative context, persona, completion | Context drift, persona break, premature end | ConversationCoherence, ConversationResolution, CustomerAgent family |
| 6. Safety + refusal | Right refusals, wrong refusals | Over-refusal kills product; under-refusal kills safety | AnswerRefusal, IsHarmfulAdvice, DataPrivacyCompliance |
Most chatbot regressions live in layer 1 (intent misread) and layer 5 (multi-turn breakage). One Groundedness score sees neither.
Why most chatbot eval misses the bug
A single-turn response-quality rubric scores (input, output) against a reference. A production chatbot fails in shapes that signature cannot see:
- Intent miss upstream. A chargeback in non-standard wording gets labelled “order status.” Retrieval runs the wrong namespace. The bot gives a clean, grounded answer about shipping windows. Groundedness scores 0.93. The bug was three layers up.
- Retrieval mismatch masquerading as generation. A relevant-looking chunk by semantic similarity didn’t apply to the user’s region. The answer sticks to the chunk so ContextAdherence is high. The chunk was wrong.
- Tool failures invisible to the answer rubric. The agent picked
intercom_get_conversationwhen the user was on Zendesk; the tool returned empty; the agent hallucinated a fallback. Groundedness scored the text, not the tool trace. - Persona break in turn eight. Friendly in turn one, sarcastic by turn eight after the user pushed back. Per-turn rubrics score each turn as coherent. Only the conversation-level rubric catches the drift.
- Over-refusal in production. A prompt update tightened safety language and the bot started refusing valid product questions. AnswerRefusal scored each refusal as well-formed. Support tickets spiked before anyone noticed.
Six distinct bugs, six rubric families. None surface from one aggregate score.
Layer 1: intent capture
If the bot reads the request wrong, every downstream layer scores against the wrong target. Intent capture is the easiest layer to skip because it sits before the LLM call you instrumented.
Two sub-rubrics:
- Intent classification. Predicted intent versus gold intent on a stratified set. Build a confusion matrix at the intent-class level (order_status, refund_request, account_access, product_question, escalation_request). The expensive cells are the cross-class confusions that send the user down the wrong tool path.
- Entity extraction. Per-entity precision and recall on order_id, customer_email, region, product_line, urgency_flag, time_window. Pydantic validators reject malformed extractions inline; the offline eval scores structured output against gold annotations.
CustomerAgentQueryHandling scores how well the bot understood the request. Pair with a CustomLLMJudge against your intent schema for product-specific labels. A bot that hallucinates a confident intent is worse than one that asks a clarifying question, which is why CustomerAgentClarificationSeeking is the right partner template: it scores whether the bot asked the right question at the right turn.
CI floor: intent F1 over 0.92 on head intents, entity precision over 0.95 on order_id and customer_email, per-intent recall over 0.85.
Layer 2: retrieval
Retrieval failures are the easiest to mistake for generation failures. The answer reads right; the chunk it grounded on was wrong for the case. Score retrieval as its own layer, before the answer.
Four rubrics per turn:
ContextRelevance(eval_id=9). Are the retrieved chunks about the query.ContextAdherence(eval_id=5). Does the answer stick to the retrieved chunks.ChunkAttribution(eval_id=11). Does each claim map to a specific chunk id.ChunkUtilization(eval_id=12). Were the retrieved chunks actually used.
Deterministic floors sit alongside: precision_at_k, recall_at_k, and namespace-correctness — the retriever queried the right per-tenant namespace for the user’s region, product line, and language. Cross-tenant retrieval is a configuration class of incident; the only durable fix is to prevent the query from ever being able to cross.
Per-conversation retrieval matters separately. The retriever can pull the right context per turn but ignore cumulative state (user said California in turn one, retriever queried without the region filter in turn three). Score retrieval on the rewritten standalone query, not the raw turn N input. A conversation-aware query rewriter is the lowest-effort fix.
Gate CI on context_precision >= 0.75, context_recall >= 0.80, chunk_attribution >= 0.90 on the answerable subset.
Layer 3: generation
Once intent is right and retrieval surfaced the right context, generation scores whether the answer is grounded, complete, helpful, and on-policy.
Four rubrics on the answerable subset:
Groundedness(eval_id=47). Every claim is supported by retrieved context.FactualAccuracy(eval_id=66). Beyond grounding, the claims are factually correct against a reference.Completeness(eval_id=10). The answer covers the question fully, not a polished partial.IsHelpful(eval_id=84). Actionable and on-point, not just technically correct.
For citation-heavy domains (medical, legal, policy support), pair the LLM-judge rubrics with deterministic citation validation: every cited source must exist in the indexed corpus at the stated version, and the quoted span must appear verbatim in the chunk the citation points at. The LLM judge cannot catch fabricated citations as reliably as a string match against the index.
Generation is also where formatting and tone get scored. Tone, IsPolite, IsConcise, and CustomerAgentLanguageHandling cover surface dimensions; PromptInstructionAdherence scores system-prompt constraints (response length, output schema, refusal templates, persona). A correct answer in the wrong tone is a regression in a chat product.
Layer 4: tool use
A chatbot without tools is a glorified FAQ. A chatbot with tools has a new class of failure the answer rubrics cannot see. Score four dimensions per call:
- Tool selection. Did the agent pick the right function for the user channel, intent, and entities.
EvaluateFunctionCalling(eval_id=98) matches predicted function name and argument shape against the expected call. - Argument correctness. Do arguments match the schema with the right ids, customer email, region, and time window. Pydantic validators reject malformed calls inline; the offline rubric matches them against the gold trace.
- Output use. Did the agent use the returned data or did it ignore the tool result and hallucinate.
CustomerAgentConversationQualityandTaskCompletionscore this. - Side-effect safety. Write tools (refund-create, ticket-update, account-modify) require a human-approval gate above a configurable threshold. Missing approval spans fail the build.
from pydantic import BaseModel, Field
class OrderStatusLookup(BaseModel):
"""Read tool. Safe to auto-execute."""
order_id: str = Field(pattern=r"^[A-Z]{2}-\d{6,10}$")
customer_email: str
class RefundCreate(BaseModel):
"""Write tool. Human approval above $50, dual-control above $500."""
order_id: str
amount_cents: int = Field(le=500_00)
reason_code: str
Typed envelopes make the schema the contract. The agent proposes a write, the gateway emits an approval span, the human signs off, then the action executes. The Agent Command Center’s per-virtual-key AllowedTools and DeniedTools enforce scope at the gateway boundary so a coerced prompt cannot exceed the budget. Above-threshold actions emit an audit-log span with proposed action, approver, timestamp, and rollback handle.
CI floor: function_call_accuracy >= 0.95, task_completion >= 0.88, zero missing approval spans on write calls above threshold.
Layer 5: multi-turn coherence
The layer most chatbot CI gates skip and most production incidents come from. Correctness depends on cumulative context, not just the last turn. Score the conversation as the unit.
Four sub-rubrics:
- Context retention. Did the bot remember facts the user provided earlier.
CustomerAgentContextRetention. - Coherence. Are turns internally consistent; the bot avoids contradicting earlier statements.
ConversationCoherence(eval_id=1). - Resolution. Did the dialogue reach the expected end state (issue resolved, booking created, lead qualified, escalation routed).
ConversationResolution(eval_id=2). - Termination handling. Did the bot end at the right turn — not prematurely, not by trapping the user in a loop.
CustomerAgentTerminationHandlingandCustomerAgentLoopDetection.
Persona adherence is the rubric most teams miss. Per-turn rubrics score each turn as coherent. Only the conversation-level rubric catches the drift. Use CustomerAgentPromptConformance for persona-versus-system-prompt adherence, and a CustomLLMJudge with the rubric score 1.0 if every turn stays in the persona defined by the system prompt; 0.0 if any turn breaks persona.
Escalation accuracy belongs here. Did the bot escalate at the right turn given user signals (frustration, repeated requests, out-of-scope intent), trap the user in a loop, or escalate too early. CustomerAgentHumanEscalation ships this. The full escalation taxonomy and per-tier floors live in the customer support chatbot guide.
TaskCompletion (eval_id=99) crowns the layer with the end-to-end outcome score. Dataset shape: 3 to 12 turns per conversation, 50 to 150 conversations per intent class, stratified across happy paths, edge cases, and the hardest ten percent of production traffic.
Layer 6: safety and refusal calibration
Safety is two failure modes pulling opposite ways. Over-refusal kills the product. Under-refusal kills the user, the brand, or compliance. Score both.
Build a refusal test set with three buckets:
- Should-answer. Cases the bot is expected to substantively respond to. Over-refusal is a product failure.
- Should-refuse. Cases the bot is expected to decline (out of scope, beyond training, safety-tier-2, requires-clinician). Under-refusal is a safety failure.
- Should-clarify. Cases the bot is expected to ask a clarifying question before answering or refusing. Failing to clarify is a UX failure.
AnswerRefusal (eval_id=88) scores the refusal directly; CustomerAgentClarificationSeeking scores the clarify path. Layer harm-class rubrics on every output regardless of refusal status: IsHarmfulAdvice, NoHarmfulTherapeuticGuidance for health, ClinicallyInappropriateTone, Toxicity, Sexist, ContentSafety. Refusal calibration only works with gold labels per case; without them the rubric collapses to “did it refuse,” which gives no signal on calibration.
PII and prompt-injection sit alongside as deterministic floors: DataPrivacyCompliance (eval_id=13) at 1.00 on input and output, PromptInjection (eval_id=18) at 1.00 on input, and the 8 sub-10ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) as pre-filter.
Future AGI Protect runs four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier at 65 ms text and 107 ms image median time-to-label per the Protect paper. The agentcc-gateway Go plugin carries deterministic PII regex covering 18 entity types with per-tenant pipeline_mode, fail_open, and per-check action (block / warn / mask / log) as the inline runtime side of the same rubric.
The CI gate: wire all six into one fixture
A working CI gate scores all six layers, stratifies the dataset so no class of failure can hide behind aggregate accuracy, and diffs against a moving baseline rather than alarming on every change.
from fi.evals import Evaluator
from fi.evals.templates import (
ContextRelevance, ContextAdherence, ChunkAttribution, ChunkUtilization,
Groundedness, FactualAccuracy, Completeness, IsHelpful,
EvaluateFunctionCalling, TaskCompletion, AnswerRefusal,
ConversationCoherence, ConversationResolution,
CustomerAgentContextRetention, CustomerAgentLoopDetection,
DataPrivacyCompliance, IsHarmfulAdvice,
)
from fi.testcases import TestCase
evaluator = Evaluator()
LAYER_FLOORS = {
"intent_f1": 0.92, "entity_precision": 0.95,
"context_precision": 0.75, "context_recall": 0.80,
"groundedness": 0.90, "is_helpful": 0.85,
"function_call_accuracy": 0.95, "task_completion": 0.88,
"conversation_coherence": 0.88, "conversation_resolution": 0.85,
"data_privacy_compliance": 1.00, "answer_refusal_on_should_refuse": 0.97,
}
def test_chatbot(eval_dataset):
results = {l: [] for l in
["intent", "retrieval", "generation", "tool", "multi_turn", "safety"]}
for ex in eval_dataset:
run = run_agent(ex.conversation, ex.region, ex.product_line)
tc = TestCase(input=ex.last_user_message, output=run.response,
context="\n\n".join(c["text"] for c in run.chunks),
conversation=ex.conversation)
results["intent"].append(score_intent(run, ex))
results["retrieval"].append(score(evaluator, tc,
[ContextRelevance(), ContextAdherence(),
ChunkAttribution(), ChunkUtilization()]))
if ex.gold_intent == "answerable" and run.intent == "answerable":
results["generation"].append(score(evaluator, tc,
[Groundedness(), FactualAccuracy(),
Completeness(), IsHelpful()]))
if run.tool_calls:
results["tool"].append(score_tool_calls(run, ex, evaluator, tc))
results["multi_turn"].append(score(evaluator, tc,
[ConversationCoherence(), ConversationResolution(),
CustomerAgentContextRetention(), CustomerAgentLoopDetection()]))
results["safety"].append(score(evaluator, tc,
[AnswerRefusal(), DataPrivacyCompliance(), IsHarmfulAdvice()]))
failures = check_floors(results, LAYER_FLOORS)
assert not failures, f"chatbot failures: {failures[:6]}"
Three habits separate a working gate from theatre. Stratify the dataset across the six layers. Equal weight per intent class and per refusal bucket; natural-distribution accuracy hides safety misses behind easy questions. Diff against a moving baseline. Alarm on a 2-point sustained drop, not every change. Promote production failures weekly. Static eval sets go stale fast because user behavior drifts and adversaries adapt.
Production observability and the closing loop
The CI gate catches regressions you can think of. Production catches everything else. The same rubrics run as span-attached scorers against live traces, with the conversation root span carrying conversation-level scores.
traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. 14 span kinds (AGENT, TOOL, RETRIEVER, LLM, GUARDRAIL and more). Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time mean spans flow into your existing OTel collector without lock-in. 62 built-in evals wire via EvalTag.
Sample 5 to 10 percent of production traffic for LLM-judge rubrics; deterministic checks (function_call_accuracy, intent emission, PII presence) run on 100 percent. Six production-only signals to alarm on:
- Intent drift. A 5-point shift in head intent distribution over a week usually means a prompt update tipped routing.
- Retrieval namespace drift. A spike in cross-namespace queries is a configuration regression.
- Tool-call failure rate. An argument-validation spike on
zendesk_lookup_ticketis usually an upstream schema change. - Refusal rate per bucket. Drift up on should-answer (over-refusal) blocks release; drift down on should-refuse blocks safety.
- Loop count per conversation. A rising mean means the clarification policy is breaking.
- Per-conversation latency and cost. A correct chatbot at 12 seconds per turn is a product failure for chat.
Error Feed closes the loop. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing conversations into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser for spans over 3000 chars, prompt-cache hit ratio near 90 percent) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1 to 5 each). The fix feeds the Platform’s self-improving evaluators so rubrics across all six layers age with the product. Engineers cannot promote a failing trace on their own — the gold labels (intent, response_type, tool calls) need domain-lead review.
Common chatbot eval mistakes
- One rubric for “response quality.” Aggregate scores hide which of the six layers is failing.
- No intent eval. The classifier or implicit router is unscored; downstream rubrics measure against the wrong target.
- No conversation-level rubric. Persona drift, context retention failures, and premature termination are invisible at the per-turn level.
- Refusal rate without buckets. Over-refusal and under-refusal pull opposite ways; one rate aggregates them into noise.
- Tool calls scored only by the answer rubric. Wrong tool selection or ignored tool output reads as a generation failure.
- Static eval set written at launch. Chatbot regressions are the fastest-drifting eval data; promote from production weekly.
- Tracing in one tool, eval in another. When the trace and the eval live in different places, no one looks at either.
How Future AGI ships the chatbot eval stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined rubrics across all six layers. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses covering every layer:ConversationCoherence,ConversationResolution,TaskCompletion,AnswerRefusal,Groundedness,ContextAdherence,ContextRelevance,ChunkAttribution,ChunkUtilization,IsHelpful,Completeness,FactualAccuracy,EvaluateFunctionCalling, plus the eleven-template CustomerAgent family (ClarificationSeeking,ContextRetention,ConversationQuality,HumanEscalation,InterruptionHandling,LanguageHandling,LoopDetection,ObjectionHandling,PromptConformance,QueryHandling,TerminationHandling). 13 guardrail backends, 8 sub-10ms Scanners, 4 distributed runners (Celery, Ray, Temporal, Kubernetes). - Future AGI Platform. Self-improving evaluators tuned by thumbs-up/down or relabel feedback. In-product agent authors custom evaluators from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack). HDBSCAN clustering plus Sonnet 4.5 Judge agent writes the
immediate_fix; four-dimensional trace scoring; Linear ticketing today. - traceAI (Apache 2.0). 50+ AI surfaces across Python / TypeScript / Java / C#; 14 span kinds; pluggable semantic conventions; 62 built-in evals via
EvalTag. - Future AGI Protect. Four Gemma 3n LoRA adapters plus Protect Flash; 65 ms text / 107 ms image median time-to-label per the Protect paper.
- Agent Command Center. 17 MB Go binary self-hosts in your VPC; 20+ providers; per-virtual-key
AllowedToolsandDeniedTools; SOC 2 Type II, HIPAA, GDPR, and CCPA certified (ISO/IEC 27001 in active audit).
Ready to evaluate your first chatbot across all six layers? Wire ContextRelevance, Groundedness, EvaluateFunctionCalling, ConversationCoherence, CustomerAgentLoopDetection, AnswerRefusal, and DataPrivacyCompliance into a pytest fixture this afternoon against the ai-evaluation SDK, then add traceAI instrumentation when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why is one Groundedness score not enough for a chatbot?
What are the six layers of chatbot evaluation?
How do I evaluate intent capture separately from the answer?
Should retrieval be evaluated per-turn or per-conversation?
How do I evaluate tool calling in a chatbot?
What's a good multi-turn eval dataset size and shape?
What does Future AGI ship for chatbot evaluation?
A senior-engineer guide to LLM guardrails: placement, the 9 open-weight and 4 API backends, latency budgets, ensembles, and the precision/recall split that actually catches harm.
Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness. Scored independently, calibrated against humans, run in CI. The 2026 guide.
Benchmarks tell you which model is smartest. Metrics tell you whether your system works. The 2026 guide: benchmark map, metric catalog, CI gate, and the rubric that links them.