Evaluating Multi-Turn Conversations: A Deep Dive (2026)
Multi-turn eval is not turn-by-turn eval averaged. The unit is the trajectory: coherence, context retention, intent drift, resolution. The trajectory-first playbook.
Table of Contents
A support agent passes every turn-level rubric. Faithfulness sits at 0.94. Refusal-correctness at 0.98. The first ten transcripts in production tell a different story: the agent quietly walks past an escalation threshold on turn four, drops persona on turn six after a user pushes back, and on turn nine ships a refund quote that contradicts the policy it cited on turn two. None of those failures register as a bad response. They register as conversations that score well per turn and fail as sessions.
Most posts on this would call it a context-window problem and pitch a longer model. That misses the failure mode. Individual turns were fine. The trajectory was not.
The opinion this post earns: multi-turn eval is not turn-by-turn eval averaged. The unit is the conversation trajectory: coherence across turns, context retention, intent drift, resolution. Scoring turn-by-turn and averaging hides the conversation-level failures that actually break trust. The mean of a transcript that contradicted itself on turn nine and one that did not look identical at the dashboard. The CSAT chart, on a one-week lag, tells you which one shipped.
This guide is the trajectory-first playbook: the four conversation-level metrics, how to build a test set out of real replays plus adversarial trajectories, how per-trajectory scoring runs in CI, and how the same rubrics attach to live session spans through traceAI and the ai-evaluation SDK. For the six-layer chatbot map, see the chatbot eval guide.
TL;DR: the trajectory metrics and the eval suite
| Metric | What it scores | Failure it catches | Primary rubric |
|---|---|---|---|
| Conversation Coherence | Internal consistency across turns | Turn 9 contradicts turn 2 | ConversationCoherence |
| Context Retention | Use of facts the user gave earlier | Agent re-asks region in turn 5 | CustomerAgentContextRetention |
| Intent Drift | Staying on the asked task | Off-topic walkback under pressure | CustomLLMJudge rubric |
| Conversation Resolution | Reaching the expected end state | Session ends without escalation context | ConversationResolution |
| Loop control | Premature loops or clarifying tangents | Three clarifies, no answer | CustomerAgentLoopDetection |
| Escalation handoff | Right turn, right queue, full context | Trigger fired turn 3, agent kept going | CustomerAgentHumanEscalation |
| Persona adherence | Holding character across the session | Sarcasm by turn 8 after pushback | CustomerAgentPromptConformance |
The discipline: score each trajectory end-to-end against all of these, gate CI on per-trajectory floors (not the mean), and run the same rubrics on live conversation root spans at 5 to 10 percent sampling.
Why averaging turn scores hides the bug
A single-turn rubric scores (input, output) against a reference. It cannot see the cumulative shape of the trajectory. Three failure modes show up in postmortem after postmortem when teams shipped on per-turn means:
- Contradicted itself across turns. Turn 2 cited the US 30-day refund window. Turn 9 quoted the EU 14-day window for the same case after the user pushed back. Per-turn Groundedness scored 0.93 on both because each answer was supported by some retrieved chunk. The conversation-level reading is a coherence break. The agent gave the user two incompatible policies in the same thread.
- Forgot what the user said. Turn 1: user mentioned California. Turn 3: agent asked for the user’s region. Turn 5: agent gave a generic answer ignoring the California-specific clause. Each turn passed isolated grounding. The trajectory lost the user’s first message inside its own clarifying loop.
- Walked off-topic and never came back. User asked about a chargeback. Five turns of clarifying questions later, the agent was negotiating a discount code. No per-turn rubric fires on this. The intent the user came in with was abandoned. The agent’s last response was perfectly grounded to the wrong task.
Per-turn scores stayed green on all three. The bugs only become visible when the rubric input is the transcript, not the turn. Long context windows do not fix this; they shift the failure pattern from “the model forgot” to “the model has the context and ignores it,” which is harder to catch because the input looks fine.
The shift is one sentence: stop scoring (input, output) pairs and start scoring transcript → expected end state. Everything below is what that costs to operationalize.
The four trajectory metrics
A useful taxonomy for multi-turn conversation evaluation collapses to four trajectory-level metrics, each with a different input shape and a different failure family. Persona, loop control, and escalation correctness layer on top.
Conversation Coherence
Are turns internally consistent across the session? The rubric input is the full transcript. It penalises contradictions between turns, restated questions the user already answered, and tool calls that ignore parameters set earlier in the conversation. ConversationCoherence (eval_id=1) ships this. The expensive failure mode it catches: an agent that grounds each turn against a fresh chunk and ends up giving the user three incompatible answers across one thread.
Context Retention
Did the agent use what the user told it? The rubric input is the transcript plus an optional structured list of facts the user provided (region, account tier, allergy, prior order id, urgency). It penalises re-asking facts the user already gave, ignoring them in a downstream tool call, and quietly forgetting them by the time the answer ships. CustomerAgentContextRetention covers this directly. The fix that gets most teams 80 percent of the way is a conversation-aware query rewriter that folds the cumulative state into the standalone retrieval query before the retriever fires.
Intent Drift
Did the agent stay on the task the user came in with? The rubric input is the user’s first message, the transcript, and the final agent response. It penalises sessions where the user asked one thing and the agent ended up doing another: slow drift under user pressure, off-topic walkbacks, and topic switches the agent failed to refuse. The SDK does not ship an IntentDrift template, but CustomLLMJudge with a four-line rubric (score 1.0 if the final response addresses the user's first-turn intent; 0.0 if the conversation drifted off the asked task; partial for related-but-not-equivalent drift) handles it. Pair with CustomerAgentQueryHandling for the intent-capture side.
Conversation Resolution
Did the dialogue reach the expected end state? The rubric input is the transcript, the system prompt, and the expected outcome (refund issued, booking confirmed, ticket escalated with summary attached, lead qualified, or correctly refused). ConversationResolution (eval_id=2) ships this. Pair with TaskCompletion (eval_id=99) on the answerable subset for the end-to-end outcome score. A trajectory that ended at “I will look into that” without a follow-up action is a resolution failure regardless of how coherent each turn read.
Three more rubrics sit alongside as trajectory-level signals. CustomerAgentLoopDetection catches premature loops and clarifying-tangent dead ends. CustomerAgentHumanEscalation scores whether the agent escalated at the right turn with the right context attached. CustomerAgentPromptConformance scores persona adherence across the full session: sarcasm by turn 8 is a regression even if each line was technically correct. Per-turn rubrics score each turn as coherent; only the trajectory rubric catches the drift.
Step 1: build the trajectory test set
A trajectory eval set has two sources. Mix them in fixed ratios; one without the other underfits or overfits.
Real production replays (50-60 percent of the set). Pull conversations where CSAT was low, a guardrail fired, the bot looped, the human inherited the case, or repeat contact happened within 72 hours. Lift them as gold trajectories with a domain-lead-reviewed expected_end_state. These are the failure shapes you have actually seen. If the agent ships, this is what it has to handle. Engineer-only labels grow noise; the gold end state needs a support lead, clinician, or product owner depending on domain.
Adversarial trajectories (40-50 percent of the set). These are the failures you have not seen yet, scripted with simulate-sdk. Cover five families. Persona pressure ladders: user pushes back across turns 4-7, agent must hold character. Crescendo jailbreaks: gradual context poisoning over many turns. Off-topic walkbacks: user slowly drags the agent off the task. Premature termination probes: user signals they are done before the task is resolved; agent should confirm, not auto-close. Missing entity recoveries: user gives a partial order id; agent should ask one targeted question, not three.
from fi.simulate import (
Persona, Scenario, TestRunner, OpenAIAgentWrapper,
AgentDefinition, LLMConfig,
)
agent_def = AgentDefinition(
name="multi-turn-support-agent",
llm_config=LLMConfig(model="gpt-4.1-mini", temperature=0.4),
system_prompt=SYSTEM_PROMPT,
)
persona_pressure = Persona(
persona={
"name": "frustrated_pro_user",
"communication_style": "challenging, pushes back across turns",
},
situation="User has a chargeback question and challenges the agent's "
"policy reading on turn 4 and again on turn 6.",
outcome="Agent holds policy, escalates to billing-sensitive queue with "
"full thread summary, never drops formal persona.",
)
scenario = Scenario(
name="billing_sensitive_persona_pressure",
description="Persona-pressure ladder on a billing-sensitive case.",
dataset=[persona_pressure],
)
wrapper = OpenAIAgentWrapper(agent_def)
runner = TestRunner(agent_wrapper=wrapper, scenarios=[scenario], max_turns=10)
report = runner.run()
Stratify the merged set across the four trajectory metrics and across the persona-by-scenario grid so a weakness in context retention cannot hide behind a high coherence mean. Start at 60-150 trajectories of 3-12 turns each: enough to score per-trajectory floors meaningfully, small enough to inspect every failure by hand for the first two weeks. Grow weekly by promoting failing production traces auto-clustered by the Error Feed, each one reviewed by a domain lead before the gold end state enters the dataset.
Step 2: per-trajectory scoring in CI
Most chatbot CI gates score per-turn and average. The gate stays green while the trajectory bugs ship. Invert the loop. The unit of CI is the trajectory; the floors are per-trajectory; the dataset is stratified so a single class of failure cannot drag everything down by one point and pass review.
from fi.evals import Evaluator
from fi.evals.templates import (
ConversationCoherence, ConversationResolution, TaskCompletion,
CustomerAgentContextRetention, CustomerAgentLoopDetection,
CustomerAgentHumanEscalation, CustomerAgentPromptConformance,
)
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
evaluator = Evaluator()
intent_drift = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "intent_drift",
"model": "gpt-4.1",
"grading_criteria": (
"Score 1.0 if the final agent response addresses the user's "
"first-turn intent. 0.0 if the conversation drifted off the "
"asked task. Partial for related-but-not-equivalent drift. "
"Provide a one-sentence reason citing the turn the drift "
"started, if any."
),
},
)
TRAJECTORY_FLOORS = {
"conversation_coherence": 0.88,
"conversation_resolution": 0.85,
"customer_agent_context_retention": 0.90,
"customer_agent_loop_detection": 0.92,
"customer_agent_human_escalation": 0.95,
"customer_agent_prompt_conformance": 0.90,
"intent_drift": 0.95, # zero tolerance on safety-tier trajectories
}
def test_trajectories(trajectory_dataset):
failed = []
for traj in trajectory_dataset:
transcript = "\n".join(
f"{t['role']}: {t['content']}" for t in traj.history
)
tc = TestCase(
input=transcript,
expected_output=traj.expected_end_state,
context=traj.system_prompt,
conversation=traj.history,
)
result = evaluator.evaluate(
eval_templates=[
ConversationCoherence(), ConversationResolution(),
CustomerAgentContextRetention(),
CustomerAgentLoopDetection(),
CustomerAgentHumanEscalation(),
CustomerAgentPromptConformance(),
],
inputs=[tc],
)
scores = {r.name: r.output for r in result.eval_results}
scores["intent_drift"] = intent_drift.evaluate([tc])[0].output
for metric, floor in TRAJECTORY_FLOORS.items():
if scores[metric] < floor:
failed.append((traj.id, metric, scores[metric], floor))
assert not failed, f"trajectory failures: {failed[:5]}"
Three habits separate a working trajectory gate from theatre. Score per trajectory, not per turn. The transcript is the input, the expected end state is the reference. Stratify the dataset across personas and scenarios. Equal weight per persona-by-scenario cell; natural-distribution accuracy hides persona-pressure failures behind a pile of happy paths. Diff against a moving baseline. Alarm on a 2-point sustained drop, not every change. Trajectory rubrics are noisier per-trajectory than per-turn scores, and over-alarming on the noise floor kills the habit of looking at the gate.
Per-trajectory scoring is more expensive than per-turn averaging because the LLM judge sees the whole transcript for every conversation. The discipline that earns it back is sampling. Per-trajectory rubrics gate the curated regression set in CI; deterministic checks (intent emission, tool-call schema, PII presence) still run on 100 percent of turn-level traffic.
Step 3: production session-level observability
The CI gate catches the trajectory bugs you can think of. Production catches everything else. The bridge that makes both possible is the same: every span carries session.id and user.id, so per-turn LLM, retrieval, and tool spans roll up into a conversation root span. Trajectory rubrics attach to that span, not the per-turn span tree.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType, EvalTag, EvalTagType, EvalSpanKind,
EvalName, ModelChoices,
)
from openinference.instrumentation.openai import OpenAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="multi-turn-support-agent",
eval_tags=[
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.CONVERSATION,
eval_name=EvalName.CONVERSATION_COHERENCE,
model=ModelChoices.TURING_LARGE,
mapping={"input": "input.value", "output": "output.value"},
),
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.CONVERSATION,
eval_name=EvalName.CONVERSATION_RESOLUTION,
model=ModelChoices.TURING_LARGE,
mapping={"input": "input.value", "output": "output.value"},
),
],
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
EvalSpanKind.CONVERSATION is the trajectory-level surface. The rubric input is the rolled-up transcript, scored when the session closes. Sample at 5-10 percent of live traffic for the LLM-judge trajectory rubrics; the cost of running every trajectory metric on every conversation is prohibitive and unnecessary. Deterministic checks (intent emission, schema validity, PII presence) keep running on every turn at 100 percent.
Four production-only trajectory signals to alarm on. Coherence drift: a 3-point drop on the conversation coherence histogram over a week usually means a system-prompt update tipped the model into contradicting itself; pair with a prompt rollback. Context retention drift: rising mean of “agent re-asked” annotations means the conversation-aware query rewriter regressed or the retrieval namespace changed. Resolution rate per scenario class: a drop on the billing-sensitive class with stable coherence usually means the escalation handoff stopped including the staged summary the human queue needs. Intent drift rate: rising on long sessions usually means the agent’s clarifying policy is breaking; the conversation is being walked off-topic before the user gets an answer.
The Error Feed closes the loop. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing trajectories into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, prompt-cache hit ratio near 90 percent) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). A typical cluster on a multi-turn agent reads: persona_drift_after_pushback / 47 sessions over 24h / agent drops formal persona on turn 5-7 after user challenges the recommendation; immediate fix: harden system prompt to restate reasoning rather than agree. The cluster summary feeds the Platform’s self-improving evaluators, so the same rubric that scored persona conformance gets sharper as drifted conversations are observed and labelled. Engineers cannot promote a failing trace to the dataset on their own; the gold end state needs domain-lead review.
How Future AGI fits multi-turn evaluation
Future AGI ships the eval stack as a package. Start with the SDK for code-defined trajectory rubrics. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.
- ai-evaluation SDK (Apache 2.0). The trajectory templates ship as classes:
ConversationCoherence,ConversationResolution,TaskCompletion, plus the 11-template CustomerAgent family (CustomerAgentClarificationSeeking,CustomerAgentContextRetention,CustomerAgentConversationQuality,CustomerAgentHumanEscalation,CustomerAgentInterruptionHandling,CustomerAgentLanguageHandling,CustomerAgentLoopDetection,CustomerAgentObjectionHandling,CustomerAgentPromptConformance,CustomerAgentQueryHandling,CustomerAgentTerminationHandling).CustomLLMJudgehandles intent drift and any product-specific trajectory rubric in a few lines. 50+ evaluators total, 20+ heuristic metrics locally. - simulate-sdk.
PersonaplusScenarioplusTestRunnerdrives the adversarial side of the trajectory set acrossOpenAIAgentWrapper,LangChainAgentWrapper,GeminiAgentWrapper, andAnthropicAgentWrapper.ScenarioGeneratorauto-produces variants for the persona-pressure and crescendo families. Voice agents get the same shape viaTTSConfigandSTTConfig. - traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds with first-class
CONVERSATION,AGENT, andTOOL; trajectory rubrics attach viaEvalTagwithEvalSpanKind.CONVERSATIONso the same rubric in CI runs on live sessions. - Future AGI Platform. Self-improving evaluators tuned by domain-lead thumbs feedback. In-product authoring agent writes trajectory rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack). HDBSCAN clustering plus a Sonnet 4.5 Judge writes the
immediate_fix; domain-lead-reviewed promotions feed the dataset and the self-improving evaluators. Linear OAuth wired today; Slack, GitHub, Jira, and PagerDuty on the roadmap. - Agent Command Center. OpenAI-compatible AI gateway in a single Go binary (Apache 2.0); 100+ providers; 18+ built-in guardrail scanners with
EvalSpanKind.CONVERSATIONcarrying trajectory scores back into the same trace tree. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.
Three honest tradeoffs. Trajectory rubrics cost more per call than turn-level rubrics because the LLM judge reads the whole transcript; the discipline is to gate CI on the curated regression set and sample production at 5-10 percent. Custom intent-drift judges need calibration against a domain-lead-labelled hold-out before the floor is meaningful; a judge that fires on every off-topic turn is noise. Self-improving evaluators need a pinned hold-out and quarterly calibration review or the drift moves with the data and no one notices.
Ready to score your first trajectories? Wire ConversationCoherence, ConversationResolution, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentHumanEscalation, and a CustomLLMJudge for intent drift into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same rubrics to live conversation root spans via EvalSpanKind.CONVERSATION when production traces start asking questions the CI gate missed.
Related reading
- LLM Chatbot Evaluation: A Comprehensive Guide (2026)
- Customer Support Chatbot: Build and Evaluate (2026)
- Multi-Turn LLM Evaluation (2026)
- Single-Turn vs Multi-Turn Evaluation (2026)
- Simulated Multi-Turn Conversation Eval (2026)
- Multi-Turn Jailbreaking Defender (2026)
- Evaluating Tool-Calling Agents (2026)
- The 2026 LLM Evaluation Playbook
Frequently asked questions
Why is averaging turn-level scores wrong for multi-turn evaluation?
What are the four trajectory-level metrics that catch what averaging misses?
How do I build a multi-turn test set that catches real failures?
What does per-trajectory CI scoring look like?
How does production session-level tracing pair with trajectory eval?
What's the difference between simulation, replay, and live eval?
How does Future AGI fit the multi-turn eval workflow?
Temporal turns agent workflows into replayable state machines. The eval that matches: per-activity correctness, workflow outcome, retry budget enforcement, signal-handler correctness. Replay is the superpower.
Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.
Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.