Guides

Evaluating Multi-Turn Conversations: A Deep Dive (2026)

Multi-turn eval is not turn-by-turn eval averaged. The unit is the trajectory: coherence, context retention, intent drift, resolution. The trajectory-first playbook.

·
Updated
·
11 min read
multi-turn-evaluation conversational-ai agent-evaluation trajectory-eval llm-observability voice-agents 2026
Editorial cover image for Evaluating Multi-Turn Conversations: An Architectural Deep Dive
Table of Contents

A support agent passes every turn-level rubric. Faithfulness sits at 0.94. Refusal-correctness at 0.98. The first ten transcripts in production tell a different story: the agent quietly walks past an escalation threshold on turn four, drops persona on turn six after a user pushes back, and on turn nine ships a refund quote that contradicts the policy it cited on turn two. None of those failures register as a bad response. They register as conversations that score well per turn and fail as sessions.

Most posts on this would call it a context-window problem and pitch a longer model. That misses the failure mode. Individual turns were fine. The trajectory was not.

The opinion this post earns: multi-turn eval is not turn-by-turn eval averaged. The unit is the conversation trajectory: coherence across turns, context retention, intent drift, resolution. Scoring turn-by-turn and averaging hides the conversation-level failures that actually break trust. The mean of a transcript that contradicted itself on turn nine and one that did not look identical at the dashboard. The CSAT chart, on a one-week lag, tells you which one shipped.

This guide is the trajectory-first playbook: the four conversation-level metrics, how to build a test set out of real replays plus adversarial trajectories, how per-trajectory scoring runs in CI, and how the same rubrics attach to live session spans through traceAI and the ai-evaluation SDK. For the six-layer chatbot map, see the chatbot eval guide.

TL;DR: the trajectory metrics and the eval suite

MetricWhat it scoresFailure it catchesPrimary rubric
Conversation CoherenceInternal consistency across turnsTurn 9 contradicts turn 2ConversationCoherence
Context RetentionUse of facts the user gave earlierAgent re-asks region in turn 5CustomerAgentContextRetention
Intent DriftStaying on the asked taskOff-topic walkback under pressureCustomLLMJudge rubric
Conversation ResolutionReaching the expected end stateSession ends without escalation contextConversationResolution
Loop controlPremature loops or clarifying tangentsThree clarifies, no answerCustomerAgentLoopDetection
Escalation handoffRight turn, right queue, full contextTrigger fired turn 3, agent kept goingCustomerAgentHumanEscalation
Persona adherenceHolding character across the sessionSarcasm by turn 8 after pushbackCustomerAgentPromptConformance

The discipline: score each trajectory end-to-end against all of these, gate CI on per-trajectory floors (not the mean), and run the same rubrics on live conversation root spans at 5 to 10 percent sampling.

Why averaging turn scores hides the bug

A single-turn rubric scores (input, output) against a reference. It cannot see the cumulative shape of the trajectory. Three failure modes show up in postmortem after postmortem when teams shipped on per-turn means:

  • Contradicted itself across turns. Turn 2 cited the US 30-day refund window. Turn 9 quoted the EU 14-day window for the same case after the user pushed back. Per-turn Groundedness scored 0.93 on both because each answer was supported by some retrieved chunk. The conversation-level reading is a coherence break. The agent gave the user two incompatible policies in the same thread.
  • Forgot what the user said. Turn 1: user mentioned California. Turn 3: agent asked for the user’s region. Turn 5: agent gave a generic answer ignoring the California-specific clause. Each turn passed isolated grounding. The trajectory lost the user’s first message inside its own clarifying loop.
  • Walked off-topic and never came back. User asked about a chargeback. Five turns of clarifying questions later, the agent was negotiating a discount code. No per-turn rubric fires on this. The intent the user came in with was abandoned. The agent’s last response was perfectly grounded to the wrong task.

Per-turn scores stayed green on all three. The bugs only become visible when the rubric input is the transcript, not the turn. Long context windows do not fix this; they shift the failure pattern from “the model forgot” to “the model has the context and ignores it,” which is harder to catch because the input looks fine.

The shift is one sentence: stop scoring (input, output) pairs and start scoring transcript → expected end state. Everything below is what that costs to operationalize.

The four trajectory metrics

A useful taxonomy for multi-turn conversation evaluation collapses to four trajectory-level metrics, each with a different input shape and a different failure family. Persona, loop control, and escalation correctness layer on top.

Conversation Coherence

Are turns internally consistent across the session? The rubric input is the full transcript. It penalises contradictions between turns, restated questions the user already answered, and tool calls that ignore parameters set earlier in the conversation. ConversationCoherence (eval_id=1) ships this. The expensive failure mode it catches: an agent that grounds each turn against a fresh chunk and ends up giving the user three incompatible answers across one thread.

Context Retention

Did the agent use what the user told it? The rubric input is the transcript plus an optional structured list of facts the user provided (region, account tier, allergy, prior order id, urgency). It penalises re-asking facts the user already gave, ignoring them in a downstream tool call, and quietly forgetting them by the time the answer ships. CustomerAgentContextRetention covers this directly. The fix that gets most teams 80 percent of the way is a conversation-aware query rewriter that folds the cumulative state into the standalone retrieval query before the retriever fires.

Intent Drift

Did the agent stay on the task the user came in with? The rubric input is the user’s first message, the transcript, and the final agent response. It penalises sessions where the user asked one thing and the agent ended up doing another: slow drift under user pressure, off-topic walkbacks, and topic switches the agent failed to refuse. The SDK does not ship an IntentDrift template, but CustomLLMJudge with a four-line rubric (score 1.0 if the final response addresses the user's first-turn intent; 0.0 if the conversation drifted off the asked task; partial for related-but-not-equivalent drift) handles it. Pair with CustomerAgentQueryHandling for the intent-capture side.

Conversation Resolution

Did the dialogue reach the expected end state? The rubric input is the transcript, the system prompt, and the expected outcome (refund issued, booking confirmed, ticket escalated with summary attached, lead qualified, or correctly refused). ConversationResolution (eval_id=2) ships this. Pair with TaskCompletion (eval_id=99) on the answerable subset for the end-to-end outcome score. A trajectory that ended at “I will look into that” without a follow-up action is a resolution failure regardless of how coherent each turn read.

Three more rubrics sit alongside as trajectory-level signals. CustomerAgentLoopDetection catches premature loops and clarifying-tangent dead ends. CustomerAgentHumanEscalation scores whether the agent escalated at the right turn with the right context attached. CustomerAgentPromptConformance scores persona adherence across the full session: sarcasm by turn 8 is a regression even if each line was technically correct. Per-turn rubrics score each turn as coherent; only the trajectory rubric catches the drift.

Step 1: build the trajectory test set

A trajectory eval set has two sources. Mix them in fixed ratios; one without the other underfits or overfits.

Real production replays (50-60 percent of the set). Pull conversations where CSAT was low, a guardrail fired, the bot looped, the human inherited the case, or repeat contact happened within 72 hours. Lift them as gold trajectories with a domain-lead-reviewed expected_end_state. These are the failure shapes you have actually seen. If the agent ships, this is what it has to handle. Engineer-only labels grow noise; the gold end state needs a support lead, clinician, or product owner depending on domain.

Adversarial trajectories (40-50 percent of the set). These are the failures you have not seen yet, scripted with simulate-sdk. Cover five families. Persona pressure ladders: user pushes back across turns 4-7, agent must hold character. Crescendo jailbreaks: gradual context poisoning over many turns. Off-topic walkbacks: user slowly drags the agent off the task. Premature termination probes: user signals they are done before the task is resolved; agent should confirm, not auto-close. Missing entity recoveries: user gives a partial order id; agent should ask one targeted question, not three.

from fi.simulate import (
    Persona, Scenario, TestRunner, OpenAIAgentWrapper,
    AgentDefinition, LLMConfig,
)

agent_def = AgentDefinition(
    name="multi-turn-support-agent",
    llm_config=LLMConfig(model="gpt-4.1-mini", temperature=0.4),
    system_prompt=SYSTEM_PROMPT,
)

persona_pressure = Persona(
    persona={
        "name": "frustrated_pro_user",
        "communication_style": "challenging, pushes back across turns",
    },
    situation="User has a chargeback question and challenges the agent's "
              "policy reading on turn 4 and again on turn 6.",
    outcome="Agent holds policy, escalates to billing-sensitive queue with "
            "full thread summary, never drops formal persona.",
)

scenario = Scenario(
    name="billing_sensitive_persona_pressure",
    description="Persona-pressure ladder on a billing-sensitive case.",
    dataset=[persona_pressure],
)

wrapper = OpenAIAgentWrapper(agent_def)
runner = TestRunner(agent_wrapper=wrapper, scenarios=[scenario], max_turns=10)
report = runner.run()

Stratify the merged set across the four trajectory metrics and across the persona-by-scenario grid so a weakness in context retention cannot hide behind a high coherence mean. Start at 60-150 trajectories of 3-12 turns each: enough to score per-trajectory floors meaningfully, small enough to inspect every failure by hand for the first two weeks. Grow weekly by promoting failing production traces auto-clustered by the Error Feed, each one reviewed by a domain lead before the gold end state enters the dataset.

Step 2: per-trajectory scoring in CI

Most chatbot CI gates score per-turn and average. The gate stays green while the trajectory bugs ship. Invert the loop. The unit of CI is the trajectory; the floors are per-trajectory; the dataset is stratified so a single class of failure cannot drag everything down by one point and pass review.

from fi.evals import Evaluator
from fi.evals.templates import (
    ConversationCoherence, ConversationResolution, TaskCompletion,
    CustomerAgentContextRetention, CustomerAgentLoopDetection,
    CustomerAgentHumanEscalation, CustomerAgentPromptConformance,
)
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator()

intent_drift = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "intent_drift",
        "model": "gpt-4.1",
        "grading_criteria": (
            "Score 1.0 if the final agent response addresses the user's "
            "first-turn intent. 0.0 if the conversation drifted off the "
            "asked task. Partial for related-but-not-equivalent drift. "
            "Provide a one-sentence reason citing the turn the drift "
            "started, if any."
        ),
    },
)

TRAJECTORY_FLOORS = {
    "conversation_coherence": 0.88,
    "conversation_resolution": 0.85,
    "customer_agent_context_retention": 0.90,
    "customer_agent_loop_detection": 0.92,
    "customer_agent_human_escalation": 0.95,
    "customer_agent_prompt_conformance": 0.90,
    "intent_drift": 0.95,  # zero tolerance on safety-tier trajectories
}

def test_trajectories(trajectory_dataset):
    failed = []
    for traj in trajectory_dataset:
        transcript = "\n".join(
            f"{t['role']}: {t['content']}" for t in traj.history
        )
        tc = TestCase(
            input=transcript,
            expected_output=traj.expected_end_state,
            context=traj.system_prompt,
            conversation=traj.history,
        )
        result = evaluator.evaluate(
            eval_templates=[
                ConversationCoherence(), ConversationResolution(),
                CustomerAgentContextRetention(),
                CustomerAgentLoopDetection(),
                CustomerAgentHumanEscalation(),
                CustomerAgentPromptConformance(),
            ],
            inputs=[tc],
        )
        scores = {r.name: r.output for r in result.eval_results}
        scores["intent_drift"] = intent_drift.evaluate([tc])[0].output
        for metric, floor in TRAJECTORY_FLOORS.items():
            if scores[metric] < floor:
                failed.append((traj.id, metric, scores[metric], floor))
    assert not failed, f"trajectory failures: {failed[:5]}"

Three habits separate a working trajectory gate from theatre. Score per trajectory, not per turn. The transcript is the input, the expected end state is the reference. Stratify the dataset across personas and scenarios. Equal weight per persona-by-scenario cell; natural-distribution accuracy hides persona-pressure failures behind a pile of happy paths. Diff against a moving baseline. Alarm on a 2-point sustained drop, not every change. Trajectory rubrics are noisier per-trajectory than per-turn scores, and over-alarming on the noise floor kills the habit of looking at the gate.

Per-trajectory scoring is more expensive than per-turn averaging because the LLM judge sees the whole transcript for every conversation. The discipline that earns it back is sampling. Per-trajectory rubrics gate the curated regression set in CI; deterministic checks (intent emission, tool-call schema, PII presence) still run on 100 percent of turn-level traffic.

Step 3: production session-level observability

The CI gate catches the trajectory bugs you can think of. Production catches everything else. The bridge that makes both possible is the same: every span carries session.id and user.id, so per-turn LLM, retrieval, and tool spans roll up into a conversation root span. Trajectory rubrics attach to that span, not the per-turn span tree.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType, EvalSpanKind,
    EvalName, ModelChoices,
)
from openinference.instrumentation.openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="multi-turn-support-agent",
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.CONVERSATION,
            eval_name=EvalName.CONVERSATION_COHERENCE,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.CONVERSATION,
            eval_name=EvalName.CONVERSATION_RESOLUTION,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

EvalSpanKind.CONVERSATION is the trajectory-level surface. The rubric input is the rolled-up transcript, scored when the session closes. Sample at 5-10 percent of live traffic for the LLM-judge trajectory rubrics; the cost of running every trajectory metric on every conversation is prohibitive and unnecessary. Deterministic checks (intent emission, schema validity, PII presence) keep running on every turn at 100 percent.

Four production-only trajectory signals to alarm on. Coherence drift: a 3-point drop on the conversation coherence histogram over a week usually means a system-prompt update tipped the model into contradicting itself; pair with a prompt rollback. Context retention drift: rising mean of “agent re-asked” annotations means the conversation-aware query rewriter regressed or the retrieval namespace changed. Resolution rate per scenario class: a drop on the billing-sensitive class with stable coherence usually means the escalation handoff stopped including the staged summary the human queue needs. Intent drift rate: rising on long sessions usually means the agent’s clarifying policy is breaking; the conversation is being walked off-topic before the user gets an answer.

The Error Feed closes the loop. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing trajectories into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, prompt-cache hit ratio near 90 percent) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). A typical cluster on a multi-turn agent reads: persona_drift_after_pushback / 47 sessions over 24h / agent drops formal persona on turn 5-7 after user challenges the recommendation; immediate fix: harden system prompt to restate reasoning rather than agree. The cluster summary feeds the Platform’s self-improving evaluators, so the same rubric that scored persona conformance gets sharper as drifted conversations are observed and labelled. Engineers cannot promote a failing trace to the dataset on their own; the gold end state needs domain-lead review.

How Future AGI fits multi-turn evaluation

Future AGI ships the eval stack as a package. Start with the SDK for code-defined trajectory rubrics. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.

  • ai-evaluation SDK (Apache 2.0). The trajectory templates ship as classes: ConversationCoherence, ConversationResolution, TaskCompletion, plus the 11-template CustomerAgent family (CustomerAgentClarificationSeeking, CustomerAgentContextRetention, CustomerAgentConversationQuality, CustomerAgentHumanEscalation, CustomerAgentInterruptionHandling, CustomerAgentLanguageHandling, CustomerAgentLoopDetection, CustomerAgentObjectionHandling, CustomerAgentPromptConformance, CustomerAgentQueryHandling, CustomerAgentTerminationHandling). CustomLLMJudge handles intent drift and any product-specific trajectory rubric in a few lines. 50+ evaluators total, 20+ heuristic metrics locally.
  • simulate-sdk. Persona plus Scenario plus TestRunner drives the adversarial side of the trajectory set across OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, and AnthropicAgentWrapper. ScenarioGenerator auto-produces variants for the persona-pressure and crescendo families. Voice agents get the same shape via TTSConfig and STTConfig.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds with first-class CONVERSATION, AGENT, and TOOL; trajectory rubrics attach via EvalTag with EvalSpanKind.CONVERSATION so the same rubric in CI runs on live sessions.
  • Future AGI Platform. Self-improving evaluators tuned by domain-lead thumbs feedback. In-product authoring agent writes trajectory rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Error Feed (inside the eval stack). HDBSCAN clustering plus a Sonnet 4.5 Judge writes the immediate_fix; domain-lead-reviewed promotions feed the dataset and the self-improving evaluators. Linear OAuth wired today; Slack, GitHub, Jira, and PagerDuty on the roadmap.
  • Agent Command Center. OpenAI-compatible AI gateway in a single Go binary (Apache 2.0); 100+ providers; 18+ built-in guardrail scanners with EvalSpanKind.CONVERSATION carrying trajectory scores back into the same trace tree. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.

Three honest tradeoffs. Trajectory rubrics cost more per call than turn-level rubrics because the LLM judge reads the whole transcript; the discipline is to gate CI on the curated regression set and sample production at 5-10 percent. Custom intent-drift judges need calibration against a domain-lead-labelled hold-out before the floor is meaningful; a judge that fires on every off-topic turn is noise. Self-improving evaluators need a pinned hold-out and quarterly calibration review or the drift moves with the data and no one notices.

Ready to score your first trajectories? Wire ConversationCoherence, ConversationResolution, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentHumanEscalation, and a CustomLLMJudge for intent drift into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same rubrics to live conversation root spans via EvalSpanKind.CONVERSATION when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is averaging turn-level scores wrong for multi-turn evaluation?
A turn-level rubric scores one input-output pair. The conversation-level failures most teams ship hide in the gaps between turns: the agent ignores a fact the user gave on turn one, drops persona on turn six after pushback, escalates four turns too late, or stitches an answer on turn nine that contradicts the policy it cited on turn two. Each individual turn looks coherent and grounded. The trajectory does not. Averaging hides the bugs because the per-turn floor passes and the mean stays high; the failure only surfaces in CSAT, repeat-contact, or a customer complaint two weeks later. The unit of multi-turn eval is the conversation trajectory, not the turn.
What are the four trajectory-level metrics that catch what averaging misses?
Conversation Coherence scores whether turns are internally consistent across the session — the agent does not contradict its earlier statements, restate questions the user already answered, or call tools with parameters that conflict with what it knew three turns ago. Context Retention scores whether the agent used facts the user provided earlier — region, account tier, prior order id, allergy — instead of re-asking or quietly forgetting. Intent Drift scores whether the agent stayed on the task it was asked to do, instead of being slowly walked off-topic by user pressure or its own clarifying tangents. Conversation Resolution scores whether the dialogue reached the expected end state (refund issued, booking confirmed, ticket escalated with full context). Future AGI ships ConversationCoherence and ConversationResolution as templates; the other two are built with CustomerAgentContextRetention and a CustomLLMJudge for intent drift.
How do I build a multi-turn test set that catches real failures?
Two sources, mixed in fixed ratios. Real production replays cover the failure shapes you have actually seen — pull conversations where CSAT was low, a guardrail fired, the bot looped, or a human override followed the bot's response, and lift them as gold trajectories with a domain-lead-reviewed expected end state. Adversarial trajectories cover the failures you have not seen yet — persona pressure ladders, crescendo jailbreaks, off-topic walkbacks, premature-termination probes, missing-entity recoveries. Future AGI's simulate-sdk uses Persona plus Scenario to script these. Stratify the set across the four trajectory metrics so a weakness in context retention cannot hide behind a high coherence score. Start at 60 to 150 trajectories of 3 to 12 turns each.
What does per-trajectory CI scoring look like?
Run the agent against each gold trajectory end-to-end. Score the entire transcript against ConversationCoherence, ConversationResolution, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentHumanEscalation, and a CustomLLMJudge for intent drift. Gate CI on per-trajectory floors (coherence >= 0.88, resolution >= 0.85, context retention >= 0.90, zero intent drift on safety-tier trajectories) rather than the mean across the set. Stratify the dataset across personas and scenario types so a single category of failure cannot drag everything down by one point and pass review. Diff against a moving baseline, not absolute thresholds, so a 2-point sustained drop blocks release.
How does production session-level tracing pair with trajectory eval?
Production catches what the test set has not seen yet. Every span carries session.id and user.id, so traceAI rolls per-turn LLM, retrieval, and tool spans into a conversation root span. The same trajectory rubrics that gate CI attach to that conversation span via EvalTag with EvalSpanKind.CONVERSATION, sampled at 5 to 10 percent of live sessions to keep judge cost bounded. Deterministic checks (intent emission, tool-call schema validity, PII presence) run on 100 percent. The session-level rollup is what makes context retention and persona drift visible at all — a per-turn span tree does not have the right input shape to score either.
What's the difference between simulation, replay, and live eval?
Simulation generates trajectories the agent has not seen by scripting Personas and Scenarios with simulate-sdk. It catches known failure modes (crescendo, persona pressure, off-topic walkback) before they reach production. Replay pulls real production conversations through the same evaluator pipeline that scored the simulator set; the same rubric definitions run in both. Live eval samples 5 to 10 percent of production sessions and runs the rubrics inline against the conversation root span, with the score written back to the trace. The discipline is to use the same rubric in all three so the scores are comparable; if simulation, replay, and live eval each use their own judge, the team will argue about which number is real instead of fixing trajectory bugs.
How does Future AGI fit the multi-turn eval workflow?
Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) gives you the code-first surface: ConversationCoherence and ConversationResolution as trajectory templates, the 11-template CustomerAgent family (ClarificationSeeking, ContextRetention, ConversationQuality, HumanEscalation, InterruptionHandling, LanguageHandling, LoopDetection, ObjectionHandling, PromptConformance, QueryHandling, TerminationHandling), and CustomLLMJudge for intent drift and persona-specific rubrics. simulate-sdk drives the test set with Persona, Scenario, and TestRunner across OpenAI, LangChain, Gemini, and Anthropic wrappers. traceAI (Apache 2.0) groups spans into sessions and runs the same rubrics on live conversation roots via EvalTag with EvalSpanKind.CONVERSATION. Error Feed clusters failing trajectories with HDBSCAN, and a Sonnet 4.5 Judge writes the immediate fix. The Platform layers self-improving evaluators tuned by domain-lead feedback at lower per-eval cost than Galileo Luna-2.
Related Articles
View all