Guides

Simulated Multi-Turn Conversation Eval (2026)

Build a simulated multi-turn eval that catches real failures: the Persona-Scenario-Adversary triangle, FAGI simulate-sdk patterns, trajectory scoring.

March 1, 2026

Updated May 20, 2026

11 min read

multi-turn-evaluation simulation llm-evaluation agent-evaluation conversation-eval simulate-sdk 2026

Table of Contents

A team ships a refund agent. The single-turn eval set scores 94 percent. The replay set scores 89 percent. The team builds a simulated multi-turn harness because everyone says you should, runs a thousand conversations, and the dashboard reads green. Two weeks after launch, escalations triple. The post-mortem reads the same way every time: the simulator only ever ran cooperative personas, the conversations all followed a happy path, and nothing in the eval set looked like the real users who showed up on day three. That’s what the harness was supposed to catch.

The simulator was not broken. It was incomplete. A simulated multi-turn eval is a Persona-Scenario-Adversary triangle. Persona is WHO is talking. Scenario is WHAT they want. Adversary is HOW they push back. Skip any leg and your simulator runs happy paths only, which is exactly where production breaks. This post is the methodology: how to design each leg, how to score the resulting trajectories, and which FAGI surfaces map to each step.

TL;DR

Triangle leg	What it carries	FAGI surface
Persona	Goal, personality, hidden constraints, drop probability	`Persona(persona=, situation=, outcome=)` in simulate-sdk
Scenario	Intent, happy-path expectation, edge-case variants	`Scenario(name=, dataset=[personas])`
Adversary	Objection, retry, escalation, off-script, persona pressure	adversary-pattern persona templates (5 families)
Trajectory scoring	End-to-end rubric over the transcript	`ConversationCoherence`, `ConversationResolution`, CustomerAgent family
Realism check	Did the simulator stay in character?	`CustomLLMJudge` for `SimulatorRealism`
Trace + cluster	Span tree per conversation, failure clusters	traceAI `EvalSpanKind.CONVERSATION`, Error Feed

If you only build three pieces, build personas with hidden constraints, an objection adversary, and end-of-conversation trajectory scoring. Everything below is what keeps the loop honest over releases.

The triangle

Three things have to be true for a simulated conversation to find real bugs. The simulator has to play a believable user with their own goals and constraints. The conversation has to head somewhere: a refund, a booking, a policy clarification, an escalation. The user has to push back when the agent stalls, contradicts itself, or asks for the same fact twice. Persona, Scenario, Adversary. Three legs. None optional.

The trap most simulation harnesses fall into is collapsing the triangle into one leg. The “cooperative customer asking about refunds” persona is all three legs squashed into a single test case: persona is thin, scenario is implicit, adversary is absent. A thousand of these conversations do not produce coverage; they produce a thousand near-identical happy paths.

Each leg does a different job. Persona carries the lexical surface. The words a frustrated user types are not the words a curious newcomer types, and an agent that overfits to one register breaks on the other. Scenario carries the conversational shape: refund-with-fraud-flag is a different shape from refund-window-expired, and the agent’s tool sequence differs. Adversary carries the resilience surface. Does the agent hold position when the user pushes back, or does it cave to whatever the user just said? If you cannot tell which leg a given test case is exercising, the test case is not pulling its weight.

The corollary: the cross-product of the three legs is your conversation matrix. A 5-persona by 8-scenario by 3-adversary grid is 120 cells; at 3 seeds per cell, 360 conversations per release. That matrix is what makes pass-rate trends comparable across releases and what makes simulation a research artifact instead of a dashboard.

Persona generation patterns

A persona is not a one-liner. The personas that produce useful failures all carry four things: a goal, a personality, three to five hidden constraints, and behavior knobs.

PERSONAS = [
    {
        "id": "frustrated_escalator",
        "goal": "Refund a $200 duplicate charge from March 14.",
        "personality": (
            "Polite for two turns, then increasingly terse. "
            "Threatens to call corporate at turn five if unresolved."
        ),
        "constraints": [
            "Does not know the order ID; shares email only if asked twice",
            "Will not paste a card number under any circumstance",
            "Mentions a screenshot but cannot actually attach files",
        ],
        "frustration_threshold": 3,
        "persona_drop_prob": 0.05,
        "communication_style": "short sentences, occasional caps, no emoji",
    },
    {
        "id": "curious_newcomer",
        "goal": "Understand whether the product supports SAML SSO.",
        "personality": "Asks follow-up questions, reads docs links, low frustration ceiling.",
        "constraints": [
            "Does not know what SCIM is and will ask",
            "Shares company size only if asked",
        ],
        "frustration_threshold": 6,
        "persona_drop_prob": 0.02,
        "communication_style": "polite, two- to three-sentence turns, asks clarifying questions",
    },
]

Four patterns that earn their keep. Hidden constraints over disclosed facts. A persona that types its full order number on turn one tests nothing. A persona that knows its order number but only shares it after the agent asks twice tests memory and confirmation. Multi-shot examples beat persona descriptions. A four-line persona description drifts inside five turns; the same persona with three sample turns of realistic phrasing holds character. Behavior knobs make personas reusable. frustration_threshold and persona_drop_prob are dials, not constants. The same persona record across three threshold values gives you three trajectories without writing three personas. Mix model families across the boundary. Simulator on gpt-4o-mini against an agent on Claude Sonnet finds bugs neither family would find against itself; shared blind spots silently inflate pass rates.

Ten well-crafted personas beat fifty thin ones. Coverage scales through scenario and adversary multiplication, not through persona count. For the deeper pattern library on persona drift and style consistency, see evaluating LLM personas.

Scenario design

A scenario is a triple: intent, happy-path expected behavior, and edge-case variants. Pull intents from real production traces where you have them and from product documentation where you don’t. Every persona crosses every scenario it makes sense for, but not every combination. The curious-newcomer persona on a fraud-flag scenario is nonsense, and the test runner should skip it.

SCENARIOS = [
    {
        "id": "refund_duplicate_charge",
        "intent": "Refund a duplicate charge",
        "happy_path": (
            "Agent verifies email, looks up charge, confirms duplicate, "
            "issues refund, gives ETA."
        ),
        "edge_cases": [
            "Charge is not actually duplicate (similar amount, different merchant)",
            "Refund window has expired",
            "Account is flagged for fraud review",
        ],
        "compatible_personas": [
            "frustrated_escalator", "curious_newcomer", "hostile_user",
        ],
        "expected_tools": ["lookup_charge", "issue_refund"],
    },
]

Three habits separate working scenario design from theatre. Edge cases are first-class, not afterthoughts. Each named edge case is a separate test run with its own expected behavior. The duplicate-charge happy path and the duplicate-charge-with-fraud-flag edge case are different rows. Compatibility lists prune the grid. Without them the runner generates nonsense combinations that waste budget and pollute pass-rate aggregates. Expected tool sequences make trajectory scoring tractable. Knowing what the agent should have called gives the trajectory rubric something to compare against; without it ConversationResolution is scoring vibes.

The dataset grows two ways. Promote failing production conversations into new scenario rows, with a domain-lead-reviewed expected end state. That’s the loop described in the LLM evaluation playbook. Generate adversarial variants programmatically: the simulate-sdk ScenarioGenerator takes a topic and an AgentDefinition and produces persona variants for a given scenario, which is the fastest way to seed the matrix when you don’t have production traces yet.

Adversary patterns

The adversary leg is the most commonly skipped and the most expensive to skip. A simulator that agrees with whatever the agent says measures compliance, not resilience. Five patterns cover most of the failure space.

Objection ladders. The user pushes back on the first incorrect answer and escalates if the agent does not self-correct. The persona is scripted to disagree on turn three regardless of what the agent says, then to demand a citation on turn five, then to ask for a supervisor on turn seven. The agent that placates (“you’re absolutely right, let me reconsider”) fails. The agent that holds position with evidence passes. This pattern finds the bug nothing else finds: agents that have learned to agree with the user under any pressure.

Retry pressure. The user repeats the same request in three different framings across the conversation, sometimes with conflicting details. “What’s the refund window?” on turn two. “How long do I have to return this?” on turn five. “Can I still get my money back?” on turn eight. A grounded agent gives the same answer all three times. A drifting agent gives three different windows. Per-turn Groundedness does not catch this. The failure is the inconsistency across turns, not any single turn.

Escalation pressure. The persona signals impatience after N unhelpful turns and demands a manager or supervisor. The eval question isn’t whether the agent transfers. It’s whether the agent transfers at the right turn with the right context attached. Too early is a soft fail. Too late is a hard fail. No context summary attached is a hard fail. CustomerAgentHumanEscalation scores this directly.

Off-script drops. The persona randomly injects distracted-user behavior: switches topics mid-turn, asks an unrelated question, goes silent for a turn, pastes a wrong order number. The agent that loses thread is the agent that fails when real users multi-task. A small per-turn persona_drop_prob (2-5 percent) is enough; higher rates make the conversation noise rather than test signal.

Persona pressure. The user emotionally loads the conversation (frustrated, hostile, distressed) and the agent has to hold its own persona across the turns. Sarcasm by turn eight is a regression even if each individual line was technically correct. CustomerAgentPromptConformance scores adherence; the failure pattern is the agent matching the user’s register instead of holding its system-prompt voice. The full trajectory-eval treatment of this lives in evaluating multi-turn conversations.

Build at least one persona per family. The cost of one extra persona is negligible compared to the cost of one persona-pressure failure shipping to production. The single most common simulation bug across every team we have seen run this loop is the absence of an objection ladder. Start there if you build nothing else.

Trajectory scoring on simulated conversations

The output of a simulator run is a transcript. The unit of scoring is the trajectory, not the turn. Two passes per conversation: end-of-conversation rubrics for outcome and persona behavior, per-turn rubrics only when the trajectory failed and you need to find the exact turn that caused it.

from fi.evals import Evaluator
from fi.evals.templates import (
    ConversationCoherence, ConversationResolution, TaskCompletion,
    CustomerAgentContextRetention, CustomerAgentLoopDetection,
    CustomerAgentHumanEscalation, CustomerAgentPromptConformance,
)
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator()

simulator_realism = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "simulator_realism",
        "model": "gpt-4.1",
        "grading_criteria": (
            "Score 1.0 if the user side of the transcript reads like a real "
            "user with the given persona. Penalize over-disclosure of hidden "
            "constraints, generic phrasing, breaking character, and ending "
            "every turn with a polite closer. Provide a one-sentence reason."
        ),
    },
)

def score_trajectory(persona, scenario, transcript):
    tc = TestCase(
        input=scenario["intent"],
        output=format_transcript(transcript),
        expected_output=scenario["happy_path"],
        context=persona["personality"],
    )
    result = evaluator.evaluate(
        eval_templates=[
            ConversationCoherence(), ConversationResolution(),
            CustomerAgentContextRetention(),
            CustomerAgentLoopDetection(),
            CustomerAgentHumanEscalation(),
            CustomerAgentPromptConformance(),
            TaskCompletion(),
        ],
        inputs=[tc],
    )
    scores = {r.name: r.output for r in result.eval_results}
    scores["simulator_realism"] = simulator_realism.evaluate([tc])[0].output
    return scores

Three habits that make the scoring useful. Score the simulator, not just the agent. SimulatorRealism runs against the user side of the transcript. If realism scores drop below 0.7, drop the conversation from the eval rollup; half your failure clusters would otherwise be simulator artifacts, not agent bugs. Floor per trajectory, not mean across set. A single hostile-persona failure that drags from 0.95 to 0.55 should block release; averaging it with 199 happy-path passes hides it inside a 0.93 mean. Stratify the rollup by adversary family. Pass rate on objection ladders is the metric that predicts production CSAT; pass rate on cooperative happy paths is the metric that predicts nothing.

Every trajectory carries a session.id so traceAI rolls turns into a conversation root span. Attach the trajectory rubrics with EvalSpanKind.CONVERSATION so the same rubrics that gate CI also run on production conversation roots at 5-10 percent sampling, same definitions, comparable scores. That is the pattern from agent observability versus evaluation versus benchmarking.

The FAGI simulate-sdk

The simulate-sdk encodes the triangle as code primitives so you do not maintain an orchestration layer yourself. Persona carries persona dict, situation, and outcome. Scenario carries a name and a list of personas. TestRunner orchestrates the loop. Wrappers (OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, AnthropicAgentWrapper) let you plug in the agent under test without rewriting your stack.

from fi.simulate import (
    Persona, Scenario, TestRunner, AgentDefinition, LLMConfig,
)

agent_def = AgentDefinition(
    name="refund-agent",
    url="wss://your-livekit-server.com",
    room_name="agent-room",
    system_prompt=SYSTEM_PROMPT,
    llm=LLMConfig(model="gpt-4.1", temperature=0.3),
)

objection_ladder = Persona(
    persona={
        "name": "frustrated_escalator",
        "communication_style": "short sentences, pushes back on first answer",
        "constraints": ["only shares email if asked twice", "no card number"],
    },
    situation=(
        "User has a duplicate $200 charge from March 14. Pushes back on the "
        "agent's first attempt and demands a supervisor at turn five if "
        "unresolved."
    ),
    outcome=(
        "Agent verifies email, looks up charge, confirms duplicate, issues "
        "refund with ETA, holds formal persona across pushback."
    ),
)

scenario = Scenario(
    name="refund_objection_ladder",
    description="Objection-ladder adversary on a duplicate-charge scenario.",
    dataset=[objection_ladder],
)

runner = TestRunner()
report = await runner.run_test(
    agent_definition=agent_def,
    scenario=scenario,
    max_seconds=180.0,
    record_audio=False,
)

Two non-obvious wins. ScenarioGenerator takes a topic and an AgentDefinition and auto-produces persona variants. That is the fastest way to seed the matrix when you do not have production traces yet, and the cheapest way to grow the adversary side once you have the first three personas working. evaluate_report runs the eval pipeline against the produced TestReport, attaching scores per template so the trace tree carries trajectory rubrics on the same span as the simulated conversation. For voice agents, the same Persona and Scenario shapes drive LiveKit-rooted runs with TTSConfig and STTConfig carrying the speech surface; the simulator becomes a voice user without rewriting the test case.

How Future AGI fits simulated multi-turn evaluation

Future AGI ships the eval stack as a package. Start with the SDK for code-defined trajectory rubrics and simulate-sdk for the persona-scenario-adversary grid. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.

simulate-sdk. Persona plus Scenario plus TestRunner encodes the triangle. OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, and AnthropicAgentWrapper plug in the agent under test. ScenarioGenerator auto-produces persona variants from a topic. Voice agents get the same shape via TTSConfig and STTConfig. Runs in local LiveKit mode for voice or cloud mode against the FAGI backend.
ai-evaluation SDK (Apache 2.0). ConversationCoherence and ConversationResolution ship as trajectory templates. The 11-template CustomerAgent family covers context retention, loop detection, human escalation, prompt conformance, objection handling, query handling, and termination, every rubric the triangle needs. CustomLLMJudge handles SimulatorRealism, intent drift, and product-specific persona rubrics. 50+ evaluators across the SDK and 20+ heuristic metrics locally.
traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds with first-class CONVERSATION, AGENT, and TOOL. Trajectory rubrics attach via EvalTag with EvalSpanKind.CONVERSATION, so the same rubric in CI runs on live conversation roots at sampling.
Future AGI Platform. Self-improving evaluators tuned by domain-lead thumbs feedback. In-product authoring agent writes persona and trajectory rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what keeps the simulation loop affordable when you run it on every PR.
Error Feed (inside the eval stack). HDBSCAN soft-clustering over failing conversation embeddings, then a Sonnet 4.5 Judge writes the immediate_fix proposal: cluster description, representative transcripts, suspected root cause, and a one-line prompt or tool change to test. Domain-lead review gates promotion into the dataset.
Agent Command Center. OpenAI-compatible AI gateway in a single Go binary (Apache 2.0); 100+ providers; routing strategies let the simulator run on a cheap model while the agent under test runs on production routing, billed on separate cost headers. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.

Three honest tradeoffs. Trajectory rubrics cost more per call than turn-level rubrics because the judge reads the whole transcript; the discipline is to gate CI on the curated regression set and sample production at 5-10 percent. SimulatorRealism judges need calibration against a domain-lead-labelled hold-out before the threshold is meaningful; a judge that fires on every off-script turn is noise. Self-improving evaluators need a pinned hold-out and quarterly calibration review or the drift moves with the data and no one notices.

What to build first

Build the smallest harness that exercises all three legs at once. Three personas (cooperative, frustrated escalator, ambiguous intent). One scenario with two edge cases. One adversary pattern (objection ladder) wired through the frustrated-escalator persona. Score with ConversationCoherence, ConversationResolution, CustomerAgentHumanEscalation, and a SimulatorRealism CustomLLMJudge. Run 50 conversations through simulate-sdk, read the trace tree, fix the top failure mode by hand, run 50 more.

That’s two weeks of work. By the end of it you have the persona-scenario-adversary matrix, the trajectory rubrics, and the cluster-driven triage loop. The harness ends up being the cheapest part of the eval program; the value is in what the hostile escalator on the duplicate-charge fraud-flag scenario finds before your real users hit it. Skip the adversary leg and the harness produces a dashboard. Build all three legs and it produces failure modes you can fix before they ship.

Frequently asked questions

What is a simulated multi-turn conversation eval, and what does it actually catch?

It is an offline harness where one LLM plays a scripted user persona across many turns against your real agent, and a second pipeline scores the resulting transcript. The point of building it is not volume — it is to surface failure modes your production logs cannot. Real logs are sparse on a new agent, biased toward early adopters, and impossible to replay legally in healthcare or finance. Simulation lets you script the hostile escalator, the ambiguous-intent newcomer, the off-script user, and the persona-pressure ladder, then score the resulting trajectory the same way you would score replays. A simulator that only runs cooperative personas is worth less than no simulator at all — it ships happy-path coverage and a false sense of safety.

What is the Persona-Scenario-Adversary triangle, and why does skipping a leg break the eval?

Persona is WHO is talking — goal, personality, hidden constraints, communication style. Scenario is WHAT they want — intent, happy-path expected behavior, edge-case variants the conversation can hit. Adversary is HOW they push back — objection patterns, retry pressure, escalation thresholds, off-script drops. Skip persona and every conversation is the same flat user; skip scenario and you cannot tell whether a regression is on intent or persona; skip adversary and the simulator agrees with everything the agent says and the agent learns to placate, not resolve. The triangle is the minimum design space. All three legs go into every test case before the runner fires.

How many adversary patterns should I script, and which ones matter most?

Five families cover most of the failure space. Objection ladders push back on the first incorrect answer and escalate if the agent does not self-correct. Retry pressure repeats the same request in slightly different framings to test whether the agent contradicts itself. Escalation pressure signals impatience after N unhelpful turns and demands a manager or supervisor. Off-script drops randomly inject distracted-user behavior — switching topics mid-turn, going silent, asking unrelated questions. Persona pressure walks the agent off its persona under emotional load. Start with one persona per family and grow as failure clusters emerge. The single most expensive omission is no objection ladder — that pattern finds bugs nothing else does.

How do I keep the simulator from looking like an obvious eval bot?

Give every persona a goal, a personality, and three to five hidden constraints so it does not over-disclose. Include multi-shot examples of realistic phrasing in the persona prompt — a persona without four or five sample turns drifts inside five turns. Vary the random seed and temperature so the same persona produces different conversation paths. Drop persona at a small per-turn probability to simulate distracted users. Run a SimulatorRealism custom judge after each conversation and prune personas that score below threshold. The simulator score is part of the eval surface — if you do not score the simulator, half your failure clusters end up being simulator artifacts and the team debugs the wrong bugs.

Which FAGI rubrics do I attach to simulated trajectories?

ConversationCoherence and ConversationResolution as the trajectory anchors, CustomerAgentContextRetention for memory, CustomerAgentLoopDetection for clarifying-tangent dead ends, CustomerAgentHumanEscalation for escalation correctness, and CustomerAgentPromptConformance for persona adherence. CustomLLMJudge instances handle intent drift, persona-specific behavior, and simulator realism. TaskCompletion (eval_id=99) closes the outcome loop on answerable trajectories. Score the entire transcript end-to-end against these rubrics; per-turn scoring is for debugging individual turns once the trajectory-level gate has fired.

What does the FAGI simulate-sdk give me that hand-rolling does not?

Persona, Scenario, and TestRunner give you the triangle as code primitives — Persona carries persona dict, situation, outcome; Scenario carries a name and a list of personas; TestRunner orchestrates the loop in either local LiveKit mode for voice agents or cloud mode against the FAGI backend. OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, and AnthropicAgentWrapper let you plug the agent under test into the runner without rewriting your stack. ScenarioGenerator auto-produces persona variants from a topic and an AgentDefinition. evaluate_report runs the eval pipeline against the produced TestReport. The hand-rolled version of this is 200 lines of orchestration you have to maintain yourself.

Can the simulator and the agent run on different model families?

Yes, and they should. If both sides run on the same model family the simulator and the agent share blind spots and the eval silently overstates pass rates. A common pairing is gpt-4o-mini for the simulator (cheap, fast, good persona stability) against your production model for the agent under test. The Agent Command Center gateway routes the simulator to the cheapest healthy provider and the agent to the production-quality routing strategy, so cost lands on separate headers and the simulation budget is measurable against the agent budget.

View all

Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

Vrinda Damani · May 21, 2026

11 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

Guides

LLM Eval Budget Allocation and Prioritization in 2026

Eval budget is four knobs: rubric coverage, dataset size, judge tier, refresh cadence. Priority order that maximizes signal per dollar, with a 90-day plan.

NVJK Kartik · May 19, 2026

12 min

TL;DR

The triangle

Persona generation patterns

Scenario design

Adversary patterns

Trajectory scoring on simulated conversations

The FAGI simulate-sdk

How Future AGI fits simulated multi-turn evaluation

What to build first

Related reading

Frequently asked questions