Simulated Multi-Turn Conversation Eval (2026)
Build a simulated multi-turn eval that catches real failures: the Persona-Scenario-Adversary triangle, FAGI simulate-sdk patterns, trajectory scoring.
Table of Contents
A team ships a refund agent. The single-turn eval set scores 94 percent. The replay set scores 89 percent. The team builds a simulated multi-turn harness because everyone says you should, runs a thousand conversations, and the dashboard reads green. Two weeks after launch, escalations triple. The post-mortem reads the same way every time: the simulator only ever ran cooperative personas, the conversations all followed a happy path, and nothing in the eval set looked like the real users who showed up on day three. That’s what the harness was supposed to catch.
The simulator was not broken. It was incomplete. A simulated multi-turn eval is a Persona-Scenario-Adversary triangle. Persona is WHO is talking. Scenario is WHAT they want. Adversary is HOW they push back. Skip any leg and your simulator runs happy paths only, which is exactly where production breaks. This post is the methodology: how to design each leg, how to score the resulting trajectories, and which FAGI surfaces map to each step.
TL;DR
| Triangle leg | What it carries | FAGI surface |
|---|---|---|
| Persona | Goal, personality, hidden constraints, drop probability | Persona(persona=, situation=, outcome=) in simulate-sdk |
| Scenario | Intent, happy-path expectation, edge-case variants | Scenario(name=, dataset=[personas]) |
| Adversary | Objection, retry, escalation, off-script, persona pressure | adversary-pattern persona templates (5 families) |
| Trajectory scoring | End-to-end rubric over the transcript | ConversationCoherence, ConversationResolution, CustomerAgent family |
| Realism check | Did the simulator stay in character? | CustomLLMJudge for SimulatorRealism |
| Trace + cluster | Span tree per conversation, failure clusters | traceAI EvalSpanKind.CONVERSATION, Error Feed |
If you only build three pieces, build personas with hidden constraints, an objection adversary, and end-of-conversation trajectory scoring. Everything below is what keeps the loop honest over releases.
The triangle
Three things have to be true for a simulated conversation to find real bugs. The simulator has to play a believable user with their own goals and constraints. The conversation has to head somewhere: a refund, a booking, a policy clarification, an escalation. The user has to push back when the agent stalls, contradicts itself, or asks for the same fact twice. Persona, Scenario, Adversary. Three legs. None optional.
The trap most simulation harnesses fall into is collapsing the triangle into one leg. The “cooperative customer asking about refunds” persona is all three legs squashed into a single test case: persona is thin, scenario is implicit, adversary is absent. A thousand of these conversations do not produce coverage; they produce a thousand near-identical happy paths.
Each leg does a different job. Persona carries the lexical surface. The words a frustrated user types are not the words a curious newcomer types, and an agent that overfits to one register breaks on the other. Scenario carries the conversational shape: refund-with-fraud-flag is a different shape from refund-window-expired, and the agent’s tool sequence differs. Adversary carries the resilience surface. Does the agent hold position when the user pushes back, or does it cave to whatever the user just said? If you cannot tell which leg a given test case is exercising, the test case is not pulling its weight.
The corollary: the cross-product of the three legs is your conversation matrix. A 5-persona by 8-scenario by 3-adversary grid is 120 cells; at 3 seeds per cell, 360 conversations per release. That matrix is what makes pass-rate trends comparable across releases and what makes simulation a research artifact instead of a dashboard.
Persona generation patterns
A persona is not a one-liner. The personas that produce useful failures all carry four things: a goal, a personality, three to five hidden constraints, and behavior knobs.
PERSONAS = [
{
"id": "frustrated_escalator",
"goal": "Refund a $200 duplicate charge from March 14.",
"personality": (
"Polite for two turns, then increasingly terse. "
"Threatens to call corporate at turn five if unresolved."
),
"constraints": [
"Does not know the order ID; shares email only if asked twice",
"Will not paste a card number under any circumstance",
"Mentions a screenshot but cannot actually attach files",
],
"frustration_threshold": 3,
"persona_drop_prob": 0.05,
"communication_style": "short sentences, occasional caps, no emoji",
},
{
"id": "curious_newcomer",
"goal": "Understand whether the product supports SAML SSO.",
"personality": "Asks follow-up questions, reads docs links, low frustration ceiling.",
"constraints": [
"Does not know what SCIM is and will ask",
"Shares company size only if asked",
],
"frustration_threshold": 6,
"persona_drop_prob": 0.02,
"communication_style": "polite, two- to three-sentence turns, asks clarifying questions",
},
]
Four patterns that earn their keep. Hidden constraints over disclosed facts. A persona that types its full order number on turn one tests nothing. A persona that knows its order number but only shares it after the agent asks twice tests memory and confirmation. Multi-shot examples beat persona descriptions. A four-line persona description drifts inside five turns; the same persona with three sample turns of realistic phrasing holds character. Behavior knobs make personas reusable. frustration_threshold and persona_drop_prob are dials, not constants. The same persona record across three threshold values gives you three trajectories without writing three personas. Mix model families across the boundary. Simulator on gpt-4o-mini against an agent on Claude Sonnet finds bugs neither family would find against itself; shared blind spots silently inflate pass rates.
Ten well-crafted personas beat fifty thin ones. Coverage scales through scenario and adversary multiplication, not through persona count. For the deeper pattern library on persona drift and style consistency, see evaluating LLM personas.
Scenario design
A scenario is a triple: intent, happy-path expected behavior, and edge-case variants. Pull intents from real production traces where you have them and from product documentation where you don’t. Every persona crosses every scenario it makes sense for, but not every combination. The curious-newcomer persona on a fraud-flag scenario is nonsense, and the test runner should skip it.
SCENARIOS = [
{
"id": "refund_duplicate_charge",
"intent": "Refund a duplicate charge",
"happy_path": (
"Agent verifies email, looks up charge, confirms duplicate, "
"issues refund, gives ETA."
),
"edge_cases": [
"Charge is not actually duplicate (similar amount, different merchant)",
"Refund window has expired",
"Account is flagged for fraud review",
],
"compatible_personas": [
"frustrated_escalator", "curious_newcomer", "hostile_user",
],
"expected_tools": ["lookup_charge", "issue_refund"],
},
]
Three habits separate working scenario design from theatre. Edge cases are first-class, not afterthoughts. Each named edge case is a separate test run with its own expected behavior. The duplicate-charge happy path and the duplicate-charge-with-fraud-flag edge case are different rows. Compatibility lists prune the grid. Without them the runner generates nonsense combinations that waste budget and pollute pass-rate aggregates. Expected tool sequences make trajectory scoring tractable. Knowing what the agent should have called gives the trajectory rubric something to compare against; without it ConversationResolution is scoring vibes.
The dataset grows two ways. Promote failing production conversations into new scenario rows, with a domain-lead-reviewed expected end state. That’s the loop described in the LLM evaluation playbook. Generate adversarial variants programmatically: the simulate-sdk ScenarioGenerator takes a topic and an AgentDefinition and produces persona variants for a given scenario, which is the fastest way to seed the matrix when you don’t have production traces yet.
Adversary patterns
The adversary leg is the most commonly skipped and the most expensive to skip. A simulator that agrees with whatever the agent says measures compliance, not resilience. Five patterns cover most of the failure space.
Objection ladders. The user pushes back on the first incorrect answer and escalates if the agent does not self-correct. The persona is scripted to disagree on turn three regardless of what the agent says, then to demand a citation on turn five, then to ask for a supervisor on turn seven. The agent that placates (“you’re absolutely right, let me reconsider”) fails. The agent that holds position with evidence passes. This pattern finds the bug nothing else finds: agents that have learned to agree with the user under any pressure.
Retry pressure. The user repeats the same request in three different framings across the conversation, sometimes with conflicting details. “What’s the refund window?” on turn two. “How long do I have to return this?” on turn five. “Can I still get my money back?” on turn eight. A grounded agent gives the same answer all three times. A drifting agent gives three different windows. Per-turn Groundedness does not catch this. The failure is the inconsistency across turns, not any single turn.
Escalation pressure. The persona signals impatience after N unhelpful turns and demands a manager or supervisor. The eval question isn’t whether the agent transfers. It’s whether the agent transfers at the right turn with the right context attached. Too early is a soft fail. Too late is a hard fail. No context summary attached is a hard fail. CustomerAgentHumanEscalation scores this directly.
Off-script drops. The persona randomly injects distracted-user behavior: switches topics mid-turn, asks an unrelated question, goes silent for a turn, pastes a wrong order number. The agent that loses thread is the agent that fails when real users multi-task. A small per-turn persona_drop_prob (2-5 percent) is enough; higher rates make the conversation noise rather than test signal.
Persona pressure. The user emotionally loads the conversation (frustrated, hostile, distressed) and the agent has to hold its own persona across the turns. Sarcasm by turn eight is a regression even if each individual line was technically correct. CustomerAgentPromptConformance scores adherence; the failure pattern is the agent matching the user’s register instead of holding its system-prompt voice. The full trajectory-eval treatment of this lives in evaluating multi-turn conversations.
Build at least one persona per family. The cost of one extra persona is negligible compared to the cost of one persona-pressure failure shipping to production. The single most common simulation bug across every team we have seen run this loop is the absence of an objection ladder. Start there if you build nothing else.
Trajectory scoring on simulated conversations
The output of a simulator run is a transcript. The unit of scoring is the trajectory, not the turn. Two passes per conversation: end-of-conversation rubrics for outcome and persona behavior, per-turn rubrics only when the trajectory failed and you need to find the exact turn that caused it.
from fi.evals import Evaluator
from fi.evals.templates import (
ConversationCoherence, ConversationResolution, TaskCompletion,
CustomerAgentContextRetention, CustomerAgentLoopDetection,
CustomerAgentHumanEscalation, CustomerAgentPromptConformance,
)
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
evaluator = Evaluator()
simulator_realism = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "simulator_realism",
"model": "gpt-4.1",
"grading_criteria": (
"Score 1.0 if the user side of the transcript reads like a real "
"user with the given persona. Penalize over-disclosure of hidden "
"constraints, generic phrasing, breaking character, and ending "
"every turn with a polite closer. Provide a one-sentence reason."
),
},
)
def score_trajectory(persona, scenario, transcript):
tc = TestCase(
input=scenario["intent"],
output=format_transcript(transcript),
expected_output=scenario["happy_path"],
context=persona["personality"],
)
result = evaluator.evaluate(
eval_templates=[
ConversationCoherence(), ConversationResolution(),
CustomerAgentContextRetention(),
CustomerAgentLoopDetection(),
CustomerAgentHumanEscalation(),
CustomerAgentPromptConformance(),
TaskCompletion(),
],
inputs=[tc],
)
scores = {r.name: r.output for r in result.eval_results}
scores["simulator_realism"] = simulator_realism.evaluate([tc])[0].output
return scores
Three habits that make the scoring useful. Score the simulator, not just the agent. SimulatorRealism runs against the user side of the transcript. If realism scores drop below 0.7, drop the conversation from the eval rollup; half your failure clusters would otherwise be simulator artifacts, not agent bugs. Floor per trajectory, not mean across set. A single hostile-persona failure that drags from 0.95 to 0.55 should block release; averaging it with 199 happy-path passes hides it inside a 0.93 mean. Stratify the rollup by adversary family. Pass rate on objection ladders is the metric that predicts production CSAT; pass rate on cooperative happy paths is the metric that predicts nothing.
Every trajectory carries a session.id so traceAI rolls turns into a conversation root span. Attach the trajectory rubrics with EvalSpanKind.CONVERSATION so the same rubrics that gate CI also run on production conversation roots at 5-10 percent sampling, same definitions, comparable scores. That is the pattern from agent observability versus evaluation versus benchmarking.
The FAGI simulate-sdk
The simulate-sdk encodes the triangle as code primitives so you do not maintain an orchestration layer yourself. Persona carries persona dict, situation, and outcome. Scenario carries a name and a list of personas. TestRunner orchestrates the loop. Wrappers (OpenAIAgentWrapper, LangChainAgentWrapper, GeminiAgentWrapper, AnthropicAgentWrapper) let you plug in the agent under test without rewriting your stack.
from fi.simulate import (
Persona, Scenario, TestRunner, AgentDefinition, LLMConfig,
)
agent_def = AgentDefinition(
name="refund-agent",
url="wss://your-livekit-server.com",
room_name="agent-room",
system_prompt=SYSTEM_PROMPT,
llm=LLMConfig(model="gpt-4.1", temperature=0.3),
)
objection_ladder = Persona(
persona={
"name": "frustrated_escalator",
"communication_style": "short sentences, pushes back on first answer",
"constraints": ["only shares email if asked twice", "no card number"],
},
situation=(
"User has a duplicate $200 charge from March 14. Pushes back on the "
"agent's first attempt and demands a supervisor at turn five if "
"unresolved."
),
outcome=(
"Agent verifies email, looks up charge, confirms duplicate, issues "
"refund with ETA, holds formal persona across pushback."
),
)
scenario = Scenario(
name="refund_objection_ladder",
description="Objection-ladder adversary on a duplicate-charge scenario.",
dataset=[objection_ladder],
)
runner = TestRunner()
report = await runner.run_test(
agent_definition=agent_def,
scenario=scenario,
max_seconds=180.0,
record_audio=False,
)
Two non-obvious wins. ScenarioGenerator takes a topic and an AgentDefinition and auto-produces persona variants. That is the fastest way to seed the matrix when you do not have production traces yet, and the cheapest way to grow the adversary side once you have the first three personas working. evaluate_report runs the eval pipeline against the produced TestReport, attaching scores per template so the trace tree carries trajectory rubrics on the same span as the simulated conversation. For voice agents, the same Persona and Scenario shapes drive LiveKit-rooted runs with TTSConfig and STTConfig carrying the speech surface; the simulator becomes a voice user without rewriting the test case.
How Future AGI fits simulated multi-turn evaluation
Future AGI ships the eval stack as a package. Start with the SDK for code-defined trajectory rubrics and simulate-sdk for the persona-scenario-adversary grid. Graduate to the Platform for self-improving evaluators tuned by domain-lead feedback.
- simulate-sdk.
PersonaplusScenarioplusTestRunnerencodes the triangle.OpenAIAgentWrapper,LangChainAgentWrapper,GeminiAgentWrapper, andAnthropicAgentWrapperplug in the agent under test.ScenarioGeneratorauto-produces persona variants from a topic. Voice agents get the same shape viaTTSConfigandSTTConfig. Runs in local LiveKit mode for voice or cloud mode against the FAGI backend. - ai-evaluation SDK (Apache 2.0).
ConversationCoherenceandConversationResolutionship as trajectory templates. The 11-template CustomerAgent family covers context retention, loop detection, human escalation, prompt conformance, objection handling, query handling, and termination, every rubric the triangle needs.CustomLLMJudgehandlesSimulatorRealism, intent drift, and product-specific persona rubrics. 50+ evaluators across the SDK and 20+ heuristic metrics locally. - traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds with first-class
CONVERSATION,AGENT, andTOOL. Trajectory rubrics attach viaEvalTagwithEvalSpanKind.CONVERSATION, so the same rubric in CI runs on live conversation roots at sampling. - Future AGI Platform. Self-improving evaluators tuned by domain-lead thumbs feedback. In-product authoring agent writes persona and trajectory rubrics from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what keeps the simulation loop affordable when you run it on every PR.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over failing conversation embeddings, then a Sonnet 4.5 Judge writes the
immediate_fixproposal: cluster description, representative transcripts, suspected root cause, and a one-line prompt or tool change to test. Domain-lead review gates promotion into the dataset. - Agent Command Center. OpenAI-compatible AI gateway in a single Go binary (Apache 2.0); 100+ providers; routing strategies let the simulator run on a cheap model while the agent under test runs on production routing, billed on separate cost headers. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.
Three honest tradeoffs. Trajectory rubrics cost more per call than turn-level rubrics because the judge reads the whole transcript; the discipline is to gate CI on the curated regression set and sample production at 5-10 percent. SimulatorRealism judges need calibration against a domain-lead-labelled hold-out before the threshold is meaningful; a judge that fires on every off-script turn is noise. Self-improving evaluators need a pinned hold-out and quarterly calibration review or the drift moves with the data and no one notices.
What to build first
Build the smallest harness that exercises all three legs at once. Three personas (cooperative, frustrated escalator, ambiguous intent). One scenario with two edge cases. One adversary pattern (objection ladder) wired through the frustrated-escalator persona. Score with ConversationCoherence, ConversationResolution, CustomerAgentHumanEscalation, and a SimulatorRealism CustomLLMJudge. Run 50 conversations through simulate-sdk, read the trace tree, fix the top failure mode by hand, run 50 more.
That’s two weeks of work. By the end of it you have the persona-scenario-adversary matrix, the trajectory rubrics, and the cluster-driven triage loop. The harness ends up being the cheapest part of the eval program; the value is in what the hostile escalator on the duplicate-charge fraud-flag scenario finds before your real users hit it. Skip the adversary leg and the harness produces a dashboard. Build all three legs and it produces failure modes you can fix before they ship.
Related reading
- Evaluating Multi-Turn Conversations: A Deep Dive (2026)
- Multi-Turn LLM Evaluation (2026)
- Single-Turn vs Multi-Turn Evaluation (2026)
- Multi-Turn Jailbreaking Defender (2026)
- Evaluating LLM Personas and Style (2026)
- The 2026 LLM Evaluation Playbook
- Agent Observability vs Evaluation vs Benchmarking (2026)
Frequently asked questions
What is a simulated multi-turn conversation eval, and what does it actually catch?
What is the Persona-Scenario-Adversary triangle, and why does skipping a leg break the eval?
How many adversary patterns should I script, and which ones matter most?
How do I keep the simulator from looking like an obvious eval bot?
Which FAGI rubrics do I attach to simulated trajectories?
What does the FAGI simulate-sdk give me that hand-rolling does not?
Can the simulator and the agent run on different model families?
Evaluating agent memory is four problems, not one: recall, freshness, contradiction handling, forgetting. A 2026 framework for Mem0, Zep, Letta, LangMem.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Evaluating browser-use agents in 2026: WebArena grades happy-path completion; production grades recovery from six failure modes nobody benchmarks.