How is a synthetic scenario different from a synthetic persona?

A synthetic persona describes one generated user profile. A synthetic scenario defines the test situation around one or more personas, including the goal, context, constraints, and expected outcome.

How do you measure synthetic scenario quality?

FutureAGI measures scenario quality with simulate-sdk reports and evaluator signals such as TaskCompletion, ToolSelectionAccuracy, and Groundedness. Teams also track coverage, duplicate rate, and eval-fail-rate-by-cohort.

What Is a Synthetic Scenario? FutureAGI Guide (2026)

What Is a Synthetic Scenario?

A synthetic scenario is an artificial but realistic test situation used to evaluate an LLM, agent, or voice workflow before production users encounter the same path. It is a data-layer artifact for simulation and eval pipelines, usually containing personas, goals, constraints, expected outcomes, and sometimes tool routes. FutureAGI maps this concept to simulate:Scenario, where engineers run repeatable scenario cohorts, capture traces, and score results with evaluators such as TaskCompletion or ToolSelectionAccuracy.

Why Synthetic Scenarios Matter in Production LLM and Agent Systems

Production traffic rarely gives teams the failure cases they need before launch. A support agent may have thousands of refund conversations but no chargeback disputes from a high-value account. A RAG assistant may answer common policy questions while failing a narrow escalation rule. A tool-calling agent may pass single-turn demos and still choose the wrong tool after a confused user changes intent in turn four. Synthetic scenarios close that coverage gap by creating controlled cases before the product collects real incidents.

Ignoring them leads to hidden failure modes: silent hallucinations after missing context, unsafe tool use under ambiguous instructions, and false confidence from happy-path eval sets. Developers feel it as unreproducible bugs. SREs see a spike in eval-fail-rate-by-cohort or p99 latency only after a release. Compliance teams lack evidence that privacy, refusal, or protected-user paths were tested. End users feel inconsistent answers, unnecessary escalations, or actions taken without enough context.

The symptoms show up in logs and metrics as repeated prompt shapes, missing scenario labels, untested tool routes, and a large gap between offline eval pass rate and production thumbs-down rate. In 2026 multi-step pipelines, a scenario matters more than a single prompt because the failure can depend on persona, memory, retrieval, planner state, tool output, and final response together.

How FutureAGI Handles Synthetic Scenarios

FutureAGI’s approach is to treat a synthetic scenario as executable eval data, not a static prompt list. The specific anchor is simulate:Scenario, exposed by the simulate-sdk as Scenario: a named scenario containing a list of Persona test cases. Each persona carries fields such as user profile, situation, and desired outcome. A scenario can also be loaded from a CSV or JSON dataset through Scenario.load_dataset, which lets teams version scenarios beside model, prompt, and tool changes.

A realistic workflow: a fintech team is about to release a dispute-resolution agent with a new chargeback tool. The engineer creates a Scenario named chargeback_edge_cases_2026_05 with personas for angry users, low-context users, policy-aware users, multilingual users, and users with possible PII in the request. The scenario runs through the agent callback and produces a TestReport with transcripts, trace links, and optional eval scores.

FutureAGI then scores the run with TaskCompletion for whether the dispute task finished, ToolSelectionAccuracy for whether the chargeback tool was chosen only when appropriate, and Groundedness for whether the final answer stayed supported by retrieved policy context. Trace fields such as agent.trajectory.step help isolate whether the failure came from planning, retrieval, tool selection, or final generation. Unlike Ragas-style synthetic question generation, which often focuses on single-turn RAG questions, this workflow tests multi-turn agent trajectories. The engineer’s next action is concrete: block a release, add a regression scenario, send ambiguous cases to annotation, or widen the scenario cohort after production traces reveal an uncovered intent.

How to Measure or Detect a Synthetic Scenario

Measure a synthetic scenario by the failure signal it creates, not by how many rows it contains:

Coverage matrix: percent of target intents, locales, risk levels, tool routes, and user expertise levels represented inside the Scenario.
Duplicate and near-duplicate rate: repeated personas or situations lower scenario value and can inflate apparent pass rates.
TaskCompletion score: evaluates whether the agent completed the assigned task for each simulated persona.
ToolSelectionAccuracy score: checks whether the selected tool matches the expected tool path for the scenario.
Trace evidence: use agent.trajectory.step, model spans, tool spans, and eval-fail-rate-by-cohort to locate the failing step.
User-feedback proxy: compare scenario failures with production thumbs-down rate, escalation rate, and reviewed incident tags.

from fi.simulate.simulation import Scenario
from fi.evals import TaskCompletion, ToolSelectionAccuracy

scenario = Scenario.load_dataset("chargeback_edge_cases_2026_05.json")
report = scenario.run(agent=dispute_agent, evaluators=[TaskCompletion(), ToolSelectionAccuracy()])
print(report.summary())

Common Mistakes

Synthetic scenarios become weak when teams optimize for breadth without executable evidence.

Writing scenarios as prose only. If the scenario cannot run against an agent and produce a score, it is documentation, not eval data.
Testing only one persona per scenario. Multi-step failures often depend on user expertise, emotion, locale, or missing context.
Leaving expected outcomes vague. “Resolve the issue” is not enough; specify tool path, refusal rule, escalation condition, or final answer constraint.
Mixing tuning and final eval scenarios. Once a scenario guides prompt changes, keep a separate holdout scenario for release decisions.
Ignoring production drift. Refresh scenarios when tool availability, policy text, customer segments, or traffic mix changes in 2026.