Simulate a Voice AI Agent in 2026: A Hands-On Guide to Persona-Driven Voice Testing with fi.simulate
Simulate voice AI agents in 2026 with fi.simulate.TestRunner: hundreds to low-thousands of scenarios, accent and interruption coverage, CI gating.
Table of Contents
A voice AI team ships a new prompt for a healthcare scheduling agent. The change improves the agent’s response to “I need to reschedule my appointment” from 87 to 93 percent task completion on the team’s internal eval. Two weeks after rollout, a regional support team in Texas reports a flood of mis-scheduled callbacks. The new prompt also broke the agent’s handling of Southern-accented English on the word “reschedule.” The QA pool never had a Southern-accented tester in the rotation. This post is the 2026 fix: programmatic voice agent simulation in CI, with hundreds to low thousands of accent, noise, and interruption scenarios per release, sized to your CI budget.
TL;DR: voice AI simulation in 2026
| Question | Answer |
|---|---|
| What is voice agent simulation? | Scripted multi-turn runs against your voice agent with synthetic personas and scenarios. Reduces reliance on the human QA pool. |
| Top tool in 2026 | Future AGI Simulate (fi.simulate.TestRunner). Provider-agnostic, ties to fi.evals scoring and traceAI tracing. |
| What to test first | Accents, background noise, interruptions, emotional state, long-tail intents, adversarial inputs. |
| How many scenarios | 600 to 2,000 per release, spread across 6 to 10 categories. |
| CI gate | Block merges if any category regresses past threshold (typical 90 percent task completion, 95 percent safety). |
| License | Future AGI ai-evaluation and traceAI are Apache 2.0. |
If you only read one row: Future AGI Simulate replaces the manual voice QA pool with a programmatic TestRunner that runs hundreds to low thousands of scenarios per release in CI and scores transcripts against the same fi.evals rubric you use in production.
Watch the webinar: voice AI testing in the AI-agent era
In this session, Nikhil and Rishav walk through the Voice Agent Simulator: how it replaces the slow human QA loop, how to design a scenario library, and how Agent Compass pinpoints whether a failure was in STT, LLM, or TTS.
What voice AI simulation actually does
A voice agent simulator does four things.
- Drives the agent through scripted multi-turn conversations. A scenario is a sequence of user turns plus expected behaviors at each step.
- Generates synthetic input. TTS-rendered voice (with accent, noise, and emotion controls) sent into the agent’s audio pipeline, or text injected for text-first systems.
- Captures the full transcript and audio. Every prompt, retrieval, tool call, LLM response, and synthesized audio segment lands as a trace.
- Scores the run. Rubric-driven evaluators run against the transcript and audio, producing per-turn and per-scenario metrics.
The output is the same shape as a unit-test report: pass/fail per scenario, aggregated by category, gated by threshold. The difference from traditional testing is that the assertion is a learned judge, not a hand-coded predicate.
The 2026 voice AI simulation landscape
| Tool | What it does | Stack scope | Notes |
|---|---|---|---|
| Future AGI Simulate | TestRunner-based persona and scenario simulation. Scores via fi.evals. Traces via traceAI. | Provider-agnostic (Vapi, Retell, LiveKit, ElevenLabs, custom) | Broadest fit when you want one stack for simulation, eval, and tracing. Apache 2.0 libraries. |
| Coval | Hosted voice agent simulation and call replay. | Provider-agnostic | Specialist alternative for hosted call simulation. |
| Vapi Evals | In-platform replay and basic transcript scoring. | Vapi-only | Useful inside Vapi; does not extend to cross-provider stacks. |
| Retell Test Calls | In-platform scripted test calls. | Retell-only | Tied to Retell hosting; limited scenario primitives. |
| LangSmith | Trace inspection and dataset-driven eval. | Text-first | Strong for LangChain-native text agents; voice is not the focus. |
| Custom Twilio + Deepgram | DIY scenario harness on top of provider SDKs. | Custom stacks | Maximum flexibility, maximum maintenance overhead. |
Note: Future AGI directly competes in the voice agent simulation space (the niche is “evaluation and testing for voice agents,” which is FAGI’s core scope). It is not a TTS, STT, or telephony provider, so the comparison is on the simulation and evaluation surface, not on the runtime infrastructure.
How to set up a voice AI agent simulation in 6 steps
Step 1: Define your scenario library by category
Start with six categories. Use your existing call logs to weight the categories by failure frequency.
- Accents and dialects. At least four per primary user geography.
- Background noise. Cafe, car, baby crying, music, second speaker.
- Interruptions. Mid-turn cutoff, double-talk, latency-induced talk-over.
- Emotional state. Impatient, distressed, calm-but-confused, sarcastic.
- Long-tail intents. Ambiguous queries, off-topic detours, code-switching.
- Adversarial inputs. Voice-channel prompt injection, multi-step social engineering, off-policy requests.
Step 2: Author personas and scenarios as code
Personas are reusable user templates (impatient buyer, confused first-time caller). Scenarios compose a persona with a starting dataset and an expected outcome. Treat both as code in source control so they version with your agent.
Step 3: Wire your agent function into TestRunner
fi.simulate.TestRunner accepts any callable with the agent(prompt: AgentInput) -> AgentResponse signature. Wrap your production agent endpoint, your Vapi assistant, your Retell call handler, or your custom Twilio + Deepgram + OpenAI + ElevenLabs stack behind that signature.
from fi.simulate import TestRunner, AgentInput, AgentResponse
def my_voice_agent(prompt: AgentInput) -> AgentResponse:
# Call your production voice agent (Vapi, Retell, LiveKit, custom stack).
# Return the agent's response with any captured audio URL.
return AgentResponse(content="...", audio_url="...")
runner = TestRunner(
agent=my_voice_agent,
scenarios=[
"accented_english",
"background_noise_cafe",
"interruption_mid_turn",
"code_switching_en_es",
"impatient_buyer",
"voice_prompt_injection",
],
)
runner.run()
Set FI_API_KEY and FI_SECRET_KEY before the call. The runner streams per-scenario results to the Future AGI dashboard and emits OpenTelemetry spans through traceAI for any backend you already use.
Step 4: Score the transcripts and audio with fi.evals
Each scenario produces a transcript plus audio. Score with the rubric templates that match your task: task_completion for closed-task flows, faithfulness if the agent grounds answers in retrieved context, instruction_following for tone and policy compliance, safety for adversarial scenarios.
from fi.evals import evaluate
result = evaluate(
"task_completion",
output=transcript_text,
context=scenario_brief,
)
score = result.score
For audio-aware evaluation, use the Turing evaluator family in fi.evals cloud evals. Latency tiers: turing_flash runs at roughly 1 to 2 seconds per call, turing_small at 2 to 3 seconds, turing_large at 3 to 5 seconds. Pick the smallest judge that hits your quality bar.
Step 5: Diagnose root cause with Agent Compass
When a scenario fails, the question is whether the breakage was in STT (the agent misheard), in the LLM (the agent reasoned wrong), or in TTS (the agent answered well but rendered audio that the user could not parse). Agent Compass clusters failure modes and points at the stage that produced the error so you can swap providers based on root cause rather than guesswork.
Step 6: Gate CI on the simulation suite
Run the suite on every pull request and on a nightly schedule. Block merges that regress any category past threshold. A workable default set of thresholds:
- Task completion: 90 percent
- Safety on adversarial scenarios: 95 percent
- Refusal correctness on out-of-policy requests: 92 percent
- Tone compliance: 85 percent
- Average p95 latency: under your product SLA (commonly 1.5 to 2 seconds for chat-flow voice agents)
The CI gate is the difference between a simulator that finds problems and a simulator that prevents them from shipping.
A worked example: scheduling agent across six scenario categories
A healthcare scheduling agent runs through 600 scenarios on a release candidate.
| Category | Scenarios | Pass rate | Verdict |
|---|---|---|---|
| Accents (en-IN, en-AU, en-ZA, en-US-SOUTH) | 100 | 0.88 | Below 0.90 threshold. Block. |
| Background noise (cafe, car, baby, music) | 100 | 0.91 | Pass. |
| Interruptions (mid-turn, double-talk) | 100 | 0.94 | Pass. |
| Emotional state | 100 | 0.93 | Pass. |
| Long-tail intents (code-switching, ambiguous) | 100 | 0.85 | Below 0.90 threshold. Block. |
| Adversarial | 100 | 0.97 | Pass. |
The CI gate blocks the merge. Two categories regressed: accents (specifically en-US-SOUTH on the word “reschedule”) and long-tail intents (specifically en-es code-switching). Agent Compass pinpoints both as STT failures rather than LLM failures. The fix is in the STT prompt or model swap, not in the LLM prompt the engineer changed. The team ships a corrected build the next day.
This is the 2026 release loop: programmatic simulation gates the merge, root-cause diagnosis tells you what to fix, and the next iteration runs the same suite without manual setup.
Best practices
- Treat scenarios as code. Version them in the same repo as your agent.
- Weight categories by failure frequency from your real call logs.
- Pair a small fast judge (turing_flash) for CI with a slower deep judge for the nightly run.
- Sample 1 to 5 percent of failures for human review every week so the rubric stays calibrated.
- Use a different model for the judge than for the agent under test to avoid judge collapse.
- Re-run the full suite after every model swap, prompt change, or provider change, even if the diff looks small.
Closing: stop testing on your customers
The teams that ship reliable voice AI in 2026 do not test on their customers. They run hundreds to low thousands of scenarios in CI on every change, gate merges on category-level pass rates, and rely on simulation, evaluation, and tracing to surface the regression class before the rollout. Future AGI Simulate is the integrated stack for that loop: TestRunner for the runs, fi.evals for the scoring, traceAI for the OpenTelemetry tracing, and Agent Compass for root-cause analysis. Start with the free tier, wire one production scenario into TestRunner, and run the loop end to end this week.
Frequently asked questions
What is voice AI agent simulation in 2026?
What is the best voice AI simulation tool in 2026?
How is fi.simulate different from running live test calls?
Which voice AI scenarios should I simulate first?
How do I gate CI on voice agent simulation results?
Can I simulate voice agents across STT, LLM, and TTS providers?
How many scenarios should one simulation run cover?
What does it cost to run voice agent simulation in 2026?
Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.
Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.
Future AGI vs Deepchecks in 2026. LLM evaluation, observability, prompt optimization, tabular and CV validation, pricing, G2 ratings, and when to pick each.