Accent and Dialect Testing for Voice AI Agents: A 2026 Methodology
Accent testing is not WER on Common Voice. This 2026 methodology catches proper-noun, code-switching, filler, and dialect-phrasing failures that production benchmarks miss.
Table of Contents
Accent testing is not WER on Common Voice. The failures that ship to production live in the words a public benchmark never saw: proper nouns the model has no prior on, code-switches between two languages mid-utterance, conversational fillers that crash a poorly tuned VAD, and dialect-specific phrasing that carries locale-specific meaning. A 5% WER on a curated read-speech corpus can sit alongside a 32% end-task failure rate on Tamil-substrate English callers in your live stack, and you’d never know from the benchmark. This methodology shows the four failure modes standard accent benchmarks miss, the persona-driven generation pattern that catches them, and how to score end-task success instead of transcript edit distance.
TL;DR: the methodology in six steps
- Pull your accent distribution from live traffic. Weight by share, not by guesswork.
- Define accent personas with
simulate-sdk. Accent string, age, speed, noise, multilingual toggle. - Generate scenarios per intent. Auto-generate against the five intents that drive your call volume.
- Run the matrix. Persona times scenario times rows. Capture audio, transcripts, and span trees.
- Score end-task success first.
conversation_resolutionandtask_completionare the launch gate.ASR/STT_accuracyandaudio_qualityare diagnostics. - Tag production with
accent_class. Feed live spans into the same evaluator pool. Cluster regressions in Error Feed.
The rest of the post is the why behind each step, the four failure modes that drive the design, and a worked plan for a healthcare voice agent launching across five Indian English substrates.
Why WER on Common Voice misses production
Mozilla Common Voice is a great research dataset and a terrible production benchmark. Three reasons.
It’s read speech. Speakers read scripted sentences in a quiet environment. Production callers don’t read. They think out loud, stumble, restart, fill silence with “um, the thing is, like”. The acoustic and prosodic profile of read speech is unlike spontaneous conversational speech across every accent.
Curated vocabulary. Common Voice prompts are designed to balance phonemes, not to look like real call traffic. Your domain has drug brand names, account numbers, street addresses, regional businesses, slang. The benchmark never saw any of it.
Wrong layer. WER is edit distance on the transcript. A 92% accurate transcript that drops “re-” from “reschedule” produces a 100% wrong end task. The number you actually care about is whether the appointment got booked, the prescription got refilled, the refund got issued. WER tells you almost nothing about that.
The fix isn’t to throw out ASR scoring. It’s to demote WER to a diagnostic and promote end-task success to the launch gate.
The four failure modes nobody tests for
Production accent failures cluster on four modes that read-speech WER doesn’t catch.
1. Proper-noun failures. The single biggest cause of accent-driven end-task failures we’ve seen. The ASR model has no language-model prior on “Trelegy”, “Mounjaro”, “Sengkang Way”, “Andheri East”, or your CFO’s name. Accent shifts the phonetic profile of the proper noun just enough to push the recognition into the wrong bucket. Public WER scores look fine because the curated corpus didn’t contain the proper noun. Production fails because the proper noun is the slot the call hinges on.
2. Code-switching failures. Speakers in India, Singapore, parts of Africa and Europe switch between languages mid-sentence as a normal speech pattern. “I want to reschedule, theek hai, but I need to confirm the doctor”. A monolingual STT model resets context at the switch, drops half the utterance, or transcribes the non-English fragment as gibberish. The downstream LLM then operates on a corrupted transcript and routes to the wrong intent. WER on a monolingual corpus can’t even see this mode.
3. Conversational filler failures. “Um, you know, the thing is, basically, I was trying to”. Real callers fill silence. Filler crashes a poorly tuned VAD into an early end-of-turn. The agent interrupts mid-thought, the caller backs up and restarts, the conversation spirals. Read-speech corpora strip these out. Production calls are full of them, and they disproportionately affect accents that carry distinctive prosodic patterns into English.
4. Dialect-phrasing failures. “Do the needful”, “reckon”, “pavement repair”, “cot death”, “pram”. The ASR transcribes correctly. The LLM misinterprets because its training distribution skews to one regional sense over another. “Pavement repair” from a UK caller is sidewalk repair, not road resurfacing. A US-trained model defaults to the wrong sense and hands off to the wrong department. This is an LLM failure, not an ASR failure, and accent-WER benchmarks don’t measure it at all.
Each mode lives on a different layer. Proper nouns are an ASR + language-model issue. Code-switching is an STT-architecture issue. Fillers are a VAD + turn-taking issue. Dialect phrasing is an LLM-interpretation issue. A testing harness that scores only transcript accuracy misses three of the four.
Persona-driven generation: the simulate-sdk pattern
The right substitute for read-speech corpora is persona-driven generation against your live stack. simulate-sdk ships a Persona plus Scenario primitive that does this directly.
A Persona carries accent, age range, communication speed, background noise, and a multilingual toggle. The TTS layer renders the persona’s audio with the specified accent and prosodic profile. A Scenario carries the intent path and the goals the persona is trying to accomplish. The TestRunner drives the persona through the scenario against an AgentWrapper that wraps your actual agent (OpenAI, LangChain, Gemini, or Anthropic-backed). The run produces a TestReport with per-call transcripts, audio, and a span tree.
# pip install simulate-sdk ai-evaluation
from fi.simulate import (
Persona, Scenario, TestRunner,
OpenAIAgentWrapper, AgentDefinition, LLMConfig,
)
# 12 personas weighted by your traffic logs, not by guesswork
personas = [
Persona(name="tamil_substrate_f_28", traits={
"accent": "South Indian Tamil-influenced English",
"age_range": "25-32", "speed": "fast",
"background_noise": "moderate", "multilingual": True,
}),
Persona(name="glasgow_m_45", traits={
"accent": "Glasgow Scottish",
"age_range": "40-50", "speed": "normal",
"background_noise": "quiet",
}),
Persona(name="punjabi_substrate_m_52", traits={
"accent": "Punjabi-influenced English with Hindi code-switching",
"age_range": "50-60", "speed": "slow",
"background_noise": "loud", "multilingual": True,
}),
# ... 9 more personas covering the rest of your accent distribution
]
scenarios = [
Scenario(
description="Caller wants to reschedule an existing appointment. "
"Some callers have the booking ID; some have only a partial date. "
"Some speakers code-switch between English and a regional language.",
goals=["confirm caller identity", "locate booking", "offer next slot"],
),
# ... 4 more scenarios for the top intents
]
agent = OpenAIAgentWrapper(AgentDefinition(
name="reschedule-bot",
llm_config=LLMConfig(model="gpt-4o", temperature=0.4),
system_prompt="You are a reschedule assistant for a healthcare clinic.",
))
runner = TestRunner(agent_wrapper=agent, personas=personas, scenarios=scenarios)
report = runner.run() # returns TestReport with pass_rate, per-call traces
Twelve personas times five scenarios at 100 rows per pair gives 6,000 fresh calls against the actual agent. Each call carries a TTS-rendered utterance with the accent prosody, gets transcribed by your real STT, goes through your real LLM, returns a real response. The fidelity is end-to-end, not a transcript proxy. Persona-driven generation also regenerates when intents shift; a static recording corpus rots the moment your call distribution changes.
End-task scoring vs transcript scoring
Five rubrics from ai-evaluation cover the accent surface across the four failure modes. They sit on different layers and answer different questions.
| Rubric | Layer | Question it answers | Use it as |
|---|---|---|---|
conversation_resolution | end-task | Did the agent resolve the caller’s intent? | Launch gate |
task_completion | end-task | Did the agent finish the specific task (booking, refill, refund)? | Launch gate |
ASR/STT_accuracy | ASR | Did the transcript match the audio? | Diagnostic |
audio_quality | audio | Was the audio intelligible to begin with? | Diagnostic |
translation_accuracy | LLM | Did the agent’s paraphrase preserve meaning? | Code-switching diagnostic |
cultural_sensitivity | LLM | Was the response appropriate for the locale? | Dialect-phrasing diagnostic |
The two end-task rubrics gate the launch. The remaining four explain why the end-task gate failed when it did. Don’t invert the order. A bot that hits 91% ASR/STT_accuracy but 64% conversation_resolution on Tamil-substrate English is still shipping a broken product. The transcript was fine; the agent still failed the call.
from fi.evals import (
Evaluator, ConversationResolution, TaskCompletion,
ASRAccuracy, AudioQualityEvaluator,
TranslationAccuracy, CulturalSensitivity,
)
from fi.testcases import MLLMTestCase, MLLMAudio, ConversationalTestCase, LLMTestCase
audio = MLLMAudio(url="recordings/tamil_substrate_reschedule_017.wav")
asr_case = MLLMTestCase(
input=audio,
query="Score STT transcript accuracy against the audio",
expected_response="I need to reschedule my appointment for next Tuesday.",
)
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I need to reschedule my appointment.",
response="I can help. What's the booking ID?"),
LLMTestCase(query="I don't have it with me, sorry.",
response="No problem. Can I take your name and phone number?"),
])
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[
ConversationResolution(), TaskCompletion(),
ASRAccuracy(), AudioQualityEvaluator(),
TranslationAccuracy(), CulturalSensitivity(),
],
inputs=[asr_case, conv],
)
The dashboard slices results by accent_class and by intent. A failure on Tamil-substrate English in the reschedule intent surfaces as one cell in a grid, not as a single global score. That’s the granularity you need to act on.
Coverage discipline: which accents, which scenarios
Coverage is where most accent suites go wrong. Two failure modes: testing every accent in the world at low fidelity, or testing three accents at high fidelity while ignoring 40% of your traffic.
The discipline is to weight the persona matrix by your live traffic distribution, not by a generic English-accent list. Pull a month of production calls, tag a sample by accent class, get the distribution. If 38% of your traffic is Tamil-substrate English and 7% is Glasgow Scottish, the persona matrix should reflect those weights. A flat matrix that gives each accent equal sampling buys you nothing for the accents that drive your revenue.
For an enterprise English deployment, we keep returning to this minimum set, then re-weight to match traffic:
| Accent | Typical traffic-weighted share |
|---|---|
| US Standard | 15-25% |
| US Southern | 5-10% |
| UK RP | 4-10% |
| UK Scottish, Northern | 3-10% |
| Canada, Australian English | 5-14% |
| Indian English (Hindi substrate) | 10-25% |
| Indian English (Tamil substrate) | 5-15% |
| Indian English (Bengali, Punjabi) | 6-18% |
| Singapore / Malaysian English | 1-5% |
For each persona attach three orthogonal variants: communication speed (fast, normal, slow), background noise (quiet, moderate, loud), and one stress condition (telephony codec, low SNR, simultaneous speech). The arithmetic lands in the low thousands of calls, which the simulator runs in hours.
Re-weight monthly. Production traffic shifts; the matrix that was right in January is wrong by May.
Production observability: the accent_class span attribute
Pre-launch testing catches the obvious failures. Production catches the residual. The same evaluator pool that scored the launch matrix should score live traffic, with accent_class as a span attribute on every call.
from fi_instrumentation import using_attributes
# Set at the start of each call, inferred from caller metadata or
# a lightweight accent-classifier span run in parallel with STT
with using_attributes({
"accent_class": "tamil_substrate_english",
"telephony_codec": "g711",
"intent": "reschedule",
}):
response = voice_agent.handle_call(audio_stream)
traceAI auto-instruments OpenAI, LangChain, Groq, Gemini, and the rest of the wrappers. The spans land in the same backend the launch suite wrote to. The dashboard slices conversation_resolution by accent_class and by intent, which means a regression on Tamil-substrate English shows up as a cell turning red on a single dashboard, not as a customer-support escalation three weeks later.
Failures cluster predictably in Error Feed. The Sonnet 4.5 Judge agent writes the immediate_fix field for each cluster. The clusters we see most often:
| Cluster | Layer | Typical immediate_fix |
|---|---|---|
| STT vocabulary gap on proper nouns | ASR | Add phonetic variants to the custom vocabulary list |
| Phonetic substitution on a specific consonant | ASR | Few-shot examples for the substituted form |
| Code-switch context reset | STT | Swap to a multilingual STT model for affected locales |
| VAD early-cut on conversational fillers | turn-taking | Raise VAD silence threshold by 150-250ms for affected personas |
| Dialect-phrase misinterpretation | LLM | System-prompt patch with regional context |
| Telephony codec degradation | network | Provider-side codec selection or codec-robust STT model |
Each cluster carries a trend signal (rising, steady, falling). The cluster view is the weekly work queue.
A worked plan: healthcare voice agent across five Indian substrates
A worked accent-test plan for a healthcare voice agent launching across five Indian English substrates. Pre-launch budget: five days of engineer time.
Day 1: persona matrix. Define 12 personas across Hindi, Tamil, Bengali, Punjabi, and Marathi substrates. Two age brackets per substrate. Three communication speeds. Multilingual toggle on where code-switching is common.
Day 2: scenario generation. Generate 100 rows against five intents from production traffic: appointment booking, prescription refill, lab-result inquiry, billing question, coverage check. Verify the branch distribution matches actual traffic, not the LLM’s guess.
Day 3: test execution. Five scenarios times 12 personas times 100 rows is 6,000 simulated calls. Run time roughly 14 hours of parallel simulation against the live agent.
Day 4: triage. Error Feed clusters the failures. Top three from a real run:
- Tamil-substrate prescription names mistranscribed. 38% failure rate on Tamil-substrate refill calls. STT vocabulary missed common Tamil-pronounced drug brand names. Fix: 14 brand names with phonetic variants. Estimated
conversation_resolutionlift: +9 points. - Bengali-substrate “next week” heard as “next time”. 24% failure rate on Bengali-substrate reschedule calls. Phonetic substitution on /w/. Fix: few-shot examples that disambiguate both forms. Lift: +5 points.
- Code-switch context reset. 17% failure rate on mixed-language coverage-check calls. STT model resets context at the language switch. Fix: multilingual STT for India-region calls. Lift: +6 points.
Day 5: re-test and ship. Re-run the same 6,000 scenarios against the patched agent. conversation_resolution lifts from 71% to 88%. The 85% launch gate is cleared. From this point the matrix runs nightly against every release candidate.
The matrix is the regression suite. The launch is the first execution, not the last.
Where FAGI fits in the accent workflow
Four product surfaces carry the loop without glue code.
simulate-sdk generates the persona-driven calls. Persona carries accent, age, speed, noise, multilingual toggle. Scenario carries intent path and goals. TestRunner drives the matrix against your wrapped agent (OpenAI, LangChain, Gemini, Anthropic) and returns a TestReport.
ai-evaluation scores every call. The Apache 2.0 SDK ships conversation_resolution, task_completion, ASR/STT_accuracy, audio_quality, translation_accuracy, and cultural_sensitivity as built-in templates. MLLMAudio accepts seven audio formats so curated production recordings drop in without transcoding. Error Localization pinpoints the failing turn so triage is one click, not one playback.
traceAI extends the same scoring to production. The OpenTelemetry-based SDK auto-instruments the wrappers and emits spans with PII redaction. Tag accent_class as a span attribute; dashboards slice live conversation_resolution by accent the same way the launch matrix did.
Agent Command Center runs Protect on the voice path. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus Protect Flash score transcribed speech at 65ms median time-to-label per arXiv 2510.13351. Prompt-injection attempts that come in through voice channels (yes, they do) get caught before they reach the LLM, and PII in transcripts gets masked before it lands in logs. The gateway is a single Go binary, Apache 2.0, benchmarked at ~29k req/s and P99 ≤ 21ms with guardrails on, on t3.xlarge.
The reason the loop closes inside a sprint: the four surfaces share datasets, evaluators, and a trend line. Personas feed the launch matrix. The launch matrix produces a baseline. The baseline becomes the regression suite. Production spans feed the same evaluator pool. Error Feed clusters write immediate_fix back into the work queue.
Honest tradeoffs
Two calls worth naming.
Persona-driven generation is only as good as the persona library and the TTS engine. The accent prosody is rendered, not recorded. For accents where the TTS layer doesn’t carry enough fidelity (some sub-regional dialects), pair the suite with a smaller recording dataset for the long tail. ElevenLabs and Cartesia voice cloning fill most of the gap; the residual is real.
End-task scoring requires you to define the end task precisely. Vague task definitions make conversation_resolution an LLM-judge whim. Spend the time to write tight goal statements per scenario. The investment pays back the first time the suite catches a regression that WER would have missed.
Related reading
- The Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric architecture this methodology sits on top of.
- WER for Voice Agents: Beyond the 2026 Baseline: why WER is a diagnostic, not a gate.
- Multilingual Voice AI Testing: the cross-language extension of this matrix.
- Voice Agent ASR Failure Modes: deeper taxonomy of the ASR layer.
Sources and references
ai-evaluationsource: github.com/future-agi/ai-evaluation (templates:ASRAccuracy,ConversationResolution,TaskCompletion,AudioQualityEvaluator,TranslationAccuracy,CulturalSensitivity)simulate-sdksource: docs.futureagi.com/docs/simulationtraceAIsource: github.com/future-agi/traceAI- Agent Command Center docs: docs.futureagi.com/docs/command-center
- Future AGI trust page: futureagi.com/trust
- Protect model family: arxiv.org/abs/2510.13351
Frequently asked questions
Why does WER on Common Voice fail as an accent benchmark?
Which four failure modes do standard accent benchmarks miss?
How does persona-driven generation differ from accent recording datasets?
Why score end-task success instead of transcript accuracy?
What coverage discipline keeps an accent test suite useful?
How do you surface accent failures in production after launch?
Where does the FAGI stack fit in the accent workflow?
Engineer voice agent simulation: 18 personas, auto-generated branching scenarios, four-step test wizard, Error Localization, programmatic eval API.
The 2026 voice testing pattern: regression on golden conversations, adversarial red-team personas, production-derived replays. Engineering implementation guide.
Future AGI vs Coval scored on simulation, native voice observability, evaluation, inline guardrails, optimization, pricing, and compliance. Honest verdict, May 2026 pricing, where each one falls short, and how the loop changes the math.