Guides

Accent and Dialect Testing for Voice AI Agents: A 2026 Methodology

Accent testing is not WER on Common Voice. 2026 methodology catches proper-noun, code-switching, filler, dialect failures production benchmarks miss.

February 12, 2026

Updated May 20, 2026

11 min read

voice-ai 2026 accent-testing multilingual simulation

Table of Contents

Accent testing is not WER on Common Voice. The failures that ship to production live in the words a public benchmark never saw: proper nouns the model has no prior on, code-switches between two languages mid-utterance, conversational fillers that crash a poorly tuned VAD, and dialect-specific phrasing that carries locale-specific meaning. A 5% WER on a curated read-speech corpus can sit alongside a 32% end-task failure rate on Tamil-substrate English callers in your live stack, and you’d never know from the benchmark. This methodology shows the four failure modes standard accent benchmarks miss, the persona-driven generation pattern that catches them, and how to score end-task success instead of transcript edit distance.

TL;DR: the methodology in six steps

Pull your accent distribution from live traffic. Weight by share, not by guesswork.
Define accent personas with simulate-sdk. Accent string, age, speed, noise, multilingual toggle.
Generate scenarios per intent. Auto-generate against the five intents that drive your call volume.
Run the matrix. Persona times scenario times rows. Capture audio, transcripts, and span trees.
Score end-task success first. conversation_resolution and task_completion are the launch gate. ASR/STT_accuracy and audio_quality are diagnostics.
Tag production with accent_class. Feed live spans into the same evaluator pool. Cluster regressions in Error Feed.

The rest of the post is the why behind each step, the four failure modes that drive the design, and a worked plan for a healthcare voice agent launching across five Indian English substrates.

Why WER on Common Voice misses production

Mozilla Common Voice is a great research dataset and a terrible production benchmark. Three reasons.

It’s read speech. Speakers read scripted sentences in a quiet environment. Production callers don’t read. They think out loud, stumble, restart, fill silence with “um, the thing is, like”. The acoustic and prosodic profile of read speech is unlike spontaneous conversational speech across every accent.

Curated vocabulary. Common Voice prompts are designed to balance phonemes, not to look like real call traffic. Your domain has drug brand names, account numbers, street addresses, regional businesses, slang. The benchmark never saw any of it.

Wrong layer. WER is edit distance on the transcript. A 92% accurate transcript that drops “re-” from “reschedule” produces a 100% wrong end task. The number you actually care about is whether the appointment got booked, the prescription got refilled, the refund got issued. WER tells you almost nothing about that.

The fix isn’t to throw out ASR scoring. It’s to demote WER to a diagnostic and promote end-task success to the launch gate. The end-to-end voice AI evaluation methodology covers how to score that gate.

The four failure modes nobody tests for

Production accent failures cluster on four modes that read-speech WER doesn’t catch.

1. Proper-noun failures. The single biggest cause of accent-driven end-task failures we’ve seen. The ASR model has no language-model prior on “Trelegy”, “Mounjaro”, “Sengkang Way”, “Andheri East”, or your CFO’s name. Accent shifts the phonetic profile of the proper noun just enough to push the recognition into the wrong bucket. Public WER scores look fine because the curated corpus didn’t contain the proper noun. Production fails because the proper noun is the slot the call hinges on.

2. Code-switching failures. Speakers in India, Singapore, parts of Africa and Europe switch between languages mid-sentence as a normal speech pattern. “I want to reschedule, theek hai, but I need to confirm the doctor”. A monolingual STT model resets context at the switch, drops half the utterance, or transcribes the non-English fragment as gibberish. The downstream LLM then operates on a corrupted transcript and routes to the wrong intent. WER on a monolingual corpus can’t even see this mode.

3. Conversational filler failures. “Um, you know, the thing is, basically, I was trying to”. Real callers fill silence. Filler crashes a poorly tuned VAD into an early end-of-turn. The agent interrupts mid-thought, the caller backs up and restarts, the conversation spirals. Read-speech corpora strip these out. Production calls are full of them, and they disproportionately affect accents that carry distinctive prosodic patterns into English.

4. Dialect-phrasing failures. “Do the needful”, “reckon”, “pavement repair”, “cot death”, “pram”. The ASR transcribes correctly. The LLM misinterprets because its training distribution skews to one regional sense over another. “Pavement repair” from a UK caller is sidewalk repair, not road resurfacing. A US-trained model defaults to the wrong sense and hands off to the wrong department. This is an LLM failure, not an ASR failure, and accent-WER benchmarks don’t measure it at all.

Each mode lives on a different layer. Proper nouns are an ASR + language-model issue. Code-switching is an STT-architecture issue. Fillers are a VAD + turn-taking issue. Dialect phrasing is an LLM-interpretation issue. A testing harness that scores only transcript accuracy misses three of the four.

Persona-driven generation: the simulate-sdk pattern

The right substitute for read-speech corpora is persona-driven generation against your live stack. simulate-sdk ships a Persona plus Scenario primitive that does this directly.

A Persona carries accent, age range, communication speed, background noise, and a multilingual toggle. The TTS layer renders the persona’s audio with the specified accent and prosodic profile. A Scenario carries the intent path and the goals the persona is trying to accomplish. The TestRunner drives the persona through the scenario against an AgentWrapper that wraps your actual agent (OpenAI, LangChain, Gemini, or Anthropic-backed). The run produces a TestReport with per-call transcripts, audio, and a span tree.

# pip install simulate-sdk ai-evaluation
from fi.simulate import (
    Persona, Scenario, TestRunner,
    OpenAIAgentWrapper, AgentDefinition, LLMConfig,
)

# 12 personas weighted by your traffic logs, not by guesswork
personas = [
    Persona(name="tamil_substrate_f_28", traits={
        "accent": "South Indian Tamil-influenced English",
        "age_range": "25-32", "speed": "fast",
        "background_noise": "moderate", "multilingual": True,
    }),
    Persona(name="glasgow_m_45", traits={
        "accent": "Glasgow Scottish",
        "age_range": "40-50", "speed": "normal",
        "background_noise": "quiet",
    }),
    Persona(name="punjabi_substrate_m_52", traits={
        "accent": "Punjabi-influenced English with Hindi code-switching",
        "age_range": "50-60", "speed": "slow",
        "background_noise": "loud", "multilingual": True,
    }),
    # ... 9 more personas covering the rest of your accent distribution
]

scenarios = [
    Scenario(
        description="Caller wants to reschedule an existing appointment. "
                    "Some callers have the booking ID; some have only a partial date. "
                    "Some speakers code-switch between English and a regional language.",
        goals=["confirm caller identity", "locate booking", "offer next slot"],
    ),
    # ... 4 more scenarios for the top intents
]

agent = OpenAIAgentWrapper(AgentDefinition(
    name="reschedule-bot",
    llm_config=LLMConfig(model="gpt-4o", temperature=0.4),
    system_prompt="You are a reschedule assistant for a healthcare clinic.",
))

runner = TestRunner(agent_wrapper=agent, personas=personas, scenarios=scenarios)
report = runner.run()  # returns TestReport with pass_rate, per-call traces

Twelve personas times five scenarios at 100 rows per pair gives 6,000 fresh calls against the actual agent. Each call carries a TTS-rendered utterance with the accent prosody, gets transcribed by your real STT, goes through your real LLM, returns a real response. The fidelity is end-to-end, not a transcript proxy. Persona-driven generation also regenerates when intents shift; a static recording corpus rots the moment your call distribution changes.

End-task scoring vs transcript scoring

Five rubrics from ai-evaluation cover the accent surface across the four failure modes. They sit on different layers and answer different questions.

Rubric	Layer	Question it answers	Use it as
`conversation_resolution`	end-task	Did the agent resolve the caller’s intent?	Launch gate
`task_completion`	end-task	Did the agent finish the specific task (booking, refill, refund)?	Launch gate
`ASR/STT_accuracy`	ASR	Did the transcript match the audio?	Diagnostic
`audio_quality`	audio	Was the audio intelligible to begin with?	Diagnostic
`translation_accuracy`	LLM	Did the agent’s paraphrase preserve meaning?	Code-switching diagnostic
`cultural_sensitivity`	LLM	Was the response appropriate for the locale?	Dialect-phrasing diagnostic

The two end-task rubrics gate the launch. The remaining four explain why the end-task gate failed when it did. Don’t invert the order. A bot that hits 91% ASR/STT_accuracy but 64% conversation_resolution on Tamil-substrate English is still shipping a broken product. The transcript was fine; the agent still failed the call.

from fi.evals import (
    Evaluator, ConversationResolution, TaskCompletion,
    ASRAccuracy, AudioQualityEvaluator,
    TranslationAccuracy, CulturalSensitivity,
)
from fi.testcases import MLLMTestCase, MLLMAudio, ConversationalTestCase, LLMTestCase

audio = MLLMAudio(url="recordings/tamil_substrate_reschedule_017.wav")
asr_case = MLLMTestCase(
    input=audio,
    query="Score STT transcript accuracy against the audio",
    expected_response="I need to reschedule my appointment for next Tuesday.",
)
conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I need to reschedule my appointment.",
                response="I can help. What's the booking ID?"),
    LLMTestCase(query="I don't have it with me, sorry.",
                response="No problem. Can I take your name and phone number?"),
])

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[
        ConversationResolution(), TaskCompletion(),
        ASRAccuracy(), AudioQualityEvaluator(),
        TranslationAccuracy(), CulturalSensitivity(),
    ],
    inputs=[asr_case, conv],
)

The dashboard slices results by accent_class and by intent. A failure on Tamil-substrate English in the reschedule intent surfaces as one cell in a grid, not as a single global score. That’s the granularity you need to act on.

Coverage discipline: which accents, which scenarios

Coverage is where most accent suites go wrong. Two failure modes: testing every accent in the world at low fidelity, or testing three accents at high fidelity while ignoring 40% of your traffic.

The discipline is to weight the persona matrix by your live traffic distribution, not by a generic English-accent list. Pull a month of production calls, tag a sample by accent class, get the distribution. If 38% of your traffic is Tamil-substrate English and 7% is Glasgow Scottish, the persona matrix should reflect those weights. A flat matrix that gives each accent equal sampling buys you nothing for the accents that drive your revenue.

For an enterprise English deployment, we keep returning to this minimum set, then re-weight to match traffic:

Accent	Typical traffic-weighted share
US Standard	15-25%
US Southern	5-10%
UK RP	4-10%
UK Scottish, Northern	3-10%
Canada, Australian English	5-14%
Indian English (Hindi substrate)	10-25%
Indian English (Tamil substrate)	5-15%
Indian English (Bengali, Punjabi)	6-18%
Singapore / Malaysian English	1-5%

For each persona attach three orthogonal variants: communication speed (fast, normal, slow), background noise (quiet, moderate, loud), and one stress condition (telephony codec, low SNR, simultaneous speech). The arithmetic lands in the low thousands of calls, which the simulator runs in hours.

Re-weight monthly. Production traffic shifts; the matrix that was right in January is wrong by May.

Production observability: the accent_class span attribute

Pre-launch testing catches the obvious failures. Production catches the residual. The same evaluator pool that scored the launch matrix should score live traffic, with accent_class as a span attribute on every call.

from fi_instrumentation import using_attributes

# Set at the start of each call, inferred from caller metadata or
# a lightweight accent-classifier span run in parallel with STT
with using_attributes({
    "accent_class": "tamil_substrate_english",
    "telephony_codec": "g711",
    "intent": "reschedule",
}):
    response = voice_agent.handle_call(audio_stream)

traceAI auto-instruments OpenAI, LangChain, Groq, Gemini, and the rest of the wrappers. The spans land in the same backend the launch suite wrote to. The dashboard slices conversation_resolution by accent_class and by intent, which means a regression on Tamil-substrate English shows up as a cell turning red on a single dashboard, not as a customer-support escalation three weeks later.

Failures cluster predictably in Error Feed. The Sonnet 4.5 Judge agent writes the immediate_fix field for each cluster. The clusters we see most often:

Cluster	Layer	Typical immediate_fix
STT vocabulary gap on proper nouns	ASR	Add phonetic variants to the custom vocabulary list
Phonetic substitution on a specific consonant	ASR	Few-shot examples for the substituted form
Code-switch context reset	STT	Swap to a multilingual STT model for affected locales
VAD early-cut on conversational fillers	turn-taking	Raise VAD silence threshold by 150-250ms for affected personas
Dialect-phrase misinterpretation	LLM	System-prompt patch with regional context
Telephony codec degradation	network	Provider-side codec selection or codec-robust STT model

Each cluster carries a trend signal (rising, steady, falling). The cluster view is the weekly work queue.

A worked plan: healthcare voice agent across five Indian substrates

A worked accent-test plan for a healthcare voice agent launching across five Indian English substrates. Pre-launch budget: five days of engineer time.

Day 1: persona matrix. Define 12 personas across Hindi, Tamil, Bengali, Punjabi, and Marathi substrates. Two age brackets per substrate. Three communication speeds. Multilingual toggle on where code-switching is common.

Day 2: scenario generation. Generate 100 rows against five intents from production traffic: appointment booking, prescription refill, lab-result inquiry, billing question, coverage check. Verify the branch distribution matches actual traffic, not the LLM’s guess.

Day 3: test execution. Five scenarios times 12 personas times 100 rows is 6,000 simulated calls. Run time roughly 14 hours of parallel simulation against the live agent.

Day 4: triage. Error Feed clusters the failures. Top three from a real run:

Tamil-substrate prescription names mistranscribed. 38% failure rate on Tamil-substrate refill calls. STT vocabulary missed common Tamil-pronounced drug brand names. Fix: 14 brand names with phonetic variants. Estimated conversation_resolution lift: +9 points.
Bengali-substrate “next week” heard as “next time”. 24% failure rate on Bengali-substrate reschedule calls. Phonetic substitution on /w/. Fix: few-shot examples that disambiguate both forms. Lift: +5 points.
Code-switch context reset. 17% failure rate on mixed-language coverage-check calls. STT model resets context at the language switch. Fix: multilingual STT for India-region calls. Lift: +6 points.

Day 5: re-test and ship. Re-run the same 6,000 scenarios against the patched agent. conversation_resolution lifts from 71% to 88%. The 85% launch gate is cleared. From this point the matrix runs nightly against every release candidate.

The matrix is the regression suite. The launch is the first execution, not the last.

Where FAGI fits in the accent workflow

Four product surfaces carry the loop without glue code.

simulate-sdk generates the persona-driven calls. Persona carries accent, age, speed, noise, multilingual toggle. Scenario carries intent path and goals. TestRunner drives the matrix against your wrapped agent (OpenAI, LangChain, Gemini, Anthropic) and returns a TestReport.

ai-evaluation scores every call. The Apache 2.0 SDK ships conversation_resolution, task_completion, ASR/STT_accuracy, audio_quality, translation_accuracy, and cultural_sensitivity as built-in templates. MLLMAudio accepts seven audio formats so curated production recordings drop in without transcoding. Error Localization pinpoints the failing turn so triage is one click, not one playback.

traceAI extends the same scoring to production. The OpenTelemetry-based SDK auto-instruments the wrappers and emits spans with PII redaction. Tag accent_class as a span attribute; dashboards slice live conversation_resolution by accent the same way the launch matrix did.

Agent Command Center runs Protect on the voice path. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus Protect Flash score transcribed speech at 65ms median time-to-label per arXiv 2510.13351. Prompt-injection attempts that come in through voice channels (yes, they do) get caught before they reach the LLM, and PII in transcripts gets masked before it lands in logs. The gateway is a single Go binary, Apache 2.0, benchmarked at ~29k req/s and P99 ≤ 21ms with guardrails on, on t3.xlarge.

The reason the loop closes inside a sprint: the four surfaces share datasets, evaluators, and a trend line. Personas feed the launch matrix. The launch matrix produces a baseline. The baseline becomes the regression suite. Production spans feed the same evaluator pool. Error Feed clusters write immediate_fix back into the work queue.

Honest tradeoffs

Two calls worth naming.

Persona-driven generation is only as good as the persona library and the TTS engine. The accent prosody is rendered, not recorded. For accents where the TTS layer doesn’t carry enough fidelity (some sub-regional dialects), pair the suite with a smaller recording dataset for the long tail. ElevenLabs and Cartesia voice cloning fill most of the gap; the residual is real.

End-task scoring requires you to define the end task precisely. Vague task definitions make conversation_resolution an LLM-judge whim. Spend the time to write tight goal statements per scenario. The investment pays back the first time the suite catches a regression that WER would have missed.

The Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric architecture this methodology sits on top of.
WER for Voice Agents: Beyond the 2026 Baseline: why WER is a diagnostic, not a gate.
Multilingual Voice AI Testing: the cross-language extension of this matrix.
Voice Agent ASR Failure Modes: deeper taxonomy of the ASR layer.

Sources and references

ai-evaluation source: github.com/future-agi/ai-evaluation (templates: ASRAccuracy, ConversationResolution, TaskCompletion, AudioQualityEvaluator, TranslationAccuracy, CulturalSensitivity)
simulate-sdk source: docs.futureagi.com/docs/simulation
traceAI source: github.com/future-agi/traceAI
Agent Command Center docs: docs.futureagi.com/docs/command-center
Future AGI trust page: futureagi.com/trust
Protect model family: arxiv.org/abs/2510.13351

Frequently asked questions

Why does WER on Common Voice fail as an accent benchmark?

WER on Common Voice tells you how the ASR model performs on read speech of curated sentences. Production voice agents don't see read speech. They see callers stumbling through proper nouns, switching between two languages mid-sentence, filling gaps with um, like, you know, and using dialect-specific phrasing the benchmark never saw. A 5% WER on Common Voice can sit alongside a 32% intent-failure rate on Tamil-accented callers in production. The right substitute is end-task scoring on accent-targeted scenarios that look like your actual call distribution, not a generic read-speech corpus. Score conversation_resolution and task_completion, not transcript edit distance.

Which four failure modes do standard accent benchmarks miss?

Four production failures hide outside read-speech WER. First, proper nouns: drug brand names, street names, account holders, regional businesses. The benchmark never saw the word, so the score looks fine while the agent silently fails. Second, code-switching: callers shifting from English to Hindi mid-sentence, common across India, Singapore, and parts of Africa. Third, conversational fillers: 'um, the thing is, like, basically' pad the way speakers actually talk and crash poorly-tuned VAD. Fourth, dialect-specific phrasing like 'do the needful', 'reckon', 'pavement repair' that carry locale-specific meaning. None of the four show up in WER on a public corpus.

How does persona-driven generation differ from accent recording datasets?

Recording datasets are static, expensive to expand, and they capture read speech or scripted dialog. Persona-driven generation produces fresh utterances against your specific intents using accent-conditioned voices. With simulate-sdk you define a Persona (accent, age, communication speed, background noise) plus a Scenario (the intent path and expected goals). The TestRunner drives the persona through the scenario against your live agent and produces a TestReport with per-turn results. The library scales: 12 personas times 5 intents at 100 rows per pair is 6,000 fresh calls against your actual stack.

Why score end-task success instead of transcript accuracy?

Transcript accuracy is necessary but not sufficient. An ASR layer can produce a 92% accurate transcript that still drops the one word that mattered, like the syllable that turned 'reschedule' into 'cancel'. End-task scoring asks the question the customer cares about: did the agent book the appointment, refill the prescription, route to the right department, complete the refund? Use conversation_resolution and task_completion as the launch gate. Use ASR/STT_accuracy and audio_quality as diagnostics to explain why the end task failed, not as the gate itself. The two metrics live on different layers and answer different questions.

What coverage discipline keeps an accent test suite useful?

Start with the accent distribution your traffic logs show, not a generic list. If 38% of your traffic is Tamil-substrate English and 7% is Glasgow Scottish, weight the persona matrix accordingly. Pair each persona with the top five intents that drive your call volume. Add three orthogonal variants per persona: communication speed (fast, normal, slow), background noise (quiet, moderate, loud), and one stress condition (telephony codec, low signal, simultaneous speech). The result is a launch matrix sized in the low thousands of calls, not the millions. Re-weight monthly as production traffic shifts.

How do you surface accent failures in production after launch?

Tag every production span with the inferred accent class as a span attribute. traceAI's auto-instrumentation feeds the spans into the same evaluator pool that ran the launch matrix. The Error Feed clusters failures into named issues (STT vocabulary gap, phonetic substitution, code-switch break, telephony-codec degradation) with root cause and immediate fix written by the Sonnet 4.5 Judge agent. The dashboard slices conversation_resolution by accent_class so a regression on a single accent surfaces inside hours, not weeks. The launch matrix becomes the regression suite for every release.

Where does the FAGI stack fit in the accent workflow?

simulate-sdk generates persona-driven calls with accent-conditioned voices and runs them through your actual agent. ai-evaluation scores each call with ASR/STT_accuracy, audio_quality, conversation_resolution, translation_accuracy, and cultural_sensitivity rubrics from the Apache 2.0 SDK. traceAI feeds production spans into the same scoring layer with accent_class as a span attribute. The Agent Command Center runs Protect on the voice path to catch prompt injection in transcribed speech and PII before it lands in logs. The four surfaces share datasets, evaluators, and a trend line, which is what makes the loop closeable inside a single sprint.

View all

Guides

Voice Agent Simulation: A 2026 Engineering Guide

Engineer voice agent simulation: 18 personas, auto-generated branching scenarios, four-step test wizard, Error Localization, programmatic eval API.

Vrinda Damani · May 7, 2026

17 min

Guides

Three-Layer Voice AI Testing: Regression, Adversarial, Production-Derived

The 2026 voice testing pattern: regression on golden conversations, adversarial red-team personas, production-derived replays. Engineering build guide.

Vrinda Damani · Apr 23, 2026

18 min

Guides

Future AGI vs Coval in 2026: Closed-Loop Voice Platform vs Focused Simulation

Future AGI vs Coval on simulation, native voice observability, eval, inline guardrails, optimization, pricing, compliance. Honest verdict, May 2026.

NVJK Kartik · Apr 9, 2026

24 min

TL;DR: the methodology in six steps

Why WER on Common Voice misses production

The four failure modes nobody tests for

Persona-driven generation: the simulate-sdk pattern

End-task scoring vs transcript scoring

Coverage discipline: which accents, which scenarios

Production observability: the accent_class span attribute

A worked plan: healthcare voice agent across five Indian substrates

Where FAGI fits in the accent workflow

Honest tradeoffs

Related reading

Sources and references

Frequently asked questions