Why WER Isn't Enough for Voice Agents: 2026 Beyond-WER Metrics
WER measures word accuracy but misses what voice agents break on. Intent preservation, entity F1, timing, and task-completion correlation are the 2026 metrics that matter.
Table of Contents
Word Error Rate is the metric every ASR vendor scorecards on. It’s the metric every voice agent team inherits when they pick an STT provider. And it’s the metric that quietly fails them once the agent goes live. WER counts edits between a transcript and a reference. It treats every word as equally important. A voice agent doesn’t. Missing a customer name breaks the call. Missing a filler word doesn’t. The 2026 voice stack needs metrics that score what the agent actually breaks on.
TL;DR (when WER fails and what to add)
WER is fine for ASR vendor comparison. It’s the wrong primary metric for a voice agent. The agent metrics that actually predict production behavior are intent preservation, entity F1, semantic similarity, and downstream task-completion correlation. Run all four on every turn alongside WER. Use ai-evaluation to author custom rubrics for each.
| Metric | What it measures | When WER misses it |
|---|---|---|
| WER | Edit distance to gold transcript | Treats name and filler word as equal weight |
| Intent preservation | Does the transcript route to the same agent action | Errors land on filler words, intent stays |
| Entity F1 | Named-entity recall plus precision | Entity errors hide inside acceptable overall WER |
| Semantic similarity | Embedding-distance to reference meaning | Paraphrased speech reads as high WER but same meaning |
| Task-completion correlation | End-to-end agent outcome delta | WER drops 2% but agent breaks on a new entity class |
Why WER became the default
WER comes out of 1970s speech recognition research. The reference is a hand-aligned transcript. The hypothesis is the ASR output. The score is the sum of substitutions, insertions, and deletions divided by the reference length. The number is easy to interpret. The benchmark suites (LibriSpeech, CommonVoice, SwitchBoard, TED-LIUM) all report it. The ASR vendor scorecards all report it. The papers all report it.
For ASR research, WER is the right metric. The task is transcription. Every word in the reference is in scope. The model that scores best on WER is the model that transcribes best.
For voice agents in 2026, the task is not transcription. The task is to drive the next agent action correctly. Transcription is the means. The agent’s behavior is the end. WER scores the means and ignores the end.
The clean fix is the Hybrid Norm (Anthropic’s 2026 eval guidance): pair the verifiable reward (WER itself, entity F1, exact-match on numbers and dates) with rubric-based LLM judges that score whether the resulting transcript preserved intent and entity meaning. WER stays in the pipeline as the deterministic floor. The rubric layer catches what WER misses.
What WER misses
Four error classes hide inside an acceptable WER number.
1. Entity errors
Voice agents are entity-heavy. A retail support agent needs the order number. A banking agent needs the dollar amount. A clinical scribe needs the drug name. A scheduling agent needs the date and the patient name.
WER weights every word equally. A 4% WER transcript can have a missed dollar amount (catastrophic) and three correct filler words (irrelevant). Another 4% WER transcript can have three missed filler words (irrelevant) and a correct dollar amount (everything). They score identically on WER. They are different transcripts for an agent.
Entity F1 fixes this. Pull the named entities out of the reference. Pull them out of the hypothesis. Score recall and precision on the entities only. Average. The number is the entity F1. The number that hides catastrophic entity misses is gone.
2. Intent-preserving substitutions
The customer says “I’d like to cancel my subscription”. The ASR returns “I’d like to cancel my prescription”. One word substitution. WER reports a tiny error rate. Intent classification routes the call to the pharmacy team. The customer gets a wrong-team transfer.
WER counts the edit. It doesn’t know that the substitution swapped the agent’s downstream action. Intent preservation asks the downstream router whether the hypothesis routes to the same node as the reference. Yes or no. The error class that swaps intent is now visible.
3. Paraphrased speech
The customer says “yeah I want to do that”. The reference transcript says “yes I would like to proceed with that”. WER reports an 80% error rate. The meaning is identical. Any downstream LLM acts on either equally.
Semantic similarity uses a sentence embedding model to measure the embedding distance between hypothesis and reference. The paraphrase-class errors that inflate WER without changing meaning are no longer false positives.
4. Multi-turn cascade
A single-turn WER number ignores what happens across the call. A 2% WER transcript that mis-transcribes the customer’s account number on turn 3 produces an agent that confidently retrieves the wrong account state for the rest of the call. The remaining nine turns are all wrong because of one early error. WER reports a flat 2%. The call was a total failure.
Downstream task-completion correlation runs the agent twice. Once with the live ASR. Once with the reference transcript. The task-completion rubric runs on both. The delta is the metric. The cascade-class failures are now visible.
The four beyond-WER metrics
Each metric has a different purpose. Run all four. Each isolates a different error class.
Intent preservation
The downstream agent has a router. The router takes a turn and produces an intent label (or a no-op routing). Intent preservation is a binary score per turn. Did the hypothesis transcript route to the same intent label as the reference transcript.
For an agent with 30 intents, you have 30 intent classes plus a no-intent class. You score on the confusion matrix. The aggregate accuracy is the intent-preservation rate. You can also break it down per intent class. Some intents are robust to ASR error (a generic FAQ). Some are fragile (cancellation, refund, escalation). The per-intent breakdown surfaces where ASR upgrades help and where they hurt.
Entity F1
Define your entity taxonomy first. For a banking agent: account number, dollar amount, date, payee name, transaction reference. For a clinical scribe: drug name, dosage, route, frequency, ICD-10 code. For a scheduling agent: date, time, location, attendee name.
Extract entities from the reference using a regex-plus-NER pipeline. Extract from the hypothesis the same way. Compute recall (entities in reference that appear correctly in hypothesis) and precision (entities in hypothesis that appear correctly in reference). F1 is the harmonic mean.
Track entity F1 per entity class. A 0.92 entity F1 overall can hide a 0.55 entity F1 on drug names because they’re a small fraction of total entities. The per-class breakdown saves the patient.
Semantic similarity
Pick a sentence embedding model. Run it on the reference and the hypothesis. Cosine similarity in the embedding space is the score. Higher is more semantically aligned.
The score corrects for paraphrase, filler-word variation, and dialect-level word swaps. The score does not correct for entity errors (entity differences usually show up in the embedding, but inconsistently). Pair semantic similarity with entity F1 so the two metrics cover different error classes.
Downstream task-completion correlation
Run the agent twice on the same scenario. Once with the live ASR. Once with the reference transcript pulled from a human annotation. Score task_completion on both. The score difference is the metric.
For a strong ASR, the delta is near zero. For an ASR with a hidden weakness on an entity class, the delta is high. The delta is the leading indicator that an ASR upgrade is silently breaking the agent before the WER number catches up.
Authoring the rubric set in ai-evaluation
ai-evaluation ships 70+ built-in eval templates. The audio leg uses audio_transcription for WER-class scoring. The four beyond-WER metrics are a mix of built-ins and custom evaluators. The full pass looks like this.
Step 1: load the test cases
MLLMAudio wraps the audio file. Seven formats are supported: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. Local path or URL. Auto-base64 encoding when the underlying model needs it.
from fi.testcases import MLLMTestCase, MLLMAudio
audio = MLLMAudio(url="path/to/turn_audio.wav")
test_case = MLLMTestCase(
input=audio,
query="Score this voice agent turn",
)
Step 2: configure the evaluator
Evaluator is the entry point. It accepts a list of eval templates and runs them in parallel. Built-in templates are imported directly. Custom evaluators are authored in plain English in the in-product agent and pulled in by name.
from fi.evals import Evaluator, AudioTranscription, ConversationResolution
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
Step 3: author the custom rubrics
Intent preservation, entity F1, and semantic similarity ship as custom evaluators authored by FAGI’s in-product agent. The author UX takes a plain-English description and produces a runnable evaluator with config, prompts, and scoring logic.
A minimal intent-preservation rubric description reads:
“Given a reference transcript and a hypothesis transcript, route both through the agent router (the same prompt as production). Compare the two intent labels. Score 1 if they match, 0 if they don’t. Return the matched label and the mismatched pair when applicable. Include the agent router prompt as a parameter.”
The agent produces a runnable evaluator with the router prompt parameterized, a label-matching scoring function, and the metadata schema for the audit trail. The evaluator runs against any test case the SDK supports.
The entity-F1 rubric description specifies the entity taxonomy:
“Extract entities of types
{account_number, dollar_amount, date, payee_name, transaction_reference}from both reference and hypothesis. Use the entity-extraction prompt I’m providing. Compute precision, recall, and F1 per type and overall. Return the per-type breakdown.”
The semantic-similarity rubric description specifies the embedding strategy:
“Embed reference and hypothesis with the configured sentence embedding model. Return cosine similarity. Flag turns with similarity below 0.85 for human review.”
Step 4: run the pass
from fi.evals import Evaluator
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[
"audio_transcription",
"intent_preservation_v1",
"entity_f1_banking_v1",
"semantic_similarity_v1",
],
inputs=[test_case],
)
The Evaluator runs all four scorers in parallel against the same input. The result object carries per-evaluator scores, reasoning, and metadata. The audit trail stores the rubric version, the input, the score, and the reasoning for every turn.
Step 5: downstream correlation in a scenario
The fourth metric (downstream task-completion correlation) is a scenario-level score, not a turn-level one. Run the same scenario twice. Once with the live ASR. Once with the reference transcript. Score task_completion on both. The delta is the correlation indicator.
FAGI Simulate auto-generates branching scenarios (20, 50, or 100 rows) from an agent definition. Run both versions in CI. The delta surfaces which ASR error classes break the agent.
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, TaskCompletion
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I need to cancel my subscription", response="..."),
LLMTestCase(query="My account number is 8392", response="..."),
])
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[TaskCompletion()],
inputs=[conv],
)
Run the same ConversationalTestCase with the reference-transcript message text. The two task_completion scores subtracted give the per-scenario delta. Average across scenarios gives the correlation indicator.
A reference scoring pipeline
For a production voice agent, the eval pipeline looks like this.
- Every live turn is captured by traceAI with separate spans for ASR, LLM, tool calls, and TTS. The ASR span carries the hypothesis transcript and confidence.
- A sampled subset of turns (say, 1 in 50 for high-volume traffic, 1 in 5 during launch) goes to human transcription for the reference.
- The hypothesis plus reference go through the four-rubric pass:
audio_transcription, intent preservation, entity F1, semantic similarity. - The scenario-level downstream task-completion correlation runs in CI nightly against the canonical scenario suite.
- Results land in the Agent Definition dashboard with per-rubric trends, per-entity-class breakdowns, and per-intent-class confusion matrices.
The single pass produces five views of the same call. Each view surfaces a different error class. The team that owns the agent sees which class is regressing the moment it does.
Common findings when teams switch from WER-only
The first month of beyond-WER scoring usually surfaces three patterns. They show up across every vertical we’ve worked with.
Finding 1: entity F1 lags WER
A team upgrades from one ASR vendor to another. WER drops from 8% to 5%. Entity F1 stays flat or drops 2 points. The new model handles general speech better and handles entity speech worse. The agent’s tool calls regress. The WER number says the upgrade was a win. The entity F1 says it was a loss. The team rolls back the ASR change.
Finding 2: intent preservation tracks training data more than WER
Two ASRs with similar WER on the public benchmarks can have very different intent-preservation scores on a domain-specific agent. The ASR trained on more in-domain data preserves intent better even when WER is slightly worse. The lesson is to score on the domain, not on the benchmark.
Finding 3: semantic similarity catches paraphrase-friendly customers
Some customer cohorts (older customers, customers in dialect-rich regions, customers in second-language English) talk in ways that inflate WER without changing meaning. Semantic similarity surfaces this segment. The team learns the apparent WER regression is a customer-mix artifact, not a model regression.
How Future AGI supports the beyond-WER scoring loop
The FAGI stack maps the full loop end to end.
ai-evaluation for scoring
70+ built-in rubrics including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, caption_hallucination, translation_accuracy, cultural_sensitivity. Apache 2.0. Custom evaluators authored by an in-product agent for the four beyond-WER rubrics described above. Evaluator runs the full pass in one call.
traceAI for span capture
30+ documented integrations across Python and TypeScript. OpenInference-compatible spans. Apache 2.0. Per-turn ASR span with hypothesis and confidence is the input to the eval pipeline. Dedicated traceAI-pipecat and traceai-livekit packages for the two open-source voice frameworks.
MLLMAudio for the audio leg
Seven audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma). Local path or URL. Auto-base64. The audio file plus the hypothesis plus the reference are the three inputs to the four-rubric pass.
Simulation for scenario-level correlation
The full simulation surface: 18 pre-built personas plus unlimited custom-authored (configure name, description, gender, age range, location, personality traits, communication style, accent, conversation speed, background noise, multilingual coverage, custom properties, free-form behavioral instructions), Workflow Builder auto-generated branching scenarios (20/50/100 rows with branch visibility), a 4-step Run Tests wizard (config to scenarios to eval to execute), Error Localization that pinpoints the exact failing turn, a programmatic eval API for configure plus re-run as part of CI, custom voices imported from ElevenLabs and Cartesia in Run Prompt, Indian phone number simulation, and a Show Reasoning column for eval debug.
The production-derived loop
audio_transcription plus the four beyond-WER rubrics run on every captured production call. Error Feed clusters the regressions into named issues. Pick a cluster (for example “named-entity misspelling on last names”), promote the offending spans into a dataset, re-run the rubric set with new candidate prompts or new STT providers, and only then close the loop with a deliberate prompt or vendor change. Production calls → Error Feed → dataset → re-eval is the closed loop that takes you beyond raw WER without auto-rewriting prompts on your behalf.
Native voice observability for Vapi, Retell, LiveKit
No SDK required. Provider API key plus Assistant ID configures auto call log capture, separate assistant and customer audio download, auto transcripts, and the full 70+ rubric eval engine on every call. The four beyond-WER rubrics run continuously on ingest once they’re authored.
Agent Command Center for governance
RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, BYOC self-host, 15+ provider routing. Per-team dashboards segment the rubric trends by intent class, agent version, and customer cohort.
Future AGI Protect for inline checks on transcript content
Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. Sub-100ms inline. ProtectFlash is the single-call binary classifier path when you need the fastest verdict. The pii and data_privacy_compliance rubrics are detection signals (not by themselves a guarantee of PHI redaction) that flag PII echo turns the four-rubric beyond-WER pass doesn’t cover.
Error Feed for cluster analysis
Auto-clusters trace failures into named issues. Auto-writes root cause plus quick fix plus long-term recommendation. The beyond-WER score regressions cluster as named issues with the failing rubric, the failing intent class, and the failing entity type identified.
Where the beyond-WER framework falls short
Reference transcripts are still required. Every metric in this framework needs a human-annotated reference for the leading sample of turns. There is no free way to score intent preservation against a hypothetical reference. The mitigation is to sample. 1 in 50 on production, 1 in 5 during launch, 100% on the canonical golden conversation set. The sampling regime keeps the human-transcription cost bounded.
Custom rubrics take a one-time setup. The intent-preservation, entity-F1, and semantic-similarity rubrics each need an authoring pass to encode the agent’s specific intent taxonomy and entity taxonomy. The in-product agent in ai-evaluation reduces this to a plain-English description. The first version ships in a day. Iteration on the rubric versions happens release over release.
Downstream correlation is a scenario-level metric. The first three rubrics run per turn. Downstream task-completion correlation runs per scenario. The two cadences don’t share an axis. The mitigation is to track them in two dashboards. Per-turn rubrics on the live-traffic dashboard. Scenario correlation on the CI dashboard. The reviewer needs both views to act on regressions.
Related reading
- Real-Time STT vs Offline STT in 2026: the upstream model choice that the beyond-WER metrics evaluate.
- Voice AI Evaluation Infrastructure: Developer’s Guide: the broader eval stack that the four-rubric pass plugs into.
- How to Build RAG-Powered Voice AI Agents in 2026: the retrieval layer that the beyond-WER framework also extends to.
- Voice AI for Healthcare and Clinical Workflows in 2026: the vertical where entity F1 is the highest-stakes beyond-WER metric.
Sources and references
- arXiv 2510.13351, Future AGI Protect model family (arxiv.org/abs/2510.13351)
- arXiv 2507.19457, GEPA Genetic-Pareto prompt optimizer (arxiv.org/abs/2507.19457)
- arXiv 2505.09666, Meta-Prompt bilevel optimization (arxiv.org/abs/2505.09666)
- arXiv 2311.09569, Random Search baseline (arxiv.org/abs/2311.09569)
- LibriSpeech, CommonVoice, SwitchBoard benchmark suite documentation
- NIST speech recognition evaluation literature on WER computation
- Future AGI trust page (futureagi.com/trust)
- ai-evaluation repository (github.com/future-agi/ai-evaluation)
- traceAI repository (github.com/future-agi/traceAI)
Frequently asked questions
What is WER and why is it the default voice AI metric?
Which beyond-WER metrics should I track for a voice agent?
How does intent preservation differ from WER?
Can I run all four metrics in a single eval pass?
Why does named-entity recall matter more than overall accuracy?
How do I correlate transcript quality with end-to-end task completion?
Does Future AGI replace WER scoring?
How to author custom voice evaluators in 2026. Two paths: in-product agent that proposes rubrics from traces, and code path that extends the Evaluator class.
The 7 ASR failure modes that break voice agents in production: detection patterns via span attributes, rubrics, and Error Feed clusters, plus mitigation playbooks.
Optimize LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional routing, async eval.