Engineering

Why WER Isn't Enough for Voice Agents: 2026 Beyond-WER Metrics

WER measures word accuracy but misses what voice agents break on. Intent preservation, entity F1, timing, and task-completion correlation are the 2026 metrics that matter.

·
Updated
·
14 min read
voice-ai 2026 wer evaluation stt
Editorial cover image for Why WER Isn't Enough for Voice Agents
Table of Contents

Word Error Rate is the metric every ASR vendor scorecards on. It’s the metric every voice agent team inherits when they pick an STT provider. And it’s the metric that quietly fails them once the agent goes live. WER counts edits between a transcript and a reference. It treats every word as equally important. A voice agent doesn’t. Missing a customer name breaks the call. Missing a filler word doesn’t. The 2026 voice stack needs metrics that score what the agent actually breaks on.

TL;DR (when WER fails and what to add)

WER is fine for ASR vendor comparison. It’s the wrong primary metric for a voice agent. The agent metrics that actually predict production behavior are intent preservation, entity F1, semantic similarity, and downstream task-completion correlation. Run all four on every turn alongside WER. Use ai-evaluation to author custom rubrics for each.

MetricWhat it measuresWhen WER misses it
WEREdit distance to gold transcriptTreats name and filler word as equal weight
Intent preservationDoes the transcript route to the same agent actionErrors land on filler words, intent stays
Entity F1Named-entity recall plus precisionEntity errors hide inside acceptable overall WER
Semantic similarityEmbedding-distance to reference meaningParaphrased speech reads as high WER but same meaning
Task-completion correlationEnd-to-end agent outcome deltaWER drops 2% but agent breaks on a new entity class

Why WER became the default

WER comes out of 1970s speech recognition research. The reference is a hand-aligned transcript. The hypothesis is the ASR output. The score is the sum of substitutions, insertions, and deletions divided by the reference length. The number is easy to interpret. The benchmark suites (LibriSpeech, CommonVoice, SwitchBoard, TED-LIUM) all report it. The ASR vendor scorecards all report it. The papers all report it.

For ASR research, WER is the right metric. The task is transcription. Every word in the reference is in scope. The model that scores best on WER is the model that transcribes best.

For voice agents in 2026, the task is not transcription. The task is to drive the next agent action correctly. Transcription is the means. The agent’s behavior is the end. WER scores the means and ignores the end.

The clean fix is the Hybrid Norm (Anthropic’s 2026 eval guidance): pair the verifiable reward (WER itself, entity F1, exact-match on numbers and dates) with rubric-based LLM judges that score whether the resulting transcript preserved intent and entity meaning. WER stays in the pipeline as the deterministic floor. The rubric layer catches what WER misses.

What WER misses

Four error classes hide inside an acceptable WER number.

1. Entity errors

Voice agents are entity-heavy. A retail support agent needs the order number. A banking agent needs the dollar amount. A clinical scribe needs the drug name. A scheduling agent needs the date and the patient name.

WER weights every word equally. A 4% WER transcript can have a missed dollar amount (catastrophic) and three correct filler words (irrelevant). Another 4% WER transcript can have three missed filler words (irrelevant) and a correct dollar amount (everything). They score identically on WER. They are different transcripts for an agent.

Entity F1 fixes this. Pull the named entities out of the reference. Pull them out of the hypothesis. Score recall and precision on the entities only. Average. The number is the entity F1. The number that hides catastrophic entity misses is gone.

2. Intent-preserving substitutions

The customer says “I’d like to cancel my subscription”. The ASR returns “I’d like to cancel my prescription”. One word substitution. WER reports a tiny error rate. Intent classification routes the call to the pharmacy team. The customer gets a wrong-team transfer.

WER counts the edit. It doesn’t know that the substitution swapped the agent’s downstream action. Intent preservation asks the downstream router whether the hypothesis routes to the same node as the reference. Yes or no. The error class that swaps intent is now visible.

3. Paraphrased speech

The customer says “yeah I want to do that”. The reference transcript says “yes I would like to proceed with that”. WER reports an 80% error rate. The meaning is identical. Any downstream LLM acts on either equally.

Semantic similarity uses a sentence embedding model to measure the embedding distance between hypothesis and reference. The paraphrase-class errors that inflate WER without changing meaning are no longer false positives.

4. Multi-turn cascade

A single-turn WER number ignores what happens across the call. A 2% WER transcript that mis-transcribes the customer’s account number on turn 3 produces an agent that confidently retrieves the wrong account state for the rest of the call. The remaining nine turns are all wrong because of one early error. WER reports a flat 2%. The call was a total failure.

Downstream task-completion correlation runs the agent twice. Once with the live ASR. Once with the reference transcript. The task-completion rubric runs on both. The delta is the metric. The cascade-class failures are now visible.

The four beyond-WER metrics

Each metric has a different purpose. Run all four. Each isolates a different error class.

Intent preservation

The downstream agent has a router. The router takes a turn and produces an intent label (or a no-op routing). Intent preservation is a binary score per turn. Did the hypothesis transcript route to the same intent label as the reference transcript.

For an agent with 30 intents, you have 30 intent classes plus a no-intent class. You score on the confusion matrix. The aggregate accuracy is the intent-preservation rate. You can also break it down per intent class. Some intents are robust to ASR error (a generic FAQ). Some are fragile (cancellation, refund, escalation). The per-intent breakdown surfaces where ASR upgrades help and where they hurt.

Entity F1

Define your entity taxonomy first. For a banking agent: account number, dollar amount, date, payee name, transaction reference. For a clinical scribe: drug name, dosage, route, frequency, ICD-10 code. For a scheduling agent: date, time, location, attendee name.

Extract entities from the reference using a regex-plus-NER pipeline. Extract from the hypothesis the same way. Compute recall (entities in reference that appear correctly in hypothesis) and precision (entities in hypothesis that appear correctly in reference). F1 is the harmonic mean.

Track entity F1 per entity class. A 0.92 entity F1 overall can hide a 0.55 entity F1 on drug names because they’re a small fraction of total entities. The per-class breakdown saves the patient.

Semantic similarity

Pick a sentence embedding model. Run it on the reference and the hypothesis. Cosine similarity in the embedding space is the score. Higher is more semantically aligned.

The score corrects for paraphrase, filler-word variation, and dialect-level word swaps. The score does not correct for entity errors (entity differences usually show up in the embedding, but inconsistently). Pair semantic similarity with entity F1 so the two metrics cover different error classes.

Downstream task-completion correlation

Run the agent twice on the same scenario. Once with the live ASR. Once with the reference transcript pulled from a human annotation. Score task_completion on both. The score difference is the metric.

For a strong ASR, the delta is near zero. For an ASR with a hidden weakness on an entity class, the delta is high. The delta is the leading indicator that an ASR upgrade is silently breaking the agent before the WER number catches up.

Authoring the rubric set in ai-evaluation

ai-evaluation ships 70+ built-in eval templates. The audio leg uses audio_transcription for WER-class scoring. The four beyond-WER metrics are a mix of built-ins and custom evaluators. The full pass looks like this.

Step 1: load the test cases

MLLMAudio wraps the audio file. Seven formats are supported: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. Local path or URL. Auto-base64 encoding when the underlying model needs it.

from fi.testcases import MLLMTestCase, MLLMAudio

audio = MLLMAudio(url="path/to/turn_audio.wav")
test_case = MLLMTestCase(
    input=audio,
    query="Score this voice agent turn",
)

Step 2: configure the evaluator

Evaluator is the entry point. It accepts a list of eval templates and runs them in parallel. Built-in templates are imported directly. Custom evaluators are authored in plain English in the in-product agent and pulled in by name.

from fi.evals import Evaluator, AudioTranscription, ConversationResolution

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

Step 3: author the custom rubrics

Intent preservation, entity F1, and semantic similarity ship as custom evaluators authored by FAGI’s in-product agent. The author UX takes a plain-English description and produces a runnable evaluator with config, prompts, and scoring logic.

A minimal intent-preservation rubric description reads:

“Given a reference transcript and a hypothesis transcript, route both through the agent router (the same prompt as production). Compare the two intent labels. Score 1 if they match, 0 if they don’t. Return the matched label and the mismatched pair when applicable. Include the agent router prompt as a parameter.”

The agent produces a runnable evaluator with the router prompt parameterized, a label-matching scoring function, and the metadata schema for the audit trail. The evaluator runs against any test case the SDK supports.

The entity-F1 rubric description specifies the entity taxonomy:

“Extract entities of types {account_number, dollar_amount, date, payee_name, transaction_reference} from both reference and hypothesis. Use the entity-extraction prompt I’m providing. Compute precision, recall, and F1 per type and overall. Return the per-type breakdown.”

The semantic-similarity rubric description specifies the embedding strategy:

“Embed reference and hypothesis with the configured sentence embedding model. Return cosine similarity. Flag turns with similarity below 0.85 for human review.”

Step 4: run the pass

from fi.evals import Evaluator

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

result = ev.evaluate(
    eval_templates=[
        "audio_transcription",
        "intent_preservation_v1",
        "entity_f1_banking_v1",
        "semantic_similarity_v1",
    ],
    inputs=[test_case],
)

The Evaluator runs all four scorers in parallel against the same input. The result object carries per-evaluator scores, reasoning, and metadata. The audit trail stores the rubric version, the input, the score, and the reasoning for every turn.

Step 5: downstream correlation in a scenario

The fourth metric (downstream task-completion correlation) is a scenario-level score, not a turn-level one. Run the same scenario twice. Once with the live ASR. Once with the reference transcript. Score task_completion on both. The delta is the correlation indicator.

FAGI Simulate auto-generates branching scenarios (20, 50, or 100 rows) from an agent definition. Run both versions in CI. The delta surfaces which ASR error classes break the agent.

from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, TaskCompletion

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I need to cancel my subscription", response="..."),
    LLMTestCase(query="My account number is 8392", response="..."),
])

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[TaskCompletion()],
    inputs=[conv],
)

Run the same ConversationalTestCase with the reference-transcript message text. The two task_completion scores subtracted give the per-scenario delta. Average across scenarios gives the correlation indicator.

A reference scoring pipeline

For a production voice agent, the eval pipeline looks like this.

  1. Every live turn is captured by traceAI with separate spans for ASR, LLM, tool calls, and TTS. The ASR span carries the hypothesis transcript and confidence.
  2. A sampled subset of turns (say, 1 in 50 for high-volume traffic, 1 in 5 during launch) goes to human transcription for the reference.
  3. The hypothesis plus reference go through the four-rubric pass: audio_transcription, intent preservation, entity F1, semantic similarity.
  4. The scenario-level downstream task-completion correlation runs in CI nightly against the canonical scenario suite.
  5. Results land in the Agent Definition dashboard with per-rubric trends, per-entity-class breakdowns, and per-intent-class confusion matrices.

The single pass produces five views of the same call. Each view surfaces a different error class. The team that owns the agent sees which class is regressing the moment it does.

Common findings when teams switch from WER-only

The first month of beyond-WER scoring usually surfaces three patterns. They show up across every vertical we’ve worked with.

Finding 1: entity F1 lags WER

A team upgrades from one ASR vendor to another. WER drops from 8% to 5%. Entity F1 stays flat or drops 2 points. The new model handles general speech better and handles entity speech worse. The agent’s tool calls regress. The WER number says the upgrade was a win. The entity F1 says it was a loss. The team rolls back the ASR change.

Finding 2: intent preservation tracks training data more than WER

Two ASRs with similar WER on the public benchmarks can have very different intent-preservation scores on a domain-specific agent. The ASR trained on more in-domain data preserves intent better even when WER is slightly worse. The lesson is to score on the domain, not on the benchmark.

Finding 3: semantic similarity catches paraphrase-friendly customers

Some customer cohorts (older customers, customers in dialect-rich regions, customers in second-language English) talk in ways that inflate WER without changing meaning. Semantic similarity surfaces this segment. The team learns the apparent WER regression is a customer-mix artifact, not a model regression.

How Future AGI supports the beyond-WER scoring loop

The FAGI stack maps the full loop end to end.

ai-evaluation for scoring

70+ built-in rubrics including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, caption_hallucination, translation_accuracy, cultural_sensitivity. Apache 2.0. Custom evaluators authored by an in-product agent for the four beyond-WER rubrics described above. Evaluator runs the full pass in one call.

traceAI for span capture

30+ documented integrations across Python and TypeScript. OpenInference-compatible spans. Apache 2.0. Per-turn ASR span with hypothesis and confidence is the input to the eval pipeline. Dedicated traceAI-pipecat and traceai-livekit packages for the two open-source voice frameworks.

MLLMAudio for the audio leg

Seven audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma). Local path or URL. Auto-base64. The audio file plus the hypothesis plus the reference are the three inputs to the four-rubric pass.

Simulation for scenario-level correlation

The full simulation surface: 18 pre-built personas plus unlimited custom-authored (configure name, description, gender, age range, location, personality traits, communication style, accent, conversation speed, background noise, multilingual coverage, custom properties, free-form behavioral instructions), Workflow Builder auto-generated branching scenarios (20/50/100 rows with branch visibility), a 4-step Run Tests wizard (config to scenarios to eval to execute), Error Localization that pinpoints the exact failing turn, a programmatic eval API for configure plus re-run as part of CI, custom voices imported from ElevenLabs and Cartesia in Run Prompt, Indian phone number simulation, and a Show Reasoning column for eval debug.

The production-derived loop

audio_transcription plus the four beyond-WER rubrics run on every captured production call. Error Feed clusters the regressions into named issues. Pick a cluster (for example “named-entity misspelling on last names”), promote the offending spans into a dataset, re-run the rubric set with new candidate prompts or new STT providers, and only then close the loop with a deliberate prompt or vendor change. Production calls → Error Feed → dataset → re-eval is the closed loop that takes you beyond raw WER without auto-rewriting prompts on your behalf.

Native voice observability for Vapi, Retell, LiveKit

No SDK required. Provider API key plus Assistant ID configures auto call log capture, separate assistant and customer audio download, auto transcripts, and the full 70+ rubric eval engine on every call. The four beyond-WER rubrics run continuously on ingest once they’re authored.

Agent Command Center for governance

RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, BYOC self-host, 15+ provider routing. Per-team dashboards segment the rubric trends by intent class, agent version, and customer cohort.

Future AGI Protect for inline checks on transcript content

Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. Sub-100ms inline. ProtectFlash is the single-call binary classifier path when you need the fastest verdict. The pii and data_privacy_compliance rubrics are detection signals (not by themselves a guarantee of PHI redaction) that flag PII echo turns the four-rubric beyond-WER pass doesn’t cover.

Error Feed for cluster analysis

Auto-clusters trace failures into named issues. Auto-writes root cause plus quick fix plus long-term recommendation. The beyond-WER score regressions cluster as named issues with the failing rubric, the failing intent class, and the failing entity type identified.

Where the beyond-WER framework falls short

Reference transcripts are still required. Every metric in this framework needs a human-annotated reference for the leading sample of turns. There is no free way to score intent preservation against a hypothetical reference. The mitigation is to sample. 1 in 50 on production, 1 in 5 during launch, 100% on the canonical golden conversation set. The sampling regime keeps the human-transcription cost bounded.

Custom rubrics take a one-time setup. The intent-preservation, entity-F1, and semantic-similarity rubrics each need an authoring pass to encode the agent’s specific intent taxonomy and entity taxonomy. The in-product agent in ai-evaluation reduces this to a plain-English description. The first version ships in a day. Iteration on the rubric versions happens release over release.

Downstream correlation is a scenario-level metric. The first three rubrics run per turn. Downstream task-completion correlation runs per scenario. The two cadences don’t share an axis. The mitigation is to track them in two dashboards. Per-turn rubrics on the live-traffic dashboard. Scenario correlation on the CI dashboard. The reviewer needs both views to act on regressions.

Sources and references

Frequently asked questions

What is WER and why is it the default voice AI metric?
Word Error Rate is the edit distance between a transcript and a human reference, normalized by reference length. It dates to 1970s speech recognition research. WER is the default because it's cheap to compute, easy to compare across ASR vendors, and aligned with the academic benchmark stack (LibriSpeech, CommonVoice, SwitchBoard). It treats every word as equally important. That assumption is what breaks for voice agents in 2026, where missing a customer name or a dollar amount is catastrophic and missing a filler word is fine.
Which beyond-WER metrics should I track for a voice agent?
Four metrics carry most of the signal. Intent preservation: does the transcript produce the same agent action as the gold transcript. Entity F1: did named entities (numbers, names, dates, drug names, account identifiers) land correctly. Semantic similarity: do the transcript and reference encode the same meaning. Downstream task-completion correlation: does the transcript drive the agent to the right outcome. The first three are runnable on a per-turn basis. The fourth runs against a synthetic or replayed multi-turn scenario.
How does intent preservation differ from WER?
WER counts edits. Intent preservation asks whether the downstream classifier or LLM still routes to the same intent given the transcript. A 12% WER transcript can have 99% intent preservation when the errors fall on filler words. A 4% WER transcript can have 70% intent preservation when one error swaps a critical entity. Future AGI's ai-evaluation supports custom intent-preservation rubrics authored from your intent taxonomy. The Evaluator class runs it on every turn alongside the audio_transcription rubric.
Can I run all four metrics in a single eval pass?
Yes. ai-evaluation's Evaluator accepts a list of eval templates and runs them in parallel against the same input. Pair the built-in audio_transcription rubric with a custom intent-preservation rubric, an entity-F1 rubric, and the conversation_resolution rubric for downstream correlation. One API call returns scores for all four. The MLLMAudio test case wraps any of seven audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) for the audio leg.
Why does named-entity recall matter more than overall accuracy?
In a banking, healthcare, or scheduling voice agent, the high-value words are entities. A customer name, a dollar amount, a date, a drug name, a routing number. Missing a single entity changes the agent's tool call. Missing the same number of filler words is invisible. Named-entity recall plus precision (entity F1) isolates the high-value words and scores them on their own. Tracking entity F1 alongside WER reveals when an ASR upgrade improves overall accuracy but degrades on the entity classes that matter.
How do I correlate transcript quality with end-to-end task completion?
Run the same scenario twice. Once with the live ASR transcript driving the agent, once with a gold reference transcript driving the agent. Score task_completion on both runs. The correlation between transcript WER and task_completion delta is the metric that matters. Future AGI Simulate auto-generates branching scenarios (20, 50, or 100 rows) from an agent definition. Run both versions in CI. The delta surfaces which ASR error classes break the agent.
Does Future AGI replace WER scoring?
No. WER is still the right baseline for cross-vendor ASR comparison. Future AGI's ai-evaluation adds the layer that WER doesn't cover: intent preservation, entity-F1, semantic similarity, downstream correlation, and per-error-class breakdown. The audio_transcription rubric ships WER-class scoring in the same API that runs the beyond-WER rubrics. You keep WER for the ASR vendor scorecard. You add the beyond-WER rubrics for the agent scorecard.
Related Articles
View all