What is word error rate (WER)?

Word error rate (WER) is a transcription-error metric for automatic speech recognition. It counts substitutions, insertions, and deletions against a reference transcript, then divides by reference word count.

How is WER different from ASR accuracy?

WER is an error ratio, so lower is better. ASR accuracy expresses how much of the speech-to-text output was correct, so higher is better.

How do you measure word error rate?

Use FutureAGI's ASRAccuracy evaluator to score ASR output against a reference transcript, then slice WER counters by cohort. Pair the result with traceAI-livekit spans and call-level task outcome.

What Is Word Error Rate? Definition & FutureAGI Guide (2026)

What Is Word Error Rate?

Word error rate (WER) is a voice AI evaluation metric that measures transcript errors from automatic speech recognition by comparing ASR output with a reference transcript. It counts substitutions, insertions, and deletions, then divides by the number of reference words. In production, WER shows up in the eval pipeline and voice traces before the LLM or agent receives text. FutureAGI connects WER to ASRAccuracy, voice simulation results, and downstream task success so teams know whether bad answers began as bad hearing.

Why Word Error Rate Matters in Production Voice Agents

High WER turns a correct voice agent into a confident agent reading the wrong request. A caller says “cancel line two” and the ASR transcript becomes “cancel line too”; a billing agent may call the wrong tool with a plausible parameter. That failure often appears as silent intent drift, not as an ASR crash. The LLM can produce a fluent, grounded answer to text the user never said.

The pain splits across teams. Developers see tool-call arguments that look valid but fail business checks. SREs see p99 time-to-first-audio improve while resolution rate drops, because a faster ASR model lost accuracy on noisy calls. Product owners see escalations rise for accents, phone codecs, or field names. Compliance teams lose confidence in call summaries because the transcript has already altered consent language or account identifiers.

The logs usually show clustered symptoms: higher WER by accent, locale, channel, or background-noise cohort; falling transcription confidence; more clarification turns; more “user repeated themselves” events; more human handoffs after ASR spans. In 2026 multi-step voice pipelines, WER is amplified by every later stage. A bad transcript can poison retrieval, choose the wrong tool, trigger the wrong policy, or make an agent memory record false. Unlike a standalone Whisper benchmark, production WER must be read beside call outcome, task completion, latency, and cohort mix.

How FutureAGI Handles Word Error Rate

FutureAGI handles WER as an ASR-boundary signal, not as a general “call quality” label. The concrete FAGI surface is ASRAccuracy (eval:ASRAccuracy), a speech-to-text accuracy evaluator from fi.evals. In a dataset or nightly regression eval, the engineer pairs audio_path, reference_transcript, asr_transcript, locale, codec, and noise_condition columns with a WER metric computed from substitution, insertion, and deletion counts. ASRAccuracy gives the FutureAGI eval layer a consistent ASR-accuracy score; the raw WER counters explain why the score moved.

FutureAGI’s approach is to connect that ASR score to the rest of the voice-agent run. With LiveKitEngine, a team can simulate calls from persona and scenario sets, capture the transcript, and attach ASRAccuracy to each call result. With traceAI-livekit, the same call can emit ASR, LLM, and TTS spans, so WER is read next to time-to-first-audio, clarification turns, and final TaskCompletion.

A realistic workflow: a healthcare intake agent ships a new ASR provider. Nightly simulations replay 2,000 consent and medication scenarios. WER stays under 5% overall, but jumps to 14% for noisy mobile calls with drug names. The engineer sets an alert on eval-fail-rate-by-cohort, keeps the old provider for that route, and adds a regression eval before the next release.

How to Measure or Detect Word Error Rate

Measure WER against a trusted reference transcript. Normalize punctuation and casing only if that matches your product risk; for account numbers, drug names, and compliance scripts, keep stricter normalization.

WER formula: (substitutions + insertions + deletions) / reference_words; lower is better.
ASRAccuracy: FutureAGI speech-to-text evaluator; use it to score ASR output against a reference transcript.
Cohort slices: track WER by accent, locale, device, codec, microphone, noise condition, and ASR provider version.
Trace signals: pair WER with traceAI-livekit ASR spans, time-to-first-audio, transcription confidence, and retry or clarification counts.
Outcome proxy: alert when WER rises with task-failure rate, escalation rate, or user repeat rate.

from fi.evals import ASRAccuracy

evaluator = ASRAccuracy()
result = evaluator.evaluate(
    prediction=asr_transcript,
    reference=human_transcript,
)
print(result.score)

Common Mistakes with Word Error Rate

The common pattern is confusing a neat benchmark with production speech behavior. Watch for these mistakes before choosing thresholds or switching ASR providers:

Reporting only average WER. A 4% global number can hide a 20% failure on noisy mobile calls or one language cohort.
Normalizing away product-critical tokens. Lowercasing and stripping punctuation can mask account IDs, dosages, prices, and dates that must survive transcription.
Treating ASR accuracy as call success. A low WER transcript can still drive a failed tool call, refusal, or wrong summary.
Comparing providers on clean clips only. Live calls include barge-ins, packet loss, music, hold prompts, and overlapping speech.
Ignoring insertions. Inserted words are often worse than deletions because they create facts the user never said.