Voice AI

What Is WER?

ASR accuracy metric computed as (substitutions + insertions + deletions) divided by reference word count; lower is better.

What Is WER?

WER (Word Error Rate) is the canonical accuracy metric for automatic speech recognition: the share of words in a generated transcript that differ from a reference, computed as (substitutions + insertions + deletions) divided by reference word count. Lower is better. Production English ASR for clean audio in conversational domain sits around 0.05–0.12; degraded audio, accented speech, and out-of-domain vocabulary push it higher. WER is the headline number every voice-agent platform reports, and it is what FutureAGI tracks via the ASRAccuracy evaluator on every transcript ingested through traceAI-livekit or traceAI-pipecat.

Why It Matters in Production LLM and Agent Systems

Every downstream voice-agent metric — intent classification, sentiment, resolution scoring, summary fidelity — inherits errors from the transcript. A 0.20 WER means 20% of words are wrong, and the LLM consuming that transcript is reading garbled input. Fix nothing else and just halving WER from 0.18 to 0.09 typically lifts intent-classification F1 by 8–12 points and resolution scoring by 5–8 points. WER is upstream of everything.

The pain shows up in production along several axes. A platform owner sees voice-agent resolution rate drop after a model swap and finds the WER regressed silently — the new ASR was 1.5× faster but 2× less accurate. A product lead reports complaints about the agent “mishearing” account numbers and discovers a numeric-token regression specific to over-the-phone audio. A compliance lead audits a sample and finds that the LLM summarizer hallucinated information that was actually correctly present in the audio but was dropped by the ASR.

By 2026, voice-agent stacks built on LiveKit, Pipecat, OpenAI Realtime, and AssemblyAI all expose WER as a first-class metric. The right pattern is to track WER per language, per accent, per route, and per provider — never just one global number — and to alert on per-segment regressions before downstream scores notice.

How FutureAGI Handles Word Error Rate

FutureAGI’s approach is to make WER a continuous metric tied to the same trace as the agent’s reasoning. The pattern: voice sessions instrument with traceAI-livekit, capturing the audio path, ASR transcript, and any reference transcript (when available — e.g., from the agent’s own typed-out confirmation). The ASRAccuracy evaluator runs against samples with reference text and computes WER per session, aggregating by language, accent, route, and ASR provider.

A concrete example: a banking voice-agent fleet runs on LiveKit with two ASR providers behind a routing policy — Provider A for English, Provider B for Spanish. FutureAGI tracks ASRAccuracy per route. Over a 24-hour window, English WER sits at 0.07 (healthy), Spanish at 0.16 (concerning), and Spanish-on-mobile-network at 0.23 (failing). The team adds noise suppression preprocessing for the Spanish-mobile cohort and re-evaluates via the simulate SDK’s LiveKitEngine against a fixed audio test bank. WER drops to 0.11. The fix is shipped behind Agent Command Center’s traffic-mirroring to compare new vs old in production before full cutover. The metric flowed end-to-end from problem detection to fix verification without leaving FutureAGI.

For pre-deploy testing, the simulate SDK runs voice scenarios through the agent stack, captures audio and reference, and reports WER as part of the TestReport output. That makes WER a pre-merge regression check, not just a production dashboard.

How to Measure or Detect It

WER measurement needs reference transcripts and per-segment slicing:

  • ASRAccuracy — primary evaluator; returns WER per session plus substitution/insertion/deletion breakdown.
  • Per-language, per-accent WER (dashboard signal): the cohort breakdown that surfaces real degradation.
  • AudioQualityEvaluator — upstream signal; bad audio quality always corrupts WER.
  • CaptionHallucination — checks for ASR-invented words (insertions); important when models rephrase rather than transcribe.
  • Numeric-token error rate — accuracy on numbers (account IDs, dates, dollar amounts) is often worse than aggregate WER and matters more.
  • Reference-set staleness — WER drift can be a reference-set issue, not an ASR issue; refresh quarterly.
from fi.evals import ASRAccuracy

asr = ASRAccuracy()
result = asr.evaluate(
    audio_path="/sessions/call-12345.wav",
    reference_text=ground_truth_transcript,
)
print(result.score, result.reason)  # WER plus error breakdown

Common Mistakes

  • Reporting one global WER. Languages, accents, and audio paths degrade differently; report per cohort.
  • Ignoring punctuation and casing in WER calculation. Standard WER strips both; if your downstream cares about punctuation, use a custom metric.
  • No numeric-token accuracy. Aggregate WER can be 0.07 while account-number accuracy is 0.65; track separately.
  • Using ASR confidence as a proxy for WER. They correlate weakly; only reference-text comparison gives true WER.
  • Pre-deploy WER but no production WER. Lab WER does not match field WER; sample production with reference text continuously.

Frequently Asked Questions

What is Word Error Rate (WER)?

Word Error Rate (WER) is the canonical ASR accuracy metric: the share of words in a transcript that differ from a reference, computed as (substitutions + insertions + deletions) divided by reference word count.

How is WER different from CER?

WER counts errors at the word level; CER (Character Error Rate) counts at the character level. CER is more granular and useful for languages without clear word boundaries; WER is the standard for English-language ASR.

How does FutureAGI measure WER?

FutureAGI's ASRAccuracy evaluator computes WER per session and aggregates by language, accent, and route. The evaluator runs offline on labeled samples and online on production transcripts paired with reference text.