A WER score is the numeric output of computing Word Error Rate on a transcript or cohort: a ratio between 0 and 1 where lower is better. Production English ASR typically scores 0.05-0.12.

What is a good WER score?

Good WER scores depend on context. Clean studio English: under 0.05. Conversational telephone English: 0.07-0.12. Accented or multilingual: 0.10-0.20. The right benchmark is your own reference set, not a vendor-published number.

How does FutureAGI report WER scores?

FutureAGI's ASRAccuracy evaluator returns a per-session WER score plus a substitution/insertion/deletion breakdown. Scores are aggregated by language, accent, route, and ASR provider in the dashboard, and can be set as release-gate thresholds.

What Is a WER Score? Definition & FutureAGI Guide (2026)

What Is a WER Score?

A Word Error Rate (WER) score is the numeric output of computing WER on a single transcript or aggregated cohort: a ratio between 0 and 1 (or expressed as 0% to 100%) where lower is better. It is the value rendered on dashboards, asserted against in regression tests, and returned by FutureAGI’s ASRAccuracy evaluator per session. A WER score is interpretable only in context — what language, what audio path, what domain, what reference set — because thresholds for “good” vary widely. A 0.10 score is excellent for noisy multilingual telephone audio and concerning for studio-quality English dictation.

Why It Matters in Production LLM and Agent Systems

The WER score is the first metric a voice-agent team looks at after deployment. It is also the most commonly misinterpreted, because the same score means different things in different contexts. A team that sees 0.08 on its English production cohort and feels safe may also be running 0.22 on Spanish-mobile without realizing — the global average looks fine while a real cohort is failing.

Score interpretation matters because it drives decisions. A 0.05 score (5% errors) feels great until you realize that on a 30-word account-balance request, that’s about 1.5 wrong words on average, and if those errors land on numeric tokens the customer’s balance is read incorrectly. A 0.15 score on a clinical transcript hides the question of which words are wrong: medication names being substituted is a different category of risk than filler words being skipped. The score itself is necessary but not sufficient.

By 2026, mature voice-agent teams treat the WER score as one slice of a multi-metric panel. Alongside it sit substitution-only WER (errors that change meaning), insertion-only WER (ASR hallucinations), numeric-token accuracy, and named-entity error rate. The single score remains the headline number on the dashboard; the breakdown is what triggers triage.

How FutureAGI Handles WER Score

FutureAGI’s ASRAccuracy evaluator returns a structured score per session: total WER, substitution rate, insertion rate, deletion rate, and an optional named-entity-only WER when entity extraction is enabled. Sessions are tagged with language, accent (when detectable), audio path, and ASR provider via traceAI-livekit or traceAI-pipecat. Aggregated WER scores roll up to dashboard cohorts; per-session scores are queryable for triage.

A concrete example: a healthcare voice-intake agent shows mean WER 0.09 across all sessions. The dashboard slice by language=Spanish shows 0.18, and slicing further by accent=Caribbean-Spanish shows 0.27. Drilling into a specific session, the substitution rate is 0.21 with most errors on medication names — the pretrained ASR did not see Caribbean-Spanish brand names in training. The team adds a domain-adaptation layer with a custom medication vocabulary and re-runs ASRAccuracy via the simulate SDK’s LiveKitEngine. The cohort WER drops to 0.11. The fix ships behind Agent Command Center’s weighted-routing to ramp gradually with monitored WER.

For threshold-based gating, teams define a release-gate WER (e.g., must stay under 0.12 on the production cohort) and use FutureAGI’s regression-eval workflow: every release runs the same audio test bank, the WER score is computed, and the deploy is blocked if the score crosses threshold.

How to Measure or Detect It

WER score reporting needs structure beyond a single number:

Total WER score (ASRAccuracy output) — the headline metric.
Substitution / insertion / deletion split — meaning-changing errors vs ASR hallucinations vs dropped words.
Per-cohort WER (dashboard signal) — language, accent, audio path, ASR provider.
Numeric-token accuracy — separate score for digits, dates, currency; usually worse than aggregate WER.
Named-entity-only WER — when entity extraction is wired, score errors on named entities specifically.
Score trend — week-over-week WER change is the regression signal.
Pre-deploy WER vs production WER — the gap is the lab-to-field calibration error.

from fi.evals import ASRAccuracy

asr = ASRAccuracy()
result = asr.evaluate(
    audio_path="/sessions/abc.wav",
    reference_text=ground_truth,
)
print(result.score)              # total WER score
print(result.metadata)           # per-error-type breakdown

Common Mistakes

Treating a single WER score as the metric. Use it as the headline; trigger triage from the breakdown.
No reference set. Without ground-truth transcripts, the score is uncomputable; sample for reference labels.
Comparing scores across vendors with different normalization. Some vendors strip punctuation and casing differently; align normalization or comparisons are meaningless.
One threshold for all cohorts. Different languages, accents, and routes need different gate thresholds.
Reporting WER as a percentage with one decimal. Round to two decimals minimum (0.082, not 8%); cohort comparisons need precision.
Comparing live WER to lab WER without sampling parity. Lab audio is usually cleaner; without matched cohort sampling, the deltas are illusory.
Confusing aggregated cohort WER with per-session WER. Release-gate thresholds belong on cohorts; per-session triage belongs on individual scores.