Voice AI

What Is Automatic Speech Recognition?

Automatic speech recognition converts spoken audio into text so downstream voice agents, captions, search, or analytics can process the user's words.

What Is Automatic Speech Recognition?

Automatic speech recognition (ASR) is the voice AI component that converts spoken audio into text for downstream systems. In production LLM and agent pipelines, ASR sits before the reasoning model: it turns a caller’s words into the transcript that a voice agent, RAG workflow, or support automation uses. As of May 2026 the ASR landscape has shifted under the field’s feet. Whisper v3-large and Deepgram Nova-3 are now table stakes, OpenAI’s GPT-4o-transcribe / GPT-4o-mini-transcribe ship sub-300ms streaming with multilingual context conditioning, Gemini 3 Pro added native audio in/out, and AssemblyAI Universal-2 leads on enterprise diarization. Teams measure ASR inside eval pipelines and production traces because one wrong word can send a perfect LLM down the wrong path. FutureAGI evaluates ASR with ASRAccuracy against reference transcripts and pairs it with CaptionHallucination checks for inserted-word risk.

Why ASR matters in production LLM and agent systems

ASR errors are upstream failures disguised as model bugs. If a call-center agent hears “cancel my renewal” as “can sell my renewal,” the LLM may select the wrong policy, retrieve the wrong document, and call the wrong billing tool. If the transcriber drops “not” in a medical-intake flow, the compliance record and the follow-up advice can both become unsafe. The model may reason well; it is reasoning over a corrupted transcript. We’ve found in our 2026 evals that 30-40% of voice-agent “model bugs” reported by product teams are actually ASR substitution errors that the LLM faithfully obeyed.

The pain is shared across teams. Developers see prompt regressions that only reproduce on noisy mobile calls. SREs see ASR-stage latency p99 climb while the chat model looks healthy. Product teams see repeated clarification turns, abandonment, and lower task completion rates. Compliance teams see missing consent phrases or malformed records. The symptoms usually appear as rising word error rate, substitution spikes on named entities, insertion errors, low ASR confidence, caption hallucinations, and downstream tool-call mismatches.

ASR is especially important in 2026-era agentic voice pipelines because the transcript is no longer just a caption. it is the planner input, the retrieval query, the function argument, the CRM note, and the audit artifact. Unlike a raw Whisper, Deepgram, or AssemblyAI benchmark on clean LibriSpeech audio, production ASR reliability depends on channel, accent, interruption, background noise, streaming latency, and whether the downstream agent can still finish the user’s task when the transcript is partially wrong. A 4% WER headline number means little when the 4% errors all land on medication names, account IDs, or refusal triggers.

The architectural shift in 2026 is end-to-end speech-to-speech models. GPT-4o-realtime and Gemini 3 audio can skip an explicit text transcript and reason directly over audio, which collapses the ASR stage into the LLM. That sounds like it ends ASR evaluation; it does the opposite. Without an explicit transcript, you lose the cheapest debugging artifact you had, and the hallucination modes move into the LLM’s audio understanding where they are harder to attribute. Production-grade teams still emit a transcript alongside the speech-to-speech path so ASRAccuracy, audit, and CRM-record workflows keep working. End-to-end models also do not free you from the cohort problem: accent, noise, and entity error rates still vary by user segment and still need evaluator-level dashboards.

The 2026 ASR provider landscape

This is the table senior engineers benchmarking voice agents should keep on hand. Numbers are public reported figures or community measurements as of Q2 2026. verify against your own audio.

Provider / ModelStreaming latency (p50)Reported WER (clean / accented)Notable 2026 traitsCommon failure mode
OpenAI GPT-4o-transcribe~280ms2.7% / 6-9%Context-aware via system prompt; supports diarization hintsOver-corrects to “plausible” words on novel vocabulary
OpenAI GPT-4o-mini-transcribe~180ms3.4% / 8-11%Cheap streaming; weak on heavy accentsInsertion hallucinations on silence
Deepgram Nova-3~220ms2.9% / 5-7%Self-hostable; best entity formattingAggressive number normalization breaks tool args
AssemblyAI Universal-2~310ms2.8% / 5-8%Strong diarization, sentiment, redactionPII redaction can leak into downstream context
Google Gemini 3 audio~250ms3.1% / 6-10%Multimodal-native, multilingual contextDiarization weaker than dedicated ASR
Anthropic Claude voice (preview)~300ms3.0% / 7-10%Strong with code-switching English / SpanishLimited language coverage outside top 10
Whisper v3-large (self-hosted)400-800ms batched3.2% / 7-12%Free, offline; well-known hallucination on silenceInsertion of training-set phrases (“thank you for watching”)
Whisper v3-turbo~250ms3.5% / 8-12%Faster, slightly lower accuracySame silence-hallucination class

Pick by route, not by leaderboard. A medication-intake call should optimize entity accuracy, not p50 latency. A retail order-status loop wants sub-300ms streaming and is tolerant of a 1-point WER hit. We’ve found that teams running per-route provider routing. GPT-4o-transcribe for medical, Nova-3 for entity-heavy retail, Whisper v3 self-hosted for cost-sensitive batch. outperform single-provider stacks on both p99 latency and end-to-end task completion.

How FutureAGI handles automatic speech recognition

FutureAGI’s approach is to treat ASR as an evaluable boundary inside the voice-agent trace, not as an invisible provider detail. The specific surface for this term is the ASRAccuracy evaluator: it scores speech-to-text output against a reference transcript and feeds the result into release gates, dashboards, and regression cohorts. The row-level record stores the audio file path, ASR provider, provider transcript, reference transcript, language, accent or channel tags, WER, entity-error-rate, ASR-stage latency, and ASRAccuracy score.

A concrete 2026 workflow: a healthcare scheduling team runs a nightly LiveKitEngine simulation over 2,000 Persona cases with accents, noisy rooms, interruptions, code-switching, and medication names. The voice runtime is instrumented with traceAI-livekit, so each call keeps the ASR stage, transcript, audio artifact, and per-stage latency close to the LLM and tool use spans. FutureAGI runs ASRAccuracy on every simulated call and CaptionHallucination on rows where the transcript includes words not present in the spoken reference. Engineers also attach Groundedness and AnswerRelevancy to the downstream LLM turn so a single dashboard shows whether a transcript regression translated into an answer regression.

The engineer does not stop at a global average. They set a threshold such as ASRAccuracy >= 0.96 and WER <= 4% for medication-name calls, then slice failures by provider, accent cohort, microphone, packet-loss bucket, and noise condition. If a new ASR model drops Indian-English medication calls from 97% to 91%, the release is blocked. The next action is a per-route provider fallback through Agent Command Center, a narrower regression eval on medication-name datasets, or a prompt / tool-schema fix if the downstream agent should have asked for clarification.

Wiring ASR into release gates and runtime fallback

Unlike a static Deepgram WER report or a Whisper leaderboard score, FutureAGI’s pattern keeps the audio, the transcript, the evaluator score, the LLM turn, and the task completion signal on the same row. A release-gate diff says “Indian-English medication calls regressed 6 points on ASRAccuracy and 4 points on TaskCompletion”. not “average WER moved 0.3 points.” That is the difference between a release decision and a rumor. Agent Command Center then expresses the policy: route Indian-English medical traffic to the previous provider until the new one passes, mirror 5% of production traffic for monitoring, and emit a latency alert if the fallback path crosses the voice loop’s p99 budget. The eval-time and runtime configurations share one schema, which is the only way to keep CI numbers and production numbers in sync.

Simulating real-world audio with simulate-sdk

Detection without simulation under-tests the system. The simulate-sdk Persona library covers accent, age, gender, code-switching, hearing-impairment, soft-speaker, and noisy-environment variants; Scenario chains those personas through realistic flows. appointment reschedule, refund dispute, medication refill, intake triage. Each simulated call produces an audio artifact and a trace that runs through the same ASRAccuracy and CaptionHallucination evaluators used in CI. We’ve found that adding three persona axes engineers commonly ignore. packet-loss simulation, on-hold music, and barge-in events. typically uncovers a 2-4 point WER gap that clean studio benchmarks miss.

Production ASR + LLM coupling

The interesting failures are joint, not isolated. Pair ASR with the LLM turn that consumed its output and the tool use step that came next, then evaluate the trajectory as one unit. The TrajectoryScore evaluator scores whether the voice agent’s full sequence of stages. ASR, intent classification, retrieval, planner, tool call, TTS. succeeded at the user goal, regardless of any single stage’s WER. A 97% transcript that fed a wrong tool call is a worse outcome than a 92% transcript whose downstream planner asked for clarification. We instrument the same trace fields the rest of the agent uses. gen_ai.request.model, agent.trajectory.step, llm.token_count.prompt. alongside the audio span so root cause is one query away.

Benchmarking ASR like an LLM

Frontier ASR providers publish numbers the way LLM labs publish benchmark numbers. clean LibriSpeech WER, CHiME WER, Common Voice WER. Treat these the way you treat MMLU-Pro (14K questions, harder MMLU successor) or GPQA Diamond (198 expert-validated questions where frontier models still cluster in the 70-80% range) in 2026: useful for tier-filtering, not for choosing a provider. The release-gate question is whether the provider’s transcripts let your agent close tickets on your traffic. Build a domain ASR golden dataset of 500-2,000 labeled production calls per route, refresh it monthly, and score every candidate provider against it. Public WER tells you whether a provider is plausible; the golden dataset tells you whether it ships. Compared with running a Hugging Face leaderboard query and picking the top model, the golden-dataset approach catches accent and entity regressions that public clean-audio scores never surface.

How to measure or detect ASR issues

Measure ASR at the transcript boundary and at the downstream task boundary. A clean transcript that produces a broken tool call is still an ASR failure; a noisy transcript whose mistakes the LLM safely resolves is not.

  • ASRAccuracy. FutureAGI evaluator for speech-to-text accuracy against a reference; the main release-gate score, returned as a 0-1 number with a reason.
  • WER. substitution, insertion, and deletion errors divided by reference words; slice by cohort, not only globally. A 3% global WER hiding 12% on Hindi-accented medication calls is a release blocker, not a footnote.
  • Entity error rate. miss rate for names, addresses, product IDs, medication names, dates, dosages, and amounts. Often the most decisive signal for tool-using agents.
  • CaptionHallucination. flags inserted words that were never spoken, the Whisper-class failure that produces “thank you for watching” in silent segments and is dangerous in medical and compliance flows.
  • Trace signals. ASR-stage latency p95/p99, low-confidence segments, repeated clarification turns, barge-in events, and empty-transcript spans alongside the LLM turn that consumed them.
  • User-feedback proxy. abandonment rate, escalation rate, “agent misunderstood me” labels, repeat-call rate within 24 hours, and failed task completion after an ASR miss.
  • Downstream coupling. pair ASRAccuracy with Groundedness and ToolSelectionAccuracy on the same row so a transcript error that propagates into a wrong tool call is one diagnostic, not three.
from fi.evals import ASRAccuracy

asr = ASRAccuracy()
result = asr.evaluate(
    audio_path="calls/123.wav",
    ground_truth="I need to change my delivery address",
)
print(result.score, result.reason)

For a release gate, wire ASRAccuracy to the voice trace and run cohort-filtered regression over a stored Dataset of accent, microphone, and noise variants. The same evaluator stack runs in CI and in production via the LiveKitEngine simulation, which is the only way to keep the two numbers aligned across releases:

from fi.evals import ASRAccuracy, CaptionHallucination, TaskCompletion, Dataset
from fi.simulate import LiveKitEngine, Persona, Scenario

# Nightly regression on a labeled production cohort
ds = Dataset.load("voice-medication-intake-golden")
report = ds.evaluate(
    evaluators=[ASRAccuracy(), CaptionHallucination(), TaskCompletion()],
    cohort_by=["accent", "microphone", "noise_bucket"],
)
print(report.fail_rate_by_cohort)

# Same evaluators inside a live simulation
engine = LiveKitEngine(
    personas=[Persona("indian-english-elderly"), Persona("us-southern-noisy")],
    scenarios=[Scenario("medication-refill")],
    evaluators=[ASRAccuracy(), CaptionHallucination(), TaskCompletion()],
)
engine.run(n=500)

For batch evaluation, run ASRAccuracy across a labeled audio dataset and store per-row scores alongside cohort tags so the dashboard can express “accuracy by accent × provider × noise condition.” Treat anything below 0.96 on a regulated route as a release-gate failure, not a trend line.

Operationally, the highest-signal cohort splits for production voice agents are: accent (geographic and L1-shaped), speaker age (children and elderly speakers fail differently), microphone class (handset, headset, laptop array, conference room), codec (Opus, G.711, MP3 transcoding), noise floor, packet-loss bucket, and call duration. A single dashboard with all seven axes catches regressions a global average will hide for weeks. The same dashboard should overlay downstream TaskCompletion so an ASR cohort whose WER is fine but whose tool-call accuracy is not surfaces as a data drift signal rather than an “agent bug.”

Treat thresholds as route-scoped, not global. A regulated medication-intake route deserves a 0.97 floor on ASRAccuracy and a hard ceiling on entity-error-rate; a casual order-status route can run at 0.92 in exchange for a 100ms lower p50. The CI artifact, the runtime alert, and the audit log should read off the same threshold table. the only sustainable way to avoid CI green / production red gaps over many releases.

Common mistakes

Most ASR incidents come from measuring clean examples while production audio carries messy user behavior and provider-specific edge cases.

  • Optimizing aggregate WER only. Slice by accent, microphone, channel, codec, noise, language, packet-loss, and call type. Average WER hides the users most likely to churn.
  • Treating ASR confidence as truth. Confidence is a model-internal calibration; verify against labeled transcripts and downstream task failure, not the provider’s own score.
  • Testing only on studio clips. Production audio has barge-ins, crosstalk, packet loss, music on hold, low-volume speakers, and code-switching. LibriSpeech tells you nothing about a Mumbai-callcenter recording.
  • Ignoring entity normalization. “Fifteen” vs “fifty,” “A-104” vs “8104,” and “twenty milligrams” vs “20 mg” can each break a tool call. Provider-specific normalization rules matter more than headline WER.
  • Evaluating ASR apart from the agent. A small transcription error can be harmless in chat and fatal when it fills a payment form. Always pair ASRAccuracy with downstream TaskCompletion.
  • Trusting silence handling on Whisper-class models. Whisper v3 still hallucinates training-set phrases over silence; either pre-filter silence or use CaptionHallucination as a runtime check.
  • One ASR provider for all routes. Per-route routing routinely beats a single provider; the gateway should treat ASR like an LLM with model fallback and traffic-mirroring.
  • Skipping a transcript with speech-to-speech models. GPT-4o-realtime and Gemini 3 audio tempt teams to drop transcripts entirely; you lose debugging, audit, and ASRAccuracy scoring in one move. Emit the transcript anyway, even when the LLM does not need it.
  • No coupling with downstream metrics. ASR is a means, not an end. Always join ASRAccuracy with TaskCompletion and evaluator outputs on the LLM turn so the dashboard tells you which transcript errors actually broke the agent.

Frequently Asked Questions

What is automatic speech recognition (ASR)?

Automatic speech recognition converts spoken audio into text for downstream voice AI systems, including voice agents, captions, search, and analytics. It is the first reliability boundary in a spoken LLM workflow because every later model step depends on the transcript.

How is ASR different from transcription accuracy?

ASR is the system or model that turns speech into text. Transcription accuracy is the measurement of how close that ASR output is to a reference transcript.

How do you measure ASR?

Use FutureAGI `ASRAccuracy` against reference transcripts, then track word error rate, insertion and deletion errors, and failure slices by accent, channel, device, and noise condition.