Guides

7 Best STT Providers for Voice AI Agents in 2026 (Tested + Ranked)

Ranked STT providers for voice AI 2026: WER, real-time latency, accent and jargon handling, the rubric that scores them all on your production audio.

February 12, 2026

Updated May 19, 2026

15 min read

voice-ai 2026 stt asr listicle

Table of Contents

If you’re building a voice AI agent in 2026, the speech-to-text provider you pick determines the ceiling on every other quality metric. A 12% word error rate cascades into intent classification failures, retrieval misses, and bad agent responses no LLM can recover from. The market consolidated around a handful of strong providers this year, and the differences between them are sharper than the marketing pages suggest. This piece ranks the seven that actually ship to production, with the WER ranges, latency floors, and accent profiles we see on real audio.

TL;DR: pick by exit reason

If your bottleneck is…	Pick	Why
Background noise and call-center audio	Deepgram Nova-3	Lowest real WER on telephony codecs and noisy environments
Multilingual + speaker diarization	AssemblyAI Universal-3 Pro	Broad language coverage, built-in diarization, strong English nuance
Open-weight self-host for compliance	Whisper large-v3	1.5B parameters, runs on a single GPU, Apache-style ecosystem
Sub-100ms first-partial latency	Cartesia Ink-Whisper	Streaming Whisper-class accuracy under the latency floor
European languages or regulated domain	Speechmatics Ursa	Domain models for medical, legal, broadcast; strong accent depth
Already on Azure / GCP / AWS	Google STT v2 (cheapest hosted), Azure Speech, AWS Transcribe	Commodity STT, broader coverage, lower per-axis accuracy
Score any of them on your audio	Future AGI ai-evaluation `audio_transcription` rubric	Apache 2.0 eval over WER, semantic similarity, entity preservation

The top five are detailed below. The hyperscalers (Azure, Google, AWS) get an honorable-mention section because they ship to production but rarely win an axis.

How we ranked them

Ranking STT providers on a single number is a category mistake — and WER alone is not enough for voice agents. We score on six axes and weight them by use case:

Word error rate on real audio. Not LibriSpeech. Real production audio with telephony codecs, background noise, and accents. Clean WER 4-7%, conversational WER 8-14%, noisy WER 12-22% are the realistic ranges.
Real-time first-partial latency. The time from speech onset to the first partial transcript event. Streaming voice agents need this under 200ms.
Real-time time-to-final. The time from end-of-speech to the final transcript event. 200-400ms across the leaders.
Language coverage. Both raw language count and per-language quality.
Accent and dialect robustness. Indian English, Scottish, Singlish, Spanglish, Hinglish. Public benchmarks rarely cover these.
Domain vocabulary handling. Medical, legal, financial. Either custom-vocabulary boosting or fine-tunable.

Pricing is a tiebreaker, not a ranking axis. The cost gap between providers (roughly 2x at scale) is much smaller than the cost of a 5% WER swing once you account for the downstream impact on intent classification and agent quality.

Now the ranked picks.

1. Deepgram Nova-3

Deepgram Nova-3 is the WER leader on hard audio in 2026. Call-center audio with G.711 codec, background noise, agent and customer talking over each other, accents, and product jargon. The kind of audio that breaks every other provider Nova-3 handles with measured WER 6-9% where competitors sit at 11-15% on the same files.

WER ranges (real production audio):

Clean read speech: 3-5%
Conversational telephony: 6-10%
Noisy call-center: 9-14%

Latency:

First-partial: 90-130ms P50, 180-220ms P95
Final after end-of-speech: 250-380ms P50

Language coverage. Nova-3 supports 45+ languages with strongest English, Spanish, French, German, Portuguese, and Japanese performance. Not the broadest coverage, but consistently the strongest per-language quality among the well-covered languages.

Accent robustness. Strong on Indian English, Filipino English, and African English. The Nova-3 training corpus includes diverse English accents at much higher proportions than the previous generation.

Domain vocabulary. Keyword boosting via the keywords parameter on the streaming API. Custom-vocabulary upload for batch mode. No full fine-tune surface on the hosted product, which keeps you dependent on Deepgram for vocabulary drift over time.

Best for. Production voice agents in call-center, support, sales, and any audio environment with background noise or accent diversity. The default pick for a US-based voice agent at scale.

Where it doesn’t win. Language coverage trails AssemblyAI and Whisper. Speaker diarization is functional but not as polished as AssemblyAI’s. Self-hosting is not an option, so compliance-regulated deployments hit a wall.

2. AssemblyAI Universal-3 Pro

AssemblyAI’s current lineup spans Universal-3 Pro, Universal-3 Pro Streaming, Universal-2, Universal-Streaming, and Whisper-Streaming. Universal-3 Pro Streaming is the production streaming flagship: broad multilingual coverage, speaker diarization in both streaming and batch, and strong English nuance covering disfluency handling, code-switching, slang, sarcasm-adjacent prosody.

WER ranges (real production audio):

Clean read speech: 4-6%
Conversational English: 7-11%
Noisy environments: 11-16%

Latency:

First-partial: 100-150ms P50, 200-260ms P95
Final after end-of-speech: 280-420ms P50

Language coverage. Broad multilingual coverage with continuous improvement on the long tail (confirm exact language count for your tier in AssemblyAI’s pricing/docs page). Hindi, Tamil, Vietnamese, Indonesian, and Arabic variants are all production-grade. Code-switching is handled natively in the multilingual variant.

Accent robustness. Strong on all major English accents. Indian English performance is on par with Deepgram Nova-3.

Domain vocabulary. Custom-vocabulary endpoint with weighted terms. The custom-vocabulary surface is the most polished among hosted providers. Topic detection, summarization, sentiment, and PII redaction all ship as add-on layers on the batch product.

Best for. Multilingual deployments, applications that need speaker diarization (interview transcripts, podcast indexing, multi-party call analytics), and content-moderation pipelines where the built-in PII redaction matters.

Where it doesn’t win. Per-language WER on hard English audio with heavy background noise trails Deepgram. Real-time latency is 10-20ms behind Deepgram on the first-partial path. Cost per minute on streaming is higher.

3. OpenAI Whisper (large-v3, turbo)

Whisper is the open-weight baseline. The 1.5B-parameter large-v3 and 800M-parameter turbo variants ship under MIT license and run on a single GPU. The ecosystem around Whisper is the largest in STT: Whisper.cpp for CPU inference, Faster-Whisper for GPU optimization, WhisperX for word-level alignment, and a steady stream of fine-tuned domain variants on HuggingFace.

WER ranges (real production audio):

Clean read speech: 3-5%
Conversational English: 6-10%
Noisy environments: 10-16%

Latency. Whisper is fundamentally a batch model. Real-time inference exists via streaming wrappers (Whisper-Streaming, WhisperLive) but the underlying architecture is not causal. First-partial latency from streaming wrappers sits at 300-600ms, which is too slow for sub-500ms voice agents. Batch throughput on an H100 is 30-50x real-time for large-v3, 60-100x for turbo.

Language coverage. 99 languages, with the most balanced per-language quality of any STT model. The multilingual training corpus is what makes Whisper the strongest code-switching model in production.

Accent robustness. The most accent-robust model in 2026. Whisper’s training data is the most diverse of any STT corpus, and the accent gains show in published comparisons.

Domain vocabulary. Fine-tunable. Teams with 10-50 hours of labeled domain audio can fine-tune Whisper to outperform any hosted provider on that domain. The cost of fine-tuning amortizes quickly at scale.

Best for. Self-hosted deployments for compliance reasons. Multilingual batch transcription. Code-switching-heavy audio. Domain-specific fine-tuning where the volume justifies the labeling cost.

Where it doesn’t win. Real-time streaming. Whisper was not designed for low-latency partials and the streaming wrappers carry that constraint. Hosted operation also costs operational overhead that hosted providers absorb for you.

4. Cartesia Ink-Whisper

Cartesia Ink-Whisper is positioned as a streaming Whisper-class ASR option that preserves Whisper-class accuracy while targeting sub-100ms first-partial latency. The model is a distilled, causalized Whisper architecture, and production telemetry is now strong enough to consider it for serious deployments.

WER ranges (real production audio):

Clean read speech: 4-6%
Conversational English: 7-11%
Noisy environments: 11-15%

Latency:

First-partial: 70-95ms P50, 130-170ms P95
Final after end-of-speech: 180-280ms P50

Language coverage. Smaller than the established providers. English is production-grade. Spanish, French, German, Portuguese are usable. Long-tail languages are not yet covered at production quality.

Accent robustness. Inherits Whisper’s accent diversity strength. Indian English and other major accent variants perform on par with Whisper large-v3.

Domain vocabulary. Keyword boosting is available. Fine-tuning is not yet a documented option on the hosted product. The narrow vocabulary surface is the main gap relative to AssemblyAI and Deepgram.

Best for. Latency-critical voice agents where the audio profile is well-controlled. Real-time captioning. Voice search where the user has a typing alternative as a fallback. Use cases where shaving 30-50ms off first-partial latency is the difference between conversational and clunky.

Where it doesn’t win. Language coverage. Domain vocabulary depth. Battle-tested production telemetry: Ink-Whisper is newer than the established providers and the failure-mode catalog is still being built out.

5. Speechmatics Ursa

Speechmatics has the deepest accent and dialect coverage among the hosted providers. The Ursa model is offered in Standard and Enhanced tiers with custom vocabularies and customer-trained custom models, and covers 55+ languages with the strongest European-language quality of any provider.

WER ranges (real production audio):

Clean read speech: 4-7%
Conversational English: 8-12%
Noisy environments: 12-18%

Latency:

First-partial: 110-160ms P50, 210-280ms P95
Final after end-of-speech: 290-450ms P50

Language coverage. 55+ languages with the strongest European-language quality of any provider. Welsh, Catalan, Basque, Maltese, and Estonian are all production-grade where most providers ignore them.

Accent robustness. The accent leader. Scottish English, Welsh English, South African English, Australian English, New Zealand English all perform within 1-2% WER of standard English. No other provider matches that depth.

Domain vocabulary. Standard and Enhanced tiers, plus custom vocabularies and customer-trained custom models for verticals such as medical, legal, broadcast, and finance. Custom-vocabulary boosting layers on top of the base or custom model.

Best for. European deployments. Regulated industries (medical, legal) where the domain model matters more than the latency. Applications with heavy regional dialect exposure.

Where it doesn’t win. Pure latency. Cost per minute is among the highest. Streaming partials arrive 15-30ms after Deepgram and AssemblyAI on average.

Honorable mentions: the hyperscalers

The Azure, Google, and AWS STT services rarely win a ranked axis, but they ship to production at scale because of cloud lock-in, billing simplicity, and breadth of coverage.

Azure Speech. 100+ languages. Custom Speech for domain fine-tuning. Strong integration with the rest of Azure AI Foundry. WER on hard audio trails Deepgram and AssemblyAI by 2-4 percentage points. Real-time first-partial latency 180-250ms P50. Pick this if you’re already on Azure with no other constraint.

Google Cloud Speech-to-Text v2. 125 languages, the broadest coverage of any hosted provider. Domain-adapted models for telephony, video, and medical. WER on conversational audio sits in the 10-14% range, behind the specialists. Real-time first-partial latency 150-220ms P50. Pick this for cost-sensitive multilingual deployments.

AWS Transcribe. Competitive hosted option, though Google STT v2 currently undercuts Tier 1 per-minute pricing. Custom vocabulary and custom-language models. Speaker diarization built in. WER on hard audio is the weakest of the hyperscalers. Real-time first-partial latency 200-300ms P50. Pick this when AWS billing simplification is the deciding factor.

None of these are wrong choices. They’re just rarely the right choice when a specialist is available at comparable cost.

How to actually score providers on your audio

Public WER benchmarks lie to you. The numbers a provider quotes were measured on LibriSpeech (clean read speech) or CommonVoice (crowdsourced clean audio), neither of which looks like your production traffic. The only WER that matters is WER on your audio.

The pattern that works:

Sample 500-1000 utterances from your production traffic. Stratify by audio profile: clean, conversational, noisy. By caller cohort: native English, Indian English, Hispanic English, multilingual. By call type: opening, middle, closing.
Generate ground-truth transcripts. Either by hand or by running Whisper large-v3 on the audio and manually correcting. The labeling effort is real but it’s a one-time cost amortized over every provider evaluation.
Run every candidate provider through the same 500-1000 utterances. Same audio in, transcripts out.
Score with the audio_transcription rubric. The audio_transcription rubric in Future AGI ai-evaluation computes WER-class transcript quality against ground truth. Pair it with custom rubrics for named-entity preservation, numeric preservation, and jargon recognition if you want per-axis attribution. The same rubric runs across all providers, giving you a comparable score per provider on every dimension.
Cluster failures by audio profile. Background-noise word drops cluster differently from accent drift, which clusters differently from jargon substitution. The cluster structure tells you which provider wins for each cohort of your audio.

The output is a decision matrix grounded in your data. Whichever provider wins on your audio is the right pick, independent of what won on the leaderboards.

For coverage your production traffic doesn’t yet exercise, Future AGI Simulation can generate STT stress sets: 18 pre-built personas plus unlimited custom, each with accent, age range, gender, location, communication style, conversation speed, background noise, and multilingual controls. Auto-generate branching scenarios (20/50/100 rows) and pipe the rendered audio through the same audio_transcription rubric.

Future AGI for STT eval at scale

ai-evaluation ships 70+ built-in eval templates in the Apache 2.0 SDK. The audio_transcription rubric scores STT output against ground truth on WER-class transcript quality. The audio_quality rubric does the same for TTS output. The translation_accuracy and cultural_sensitivity rubrics extend the eval to multilingual voice. Pair them with conversation_coherence and conversation_resolution for multi-turn voice flows. Custom evaluators authored by an in-product agent close the gap when a built-in rubric doesn’t quite fit. The MLLMAudio test case accepts 7 audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) directly from URL or local path. Apache 2.0.

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator

audio = MLLMAudio(url="path/to/call_recording.wav")
test_case = MLLMTestCase(
    input=audio,
    response="hypothesis transcript from your STT provider",
    expected_response="ground-truth transcript from your labeler",
    query="Score the transcript quality",
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[AudioTranscriptionEvaluator()],
    inputs=[test_case],
)

traceAI ships 30+ documented integrations across Python and TypeScript with OpenInference-compatible spans, including dedicated traceAI-pipecat and traceai-livekit packages. Instrumented spans capture STT provider, model, first-partial latency, final latency, and confidence as provider/custom span attributes. For Vapi and Retell, use FAGI’s native dashboard-driven voice observability (provider API key plus Assistant ID) instead of an SDK instrumentor. Apache 2.0.

Error Feed auto-clusters STT failures into named issues with auto-written root cause, quick fix, and long-term recommendation. Background-noise word drops, accent-specific WER drift, jargon substitution, and codec degradation each become their own named cluster instead of drowning in raw spans. Zero-config: ingest spans and clusters emerge.

The MLLMAudio batch eval pattern runs the same audio_transcription rubric across hundreds of utterances in a single job. Plug your labeled set in, plug each candidate provider’s transcripts in, and the eval job returns the per-provider score sheet.

Future AGI Protect is built on Gemma 3n with LoRA-trained category-specific adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. Rule-based Protect covers the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance); ProtectFlash is the single-call binary classifier that gives you the sub-100ms inline path when you can’t afford rule-based scan time. Inline guardrails scan transcripts for PII, prompt injection, and policy violations without breaking the voice latency budget.

Agent Command Center hosts the stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 all certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required. Auto-captured call recordings, separate assistant and customer audio tracks, auto-transcripts, and the full eval engine on every call.

Where each provider genuinely wins

Calibrated honesty: every provider on this list owns at least one axis where they’re the right answer. The wedge per provider:

Deepgram Nova-3 wins WER on hard call-center audio with background noise, accents, and telephony codecs.
AssemblyAI Universal-3 Pro wins multilingual coverage, speaker diarization polish, and English-nuance handling.
Whisper large-v3 wins open-weight self-hostability, accent diversity, and code-switching.
Cartesia Ink-Whisper wins streaming latency with Whisper-class accuracy.
Speechmatics Ursa wins European-language coverage and regulated-industry domain models.
Azure Speech wins for shops already on Azure with no other constraint.
Google Speech wins for breadth of language coverage at commodity pricing.
AWS Transcribe wins on pure cost per minute when accuracy is secondary.

Future AGI is not in this comparison because Future AGI is the eval and observability layer that scores all of them. We ship the rubrics, the telemetry, and the cluster surface that lets you pick the right STT for your audio profile rather than guessing from a public benchmark.

Two deliberate tradeoffs on the FAGI STT stack

Async eval gating is explicit. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) inside the Dataset UI and the Python library. Pick an optimizer, point at a dataset and an evaluator, run. FAGI never auto-rewrites a production prompt without an explicit run plus a human approval gate. The loop is deliberate by design.

Native voice obs ships for Vapi, Retell, and LiveKit out of the box; everything else flows through Enable Others mode via the traceAI SDK (dedicated traceAI-pipecat and traceai-livekit packages plus 30+ documented integrations) or a webhook. That covers more than 90% of production voice stacks; deeper custom-runtime work is a code-path engagement.

A worked recommendation

For most production voice agent teams in 2026, the pick is:

Primary real-time STT. Deepgram Nova-3 if your audio is US-English with call-center characteristics. AssemblyAI Universal-3 Pro if multilingual coverage or speaker diarization matters more than the last 1% of WER on noisy English.
Offline / batch STT. Same provider in batch mode (Deepgram or AssemblyAI batch is 40-60% the cost of real-time on the same audio). Whisper large-v3 if you need to self-host for compliance.
Latency-critical use case. Cartesia Ink-Whisper for sub-100ms first-partial in clean conditions.
European or regulated domain. Speechmatics Ursa.
Eval and observability layer. Future AGI: traceAI for span capture, ai-evaluation for the audio_transcription rubric, Error Feed for failure-mode clustering, Agent Command Center for hosted dashboards.

The stack is provider-agnostic. The eval layer is what tells you when to swap providers. The right answer in six months will be different from the right answer today, and the only way to keep up is to run the rubrics continuously on your live audio.

Sources and references

Future AGI Protect: arXiv 2510.13351
GEPA Genetic-Pareto optimizer: arXiv 2507.19457
Meta-Prompt bilevel optimization: arXiv 2505.09666
Random Search baseline: arXiv 2311.09569
OpenInference span specification: github.com/Arize-ai/openinference
Future AGI trust and compliance: futureagi.com/trust
Deepgram Nova-3: deepgram.com vendor documentation
AssemblyAI Universal-3 Pro: assemblyai.com vendor documentation
OpenAI Whisper repository: openai-whisper GitHub
Speechmatics Ursa: speechmatics.com vendor documentation
Cartesia Ink-Whisper: cartesia.ai vendor documentation
WER computation reference: NIST SCLITE documentation

Frequently asked questions

What's the most accurate STT provider for voice AI agents in 2026?

Deepgram Nova-3 leads on word error rate for hard call-center audio with background noise, accents, and telephony codecs, with real WER ranges of 4-7% on clean audio and 8-14% on conversational audio. AssemblyAI Universal-3 Pro ties or wins on English nuance, code-switching, and speaker diarization. Whisper large-v3 is the open-weight baseline at 1.5B parameters with strong multilingual coverage. The right answer depends on your audio profile, so the only ranking that matters is the one you produce by scoring providers on your own labeled holdout set.

Which STT provider has the lowest real-time latency?

Cartesia Ink-Whisper is the latency leader in 2026, with first-partial transcripts arriving in under 100ms P50 in clean conditions while keeping Whisper-class accuracy. Deepgram Nova-3 sits at 90-130ms P50 first-partial, AssemblyAI Universal-3 Pro at 100-150ms, and Speechmatics Ursa at 110-160ms. Real-time time-to-first-token across the leaders sits in the 200-400ms range when you include connection setup. Pick on the audio profile and language coverage first, then on latency.

Should I use Deepgram, AssemblyAI, or Whisper for a production voice agent?

Deepgram Nova-3 for call-center audio with background noise and challenging accents. AssemblyAI Universal-3 Pro for multilingual deployments, speaker diarization, and English nuance like sarcasm and disfluency. Whisper large-v3 when you need to self-host for compliance, code-switching is heavy, or your audio is well-controlled. Many production stacks use Deepgram or AssemblyAI for the real-time path and Whisper for the offline analytics pass. Score all three on a 500-1000 utterance labeled sample of your production audio before committing.

How do I evaluate STT providers on my own audio?

Build a labeled holdout set of 500-1000 utterances representative of your production audio, with ground-truth transcripts. Run each candidate STT through the audio_transcription rubric in Future AGI ai-evaluation, which scores WER, semantic similarity, named-entity preservation, and jargon handling. Cluster failures by audio profile (background noise, accent, jargon, cross-talk) using Error Feed. Pick the provider that wins on your audio, not the one that wins on a public leaderboard.

What about Azure, Google, and AWS for STT in 2026?

All three are commodity options. Google Cloud Speech-to-Text v2 covers 125 languages, the broadest coverage of any hosted STT, with the lowest Tier 1 per-minute pricing of the hyperscalers; WER trails Deepgram and AssemblyAI on hard audio. Azure Speech is the natural pick for shops on Azure with no other constraint. AWS Transcribe is competitive on price but trails Google on Tier 1 per-minute pricing and lags on accent robustness and real-time partial latency. Pick the hyperscaler STT when cloud lock-in is already a sunk cost, otherwise pick a specialist.

Why is real WER higher than the marketing benchmark numbers?

Public WER benchmarks usually run on academic test sets like LibriSpeech and CommonVoice, which are clean read speech. Real production audio includes telephony codecs (G.711, G.729, low-bitrate Opus), background noise, accents, cross-talk, disfluencies, and domain jargon. Real WER on production audio is typically 2-3x the marketing number. Plan for 4-7% on clean audio, 8-14% on conversational audio, and 12-22% on noisy audio. Score on your own data.

How does Future AGI fit into the STT stack?

Future AGI is the eval and observability layer that scores any STT provider you pick. traceAI captures STT provider, model, mode, and per-stage latency as span attributes via the dedicated traceAI-pipecat and traceai-livekit packages. ai-evaluation runs the audio_transcription rubric on a labeled holdout or on live production traffic. Error Feed auto-clusters mistranscriptions into named issues like background noise, accent drift, and jargon substitution. Future AGI Protect runs sub-100ms inline so safety scanning on transcripts fits inside the voice budget.

View all

Guides

Real-Time STT vs Offline STT: A 2026 Decision Guide for Voice AI

Real-time STT vs offline STT in 2026: latency, WER, cost, accent robustness, and the eval rubric that scores both at scale. A decision matrix for voice AI.

Vrinda Damani · Mar 19, 2026

16 min

Guides

7 Best TTS Providers for Voice AI Agents in 2026 (Tested + Ranked)

Tested and ranked: 7 best TTS providers for voice AI agents 2026, with real per-character pricing, streaming TTFA latency, voice cloning, SSML support.

NVJK Kartik · Mar 12, 2026

15 min

Guides

Medical and Healthcare STT in 2026: Accent, Jargon, HIPAA

Ship clinical-grade STT in 2026: medical jargon coverage, patient accent and dialect robustness, HIPAA and BAA across audio and transcripts.

Vrinda Damani · Feb 12, 2026

18 min

TL;DR: pick by exit reason

How we ranked them

1. Deepgram Nova-3

2. AssemblyAI Universal-3 Pro

3. OpenAI Whisper (large-v3, turbo)

4. Cartesia Ink-Whisper

5. Speechmatics Ursa

Honorable mentions: the hyperscalers

How to actually score providers on your audio

Future AGI for STT eval at scale

Where each provider genuinely wins

Two deliberate tradeoffs on the FAGI STT stack

A worked recommendation

Related reading

Sources and references

Frequently asked questions