Articles

Best Speech-to-Text APIs in 2026: Benchmarks, Pricing, and a Developer Decision Guide for Choosing the Right STT Provider

Best STT APIs in May 2026: Deepgram Nova-3 + Flux, AssemblyAI Universal-2, Whisper, ElevenLabs Scribe v2 with WER, latency, and pricing compared.

·
Updated
·
20 min read
stt voice-ai ai-evaluations deepgram assemblyai whisper elevenlabs speech-to-text 2026
Best STT APIs in May 2026: Deepgram Nova-3 + Flux, AssemblyAI Universal-2, ElevenLabs Scribe v2, Whisper, NVIDIA Canary on WER and latency.
Table of Contents

Updated May 14, 2026. Deepgram shipped Flux for voice agents, NVIDIA Canary Qwen 2.5B took the Open ASR Leaderboard top spot, and ElevenLabs Scribe v2 Realtime hit 150ms across 30 languages. Here is the current state, the right pick by use case, and how to evaluate before production.

Best STT APIs in May 2026: Deepgram Nova-3 + Flux, AssemblyAI Universal-2, ElevenLabs Scribe v2, Whisper, NVIDIA Canary on WER and latency.

TL;DR: Best STT API by use case in May 2026

Use caseBest pickWhyListed price
Voice agents (lowest E2S latency)Deepgram Flux + Nova-3Sub-300ms streaming, fastest end-of-speech detection$0.0077/min streaming
Lowest WER (open source)NVIDIA Canary Qwen 2.5B5.63% WER, top of Open ASR Leaderboard, NVIDIA Open Model LicenseFree (GPU bill)
Lowest WER (hosted API)Deepgram Nova-3 (batch)5.26% WER on real-world test set across 9 audio domains$0.0043/min batch
Multilingual real-timeElevenLabs Scribe v2 Realtime93.5% FLEURS across 30 languages at ~150ms$0.22 to $0.48/hour
Transcript intelligence (sentiment, topics)AssemblyAI Universal-2 + Slam-1Built-in sentiment, topic, entity, content moderation~$0.37/hour
Open-source self-hostOpenAI Whisper Large-v357+ languages, free, runs on any GPU host or laptopFree (compute)
Independent benchmark accuracyOpenAI GPT-4o Transcribe~8.9% WER on independent benchmarks (Artificial Analysis)$6 per 1K min
On-prem enterpriseSpeechmatics EnhancedFull on-prem, 55+ languages, regulated industriesCustom
AWS-nativeAmazon TranscribeIAM, S3, Lambda, Amazon Connect, HIPAA medical$0.024/min
GCP-native + widest languagesGoogle Cloud Speech-to-Text125+ languages with Chirp 3, medical and phone variants$16 per 1K min
Azure + Custom SpeechMicrosoft Azure SpeechFine-tune on proprietary vocabulary, 100+ languages$1/hour standard

If you only read one row: Deepgram for voice agents, Whisper or NVIDIA NeMo for self-host, ElevenLabs Scribe v2 for multilingual real-time, AssemblyAI for transcript intelligence. Everything else is a tradeoff on those four.

Future AGI is not an STT vendor. It is the recommended evaluation, simulation, and observability companion. We come back to this at the end.

Why the gap between STT marketing claims and real production performance can break your voice AI

Speech-to-text (STT) is no longer just a transcription tool. It is the front door of every voice agent, real-time captioning system, and conversational AI product shipping today. If your STT layer drops words, adds latency, or chokes on accents, everything downstream breaks. The LLM gets garbage input. The user hears a delayed, confused response. And your product loses trust.

The problem? Many providers market best-in-class accuracy and low-latency streaming. Those numbers usually come from clean studio audio with native English speakers. Real production audio has background noise, international accents, and cellular compression. The gap between marketing claims and actual performance can be massive.

This guide cuts through the noise. We compare 10 leading speech-to-text providers using independent benchmark data, real pricing, and practical criteria that matter when you are building for production. Whether you are wiring up a voice agent over WebSocket or processing thousands of hours of call center recordings, this breakdown will help you pick the right STT API for your stack.

What Makes a Speech-to-Text API Provider Best: Latency, Accuracy, Language Coverage, Pricing, and Developer Experience

There is no single “best” STT provider. The right choice depends entirely on what you are building. A voice agent that needs sub-200ms streaming latency has different requirements than a medical transcription pipeline that prioritizes domain-specific accuracy.

Here is what actually matters when you are evaluating providers:

  • Latency profile: Does your app need real-time streaming via WebSocket, or is batch processing acceptable?
  • Accuracy on your data: A 5% WER on LibriSpeech benchmarks means nothing if your users have heavy accents and noisy environments.
  • Language and accent coverage: Supporting 100+ languages on paper is different from delivering low WER across all of them.
  • Pricing structure: What you actually pay per audio hour once you factor in your real volume, plus extras like diarization and redaction that quietly add up on the invoice.
  • Developer experience: How good the SDKs are, how deep the docs go, and how quickly you can get from a fresh API key to a working integration without fighting the tooling.

The vendor that wins on one axis often loses on another. Deepgram leads on voice-agent latency and end-of-speech detection. NVIDIA Canary Qwen 2.5B (open source) holds the top spot on the Hugging Face Open ASR Leaderboard at 5.63% WER. Deepgram Nova-3 leads hosted WER at 5.26% on its own real-world test set. ElevenLabs Scribe v2 Realtime leads multilingual real-time. AssemblyAI leads transcript intelligence. Open-source models like Whisper give you full control but require infrastructure. Your job is to find the right tradeoff for your specific use case.

Six Metrics That Actually Determine Production STT Performance

Before diving into provider comparisons, you need to know which numbers to trust and which to ignore. Here are the six metrics that separate production-grade STT from demo-grade STT.

Word Error Rate: How Dataset Choice Makes the Same Provider Look Great or Terrible Depending on Your Audio

WER counts substitutions, insertions, and deletions against a ground-truth transcript. Lower is better. But context matters enormously. A provider reporting 5% WER on clean LibriSpeech audio might hit 15-20% on noisy call center recordings. Always ask: what dataset was used, what normalization was applied, and does it match your production audio profile?

Streaming Latency Time to First Partial: Why Sub-300ms STT Is Essential for Voice Agent Round-Trip Budgets

For real-time voice agents, this is the most critical metric. It measures how quickly you receive the first transcribed word after audio is sent over a WebSocket connection. The target for voice-to-voice interactions is under 800ms total round-trip (STT + LLM + TTS combined), so your STT budget is roughly 150-300ms. Anything above 500ms makes conversations feel sluggish.

Accuracy Under Real Conditions: How Noisy Audio and Accented Speech Reveal the True Gap Between Providers

Throw your messiest audio at it: recordings with background noise, speakers who aren’t native English, and jargon specific to your industry. A 95% accurate system on clean English becomes 85% on accented speech with background noise. Independent benchmarks from Artificial Analysis and the Hugging Face Open ASR Leaderboard give you a more honest picture than vendor-reported numbers.

Cost Per Audio Hour: How Volume, Diarization Fees, and Add-Ons Turn Small Rate Differences into Large Annual Costs

You can pay as little as $0.0043 per minute with Deepgram Nova-3 batch ($0.0077 per minute streaming) or as much as $0.016 per minute on Google Cloud’s Standard tier. At 5,000 audio hours per month (300,000 minutes), Deepgram batch runs about $1,290 per month and Google Standard runs about $4,800 per month, a gap of roughly $42,000 per year. And that is before you tack on charges for diarization, redaction, or jumping to a premium model tier.

Noise and Accent Handling: Why Studio Benchmark Accuracy Does Not Predict Real-World User Experience

Your users are not recording in a studio. They are on speakerphone in a car, calling from a loud office, or speaking with a regional accent. The provider that handles this diversity best will save you the most headaches in production.

Streaming Protocol Support: How WebSocket Implementation Quality Affects End-of-Speech Detection and LLM Handoff Speed

WebSocket-based streaming is the standard for real-time STT. Check whether the provider supports persistent connections, handles reconnection gracefully, and provides interim (partial) results alongside final transcripts. This matters a lot for voice agent STT architecture where you need to detect end-of-speech and start LLM processing as fast as possible.

Top 10 Speech-to-Text Providers in 2026: Features, Benchmarks, Best Fit, and Pricing Compared

Figure 1: Top Speech-to-Text Providers 2026

Deepgram Nova-3 and Flux: How 5.26 Percent WER and Sub-300ms Latency Lead for Voice Agent Use Cases

Deepgram builds speech recognition models from scratch, purpose-built for speed. Nova-3 cut word error rate by 54% against the closest competitor when Deepgram tested it on their own benchmark set covering nine different audio domains. The Flux model is a separate beast, built specifically to detect when a speaker stops talking as fast as possible, which is exactly what you need inside voice agent pipelines.

Key Features

  • Nova-3: 5.26% WER (batch) and 6.84% WER (streaming) on real-world datasets spanning medical, finance, and call center audio.
  • Flux model: Built from the ground up for voice agents, with the quickest end-of-speech detection you will find right now. Handles real-time multilingual transcription across 36+ languages and can deal with code-switching when speakers jump between languages mid-sentence.
  • WebSocket streaming, keyterm prompting, speaker diarization, and real-time PII redaction for up to 50 entity types.

Best Fit: Real-time voice agents, conversational AI, and high-volume streaming where latency is the top priority.

Pricing: $0.0043/min batch, $0.0077/min streaming (PAYG). Growth plan from $0.0036/min batch. $200 free credits to start.

AssemblyAI Universal-2 and Slam-1: How Built-In Sentiment Analysis and Content Moderation Go Beyond Transcription

AssemblyAI focuses heavily on accuracy and post-transcription intelligence. Their Universal-2 model hits around 14.5% WER on challenging mixed datasets and includes built-in speech intelligence features like sentiment analysis and PII detection. They also released Slam-1, a speech-language model, in late 2025.

Key Features

  • Universal-2 streaming model with 30% fewer hallucinations than Whisper Large-v3.
  • Built-in speech intelligence: sentiment analysis, topic detection, entity recognition, and content moderation.
  • Slam-1 is their speech-language model that goes beyond basic transcription and actually understands what is happening in the audio.
  • Covers 99+ languages total, with live multilingual streaming currently working across six of them.

Best Fit: Works well when you need more than just a raw transcript. Think meeting analytics where you want sentiment and topics pulled out automatically, compliance workflows that flag specific language, or media pipelines that need structured data from audio.

Pricing: Starts around $0.37/hour. Prices drop if you commit to higher volumes.

ElevenLabs Scribe v2 Realtime: How 150ms Latency and 93.5 Percent FLEURS Accuracy Lead in Multilingual Voice AI

ElevenLabs jumped into the STT space in early 2025 with Scribe v1, then rolled out Scribe v2 and the Realtime variant pretty quickly after that. On the FLEURS benchmark, Scribe v2 Realtime hits 93.5% accuracy across 30 languages while keeping latency under 150ms. That combination of speed and multilingual accuracy is hard to find anywhere else right now.

Key Features

  • Scribe v2 Realtime: 150ms latency with predictive next-word transcription across 90+ languages.
  • Up to 32-speaker diarization, audio event tagging (laughter, applause), and keyterm prompting.
  • Automatic language detection and mid-conversation language switching.
  • Covers SOC 2, HIPAA, and GDPR requirements, and you can keep data within the EU if your compliance team needs that.

Best Fit: Makes the most sense if you are already running ElevenLabs for text-to-speech and want one vendor handling both sides of the voice pipeline. Also a strong pick when multilingual accuracy is a hard requirement.

Pricing: Scribe v1 and v2 run between $0.22/hour on enterprise plans and $0.40/hour on smaller tiers. The Realtime version sits between $0.39/hour and $0.48/hour depending on your plan.

Google Cloud Speech-to-Text Chirp 3: How 125 Plus Language Coverage Serves Multilingual and GCP-Native Stacks

Google offers the widest language coverage in the STT market with 125+ languages across models like Chirp 2, Chirp 3, and specialized variants for medical and phone call audio. The sheer breadth makes it a default choice for multilingual products, though independent benchmarks show it often trails specialized providers in real-time accuracy.

Key Features

  • 125+ languages and locales with Chirp model family.
  • They offer specialized models tuned for medical audio, phone calls, and separate variants for short and long recordings.
  • There is also a built-in accuracy evaluation tool right in the console, so you can upload your own audio with ground-truth transcripts and get WER numbers without writing any code.
  • Plugs straight into the rest of GCP, including Vertex AI and other Google Cloud services.

Best Fit: Good choice if you are building a multilingual product, your infrastructure already lives on GCP, or your team has sunk real investment into Google Cloud tooling.

Pricing: Standard rate is $16.00 per 1,000 minutes. If you push past 2 million minutes a month, that drops to $4.00 per 1,000 minutes.

OpenAI Whisper and GPT-4o Transcribe: How 8.9 Percent WER and Open-Source Flexibility Serve Batch and Self-Hosted Needs

OpenAI gives you two options here. You can self-host the open-source Whisper models on your own infrastructure, or you can call the proprietary GPT-4o Transcribe API. GPT-4o Transcribe has posted some impressive benchmark numbers and actually came out with the lowest WER in a few independent tests. The tradeoff is price, since it costs noticeably more than most other providers.

Key Features

  • GPT-4o Transcribe: high accuracy with approximately 8.9% WER on Artificial Analysis benchmarks.
  • Whisper Large-v3: open-source, self-hostable, strong multilingual performance.
  • Good noise handling and accent coverage across 57+ languages.
  • Whisper available through third-party inference providers (Groq, Fireworks, Replicate) for faster/cheaper processing.

Best Fit: Solid pick for batch transcription jobs, self-hosted setups where you want full control over the model, or any situation where getting the words right matters more than getting them fast.

Pricing: GPT-4o Transcribe costs $6.00 per 1,000 minutes. Whisper via third-party hosts: varies ($0.50-$3.00/1,000 minutes depending on provider).

Speechmatics Enhanced: How On-Premise Deployment and Accent Handling Serve Regulated Enterprise Environments

Speechmatics has been around in the enterprise STT space for a while, and they offer on-premise deployment if your data cannot leave your own servers. The Enhanced model squeezes out the best possible accuracy, while the Default model trades a bit of that for faster processing. They recently added their own TTS product too, so you can now run both speech-to-text and text-to-speech through a single vendor.

Key Features

  • Enhanced and Default models with strong accent and dialect handling across 55+ languages.
  • You can deploy on-premise or in a private cloud, which matters a lot if you work in a regulated industry where data cannot leave certain boundaries.
  • Supports real-time streaming with word-level timestamps and speaker diarization baked in. They carry the enterprise compliance certifications and data residency controls that procurement teams in finance and healthcare actually ask for.

Best Fit: Works best for enterprise teams dealing with strict data residency rules, regulated sectors like finance or healthcare, and anyone who needs speech-to-text running entirely on their own infrastructure.

Pricing: No public price list. You will need to talk to their sales team for a volume quote.

Amazon Transcribe: How Native AWS Integration and HIPAA-Eligible Medical Models Serve AWS-Native Architectures

If your whole stack already runs on AWS, Amazon Transcribe is the natural choice. It handles 100+ languages, has dedicated models for medical transcription and call center analytics, and hooks directly into S3, Lambda, and Amazon Connect without any extra glue code.

Key Features

  • Handles 100+ languages and can automatically figure out which language is being spoken.
  • Has specialized models for medical transcription that qualify as HIPAA eligible, plus a separate Call Analytics model for contact center use cases.
  • Connects natively to Lambda, S3, Amazon Connect, and SageMaker, so there is no extra wiring needed if you are already on AWS.
  • Custom vocabulary and custom language model support.

Best Fit: AWS-native architectures, call center analytics via Amazon Connect, and healthcare applications needing HIPAA-eligible transcription.

Pricing: $0.024/min streaming, $0.024/min batch. Volume discounts at higher tiers.

Microsoft Azure Custom Speech: How Fine-Tuning on Proprietary Vocabulary Serves Enterprise Microsoft Ecosystems

Azure Speech Services stands out with its Custom Speech capability, letting you fine-tune models on your own data. This is a big advantage for domain-specific deployments where standard models underperform on specialized vocabulary.

Key Features

  • Custom Speech: fine-tune recognition models with your own audio and text data.
  • Real-time and batch transcription across 100+ languages.
  • Deep Microsoft ecosystem integration (Teams, Dynamics, Power Platform).
  • On-device deployment via Speech SDK for edge scenarios.

Best Fit: Enterprise Microsoft shops, products needing custom-trained models on proprietary vocabulary, and edge/on-device deployments.

Pricing: Standard: $1.00/hour. Custom models: $1.40/hour. Free tier: 5 hours/month.

NVIDIA NeMo Canary and Parakeet: How 5.63 Percent WER and 2000x Real-Time Speed Lead Open-Source STT in 2026

NVIDIA’s open-source ASR models are sitting at the top of the Hugging Face Open ASR Leaderboard right now. Canary Qwen 2.5B holds the number one spot with 5.63% WER, and it pairs a FastConformer encoder with a Qwen3-1.7B LLM decoder under the hood. If speed is what you care about, the Parakeet TDT models process audio at close to 2,000x real-time, which makes them the fastest open-source option out there by a wide margin.

Key Features

  • Canary Qwen 2.5B: lowest WER on Open ASR Leaderboard (5.63%), dual transcription + LLM analysis mode.
  • Parakeet TDT 1.1B: extreme inference speed (2,000x real-time), ideal for live captioning.
  • Trained on 65,000+ hours of diverse audio data.
  • Requires NVIDIA NeMo toolkit. Parakeet TDT ships under Apache 2.0. Canary Qwen 2.5B ships under a permissive NVIDIA Open Model License with terms similar to Apache 2.0 for most production use; verify license terms for your specific commercial use case before adoption.

Best Fit: Teams with GPU infrastructure who want full control, self-hosted batch processing at scale, and research or prototyping on the latest open architectures.

Pricing: Free (open source). Infrastructure costs depend on your GPU setup.

Gladia Solaria-1: How Focused Developer Ergonomics and Competitive Pricing Serve Startups and Mid-Size Teams

Gladia has gained attention with their Solaria-1 model, which appears on the Artificial Analysis leaderboard alongside major providers. They position themselves as an STT-first API company with a focus on developer experience and straightforward pricing.

Key Features

  • Solaria-1 model with competitive WER on independent benchmarks.
  • Real-time and batch transcription with speaker diarization.
  • Simple REST API with clean documentation and quick onboarding.
  • Audio intelligence features including summarization and sentiment analysis.

Best Fit: Startups and mid-size teams looking for a focused STT API with good developer ergonomics and competitive pricing.

Pricing: Usage-based pricing. Check gladia.io for current rates.

Comprehensive Speech-to-Text Provider Comparison: WER, Latency, Languages, and Pricing Side by Side

ProviderTop ModelWER (Approx.)Streaming LatencyLanguagesPrice (per 1K min)
DeepgramNova-3 / Flux5.26% (batch)Sub-300ms36+$4.30 (PAYG)
AssemblyAIUniversal-2 / Slam-1~14.5%*~760ms delta99+~$6.20
ElevenLabsScribe v2 RT3.3% (EN)~150ms90+~$6.50-$8.00 (RT)
Google CloudChirp 3~11.6%*Variable125+$16.00 (std)
OpenAIGPT-4o Transcribe~8.9%Not real-time optimized57+$6.00
SpeechmaticsEnhancedCompetitive~730ms delta55+Custom
AmazonTranscribe StdModerateModerate100+$14.40
Azure SpeechCustom SpeechVariableModerate100+~$16.67 (std)
NVIDIA NeMoCanary Qwen 2.5B5.63%Self-hostedEN (primary)Free (OSS)
GladiaSolaria-1CompetitiveReal-time capable100+Usage-based

Table 1: Speech-to-Text Provider Comparison

* WER figures vary significantly by dataset and methodology. Numbers marked with * are from mixed/real-world benchmarks. Always test with your own audio.

Use CaseRecommended Provider(s)Why
Voice Agents (real-time)Deepgram Flux, ElevenLabs Scribe v2 RTLowest end-of-speech latency, WebSocket streaming
Call Center AnalyticsAssemblyAI, Amazon TranscribeBuilt-in intelligence, compliance features
Multilingual ProductsGoogle Cloud, ElevenLabs ScribeBroadest language coverage
Self-Hosted / On-PremNVIDIA NeMo, Whisper, SpeechmaticsFull data control, no vendor lock-in
Medical TranscriptionDeepgram Nova-3 Medical, Azure CustomDomain-specific models, HIPAA compliance
High-Volume BatchDeepgram (batch), OpenAI WhisperBest price-performance at scale

Table 2: Speech-to-Text provider use cases

How to Choose an STT Provider: A Four-Step Decision Framework for Production Voice AI Teams

Picking an STT provider is not about finding the best benchmarks. It is about matching the right provider to your constraints. Here is a practical framework you can walk through before committing.

Step 1: Determine Whether Your Application Needs Real-Time Streaming or Batch Processing

Start with a simple question: does your app actually need real-time streaming, or is batch processing good enough? If you are building a voice agent or live captioning system, streaming is not optional. You need WebSocket support that delivers the first partial transcript in under 300ms. But if you are transcribing recorded calls or podcast episodes after the fact, batch mode will cost you less and usually gives you better accuracy too.

Step 2: Test with Your Own Audio Samples Including Edge Cases Before Committing to Any Provider

Never trust vendor benchmarks alone. Collect 50-100 representative audio samples from your production environment. Include edge cases: noisy backgrounds, accented speakers, domain-specific terminology. Run these through 2-3 providers and compare WER, latency, and formatting quality side by side.

Step 3: Calculate Total Cost Including Diarization, Redaction, Infrastructure, and Engineering Hours

Do not just look at the per-minute rate and call it a day. You need to account for diarization fees, redaction add-ons, the infrastructure bill if you are self-hosting, and how many engineering hours it takes to get everything wired up. Sometimes paying a bit more for an API that ships with solid SDKs and clear documentation saves your team weeks of work compared to a cheaper option with rough tooling.

Step 4: Build Multi-Provider Failover Architecture and Continuous Monitoring into Your STT Stack

Production voice systems should not depend on a single STT provider. Build your architecture to support switching providers. Route a percentage of traffic to an alternative, monitor accuracy and latency continuously, and have a failover path if your primary provider degrades. This is where observability becomes critical.

How Future AGI helps you ship and maintain STT in production (companion, not a STT vendor)

Future AGI does not sell a speech-to-text model. The vendor list above is the right place to look for the STT engine itself. What Future AGI provides is the evaluation, simulation, and observability layer that pairs with whichever STT provider you pick.

Choosing an STT provider is step one. Keeping it working in production is the harder problem. Models update without warning. Latency spikes during peak hours. Accuracy drifts on certain accent groups. These are the issues that vendor dashboards do not surface.

Future AGI is built for exactly that gap. Here is how it fits into the STT workflow:

  • A/B test your STT layer. Future AGI Simulate lets you compare STT providers (Deepgram, AssemblyAI, ElevenLabs, Whisper, NVIDIA NeMo, others) side by side on identical audio, measuring transcription accuracy, time-to-final latency, diarization quality, punctuation, and domain-term accuracy in a controlled environment.
  • Audio-level evaluation. Most testing tools only look at transcripts. Future AGI evaluates the audio pipeline end-to-end, catching latency spikes, diarization errors, transcription drift on domain terms, and accuracy regressions that text-only diff tools miss.
  • Production observability with traceAI. traceAI is Apache 2.0, OpenTelemetry-based, and supports Python, TypeScript, Java, and C#. Span-level instrumentation of every STT, LLM, and TTS call. Detect P95 latency regressions, WER drift, and provider-level anomalies before users notice.
  • Simulate at scale. Run thousands of synthetic conversations with diverse accents, background noise profiles, and edge cases before a real user touches the system. Future AGI supports 50+ languages with customizable personas.
  • Continuous regression protection. When STT providers update their models, Future AGI reruns your evaluation suite and flags degradation. The common failure shape: a sub-second P95 latency that quietly drifts above one second after a model update, surfacing in your eval suite before the customer dashboard.
# Compare STT vendors on identical audio with the FAGI eval SDK.
# Run a domain reproduction before flipping any production traffic.
# Requires FI_API_KEY and FI_SECRET_KEY already set in your environment.
# Snippet shows the eval surface only. For span-level tracing, also add:
#   from fi_instrumentation import register, FITracer

from fi.opt.base import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

# Replace with your own loader. Each sample needs: audio bytes + ground_truth string.
samples = [
    {"id": "call-001", "audio": b"<bytes>", "ground_truth": "Hello, this is a test."},
    {"id": "call-002", "audio": b"<bytes>", "ground_truth": "Refill on May 27."},
]

def transcribe(vendor: str, audio_bytes: bytes) -> str:
    # Stand-in: call Deepgram, AssemblyAI, ElevenLabs, Whisper, etc.
    return "..."

provider = LiteLLMProvider()

wer_config = {
    "name": "stt_quality_judge",
    "grading_criteria": (
        "Compare the transcript to the ground truth. Score 0-5 on: "
        "(1) word accuracy, (2) punctuation, (3) speaker labels, "
        "(4) domain terminology."
    ),
}
wer_judge = CustomLLMJudge(provider, config=wer_config)
evaluator = Evaluator(metric=wer_judge)

for sample in samples:
    for vendor in ["deepgram", "assemblyai", "elevenlabs", "whisper"]:
        transcript = transcribe(vendor, sample["audio"])
        score = evaluator.evaluate(
            {"transcript": transcript, "ground_truth": sample["ground_truth"]}
        )
        print(vendor, sample["id"], score)

turing_flash returns in roughly 1 to 2 seconds. turing_small (2 to 3 seconds) and turing_large (3 to 5 seconds) deliver higher-fidelity scoring on safety-critical workloads.

The bottom line: Future AGI does not replace your STT provider. It makes sure whichever provider you choose keeps performing the way you expect, every day, at scale.

How to pick the right STT API for production in 2026

The STT market in 2026 is more competitive than it has ever been. Deepgram leads on latency and cost-efficiency. ElevenLabs Scribe v2 delivers remarkable multilingual accuracy. AssemblyAI bundles rich transcript intelligence. Open-source models from NVIDIA are closing the accuracy gap fast. The hyperscalers (Google, AWS, Azure) remain strong choices for teams already in those ecosystems.

But your production performance will not match any benchmark. Your audio is unique. Your users are unique. The only way to find the right provider is to test with your own data, monitor continuously, and build the infrastructure to switch when things change. Start with 2 to 3 providers, run controlled tests with Future AGI Simulate, and let real data drive the decision.

Sources

Frequently asked questions

What is the best speech-to-text API in May 2026?
There is no single best STT provider. Pick by use case. Deepgram Nova-3 + Flux leads voice-agent latency and end-of-speech detection. NVIDIA Canary Qwen 2.5B leads Open ASR Leaderboard WER at 5.63%. OpenAI GPT-4o Transcribe leads accuracy on independent benchmarks at roughly 8.9% WER. ElevenLabs Scribe v2 Realtime leads multilingual real-time accuracy at 93.5% on FLEURS across 30 languages. AssemblyAI leads transcript intelligence (sentiment, topic, entity). Google, AWS, and Azure are the right picks when ecosystem fit dominates the decision.
Which STT API has the lowest latency for voice agents?
Deepgram Flux is purpose-built for voice agents and posts the lowest end-of-speech detection latency in May 2026. ElevenLabs Scribe v2 Realtime hits roughly 150ms first-partial latency across 90+ languages. AssemblyAI streaming sits around 760ms time-to-final on mixed datasets. For voice-to-voice round-trip budgets under 800ms, your STT budget is 150 to 300ms, which makes Deepgram and ElevenLabs the two front-runners and rules out batch-mode providers like the base GPT-4o Transcribe.
Which STT API has the lowest WER in 2026?
WER results vary heavily by dataset, normalization, and audio domain, so absolute leaders depend on the benchmark. On the Hugging Face Open ASR Leaderboard, NVIDIA Canary Qwen 2.5B holds the top spot at 5.63% WER. Deepgram reports 5.26% WER for Nova-3 on its own real-world test set spanning medical, finance, and call-center audio. On Artificial Analysis independent benchmarks, OpenAI GPT-4o Transcribe posts roughly 8.9% WER. ElevenLabs Scribe v2 reports roughly 3.3% WER on internal English-only evaluations, which is a different test distribution than Open ASR Leaderboard. Always run a domain reproduction on your own audio.
What does an STT API cost in May 2026?
Listed prices range from free open source to roughly $16 per 1,000 minutes. Deepgram Nova-3 at $0.0043 per minute batch and $0.0077 per minute streaming. AssemblyAI starts at roughly $0.37 per hour. ElevenLabs Scribe runs $0.22 to $0.48 per hour depending on plan. OpenAI GPT-4o Transcribe at $6 per 1,000 minutes. Google Cloud Standard at $16 per 1,000 minutes with discounts at high volume. AWS Transcribe at $0.024 per minute. Azure Standard at $1 per hour. NVIDIA NeMo is free open source plus your GPU bill.
Does Future AGI provide speech-to-text?
No. Future AGI does not sell a speech-to-text model. Future AGI is the recommended evaluation, simulation, and observability layer that pairs with the STT vendor you pick. The platform runs A/B tests across Deepgram, AssemblyAI, ElevenLabs, Whisper, NVIDIA NeMo, and others on identical audio, scores WER and latency side by side, and instruments production traces with traceAI (Apache 2.0). Pick an STT vendor for the speech model, pair Future AGI for evaluation and observability.
Which STT API supports on-premise deployment?
Three categories of providers in May 2026 ship on-premise. Speechmatics offers full on-premise and private-cloud deployment with data residency controls. NVIDIA NeMo (Canary Qwen, Parakeet TDT) is fully self-hostable as open source on your GPU infrastructure. OpenAI Whisper is open source, self-hostable, and available via third-party inference hosts (Groq, Fireworks, Replicate). Deepgram supports VPC and self-hosted runtime for the Enterprise tier. The cloud-native hyperscalers (Google, AWS, Azure) do not offer true on-prem, but Azure ships Speech containers for Kubernetes that approximate it.
How should I evaluate an STT API before production?
Run a domain reproduction. Collect 50 to 200 representative audio samples from your production environment, including noisy backgrounds, accented speakers, and domain terminology. Run each sample through your 2 to 3 candidate providers and measure WER, time-to-final, time-to-first-partial, diarization accuracy, and pronunciation on domain-specific terms. Score with Future AGI's audio-level evaluators, instrument with traceAI for span-level visibility, and stress-test with Future AGI Simulate. Gate production on your domain numbers, not vendor benchmarks.
Whisper versus Deepgram: which one should I use?
Pick Deepgram if latency, end-of-speech detection, or hosted reliability matters. Nova-3 + Flux is the May 2026 voice-agent default. Pick Whisper if cost, self-hosting, or full control over the model matters. Whisper Large-v3 plus a third-party host (Groq, Fireworks, Replicate) or your own GPU is cheaper at very high volume, especially for batch transcription. For multilingual production, Whisper covers 57+ languages but Scribe v2 Realtime is more accurate on 30 languages. For most voice agents in 2026 the right answer is Deepgram in the hot path with Whisper as a batch fallback.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.