Real-Time STT vs Offline STT: A 2026 Decision Guide for Voice AI
Real-time STT vs offline STT in 2026: latency, WER, cost, accent robustness, and the eval rubric that scores both at scale. A decision matrix for voice AI.
Table of Contents
The single biggest infrastructure decision in any voice AI stack is what speech-to-text engine you use, and whether you use one or two. Real-time STT and offline STT are not the same product. They share model families and sometimes share APIs, but they have different latency budgets, different WER profiles, different cost structures, and different failure modes. This guide is the 2026 decision matrix: when to pick which, how to evaluate both systematically, and how to instrument both for production.
TL;DR: when to pick which
| Real-time STT | Offline STT | |
|---|---|---|
| Latency floor | 100-300ms first-partial | 0.1-0.3x real-time for completed file |
| WER (clean audio) | 4-7% | 2-4% |
| WER (hard audio) | 8-15% | 5-10% |
| Cost per minute | $0.008-0.020 | $0.003-0.010 |
| Use cases | Live voice agents, IVR, captions | Call analytics, transcripts, search |
| Best fit providers | Deepgram Nova-3, AssemblyAI Universal-1, Cartesia Ink-Whisper | Whisper large-v3, Deepgram Nova-3 batch, AssemblyAI offline |
Most production voice agent stacks use both. Real-time STT for the live conversation. Offline STT for the post-call analytics. The two layers serve different consumers and don’t share a hot path.
What “real-time” actually means in STT
Real-time STT (streaming) processes audio as it arrives. The audio stream is chunked (typically 100ms chunks) and each chunk is fed to a model that produces a per-chunk transcript update. The provider exposes two callback types: partial and final. Partials arrive every 200-500ms while the user is speaking. Final arrives 400-1000ms after the user stops speaking.
The model architecture matters. Streaming STT models are typically:
- Causal (the model only looks at past audio, not future audio). This is what enables low-latency partial transcripts.
- Trained for streaming with specific objectives like CTC plus attention, or transducer-based decoders.
- Smaller than offline counterparts. Streaming models are typically 60-300M parameters; offline models can be 1.5B+ parameters.
The constraints of streaming (no future context, lower parameter count, partial output) cost some accuracy. The trade is acceptable for voice agents because the user is still talking and corrections can happen at the final transcript step.
What “offline” means in STT
Offline STT (batch) processes a completed audio file. The audio is loaded in full and the model produces a single high-accuracy transcript.
Three properties make offline STT more accurate:
Bidirectional context. The model sees the full utterance when it predicts each word. A sentence ending that disambiguates an earlier word is available to the model. Streaming models can’t do this.
Larger models. Whisper large-v3 is 1.5B parameters. Whisper turbo is 800M. Streaming models are 60-300M. The larger models capture more linguistic patterns.
Longer chunks. Offline models can process 30-second chunks with full context. Streaming models work in 100-500ms windows. The longer chunk lets the model use prosodic cues across the utterance.
The cost of offline STT is latency. A 60-second call returns in 6-20 seconds depending on the provider. For real-time voice this is unusable. For post-call analytics it’s fine.
The decision matrix
Six axes determine which engine you pick.
Axis 1: latency budget
- Sub-500ms voice agent. Real-time STT, full stop. The latency budget allows no slack.
- 5-30 second response window (e.g., dictation that completes when the user pauses, voice search). Real-time STT for the partial signal during input. Offline STT can score the final transcript if the user is willing to wait 1-2 seconds.
- Post-call analytics. Offline STT. Latency irrelevant; accuracy paramount.
- Live captions. Real-time STT with low-latency partial output. WER tradeoffs accepted.
Axis 2: accuracy requirement
- Voice agent intent classification. Real-time STT is sufficient. WER of 5-10% is workable because the LLM downstream is forgiving of partial errors.
- Legal transcript or medical dictation. Offline STT. Accuracy is the product; the user is paying for low WER.
- Customer call quality scoring. Offline STT for analytics, real-time STT for live alerts during the call.
- Voice search query. Real-time STT with re-rank from offline pass. The two-pass approach trades some latency for accuracy gain.
Axis 3: cost
Real-time STT is more expensive per minute. Three reasons: streaming requires persistent connections, the per-token cost is higher for partial decoding, and many providers charge a real-time premium.
Typical 2026 pricing per audio minute:
| Provider | Real-time | Offline (batch) |
|---|---|---|
| Deepgram Nova-3 | $0.0070 | $0.0043 |
| AssemblyAI Universal-1 | $0.0150 | $0.0050 |
| OpenAI Whisper API | N/A (no streaming) | $0.0060 |
| Speechmatics Ursa | $0.0125 | $0.0090 |
| Cartesia Ink-Whisper | $0.0080 | N/A |
| Self-hosted Whisper | $0.001-0.003 (compute only) | $0.0005-0.002 (compute only) |
The cost gap matters at scale. A voice agent doing 1M minutes per month pays $7K on Deepgram real-time and $4.3K on Deepgram batch. The batch savings of $2.7K/month is the cost of running offline analytics on the same audio.
Axis 4: language coverage
- English-first deployment. Most providers are strong. Pick on latency and cost.
- Multilingual deployment. AssemblyAI Universal-1 covers 99 languages. Whisper large-v3 covers 99. Deepgram Nova-3 covers 36. Google Cloud Speech covers 125.
- Code-switching (Spanglish, Hinglish). Whisper handles code-switching better than most streaming providers. For real-time agents in mixed-language markets, test code-switching specifically.
- Low-resource languages. Whisper variants and AssemblyAI cover more long-tail languages. Test the specific language; provider coverage doesn’t equal provider quality on every language.
Axis 5: accent robustness
- American English speakers. Every major provider works. Pick on latency.
- Indian English speakers. Deepgram Nova-3 and AssemblyAI Universal-1 do well. Whisper large-v3 has the edge on accent diversity. Older streaming models (early Whisper variants, older Deepgram models) struggle more.
- Heavy regional dialects (Scottish, Cajun, Singlish). Test on labeled data. No provider guarantees this; benchmarks rarely include strong dialect coverage. Whisper variants tend to lead in published comparisons.
Axis 6: domain vocabulary
- General conversation. Any provider.
- Medical, legal, financial vocabulary. AssemblyAI offers a custom-vocabulary endpoint. Speechmatics has industry models. Deepgram supports keyword boosting. Self-hosted Whisper variants can be fine-tuned on domain audio for the strongest accuracy. The cost of fine-tuning offsets the per-call savings at high enough volume.
Provider deep-dive: real-time STT in 2026
Five providers dominate real-time STT in 2026. The differences are in WER on hard audio, language coverage, and latency.
Deepgram Nova-3
The leader on WER for hard audio (background noise, accents, jargon). First-partial latency 90-130ms P50, 180-220ms P95. 36 languages supported. Strong on telephony codecs. Custom-vocabulary boosting. Best for support and call-center deployments where the audio profile is challenging.
AssemblyAI Universal-1
The leader on language coverage (99 languages). First-partial latency 100-150ms P50, 200-260ms P95. Strong English WER. Speaker diarization in both real-time and offline modes. Best for multilingual deployments and use cases that need speaker labels.
Speechmatics Ursa
Strong European-language coverage. First-partial latency 110-160ms P50, 210-280ms P95. Domain models for medical, legal, broadcast. Best for European deployments and regulated industries.
Cartesia Ink-Whisper
The latency leader. First-partial latency below 100ms P50 in early benchmarks. WER competitive with Deepgram on clean audio. Newer entrant with less production telemetry; smaller language coverage than the established providers. Best for latency-critical use cases where the audio profile is well-controlled.
Google Cloud Speech-to-Text v2
The commodity option. 125 languages. First-partial latency 150-220ms P50, 320-450ms P95. Higher WER than Deepgram or AssemblyAI but the broadest language coverage. Best for cost-sensitive deployments or applications already on Google Cloud.
Provider deep-dive: offline STT in 2026
Four providers dominate offline STT in 2026.
OpenAI Whisper (large-v3, turbo)
The open-weight baseline. 1.5B parameter large-v3, 800M parameter turbo. Strong WER across 99 languages. Can be self-hosted on a single GPU. The largest community of variants (Whisper.cpp for CPU inference, Faster-Whisper for GPU optimization). Best for teams that want to self-host or need offline batch processing without API costs.
Deepgram Nova-3 (batch mode)
The same Nova-3 model running in batch mode at 40-60% of the real-time per-minute price. Strong telephony audio handling. Speaker diarization, summarization, sentiment built into the batch pipeline. Best for call-center analytics teams that want a hosted batch service.
AssemblyAI offline
Speaker diarization leader. Custom-vocabulary endpoint. Topic detection, summarization, sentiment, redaction (HIPAA-friendly) built in. Best for call analytics and content-moderation pipelines.
NVIDIA Parakeet
English-only model from NVIDIA. Open-source. Fastest open-source English STT at 100-200x real-time on a single H100. Best for English-heavy batch workloads and teams with GPU capacity to self-host.
The eval rubric that scores both: audio_transcription
Whichever provider mix you pick, you need a way to score them systematically. The audio_transcription rubric in ai-evaluation is purpose-built for this.
The rubric computes:
- Word Error Rate (WER). The standard metric. Edit distance between hypothesis and reference, normalized by reference length.
- Semantic similarity. Embeddings-based score that catches paraphrases the LLM downstream would consider equivalent.
- Named-entity preservation. Did the names, places, account numbers survive?
- Numeric preservation. Did the numbers survive? (Phone numbers, dollar amounts, dates.)
- Jargon recognition. Did the domain terms survive?
The rubric works on both real-time and offline STT output. The pattern:
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator
audio = MLLMAudio(url="path/to/call_recording.wav")
test_case = MLLMTestCase(
input=audio,
response="hypothesis transcript from your STT provider",
expected_response="ground-truth transcript from your labeler",
query="Score the transcript quality",
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=["audio_transcription"],
inputs=[test_case],
)
Run the rubric on a labeled holdout set of 500-1000 utterances representative of your production audio. Run it again on every provider you’re evaluating. The output is a comparable STT-quality score per provider, with WER-class accuracy when ground-truth transcripts are available.
The audio_quality rubric scores output-audio quality on the TTS side. Together the two rubrics cover both ends of the voice pipeline.
Real-time STT plus offline STT: the two-pass pattern
The strongest production pattern uses both engines. Real-time STT drives the live conversation. Offline STT scores the call after it ends.
The flow:
- Live call. Real-time STT streams partials to the LLM. The agent responds. The call audio is recorded (separate assistant and customer tracks where possible).
- Post-call. The recorded audio is sent to offline STT. The high-accuracy transcript is stored.
- Eval pass. The offline transcript is compared to the real-time transcript via the
audio_transcriptionrubric. Differences are surfaced as potential mis-transcriptions in the live path. - Cluster analysis. The Error Feed clusters mis-transcription patterns into named issues: background noise causing word drops, accent causing entity mis-recognition, jargon causing technical-term substitution.
- Re-tune. Custom-vocabulary boosting, accent-specific models, or codec changes are applied based on the cluster patterns.
The two-pass pattern is what lets a voice team improve STT continuously. The real-time path is constrained by latency; the offline path provides the ground-truth signal that drives the improvement cycle.
A worked decision: sales voice agent with post-call analytics
A US-based sales voice agent doing 200K calls per month. Each call is 4 minutes average. Total: 800K minutes/month of audio.
Real-time STT pick. Deepgram Nova-3 streaming. First-partial latency 110ms P50 fits the sub-500ms voice budget. WER 4.5% on the call-center audio profile after custom-vocabulary tuning for sales-specific terms. Cost: $0.007/minute x 800K = $5,600/month.
Offline STT pick. Deepgram Nova-3 batch. Same model, same vocabulary, half the cost. WER 3.8% on the same audio (gains from bidirectional context). Cost: $0.0043/minute x 800K = $3,440/month.
Why both. Real-time drives the live conversation. Batch produces the analytics transcript that feeds CSAT scoring, intent classification, agent coaching, and the eval rubrics that surface mis-transcriptions in the real-time path.
Total STT cost. $9,040/month. Roughly 1% of the typical revenue per call at the scale that justifies a voice agent.
Telemetry. traceAI captures STT provider, confidence per partial, first-partial latency, final latency as span attributes. The dedicated traceAI-pipecat and traceai-livekit packages instrument the voice frameworks. OpenInference-compatible. Apache 2.0.
Evaluation. The audio_transcription rubric runs nightly on a 1000-utterance sample comparing real-time output to offline output. Mis-transcriptions cluster in Error Feed into named issues. The team re-tunes custom vocabulary monthly based on the clusters.
The resulting voice stack is a real-time path with 4.5% WER, a batch path with 3.8% WER, and a feedback loop that surfaces the drift each pattern produces. The team can A/B test new providers (Cartesia Ink-Whisper, AssemblyAI Universal-1) on the same labeled set and ship the winner without guesswork.
Calibrated comparison: where each engine wins
Both real-time and offline STT have axes where each provider leads. Honest comparison:
Real-time STT leaders:
- Deepgram Nova-3 leads WER on hard call-center audio with accents and background noise.
- AssemblyAI Universal-1 leads language coverage and built-in speaker diarization.
- Cartesia Ink-Whisper leads pure latency, with first-partial sub-100ms in clean conditions.
- Speechmatics Ursa leads European language coverage and regulated-industry domain models.
Offline STT leaders:
- Whisper large-v3 leads on open-weight accuracy and self-hostability.
- Deepgram batch leads on cost per minute for hosted batch.
- AssemblyAI offline leads on built-in speaker diarization and content-moderation features.
- NVIDIA Parakeet leads on English-only throughput for self-hosted deployments.
Future AGI is not in this comparison because Future AGI is the eval layer that scores all of them. We ship the rubrics, the telemetry, and the cluster surface that lets you pick the right STT for your audio profile rather than guessing from a public benchmark.
Telemetry per STT call
Every STT call should produce a typed span with these attributes.
| Attribute | Type | Notes |
|---|---|---|
stt_provider | string | deepgram, assemblyai, whisper, etc. |
stt_model | string | nova-3, universal-1, large-v3, etc. |
stt_mode | string | streaming or batch |
audio_duration_ms | int | Length of audio submitted |
first_partial_latency_ms | int | Streaming only |
final_latency_ms | int | Time from end-of-audio to final |
final_confidence | float | Provider-reported confidence |
transcript_length_chars | int | Length of final output |
language_detected | string | If language detection ran |
speaker_count | int | If diarization ran |
wer_vs_groundtruth | float | If ground truth is available |
The span data feeds the observability stack. Plot first-partial latency P95, final latency P95, and WER (where ground truth exists) weekly. Investigate regressions above 30ms or 1%.
Handling code-switching and multilingual audio
Multilingual audio is a real failure mode in 2026 voice agents. Spanglish, Hinglish, Singlish, and other code-switched dialects break STT engines that assume one language per utterance.
Three patterns work.
Pattern 1: language detection per chunk. Some providers (AssemblyAI, Whisper) detect language per chunk and switch decoders. The accuracy gains are real but latency suffers (the chunk has to be classified before transcription).
Pattern 2: multilingual model. Whisper large-v3, AssemblyAI Universal-1 are explicitly trained on multilingual data and handle code-switching natively. The accuracy is better than single-language models forced to handle the second language.
Pattern 3: per-locale routing. Detect the caller’s locale at session start (via phone number, account record, or a brief language-detection turn). Route to a per-locale STT configuration. Avoid mid-call language switching.
The translation_accuracy rubric and cultural_sensitivity rubric in ai-evaluation score multilingual output systematically. Run them on a labeled multilingual holdout set when picking a multilingual STT provider.
Common failure modes and fixes
Six failure modes show up in production STT pipelines.
1. Background-noise word drops. Noisy audio causes words to be dropped or replaced with the closest acoustic match. Fix: provider with better noise-robust training data (Deepgram Nova-3 leads here), or pre-process audio with noise suppression (RNNoise, NVIDIA Maxine).
2. Accent drift. STT trained primarily on American English degrades on Indian English, Scottish English, etc. WER on accent cohorts can be 2-3x baseline. Fix: pick a multilingual provider with accent-diverse training, or fine-tune Whisper on accent audio if you have the volume to justify it.
3. Jargon substitution. Domain terms (medical, legal, technical) get replaced with phonetically similar but wrong words. Fix: custom-vocabulary boosting (Deepgram, AssemblyAI, Speechmatics all support it), or self-hosted fine-tuning.
4. Speaker cross-talk. Two speakers talking simultaneously confuses the model. Fix: speaker diarization (AssemblyAI is the leader), separate channel audio per speaker (most VoIP supports this).
5. Codec artifacts. Low-bitrate codecs introduce noise that confuses STT. G.711 is fine; G.729 and low-bitrate Opus are worse. Fix: prefer higher-bitrate codecs end to end.
6. Real-time/offline drift. Real-time and offline transcripts of the same audio differ in 15-30% of words. The differences are usually low-importance (filler words, punctuation) but sometimes high-importance (entity names, numbers). Fix: the two-pass pattern with the audio_transcription rubric scoring drift; surface high-importance drift to the operations team.
Future AGI on STT evaluation
traceAI captures STT provider, model, mode, latency, confidence, and transcript length as span attributes. 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages for the voice frameworks. OpenInference-compatible. Apache 2.0.
ai-evaluation ships 70+ built-in eval templates including audio_transcription for STT scoring, audio_quality for TTS scoring, translation_accuracy and cultural_sensitivity for multilingual. Custom evaluators authored by an in-product agent. The MLLMAudio test case accepts 7 audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) directly from URL or local path. Per-route eval gating keeps async eval off the critical voice path. Programmatic eval API for configure plus re-run. Apache 2.0.
Future AGI Protect runs sub-100ms inline on Gemma 3n with LoRA-trained adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash for the single-call binary path. Inline guardrails scan transcripts for PII, prompt injection, and policy violations without breaking the voice budget.
Error Feed auto-clusters STT failures into named issues with auto-written root cause and quick fix. Background-noise word drops, accent-specific WER, jargon substitution, codec drift each become their own named cluster instead of drowning in raw spans.
Agent Command Center hosts the stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required. Auto-captured call recordings download separate assistant and customer audio tracks.
The simulation surface for STT testing is Simulate: 18 pre-built personas plus unlimited custom with controls for accent, background noise, and multilingual. Generate hundreds of test calls per scenario via the auto-generated branching scenarios in Workflow Builder. The four-step Run Tests wizard scores each call with audio_transcription. Error Localization pinpoints the failing turn.
Where this falls short
Ground-truth labeling is still labor-intensive. Scoring WER requires reference transcripts. Generating reference transcripts at scale requires labeled audio. The labeling cost is real. We surface the eval rubric and the cluster surface but the reference set still has to come from somewhere. The audio_transcription rubric is reference-based WER-class scoring; for unlabeled production traffic, lean on sampled labeling, custom evaluators, or Error Feed clustering to surface failure patterns.
Provider benchmarks are noisy. Public WER benchmarks usually run on academic test sets (LibriSpeech, CommonVoice) that don’t reflect production audio. The only number that matters is WER on your audio. Run the rubric on your labeled set; ignore the leaderboards.
Real-time and batch APIs diverge over time. Providers sometimes update one model and not the other. The drift between real-time and offline output can grow over months. The two-pass pattern with the audio_transcription rubric surfaces drift as it happens; the alert threshold is up to you.
Related reading
- Sub-500ms Voice AI: The Complete Latency Budget Guide for 2026
- How to Measure Voice AI Latency: The Complete 2026 Guide
- Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide
- How to Implement Voice AI Observability in 2026
Sources and references
- Future AGI Protect: arXiv 2510.13351
- OpenInference span specification: github.com/Arize-ai/openinference
- Future AGI trust and compliance: futureagi.com/trust
- Deepgram Nova-3 documentation: deepgram.com vendor docs
- AssemblyAI Universal-1 documentation: assemblyai.com vendor docs
- OpenAI Whisper repository: openai-whisper GitHub
- Speechmatics Ursa documentation: speechmatics.com vendor docs
- Cartesia Ink-Whisper announcement: cartesia.ai vendor docs
- WER computation reference: NIST SCLITE documentation
Frequently asked questions
What's the difference between real-time STT and offline STT?
When should you pick real-time STT versus offline STT?
Which providers lead in real-time STT in 2026?
Which providers lead in offline STT in 2026?
How do you evaluate STT accuracy systematically?
What's the latency impact of streaming versus batch STT?
How does Future AGI help with STT evaluation in production?
Ranked STT providers for voice AI in 2026: WER, real-time latency, accent and jargon handling, and the rubric that scores them all on your production audio.
Future AGI vs Bluejay on simulation, native voice observability, eval depth, inline guardrails, the optimizer loop, pricing, and compliance. The honest verdict for 2026 voice teams.
Cascaded voice AI vs speech-to-speech in 2026: latency, eval depth, debug cost, model flexibility, and the architecture decision every voice team faces.