Guides

Real-Time STT vs Offline STT: A 2026 Decision Guide for Voice AI

Real-time STT vs offline STT in 2026: latency, WER, cost, accent robustness, and the eval rubric that scores both at scale. A decision matrix for voice AI.

·
Updated
·
16 min read
voice-ai 2026 stt asr comparison
Editorial cover image for Real-Time STT vs Offline STT: A 2026 Decision Guide for Voice AI

The single biggest infrastructure decision in any voice AI stack is what speech-to-text engine you use, and whether you use one or two. Real-time STT and offline STT are not the same product. They share model families and sometimes share APIs, but they have different latency budgets, different WER profiles, different cost structures, and different failure modes. This guide is the 2026 decision matrix: when to pick which, how to evaluate both systematically, and how to instrument both for production.

TL;DR: when to pick which

Real-time STTOffline STT
Latency floor100-300ms first-partial0.1-0.3x real-time for completed file
WER (clean audio)4-7%2-4%
WER (hard audio)8-15%5-10%
Cost per minute$0.008-0.020$0.003-0.010
Use casesLive voice agents, IVR, captionsCall analytics, transcripts, search
Best fit providersDeepgram Nova-3, AssemblyAI Universal-1, Cartesia Ink-WhisperWhisper large-v3, Deepgram Nova-3 batch, AssemblyAI offline

Most production voice agent stacks use both. Real-time STT for the live conversation. Offline STT for the post-call analytics. The two layers serve different consumers and don’t share a hot path.

What “real-time” actually means in STT

Real-time STT (streaming) processes audio as it arrives. The audio stream is chunked (typically 100ms chunks) and each chunk is fed to a model that produces a per-chunk transcript update. The provider exposes two callback types: partial and final. Partials arrive every 200-500ms while the user is speaking. Final arrives 400-1000ms after the user stops speaking.

The model architecture matters. Streaming STT models are typically:

  • Causal (the model only looks at past audio, not future audio). This is what enables low-latency partial transcripts.
  • Trained for streaming with specific objectives like CTC plus attention, or transducer-based decoders.
  • Smaller than offline counterparts. Streaming models are typically 60-300M parameters; offline models can be 1.5B+ parameters.

The constraints of streaming (no future context, lower parameter count, partial output) cost some accuracy. The trade is acceptable for voice agents because the user is still talking and corrections can happen at the final transcript step.

What “offline” means in STT

Offline STT (batch) processes a completed audio file. The audio is loaded in full and the model produces a single high-accuracy transcript.

Three properties make offline STT more accurate:

Bidirectional context. The model sees the full utterance when it predicts each word. A sentence ending that disambiguates an earlier word is available to the model. Streaming models can’t do this.

Larger models. Whisper large-v3 is 1.5B parameters. Whisper turbo is 800M. Streaming models are 60-300M. The larger models capture more linguistic patterns.

Longer chunks. Offline models can process 30-second chunks with full context. Streaming models work in 100-500ms windows. The longer chunk lets the model use prosodic cues across the utterance.

The cost of offline STT is latency. A 60-second call returns in 6-20 seconds depending on the provider. For real-time voice this is unusable. For post-call analytics it’s fine.

The decision matrix

Six axes determine which engine you pick.

Axis 1: latency budget

  • Sub-500ms voice agent. Real-time STT, full stop. The latency budget allows no slack.
  • 5-30 second response window (e.g., dictation that completes when the user pauses, voice search). Real-time STT for the partial signal during input. Offline STT can score the final transcript if the user is willing to wait 1-2 seconds.
  • Post-call analytics. Offline STT. Latency irrelevant; accuracy paramount.
  • Live captions. Real-time STT with low-latency partial output. WER tradeoffs accepted.

Axis 2: accuracy requirement

  • Voice agent intent classification. Real-time STT is sufficient. WER of 5-10% is workable because the LLM downstream is forgiving of partial errors.
  • Legal transcript or medical dictation. Offline STT. Accuracy is the product; the user is paying for low WER.
  • Customer call quality scoring. Offline STT for analytics, real-time STT for live alerts during the call.
  • Voice search query. Real-time STT with re-rank from offline pass. The two-pass approach trades some latency for accuracy gain.

Axis 3: cost

Real-time STT is more expensive per minute. Three reasons: streaming requires persistent connections, the per-token cost is higher for partial decoding, and many providers charge a real-time premium.

Typical 2026 pricing per audio minute:

ProviderReal-timeOffline (batch)
Deepgram Nova-3$0.0070$0.0043
AssemblyAI Universal-1$0.0150$0.0050
OpenAI Whisper APIN/A (no streaming)$0.0060
Speechmatics Ursa$0.0125$0.0090
Cartesia Ink-Whisper$0.0080N/A
Self-hosted Whisper$0.001-0.003 (compute only)$0.0005-0.002 (compute only)

The cost gap matters at scale. A voice agent doing 1M minutes per month pays $7K on Deepgram real-time and $4.3K on Deepgram batch. The batch savings of $2.7K/month is the cost of running offline analytics on the same audio.

Axis 4: language coverage

  • English-first deployment. Most providers are strong. Pick on latency and cost.
  • Multilingual deployment. AssemblyAI Universal-1 covers 99 languages. Whisper large-v3 covers 99. Deepgram Nova-3 covers 36. Google Cloud Speech covers 125.
  • Code-switching (Spanglish, Hinglish). Whisper handles code-switching better than most streaming providers. For real-time agents in mixed-language markets, test code-switching specifically.
  • Low-resource languages. Whisper variants and AssemblyAI cover more long-tail languages. Test the specific language; provider coverage doesn’t equal provider quality on every language.

Axis 5: accent robustness

  • American English speakers. Every major provider works. Pick on latency.
  • Indian English speakers. Deepgram Nova-3 and AssemblyAI Universal-1 do well. Whisper large-v3 has the edge on accent diversity. Older streaming models (early Whisper variants, older Deepgram models) struggle more.
  • Heavy regional dialects (Scottish, Cajun, Singlish). Test on labeled data. No provider guarantees this; benchmarks rarely include strong dialect coverage. Whisper variants tend to lead in published comparisons.

Axis 6: domain vocabulary

  • General conversation. Any provider.
  • Medical, legal, financial vocabulary. AssemblyAI offers a custom-vocabulary endpoint. Speechmatics has industry models. Deepgram supports keyword boosting. Self-hosted Whisper variants can be fine-tuned on domain audio for the strongest accuracy. The cost of fine-tuning offsets the per-call savings at high enough volume.

Provider deep-dive: real-time STT in 2026

Five providers dominate real-time STT in 2026. The differences are in WER on hard audio, language coverage, and latency.

Deepgram Nova-3

The leader on WER for hard audio (background noise, accents, jargon). First-partial latency 90-130ms P50, 180-220ms P95. 36 languages supported. Strong on telephony codecs. Custom-vocabulary boosting. Best for support and call-center deployments where the audio profile is challenging.

AssemblyAI Universal-1

The leader on language coverage (99 languages). First-partial latency 100-150ms P50, 200-260ms P95. Strong English WER. Speaker diarization in both real-time and offline modes. Best for multilingual deployments and use cases that need speaker labels.

Speechmatics Ursa

Strong European-language coverage. First-partial latency 110-160ms P50, 210-280ms P95. Domain models for medical, legal, broadcast. Best for European deployments and regulated industries.

Cartesia Ink-Whisper

The latency leader. First-partial latency below 100ms P50 in early benchmarks. WER competitive with Deepgram on clean audio. Newer entrant with less production telemetry; smaller language coverage than the established providers. Best for latency-critical use cases where the audio profile is well-controlled.

Google Cloud Speech-to-Text v2

The commodity option. 125 languages. First-partial latency 150-220ms P50, 320-450ms P95. Higher WER than Deepgram or AssemblyAI but the broadest language coverage. Best for cost-sensitive deployments or applications already on Google Cloud.

Provider deep-dive: offline STT in 2026

Four providers dominate offline STT in 2026.

OpenAI Whisper (large-v3, turbo)

The open-weight baseline. 1.5B parameter large-v3, 800M parameter turbo. Strong WER across 99 languages. Can be self-hosted on a single GPU. The largest community of variants (Whisper.cpp for CPU inference, Faster-Whisper for GPU optimization). Best for teams that want to self-host or need offline batch processing without API costs.

Deepgram Nova-3 (batch mode)

The same Nova-3 model running in batch mode at 40-60% of the real-time per-minute price. Strong telephony audio handling. Speaker diarization, summarization, sentiment built into the batch pipeline. Best for call-center analytics teams that want a hosted batch service.

AssemblyAI offline

Speaker diarization leader. Custom-vocabulary endpoint. Topic detection, summarization, sentiment, redaction (HIPAA-friendly) built in. Best for call analytics and content-moderation pipelines.

NVIDIA Parakeet

English-only model from NVIDIA. Open-source. Fastest open-source English STT at 100-200x real-time on a single H100. Best for English-heavy batch workloads and teams with GPU capacity to self-host.

The eval rubric that scores both: audio_transcription

Whichever provider mix you pick, you need a way to score them systematically. The audio_transcription rubric in ai-evaluation is purpose-built for this.

The rubric computes:

  • Word Error Rate (WER). The standard metric. Edit distance between hypothesis and reference, normalized by reference length.
  • Semantic similarity. Embeddings-based score that catches paraphrases the LLM downstream would consider equivalent.
  • Named-entity preservation. Did the names, places, account numbers survive?
  • Numeric preservation. Did the numbers survive? (Phone numbers, dollar amounts, dates.)
  • Jargon recognition. Did the domain terms survive?

The rubric works on both real-time and offline STT output. The pattern:

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator

audio = MLLMAudio(url="path/to/call_recording.wav")
test_case = MLLMTestCase(
    input=audio,
    response="hypothesis transcript from your STT provider",
    expected_response="ground-truth transcript from your labeler",
    query="Score the transcript quality",
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=["audio_transcription"],
    inputs=[test_case],
)

Run the rubric on a labeled holdout set of 500-1000 utterances representative of your production audio. Run it again on every provider you’re evaluating. The output is a comparable STT-quality score per provider, with WER-class accuracy when ground-truth transcripts are available.

The audio_quality rubric scores output-audio quality on the TTS side. Together the two rubrics cover both ends of the voice pipeline.

Real-time STT plus offline STT: the two-pass pattern

The strongest production pattern uses both engines. Real-time STT drives the live conversation. Offline STT scores the call after it ends.

The flow:

  1. Live call. Real-time STT streams partials to the LLM. The agent responds. The call audio is recorded (separate assistant and customer tracks where possible).
  2. Post-call. The recorded audio is sent to offline STT. The high-accuracy transcript is stored.
  3. Eval pass. The offline transcript is compared to the real-time transcript via the audio_transcription rubric. Differences are surfaced as potential mis-transcriptions in the live path.
  4. Cluster analysis. The Error Feed clusters mis-transcription patterns into named issues: background noise causing word drops, accent causing entity mis-recognition, jargon causing technical-term substitution.
  5. Re-tune. Custom-vocabulary boosting, accent-specific models, or codec changes are applied based on the cluster patterns.

The two-pass pattern is what lets a voice team improve STT continuously. The real-time path is constrained by latency; the offline path provides the ground-truth signal that drives the improvement cycle.

A worked decision: sales voice agent with post-call analytics

A US-based sales voice agent doing 200K calls per month. Each call is 4 minutes average. Total: 800K minutes/month of audio.

Real-time STT pick. Deepgram Nova-3 streaming. First-partial latency 110ms P50 fits the sub-500ms voice budget. WER 4.5% on the call-center audio profile after custom-vocabulary tuning for sales-specific terms. Cost: $0.007/minute x 800K = $5,600/month.

Offline STT pick. Deepgram Nova-3 batch. Same model, same vocabulary, half the cost. WER 3.8% on the same audio (gains from bidirectional context). Cost: $0.0043/minute x 800K = $3,440/month.

Why both. Real-time drives the live conversation. Batch produces the analytics transcript that feeds CSAT scoring, intent classification, agent coaching, and the eval rubrics that surface mis-transcriptions in the real-time path.

Total STT cost. $9,040/month. Roughly 1% of the typical revenue per call at the scale that justifies a voice agent.

Telemetry. traceAI captures STT provider, confidence per partial, first-partial latency, final latency as span attributes. The dedicated traceAI-pipecat and traceai-livekit packages instrument the voice frameworks. OpenInference-compatible. Apache 2.0.

Evaluation. The audio_transcription rubric runs nightly on a 1000-utterance sample comparing real-time output to offline output. Mis-transcriptions cluster in Error Feed into named issues. The team re-tunes custom vocabulary monthly based on the clusters.

The resulting voice stack is a real-time path with 4.5% WER, a batch path with 3.8% WER, and a feedback loop that surfaces the drift each pattern produces. The team can A/B test new providers (Cartesia Ink-Whisper, AssemblyAI Universal-1) on the same labeled set and ship the winner without guesswork.

Calibrated comparison: where each engine wins

Both real-time and offline STT have axes where each provider leads. Honest comparison:

Real-time STT leaders:

  • Deepgram Nova-3 leads WER on hard call-center audio with accents and background noise.
  • AssemblyAI Universal-1 leads language coverage and built-in speaker diarization.
  • Cartesia Ink-Whisper leads pure latency, with first-partial sub-100ms in clean conditions.
  • Speechmatics Ursa leads European language coverage and regulated-industry domain models.

Offline STT leaders:

  • Whisper large-v3 leads on open-weight accuracy and self-hostability.
  • Deepgram batch leads on cost per minute for hosted batch.
  • AssemblyAI offline leads on built-in speaker diarization and content-moderation features.
  • NVIDIA Parakeet leads on English-only throughput for self-hosted deployments.

Future AGI is not in this comparison because Future AGI is the eval layer that scores all of them. We ship the rubrics, the telemetry, and the cluster surface that lets you pick the right STT for your audio profile rather than guessing from a public benchmark.

Telemetry per STT call

Every STT call should produce a typed span with these attributes.

AttributeTypeNotes
stt_providerstringdeepgram, assemblyai, whisper, etc.
stt_modelstringnova-3, universal-1, large-v3, etc.
stt_modestringstreaming or batch
audio_duration_msintLength of audio submitted
first_partial_latency_msintStreaming only
final_latency_msintTime from end-of-audio to final
final_confidencefloatProvider-reported confidence
transcript_length_charsintLength of final output
language_detectedstringIf language detection ran
speaker_countintIf diarization ran
wer_vs_groundtruthfloatIf ground truth is available

The span data feeds the observability stack. Plot first-partial latency P95, final latency P95, and WER (where ground truth exists) weekly. Investigate regressions above 30ms or 1%.

Handling code-switching and multilingual audio

Multilingual audio is a real failure mode in 2026 voice agents. Spanglish, Hinglish, Singlish, and other code-switched dialects break STT engines that assume one language per utterance.

Three patterns work.

Pattern 1: language detection per chunk. Some providers (AssemblyAI, Whisper) detect language per chunk and switch decoders. The accuracy gains are real but latency suffers (the chunk has to be classified before transcription).

Pattern 2: multilingual model. Whisper large-v3, AssemblyAI Universal-1 are explicitly trained on multilingual data and handle code-switching natively. The accuracy is better than single-language models forced to handle the second language.

Pattern 3: per-locale routing. Detect the caller’s locale at session start (via phone number, account record, or a brief language-detection turn). Route to a per-locale STT configuration. Avoid mid-call language switching.

The translation_accuracy rubric and cultural_sensitivity rubric in ai-evaluation score multilingual output systematically. Run them on a labeled multilingual holdout set when picking a multilingual STT provider.

Common failure modes and fixes

Six failure modes show up in production STT pipelines.

1. Background-noise word drops. Noisy audio causes words to be dropped or replaced with the closest acoustic match. Fix: provider with better noise-robust training data (Deepgram Nova-3 leads here), or pre-process audio with noise suppression (RNNoise, NVIDIA Maxine).

2. Accent drift. STT trained primarily on American English degrades on Indian English, Scottish English, etc. WER on accent cohorts can be 2-3x baseline. Fix: pick a multilingual provider with accent-diverse training, or fine-tune Whisper on accent audio if you have the volume to justify it.

3. Jargon substitution. Domain terms (medical, legal, technical) get replaced with phonetically similar but wrong words. Fix: custom-vocabulary boosting (Deepgram, AssemblyAI, Speechmatics all support it), or self-hosted fine-tuning.

4. Speaker cross-talk. Two speakers talking simultaneously confuses the model. Fix: speaker diarization (AssemblyAI is the leader), separate channel audio per speaker (most VoIP supports this).

5. Codec artifacts. Low-bitrate codecs introduce noise that confuses STT. G.711 is fine; G.729 and low-bitrate Opus are worse. Fix: prefer higher-bitrate codecs end to end.

6. Real-time/offline drift. Real-time and offline transcripts of the same audio differ in 15-30% of words. The differences are usually low-importance (filler words, punctuation) but sometimes high-importance (entity names, numbers). Fix: the two-pass pattern with the audio_transcription rubric scoring drift; surface high-importance drift to the operations team.

Future AGI on STT evaluation

traceAI captures STT provider, model, mode, latency, confidence, and transcript length as span attributes. 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages for the voice frameworks. OpenInference-compatible. Apache 2.0.

ai-evaluation ships 70+ built-in eval templates including audio_transcription for STT scoring, audio_quality for TTS scoring, translation_accuracy and cultural_sensitivity for multilingual. Custom evaluators authored by an in-product agent. The MLLMAudio test case accepts 7 audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) directly from URL or local path. Per-route eval gating keeps async eval off the critical voice path. Programmatic eval API for configure plus re-run. Apache 2.0.

Future AGI Protect runs sub-100ms inline on Gemma 3n with LoRA-trained adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash for the single-call binary path. Inline guardrails scan transcripts for PII, prompt injection, and policy violations without breaking the voice budget.

Error Feed auto-clusters STT failures into named issues with auto-written root cause and quick fix. Background-noise word drops, accent-specific WER, jargon substitution, codec drift each become their own named cluster instead of drowning in raw spans.

Agent Command Center hosts the stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required. Auto-captured call recordings download separate assistant and customer audio tracks.

The simulation surface for STT testing is Simulate: 18 pre-built personas plus unlimited custom with controls for accent, background noise, and multilingual. Generate hundreds of test calls per scenario via the auto-generated branching scenarios in Workflow Builder. The four-step Run Tests wizard scores each call with audio_transcription. Error Localization pinpoints the failing turn.

Where this falls short

Ground-truth labeling is still labor-intensive. Scoring WER requires reference transcripts. Generating reference transcripts at scale requires labeled audio. The labeling cost is real. We surface the eval rubric and the cluster surface but the reference set still has to come from somewhere. The audio_transcription rubric is reference-based WER-class scoring; for unlabeled production traffic, lean on sampled labeling, custom evaluators, or Error Feed clustering to surface failure patterns.

Provider benchmarks are noisy. Public WER benchmarks usually run on academic test sets (LibriSpeech, CommonVoice) that don’t reflect production audio. The only number that matters is WER on your audio. Run the rubric on your labeled set; ignore the leaderboards.

Real-time and batch APIs diverge over time. Providers sometimes update one model and not the other. The drift between real-time and offline output can grow over months. The two-pass pattern with the audio_transcription rubric surfaces drift as it happens; the alert threshold is up to you.

Sources and references

  • Future AGI Protect: arXiv 2510.13351
  • OpenInference span specification: github.com/Arize-ai/openinference
  • Future AGI trust and compliance: futureagi.com/trust
  • Deepgram Nova-3 documentation: deepgram.com vendor docs
  • AssemblyAI Universal-1 documentation: assemblyai.com vendor docs
  • OpenAI Whisper repository: openai-whisper GitHub
  • Speechmatics Ursa documentation: speechmatics.com vendor docs
  • Cartesia Ink-Whisper announcement: cartesia.ai vendor docs
  • WER computation reference: NIST SCLITE documentation

Frequently asked questions

What's the difference between real-time STT and offline STT?
Real-time STT (streaming) processes audio as it arrives and produces partial transcripts within 100-300ms. It's the foundation of live voice agents. Offline STT (batch) processes a completed audio file and produces a high-accuracy final transcript with seconds of latency. It's the foundation of call analytics, transcription services, and meeting summaries. The two are different products with different latency budgets, WER profiles, and cost structures, even when they share a model family.
When should you pick real-time STT versus offline STT?
Pick real-time when latency is the constraint: voice agents, live captions, IVR, voice assistants. The WER is slightly worse and the cost is higher but the experience requires it. Pick offline when accuracy is the constraint: post-call analytics, podcast transcription, legal evidence, medical dictation. The WER is 15-30% better and cost is 30-50% lower because the model can use bidirectional context and longer chunks. Many voice agent stacks use both: real-time STT for the live conversation, offline STT for the post-call analytics layer.
Which providers lead in real-time STT in 2026?
Deepgram Nova-3 has the WER edge on hard audio with background noise and accents, with first-partial latency in the 100-150ms range. AssemblyAI Universal-1 covers 99 languages with similar latency. Speechmatics Ursa is strong on European languages. Cartesia Ink-Whisper hit the market with sub-100ms first-partial latency. Google Cloud Speech and Azure Speech are commodity options. OpenAI's real-time API ships Whisper as a streaming variant. Pick the combination that matches your latency budget, language coverage, and audio profile.
Which providers lead in offline STT in 2026?
OpenAI Whisper (large-v3) is the open-weight baseline with strong accuracy across 99 languages. Deepgram Nova-3 offers a batch mode at lower cost per minute. AssemblyAI's offline mode is the leader on speaker diarization. NVIDIA Parakeet is the leader for English-only at low cost. For medical and legal domains, AssemblyAI and Speechmatics offer specialized models with industry-specific vocabulary handling. Open-source Whisper variants (Whisper.cpp, Faster-Whisper) let teams self-host for compliance reasons.
How do you evaluate STT accuracy systematically?
The audio_transcription rubric in Future AGI ai-evaluation scores ASR output against a ground-truth transcript at scale. It provides WER-class scoring against the ground truth; pair it with custom evaluators or Error Feed clustering for named-entity, jargon, and accent-specific diagnostics. Run it on a labeled holdout set of 500-1000 utterances representative of your production audio. Run it again on every provider you're evaluating. The audio_transcription rubric ships in the Apache 2.0 SDK along with 55 other built-in eval templates.
What's the latency impact of streaming versus batch STT?
Real-time STT produces a first-partial transcript in 100-300ms and a final in 400-1000ms after end of speech. Offline STT produces a final transcript in roughly 0.1-0.3x real-time, so a 60-second call returns in 6-20 seconds. The streaming overhead (chunking, intermediate state, WebSocket maintenance) is what costs the 100-200ms latency floor. The batch model can use bidirectional context (the model sees the whole utterance before producing tokens) which is where the WER gain comes from.
How does Future AGI help with STT evaluation in production?
traceAI captures STT provider, confidence score, first-partial latency, and final latency as span attributes via the dedicated traceAI-pipecat and traceai-livekit packages. ai-evaluation runs audio_transcription on every call or on a sampled fraction, scoring against ground truth where available and using semantic checks where it's not. Error Feed auto-clusters mistranscriptions into named failure modes: background noise, accent, jargon, cross-talk. Future AGI Protect runs sub-100ms inline so safety scanning on transcripts fits inside the voice budget.
Related Articles
View all