Research

Best Voice AI Models in April 2026: STT, TTS, and Voice Agent Stack

Best Voice AI April 2026: compare OpenAI Realtime API, Deepgram, Cartesia, ElevenLabs, Vapi, and Retell for STT, TTS, latency, and voice agents.

May 2, 2026

17 min read

voice-ai stt tts voice-agents monthly-compare 2026

Table of Contents

April 2026 was the month voice AI hit production maturity. GPT-5.5 shipped native audio on April 23, opening up a new architecture (speech directly to LLM, no STT step). The rest of the stack (Cartesia, Deepgram, ElevenLabs, Vapi, Retell) settled into clear category leaders. Hitting sub-700ms end-to-end on a managed platform is now an engineering pick, not a research problem.

Voice AI stack diagram for April 2026 with the OpenAI Realtime API + gpt-realtime-1.5 launch on April 23 highlighted as the month's structural change.

TL;DR: Best voice AI per layer, April 2026

Layer	Best pick	Why	Pricing
Streaming STT	Deepgram Nova-3	6.84% WER, sub-300ms streaming	$0.0077/min
STT with intelligence	AssemblyAI Universal-2	Streaming + summarization, entity, sentiment	~$0.0025/min base
Highest batch accuracy	Google Cloud Chirp	11.6% WER, 125+ languages	varies
Open-source STT	Whisper Large V3	7.4% WER avg, 99+ languages	self-host
Turn-taking detection	Deepgram Flux	Purpose-built end-of-turn	bundled
TTS for sub-300ms agents	Cartesia Sonic Turbo	40ms TTFA	premium
TTS quality and cloning	ElevenLabs v3 Multilingual	32+ languages, voice cloning	premium
Native audio endpoint (NEW)	OpenAI Realtime API + gpt-realtime-1.5 (April 23)	Speech-to-speech, no STT step	$32/$64 per M audio tokens
Voice agent platform default	Retell AI	~600-780ms E2E, $0.07/min, HIPAA	$0.07/min
Voice agent at scale	Vapi	300M+ cumulative calls, 99.9% uptime	varies

If you only read one row: OpenAI’s Realtime API + gpt-realtime-1.5 (April 23) opens a second architecture for voice agents: speech-to-speech, no STT step. The classic STT-LLM-TTS stack (Deepgram + GPT-5.5 + Cartesia, orchestrated by Retell or Vapi) remains the default for long-form reliability. Pick the Realtime API if prosody preservation matters more than per-minute price (audio bills 6-12x text rates).

The story of voice AI in April 2026

April 2026 had two stories, one big and one quiet.

The big story: OpenAI shipped the Realtime API with gpt-realtime-1.5 on April 23. The Realtime API accepts speech as direct input, not via a separate STT step. The result: prosody, emotion, accent context, and background-noise tolerance that the STT-then-LLM pipeline discards. Press coverage often called this “GPT-5.5 native audio,” but per OpenAI’s official model and pricing pages the audio path is gpt-realtime-1.5, not GPT-5.5 itself (which remains a text and image input model). This is a different architecture, not just a new model. For applications where audio context is the signal (mental health, accessibility, accent-sensitive products), the Realtime API path is a meaningful upgrade. For applications where transcript accuracy on long conversations is the bottleneck, the classic STT-LLM-TTS pipeline remains stronger.

The quiet story: The classic voice-agent stack matured. Deepgram Nova-3 plus Flux is the production default for streaming STT with end-of-turn detection. Cartesia Sonic Turbo at 40ms TTFA makes sub-300ms end-to-end agents reachable. ElevenLabs v3 leads voice quality and cloning. Vapi handles 62 million calls per month at 99.99% uptime. Retell AI sits at 620ms measured end-to-end with HIPAA included. The components compose. Custom voice-agent stacks no longer need 3-6 months of engineering to reach platform quality.

Best speech-to-text models in April 2026

Deepgram Nova-3. The streaming STT default

The production default for streaming STT in April 2026. Deepgram’s Nova-3 announcement reports 6.84% median streaming WER across 81 hours of audio in 9 domains, with batch dropping that to 5.26% (a 1.58-point gap, the smallest among major providers offering both modes).

Specs: 6.84% streaming WER (vendor); ~18.3% on the independent Artificial Analysis AA-WER index; sub-300ms streaming latency; 30+ streaming languages; $0.0077/min streaming, $0.0043/min batch.

Production reality. Customer logos publicly disclosed include NASA (ISS-to-Mission-Control transcription), Spotify, Twilio, Citi. Vapi runs on Deepgram for STT. Production complaints cluster around WER spikes when users speak over background music and modest accuracy degradation on heavy-accent audio without keyterm prompting.

Best for: Production voice agents where end-to-end latency is binding. Real-time captioning. Live conversational AI.

Skip if: You need batch accuracy across rare languages (use Google Cloud Chirp). You need bundled speech intelligence (use AssemblyAI Universal-2).

AssemblyAI Universal-2. Best for streaming STT with bundled intelligence

Universal-2’s pitch is the bundled intelligence layer: summarization, entity detection, sentiment, and PII redaction ship in the same API call without per-feature surcharges. AssemblyAI’s Universal-2 research blog reports 14.5% WER on its mixed-content benchmark (CommonVoice + Fleurs + VoxPopuli + 60 hours in-house call-center, podcast, broadcast, and webinar audio).

Specs: 14.5% WER on AssemblyAI’s challenging benchmark; 5.65-6.7% on cleaner segments via Artificial Analysis; 300-600ms streaming latency (consistently slower than Deepgram in third-party tests); $0.37/hr batch async, $0.47/hr streaming on standard plans; 99+ languages.

Production reality. Customer logos: WSJ, NBC Universal, Spotify advertising, CallRail (+23% accuracy lift after Universal-2 upgrade), Veed, Descript, Podchaser. Streaming Universal-Streaming is meaningfully higher latency than Deepgram in independent tests; the value is the depth on every transcript feature, not the latency floor.

Best for: Voice agents where post-call analytics or entity extraction is part of the product. Customer support call intelligence. Compliance-heavy verticals needing inline PII redaction.

Skip if: Pure transcript accuracy is the goal (Deepgram is faster). End-to-end latency under 500ms is binding.

Whisper Large V3. Best open-source self-hosted STT

OpenAI’s open-weight Whisper Large V3 is the self-hosted default for privacy-sensitive workloads. The model card reports a sequential long-form algorithm beating chunked decoding by ~0.5pp WER on batch.

Specs: 7.4% WER average across mixed benchmarks; 1.55B parameters; 99+ languages; 10GB VRAM minimum; no built-in diarization; no native streaming.

Production failure modes worth knowing. Whisper has well-documented hallucination patterns: it hallucinates “Subscribe to my channel” on silence (YouTube training-data bleed), can enter 20-minute repetition loops on long audio, and bias is multiplied 5-6x for non-English languages. Deepgram’s audit reports v3 hallucinates 4x more often than v2 on hour-long audio.

Pricing. OpenAI hosts Whisper at $0.006/min, but most teams self-host on Together / Replicate / Fal.ai / their own GPUs. Self-hosted breaks even vs Deepgram around 200K minutes/month given GPU rental costs.

Best for: Self-hosted deployments above 500K minutes per month. Privacy-sensitive workloads where audio cannot leave your infrastructure. Edge inference.

Skip if: You need streaming under 300ms (Whisper has none). You serve hour-long audio (the repetition-loop failure mode is real). You need production support and SLA.

Deepgram Flux. Best for end-of-turn detection

The model-integrated end-of-turn detection layer that turns generic STT into voice-agent-ready STT. Generic STT APIs do not signal when the user has actually finished speaking, which is the difference between a voice agent that talks over the user and one that does not. Flux detects end-of-turn in under 400ms and pairs natively with Nova-3. Multilingual GA shipped April 29, 2026 across 10 languages with mid-call switching.

Partner integrations: Vapi, LiveKit Agents, Pipecat, Cloudflare Workers AI, Jambonz.

Best for: Any production voice agent. Worth more than 1-2 points of WER accuracy in user-experience terms.

Skip if: Your product is transcription-only (no agent turn-taking). You self-host STT (Flux is Deepgram-only).

Best text-to-speech models in April 2026

Cartesia Sonic Turbo. The TTS latency leader at 40ms TTFA

The latency leader. 40ms TTFA is what makes sub-300ms end-to-end voice agents reachable.

Specs: 40ms TTFA on Turbo, 90ms on standard Sonic 3; 15+ languages; 5-second voice cloning sample; streaming output.

Production reality. The 40ms TTFA claim is the structural unlock for sub-300ms voice agents, but worth a caveat: the only public load test of Sonic Turbo’s 40ms claim under traffic comes from Together AI co-marketing material, not an independent third-party benchmark. Run your own load test if you intend to depend on the 40ms floor at concurrency. Cartesia also publishes a hosted runtime (Cartesia Line) optimized to keep Sonic Turbo on the hot path.

Best for: Sub-500ms round-trip turn-latency voice agents. Real-time interactive applications. Telephony where dropped frames or jitter are unacceptable.

Skip if: Voice quality and cloning fidelity matter more than latency (use ElevenLabs v3). You’re locked into the ElevenLabs ecosystem for branded voices.

ElevenLabs v3 Multilingual. Best for voice quality and cloning

The voice-quality and cloning leader. ElevenLabs v3 is built for expressiveness, character voices, and audiobook-style read-aloud, not real-time voice agents. For real-time, use Flash v2.5.

Specs: 32+ languages; best-in-category voice cloning; strong multi-language emotion. v3 is not real-time-optimized; ElevenLabs explicitly notes v3 should be paired with a different model for low-latency agents. For sub-100ms latency use Flash v2.5 at ~75ms model inference.

Pricing reality. ElevenLabs ships character-based pricing with well-documented gotchas: 2x credit charge on premium voices, failed generations consume credits, generations under 5 characters bill at 5-character minimum. Production budgets typically run 1.4-1.7x list price after retries and edge cases.

Best for: Creator content, audiobook generation, branded voice products, character voice consistency, high-quality multi-language consumer voice.

Skip if: Sub-100ms TTFA is required (use Cartesia or Flash v2.5). Predictable per-minute pricing matters (use Cartesia or Sonic).

OpenAI gpt-4o-mini-tts and gpt-realtime. Best for LLM-controlled instructable voice

OpenAI ships two distinct voice paths in April 2026 that are easy to confuse. gpt-4o-mini-tts is the instructable-voice TTS: text in, audio out, with natural-language voice character control (“speak with calm authority”, “sound urgent”). gpt-realtime / gpt-realtime-1.5 is the speech-to-speech native-audio endpoint accessed via the Realtime API.

Specs. gpt-4o-mini-tts: ~200ms TTFA; 50+ languages; natural-language voice character control. gpt-realtime-1.5: covered in the Native Audio section below; $32/M input, $64/M output.

Production verdict. Public benchmarks on gpt-realtime show task completion drops from ~0.70 (clean text both sides) to ~0.50 (raw audio both sides) on free-form long conversations because ASR drift compounds across turns. For long-form voice agents, the pipeline (Deepgram + GPT-5.5 + ElevenLabs Flash) is often more reliable despite the higher latency floor. For short structured turns under 30 seconds where prosody matters, gpt-realtime wins.

Best for: Voice agents where the LLM should control voice character based on conversation context. OpenAI ecosystem teams who want one vendor across LLM and audio. Short structured prosody-sensitive turns (mental-health, accessibility).

Skip if: Sub-100ms latency is required (use Cartesia + a fast LLM). Free-form long conversations are the workload (ASR drift compounds; classic pipeline is more bounded).

Native audio architectures (the structural shift in April)

OpenAI Realtime API + gpt-realtime-1.5. First production-grade speech-to-speech endpoint

Worth flagging the naming. OpenAI’s official model and pricing pages document the audio path as the Realtime API powered by gpt-realtime-1.5, not as “GPT-5.5 native audio.” GPT-5.5 itself is listed as a text/image-input → text-output model. Press coverage in April collapsed both into “GPT-5.5 native audio,” which conflates two separate endpoints.

Two voice agent architectures April 2026: classic STT-LLM-TTS pipeline versus the OpenAI Realtime API native speech-to-speech endpoint.

Architecture difference from STT-LLM-TTS:

STT pipeline: speech → text → LLM → text → TTS (loses prosody, emotion, accent on the speech-to-text hop; reintroduces it on the TTS hop)
Realtime API: speech → speech directly through one model (preserves audio context end-to-end, no transcript intermediate)

Pricing. gpt-realtime-1.5 audio is $32/M input and $64/M output per OpenAI’s pricing page. That is roughly 6x the text-only GPT-5.5 rate ($5/$30) on input and ~2x on output. Audio token-counting also differs from text token-counting; budget accordingly.

Production verdict. Public benchmarks on gpt-realtime show task completion drops from ~0.70 (clean text both sides) to ~0.50 (raw audio both sides) when free-form long conversations are involved. ASR drift compounds across turns. For free-form long conversations, the classic Cartesia + Nova-3 + fast-LLM pipeline is often more reliable despite the higher latency floor.

Best for: Mental-health support, accessibility products, accent-sensitive flows, applications where prosody is the signal (sarcasm, hesitation, emphasis). Short structured turns under 30 seconds.

Skip if: Transcript accuracy on long conversations is what matters (use Deepgram + LLM). You need predictable per-turn latency at scale (the classic pipeline is more bounded). Cost matters at volume (audio rates are ~6-12x text rates).

STT comparison at a glance

Provider	Streaming WER (vendor)	AA-WER (independent)	Streaming latency	$/min
Deepgram Nova-3	6.84%	~18.3%	sub-300ms	$0.0043
AssemblyAI Universal-2	14.5%	5.65-6.7% (cleaner)	300-600ms	$0.47/hr
Google Cloud Chirp 3	4-7% (clean studio)	10-15%	300-600ms	$0.016/min
Whisper Large V3	n/a (batch only)	5-8% (clean)	n/a	$0.006/min hosted
ElevenLabs Scribe v2 Realtime	6.5% (vendor FLEURS)	not yet ranked	150ms	bundled credits

The vendor-vs-AA-WER gap is the key trust signal. Vendors publish numbers on benchmarks they may have trained on; Artificial Analysis runs an independent suite on AgentTalk, VoxPopuli-Cleaned, and Earnings22-Cleaned. Run a domain reproduction before committing.

Best voice agent platforms in April 2026

Voice agent platform end-to-end latency comparison April 2026: ElevenLabs Conversational, Vapi sub-500ms, Deepgram Voice Agent sub-400ms with Flux, Retell AI 600-780ms.

Retell AI. Most-teams default

About 620ms measured end-to-end. $0.07/min. HIPAA included. No platform fee. Both no-code builder and developer SDK.

Best for: Most production voice agents where sub-700ms is acceptable, HIPAA matters, managed platform reduces engineering load.

Skip if: Sub-500ms required (use Vapi or roll your own). Self-hosted required (use Deepgram Voice Agent).

Vapi. The scale pick

300M+ cumulative calls, 99.9% uptime (99.99% enterprise), sub-500ms average latency, multi-channel (voice + SMS + chat).

Best for: Production voice agents at millions-of-calls-per-month scale.

Skip if: You are below 100K calls/month (Retell is more cost-effective).

Deepgram Voice Agent. Best self-hosted voice agent stack

Bundled stack with self-hosted option. Sub-400ms with Flux.

Best for: Self-hosted compliance, bundled pricing without LLM pass-through surprises.

Skip if: You want managed-service simplicity.

ElevenLabs Conversational. Best voice-quality-first agent platform

Voice-quality-first agent platform with ElevenLabs v3 TTS bundled.

Best for: Voice products where the voice is the brand, character voice agents, premium consumer products.

Skip if: Latency is binding (use Cartesia + Vapi/Retell).

Bland AI. Best for outbound phone agents at scale

Outbound phone volume. Norm agent builder. Tiered per-minute pricing per bland.ai/pricing: Start tier is 10 concurrent / 100 calls/day, Build is 50 / 2,000, Scale is 100 / 5,000, Enterprise is unlimited. HIPAA is included on Build and above (no separate add-on).

Best for: Outbound phone agents, sales operations, structured outbound calls.

Open-source agent platform alternatives

If you do not want a managed agent runtime, four OSS frameworks compose STT-LLM-TTS yourself with realistic production latency:

Pipecat (Daily). End-to-end latency 800-950ms across community reports. Strong adapter ecosystem.
LiveKit Agents. ~750-900ms E2E. The same LiveKit voice infra FutureAGI’s agent-simulate SDK uses.
Daily Bots. Hosted Pipecat runtime. Per-minute compute + provider passthrough.
Cartesia Line. Cartesia’s runtime, optimized for Sonic Turbo’s 40ms TTFA on the hot path.

HIPAA tier matrix

The clearest pricing differentiator across managed platforms. Production teams in regulated verticals hit a $1,000/mo wall on Vapi or ElevenLabs that Retell or Bland include on standard plans:

Platform	HIPAA included?	Where it costs more
Retell AI	Yes, standard plan, self-serve BAA	No upcharge
Bland AI	Yes, on Build/Scale	Plan-tier upgrade
Deepgram Voice Agent	Enterprise contract only	Sales-led BAA
Vapi	$1,000/mo add-on	Largest gotcha in the category
ElevenLabs Conversational	Enterprise + Zero Retention Mode	Reportedly $1,000/mo + Enterprise contract

End-to-end latency budget in April 2026

Component	Sub-300ms pick	Sub-700ms pick
STT	Deepgram Nova-3 (sub-300ms)	Deepgram Nova-3
LLM	GPT-5 mini or Gemini 3.1 Flash	GPT-5.5, Claude Opus 4.7, etc.
TTS	Cartesia Sonic Turbo (40ms TTFA)	ElevenLabs v3 (sub-100ms)
Orchestration	Tight platform-native	Standard platform
Total	~300ms achievable	~700ms achievable

The 40ms Cartesia Sonic Turbo TTFA is what makes sub-300ms reachable. Without it, the rest of the stack does not have enough budget.

Cost at scale: what 100K minutes/month actually costs

Per-minute list price hides 5-10x cost variance once you account for LLM passthrough, character-pricing markups for TTS, retry rate, and HIPAA add-ons. April 2026 specifically introduces a new cost vector with the Realtime API: gpt-realtime-1.5 audio bills at $32/M input, $64/M output, which is 6-12x text rates.

Stack	Component pricing	Estimated monthly cost (100K min)
Retell managed	$0.07/min flat	$7,000
Vapi BYO + Deepgram + GPT-5.5 + ElevenLabs Flash	$0.05/min platform + $0.0043/min STT + ~$0.02/min text LLM + ~$0.04/min TTS (1.5x retry buffer)	$11,500-14,000
OpenAI Realtime API alone	gpt-realtime-1.5 audio at $32/$64 per M tokens, ~3K tokens per min	~$15,000-25,000 audio-only
Bland AI Build tier	Tier-bundled	$6,000-9,000
Self-hosted (Whisper + Claude API + Cartesia + Pipecat)	GPU rental ~$0.001/min + Claude ~$0.025/min + Cartesia ~$0.03/min	$6,000-7,500 + on-call cost

The honest framing: under 100K min/month, Retell’s flat $0.07/min usually wins on total cost of ownership. Above 1M min/month, BYO economics flip. The Realtime API + gpt-realtime-1.5 is the highest-cost path on this list. Use it when prosody matters more than per-minute price (mental health, accessibility), not as a default.

ElevenLabs character-pricing reality has well-documented gotchas: 2x credit charge on premium voices, failed generations consume credits, generations under 5 chars bill at 5-char minimum. Production budgets typically run 1.4-1.7x list. Build the buffer in.

Decision framework

Choose Retell AI for managed default. Sub-700ms, HIPAA, $0.07/min.

Choose Vapi for scale. Millions of calls per month, multi-channel, 99.99% SLA.

Choose Deepgram Voice Agent for self-hosted. Bundled pricing, Flux turn-taking.

Choose ElevenLabs Conversational for voice-quality-first products.

Choose the OpenAI Realtime API + gpt-realtime-1.5 for prosody-sensitive applications. Mental health, accessibility, emotion-aware agents. Audio rates ($32/$64 per M tokens) are 6-12x text rates, so reserve for short turns where prosody is the signal.

Roll your own (Cartesia + Deepgram + LLM) for sub-300ms targets.

Common mistakes

Picking TTS by quality and ignoring latency. Sub-300ms requires Cartesia.
Skipping turn-taking detection. Generic STT does not handle end-of-turn. Worth more than WER.
Ignoring the LLM as a latency cost. A 300ms STT plus 1500ms LLM is not a 300ms agent.
Using STT-then-LLM when audio context is the signal. April 2026 introduced native audio with GPT-5.5; pick it when prosody matters.
Building from scratch without reality. Custom voice stacks take 3-6 months to reach platform quality.

How Future AGI fits

Future AGI provides reliability infrastructure for voice agents. Simulation generates voice scenarios (accents, background noise, interruptions). Eval models score voice outputs across groundedness, hallucination, accent handling, tool-call accuracy. Guardrails block bad outputs at the gateway in under 100ms (the latency budget for voice requires this is fast). Open-source, self-hostable.

Sources

STT primary

TTS primary

Voice agent platforms

Independent + standards

Open-source agent frameworks

See also: Best LLMs of April 2026 for the LLM brain in your voice agent. Next voice post: Best Voice AI of May 2026. Previous: Best Voice AI of March 2026.

Frequently asked questions

What changed in voice AI during April 2026?

The single biggest April 2026 voice change was OpenAI shipping the Realtime API with gpt-realtime-1.5 on April 23. The Realtime API accepts speech as direct input rather than requiring a separate STT step, which preserves prosody, emotion, and accent disambiguation that gets lost in STT-then-LLM pipelines. Press coverage often called this GPT-5.5 native audio, but per OpenAI's official model and pricing pages the audio path is gpt-realtime-1.5 at $32 per million input tokens and $64 per million output tokens, while GPT-5.5 itself is a text and image input model. The other components (Cartesia Sonic 3, Deepgram Nova-3 plus Flux, ElevenLabs v3, Vapi, Retell AI) were stable from March.

What is the best speech-to-text model in April 2026?

Deepgram Nova-3 is the production default at [6.84% WER](https://deepgram.com/learn/introducing-nova-3-speech-to-text-api) and sub-300ms streaming latency for $0.0077 per minute streaming ($0.0043/min batch). AssemblyAI Universal-2 ships streaming STT with summarization, entity, and sentiment intelligence bundled. Google Cloud Chirp leads batch accuracy at 11.6% WER across 125+ languages. Whisper Large V3 is the open-source default at 7.4% WER average. ElevenLabs Scribe is the accent and education pick at 3.5% WER with diarization. Deepgram Flux is the turn-taking detection layer that paired with Nova-3 fixes a category-wide gap in voice agents.

What is the best text-to-speech model for voice agents in April 2026?

Cartesia Sonic Turbo is the latency leader at 40ms time-to-first-audio. ElevenLabs v3 Multilingual leads voice quality and cloning. OpenAI gpt-4o-mini-tts (instructable) and gpt-realtime (speech-to-speech) leads instructable voice character. Hume Octave leads emotion. PlayHT leads conversational long-form. Sub-300ms end-to-end agents need Cartesia. Voice quality first prefers ElevenLabs.

Should I use the OpenAI Realtime API instead of STT plus LLM in April 2026?

Native audio endpoints preserve prosody, emotion, and accent context that STT discards. For applications where audio context is the signal (mental health, accessibility, accent-sensitive products), gpt-realtime-1.5 via the Realtime API is a meaningful upgrade. For applications where transcript accuracy on long conversations is the bottleneck (call summarization, structured intent extraction), STT plus LLM remains stronger. Public benchmarks show gpt-realtime task completion drops from roughly 0.70 on clean text both sides to 0.50 on raw audio both sides because ASR drift compounds across turns. Audio also bills at 6 to 12 times text token rates. Pick the Realtime API for short prosody-sensitive turns, the classic pipeline for long-form reliability.

What is the best voice agent platform in April 2026?

Retell AI is the most-teams default. [About 620ms measured end-to-end latency](https://www.retellai.com/resources/ai-voice-agent-latency-face-off-2025), $0.07 per minute, HIPAA included. Vapi is the pick at scale (300M+ cumulative calls, 99.9% uptime). ElevenLabs Conversational wins on voice quality. Deepgram Voice Agent wins on bundled pricing and self-hosted option. Bland AI is purpose-built for outbound phone volume.