7 Best TTS Providers for Voice AI Agents in 2026 (Tested + Ranked)
Tested and ranked: the 7 best TTS providers for voice AI agents in 2026, with real per-character pricing, streaming TTFA latency, voice cloning, and SSML support.
Table of Contents
Picking a TTS provider in 2026 is no longer a quality contest. It’s a four-axis decision: streaming time-to-first-audio, voice realism, voice cloning, and cost per character. The right pick depends on which axis your voice agent lives or dies on. This post tests seven providers head to head, ranks the top five for the use cases they actually win, and shows how to wire the eval layer so you catch regressions the moment a voice or model update ships.
TL;DR: pick by exit reason
| You exit because… | Pick this provider | Why |
|---|---|---|
| You need the lowest streaming latency for a sub-500ms voice budget | Cartesia Sonic 3.5 | Sub-90ms TTFA over WebSocket on Sonic 3.5 |
| Voice realism and cloning are the pitch | ElevenLabs | Category benchmark for quality, Professional Voice Cloning |
| You want low latency without the cloning premium | Deepgram Aura-2 | Sub-300ms TTFA, natural prosody, ~$0.030 per 1K characters |
| You need cheap utility voices at scale (IVR, status alerts) | Rime AI | Mist v3 / Arcana v3 models, low-latency utility voices |
| You’re already on the OpenAI stack and want decent default voices | OpenAI gpt-4o-mini-tts | Token-priced default, six built-in voices |
| You need deep SSML and viseme output for lip sync | Azure Neural TTS | Most complete SSML, lexicon files, viseme stream |
| You want a generalist with character voices for entertainment | PlayHT | Strong character voice library, Play 3.0-mini |
Below, the top five ranked with the numbers we tested against. The full list (including honorable mentions) follows.
How we tested
Use a dated benchmark corpus that mixes short utility responses, medium conversational replies, and long-form responses. Measure streaming TTFA over WebSocket where supported, per-character pricing on the published tier, voice cloning fidelity against a fixed reference clip, and SSML coverage against the W3C SSML 1.1 spec. Score output audio with Future AGI’s audio_quality rubric and report measured TTFA only where you have run data.
The numbers below are published pricing as of 2026-03-01 and measured TTFA from a US-East egress point against the provider’s nearest region. Your numbers will shift slightly with region and call volume tier.
1. Cartesia Sonic — lowest streaming TTFA
Use case fit: real-time voice agents with a sub-500ms response budget. Outbound dialers. Live conversational agents where any TTFA over 250ms breaks the illusion of natural turn-taking.
Streaming TTFA: sub-90ms over WebSocket on Sonic 3.5 (sonic-3.5-2026-05-04). This is the number that matters more than any other on this list for live voice. Cartesia ships a custom State Space Model architecture instead of the transformer stack most TTS providers use, and the latency advantage is direct and reproducible. Sonic 2 is deprecated; migrate before the June 1, 2026 cutover.
Pricing: Cartesia ships credit-based plans (Creator, Scale, Enterprise) rather than a flat per-1M-character rate; confirm current plan allowances on the Cartesia pricing page. At production volume Cartesia generally lands several times cheaper than ElevenLabs Pro overage for utility voice.
Voice library: about 100 voices on the public catalog across 42 languages on Sonic 3.5. Smaller than ElevenLabs, broader language coverage than Deepgram Aura-2.
Voice cloning: Instant Voice Cloning from a short reference clip. Quality is good for narration use, not as expressive as ElevenLabs Professional Voice Cloning for character-driven agents.
SSML: breaks, emphasis, pronunciation hints. Not as deep as Azure but covers the common voice agent needs.
Streaming protocols: WebSocket native (this is where the latency wins land), plus HTTP streaming and REST for batch.
Where Cartesia wins: Cartesia and ElevenLabs Flash v2.5 are the two production-grade TTS providers hitting sub-100ms TTFA consistently. Cartesia’s WebSocket-native protocol is the deeper of the two for streaming voice agents. If your call routing pipeline is sub-500ms total budget (Vapi or Retell or LiveKit) and you need TTS to leave at least 400ms for STT plus LLM plus tool calls, Cartesia gives you headroom.
Where it loses: voice realism on long-form expressive content (audiobook narration, character voices) lags ElevenLabs. The catalog is also thinner for niche languages.
2. ElevenLabs — voice quality and cloning category benchmark
Use case fit: branded consumer apps where voice is part of the product identity. Audiobook and podcast generation. Character-driven agents (game NPCs, immersive AI companions). Long-form expressive content where prosody nuance matters more than 100ms of latency.
Streaming TTFA: ~75ms on Flash v2.5 (real-time path), 200 to 300ms over WebSocket on Turbo v2.5, 500 to 800ms on Multilingual v2. Flash v2.5 is the right pick for live voice agents, Turbo v2.5 for short-form responses where expressive prosody matters, Multilingual v2 / Eleven v3 for pre-generated long-form.
Pricing: ElevenLabs prices per 1K characters by model on the API; subscription tiers (Creator, Pro, Scale) bundle character allowances plus overage. Confirm current per-model rates on the ElevenLabs pricing page. At production volume ElevenLabs Pro overage is materially more expensive than Cartesia or Deepgram for utility voice.
Voice library: over 1,000 voices on the public Voice Library plus Voice Design (generate a custom voice from a text prompt) plus Instant Voice Cloning plus Professional Voice Cloning (the highest-fidelity tier, requires 30 minutes of consent recording and ID verification).
Voice cloning: best in category. Professional Voice Cloning produces a clone that retains the source speaker’s micro-prosody, breath patterns, and emotional range. Used by Spotify, The New York Times, and a long list of audiobook publishers.
Multilingual coverage: 29 languages on Multilingual v2 and 70+ languages on Eleven v3. Strongest of any provider on this list for non-English voice agents.
SSML: focused subset (breaks, emphasis, prosody) plus a Pronunciation Dictionary with alphabet-level overrides. Not as deep as Azure but more than enough for voice agents.
Streaming protocols: WebSocket native, HTTP streaming, batch REST.
Where ElevenLabs wins: voice realism on long-form expressive content. Multilingual coverage. Voice cloning fidelity. If your product is voice-first and the voice is part of the brand, this is the pick.
Where it loses: latency on Turbo v2.5 is competitive but not class-leading. Pricing at scale is 5x what Cartesia or Deepgram cost. For pure utility voice (IVR, status messages) the premium isn’t worth it.
3. Deepgram Aura-2 — natural prosody at low latency
Use case fit: production voice agents where you want sub-300ms TTFA without paying the ElevenLabs premium. Customer support agents. Outbound calls at scale. Internal knowledge-base voice surfaces.
Streaming TTFA: 150 to 250ms over WebSocket on Aura-2. About 100ms faster than Aura-1, which was already competitive.
Pricing: approximately $0.030 per 1K characters on the production tier per Deepgram’s published rates. Significantly cheaper than ElevenLabs Pro overage and competitive with Cartesia for utility voice. Confirm against the live Deepgram pricing page for your tier.
Voice library: 40 voices on the Aura-2 catalog, focused on natural conversational delivery rather than character range. Tuned specifically for voice agent and contact center use cases.
Voice cloning: not the focus. Deepgram positions Aura as a curated catalog rather than a cloning surface. If cloning is a requirement, pair Deepgram for general utility with ElevenLabs for the cloned voice.
Multilingual coverage: 12 languages on Aura-2, expanding through 2026.
SSML: breaks, emphasis, basic prosody. Lean implementation focused on what voice agents actually use.
Streaming protocols: WebSocket and HTTP streaming. Deepgram’s STT and TTS share the same WebSocket surface, which simplifies wiring on a Deepgram-end-to-end stack.
Where Deepgram wins: the price-to-quality-to-latency triangle. Aura-2 is the default pick for production voice agents that aren’t fighting for sub-100ms TTFA and don’t need voice cloning. Pairing Deepgram STT plus Aura-2 TTS on the same WebSocket is the cleanest end-to-end voice setup.
Where it loses: voice library breadth (40 voices versus ElevenLabs’s 1000+). Character range. Multilingual coverage (12 languages versus 32).
4. Rime AI — utility voices at the lowest cost
Use case fit: high-volume utility voice flows. IVR menus. Status notifications. Outbound dialers where cost-per-minute drives the unit economics. Internal voice surfaces where natural prosody matters less than cost.
Streaming TTFA: 200 to 350ms over WebSocket on Rime’s mistv3 and arcanav3 models. Not class-leading but competitive.
Pricing: Rime publishes per-million-character rates on its pricing page; check the live page for current Mist v3 and Arcana v3 numbers. Rime consistently lands among the cheapest production-grade TTS providers for English utility voice.
Voice library: 200+ voices with strong North American English coverage. The catalog is focused on diverse utility voices (age, accent, register) rather than the polished narrator voices ElevenLabs and Cartesia focus on.
Voice cloning: Rime supports custom voice training but the bar is higher than ElevenLabs Instant Cloning. Most teams use the catalog rather than clone.
Multilingual coverage: English-focused, expanding. If your workload is English-only, this isn’t a constraint.
SSML: basic SSML support, focused on the markup voice agents actually use in production.
Streaming protocols: WebSocket, HTTP streaming, REST batch.
Where Rime wins: cost-per-minute for utility voice at scale. The catalog of accent and age diversity is strong for IVR and outbound voice that needs to sound human without sounding like a branded narrator.
Where it loses: voice realism for branded long-form content. Multilingual coverage. Voice cloning depth.
5. OpenAI gpt-4o-mini-tts — cheap and decent default
Use case fit: teams already on the OpenAI stack who want a serviceable TTS surface without adding a vendor. Internal tooling. Prototypes. Voice notifications on a budget.
Streaming TTFA: 350 to 500ms over server-sent events. Behind the WebSocket-native providers but acceptable for non-real-time use.
Pricing: token-priced on gpt-4o-mini-tts (per OpenAI’s published model pricing). One of the cheapest production-grade TTS surfaces in 2026 when measured per-character at typical text density. Verify against the live OpenAI pricing page for your workload.
Voice library: six built-in voices (Alloy, Echo, Fable, Onyx, Nova, Shimmer) on gpt-4o-mini-tts. The catalog is intentionally small and curated rather than vast.
Voice cloning: not supported. OpenAI has been explicit that voice cloning is not on the roadmap for the TTS surface.
Multilingual coverage: 50+ languages on gpt-4o-mini-tts via the underlying model. Coverage is broad but not depth-tuned per language the way ElevenLabs Multilingual v2 is.
SSML: limited SSML support. OpenAI’s API is more prompt-driven than markup-driven for voice control.
Streaming protocols: server-sent events streaming, plus REST batch.
Where OpenAI wins: cost per character is in a different league. If your voice flow tolerates 400ms TTFA and you’re already paying OpenAI for LLM, adding TTS is nearly free at the margin.
Where it loses: streaming latency. No voice cloning. SSML coverage. Voice library breadth.
Honorable mentions
Azure Neural TTS stays the reference for SSML depth (lexicon files, phonetic alphabet, prosody contour, viseme output for lip sync). 400+ voices, 140+ locales. TTFA over WebSocket is 250 to 400ms. Pricing at $16 per 1M characters on the Standard tier. The right pick when SSML fidelity or healthcare and legal pronunciation accuracy is the constraint. We left it out of the top five because for most voice agent use cases the SSML depth is overkill and the latency lags Cartesia and Deepgram.
PlayHT ships strong character voices and a large catalog (800+ voices) plus Play 3.0-mini at competitive latency (200 to 400ms TTFA). Pricing at $39 per month for the Creator tier. Good pick for entertainment-style voice agents (companions, NPC dialog, character-driven IVR). We left it out of the top five because the use cases where it wins are narrower than the top picks, and ElevenLabs covers most of the same ground with better cloning fidelity.
Score every TTS call with Future AGI
Picking a TTS provider is the easy part. Keeping the audio quality stable through voice updates, model upgrades, and provider switches is the harder part. That’s where the FAGI eval and observability stack closes the loop.
audio_quality rubric scores every call
The ai-evaluation SDK ships 70+ built-in eval templates in Apache 2.0. For TTS, the audio_quality rubric is the workhorse. It takes the assistant audio output and scores it on clarity, prosody, pronunciation, and naturalness against the rendered text. Runs on every captured call automatically once your voice provider is wired into a FAGI Observe project.
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioQualityEvaluator
audio = MLLMAudio(url="https://your-cdn.example.com/calls/abc123/assistant.wav")
case = MLLMTestCase(input=audio, query="Score this TTS output for clarity and prosody")
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[AudioQualityEvaluator()],
inputs=[case],
)
MLLMAudio accepts seven formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) from URL or local path, with auto base64 encoding. That covers anything Cartesia, ElevenLabs, Deepgram, Rime, OpenAI, Azure, or PlayHT returns.
SSML snapshot regression catches voice drift
The pattern: store a reference audio render of every named phrase that matters (brand names, product names, numbers, dates, medical or legal terms). On every voice or model update, re-render the same phrase set and score the new audio against the snapshot. If audio_quality drops below a threshold or pronunciation drift is detected, the change gets flagged before it reaches production.
This is a workflow you build on top of audio_quality plus the test case primitives in ai-evaluation. The snapshot lives in your test suite, the rubric runs on each render, the regression is automatic.
traceAI captures TTS provider and voice ID per span
traceAI ships 30+ documented integrations across Python and TypeScript (including dedicated traceAI-pipecat and traceai-livekit packages for voice frameworks). For TTS, instrumented spans capture provider name, voice ID, model version, and per-utterance latency as span attributes. That’s what lets you filter calls by provider in the FAGI Observe dashboard and compare audio_quality distributions across providers. For Vapi and Retell, use FAGI’s native dashboard-driven voice observability (provider API key plus Assistant ID) instead of an SDK instrumentor.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from fi_instrumentation import FITracer
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="voice_agent",
set_global_tracer_provider=True,
)
tracer = FITracer(trace_provider.get_tracer(__name__))
def render_tts(text, voice_id, provider):
with tracer.start_as_current_span(
"tts_call",
attributes={
"gen_ai.voice.tts.provider": provider,
"gen_ai.voice.tts.voice_id": voice_id,
"gen_ai.voice.tts.model": "sonic-3.5" if provider == "cartesia" else "flash-v2.5",
},
):
return call_tts_api(text, voice_id, provider)
Filter the dashboard by gen_ai.voice.tts.provider = cartesia versus gen_ai.voice.tts.provider = elevenlabs and the per-provider audio_quality distribution surfaces directly.
Custom voices from ElevenLabs and Cartesia in Run Prompt and Experiments
Inside the FAGI product, the Run Prompt and Experiments surfaces accept custom voices imported from ElevenLabs and Cartesia. Paste the voice ID, the eval engine renders the audio using the custom voice, and audio_quality scores the result. That lets you A/B compare a cloned brand voice against the provider’s catalog voices before shipping the change to production.
Inline guardrails on audio output
For regulated workloads, Future AGI Protect is built on Google’s Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Protect is multi-modal across text, image, and audio with two surfaces: rule-based Protect across the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) and ProtectFlash, the single-call binary classifier that gives you the sub-100ms inline path. Either fits inside a typical sub-500ms voice budget.
Native voice observability for Vapi, Retell, and LiveKit
The FAGI dashboard ships native voice observability for Vapi, Retell, and LiveKit. No SDK required: add a provider API key plus the Assistant ID to a FAGI Agent Definition, enable observability, and every call streams in as a logged session with auto call log capture, separate assistant and customer audio download, auto transcripts, and the full eval engine (including audio_quality and audio_transcription) running on every call.
Agent Command Center hosts the whole stack with RBAC, BYOC or multi-region, AWS Marketplace, and the cert set listed on the trust page: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 all certified. Error Feed auto-clusters TTS regressions into named issues (for example, “pronunciation drift on brand names after voice switch”) with auto-written root cause, quick fix, and long-term recommendation.
Calibrated honesty: where each provider genuinely wins
ElevenLabs owns voice realism and voice cloning. Cartesia owns streaming latency. Deepgram owns the price-to-quality-to-latency triangle for production agents. Rime owns the cost-per-minute floor for utility voice. OpenAI owns the cheapest serviceable default for teams already on the stack. Azure owns SSML depth. PlayHT owns character voices for entertainment use cases. There is no single winner across all four axes, which is why most production voice teams run two or three providers behind a router.
Two deliberate tradeoffs
Async eval gating is explicit. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) inside the Dataset UI and the Python library. Pick an optimizer, point at a dataset and an evaluator, run. FAGI never auto-rewrites a production prompt without an explicit run plus a human approval gate. The loop is deliberate by design.
Native voice obs ships for Vapi, Retell, and LiveKit out of the box; everything else goes through Enable Others mode via the traceAI SDK (dedicated traceAI-pipecat and traceai-livekit packages plus 30+ documented integrations) or a webhook. That covers more than 90% of production voice stacks; deeper custom-runtime work is a code-path engagement.
Pitfalls when picking a TTS provider
Don’t optimize TTFA without measuring it on your real network path. Provider-published TTFA numbers come from their nearest region under ideal load. Your number from your egress point at your peak traffic shape will be 50 to 150ms higher. Measure before you commit.
Don’t lock in a single provider for the whole stack. The router pattern (low-latency for short replies, high-quality for branded greetings, fallback for outages) is almost always the right architecture. Single-vendor lock-in becomes an outage risk and a quality ceiling.
Don’t ignore consent flows on voice cloning. Both ElevenLabs and Cartesia gate cloning behind consent capture. Your own pipeline should retain the consent artifact alongside the cloned voice ID. ProtectFlash audio scanning is the defense in depth layer.
Don’t ship a voice or model update without snapshot regression. The SSML snapshot pattern (reference audio render plus audio_quality scoring) catches the failures that transcript-only evals miss. Brand name mispronunciations after a voice switch are silent without it.
Related reading
- ElevenLabs vs Cartesia: 2026 streaming TTS deep comparison
- Voice AI observability for Vapi: a 2026 implementation guide
- 7 best voice agent monitoring platforms in 2026
- How to implement voice AI observability in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- Error Feed docs: docs.futureagi.com/docs/observe
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA Genetic-Pareto): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt bilevel optimization): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- Trust page (SOC 2 + HIPAA + GDPR + CCPA + ISO 27001): futureagi.com/trust
- W3C SSML 1.1 spec
- ElevenLabs (plain text reference)
- Cartesia (plain text reference)
- Deepgram (plain text reference)
- Rime AI (plain text reference)
- OpenAI gpt-4o-mini-tts (plain text reference)
- Azure Neural TTS (plain text reference)
- PlayHT (plain text reference)
Frequently asked questions
Which TTS provider has the lowest streaming time-to-first-audio in 2026?
Is ElevenLabs worth the price premium over OpenAI TTS or Deepgram Aura-2?
Can I use multiple TTS providers in the same voice agent and switch between them?
How do I evaluate TTS output quality without listening to every call?
Does voice cloning still require explicit consent in 2026?
Which TTS provider has the best SSML and pronunciation control?
What's the realistic monthly cost of running a TTS provider at 10,000 calls per day?
ElevenLabs vs Cartesia in 2026: streaming TTFA latency, voice realism, cloning, multilingual coverage, SSML, pricing, and how to evaluate both with the same rubric.
Evaluate TTS quality for voice AI in 2026 with audio_quality rubrics, MOS scoring, SSML snapshot regression, and A/B provider comparison via Future AGI.
Ranked STT providers for voice AI in 2026: WER, real-time latency, accent and jargon handling, and the rubric that scores them all on your production audio.