Voice AI

What Is Text-to-Speech?

A voice AI component that converts written text into spoken audio using a speech synthesis model.

What Is Text-to-Speech?

Text-to-speech (TTS) is a voice AI component that converts written text into spoken audio using a speech synthesis model. In production voice-agent pipelines, TTS is the final audio surface after the LLM chooses what to say, so failures appear as mispronunciations, skipped words, unnatural prosody, long time-to-first-audio, or audio that does not match the intended response. The 2026 generation of speech models. ElevenLabs v3, OpenAI’s gpt-4o-tts, Cartesia Sonic 2, PlayHT 3.0, and the on-device Apple voices. has narrowed the quality gap on English studio reads, but live, latency-bound, multilingual production traffic still exposes the same failure modes. FutureAGI evaluates TTS with TTSAccuracy, audio-quality signals, and simulation traces tied to traceAI spans before release.

Why text-to-speech matters in production LLM and agent systems

Bad TTS turns a correct LLM response into a failed user interaction. The model may choose the right refund policy, but the speech layer can skip “not”, pronounce a medication name incorrectly, add awkward pauses, or take 1.8 seconds before the first syllable. Users experience that as confusion or mistrust, not as a separate infrastructure problem. which is why TTS is now considered a first-class LLM evaluation surface rather than a media-rendering step.

The pain shows up differently by role. Product teams see call abandonment, repeat questions, and lower satisfaction on specific voices or locales. SREs see p99 time-to-first-audio drift, elevated retry rates, or codec errors after a provider change. Compliance teams worry about auditable phrasing when a regulated disclosure is spoken differently from the approved text. Developers debug traces where the LLM output is correct but the audio artifact fails a manual review.

TTS matters more in 2026-era agentic voice systems because it sits after many upstream decisions. A sales agent may retrieve account context, call a CRM tool, generate a response with Claude Sonnet 4.6 or GPT-5.1, and then speak it during a live call. If the speech layer drops an amount, misreads an acronym, or changes intonation on a consent question, the entire multi-step agent trajectory is blamed. Unlike chat, voice has no easy visual correction; the user hears the mistake once, reacts immediately, and often interrupts the next turn. which is why the LiveKit and Pipecat patterns now ship interrupt-aware streaming TTS by default.

How FutureAGI handles text-to-speech

FutureAGI’s approach is to treat TTS as an evaluation surface that ships with voice-agent observability, not as a media-rendering step. The anchor for this glossary term is eval:TTSAccuracy, exposed through the TTSAccuracy evaluator. Teams attach it to generated speech artifacts so a release can fail when the spoken audio does not match the intended response text closely enough for the task. For raw audio integrity, AudioQualityEvaluator is tracked beside it, because faithful words still fail if the waveform clips, contains long silence, or degrades under a noisy channel.

A real workflow looks like this: a healthcare scheduling agent generates the sentence “Your appointment is on May 17 at 3:40 p.m.” and sends it to a TTS provider. FutureAGI stores the intended text, audio path, voice ID, locale, and time-to-first-audio in the evaluation run. TTSAccuracy scores the spoken-output fidelity; AudioQualityEvaluator catches clipping or silence; the simulate-sdk LiveKitEngine replays the call through a Scenario containing accents, background noise, and interruption behavior. We’ve found in our 2026 voice evals that 30–40% of TTS regressions only appear under telecom-grade codecs (G.711 µ-law, Opus 16 kHz), not in 48 kHz studio playback. which is why the Scenario ships with codec presets.

The engineer then sets a threshold by cohort, not only a global average. If English-US passes but Indian-English medical terms regress, the run opens an alert, blocks the provider rollout, or routes that locale through a fallback voice in the voice stack. Unlike Vocode’s provider abstraction, which gives you swap-ability but not measurement, FutureAGI connects TTS quality to the production trace, the generated text, the user prompt, and the user cohort. The platform pairs traceAI-livekit spans with simulation results so teams can compare quality failures against latency spikes instead of treating them as separate dashboards. While there is no canonical public TTS leaderboard with the discrimination of MMLU-Pro, the closest adjacent anchors are τ-bench (Anthropic’s multi-turn customer-support benchmark; frontier 55-70% in May 2026) for voice-agent task shape and the AgentHarm voice-specific probes for safety regressions in spoken disclosures. both useful for calibrating internal TTS thresholds against publicly known difficulty levels.

2026 TTS provider tradeoffs

ProviderStrengthWatch-outBest fit
ElevenLabs v3Expressive prosody, voice cloning, 30+ languagesTTFB jumps under concurrency; cost-per-char highestBranded consumer agents, IVR
OpenAI gpt-4o-ttsTight LLM-TTS coupling, low TTFBLimited voice library, English-leaningRAG bots co-deployed with GPT-5.x
Cartesia Sonic 2Sub-100ms TTFB, on-device optionFewer accents, smaller voice catalogLow-latency phone agents
PlayHT 3.0Cheap streaming, multilingualPronunciation drift on rare named entitiesHigh-volume outbound calling
Apple on-deviceZero network round-trip, freeQuality below cloud peersMobile assistants with privacy needs

How to measure or detect TTS quality

Measure TTS at the audio boundary and at the conversation boundary:

  • TTSAccuracy: scores whether the synthesized audio reflects the intended text closely enough for the use case.
  • AudioQualityEvaluator: checks raw audio integrity issues such as clipping, silence, distortion, or noisy output.
  • Time-to-first-audio: user-perceived start latency; alert on p95 or p99 regressions by provider, voice, and locale.
  • Pronunciation error rate: review named entities, product names, medication names, amounts, and acronyms separately from generic words.
  • Text-audio mismatch rate: sample generated text and final audio transcripts via word error rate to catch skipped negations, numbers, and compliance phrases.
  • User feedback proxy: track barge-ins, repeated “what?”, call abandonment, and thumbs-down events after TTS-heavy turns.

Minimal Python:

from fi.evals import TTSAccuracy, AudioQualityEvaluator

tts = TTSAccuracy()
audio = AudioQualityEvaluator()

for row in voice_dataset:
    a = tts.evaluate(input_text=row.expected_text, audio_path=row.audio_path)
    q = audio.evaluate(audio_path=row.audio_path)
    row.attach_scores(tts_accuracy=a.score, audio_quality=q.score)

Use the score as a release gate, not only a monitoring graph. A practical setup runs TTSAccuracy on nightly simulated calls, samples production calls by risk cohort, and compares every provider or voice change against the last approved baseline.

Common mistakes

These mistakes make TTS look healthy in dashboards while users hear failures:

  • Testing only provider demos. Studio previews miss runtime failures from streaming, telecom codecs, noisy input, and concurrent load.
  • Averaging across locales. A 97% aggregate score can hide one accent, language, or medical-term cohort failing badly.
  • Ignoring numbers and negations. “Five” versus “fifteen” or dropped “not” carries more risk than a harmless filler-word mismatch.
  • Measuring latency without audio quality. Fast first audio is not enough if the voice clips, mumbles, or omits required disclosure text.
  • Changing voices without regression evals. A new voice can alter pronunciation, pacing, and compliance phrasing even when the text prompt is unchanged.
  • Skipping codec testing. A model that sounds great at 48 kHz can mangle phonemes after G.711 downsampling. Run evals over the same codec pipeline production traffic uses.

Frequently Asked Questions

What is text-to-speech?

Text-to-speech is a voice AI component that converts written text into spoken audio. In production systems, it is the final user-facing step where wording, pronunciation, prosody, latency, and audio quality all affect caller trust.

How is text-to-speech different from automatic speech recognition?

Automatic speech recognition turns spoken audio into text, while text-to-speech turns text back into spoken audio. A voice agent usually needs both, with ASR at the input boundary and TTS at the output boundary.

How do you measure text-to-speech?

FutureAGI measures TTS with the TTSAccuracy evaluator, audio-quality signals, and simulation traces from voice sessions. Teams track spoken-output fidelity, time-to-first-audio, pronunciation errors, and user feedback by cohort.