Guides

How to Evaluate TTS Quality for Voice AI in 2026: SSML + MOS + Rubrics

Evaluate TTS quality for voice AI in 2026 with audio_quality rubrics, MOS scoring, SSML snapshot regression, and A/B provider comparison via Future AGI.

·
Updated
·
12 min read
voice-ai 2026 tts evaluation how-to
Editorial cover image for How to Evaluate TTS Quality for Voice AI in 2026
Table of Contents

TTS quality is the part of a voice stack that fails silently. Transcripts read fine, eval rubrics on the LLM look green, and the customer still hangs up because the assistant said the product name wrong or the prosody flattened after a voice update. This guide walks through a six-step evaluation loop that catches those failures before they ship, using rubric-based scoring, SSML snapshot regression, and provider A/B comparison.

Step preview

  1. Collect TTS output samples across the providers and voices you ship.
  2. Score every sample with audio_quality plus custom rubrics for prosody, pronunciation, naturalness, and brand fit.
  3. Freeze canonical SSML plus audio as golden pairs and re-score on every voice or model change.
  4. A/B compare TTS providers on the same corpus.
  5. Set thresholds and alert on regression.
  6. Run the same eval loop on live production audio captured by native voice observability.

The rest of the post fills in each step with code you can copy.

Why TTS deserves its own eval surface

Most voice eval guides stop at three rubrics: did the agent complete the task, did the conversation resolve, did the LLM say something sensible. Those are correct rubrics. They miss the TTS leg entirely.

The failure modes that show up after a voice or provider switch:

  • Brand name mispronunciation. “Kazoom” becomes “kuh-zoom”. The transcript reads correct, the audio is wrong.
  • Prosody flattening. Vendor pushes a voice model update. The voice sounds slightly more robotic. Every customer hears it. No transcript-based eval flags it.
  • SSML breakage. A new TTS engine version interprets <break time="200ms"/> differently. Pauses shift, timing breaks.
  • Number reading drift. “8842” rendered as “eighty-eight forty-two” suddenly becomes “eight thousand eight hundred forty-two”. Both are valid; only one matches your style guide.
  • Locale drift. Same voice, different accents start leaking through on phrases the model was not trained on.

None of these show up in transcript-only observability. They all need audio scoring. The pattern in this guide is: rubric-scored audio on every sample, golden SSML snapshots for regression, threshold-based alerts on drift.

Step 1: Collect TTS output samples

You need a canonical corpus that covers your real utterance distribution plus your edge cases. We recommend three buckets.

Bucket A: high-volume utterances. The top 50 phrases your assistant says, by frequency. Pulled from production transcripts over the last 30 days. These are the phrases customers hear most.

Bucket B: brand-critical utterances. Brand names, product names, partner names, account numbers, dollar amounts, dates, addresses. Anything where mispronunciation costs trust.

Bucket C: SSML-marked utterances. Phrases that use <break>, <emphasis>, <prosody>, <phoneme>, or <say-as> to control rendering. These break first when providers update their engines.

For each bucket, render the audio across every TTS provider and voice you ship. ElevenLabs and Cartesia are the two voice catalogs we recommend most often, and both are wired into Future AGI’s Run Prompt and Experiments surfaces directly if you want to skip the manual rendering step.

Store the resulting audio with metadata:

sample = {
    "sample_id": "bucket_b_007",
    "bucket": "brand_critical",
    "input_text": "Welcome to Kazoom Insurance. How can I help today?",
    "ssml": '<speak>Welcome to <phoneme alphabet="ipa" ph="kəˈzum">Kazoom</phoneme> Insurance. <break time="200ms"/> How can I help today?</speak>',
    "provider": "elevenlabs",
    "voice_id": "rachel",
    "audio_url": "s3://your-bucket/tts-eval/bucket_b_007_elevenlabs_rachel.wav",
    "rendered_at": "2026-03-10T14:22:00Z",
}

A few hundred samples per provider, refreshed quarterly, is the right ongoing budget for most teams. Larger catalogs only help if you have correspondingly larger production volume.

Step 2: Score samples with audio_quality plus custom rubrics

The ai-evaluation SDK ships 70+ built-in eval templates under Apache 2.0. For TTS, five rubrics carry the load: one built-in plus four custom ones authored against your style guide.

RubricTypeWhat it scores
audio_qualityBuilt-inTTS clarity, prosody, intelligibility on the rendered audio
pronunciationCustomWhether brand names and product terms render correctly
prosodyCustomSentence-level intonation, emphasis, pacing match to style guide
naturalnessCustomHuman-like cadence versus robotic flatness
brand_fitCustomVoice matches the persona and tone the brand specifies

The built-in audio_quality rubric is the spine. Run it on every sample. The four custom rubrics fill in the dimensions specific to your brand.

Here is the end-to-end scoring loop:

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioQualityEvaluator

ev = Evaluator(
    fi_api_key="your-future-agi-api-key",
    fi_secret_key="your-future-agi-secret-key",
)

# audio_quality is the built-in rubric
audio_quality = AudioQualityEvaluator()

def score_sample(sample):
    audio = MLLMAudio(url=sample["audio_url"])
    case = MLLMTestCase(
        input=audio,
        query=f"Score this TTS audio rendered from: {sample['input_text']}",
    )
    return ev.evaluate(
        eval_templates=[audio_quality],
        inputs=[case],
    )

Author your four custom rubrics (pronunciation, prosody, naturalness, brand_fit) in the FAGI product using the in-product agent: describe the rubric in natural language, the agent drafts the LLM-as-judge prompt, you review and accept, and it lands as a reusable evaluator that you attach through the configured evaluator workflow for your project alongside the built-in audio_quality.

MLLMAudio accepts seven formats out of the box: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. URLs or local paths, with auto base64 encoding. The same loader handles whatever ElevenLabs, Cartesia, or any other provider returns.

Why audio_quality, not just transcript-based rubrics

Transcript-based rubrics score what the model said. TTS rubrics score how it said it. The two are independent failure surfaces. A correct transcript with bad TTS still loses customers. The audio_quality rubric reads the actual rendered audio. It does not depend on a transcript at all.

This matters for the failure modes we listed earlier. Pronunciation drift, prosody flattening, SSML breakage. All silent on the transcript side. All visible to a rubric that ingests the audio.

Where the in-product custom evaluator authoring sits

Author the custom pronunciation, prosody, naturalness, and brand_fit rubrics in the FAGI product. Describe the rubric in natural language, the in-product agent drafts the LLM-as-judge prompt, you review and accept, and it lands as a reusable evaluator that you attach through the configured evaluator workflow for your project. Most teams write the first rubric in the dashboard to get a working baseline before iterating.

Step 3: SSML snapshot regression

Once you have a working corpus and rubric set, freeze a subset as golden pairs. Each golden is a tuple of:

  • Canonical input text
  • Canonical SSML markup
  • Rendered audio file (the “golden audio”)
  • Baseline rubric scores at the time of freeze
  • Provider, voice, and version metadata

You re-run the eval on the golden corpus on a schedule (we recommend daily for high-volume voice products) and on every voice or provider change. The pattern catches the silent failure where a TTS vendor pushes a voice model update and your scores drift.

import json
from datetime import datetime, timedelta

def regression_run(golden_corpus, current_provider, current_voice):
    drift_events = []
    for golden in golden_corpus:
        # Re-render with current provider plus voice
        current_audio_url = render_tts(
            text=golden["input_text"],
            ssml=golden["ssml"],
            provider=current_provider,
            voice_id=current_voice,
        )
        # Score the new audio
        result = score_sample({**golden, "audio_url": current_audio_url})
        # Compare to baseline
        for rubric in ["audio_quality", "pronunciation", "prosody", "naturalness", "brand_fit"]:
            baseline = golden["baseline_scores"][rubric]
            current = result.scores[rubric]
            delta = current - baseline
            if delta < -0.1 or current < 0.7:
                drift_events.append({
                    "sample_id": golden["sample_id"],
                    "rubric": rubric,
                    "baseline": baseline,
                    "current": current,
                    "delta": delta,
                    "audio_url": current_audio_url,
                })
    return drift_events

The thresholds in that snippet are the defaults we recommend: a 0.1 point drop against baseline triggers a drift event, and any absolute score below 0.7 triggers regardless of baseline. Tighten both once your per-voice sample volume stabilizes.

What goes in the golden corpus

Not every sample belongs in the golden corpus. The right inclusion criteria:

  • All Bucket B (brand-critical) samples. Pronunciation drift on these is the highest-cost failure.
  • A random 20 percent of Bucket A (high-volume) samples, refreshed quarterly. Stratify by length and locale.
  • All Bucket C (SSML-marked) samples. SSML interpretation changes are the most common regression cause.

That gets most teams to a 150 to 300 sample golden corpus per voice. Manageable to render and score on a daily schedule, large enough to catch real drift.

Freezing the baseline

The baseline scores are the rubric outputs at the moment you decide a voice is production-ready. You run the corpus, you listen to a stratified sample with a human panel, you confirm the voice ships, you freeze that exact set of scores as the baseline. From that moment forward, every regression run compares to the frozen numbers, not to a rolling average.

The reason matters. Rolling baselines drift with the regression itself. Frozen baselines anchor against a known-good point. If a regression is real and persistent, a rolling average normalizes it away after a week. A frozen baseline keeps screaming until you intervene.

Step 4: A/B compare TTS providers

The same corpus and rubric set powers a fair provider comparison. The setup:

  1. Render the entire corpus across Provider A and Provider B with matched voices (closest persona match).
  2. Score every sample with the five rubrics.
  3. Aggregate by mean score per rubric, and by failure rate at a 0.7 threshold per rubric.
  4. Run a paired human listening test on a stratified subset (we recommend 30 samples) to validate the rubric ranking matches a human panel.
from statistics import mean

def ab_compare(corpus, provider_a, voice_a, provider_b, voice_b):
    rubrics = ["audio_quality", "pronunciation", "prosody", "naturalness", "brand_fit"]
    results = {provider_a: {r: [] for r in rubrics}, provider_b: {r: [] for r in rubrics}}

    for sample in corpus:
        for provider, voice in [(provider_a, voice_a), (provider_b, voice_b)]:
            audio_url = render_tts(
                text=sample["input_text"],
                ssml=sample["ssml"],
                provider=provider,
                voice_id=voice,
            )
            scored = score_sample({**sample, "audio_url": audio_url})
            for r in rubrics:
                results[provider][r].append(scored.scores[r])

    summary = {}
    for provider in [provider_a, provider_b]:
        summary[provider] = {
            r: {
                "mean": mean(results[provider][r]),
                "failure_rate": sum(1 for s in results[provider][r] if s < 0.7) / len(results[provider][r]),
            }
            for r in rubrics
        }
    return summary

The output is a per-rubric mean and failure rate per provider. That gives you defensible numbers to share with stakeholders when you pick the production TTS vendor.

Calibrated honesty on provider strengths

In our experience running this loop across customer voice stacks, the pattern is consistent. ElevenLabs ships the highest voice quality and the most realistic cloning. Cartesia ships the lowest-latency streaming TTS in the Sonic family. Neither is universally the right pick. The eval loop in this section is what produces the answer for your specific brand, your specific utterance distribution, and your specific latency budget.

Step 5: Set thresholds and alert on regression

The thresholds you set during Step 3 (0.1 relative drop, 0.7 absolute floor) are the alerting trigger. Wire them into Error Feed so drift events cluster as named issues with auto-written root cause, supporting evidence, a quick fix, and a long-term recommendation.

The Error Feed pattern for TTS specifically. A drift event with rubric=pronunciation and delta=-0.15 lands as a span attribute on the regression run. Error Feed’s clustering layer detects a pattern across recent runs and writes the issue. Typical issue names that emerge:

  • “Pronunciation drift on brand name Kazoom after voice model v4.7”
  • “SSML break-time interpretation changed after a Cartesia Sonic model update”
  • “Prosody flattening on questions for ElevenLabs voice rachel after platform refresh”

You do not write those names. The clustering layer writes them, attaches the offending audio samples, and proposes the fix.

Hooking the alerts to engineering

The named issues route to your existing incident channel via the standard Error Feed integrations. The supporting evidence (the offending audio files, the score deltas, the SSML inputs) all attach to the issue automatically, so the on-call engineer can replay the failure without leaving the dashboard.

Step 6: Run the same loop on live production audio

The eval suite you built in steps 1 through 5 runs against any audio source. That includes live production calls captured by native voice observability.

The setup: add a Vapi, Retell AI, or LiveKit Agent Definition in the Agent Command Center. Every call captures separate assistant audio and customer audio. The same audio_quality rubric you attached to your eval project attaches to the production project. It runs on every captured assistant audio leg.

The metric you watch is the running mean of audio_quality per voice per locale per day. Drift against the rolling baseline triggers the same alert path. The golden corpus regression catches drift in your test environment; the production scoring catches drift in real customer conversations.

Future AGI integration: the full TTS eval stack

Putting the pieces together, the TTS eval surface inside Future AGI looks like this:

+------------------------+       +------------------------+
| TTS providers          |       | Production voice calls |
| - ElevenLabs           |       | (Vapi / Retell /       |
| - Cartesia             |       |  LiveKit)              |
| - Native integrations  |       +-----------+------------+
+-----------+------------+                   |
            |                                | assistant.wav
            v                                v
+----------------------------------------------+
| ai-evaluation                                |
| - audio_quality                 |
| - pronunciation / prosody / naturalness /    |
|   brand_fit custom rubrics                   |
| - MLLMAudio (.mp3 .wav .ogg .m4a .aac .flac  |
|   .wma)                                      |
+----------------------+-----------------------+
                       |
                       v
+----------------------------------------------+
| Future AGI Observe                           |
| - Golden SSML snapshot baselines             |
| - Daily regression runs                      |
| - A/B provider comparison reports            |
| - Error Feed: clustered TTS regressions      |
|   with named issues + quick fixes            |
+----------------------------------------------+

ai-evaluation ships the rubrics under Apache 2.0. traceAI instruments the TTS provider call as a span when you want per-utterance latency and provider attribution alongside the quality scores. Future AGI Protect is built on Google’s Gemma 3n with LoRA-trained adapters per safety dimension (per arXiv 2510.13351). Rule-based Protect runs across the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance); ProtectFlash is the single-call binary classifier that gives you the sub-100ms inline path for outbound audio. Agent Command Center hosts the whole stack with SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certifications per futureagi.com/trust.

For multilingual TTS eval, the same loop runs across locales. translation_accuracy and cultural_sensitivity attach as additional rubrics. Future AGI Simulation generates input text in the target language across 18 pre-built personas with accent, age, gender, location, communication style, conversation speed, and background noise controls.

Where this approach falls short (calibrated)

Human MOS panels still beat rubrics on new voice selection. For picking a new production voice, run a real human panel on a small set of voices. Rubrics catch drift; humans catch the deeper aesthetic question of whether the voice fits your brand at all. Our advice: human MOS for voice selection, rubrics for everything after.

Pronunciation rubrics need style guide investment upfront. The custom pronunciation rubric is only as good as the IPA notation in your brand style guide. Brands without a documented pronunciation guide spend the first two weeks writing one. That work is reusable across providers, but it is not zero.

Streaming TTS introduces partial-audio scoring complexity. If you stream TTS audio chunk-by-chunk (Cartesia Sonic family is the common case), most teams score at the utterance boundary by concatenating chunks before running audio_quality. Per-chunk scoring is workable but not the default pattern; design your harness around utterance-boundary scoring unless you have a specific reason to score partials.

Common pitfalls when building a TTS eval suite

Do not score the customer audio leg with TTS rubrics. The customer audio is human voice. Run audio_transcription on it for STT scoring instead. The TTS rubrics target the assistant audio leg.

Do not let the golden corpus go stale. Refresh Bucket A samples quarterly against your current production traffic distribution. Frozen baselines are correct; frozen corpora are not. Brand-critical and SSML samples stay stable longer.

Do not skip the human listening validation step. A rubric that disagrees with a human panel is a broken rubric. Burn a small budget on a paired listening test every quarter to confirm the rubric ranking still matches your team’s ear.

Do not run the eval on a tiny corpus. Under 50 samples per voice, statistical confidence on the score deltas is too low to alert on. Get to at least 150 samples per voice before turning on the regression alerts.

Do not forget locale stratification. A single mean score across locales hides drift inside one locale. Slice your dashboards by voice plus locale, alert on each combination separately.

When you have outgrown this setup

Once the six-step loop is running cleanly, the natural next move is to feed the eval results back into prompt optimization for the upstream LLM that generates the text being rendered. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) inside the Dataset UI and as a Python library; either surface can take rubric scores as the objective signal and tune the prompts that produce TTS-friendly output (shorter sentences, brand-correct phrasings, deliberate SSML hints).

For the production monitoring playbook end-to-end, see how to monitor AI voice agents in production. For brand voice and cloning safety, see voice cloning safety and brand voice management.

Sources and references

Frequently asked questions

What is the single most useful rubric for TTS quality?
Start with audio_quality from the ai-evaluation SDK. It scores TTS output across clarity, prosody, and intelligibility on the rendered assistant audio, which is the failure surface most teams miss when they read transcripts only. Pair it with a brand-fit custom rubric authored against your style guide, and a pronunciation rubric for the specific brand names and product terms your assistant says aloud. Three rubrics on every generated sample gets you to a defensible TTS quality baseline.
How does SSML snapshot regression actually work?
You freeze a canonical SSML input plus the rendered audio output as a golden pair. On every provider, voice, or model change you re-render the same SSML and score the new audio against the golden using audio_quality plus a similarity check on prosody and pronunciation. Any score drop past a configured threshold opens a regression. The pattern catches the silent failures that happen when a TTS vendor pushes a voice model update and your brand voice subtly shifts under you.
Do I need human raters to compute MOS scores?
Not for ongoing regression. Human MOS panels are still the gold standard for new voice selection, but they do not scale to every commit. The practical pattern in 2026 is rubric-based scoring on every build, periodic human MOS panels for new voices or major releases, and an inline audio_quality rubric on a sampled subset of production calls. Rubrics are useful for continuous regression checks; keep periodic human MOS panels to calibrate whether the rubric still matches listener judgment.
How do I A/B compare two TTS providers fairly?
Use the same set of input texts plus SSML across both providers, render the audio, attach an MLLMTestCase per audio sample, and run audio_quality plus custom pronunciation, prosody, and brand_fit rubrics on each. Aggregate by mean score and by failure rate at a configured threshold. Run a paired listening test on a subset to validate the rubric ranking matches a human panel. The whole loop runs inside the ai-evaluation SDK with MLLMAudio accepting all seven common audio formats.
What threshold should I set before alerting on a TTS regression?
Start with a 0.1 point drop on a 0 to 1 normalized audio_quality score against the rolling 7-day baseline, scoped to a specific voice plus locale. Tighten to 0.05 once your samples per voice cross a stable volume. Add a hard floor at 0.7 absolute regardless of baseline so a slow degradation does not avoid the alert. Wire the threshold check into Error Feed so the failure clusters with other audio regressions and gets a named issue with a quick fix recommendation.
Does this approach scale to multilingual TTS evaluation?
Yes. The ai-evaluation SDK ships translation_accuracy and cultural_sensitivity as built-in multilingual rubrics. Pair them with audio_quality on the rendered audio, and use Future AGI Simulation personas with the multilingual toggle to generate test inputs in the target language. Many popular languages are supported with accent and locale controls on the persona side, so you can produce a fair audio dataset across languages without flying in human voice actors for every locale.
Can I run the TTS eval suite on a live production call instead of a pre-launch sample?
Yes. The native voice observability layer captures the assistant audio leg separately from the customer audio leg on every Vapi, Retell AI, or LiveKit call. The same audio_quality rubric you run pre-launch attaches to the project and runs on every captured call. Errors cluster in Error Feed with auto-written root cause. The eval suite you build in this guide is one configuration; it runs identically on golden samples, simulation outputs, and production calls.
Related Articles
View all