Voice AI

What Is Voice Agent Evaluation?

The practice of scoring a voice agent across ASR, audio quality, turn handling, latency, reasoning, and task completion.

What Is Voice Agent Evaluation?

Voice agent evaluation is the practice of measuring whether a real-time AI voice agent understands speech, chooses the right action, and produces clear spoken output across a live conversation. It is a voice-AI evaluation discipline that appears in audio simulation, offline eval pipelines, and production call traces. As of May 2026, the production stack for a voice agent typically combines a real-time speech-to-speech model (GPT-5.x Realtime, Gemini 3.x Live, Claude Sonnet 4.6 voice), or a chained pipeline of ASR (Deepgram Nova-3, Whisper v3, Azure Speech, AssemblyAI Universal-2), an LLM, and TTS (ElevenLabs, Cartesia Sonic, OpenAI Realtime TTS). Teams score ASR accuracy, audio quality, turn handling, latency, task completion, and transcript-grounded reasoning so regressions in the ASR-LLM-TTS loop are caught before callers hear them. FutureAGI anchors this with ASRAccuracy, AudioQualityEvaluator, TaskCompletion, and LiveKitEngine simulations.

Why voice agent evaluation matters in production LLM and agent systems

Voice failures compound across the pipeline in a way text agents do not have to deal with. A small transcription error can change an account number; the LLM can reason over the wrong text; a tool call can update the wrong record; the TTS layer can speak a response that is hard to understand because of stress on the wrong syllable, an over-aggressive endpoint, or a barge-in that cut the user mid-sentence. The named 2026 failure modes are transcript drift, premature turn cutoff, false barge-in, audio-quality degradation, hallucinated tool calls grounded in mis-heard speech, and false task completion where the agent says “I’ve cancelled your subscription” but the underlying tool call failed.

The pain lands on different owners. Developers see flaky call scenarios that pass in text-only but fail with speech. the same prompt that worked when typed fails when the user mumbles two relevant digits. SREs see p99 time-to-first-audio climb after a provider change, often by 100-300ms, which kills perceived responsiveness even when the model itself is correct. Product teams see lower conversion or higher hang-up rate on one accent, locale, or noisy-channel cohort that aggregate dashboards completely hide. Compliance teams lose auditability when the production artifact is only a cleaned transcript, not the audio, transcript, model reasoning, tool use trace, and final spoken reply linked through one trace.

Voice agents in 2026 are multi-step systems, not single ASR calls. A real call may include speech recognition (or a single multimodal speech-to-speech model), turn detection, context retrieval, tool calling, policy and PII checks, model fallback when latency spikes, TTS, and a barge-in handler. Logs often show repeated user corrections, long silence windows, barge-in spikes that the agent did not honour, low transcription confidence, or conversations marked “resolved” that later reopen because the action did not actually fire. Unlike transcript-only QA in Vapi dashboards, raw LiveKit logs, or Retell’s per-turn metrics, voice agent evaluation has to score audio, timing, transcript, model reasoning, tool outcome, and final audio together. anything less under-counts the failure rate.

The honest 2026 framing: text-mode RAG/agent evals do not transfer cleanly to voice. A voice agent that scores 90% on transcript-grounded Groundedness can still produce 30% caller frustration if its ASRAccuracy drops 8 points on accented speech, its time_to_first_audio exceeds 1.2 seconds, or its turn-detection model cuts off elderly callers mid-sentence. Voice evaluation is not text evaluation plus audio. it is a different scorecard.

How FutureAGI handles voice agent evaluation

FutureAGI’s approach is to treat a voice call as an evaluable run with audio, transcript, agent trajectory, and output audio attached as first-class evidence. In a pre-production workflow, an engineer defines customer personas and call goals, then runs those scenarios through the simulate-sdk LiveKitEngine, which captures both transcripts and the raw audio for every turn of every simulated call. That distinction matters. most competing voice-eval tools score the transcript and throw the audio away, which means audio-quality regressions are invisible until customers report them.

The evaluation layer then attaches specific FutureAGI evaluators. ASRAccuracy scores the speech-to-text boundary, ideally with WER/CER split out, sliced by accent, language, channel, and noise condition. AudioQualityEvaluator scores the audio-quality surface. clipping, dropout, level, sibilance, and TTS naturalness. so raw audio issues do not hide behind a cleaned transcript. Teams add TaskCompletion for whether the call goal was actually achieved (not just whether the agent said it was), ToolSelectionAccuracy for agent actions, Groundedness for transcript-grounded reasoning over retrieved context, and CaptionHallucination when generated captions or transcripts may include words that were not spoken. For end-to-end agent behaviour, TrajectoryScore evaluates the agent trajectory across all turns.

The exact fields to preserve per call are the audio path (input), the ASR transcript, the expected transcript when available, turn events with timestamps, the tool use trace, the final text the agent intended to say, and the final audio path the caller actually heard. Every one of those is a span attribute via traceAI. specifically traceAI-livekit for LiveKit-based stacks, traceAI-pipecat for Pipecat, and the model SDK instrumentor for the speech-to-speech or ASR-LLM-TTS components.

A concrete 2026 workflow: a healthcare scheduling voice agent is simulated nightly against 2,000 LiveKit calls covering accents (US English, UK English, Indian English, Spanish-accented English, AAVE), background noise (quiet, traffic, restaurant, hold music), barge-in patterns (impatient interrupt, polite handoff, no interrupt), and call types (book new, reschedule, cancel, ask about insurance, ask about policy). FutureAGI records the captured audio and transcript, runs ASRAccuracy, AudioQualityEvaluator, TaskCompletion, ToolSelectionAccuracy, and TrajectoryScore, then blocks release if ASR score drops for a Spanish-accent cohort, if task completion falls on rescheduling calls, if time-to-first-audio p99 exceeds 900ms, or if any high-severity category (insurance question with stale policy) produces a confident incorrect answer. The engineer inspects failed traces in the tracing view, fixes the ASR provider route or turn-detection threshold or prompt, and reruns the regression suite before rollout.

The named comparison: LiveKit Agents ships an AgentEvaluator with basic transcript metrics; Vapi exposes per-turn dashboards; Retell logs raw calls. None of these connect a voice call’s audio, transcript, tool outcome, and eval verdicts into a single linked record with offline simulation and release-gate enforcement. FutureAGI’s LiveKitEngine plus the evaluator suite is built specifically to close that loop, with all evidence preserved for audit.

The voice-agent evaluator stack. 2026

A senior engineer building a voice agent in 2026 needs a layered scorecard, not a single average. The table below maps each failure surface to the evaluator and the signals worth alerting on:

Failure surfaceEvaluatorKey signalRelease-gate guidance
Speech-to-text accuracyASRAccuracyWER, CER, sliced by accent / language / noiseBlock on >2pp drop per cohort
Audio integrityAudioQualityEvaluatorClipping, dropout, SNR, TTS naturalnessBlock on any clipping >0.5% of calls
Turn handlingLiveKitEngine turn eventsPremature endpoint rate, missed barge-in rate, time-to-yieldAlert on premature endpoint >3%
LatencytraceAI voice spanstime_to_first_audio, tts_first_audio_ms, end-to-end turn latencyBlock on p99 >1.2s for interactive
Transcript-grounded reasoningGroundedness, AnswerRelevancyFaithfulness to retrieved policy / contextBlock on Groundedness <0.85
Tool selectionToolSelectionAccuracyRight tool for the right user intentBlock on safety-critical drops
Task completionTaskCompletionGoal achieved, not just claimed achievedBlock on regression >2pp
Trajectory across turnsTrajectoryScoreMulti-turn coherence, repair behaviourTrack per scenario family
Caption / transcript hallucinationCaptionHallucinationWords in transcript not in audioBlock on any high-severity match
PII leakagePII evaluatorPII in transcript or TTS not redactedBlock on any high-severity leak

Single-cohort averages hide most of these failures. Per-cohort thresholding is the difference between a voice agent that passes eval and a voice agent that ships.

How to use voice agent evaluation in CI

The 2026 pattern is to run voice eval as part of CI, not as a one-off pre-launch QA. The shape of a useful CI job:

  1. Generate scenarios. define personas (calm caller, frustrated caller, elderly caller, accented caller, noisy environment) and goals (book appointment, get refund, check balance, escalate to human).
  2. Run through LiveKitEngine. the simulate sandbox replays scenarios against the candidate voice agent build, captures audio and transcripts for every turn.
  3. Score with the evaluator suite. ASRAccuracy, AudioQualityEvaluator, Groundedness, TaskCompletion, ToolSelectionAccuracy, TrajectoryScore, plus a CustomEvaluation for product-specific rules (greeting tone, regulatory disclosure presence).
  4. Diff against baseline. compare every metric to the last shipped build per cohort.
  5. Block on regression. fail the build if any cohort regresses beyond threshold, with a link to the failing trace in the agent command center.
  6. Promote learnings to the golden dataset. every production failure that escapes the gate becomes a new simulation scenario for next time.

How to measure or detect voice agent evaluation

Measure voice agent evaluation as a layered scorecard, not a single average:

  • ASRAccuracy: returns a score for speech-to-text accuracy, ideally sliced by accent, language, channel, and noise condition.
  • AudioQualityEvaluator: returns an audio-quality score so raw audio issues do not hide behind a cleaned transcript.
  • TaskCompletion: returns a verdict on whether the agent actually accomplished the user’s goal. separate from whether the agent claimed it did.
  • ToolSelectionAccuracy: scores the agent’s tool choices across the call.
  • TrajectoryScore: scores the multi-turn agent path, including planning, repair, and escalation behaviour.
  • LiveKitEngine captures: keep audio path, transcript, turn events, and scenario metadata for each simulated call.
  • Dashboard signals: p99 time-to-first-audio, average silence duration, barge-in rate, eval-fail-rate-by-cohort, and task-completion rate.
  • User proxies: hang-up rate, transfer-to-human rate, repeated corrections, thumbs-down rate, and reopened tickets after “resolved” calls.

Minimal Python:

from fi.evals import ASRAccuracy, AudioQualityEvaluator, TaskCompletion

asr = ASRAccuracy()
audio = AudioQualityEvaluator()
task = TaskCompletion()

asr_score = asr.evaluate(audio_path=call_audio, ground_truth=reference_transcript)
audio_score = audio.evaluate(audio_path=call_audio)
task_score = task.evaluate(
    transcript=call_transcript,
    goal=call_goal,
    tool_outcomes=call_tool_trace,
)
print(asr_score.score, audio_score.score, task_score.score)

Set thresholds per workflow. A sales qualification bot, a medical scheduler, and a banking authentication agent should not share one ASR or audio-quality cutoff. the cost of a missed digit varies by domain.

In our 2026 evals we have seen the pattern repeatedly: teams that evaluate only the transcript miss roughly 25% of caller-impacting failures, because the failure is in the audio (clipping, low TTS volume, robotic prosody, dropouts) or in the timing (long silences, premature endpoint, missed barge-in) rather than in the words. Audio and timing are not optional metrics for a voice agent; they are co-equal with transcript correctness. Public benchmarks worth grounding against in 2026: τ-bench (Anthropic, multi-turn customer-support trajectories) lets you compare task-completion rate on the same scenarios used for top-line agent comparisons (frontier sits 55-70%), and BFCL v3 (Berkeley Function Calling Leaderboard) gives a public reference for tool-selection accuracy that the voice agent’s ToolSelectionAccuracy cohort can be benchmarked against.

Multi-turn vs single-turn voice eval

A 2026-specific concern: most voice eval that exists in open-source today is single-turn. It scores ASR accuracy on a clip and TTS naturalness on a clip and calls that a voice evaluation. A real production voice agent runs across 5-15 turns with state across all of them, and the failures that cost money are usually multi-turn. the agent confirms a wrong account number on turn 3 because it mis-heard turn 1, the agent says it cancelled a subscription on turn 7 when the tool call from turn 6 silently failed, the agent loses context on turn 4 after a barge-in cut it off on turn 3. Single-turn evaluation cannot see any of those. The LiveKitEngine simulation captures the entire multi-turn trajectory; TrajectoryScore evaluates whether the agent’s path made sense from turn 1 to turn 15; per-turn metrics roll up into a call-level verdict that a release gate can consume. That is the level of evaluation a 2026 voice agent needs.

Voice eval for speech-to-speech models

A 2026 architectural shift worth calling out: a growing number of production voice agents use a single speech-to-speech model (GPT-5.x Realtime, Gemini 3.x Live) rather than an ASR-LLM-TTS chain. That changes evaluation in two ways. First, there is no intermediate transcript to score directly. the model takes audio in and emits audio out, so ASRAccuracy and AudioQualityEvaluator operate on the input and output audio respectively, but there is no transcript field in between to anchor on. Second, the model can produce prosody, intonation, and emotional cues that pure-text eval cannot capture. The LiveKitEngine simulation still applies, the audio is still captured at every turn, and TaskCompletion, ToolSelectionAccuracy, and TrajectoryScore still ground on what the agent did rather than what it said. but the eval scorecard for speech-to-speech models shifts more weight onto the audio side and less onto intermediate transcript checks.

Common mistakes

Most teams under-measure voice agents because the transcript looks easier to score than the call. That is exactly the wrong shortcut.

  • Scoring only cleaned transcripts. You miss clipping, long silence, ASR substitutions, and TTS issues that the caller actually experienced. Score the audio too.
  • Using one accent cohort. Aggregate ASR scores hide regressions for dialect, language, microphone, and background-noise slices. Always slice.
  • Ignoring turn events. Premature endpointing and missed barge-in can fail a call even when every generated sentence is correct. The turn layer is its own evaluation surface.
  • Treating latency as separate from quality. A correct answer delivered after a long silence often behaves like a failure. Add time_to_first_audio to the release gate, not just transcript metrics.
  • Skipping scenario metadata. Without persona, channel, goal, and expected outcome, failures cannot be routed to the right owner or reproduced in simulation.
  • Self-reported task completion. Trusting “the agent said it did X” as a success signal is how silent failure modes ship. Always cross-check with the tool trace.
  • Ignoring caption hallucination. Generated captions or transcripts can include words that were never spoken, especially under noise. Use CaptionHallucination on the transcript-vs-audio comparison.
  • Throwing away the audio. If you only keep the transcript, you cannot reproduce the failure, debug TTS regressions, or satisfy a compliance audit. Always store the audio with the trace.
  • Running eval only pre-launch. A voice agent regresses every time the ASR provider ships a model, the TTS voice is changed, or the LLM provider tweaks a streaming behaviour. Continuous eval through LiveKitEngine is the only way to catch it.
  • No PII handling. Voice calls leak names, account numbers, and addresses constantly. Run the PII evaluator on transcripts and TTS output and redact before storage.

Frequently Asked Questions

What is voice agent evaluation?

Voice agent evaluation measures whether a real-time AI voice agent understands speech, handles turns, completes the user goal, and returns clear spoken output across a live conversation.

How is voice agent evaluation different from voice agent testing?

Voice agent testing runs scenarios to expose failures. Voice agent evaluation adds repeatable scores, thresholds, and release gates for ASR accuracy, audio quality, latency, turn handling, and task completion.

How do you measure voice agent evaluation?

FutureAGI uses evaluator classes such as ASRAccuracy and AudioQualityEvaluator, plus LiveKitEngine simulations that capture audio and transcripts. Teams threshold these scores by scenario, cohort, and release.