What is Voice Agent Quality Index (VAQI)?

Voice Agent Quality Index (VAQI) is a composite score for whether a voice agent understands speech, handles turns, completes the user goal, and responds with clear audio.

How is VAQI different from audio quality?

Audio quality measures whether speech is clear and intelligible. VAQI is broader: it combines audio quality with ASR accuracy, latency, turn handling, task completion, and escalation signals.

How do you measure VAQI?

Use FutureAGI's AudioQualityEvaluator for the audio-quality component, then combine it with ASRAccuracy, TTSAccuracy, TaskCompletion, LiveKitEngine simulation outputs, and production trace metrics.

What Is Voice Agent Quality Index? FutureAGI Guide (2026)

What Is Voice Agent Quality Index?

Voice Agent Quality Index (VAQI) is a composite voice-AI reliability metric that summarizes whether a voice agent understands speech, keeps turns natural, completes the task, and speaks back clearly. It appears in eval pipelines, LiveKit simulations, and production call traces after ASR, LLM reasoning, tool use, and TTS have all produced artifacts. A useful VAQI combines audio quality, transcription accuracy, time-to-first-audio, turn handling, task completion, and user escalation signals. FutureAGI anchors the audio-quality portion with eval:AudioQualityEvaluator and AudioQualityEvaluator.

Why Voice Agent Quality Index Matters in Production LLM and Agent Systems

A voice agent can fail even when each isolated component looks acceptable. ASR can produce a readable transcript, the LLM can choose a plausible next action, and TTS can generate audio, yet the caller still experiences long silence, clipped first words, poor barge-in handling, or a resolved label on an unresolved task. VAQI matters because it forces teams to score the call as a user-facing system, not as disconnected ASR, LLM, and TTS benchmarks.

Ignoring VAQI usually creates two named failure modes: transcript-driven false success and latency-driven abandonment. In the first, a cleaned transcript hides audio distortion, missed digits, or turn cutoff, so the agent marks the task complete over the wrong state. In the second, every answer is textually correct but time-to-first-audio or silence duration makes users interrupt, repeat themselves, or hang up.

The pain is visible across teams. Developers see scenario flakes that do not reproduce in text tests. SREs see p99 audio latency, ASR retries, and provider-specific quality regressions. Product teams see lower containment, higher transfer-to-human rate, and repeated “are you still there” turns. Compliance teams worry when consent, billing, or medical disclosures are present in logs but not intelligible in the original call.

In 2026-era voice agents, the index is especially useful because pipelines are multi-step: LiveKit or telephony capture, voice activity detection, ASR, routing, retrieval, tool calls, policy checks, LLM response generation, and TTS. A single average score is not enough, but a weighted VAQI gives release managers one gate while preserving drill-down metrics for engineers.

How FutureAGI Handles Voice Agent Quality Index

FutureAGI treats VAQI as a composed scorecard, not a single magic evaluator. The required anchor for this term is eval:AudioQualityEvaluator, which maps to the AudioQualityEvaluator cloud-template evaluator for the audio_quality check. In a typical workflow, a team runs voice support scenarios through the simulate-sdk LiveKitEngine, captures caller audio, agent audio, transcripts, turn events, scenario IDs, and tool traces, then attaches evaluator scores to the same run.

The VAQI calculation might weight six signals: audio_quality from AudioQualityEvaluator, transcript correctness from ASRAccuracy, spoken-output correctness from TTSAccuracy, task outcome from TaskCompletion, p99 time-to-first-audio, and escalation or abandonment rate by cohort. The exact weights should differ by domain. A banking authentication agent may weight ASR and task correctness higher than prosody. A sales qualification agent may weight latency and interruption handling more heavily because dead air kills conversion.

FutureAGI’s approach is to keep the aggregate score explainable. If VAQI drops from 0.91 to 0.82 after a TTS provider change, engineers should see whether the loss came from audio_quality, TTS match, barge-in rate, or task completion. The next action is concrete: roll back the provider route, raise an alert on the affected locale, add failed calls to a regression dataset, and rerun the same LiveKitEngine scenarios before release.

Unlike Vapi-style transcript QA, VAQI should not treat the transcript as the full truth. Transcript-only review can miss clipped disclosures, robotic speech, background-noise sensitivity, or turn timing that makes a correct answer unusable. FutureAGI connects the audio artifact, transcript, eval score, and trace ID so a failing index can be debugged at the exact call turn.

How to Measure or Detect Voice Agent Quality Index

Measure VAQI as a weighted, cohort-aware scorecard. Start with component scores, then combine them only after each signal is independently useful:

AudioQualityEvaluator - scores the audio_quality component for captured or generated speech artifacts.
ASRAccuracy - measures whether spoken input became the expected transcript, sliced by accent, locale, device, and noise condition.
TTSAccuracy - checks whether generated speech matches the intended text and remains usable for the target voice persona.
Trace and dashboard signals - p50 and p99 time-to-first-audio, silence duration, barge-in rate, turn cutoff rate, eval-fail-rate-by-cohort, and task-completion rate.
User-feedback proxies - transfer-to-human rate, repeat-request rate, hang-up rate, thumbs-down rate, and reopened tickets after a “resolved” call.

Minimal evaluator pattern:

from fi.evals import AudioQualityEvaluator

audio_eval = AudioQualityEvaluator()
result = audio_eval.evaluate(audio_path="calls/support-turn.wav")
print(result.score, result.reason)

Do not ship one global VAQI threshold. Set a release gate per workflow, language, channel, and risk tier. A 0.86 score may be acceptable for low-stakes restaurant bookings and unacceptable for medication instructions.

Common Mistakes

Most VAQI mistakes come from turning a useful index into a vague average.

Averaging away the root cause. A good aggregate can hide ASR failures for one accent, codec, or noisy mobile cohort.
Treating transcript quality as call quality. A readable transcript can hide clipped audio, awkward pauses, poor interruption handling, or unclear TTS.
Using equal weights everywhere. Billing, healthcare, support, and sales calls have different risk and latency tolerance.
Ignoring negative user actions. Escalations, hang-ups, and repeat prompts often reveal failures that evaluator scores miss.
Failing to version the formula. A changed VAQI weight without a new version makes release comparisons misleading.