What Is Voice Agent Quality Index? FutureAGI Guide (2026)

What Is Voice Agent Quality Index?

Voice Agent Quality Index (VAQI) is a composite voice-AI metric that rolls multiple per-call signals into a single 0-100 score. The signals usually include ASR accuracy, transcript faithfulness, time-to-first-audio, turn-timing accuracy, conversation resolution, and audio quality. It is a voice-AI release-readiness number that product, SRE, and reliability teams use to compare agent variants, call cohorts, or providers in one place. FutureAGI builds VAQI-style scores by aggregating standard evaluators with AggregatedMetric, then wiring the bundle through LiveKitEngine simulations and traceAI:livekit production traces.

Why VAQI Matters in Production Voice Agent Systems

A voice agent has at least five failure surfaces: bad transcripts, bad turn timing, bad latency, bad TTS audio, and bad resolution. Tracking each metric separately is correct for engineers but confusing for product owners and execs. A composite score is how teams answer “is the agent better today than last week?” without a 30-row table.

The risks are also real. A composite score that hides a regression on one component is dangerous. If a new TTS provider improves audio quality but drops ASR accuracy on noisy callers, a careless VAQI definition can show “no change” while real users hang up. Engineers see this as a benchmark win that does not match support tickets; SREs see no change in dashboards while p99 latency creeps up; product owners trust the green number too long.

In 2026 agentic voice stacks, VAQI also has to absorb tool-use accuracy and multi-step reasoning. A pure transcript metric does not capture that the agent picked the wrong API. A useful VAQI weights resolution and tool accuracy alongside acoustic and latency signals. FutureAGI treats VAQI as a reporting layer above per-call evaluators, never as a replacement for them.

How FutureAGI Handles VAQI

FutureAGI’s approach is to keep the components transparent and reproducible. We compose VAQI-style scores from named evaluators, store every component, and never throw away the per-call evidence that produced the headline number.

A real example: a team defines VAQI as a weighted sum of ASRAccuracy (30%), AudioQualityEvaluator (15%), ConversationResolution (35%), TTSAccuracy (10%), and a latency penalty derived from time-to-first-audio (10%). They wrap the bundle with AggregatedMetric and attach it to a Dataset via Dataset.add_evaluation. LiveKitEngine runs 2,000 simulated calls across noise cohorts, accents, and devices; the evaluation store records each component plus the aggregate. In production, traceAI:livekit instruments the same evaluators on live calls; the gateway can route premium customers through a stricter VAQI threshold using a routing policy.

Unlike a black-box vendor “voice quality score”, FutureAGI’s VAQI is auditable. Engineers can see why a number dropped, slice by cohort, and run a regression-eval to confirm a fix before rollout. The LiveKitEngine TestReport keeps the audio paths, transcripts, and per-component scores for every call, so a bad VAQI is always reproducible.

How to Measure or Detect It

Build VAQI from named, transparent components:

ASRAccuracy for transcript fidelity vs. ground-truth or simulated reference.
AudioQualityEvaluator for clipping, noise, and silence.
ConversationResolution for end-to-end task success.
TTSAccuracy for output speech fidelity.
Latency penalty computed from time-to-first-audio and end-to-end latency span fields.
AggregatedMetric to combine them with explicit weights and per-call scores.

Minimal aggregator shape:

from fi.evals import AggregatedMetric, ASRAccuracy, AudioQualityEvaluator

vaqi = AggregatedMetric(
    metrics=[ASRAccuracy(), AudioQualityEvaluator()],
    weights=[0.6, 0.4],
)
result = vaqi.evaluate(input=ref, output=transcript, audio_path=path)
print(result.score)

That snippet shows two of the five components. Add resolution, TTS, and latency penalties to match a production VAQI definition.

Common Mistakes

Avoid these traps when defining and tracking VAQI:

Hiding regressions. Without per-component visibility, a composite score can mask one component dropping.
Equal weights for unequal signals. Latency and resolution rarely deserve the same weight as turn-timing nuance.
Ignoring confidence intervals. A 1-point VAQI delta on 100 calls is noise.
Comparing VAQI across products. A definition tuned for support is not directly comparable to outbound sales.
Treating VAQI as a SLA. It is a north-star metric; SLAs should still be defined on the underlying evaluator components.

Frequently Asked Questions

What is VAQI?

Voice Agent Quality Index (VAQI) is a composite 0-100 voice-AI quality score. It aggregates ASR accuracy, transcript faithfulness, latency, turn-timing, conversation resolution, and audio quality so teams can compare variants and call cohorts in one number.

How is VAQI different from ASR accuracy?

ASR accuracy measures only the transcript. VAQI rolls up ASR alongside latency, turn timing, audio quality, and call outcomes, so a variant with great transcripts but slow responses does not get a perfect score.

How do you compute a VAQI in FutureAGI?

Use `AggregatedMetric` to combine `ASRAccuracy`, `AudioQualityEvaluator`, `ConversationResolution`, and latency span fields with weights. Run the bundle through `Dataset.add_evaluation` over `LiveKitEngine` or `traceAI:livekit` traces.