What Is Voice Activity Detection?
VAD classifies audio frames as speech or non-speech so downstream voice-agent stages know when to listen, transcribe, and respond.
What Is Voice Activity Detection?
Voice activity detection (VAD) is the process of classifying each audio frame as speech or non-speech in real time. It is a voice-AI reliability signal that sits before ASR, endpointing, turn detection, tool calls, and LLM reasoning in a production voice agent. In FutureAGI workflows, VAD shows up as a turn-boundary risk: bad start or end events inflate latency, cut off user intent, or trigger the agent on keyboard noise. We treat it as a traceable decision, not a black-box preprocessor, and tie it to call-level evidence such as transcripts, latency, and resolution.
Why VAD Matters in Production LLM and Agent Systems
VAD failures cascade because every downstream voice stage trusts the speech boundary. If VAD misses the first 400 ms of a caller saying “do not cancel”, ASR transcribes “cancel”, and the LLM happily calls the wrong tool with a clean trace and clean logs. If VAD fires on background noise, the agent interrupts itself, starts ASR on empty audio, or spends tokens answering a question nobody asked.
The failure modes are concrete. False negatives create cut-off utterances, missing intents, and clarification loops. False positives create phantom turns, dead air, and higher time-to-first-audio. Developers see the symptom as a flaky transcript or wrong tool plan; SREs see rising p95 endpointing delay, ASR retries, and extra audio frames sent to a paid speech provider. Product teams see hangups, longer handle time, and lower completion on noisy mobile callers. Compliance teams lose evidence when a regulated user statement never reaches the transcript.
In 2026-era voice pipelines, VAD is not a small preprocessor. A turn passes through media capture, VAD, ASR, retrieval, an LLM planner, tool calls, TTS, and transport. One boundary mistake can make the whole agentic path look unreliable when the model, prompt, and tools are working.
How FutureAGI Handles VAD
FutureAGI’s approach is to treat VAD as a turn-boundary risk and connect it to transcript, latency, and call-outcome evidence. The product inventory does not list a dedicated VAD evaluator, so a practical FutureAGI workflow uses nearby surfaces: LiveKitEngine for voice simulations, traceAI:livekit and traceAI:pipecat for production call traces, ASRAccuracy for transcript impact, and AudioQualityEvaluator for the audio conditions that confuse speech detection.
A real example: a support agent is tested against callers who pause, cough, talk over the agent, and call from noisy cars. LiveKitEngine runs a Scenario set and stores transcripts, audio paths, optional eval scores, and TestReport artifacts. Engineers inspect turn boundaries against the audio timeline and slice failures by noise cohort, device, locale, and provider. If ASRAccuracy drops only when VAD starts late, the ASR model is fine; the boundary detector or endpointing timeout is the suspect.
Unlike Silero VAD or WebRTC VAD alone, which produce local speech/non-speech decisions, FutureAGI ties the boundary to downstream reliability: time-to-first-audio, missed-utterance rate, interruption rate, and task completion. The next action is operational. Teams can alert on a rising false-start cohort, block a regression, tune endpointing thresholds, or replay affected calls before changing ASR, TTS, or routing.
How to Measure or Detect VAD
Measure VAD with labeled audio when you can, then verify its downstream effect in traces:
- Segment precision and recall comparing predicted speech windows with human-labeled boundaries; track false starts and missed speech separately.
- Endpointing delay from actual speech end to agent response start; high delay creates dead air.
- Cut-off rate: count turns where the first or last word is missing in ASR output.
ASRAccuracy: returns transcript accuracy and is a strong downstream proxy for VAD quality.AudioQualityEvaluator: scores audio issues (clipping, noise, silence) that often explain VAD regressions.- Dashboard signals: p99 time-to-first-audio, ASR retry rate, barge-in rate, missed-utterance rate, escalation rate.
Minimal downstream eval:
from fi.evals import ASRAccuracy
asr = ASRAccuracy()
result = asr.evaluate(
input="do not cancel my card",
output="cancel my card",
)
print(result.score)
That score does not measure VAD directly. It captures the transcript damage a VAD cutoff causes, which routes engineers back to the audio boundary.
Common Mistakes
These mistakes make VAD look healthy in offline clips while live agents still interrupt or miss users:
- Equating VAD with endpointing. VAD labels frames; endpointing decides the turn is done.
- Tuning only on clean studio audio. Lab clips miss car noise, hold music, speakerphone echo, and overlapping speakers.
- Averaging false starts and missed speech. They cause different product failures and need different thresholds.
- Letting silence thresholds hide latency. A conservative threshold cuts interruptions but adds dead-air seconds.
- Scoring only final transcripts. A clean transcript can still hide bad turn timing that breaks barge-in and tool flow.
Frequently Asked Questions
What is voice activity detection (VAD)?
VAD is the per-frame decision about whether audio contains human speech, silence, or noise. Voice agents use it to decide when to send audio to ASR, when a user turn begins, and when downstream tool calls and LLM reasoning can run.
How is VAD different from endpointing?
VAD is the low-level speech/non-speech classification of each frame. Endpointing layers timing rules, silence thresholds, and conversation context on top of VAD to decide when a user turn is actually finished.
How do you measure VAD quality in FutureAGI?
Run `LiveKitEngine` simulations across noisy and clean cohorts, then check `ASRAccuracy`, `AudioQualityEvaluator`, time-to-first-audio, missed-utterance rate, and barge-in rate. Compare predicted speech windows against labeled audio when ground truth exists.