What Is Voice Activity Detection?
Voice activity detection classifies audio frames as speech or non-speech so voice agents know when to listen and respond.
What Is Voice Activity Detection?
Voice activity detection (VAD) is the process of deciding which audio frames contain human speech and which are silence, noise, or non-speech sound. It is a voice-AI reliability signal that shows up before ASR, turn detection, endpointing, and downstream LLM tool calls in production traces. FutureAGI treats VAD as a boundary-quality problem: bad speech start or end events can inflate latency, cut off user intent, or trigger the agent at the wrong time.
Why Voice Activity Detection Matters in Production LLM and Agent Systems
VAD errors cascade because every later voice-agent stage trusts the speech boundary. If VAD misses the first 400 ms of a caller saying “do not cancel,” ASR may transcribe “cancel” and the LLM can call the wrong tool with a clean trace. If VAD fires on keyboard noise, the agent may interrupt, start ASR on empty audio, or spend tokens answering a question nobody asked.
The failure modes are concrete: false negatives create cut-off utterances, missing intents, and repeated clarification loops; false positives create phantom turns, dead-air responses, and higher time-to-first-audio. Developers often see the symptom as a bad transcript or a flaky agent plan. SREs see rising p95 endpointing delay, ASR retries, jitter-sensitive call failures, or extra audio frames sent to a speech provider. Product teams see barge-in, hangups, longer handle time, and lower completion for noisy mobile callers. Compliance teams lose evidence when a regulated user statement never reaches the transcript.
In 2026-era voice pipelines, VAD is not a small preprocessor. A single turn may pass through LiveKit or Pipecat media capture, VAD, ASR, retrieval, an LLM planner, tool calls, TTS, and transport. One boundary mistake can make the whole agentic path look unreliable even when the model, prompt, and tools are working.
How FutureAGI Handles Voice Activity Detection
FutureAGI’s approach is to treat VAD as a traceable turn-boundary risk, then connect it to transcript, latency, and call-outcome evidence. The product inventory does not list a dedicated VAD evaluator, so a practical FutureAGI workflow uses nearby surfaces: LiveKitEngine for voice simulations, traceAI:livekit or traceAI:pipecat for production call traces, ASRAccuracy for transcript impact, and AudioQualityEvaluator for audio conditions that can confuse speech detection.
A real example: a support agent is tested on callers who pause, cough, talk over the agent, and speak from noisy cars. LiveKitEngine runs the Scenario set and stores the transcript, audio path, optional eval scores, and TestReport/TestCaseResult evidence. Engineers inspect turn boundaries against the audio timeline, then slice failures by noise cohort, device, locale, and provider. If ASRAccuracy drops only when VAD starts late, the ASR model may be fine; the boundary detector or endpointing timeout is the suspect.
Unlike Silero VAD or WebRTC VAD alone, which produce local speech/non-speech decisions, FutureAGI ties the boundary to downstream reliability: p99 time-to-first-audio, missed-utterance rate, interruption rate, and task completion. The next action is operational. Teams can alert on a rising false-start cohort, block a voice-agent regression, tune endpointing thresholds, or replay affected calls before changing ASR, TTS, or LLM routing.
How to Measure or Detect Voice Activity Detection
Measure VAD with labeled audio when you can, then verify its downstream effect in traces:
- Segment precision and recall: compare predicted speech windows with human-labeled speech boundaries; false starts and missed speech need separate counts.
- Endpointing delay: measure the time from actual speech end to agent response start; high delay creates dead air.
- Cut-off rate: count turns where the first or last word is missing in ASR output.
ASRAccuracy: returns speech-to-text accuracy; use it as a downstream proxy when boundary mistakes damage transcripts.AudioQualityEvaluator: scores audio quality issues such as silence, clipping, or noise that often explain VAD regressions.- Dashboard signals: p99 time-to-first-audio, ASR retry rate, barge-in rate, missed-utterance rate, and escalation rate by channel.
Minimal downstream eval shape:
from fi.evals import ASRAccuracy
asr = ASRAccuracy()
result = asr.evaluate(
input="do not cancel my card",
output="cancel my card",
)
print(result.score)
That snippet does not score VAD directly. It shows the transcript failure a VAD cutoff can create, which should send the engineer back to the audio boundary.
Common Mistakes
These mistakes make VAD look healthy in offline clips while the live agent still interrupts or misses users:
- Equating VAD with endpointing. VAD labels speech frames; endpointing decides the user turn is complete.
- Tuning only on clean audio. Quiet lab clips miss car noise, hold music, speakerphone echo, and overlapping speech.
- Averaging false starts and missed speech. They create different product failures and need different thresholds.
- Letting silence thresholds hide latency. A conservative threshold can reduce interruptions while adding a full second of dead air.
- Scoring only final transcripts. The transcript can look acceptable while turn timing still breaks barge-in, tool timing, or escalation.
Frequently Asked Questions
What is voice activity detection (VAD)?
Voice activity detection classifies audio frames as speech, silence, noise, or other non-speech sound. In voice agents, it controls when audio should be sent to ASR, when a user turn starts, and when downstream reasoning can begin.
How is VAD different from endpointing?
VAD is the frame-level speech/non-speech decision. Endpointing uses VAD plus timing rules, turn context, and sometimes model signals to decide that the speaker has finished a turn.
How do you measure voice activity detection?
In FutureAGI, measure VAD indirectly through simulated voice traces, turn-boundary errors, ASRAccuracy, AudioQualityEvaluator, time-to-first-audio, barge-in rate, and missed-utterance rate. A dedicated VAD score should compare predicted speech segments with labeled audio boundaries.