IVR (Interactive Voice Response) is the contact-center pattern that answers a phone call with an automated menu — historically touch-tone, increasingly speech-driven — to route, authenticate, or self-serve before reaching a human.

How is modern IVR different from old IVR?

Old IVR was rigid touch-tone menus. Modern IVR is a streaming voice agent: ASR, LLM intent recognition, MCP tool calls for account lookups, and TTS synthesis — handling natural language rather than menu codes.

How do you evaluate an IVR system?

FutureAGI uses `ASRAccuracy` and `WordErrorRate` for transcription, `ConversationCoherence` for dialogue, `TaskCompletion` for goal completion, and the `LiveKitEngine` for simulate-first regression testing.

What Is IVR? Definition & FutureAGI Voice Guide (2026)

What Is IVR?

IVR (Interactive Voice Response) is the contact-center pattern that answers a phone call with an automated system — historically a touch-tone menu, increasingly a speech-driven voice agent — to route, authenticate, or self-serve before reaching a human. In 2026 stacks, the menu is largely gone: the IVR is a streaming voice agent built from ASR, an LLM that handles intent and dialogue, MCP tool calls for account lookups, and TTS for synthesized replies. FutureAGI evaluates these IVRs with ASRAccuracy, AudioQualityEvaluator, ConversationCoherence, TaskCompletion, and the LiveKitEngine simulation surface for Persona-driven regression tests.

Why It Matters in Production LLM and Agent Systems

A bad IVR experience drives callers straight to “press 0 for an agent” — or out the door. Each ASR error, each missed turn boundary, each menu dead-end raises Average Handle Time (AHT) and erodes self-service containment. The gap between a 92% accurate IVR and a 96% accurate IVR is the difference between healthy unit economics and a contact center bleeding agent hours.

The pain spans roles. Operations leads see AHT and escalation rates climb when an IVR change ships. Engineering owns latency budgets, ASR confidence thresholds, and barge-in handling. Compliance is asked to prove a specific call was authenticated correctly and cannot, because the trace doesn’t tie recording, transcript, intent, and outcome together. Product owners watch CSAT degrade for callers with non-mainstream accents. End users — older callers, non-native speakers — silently leave for human channels.

In 2026 IVR stacks, the surface keeps widening: LiveKit-based call paths, Deepgram or Cartesia ASR, OpenAI Agents SDK or LangGraph for dialogue, MCP tool calls for CRM lookups, ElevenLabs or Cartesia TTS on the way back. Each step is a place latency, accuracy, and inclusivity can collapse. Aggregate metrics hide which step broke; trace-anchored evaluation does not.

How FutureAGI Handles IVR

FutureAGI’s approach treats an IVR as a multi-step trajectory and evaluates every stage. At the audio stage, AudioQualityEvaluator scores input recordings. At the ASR stage, ASRAccuracy and WordErrorRate score transcription against ground truth, sliced by user.language and accent. At the dialogue stage, ConversationCoherence, CustomerAgentInterruptionHandling, and CustomerAgentLanguageHandling score multi-turn dialogue quality. At the goal stage, TaskCompletion and ConversationResolution score whether the caller’s objective was met. At the simulate stage, the LiveKitEngine runs synthetic callers (Persona × Scenario) before any production traffic — so a model swap, prompt revision, or ASR provider change is regressed against the same scenarios every release.

Concretely: a telco voice IVR running on the LiveKitEngine is migrating from a deterministic menu to an LLM-driven agent. The team builds 300 Scenario runs covering balance lookups, plan changes, and outage queries, with three Persona variants (clear speaker, accented speaker, slow speaker). They run ASRAccuracy, TaskCompletion, and ConversationCoherence per scenario. The dashboard shows TaskCompletion at 88% on clear speakers but 61% on accented speakers — a recognizer-side failure correlated with ASRAccuracy dropping below 0.80. The team retunes the ASR provider for the affected accents, re-runs the same 300 scenarios as a regression eval, and gates the deploy on per-persona pass rate. IVR becomes a regressionable, simulate-first property, not a Friday-night production gamble.

How to Measure or Detect It

Evaluate IVR at multiple resolutions — global containment hides which stage broke:

ASRAccuracy / WordErrorRate: transcription accuracy vs. ground truth.
AudioQualityEvaluator: input audio quality (noise, clipping, sample-rate sanity).
ConversationCoherence: multi-turn dialogue quality.
CustomerAgentInterruptionHandling: barge-in handling and turn-taking.
TaskCompletion / ConversationResolution: caller-goal completion rates.
Containment rate (operational signal): fraction of calls resolved without escalation to human.
Time-to-first-audio: caller-perceived responsiveness from input end to first synthesized output.

Minimal Python:

from fi.evals import ASRAccuracy, TaskCompletion, ConversationCoherence

asr = ASRAccuracy()
task = TaskCompletion()
coherence = ConversationCoherence()

asr_result = asr.evaluate(output=transcript, expected=ground_truth)
task_result = task.evaluate(input=goal, trajectory=trace_spans)
print(asr_result.score, task_result.score)

Common Mistakes

Reporting only containment. Containment is a business metric, not a quality metric — pair it with per-step ASRAccuracy, TaskCompletion, and dialogue scores.
Skipping accent and language slicing. A 5% global WER often hides a 20% WER on one accent — exactly the cohort that escalates.
Treating “press 0 for agent” as a feature. A graceful escalation is good; an IVR designed around making escalation hard is bad and shows up in CSAT.
No simulate-first regression. Production-only IVR evaluation lets bad model changes hit real callers; use LiveKitEngine with Persona × Scenario first.
Reusing menu-IVR metrics for conversational IVR. Menu IVRs care about node-completion rates; conversational IVRs need goal-level evals (TaskCompletion).