Evaluating Voice AI Agents in 2026: End-Task, Pipeline-Stage, Conversation-Coherence
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.
Table of Contents
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation-coherence scoring. WER is for the ASR component, not the agent. The agent’s job is task completion at sub-second turn latency with conversational stability across N turns, and a 5% WER on Common Voice can sit alongside a 32% task-failure rate in your live stack without contradiction, because the two numbers measure different things on different layers. This post is the production eval methodology: how to score the task as the launch gate, how to attribute failures back to ASR, LLM, TTS, or turn-taking, and how to score the conversation across turns instead of one turn at a time. Built for engineers running Pipecat, LiveKit, VAPI, Bland, Retell, or any real-time audio pipeline with ASR, an LLM brain, TTS, and a turn-taking layer.
The thesis
Eval the pipeline; score the task. Three scoring layers, one gate, two diagnostics.
- End-task scoring is the launch gate.
conversation_resolutionandtask_completionanswer the question the caller cares about: did the agent book the appointment, refill the prescription, route the call to the right department, finish the refund? This is the only metric that matters for a pass-fail decision. - Pipeline-stage attribution is the diagnostic. When end-task fails, the failure happened at ASR, LLM, TTS, or the turn-taking layer. Each stage has its own per-stage rubric and its own latency budget. Stage-level scores explain end-task scores; they don’t replace them.
- Conversation-coherence scoring is the second diagnostic. Single-turn rubrics miss the failures that compound across a call: persona drift, context loss, recovery from interruption, end-of-call cleanup. Score per turn for telemetry; aggregate per call before you make any decision.
Two ways teams get this wrong. Some score only WER and assume a 95% transcript means a 95% agent. They ship a broken product. Others score only end-task and watch the dashboard flatten without any signal on where to look. They debug by playing back recordings one at a time. The fix is to keep all three layers live and route them at the right decision point.
Why WER is for the ASR, not the agent
Word Error Rate is edit distance between a transcript and a reference. It’s a fine number for the ASR component. It’s the wrong gate for the agent for four reasons.
The ASR is one stage of four. A 92% accurate transcript that drops the “re-” from “reschedule” hands the LLM the wrong intent and produces a 100% wrong end task. Conversely, a transcript that misses a filler word but preserves the slot value scores worse on WER while the agent still books the right appointment. WER moves in directions that don’t track agent success.
Curated corpora are not your traffic. Mozilla Common Voice is read speech, balanced phonemes, quiet recording, no telephony codec. Production callers think out loud, stumble through proper nouns, switch languages mid-utterance, and arrive over a g711 codec. Public WER on the curated corpus tells you nothing about how your ASR handles “Mounjaro” pronounced by a Tamil-substrate caller in a moving vehicle.
LLM and TTS failures aren’t ASR failures. A monolingual STT model that resets context on a code-switch is an STT-architecture failure. A US-trained LLM that defaults “pavement repair” to road resurfacing instead of sidewalk repair is an LLM-interpretation failure. A TTS that pronounces a confirmation number like a phone number is a TTS failure. WER measures none of them.
Latency is invisible to WER. A perfect transcript that arrives 1.4 seconds late breaks the call. Callers start talking over the agent, barge-in fires on the response, the conversation spirals. WER on the final transcript looks great. The call still failed.
The fix isn’t to throw out WER. WER is the right number for the ASR component, kept as a per-stage diagnostic and stratified by accent group and background-noise bucket. It just isn’t the agent’s gate.
End-task scoring: the launch gate
The first eval that runs on every call is end-task. The whole point of the agent is whether the call resolved. Two rubrics from the ai-evaluation SDK cover the ground.
conversation_resolution. Did the agent resolve the caller’s intent? Multi-turn, scored against the goal list of the scenario.task_completion. Did the agent finish the specific task (booking, refill, refund, transfer)? Partial completions and unnecessary escalations both count against the score.
Both are launch gates. A bot that posts 91% on ASR/STT_accuracy and 64% on conversation_resolution is still shipping a broken product. The transcript was fine; the agent failed the call. Don’t invert the priority.
from fi.evals import Evaluator
from fi.evals.templates import ConversationResolution, TaskCompletion
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
result = evaluator.evaluate(
eval_templates=[ConversationResolution(), TaskCompletion()],
inputs=[conversation_test_case],
)
Define the end task precisely in the scenario goal list. Vague task statements make conversation_resolution an LLM-judge whim. Tight goal statements (“confirm caller identity; locate booking; offer next available slot; send confirmation SMS”) make the gate stable.
Pipeline-stage attribution: which stage failed
When the end-task gate fails, the next question is which stage caused it. Each pipeline stage has its own per-stage rubric and its own latency budget. The stages map cleanly to the audio path.
| Stage | What it does | Per-stage rubric | Latency budget |
|---|---|---|---|
| ASR | Audio in, transcript out | ASRAccuracy (eval_name: ASR/STT_accuracy), audio_quality | 50-150 ms first-hypothesis |
| LLM brain | Transcript in, text response out | TaskCompletion, Groundedness, PromptAdherence | 300-500 ms first-token |
| TTS | Text in, audio out | TTSAccuracy, audio_quality on the synthesized output | 100-200 ms first-byte |
| Turn-taking | VAD, barge-in, end-of-turn | CustomerAgentInterruptionHandling, CustomerAgentTerminationHandling | 50-150 ms decision latency |
Stage attribution lets one failed call answer one question instead of four. A call that fails conversation_resolution with ASR/STT_accuracy at 0.62 on a Tamil-substrate proper noun is an ASR vocabulary gap. A call that fails conversation_resolution with ASR/STT_accuracy at 0.94 and Groundedness at 0.41 is an LLM grounding failure. Same end-task miss, different fix.
Latency attribution is the second axis. Route the LLM brain through the Future AGI gateway and every response returns x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, and x-prism-routing-strategy. The LLM portion of the per-turn budget is one header lookup. ASR and TTS latencies come from your voice framework’s own spans, which traceAI captures with gen_ai.voice.* attributes.
import litellm
litellm.api_base = "https://gateway.futureagi.com/v1"
response = litellm.completion(
model="gpt-4o-mini",
messages=[{"role": "system", "content": SYSTEM_PROMPT}, *turns],
extra_headers={"x-prism-routing-strategy": "lowest-latency"},
)
llm_latency_ms = int(
response._hidden_params["additional_headers"]["x-prism-latency-ms"]
)
Sum the three stage latencies into per-turn end-to-end. Alert when the rolling p95 crosses one second. Without gateway headers, voice agents degrade silently as upstream model latency drifts; the first sign of trouble is a customer complaint three weeks later.
Turn-taking latency: the sub-second p95 target
Turn latency is the difference between a usable voice agent and an unusable one. Three thresholds matter, none of them aspirational.
- Under 800 ms end-to-end, the call feels human. Callers wait for the agent.
- Between 800 ms and 1.5 seconds, callers start talking over the agent. Barge-in fires. The conversation gets choppy.
- Above 1.5 seconds, the call breaks. Users hang up or escalate.
The p95 target for production is under one second; under 800 ms is the goal. ASR is hard to push below 100 ms first-hypothesis. TTS first-byte is hard to push below 150 ms with a high-quality voice. The LLM portion is where 300-700 ms of variance lives, which is why routing the brain through a low-latency gateway moves the budget more than any TTS optimization. The gateway’s x-prism-latency-ms header is the signal you act on.
Turn-taking correctness is a separate failure mode. The agent has to decide when the user’s turn ended, when to barge in, and when to wait through a filler. Three rubrics from the customer-agent template family cover the surface.
CustomerAgentInterruptionHandling. Score barge-in correctness. Penalize false positives on background noise (cough, dog, traffic) and false negatives on actual user interruptions.CustomerAgentTerminationHandling. Score end-of-call cleanup. Did the agent close the call appropriately, log the outcome, send the follow-up?CustomerAgentLoopDetection. Score whether the agent recognized it was stuck in a loop and escalated or recovered instead of grinding forward.
False positives on barge-in destroy calls; false negatives destroy trust. Treat turn-taking as an eval axis, not a UX setting.
Conversation coherence across N turns
Persona drift, context loss, and recovery from interruption all happen at the call level, not the turn level. Score per turn for telemetry; aggregate per call before any decision. The drift you care about typically shows up around turn six, which is invisible to any single-turn rubric.
Three rubrics carry the coherence layer.
ConversationCoherence. Does the agent’s sequence of turns hang together as a single coherent thread? Penalizes contradictions, dropped threads, and shifts in tone or persona.CustomerAgentContextRetention. Did the agent retain facts from earlier turns? The caller said their name in turn one; did the agent still know it in turn five?CustomerAgentConversationQuality. Aggregate call-quality scoring across turns, weighted by failure severity.
The pattern we see most often: a call starts as “Maya from billing support” and ends as a generic assistant. The system prompt got crowded out by retrieved context by turn seven, the persona dropped, and the customer noticed even if the answer was technically correct. The call-level rubric catches this; the turn-level rubric never will.
from fi.evals.templates import (
ConversationCoherence,
CustomerAgentContextRetention,
CustomerAgentInterruptionHandling,
CustomerAgentTerminationHandling,
)
coherence_score = evaluator.evaluate(
eval_templates=[
ConversationCoherence(),
CustomerAgentContextRetention(),
CustomerAgentInterruptionHandling(),
CustomerAgentTerminationHandling(),
],
inputs=[conversation_test_case],
)
Roll up per-call. Alert on calls below threshold. Track per-turn for the trend line that shows you when drift starts.
Persona-driven test generation with simulate-sdk
A static recording corpus rots the moment your call distribution shifts. Persona-driven generation runs against your live stack, regenerates when intents change, and produces end-to-end calls instead of transcript proxies. simulate-sdk ships the primitive directly.
A Persona carries voice traits (accent, age range, speed, background noise, multilingual toggle). A Scenario carries the intent path and the goal list. TestRunner drives the persona through the scenario against an AgentWrapper that wraps your actual agent (OpenAI, LangChain, Gemini, or Anthropic-backed). The run produces a TestReport with per-call transcripts, audio, and a span tree.
from fi.simulate import (
Persona, Scenario, TestRunner,
OpenAIAgentWrapper, AgentDefinition, LLMConfig,
)
personas = [
Persona(name="tamil_substrate_f_28", traits={
"accent": "South Indian Tamil-influenced English",
"speed": "fast", "background_noise": "moderate",
"multilingual": True,
}),
Persona(name="us_southern_m_60", traits={
"accent": "US Southern", "speed": "slow",
"background_noise": "quiet",
}),
# ... weighted by your actual traffic distribution
]
scenarios = [
Scenario(
description="Caller wants to reschedule an existing appointment.",
goals=["confirm caller identity", "locate booking", "offer next slot"],
),
# ... top intents from production traffic
]
agent = OpenAIAgentWrapper(AgentDefinition(
name="reschedule-bot",
llm_config=LLMConfig(model="gpt-4o", temperature=0.4),
system_prompt="You are a reschedule assistant for a healthcare clinic.",
))
runner = TestRunner(agent_wrapper=agent, personas=personas, scenarios=scenarios)
report = runner.run()
print(f"Pass rate: {report.pass_rate:.0%}")
Twelve personas times five scenarios at 100 rows per pair is 6,000 fresh calls. Run time roughly 14 hours of parallel simulation against the live agent. Each call carries TTS-rendered audio with the accent prosody, gets transcribed by your real STT, hits your real LLM, returns through your real TTS. End-to-end, not a proxy. For accent-coverage strategy in detail see accent and dialect testing for voice AI.
Production observability with traceAI
The same evaluator pool that scored the launch matrix scores live traffic. traceAI auto-instruments OpenAI, LangChain, Groq, Gemini, Portkey, and the voice-framework-specific instrumentors. Spans land in the same backend the launch suite wrote to.
For voice agents, tag the spans with a gen_ai.voice.* namespace so the platform renders them with the audio-specific UI. A typical per-turn span carries:
gen_ai.voice.asr.provider = "whisper"
gen_ai.voice.asr.transcript = "can you check the one for next tuesday"
gen_ai.voice.asr.confidence = 0.87
gen_ai.voice.asr.audio_duration_ms = 2140
gen_ai.voice.tts.provider = "elevenlabs"
gen_ai.voice.tts.text_source = "your appointment for tuesday is confirmed"
gen_ai.voice.tts.audio_duration_ms = 1840
gen_ai.voice.turn.event = "end_of_turn"
gen_ai.voice.turn.barge_in_fired = false
gen_ai.voice.accent_class = "tamil_substrate_english"
fi.span.kind = "AGENT"
The dashboard slices conversation_resolution by accent_class and by intent. A regression on Tamil-substrate English in the reschedule intent shows up as one cell turning red, not as a customer-support escalation three weeks later.
Failures cluster in Error Feed. HDBSCAN soft-clustering groups failed traces into recurring patterns; a Sonnet 4.5 Judge agent writes the immediate_fix field per cluster on a 30-turn budget. The clusters we see most often on voice deployments:
| Cluster | Pipeline stage | Typical immediate_fix |
|---|---|---|
| ASR drops disfluencies, breaks intent on turn two | ASR | Tune VAD silence threshold; add filler tokens to language model |
| Proper-noun mistranscription on a slot value | ASR | Add phonetic variants to custom vocabulary |
| Code-switch context reset | STT | Swap to a multilingual STT for affected locales |
| Barge-in on background noise | turn-taking | Raise VAD noise floor for affected personas |
| Persona drift after turn six | LLM | System-prompt rewrite; reduce retrieved-context window |
| PII not redacted from call recording | compliance | Run SecretsScanner and RegexScanner before cold storage |
| LLM p95 latency drift past 700 ms | brain | Switch routing strategy to lowest-latency; check fallback rate |
Each cluster carries a trend signal (rising, steady, falling). The cluster view is the weekly work queue. Linear is the only ticketing destination wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
Where the FAGI stack fits
Four product surfaces carry the voice eval loop without glue code.
simulate-sdk generates the persona-driven calls. Persona carries voice traits; Scenario carries the goal list; TestRunner drives the matrix against your wrapped agent (OpenAI, LangChain, Gemini, Anthropic) and returns a TestReport. The matrix scales: twelve personas, five intents, 100 rows per pair, 6,000 calls run in hours.
ai-evaluation scores every call. The Apache 2.0 SDK ships conversation_resolution, task_completion, ConversationCoherence, ASR/STT_accuracy, TTSAccuracy, audio_quality, plus the customer-agent template family (CustomerAgentInterruptionHandling, CustomerAgentTerminationHandling, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentLanguageHandling) as built-in templates. Error Localization pinpoints the failing turn so triage is one click, not one playback. Pricing runs lower per-eval than Galileo Luna-2 on equivalent throughput.
traceAI extends the same scoring to production. The OpenTelemetry-based SDK auto-instruments the wrappers, emits spans with PII redaction, and ships the gen_ai.voice.* attribute namespace unique to traceAI. Dashboards slice live conversation_resolution by accent and by intent the same way the launch matrix did.
Agent Command Center runs Protect on the voice path and gives you the latency telemetry to hold a sub-second p95. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus Protect Flash score transcribed speech at 65ms text and 107ms image median time-to-label per arXiv 2510.13351. Prompt-injection attempts that come in through voice channels get caught before they reach the LLM; PII in transcripts gets masked before it lands in logs. The gateway is a single Go binary, Apache 2.0, benchmarked at 29k req/s and P99 21 ms with guardrails on, on t3.xlarge. Every response carries x-prism-latency-ms so per-turn p95 tracking is automatic.
The reason the loop closes inside a sprint: the four surfaces share datasets, evaluators, and a trend line. Personas feed the launch matrix. The launch matrix produces a baseline. The baseline becomes the regression suite. Production spans feed the same evaluator pool. Error Feed clusters write immediate_fix back into the work queue. Compliance: SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.
Honest tradeoffs
Three calls worth naming.
Persona-driven generation is only as good as the TTS engine. Accent prosody is rendered, not recorded. For accents where the TTS layer doesn’t carry enough fidelity (some sub-regional dialects), pair the suite with a small recording dataset for the long tail. Voice cloning from ElevenLabs or Cartesia fills most of the gap; the residual is real.
End-task scoring requires precise goal statements. Vague goals make conversation_resolution an LLM-judge whim. Spend time on tight goal statements per scenario. The investment pays back the first time the suite catches a regression that WER would have missed.
Linear is the only ticketing destination wired today. Slack, GitHub, Jira, and PagerDuty are on the roadmap. If your incident workflow lives elsewhere, you’ll bridge through webhooks until those land.
Related reading
- The Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric architecture this methodology sits on top of.
- Accent and Dialect Testing for Voice AI: persona matrix design for accent coverage.
- WER for Voice Agents: Beyond the 2026 Baseline: why WER is a diagnostic, not a gate.
- How to Measure Voice AI Latency: per-stage latency methodology.
Sources and references
ai-evaluationsource: github.com/future-agi/ai-evaluation (templates:ConversationResolution,TaskCompletion,ConversationCoherence,ASRAccuracy,TTSAccuracy,AudioQualityEvaluator,CustomerAgentInterruptionHandling,CustomerAgentTerminationHandling,CustomerAgentContextRetention,CustomerAgentLoopDetection,CustomerAgentLanguageHandling,CustomerAgentConversationQuality)simulate-sdksource: docs.futureagi.com/docs/simulationtraceAIsource: github.com/future-agi/traceAI- Agent Command Center docs: docs.futureagi.com/docs/command-center
- Future AGI trust page: futureagi.com/trust
- Protect model family: arxiv.org/abs/2510.13351
Frequently asked questions
Why is WER a bad eval for a voice agent?
What are the three scoring layers a voice agent eval needs?
What turn latency should a voice agent target?
How do you score conversation coherence across multiple turns?
How does FAGI's simulate-sdk generate voice test cases?
How does FAGI cluster voice failures in production?
Where does the FAGI eval stack fit in the voice loop?
Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.
The 8-layer LLM eval reference architecture for 2026, drawn end to end with ASCII diagrams, five deployment topologies, integration points, and the five anti-patterns it kills.
Engineering walkthrough of a voice agent analytics dashboard: per-call detail drawer with 5 panels, aggregate SLO grid with 3 tiers, span/eval/tag data flow, and the production-to-simulation closed loop.