Smart Voice AI Integration in 2026: Vapi, Retell, LiveKit, Pipecat, and How to Evaluate Voice Agents in Production
Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.
Table of Contents
Updated May 14, 2026. Vapi, Retell AI, LiveKit Agents, and Pipecat now cover most production voice agent stacks. The interesting work has moved from picking a framework to evaluating and instrumenting voice agents in production. Below: the four frameworks compared, the right pick by use case, and the instrumentation plus evaluator code we run on every voice agent we ship.

TL;DR: Voice AI integration in May 2026
| Need | Best pick | Why |
|---|---|---|
| Fastest path to production support agent | Vapi | Hosted, routes STT/LLM/TTS, dashboard, phone numbers, webhooks |
| Enterprise voice agent with SOC2 | Retell AI | Workflow builder, SOC2, structured flows, agent IDE |
| Open source WebRTC voice stack | LiveKit Agents | Apache 2.0, full transport control, agents framework |
| Open source Python pipeline | Pipecat | BSD, fine-grained frame pipelines, large adapter list |
| Lowest STT end-of-speech latency | Deepgram Flux | Purpose-built for voice agents, sub-300ms end-of-speech |
| Lowest TTS TTFA | Cartesia Sonic 4 | ~40ms TTFA, State Space Model architecture |
| Eval, simulation, observability | Future AGI | Audio evaluators, fi.simulate TestRunner, traceAI Apache 2.0 |
If you only read one row: Vapi for hosted speed, Retell for enterprise, LiveKit + Pipecat when you want full control, plus Future AGI as the evaluation and observability companion for every framework on the list.
Why voice agents need their own evaluation layer
Most LLM evaluation tooling assumes a text in, text out interaction. Voice changes the contract in four ways.
- Inputs are audio, not text. A WER of 6% on the STT side means roughly one wrong word per 16. That single wrong word can flip an order ID, a phone number, or a yes to a no.
- Latency is cumulative across at least three hops. STT, LLM, and TTS each add latency, and any one of them spiking past its budget breaks the call. A 95th percentile latency on the LLM alone is not enough.
- The conversation is full duplex. Barge-in, interruption handling, and turn detection have no equivalent in text chat. A perfectly faithful response that arrives 2 seconds late feels worse than a slightly wrong response that arrives in 400ms.
- The recording is the only evidence. When a voice agent fails, the audio is the trace. Without span level traces tied to the recording, root cause analysis turns into manually listening to calls.
Smart voice AI in 2026 means three things sit together: a real framework, real audio level evaluators, and real instrumentation tied back to the recording. The rest of this guide walks through that stack.
The 4 voice agent frameworks in May 2026
Vapi
Vapi is the fastest path to production for a hosted voice agent. The platform routes the STT, LLM, and TTS hops through a single API, ships phone numbers, supports tool calls and webhooks, and exposes a dashboard for call review.
The May 2026 routing list spans Deepgram, AssemblyAI, ElevenLabs, Cartesia, Hume, OpenAI Realtime, and the major LLM providers. The killer feature is provider swap: you can switch STT or TTS providers without rewriting the agent.
Best for: support agents, outbound calling, IVR replacement, appointment booking. See the Vapi docs for the current API surface.
Retell AI
Retell AI focuses on enterprise voice agents with a structured workflow builder, SOC2 compliance, and an agent IDE. The conversation flow is modeled explicitly with nodes, transitions, and conditions, which makes regression testing easier than a free-form prompt.
The platform is more opinionated than Vapi: less provider choice, more structure around the agent. Teams that want a clear path to compliance and audit usually pick Retell first.
Best for: regulated industries, scheduled outbound campaigns, structured intake flows. See the Retell docs for the API.
LiveKit Agents
LiveKit Agents is the open source pick when you want full control of the transport layer. The framework runs on top of LiveKit’s WebRTC stack (Apache 2.0) and ships a Python agents framework with adapters for the major STT, LLM, and TTS providers.
LiveKit ships room recording, server-side mixing, and SIP integration, which is why it shows up in production stacks that need both voice agents and human-in-the-loop hand off in the same session.
Best for: open source stacks, custom WebRTC use cases, voice plus human handoff. See LiveKit Agents and the Apache 2.0 LICENSE.
Pipecat
Pipecat (from Daily) is an open source Python framework for real-time voice and multimodal agents. The pipeline is a chain of frame processors covering VAD, STT, LLM, TTS, and tool use. Where LiveKit owns the WebRTC layer, Pipecat owns the pipeline orchestration.
Pipecat ships a long list of adapters (Daily, LiveKit, Twilio, Deepgram, AssemblyAI, ElevenLabs, Cartesia, OpenAI, Anthropic, Gemini), which makes it the right pick when you want fine-grained Python control over the agent loop.
Best for: research, custom pipelines, multimodal experiments. See Pipecat on GitHub and the BSD LICENSE.
Voice agent latency budget in 2026
A natural voice-to-voice round trip targets 800ms as the ideal, with a 600 to 1,000ms acceptable operating range. The cumulative latency budget breaks down roughly:
| Hop | Budget | May 2026 picks |
|---|---|---|
| End-of-speech detection (STT) | 150 to 300ms | Deepgram Flux, ElevenLabs Scribe v2 Realtime |
| LLM first token | 300 to 500ms | Groq-hosted Llama 4.x, gpt-5-2025-08-07 streaming, claude-opus-4-7 streaming |
| TTS time-to-first-audio | 40 to 200ms | Cartesia Sonic 4 (~40ms), ElevenLabs Flash v2.5 (~75ms), Deepgram Aura-2 (sub-200ms) |
| Network + barge-in | 50 to 100ms | LiveKit / Daily transport |
| Total | ~600 to 1,000ms |
When the budget breaks, it usually breaks at the LLM hop. The fix is one of three: switch to a faster model, prefetch the first sentence on speculative decoding, or move some of the agent logic to a lighter classifier.
Future AGI’s audio level evaluators run continuously over production calls and flag spans where any one hop exceeds its budget. See the latency tracking patterns in our AI agent cost optimization observability guide.
Instrumenting voice agents with traceAI
traceAI is Future AGI’s open source (Apache 2.0) OpenTelemetry instrumentation. The same pattern wraps Vapi, Retell, LiveKit, and Pipecat.
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
# Register once at agent boot
tracer_provider = register(
project_type=ProjectType.OBSERVE,
project_name="voice-agent-prod",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))
async def handle_voice_turn(audio_bytes, session_id):
with tracer.start_as_current_span("voice.turn") as span:
span.set_attribute("session.id", session_id)
with tracer.start_as_current_span("voice.stt"):
transcript = await stt_client.transcribe(audio_bytes)
with tracer.start_as_current_span("voice.llm") as llm_span:
llm_span.set_attribute("input.value", transcript)
response = await llm.generate(transcript)
llm_span.set_attribute("output.value", response)
with tracer.start_as_current_span("voice.tts"):
audio_out = await tts_client.synthesize(response)
return audio_out
The FI_API_KEY and FI_SECRET_KEY environment variables authenticate the trace shipment. Every span shows up in the Future AGI dashboard tied to the recording, with per-span latency, cost, and any evaluators that ran on the output.
For Vapi the same idea applies on the webhook handler. For LiveKit Agents, wrap the agent’s on_turn callback. For Pipecat, wrap the frame processor process_frame method.
Voice agent evaluators that actually predict production
Six evaluators run continuously on every call in a well-instrumented voice agent.
from fi.evals import evaluate
# 1. Faithfulness of the LLM response to retrieval context
faith_result = evaluate(
"faithfulness",
output=llm_response,
context=retrieved_context,
)
# 2. Task completion at end of call
completion = evaluate(
"task_completion",
output=full_transcript,
expected="appointment booked",
)
# 3. Toxicity on every assistant turn
toxicity = evaluate(
"toxicity",
output=llm_response,
)
# 4. Hallucination check against retrieved facts
hallucination = evaluate(
"hallucination",
output=llm_response,
context=knowledge_base_chunk,
)
For audio-level scoring (transcription accuracy, audio quality, pronunciation), pair the cloud audio evaluators documented at docs.futureagi.com with the same evaluate() call shape against the audio URL or the transcript reference.
The cloud evaluators run in different latency tiers: turing_flash at roughly 1 to 2 seconds, turing_small at roughly 2 to 3 seconds, and turing_large at roughly 3 to 5 seconds (see Future AGI cloud evals docs). For inline use in the voice hot path, run turing_flash evaluators asynchronously per turn and use the result to flag the next turn or end the call. For deeper analysis, run turing_large evaluators offline on the full transcript after the call ends.
For a deeper look at evaluator selection see our custom LLM eval metrics best practices and deterministic LLM evaluation metrics guides.
Pre-production simulation with fi.simulate
Voice agents fail in production on edge cases that look obvious in hindsight: a caller speaking a digit at a time, a barge-in halfway through a sentence, an out-of-vocabulary product name. Manual testing covers a few of these. Simulation covers thousands.
from fi.simulate import TestRunner, AgentInput, AgentResponse
def my_voice_agent(payload: AgentInput) -> AgentResponse:
transcript = stt_client.transcribe(payload.audio)
response_text = llm.generate(transcript)
audio_out = tts_client.synthesize(response_text)
return AgentResponse(text=response_text)
runner = TestRunner(
agent=my_voice_agent,
personas=["impatient_caller", "domain_expert", "adversarial_caller"],
scenarios=[
"caller speaks the date as next tuesday",
"caller barges in halfway through the confirmation",
"caller spells the email address letter by letter",
"background noise from a busy cafe",
"caller is on a poor cellular connection",
],
)
report = runner.run(n_turns_per_scenario=10)
print(report.summary())
The runner spins up AI test callers that hold full conversations against the agent, scores the transcripts on the listed evaluators, and surfaces failure modes ranked by frequency. See the Future AGI simulation docs for the full contract.
Architecture pattern: how the four hops fit together
Every production voice agent in May 2026 looks roughly like this:
- Transport (WebRTC or SIP). LiveKit, Daily, Twilio, or Vapi/Retell hosted transport.
- VAD + end-of-speech detection. Deepgram Flux, Silero VAD, or the framework default.
- STT. Deepgram Nova-3, AssemblyAI Universal-2, ElevenLabs Scribe v2 Realtime, Whisper, or NVIDIA Parakeet TDT (Apache 2.0).
- LLM. gpt-5-2025-08-07, claude-opus-4-7, Gemini 3 Pro, or self-hosted Llama 4.x via Groq, vLLM, or TGI.
- TTS. Cartesia Sonic 4, ElevenLabs Flash v2.5, Deepgram Aura-2, Hume Octave 2, or self-hosted Kokoro / Piper.
- Observability. traceAI (Apache 2.0) spans tied to the recording, with audio level evaluators running continuously and surfacing in the Future AGI dashboard via the Agent Command Center at
/platform/monitor/command-center. - Pre-production. fi.simulate scenarios, evaluator gates, and regression tests in CI.
The frameworks above (Vapi, Retell, LiveKit Agents, Pipecat) cover layers 1 through 5. Future AGI covers layers 6 and 7.
Production failure modes worth instrumenting
Six failure modes account for most voice agent incidents in production.
- STT word swap on numbers and names. A “fifteen” becomes “fifty,” an “Aaron” becomes “Erin.” Run WER plus a domain-specific evaluator on every call.
- LLM tool-call drift. The model calls a tool with the wrong argument because the transcript was slightly wrong. Trace the tool call arguments as span attributes.
- TTS pronunciation failure on proper nouns. Names, addresses, and product names get pronounced wrong. Add a domain pronunciation evaluator with a small reference set.
- Long-tail latency spikes on the LLM hop. P95 is fine, P99 is 4 seconds. Voice users do not see averages, they hear the worst call of the day.
- Barge-in races. The caller interrupts and the TTS keeps streaming. Trace the barge-in event and measure the gap to TTS cut.
- Silent failures with no audio at all. A provider hiccup means the caller hears dead air. Heartbeat the TTS output and alert when more than 2 seconds of silence happens mid-call.
These are the six failure modes teams should alert on once traceAI spans and fi.evals evaluators are wired in. For the broader observability pattern see our best AI agent observability tools and LLM tracing tools guides.
The Future AGI voice stack in one diagram
- Frameworks (Vapi, Retell, LiveKit, Pipecat) call STT (Deepgram, AssemblyAI, ElevenLabs, Whisper), then the LLM (gpt-5-2025-08-07, claude-opus-4-7, Gemini 3 Pro, Llama 4.x), then TTS (Cartesia, ElevenLabs, Hume, Deepgram Aura-2).
- Every hop is wrapped by traceAI (Apache 2.0) with spans for STT, LLM, tool calls, and TTS.
- Every span ships to the Future AGI dashboard at
/platform/monitor/command-centerwith audio level evaluators (faithfulness, toxicity, WER, audio quality) attached. - Pre-production runs fi.simulate TestRunner scenarios, gated by evaluator thresholds.
- Production runs continuous evaluators (
turing_flashon every turn,turing_largeoffline on the recording).
Closing: pick a framework, then add the evaluation layer
The May 2026 voice AI build is no longer about “build everything yourself.” Vapi, Retell, LiveKit Agents, and Pipecat cover the transport, STT, LLM, and TTS hops. The actual production work is in the evaluation and observability layer above them.
Future AGI is not a voice framework. It is the recommended evaluation, simulation, and observability companion. Wire traceAI into the framework you pick, run fi.evals audio evaluators continuously on every call, run fi.simulate scenarios in CI before every release, and watch the dashboard at /platform/monitor/command-center.
Book a Future AGI demo to see voice agent evaluation and observability in action.
Related reading
Frequently asked questions
What is the right voice AI framework in May 2026?
What latency budget should a voice agent target?
How do you evaluate a voice agent beyond Word Error Rate?
How do you instrument a Vapi or Retell voice agent for observability?
Can I run voice agent simulation before going live?
Does Future AGI sell a voice AI model or framework?
Which voice stack should I pick if I want fully open source?
Is Vapi or Retell the better hosted voice agent platform?
Simulate voice AI agents in 2026 with fi.simulate.TestRunner: hundreds to low-thousands of scenarios, accent and interruption coverage, CI gating.
Future AGI vs Deepchecks in 2026. LLM evaluation, observability, prompt optimization, tabular and CV validation, pricing, G2 ratings, and when to pick each.
Real-time LLM monitoring in 2026. FutureAGI, Langfuse, Phoenix, Helicone, OpenLIT, Datadog, and New Relic ranked on latency, eval depth, and OTel support.