Guides

Smart Voice AI Integration in 2026: Vapi, Retell, LiveKit, Pipecat, and How to Evaluate Voice Agents in Production

Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.

August 14, 2025

Updated May 14, 2026

9 min read

voice-ai ai-agents evaluations observability vapi retell livekit pipecat 2026

Table of Contents

Updated May 14, 2026. Vapi, Retell AI, LiveKit Agents, and Pipecat now cover most production voice agent stacks. The interesting work has moved from picking a framework to evaluating and instrumenting voice agents in production. Below: the four frameworks compared, the right pick by use case, and the instrumentation plus evaluator code we run on every voice agent we ship.

Smart voice AI integration in May 2026: Vapi, Retell, LiveKit Agents, Pipecat compared with traceAI instrumentation and Future AGI audio evals.

TL;DR: Voice AI integration in May 2026

Need	Best pick	Why
Fastest path to production support agent	Vapi	Hosted, routes STT/LLM/TTS, dashboard, phone numbers, webhooks
Enterprise voice agent with SOC2	Retell AI	Workflow builder, SOC2, structured flows, agent IDE
Open source WebRTC voice stack	LiveKit Agents	Apache 2.0, full transport control, agents framework
Open source Python pipeline	Pipecat	BSD, fine-grained frame pipelines, large adapter list
Lowest STT end-of-speech latency	Deepgram Flux	Purpose-built for voice agents, sub-300ms end-of-speech
Lowest TTS TTFA	Cartesia Sonic 4	~40ms TTFA, State Space Model architecture
Eval, simulation, observability	Future AGI	Audio evaluators, fi.simulate TestRunner, traceAI Apache 2.0

If you only read one row: Vapi for hosted speed, Retell for enterprise, LiveKit + Pipecat when you want full control, plus Future AGI as the evaluation and observability companion for every framework on the list.

Why voice agents need their own evaluation layer

Most LLM evaluation tooling assumes a text in, text out interaction. Voice changes the contract in four ways.

Inputs are audio, not text. A WER of 6% on the STT side means roughly one wrong word per 16. That single wrong word can flip an order ID, a phone number, or a yes to a no.
Latency is cumulative across at least three hops. STT, LLM, and TTS each add latency, and any one of them spiking past its budget breaks the call. A 95th percentile latency on the LLM alone is not enough.
The conversation is full duplex. Barge-in, interruption handling, and turn detection have no equivalent in text chat. A perfectly faithful response that arrives 2 seconds late feels worse than a slightly wrong response that arrives in 400ms.
The recording is the only evidence. When a voice agent fails, the audio is the trace. Without span level traces tied to the recording, root cause analysis turns into manually listening to calls.

Smart voice AI in 2026 means three things sit together: a real framework, real audio level evaluators, and real instrumentation tied back to the recording. The rest of this guide walks through that stack.

The 4 voice agent frameworks in May 2026

Vapi

Vapi is the fastest path to production for a hosted voice agent. The platform routes the STT, LLM, and TTS hops through a single API, ships phone numbers, supports tool calls and webhooks, and exposes a dashboard for call review.

The May 2026 routing list spans Deepgram, AssemblyAI, ElevenLabs, Cartesia, Hume, OpenAI Realtime, and the major LLM providers. The killer feature is provider swap: you can switch STT or TTS providers without rewriting the agent.

Best for: support agents, outbound calling, IVR replacement, appointment booking. See the Vapi docs for the current API surface.

Retell AI

Retell AI focuses on enterprise voice agents with a structured workflow builder, SOC2 compliance, and an agent IDE. The conversation flow is modeled explicitly with nodes, transitions, and conditions, which makes regression testing easier than a free-form prompt.

The platform is more opinionated than Vapi: less provider choice, more structure around the agent. Teams that want a clear path to compliance and audit usually pick Retell first.

Best for: regulated industries, scheduled outbound campaigns, structured intake flows. See the Retell docs for the API.

LiveKit Agents

LiveKit Agents is the open source pick when you want full control of the transport layer. The framework runs on top of LiveKit’s WebRTC stack (Apache 2.0) and ships a Python agents framework with adapters for the major STT, LLM, and TTS providers.

LiveKit ships room recording, server-side mixing, and SIP integration, which is why it shows up in production stacks that need both voice agents and human-in-the-loop hand off in the same session.

Best for: open source stacks, custom WebRTC use cases, voice plus human handoff. See LiveKit Agents and the Apache 2.0 LICENSE.

Pipecat

Pipecat (from Daily) is an open source Python framework for real-time voice and multimodal agents. The pipeline is a chain of frame processors covering VAD, STT, LLM, TTS, and tool use. Where LiveKit owns the WebRTC layer, Pipecat owns the pipeline orchestration.

Pipecat ships a long list of adapters (Daily, LiveKit, Twilio, Deepgram, AssemblyAI, ElevenLabs, Cartesia, OpenAI, Anthropic, Gemini), which makes it the right pick when you want fine-grained Python control over the agent loop.

Best for: research, custom pipelines, multimodal experiments. See Pipecat on GitHub and the BSD LICENSE.

Voice agent latency budget in 2026

A natural voice-to-voice round trip targets 800ms as the ideal, with a 600 to 1,000ms acceptable operating range. The cumulative latency budget breaks down roughly:

Hop	Budget	May 2026 picks
End-of-speech detection (STT)	150 to 300ms	Deepgram Flux, ElevenLabs Scribe v2 Realtime
LLM first token	300 to 500ms	Groq-hosted Llama 4.x, gpt-5-2025-08-07 streaming, claude-opus-4-7 streaming
TTS time-to-first-audio	40 to 200ms	Cartesia Sonic 4 (~40ms), ElevenLabs Flash v2.5 (~75ms), Deepgram Aura-2 (sub-200ms)
Network + barge-in	50 to 100ms	LiveKit / Daily transport
Total	~600 to 1,000ms

When the budget breaks, it usually breaks at the LLM hop. The fix is one of three: switch to a faster model, prefetch the first sentence on speculative decoding, or move some of the agent logic to a lighter classifier.

Future AGI’s audio level evaluators run continuously over production calls and flag spans where any one hop exceeds its budget. See the latency tracking patterns in our AI agent cost optimization observability guide.

Instrumenting voice agents with traceAI

traceAI is Future AGI’s open source (Apache 2.0) OpenTelemetry instrumentation. The same pattern wraps Vapi, Retell, LiveKit, and Pipecat.

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType

# Register once at agent boot
tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice-agent-prod",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))


async def handle_voice_turn(audio_bytes, session_id):
    with tracer.start_as_current_span("voice.turn") as span:
        span.set_attribute("session.id", session_id)

        with tracer.start_as_current_span("voice.stt"):
            transcript = await stt_client.transcribe(audio_bytes)

        with tracer.start_as_current_span("voice.llm") as llm_span:
            llm_span.set_attribute("input.value", transcript)
            response = await llm.generate(transcript)
            llm_span.set_attribute("output.value", response)

        with tracer.start_as_current_span("voice.tts"):
            audio_out = await tts_client.synthesize(response)

        return audio_out

The FI_API_KEY and FI_SECRET_KEY environment variables authenticate the trace shipment. Every span shows up in the Future AGI dashboard tied to the recording, with per-span latency, cost, and any evaluators that ran on the output.

For Vapi the same idea applies on the webhook handler. For LiveKit Agents, wrap the agent’s on_turn callback. For Pipecat, wrap the frame processor process_frame method.

Voice agent evaluators that actually predict production

Six evaluators run continuously on every call in a well-instrumented voice agent.

from fi.evals import evaluate

# 1. Faithfulness of the LLM response to retrieval context
faith_result = evaluate(
    "faithfulness",
    output=llm_response,
    context=retrieved_context,
)

# 2. Task completion at end of call
completion = evaluate(
    "task_completion",
    output=full_transcript,
    expected="appointment booked",
)

# 3. Toxicity on every assistant turn
toxicity = evaluate(
    "toxicity",
    output=llm_response,
)

# 4. Hallucination check against retrieved facts
hallucination = evaluate(
    "hallucination",
    output=llm_response,
    context=knowledge_base_chunk,
)

For audio-level scoring (transcription accuracy, audio quality, pronunciation), pair the cloud audio evaluators documented at docs.futureagi.com with the same evaluate() call shape against the audio URL or the transcript reference.

The cloud evaluators run in different latency tiers: turing_flash at roughly 1 to 2 seconds, turing_small at roughly 2 to 3 seconds, and turing_large at roughly 3 to 5 seconds (see Future AGI cloud evals docs). For inline use in the voice hot path, run turing_flash evaluators asynchronously per turn and use the result to flag the next turn or end the call. For deeper analysis, run turing_large evaluators offline on the full transcript after the call ends.

For a deeper look at evaluator selection see our custom LLM eval metrics best practices and deterministic LLM evaluation metrics guides.

Pre-production simulation with fi.simulate

Voice agents fail in production on edge cases that look obvious in hindsight: a caller speaking a digit at a time, a barge-in halfway through a sentence, an out-of-vocabulary product name. Manual testing covers a few of these. Simulation covers thousands.

from fi.simulate import TestRunner, AgentInput, AgentResponse


def my_voice_agent(payload: AgentInput) -> AgentResponse:
    transcript = stt_client.transcribe(payload.audio)
    response_text = llm.generate(transcript)
    audio_out = tts_client.synthesize(response_text)
    return AgentResponse(text=response_text)


runner = TestRunner(
    agent=my_voice_agent,
    personas=["impatient_caller", "domain_expert", "adversarial_caller"],
    scenarios=[
        "caller speaks the date as next tuesday",
        "caller barges in halfway through the confirmation",
        "caller spells the email address letter by letter",
        "background noise from a busy cafe",
        "caller is on a poor cellular connection",
    ],
)
report = runner.run(n_turns_per_scenario=10)
print(report.summary())

The runner spins up AI test callers that hold full conversations against the agent, scores the transcripts on the listed evaluators, and surfaces failure modes ranked by frequency. See the Future AGI simulation docs for the full contract.

Architecture pattern: how the four hops fit together

Every production voice agent in May 2026 looks roughly like this:

Transport (WebRTC or SIP). LiveKit, Daily, Twilio, or Vapi/Retell hosted transport.
VAD + end-of-speech detection. Deepgram Flux, Silero VAD, or the framework default.
STT. Deepgram Nova-3, AssemblyAI Universal-2, ElevenLabs Scribe v2 Realtime, Whisper, or NVIDIA Parakeet TDT (Apache 2.0).
LLM. gpt-5-2025-08-07, claude-opus-4-7, Gemini 3 Pro, or self-hosted Llama 4.x via Groq, vLLM, or TGI.
TTS. Cartesia Sonic 4, ElevenLabs Flash v2.5, Deepgram Aura-2, Hume Octave 2, or self-hosted Kokoro / Piper.
Observability. traceAI (Apache 2.0) spans tied to the recording, with audio level evaluators running continuously and surfacing in the Future AGI dashboard via the Agent Command Center at /platform/monitor/command-center.
Pre-production. fi.simulate scenarios, evaluator gates, and regression tests in CI.

The frameworks above (Vapi, Retell, LiveKit Agents, Pipecat) cover layers 1 through 5. Future AGI covers layers 6 and 7.

Production failure modes worth instrumenting

Six failure modes account for most voice agent incidents in production.

STT word swap on numbers and names. A “fifteen” becomes “fifty,” an “Aaron” becomes “Erin.” Run WER plus a domain-specific evaluator on every call.
LLM tool-call drift. The model calls a tool with the wrong argument because the transcript was slightly wrong. Trace the tool call arguments as span attributes.
TTS pronunciation failure on proper nouns. Names, addresses, and product names get pronounced wrong. Add a domain pronunciation evaluator with a small reference set.
Long-tail latency spikes on the LLM hop. P95 is fine, P99 is 4 seconds. Voice users do not see averages, they hear the worst call of the day.
Barge-in races. The caller interrupts and the TTS keeps streaming. Trace the barge-in event and measure the gap to TTS cut.
Silent failures with no audio at all. A provider hiccup means the caller hears dead air. Heartbeat the TTS output and alert when more than 2 seconds of silence happens mid-call.

These are the six failure modes teams should alert on once traceAI spans and fi.evals evaluators are wired in. For the broader observability pattern see our best AI agent observability tools and LLM tracing tools guides.

The Future AGI voice stack in one diagram

Frameworks (Vapi, Retell, LiveKit, Pipecat) call STT (Deepgram, AssemblyAI, ElevenLabs, Whisper), then the LLM (gpt-5-2025-08-07, claude-opus-4-7, Gemini 3 Pro, Llama 4.x), then TTS (Cartesia, ElevenLabs, Hume, Deepgram Aura-2).
Every hop is wrapped by traceAI (Apache 2.0) with spans for STT, LLM, tool calls, and TTS.
Every span ships to the Future AGI dashboard at /platform/monitor/command-center with audio level evaluators (faithfulness, toxicity, WER, audio quality) attached.
Pre-production runs fi.simulate TestRunner scenarios, gated by evaluator thresholds.
Production runs continuous evaluators (turing_flash on every turn, turing_large offline on the recording).

Closing: pick a framework, then add the evaluation layer

The May 2026 voice AI build is no longer about “build everything yourself.” Vapi, Retell, LiveKit Agents, and Pipecat cover the transport, STT, LLM, and TTS hops. The actual production work is in the evaluation and observability layer above them.

Future AGI is not a voice framework. It is the recommended evaluation, simulation, and observability companion. Wire traceAI into the framework you pick, run fi.evals audio evaluators continuously on every call, run fi.simulate scenarios in CI before every release, and watch the dashboard at /platform/monitor/command-center.

Book a Future AGI demo to see voice agent evaluation and observability in action.

Frequently asked questions

What is the right voice AI framework in May 2026?

There is no single best voice AI framework. Pick by use case. Vapi is the fastest path to production for support and outbound voice agents with full STT, LLM, and TTS routing baked in. Retell AI focuses on developer experience and SOC2 deployments. LiveKit Agents is the open source pick when you want full control of the WebRTC stack. Pipecat is the open source pick when you need fine-grained Python pipelines. Future AGI is not a voice framework. It is the evaluation, simulation, and observability layer that pairs with whichever framework you pick.

What latency budget should a voice agent target?

Voice-to-voice round trip should stay under 800ms for natural conversation. Above 1.2s users start to interrupt the agent. A typical budget is 150 to 300ms for STT, 300 to 500ms for the LLM token to first audio, and 100 to 200ms for TTS time to first audio, plus 50 to 100ms of network and barge-in handling. Cartesia Sonic 4 at roughly 40ms TTFA and Deepgram Flux for end-of-speech detection are the May 2026 picks for the tightest budgets.

How do you evaluate a voice agent beyond Word Error Rate?

Word Error Rate only measures STT transcription accuracy on the input side. Production voice agents need at least six metrics: end-of-speech detection latency, time-to-first-audio for TTS, semantic faithfulness against retrieval context, task completion rate, barge-in handling, and audio quality (MOS or naturalness). Future AGI ships audio-level evaluators for transcription, faithfulness, toxicity, and naturalness that score every call in production without manual labeling.

How do you instrument a Vapi or Retell voice agent for observability?

Wrap the agent loop with traceAI, the Future AGI Apache 2.0 OpenTelemetry instrumentation. The fi_instrumentation.register call sets up the tracer with your project name and API keys, and FITracer spans cover the STT, LLM, tool, and TTS hops. The same trace then renders inside the Future AGI dashboard alongside cost, latency, and per-span evaluators. The same pattern works for LiveKit Agents and Pipecat pipelines.

Can I run voice agent simulation before going live?

Yes. Future AGI Simulate ships AI test agents that hold full conversations against your voice agent across thousands of scenarios. The fi.simulate TestRunner takes scenarios, AgentInput, and AgentResponse contracts, then scores transcripts on end-of-call evaluators. Teams use it to find regressions before a release, stress test long calls, and measure barge-in and disfluency handling without paying human testers.

Does Future AGI sell a voice AI model or framework?

No. Future AGI does not sell an STT model, a TTS model, a voice agent framework, or a phone number. Future AGI is the evaluation, simulation, and observability layer that pairs with the voice stack you pick. Pick Vapi, Retell, LiveKit Agents, or Pipecat for the framework. Pick Deepgram, AssemblyAI, ElevenLabs, or Whisper for STT and TTS. Pair Future AGI for evals plus traceAI for production traces.

Which voice stack should I pick if I want fully open source?

LiveKit Agents (Apache 2.0) plus Pipecat (BSD) is the fully open source pick. Pair with self-hosted Whisper or NVIDIA Parakeet TDT (Apache 2.0) for STT and a self-hosted TTS like Kokoro or Piper. traceAI (Apache 2.0) and ai-evaluation (Apache 2.0) cover observability and evals. The entire stack runs in your VPC with no vendor lock-in.

Is Vapi or Retell the better hosted voice agent platform?

Vapi has wider provider routing (it routes Deepgram, AssemblyAI, ElevenLabs, Cartesia, Hume, and OpenAI under one API) and a larger ecosystem. Retell AI focuses more tightly on enterprise voice agents with SOC2, conversation flow tooling, and a structured workflow builder. Pick Vapi if you need maximum flexibility across vendors and providers. Pick Retell if you want a more opinionated agent IDE and enterprise compliance from day one.

View all

Guides

Simulate a Voice AI Agent in 2026: A Hands-On Guide

Simulate voice AI agents in 2026 with fi.simulate.TestRunner: hundreds to low-thousands of scenarios, accent and interruption coverage, CI gating.

Rishav Hada · Aug 7, 2025

7 min

Guides

Future AGI vs Deepchecks 2026: LLM Eval, Pricing, G2

Future AGI vs Deepchecks in 2026. LLM evaluation, observability, prompt optimization, tabular and CV validation, pricing, G2 ratings, and when to pick each.

Rishav Hada · Jul 21, 2025

8 min

Guides

Real-Time LLM Performance Monitoring in 2026: 7 Tools Ranked

Real-time LLM monitoring in 2026. FutureAGI, Langfuse, Phoenix, Helicone, OpenLIT, Datadog, and New Relic ranked on latency, eval depth, and OTel support.

Rishav Hada · Dec 1, 2024

12 min

TL;DR: Voice AI integration in May 2026

Why voice agents need their own evaluation layer

The 4 voice agent frameworks in May 2026

Vapi

Retell AI

LiveKit Agents

Pipecat

Voice agent latency budget in 2026

Instrumenting voice agents with traceAI

Voice agent evaluators that actually predict production

Pre-production simulation with fi.simulate

Architecture pattern: how the four hops fit together

Production failure modes worth instrumenting

The Future AGI voice stack in one diagram

Closing: pick a framework, then add the evaluation layer

Related reading

Frequently asked questions