Guides

Voice AI Infrastructure Stack: A 2026 Pick-by-Stage Decision Guide

Pick-by-stage guide to the 2026 voice AI stack: telephony, orchestration, STT, LLM, TTS, and eval/observe. Real pricing, real picks, and three full compositions.

March 19, 2026

Updated May 19, 2026

18 min read

voice-ai 2026 architecture infrastructure stack

Table of Contents

If you’re shipping a voice AI agent in 2026, the stack you compose matters more than any single provider on it. Six layers (telephony, orchestration, speech-to-text, LLM, text-to-speech, and evaluation plus observability) each have three to five real contenders, and the right pick depends on your audio profile, latency budget, compliance posture, and call volume. This guide ranks the picks per layer, gives you the real per-minute economics, and shows three full compositions that ship to production today.

TL;DR: pick by layer

Layer	Default pick	When to swap
Telephony	Twilio	Telnyx for cost, Plivo for value, Vonage for EU/APAC enterprise
Orchestration	Vapi	Retell AI for lowest latency, ElevenLabs Agents for TTS quality, LiveKit for open-source, Pipecat for Python-native
STT	Deepgram Nova-3	AssemblyAI for multilingual, Cartesia Ink-Whisper for sub-100ms, Speechmatics Ursa for EU, Whisper for self-host
LLM	OpenAI gpt-4.1-mini	Claude Sonnet 4.5 for long context, Gemini 2.0 Flash for multimodal, Groq for latency, Bedrock for BYOC compliance
TTS	ElevenLabs	Cartesia Sonic 3.5 for lowest TTFA, Deepgram Aura-2 for cost + prosody, Rime AI for utility voices, OpenAI gpt-4o-mini-tts for cheapest
Eval + Observe	Future AGI	Run any of the 70+ built-in rubrics on every call. Native voice obs for Vapi/Retell/LiveKit. SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified

The picks below go layer by layer. Three end-to-end compositions follow, with per-call economics.

Layer 1: Telephony

Telephony is the pipe. It carries the PSTN call into your stack and the agent audio back out. The decisions here are: cost per minute, SIP support for BYO trunks, geographic coverage, and bring-your-own number flow.

Twilio (default). Programmable Voice covers 100+ countries with the widest DID inventory in the industry. SIP trunking is mature inbound and outbound. BYO number via SIP refer or hosted number port. Pricing sits at $0.014 per minute inbound and $0.022 per minute outbound on US toll-free. Best for broadest geographic coverage with minimal surprises. Per-minute cost is the highest of the four.
Telnyx (cost + engineering). Runs its own private global network instead of leasing PSTN access, which shows up as 20-40% lower per-minute pricing on US traffic and better SIP latency end to end. Inbound starts at $0.0035 per minute on US local DIDs, outbound at $0.0070 per minute. SIP trunking is a first-class product. Best for SIP-heavy stacks where the saving over Twilio materially changes unit economics. Geographic coverage is narrower than Twilio.
Plivo (value). Prices below Telnyx on most outbound corridors while keeping a reliable carrier-grade network. Inbound starts at $0.0075 per minute on US local DIDs, outbound at $0.0050 per minute. Best for high-volume outbound voice agents where per-minute cost dominates. Documentation depth and SDK breadth trail Twilio.
Vonage (EU/APAC enterprise). Carries the largest legacy contact-center footprint in Europe and APAC. SIP coverage is deep in EMEA, with regional DIDs in countries Twilio still routes through partners. Per-minute pricing comparable to Twilio on most corridors, lower on intra-EU. Best for EU-headquartered voice agents needing regional DIDs in Germany, France, Italy, Spain, Nordics, and APAC deployments with a Vonage interconnect. Developer ergonomics are the weakest of the four.

How to pick telephony. US-only and developer-led: Telnyx wins on cost and ergonomics. Global with broadest coverage: Twilio wins on reliability. High-volume outbound: Plivo wins on price. EU enterprise with procurement preference: Vonage wins on regional depth. Lock the carrier last; orchestration usually integrates with all four.

Layer 2: Orchestration

Orchestration is the glue. It receives the telephony media stream, routes audio to STT, sends transcripts to an LLM, pipes responses to TTS, and streams synthesized audio back. The five contenders in 2026 trade off open-source vs hosted, latency, and Python-native vs JavaScript-native.

Vapi (community + BYO). The largest open-source-aligned community in voice agents and the most extensible at the BYO-model layer. Pick your own STT, LLM, and TTS; Vapi wires them together with a configurable pipeline that exposes turn detection, interruption handling, tool calls, and webhooks at every stage. Public-tier pricing lists $0.05 per minute of platform overhead on top of underlying provider costs. Best for teams wanting maximum BYO flexibility without operating orchestration themselves.
Retell AI (lowest hosted latency). Ships an opinionated stack with native coupling between LLM and TTS that shaves 100-200ms off end-to-end latency vs more flexible orchestrations. Per-minute pricing bundles orchestration plus hosted LLM and TTS. Best for latency-critical voice agents where 500ms vs 700ms round-trip is the difference between conversational and clunky. BYO surface narrower than Vapi.
ElevenLabs Agents (voice quality). Conversational AI built on top of ElevenLabs’ high-fidelity TTS, with an integrated runtime emphasizing voice quality, natural prosody, and brand voice cloning. Subscription-based with per-minute usage. Best for brand-led voice (consumer, hospitality, media) where the voice is the product. Pricing is the highest of the hosted options.
LiveKit Agents (open-source WebRTC). Apache-licensed framework for voice agents on top of LiveKit Cloud or self-hosted infra. BYO model end to end. LiveKit handles WebRTC media transport, turn detection, and agent orchestration. Cloud charges per participant minute; self-host is free in license cost. Best for teams that want full control over the orchestration runtime and WebRTC-first deployments where PSTN is secondary.
Pipecat (Python-native). Apache-licensed Python framework from Daily. Pipeline composes processors (STT, LLM, TTS, VAD, tool execution) into a graph with explicit interruption handling and turn detection. Native integrations cover Deepgram, AssemblyAI, OpenAI, Anthropic, ElevenLabs, Cartesia, Daily, LiveKit, and more. Best for Python-first teams that already operate ML services in Python.

How to pick orchestration. Self-host with full control: LiveKit or Pipecat. Hosted with BYO flexibility: Vapi. Hosted with lowest latency: Retell. Brand-voice-led: ElevenLabs Agents. The eval layer is independent in every case.

Layer 3: Speech-to-text (STT)

STT determines the ceiling on every downstream quality metric. A 12% word error rate cascades into intent miss, retrieval miss, and bad agent responses no LLM can recover from. The picks below mirror the STT-specific guide compressed for stack composition.

Deepgram Nova-3. WER leader on hard call-center audio (telephony codecs, background noise, cross-talk, accent diversity). First-partial 90-130ms P50. Real production WER 6-10% on conversational telephony. Streaming starts at $0.0043 per minute.
AssemblyAI Universal-3 Pro. Broad multilingual coverage, polished speaker diarization, and the strongest English-nuance handling. WER 7-11% on conversational English. First-partial 100-150ms P50. Streaming starts at $0.0042 per minute.
Whisper large-v3 (self-host). The open-weight baseline. 1.5B parameters under MIT license. Most accent-robust model in 2026 and strongest code-switching model in production. Batch model; real-time wrappers carry 300-600ms first-partial latency. Free in license; you pay for compute.
Cartesia Ink-Whisper. Streaming Whisper-class ASR with sub-100ms first-partial in clean conditions. Distilled causalized Whisper that trades long-tail language coverage for latency. 70-95ms P50 first-partial.
Speechmatics Ursa. Accent and dialect leader. 55+ languages with strongest European quality and deepest accent coverage (Scottish, Welsh, South African, Australian, NZ English all within 1-2% WER of standard English). Custom medical, legal, broadcast, and finance models.

How to pick STT. Score on your audio. Public benchmarks lie. Sample 500-1000 utterances from your production traffic, generate ground-truth transcripts, and run every candidate through the audio_transcription rubric in Future AGI ai-evaluation. The winner on your audio is the right pick.

Layer 4: LLM

The LLM is the brain. Voice agents need a model that handles streaming, partial inputs, tool calls, and 500-1500ms time-to-first-token.

OpenAI (default). GPT-4.1-mini and gpt-4o-mini hit the right cost-latency-quality point for most voice agents. Streaming mature, tool calls reliable, time-to-first-token 350-600ms P50. GPT-4.1-mini is roughly $0.40 per million input tokens and $1.60 per million output tokens. At 200 input + 80 output tokens per turn with 10 turns, LLM cost lands around $0.0021 per call.
Anthropic Claude (long context). Claude Sonnet 4.5 handles 200K context, which matters when your agent passes long tool outputs (CRM history, knowledge base chunks) back into every turn. Tool calling is among the most reliable. Sonnet 4.5 is roughly $3 per million input and $15 per million output tokens.
Gemini 2.0 (multimodal). Gemini 2.0 Flash handles native multimodal inputs (audio in, text out), collapsing STT and LLM into a single call. Lowest-cost tier among frontier models.
Groq (latency). LPU inference runs Llama 3.1 70B, Llama 3.3, and Mixtral at time-to-first-token under 200ms. Right pick when latency dominates and the task is conversational rather than reasoning-heavy.
Together / Fireworks (cost). Hosted Llama, Qwen, DeepSeek, and Mistral at lower per-token cost than frontier closed models. Fireworks ships fine-tuning; Together ships dedicated endpoints.
AWS Bedrock (BYOC compliance). Hosts Claude, Llama, Mistral, and Titan inside your AWS account with VPC peering, KMS-encrypted prompts, and the AWS audit boundary. Pick this when HIPAA BAA or federal audit boundary matters more than lowest latency.

How to pick the LLM. Match the LLM to the task profile. Conversational with short context: GPT-4.1-mini or Llama 3.3 on Groq. Long-context retrieval per turn: Claude Sonnet 4.5. Multimodal: Gemini 2.0 Flash. Regulated workload that needs to stay in AWS: Bedrock.

Layer 5: Text-to-speech (TTS)

TTS is the most cost-variable layer in the stack. ElevenLabs and OpenAI gpt-4o-mini-tts differ by roughly 5x on per-minute cost for the same generated audio length.

ElevenLabs (quality + cloning). v2.5 Turbo and v3 deliver the most natural prosody of any commercial TTS, with the deepest voice library and the most realistic voice cloning. Streaming TTFA 150-300ms P50 on Turbo. Production voice agents at scale typically land at $0.045-$0.070 per minute of generated audio. Best for brand-led voice and cloning use cases.
Cartesia Sonic 3.5 (lowest streaming TTFA). Sub-100ms streaming time-to-first-audio in clean conditions with a small but growing voice library. Designed around the voice-agent use case where every TTS millisecond comes out of perceived response time. Best for latency-critical agents.
Deepgram Aura-2 (cost + prosody). Natural-sounding voices at roughly half the per-minute cost of ElevenLabs Turbo. Streaming TTFA 150-250ms P50. $0.0150 per minute of generated audio. Best for high-volume support where ElevenLabs is more than you need.
Rime AI (utility voices). Conversational, regional, and accent-diverse voices that sound less polished than ElevenLabs but more authentic for collections, sales outreach, and research. Best for outbound voice where regional authenticity drives engagement.
OpenAI gpt-4o-mini-tts (cheapest). Roughly 20% of ElevenLabs Turbo pricing with usable quality. Streaming TTFA competitive. Smaller voice library. Best for cost-led deployments, internal tools, and prototypes.

How to pick TTS. Quality + cloning: ElevenLabs. Lowest TTFA: Cartesia Sonic 3.5. Cost + prosody: Deepgram Aura-2. Regional voices: Rime AI. Cheapest viable: OpenAI gpt-4o-mini-tts. Score samples on the audio_quality rubric in Future AGI ai-evaluation before committing.

Layer 6: Evaluation and observability

The eval and observability layer is where Future AGI leads. The other five layers are independent components you compose; the eval layer is what tells you when to swap any of them.

Future AGI

traceAI instruments every layer with OpenInference-compatible spans across 30+ documented integrations in Python and TypeScript, including dedicated traceAI-pipecat and traceai-livekit packages for the voice frameworks above. Span attributes capture STT provider plus model plus first-partial latency, LLM provider plus model plus time-to-first-token, TTS provider plus voice ID plus time-to-first-audio, and tool call metadata. Apache 2.0.

ai-evaluation ships 70+ built-in eval templates in the Apache 2.0 SDK. The voice-specific rubrics that matter for stack evaluation:

audio_transcription scores STT output against ground truth.
audio_quality scores TTS output for naturalness and prosody.
conversation_coherence scores multi-turn coherence.
conversation_resolution scores whether the agent resolved the user’s request.
task_completion scores whether the agent finished the intended task.
translation_accuracy and cultural_sensitivity cover multilingual.
caption_hallucination catches multimodal hallucination.

# Audio-input rubrics (audio_transcription, audio_quality) consume MLLMAudio
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator, AudioQualityEvaluator

audio = MLLMAudio(url="path/to/call_recording.wav")
audio_case = MLLMTestCase(
    input=audio,
    query="Score STT quality and TTS audio quality on this call",
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
audio_result = ev.evaluate(
    eval_templates=[AudioTranscriptionEvaluator(), AudioQualityEvaluator()],
    inputs=[audio_case],
)

# Conversation rubrics (conversation_coherence, conversation_resolution) consume transcripts
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import ConversationCoherence, ConversationResolution

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I want to reschedule", response="Sure. What day works best?"),
    LLMTestCase(query="Tuesday at 3pm", response="Booked you for Tuesday at 3pm."),
])

conv_result = ev.evaluate(
    eval_templates=[ConversationCoherence(), ConversationResolution()],
    inputs=[conv],
)

Native voice observability ships for Vapi, Retell AI, and LiveKit dashboard-driven, no SDK required. Add the provider API key plus Assistant ID to a Future AGI Agent Definition. Auto call log capture starts immediately. Every call gets separate assistant and customer audio download, auto-transcript, and the full eval engine on every recording.

Error Feed auto-clusters trace failures into named issues with auto-written root cause, quick fix, and long-term recommendation. Background-noise word drops, accent-specific WER drift, LLM hallucination, TTS prosody degradation, tool-call argument errors each become their own named cluster instead of drowning in raw spans.

Future AGI Protect is built on Gemma 3n with LoRA-trained category-specific adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. Rule-based Protect covers the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). ProtectFlash is the single-call binary classifier for the sub-100ms inline path; rule-based Protect is better suited to async or per-route policy checks where richer per-rule attribution outweighs scan time.

Agent Command Center hosts the stack with RBAC, AWS Marketplace, multi-region hosted, BYOC for regulated workloads, 15+ providers exposed via the router surface, and SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 all certified per the trust page.

Per-call cost economics at 1000 calls per day

A realistic stack lands somewhere in the low hundreds of dollars per 1000 calls of 3-minute average length, with the spread driven by orchestration overhead and TTS choice. The table below breaks the math down for two representative compositions; treat the numbers as a starting model, not a quote.

Layer	Cost-led stack	Quality-led stack
Telephony (Plivo inbound vs Twilio inbound, 3 min)	$0.0225	$0.0420
Orchestration (Vapi platform fee, 3 min)	$0.1500	$0.1500
STT (Deepgram Nova-3 streaming, 3 min)	$0.0129	$0.0129
LLM (gpt-4.1-mini vs Claude Sonnet 4.5, 10 turns)	$0.0021	$0.0090
TTS (gpt-4o-mini-tts vs ElevenLabs Turbo, 90 sec audio)	$0.0135	$0.0675
Eval + Observe (Future AGI Pro tier amortized)	$0.0050	$0.0050
Total per call	$0.206	$0.286
Per 1000 calls	$206	$286
Per 30K calls (1000/day x 30 days)	$6,180	$8,580

The TTS layer is the largest variable. Swapping ElevenLabs Turbo for Deepgram Aura-2 saves materially per call. Swapping Vapi orchestration for self-hosted LiveKit cuts the platform fee at the cost of operating WebRTC infrastructure. Eval and observability overhead should be modeled per deployment; the value lies in catching provider, prompt, and tool regressions early rather than after they affect production calls.

Three end-to-end compositions

The stack picks compose in three realistic shapes. Pick the composition that matches your operating posture and audio profile.

Composition 1: Consumer voice agent

A consumer voice agent (smart assistant, in-app voice, brand-led companion) prioritizes voice quality and conversational naturalness. Latency matters but the user has tolerance for 600-800ms round-trip if the agent sounds human.

Telephony. Twilio if PSTN matters. Often WebRTC-only via LiveKit if the surface is in-app voice.
Orchestration. ElevenLabs Agents for the bundled TTS-led experience, or Vapi if you want BYO flexibility.
STT. AssemblyAI Universal-3 Pro for the English-nuance handling.
LLM. OpenAI GPT-4.1-mini for cost, or Claude Sonnet 4.5 if long context per turn matters.
TTS. ElevenLabs v2.5 Turbo or v3 for quality. Voice cloning if the brand needs a consistent persona.
Eval + Observe. Future AGI traceAI for span capture, audio_quality plus conversation_coherence rubrics on every call, Error Feed for drop-off clusters.

Per-call cost lands around $0.25-$0.30 for 3-minute calls. The voice quality and brand consistency justify the premium.

Composition 2: Enterprise SaaS voice agent

An enterprise SaaS voice agent (sales caller, support deflection, appointment booking) prioritizes per-call unit economics and tool-call reliability. The audio profile is mixed (clean office audio plus mobile callers with background noise).

Telephony. Telnyx for cost on US traffic. Twilio for global coverage with no surprises.
Orchestration. Vapi for BYO flexibility, or Retell AI if latency is the deciding axis.
STT. Deepgram Nova-3 for the WER lead on conversational telephony.
LLM. OpenAI gpt-4.1-mini for cost-quality balance.
TTS. Deepgram Aura-2 for natural prosody at half the ElevenLabs cost. Cartesia Sonic 3.5 if sub-500ms round-trip is mandatory.
Eval + Observe. Future AGI native voice observability for Vapi or Retell with provider API key plus Assistant ID. audio_transcription, conversation_coherence, conversation_resolution, and task_completion on every call.

Per-call cost lands around $0.20-$0.24 for 3-minute calls. The eval layer pays for itself the first time it catches a tool-call regression in production.

Composition 3: Regulated healthcare voice agent

A healthcare voice agent (clinical intake, refill scheduling, patient navigation) prioritizes compliance, audit trail, and STT accuracy on medical vocabulary. The audio profile includes background noise from clinical environments and a wide accent distribution from patients.

Telephony. Twilio with HIPAA BAA. Vonage if the deployment is EU-based and needs regional DIDs.
Orchestration. LiveKit Agents or Pipecat self-hosted in your VPC with full audit control.
STT. Speechmatics Ursa for the medical custom model and accent depth. AssemblyAI Universal-3 Pro is the second pick if speaker diarization matters more than accent depth.
LLM. AWS Bedrock with Claude Sonnet 4.5 inside your AWS account for the HIPAA BAA boundary and the long-context handling for patient history.
TTS. Deepgram Aura-2 for cost and natural prosody. ElevenLabs is acceptable but the cost adds up at clinic-scale call volume.
Eval + Observe. Future AGI BYOC deployment with traceAI for span capture across the self-hosted orchestration. audio_transcription plus a custom medical-vocabulary rubric (authored by the in-product agent from labeled clinical transcripts). task_completion on intake. ProtectFlash inline on every LLM response for PII and policy violations. Agent Command Center with SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 all certified.

Per-call cost lands around $0.25-$0.32 for 4-minute average healthcare calls. The compliance posture and the audit trail are the deciding axes, not the dollar.

When to swap layers

The eval layer tells you when each other layer is the wrong pick. The signals to watch:

STT. Background-noise word drops or accent-specific WER drift cluster up in Error Feed. The audio_transcription rubric drops 2-3 percentage points on a cohort of your audio. Time to A/B a different STT provider on that cohort.
LLM. Hallucination or tool-call argument errors cluster up. The task_completion or conversation_resolution rubric drops. Time to swap LLM or tune the prompt with one of agent-opt’s six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). Both UI-driven (Dataset view) and SDK-driven via Python.
TTS. audio_quality scores drift downward. Drop-off rate climbs in the first 10 seconds of calls. Time to A/B a different TTS provider or voice on the affected cohort.
Orchestration. End-to-end latency climbs past your budget despite individual layer latencies looking healthy. Tool-call reliability drops. Time to inspect orchestration spans and reconsider the framework.
Telephony. Connection failures or media stream degradation cluster up. Per-minute cost trends up against benchmark. Time to renegotiate or swap carrier.

Every signal is grounded in trace data and eval scores, not in vendor marketing pages. The stack stays composable because the eval layer is the unifying surface.

Three deliberate tradeoffs for the full-stack platform layer

These are deployment-posture and process choices baked into the platform, not feature gaps.

Federal procurement runs via BYOC self-host. Air-gapped BYOC is the path for federal and sovereign-data workloads. Cloud customers get SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page; ISO 42001 is in progress. The BYOC path puts the same software in the customer’s VPC with customer-owned IAM, KMS, and audit boundary.

Async eval gating is explicit. agent-opt runs against trace data only when a team triggers an explicit optimization run, picks one of the six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard), and approves the candidate prompt. FAGI never auto-rewrites prompts in production without a human approval gate. The loop is closed, but the gate is intentional.

Native voice obs and Enable Others. Native call-log capture ships for Vapi, Retell, and LiveKit out of the box (provider API key plus Assistant ID, no SDK code). Other voice runtimes (Pipecat, custom orchestration, OpenAI Agents SDK) wire in via the 30+ documented traceAI instrumentors. Best-in-class for the platform layer: eval (70+ built-in templates plus custom evaluator authoring), observe (native plus traceAI), simulation (18 pre-built personas plus unlimited custom plus Workflow Builder plus Error Localization), guardrails (Protect rule-based plus ProtectFlash sub-100ms), and optimization (six optimizers via UI and SDK).

A worked recommendation

For most production voice agent teams in 2026 standing up the full stack from scratch, the pick is:

Telephony. Twilio if you need broad coverage with no surprises. Telnyx if US-only and you want to save on per-minute cost.
Orchestration. Vapi for the BYO flexibility and community. Retell AI if latency is the top axis.
STT. Deepgram Nova-3. AssemblyAI Universal-3 Pro if multilingual or diarization matters.
LLM. OpenAI GPT-4.1-mini as the default. Claude Sonnet 4.5 for long context.
TTS. Deepgram Aura-2 for cost and prosody. ElevenLabs Turbo if voice quality is the deciding axis.
Eval + Observe + Simulation + Guardrail + Optimization (the platform layer). Future AGI: traceAI for span capture (30+ documented integrations, OpenInference-compatible), ai-evaluation for the 70+ built-in voice rubrics plus custom evaluator authoring, Error Feed for failure-mode clustering, native voice observability for Vapi/Retell/LiveKit (provider API key plus Assistant ID, no SDK), Simulate with the 18 pre-built personas plus unlimited custom plus Workflow Builder (Conversation, End Call, Transfer Call nodes) plus Error Localization, Future AGI Protect rule-based plus ProtectFlash for sub-100ms inline guardrails across the 4 documented safety dimensions, agent-opt with six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) via UI and SDK, all under Agent Command Center for hosted deployment.

The stack is provider-agnostic at every layer except the eval layer. The eval layer is what tells you when to swap any of the others. The right composition six months from now will be different from the right composition today, and the only way to keep up is to score every layer continuously on your live audio.

Sources and references

Future AGI Protect: arXiv 2510.13351
GEPA optimizer: arXiv 2507.19457
Meta-Prompt optimizer: arXiv 2505.09666
Random Search baseline: arXiv 2311.09569
OpenInference span specification: github.com/Arize-ai/openinference
Future AGI trust and compliance: futureagi.com/trust
Twilio Programmable Voice: twilio.com vendor documentation
Telnyx Voice API: telnyx.com vendor documentation
Plivo Voice API: plivo.com vendor documentation
Vonage Voice API: vonage.com vendor documentation
Vapi: vapi.ai vendor documentation
Retell AI: retellai.com vendor documentation
ElevenLabs Conversational AI: elevenlabs.io vendor documentation
LiveKit Agents: livekit.io vendor documentation
Pipecat: pipecat.ai vendor documentation
Deepgram Nova-3 and Aura-2: deepgram.com vendor documentation
AssemblyAI Universal-3 Pro: assemblyai.com vendor documentation
OpenAI Whisper repository: openai-whisper GitHub
Speechmatics Ursa: speechmatics.com vendor documentation
Cartesia Sonic and Ink-Whisper: cartesia.ai vendor documentation
Rime AI: rime.ai vendor documentation

Frequently asked questions

What does a complete voice AI infrastructure stack look like in 2026?

Six layers: telephony (Twilio, Telnyx, Plivo, Vonage), orchestration (Vapi, Retell AI, ElevenLabs Agents, LiveKit, Pipecat), speech-to-text (Deepgram Nova-3, AssemblyAI Universal-3 Pro, Whisper large-v3, Cartesia Ink-Whisper, Speechmatics Ursa), LLM (OpenAI, Anthropic, Gemini 2.0, Groq, Together, Fireworks, Bedrock), text-to-speech (ElevenLabs, Cartesia Sonic 3.5, Deepgram Aura-2, Rime AI, OpenAI gpt-4o-mini-tts), and evaluation plus observability (Future AGI traceAI, ai-evaluation, Error Feed, Future AGI Protect). The picks per layer are independent. The eval layer is what tells you when to swap any of the others.

How much does a 1000-call-per-day voice AI stack cost in 2026?

Cost varies by provider mix, average call length, orchestration overhead, and routing choices. As a rough order of magnitude on 3-minute calls, telephony lands around $0.012-$0.020 per minute, STT around $0.0036-$0.0080 per minute, LLM around $0.0030-$0.0150 per call, and TTS around $0.0090-$0.0450 per call, with orchestration overhead on top. Worked compositions in the table below land near $0.20-$0.30 per call. The TTS layer is the largest variable: ElevenLabs is roughly 5x the cost of OpenAI gpt-4o-mini-tts. Model the stack per minute with your real call duration and verify against current vendor pricing before committing.

Which orchestration platform should I pick for a voice agent in 2026?

Vapi for the largest open-source community and bring-your-own-model flexibility. Retell AI for the lowest-latency hosted stack with native LLM and TTS coupling. ElevenLabs Agents for TTS-led conversational quality. LiveKit Agents for open-source WebRTC and self-hosted orchestration. Pipecat for a Python-native pipeline you want to fully own. Match the orchestration to your latency budget, your existing stack, and your appetite for self-hosting. All five compose well with Future AGI for evaluation and observability.

Can I swap providers in one layer without rewriting the others?

Mostly yes. STT, LLM, and TTS are independent layers in every serious 2026 orchestration. Vapi, Retell AI, LiveKit, and Pipecat all let you swap STT, LLM, and TTS providers via configuration. Telephony is a slightly heavier swap because phone numbers and SIP trunks are tied to the carrier. Future AGI traceAI captures provider, model, and per-stage latency as span attributes so you can A/B providers behind the same orchestration with no rewrites.

Where does Future AGI fit in the voice stack?

Future AGI is the evaluation and observability layer that scores any combination of the other five layers. traceAI captures OpenInference spans across the orchestration, STT, LLM, and TTS calls with dedicated traceAI-pipecat and traceai-livekit packages. ai-evaluation runs audio_transcription and audio_quality on audio inputs, plus conversation_coherence and conversation_resolution on captured transcripts. Error Feed clusters failures into named issues. Native voice observability for Vapi, Retell, and LiveKit needs only a provider API key plus Assistant ID, no SDK changes.

How do I evaluate a voice stack end-to-end on real production audio?

Sample 500-1000 real calls and run them through Future AGI ai-evaluation. The audio_transcription rubric scores STT quality. conversation_coherence and conversation_resolution score the multi-turn flow. audio_quality scores TTS output. task_completion scores whether the agent finished the user's intent. For coverage your production traffic does not yet exercise, Future AGI Simulation generates branching scenarios with 18 pre-built personas plus custom personas controlling accent, age range, gender, location, background noise, and multilingual settings.

What is the difference between Twilio, Telnyx, Plivo, and Vonage for voice AI telephony?

Twilio is the default with the widest geographic coverage, strongest SIP and BYO number support, and the highest per-minute price. Telnyx is the engineering-led alternative with lower per-minute rates, a competitive SIP trunk, and good developer ergonomics. Plivo is the value pick with strong reliability at the lowest per-minute pricing in most regions. Vonage is the enterprise pick with deep regional coverage in Europe and APAC and a large legacy contact-center footprint. All four expose programmable voice APIs that the major orchestration platforms integrate with directly.

View all

Guides

Multi-Agent Voice Systems in 2026: State Transitions, Hand-offs, Eval Boundaries

How to architect multi-agent voice systems in 2026: state transitions, hand-off prompt design, per-agent vs end-to-end evals, latency budgets, failure attribution.

NVJK Kartik · Apr 23, 2026

17 min

Guides

Cascaded Voice AI vs Speech-to-Speech: The 2026 Architecture Decision

Cascaded voice AI vs speech-to-speech in 2026: latency, eval depth, debug cost, model flexibility, and the architecture decision every voice team faces.

NVJK Kartik · Apr 9, 2026

17 min

Guides

Voice Agent Deployment Patterns: Cloud, BYOC, and On-Prem in 2026

Three voice agent deployment patterns compared in 2026. Cloud (managed hosted), BYOC inside customer VPC, and air-gapped on-prem with concrete tradeoffs.

NVJK Kartik · Apr 9, 2026

17 min

TL;DR: pick by layer

Layer 1: Telephony

Layer 2: Orchestration

Layer 3: Speech-to-text (STT)

Layer 4: LLM

Layer 5: Text-to-speech (TTS)

Layer 6: Evaluation and observability

Future AGI

Per-call cost economics at 1000 calls per day

Three end-to-end compositions

Composition 1: Consumer voice agent

Composition 2: Enterprise SaaS voice agent

Composition 3: Regulated healthcare voice agent

When to swap layers

Three deliberate tradeoffs for the full-stack platform layer

A worked recommendation

Related reading

Sources and references

Frequently asked questions