Voice AI Infrastructure Stack: A 2026 Pick-by-Stage Decision Guide
Pick-by-stage guide to the 2026 voice AI stack: telephony, orchestration, STT, LLM, TTS, and eval/observe. Real pricing, real picks, and three full compositions.
Table of Contents
If you’re shipping a voice AI agent in 2026, the stack you compose matters more than any single provider on it. Six layers (telephony, orchestration, speech-to-text, LLM, text-to-speech, and evaluation plus observability) each have three to five real contenders, and the right pick depends on your audio profile, latency budget, compliance posture, and call volume. This guide ranks the picks per layer, gives you the real per-minute economics, and shows three full compositions that ship to production today.
TL;DR: pick by layer
| Layer | Default pick | When to swap |
|---|---|---|
| Telephony | Twilio | Telnyx for cost, Plivo for value, Vonage for EU/APAC enterprise |
| Orchestration | Vapi | Retell AI for lowest latency, ElevenLabs Agents for TTS quality, LiveKit for open-source, Pipecat for Python-native |
| STT | Deepgram Nova-3 | AssemblyAI for multilingual, Cartesia Ink-Whisper for sub-100ms, Speechmatics Ursa for EU, Whisper for self-host |
| LLM | OpenAI gpt-4.1-mini | Claude Sonnet 4.5 for long context, Gemini 2.0 Flash for multimodal, Groq for latency, Bedrock for BYOC compliance |
| TTS | ElevenLabs | Cartesia Sonic 3.5 for lowest TTFA, Deepgram Aura-2 for cost + prosody, Rime AI for utility voices, OpenAI gpt-4o-mini-tts for cheapest |
| Eval + Observe | Future AGI | Run any of the 70+ built-in rubrics on every call. Native voice obs for Vapi/Retell/LiveKit. SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified |
The picks below go layer by layer. Three end-to-end compositions follow, with per-call economics.
Layer 1: Telephony
Telephony is the pipe. It carries the PSTN call into your stack and the agent audio back out. The decisions here are: cost per minute, SIP support for BYO trunks, geographic coverage, and bring-your-own number flow.
- Twilio (default). Programmable Voice covers 100+ countries with the widest DID inventory in the industry. SIP trunking is mature inbound and outbound. BYO number via SIP refer or hosted number port. Pricing sits at $0.014 per minute inbound and $0.022 per minute outbound on US toll-free. Best for broadest geographic coverage with minimal surprises. Per-minute cost is the highest of the four.
- Telnyx (cost + engineering). Runs its own private global network instead of leasing PSTN access, which shows up as 20-40% lower per-minute pricing on US traffic and better SIP latency end to end. Inbound starts at $0.0035 per minute on US local DIDs, outbound at $0.0070 per minute. SIP trunking is a first-class product. Best for SIP-heavy stacks where the saving over Twilio materially changes unit economics. Geographic coverage is narrower than Twilio.
- Plivo (value). Prices below Telnyx on most outbound corridors while keeping a reliable carrier-grade network. Inbound starts at $0.0075 per minute on US local DIDs, outbound at $0.0050 per minute. Best for high-volume outbound voice agents where per-minute cost dominates. Documentation depth and SDK breadth trail Twilio.
- Vonage (EU/APAC enterprise). Carries the largest legacy contact-center footprint in Europe and APAC. SIP coverage is deep in EMEA, with regional DIDs in countries Twilio still routes through partners. Per-minute pricing comparable to Twilio on most corridors, lower on intra-EU. Best for EU-headquartered voice agents needing regional DIDs in Germany, France, Italy, Spain, Nordics, and APAC deployments with a Vonage interconnect. Developer ergonomics are the weakest of the four.
How to pick telephony. US-only and developer-led: Telnyx wins on cost and ergonomics. Global with broadest coverage: Twilio wins on reliability. High-volume outbound: Plivo wins on price. EU enterprise with procurement preference: Vonage wins on regional depth. Lock the carrier last; orchestration usually integrates with all four.
Layer 2: Orchestration
Orchestration is the glue. It receives the telephony media stream, routes audio to STT, sends transcripts to an LLM, pipes responses to TTS, and streams synthesized audio back. The five contenders in 2026 trade off open-source vs hosted, latency, and Python-native vs JavaScript-native.
- Vapi (community + BYO). The largest open-source-aligned community in voice agents and the most extensible at the BYO-model layer. Pick your own STT, LLM, and TTS; Vapi wires them together with a configurable pipeline that exposes turn detection, interruption handling, tool calls, and webhooks at every stage. Public-tier pricing lists $0.05 per minute of platform overhead on top of underlying provider costs. Best for teams wanting maximum BYO flexibility without operating orchestration themselves.
- Retell AI (lowest hosted latency). Ships an opinionated stack with native coupling between LLM and TTS that shaves 100-200ms off end-to-end latency vs more flexible orchestrations. Per-minute pricing bundles orchestration plus hosted LLM and TTS. Best for latency-critical voice agents where 500ms vs 700ms round-trip is the difference between conversational and clunky. BYO surface narrower than Vapi.
- ElevenLabs Agents (voice quality). Conversational AI built on top of ElevenLabs’ high-fidelity TTS, with an integrated runtime emphasizing voice quality, natural prosody, and brand voice cloning. Subscription-based with per-minute usage. Best for brand-led voice (consumer, hospitality, media) where the voice is the product. Pricing is the highest of the hosted options.
- LiveKit Agents (open-source WebRTC). Apache-licensed framework for voice agents on top of LiveKit Cloud or self-hosted infra. BYO model end to end. LiveKit handles WebRTC media transport, turn detection, and agent orchestration. Cloud charges per participant minute; self-host is free in license cost. Best for teams that want full control over the orchestration runtime and WebRTC-first deployments where PSTN is secondary.
- Pipecat (Python-native). Apache-licensed Python framework from Daily. Pipeline composes processors (STT, LLM, TTS, VAD, tool execution) into a graph with explicit interruption handling and turn detection. Native integrations cover Deepgram, AssemblyAI, OpenAI, Anthropic, ElevenLabs, Cartesia, Daily, LiveKit, and more. Best for Python-first teams that already operate ML services in Python.
How to pick orchestration. Self-host with full control: LiveKit or Pipecat. Hosted with BYO flexibility: Vapi. Hosted with lowest latency: Retell. Brand-voice-led: ElevenLabs Agents. The eval layer is independent in every case.
Layer 3: Speech-to-text (STT)
STT determines the ceiling on every downstream quality metric. A 12% word error rate cascades into intent miss, retrieval miss, and bad agent responses no LLM can recover from. The picks below mirror the STT-specific guide compressed for stack composition.
- Deepgram Nova-3. WER leader on hard call-center audio (telephony codecs, background noise, cross-talk, accent diversity). First-partial 90-130ms P50. Real production WER 6-10% on conversational telephony. Streaming starts at $0.0043 per minute.
- AssemblyAI Universal-3 Pro. Broad multilingual coverage, polished speaker diarization, and the strongest English-nuance handling. WER 7-11% on conversational English. First-partial 100-150ms P50. Streaming starts at $0.0042 per minute.
- Whisper large-v3 (self-host). The open-weight baseline. 1.5B parameters under MIT license. Most accent-robust model in 2026 and strongest code-switching model in production. Batch model; real-time wrappers carry 300-600ms first-partial latency. Free in license; you pay for compute.
- Cartesia Ink-Whisper. Streaming Whisper-class ASR with sub-100ms first-partial in clean conditions. Distilled causalized Whisper that trades long-tail language coverage for latency. 70-95ms P50 first-partial.
- Speechmatics Ursa. Accent and dialect leader. 55+ languages with strongest European quality and deepest accent coverage (Scottish, Welsh, South African, Australian, NZ English all within 1-2% WER of standard English). Custom medical, legal, broadcast, and finance models.
How to pick STT. Score on your audio. Public benchmarks lie. Sample 500-1000 utterances from your production traffic, generate ground-truth transcripts, and run every candidate through the audio_transcription rubric in Future AGI ai-evaluation. The winner on your audio is the right pick.
Layer 4: LLM
The LLM is the brain. Voice agents need a model that handles streaming, partial inputs, tool calls, and 500-1500ms time-to-first-token.
- OpenAI (default). GPT-4.1-mini and gpt-4o-mini hit the right cost-latency-quality point for most voice agents. Streaming mature, tool calls reliable, time-to-first-token 350-600ms P50. GPT-4.1-mini is roughly $0.40 per million input tokens and $1.60 per million output tokens. At 200 input + 80 output tokens per turn with 10 turns, LLM cost lands around $0.0021 per call.
- Anthropic Claude (long context). Claude Sonnet 4.5 handles 200K context, which matters when your agent passes long tool outputs (CRM history, knowledge base chunks) back into every turn. Tool calling is among the most reliable. Sonnet 4.5 is roughly $3 per million input and $15 per million output tokens.
- Gemini 2.0 (multimodal). Gemini 2.0 Flash handles native multimodal inputs (audio in, text out), collapsing STT and LLM into a single call. Lowest-cost tier among frontier models.
- Groq (latency). LPU inference runs Llama 3.1 70B, Llama 3.3, and Mixtral at time-to-first-token under 200ms. Right pick when latency dominates and the task is conversational rather than reasoning-heavy.
- Together / Fireworks (cost). Hosted Llama, Qwen, DeepSeek, and Mistral at lower per-token cost than frontier closed models. Fireworks ships fine-tuning; Together ships dedicated endpoints.
- AWS Bedrock (BYOC compliance). Hosts Claude, Llama, Mistral, and Titan inside your AWS account with VPC peering, KMS-encrypted prompts, and the AWS audit boundary. Pick this when HIPAA BAA or federal audit boundary matters more than lowest latency.
How to pick the LLM. Match the LLM to the task profile. Conversational with short context: GPT-4.1-mini or Llama 3.3 on Groq. Long-context retrieval per turn: Claude Sonnet 4.5. Multimodal: Gemini 2.0 Flash. Regulated workload that needs to stay in AWS: Bedrock.
Layer 5: Text-to-speech (TTS)
TTS is the most cost-variable layer in the stack. ElevenLabs and OpenAI gpt-4o-mini-tts differ by roughly 5x on per-minute cost for the same generated audio length.
- ElevenLabs (quality + cloning). v2.5 Turbo and v3 deliver the most natural prosody of any commercial TTS, with the deepest voice library and the most realistic voice cloning. Streaming TTFA 150-300ms P50 on Turbo. Production voice agents at scale typically land at $0.045-$0.070 per minute of generated audio. Best for brand-led voice and cloning use cases.
- Cartesia Sonic 3.5 (lowest streaming TTFA). Sub-100ms streaming time-to-first-audio in clean conditions with a small but growing voice library. Designed around the voice-agent use case where every TTS millisecond comes out of perceived response time. Best for latency-critical agents.
- Deepgram Aura-2 (cost + prosody). Natural-sounding voices at roughly half the per-minute cost of ElevenLabs Turbo. Streaming TTFA 150-250ms P50. $0.0150 per minute of generated audio. Best for high-volume support where ElevenLabs is more than you need.
- Rime AI (utility voices). Conversational, regional, and accent-diverse voices that sound less polished than ElevenLabs but more authentic for collections, sales outreach, and research. Best for outbound voice where regional authenticity drives engagement.
- OpenAI gpt-4o-mini-tts (cheapest). Roughly 20% of ElevenLabs Turbo pricing with usable quality. Streaming TTFA competitive. Smaller voice library. Best for cost-led deployments, internal tools, and prototypes.
How to pick TTS. Quality + cloning: ElevenLabs. Lowest TTFA: Cartesia Sonic 3.5. Cost + prosody: Deepgram Aura-2. Regional voices: Rime AI. Cheapest viable: OpenAI gpt-4o-mini-tts. Score samples on the audio_quality rubric in Future AGI ai-evaluation before committing.
Layer 6: Evaluation and observability
The eval and observability layer is where Future AGI leads. The other five layers are independent components you compose; the eval layer is what tells you when to swap any of them.
Future AGI
traceAI instruments every layer with OpenInference-compatible spans across 30+ documented integrations in Python and TypeScript, including dedicated traceAI-pipecat and traceai-livekit packages for the voice frameworks above. Span attributes capture STT provider plus model plus first-partial latency, LLM provider plus model plus time-to-first-token, TTS provider plus voice ID plus time-to-first-audio, and tool call metadata. Apache 2.0.
ai-evaluation ships 70+ built-in eval templates in the Apache 2.0 SDK. The voice-specific rubrics that matter for stack evaluation:
audio_transcriptionscores STT output against ground truth.audio_qualityscores TTS output for naturalness and prosody.conversation_coherencescores multi-turn coherence.conversation_resolutionscores whether the agent resolved the user’s request.task_completionscores whether the agent finished the intended task.translation_accuracyandcultural_sensitivitycover multilingual.caption_hallucinationcatches multimodal hallucination.
# Audio-input rubrics (audio_transcription, audio_quality) consume MLLMAudio
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator, AudioQualityEvaluator
audio = MLLMAudio(url="path/to/call_recording.wav")
audio_case = MLLMTestCase(
input=audio,
query="Score STT quality and TTS audio quality on this call",
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
audio_result = ev.evaluate(
eval_templates=[AudioTranscriptionEvaluator(), AudioQualityEvaluator()],
inputs=[audio_case],
)
# Conversation rubrics (conversation_coherence, conversation_resolution) consume transcripts
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import ConversationCoherence, ConversationResolution
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I want to reschedule", response="Sure. What day works best?"),
LLMTestCase(query="Tuesday at 3pm", response="Booked you for Tuesday at 3pm."),
])
conv_result = ev.evaluate(
eval_templates=[ConversationCoherence(), ConversationResolution()],
inputs=[conv],
)
Native voice observability ships for Vapi, Retell AI, and LiveKit dashboard-driven, no SDK required. Add the provider API key plus Assistant ID to a Future AGI Agent Definition. Auto call log capture starts immediately. Every call gets separate assistant and customer audio download, auto-transcript, and the full eval engine on every recording.
Error Feed auto-clusters trace failures into named issues with auto-written root cause, quick fix, and long-term recommendation. Background-noise word drops, accent-specific WER drift, LLM hallucination, TTS prosody degradation, tool-call argument errors each become their own named cluster instead of drowning in raw spans.
Future AGI Protect is built on Gemma 3n with LoRA-trained category-specific adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. Rule-based Protect covers the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). ProtectFlash is the single-call binary classifier for the sub-100ms inline path; rule-based Protect is better suited to async or per-route policy checks where richer per-rule attribution outweighs scan time.
Agent Command Center hosts the stack with RBAC, AWS Marketplace, multi-region hosted, BYOC for regulated workloads, 15+ providers exposed via the router surface, and SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 all certified per the trust page.
Per-call cost economics at 1000 calls per day
A realistic stack lands somewhere in the low hundreds of dollars per 1000 calls of 3-minute average length, with the spread driven by orchestration overhead and TTS choice. The table below breaks the math down for two representative compositions; treat the numbers as a starting model, not a quote.
| Layer | Cost-led stack | Quality-led stack |
|---|---|---|
| Telephony (Plivo inbound vs Twilio inbound, 3 min) | $0.0225 | $0.0420 |
| Orchestration (Vapi platform fee, 3 min) | $0.1500 | $0.1500 |
| STT (Deepgram Nova-3 streaming, 3 min) | $0.0129 | $0.0129 |
| LLM (gpt-4.1-mini vs Claude Sonnet 4.5, 10 turns) | $0.0021 | $0.0090 |
| TTS (gpt-4o-mini-tts vs ElevenLabs Turbo, 90 sec audio) | $0.0135 | $0.0675 |
| Eval + Observe (Future AGI Pro tier amortized) | $0.0050 | $0.0050 |
| Total per call | $0.206 | $0.286 |
| Per 1000 calls | $206 | $286 |
| Per 30K calls (1000/day x 30 days) | $6,180 | $8,580 |
The TTS layer is the largest variable. Swapping ElevenLabs Turbo for Deepgram Aura-2 saves materially per call. Swapping Vapi orchestration for self-hosted LiveKit cuts the platform fee at the cost of operating WebRTC infrastructure. Eval and observability overhead should be modeled per deployment; the value lies in catching provider, prompt, and tool regressions early rather than after they affect production calls.
Three end-to-end compositions
The stack picks compose in three realistic shapes. Pick the composition that matches your operating posture and audio profile.
Composition 1: Consumer voice agent
A consumer voice agent (smart assistant, in-app voice, brand-led companion) prioritizes voice quality and conversational naturalness. Latency matters but the user has tolerance for 600-800ms round-trip if the agent sounds human.
- Telephony. Twilio if PSTN matters. Often WebRTC-only via LiveKit if the surface is in-app voice.
- Orchestration. ElevenLabs Agents for the bundled TTS-led experience, or Vapi if you want BYO flexibility.
- STT. AssemblyAI Universal-3 Pro for the English-nuance handling.
- LLM. OpenAI GPT-4.1-mini for cost, or Claude Sonnet 4.5 if long context per turn matters.
- TTS. ElevenLabs v2.5 Turbo or v3 for quality. Voice cloning if the brand needs a consistent persona.
- Eval + Observe. Future AGI traceAI for span capture,
audio_qualityplusconversation_coherencerubrics on every call, Error Feed for drop-off clusters.
Per-call cost lands around $0.25-$0.30 for 3-minute calls. The voice quality and brand consistency justify the premium.
Composition 2: Enterprise SaaS voice agent
An enterprise SaaS voice agent (sales caller, support deflection, appointment booking) prioritizes per-call unit economics and tool-call reliability. The audio profile is mixed (clean office audio plus mobile callers with background noise).
- Telephony. Telnyx for cost on US traffic. Twilio for global coverage with no surprises.
- Orchestration. Vapi for BYO flexibility, or Retell AI if latency is the deciding axis.
- STT. Deepgram Nova-3 for the WER lead on conversational telephony.
- LLM. OpenAI gpt-4.1-mini for cost-quality balance.
- TTS. Deepgram Aura-2 for natural prosody at half the ElevenLabs cost. Cartesia Sonic 3.5 if sub-500ms round-trip is mandatory.
- Eval + Observe. Future AGI native voice observability for Vapi or Retell with provider API key plus Assistant ID.
audio_transcription,conversation_coherence,conversation_resolution, andtask_completionon every call.
Per-call cost lands around $0.20-$0.24 for 3-minute calls. The eval layer pays for itself the first time it catches a tool-call regression in production.
Composition 3: Regulated healthcare voice agent
A healthcare voice agent (clinical intake, refill scheduling, patient navigation) prioritizes compliance, audit trail, and STT accuracy on medical vocabulary. The audio profile includes background noise from clinical environments and a wide accent distribution from patients.
- Telephony. Twilio with HIPAA BAA. Vonage if the deployment is EU-based and needs regional DIDs.
- Orchestration. LiveKit Agents or Pipecat self-hosted in your VPC with full audit control.
- STT. Speechmatics Ursa for the medical custom model and accent depth. AssemblyAI Universal-3 Pro is the second pick if speaker diarization matters more than accent depth.
- LLM. AWS Bedrock with Claude Sonnet 4.5 inside your AWS account for the HIPAA BAA boundary and the long-context handling for patient history.
- TTS. Deepgram Aura-2 for cost and natural prosody. ElevenLabs is acceptable but the cost adds up at clinic-scale call volume.
- Eval + Observe. Future AGI BYOC deployment with traceAI for span capture across the self-hosted orchestration.
audio_transcriptionplus a custom medical-vocabulary rubric (authored by the in-product agent from labeled clinical transcripts).task_completionon intake.ProtectFlashinline on every LLM response for PII and policy violations. Agent Command Center with SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 all certified.
Per-call cost lands around $0.25-$0.32 for 4-minute average healthcare calls. The compliance posture and the audit trail are the deciding axes, not the dollar.
When to swap layers
The eval layer tells you when each other layer is the wrong pick. The signals to watch:
- STT. Background-noise word drops or accent-specific WER drift cluster up in Error Feed. The
audio_transcriptionrubric drops 2-3 percentage points on a cohort of your audio. Time to A/B a different STT provider on that cohort. - LLM. Hallucination or tool-call argument errors cluster up. The
task_completionorconversation_resolutionrubric drops. Time to swap LLM or tune the prompt with one ofagent-opt’s six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). Both UI-driven (Dataset view) and SDK-driven via Python. - TTS.
audio_qualityscores drift downward. Drop-off rate climbs in the first 10 seconds of calls. Time to A/B a different TTS provider or voice on the affected cohort. - Orchestration. End-to-end latency climbs past your budget despite individual layer latencies looking healthy. Tool-call reliability drops. Time to inspect orchestration spans and reconsider the framework.
- Telephony. Connection failures or media stream degradation cluster up. Per-minute cost trends up against benchmark. Time to renegotiate or swap carrier.
Every signal is grounded in trace data and eval scores, not in vendor marketing pages. The stack stays composable because the eval layer is the unifying surface.
Three deliberate tradeoffs for the full-stack platform layer
These are deployment-posture and process choices baked into the platform, not feature gaps.
Federal procurement runs via BYOC self-host. Air-gapped BYOC is the path for federal and sovereign-data workloads. Cloud customers get SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page; ISO 42001 is in progress. The BYOC path puts the same software in the customer’s VPC with customer-owned IAM, KMS, and audit boundary.
Async eval gating is explicit. agent-opt runs against trace data only when a team triggers an explicit optimization run, picks one of the six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard), and approves the candidate prompt. FAGI never auto-rewrites prompts in production without a human approval gate. The loop is closed, but the gate is intentional.
Native voice obs and Enable Others. Native call-log capture ships for Vapi, Retell, and LiveKit out of the box (provider API key plus Assistant ID, no SDK code). Other voice runtimes (Pipecat, custom orchestration, OpenAI Agents SDK) wire in via the 30+ documented traceAI instrumentors. Best-in-class for the platform layer: eval (70+ built-in templates plus custom evaluator authoring), observe (native plus traceAI), simulation (18 pre-built personas plus unlimited custom plus Workflow Builder plus Error Localization), guardrails (Protect rule-based plus ProtectFlash sub-100ms), and optimization (six optimizers via UI and SDK).
A worked recommendation
For most production voice agent teams in 2026 standing up the full stack from scratch, the pick is:
- Telephony. Twilio if you need broad coverage with no surprises. Telnyx if US-only and you want to save on per-minute cost.
- Orchestration. Vapi for the BYO flexibility and community. Retell AI if latency is the top axis.
- STT. Deepgram Nova-3. AssemblyAI Universal-3 Pro if multilingual or diarization matters.
- LLM. OpenAI GPT-4.1-mini as the default. Claude Sonnet 4.5 for long context.
- TTS. Deepgram Aura-2 for cost and prosody. ElevenLabs Turbo if voice quality is the deciding axis.
- Eval + Observe + Simulation + Guardrail + Optimization (the platform layer). Future AGI: traceAI for span capture (30+ documented integrations, OpenInference-compatible), ai-evaluation for the 70+ built-in voice rubrics plus custom evaluator authoring, Error Feed for failure-mode clustering, native voice observability for Vapi/Retell/LiveKit (provider API key plus Assistant ID, no SDK), Simulate with the 18 pre-built personas plus unlimited custom plus Workflow Builder (Conversation, End Call, Transfer Call nodes) plus Error Localization, Future AGI Protect rule-based plus ProtectFlash for sub-100ms inline guardrails across the 4 documented safety dimensions, agent-opt with six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) via UI and SDK, all under Agent Command Center for hosted deployment.
The stack is provider-agnostic at every layer except the eval layer. The eval layer is what tells you when to swap any of the others. The right composition six months from now will be different from the right composition today, and the only way to keep up is to score every layer continuously on your live audio.
Related reading
- 7 Best STT Providers for Voice AI Agents in 2026
- Sub-500ms Voice AI: The Complete Latency Budget Guide for 2026
- Voice Agent Logging and Analytics Architecture in 2026
- How to Implement Voice AI Observability in 2026
Sources and references
- Future AGI Protect: arXiv 2510.13351
- GEPA optimizer: arXiv 2507.19457
- Meta-Prompt optimizer: arXiv 2505.09666
- Random Search baseline: arXiv 2311.09569
- OpenInference span specification: github.com/Arize-ai/openinference
- Future AGI trust and compliance: futureagi.com/trust
- Twilio Programmable Voice: twilio.com vendor documentation
- Telnyx Voice API: telnyx.com vendor documentation
- Plivo Voice API: plivo.com vendor documentation
- Vonage Voice API: vonage.com vendor documentation
- Vapi: vapi.ai vendor documentation
- Retell AI: retellai.com vendor documentation
- ElevenLabs Conversational AI: elevenlabs.io vendor documentation
- LiveKit Agents: livekit.io vendor documentation
- Pipecat: pipecat.ai vendor documentation
- Deepgram Nova-3 and Aura-2: deepgram.com vendor documentation
- AssemblyAI Universal-3 Pro: assemblyai.com vendor documentation
- OpenAI Whisper repository: openai-whisper GitHub
- Speechmatics Ursa: speechmatics.com vendor documentation
- Cartesia Sonic and Ink-Whisper: cartesia.ai vendor documentation
- Rime AI: rime.ai vendor documentation
Frequently asked questions
What does a complete voice AI infrastructure stack look like in 2026?
How much does a 1000-call-per-day voice AI stack cost in 2026?
Which orchestration platform should I pick for a voice agent in 2026?
Can I swap providers in one layer without rewriting the others?
Where does Future AGI fit in the voice stack?
How do I evaluate a voice stack end-to-end on real production audio?
What is the difference between Twilio, Telnyx, Plivo, and Vonage for voice AI telephony?
How to architect multi-agent voice systems in 2026: state transitions, hand-off prompt design, per-agent vs end-to-end evals, latency budgets, failure attribution.
Cascaded voice AI vs speech-to-speech in 2026: latency, eval depth, debug cost, model flexibility, and the architecture decision every voice team faces.
Three voice agent deployment patterns compared in 2026. Cloud (managed hosted), BYOC inside customer VPC, and air-gapped on-prem with concrete tradeoffs.