Guides

Cascaded Voice AI vs Speech-to-Speech: The 2026 Architecture Decision

Cascaded voice AI vs speech-to-speech in 2026: latency, eval depth, debug cost, model flexibility, and the architecture decision every voice team faces.

·
Updated
·
17 min read
voice-ai 2026 architecture speech-to-speech comparison
Editorial cover image for Cascaded Voice AI vs Speech-to-Speech: The 2026 Architecture Decision
Table of Contents

Every voice AI team in 2026 faces the same architecture call. Do you build a cascaded pipeline of STT plus LLM plus TTS, each running as its own service with its own span and its own vendor knob? Or do you adopt a speech-to-speech model that takes audio in and produces audio out in a single inference call? The trade is real and the answers don’t generalize. This guide is the decision matrix: latency, eval depth, debug cost, model flexibility, tool calling, and multi-modal ergonomics, with concrete provider names and the eval rubrics that score both paths.

TL;DR: capability snapshot

Cascaded (STT + LLM + TTS)Speech-to-Speech (S2S)
TTFA (warm)600-1200ms typical300-500ms typical
Eval surfacePer-stage spans, full rubric coverageAudio in + audio out; transcript via sampled pass
Debug costLow. Each fault has its own spanHigher. Black-box single call
Vendor flexibilityHot-swap STT, LLM, TTS independentlyLocked to one S2S provider
Tool callingMature (inherits LLM provider surface)Working, newer, edge cases still rough
Prosody / emotionLimited (TTS reconstructs from text)Tighter (model has full audio context)
Multi-modal in/outAdd a vision LLM in the middleAudio + text + image native in same call
Best fitProduction support, sales, healthcare, schedulingLive conversational demos, copilots, language practice

Most 2026 production voice agents still run cascaded. S2S is growing fast for short conversational use cases where latency dominates and the tool surface is shallow.

One axis worth calling out before the deep dive: backchanneling (the listener’s “uh-huh / mhm” signals) and dedicated turn-taking models are easier in S2S because the model holds the full audio context. Cascaded stacks need a separate turn-taking layer (Pipecat’s SmartTurnAnalyzer, LiveKit’s TurnDetector, Vapi’s endpointing) running alongside the LLM. Both paths can ship clean barge-in; the conversational naturalness ceiling is currently higher on S2S.

What “cascaded” means in 2026

Cascaded voice AI is a pipeline of three or four discrete models that pass intermediate text between them. The canonical flow:

  1. STT (speech-to-text). The audio stream is chunked, sent to a real-time STT provider (Deepgram Nova-3, AssemblyAI Universal-1, Cartesia Ink-Whisper), and a partial-then-final transcript comes back over WebSocket.
  2. LLM. The transcript is appended to a chat history and fed to an LLM (OpenAI, Anthropic, Google, Groq). Streaming tokens come back.
  3. TTS (text-to-speech). The LLM token stream is buffered into utterance chunks and sent to a TTS provider (ElevenLabs, Cartesia Sonic, OpenAI TTS, Rime, PlayHT). Audio comes back, streamed to the user.
  4. Tool calling. When the LLM emits a tool-call token, the orchestrator pauses TTS, runs the tool, appends the result, and resumes generation.

The orchestration glue is what platforms like Vapi, Retell AI, ElevenLabs Agents, LiveKit Agents, and Pipecat provide. Each platform has opinions about the partial-stream handoff, barge-in detection, and end-of-turn classification, but the underlying architecture is the same three-stage cascade.

The benefit is that every stage is a separable component. You can pick the STT vendor that wins on your audio profile, the LLM that wins on your reasoning task, and the TTS that wins on your brand voice. You see exactly where the call broke. You get a span tree with stage-attributed latency.

The cost is the latency floor. STT first-partial costs you 100-300ms. LLM time-to-first-token costs you 200-600ms. TTS time-to-first-audio costs you 75-300ms. Add network round-trips and orchestrator overhead. TTFA in the 600-1200ms range is typical, sub-500ms is achievable but requires aggressive partial-stream handoff and predictive TTS pre-roll.

What “speech-to-speech” means in 2026

Speech-to-speech is a single multi-modal model that ingests an audio stream and emits an audio stream directly. There is no intermediate text transcript on the hot path. The model holds prosody, pacing, emotion, and content in the same latent space.

Three S2S systems are production-grade in 2026:

  • OpenAI Realtime API (gpt-realtime). WebRTC and WebSocket transport. Function calling supported. Voice options include shimmer, alloy, echo, and others. Multi-modal in the same call across audio, text, and image.
  • Google Gemini Live API. Multi-modal native (audio, video/image, and text input). Tool calling supported. Strong on long-context conversation. Available via the Gemini API and Vertex AI surface.
  • Moshi (Kyutai). Open-weight and self-hostable; teams choosing it should plan to own production hardening, deployment, and evaluation.

The benefit of S2S is two-fold. First, latency. A single inference path skips the STT and TTS round-trips. TTFA in the 300-500ms range is typical on warm WebRTC connections. Second, prosody. The model hears how the user spoke (loud, fast, frustrated, whispered) and matches output prosody accordingly. Cascaded stacks reconstruct prosody from text, which is a strictly weaker signal.

The cost of S2S is the eval surface. There is no per-stage transcript on the hot path, so when a call fails, you cannot easily say which sub-component broke. You score the input audio and the output audio. You sample-decode the conversation into text for retrospective eval. The debug surface is smaller. Tool calling is newer and the edge cases (mid-turn barge-in during a tool call, structured output validation, multi-tool chains) are still being refined relative to the LLM tool-calling surface that’s been hardened over years.

The decision matrix: six axes

Six axes determine which architecture you pick.

Axis 1: latency

S2S wins. Cascaded loses. The numbers:

StageCascaded budgetS2S budget
STT first-partial100-300msnot applicable
LLM TTFT200-600msmerged into single model TTFA
TTS TTFA75-300msnot applicable
Network + orchestration50-200ms50-150ms
Total TTFA600-1200ms typical300-500ms typical

The cascaded budget can be compressed by streaming partials directly into the LLM (no end-of-utterance wait), running predictive TTS pre-roll on the first 4-8 tokens, and co-locating STT/LLM/TTS in the same region. Teams that ship sub-500ms cascaded voice (covered in the sub-500ms voice AI guide) usually pair Cartesia Ink-Whisper, Groq llama-3.3-70b, and Cartesia Sonic in the same region. It works but it requires every stage to cooperate.

S2S avoids the STT-to-LLM-to-TTS handoff, so it can reduce TTFA, but teams should benchmark both paths on their own region, providers, and turn-taking settings. The gap is real and the gap is durable for as long as cascaded stacks pay the inter-stage handoff cost.

Axis 2: eval depth

Cascaded wins. S2S loses, with caveats.

Cascaded gives you a span per stage. Each span has its own typed attributes and its own rubric:

  • STT span. Attributes: provider, model, first-partial latency, final latency, confidence, transcript text. Rubric: audio_transcription for WER-class scoring against ground truth.
  • LLM span. Attributes: provider, model, TTFT, tokens, tool calls, response text. Rubric: task_completion, conversation_coherence, conversation_resolution, domain-specific custom evaluators.
  • TTS span. Attributes: provider, voice ID, TTFA, audio duration, output audio URL. Rubric: audio_quality for prosody, clarity, pronunciation.
  • Tool spans. Attributes: tool name, arguments, response, latency. Rubric: argument validity, response correctness via custom evaluators.

S2S gives you one span. Input audio in, output audio out. You can run audio_transcription on a sampled transcript of the input to get an ASR view. You can run audio_quality on the output. You can run conversation_resolution on the multi-turn output. You can run any of the 70+ built-in rubrics in ai-evaluation on either side. What you cannot do is say “the LLM was right but the TTS broke” because there is no LLM stage to evaluate independently.

For most production use cases, the per-stage eval surface is the higher-value asset. You ship faster because you debug faster.

Axis 3: debug cost

Cascaded wins.

When a cascaded call fails, the span tree tells you which stage to investigate. STT confidence dropped to 0.4? Look at the audio. LLM emitted a refund the policy doesn’t support? Look at the prompt. TTS pronounced “Kaiser” as “Kai-ser”? Look at the TTS provider and the SSML.

When an S2S call fails, you have audio in and audio out. You score both. You may or may not catch the failure mode in the eval. You probably need to run a separate ASR pass to understand what the user actually said, and a separate eval pass to score what the model actually heard. The debug loop is slower.

S2S debug tooling will catch up. In 2026 it has not.

Axis 4: model swap flexibility

Cascaded wins decisively.

In a cascaded pipeline you can swap any stage independently. Use Deepgram for STT this quarter, AssemblyAI next quarter. Use Anthropic Sonnet for the LLM today, Groq llama-3.3-70b for cost optimization next month. Use ElevenLabs for the brand voice, Cartesia for the latency-critical path. The orchestration layer (Vapi, Retell, LiveKit, Pipecat) is built around this assumption.

In an S2S deployment you pick one provider and you live with their full stack. Moving from OpenAI Realtime to Gemini Live is not a code change, it is an architecture change. You re-evaluate the entire conversational experience because the model’s audio characteristics are different from the input handling all the way through the output prosody.

If your team is risk-averse to vendor lock-in, cascaded is the answer.

Axis 5: tool calling

Cascaded wins, but the gap is closing.

Cascaded tool calling inherits the LLM provider’s full surface. OpenAI function calling is mature. Anthropic tool use is mature. Both have multi-tool chains, strict JSON schema validation, parallel tool calls, and well-documented edge cases.

S2S tool calling is real but newer. Function calling is supported in OpenAI Realtime and Gemini Live. The edge cases are rougher. Partial tool arguments during a barge-in. Mid-turn tool retries. Structured output validation when the model is also producing audio. The patterns are stabilizing but they are not as battle-tested as the cascaded equivalent.

For agents that need 5+ tools with strict schema enforcement (booking, payments, healthcare intake, structured form filling), cascaded is the safer 2026 default. For agents with 1-3 tools and looser argument validation (search, lookup, simple state queries), S2S is fine.

Axis 6: multi-modal ergonomics

S2S wins.

If you need audio, text, and image in the same conversational turn (a user shows the agent a photo of their broken router and asks how to fix it), S2S handles it in a single call. GPT-4o Realtime and Gemini Live both accept image inputs alongside audio.

In a cascaded pipeline you bolt a vision LLM into the LLM stage, which means the image has to travel through a separate code path, get described in text, and feed back into the conversation context. It works, but the integration cost is real and the latency is worse than the S2S equivalent.

For voice agents that need to see what the user sees, S2S is the cleaner path.

Provider landscape in 2026

Cascaded orchestrators. Common cascaded orchestrators include Vapi, Retell AI, ElevenLabs Agents, LiveKit Agents, and Pipecat. Vapi focuses on phone-capable voice agents; Retell exposes per-call latency breakdowns across ASR, LLM, TTS, and S2S paths; LiveKit and Pipecat are open-source framework options. ElevenLabs Conversational AI (Agents) is built around the ElevenLabs TTS family with native voice cloning. LiveKit Agents and Pipecat expose dedicated traceAI integrations via traceai-livekit and traceAI-pipecat.

Speech-to-speech models. Notable S2S options include OpenAI Realtime, Gemini Live, and Moshi. OpenAI documents realtime text/audio with image input; Gemini Live documents low-latency voice/video interactions and tool use; Moshi is open-weight/self-hostable.

Two worked decisions

Cascaded pick: healthcare scheduling voice agent

A US-based healthcare scheduling voice agent. 50K calls per month, 5 minutes average. HIPAA-bound PHI. Six tools (appointment lookup, slot search, patient verification, insurance check, confirmation, escalation). Indian English, Mexican Spanish, and Caribbean English accents.

Architecture pick: cascaded. Six tools with strict argument schemas need the mature LLM tool-calling surface. Per-stage rubrics (audio_transcription for STT on accent cohorts, task_completion for the LLM, audio_quality for TTS pronunciation of medication names) give the debug signal regulated workloads require. AssemblyAI Universal-1 for STT, Anthropic Sonnet for the LLM, ElevenLabs for TTS. Each stage swappable. Agent Command Center hosts the eval and observability layer with RBAC; SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. traceAI instruments via the traceai-livekit package. Future AGI Protect runs sub-100ms inline per arXiv 2510.13351 on the transcript stream scanning for PII leaks, prompt injection, and tone violations. ProtectFlash runs the single-call binary path on latency-sensitive turns. A well-tuned cascaded stack can target sub-second TTFA. Each major stage can be scored with a targeted rubric.

S2S pick: language-practice voice copilot

A consumer language-practice app. 200K minutes per day, 3 minutes per session. The job is conversational practice with gentle grammar correction. Latency dominates UX. Tool calling is minimal (one vocabulary lookup). Conversation prosody is the product because the app teaches pronunciation.

Architecture pick: speech-to-speech. TTFA below 500ms is the experience differentiator and S2S clears it without aggressive tuning. The model hears the learner’s pronunciation and matches output prosody for correction. One tool with loose argument validation is well within the 2026 S2S surface. traceAI instruments the voice stack through packages like traceai-livekit and traceAI-pipecat; MLLMAudio is the ai-evaluation test-case wrapper for scoring recorded audio, not a telemetry capture API. MLLMAudio accepts 7 formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. The same ai-evaluation rubrics run on the audio: audio_transcription on a sampled transcript pass for input recognition, audio_quality for output prosody, conversation_coherence for multi-turn flow. Future AGI Protect runs on the output audio (multi-modal Protect catches NSFW or violent synthesized speech before it reaches a learner). The target is sub-500ms TTFA, validated on the team’s own provider and region. The debug surface is smaller but the use case tolerates it.

Hybrid: when both architectures coexist

A pragmatic 2026 pattern is to run both architectures in the same product, routed by use case. A SaaS support product might run cascaded for the main agent (tool-rich, eval-deep) and S2S for a voice-copilot demo on the marketing site (latency-critical, tool-light). The shared infra is the eval and observability layer. traceAI handles both span shapes; 30+ documented integrations cover the cascaded orchestrators (traceAI-pipecat, traceai-livekit) and the LLM providers that power both architectures. ai-evaluation ships the same 70+ built-in rubrics for both. The eval engine doesn’t care whether the audio came from a cascaded TTS or an S2S model. You compare cascaded performance to S2S performance on the same rubrics, with the same labeled holdouts, and ship the architecture that wins on your audio profile.

Code: scoring either architecture in ai-evaluation

The same MLLMTestCase pattern works for both cascaded and S2S outputs:

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator

# Works for both cascaded TTS output and S2S model output
audio = MLLMAudio(url="path/to/output_audio.wav")
test_case = MLLMTestCase(
    input=audio,
    response="hypothesis transcript from ASR or cascaded STT span",
    expected_response="ground-truth transcript",
    query="Score the transcript quality",
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[AudioTranscriptionEvaluator()],
    inputs=[test_case],
)

For multi-turn conversation scoring (works on both cascaded LLM output and S2S audio decoded to text):

from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="Hi, I need to reschedule my appointment", response="..."),
    LLMTestCase(query="It's Tuesday at 3pm", response="..."),
])

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[ConversationCoherence(), ConversationResolution()],
    inputs=[conv],
)

For cascaded telemetry instrumentation via LiveKit Agents:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

register(
    project_name="LiveKit Agent",
    project_type=ProjectType.OBSERVE,
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

For inline guardrails on either architecture:

from fi.evals import Protect

p = Protect()
out = p.protect(
    inputs=test_case,
    protect_rules=[
        {"metric": "content_moderation"},
        {"metric": "security"},
        {"metric": "data_privacy_compliance"},
    ],
)
# For latency-critical paths, the single-call binary surface:
out = p.protect(inputs=test_case)

The eval and telemetry surface is identical across cascaded and S2S. Pick the architecture that wins on your use case; the instrumentation layer is the same.

Calibrated comparison: where each architecture wins

Honest summary of the 2026 state of the art:

Cascaded wins:

  • Per-stage debug. The span tree tells you which stage broke.
  • Vendor flexibility. Swap STT, LLM, or TTS independently.
  • Tool calling depth. Inherits the LLM provider’s mature tool surface.
  • Eval depth. Six rubrics on six spans is strictly more signal than two rubrics on one span.
  • Production maturity. Most 2026 production voice agents are cascaded.

S2S wins:

  • Latency. 300-500ms TTFA is the typical S2S floor; cascaded fights to clear 500ms.
  • Prosody. The model has the full audio context end to end; no reconstruction from text.
  • Multi-modal in/out. Audio, text, and image in the same call with no plumbing.
  • Conversational naturalness on short turns. The model preserves pacing, hesitation, and emotion.

Future AGI is not in this comparison because Future AGI is the eval and observability layer that supports both. We ship the rubrics, the telemetry, and the cluster surface that lets you pick the right architecture for your audio profile rather than guessing from a public benchmark.

Span attributes that matter for either architecture

For cascaded systems, capture provider/model/latency per STT, LLM, TTS, and tool span. For S2S systems, capture input/output audio, transcript if available, tool events, and end-to-end latency. Keep the eval schema shared even when telemetry shape differs.

Future AGI on the cascaded vs S2S choice

traceAI captures cascaded voice pipelines through supported framework instrumentors; S2S observability should be described as supported through the instrumented orchestrator or native voice observability, depending on provider. 30+ documented integrations including dedicated traceAI-pipecat and traceai-livekit packages for the voice orchestrators. OpenInference-compatible. Apache 2.0.

ai-evaluation ships 70+ built-in eval templates including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, and caption_hallucination for multi-modal output. Multilingual rubrics include translation_accuracy and cultural_sensitivity. MLLMAudio accepts 7 audio formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) directly from URL or local path. ConversationalTestCase covers multi-turn evaluation for either architecture. Custom evaluators authored by an in-product agent. Apache 2.0.

Future AGI Protect is the inline guardrail layer. Gemma 3n foundation with LoRA-trained category-specific adapters per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash is the single-call binary harmful/not-harmful path for latency-sensitive guardrails. Sub-100ms inline scanning fits inside either architecture’s voice budget. Four documented safety dimensions: Content Moderation, Bias Detection, Security, Data Privacy Compliance.

agent-opt ships six optimizers for voice-agent prompts: Bayesian Search, Meta-Prompt (per arXiv 2505.09666), ProTeGi (Prompt optimization with Textual Gradients), GEPA Genetic-Pareto (per arXiv 2507.19457), Random Search baseline (per arXiv 2311.09569), and PromptWizard. Both UI-driven (inside the Dataset view, point a run at a dataset, pick an evaluator, pick an optimizer) and SDK-driven via the agent-opt Python library expose the same six. Move from observing failures in cascaded or S2S traces to testing prompt revisions against labeled scenarios drawn from the same eval set used for calls.

Error Feed auto-clusters voice call failures into named issues with auto-written root cause and quick fix. STT mis-transcriptions, LLM hallucinations, TTS pronunciation drift, and S2S prosody anomalies each surface as their own cluster instead of drowning in raw spans.

Agent Command Center hosts the stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required. Auto-captured call logs include transcripts and downloadable assistant/customer recordings where supported by the configured provider.

For simulation: Simulate ships 18 pre-built personas plus unlimited custom with controls for accent, background noise, and multilingual. Auto-generated branching scenarios let you specify 20, 50, or 100 row counts; the platform generates conversation paths, personas, situations, and outcomes. The four-step Run Tests wizard scores each call with the eval rubric set you pick. Error Localization pinpoints the failing turn in Simulate.

Either architecture works with Future AGI. The instrumentation surface is the same. The eval rubric set is the same. The compliance posture is the same.

Architecture-level realities the eval layer absorbs

These are real properties of the two architectures, not gaps in the observability surface. FAGI’s eval and trace stack covers both shapes identically.

S2S tool-calling edge cases are still rough at the model layer. Mid-turn barge-in during a tool call, parallel tool execution, and strict JSON schema validation in S2S are workable but newer than the cascaded equivalent. For agents that need 5+ tools with strict schema enforcement, cascaded is the safer 2026 default. FAGI captures both shapes through dedicated voice instrumentors (traceAI-pipecat, traceai-livekit) and scores tool-calling correctness on either architecture with the evaluate_function_calling template.

Cascaded latency floors are hard to push below 500ms. Co-locating STT, LLM, and TTS in the same region, streaming partials directly to the LLM, and predictive TTS pre-roll on the first tokens can clear sub-500ms, but every stage has to cooperate. S2S clears it without effort. FAGI’s per-stage span attributes (gen_ai.voice.latency.transcriber_avg_ms, gen_ai.voice.latency.voice_avg_ms, gen_ai.voice.latency.turn_avg_ms, gen_ai.voice.latency.ttfb_ms) attribute every millisecond to the right component on cascaded; for S2S the same trace surface captures the single audio-in-audio-out span with its tool-call children.

Eval ground truth for S2S requires a sampled transcript pass. The hot path has no text. To run WER-class scoring you decode a sample of audio to transcript and run the rubric. The two-pass pattern (real-time S2S, post-call sampled transcript eval) is the right answer; FAGI’s eval engine accepts the sampled transcript or the raw audio (MLLMAudio wraps the audio test case for direct scoring) and runs the same audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion templates on either architecture.

Sources and references

  • Future AGI Protect: arXiv 2510.13351
  • GEPA optimizer: arXiv 2507.19457
  • Meta-Prompt optimizer: arXiv 2505.09666
  • Random Search baseline: arXiv 2311.09569
  • OpenInference span specification: github.com/Arize-ai/openinference
  • Future AGI trust and compliance: futureagi.com/trust
  • OpenAI Realtime API documentation: openai.com vendor docs
  • Google Gemini Live API documentation: ai.google.dev vendor docs
  • Moshi (Kyutai) open-weight S2S model: kyutai.org vendor docs
  • Vapi, Retell AI, ElevenLabs Agents, LiveKit Agents, Pipecat: respective vendor docs
  • Deepgram Nova-3, AssemblyAI Universal-1, Cartesia Ink-Whisper: respective vendor docs

Frequently asked questions

What's the difference between cascaded voice AI and speech-to-speech?
Cascaded voice AI is a pipeline of separate models: STT transcribes audio, an LLM reasons over text, TTS speaks the reply. Each stage produces its own span, latency, and eval surface. Speech-to-speech (S2S) is a single multi-modal model that ingests audio and emits audio directly, with no intermediate text transcript. Cascaded gives you per-stage observability and the ability to swap any vendor. S2S gives you lower latency and tighter prosody because the model has the full audio context end to end.
Which is faster: cascaded or speech-to-speech voice AI?
Speech-to-speech wins on raw time-to-first-audio. A cascaded pipeline accumulates STT latency (100-300ms first-partial), LLM time-to-first-token (200-600ms), TTS time-to-first-audio (75-300ms), plus network and orchestration overhead, often landing between 600ms and 1200ms TTFA. S2S models like OpenAI Realtime and Gemini Live run a single inference path and typically hit 300-500ms TTFA on warm connections. The latency win is real, but it shrinks as cascaded providers tune partial-stream handoff and TTS pre-roll.
When should you pick cascaded over speech-to-speech in 2026?
Pick cascaded when you need per-stage debug, vendor flexibility, mature tool calling, or strict eval coverage. Most production voice agents (support, sales, healthcare intake, scheduling) still run cascaded because the failure modes are easier to localize. STT mistranscribed a name. The LLM hallucinated a refund policy. TTS pronounced the medication wrong. Each fault has its own span, its own rubric, and its own fix. S2S black-boxes all of that into a single audio-in-audio-out call.
When should you pick speech-to-speech over cascaded?
Pick S2S when latency dominates the experience and the use case is conversational rather than transactional. Live-conversation demos, voice copilots, language-practice apps, and short Q&A bots are good fits. The model preserves prosody, emotion, and pacing across turns because it never round-trips to text. Trade-offs: tool calling is newer and the edge cases are still being refined, you can't hot-swap the STT or TTS vendor, and debugging a failure means scoring the input audio and output audio rather than inspecting a span tree.
How do you evaluate a speech-to-speech model when there's no intermediate transcript?
Score the input audio and the output audio directly. In Future AGI ai-evaluation, the audio_transcription rubric scores any ASR view of the input or output against ground truth. The audio_quality rubric scores output prosody, clarity, and pronunciation. The conversation_resolution and conversation_coherence rubrics score multi-turn outcomes. For S2S, the eval surface is the audio itself plus a sampled transcript pass; for cascaded you also get per-span LLM and tool eval. MLLMAudio accepts seven audio formats including .mp3, .wav, .ogg, .m4a, .aac, .flac, and .wma.
Does tool calling work in speech-to-speech models?
Yes, but the tooling is newer than in cascaded stacks. OpenAI Realtime and Gemini Live both support function calling, and the patterns are converging on the standard JSON schema for tool definitions. The maturity gap shows up in edge cases: partial tool arguments during a barge-in, mid-turn tool retries, and structured output validation. Cascaded stacks inherit the LLM provider's full tool-calling surface (which is now several years old). For agents that need 5+ tools with strict schema validation, cascaded is the safer 2026 default.
How does Future AGI cover both architectures?
traceAI captures per-stage spans for cascaded stacks (STT, LLM, TTS, tool calls), with dedicated traceAI-pipecat and traceai-livekit packages plus OpenInference-compatible spans for the underlying LLM provider. For speech-to-speech, the same trace surface captures the single audio-in-audio-out call with its tool-call children. The eval engine runs the same 70+ built-in rubrics on either architecture: audio_transcription scores input or output transcripts, audio_quality scores synthesized speech, conversation_resolution scores multi-turn outcomes. Future AGI Protect runs sub-100ms inline on either pipeline.
Related Articles
View all