Multi-Agent Voice Systems in 2026: State Transitions, Hand-offs, Eval Boundaries
How to architect multi-agent voice systems in 2026: state transitions, hand-off prompt design, per-agent vs end-to-end evals, latency budgets, failure attribution.
Table of Contents
Most voice agent stacks start as one big prompt and one big LLM call per turn. The day a single prompt has to handle billing, technical support, and retention with three different tones and three different tool sets, the prompt cracks. Multi-agent voice systems split that workload. The architecture is harder to debug, the failure modes are subtler, and the eval boundary moves from “did the agent answer well” to “did the chain hand off well.” This guide walks the patterns, the state-preservation problem, the eval boundary, and how to instrument the whole thing.
TL;DR step preview
- Pick the pattern that fits your call flow: triage to specialist, supervisor plus workers, parallel sub-agents, or sequential pipeline.
- Preserve three layers of state across every hand-off: full transcript, structured slots, and a hand-off context note. Skip any one and the customer repeats themselves.
- Score evals at two layers. Per-agent
task_completiontells you each stage’s job got done. End-to-endconversation_resolutionandconversation_coherencetell you the call worked as a whole. - Allocate latency budgets per agent and verify in production. A 1-second per-turn budget might split 200ms triage + 600ms specialist + 200ms closer; parallel sub-agents share the same clock.
- Attribute failures by walking back through per-agent scores until the first stage drops. The transition turns are the highest-leverage debug points.
- Instrument the whole chain as one trace tree: parent span = session, child spans = each agent, sub-spans = turns. Error Feed clusters production failures by which agent caused them.
The rest of the guide unpacks each pattern, the state-preservation contract, the eval boundary, and the observability layer that closes the loop.
Why multi-agent voice systems exist
A single voice agent works when one prompt can handle the full call. The moment the call branches across domains (billing rules, technical troubleshooting, retention offers, scheduling) a monolithic prompt starts dropping branches. You see it in the eval scores first: task_completion holds on simple intents and collapses on the complex ones. You see it in customer transcripts second: long pauses, hedged responses, the agent answering the wrong question.
The fix is decomposition. Split the call across specialized agents. Each agent owns a narrower job: one prompt, one tool set, one tone, one set of eval rubrics. The trade is real. Hand-offs add latency, hand-offs can lose state, hand-offs require eval at the seam not just inside each stage. Done well, the multi-agent stack can outperform a monolithic prompt on complex routed calls. Done poorly, the customer feels every hand-off and the engineering team can’t tell which agent dropped the ball.
The architectural concerns from this point on are the same whether your runtime is LiveKit, Pipecat, Vapi, or a custom orchestration layer. The patterns differ on syntax; the eval boundary, state contract, and observability needs are the same.
Pattern 1: Triage to specialist
The most common multi-agent voice pattern. A front-line triage agent answers the call, identifies intent in a few turns, then hands off to a specialist (billing, technical, retention, scheduling) who completes the resolution.
Caller
|
v
+-------------------+ intent + slots + tone
| Triage agent | --------------------------------+
| (intent routing) | |
+-------------------+ v
+----------+----------+
| Billing specialist |
| or |
| Technical |
| or |
| Retention |
+----------+----------+
|
v
Resolution +
optional closing
The triage agent’s job is narrow: greet the caller, identify the intent in 1-3 turns, extract structured slots (account number, issue category, urgency), and hand off. It should not attempt resolution. The temptation to “answer the easy ones in triage” leaks complexity back into the triage prompt and you’re back to a monolith.
The specialist’s job is deep: full tool access for its domain, longer context window, domain-specific eval rubrics. The specialist trusts the triage agent’s classification but verifies one critical slot (account number, customer identity) before taking any action.
Failure mode: triage misclassifies intent. The specialist either says “I can’t help with that, let me transfer you back” (correct but costly) or attempts resolution outside its domain (incorrect and worse). Score triage with task_completion on classification accuracy. Score the specialist with task_completion on resolution. Score the whole call with conversation_resolution.
Pattern 2: Supervisor and workers
A supervisor agent decomposes the call goal into sub-tasks, dispatches them to N specialist workers, and aggregates the results before responding to the caller. Useful when the caller’s request fans out (e.g., “schedule three appointments at three different clinics and confirm insurance for each”).
Caller
|
v
+-------------------+
| Supervisor agent |---+----+----+
| (decompose + | | | |
| dispatch) | v v v
+-------------------+ W1 W2 W3 (workers run concurrently or sequentially)
^ | | |
| +----+----+
| |
+---------------------+ (aggregate)
|
v
+-------------------+
| Response |
+-------------------+
The supervisor maintains the caller-facing conversation. Workers don’t talk to the caller; they execute their sub-task and return a structured result. The supervisor decides whether to surface partial progress (“got the first appointment, working on the second”) or wait for the whole bundle.
Failure mode: workers return inconsistent results (one books for Tuesday, one for Wednesday, the supervisor doesn’t notice). Score each worker with task_completion on its sub-task. Score the supervisor with a custom eval rubric for aggregation correctness. The end-to-end conversation_resolution covers whether the caller’s actual goal was met.
This pattern fits text agents better than voice in 2026 because voice is latency-sensitive and dispatching N workers in serial blows the per-turn budget. Use the parallel variant if you go this route on voice.
Pattern 3: Parallel sub-agents
Multiple agents work concurrently on the same input. The classic voice example: transcription, sentiment, and summarization all run on the streaming audio in parallel. The orchestration layer aggregates the outputs and feeds them to the main response-generating agent.
+---> Transcription agent ---+
| |
Caller turn --------+---> Sentiment agent -------+---> Aggregator ---> Main agent ---> Response
| |
+---> Summarization agent ---+
Latency in this pattern is bounded by the slowest sub-agent, not the sum. A 200ms transcription + 150ms sentiment + 180ms summarization runs in roughly 200ms wall clock, not 530ms. That’s the whole point.
Failure mode: one sub-agent crashes or stalls and the aggregator either blocks waiting or proceeds without the missing input. Decide explicitly per sub-agent whether it’s required for the main agent to respond. Sentiment is usually nice-to-have (proceed without it). Transcription is usually required (block or fail the turn). Score each sub-agent independently. Score the aggregator’s behavior under partial input with a custom rubric.
Pattern 4: Sequential pipeline
Agent A processes, hands off state to B, which hands off to C. Each stage transforms the input for the next. Useful when the stages are genuinely sequential (extract intent first, then look up policy, then generate response in policy-compliant tone).
Caller turn --> Intent extractor --> Policy looker-upper --> Response generator --> TTS
(cheap LLM call) (tool span: vector DB) (richer LLM call)
Latency adds up. A 100ms + 150ms + 400ms pipeline runs in 650ms wall clock. The advantage is that each stage uses the right model for the job: a cheap small model for intent, a tool call for policy lookup, a stronger model for response. Total cost can be lower than one big call to a frontier model.
Failure mode: a stage in the middle returns an unexpected shape and breaks the next stage. Score each stage independently. The transition between stages is the failure-prone seam, not the stages themselves.
State preservation across hand-offs
The single biggest determinant of multi-agent voice quality. Three layers move with the customer.
Layer 1: Conversation history
The full transcript up to the hand-off point. The specialist needs to see what the customer actually said, not a summary. Summaries lose tone, lose hedged commitments, lose the exact phrasing that the customer used to describe the issue. Pass the raw turns; let the specialist’s prompt decide how much to attend to.
For long calls, the practical limit is context window. A 30-turn call may not fit comfortably in a specialist’s prompt. The compression strategy: keep the last N turns verbatim, plus a structured summary of earlier turns. Don’t lose the verbatim recent context.
Layer 2: Structured slot state
Extracted entities: customer_id, account_type, issue_category, urgency, prior_eval_scores. Triage’s job is to fill these slots, the specialist’s job is to trust them (with a verification step on the critical ones).
Schema discipline matters. Define the slot shape upfront. If triage extracts customer_id: "12345" and the specialist expects customerId: 12345, you have a contract bug. The fix is a shared schema definition in code that both agents read.
Layer 3: Hand-off context note
A short instruction the triage agent writes for the specialist. Format varies, but the content is consistent:
- Why the transfer is happening (intent classification + confidence).
- Customer’s tone and emotional state.
- Anything triage promised the customer (e.g., “I told them a specialist would get them a refund”).
- Anything triage tried that didn’t work (failed tool calls, ambiguous responses).
The hand-off note is what makes the specialist feel like a colleague who got a real warm transfer instead of a stranger who got a ticket dropped on their desk. Skip it and the specialist starts cold; include it and the specialist hits the ground running.
In LiveKit’s Agents framework, the hand-off carries explicit state via the transfer event. In Pipecat, the processor pattern lets you compose state-passing between specialist processors. In OpenAI Agents SDK, handoffs() accepts a structured context payload. In Vapi, the squad routing passes state via the dispatch context. Whichever runtime, structure the three layers explicitly.
The eval boundary: per-agent versus end-to-end
A monolithic voice agent has one eval boundary: the call. A multi-agent voice system has N+1 boundaries: one per agent, plus the end-to-end call.
Per-agent eval
Each agent gets scored on its own job. The natural rubric is task_completion configured per stage:
- Triage
task_completion: did the triage agent correctly identify intent, fill required slots, and hand off? - Specialist
task_completion: did the specialist resolve the issue within its domain? - Closer
task_completion: did the closing agent confirm next steps and end the call cleanly?
Per-agent eval tells you exactly which stage dropped the ball. Without it, you know the call failed but you can’t attribute blame to a stage.
End-to-end eval
The whole call gets scored as one trace. The rubrics that matter at this layer:
conversation_resolution: did the call achieve the customer’s actual goal?conversation_coherence: did the conversation hang together as one continuous interaction, despite the agent transitions?- Custom CSAT proxy rubrics: tone consistency across hand-offs, customer-repeat-rate (did the customer have to repeat themselves after a transfer?).
End-to-end eval is what the business cares about. The customer doesn’t know there were three agents; they know whether their problem got solved. Score both layers. They tell you different things.
Multi-eval-template pattern
For a multi-agent call, you attach multiple eval rubrics to the same trace:
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution
conv = ConversationalTestCase(messages=[
LLMTestCase(query="Hi, I need help with my bill", response="I can route you. What's the issue?"),
LLMTestCase(query="I was charged twice this month", response="Got it. Transferring to billing."),
LLMTestCase(query="(billing specialist) Let me pull up your account", response="I see the duplicate charge. Refunding now."),
LLMTestCase(query="Thanks", response="You're welcome. Anything else?"),
])
ev = Evaluator(fi_api_key=..., fi_secret_key=...)
result = ev.evaluate(
eval_templates=[ConversationCoherence(), ConversationResolution()],
inputs=[conv],
)
The same ConversationalTestCase carries the whole multi-agent dialog. Multiple eval templates score it at the end-to-end layer. Per-agent scoring happens on each stage’s individual turns, attached to the per-agent spans.
Latency budget allocation
Voice is latency-sensitive. A multi-agent call adds hand-off overhead on top of each stage’s own latency. Budgeting matters.
Sum-the-stages pattern
For sequential patterns (triage → specialist → closer), the per-turn budget sums:
Triage classification turn: 200ms
Hand-off plus state pass: 50ms
Specialist response generation: 600ms
Closing turn: 150ms
-----------------------------------------
Total per-call critical-path budget: 1000ms (1s)
Each agent owns its slice. If the specialist’s first response routinely runs 800ms, you have 200ms left for everything else, which is rarely enough.
Parallel-bound pattern
For parallel sub-agent patterns, the budget is bounded by the slowest sub-agent, not the sum:
Transcription: 200ms --+
Sentiment: 150ms --+--> Aggregator: 30ms --> Main agent: 400ms = 630ms total
Summarization: 180ms --+
The slowest sub-agent (transcription at 200ms) dominates. The faster ones run inside its window.
Instrumenting the budget
traceAI captures per-agent span attributes: latency per stage, parent-child relationships, transition timestamps. Walk the trace tree after a call to see exactly where the budget went. The single most useful visualization for multi-agent voice is a Gantt-style timeline of agent spans, with the customer’s audio waveform on top: you immediately see which agent overran and how much dead air the customer experienced.
Failure attribution: who dropped the ball
When conversation_resolution comes back low on a multi-agent call, the question is “which agent failed?” Walk the per-agent task_completion scores from start to end. The first agent that dropped below threshold is your root cause.
Common attribution patterns:
- Triage misclassified. Triage
task_completionis high (it did fill the slots) but the wrong specialist got picked. Custom eval rubric for routing accuracy catches this. - Specialist had insufficient context. Triage
task_completionlooks fine but the specialist’s first turn shows confusion. Hand-off state contract was incomplete: the specialist didn’t get something it needed. - Specialist resolved the wrong issue. Triage routed correctly, the specialist’s
task_completionis high on the wrong job. The slot schema let through an ambiguous intent. - Closer skipped confirmation. Specialist resolved the issue but the closing agent ended the call without recap. The customer leaves unsure what happens next.
Error Feed in FAGI’s Observe surface clusters these patterns into named issues across calls. Instead of triaging each failed call individually, you triage “specialist had insufficient context on billing transfers” as one cluster with supporting call evidence.
Cost attribution per agent
Each agent makes LLM calls, tool calls, and provider calls. Cost should attribute to the right stage.
Capture per-agent token counts as span attributes:
llm.token_count.promptllm.token_count.completionllm.model_namellm.provider
Sum across stages for total call cost. Group by agent for cost-per-stage breakdown. The natural insight: if specialist turns are materially more expensive than triage turns and triage routing accuracy drifts, you pay the specialist’s cost on the mis-routed share too. Cost attribution makes the economic argument for better triage accuracy explicit.
Tag-based attribution at the trace level (agent_version, customer_id, vertical) lets you slice cost by any business dimension on top of the per-agent breakdown.
Where multi-agent voice runtimes stand in 2026
Not every voice framework has equal depth on multi-agent. The state of the open-source field as of mid-2026:
- LiveKit Agents. The cleanest open-source multi-agent primitives. Hand-off is a first-class event. State passes via structured context. Native support for parallel sub-agents in the same session. The community is large enough that patterns are well-documented.
- Pipecat. Processor-pattern composition lets you swap specialist processors mid-call. Strong on sequential pipelines, good on triage-to-specialist. Multi-agent feels slightly more code-heavy than LiveKit but ships more flexibility.
- Vapi. Simpler abstractions. Squad-style routing fits straightforward triage-to-specialist flows out of the box. Less flexibility for supervisor-and-workers or custom orchestration patterns. The trade is faster time-to-first-call.
- OpenAI Agents SDK. The strongest for cascade-pattern multi-agent in code.
handoffs()is explicit and structured. Voice integration is via the realtime API; multi-agent voice typically pairs the Agents SDK for orchestration with a separate voice runtime for the audio loop.
The patterns in this guide apply to all four. The implementation details differ.
How Future AGI fits
Multi-agent voice systems put more weight on observability than monolithic voice agents do. The chain has more places to fail and the failure modes are subtler. FAGI’s stack is wired for this:
- traceAI captures the whole transfer chain as one trace tree. Parent span is the session. Child spans represent each agent. Sub-spans break down per turn within an agent. The traceAI instrumentors are OpenInference-compatible (Apache 2.0). The
traceai-livekitandtraceAI-pipecatpackages cover the dominant open-source multi-agent runtimes;traceai-openai-agentscovers the OpenAI Agents SDK. conversation_resolutionon end-to-end outcome. Score the whole multi-agent call as one trace. The ai-evaluation SDK ships 70+ built-in eval templates includingconversation_coherence,conversation_resolution,task_completion,audio_transcription,audio_quality. Apache 2.0.task_completionper agent. Configure the rubric per stage so each agent gets its own score against its own job description. Per-agent failure becomes visible immediately.ConversationalTestCasecarries the multi-agent dialog. The same test-case shape handles a single-agent conversation or a multi-agent transfer chain. Multi-eval-template pattern attaches multiple rubrics to one call.- Error Feed clusters failures by which agent caused them. Error Feed reads the captured spans and groups related failures into named issues with auto-written root cause, supporting evidence, quick fix, and long-term recommendation. Zero-config the moment spans flow in.
- Workflow Builder Transfer Call Node tests the hand-off explicitly. The visual graph ships Conversation, End Call, and Transfer Call nodes; drag-and-drop the transfer point, configure the slot payload, and the scenario exercises the exact hand-off the agent will use in production. Auto-generate 20, 50, or 100 branching scenarios that exercise the triage-to-specialist hand-off. Verify the specialist gets all needed context. Verify the customer doesn’t have to repeat themselves.
evaluate_function_callingscores tool-calling correctness across the transfer. The rubric checks that the right tool was called with the right arguments at the right turn. For multi-agent flows the most common silent failure is the triage agent invoking the wrong specialist tool or passing the wrong slot payload;evaluate_function_callingcatches both.- Persona library: 18 pre-built plus unlimited custom-authored. Custom personas configure name, description, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise, multilingual (many popular languages), plus custom properties and free-form behavioral instructions. The library grows as the team adds new edge cases.
- Error Localization across agent transitions. When a scenario or production call fails, the system highlights the exact turn (and the exact transfer event) where the chain broke. For triage-to-specialist hand-offs the failing turn is almost always inside or immediately after the transfer; Error Localization makes it visible without scanning the whole trace.
- Programmatic eval API for configure + re-run. Wire the same per-agent and end-to-end rubrics into CI so every multi-agent change runs the regression suite before merge.
- Native voice observability for Vapi, Retell, and LiveKit. Add provider API key + Assistant ID. Auto call log capture starts immediately. Separate assistant + customer audio download per call. Auto transcripts. The same eval engine runs on every call.
- Future AGI Protect model family runs sub-100ms inline per arXiv 2510.13351. Gemma 3n foundation with LoRA-trained adapters per safety dimension. Multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path for the lowest-latency surface.
- Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
agent-optships six optimizers: Bayesian Search, Meta-Prompt (per arXiv 2505.09666), ProTeGi (Prompt optimization with Textual Gradients), GEPA Genetic-Pareto (per arXiv 2507.19457), Random Search baseline (per arXiv 2311.09569), and PromptWizard. Both UI-driven (Dataset view: pick a dataset, an evaluator, and an optimizer) and SDK-driven via Python expose the same six. Multi-agent prompts are exactly the kind of brittle artifact that benefits from explicit optimization against trace data.
The Agent Definition that wires the observability also drives simulation, optimization, and inline guardrails. One platform, one trust posture, one bill.
Two deliberate tradeoffs
These are deployment-posture and process choices baked into the platform, not feature gaps.
Async eval gating is explicit. agent-opt runs against trace data only when a team triggers an explicit optimization run, picks an optimizer (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard), and approves the candidate prompt. FAGI never auto-rewrites a multi-agent prompt without a human approval gate. The loop is closed, but the gate is intentional.
Native voice obs ships for Vapi, Retell, and LiveKit out of the box. Add provider API key + Assistant ID and call logs flow into the dashboard with no SDK code. Other multi-agent runtimes (Pipecat, OpenAI Agents SDK, custom orchestration) wire in via the traceAI-pipecat, traceai-openai-agents, and the broader 30+ traceAI instrumentor surface. The two paths cover 90%+ of production multi-agent voice stacks.
Common pitfalls
Hand-off note skipped. The specialist starts cold and the customer feels the seam. Always pass the three-layer state (history, slots, context note).
Per-agent eval skipped. You score end-to-end and discover the call failed, but you can’t attribute the failure. Wire per-agent task_completion from day one.
Latency budget unmodeled. Each agent thinks its 600ms is fine in isolation; the chain adds up to 2 seconds and the customer hangs up. Allocate budget per stage and verify in production.
Triage tries to resolve. The triage agent’s prompt grows to handle “easy” cases and you slip back toward a monolith. Keep triage narrow.
Schema drift between agents. Triage emits customer_id: "12345" (string), specialist expects customerId: 12345 (int). Shared schema in code, both agents read from it.
Cost ignored. Specialist turns are typically materially more expensive than triage turns. Whenever routing accuracy drifts, you pay specialist cost on the mis-routed share. Track per-agent cost; the economic case for better triage writes itself.
Closing the loop
The architecture is more complex than a monolithic voice agent, but the operational loop is the same shape: observe, evaluate, cluster, optimize.
Observe. traceAI captures the whole multi-agent chain as one trace tree.
Evaluate. Per-agent task_completion plus end-to-end conversation_resolution and conversation_coherence.
Cluster. Error Feed groups related failures into named issues attributed to specific agents in the chain.
Optimize. agent-opt runs one of the six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) against the failing agent’s trace data and corrected examples. The cluster shrinks, the per-agent score climbs, the end-to-end resolution rate follows.
The multi-agent payoff is on the hard calls. Monolithic agents handle the easy ones; multi-agent handles the ones that branch. Wire the observability to match the architecture and the failure modes become tractable.
Related reading
- Agent Architecture Patterns in 2026: A practical taxonomy
- Logging and Analytics Architecture for Voice Agents in 2026
- Three-Layer Voice Testing in 2026
- Voice Agent Conversation Monitoring in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- Trust page: futureagi.com/trust
- OpenInference spec: github.com/Arize-ai/openinference
- OpenTelemetry: opentelemetry.io
- LiveKit Agents: docs.livekit.io/agents
- Pipecat: github.com/pipecat-ai/pipecat
- OpenAI Agents SDK: github.com/openai/openai-agents-python
Frequently asked questions
What is a multi-agent voice system, and when do you actually need one?
How do you preserve state across an agent hand-off?
How do you evaluate a multi-agent voice call?
What's the right latency budget split for a multi-agent voice call?
How does Future AGI handle multi-agent voice observability?
Which open-source frameworks have the cleanest multi-agent voice primitives?
How do you attribute a failure to the right agent in the chain?
Wire voice agent regression tests into GitHub Actions and GitLab CI in 2026: golden conversations, three-layer testing, deploy gates, drift detection, and FAGI evals.
Step-by-step 2026 methodology to evaluate voice AI agents end-to-end: trace, score, cluster, optimize, redeploy. With real rubrics, code, and a closed loop.
Cascaded voice AI vs speech-to-speech in 2026: latency, eval depth, debug cost, model flexibility, and the architecture decision every voice team faces.