Guides

Multi-Agent Voice Systems in 2026: State Transitions, Hand-offs, Eval Boundaries

How to architect multi-agent voice systems in 2026: state transitions, hand-off prompt design, per-agent vs end-to-end evals, latency budgets, failure attribution.

·
Updated
·
17 min read
voice-ai 2026 multi-agent architecture evaluation
Editorial cover image for Multi-Agent Voice Systems in 2026
Table of Contents

Most voice agent stacks start as one big prompt and one big LLM call per turn. The day a single prompt has to handle billing, technical support, and retention with three different tones and three different tool sets, the prompt cracks. Multi-agent voice systems split that workload. The architecture is harder to debug, the failure modes are subtler, and the eval boundary moves from “did the agent answer well” to “did the chain hand off well.” This guide walks the patterns, the state-preservation problem, the eval boundary, and how to instrument the whole thing.

TL;DR step preview

  1. Pick the pattern that fits your call flow: triage to specialist, supervisor plus workers, parallel sub-agents, or sequential pipeline.
  2. Preserve three layers of state across every hand-off: full transcript, structured slots, and a hand-off context note. Skip any one and the customer repeats themselves.
  3. Score evals at two layers. Per-agent task_completion tells you each stage’s job got done. End-to-end conversation_resolution and conversation_coherence tell you the call worked as a whole.
  4. Allocate latency budgets per agent and verify in production. A 1-second per-turn budget might split 200ms triage + 600ms specialist + 200ms closer; parallel sub-agents share the same clock.
  5. Attribute failures by walking back through per-agent scores until the first stage drops. The transition turns are the highest-leverage debug points.
  6. Instrument the whole chain as one trace tree: parent span = session, child spans = each agent, sub-spans = turns. Error Feed clusters production failures by which agent caused them.

The rest of the guide unpacks each pattern, the state-preservation contract, the eval boundary, and the observability layer that closes the loop.

Why multi-agent voice systems exist

A single voice agent works when one prompt can handle the full call. The moment the call branches across domains (billing rules, technical troubleshooting, retention offers, scheduling) a monolithic prompt starts dropping branches. You see it in the eval scores first: task_completion holds on simple intents and collapses on the complex ones. You see it in customer transcripts second: long pauses, hedged responses, the agent answering the wrong question.

The fix is decomposition. Split the call across specialized agents. Each agent owns a narrower job: one prompt, one tool set, one tone, one set of eval rubrics. The trade is real. Hand-offs add latency, hand-offs can lose state, hand-offs require eval at the seam not just inside each stage. Done well, the multi-agent stack can outperform a monolithic prompt on complex routed calls. Done poorly, the customer feels every hand-off and the engineering team can’t tell which agent dropped the ball.

The architectural concerns from this point on are the same whether your runtime is LiveKit, Pipecat, Vapi, or a custom orchestration layer. The patterns differ on syntax; the eval boundary, state contract, and observability needs are the same.

Pattern 1: Triage to specialist

The most common multi-agent voice pattern. A front-line triage agent answers the call, identifies intent in a few turns, then hands off to a specialist (billing, technical, retention, scheduling) who completes the resolution.

Caller
  |
  v
+-------------------+        intent + slots + tone
| Triage agent      | --------------------------------+
| (intent routing)  |                                 |
+-------------------+                                 v
                                          +----------+----------+
                                          |  Billing specialist |
                                          |  or                 |
                                          |  Technical          |
                                          |  or                 |
                                          |  Retention          |
                                          +----------+----------+
                                                     |
                                                     v
                                              Resolution +
                                              optional closing

The triage agent’s job is narrow: greet the caller, identify the intent in 1-3 turns, extract structured slots (account number, issue category, urgency), and hand off. It should not attempt resolution. The temptation to “answer the easy ones in triage” leaks complexity back into the triage prompt and you’re back to a monolith.

The specialist’s job is deep: full tool access for its domain, longer context window, domain-specific eval rubrics. The specialist trusts the triage agent’s classification but verifies one critical slot (account number, customer identity) before taking any action.

Failure mode: triage misclassifies intent. The specialist either says “I can’t help with that, let me transfer you back” (correct but costly) or attempts resolution outside its domain (incorrect and worse). Score triage with task_completion on classification accuracy. Score the specialist with task_completion on resolution. Score the whole call with conversation_resolution.

Pattern 2: Supervisor and workers

A supervisor agent decomposes the call goal into sub-tasks, dispatches them to N specialist workers, and aggregates the results before responding to the caller. Useful when the caller’s request fans out (e.g., “schedule three appointments at three different clinics and confirm insurance for each”).

Caller
  |
  v
+-------------------+
| Supervisor agent  |---+----+----+
| (decompose +      |   |    |    |
|  dispatch)        |   v    v    v
+-------------------+  W1   W2   W3   (workers run concurrently or sequentially)
       ^                |    |    |
       |                +----+----+
       |                     |
       +---------------------+  (aggregate)
       |
       v
+-------------------+
| Response          |
+-------------------+

The supervisor maintains the caller-facing conversation. Workers don’t talk to the caller; they execute their sub-task and return a structured result. The supervisor decides whether to surface partial progress (“got the first appointment, working on the second”) or wait for the whole bundle.

Failure mode: workers return inconsistent results (one books for Tuesday, one for Wednesday, the supervisor doesn’t notice). Score each worker with task_completion on its sub-task. Score the supervisor with a custom eval rubric for aggregation correctness. The end-to-end conversation_resolution covers whether the caller’s actual goal was met.

This pattern fits text agents better than voice in 2026 because voice is latency-sensitive and dispatching N workers in serial blows the per-turn budget. Use the parallel variant if you go this route on voice.

Pattern 3: Parallel sub-agents

Multiple agents work concurrently on the same input. The classic voice example: transcription, sentiment, and summarization all run on the streaming audio in parallel. The orchestration layer aggregates the outputs and feeds them to the main response-generating agent.

                     +---> Transcription agent ---+
                     |                            |
Caller turn  --------+---> Sentiment agent -------+---> Aggregator ---> Main agent ---> Response
                     |                            |
                     +---> Summarization agent ---+

Latency in this pattern is bounded by the slowest sub-agent, not the sum. A 200ms transcription + 150ms sentiment + 180ms summarization runs in roughly 200ms wall clock, not 530ms. That’s the whole point.

Failure mode: one sub-agent crashes or stalls and the aggregator either blocks waiting or proceeds without the missing input. Decide explicitly per sub-agent whether it’s required for the main agent to respond. Sentiment is usually nice-to-have (proceed without it). Transcription is usually required (block or fail the turn). Score each sub-agent independently. Score the aggregator’s behavior under partial input with a custom rubric.

Pattern 4: Sequential pipeline

Agent A processes, hands off state to B, which hands off to C. Each stage transforms the input for the next. Useful when the stages are genuinely sequential (extract intent first, then look up policy, then generate response in policy-compliant tone).

Caller turn  -->  Intent extractor  -->  Policy looker-upper  -->  Response generator  -->  TTS
                  (cheap LLM call)      (tool span: vector DB)    (richer LLM call)

Latency adds up. A 100ms + 150ms + 400ms pipeline runs in 650ms wall clock. The advantage is that each stage uses the right model for the job: a cheap small model for intent, a tool call for policy lookup, a stronger model for response. Total cost can be lower than one big call to a frontier model.

Failure mode: a stage in the middle returns an unexpected shape and breaks the next stage. Score each stage independently. The transition between stages is the failure-prone seam, not the stages themselves.

State preservation across hand-offs

The single biggest determinant of multi-agent voice quality. Three layers move with the customer.

Layer 1: Conversation history

The full transcript up to the hand-off point. The specialist needs to see what the customer actually said, not a summary. Summaries lose tone, lose hedged commitments, lose the exact phrasing that the customer used to describe the issue. Pass the raw turns; let the specialist’s prompt decide how much to attend to.

For long calls, the practical limit is context window. A 30-turn call may not fit comfortably in a specialist’s prompt. The compression strategy: keep the last N turns verbatim, plus a structured summary of earlier turns. Don’t lose the verbatim recent context.

Layer 2: Structured slot state

Extracted entities: customer_id, account_type, issue_category, urgency, prior_eval_scores. Triage’s job is to fill these slots, the specialist’s job is to trust them (with a verification step on the critical ones).

Schema discipline matters. Define the slot shape upfront. If triage extracts customer_id: "12345" and the specialist expects customerId: 12345, you have a contract bug. The fix is a shared schema definition in code that both agents read.

Layer 3: Hand-off context note

A short instruction the triage agent writes for the specialist. Format varies, but the content is consistent:

  • Why the transfer is happening (intent classification + confidence).
  • Customer’s tone and emotional state.
  • Anything triage promised the customer (e.g., “I told them a specialist would get them a refund”).
  • Anything triage tried that didn’t work (failed tool calls, ambiguous responses).

The hand-off note is what makes the specialist feel like a colleague who got a real warm transfer instead of a stranger who got a ticket dropped on their desk. Skip it and the specialist starts cold; include it and the specialist hits the ground running.

In LiveKit’s Agents framework, the hand-off carries explicit state via the transfer event. In Pipecat, the processor pattern lets you compose state-passing between specialist processors. In OpenAI Agents SDK, handoffs() accepts a structured context payload. In Vapi, the squad routing passes state via the dispatch context. Whichever runtime, structure the three layers explicitly.

The eval boundary: per-agent versus end-to-end

A monolithic voice agent has one eval boundary: the call. A multi-agent voice system has N+1 boundaries: one per agent, plus the end-to-end call.

Per-agent eval

Each agent gets scored on its own job. The natural rubric is task_completion configured per stage:

  • Triage task_completion: did the triage agent correctly identify intent, fill required slots, and hand off?
  • Specialist task_completion: did the specialist resolve the issue within its domain?
  • Closer task_completion: did the closing agent confirm next steps and end the call cleanly?

Per-agent eval tells you exactly which stage dropped the ball. Without it, you know the call failed but you can’t attribute blame to a stage.

End-to-end eval

The whole call gets scored as one trace. The rubrics that matter at this layer:

  • conversation_resolution: did the call achieve the customer’s actual goal?
  • conversation_coherence: did the conversation hang together as one continuous interaction, despite the agent transitions?
  • Custom CSAT proxy rubrics: tone consistency across hand-offs, customer-repeat-rate (did the customer have to repeat themselves after a transfer?).

End-to-end eval is what the business cares about. The customer doesn’t know there were three agents; they know whether their problem got solved. Score both layers. They tell you different things.

Multi-eval-template pattern

For a multi-agent call, you attach multiple eval rubrics to the same trace:

from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="Hi, I need help with my bill", response="I can route you. What's the issue?"),
    LLMTestCase(query="I was charged twice this month", response="Got it. Transferring to billing."),
    LLMTestCase(query="(billing specialist) Let me pull up your account", response="I see the duplicate charge. Refunding now."),
    LLMTestCase(query="Thanks", response="You're welcome. Anything else?"),
])

ev = Evaluator(fi_api_key=..., fi_secret_key=...)
result = ev.evaluate(
    eval_templates=[ConversationCoherence(), ConversationResolution()],
    inputs=[conv],
)

The same ConversationalTestCase carries the whole multi-agent dialog. Multiple eval templates score it at the end-to-end layer. Per-agent scoring happens on each stage’s individual turns, attached to the per-agent spans.

Latency budget allocation

Voice is latency-sensitive. A multi-agent call adds hand-off overhead on top of each stage’s own latency. Budgeting matters.

Sum-the-stages pattern

For sequential patterns (triage → specialist → closer), the per-turn budget sums:

Triage classification turn:         200ms
Hand-off plus state pass:            50ms
Specialist response generation:     600ms
Closing turn:                       150ms
-----------------------------------------
Total per-call critical-path budget: 1000ms (1s)

Each agent owns its slice. If the specialist’s first response routinely runs 800ms, you have 200ms left for everything else, which is rarely enough.

Parallel-bound pattern

For parallel sub-agent patterns, the budget is bounded by the slowest sub-agent, not the sum:

Transcription:    200ms  --+
Sentiment:        150ms  --+--> Aggregator: 30ms --> Main agent: 400ms = 630ms total
Summarization:    180ms  --+

The slowest sub-agent (transcription at 200ms) dominates. The faster ones run inside its window.

Instrumenting the budget

traceAI captures per-agent span attributes: latency per stage, parent-child relationships, transition timestamps. Walk the trace tree after a call to see exactly where the budget went. The single most useful visualization for multi-agent voice is a Gantt-style timeline of agent spans, with the customer’s audio waveform on top: you immediately see which agent overran and how much dead air the customer experienced.

Failure attribution: who dropped the ball

When conversation_resolution comes back low on a multi-agent call, the question is “which agent failed?” Walk the per-agent task_completion scores from start to end. The first agent that dropped below threshold is your root cause.

Common attribution patterns:

  • Triage misclassified. Triage task_completion is high (it did fill the slots) but the wrong specialist got picked. Custom eval rubric for routing accuracy catches this.
  • Specialist had insufficient context. Triage task_completion looks fine but the specialist’s first turn shows confusion. Hand-off state contract was incomplete: the specialist didn’t get something it needed.
  • Specialist resolved the wrong issue. Triage routed correctly, the specialist’s task_completion is high on the wrong job. The slot schema let through an ambiguous intent.
  • Closer skipped confirmation. Specialist resolved the issue but the closing agent ended the call without recap. The customer leaves unsure what happens next.

Error Feed in FAGI’s Observe surface clusters these patterns into named issues across calls. Instead of triaging each failed call individually, you triage “specialist had insufficient context on billing transfers” as one cluster with supporting call evidence.

Cost attribution per agent

Each agent makes LLM calls, tool calls, and provider calls. Cost should attribute to the right stage.

Capture per-agent token counts as span attributes:

  • llm.token_count.prompt
  • llm.token_count.completion
  • llm.model_name
  • llm.provider

Sum across stages for total call cost. Group by agent for cost-per-stage breakdown. The natural insight: if specialist turns are materially more expensive than triage turns and triage routing accuracy drifts, you pay the specialist’s cost on the mis-routed share too. Cost attribution makes the economic argument for better triage accuracy explicit.

Tag-based attribution at the trace level (agent_version, customer_id, vertical) lets you slice cost by any business dimension on top of the per-agent breakdown.

Where multi-agent voice runtimes stand in 2026

Not every voice framework has equal depth on multi-agent. The state of the open-source field as of mid-2026:

  • LiveKit Agents. The cleanest open-source multi-agent primitives. Hand-off is a first-class event. State passes via structured context. Native support for parallel sub-agents in the same session. The community is large enough that patterns are well-documented.
  • Pipecat. Processor-pattern composition lets you swap specialist processors mid-call. Strong on sequential pipelines, good on triage-to-specialist. Multi-agent feels slightly more code-heavy than LiveKit but ships more flexibility.
  • Vapi. Simpler abstractions. Squad-style routing fits straightforward triage-to-specialist flows out of the box. Less flexibility for supervisor-and-workers or custom orchestration patterns. The trade is faster time-to-first-call.
  • OpenAI Agents SDK. The strongest for cascade-pattern multi-agent in code. handoffs() is explicit and structured. Voice integration is via the realtime API; multi-agent voice typically pairs the Agents SDK for orchestration with a separate voice runtime for the audio loop.

The patterns in this guide apply to all four. The implementation details differ.

How Future AGI fits

Multi-agent voice systems put more weight on observability than monolithic voice agents do. The chain has more places to fail and the failure modes are subtler. FAGI’s stack is wired for this:

  • traceAI captures the whole transfer chain as one trace tree. Parent span is the session. Child spans represent each agent. Sub-spans break down per turn within an agent. The traceAI instrumentors are OpenInference-compatible (Apache 2.0). The traceai-livekit and traceAI-pipecat packages cover the dominant open-source multi-agent runtimes; traceai-openai-agents covers the OpenAI Agents SDK.
  • conversation_resolution on end-to-end outcome. Score the whole multi-agent call as one trace. The ai-evaluation SDK ships 70+ built-in eval templates including conversation_coherence, conversation_resolution, task_completion, audio_transcription, audio_quality. Apache 2.0.
  • task_completion per agent. Configure the rubric per stage so each agent gets its own score against its own job description. Per-agent failure becomes visible immediately.
  • ConversationalTestCase carries the multi-agent dialog. The same test-case shape handles a single-agent conversation or a multi-agent transfer chain. Multi-eval-template pattern attaches multiple rubrics to one call.
  • Error Feed clusters failures by which agent caused them. Error Feed reads the captured spans and groups related failures into named issues with auto-written root cause, supporting evidence, quick fix, and long-term recommendation. Zero-config the moment spans flow in.
  • Workflow Builder Transfer Call Node tests the hand-off explicitly. The visual graph ships Conversation, End Call, and Transfer Call nodes; drag-and-drop the transfer point, configure the slot payload, and the scenario exercises the exact hand-off the agent will use in production. Auto-generate 20, 50, or 100 branching scenarios that exercise the triage-to-specialist hand-off. Verify the specialist gets all needed context. Verify the customer doesn’t have to repeat themselves.
  • evaluate_function_calling scores tool-calling correctness across the transfer. The rubric checks that the right tool was called with the right arguments at the right turn. For multi-agent flows the most common silent failure is the triage agent invoking the wrong specialist tool or passing the wrong slot payload; evaluate_function_calling catches both.
  • Persona library: 18 pre-built plus unlimited custom-authored. Custom personas configure name, description, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise, multilingual (many popular languages), plus custom properties and free-form behavioral instructions. The library grows as the team adds new edge cases.
  • Error Localization across agent transitions. When a scenario or production call fails, the system highlights the exact turn (and the exact transfer event) where the chain broke. For triage-to-specialist hand-offs the failing turn is almost always inside or immediately after the transfer; Error Localization makes it visible without scanning the whole trace.
  • Programmatic eval API for configure + re-run. Wire the same per-agent and end-to-end rubrics into CI so every multi-agent change runs the regression suite before merge.
  • Native voice observability for Vapi, Retell, and LiveKit. Add provider API key + Assistant ID. Auto call log capture starts immediately. Separate assistant + customer audio download per call. Auto transcripts. The same eval engine runs on every call.
  • Future AGI Protect model family runs sub-100ms inline per arXiv 2510.13351. Gemma 3n foundation with LoRA-trained adapters per safety dimension. Multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path for the lowest-latency surface.
  • Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
  • agent-opt ships six optimizers: Bayesian Search, Meta-Prompt (per arXiv 2505.09666), ProTeGi (Prompt optimization with Textual Gradients), GEPA Genetic-Pareto (per arXiv 2507.19457), Random Search baseline (per arXiv 2311.09569), and PromptWizard. Both UI-driven (Dataset view: pick a dataset, an evaluator, and an optimizer) and SDK-driven via Python expose the same six. Multi-agent prompts are exactly the kind of brittle artifact that benefits from explicit optimization against trace data.

The Agent Definition that wires the observability also drives simulation, optimization, and inline guardrails. One platform, one trust posture, one bill.

Two deliberate tradeoffs

These are deployment-posture and process choices baked into the platform, not feature gaps.

Async eval gating is explicit. agent-opt runs against trace data only when a team triggers an explicit optimization run, picks an optimizer (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard), and approves the candidate prompt. FAGI never auto-rewrites a multi-agent prompt without a human approval gate. The loop is closed, but the gate is intentional.

Native voice obs ships for Vapi, Retell, and LiveKit out of the box. Add provider API key + Assistant ID and call logs flow into the dashboard with no SDK code. Other multi-agent runtimes (Pipecat, OpenAI Agents SDK, custom orchestration) wire in via the traceAI-pipecat, traceai-openai-agents, and the broader 30+ traceAI instrumentor surface. The two paths cover 90%+ of production multi-agent voice stacks.

Common pitfalls

Hand-off note skipped. The specialist starts cold and the customer feels the seam. Always pass the three-layer state (history, slots, context note).

Per-agent eval skipped. You score end-to-end and discover the call failed, but you can’t attribute the failure. Wire per-agent task_completion from day one.

Latency budget unmodeled. Each agent thinks its 600ms is fine in isolation; the chain adds up to 2 seconds and the customer hangs up. Allocate budget per stage and verify in production.

Triage tries to resolve. The triage agent’s prompt grows to handle “easy” cases and you slip back toward a monolith. Keep triage narrow.

Schema drift between agents. Triage emits customer_id: "12345" (string), specialist expects customerId: 12345 (int). Shared schema in code, both agents read from it.

Cost ignored. Specialist turns are typically materially more expensive than triage turns. Whenever routing accuracy drifts, you pay specialist cost on the mis-routed share. Track per-agent cost; the economic case for better triage writes itself.

Closing the loop

The architecture is more complex than a monolithic voice agent, but the operational loop is the same shape: observe, evaluate, cluster, optimize.

Observe. traceAI captures the whole multi-agent chain as one trace tree.

Evaluate. Per-agent task_completion plus end-to-end conversation_resolution and conversation_coherence.

Cluster. Error Feed groups related failures into named issues attributed to specific agents in the chain.

Optimize. agent-opt runs one of the six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) against the failing agent’s trace data and corrected examples. The cluster shrinks, the per-agent score climbs, the end-to-end resolution rate follows.

The multi-agent payoff is on the hard calls. Monolithic agents handle the easy ones; multi-agent handles the ones that branch. Wire the observability to match the architecture and the failure modes become tractable.

Sources and references

Frequently asked questions

What is a multi-agent voice system, and when do you actually need one?
A multi-agent voice system splits a single call across two or more specialized agents that hand off state mid-conversation. The triage-to-specialist pattern routes the caller from a front-line agent to a billing, technical, or retention specialist once intent is clear. The supervisor-and-workers pattern decomposes a task across N specialists and aggregates. You need multi-agent when one agent's prompt grows past the point of reliable behavior, when domain expertise differs sharply between branches (medical triage versus appointment booking), or when latency budgets demand parallel work (transcription plus sentiment plus summarization concurrently). For single-intent flows, stay monolithic. The hand-off cost only pays back when the routing reduces failure rate or latency by more than the hand-off overhead.
How do you preserve state across an agent hand-off?
Three layers move with the customer. Conversation history (full transcript so the specialist sees what triage already collected). Structured slot state (extracted entities like customer_id, account_type, issue_category, eval scores from earlier turns). Hand-off context note (a short instruction the triage agent writes to the specialist about why the transfer is happening, what the customer's tone is, and what's been promised). Skip any of the three and the customer repeats themselves, which is the single most reliable predictor of CSAT failure on multi-agent voice calls. In LiveKit and Pipecat the hand-off mechanism is a structured event with all three fields; in OpenAI Agents SDK it's a handoff() call that carries explicit context.
How do you evaluate a multi-agent voice call?
Two layers, not one. Per-agent eval (did triage correctly identify intent, did the specialist resolve the issue, did the closing agent confirm next steps) using `task_completion` on each stage. End-to-end eval (did the whole call achieve its goal, would the customer come back) using `conversation_resolution` and `conversation_coherence` on the full multi-turn trace. Without per-agent scoring you know the call failed but not where. Without end-to-end scoring you know each agent did its job but the customer still hung up frustrated. FAGI's Error Feed clusters failures by which agent in the chain caused them.
What's the right latency budget split for a multi-agent voice call?
Sum to your overall budget. A typical 1-second per-turn budget on a transfer flow allocates roughly 200ms for the triage agent's classification turn, 600ms for the specialist's first response (the heaviest LLM call), and 200ms for closing or hand-back. Parallel sub-agents share the same wall clock; transcription, sentiment, and summarization run concurrently and the slowest wins. Sequential pipelines add up, with each stage spending real budget. traceAI captures per-agent span attributes so you can see exactly where the budget went and which agent is overrunning.
How does Future AGI handle multi-agent voice observability?
The whole transfer chain lands as one trace tree. Parent span is the session. Child spans represent each agent (triage, specialist, closer). Sub-spans break down per turn within an agent (STT, LLM, TTS, tool calls). The same Agent Definition that captured the call ingests eval scores at both layers: per-agent via `task_completion` and end-to-end via `conversation_resolution`. Error Feed clusters trace failures into named issues. Workflow Builder scenarios specifically test transfer paths to verify the specialist gets full context. Error Localization in Simulate pinpoints the exact failing transition turn.
Which open-source frameworks have the cleanest multi-agent voice primitives?
LiveKit and Pipecat lead on multi-agent in OSS. LiveKit's Agents framework supports agent hand-off as a first-class event with state passing. Pipecat's processor pattern composes pipelines that can swap specialists mid-call. Vapi's abstractions are simpler, with squad-style routing that fits straightforward triage flows. OpenAI Agents SDK is the strongest for cascade-pattern multi-agent in code, with handoffs() and structured tool-style transfers. FAGI's traceAI-pipecat and traceai-livekit packages instrument both frameworks; the OpenAI Agents instrumentor (traceai-openai-agents) covers the SDK directly.
How do you attribute a failure to the right agent in the chain?
Trace structure plus per-stage eval scoring does most of the work. Each agent's span carries its own `task_completion` score. When the end-to-end `conversation_resolution` is low, walk back through the per-agent scores until you find the first one that dropped. The transition turns (where one agent hands off to the next) are the highest-leverage debug points. Error Localization in Simulate pinpoints the exact failing turn under simulation. Error Feed clusters production failures by which agent in the chain caused them, so you stop triaging individual calls and start triaging failure patterns.
Related Articles
View all