Voice AI Observability for Retell AI: A 2026 Implementation Guide
Wire Retell AI observability the FAGI way: native dashboard via Assistant ID, optional traceAI SDK, eval engine on every call with audio + transcript.
Table of Contents
Retell AI hosts the call. Your job is to know what happened in every conversation, score it against the rubrics that matter, and route the failures into something the on-call rotation can actually act on. This guide walks through the two paths Future AGI ships for Retell observability: a no-SDK dashboard path that wires through the Retell API, and an optional code-driven traceAI path for teams that want LLM-level span depth on top.
Step preview
- Wire your Retell assistant into a Future AGI Agent Definition via Retell API key + Assistant ID. Call log capture starts immediately.
- Verify the auto transcript, separate assistant and customer audio downloads, and the session timeline in the FAGI Observe project.
- Attach the named voice rubrics:
audio_transcription,audio_quality,conversation_coherence,conversation_resolution,task_completion. - Optionally install traceAI for the LLM provider that runs your prompt logic, so LLM and tool spans land in the same trace tree.
- Turn on Error Feed for auto-clustered failures and Future AGI Protect for inline guardrails.
The rest of this guide fills in each step with the code and the config.
Why Retell specifically
Retell AI ships the lowest-latency hosted voice stack in 2026. The wedge is the native coupling between the LLM and the streaming TTS at the runtime level. Most voice orchestration frameworks pipe through a separate TTS leg with its own buffering; Retell removes that hop by streaming LLM tokens straight into the TTS. The result is end-to-end turn latency at the low end of the category, which translates into conversations that feel less mechanical.
What Retell does not ship is an observability and eval layer that matches what production teams need. The Retell dashboard surfaces call logs, transcripts, and basic metrics. It does not score every call against multi-turn rubrics, it does not auto-cluster failures into named issues with root cause analysis, and it does not run inline guardrails on the LLM response. That gap is where FAGI sits.
The pattern we recommend: keep Retell as your call runtime, add FAGI as your observability and eval layer. The two compose cleanly because the native integration is API-driven and reads the call data Retell already exposes.
Step 1: Wire the Retell assistant into a FAGI Agent Definition
This is the dashboard path. No code.
Create the Agent Definition
In the FAGI console, open the Observe product and create a new project. Inside the project, create an Agent Definition. The form asks for:
- Agent name: free-text, what shows up in the call log table.
- Provider: pick Retell AI from the supported provider list. The natively supported providers are Vapi, Retell AI, and LiveKit.
- Provider API key: paste the Retell API key from your Retell account settings.
- Assistant ID: paste the Assistant ID from the Retell console.
- Observability toggle: enable.
Save the agent. FAGI handshakes with Retell to verify the API key + Assistant ID. If the handshake fails, the dashboard surfaces the error inline with a remediation hint.
What lands in the dashboard after save
The next call Retell handles for that assistant lands in FAGI within a few minutes. The Call Log table populates with the row. The row carries:
- Two separate audio files: one for assistant audio, one for customer audio. Both downloadable.
- The auto transcript: turn-by-turn alternating speaker rows with timestamps.
- The session timeline: call as a root span, turn boundaries rendered as child events.
- The tags panel: whatever metadata you passed through Retell’s call API.
This is the surface that needs zero code. You can stop here.
Tagging for KPI attribution
The Agent Definition supports custom attributes that ride into every captured call as tags. Set these on the Retell side via the call metadata API:
customer_id: filter axis for per-account analysis.vertical: e.g.support,outbound_sales,appointment_booking.agent_version: lets you A/B compare prompt revisions.campaign_id: for outbound, links the call to the campaign.intent: top-level intent class.
These flow through the Retell API into the FAGI session and become filter axes in the Observe dashboard. They also become the cluster keys Error Feed uses when grouping failures.
Step 2: Verify the call surface
Place a test call. Open the Call Log row in the FAGI dashboard. You should see four panels:
Audio panel: two players, one labelled Assistant, one labelled Customer. Each has its own waveform and a download button. The separation lets you debug a barge-in failure (interruption timing in the customer audio) or a TTS regression (clarity in the assistant audio) without listening to both legs mixed.
Transcript panel: turn-by-turn rows with speaker tags and timestamps. Hover a row to see the STT confidence Retell exposes for that turn.
Session timeline: horizontal trace tree rendering the call as the root. Without traceAI wired in yet, the timeline shows turn boundaries only. After step 4, LLM calls, tool invocations, and retrievals appear as nested child spans.
Tags panel: shows whatever metadata Retell passed through the call API.
If any of these panels is missing, the most common cause is an API key without the right scope. Re-check the Retell console permissions and re-save the Agent Definition.
Step 3: Attach the named voice rubrics
The ai-evaluation SDK ships 70+ built-in eval templates in Apache 2.0. Five of them carry most of the load on a Retell workload.
| Rubric | What it scores |
|---|---|
audio_transcription | ASR drift on customer audio against the rendered transcript |
audio_quality | TTS clarity and prosody on the assistant audio |
conversation_coherence | Multi-turn coherence across the whole call |
conversation_resolution | Did the call resolve the customer’s stated goal |
task_completion | Did the agent complete its tool calls and workflow |
In the dashboard, open the project’s Evals tab. Add the five built-ins. They run on every captured call going forward. Past calls require an explicit backfill, which is one click.
If you’d rather keep the eval config in code, the pattern looks like this:
from fi.testcases import MLLMTestCase, MLLMAudio, ConversationalTestCase, LLMTestCase
from fi.evals import (
Evaluator,
AudioTranscriptionEvaluator,
AudioQualityEvaluator,
ConversationCoherence,
ConversationResolution,
TaskCompletion,
)
ev = Evaluator(
fi_api_key="your-future-agi-api-key",
fi_secret_key="your-future-agi-secret-key",
)
assistant_audio = MLLMAudio(url="https://fagi.example.com/calls/retell-abc123/assistant.wav")
audio_case = MLLMTestCase(input=assistant_audio, query="Score the assistant TTS leg")
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I want to cancel my subscription", response="I can help with that. Can I ask why?"),
LLMTestCase(query="It's too expensive", response="Got it. We have a 30 percent discount option I can apply. Want to hear about it?"),
])
result = ev.evaluate(
eval_templates=[
AudioTranscriptionEvaluator(),
AudioQualityEvaluator(),
ConversationCoherence(),
ConversationResolution(),
TaskCompletion(),
],
inputs=[audio_case, conv],
)
MLLMAudio accepts seven formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. URLs or local paths, auto base64 encoded. That covers whatever Retell’s call API exposes.
Why these five rubrics for Retell
audio_transcription matters more on Retell than on most stacks because Retell’s STT path is tightly coupled to its LLM streaming. When ASR drifts on accented or noisy audio, the LLM ingests the wrong text and the downstream eval scores look fine while the customer experience is broken. This rubric scores the actual audio against the rendered transcript and flags the drift.
audio_quality catches TTS regressions on the assistant leg. Retell’s TTS coupling is a strength when it’s working and a single point of failure when it isn’t. If voice clarity drops after a provider switch or a model update, this rubric surfaces it before the customer complaints do.
conversation_coherence and conversation_resolution are the multi-turn rubrics. They run on the full transcript via ConversationalTestCase. Coherence catches the assistant contradicting itself across turns; resolution catches calls where the assistant sounded fine but didn’t actually solve the problem.
task_completion is the agent-side success rubric. Did the assistant complete the tool calls and the workflow it was supposed to complete. This pairs with business-side metrics like FCR and containment rate.
Step 4: Add traceAI for richer LLM-level spans (optional)
The dashboard path gives you call-level visibility. If you want turn-level depth on the LLM side, you wire traceAI inside the service that hosts your LLM logic. Retell calls into your LLM endpoint for each turn, and that endpoint is where the instrumentation attaches.
traceAI ships 30+ documented integrations across Python + TypeScript, OpenInference-compatible, Apache 2.0.
import os
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="retell_support_agent",
set_global_tracer_provider=True,
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
For Anthropic, swap OpenAIInstrumentor for AnthropicInstrumentor from traceai_anthropic. For LiteLLM-routed LLM calls, traceai_litellm.LiteLLMInstrumentor.
Joining traceAI spans to the Retell call
The Retell call ID is the join key. Pass it into your LLM endpoint as a header or metadata field, and write it on the root span:
from fi_instrumentation import FITracer
tracer = FITracer(trace_provider.get_tracer(__name__))
def handle_retell_turn(retell_call_id, customer_id, agent_version, messages):
with tracer.start_as_current_span(
"retell_turn",
attributes={
"conversation_id": retell_call_id,
"customer_id": customer_id,
"agent_version": agent_version,
"channel": "voice",
"provider": "retell",
},
):
return openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
LLM spans now land in the same FAGI project as the Retell call sessions. The dashboard renders them under the same root in the trace tree, so you can drill from a call row into the exact LLM turn that produced a low conversation_coherence score.
When to skip the SDK path
If your team is happy with call-level observability, skip step 4. The dashboard path covers most production voice debugging on Retell. Add traceAI only when:
- You need tool call arguments visible at span level.
- You’re A/B testing prompt revisions and need turn-level eval differentials.
- You’re integrating with a RAG retrieval layer and need retrieval spans on the trace tree.
- You need to debug LLM latency at the per-call granularity.
For most support and outbound use cases, the dashboard path is enough.
Step 5: Turn on Error Feed and inline Protect
This is where the loop closes.
Error Feed auto-clusters Retell failures
Error Feed is the zero-config error monitoring layer in the FAGI Observe product. It detects errors across five categories: factual grounding failures, tool crashes, broken workflows, safety violations, and reasoning gaps. It auto-clusters them into named issues with auto-written root cause, supporting span evidence, a quick fix to ship today, and a long-term recommendation.
For Retell workloads, the common clusters look like this:
- “TTS prosody drift after voice ID change” clusters audio quality regressions and points at the voice ID switch.
- “STT confidence drop on Indian English” clusters mistranscriptions, points at the accent group, suggests a per-accent threshold tweak.
- “Tool argument schema mismatch in
book_appointment” clusters failed tool calls, points at the prompt drift, suggests a patch. - “Resolution drop on cancellation intent after agent_version 4.2” clusters resolution failures by intent and agent version, points at the prompt revision that caused the regression.
You don’t write these names. The clustering layer writes them.
Inline guardrails via Future AGI Protect
If your Retell assistant runs in a regulated workflow, inline content moderation matters. The Future AGI Protect model family runs sub-100ms inline. Foundation is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio.
The integration sits inside your LLM endpoint, between the LLM response and Retell’s TTS leg:
from fi.evals import Protect
p = Protect()
def safe_reply(user_text, agent_text):
out = p.protect(
inputs={"input": user_text, "output": agent_text},
protect_rules=[
{"metric": "content_moderation"},
{"metric": "security"},
{"metric": "data_privacy_compliance"},
],
)
if out.blocked:
return "I'm sorry, I can't help with that. Let me transfer you to a human agent."
return agent_text
For the fastest path:
out = p.protect(
inputs={"input": user_text, "output": agent_text},
)
ProtectFlash returns a single harmful or not-harmful verdict in one call. The verdict lands on the FAGI span, so the trust team can review denied responses in Error Feed.
A full reference architecture
Putting it all together, a production Retell observability stack on FAGI looks like this:
+-------------------+ +---------------------+ +-------------------+
| Retell AI runtime | -----> | Your LLM Endpoint | -----> | LLM Provider |
| (LLM + TTS coupl) | | + traceAI instr | | (OpenAI, Anthropic|
| | | + Protect inline | | etc.) |
+-------------------+ +----------+----------+ +-------------------+
| |
| | OpenInference spans
| v
| +----------------------------+
| | FAGI Observe project |
+------------> | - native Retell integration|
call log + audio | - 70+ built-in rubrics |
+ transcript | - Error Feed clustering |
| - inline Protect verdicts |
+----------------------------+
|
v
+----------------------------+
| Agent Command Center |
| - RBAC, BYOC, multi-region |
| - SOC 2 + HIPAA + GDPR |
| + CCPA + ISO 27001 |
+----------------------------+
Retell owns the call runtime. Your LLM endpoint is where traceAI and inline Protect attach. The FAGI Observe project receives both surfaces (call-level from Retell, span-level from the endpoint) and joins them under one session. Agent Command Center hosts the whole stack with RBAC, multi-region or BYOC, and the cert set listed on futureagi.com/trust: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001.
Calibrated honesty: where Retell genuinely wins
Retell AI is the lowest-latency hosted voice stack in 2026. That wedge matters in three concrete ways:
End-to-end turn latency. The native LLM and TTS coupling removes a buffering hop most orchestration vendors keep. For consumer-facing voice where every 100 ms of latency tradesfor lower conversion or higher abandonment, this is a real difference.
Operational simplicity. Retell hosts the whole stack. You don’t run STT, LLM, and TTS as separate services; Retell binds them together. For teams that want to ship voice without operating four moving pieces, that’s a clean trade.
Telephony depth. Retell’s telephony integration is mature, with support for the common providers and a clean inbound and outbound setup. For high-volume calling workloads, that matters.
What Retell does not ship is the deep observability, eval, clustering, and inline guardrail layer that production teams need on top. That’s the gap FAGI fills. Retell runs the call. FAGI watches it, scores it, clusters the failures, and guards the LLM output.
Two deliberate tradeoffs
Async eval gating is explicit. FAGI never auto-rewrites prompts in production without an explicit run plus a human approval gate. The Dataset UI ships UI-driven optimization across all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for programmatic control. Either way, the loop stays explicit: point the run at a dataset, pick an evaluator, pick the optimizer, then promote a candidate by hand.
Native voice obs ships for Vapi, Retell, and LiveKit; everything else routes through traceAI or Enable Others. The provider-API-key dashboard path covers the three runtimes most production teams pick. The remaining 10 percent (Synthflow, Bland, Pipecat, custom RTP) lands through the traceAI SDK (from fi_instrumentation import register plus from fi_instrumentation.fi_types import ProjectType) or via Enable Others mode with mobile-number simulation. Active iteration on the dashboard surface keeps shipping every release: multi-step Agent Definition UX, Prompt Workbench Revamp, redesigned Run Test performance metrics, Show Reasoning column in Simulate, sticky filters in Observe, scenario generation with branch visibility, and Error Localization that pinpoints the failing turn.
Common pitfalls when wiring Retell observability
Don’t paste the test-mode API key into the Agent Definition. Retell separates test and production keys. The call log capture path needs the production key. The dashboard surfaces a clear error when the key scope is wrong, but it’s the most common first-time mistake.
Don’t skip metadata tagging on the Retell side. If you don’t pass customer_id, vertical, agent_version, and intent through Retell’s call API, you lose the KPI attribution layer on the FAGI side. Set them once when you configure the assistant, and every call carries them.
Don’t try to instrument inside Retell’s runtime. You can only instrument LLM calls you control. If you’re using Retell’s hosted LLM option, traceAI can’t wrap it. The fix is to run your own LLM endpoint that Retell calls into, which is the default pattern for teams that want prompt versioning or provider failover.
Don’t run all five rubrics on every call from day one. Start with conversation_resolution and task_completion. They give you the highest-signal failure modes for the lowest cost. Add audio_transcription and audio_quality once you’ve seen a TTS or STT regression. Add conversation_coherence once you have enough multi-turn data to make the score stable.
Don’t disable Error Feed because the first day looks empty. It needs a few days of traffic to populate the named issue list. Once volume crosses a threshold, the clusters start surfacing.
Don’t rely on transcript-only debugging. Retell’s tight coupling between STT, LLM, and TTS means failures can hide in the audio layer while the transcript looks fine. Always run audio_transcription and audio_quality once you’re past the first week of traffic.
When you’ve outgrown this setup
Once the dashboard path, eval rubrics, Error Feed, and inline Protect are running cleanly, the next move is to add simulation. FAGI’s simulation product ships 18 pre-built personas plus unlimited custom-authored personas. Custom personas configure name, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise, multilingual toggle, custom properties, and free-form behavioral instructions. The Workflow Builder (Conversation Node, End Call Node, Transfer Call Node) auto-generates branching scenarios (20, 50, or 100 rows) with branch visibility; Dataset scenarios accept CSV, JSON, and Excel uploads or synthetic generation; script-based runs cover deterministic regression. The 4-step Run Tests wizard runs the suite against your Retell assistant, Error Localization pinpoints the exact failing turn, and the Show Reasoning column surfaces eval rationale per scenario. The Tool Calling eval and programmatic eval API cover CI integration; custom voices from ElevenLabs and Cartesia plug into Run Prompt and Experiments; Indian phone number simulation ships as a configurable region.
The same Agent Definition you wired in step 1 plugs into Simulate. The same eval rubrics run on simulated calls. The same Error Feed clusters scenario failures alongside production failures. That’s the unified surface.
The other natural extension is closing the loop into optimization. The Dataset UI ships all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for CI-driven runs. Both read the same trace data the dashboard renders and propose prompt revisions against live failure patterns. The pattern is explicit by design. Point a run at a dataset, pick an evaluator and optimizer, review candidates, promote by hand. Turn it on after the first month, once your eval baselines stabilize and the failure modes are well understood.
For a deeper walkthrough of the simulation side, see the voice agent scenario guide. For the broader production monitoring playbook, see how to monitor AI voice agents in production.
Related reading
- Voice AI Observability for Vapi: A 2026 Implementation Guide
- How to monitor AI voice agents in production: a 2026 playbook
- 7 best voice agent monitoring platforms in 2026
- How to implement voice AI observability in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- Trust page (SOC 2 + HIPAA + GDPR + CCPA + ISO 27001): futureagi.com/trust
- OpenInference spec: github.com/Arize-ai/openinference
- Retell AI (plain text reference; no competitor backlink)
Frequently asked questions
Does Retell AI need an SDK to integrate with Future AGI?
Which eval rubrics should I run on Retell calls?
Why does Retell often hit lower latency than other voice stacks?
Will Retell's call recordings download separately for assistant and customer?
How does Error Feed handle Retell call failures?
What latency does inline Future AGI Protect add on top of Retell's stack?
Can I use FAGI's Retell observability alongside other voice providers in the same project?
Implement voice observability for Pipecat with traceAI-pipecat: install, register, enable HTTP attribute mapping, attach audio + multi-turn eval rubrics.
Implement voice AI observability for LiveKit Agents: native FAGI dashboard via Assistant ID plus traceai-livekit pip package for code-driven span tracing.
Implement voice AI observability for Vapi in 2026: native FAGI dashboard via Assistant ID, traceAI SDK path, audio_transcription and conversation rubrics.