Guides

Voice AI Observability for Retell AI: A 2026 Implementation Guide

Wire Retell AI observability the FAGI way: native dashboard via Assistant ID, optional traceAI SDK, eval engine on every call with audio + transcript.

·
Updated
·
13 min read
voice-ai 2026 observability retell how-to
Editorial cover image for Voice AI Observability for Retell AI: A 2026 Implementation Guide
Table of Contents

Retell AI hosts the call. Your job is to know what happened in every conversation, score it against the rubrics that matter, and route the failures into something the on-call rotation can actually act on. This guide walks through the two paths Future AGI ships for Retell observability: a no-SDK dashboard path that wires through the Retell API, and an optional code-driven traceAI path for teams that want LLM-level span depth on top.

Step preview

  1. Wire your Retell assistant into a Future AGI Agent Definition via Retell API key + Assistant ID. Call log capture starts immediately.
  2. Verify the auto transcript, separate assistant and customer audio downloads, and the session timeline in the FAGI Observe project.
  3. Attach the named voice rubrics: audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion.
  4. Optionally install traceAI for the LLM provider that runs your prompt logic, so LLM and tool spans land in the same trace tree.
  5. Turn on Error Feed for auto-clustered failures and Future AGI Protect for inline guardrails.

The rest of this guide fills in each step with the code and the config.

Why Retell specifically

Retell AI ships the lowest-latency hosted voice stack in 2026. The wedge is the native coupling between the LLM and the streaming TTS at the runtime level. Most voice orchestration frameworks pipe through a separate TTS leg with its own buffering; Retell removes that hop by streaming LLM tokens straight into the TTS. The result is end-to-end turn latency at the low end of the category, which translates into conversations that feel less mechanical.

What Retell does not ship is an observability and eval layer that matches what production teams need. The Retell dashboard surfaces call logs, transcripts, and basic metrics. It does not score every call against multi-turn rubrics, it does not auto-cluster failures into named issues with root cause analysis, and it does not run inline guardrails on the LLM response. That gap is where FAGI sits.

The pattern we recommend: keep Retell as your call runtime, add FAGI as your observability and eval layer. The two compose cleanly because the native integration is API-driven and reads the call data Retell already exposes.

Step 1: Wire the Retell assistant into a FAGI Agent Definition

This is the dashboard path. No code.

Create the Agent Definition

In the FAGI console, open the Observe product and create a new project. Inside the project, create an Agent Definition. The form asks for:

  • Agent name: free-text, what shows up in the call log table.
  • Provider: pick Retell AI from the supported provider list. The natively supported providers are Vapi, Retell AI, and LiveKit.
  • Provider API key: paste the Retell API key from your Retell account settings.
  • Assistant ID: paste the Assistant ID from the Retell console.
  • Observability toggle: enable.

Save the agent. FAGI handshakes with Retell to verify the API key + Assistant ID. If the handshake fails, the dashboard surfaces the error inline with a remediation hint.

What lands in the dashboard after save

The next call Retell handles for that assistant lands in FAGI within a few minutes. The Call Log table populates with the row. The row carries:

  • Two separate audio files: one for assistant audio, one for customer audio. Both downloadable.
  • The auto transcript: turn-by-turn alternating speaker rows with timestamps.
  • The session timeline: call as a root span, turn boundaries rendered as child events.
  • The tags panel: whatever metadata you passed through Retell’s call API.

This is the surface that needs zero code. You can stop here.

Tagging for KPI attribution

The Agent Definition supports custom attributes that ride into every captured call as tags. Set these on the Retell side via the call metadata API:

  • customer_id: filter axis for per-account analysis.
  • vertical: e.g. support, outbound_sales, appointment_booking.
  • agent_version: lets you A/B compare prompt revisions.
  • campaign_id: for outbound, links the call to the campaign.
  • intent: top-level intent class.

These flow through the Retell API into the FAGI session and become filter axes in the Observe dashboard. They also become the cluster keys Error Feed uses when grouping failures.

Step 2: Verify the call surface

Place a test call. Open the Call Log row in the FAGI dashboard. You should see four panels:

Audio panel: two players, one labelled Assistant, one labelled Customer. Each has its own waveform and a download button. The separation lets you debug a barge-in failure (interruption timing in the customer audio) or a TTS regression (clarity in the assistant audio) without listening to both legs mixed.

Transcript panel: turn-by-turn rows with speaker tags and timestamps. Hover a row to see the STT confidence Retell exposes for that turn.

Session timeline: horizontal trace tree rendering the call as the root. Without traceAI wired in yet, the timeline shows turn boundaries only. After step 4, LLM calls, tool invocations, and retrievals appear as nested child spans.

Tags panel: shows whatever metadata Retell passed through the call API.

If any of these panels is missing, the most common cause is an API key without the right scope. Re-check the Retell console permissions and re-save the Agent Definition.

Step 3: Attach the named voice rubrics

The ai-evaluation SDK ships 70+ built-in eval templates in Apache 2.0. Five of them carry most of the load on a Retell workload.

RubricWhat it scores
audio_transcriptionASR drift on customer audio against the rendered transcript
audio_qualityTTS clarity and prosody on the assistant audio
conversation_coherenceMulti-turn coherence across the whole call
conversation_resolutionDid the call resolve the customer’s stated goal
task_completionDid the agent complete its tool calls and workflow

In the dashboard, open the project’s Evals tab. Add the five built-ins. They run on every captured call going forward. Past calls require an explicit backfill, which is one click.

If you’d rather keep the eval config in code, the pattern looks like this:

from fi.testcases import MLLMTestCase, MLLMAudio, ConversationalTestCase, LLMTestCase
from fi.evals import (
    Evaluator,
    AudioTranscriptionEvaluator,
    AudioQualityEvaluator,
    ConversationCoherence,
    ConversationResolution,
    TaskCompletion,
)

ev = Evaluator(
    fi_api_key="your-future-agi-api-key",
    fi_secret_key="your-future-agi-secret-key",
)

assistant_audio = MLLMAudio(url="https://fagi.example.com/calls/retell-abc123/assistant.wav")
audio_case = MLLMTestCase(input=assistant_audio, query="Score the assistant TTS leg")

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I want to cancel my subscription", response="I can help with that. Can I ask why?"),
    LLMTestCase(query="It's too expensive", response="Got it. We have a 30 percent discount option I can apply. Want to hear about it?"),
])

result = ev.evaluate(
    eval_templates=[
        AudioTranscriptionEvaluator(),
        AudioQualityEvaluator(),
        ConversationCoherence(),
        ConversationResolution(),
        TaskCompletion(),
    ],
    inputs=[audio_case, conv],
)

MLLMAudio accepts seven formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. URLs or local paths, auto base64 encoded. That covers whatever Retell’s call API exposes.

Why these five rubrics for Retell

audio_transcription matters more on Retell than on most stacks because Retell’s STT path is tightly coupled to its LLM streaming. When ASR drifts on accented or noisy audio, the LLM ingests the wrong text and the downstream eval scores look fine while the customer experience is broken. This rubric scores the actual audio against the rendered transcript and flags the drift.

audio_quality catches TTS regressions on the assistant leg. Retell’s TTS coupling is a strength when it’s working and a single point of failure when it isn’t. If voice clarity drops after a provider switch or a model update, this rubric surfaces it before the customer complaints do.

conversation_coherence and conversation_resolution are the multi-turn rubrics. They run on the full transcript via ConversationalTestCase. Coherence catches the assistant contradicting itself across turns; resolution catches calls where the assistant sounded fine but didn’t actually solve the problem.

task_completion is the agent-side success rubric. Did the assistant complete the tool calls and the workflow it was supposed to complete. This pairs with business-side metrics like FCR and containment rate.

Step 4: Add traceAI for richer LLM-level spans (optional)

The dashboard path gives you call-level visibility. If you want turn-level depth on the LLM side, you wire traceAI inside the service that hosts your LLM logic. Retell calls into your LLM endpoint for each turn, and that endpoint is where the instrumentation attaches.

traceAI ships 30+ documented integrations across Python + TypeScript, OpenInference-compatible, Apache 2.0.

import os

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="retell_support_agent",
    set_global_tracer_provider=True,
)

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

For Anthropic, swap OpenAIInstrumentor for AnthropicInstrumentor from traceai_anthropic. For LiteLLM-routed LLM calls, traceai_litellm.LiteLLMInstrumentor.

Joining traceAI spans to the Retell call

The Retell call ID is the join key. Pass it into your LLM endpoint as a header or metadata field, and write it on the root span:

from fi_instrumentation import FITracer

tracer = FITracer(trace_provider.get_tracer(__name__))

def handle_retell_turn(retell_call_id, customer_id, agent_version, messages):
    with tracer.start_as_current_span(
        "retell_turn",
        attributes={
            "conversation_id": retell_call_id,
            "customer_id": customer_id,
            "agent_version": agent_version,
            "channel": "voice",
            "provider": "retell",
        },
    ):
        return openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )

LLM spans now land in the same FAGI project as the Retell call sessions. The dashboard renders them under the same root in the trace tree, so you can drill from a call row into the exact LLM turn that produced a low conversation_coherence score.

When to skip the SDK path

If your team is happy with call-level observability, skip step 4. The dashboard path covers most production voice debugging on Retell. Add traceAI only when:

  • You need tool call arguments visible at span level.
  • You’re A/B testing prompt revisions and need turn-level eval differentials.
  • You’re integrating with a RAG retrieval layer and need retrieval spans on the trace tree.
  • You need to debug LLM latency at the per-call granularity.

For most support and outbound use cases, the dashboard path is enough.

Step 5: Turn on Error Feed and inline Protect

This is where the loop closes.

Error Feed auto-clusters Retell failures

Error Feed is the zero-config error monitoring layer in the FAGI Observe product. It detects errors across five categories: factual grounding failures, tool crashes, broken workflows, safety violations, and reasoning gaps. It auto-clusters them into named issues with auto-written root cause, supporting span evidence, a quick fix to ship today, and a long-term recommendation.

For Retell workloads, the common clusters look like this:

  • “TTS prosody drift after voice ID change” clusters audio quality regressions and points at the voice ID switch.
  • “STT confidence drop on Indian English” clusters mistranscriptions, points at the accent group, suggests a per-accent threshold tweak.
  • “Tool argument schema mismatch in book_appointment clusters failed tool calls, points at the prompt drift, suggests a patch.
  • “Resolution drop on cancellation intent after agent_version 4.2” clusters resolution failures by intent and agent version, points at the prompt revision that caused the regression.

You don’t write these names. The clustering layer writes them.

Inline guardrails via Future AGI Protect

If your Retell assistant runs in a regulated workflow, inline content moderation matters. The Future AGI Protect model family runs sub-100ms inline. Foundation is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio.

The integration sits inside your LLM endpoint, between the LLM response and Retell’s TTS leg:

from fi.evals import Protect

p = Protect()

def safe_reply(user_text, agent_text):
    out = p.protect(
        inputs={"input": user_text, "output": agent_text},
        protect_rules=[
            {"metric": "content_moderation"},
            {"metric": "security"},
            {"metric": "data_privacy_compliance"},
        ],
    )
    if out.blocked:
        return "I'm sorry, I can't help with that. Let me transfer you to a human agent."
    return agent_text

For the fastest path:

out = p.protect(
    inputs={"input": user_text, "output": agent_text},
    
)

ProtectFlash returns a single harmful or not-harmful verdict in one call. The verdict lands on the FAGI span, so the trust team can review denied responses in Error Feed.

A full reference architecture

Putting it all together, a production Retell observability stack on FAGI looks like this:

+-------------------+        +---------------------+        +-------------------+
| Retell AI runtime | -----> | Your LLM Endpoint   | -----> | LLM Provider      |
| (LLM + TTS coupl) |        | + traceAI instr     |        | (OpenAI, Anthropic|
|                   |        | + Protect inline    |        | etc.)             |
+-------------------+        +----------+----------+        +-------------------+
        |                              |
        |                              | OpenInference spans
        |                              v
        |              +----------------------------+
        |              | FAGI Observe project       |
        +------------> | - native Retell integration|
   call log + audio    | - 70+ built-in rubrics      |
   + transcript        | - Error Feed clustering    |
                       | - inline Protect verdicts  |
                       +----------------------------+
                                     |
                                     v
                       +----------------------------+
                       | Agent Command Center       |
                       | - RBAC, BYOC, multi-region |
                       | - SOC 2 + HIPAA + GDPR     |
                       |   + CCPA + ISO 27001       |
                       +----------------------------+

Retell owns the call runtime. Your LLM endpoint is where traceAI and inline Protect attach. The FAGI Observe project receives both surfaces (call-level from Retell, span-level from the endpoint) and joins them under one session. Agent Command Center hosts the whole stack with RBAC, multi-region or BYOC, and the cert set listed on futureagi.com/trust: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001.

Calibrated honesty: where Retell genuinely wins

Retell AI is the lowest-latency hosted voice stack in 2026. That wedge matters in three concrete ways:

End-to-end turn latency. The native LLM and TTS coupling removes a buffering hop most orchestration vendors keep. For consumer-facing voice where every 100 ms of latency tradesfor lower conversion or higher abandonment, this is a real difference.

Operational simplicity. Retell hosts the whole stack. You don’t run STT, LLM, and TTS as separate services; Retell binds them together. For teams that want to ship voice without operating four moving pieces, that’s a clean trade.

Telephony depth. Retell’s telephony integration is mature, with support for the common providers and a clean inbound and outbound setup. For high-volume calling workloads, that matters.

What Retell does not ship is the deep observability, eval, clustering, and inline guardrail layer that production teams need on top. That’s the gap FAGI fills. Retell runs the call. FAGI watches it, scores it, clusters the failures, and guards the LLM output.

Two deliberate tradeoffs

Async eval gating is explicit. FAGI never auto-rewrites prompts in production without an explicit run plus a human approval gate. The Dataset UI ships UI-driven optimization across all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for programmatic control. Either way, the loop stays explicit: point the run at a dataset, pick an evaluator, pick the optimizer, then promote a candidate by hand.

Native voice obs ships for Vapi, Retell, and LiveKit; everything else routes through traceAI or Enable Others. The provider-API-key dashboard path covers the three runtimes most production teams pick. The remaining 10 percent (Synthflow, Bland, Pipecat, custom RTP) lands through the traceAI SDK (from fi_instrumentation import register plus from fi_instrumentation.fi_types import ProjectType) or via Enable Others mode with mobile-number simulation. Active iteration on the dashboard surface keeps shipping every release: multi-step Agent Definition UX, Prompt Workbench Revamp, redesigned Run Test performance metrics, Show Reasoning column in Simulate, sticky filters in Observe, scenario generation with branch visibility, and Error Localization that pinpoints the failing turn.

Common pitfalls when wiring Retell observability

Don’t paste the test-mode API key into the Agent Definition. Retell separates test and production keys. The call log capture path needs the production key. The dashboard surfaces a clear error when the key scope is wrong, but it’s the most common first-time mistake.

Don’t skip metadata tagging on the Retell side. If you don’t pass customer_id, vertical, agent_version, and intent through Retell’s call API, you lose the KPI attribution layer on the FAGI side. Set them once when you configure the assistant, and every call carries them.

Don’t try to instrument inside Retell’s runtime. You can only instrument LLM calls you control. If you’re using Retell’s hosted LLM option, traceAI can’t wrap it. The fix is to run your own LLM endpoint that Retell calls into, which is the default pattern for teams that want prompt versioning or provider failover.

Don’t run all five rubrics on every call from day one. Start with conversation_resolution and task_completion. They give you the highest-signal failure modes for the lowest cost. Add audio_transcription and audio_quality once you’ve seen a TTS or STT regression. Add conversation_coherence once you have enough multi-turn data to make the score stable.

Don’t disable Error Feed because the first day looks empty. It needs a few days of traffic to populate the named issue list. Once volume crosses a threshold, the clusters start surfacing.

Don’t rely on transcript-only debugging. Retell’s tight coupling between STT, LLM, and TTS means failures can hide in the audio layer while the transcript looks fine. Always run audio_transcription and audio_quality once you’re past the first week of traffic.

When you’ve outgrown this setup

Once the dashboard path, eval rubrics, Error Feed, and inline Protect are running cleanly, the next move is to add simulation. FAGI’s simulation product ships 18 pre-built personas plus unlimited custom-authored personas. Custom personas configure name, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise, multilingual toggle, custom properties, and free-form behavioral instructions. The Workflow Builder (Conversation Node, End Call Node, Transfer Call Node) auto-generates branching scenarios (20, 50, or 100 rows) with branch visibility; Dataset scenarios accept CSV, JSON, and Excel uploads or synthetic generation; script-based runs cover deterministic regression. The 4-step Run Tests wizard runs the suite against your Retell assistant, Error Localization pinpoints the exact failing turn, and the Show Reasoning column surfaces eval rationale per scenario. The Tool Calling eval and programmatic eval API cover CI integration; custom voices from ElevenLabs and Cartesia plug into Run Prompt and Experiments; Indian phone number simulation ships as a configurable region.

The same Agent Definition you wired in step 1 plugs into Simulate. The same eval rubrics run on simulated calls. The same Error Feed clusters scenario failures alongside production failures. That’s the unified surface.

The other natural extension is closing the loop into optimization. The Dataset UI ships all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for CI-driven runs. Both read the same trace data the dashboard renders and propose prompt revisions against live failure patterns. The pattern is explicit by design. Point a run at a dataset, pick an evaluator and optimizer, review candidates, promote by hand. Turn it on after the first month, once your eval baselines stabilize and the failure modes are well understood.

For a deeper walkthrough of the simulation side, see the voice agent scenario guide. For the broader production monitoring playbook, see how to monitor AI voice agents in production.

Sources and references

Frequently asked questions

Does Retell AI need an SDK to integrate with Future AGI?
No, not for the dashboard path. Add your Retell API key and Assistant ID to a Future AGI Agent Definition, enable observability, and every call captured by Retell streams into the FAGI Observe project with separate assistant and customer audio downloads, an auto transcript, and the eval engine. The SDK path through traceAI is optional and only worth wiring when you want richer LLM-level spans on top of the call-level visibility the native integration ships.
Which eval rubrics should I run on Retell calls?
Five built-ins from ai-evaluation give you the highest signal on a Retell workload: audio_transcription for ASR drift on customer audio, audio_quality for TTS regressions on the assistant leg, conversation_coherence for multi-turn flow, conversation_resolution for whether the call resolved the customer's goal, and task_completion for agent-side success. All five are Apache 2.0 built-ins.
Why does Retell often hit lower latency than other voice stacks?
Retell's hosted stack tightly couples the LLM with TTS streaming at the runtime level, which removes a hop most other orchestration vendors keep. That coupling is the genuine wedge: end-to-end turn latency at the low end of the category. What it doesn't ship is a deep observability and eval layer on top. FAGI adds that layer through the native voice integration, so you keep Retell's latency profile and gain the eval engine and Error Feed clustering on top.
Will Retell's call recordings download separately for assistant and customer?
Yes. FAGI's native voice observability stores assistant audio and customer audio as separate downloadable files on every captured Retell call. That separation matters for debugging barge-in failures (interruption timing lives in the customer leg) and TTS regressions (clarity and prosody live in the assistant leg). Both files attach to the same FAGI session along with the auto transcript, eval scores, and any traceAI spans you've wired in.
How does Error Feed handle Retell call failures?
Error Feed auto-clusters failures across five categories: factual grounding failures, tool crashes, broken workflows, safety violations, and reasoning gaps. For Retell specifically, the common clusters surface as TTS prosody drift after a voice switch, tool argument mismatches in the agent's function calls, STT confidence drops on noisy audio, and resolution failures clustered by intent. Each named issue ships with auto-written root cause, supporting span evidence, a quick fix, and a long-term recommendation.
What latency does inline Future AGI Protect add on top of Retell's stack?
Future AGI Protect is the sub-100ms inline guardrail path. The Protect model family is built on Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path. Either fits inside the sub-500 ms voice budget Retell teams already target, so you can run guardrails on the critical path before the LLM response hits the TTS leg.
Can I use FAGI's Retell observability alongside other voice providers in the same project?
Yes. The Agent Definition is per-agent, not per-project. You can run a Retell agent, a Vapi agent, and a LiveKit agent in the same FAGI Observe project, each with its own provider API key and Assistant ID. The dashboard renders them in one Call Log table with a provider filter. Cross-provider clusters in Error Feed surface failure modes that are common across stacks (STT accent drift, TTS pronunciation regressions, tool schema mismatches) separately from provider-specific clusters.
Related Articles
View all