Articles

How to Implement Voice AI Observability in 2026: Vapi, Retell, LiveKit, Pipecat

Implement voice AI observability in 2026 for Vapi, Retell, LiveKit, and Pipecat agents. Real traceAI code, latency SLOs, audio metrics, and live eval scoring.

·
Updated
·
6 min read
agents voice-ai observability
Voice AI observability for Vapi, Retell, LiveKit, and Pipecat
Table of Contents

How to Implement Voice AI Observability in 2026: Full Guide

Voice AI agents in 2026 run on stacks like LiveKit Agents, Pipecat, Vapi, and Retell, with OpenAI Realtime, Gemini Live, or Anthropic Claude in the loop. Each layer can break independently, and standard APM tools see only the network. This guide shows how to wire end-to-end observability for production voice agents using Future AGI’s Apache 2.0 traceAI instrumentors and the fi.evals evaluation catalog, with runnable code and the SLO thresholds we use day to day.

TL;DR

QuestionShort answer
What changed in 2026OpenAI Realtime API, LiveKit Agents 1.0, and Pipecat 0.6 made low-latency voice mainstream; observability is now mandatory for production.
Top voice observability stackFuture AGI traceAI for instrumentation + fi.evals for scoring; provider dashboards (LiveKit, Vapi, Retell) as a secondary view.
Required spans per turnOne conversation root span containing STT, LLM, tool calls, and TTS as child spans.
Latency SLO starting pointP95 end-to-end turn latency under 800ms for a single tool-free turn; tune on your data.
Quality metricsWER, intent confidence, task completion, groundedness.
Audio metricsMOS, jitter, packet loss, barge-in failure rate.
FAGI integrationregister(project_type=ProjectType.OBSERVE, ...) + evaluate(eval_templates="task_completion", ...)

Why traditional APM tools miss what actually breaks

A voice agent has three failure surfaces that text agents do not:

  1. Audio quality: jitter, packet loss, and codec switching degrade the user-perceived experience even when the LLM is healthy.
  2. Turn-taking: barge-in failure, late end-of-turn detection, and double-tap responses ruin natural flow.
  3. ASR drift: a small WER bump silently corrupts the LLM context, which then misclassifies intent and calls the wrong tool.

Standard HTTP-trace APM does not see any of those. Voice observability needs (a) one conversation root span per call, (b) child spans for STT, LLM, tool, and TTS, and (c) an eval score attached to each turn that captures both transcript quality and audio quality.

The four metric families that matter

Latency

  • TTFB: delay between user end-of-speech and first audio packet back.
  • End-to-end turn latency: full STT + LLM + TTS round-trip.
  • TTS processing lag: delta between text generation and audio rendering.

Quality

  • WER: ASR transcription error rate against ground truth.
  • Intent classification confidence: model probability for the chosen intent.
  • Task completion: percentage of conversations where the user goal was met.

Audio

  • MOS: estimated speech-output quality 1-5.
  • Jitter and packet loss: network signals that cause robotic playback.
  • Barge-in failure rate: how often the agent ignores a user interrupt.

Business

  • AHT: average handle time.
  • FCR: first-contact resolution rate.
  • Escalation rate: handoffs to human agents, split into planned vs failure-driven.

Setting alert thresholds that matter

A dashboard that lights up red for every blip is a dashboard people mute.

AlertWhat it tracksWhy it matters
Average latencyMean turn timeUseful for trending, hides spikes
P95 turn latencyWorst 5% of usersCatches the tail that hangs up
Spike durationHow long P95 stays elevatedDistinguishes blip from outage
Anomaly deltaDeviation from learned baselineCatches drift in metrics that fluctuate with traffic

Static thresholds are right for SLAs and hard limits (“server returned 5xx”). Anomaly detection is right for metrics that move with traffic (turn latency at peak hours, WER under new accent mix).

How to set up voice observability with Future AGI

Step 1: Install the instrumentors

The traceAI monorepo on github.com/future-agi/traceAI ships instrumentors for the LLM providers used inside voice frameworks (OpenAI, Anthropic, LiteLLM, Vertex AI). Install the package that matches your provider plus ai-evaluation.

# pip install traceAI-openai ai-evaluation fi-instrumentation

Step 2: Register the tracer

Register a tracer at the start of your voice service so every span streams to the Future AGI dashboard.

import os

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice_support_agent",
)

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

For Anthropic, swap in traceai_anthropic.AnthropicInstrumentor. For LiteLLM-backed stacks (LiveKit Agents, Pipecat with LiteLLM), use traceai_litellm.LiteLLMInstrumentor. All instrumentors live in the same Apache 2.0 repo.

Step 3: Wrap each conversation in a session span

Use the OpenTelemetry tracer returned by register() to create one root span per conversation. Every turn becomes a child span automatically.

from fi_instrumentation import FITracer

tracer = FITracer(trace_provider.get_tracer(__name__))

def handle_call(call_id, user_phone):
    with tracer.start_as_current_span(
        "voice_conversation",
        attributes={
            "conversation_id": call_id,
            "user_phone": user_phone,
            "channel": "voice",
        },
    ) as conv_span:
        run_voice_loop(call_id)

FITracer is exposed by fi_instrumentation and wraps the standard OpenTelemetry tracer with Future AGI-specific attributes.

Step 4: Score every turn with fi.evals

After each turn completes, score the agent’s answer against the user’s intent. The example below uses the task_completion cloud template plus a CustomLLMJudge for tone.

import os

from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

def score_turn(user_text, agent_text, expected_outcome):
    task_score = evaluate(
        eval_templates="task_completion",
        inputs={
            "input": user_text,
            "output": agent_text,
            "expected_output": expected_outcome,
        },
        model_name="turing_flash",
    )

    judge = CustomLLMJudge(
        name="voice_tone_judge",
        grading_criteria=(
            "Score 1 if the answer is concise, warm, and "
            "avoids hold-music phrases like 'please wait a moment'. "
            "Otherwise score 0."
        ),
        provider=LiteLLMProvider(
            model=os.getenv("JUDGE_MODEL", "gpt-4o-mini"),
        ),
    )

    tone = Evaluator(metric=judge).evaluate(
        output=agent_text,
        context="brand voice v3",
    )

    return {
        "task_completion": task_score.eval_results[0].metrics[0].value,
        "tone": tone,
    }

turing_flash returns in roughly 1-2 seconds, turing_small in 2-3 seconds, and turing_large in 3-5 seconds (docs.futureagi.com). Pick a tier that fits the in-call latency budget; many teams run scoring asynchronously after the turn so the latency is invisible to the user.

Step 5: Define SLO-based alerts

Convert metrics into SLOs in the Future AGI dashboard or your own alert manager. Reasonable starting points:

  • P95 end-to-end turn latency under 800ms (tune on your traffic).
  • Intent confidence median above 0.85.
  • Task completion rate above 90%.
  • Barge-in failure rate under 2%.

Page only on sustained breaches. Surface single-window spikes as low-priority signals to avoid alert fatigue.

Tracing across STT, LLM, and TTS

Voice spans must connect across components. The most useful traces in production are:

  • Conversation root: one span per call, carries conversation_id, channel, customer ID.
  • Turn span: one per user-agent exchange, carries turn_id, end-of-turn timestamp, audio quality at start of turn.
  • STT span: provider, model, transcript, WER (if a reference is available), confidence.
  • LLM span: model, prompt tokens, completion tokens, tool calls, latency.
  • Tool span: tool name, arguments, response, latency.
  • TTS span: voice ID, output duration, MOS estimate, packet-loss flag.

traceAI emits the LLM and provider spans automatically. For STT and TTS, wrap the framework’s events in a manual span using tracer.start_as_current_span("stt", ...).

Voice frameworks and how observability lands

FrameworkLicenseWhere traceAI plugs in
LiveKit AgentsApache 2.0 (github.com/livekit/agents)Instrument the LLM provider; wrap turn events in a session span
PipecatBSD 2-Clause (github.com/pipecat-ai/pipecat)Instrument the LLM provider; use Pipecat’s frame processor hooks to emit STT/TTS spans
VapiProprietary cloud (vapi.ai)Forward Vapi webhook events into the Future AGI dashboard via the OpenTelemetry collector
RetellProprietary cloud (retellai.com)Same pattern as Vapi: webhook bridge plus LLM-side instrumentation

The voice framework is the orchestration layer. The LLM provider is the reasoning layer. Future AGI sits underneath both as the observability and evaluation layer.

Closing the loop with pre-launch simulation

Production traces are great for catching what already happened. To catch problems before launch, use a synthetic conversation runner. fi.simulate exposes TestRunner and AgentInput/AgentResponse types that drive a target agent through scripted turns and score each turn with the same fi.evals templates used in production.

from fi.simulate import TestRunner, AgentInput, AgentResponse

def my_agent(turn: AgentInput) -> AgentResponse:
    # call the real voice agent; return the text response
    return AgentResponse(output="Sure, I can help you book a flight.")

runner = TestRunner(
    agent=my_agent,
    scenarios=[
        "User wants to book a flight from SFO to JFK on Friday.",
        "User asks for a refund on a missed flight.",
    ],
    eval_templates=["task_completion", "groundedness"],
)

runner.run()

Replace my_agent with the entry point of your LiveKit, Pipecat, Vapi, or Retell pipeline. The runner replays each scenario and scores the agent against the same templates that production traffic is scored against.

Common failure modes the integration catches

  • Silent ASR drift: WER score on the production set drops; alert fires.
  • TTS lag: TTS span latency climbs while LLM latency stays flat; isolated to the TTS provider.
  • Barge-in regression: barge-in failure rate spikes after a Pipecat upgrade; rollback validated by tracing.
  • Tool hallucination: tool_call_accuracy template flags arguments that do not match the schema.
  • Drifted system prompt: task_completion score drops 4 points after a prompt edit; revert.

Wrap-up

Production voice AI in 2026 is not just LLM-quality. It is audio-quality, turn-quality, and tool-quality together. Wire traceAI into the LLM provider, group every turn under a conversation_id, score each turn with fi.evals, and set SLO alerts on P95 turn latency, MOS, and intent confidence. The integration is a few lines of Python, the SDK is Apache 2.0, and the same instrumentation works whether you ship on LiveKit Agents, Pipecat, Vapi, or Retell.

For deeper reading see the voice AI platforms comparison, the AI agent cost and observability guide, and the agent debugging tools roundup.

Frequently asked questions

Why does voice AI need different observability than text chat?
Voice adds three failure surfaces that text chat does not have: audio quality (jitter, packet loss, MOS), turn-taking (barge-in failures, end-of-turn detection), and ASR errors that silently corrupt downstream LLM context. Standard APM tools see only HTTP latency and miss those. You need spans that cover STT, LLM, and TTS as one conversation trace, plus eval scores tied to that trace.
What metrics matter most in production voice AI?
Track four families: latency (TTFB, end-to-end turn latency, TTS lag), quality (WER, intent confidence, task completion), audio (MOS, jitter, packet loss, barge-in failure rate), and business (AHT, FCR, escalation rate). Set alerts on P95 latency, not average. Use anomaly detection for metrics that fluctuate with traffic patterns.
How do I instrument a Vapi, Retell, LiveKit, or Pipecat agent with Future AGI?
Install `traceAI-openai` (or the relevant provider instrumentor) plus `ai-evaluation`, then call `register(project_type=ProjectType.OBSERVE, project_name=...)`. LLM and tool spans land automatically. For STT and TTS, wrap the framework's events with a manual span that carries the shared `conversation_id`. The traceAI repo on github.com/future-agi/traceAI is Apache 2.0.
What is performance drift in a voice agent?
Performance drift is the gradual loss of accuracy as production traffic shifts away from the training distribution. In voice this usually means new slang, regional accents, background noise patterns, or a silent ASR model update. The drift compounds for weeks before customers complain. The fix is continuous evaluation against a fixed scoring rubric so a 4-point drop in intent accuracy fires an alert in real time.
Average latency vs P95 vs P99: what should I alert on?
Alert on P95 or P99 turn latency for production voice. Average latency hides the tail users who experience 3-second pauses, and those tail users are the ones who hang up. A reasonable starting SLO for natural conversational flow is P95 end-to-end turn latency under 800ms for a single tool-free turn, but tune it on your own data and load.
How do I evaluate audio quality automatically?
Use a Mean Opinion Score estimator on the synthesized audio, monitor jitter and packet loss on the network layer, and track the barge-in failure rate. Future AGI ships multimodal eval templates that score speech-output quality. Pair them with the LLM-level evaluation so improving transcript accuracy does not hide a regression in TTS quality.
Can I use this with Vapi, Retell, or my own LiveKit stack?
Yes. The traceAI instrumentors wrap the underlying provider SDKs (OpenAI, Anthropic, Mistral, etc.), so any voice framework built on those providers gets spans automatically. For LiveKit Agents, Pipecat, Vapi, and Retell add a hook that attaches `conversation_id` to the trace context at the start of each session. The `fi.evals` API is provider-agnostic.
Is the SDK open source?
Yes. traceAI is Apache 2.0 (github.com/future-agi/traceAI/blob/main/LICENSE). The `ai-evaluation` SDK that exposes `fi.evals` is Apache 2.0 (github.com/future-agi/ai-evaluation/blob/main/LICENSE). The Future AGI cloud platform, the judge models (`turing_flash`, `turing_small`, `turing_large`), and the dashboards are the proprietary layer.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.