Guides

How to Build RAG-Powered Voice AI Agents in 2026

Build streaming RAG-powered voice agents in 2026. Parallel retrieval, grounded LLM with citations, faithfulness eval, and traceAI instrumented spans.

April 23, 2026

Updated May 19, 2026

13 min read

voice-ai 2026 rag retrieval voice-agents

Voice AI agents in 2026 are no longer just LLM core plus telephony. The interesting deployments wire in a knowledge base: the receptionist who can answer policy questions from the company handbook, the support agent who reads the latest pricing tier off the docs, the medical triage agent who grounds advice in a clinical protocol. The shape is RAG-powered voice. The hard part is doing retrieval inside a streaming audio budget without breaking the conversational flow. This guide walks through the architecture, runnable code, and the eval rubrics that catch retrieval drift before it embarrasses the agent.

TL;DR: the five-step build

Stream ASR with partial transcripts. Trigger retrieval the moment a stable partial clears 70 to 80% confidence.
Run retrieval in parallel with LLM TTFT. Vector search + reranker on the partial while the LLM warms up.
Ground the LLM with per-claim citation markers. TTS strips markers, trace keeps them.
Score faithfulness, context-relevance, citation correctness on every call with ai-evaluation.
Optimize retrieval prompts against live trace data with agent-opt once eval baselines stabilize.

The pattern works on top of any voice runtime (Vapi, Retell, ElevenLabs Agents, Bland, LiveKit Agents, Pipecat). The retrieval and eval layer is vendor-neutral by design.

Why voice RAG is harder than text RAG

Three pressures collide:

Latency budget. Conversational voice wants sub-800ms first response. A text RAG pipeline can take 2 to 3 seconds without anyone noticing. A voice RAG pipeline at 2 seconds feels broken. Retrieval has to fit inside the LLM’s TTFT window, which means parallel execution and partial-transcript triggering.

Streaming inputs. Text RAG sees the full query before kicking off retrieval. Voice RAG has to decide when the partial transcript is stable enough to retrieve on. Wait too long and retrieval is in the critical path; trigger too early and retrieval runs on a half-formed query.

Short conversational outputs. Text RAG can return a paragraph with bullet citations. Voice RAG returns one or two sentences. Citation tracking has to happen at the claim level, not the document level, because there’s only enough room for one or two claims per turn. That makes faithfulness scoring harder.

The architecture below handles all three.

The architecture

[caller audio]
     │
     ▼
[Streaming ASR: Deepgram / AssemblyAI]
     │
     ├─► partial transcript (confidence ≥ 0.75)
     │        │
     │        ▼
     │   [Query rewriter LLM] ──► rewritten query
     │        │
     │        ▼
     │   [Vector search: Pinecone / Qdrant / pgvector]
     │        │
     │        ▼
     │   [Reranker: Cohere Rerank / bge-reranker]
     │        │
     │        ▼
     │   [top-3 chunks]
     │
     └─► final transcript ──► [Grounded LLM with chunks + citation markers]
                                     │
                                     ▼
                              [Strip markers]
                                     │
                                     ▼
                              [Streaming TTS: Cartesia / ElevenLabs]
                                     │
                                     ▼
                              [caller audio out]

Retrieval starts on the partial transcript while the LLM is still spinning up. By the time the final transcript lands and the LLM call kicks off, the retrieval results are already in the prompt budget. The grounded LLM emits per-claim citation markers; the TTS layer strips them; the trace keeps them.

Step 1: Stream ASR with partial transcripts

Use a streaming ASR provider that emits partials with confidence scores. Deepgram, AssemblyAI, and Whisper-streaming all work.

# pip install deepgram-sdk traceAI-openai ai-evaluation fi-instrumentation
import os
import asyncio
from deepgram import DeepgramClient, LiveOptions
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice_rag_agent",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
tracer = FITracer(trace_provider.get_tracer(__name__))

PARTIAL_THRESHOLD = 0.75

class VoiceRAGSession:
    def __init__(self, conversation_id: str):
        self.conversation_id = conversation_id
        self.retrieval_task = None
        self.retrieved_chunks = None

    async def on_partial(self, transcript: str, confidence: float):
        if confidence < PARTIAL_THRESHOLD or self.retrieval_task is not None:
            return
        # Kick off retrieval in parallel with LLM warm-up
        self.retrieval_task = asyncio.create_task(
            self.retrieve(transcript)
        )

    async def retrieve(self, partial_query: str):
        with tracer.start_as_current_span(
            "retrieval",
            attributes={
                "conversation_id": self.conversation_id,
                "partial_query": partial_query,
                "retrieval_trigger": "partial_transcript",
            },
        ):
            rewritten = await self.rewrite_query(partial_query)
            chunks = await self.vector_search(rewritten, top_k=10)
            top = await self.rerank(rewritten, chunks, top_n=3)
            self.retrieved_chunks = top
            return top

The retrieve span carries the partial query, the rewritten query, the chunk IDs, and the latency per sub-stage. Every retrieval call lands in the trace alongside the LLM and TTS spans.

Step 2: Parallel retrieval and LLM warm-up

The trick that makes voice RAG feel conversational is running retrieval and LLM warm-up in parallel.

    async def on_final_transcript(self, final_transcript: str):
        # If retrieval already kicked off on a partial, await it; otherwise run now
        if self.retrieval_task is None:
            self.retrieval_task = asyncio.create_task(self.retrieve(final_transcript))

        chunks = await self.retrieval_task
        return await self.grounded_generate(final_transcript, chunks)

    async def grounded_generate(self, query: str, chunks):
        prompt = self.build_grounded_prompt(query, chunks)
        with tracer.start_as_current_span(
            "grounded_llm",
            attributes={
                "conversation_id": self.conversation_id,
                "chunk_ids": [c["id"] for c in chunks],
                "query": query,
            },
        ):
            return await self.stream_llm(prompt)

The vector search plus reranker budget is typically 200 to 300ms total. The LLM TTFT for a small model on a warm endpoint lands in 300 to 500ms. Running them in parallel keeps the user-perceived latency in the 500 to 800ms range.

Step 3: Ground the LLM with per-claim citation markers

The grounded LLM prompt instructs the model to emit per-claim citation markers tied to retrieval chunk IDs. The TTS layer strips them before speaking. The trace keeps them for eval.

GROUNDED_VOICE_PROMPT = """You are a concise voice agent. Answer the user's question using ONLY the retrieved context below.

Rules:
- Answer in 1-2 short sentences suitable for spoken delivery.
- For every factual claim, append a citation marker in the form [chunk_id].
- If the context does not contain the answer, say so plainly.
- Do not invent facts beyond the retrieved context.

Retrieved context:
{chunks_with_ids}

User question: {query}

Spoken answer with citations:"""

def build_grounded_prompt(query: str, chunks):
    chunks_text = "\n".join(
        f"[{c['id']}] {c['text']}" for c in chunks
    )
    return GROUNDED_VOICE_PROMPT.format(
        chunks_with_ids=chunks_text,
        query=query,
    )

CITATION_PATTERN = re.compile(r"\[(?P<chunk_id>chunk_\d+)\]")

def strip_citations_for_tts(grounded_response: str):
    return CITATION_PATTERN.sub("", grounded_response).strip()

The spoken response goes to TTS; the structured response (with markers intact) goes into the eval pipeline.

Step 4: Score every call on the three rubrics

ai-evaluation ships four named rubrics that handle voice RAG out of the box: groundedness, chunk_attribution, chunk_utilization, and context_relevance. For audio-input scoring, use MLLMAudio test cases. They accept .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma (URL or local path, auto-base64).

from fi.evals import evaluate

def score_voice_rag_call(query, grounded_response_with_markers, retrieved_chunks):
    chunks_text = [c["text"] for c in retrieved_chunks]
    chunks_by_id = {c["id"]: c["text"] for c in retrieved_chunks}

    # 1. Groundedness: do the LLM's claims match the retrieved chunks
    groundedness = evaluate(
        eval_templates="groundedness",
        inputs={
            "output": grounded_response_with_markers,
            "context": "\n".join(chunks_text),
        },
    )

    # 2. Context relevance: are the retrieved chunks relevant to the query
    context_rel = evaluate(
        eval_templates="context_relevance",
        inputs={
            "input": query,
            "context": "\n".join(chunks_text),
        },
    )

    # 3. Chunk attribution: do markers point to chunks that actually support the claims
    chunk_attr = evaluate(
        eval_templates="chunk_attribution",
        inputs={
            "output": grounded_response_with_markers,
            "context_by_id": chunks_by_id,
        },
    )

    return {
        "groundedness": groundedness.eval_results[0].metrics[0].value,
        "context_relevance": context_rel.eval_results[0].metrics[0].value,
        "chunk_attribution": chunk_attr.eval_results[0].metrics[0].value,
    }

Run the scoring asynchronously after the turn so the latency stays out of the critical path. ai-evaluation ships 70+ built-in eval templates including the voice-specific audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, plus the RAG quintet (groundedness 47, context_relevance 9, context_adherence 5, chunk_attribution 11, chunk_utilization 12), plus unlimited custom evaluators authored by an in-product agent. In-house classifier models are tuned for the LLM-as-judge cost/latency tradeoff. Programmatic eval API for configure + re-run. Native voice observability for Vapi/Retell/LiveKit also brings call logs, transcripts, and separate assistant/customer audio downloads into the same eval flow. Apache 2.0.

Step 5: Optimize retrieval prompts against live trace data

The eval scores plus traces feed agent-opt. The optimizer reads the failing traces, identifies queries where context-relevance scored low, and proposes variations to the query rewriter and reranker prompts.

# Conceptual: run on a schedule against the production trace store
from fi.opt import GEPAOptimizer

optimizer = GEPAOptimizer(
    target_prompt="query_rewriter",
    objective="context_relevance",
    trace_source="voice_rag_agent",
    sample_size=500,
)

candidate_prompts = optimizer.propose(num_variants=8)
for variant in candidate_prompts:
    score = optimizer.evaluate(variant)
    print(variant.name, score)

best = optimizer.select_best()
optimizer.promote(best, rollout_percentage=10)

FAGI never auto-rewrites prompts without an explicit run and a human approval gate. Most teams turn agent-opt on once they have a baseline of 1,000 to 10,000 scored calls. agent-opt ships six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) exposed via the Dataset UI and the SDK; pick the one that matches your search budget.

Latency budget math

The numbers that matter:

Stage	Target p50	Hard ceiling
ASR partial confidence ≥ 0.75	200ms after first audio	400ms
Query rewrite	80ms	150ms
Vector search top-k	60ms	120ms
Reranker top-3	80ms	150ms
LLM TTFT	400ms	700ms
TTS first audio	200ms	350ms
End-to-end first audio	600-800ms	1.2s

The retrieval pipeline (rewrite + search + rerank) lands in 200-300ms. Running it in parallel with LLM warm-up means it doesn’t add to the user-perceived latency. The Future AGI Protect model family runs sub-100ms inline per arXiv 2510.13351, which fits inside the budget for inline PII redaction and prompt-injection blocking. ProtectFlash gives a single-call binary classifier path.

Common failure modes the eval catches

The three rubrics catch the failure modes that text RAG eval misses:

Faithfulness drops, context-relevance high. The retrieval pulled the right chunks but the LLM hallucinated on top of them. Fix: tighten the grounded prompt.
Context-relevance drops, faithfulness high. Retrieval pulled the wrong chunks but the LLM caught it and refused. Fix: improve the query rewriter or the chunk strategy.
Citation correctness drops, faithfulness and context-relevance high. The LLM is making true claims but attributing them to the wrong chunk. Fix: few-shot the citation format.
All three drop on a specific intent. Retrieval coverage gap. Fix: re-index that part of the knowledge base or add structured retrieval (BM25 + vector hybrid).

Error Feed auto-clusters retrieval failures into named issues with root cause, quick fix, and long-term recommendation. Zero-config the moment traces hit an Observe project.

The Future AGI stack on voice RAG

Five products doing five jobs:

traceAI + native voice observability: 30+ documented integrations across Python + TypeScript (including traceAI-pipecat and traceai-livekit), OpenInference-compat, Apache 2.0. For Vapi/Retell/LiveKit, no SDK is needed. Every retrieval, LLM, and TTS span lands in the trace. Vector hits, reranker outputs, chunk IDs per turn all captured as span attributes.
ai-evaluation: 70+ built-in eval templates including groundedness, chunk_attribution, chunk_utilization, context_relevance, audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion. Unlimited custom evaluators authored by an in-product agent that calibrate from human review feedback. In-house classifier models tuned for the LLM-as-judge cost/latency tradeoff. Programmatic eval API. Apache 2.0.
agent-opt: six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) exposed via Dataset UI and SDK against live trace data. Tunes query rewriter and reranker prompts on explicit run with human approval.
Future AGI Protect: Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). Multi-modal text, image, audio. Sub-100ms inline. ProtectFlash single-call binary classifier path. PII and PHI scrubbing on the retrieval query before it hits the vector store.
Agent Command Center: RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, 15+ provider routing.
Error Feed: the clustering and what-to-fix layer over your traces and evals. Zero-config auto-clusters retrieval failures into named issues with root cause, quick fix, and long-term recommendation.

The closed loop (trace, eval, cluster, optimize, re-deploy) is the differentiator. agent-opt reads the failing clusters from Error Feed and proposes prompt variants whose expected context-relevance is higher.

Two deliberate tradeoffs

Async eval gating is explicit. agent-opt requires an explicit run plus a human approval gate before any prompt rewrite ships. FAGI never auto-rewrites the retrieval prompt in production without human approval. Intentional design.

Native voice obs ships for Vapi, Retell, and LiveKit out of the box. Enable Others mode covers the rest via traceAI SDK or webhook, which covers 90%+ of production stacks. The dashboards are actively iterated every release. Recent shipped work includes multi-step Agent Definition UX, Prompt Workbench Revamp, redesigned Run Test performance metrics, Show Reasoning column in Simulate, sticky filters in Observe, and Error Localization that pinpoints the failing turn.

Query rewriting for voice

The query rewriter is the highest-impact part of voice RAG. A good rewriter turns a noisy partial transcript (“uh what about, like, the return policy thing”) into a clean retrieval query (“return policy”). A bad rewriter passes the noise through and retrieval picks up the wrong chunks.

The rewriter prompt needs three things:

Conversation context. The rewriter sees the previous two turns of the conversation so it can resolve pronouns and references. “What about that one?” needs the previous turn’s mention of a specific product to make sense.

Intent classification. The rewriter tags the rewritten query with the intent (FAQ, transactional, complaint). The intent tag conditions the retrieval index used; FAQ intent queries hit a smaller, more curated knowledge base than open-ended ones.

Conservative rewriting. The rewriter should not invent details the caller didn’t say. If the caller’s partial transcript is genuinely ambiguous, the rewriter should pass the partial through and let the agent ask a clarifying question on the next turn.

REWRITER_PROMPT = """You rewrite a noisy partial voice transcript into a clean retrieval query.

Rules:
- Resolve pronouns from the previous two turns.
- Strip filler words and self-corrections.
- Do not invent details the user did not say.
- Output the rewritten query plus an intent tag.

Previous turns:
{previous_turns}

Partial transcript:
{partial}

Output JSON: {{"query": "...", "intent": "faq|transactional|complaint|other"}}"""

The rewriter is the natural target for agent-opt. Pick one of the six optimizers (Bayesian Search, Meta-Prompt arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto arXiv 2507.19457, Random Search arXiv 2311.09569, PromptWizard) against the live trace data, optimize for context-relevance score, and the rewriter prompt improves on each explicit optimization run with human approval.

Handling the “I don’t know” case

The single most damaging failure mode for a RAG-powered voice agent is hallucinating an answer when the retrieval didn’t actually land anything relevant. The fix is a hard-coded “I don’t know” path keyed off the reranker’s top score.

If the top reranked chunk’s relevance score is below a threshold (typically 0.4 to 0.5 depending on the reranker), the grounded LLM should refuse to answer and offer to take a message or escalate. The threshold is tuned per intent: FAQ intents tolerate a lower threshold because the chunks are dense; transactional intents need a higher threshold because the cost of a wrong answer is higher.

ai-evaluation’s faithfulness rubric catches the cases where the agent answered despite low relevance. Error Feed clusters these into a named issue with a quick fix: raise the threshold for the affected intent.

Multimodal retrieval

For some workloads the knowledge base isn’t pure text. Product images, technical diagrams, audio clips, and video transcripts all carry signal that text retrieval misses. The Future AGI Protect model family is multi-modal across text, image, and audio in a single model family, which lets you redact PII across modalities before retrieval.

The retrieval layer itself can run multimodal embeddings (CLIP, OpenCLIP for image, or vision-capable embedding APIs) so the agent retrieves the right product image when the caller describes a visual problem (“the red light on the front panel”). The grounded LLM then references the image content in its spoken answer.

This is still a less-common pattern in 2026 but it’s the direction RAG-powered voice is heading. The eval rubrics (faithfulness, context-relevance, citation correctness) work the same way; the only change is that the “context” now contains image embeddings plus text.

A note on OpenAI Realtime and Gemini Live

Speech-to-speech APIs from OpenAI and Google accept tool calls, which is the hook for RAG. The tradeoff is that the tool-call round-trip adds latency that traditional ASR-LLM-TTS pipelines can hide via parallel retrieval.

For RAG-heavy workloads many teams run a hybrid: streaming ASR (Deepgram or AssemblyAI), parallel retrieval triggered on partial transcripts, grounded LLM call (GPT-4o or Claude 3.7 or Gemini 2.0 in standard non-speech mode), streaming TTS (Cartesia or ElevenLabs). The hop count is higher but the retrieval budget is predictable.

For lighter RAG (one or two well-formed knowledge-base lookups per call) the speech-to-speech API plus a tool call works. The eval rubrics are the same either way.

Chunking strategy for voice answers

Voice answers are short. The chunking strategy that works for a text RAG dashboard does not work for a voice RAG agent that speaks one or two sentences at a time. Three things change:

Smaller chunks. 200 to 400 tokens per chunk is the sweet spot for voice. Bigger chunks waste prompt budget on context the LLM won’t use; smaller chunks force the reranker to do more work per call.

Sentence-level metadata. Every chunk carries the source document title, section heading, and last-updated date as metadata. The grounded LLM uses this to disambiguate when two chunks contradict each other (“the older policy says X but the newer policy says Y”).

Pre-summarized chunks for common intents. For high-volume intents (return policy, business hours, pricing tier), pre-write a one-sentence summary chunk and put it at the top of the index for that intent. The agent quotes the summary directly; the citation marker still points to the underlying full chunk for audit.

The chunking choices show up in the context-relevance and faithfulness scores. Bad chunking usually means context-relevance is fine (you retrieved the right document) but faithfulness drifts (the LLM had to summarize a 1,500-token chunk into one sentence and lost a detail).

Hybrid retrieval: vector + BM25

Pure vector retrieval misses some queries that BM25 catches and vice versa. For voice the difference shows up most on specific product names, SKU numbers, named entities, and proper nouns. A vector search on “the EX-720 model” might pull back chunks about similar models; a BM25 search lands the exact product page. The choice of vector database for the RAG layer matters less here than the round-trip and reranker budget.

The pattern we recommend is parallel hybrid retrieval. Vector and BM25 both run on the same partial transcript. The reranker combines the two result sets and prunes to top-3. The reranker is what decides which retrieval path won for this query.

async def hybrid_retrieve(query, top_k=20, top_n=3):
    vector_task = asyncio.create_task(vector_search(query, top_k=top_k))
    bm25_task = asyncio.create_task(bm25_search(query, top_k=top_k))
    vector_hits, bm25_hits = await asyncio.gather(vector_task, bm25_task)
    merged = dedup_by_chunk_id(vector_hits + bm25_hits)
    return await rerank(query, merged, top_n=top_n)

The latency cost is the slower of the two retrievals, not the sum. Both run in parallel. For most production setups the budget stays under 300ms even with the dedup and rerank steps.

Caching for voice RAG

Voice agents talk to similar callers asking similar questions. A request-level cache that maps (rewritten query, intent tag) to (top-3 chunks) cuts retrieval cost dramatically on FAQ-heavy workloads.

The cache key wants to be semantic, not literal. Two callers asking “what are your hours” and “when do you open” should hit the same cache entry. The pattern: hash the rewritten query plus the intent tag with a sentence-transformer embedding, look up the cache by embedding distance with a low-similarity threshold.

The eval rubrics keep working on cached responses because the cache is on the retrieval side, not on the LLM side. The LLM still gets fresh context-grounded generation per turn; the chunks just came from cache instead of from the vector store.

Example metrics to track after a trace-eval-optimize loop

A mid-market SaaS company shipped a RAG-powered voice support agent across 12 knowledge-base sources. After the four-week trace-eval-cluster-optimize loop:

End-to-end p50 latency: 720ms (down from 1.1s baseline).
Context-relevance: 0.91 (up from 0.74).
Faithfulness: 0.94 (up from 0.82).
Citation correctness: 0.89 (up from 0.61).
First-call resolution: 87% (up from 71%).

The biggest single lift came from running agent-opt against the query rewriter prompt for two weeks. The optimizer found a variant that re-anchored ambiguous pronouns from earlier turns, which lifted context-relevance by 11 points and pulled faithfulness up by 8 points downstream.

How to Implement Voice AI Observability in 2026: the underlying instrumentation pattern for any voice stack.
Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric library and judge model tradeoffs.
Advanced Chunking Techniques for RAG: the upstream retrieval quality work.
Agentic RAG Systems: the broader agentic retrieval surface.

Sources and references

arXiv 2510.13351: Future AGI Protect model family (arxiv.org/abs/2510.13351)
OpenInference specification: OpenTelemetry GenAI semantic conventions
Future AGI trust page: futureagi.com/trust
traceAI repository: github.com/future-agi/traceAI
ai-evaluation repository: github.com/future-agi/ai-evaluation
agent-opt repository: github.com/future-agi/agent-opt
Error Feed docs: docs.futureagi.com/docs/observe

Frequently asked questions

Why is RAG harder in voice than in text?

Three reasons. Latency: retrieval adds 200 to 800ms to the critical path, which can blow the sub-800ms first-response budget. Streaming: text RAG can wait for the full query; voice RAG has to start retrieval on a partial transcript. Grounding: voice answers are short and conversational, which means citation tracking has to happen at the claim level, not the document level. The fix is parallel retrieval, partial-transcript triggering, and per-claim citation markers.

What's the recommended architecture for streaming voice RAG?

ASR streams partial transcripts. The moment a partial clears a confidence threshold (70 to 80%), retrieval kicks off in parallel with LLM warm-up. Vector search returns top-k; a reranker prunes to top-3. The grounded LLM emits the response with per-claim citation markers tied to chunk IDs. The TTS layer strips markers for the spoken response. The trace keeps them. ai-evaluation scores `groundedness`, `context_relevance`, `context_adherence`, `chunk_attribution`, and `chunk_utilization` on every call.

What vector database should I use for voice RAG?

Any sub-100ms p95 vector store works: Pinecone, Weaviate, Qdrant, pgvector with HNSW, Chroma. The constraint isn't the vector store; it's the round-trip plus reranker. Most production voice deployments use a fast vector store plus a small reranker (Cohere Rerank, bge-reranker) to keep the retrieval budget under 300ms total. The reranker is what makes context-relevance scores stay high.

How do I evaluate retrieval for a voice agent?

Four named rubrics ship as built-in templates in ai-evaluation: `groundedness` for whether the response is grounded in retrieved evidence, `chunk_attribution` for per-chunk source attribution, `chunk_utilization` for whether retrieved chunks are actually used, and `context_relevance` for whether the retrieval pulled the right chunks. For voice-specific audio input, `MLLMAudio` test cases accept 7 formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma). Run them on every call so retrieval drift surfaces in real time.

Can I use OpenAI Realtime or Gemini Live with RAG?

Yes with caveats. OpenAI Realtime and Gemini Live both accept tool calls, which is the hook for retrieval. The tradeoff is that the speech-to-speech path adds latency overhead to tool calls that traditional ASR-LLM-TTS pipelines don't have. For RAG-heavy workloads many teams run a hybrid: speech-to-text via streaming ASR, RAG against the partial transcript, grounded LLM call, streaming TTS. The hop count is higher but the retrieval budget is more predictable.

How do I optimize retrieval prompts for a voice agent?

Use agent-opt, which ships six prompt optimizers — Bayesian Search, Meta-Prompt (arXiv 2505.09666), ProTeGi, GEPA Genetic-Pareto (arXiv 2507.19457), Random Search (arXiv 2311.09569), and PromptWizard — exposed via both Dataset UI (select dataset, evaluator, optimizer, review candidate prompts in the dashboard) and SDK. The optimizer reads the production traces, identifies queries where context-relevance or groundedness scored low, and proposes variations to the query rewriter and reranker prompts. Each variant is evaluated against the production distribution. FAGI never auto-rewrites prompts without an explicit run and a human approval gate.

What about privacy and PII in retrieval?

The Future AGI Protect model family can run inline Data Privacy and Prompt Injection checks before queries hit the vector store; cite arXiv 2510.13351 for the Gemma 3n + LoRA adapter architecture. ProtectFlash gives a single-call binary classifier path. For regulated workloads, use Protect Data Privacy/PII checks plus HIPAA-certified hosted deployment or BYOC where required. Self-hosted means the queries never leave your VPC; hosted means SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified posture per the trust page.

View all

Guides

Evaluating RAG Chunking Strategies in 2026

Chunking is a domain question. Fixed-size loses on legal, semantic wins prose, clause-level wins contracts, late-interaction wins code.

NVJK Kartik · May 14, 2026

13 min

Guides

5 Best AI Answering Services in 2026 (Tested + Ranked)

Top 5 AI answering services in 2026 ranked on setup speed, integrations, and reliability. Honest tradeoffs plus 2 honorable mentions for SMB owners.

Vrinda Damani · Apr 23, 2026

11 min

Guides

IVR Modernization: Migrate Legacy IVR to AI Voice Agents in 2026

A step-by-step IVR modernization playbook for 2026: audit legacy flows, pick a runtime, simulate, deploy, observe. Migrate DTMF menus to AI voice agents.

NVJK Kartik · Mar 26, 2026

13 min

TL;DR: the five-step build

Why voice RAG is harder than text RAG

The architecture

Step 1: Stream ASR with partial transcripts

Step 2: Parallel retrieval and LLM warm-up

Step 3: Ground the LLM with per-claim citation markers

Step 4: Score every call on the three rubrics

Step 5: Optimize retrieval prompts against live trace data

Latency budget math

Common failure modes the eval catches

The Future AGI stack on voice RAG

Two deliberate tradeoffs

Query rewriting for voice

Handling the “I don’t know” case

Multimodal retrieval

A note on OpenAI Realtime and Gemini Live

Chunking strategy for voice answers

Hybrid retrieval: vector + BM25

Caching for voice RAG

Example metrics to track after a trace-eval-optimize loop

Related reading

Sources and references

Frequently asked questions