How is caption hallucination different from ASR accuracy?

ASR accuracy measures overall transcript fidelity. Caption hallucination focuses on unsupported insertions: transcript content that appears confident but has no audio evidence.

How do you measure caption hallucination?

Use FutureAGI's CaptionHallucination evaluator on audio, generated captions, and reference transcripts when available. Pair it with ASRAccuracy, transcription confidence, and failed-call review by cohort.

What Is Caption Hallucination? FutureAGI Guide (2026)

What Is Caption Hallucination?

Caption hallucination is a voice AI failure where a captioning or automatic speech recognition system invents text that was not spoken in the audio. It is a voice-evaluation issue that appears in transcript pipelines, production call traces, and agent workflows that treat captions as evidence. FutureAGI measures it with CaptionHallucination so teams can catch unsupported words, names, numbers, or claims before they affect retrieval, tool calls, compliance logs, or the final spoken answer.

Why Caption Hallucination Matters in Production Voice Agents

Caption hallucination turns an audio problem into a trust problem. A caller says “I do not authorize the transfer,” the transcript inserts “I authorize the transfer,” and the agent now reasons over false evidence. The named failure modes are unsupported transcript insertion, false entity capture, audit-trail corruption, and downstream tool misfire. These are worse than ordinary ASR substitutions because the transcript may look fluent and high confidence.

The pain lands across the team. Developers chase prompt bugs even though the caption layer created the bad input. SREs see longer calls, higher retry counts, and p99 time-to-first-audio drift as the agent asks unnecessary clarifying questions. Compliance teams lose a reliable record of what the user actually said. Product teams see call abandonment when customers repeat themselves or hear the agent act on words they never spoke.

The symptoms show up in logs as impossible entities, captions that contain policy or account terms absent from the raw audio, low agreement between transcript variants, and task failures clustered by noisy channels or overlapping speech. In 2026 voice systems, a single call can pass through ASR, turn detection, retrieval, tool calling, policy checks, and TTS. Once hallucinated caption text enters that chain, every later component may look correct while operating on invented input.

How FutureAGI Handles Caption Hallucination

FutureAGI maps the anchor eval:CaptionHallucination to the CaptionHallucination evaluator listed in the FAGI inventory as the caption_hallucination cloud eval template. In a voice workflow, the term shows up at the boundary between source audio, generated captions, normalized transcripts, and the agent trace. Engineers attach the raw audio path, generated caption text, optional reference transcript, final agent response, and call outcome to the same eval run.

A practical example is a banking voice agent that verifies a caller before changing a payment date. The team runs nightly LiveKitEngine simulations with accents, barge-in, background speech, and low-bandwidth audio. FutureAGI scores each generated caption with CaptionHallucination, then pairs the result with ASRAccuracy, TaskCompletion, and the traceAI livekit integration so the dashboard can show whether unsupported caption text changed a tool argument or final answer.

FutureAGI’s approach is to treat caption hallucination as an evidence mismatch, not just a transcript-quality average. Unlike a standalone WER report from an ASR benchmark, this links the hallucinated phrase to the exact call turn and downstream decision. If the hallucination rate crosses a threshold for noisy mobile calls, the engineer can block the ASR provider rollout, add cohort-specific regression scenarios, route high-risk calls to human review, or require a clarification step before tool execution.

How to Measure or Detect Caption Hallucination

Measure caption hallucination with evidence checks and downstream impact, not only transcript similarity:

CaptionHallucination: FutureAGI evaluator for detecting caption text unsupported by the source audio or reference transcript.
ASRAccuracy: companion evaluator that scores broader speech-to-text fidelity, useful for separating insertions from substitutions or deletions.
Word insertion rate: count added words per reference word, sliced by language, accent, device, codec, and background-noise cohort.
Transcription confidence mismatch: high confidence on unsupported phrases is a stronger release blocker than low-confidence noise.
Trace correlation: join hallucination outcomes to TaskCompletion, tool arguments, escalation rate, and p99 time-to-first-audio.
Human review proxy: sample failed calls where the caption includes account IDs, dollar amounts, medical terms, consent language, or names.

from fi.evals import CaptionHallucination

caption_eval = CaptionHallucination()
result = caption_eval.evaluate(
    audio_path="calls/payment-change.wav",
    caption="The caller authorized a payment change."
)
print(result.score)

Use thresholds by workflow. A meeting-summary caption can tolerate minor filler-word drift. A financial, healthcare, or identity-verification agent should fail release on one unsupported consent phrase.

Common Mistakes

Most caption hallucination bugs survive because teams measure clean transcripts instead of evidence.

Treating WER as enough. WER can miss rare inserted entities that carry the whole business risk.
Scoring only normalized transcripts. Normalization may remove the hallucinated phrase before evaluation sees it.
Trusting confidence blindly. A hallucinated caption can be fluent, timestamped, and high confidence.
Reviewing only failed calls. Successful-looking calls may contain false transcript evidence that silently changed a tool argument.
Ignoring overlap and barge-in. Cross-talk is a common source of invented captions in live voice agents.