Guides

Voice AI for Insurance: Claims Intake, Underwriting Calls, and Compliance in 2026

How to deploy voice AI across insurance workflows in 2026. FNOL intake, claims status, underwriting Q&A, dispatch, renewals, fraud triage, compliance.

April 9, 2026

Updated May 19, 2026

20 min read

voice-ai 2026 insurance fnol compliance

Table of Contents

Insurance is the third-largest voice AI vertical after support and banking, and it’s the one where empathy and compliance collide on every call. A First Notice of Loss caller has just had an accident. An underwriting follow-up touches non-public personal data. A renewals outbound has to clear state-by-state telemarketing rules. The runtime is the easy part. The hard part is the eval, guardrail, and audit layer that lets the deployment ship inside fifty different state insurance departments at once.

TL;DR (the six production insurance workflows)

FNOL intake. Auto, home, and health initial-loss capture. Empathy-first, structured-schema output, longest single-call duration.
Claims status. Where is my claim, what documents do you need, when does the check arrive. Highest volume, highest containment.
Underwriting Q&A. Supplemental application questions, follow-up clarifications, document collection prompts.
Agent dispatch. Triage and schedule in-person adjusters for inspections and appraisals.
Renewals and cross-sell. Outbound check-in with existing customers, coverage review, complementary product offer.
Fraud triage. Flag suspicious language patterns and timing anomalies, route to SIU without accusing.

The runtime layer is Vapi or Retell for orchestration, Deepgram Nova-3 for STT, ElevenLabs or Cartesia for empathetic TTS, and Twilio or Telnyx for telephony. The eval, observability, simulation, and guardrail layer is Future AGI. The dedicated section below explains how that lands.

Why insurance is harder than generic support

Legacy IVR still dominates insurance phone trees. Voice AI takes the triage and routing layer first, not the adjudication layer. Most carriers run voice AI alongside human-agent fallback rather than as full replacement. Five constraints make insurance voice fundamentally different from generic customer support.

First, the caller is often in distress. FNOL calls land within minutes of an accident, a break-in, or a hospital admission. The agent’s first turn sets the tone for the whole claim relationship. A dismissive line, a robotic cadence, or an inappropriate offer to upsell on a fresh accident call shows up in NPS scores six months later.

Second, every state writes its own consumer-protection rules. California’s claims-handling regulation differs from Texas which differs from New York. The carrier already maintains the per-state matrix. The voice agent has to honor it without growing fifty parallel prompt graphs.

Third, recording consent is a patchwork. One-party consent states (most of them) require only the carrier to disclose. Two-party consent states (California, Florida, Illinois, Maryland, Massachusetts, Montana, Nevada, New Hampshire, Pennsylvania, Washington) require explicit caller consent before recording starts. The disclosure prompt has to fire before any captured audio reaches storage.

Fourth, fraud is omnipresent and asymmetric. A real claim wrongly flagged is a customer-experience event. A real fraud cleanly captured saves money. Voice AI is the wedge for triage, never for accusation. The script forbids the agent from naming fraud, changing tone toward the caller, or denying a claim on the call.

Fifth, the audit trail is examined by state market-conduct examiners on a multi-year cycle. The carrier has to reconstruct any call on demand. The trace has to survive the call, the QA cycle, and the next exam four years out.

Workflow 1: First Notice of Loss intake

FNOL is the longest single-call workflow on this list and the most consequential. The agent has minutes to capture a structured loss report from a caller whose adrenaline is still elevated.

Design pattern

The agent opens with an empathy-first greeting (acknowledge the situation, confirm caller safety), discloses recording per state, and confirms basic identity (policy number plus one secondary factor). Then the structured capture begins:

Date and time of loss
Location of loss (address or coordinates)
Type of loss (collision, theft, fire, water, liability, injury)
Parties involved (names, contact, role)
Vehicles or property affected (year, make, model, VIN, address, photos pending)
Damages and estimated severity
Injuries (yes or no, severity, hospital if known)

The agent reads back the captured schema before closing, gives the claim number, sets next-step expectations (adjuster contact within X hours per state requirement), and offers a callback if the caller remembers more.

Eval rubrics

task_completion: did the agent capture all seven FNOL fields.
conversation_resolution: the caller hung up understanding their claim number and next step.
is_polite and tone: empathy in the opening and through stressful turns.
pii: no echoing of full SSN, full account, or full card numbers.
data_privacy_compliance: no disclosure of policy details to a non-policy caller.
is_compliant: the state-specific recording disclosure fired.

Custom evaluators that earn their keep

Two custom evaluators do most of the heavy lifting on FNOL.

claim-info-completeness scores whether the agent captured all required FNOL fields. The carrier supplies the schema (the seven fields above plus any line-of-business additions). The evaluator runs against the trace and returns a per-field score plus an aggregate. Coverage gaps become a fix-prompt signal in the next sprint.

empathy-tone scores whether the agent acknowledged the caller’s situation appropriately. The carrier supplies the tone-of-voice guide (the same one the live agents are trained on). The evaluator runs against the first three turns and against any turn where the caller signals frustration. Carriers with strong NPS programs use this evaluator to align voice AI with the brand’s emotional standard. Future AGI’s ai-evaluation supports custom evaluators authored by an in-product agent: describe the policy in plain English, the agent produces a runnable evaluator.

Workflow 2: claims status

Highest volume on the list. The caller wants to know where their claim is, what documents are missing, when the adjuster will visit, and when the check is coming. Bounded scope, structured output, the workflow most likely to deflect successfully.

Design pattern

The agent verifies identity (policy number plus secondary factor), retrieves the claim state from the claims system, and reads back the status (assigned, inspection scheduled, estimate pending, payment issued, closed). For document gaps, the agent lists what is missing and offers to text or email a secure upload link. For inspection scheduling, the agent confirms the scheduled date and time. For payment status, the agent confirms the issued date and method but never reads a full check or account number.

Eval rubrics

task_completion: the caller got the status they needed.
conversation_resolution: the call closed without transfer.
is_helpful and is_polite: tone and CSAT proxies.
pii: no full-PII echo.
is_factually_consistent: the status read matches the claims-system state.

Claims-status containment should be measured against the carrier’s IVR baseline through a controlled pilot followed by post-launch tuning, and savings denominate in adjuster minutes plus call-center seat hours once the targeted intent classes stabilize.

Workflow 3: underwriting Q&A

Underwriting calls split between supplemental application questions (driver history details, prior loss disclosures, property features the digital form did not capture) and follow-up clarifications during the underwriting review. Voice AI fits the supplemental and follow-up layer; binding stays with licensed humans.

Design pattern

The conversation is templated by product line (personal auto, homeowners, term life, small commercial). Each section has a prompt, an expected schema, and a validation step. Disclosure language for fair credit reporting and consumer notice runs verbatim at the appropriate points. The structured output feeds the underwriting system.

Eval rubrics

task_completion: the supplemental questionnaire covered all required sections.
is_compliant: the disclosure-required script was followed.
prompt_adherence: the agent did not paraphrase the disclosure language.
is_factually_consistent: the summary-back of captured data matches what the customer said.

Underwriting Q&A is where most carriers introduce custom evaluators on top of the built-ins. The custom evals encode the carrier’s specific licensure boundary (preliminary qualification language, no binding commitment, no rate guarantees, no agent-style sales pitch). Future AGI’s in-product agent authors these from a plain-English policy description.

Workflow 4: agent dispatch

Caller needs an in-person adjuster, body shop appointment, or independent medical exam. The voice agent triages the request and schedules the dispatch.

Design pattern

The agent confirms the claim is active, retrieves the territory and available adjusters or vendors, offers two or three slots, captures the caller’s slot preference and contact details, and writes the scheduled appointment back to the dispatch system. For high-value or complex losses, the agent routes to a human dispatcher rather than self-scheduling.

Eval rubrics

task_completion: the appointment is in the dispatch system.
conversation_resolution: the caller has a confirmed date and time.
is_helpful: tone matched the workflow.
is_compliant: the agent did not promise anything beyond scheduling.

Workflow 5: renewals and cross-sell outbound

Outbound check-in with existing customers thirty to sixty days before renewal. The agent reviews coverage, asks about life changes that affect coverage (new car, new driver, new property, new dependent), and offers a complementary product where appropriate.

Design pattern

The agent identifies itself and the carrier, discloses the call purpose, confirms the customer wants to continue, runs the renewal check-in script, and routes any binding question to a licensed human. Outbound has to clear state-level telemarketing rules and the carrier’s internal do-not-call list.

Eval rubrics

is_polite: tone in an unsolicited outbound context.
task_completion: renewal check-in completed.
is_compliant: telemarketing disclosure fired.
prompt_adherence: the agent did not stray into binding language or rate quotes.
tone: no high-pressure sales cadence.

Renewals is where calibrated honesty matters most. Voice AI in 2026 handles the check-in. A licensed agent handles the bind. The script enforces the handoff.

Workflow 6: fraud triage

The hardest design problem on the list because the agent has to detect without accusing.

Design pattern

The agent runs the FNOL or claims-status workflow normally. Behind the scenes, every turn scores against a fraud-language evaluator and against timing-anomaly heuristics (pause-on-factual-recall, rehearsed phrasing, inconsistency with prior policy data, reference to coverage limits the caller should not know). The score never changes the caller experience. High-score calls drop into a special investigations unit queue post-call. The agent never says the word fraud, never accuses, never alters tone toward the caller.

Eval rubrics

prompt_adherence: the agent stayed inside the script (no fraud accusation).
tone: tone consistent across high-fraud-score calls and low-fraud-score calls.
Custom fraud-language evaluator: per-turn score against the carrier’s known-pattern list.

Why a custom evaluator owns fraud triage

The carrier’s special investigations unit maintains the known-pattern list. The list is proprietary and updates as fraud rings rotate language. The custom evaluator is authored from that list and versioned (or updated) whenever SIU policy changes. Built-in rubrics catch generic harmful intent. Fraud detection needs the carrier-specific signal.

The voice stack for insurance

The runtime field narrows quickly when you stack carrier requirements against vendor capabilities.

Runtime: Vapi or Retell

For high-volume claims status, Retell’s hosted pipeline lands first-response p50 around 600ms on US-East. SOC 2 Type II ships as standard. For carriers routing across multiple LLMs (cheap for status checks, premium for FNOL narrative capture), Vapi’s BYO routing wins. Native SIP, broad telephony integration, OpenInference tracing via traceAI.

STT: Deepgram Nova-3

Vehicle, property, and medical terms appear constantly in insurance audio. Deepgram Nova-3 handles Toyota RAV4, 2018 Forester XT, posterior cruciate ligament, and kitchen subfloor with strong WER on noisy audio. Strong noise robustness across the typical FNOL acoustic envelope (a totaled car at roadside, a flooded basement, a busy ER lobby).

TTS: ElevenLabs or Cartesia

Empathetic delivery is a design surface, not a side effect. ElevenLabs ships custom cloned voices with consistent identity across languages. Cartesia ships the lowest-latency streaming TTS on the market today (Sonic family). Both work for FNOL; pick on latency budget and brand-voice fit.

Telephony: Twilio or Telnyx

Both are SOC 2 Type II. Both support SIP. Telnyx leads on direct-route economics. Twilio leads on integration breadth. State-by-state caller ID treatment varies; confirm with the telephony vendor before launching outbound in new states.

Eval, observability, simulation, guardrails: Future AGI

The next section.

How Future AGI fits the insurance voice stack

Future AGI is the eval, observability, simulation, and guardrail layer underneath all four runtimes. The five products map cleanly to the insurance workload.

traceAI for distributed tracing

30+ documented integrations across Python and TypeScript, OpenInference-compatible spans, Apache 2.0. Every insurance call becomes a trace with ASR span, retrieval span (policy lookup, claims-system retrieval, vendor lookup), LLM span, tool spans (FNOL submit, document request, dispatch schedule, status read), TTS span, and conversation ID linking the whole thing. Tag each trace with claim_id, policy_id, state, and line_of_business for downstream CRM linking and per-tenant audit.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pipecat import enable_http_attribute_mapping

register(
    project_type=ProjectType.OBSERVE,
    project_name="Insurance Voice Agent",
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

For Vapi or Retell deployments, native voice observability removes the SDK entirely.

Native voice observability for Vapi, Retell, and LiveKit

Add the provider API key plus Assistant ID to a FAGI Agent Definition. Every captured insurance call can get separate assistant and customer audio downloads, an auto transcript, and selected evals from the same engine. No SDK required for the native dashboard path. “Enable Others” mode covers any voice provider via mobile-number simulation.

ai-evaluation for scoring

70+ built-in rubrics including pii, data_privacy_compliance, is_compliant, task_completion, conversation_resolution, is_polite, is_helpful, tone, prompt_adherence, is_factually_consistent. All Apache 2.0. Custom evaluators (claim-info-completeness, empathy-tone, fraud-language) authored by an in-product agent from carrier policy.

Multi-turn and audio test infrastructure

from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I just had an accident", response="I'm sorry to hear that. Are you safe right now?"),
    LLMTestCase(query="Yes, but my car is totaled", response="Thank you for confirming. Let me take down the details when you're ready."),
])

ev = Evaluator(fi_api_key=..., fi_secret_key=...)
result = ev.evaluate(
    eval_templates=[ConversationCoherence(), ConversationResolution()],
    inputs=[conv],
)

For audio eval, MLLMAudio accepts seven formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) from local file or URL.

from fi.testcases import MLLMTestCase, MLLMAudio

audio = MLLMAudio(url="https://carrier-storage/calls/fnol_2026_04_09_142233.wav")
test_case = MLLMTestCase(input=audio, query="Score this FNOL call for empathy and claim-info-completeness")

Simulation for pre-launch testing

18 pre-built personas plus unlimited custom-authored. Custom personas configure name, description, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise (highway, hospital lobby, kitchen with running water), multilingual (many popular languages), plus custom properties and free-form behavioral instructions. Anxiety level matters for FNOL after a serious loss; fluency matters for cross-state and multilingual calls. Workflow Builder ships Conversation, End Call, and Transfer Call nodes with branch visibility; auto-generates branching scenarios (20, 50, or 100 rows) from an insurance agent definition. 4-step Run Tests wizard (config → scenarios → eval → execute). Error Localization pinpoints the failing turn when a scenario fails. Programmatic eval API for configure plus re-run as part of CI. The evaluate_function_calling template scores claim systems integration (right tool, right arguments at the right turn), critical when FNOL routes to a policy lookup, a vehicle VIN lookup, or a coverage-status check.

Future AGI Protect for inline guardrails

Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. Two surfaces: rule-based Protect across the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) and ProtectFlash (single-call binary classifier). Sub-100ms inline.

from fi.evals import Protect

p = Protect()
out = p.protect(inputs=test_case)

Run ProtectFlash for the fast harmful/not-harmful guardrail path on the critical voice budget. Keep fraud-language scoring in a custom evaluator or SIU-specific rubric, and run rule-based Protect on every Nth turn (or async) for richer per-rule attribution in the trace.

Error Feed for failure clustering

Auto-clusters trace failures into named issues. Auto-writes root cause plus quick fix plus long-term recommendation. For an insurance agent that means 40 failed FNOL submissions caused by a missing-vehicle-VIN field cluster as one issue. The exam response is one document per cluster, not one per call.

Agent Command Center for hosting and governance

RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, BYOC self-host, 15+ LLM providers and 100+ models in Agent Command Center routing. HIPAA BAA available on eligible plans for health insurance lines where PHI is in scope (group health, supplemental health, Medicare Advantage, federal flood with health-data tie-ins). Per-team RBAC and per-tenant attribution tags (state, line of business, product) so regulatory reviews segment cleanly.

State-by-state compliance: the layered model

The carrier already maintains a per-state compliance matrix for human agents. The voice AI deployment inherits that matrix rather than re-deriving it.

The agent reads the recording disclosure before any captured audio reaches storage. The disclosure text varies by state. Two-party consent states require explicit caller assent (the agent waits for a yes before continuing). One-party consent states require carrier disclosure only. The disclosure runs verbatim from the canonical text, scored every call via prompt_adherence.

NAIC market conduct

The National Association of Insurance Commissioners publishes model rules on fair claims handling. State insurance departments adopt the model rules with local variations. The voice agent honors timely contact, fair settlement, and no-misrepresentation rules across every workflow. Custom evaluators authored from the carrier’s market-conduct playbook score these per call.

GLBA and state privacy

Non-public personal information receives GLBA-class protection. Many states layer additional privacy rules on top (California CPRA, Virginia VCDPA, Colorado CPA, Connecticut CTDPA, Utah UCPA). The pii and data_privacy_compliance rubrics catch echo events. Future AGI Protect redacts PII inline before downstream services receive the payload.

Anti-fraud statutes

Every state requires carriers to investigate and report suspected fraud. None permits the carrier to accuse the caller mid-call. The fraud-triage workflow above implements both halves: detect and route on the back end, never accuse on the front end. The prompt_adherence rubric scores every call against the anti-accusation script.

Carriers selling in California fall under CCPA. Carriers with EU customers fall under GDPR. Both require data-subject-access response capability. The audit trail (trace, eval history, Protect log) doubles as the response artifact for a DSAR.

Common failure modes in insurance voice deployments

The failure patterns repeat across insurance voice rollouts. Knowing them in advance shortens the cutover.

State-specific disclosure miss. The two-party consent prompt fires after the recording started, or fails to fire on a state border-routed call. Mitigation: server-side recording control tied to the disclosure-confirmation event, plus prompt_adherence regression every release.
FNOL field omission. The caller goes off-script and the agent forgets to circle back to the missed field. Mitigation: claim-info-completeness custom evaluator on every FNOL trace; a missed-field rate above 5% drives a prompt-iteration sprint.
Empathy drift on long calls. The agent’s tone resets to neutral after a long structured-capture block, missing a callback to the caller’s distress signal. Mitigation: tone plus empathy-tone custom evaluator on every turn after turn 8.
PII echo-back. The agent reads back a full SSN or policy number to confirm capture. Mitigation: prompt-level rule forbidding full-PII echo, plus pii rubric flagging any echo turn, plus inline ProtectFlash blocking on the critical path.
Fraud-tone leakage. The agent’s tone shifts on calls with a high fraud-language score, alerting the caller. Mitigation: tone variance check across fraud-score buckets, plus simulation rehearsal under the SIU pattern library.
Cross-state script confusion. The agent applies California’s prompt to a Texas caller because caller ID routed to the wrong region. Mitigation: per-state tag on every trace, plus a regression suite that exercises every state explicitly.
Renewals telemarketing slip. Outbound renewals dial a number on the carrier’s own do-not-call list because of a stale list sync. Mitigation: pre-dial DNC check at the orchestrator layer, plus daily Error Feed cluster review during the first 60 days.

Each of these has a clean mitigation in the FAGI stack. The simulation suite catches the predictable ones pre-launch; the observability stack catches the long tail post-launch.

Designing the exam-ready dashboard

State market-conduct examiners ask the same questions every cycle. The dashboard that pre-empts those questions saves weeks of exam-prep time.

Top-of-dashboard KPIs

Three KPIs go above the fold:

Compliance violation rate per 10,000 calls. Drift below 1 per 10,000 is healthy. Above 5 per 10,000 means the rubric set or the prompt is drifting.
PII echo rate per 10,000 calls. A direct measure of the redaction layer plus the prompt discipline. Above 1 per 10,000 is a fix-now signal.
FNOL claim-info-completeness rate. Above 95% on the seven core fields is healthy. Below 90% means the agent is missing capture loops.

Per-intent breakdowns

Every intent class gets its own row: FNOL, claims status, underwriting Q&A, dispatch, renewals, fraud-triage flag rate. Each row shows call volume, task_completion rate, conversation_resolution rate, compliance violation count, and average handle time.

Per-state and per-line-of-business breakdowns

The same metrics segment by state and by line of business. Compliance violation hotspots correlate with state-specific disclosure regressions. Auto, home, health, life, and small commercial each carry their own rubric set and their own pass-rate target. The dashboard surfaces both segments side by side so leadership can compare voice readiness at a glance and so the examiner sees state-level posture before they ask.

Compliance audit pattern

The audit pattern for an insurance voice deployment uses three artifacts per call.

Artifact 1: the trace

Every span. ASR provider, model, confidence. LLM provider, model, prompt version, response. Tool calls and their outcomes. TTS provider, voice, latency. Conversation ID linking all of it. Tags for claim_id, policy_id, state, line_of_business, agent_version. traceAI emits OpenInference-compatible spans that any OTel-compatible audit pipeline can consume.

Artifact 2: the eval history

Every score on every rubric for the call. The rubric set is configured per workflow (FNOL has one set, claims status has another, renewals has another). The score plus reasoning plus eval template ID plus version is stored per turn.

Artifact 3: the Protect log

Every guardrail check on every turn. The check, the score, the action taken (allow, block, redact, escalate). For ProtectFlash, the single binary outcome. For rule-based Protect, the per-rule output.

The three artifacts together reconstruct the call to an examiner’s satisfaction. The Error Feed clustering surfaces the patterns across calls (which intent class has the highest PII-echo rate, which state has the highest disclosure-miss rate, which time-of-day correlates with elevated fraud-triage flag rate).

Three deliberate tradeoffs

These are deployment-posture and process choices baked into the platform, not feature gaps.

Federal-program lines and sovereign workloads run via BYOC self-host. Cloud customers run on SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certifications per the trust page; ISO 42001 is in progress. HIPAA BAA available on eligible plans where PHI is in scope (group health, supplemental health, Medicare Advantage). Carriers with federal-program lines (federal flood, federal crop) deploy in their VPC via air-gapped BYOC. Same software, customer-owned audit boundary, customer-managed KMS.

Async eval gating is explicit. agent-opt ships six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard), available both UI-driven (Dataset view) and SDK-driven via Python. Runs require an explicit trigger plus a human approval gate. FAGI never auto-rewrites a state-disclosed prompt without consent. Custom evaluators calibrate from human review feedback rather than auto-rewriting, which is the right shape for SIU-specific fraud-language rubrics and carrier-authored market-conduct rubrics.

Native voice obs and Enable Others. Native call-log capture ships for Vapi, Retell, and LiveKit out of the box (provider API key plus Assistant ID, no SDK code). Other voice runtimes (Pipecat, custom orchestration) wire in via the 30+ documented traceAI instrumentors. The evaluate_function_calling template runs on either path to score the claim systems integration; the pii, data_privacy_compliance, and prompt_adherence rubrics plus Protect Data Privacy run identically.

A 30-day controlled pilot for FNOL

The cleanest first deployment is FNOL on a single state, single line of business, with human-agent shadow on every call. The 30-day pilot below assumes a 200,000-call/month carrier with FNOL representing roughly 10% of inbound.

Day	Phase	Activities
0	Scope	Pick state, pick line (auto FNOL is the typical first slice). Lock the FNOL schema, the recording disclosure, the empathy-tone rubric.
1-3	Compliance	DPAs signed with every vendor in the path. State disclosure language sign-off. PII redaction validation.
4-7	Agent build	Conversational design, structured-capture schema, custom evaluators authored (claim-info-completeness, empathy-tone).
8-10	Persona library	30-60 personas covering caller demographics, anxiety levels, background-noise conditions (roadside, hospital, garage).
11-15	Simulation	Auto-generate 100-row scenarios, run 10,000+ synthetic FNOL calls. Score with the insurance rubric set.
16-18	Pre-launch	Compliance officer review of sampled transcripts. Disclosure-language regression suite. SIU sign-off on fraud-triage script.
19	Soft launch	1% of FNOL traffic to AI path with human shadow on every call.
20-23	Ramp	5% to 10% of FNOL traffic. Daily Error Feed cluster review.
24-27	Measure	Compare FCR, AHT, CSAT proxy, escalation rate against human baseline.
28-30	Decide	Expand if outcomes match. Hold or roll back if any compliance metric drifts.

The cadence compresses for low-risk workflows (claims status) and lengthens for high-risk ones (fraud triage, complex commercial FNOL). The constraint that doesn’t bend is the compliance review gate before any ramp.

Voice AI for Banking and Financial Services in 2026: the parallel playbook for GLBA-regulated voice deployments.
Voice AI for Healthcare and Clinical Workflows in 2026: the HIPAA-regulated voice playbook.
IVR Modernization: Migrate Legacy IVR to AI Voice Agents in 2026: the cutover playbook for legacy insurance phone trees.
Voice AI Evaluation Infrastructure: Developer’s Guide: eval rubrics that score insurance voice workloads.

Sources and references

arXiv 2510.13351, Future AGI Protect model family (arxiv.org/abs/2510.13351)
arXiv 2507.19457, GEPA Genetic-Pareto prompt optimizer (arxiv.org/abs/2507.19457)
arXiv 2505.09666, Meta-Prompt bilevel optimization (arxiv.org/abs/2505.09666)
arXiv 2311.09569, Random Search baseline (arxiv.org/abs/2311.09569)
Gramm-Leach-Bliley Act, Safeguards Rule
NAIC Unfair Claims Settlement Practices Model Act
State-by-state recording-consent statutes (one-party and two-party)
CCPA, GDPR, and state privacy statutes (VCDPA, CPA, CTDPA, UCPA)
Future AGI trust page (futureagi.com/trust)
traceAI repository (github.com/future-agi/traceAI)
ai-evaluation repository (github.com/future-agi/ai-evaluation)
Vapi, Retell AI, Deepgram, ElevenLabs, Cartesia, Twilio, Telnyx: vendor documentation and SOC 2 attestation pages (referenced in plain text per editorial policy)

Frequently asked questions

Which insurance workflows are realistic for voice AI in 2026?

Six workflows are production-ready in 2026. First Notice of Loss (FNOL) intake across auto, home, and health. Claims status and document checks. Underwriting Q&A and supplemental clarifications. Agent dispatch and adjuster scheduling. Renewals and cross-sell outbound. Fraud triage on suspicious language patterns. Voice AI handles triage and structured capture; adjudication and binding stay with licensed humans. The runtime is Vapi or Retell, Deepgram for STT, ElevenLabs or Cartesia for TTS, and Future AGI for eval, observability, and guardrails.

What compliance posture does an insurance voice agent need?

Five compliance surfaces. State insurance department rules on claims handling (each state has its own consumer-protection statute). Recording consent under one-party versus two-party state laws. NAIC market conduct guidance on fair claims handling. GLBA and state privacy rules for non-public personal information. Anti-fraud statutes that require escalation but forbid accusation. Future AGI is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. The Protect model family redacts PII sub-100ms inline before payloads reach non-compliant services.

How should FNOL voice agents handle a stressed caller?

Empathy is a design surface, not a soft skill. The agent acknowledges the situation in the first turn (accident, damage, loss), confirms safety before asking for facts, and uses an empathetic TTS voice tuned for slow pacing. Capture the seven core FNOL fields in a structured schema: date, time, location, parties involved, vehicles or property, damages, injuries. Future AGI's `tone` and `is_polite` score every turn for empathetic delivery. A custom empathy-tone evaluator authored from the carrier's tone-of-voice guide catches drift the built-ins miss.

How does voice AI detect fraud without accusing the caller?

Fraud triage flags language patterns and timing anomalies, then routes the call to a special investigations unit. The agent never says fraud, never accuses, never alters tone toward the caller. Pattern signals include inconsistencies between the FNOL narrative and prior policy data, rehearsed phrasing, hesitation on factual recall, and reference to coverage limits the caller should not know. Future AGI's `prompt_adherence` keeps the agent inside the script. A custom fraud-language evaluator scores the call for known signals; high-score calls drop into a human-only queue without changing the caller experience.

Which voice runtime is best for insurance deployments?

Vapi and Retell handle most insurance workloads. Retell wins on hosted latency for high-volume claims intake. Vapi wins on BYO model flexibility when routing between cheap LLMs for status checks and premium models for FNOL narrative capture. Deepgram Nova-3 handles vehicle, property, and medical terms with strong WER on noisy audio (a totaled car, a flooded basement, a busy ER lobby). ElevenLabs and Cartesia both work for empathetic TTS. Confirm SOC 2 Type II and signed DPAs with every vendor in the call path.

What KPIs prove the insurance voice deployment worked?

Containment rate per intent class, average handle time, first-call resolution, customer satisfaction proxy, escalation rate, claim-info-completeness score on FNOL traces, and compliance violation rate (PII echo, disclosure miss, fraud-accusation block). Map these to ai-evaluation rubrics: `task_completion`, `conversation_resolution`, `is_polite`, `is_compliant`, `pii`, `tone`. Compare against the legacy IVR baseline over a controlled pilot window before declaring success.

How do I audit an insurance voice agent for regulators?

Three artifacts per call. The trace history for every call (traceAI), the eval history for every call (ai-evaluation), and the Protect log for every guardrail check. Future AGI consolidates all three under Agent Command Center with RBAC, per-tenant attribution tags (state, line of business, product), and configurable retention. The Error Feed clusters violations into named issues, so a state market-conduct exam returns one document per cluster rather than one per call.

View all

Guides

Voice Cloning Safety and Brand Voice Management for Production AI in 2026

Manage voice cloning safety and brand voice for production AI in 2026 with consent capture, watermarking, voice-print policy, and Future AGI Protect.

Vrinda Damani · Apr 16, 2026

16 min

Guides

Voice Agent Deployment Patterns: Cloud, BYOC, and On-Prem in 2026

Three voice agent deployment patterns compared in 2026. Cloud (managed hosted), BYOC inside customer VPC, and air-gapped on-prem with concrete tradeoffs.

NVJK Kartik · Apr 9, 2026

17 min

Guides

Voice AI for Legal: Discovery Intake, Client Onboarding, and Compliance in 2026

Deploy voice AI in legal workflows in 2026: client intake, discovery interviews, contract Q&A, deposition prep, status updates, and the compliance posture.

Vrinda Damani · Apr 9, 2026

21 min

TL;DR (the six production insurance workflows)

Why insurance is harder than generic support

Workflow 1: First Notice of Loss intake

Design pattern

Eval rubrics

Custom evaluators that earn their keep

Workflow 2: claims status

Design pattern

Eval rubrics

Workflow 3: underwriting Q&A

Design pattern

Eval rubrics

Workflow 4: agent dispatch

Design pattern

Eval rubrics

Workflow 5: renewals and cross-sell outbound

Design pattern

Eval rubrics

Workflow 6: fraud triage

Design pattern

Eval rubrics

Why a custom evaluator owns fraud triage

The voice stack for insurance

Runtime: Vapi or Retell

STT: Deepgram Nova-3

TTS: ElevenLabs or Cartesia

Telephony: Twilio or Telnyx

Eval, observability, simulation, guardrails: Future AGI

How Future AGI fits the insurance voice stack

traceAI for distributed tracing

Native voice observability for Vapi, Retell, and LiveKit

ai-evaluation for scoring

Multi-turn and audio test infrastructure

Simulation for pre-launch testing

Future AGI Protect for inline guardrails

Error Feed for failure clustering

Agent Command Center for hosting and governance

State-by-state compliance: the layered model

Recording consent

NAIC market conduct

GLBA and state privacy

Anti-fraud statutes

CCPA, GDPR, and consumer privacy

Common failure modes in insurance voice deployments

Designing the exam-ready dashboard

Top-of-dashboard KPIs

Per-intent breakdowns

Per-state and per-line-of-business breakdowns

Compliance audit pattern

Artifact 1: the trace

Artifact 2: the eval history

Artifact 3: the Protect log

Three deliberate tradeoffs

A 30-day controlled pilot for FNOL

Related reading

Sources and references

Frequently asked questions