Guides

Medical and Healthcare STT in 2026: Accent, Jargon, HIPAA

Ship clinical-grade STT in 2026: medical jargon coverage, patient accent and dialect robustness, HIPAA and BAA across audio and transcripts.

February 12, 2026

Updated May 19, 2026

18 min read

voice-ai 2026 healthcare stt hipaa compliance

Table of Contents

A clinical scribe, a triage agent, a patient-callback flow, an after-visit summary. Every voice AI workload in healthcare runs through speech-to-text. The STT layer is the most common failure point in a healthcare deployment and the most regulated one. Three constraints fire at once: medical jargon coverage, patient accent and dialect spread, and HIPAA compliance across the audio plus transcript pipeline. This is the 2026 playbook for shipping STT that handles all three.

TL;DR (the medical STT trifecta)

Three forces decide whether a healthcare STT deployment ships. The trifecta has to clear all three simultaneously. A pipeline that nails two of three and fails one is not a clinical pipeline.

Constraint	What it demands	What breaks if you miss
Medical jargon	Drug names, dosages, routes, anatomy, ICD-10, CPT codes transcribed correctly	Wrong medication entry, missed allergy, billing code drift
Accent + dialect	Patient speech across the full demographic mix scored evenly	Specific cohorts get worse care; bias-class complaints
HIPAA + BAA	Every vendor in the audio plus transcript path under a BAA, PHI encryption + redaction + audit	Regulatory filing, OCR investigation, breach notification

The runtime layer is a healthcare-tuned STT (Deepgram, Amazon Transcribe Medical, AssemblyAI, or a custom Whisper fine-tune). The eval, redaction, observability, and audit layer is Future AGI. The dedicated section below explains the full mapping.

Why generic STT fails in healthcare

The training distribution of a general-purpose ASR model overweights public-domain audiobook speech, podcast speech, broadcast news, and TED-style talks. Three demographic and lexical gaps follow.

Gap 1: clinical jargon

The drug name corpus alone has 20,000+ unique active ingredients and tens of thousands more branded variants. A patient saying “I take metformin” trips a model that has seen “I take medicine” a thousand times more. The substitution rate on rare drug names is consistently the worst error class in a generic ASR. Anatomy terms, procedure names, lab-test names, and ICD codes carry the same problem. The vocabulary is large, the per-token training frequency is low, and the cost of a single substitution is clinically dangerous.

Gap 2: patient demographic spread

A clinical population is older on average than a podcast-listener population. Speech rate is slower. Pauses are longer. Dysarthria, dental-prosthetic articulation, post-stroke speech, and tracheostomy-modified speech all appear at rates higher than in the training mix. The accent and dialect spread is also wider. A healthcare workload has to be evaluated across the demographic mix that actually uses the service, not the mix that produced the benchmark numbers.

Gap 3: PHI everywhere

The transcript contains protected health information by design. The patient’s name, date of birth, address, phone number, medical record number, diagnosis, medication, lab result, and the clinician’s name and provider identifier all appear in the audio stream. The audio file is PHI. The transcript is PHI. The eval signal derived from the transcript is PHI. The redacted variant is PHI unless the redaction is provably reversible-free and the categories meet the Safe Harbor definition under 45 CFR 164.514.

A generic STT vendor without a BAA cannot legally receive the audio. A generic LLM downstream without a BAA cannot legally receive the transcript. A generic analytics tool downstream without a BAA cannot legally receive the eval scores. Every node in the path needs a BAA.

Provider options for healthcare STT in 2026

Four options dominate the realistic field.

Option 1: Deepgram Nova-3 Medical

Deepgram’s Nova-3 family with the Medical model variant is a strong hosted option for general clinical speech. WER on Deepgram’s published medical benchmark is competitive with the best academic results. Benchmark on your own audio to confirm. The streaming API supports real-time transcription with sub-300ms first-partial latency. The BAA ships on the enterprise tier with a documented PHI handling posture. Audio retention is configurable. The integration surface is mature.

Wins on: strong accuracy on general clinical English, low streaming latency, broad keyword-boosting capability for institution-specific jargon.

Trades off on: the standard model needs keyword boost lists for institution-specific drug names and procedure codes; the cost ramp on high-volume real-time streaming is steep.

Option 2: Amazon Transcribe Medical

Amazon Transcribe Medical is the option for teams already deep in the AWS ecosystem. It is a HIPAA-eligible AWS service under the AWS BAA. It handles clinical conversation transcription with structured medical entity extraction. It integrates cleanly with Bedrock for downstream LLM work and with Comprehend Medical for entity extraction.

Wins on: tight integration with the AWS HIPAA-eligible service stack; structured medical entity extraction in the same response; predictable governance under the AWS BAA.

Trades off on: WER on general clinical conversation often trails Deepgram on published third-party benchmarks; streaming has higher first-partial latency. Benchmark directly on your own audio.

Option 3: AssemblyAI Medical Mode

AssemblyAI’s medical-mode and Universal-3 streaming models ship strong on both streaming and async APIs. The async API is the better fit for chart-note dictation and after-visit summary generation where the full audio is available. The BAA is available on the healthcare tier. The async API is a strong option for entity F1 on extended dictation. Benchmark on your own audio to confirm.

Wins on: strong async accuracy on long-form dictation, broad set of post-processing options (speaker diarization, redaction).

Trades off on: streaming latency may trail Deepgram for the most real-time use cases. Measure on your audio.

Option 4: Custom Whisper fine-tune

The right option for teams with a clinical reference corpus, the engineering bench to host the model, and a BAA-covered cloud (Azure OpenAI’s Whisper variant, AWS-hosted Whisper on HIPAA-eligible compute, or self-hosted on HIPAA-covered private infrastructure). Whisper-large-v3 fine-tuned on 100-500 hours of in-domain clinical audio outperforms generic Whisper by 8-15 WER points on the in-domain test set and sometimes overtakes the hosted vendors on the institution’s specific jargon mix.

Wins on: full control of the model, the lowest unit cost at high volume, the best fit when you have an unusual lexicon (a niche specialty, a non-English clinical service, a research workload).

Trades off on: the engineering and ops cost is real. The fine-tune cycle, the eval cycle, the model rotation cycle, and the security review for the hosting choice all need owners. Hosted vendors absorb this work.

Evaluating medical STT quality

WER is the baseline. WER alone is not the bar. The four beyond-WER metrics from the voice-agent stack apply directly to clinical STT, with a twist: the entity taxonomy is the clinical one.

audio_transcription

The WER-class baseline. Pair every test run with this rubric for the cross-vendor scorecard.

Clinical entity F1

The custom rubric. Define the entity taxonomy: drug name, dosage, route, frequency, anatomy term, lab name, lab value, ICD-10 code, CPT code, provider name, facility name. Extract entities from the reference. Extract from the hypothesis. Score per-type and overall F1.

Per-type breakdown is non-negotiable. A 0.92 overall entity F1 hides a 0.55 drug-name F1 because drug names are a small fraction of total entities. Track per-class trend lines.

Intent preservation

For triage and routing flows, intent preservation is the agent-relevant score. Did the hypothesis route to the same triage category as the reference. Build the rubric against the institution’s triage taxonomy.

Semantic similarity

For patient-described symptoms, paraphrase is heavy. “It hurts when I breathe deep” and “I have pain on deep inspiration” are semantically identical and WER-divergent. Embedding-based similarity catches this.

task_completion and conversation_resolution

For multi-turn clinical agents (triage, callback, scheduling), these built-in rubrics score whether the call resolved correctly. Run the scenario twice (live ASR vs reference transcript) for the downstream correlation indicator.

Clinical-safety rubrics

Three rubrics are clinical-safety-specific and ship as built-ins in ai-evaluation:

no_harmful_therapeutic_guidance: the agent did not provide medical advice it isn’t licensed to provide.
clinically_inappropriate_tone: the tone did not slide into dismissive, defensive, or sales-toned territory.
audio_transcription: WER-class scoring on the transcript leg.

Plus the data and PHI rubrics:

pii: PHI categories flagged on every turn.
data_privacy_compliance: policy-class violations flagged across the call.

Run the full set on every clinical turn during launch. Sample on high-volume production.

Pipeline pattern: BAA-covered end to end

The HIPAA-correct pipeline has every node under a signed BAA. The pattern below is the most common 2026 deployment; the HIPAA-compliant voice AI build guide covers the agent layer above it.

Node 1: telephony or capture

The audio capture vendor (the carrier, the WebRTC service, the EHR-integrated capture surface) must operate under a BAA. Twilio Programmable Voice, Telnyx, Agora, and AWS-hosted carriers all offer BAA-eligible tiers. Default tiers usually do not. Verify before signal capture starts.

Node 2: STT

The STT vendor needs a BAA on the tier and region you’re using. Deepgram Nova-3 Medical, Amazon Transcribe Medical (under the AWS BAA), and AssemblyAI Medical Mode all qualify. Self-hosted Whisper inside your HIPAA-covered VPC needs no external BAA but needs the internal security review for the model-hosting infrastructure.

Node 3: PHI redaction

The redactor runs immediately after STT, before the transcript reaches any non-BAA-covered downstream consumer. Future AGI Protect handles this leg. The model family runs on Gemma 3n with LoRA-trained adapters per arXiv 2510.13351, multi-modal across text and audio. The redactor runs sub-100ms inline. The audio leg catches PHI in the audio stream before transcription. The text leg redacts categories that survive transcription.

from fi.evals import Protect, Evaluator, PII

p = Protect()
out = p.protect(
    inputs=test_case,
    protect_rules=[
        {"metric": "data_privacy_compliance"},
    ],
)

# PII detection runs as a separate Evaluator template, not as a Protect rule.
ev = Evaluator(fi_api_key=..., fi_secret_key=...)
pii_result = ev.evaluate(eval_templates=[PII()], inputs=[test_case])

For the single-call critical path:

out = p.protect(inputs=test_case)

ProtectFlash is the binary classifier mode. Use it on the inline critical path where rule-based scan time is tight. Use rule-based Protect on every Nth turn for richer per-rule signal.

Node 4: LLM

The clinical LLM (the scribe model, the triage model, the summarization model) needs a BAA-covered hosting. Azure OpenAI on the BAA-eligible tier, AWS Bedrock with HIPAA-eligible model variants (Claude, Llama, Titan), or a self-hosted open-weight model on HIPAA-covered compute. The redacted transcript is the input. The LLM output is PHI again the moment it includes patient-specific reasoning.

Node 5: storage and audit

The transcript, the audio, the eval scores, and the Protect log all live in HIPAA-covered storage with documented retention. Encryption at rest, encryption in transit, role-based access, access logging, periodic access review. Agent Command Center runs this layer with RBAC, multi-region hosted, BYOC self-host, and per-tenant attribution tags.

Node 6: downstream consumers

Every downstream tool (analytics, BI, observability, EHR integration) that receives any PHI-derived signal needs a BAA. The redactor catches anything that bypassed BAA-covered nodes. Plan the downstream node-by-node before launch.

Compliance posture

Five surfaces decide whether the deployment clears the security review.

Surface 1: BAA inventory

Every vendor in the call path has a signed BAA on file. The inventory is a living document. Carrier, STT, redactor, LLM, observability, eval engine, storage, downstream analytics. Each gets a row with the vendor name, the BAA effective date, the BAA scope, and the renewal date. Missing rows are blockers.

Surface 2: PHI flow map

A diagram showing every system that touches PHI, the direction of flow, the encryption posture, and the retention policy. Required for the HIPAA risk analysis. The diagram is reviewed quarterly during launch and annually after.

Surface 3: certifications

Future AGI is SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified per the trust page. The certifications cover the eval, redaction, observability, simulation, and command-center layers of the stack. The customer adds the certifications under their own BAA inventory.

Surface 4: access logging

Every read of PHI is logged with the actor identity, the resource, the timestamp, and the action. Logs are retained per the institution’s retention policy (typically 6 years for HIPAA records). Access reviews happen quarterly. The Agent Command Center RBAC layer surfaces these logs.

Surface 5: breach response

A documented incident-response process. Who is paged on a suspected PHI exposure. The 60-day breach notification timeline under the Breach Notification Rule. The OCR reporting protocol. The customer-notification protocol. The corrective-action protocol. Tabletop the process annually.

Accent and dialect coverage

The training-distribution gap on patient demographics is closable with disciplined evaluation. Three tactics together cover the field.

Tactic 1: pick an accent-broad model

Deepgram Nova-3 with the healthcare variant has the broadest published demographic coverage of the hosted options. AssemblyAI’s healthcare model ships similar coverage on the async side. Whisper-large-v3 fine-tuned on accent-diverse clinical audio is the bench-stretched option. Avoid generic ASR models tuned on broadcast English; the demographic gap is widest there.

Tactic 2: simulate across the demographic mix

FAGI Simulate ships 18 pre-built personas plus unlimited custom. Persona controls include gender, age range (18-25 through 60+), location (US, Canada, UK, Australia, India), accent, communication style, conversation speed, background noise, and multilingual toggle across many popular languages. Build a persona library that mirrors the institution’s actual patient demographic. Run the agent against the full library before launch.

Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) from an agent definition. Error Localization pinpoints the failing turn when a scenario fails on a specific persona class. The programmatic eval API automates configure plus re-run as part of CI.

Tactic 3: score per cohort

Aggregate WER hides per-cohort regressions. Score audio_transcription plus the four beyond-WER rubrics segmented by accent class, age class, dialect class, and background-noise class. The per-cohort dashboards surface bias-class regressions the moment they ship. Per-cohort dashboards also surface where additional training data would help most.

Code patterns

Run the medical STT eval pass

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator, PII, DataPrivacyCompliance

audio = MLLMAudio(url="path/to/clinical_audio.wav")
test_case = MLLMTestCase(
    input=audio,
    query="Score this clinical conversation turn",
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

result = ev.evaluate(
    eval_templates=[
        AudioTranscriptionEvaluator(),
        PII(),
        DataPrivacyCompliance(),
        "clinical_entity_f1_v1",
        "no_harmful_therapeutic_guidance_v1",
        "clinically_inappropriate_tone_v1",
    ],
    inputs=[test_case],
)

MLLMAudio accepts seven formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) from a local path or URL with auto-base64 encoding.

Multi-turn clinical conversation eval

from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution, TaskCompletion

conv = ConversationalTestCase(messages=[
    LLMTestCase(query="I've been feeling chest tightness when I climb stairs", response="..."),
    LLMTestCase(query="Started about three weeks ago", response="..."),
    LLMTestCase(query="No, I haven't had a heart issue before", response="..."),
])

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[
        ConversationCoherence(),
        ConversationResolution(),
        TaskCompletion(),
    ],
    inputs=[conv],
)

Instrument the voice agent

For Pipecat-based clinical agents:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pipecat import enable_http_attribute_mapping

register(
    project_type=ProjectType.OBSERVE,
    project_name="Clinical Voice Agent",
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

For LiveKit-based clinical agents:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

register(
    project_name="LiveKit Clinical Agent",
    project_type=ProjectType.OBSERVE,
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

The traceAI-pipecat and traceai-livekit packages ship as dedicated pip integrations. OpenInference-compatible spans capture STT provider, LLM provider, tool calls, and TTS provider per turn.

How Future AGI fits the medical STT stack

Future AGI is the eval, redaction, observability, simulation, and audit layer underneath any STT plus LLM plus TTS choice. The mapping is concrete.

traceAI for distributed tracing

30+ documented integrations across Python and TypeScript. OpenInference-compatible spans. Apache 2.0. Every clinical call becomes a trace with the ASR span (provider, confidence, hypothesis transcript), retrieval span (EHR lookup, drug-interaction check), LLM span (model, prompt version, response), tool spans (charting tool, order-entry tool, refill tool), TTS span, and conversation ID linking the whole thing. Dedicated traceAI-pipecat and traceai-livekit packages cover the open-source voice frameworks.

ai-evaluation for scoring

70+ built-in rubrics including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, pii, data_privacy_compliance, no_harmful_therapeutic_guidance, clinically_inappropriate_tone, translation_accuracy, cultural_sensitivity. Apache 2.0. Custom evaluators authored by an in-product agent for institution-specific rubrics like clinical entity F1 with the institution’s drug, lab, and procedure taxonomy.

Native voice observability for Vapi, Retell, LiveKit

Add the provider API key plus Assistant ID to a FAGI Agent Definition. Every clinical call gets separate clinician and patient audio download, auto transcript, and the full eval engine. No SDK required. “Enable Others” mode covers any voice provider via mobile-number simulation. Indian phone number support for international clinical workloads.

Simulation for pre-launch and CI

18 pre-built personas plus unlimited custom. Per-persona accent, age range, location, communication style, conversation speed, background noise, and multilingual controls. Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) from a clinical agent definition. 4-step Run Tests wizard. Error Localization pinpoints the failing turn. Programmatic eval API for configure plus re-run as part of CI.

Future AGI Protect for inline PHI redaction

Gemma 3n foundation with LoRA-trained adapters per arXiv 2510.13351. Multi-modal across text, image, and audio (no preprocessing pipeline required). Two surfaces: rule-based Protect across the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) and ProtectFlash (single-call binary classifier). Sub-100ms inline. The audio leg catches PHI in the raw audio before it lands in a transcript that any non-BAA tool can see.

Error Feed for failure clustering

Auto-clusters trace failures into named issues. Auto-writes root cause plus quick fix plus long-term recommendation. For a clinical agent, a cluster of “drug name misread on Spanish-accented patient” becomes one named issue with a quick-fix suggestion (add the drug to the keyword boost list) and a long-term recommendation (fine-tune the ASR on Spanish-accented clinical audio).

Agent Command Center for hosting and governance

RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, BYOC self-host, 15+ provider routing. Per-team RBAC and per-tenant attribution tags so the eval scores segment by clinic, by service line, by specialty.

agent-opt for prompt tuning

agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) exposed through the Dataset UI and the Python library. For clinical workloads, pick the optimizer per workload (GEPA for the triage prompt, ProTeGi for the after-visit summary prompt, Bayesian Search for the patient-callback prompt) and tune against the eval scores the rubric set produces. The optimizer runs on the dataset; the candidate prompts and final scores surface in the dashboard before any change reaches production.

Common failure modes

The patterns repeat across clinical STT deployments.

Drug name substitution on accented speech. A Spanish-accented patient saying “metformin” trips a model trained on a thinner Spanish-accent slice. The hypothesis returns “metform in” or “metformin and” or “Metropolis”. The drug-name entity F1 surfaces it. The fix is a keyword boost list per institution-common drug name plus a per-cohort eval pass.
Dosage parsed as separate words. “Twenty five milligrams” parses as “twenty” plus “five” plus “milligrams” with no link. The downstream LLM has to reassemble. The mitigation is a normalization pass on the transcript before the LLM sees it plus a custom dosage-entity rubric.
PHI echo in the redacted transcript. The redactor catches the obvious patterns (full names, MRN, DOB) and misses a less obvious one (the patient’s home street). The pii and data_privacy_compliance rubrics catch the residual rate. The Protect rule set is tunable per institution.
ICD-10 hallucination. The LLM produces an ICD code that doesn’t exist or that doesn’t fit the symptom description. The fix is a structured-output schema with ICD validation plus a custom rubric scoring code presence against the validation database.
Background-noise regression on rural deployments. The accuracy drops on calls from rural clinics with noisier audio. The mitigation is per-noise-level evals plus a noise-resilient model choice (Deepgram Nova-3 holds up best in our internal comparison).
Tone drift on patient frustration. The agent’s tone slides toward defensive or dismissive when a patient is frustrated. The clinically_inappropriate_tone rubric catches it. The fix is prompt iteration plus a custom rubric against the institution’s tone guidelines.
Audit trail gap on tool failures. A tool call (refill request, scheduling lookup) silently fails and the agent confirms anyway. The traceAI capture surfaces the gap. The fix is a confirmation turn after every transactional tool call plus an is_factually_consistent rubric on the summary-back.

Each failure has a clean mitigation in the FAGI stack. The simulation suite catches the predictable ones pre-launch. The observability stack catches the long tail post-launch.

A reference 16-week clinical STT deployment

Week	Phase	Activities
1-2	Scope	Pick workflow (scribe, triage, callback, summary). Map PHI flow. Identify BAA-required vendors.
3	Compliance	BAA execution with every vendor in path. HIPAA risk analysis. PHI flow diagram review.
4	STT selection	Benchmark Deepgram Nova-3 Medical, Amazon Transcribe Medical, AssemblyAI Medical Mode, and a Whisper fine-tune on the institution’s audio corpus. Score on WER, clinical entity F1, and per-cohort variance.
5-6	Agent build	Conversational design, clinical taxonomy mapping, structured capture schema. Clinical safety prompt sign-off.
7	Persona library	50-100 personas mirroring the institution’s patient demographic. Accent, age, dialect, background-noise coverage.
8-9	Simulation	Auto-generate scenarios. Run 20,000-50,000 synthetic clinical conversations. Score with the full clinical rubric set.
10	Pre-launch	Clinical officer review of sampled transcripts. PHI flow audit. Disclosure-language regression suite.
11	Soft launch	5% of call volume to AI path with human shadow on every call.
12	Ramp	25% to 50%. Daily Error Feed cluster review. Per-cohort dashboard review.
13	Ramp	75% to 100%. Live regulator-ready audit trail.
14-15	Tune	Prompt iteration on flagged clusters. Keyword boost list iteration. Per-cohort tuning.
16	Steady state	Baseline established. Weekly cohort review. Monthly compliance review. Quarterly access audit.

The cadence stretches for higher-risk workflows (clinical decision support adjacencies) and compresses for lower-risk ones (after-visit summary generation). The compliance review gate at week 10 doesn’t bend.

Three deliberate tradeoffs

Federal procurement runs via BYOC self-host. FedRAMP doesn’t appear on the FAGI trust page yet. Federal health agencies and the VA with federal posture requirements deploy in their VPC via air-gapped BYOC. Same software, customer-owned audit boundary. The platform layer carries the full cert stack: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 all certified per the trust page; ISO 42001 (AI management) is in progress. HIPAA BAA is available on eligible plans; the audio leg of Protect plus the pii and data_privacy_compliance rule-based scans cover PHI safeguarding before transcripts reach non-BAA-covered downstream consumers.

Async eval gating is explicit. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) inside the Dataset UI and the Python library. Pick an optimizer, point at a dataset and an evaluator, run. FAGI never auto-rewrites a clinical prompt without an explicit run plus a human approval gate, which is exactly the property a clinical team wants on a regulated surface.

Native voice obs ships for Vapi, Retell, and LiveKit out of the box; everything else flows through Enable Others mode via the traceAI SDK (dedicated traceAI-pipecat and traceai-livekit packages plus 30+ documented integrations) or a webhook. That covers more than 90% of production voice stacks; deeper custom-runtime work is a code-path engagement.

Voice AI for Healthcare and Clinical Workflows in 2026: the parallel playbook for the agent layer above the STT.
Medical Chatbot: Build and Evaluate in 2026: the text-channel sibling to this voice-channel playbook.
Real-Time STT vs Offline STT in 2026: the streaming-vs-async tradeoff that decides the deployment topology.
Why WER Isn’t Enough for Voice Agents: 2026 Beyond-WER Metrics: the deep dive on the four metrics this playbook references.

Sources and references

arXiv 2510.13351, Future AGI Protect model family (arxiv.org/abs/2510.13351)
arXiv 2507.19457, GEPA Genetic-Pareto prompt optimizer (arxiv.org/abs/2507.19457)
arXiv 2505.09666, Meta-Prompt bilevel optimization (arxiv.org/abs/2505.09666)
arXiv 2311.09569, Random Search baseline (arxiv.org/abs/2311.09569)
HIPAA Security Rule, Privacy Rule, and Breach Notification Rule (45 CFR Parts 160 and 164)
HHS Office for Civil Rights breach reporting guidance
45 CFR 164.514 Safe Harbor de-identification standard
Future AGI trust page (futureagi.com/trust)
ai-evaluation repository (github.com/future-agi/ai-evaluation)
traceAI repository (github.com/future-agi/traceAI)
Deepgram, Amazon Transcribe Medical, AssemblyAI: vendor documentation and BAA terms (referenced in plain text per editorial policy)

Frequently asked questions

Why is general-purpose STT not enough for medical use cases?

Three reasons. Medical jargon (drug names, anatomy, diagnoses, procedure codes) is heavily underrepresented in the training data of generic ASR models. Patient accent and dialect spread is wider than the LibriSpeech-style benchmark mix. HIPAA requires a BAA-covered audio and transcript pipeline, which most generic ASR vendors don't offer on their default tier. A clinical-grade STT system has to handle the trifecta of jargon, accent, and HIPAA simultaneously.

Which STT providers are realistic for healthcare in 2026?

Four options carry most of the field. Deepgram Nova-3 Medical is a strong hosted option for general clinical speech with a BAA on the enterprise tier. Amazon Transcribe Medical is the option for teams already deep in the AWS ecosystem with HIPAA-eligible AWS services. AssemblyAI Medical Mode ships strong on the streaming and async APIs with healthcare-tuned models. A custom fine-tune on Whisper (large or medium), self-hosted on HIPAA-covered compute or running on a BAA-covered speech service, is the option for teams with the engineering bench and clinical reference corpus.

How do I evaluate STT quality on medical audio?

WER alone is not enough. Pair WER with entity F1 on the medical entity classes (drug names, dosages, routes, frequencies, anatomy terms, ICD-10 codes). Add intent preservation for the agent action triggered by the turn. Add semantic similarity for paraphrase-heavy patient speech. Future AGI's ai-evaluation ships audio_transcription for WER-class scoring. Custom evaluators authored by an in-product agent cover the entity-F1 and intent-preservation rubrics with the clinical entity taxonomy you define.

What does HIPAA compliance require across the STT pipeline?

HIPAA Security Rule plus the Privacy Rule plus the Breach Notification Rule. Practically: a signed Business Associate Agreement with every vendor in the audio and transcript path, encryption in transit and at rest, access logging with retention, and a documented incident-response process. The audio is PHI. The transcript is PHI. The eval scores derived from the transcript are PHI. The redacted version of the transcript may also be PHI depending on the redaction discipline. Future AGI is HIPAA, SOC 2 Type II, GDPR, CCPA, and ISO 27001 certified per the trust page.

How do I redact PHI from clinical transcripts at inference time?

Run a PHI-aware redactor before any non-BAA-covered downstream consumer (analytics, telemetry, third-party tools). Future AGI Protect runs on Gemma 3n with LoRA-trained adapters per arXiv 2510.13351, multi-modal across audio and text, sub-100ms inline. The pii rubric and data_privacy_compliance rubric target PHI categories directly. The audio leg of Protect catches PHI in the audio stream before it ever lands in a transcript a non-BAA tool can see.

How do I handle accent and dialect spread in patient speech?

Three tactics. First, pick an STT model trained on a diverse speaker corpus (Deepgram Nova-3, AssemblyAI's healthcare model, or a Whisper variant fine-tuned on accent-diverse clinical audio). Second, simulate accent and dialect coverage pre-launch. Future AGI Simulate ships 18 pre-built personas plus unlimited custom with accent, location, age, and background-noise controls. Third, score per-cohort. Run audio_transcription and the four beyond-WER rubrics segmented by accent and dialect group so regressions don't hide inside the aggregate.

Can I use ChatGPT or general LLMs as the medical scribe?

Not as a generic OpenAI-tier deployment. A clinical scribe handling PHI needs a BAA with the LLM vendor. Azure OpenAI with the BAA-eligible tier, AWS Bedrock with HIPAA-eligible models, or a self-hosted open-weight model on a HIPAA-covered cloud are the realistic options. The eval, redaction, and audit layer sits underneath whichever LLM you pick. Future AGI's stack runs across all of these without re-instrumenting the agent.

View all

Guides

HIPAA-Compliant Voice AI in 2026: Build, Test, Deploy

End-to-end HIPAA voice AI in 2026. BAA-covered call chain, PHI-aware regression suite, breach detection, patient-access flows, with Future AGI Protect.

NVJK Kartik · Mar 12, 2026

22 min

Guides

Voice AI for Healthcare and Clinical Workflows in 2026

Deploy voice AI across clinical workflows in 2026: appointment scheduling, intake, medication reminders, post-discharge follow-up under HIPAA and BAA.

NVJK Kartik · Feb 12, 2026

14 min

Guides

Voice Cloning Safety and Brand Voice Management for Production AI in 2026

Manage voice cloning safety and brand voice for production AI in 2026 with consent capture, watermarking, voice-print policy, and Future AGI Protect.

Vrinda Damani · Apr 16, 2026

16 min

TL;DR (the medical STT trifecta)

Why generic STT fails in healthcare

Gap 1: clinical jargon

Gap 2: patient demographic spread

Gap 3: PHI everywhere

Provider options for healthcare STT in 2026

Option 1: Deepgram Nova-3 Medical

Option 2: Amazon Transcribe Medical

Option 3: AssemblyAI Medical Mode

Option 4: Custom Whisper fine-tune

Evaluating medical STT quality

audio_transcription

Clinical entity F1

Intent preservation

Semantic similarity

task_completion and conversation_resolution

Clinical-safety rubrics

Pipeline pattern: BAA-covered end to end

Node 1: telephony or capture

Node 2: STT

Node 3: PHI redaction

Node 4: LLM

Node 5: storage and audit

Node 6: downstream consumers

Compliance posture

Surface 1: BAA inventory

Surface 2: PHI flow map

Surface 3: certifications

Surface 4: access logging

Surface 5: breach response

Accent and dialect coverage

Tactic 1: pick an accent-broad model

Tactic 2: simulate across the demographic mix

Tactic 3: score per cohort

Code patterns

Run the medical STT eval pass

Multi-turn clinical conversation eval

Instrument the voice agent

How Future AGI fits the medical STT stack

traceAI for distributed tracing

ai-evaluation for scoring

Native voice observability for Vapi, Retell, LiveKit

Simulation for pre-launch and CI

Future AGI Protect for inline PHI redaction

Error Feed for failure clustering

Agent Command Center for hosting and governance

agent-opt for prompt tuning

Common failure modes

A reference 16-week clinical STT deployment

Three deliberate tradeoffs

Related reading

Sources and references

Frequently asked questions