Medical and Healthcare STT in 2026: Accent, Jargon, HIPAA
How to ship clinical-grade STT in 2026. Medical jargon coverage, patient accent and dialect robustness, HIPAA and BAA across the audio plus transcript pipeline.
Table of Contents
A clinical scribe, a triage agent, a patient-callback flow, an after-visit summary. Every voice AI workload in healthcare runs through speech-to-text. The STT layer is the most common failure point in a healthcare deployment and the most regulated one. Three constraints fire at once: medical jargon coverage, patient accent and dialect spread, and HIPAA compliance across the audio plus transcript pipeline. This is the 2026 playbook for shipping STT that handles all three.
TL;DR (the medical STT trifecta)
Three forces decide whether a healthcare STT deployment ships. The trifecta has to clear all three simultaneously. A pipeline that nails two of three and fails one is not a clinical pipeline.
| Constraint | What it demands | What breaks if you miss |
|---|---|---|
| Medical jargon | Drug names, dosages, routes, anatomy, ICD-10, CPT codes transcribed correctly | Wrong medication entry, missed allergy, billing code drift |
| Accent + dialect | Patient speech across the full demographic mix scored evenly | Specific cohorts get worse care; bias-class complaints |
| HIPAA + BAA | Every vendor in the audio plus transcript path under a BAA, PHI encryption + redaction + audit | Regulatory filing, OCR investigation, breach notification |
The runtime layer is a healthcare-tuned STT (Deepgram, Amazon Transcribe Medical, AssemblyAI, or a custom Whisper fine-tune). The eval, redaction, observability, and audit layer is Future AGI. The dedicated section below explains the full mapping.
Why generic STT fails in healthcare
The training distribution of a general-purpose ASR model overweights public-domain audiobook speech, podcast speech, broadcast news, and TED-style talks. Three demographic and lexical gaps follow.
Gap 1: clinical jargon
The drug name corpus alone has 20,000+ unique active ingredients and tens of thousands more branded variants. A patient saying “I take metformin” trips a model that has seen “I take medicine” a thousand times more. The substitution rate on rare drug names is consistently the worst error class in a generic ASR. Anatomy terms, procedure names, lab-test names, and ICD codes carry the same problem. The vocabulary is large, the per-token training frequency is low, and the cost of a single substitution is clinically dangerous.
Gap 2: patient demographic spread
A clinical population is older on average than a podcast-listener population. Speech rate is slower. Pauses are longer. Dysarthria, dental-prosthetic articulation, post-stroke speech, and tracheostomy-modified speech all appear at rates higher than in the training mix. The accent and dialect spread is also wider. A healthcare workload has to be evaluated across the demographic mix that actually uses the service, not the mix that produced the benchmark numbers.
Gap 3: PHI everywhere
The transcript contains protected health information by design. The patient’s name, date of birth, address, phone number, medical record number, diagnosis, medication, lab result, and the clinician’s name and provider identifier all appear in the audio stream. The audio file is PHI. The transcript is PHI. The eval signal derived from the transcript is PHI. The redacted variant is PHI unless the redaction is provably reversible-free and the categories meet the Safe Harbor definition under 45 CFR 164.514.
A generic STT vendor without a BAA cannot legally receive the audio. A generic LLM downstream without a BAA cannot legally receive the transcript. A generic analytics tool downstream without a BAA cannot legally receive the eval scores. Every node in the path needs a BAA.
Provider options for healthcare STT in 2026
Four options dominate the realistic field.
Option 1: Deepgram Nova-3 Medical
Deepgram’s Nova-3 family with the Medical model variant is a strong hosted option for general clinical speech. WER on Deepgram’s published medical benchmark is competitive with the best academic results. Benchmark on your own audio to confirm. The streaming API supports real-time transcription with sub-300ms first-partial latency. The BAA ships on the enterprise tier with a documented PHI handling posture. Audio retention is configurable. The integration surface is mature.
Wins on: strong accuracy on general clinical English, low streaming latency, broad keyword-boosting capability for institution-specific jargon.
Trades off on: the standard model needs keyword boost lists for institution-specific drug names and procedure codes; the cost ramp on high-volume real-time streaming is steep.
Option 2: Amazon Transcribe Medical
Amazon Transcribe Medical is the option for teams already deep in the AWS ecosystem. It is a HIPAA-eligible AWS service under the AWS BAA. It handles clinical conversation transcription with structured medical entity extraction. It integrates cleanly with Bedrock for downstream LLM work and with Comprehend Medical for entity extraction.
Wins on: tight integration with the AWS HIPAA-eligible service stack; structured medical entity extraction in the same response; predictable governance under the AWS BAA.
Trades off on: WER on general clinical conversation often trails Deepgram on published third-party benchmarks; streaming has higher first-partial latency. Benchmark directly on your own audio.
Option 3: AssemblyAI Medical Mode
AssemblyAI’s medical-mode and Universal-3 streaming models ship strong on both streaming and async APIs. The async API is the better fit for chart-note dictation and after-visit summary generation where the full audio is available. The BAA is available on the healthcare tier. The async API is a strong option for entity F1 on extended dictation. Benchmark on your own audio to confirm.
Wins on: strong async accuracy on long-form dictation, broad set of post-processing options (speaker diarization, redaction).
Trades off on: streaming latency may trail Deepgram for the most real-time use cases. Measure on your audio.
Option 4: Custom Whisper fine-tune
The right option for teams with a clinical reference corpus, the engineering bench to host the model, and a BAA-covered cloud (Azure OpenAI’s Whisper variant, AWS-hosted Whisper on HIPAA-eligible compute, or self-hosted on HIPAA-covered private infrastructure). Whisper-large-v3 fine-tuned on 100-500 hours of in-domain clinical audio outperforms generic Whisper by 8-15 WER points on the in-domain test set and sometimes overtakes the hosted vendors on the institution’s specific jargon mix.
Wins on: full control of the model, the lowest unit cost at high volume, the best fit when you have an unusual lexicon (a niche specialty, a non-English clinical service, a research workload).
Trades off on: the engineering and ops cost is real. The fine-tune cycle, the eval cycle, the model rotation cycle, and the security review for the hosting choice all need owners. Hosted vendors absorb this work.
Evaluating medical STT quality
WER is the baseline. WER alone is not the bar. The four beyond-WER metrics from the voice-agent stack apply directly to clinical STT, with a twist: the entity taxonomy is the clinical one.
audio_transcription
The WER-class baseline. Pair every test run with this rubric for the cross-vendor scorecard.
Clinical entity F1
The custom rubric. Define the entity taxonomy: drug name, dosage, route, frequency, anatomy term, lab name, lab value, ICD-10 code, CPT code, provider name, facility name. Extract entities from the reference. Extract from the hypothesis. Score per-type and overall F1.
Per-type breakdown is non-negotiable. A 0.92 overall entity F1 hides a 0.55 drug-name F1 because drug names are a small fraction of total entities. Track per-class trend lines.
Intent preservation
For triage and routing flows, intent preservation is the agent-relevant score. Did the hypothesis route to the same triage category as the reference. Build the rubric against the institution’s triage taxonomy.
Semantic similarity
For patient-described symptoms, paraphrase is heavy. “It hurts when I breathe deep” and “I have pain on deep inspiration” are semantically identical and WER-divergent. Embedding-based similarity catches this.
task_completion and conversation_resolution
For multi-turn clinical agents (triage, callback, scheduling), these built-in rubrics score whether the call resolved correctly. Run the scenario twice (live ASR vs reference transcript) for the downstream correlation indicator.
Clinical-safety rubrics
Three rubrics are clinical-safety-specific and ship as built-ins in ai-evaluation:
no_harmful_therapeutic_guidance: the agent did not provide medical advice it isn’t licensed to provide.clinically_inappropriate_tone: the tone did not slide into dismissive, defensive, or sales-toned territory.audio_transcription: WER-class scoring on the transcript leg.
Plus the data and PHI rubrics:
pii: PHI categories flagged on every turn.data_privacy_compliance: policy-class violations flagged across the call.
Run the full set on every clinical turn during launch. Sample on high-volume production.
Pipeline pattern: BAA-covered end to end
The HIPAA-correct pipeline has every node under a signed BAA. The pattern below is the most common 2026 deployment.
Node 1: telephony or capture
The audio capture vendor (the carrier, the WebRTC service, the EHR-integrated capture surface) must operate under a BAA. Twilio Programmable Voice, Telnyx, Agora, and AWS-hosted carriers all offer BAA-eligible tiers. Default tiers usually do not. Verify before signal capture starts.
Node 2: STT
The STT vendor needs a BAA on the tier and region you’re using. Deepgram Nova-3 Medical, Amazon Transcribe Medical (under the AWS BAA), and AssemblyAI Medical Mode all qualify. Self-hosted Whisper inside your HIPAA-covered VPC needs no external BAA but needs the internal security review for the model-hosting infrastructure.
Node 3: PHI redaction
The redactor runs immediately after STT, before the transcript reaches any non-BAA-covered downstream consumer. Future AGI Protect handles this leg. The model family runs on Gemma 3n with LoRA-trained adapters per arXiv 2510.13351, multi-modal across text and audio. The redactor runs sub-100ms inline. The audio leg catches PHI in the audio stream before transcription. The text leg redacts categories that survive transcription.
from fi.evals import Protect, Evaluator, PII
p = Protect()
out = p.protect(
inputs=test_case,
protect_rules=[
{"metric": "data_privacy_compliance"},
],
)
# PII detection runs as a separate Evaluator template, not as a Protect rule.
ev = Evaluator(fi_api_key=..., fi_secret_key=...)
pii_result = ev.evaluate(eval_templates=[PII()], inputs=[test_case])
For the single-call critical path:
out = p.protect(inputs=test_case)
ProtectFlash is the binary classifier mode. Use it on the inline critical path where rule-based scan time is tight. Use rule-based Protect on every Nth turn for richer per-rule signal.
Node 4: LLM
The clinical LLM (the scribe model, the triage model, the summarization model) needs a BAA-covered hosting. Azure OpenAI on the BAA-eligible tier, AWS Bedrock with HIPAA-eligible model variants (Claude, Llama, Titan), or a self-hosted open-weight model on HIPAA-covered compute. The redacted transcript is the input. The LLM output is PHI again the moment it includes patient-specific reasoning.
Node 5: storage and audit
The transcript, the audio, the eval scores, and the Protect log all live in HIPAA-covered storage with documented retention. Encryption at rest, encryption in transit, role-based access, access logging, periodic access review. Agent Command Center runs this layer with RBAC, multi-region hosted, BYOC self-host, and per-tenant attribution tags.
Node 6: downstream consumers
Every downstream tool (analytics, BI, observability, EHR integration) that receives any PHI-derived signal needs a BAA. The redactor catches anything that bypassed BAA-covered nodes. Plan the downstream node-by-node before launch.
Compliance posture
Five surfaces decide whether the deployment clears the security review.
Surface 1: BAA inventory
Every vendor in the call path has a signed BAA on file. The inventory is a living document. Carrier, STT, redactor, LLM, observability, eval engine, storage, downstream analytics. Each gets a row with the vendor name, the BAA effective date, the BAA scope, and the renewal date. Missing rows are blockers.
Surface 2: PHI flow map
A diagram showing every system that touches PHI, the direction of flow, the encryption posture, and the retention policy. Required for the HIPAA risk analysis. The diagram is reviewed quarterly during launch and annually after.
Surface 3: certifications
Future AGI is SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified per the trust page. The certifications cover the eval, redaction, observability, simulation, and command-center layers of the stack. The customer adds the certifications under their own BAA inventory.
Surface 4: access logging
Every read of PHI is logged with the actor identity, the resource, the timestamp, and the action. Logs are retained per the institution’s retention policy (typically 6 years for HIPAA records). Access reviews happen quarterly. The Agent Command Center RBAC layer surfaces these logs.
Surface 5: breach response
A documented incident-response process. Who is paged on a suspected PHI exposure. The 60-day breach notification timeline under the Breach Notification Rule. The OCR reporting protocol. The customer-notification protocol. The corrective-action protocol. Tabletop the process annually.
Accent and dialect coverage
The training-distribution gap on patient demographics is closable with disciplined evaluation. Three tactics together cover the field.
Tactic 1: pick an accent-broad model
Deepgram Nova-3 with the healthcare variant has the broadest published demographic coverage of the hosted options. AssemblyAI’s healthcare model ships similar coverage on the async side. Whisper-large-v3 fine-tuned on accent-diverse clinical audio is the bench-stretched option. Avoid generic ASR models tuned on broadcast English; the demographic gap is widest there.
Tactic 2: simulate across the demographic mix
FAGI Simulate ships 18 pre-built personas plus unlimited custom. Persona controls include gender, age range (18-25 through 60+), location (US, Canada, UK, Australia, India), accent, communication style, conversation speed, background noise, and multilingual toggle across many popular languages. Build a persona library that mirrors the institution’s actual patient demographic. Run the agent against the full library before launch.
Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) from an agent definition. Error Localization pinpoints the failing turn when a scenario fails on a specific persona class. The programmatic eval API automates configure plus re-run as part of CI.
Tactic 3: score per cohort
Aggregate WER hides per-cohort regressions. Score audio_transcription plus the four beyond-WER rubrics segmented by accent class, age class, dialect class, and background-noise class. The per-cohort dashboards surface bias-class regressions the moment they ship. Per-cohort dashboards also surface where additional training data would help most.
Code patterns
Run the medical STT eval pass
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioTranscriptionEvaluator, PII, DataPrivacyCompliance
audio = MLLMAudio(url="path/to/clinical_audio.wav")
test_case = MLLMTestCase(
input=audio,
query="Score this clinical conversation turn",
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[
AudioTranscriptionEvaluator(),
PII(),
DataPrivacyCompliance(),
"clinical_entity_f1_v1",
"no_harmful_therapeutic_guidance_v1",
"clinically_inappropriate_tone_v1",
],
inputs=[test_case],
)
MLLMAudio accepts seven formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma) from a local path or URL with auto-base64 encoding.
Multi-turn clinical conversation eval
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence, ConversationResolution, TaskCompletion
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I've been feeling chest tightness when I climb stairs", response="..."),
LLMTestCase(query="Started about three weeks ago", response="..."),
LLMTestCase(query="No, I haven't had a heart issue before", response="..."),
])
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[
ConversationCoherence(),
ConversationResolution(),
TaskCompletion(),
],
inputs=[conv],
)
Instrument the voice agent
For Pipecat-based clinical agents:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pipecat import enable_http_attribute_mapping
register(
project_type=ProjectType.OBSERVE,
project_name="Clinical Voice Agent",
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
For LiveKit-based clinical agents:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
register(
project_name="LiveKit Clinical Agent",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
The traceAI-pipecat and traceai-livekit packages ship as dedicated pip integrations. OpenInference-compatible spans capture STT provider, LLM provider, tool calls, and TTS provider per turn.
How Future AGI fits the medical STT stack
Future AGI is the eval, redaction, observability, simulation, and audit layer underneath any STT plus LLM plus TTS choice. The mapping is concrete.
traceAI for distributed tracing
30+ documented integrations across Python and TypeScript. OpenInference-compatible spans. Apache 2.0. Every clinical call becomes a trace with the ASR span (provider, confidence, hypothesis transcript), retrieval span (EHR lookup, drug-interaction check), LLM span (model, prompt version, response), tool spans (charting tool, order-entry tool, refill tool), TTS span, and conversation ID linking the whole thing. Dedicated traceAI-pipecat and traceai-livekit packages cover the open-source voice frameworks.
ai-evaluation for scoring
70+ built-in rubrics including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, pii, data_privacy_compliance, no_harmful_therapeutic_guidance, clinically_inappropriate_tone, translation_accuracy, cultural_sensitivity. Apache 2.0. Custom evaluators authored by an in-product agent for institution-specific rubrics like clinical entity F1 with the institution’s drug, lab, and procedure taxonomy.
Native voice observability for Vapi, Retell, LiveKit
Add the provider API key plus Assistant ID to a FAGI Agent Definition. Every clinical call gets separate clinician and patient audio download, auto transcript, and the full eval engine. No SDK required. “Enable Others” mode covers any voice provider via mobile-number simulation. Indian phone number support for international clinical workloads.
Simulation for pre-launch and CI
18 pre-built personas plus unlimited custom. Per-persona accent, age range, location, communication style, conversation speed, background noise, and multilingual controls. Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) from a clinical agent definition. 4-step Run Tests wizard. Error Localization pinpoints the failing turn. Programmatic eval API for configure plus re-run as part of CI.
Future AGI Protect for inline PHI redaction
Gemma 3n foundation with LoRA-trained adapters per arXiv 2510.13351. Multi-modal across text, image, and audio (no preprocessing pipeline required). Two surfaces: rule-based Protect across the 4 documented safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) and ProtectFlash (single-call binary classifier). Sub-100ms inline. The audio leg catches PHI in the raw audio before it lands in a transcript that any non-BAA tool can see.
Error Feed for failure clustering
Auto-clusters trace failures into named issues. Auto-writes root cause plus quick fix plus long-term recommendation. For a clinical agent, a cluster of “drug name misread on Spanish-accented patient” becomes one named issue with a quick-fix suggestion (add the drug to the keyword boost list) and a long-term recommendation (fine-tune the ASR on Spanish-accented clinical audio).
Agent Command Center for hosting and governance
RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, BYOC self-host, 15+ provider routing. Per-team RBAC and per-tenant attribution tags so the eval scores segment by clinic, by service line, by specialty.
agent-opt for prompt tuning
agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) exposed through the Dataset UI and the Python library. For clinical workloads, pick the optimizer per workload (GEPA for the triage prompt, ProTeGi for the after-visit summary prompt, Bayesian Search for the patient-callback prompt) and tune against the eval scores the rubric set produces. The optimizer runs on the dataset; the candidate prompts and final scores surface in the dashboard before any change reaches production.
Common failure modes
The patterns repeat across clinical STT deployments.
- Drug name substitution on accented speech. A Spanish-accented patient saying “metformin” trips a model trained on a thinner Spanish-accent slice. The hypothesis returns “metform in” or “metformin and” or “Metropolis”. The drug-name entity F1 surfaces it. The fix is a keyword boost list per institution-common drug name plus a per-cohort eval pass.
- Dosage parsed as separate words. “Twenty five milligrams” parses as “twenty” plus “five” plus “milligrams” with no link. The downstream LLM has to reassemble. The mitigation is a normalization pass on the transcript before the LLM sees it plus a custom dosage-entity rubric.
- PHI echo in the redacted transcript. The redactor catches the obvious patterns (full names, MRN, DOB) and misses a less obvious one (the patient’s home street). The
piianddata_privacy_compliancerubrics catch the residual rate. The Protect rule set is tunable per institution. - ICD-10 hallucination. The LLM produces an ICD code that doesn’t exist or that doesn’t fit the symptom description. The fix is a structured-output schema with ICD validation plus a custom rubric scoring code presence against the validation database.
- Background-noise regression on rural deployments. The accuracy drops on calls from rural clinics with noisier audio. The mitigation is per-noise-level evals plus a noise-resilient model choice (Deepgram Nova-3 holds up best in our internal comparison).
- Tone drift on patient frustration. The agent’s tone slides toward defensive or dismissive when a patient is frustrated. The
clinically_inappropriate_tonerubric catches it. The fix is prompt iteration plus a custom rubric against the institution’s tone guidelines. - Audit trail gap on tool failures. A tool call (refill request, scheduling lookup) silently fails and the agent confirms anyway. The traceAI capture surfaces the gap. The fix is a confirmation turn after every transactional tool call plus an
is_factually_consistentrubric on the summary-back.
Each failure has a clean mitigation in the FAGI stack. The simulation suite catches the predictable ones pre-launch. The observability stack catches the long tail post-launch.
A reference 16-week clinical STT deployment
| Week | Phase | Activities |
|---|---|---|
| 1-2 | Scope | Pick workflow (scribe, triage, callback, summary). Map PHI flow. Identify BAA-required vendors. |
| 3 | Compliance | BAA execution with every vendor in path. HIPAA risk analysis. PHI flow diagram review. |
| 4 | STT selection | Benchmark Deepgram Nova-3 Medical, Amazon Transcribe Medical, AssemblyAI Medical Mode, and a Whisper fine-tune on the institution’s audio corpus. Score on WER, clinical entity F1, and per-cohort variance. |
| 5-6 | Agent build | Conversational design, clinical taxonomy mapping, structured capture schema. Clinical safety prompt sign-off. |
| 7 | Persona library | 50-100 personas mirroring the institution’s patient demographic. Accent, age, dialect, background-noise coverage. |
| 8-9 | Simulation | Auto-generate scenarios. Run 20,000-50,000 synthetic clinical conversations. Score with the full clinical rubric set. |
| 10 | Pre-launch | Clinical officer review of sampled transcripts. PHI flow audit. Disclosure-language regression suite. |
| 11 | Soft launch | 5% of call volume to AI path with human shadow on every call. |
| 12 | Ramp | 25% to 50%. Daily Error Feed cluster review. Per-cohort dashboard review. |
| 13 | Ramp | 75% to 100%. Live regulator-ready audit trail. |
| 14-15 | Tune | Prompt iteration on flagged clusters. Keyword boost list iteration. Per-cohort tuning. |
| 16 | Steady state | Baseline established. Weekly cohort review. Monthly compliance review. Quarterly access audit. |
The cadence stretches for higher-risk workflows (clinical decision support adjacencies) and compresses for lower-risk ones (after-visit summary generation). The compliance review gate at week 10 doesn’t bend.
Three deliberate tradeoffs
Federal procurement runs via BYOC self-host. FedRAMP doesn’t appear on the FAGI trust page yet. Federal health agencies and the VA with federal posture requirements deploy in their VPC via air-gapped BYOC. Same software, customer-owned audit boundary. The platform layer carries the full cert stack: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 all certified per the trust page; ISO 42001 (AI management) is in progress. HIPAA BAA is available on eligible plans; the audio leg of Protect plus the pii and data_privacy_compliance rule-based scans cover PHI safeguarding before transcripts reach non-BAA-covered downstream consumers.
Async eval gating is explicit. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) inside the Dataset UI and the Python library. Pick an optimizer, point at a dataset and an evaluator, run. FAGI never auto-rewrites a clinical prompt without an explicit run plus a human approval gate, which is exactly the property a clinical team wants on a regulated surface.
Native voice obs ships for Vapi, Retell, and LiveKit out of the box; everything else flows through Enable Others mode via the traceAI SDK (dedicated traceAI-pipecat and traceai-livekit packages plus 30+ documented integrations) or a webhook. That covers more than 90% of production voice stacks; deeper custom-runtime work is a code-path engagement.
Related reading
- Voice AI for Healthcare and Clinical Workflows in 2026: the parallel playbook for the agent layer above the STT.
- Medical Chatbot: Build and Evaluate in 2026: the text-channel sibling to this voice-channel playbook.
- Real-Time STT vs Offline STT in 2026: the streaming-vs-async tradeoff that decides the deployment topology.
- Why WER Isn’t Enough for Voice Agents: 2026 Beyond-WER Metrics: the deep dive on the four metrics this playbook references.
Sources and references
- arXiv 2510.13351, Future AGI Protect model family (arxiv.org/abs/2510.13351)
- arXiv 2507.19457, GEPA Genetic-Pareto prompt optimizer (arxiv.org/abs/2507.19457)
- arXiv 2505.09666, Meta-Prompt bilevel optimization (arxiv.org/abs/2505.09666)
- arXiv 2311.09569, Random Search baseline (arxiv.org/abs/2311.09569)
- HIPAA Security Rule, Privacy Rule, and Breach Notification Rule (45 CFR Parts 160 and 164)
- HHS Office for Civil Rights breach reporting guidance
- 45 CFR 164.514 Safe Harbor de-identification standard
- Future AGI trust page (futureagi.com/trust)
- ai-evaluation repository (github.com/future-agi/ai-evaluation)
- traceAI repository (github.com/future-agi/traceAI)
- Deepgram, Amazon Transcribe Medical, AssemblyAI: vendor documentation and BAA terms (referenced in plain text per editorial policy)
Frequently asked questions
Why is general-purpose STT not enough for medical use cases?
Which STT providers are realistic for healthcare in 2026?
How do I evaluate STT quality on medical audio?
What does HIPAA compliance require across the STT pipeline?
How do I redact PHI from clinical transcripts at inference time?
How do I handle accent and dialect spread in patient speech?
Can I use ChatGPT or general LLMs as the medical scribe?
End-to-end HIPAA voice AI in 2026. BAA-covered call chain, PHI-aware regression suite, breach detection, patient-access flows, with Future AGI Protect.
How to deploy voice AI across clinical workflows in 2026. Appointment scheduling, intake, medication reminders, post-discharge follow-up under HIPAA and BAA.
Manage voice cloning safety and brand voice for production AI in 2026 with consent capture, watermarking, voice-print policy, and Future AGI Protect.