Voice Agent Regression Testing in CI/CD: A 2026 Engineering Guide
Wire voice agent regression tests into GitHub Actions and GitLab CI in 2026: golden conversations, three-layer testing, deploy gates, drift detection, and FAGI evals.
Table of Contents
A voice agent that ships without a CI regression gate is one prompt edit away from a production incident. The fix is not heroic manual testing. It is a programmatic eval pipeline wired into the same GitHub Actions or GitLab CI workflow that builds and deploys the agent. Pass rates become deploy gates. Drift scores become release blockers. The same Three-Layer framework Coval publishes runs against every pull request, every release candidate, and every model upgrade. This guide is the engineering implementation: golden conversations, scenario auto-generation, eval API calls, .yml snippets, and the deploy gates that catch regressions before customers do.
TL;DR: voice regression in CI/CD at a glance
| Stage | What runs | When | Deploy gate |
|---|---|---|---|
| Layer 1: Regression | 50-200 golden conversations | Every PR | task_completion 95% pass |
| Layer 2: Adversarial | 50-100 scenarios per red-team archetype | Release candidate | policy_preservation 90% pass |
| Layer 3: Production-derived | 500-2,000 sampled real calls | Per release plus weekly | under 5% drift on rubric package |
| Latency gate | P95 turn latency on golden set | Every PR | budget-specific (often under 1.2s) |
| Audio quality gate | audio_quality on TTS output | Release candidate | drift under 0.05 from baseline |
The pyramid runs at three cadences. Regression is fast and gated tight. Adversarial is medium and gated on policy. Production-derived is slow and gated on drift. All three layers share an Agent Definition, a persona library, a scenario authoring surface, and the programmatic eval API.
Why voice regression cannot live outside CI
Text agent regression can almost survive on manual review. Voice agent regression cannot. The reason is the failure surface: turn-taking, latency, audio quality, prosody, ASR confidence, multi-turn coherence, tool-call timing, and policy preservation all interact. A change that improves one usually nudges another. Without an automated test pipeline, the engineer shipping the change is guessing.
Three failure classes show up only when regression lives in CI:
Prompt edits that change unrelated behavior. A small tweak to the qualification prompt subtly shifts how the agent handles billing disputes. Manual testing misses it. CI catches it on turn 7 of the billing golden conversation.
Model upgrades that improve some flows and degrade others. A move from one LLM version to the next nearly always wins on some intents and loses on others. The aggregate metric looks flat. The per-intent drift shows the new model gave up 8% on refund handling while gaining 4% on technical questions. CI surfaces the per-intent breakdown.
Latency creep that hides under feature velocity. A new tool call here, a longer system prompt there, a richer retrieval payload everywhere. Each change adds 50 to 100 milliseconds. After three sprints the agent is 300ms slower per turn, and the CSAT regression shows up two weeks after the user starts feeling it. CI catches the latency creep on the first PR that pushes P95 over budget.
The argument for CI is not that it is more thorough than a senior engineer with a Friday afternoon. The argument is that the agent ships every week, the senior engineer cannot review every PR, and the failures CI catches are the ones the engineer cannot anticipate from the diff.
The golden conversation suite
Layer 1 is the regression layer. It runs on every PR. The artifact is the golden conversation suite.
What a golden conversation contains
Each golden conversation is a structured object with seven fields:
- Persona. The simulated caller. Pulled from the 18 pre-built personas or a custom-authored one.
- Initial utterance. The first thing the persona says.
- Turn-by-turn expected flow. A short script describing what the agent should accomplish at each turn, not exact wording.
- Tool-call expectations. Which tools the agent is expected to call, in what order, with what arguments.
- Expected outcome. Resolved, escalated, refused, or transferred.
- Rubric package. Which evals score this conversation. Always includes
task_completionandconversation_resolution; often addsis_polite,is_helpful,is_concise. - Tags. Intent, priority, owner. Used for CI grouping and ownership routing on failure.
Curating the suite
50 to 200 conversations is the right range. Below 50, coverage is thin. Above 200, the test cycle slows past the developer feedback loop.
The shape that works: a base set of 30 to 50 covering the must-pass happy paths for each intent. Add 20 to 50 covering known edge cases. Add 10 to 30 covering known regression-prone areas (recent bug fixes, recent model upgrades). The total lands in the 60 to 130 range for most agents.
Three curation rules keep the set healthy:
Add a golden conversation for every shipped feature. When the agent gains a new capability, the PR adds the corresponding golden conversation. The set grows with the agent.
Replace stale conversations. When product behavior changes (new pricing, new account tiers, new policies), update the golden conversations that depend on the changed behavior. Stale conversations test the wrong thing and produce false positives that train the team to ignore the CI signal.
Curate, do not accumulate. Resist the urge to add every edge case. Each new conversation costs CI time. Keep the suite at the level where total CI wall-clock is under 15 minutes for the PR layer.
Authoring the suite in Workflow Builder
Workflow Builder is the authoring surface. Each golden conversation is a workflow with the persona, the initial utterance, and the turn-by-turn flow. Workflow Builder is the suite authoring surface. Teams that manage eval suites as code should export or mirror the suite definition in their repo; otherwise the platform configuration is the source of truth.
The auto-generate option in Workflow Builder accelerates suite construction. Specify the agent definition, the scenario description, and a row count (20, 50, or 100). The system generates conversation paths, personas, situations, and outcomes. For the regression layer, most teams use auto-generate to bootstrap the suite, then hand-curate the conversations that survive. Branch visibility shows the coverage across each generated branch so the team can prune redundant paths.
Three-Layer Testing in CI
Coval publishes a Three-Layer framework for voice AI testing: regression, adversarial, and production-derived testing. It is the cleanest way to think about voice regression in CI. Each layer catches a different failure class and attaches to a different CI stage.
Layer 1: regression on golden conversations (every PR)
The deploy gate for routine code changes. 50 to 200 golden conversations re-run on every PR against the agent under test. Pass rate target: 95% on task_completion and conversation_resolution. Below 95%, the PR is blocked.
The wall-clock budget is tight. Most teams aim for under 15 minutes total. That constrains the suite size and the number of rubrics scored per conversation. The standard regression package is usually task_completion, conversation_resolution, conversation_coherence, is_polite, is_helpful, and is_concise; multilingual suites add translation_accuracy and cultural_sensitivity.
Layer 2: adversarial on red-team personas (release candidate)
The deploy gate for release candidates. Eight to twelve red-team archetypes, 50 to 100 scenarios per archetype, auto-generated through Workflow Builder. Pass rate target: 90% on a custom policy-boundary evaluator authored in product, with built-in PromptInjection and DataPrivacyCompliance checks where relevant. The wall-clock budget is 1 to 3 hours, which fits in a nightly or pre-release pipeline.
Standard archetypes: angry customer, confused elderly, prompt injector, social engineer, policy-edge caller, compliance-conscious, repeated-question, hostile prankster. Each archetype probes a different policy boundary. The persona library covers six of these out of the box; the other two are easy custom-authored.
Layer 3: production-derived on sampled real calls (per release, weekly)
The deploy gate for major model upgrades and prompt overhauls. 500 to 2,000 sampled real calls from production, replayed through the new agent version, scored on per-turn drift. The gate is under 5% drift on the rubric package. The wall-clock budget is 2 to 6 hours, so this layer lives in a release pipeline rather than the PR pipeline.
Sampled calls come from traceAI logs. Random sampling across all production traffic is the default; stratified sampling by intent or failure-focused sampling on flagged calls adds confidence on major upgrades.
The Three-Layer pattern is the testing strategy. The next two sections wire it into CI.
The programmatic eval API
The programmatic eval API is what turns a test suite into a CI gate. Configure scenarios, attach rubrics, execute the run, and read the result. The same API powers all three layers.
Configuring a run
Layer 1 regression on the golden set, called from a CI step:
from fi.evals import Evaluator
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals.templates import (
TaskCompletion,
ConversationResolution,
IsPolite,
IsHelpful,
IsConcise,
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
# Pull the golden conversations from the workflow runner.
# Each golden conversation has been executed against the PR branch agent.
golden_runs = load_golden_runs(branch="pr-1284", suite="support_v3")
conversations = [
ConversationalTestCase(messages=[
LLMTestCase(query=t.user, response=t.agent)
for t in run.turns
])
for run in golden_runs
]
results = ev.evaluate(
eval_templates=[
TaskCompletion(),
ConversationResolution(),
IsPolite(),
IsHelpful(),
IsConcise(),
],
inputs=conversations,
)
pass_rate = sum(r.passed for r in results) / len(results)
if pass_rate < 0.95:
raise SystemExit(f"Regression gate failed. Pass rate {pass_rate:.2%} below 95%.")
Audio-aware evals
Voice-specific rubrics score the actual audio rather than the transcript. MLLMAudio accepts seven formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. The audio loads from URL or local path with auto-base64 encoding.
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator
from fi.evals.templates import AudioQualityEvaluator, AudioTranscriptionEvaluator
audio = MLLMAudio(url="https://recordings.example.com/call_42.wav")
test_case = MLLMTestCase(input=audio, query="Score TTS quality on this call")
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[AudioQualityEvaluator(), AudioTranscriptionEvaluator()],
inputs=[test_case],
)
audio_quality gates TTS drift on the new voice or new model. audio_transcription gates ASR quality, useful on production-derived sampling where the original audio reflects real caller acoustics.
Re-running a saved eval configuration
The eval API supports configure-then-re-run patterns. The Configure and Re-run Evaluations endpoint (shipped late 2025) lets the team save a suite plus rubric configuration once, then re-execute it from CI with a single call. The configuration lives in the FAGI platform; CI only needs the run trigger.
# Pseudocode: trigger the saved evaluation via the Configure and Re-run Evaluations API, then poll the run result.
results = trigger_saved_evaluation(
evaluation_id="regression_support_v3",
target_agent_definition_id="support_agent_pr_branch",
)
The pattern keeps CI files minimal. The full rubric package, sampling logic, and pass-rate thresholds live alongside the suite in the FAGI platform. CI just calls the trigger and reads the result.
GitHub Actions workflow
Wire all three layers into a single GitHub Actions workflow with three jobs at three cadences.
name: voice-agent-eval
on:
pull_request:
branches: [main]
workflow_dispatch:
schedule:
- cron: "0 6 * * *" # nightly adversarial run
jobs:
regression:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install fi
run: pip install futureagi
- name: Deploy PR branch agent
run: ./scripts/deploy_pr_agent.sh "${{ github.head_ref }}"
- name: Run Layer 1 regression
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: python ci/run_regression.py --branch "${{ github.head_ref }}"
- name: Check P95 latency gate
run: python ci/check_latency.py --budget-ms 1200
adversarial:
if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
timeout-minutes: 180
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install fi
run: pip install futureagi
- name: Run Layer 2 adversarial
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: python ci/run_adversarial.py
production_derived:
if: github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
timeout-minutes: 360
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install fi
run: pip install futureagi
- name: Sample production calls
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
run: python ci/sample_production.py --window-days 14 --count 1000
- name: Run Layer 3 production-derived
env:
FI_API_KEY: ${{ secrets.FI_API_KEY }}
FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
run: python ci/run_production_derived.py
- name: Check drift threshold
run: python ci/check_drift.py --threshold 0.05
The three jobs run at three triggers. PRs run regression only. The nightly cron runs adversarial. Major releases trigger production-derived manually via workflow_dispatch. The split keeps the PR feedback loop under 20 minutes while letting the slower layers run on their own cadence.
GitLab CI variant
The same pattern in GitLab CI uses three jobs with rules: to gate when each runs.
stages:
- regression
- adversarial
- production_derived
variables:
PYTHON_VERSION: "3.12"
before_script:
- pip install futureagi
regression:
stage: regression
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
timeout: 20m
script:
- ./scripts/deploy_pr_agent.sh "$CI_COMMIT_REF_NAME"
- python ci/run_regression.py --branch "$CI_COMMIT_REF_NAME"
- python ci/check_latency.py --budget-ms 1200
adversarial:
stage: adversarial
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_COMMIT_TAG =~ /^v.*-rc.*$/
timeout: 3h
script:
- python ci/run_adversarial.py
production_derived:
stage: production_derived
rules:
- if: $CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/
- if: $CI_PIPELINE_SOURCE == "schedule"
when: manual
timeout: 6h
script:
- python ci/sample_production.py --window-days 14 --count 1000
- python ci/run_production_derived.py
- python ci/check_drift.py --threshold 0.05
Two patterns are worth calling out. The regression stage uses the merge request trigger so the gate runs before the merge, not after. The production-derived stage uses tag-based release rules so the slow layer only runs on actual versioned releases rather than every commit.
The deploy gates
Five gates carry the deploy decision. Each maps to a specific rubric or metric.
Gate 1: P95 turn latency
P95 turn latency on the golden conversation suite. The budget depends on the agent type and is set from your production voice SLO; real-time voice stacks often track both time-to-first-audio and full turn latency rather than using a universal threshold. The gate measures the regression-test runs themselves, so it reflects realistic per-turn latency rather than a microbenchmark.
import statistics
latencies_ms = [t.duration_ms for run in golden_runs for t in run.turns]
p95 = statistics.quantiles(latencies_ms, n=20)[18] # 95th percentile
if p95 > 1200:
raise SystemExit(f"Latency gate failed. P95 {p95}ms above 1200ms.")
Gate 2: task_completion pass rate
The task_completion rubric scores whether the agent finished the requested job. Pass rate target on the golden suite: 95%. The single most important gate because it captures the customer-facing outcome rather than incidental quality.
Gate 3: conversation_resolution pass rate
The conversation_resolution rubric scores whether the multi-turn conversation closed cleanly. Pass rate target: 95%. Catches the conversation that ends in a dead-end loop even when the task was technically completed.
Gate 4: audio_quality drift
The audio_quality rubric scores TTS output on rendered audio. Compute the rubric score on the new model or new voice, compare against the baseline from the prior release. Drift target: under 0.05 absolute. Catches the silent TTS regression that no transcript-based rubric surfaces. Especially important when the team upgrades the TTS provider or switches voices in Run Prompt to a different ElevenLabs or Cartesia voice.
Gate 5: drift on the rubric package (Layer 3)
For the production-derived layer, the gate is per-rubric drift on the rubric package. For each rubric in the package, compute the score delta between the new and baseline runs across the sampled calls. The release is blocked if any single rubric drifts more than 0.05.
def drift(new_scores, baseline_scores):
return abs(statistics.mean(new_scores) - statistics.mean(baseline_scores))
for rubric in ["task_completion", "conversation_resolution", "is_concise", "is_polite", "is_helpful"]:
d = drift(new[rubric], baseline[rubric])
if d > 0.05:
raise SystemExit(f"Drift gate failed on {rubric}. Drift {d:.3f} above 0.05.")
The five gates compose. A release that clears all five is ready for production. A release that fails any gate either gets blocked or canaried with a smaller cohort while the team iterates.
Drift detection across model versions
The drift signal is the most useful artifact a CI eval pipeline produces. Every major model upgrade, every system prompt rewrite, every TTS provider switch is a candidate for drift detection.
How drift detection runs
- Pin the baseline. Snapshot the rubric scores on the current model on the current scenario set. The baseline is the reference point.
- Re-run the same scenarios. Execute the same scenarios against the new model. Same personas, same initial utterances, same rubric package, same scoring engine.
- Compute per-rubric drift. For each rubric, compute the absolute delta between the new and baseline mean scores.
- Surface the regression. Per-rubric drift above 0.05 is the standard release blocker.
What drift detection catches
Three regression classes show up only in drift detection.
Silent rubric regression. The aggregate pass rate looks flat. Per-rubric breakdown shows is_concise dropped 8 points. The new model is more verbose on average; customers will notice over a quarter.
Per-intent regression. Aggregate drift looks fine. Stratify by intent tag: refund handling drifted minus 12 points while technical support drifted plus 4 points. The new model is worse at exactly the intent the previous model was strong at.
Tool-call regression. The new model is more confident about calling tools. The task_completion rate looks fine on the golden suite. On production-derived testing, the new model calls the wrong tool 6% of the time on calls the prior model handled correctly. The drift on tool-call correctness rubrics catches it.
Example drift report
An illustrative drift report from a model upgrade. The team upgrades the underlying LLM, runs the same 1,000-call production-derived suite against the new model, and compares.
| Rubric | Baseline | New model | Drift | Verdict |
|---|---|---|---|---|
task_completion | 0.93 | 0.92 | -0.01 | Pass |
conversation_resolution | 0.89 | 0.87 | -0.02 | Pass |
is_polite | 0.94 | 0.93 | -0.01 | Pass |
is_helpful | 0.87 | 0.89 | +0.02 | Pass |
is_concise | 0.91 | 0.84 | -0.07 | FAIL |
audio_quality | 0.88 | 0.86 | -0.02 | Pass |
The is_concise regression is the release blocker. In this example, Error Localization points to the failing turns and the team confirms the new model is producing longer responses. The fix is a system-prompt brevity constraint. Re-run after the prompt fix: is_concise drift drops to -0.03. Release proceeds.
The drift gate caught a regression that would have shipped to production. The drift gate paid for itself.
Error Localization in CI failures
When the regression test fails on a 12-turn conversation, the engineer needs more than “this conversation failed.” Error Localization pinpoints the exact turn responsible.
The output of a localized failure looks like this:
Failure: conversation golden_billing_refund_v7
Failing turn: 7 of 12
Caller: "Can you process a refund without the receipt? I lost it."
Agent: "Sure, I'll process that refund now. Confirming $89.99 back to your card."
Failing rubric: task_completion
Reason: Agent processed a refund without policy-required receipt or supervisor approval.
Tool calls: refund_process(amount=89.99, account_id=...) executed without prior verify_receipt() or escalate_to_supervisor().
Suggested fix: Add receipt verification step before refund_process tool call.
The CI log surfaces the turn, the input, the agent’s response, the failing rubric, the reason, the tool calls executed, and a suggested fix. The engineer fixes the system prompt or the tool-call sequence directly, instead of replaying the conversation by hand to figure out which turn went wrong.
Error Localization shipped in late 2025 in FAGI’s Simulate product. In CI it reduces manual replay work by pointing engineers to the failing turn, input, response, and rubric.
Three deliberate tradeoffs
Golden-set curation is human work. New conversations added on every shipped feature, stale conversations replaced when product behavior changes, redundant conversations pruned. Plan for 10 to 15% of testing effort on suite maintenance. The compounding payoff is real: six months in, the suite is the highest-signal artifact the team owns about the agent’s intended behavior, and Error Feed feeds new candidate conversations from production clusters automatically.
Adversarial coverage extends from production clusters. New attack patterns become new adversarial personas through the Error Feed cluster surface; failing patterns from production flow into the simulation library as candidate scenarios. The translation step is human-approved by design (regulated workloads want a reviewer accept on every new red-team archetype). A new attack pattern in production lands as a new persona in the next release candidate.
Production-derived testing activates with production traffic. Pre-launch agents bootstrap on Layer 1 plus Layer 2 alone; Layer 3 activates once traffic accumulates (typically 30 to 60 days post-launch). The Vapi / Retell / LiveKit native voice observability captures calls from day one, so Layer 3 is ready the moment the trace volume crosses the sampling threshold. The waiting period does not block the launch.
How FAGI ships Three-Layer Testing built-in
Coval publishes a Three-Layer framework for voice AI testing (regression on golden conversations, adversarial on red-team personas, production-derived on sampled real calls). The pattern is well-known in voice-AI QA. FAGI’s Workflow Builder ships Three-Layer Testing as the default flow inside a unified eval plus observability plus simulation plus guardrail platform, with the simulation library sharing data with the trace store, eval engine, and Error Feed in the same project.
The full FAGI simulation surface that powers the CI loop: 18 pre-built personas plus unlimited custom-authored (gender, age range across 18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+, location across US / Canada / UK / Australia / India, personality traits, communication style, accent, conversation speed, background noise, multilingual across many popular languages, custom properties, free-form behavioral instructions). Visual Workflow Builder with drag-and-drop graph (Conversation / End Call / Transfer Call nodes). Auto-generate scenarios at 20, 50, or 100 rows with branch visibility. Dataset scenarios via CSV / JSON / Excel upload plus synthetic generation. 4-step Run Tests wizard (config → scenarios → eval → execute). Error Localization that pinpoints the exact failing turn. Tool Calling eval. Programmatic eval API for configure plus re-run. Custom voices from ElevenLabs and Cartesia in Run Prompt. Indian phone number simulation. Show Reasoning column in Simulate.
The simulation library doesn’t round-trip data to a separate observability or eval product. Production calls flow into the simulation library natively via Vapi / Retell / LiveKit dashboard integration; failures cluster in Error Feed and surface as candidate scenarios automatically. That’s the closed loop the Three-Layer pattern needs to stay continuous across releases.
Future AGI on voice regression in CI/CD
Simulate is the test execution surface for all three layers. 18 pre-built personas plus custom-persona authoring with controls for gender, age, location, accent, communication style, background noise, and multilingual. Workflow Builder auto-generates branching scenarios at 20, 50, or 100 rows with branch visibility for coverage. The 4-step Run Tests wizard executes the test matrix. Error Localization pinpoints the failing turn. The programmatic eval API (Configure and Re-run Evaluations) wires the whole thing into GitHub Actions or GitLab CI.
ai-evaluation ships 70+ built-in eval templates. The CI deploy-gate package: task_completion, conversation_resolution, audio_quality, audio_transcription, is_polite, is_helpful, is_concise. Multi-turn dialogs use ConversationalTestCase. Audio rubrics use MLLMAudio with seven supported formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma). Custom voice evaluators are authored in product by the in-product evaluator-authoring agent that calibrates from human review feedback. Apache 2.0. When the failing set is prompt-sensitive, agent-opt runs 6 optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) against the failed scenarios from the Dataset UI or the Python SDK to propose prompt candidates for human review.
traceAI is the source for Layer 3 production-derived testing. Captures instrumented production interactions as OpenInference-compatible spans. For Vapi, Retell, and LiveKit, native voice observability adds no-SDK call logs, transcripts, and separate assistant/customer recordings through provider API key plus Assistant ID. 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages. Apache 2.0.
Future AGI Protect reduces the adversarial-layer failure rate by handling prompt injection and policy violation at the safety layer. Runs sub-100ms inline on Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash for single-call binary classification when the rule-based scan time is too expensive.
Error Feed auto-clusters failures from all three layers into named issues with auto-written root cause, quick fix, and long-term recommendation. The same cluster surface spans regression failures, adversarial failures, and production-derived drift.
Agent Command Center hosts the whole CI eval pipeline with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads.
The pipeline is the workflow. The five products are the implementation.
A worked CI rollout: 30-day plan
A 30-day rollout for a customer support voice agent moving from manual testing to CI-gated regression.
Week 1: foundation.
- Author the Agent Definition. Name, behavior, capabilities, constraints, tool list.
- Build the persona library: 8 custom personas plus 6 pre-built. Cover the routine intents.
- Author 60 golden conversations through Workflow Builder. Use auto-generate to bootstrap, then hand-curate.
Week 2: Layer 1 in PR CI.
- Write the GitHub Actions workflow with the regression job.
- Attach the 5-rubric package:
task_completion,conversation_resolution,is_polite,is_helpful,is_concise. - Set the gate at 95% pass rate on
task_completionandconversation_resolution. - First baseline run on the main branch: 89%. Engineers fix the failures over three days. Re-run: 96%. Gate clears.
- P95 latency gate at 1.2 seconds. First baseline: 1.45 seconds. Engineers tune the system prompt and remove a redundant tool call. Re-run: 1.08 seconds. Gate clears.
Week 3: Layer 2 in nightly cron.
- Author 8 adversarial archetype scenarios. Auto-generate 50 rows per archetype.
- Custom
policy_preservationrubric authored by the in-product agent. - First adversarial run: prompt-injector at 73%, social-engineer at 81%. Below gate.
- Engineer adds a Protect ruleset for prompt injection plus tightens the auth flow in the system prompt. Re-run: prompt-injector at 94%, social-engineer at 92%. Gate clears.
Week 4: Layer 3 wiring.
- traceAI has been capturing every production call since Week 1.
- Author the production-derived sampling script: random sample 1,000 calls from the prior 14 days, stratified by intent tag.
- Replay through the current production agent. Pin the baseline.
- First drift run with no model change shows under 1% drift across all rubrics. The pipeline is healthy.
- Schedule the production-derived job in workflow_dispatch for tagged releases.
By Day 30, the team has a complete three-layer CI eval pipeline. Layer 1 runs on every PR with a 15-minute feedback loop. Layer 2 runs nightly with a 2-hour wall-clock. Layer 3 runs per tagged release with a 4-hour wall-clock. The team has shipped 8 PRs in 30 days. CI caught regressions on 3 of them. None of the regressions reached production.
The CI eval pipeline pays for itself on the first production incident it prevents.
Related reading
- Three-Layer Voice AI Testing: Regression, Adversarial, Production-Derived
- Voice Agent Simulation: A 2026 Engineering Guide
- Evaluating Voice AI Agents in 2026
- How to Implement Voice AI Observability in 2026
Sources and references
- Future AGI Protect: arXiv 2510.13351
- OpenInference span specification: github.com/Arize-ai/openinference
- Future AGI trust and compliance: futureagi.com/trust
- Future AGI Simulate documentation: docs.futureagi.com/docs/simulate
- ai-evaluation repository: github.com/future-agi/ai-evaluation
- traceAI repository: github.com/future-agi/traceAI
- Coval Three-Layer Testing pattern: coval.dev product documentation
- GitHub Actions documentation: docs.github.com/en/actions
- GitLab CI/CD reference: docs.gitlab.com/ee/ci
Frequently asked questions
What is voice agent regression testing in CI/CD?
How many golden conversations should I curate?
Which evals should gate the deploy?
What is the Coval three-layer pattern and how does FAGI implement it?
How does drift detection work between model versions?
How does Error Localization help in CI failures?
Can the eval run on real audio rather than transcripts?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Traditional QA asserts pass/fail. LLM eval grades against a rubric. The pyramid, the golden set, the CI gate carry. The assertion library is what you replace.
Prompt regression is pytest for prompts. Three patterns: per-rubric assertion, per-route stratified eval, and paired comparison vs prior version with CI on the delta.