Guides

Voice Agent Regression Testing in CI/CD: A 2026 Engineering Guide

Wire voice agent regression tests into GitHub Actions and GitLab CI in 2026: golden conversations, three-layer testing, deploy gates, drift detection, and FAGI evals.

·
Updated
·
18 min read
voice-ai 2026 regression-testing ci-cd evaluation
Editorial cover image for Voice Agent Regression Testing in CI/CD: A 2026 Engineering Guide
Table of Contents

A voice agent that ships without a CI regression gate is one prompt edit away from a production incident. The fix is not heroic manual testing. It is a programmatic eval pipeline wired into the same GitHub Actions or GitLab CI workflow that builds and deploys the agent. Pass rates become deploy gates. Drift scores become release blockers. The same Three-Layer framework Coval publishes runs against every pull request, every release candidate, and every model upgrade. This guide is the engineering implementation: golden conversations, scenario auto-generation, eval API calls, .yml snippets, and the deploy gates that catch regressions before customers do.

TL;DR: voice regression in CI/CD at a glance

StageWhat runsWhenDeploy gate
Layer 1: Regression50-200 golden conversationsEvery PRtask_completion 95% pass
Layer 2: Adversarial50-100 scenarios per red-team archetypeRelease candidatepolicy_preservation 90% pass
Layer 3: Production-derived500-2,000 sampled real callsPer release plus weeklyunder 5% drift on rubric package
Latency gateP95 turn latency on golden setEvery PRbudget-specific (often under 1.2s)
Audio quality gateaudio_quality on TTS outputRelease candidatedrift under 0.05 from baseline

The pyramid runs at three cadences. Regression is fast and gated tight. Adversarial is medium and gated on policy. Production-derived is slow and gated on drift. All three layers share an Agent Definition, a persona library, a scenario authoring surface, and the programmatic eval API.

Why voice regression cannot live outside CI

Text agent regression can almost survive on manual review. Voice agent regression cannot. The reason is the failure surface: turn-taking, latency, audio quality, prosody, ASR confidence, multi-turn coherence, tool-call timing, and policy preservation all interact. A change that improves one usually nudges another. Without an automated test pipeline, the engineer shipping the change is guessing.

Three failure classes show up only when regression lives in CI:

Prompt edits that change unrelated behavior. A small tweak to the qualification prompt subtly shifts how the agent handles billing disputes. Manual testing misses it. CI catches it on turn 7 of the billing golden conversation.

Model upgrades that improve some flows and degrade others. A move from one LLM version to the next nearly always wins on some intents and loses on others. The aggregate metric looks flat. The per-intent drift shows the new model gave up 8% on refund handling while gaining 4% on technical questions. CI surfaces the per-intent breakdown.

Latency creep that hides under feature velocity. A new tool call here, a longer system prompt there, a richer retrieval payload everywhere. Each change adds 50 to 100 milliseconds. After three sprints the agent is 300ms slower per turn, and the CSAT regression shows up two weeks after the user starts feeling it. CI catches the latency creep on the first PR that pushes P95 over budget.

The argument for CI is not that it is more thorough than a senior engineer with a Friday afternoon. The argument is that the agent ships every week, the senior engineer cannot review every PR, and the failures CI catches are the ones the engineer cannot anticipate from the diff.

The golden conversation suite

Layer 1 is the regression layer. It runs on every PR. The artifact is the golden conversation suite.

What a golden conversation contains

Each golden conversation is a structured object with seven fields:

  • Persona. The simulated caller. Pulled from the 18 pre-built personas or a custom-authored one.
  • Initial utterance. The first thing the persona says.
  • Turn-by-turn expected flow. A short script describing what the agent should accomplish at each turn, not exact wording.
  • Tool-call expectations. Which tools the agent is expected to call, in what order, with what arguments.
  • Expected outcome. Resolved, escalated, refused, or transferred.
  • Rubric package. Which evals score this conversation. Always includes task_completion and conversation_resolution; often adds is_polite, is_helpful, is_concise.
  • Tags. Intent, priority, owner. Used for CI grouping and ownership routing on failure.

Curating the suite

50 to 200 conversations is the right range. Below 50, coverage is thin. Above 200, the test cycle slows past the developer feedback loop.

The shape that works: a base set of 30 to 50 covering the must-pass happy paths for each intent. Add 20 to 50 covering known edge cases. Add 10 to 30 covering known regression-prone areas (recent bug fixes, recent model upgrades). The total lands in the 60 to 130 range for most agents.

Three curation rules keep the set healthy:

Add a golden conversation for every shipped feature. When the agent gains a new capability, the PR adds the corresponding golden conversation. The set grows with the agent.

Replace stale conversations. When product behavior changes (new pricing, new account tiers, new policies), update the golden conversations that depend on the changed behavior. Stale conversations test the wrong thing and produce false positives that train the team to ignore the CI signal.

Curate, do not accumulate. Resist the urge to add every edge case. Each new conversation costs CI time. Keep the suite at the level where total CI wall-clock is under 15 minutes for the PR layer.

Authoring the suite in Workflow Builder

Workflow Builder is the authoring surface. Each golden conversation is a workflow with the persona, the initial utterance, and the turn-by-turn flow. Workflow Builder is the suite authoring surface. Teams that manage eval suites as code should export or mirror the suite definition in their repo; otherwise the platform configuration is the source of truth.

The auto-generate option in Workflow Builder accelerates suite construction. Specify the agent definition, the scenario description, and a row count (20, 50, or 100). The system generates conversation paths, personas, situations, and outcomes. For the regression layer, most teams use auto-generate to bootstrap the suite, then hand-curate the conversations that survive. Branch visibility shows the coverage across each generated branch so the team can prune redundant paths.

Three-Layer Testing in CI

Coval publishes a Three-Layer framework for voice AI testing: regression, adversarial, and production-derived testing. It is the cleanest way to think about voice regression in CI. Each layer catches a different failure class and attaches to a different CI stage.

Layer 1: regression on golden conversations (every PR)

The deploy gate for routine code changes. 50 to 200 golden conversations re-run on every PR against the agent under test. Pass rate target: 95% on task_completion and conversation_resolution. Below 95%, the PR is blocked.

The wall-clock budget is tight. Most teams aim for under 15 minutes total. That constrains the suite size and the number of rubrics scored per conversation. The standard regression package is usually task_completion, conversation_resolution, conversation_coherence, is_polite, is_helpful, and is_concise; multilingual suites add translation_accuracy and cultural_sensitivity.

Layer 2: adversarial on red-team personas (release candidate)

The deploy gate for release candidates. Eight to twelve red-team archetypes, 50 to 100 scenarios per archetype, auto-generated through Workflow Builder. Pass rate target: 90% on a custom policy-boundary evaluator authored in product, with built-in PromptInjection and DataPrivacyCompliance checks where relevant. The wall-clock budget is 1 to 3 hours, which fits in a nightly or pre-release pipeline.

Standard archetypes: angry customer, confused elderly, prompt injector, social engineer, policy-edge caller, compliance-conscious, repeated-question, hostile prankster. Each archetype probes a different policy boundary. The persona library covers six of these out of the box; the other two are easy custom-authored.

Layer 3: production-derived on sampled real calls (per release, weekly)

The deploy gate for major model upgrades and prompt overhauls. 500 to 2,000 sampled real calls from production, replayed through the new agent version, scored on per-turn drift. The gate is under 5% drift on the rubric package. The wall-clock budget is 2 to 6 hours, so this layer lives in a release pipeline rather than the PR pipeline.

Sampled calls come from traceAI logs. Random sampling across all production traffic is the default; stratified sampling by intent or failure-focused sampling on flagged calls adds confidence on major upgrades.

The Three-Layer pattern is the testing strategy. The next two sections wire it into CI.

The programmatic eval API

The programmatic eval API is what turns a test suite into a CI gate. Configure scenarios, attach rubrics, execute the run, and read the result. The same API powers all three layers.

Configuring a run

Layer 1 regression on the golden set, called from a CI step:

from fi.evals import Evaluator
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals.templates import (
    TaskCompletion,
    ConversationResolution,
    IsPolite,
    IsHelpful,
    IsConcise,
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

# Pull the golden conversations from the workflow runner.
# Each golden conversation has been executed against the PR branch agent.
golden_runs = load_golden_runs(branch="pr-1284", suite="support_v3")

conversations = [
    ConversationalTestCase(messages=[
        LLMTestCase(query=t.user, response=t.agent)
        for t in run.turns
    ])
    for run in golden_runs
]

results = ev.evaluate(
    eval_templates=[
        TaskCompletion(),
        ConversationResolution(),
        IsPolite(),
        IsHelpful(),
        IsConcise(),
    ],
    inputs=conversations,
)

pass_rate = sum(r.passed for r in results) / len(results)
if pass_rate < 0.95:
    raise SystemExit(f"Regression gate failed. Pass rate {pass_rate:.2%} below 95%.")

Audio-aware evals

Voice-specific rubrics score the actual audio rather than the transcript. MLLMAudio accepts seven formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. The audio loads from URL or local path with auto-base64 encoding.

from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator
from fi.evals.templates import AudioQualityEvaluator, AudioTranscriptionEvaluator

audio = MLLMAudio(url="https://recordings.example.com/call_42.wav")
test_case = MLLMTestCase(input=audio, query="Score TTS quality on this call")

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[AudioQualityEvaluator(), AudioTranscriptionEvaluator()],
    inputs=[test_case],
)

audio_quality gates TTS drift on the new voice or new model. audio_transcription gates ASR quality, useful on production-derived sampling where the original audio reflects real caller acoustics.

Re-running a saved eval configuration

The eval API supports configure-then-re-run patterns. The Configure and Re-run Evaluations endpoint (shipped late 2025) lets the team save a suite plus rubric configuration once, then re-execute it from CI with a single call. The configuration lives in the FAGI platform; CI only needs the run trigger.

# Pseudocode: trigger the saved evaluation via the Configure and Re-run Evaluations API, then poll the run result.
results = trigger_saved_evaluation(
    evaluation_id="regression_support_v3",
    target_agent_definition_id="support_agent_pr_branch",
)

The pattern keeps CI files minimal. The full rubric package, sampling logic, and pass-rate thresholds live alongside the suite in the FAGI platform. CI just calls the trigger and reads the result.

GitHub Actions workflow

Wire all three layers into a single GitHub Actions workflow with three jobs at three cadences.

name: voice-agent-eval
on:
  pull_request:
    branches: [main]
  workflow_dispatch:
  schedule:
    - cron: "0 6 * * *"  # nightly adversarial run

jobs:
  regression:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install fi
        run: pip install futureagi
      - name: Deploy PR branch agent
        run: ./scripts/deploy_pr_agent.sh "${{ github.head_ref }}"
      - name: Run Layer 1 regression
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: python ci/run_regression.py --branch "${{ github.head_ref }}"
      - name: Check P95 latency gate
        run: python ci/check_latency.py --budget-ms 1200

  adversarial:
    if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    timeout-minutes: 180
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install fi
        run: pip install futureagi
      - name: Run Layer 2 adversarial
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: python ci/run_adversarial.py

  production_derived:
    if: github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    timeout-minutes: 360
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install fi
        run: pip install futureagi
      - name: Sample production calls
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
        run: python ci/sample_production.py --window-days 14 --count 1000
      - name: Run Layer 3 production-derived
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: python ci/run_production_derived.py
      - name: Check drift threshold
        run: python ci/check_drift.py --threshold 0.05

The three jobs run at three triggers. PRs run regression only. The nightly cron runs adversarial. Major releases trigger production-derived manually via workflow_dispatch. The split keeps the PR feedback loop under 20 minutes while letting the slower layers run on their own cadence.

GitLab CI variant

The same pattern in GitLab CI uses three jobs with rules: to gate when each runs.

stages:
  - regression
  - adversarial
  - production_derived

variables:
  PYTHON_VERSION: "3.12"

before_script:
  - pip install futureagi

regression:
  stage: regression
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  timeout: 20m
  script:
    - ./scripts/deploy_pr_agent.sh "$CI_COMMIT_REF_NAME"
    - python ci/run_regression.py --branch "$CI_COMMIT_REF_NAME"
    - python ci/check_latency.py --budget-ms 1200

adversarial:
  stage: adversarial
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
    - if: $CI_COMMIT_TAG =~ /^v.*-rc.*$/
  timeout: 3h
  script:
    - python ci/run_adversarial.py

production_derived:
  stage: production_derived
  rules:
    - if: $CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/
    - if: $CI_PIPELINE_SOURCE == "schedule"
      when: manual
  timeout: 6h
  script:
    - python ci/sample_production.py --window-days 14 --count 1000
    - python ci/run_production_derived.py
    - python ci/check_drift.py --threshold 0.05

Two patterns are worth calling out. The regression stage uses the merge request trigger so the gate runs before the merge, not after. The production-derived stage uses tag-based release rules so the slow layer only runs on actual versioned releases rather than every commit.

The deploy gates

Five gates carry the deploy decision. Each maps to a specific rubric or metric.

Gate 1: P95 turn latency

P95 turn latency on the golden conversation suite. The budget depends on the agent type and is set from your production voice SLO; real-time voice stacks often track both time-to-first-audio and full turn latency rather than using a universal threshold. The gate measures the regression-test runs themselves, so it reflects realistic per-turn latency rather than a microbenchmark.

import statistics

latencies_ms = [t.duration_ms for run in golden_runs for t in run.turns]
p95 = statistics.quantiles(latencies_ms, n=20)[18]  # 95th percentile

if p95 > 1200:
    raise SystemExit(f"Latency gate failed. P95 {p95}ms above 1200ms.")

Gate 2: task_completion pass rate

The task_completion rubric scores whether the agent finished the requested job. Pass rate target on the golden suite: 95%. The single most important gate because it captures the customer-facing outcome rather than incidental quality.

Gate 3: conversation_resolution pass rate

The conversation_resolution rubric scores whether the multi-turn conversation closed cleanly. Pass rate target: 95%. Catches the conversation that ends in a dead-end loop even when the task was technically completed.

Gate 4: audio_quality drift

The audio_quality rubric scores TTS output on rendered audio. Compute the rubric score on the new model or new voice, compare against the baseline from the prior release. Drift target: under 0.05 absolute. Catches the silent TTS regression that no transcript-based rubric surfaces. Especially important when the team upgrades the TTS provider or switches voices in Run Prompt to a different ElevenLabs or Cartesia voice.

Gate 5: drift on the rubric package (Layer 3)

For the production-derived layer, the gate is per-rubric drift on the rubric package. For each rubric in the package, compute the score delta between the new and baseline runs across the sampled calls. The release is blocked if any single rubric drifts more than 0.05.

def drift(new_scores, baseline_scores):
    return abs(statistics.mean(new_scores) - statistics.mean(baseline_scores))

for rubric in ["task_completion", "conversation_resolution", "is_concise", "is_polite", "is_helpful"]:
    d = drift(new[rubric], baseline[rubric])
    if d > 0.05:
        raise SystemExit(f"Drift gate failed on {rubric}. Drift {d:.3f} above 0.05.")

The five gates compose. A release that clears all five is ready for production. A release that fails any gate either gets blocked or canaried with a smaller cohort while the team iterates.

Drift detection across model versions

The drift signal is the most useful artifact a CI eval pipeline produces. Every major model upgrade, every system prompt rewrite, every TTS provider switch is a candidate for drift detection.

How drift detection runs

  1. Pin the baseline. Snapshot the rubric scores on the current model on the current scenario set. The baseline is the reference point.
  2. Re-run the same scenarios. Execute the same scenarios against the new model. Same personas, same initial utterances, same rubric package, same scoring engine.
  3. Compute per-rubric drift. For each rubric, compute the absolute delta between the new and baseline mean scores.
  4. Surface the regression. Per-rubric drift above 0.05 is the standard release blocker.

What drift detection catches

Three regression classes show up only in drift detection.

Silent rubric regression. The aggregate pass rate looks flat. Per-rubric breakdown shows is_concise dropped 8 points. The new model is more verbose on average; customers will notice over a quarter.

Per-intent regression. Aggregate drift looks fine. Stratify by intent tag: refund handling drifted minus 12 points while technical support drifted plus 4 points. The new model is worse at exactly the intent the previous model was strong at.

Tool-call regression. The new model is more confident about calling tools. The task_completion rate looks fine on the golden suite. On production-derived testing, the new model calls the wrong tool 6% of the time on calls the prior model handled correctly. The drift on tool-call correctness rubrics catches it.

Example drift report

An illustrative drift report from a model upgrade. The team upgrades the underlying LLM, runs the same 1,000-call production-derived suite against the new model, and compares.

RubricBaselineNew modelDriftVerdict
task_completion0.930.92-0.01Pass
conversation_resolution0.890.87-0.02Pass
is_polite0.940.93-0.01Pass
is_helpful0.870.89+0.02Pass
is_concise0.910.84-0.07FAIL
audio_quality0.880.86-0.02Pass

The is_concise regression is the release blocker. In this example, Error Localization points to the failing turns and the team confirms the new model is producing longer responses. The fix is a system-prompt brevity constraint. Re-run after the prompt fix: is_concise drift drops to -0.03. Release proceeds.

The drift gate caught a regression that would have shipped to production. The drift gate paid for itself.

Error Localization in CI failures

When the regression test fails on a 12-turn conversation, the engineer needs more than “this conversation failed.” Error Localization pinpoints the exact turn responsible.

The output of a localized failure looks like this:

Failure: conversation golden_billing_refund_v7
Failing turn: 7 of 12
Caller: "Can you process a refund without the receipt? I lost it."
Agent: "Sure, I'll process that refund now. Confirming $89.99 back to your card."
Failing rubric: task_completion
Reason: Agent processed a refund without policy-required receipt or supervisor approval.
Tool calls: refund_process(amount=89.99, account_id=...) executed without prior verify_receipt() or escalate_to_supervisor().
Suggested fix: Add receipt verification step before refund_process tool call.

The CI log surfaces the turn, the input, the agent’s response, the failing rubric, the reason, the tool calls executed, and a suggested fix. The engineer fixes the system prompt or the tool-call sequence directly, instead of replaying the conversation by hand to figure out which turn went wrong.

Error Localization shipped in late 2025 in FAGI’s Simulate product. In CI it reduces manual replay work by pointing engineers to the failing turn, input, response, and rubric.

Three deliberate tradeoffs

Golden-set curation is human work. New conversations added on every shipped feature, stale conversations replaced when product behavior changes, redundant conversations pruned. Plan for 10 to 15% of testing effort on suite maintenance. The compounding payoff is real: six months in, the suite is the highest-signal artifact the team owns about the agent’s intended behavior, and Error Feed feeds new candidate conversations from production clusters automatically.

Adversarial coverage extends from production clusters. New attack patterns become new adversarial personas through the Error Feed cluster surface; failing patterns from production flow into the simulation library as candidate scenarios. The translation step is human-approved by design (regulated workloads want a reviewer accept on every new red-team archetype). A new attack pattern in production lands as a new persona in the next release candidate.

Production-derived testing activates with production traffic. Pre-launch agents bootstrap on Layer 1 plus Layer 2 alone; Layer 3 activates once traffic accumulates (typically 30 to 60 days post-launch). The Vapi / Retell / LiveKit native voice observability captures calls from day one, so Layer 3 is ready the moment the trace volume crosses the sampling threshold. The waiting period does not block the launch.

How FAGI ships Three-Layer Testing built-in

Coval publishes a Three-Layer framework for voice AI testing (regression on golden conversations, adversarial on red-team personas, production-derived on sampled real calls). The pattern is well-known in voice-AI QA. FAGI’s Workflow Builder ships Three-Layer Testing as the default flow inside a unified eval plus observability plus simulation plus guardrail platform, with the simulation library sharing data with the trace store, eval engine, and Error Feed in the same project.

The full FAGI simulation surface that powers the CI loop: 18 pre-built personas plus unlimited custom-authored (gender, age range across 18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+, location across US / Canada / UK / Australia / India, personality traits, communication style, accent, conversation speed, background noise, multilingual across many popular languages, custom properties, free-form behavioral instructions). Visual Workflow Builder with drag-and-drop graph (Conversation / End Call / Transfer Call nodes). Auto-generate scenarios at 20, 50, or 100 rows with branch visibility. Dataset scenarios via CSV / JSON / Excel upload plus synthetic generation. 4-step Run Tests wizard (config → scenarios → eval → execute). Error Localization that pinpoints the exact failing turn. Tool Calling eval. Programmatic eval API for configure plus re-run. Custom voices from ElevenLabs and Cartesia in Run Prompt. Indian phone number simulation. Show Reasoning column in Simulate.

The simulation library doesn’t round-trip data to a separate observability or eval product. Production calls flow into the simulation library natively via Vapi / Retell / LiveKit dashboard integration; failures cluster in Error Feed and surface as candidate scenarios automatically. That’s the closed loop the Three-Layer pattern needs to stay continuous across releases.

Future AGI on voice regression in CI/CD

Simulate is the test execution surface for all three layers. 18 pre-built personas plus custom-persona authoring with controls for gender, age, location, accent, communication style, background noise, and multilingual. Workflow Builder auto-generates branching scenarios at 20, 50, or 100 rows with branch visibility for coverage. The 4-step Run Tests wizard executes the test matrix. Error Localization pinpoints the failing turn. The programmatic eval API (Configure and Re-run Evaluations) wires the whole thing into GitHub Actions or GitLab CI.

ai-evaluation ships 70+ built-in eval templates. The CI deploy-gate package: task_completion, conversation_resolution, audio_quality, audio_transcription, is_polite, is_helpful, is_concise. Multi-turn dialogs use ConversationalTestCase. Audio rubrics use MLLMAudio with seven supported formats (.mp3, .wav, .ogg, .m4a, .aac, .flac, .wma). Custom voice evaluators are authored in product by the in-product evaluator-authoring agent that calibrates from human review feedback. Apache 2.0. When the failing set is prompt-sensitive, agent-opt runs 6 optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) against the failed scenarios from the Dataset UI or the Python SDK to propose prompt candidates for human review.

traceAI is the source for Layer 3 production-derived testing. Captures instrumented production interactions as OpenInference-compatible spans. For Vapi, Retell, and LiveKit, native voice observability adds no-SDK call logs, transcripts, and separate assistant/customer recordings through provider API key plus Assistant ID. 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages. Apache 2.0.

Future AGI Protect reduces the adversarial-layer failure rate by handling prompt injection and policy violation at the safety layer. Runs sub-100ms inline on Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash for single-call binary classification when the rule-based scan time is too expensive.

Error Feed auto-clusters failures from all three layers into named issues with auto-written root cause, quick fix, and long-term recommendation. The same cluster surface spans regression failures, adversarial failures, and production-derived drift.

Agent Command Center hosts the whole CI eval pipeline with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads.

The pipeline is the workflow. The five products are the implementation.

A worked CI rollout: 30-day plan

A 30-day rollout for a customer support voice agent moving from manual testing to CI-gated regression.

Week 1: foundation.

  • Author the Agent Definition. Name, behavior, capabilities, constraints, tool list.
  • Build the persona library: 8 custom personas plus 6 pre-built. Cover the routine intents.
  • Author 60 golden conversations through Workflow Builder. Use auto-generate to bootstrap, then hand-curate.

Week 2: Layer 1 in PR CI.

  • Write the GitHub Actions workflow with the regression job.
  • Attach the 5-rubric package: task_completion, conversation_resolution, is_polite, is_helpful, is_concise.
  • Set the gate at 95% pass rate on task_completion and conversation_resolution.
  • First baseline run on the main branch: 89%. Engineers fix the failures over three days. Re-run: 96%. Gate clears.
  • P95 latency gate at 1.2 seconds. First baseline: 1.45 seconds. Engineers tune the system prompt and remove a redundant tool call. Re-run: 1.08 seconds. Gate clears.

Week 3: Layer 2 in nightly cron.

  • Author 8 adversarial archetype scenarios. Auto-generate 50 rows per archetype.
  • Custom policy_preservation rubric authored by the in-product agent.
  • First adversarial run: prompt-injector at 73%, social-engineer at 81%. Below gate.
  • Engineer adds a Protect ruleset for prompt injection plus tightens the auth flow in the system prompt. Re-run: prompt-injector at 94%, social-engineer at 92%. Gate clears.

Week 4: Layer 3 wiring.

  • traceAI has been capturing every production call since Week 1.
  • Author the production-derived sampling script: random sample 1,000 calls from the prior 14 days, stratified by intent tag.
  • Replay through the current production agent. Pin the baseline.
  • First drift run with no model change shows under 1% drift across all rubrics. The pipeline is healthy.
  • Schedule the production-derived job in workflow_dispatch for tagged releases.

By Day 30, the team has a complete three-layer CI eval pipeline. Layer 1 runs on every PR with a 15-minute feedback loop. Layer 2 runs nightly with a 2-hour wall-clock. Layer 3 runs per tagged release with a 4-hour wall-clock. The team has shipped 8 PRs in 30 days. CI caught regressions on 3 of them. None of the regressions reached production.

The CI eval pipeline pays for itself on the first production incident it prevents.

Sources and references

Frequently asked questions

What is voice agent regression testing in CI/CD?
Voice agent regression testing in CI/CD is the practice of running a curated set of multi-turn voice conversations on every pull request and release candidate, then gating the deploy on rubric pass rates. The pattern is built from three layers: regression on golden conversations (50-200 hand-curated must-pass dialogues), adversarial on red-team personas, and production-derived on sampled real calls. Each layer attaches to a different CI stage. Tests run via a programmatic eval API. Pass rates and drift scores become deploy gates that block bad releases before they reach production.
How many golden conversations should I curate?
50 to 200 hand-curated multi-turn dialogues is the sweet spot. Fewer than 50 misses coverage on common intents. More than 200 slows the test cycle past the CI feedback loop, which kills developer adherence. Each golden conversation pairs a persona, an initial utterance, a turn-by-turn expected flow, an expected outcome, and a rubric package. The set is a living artifact: add a golden conversation for every shipped capability, replace stale conversations when product behavior changes, and resist the urge to accumulate every test case the team can think of.
Which evals should gate the deploy?
Four rubrics carry most of the deploy-gate weight. `task_completion` confirms the agent finished the requested job. `conversation_resolution` confirms the multi-turn conversation closed cleanly. `audio_quality` catches TTS drift on the new model or prompt. P95 turn latency catches the silent latency regression that no rubric surfaces. Below 95% on `task_completion` and `conversation_resolution`, or above your latency budget on P95, the PR or release candidate is blocked. The gates are conservative on purpose.
What is the Coval three-layer pattern and how does FAGI implement it?
Coval publishes a Three-Layer framework for voice AI testing: regression on golden conversations, adversarial on red-team personas, and production-derived on sampled real calls. Coval announced a $3.3M round to bring simulation-style evaluation to AI voice and chat agents, and the framework has become widely referenced shorthand. FAGI ships the same three layers as built-in inside a broader eval plus observability platform. The 18 pre-built personas plus custom-persona authoring cover regression and adversarial. Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows per layer). The programmatic eval API CI-wires the whole thing. Same rubric engine scores all three layers.
How does drift detection work between model versions?
Drift detection re-runs the same eval set against the new model or prompt and compares scores against the prior baseline. The signal is per-rubric delta: if `is_concise` was 0.91 on the prior model and is 0.86 on the new model, that's a meaningful regression even if outcomes look the same. Drift detection runs on the production-derived layer (sampled real calls), so the signal reflects actual production traffic distribution. Below 5% drift on the rubric package is the standard release gate. Above 5%, the release is blocked or canaried with a smaller cohort while engineers iterate.
How does Error Localization help in CI failures?
Error Localization pinpoints the exact turn responsible for a failure rather than reporting a binary pass/fail on the full conversation. When the regression test fails on a 12-turn conversation, Error Localization says 'turn 7 produced a non-compliant refund offer' rather than 'conversation failed'. CI logs the specific turn, the input, the agent's response, and the rubric that fired. Engineers fix the turn directly instead of replaying the conversation manually. Error Localization shipped in late 2025 as part of FAGI's Simulate product.
Can the eval run on real audio rather than transcripts?
Yes. The `MLLMAudio` test case accepts seven audio formats: .mp3, .wav, .ogg, .m4a, .aac, .flac, and .wma. It loads from URL or local path with auto-base64 encoding for inline transport. Audio-specific rubrics like `audio_transcription` score STT quality directly on the audio, and `audio_quality` scores TTS output quality on the rendered audio. Most teams run a mixed pipeline: transcript-based rubrics for semantic checks and audio-based rubrics for ASR and TTS quality.
Related Articles
View all