Home / Changelog / 2025 Week 40

Sep 16 – Sep 29, 2025 2025 W40

Voice Observability for Vapi, Retell, and ElevenLabs; Eval Groups in Experiments; Simulate via SDK

Observability ships for three voice platforms at once. Evaluation groups integrate with experiments and optimization. Call Simulation now SDK-triggerable.

Monitor Evaluate SDK Simulate Platform

3 Voice platforms supported

60-70% Cost reduction via SDK simulation

What's in this digest

Monitor New

Voice Observability for Vapi, Retell, and ElevenLabs

Evaluate New

Eval groups in experiments and optimization

SDK New

Simulate via SDK

Simulate Improved

Selective test rerun in Simulate

Evaluate Improved

Default eval groups

Simulate Improved

Advanced simulation management

Platform Improved

Workbench revamp

SDK Improved

traceAI session support

Platform Fixed

Agent definition design changes

Voice Observability for Vapi, Retell, and ElevenLabs

W40

Voice platforms like Vapi, Retell, and ElevenLabs have made building voice agents radically easier. But observability for voice agents has lagged the text-agent surface by a long way. You could build a sophisticated voice agent and have almost no visibility into how it actually performs in production.

Voice observability now spans all three platforms at once.

What’s new

Three voice platforms supported at once. Vapi, Retell, and ElevenLabs each get a dedicated adapter that pulls call data into Future AGI as traces (the end-to-end records of how your agent handled each request).
Call-level metrics (duration, latency, turn count, interruption frequency, silence gaps) on every call.
Utterance-level analysis. Each agent response in the transcript is a span (an individual step inside a trace) you can run evaluations against.
Native to the existing trace surface. Voice calls become traces in the same Observe view that holds your text agents, with the same filtering, alerts, and evaluations.

Why it matters

Voice observability used to mean either a platform-specific dashboard per voice provider or no observability at all. All three providers now sit inside one unified view, the same view your text agents already use.

Who it’s for

Teams building voice agents on Vapi, Retell, or ElevenLabs who need production observability at the same level they expect for text. Quality assurance (QA) and MLOps teams running voice agents in production and needing to correlate voice behavior with the rest of the agent stack.

Read the docs →

Eval Groups in Experiments and Optimization

Evaluation groups (sets of related evaluations run together) are now wired into experiments and agent optimization, so you can optimize against the group’s combined score directly. Full create / read / update / delete support is in place.

What’s new

Run eval groups inside experiments. Select a group as the evaluation configuration for an experiment (a run of your prompt or model against a dataset of test cases), and every test case gets scored against every evaluation in the group.
Optimize against group-level scores. Agent and prompt optimization workflows can target a group’s aggregate score instead of a single metric.
Full create/read/update/delete on groups. Manage your evaluation library programmatically.
History tracking on every group edit.

Why it matters

Single metrics tell you about one quality dimension. Groups of metrics tell you whether your agent is actually good at its job. Integrating groups with experiments and optimization means “optimize my agent for the behavior I actually care about” instead of “optimize for one proxy score.”

Who it’s for

ML and AI engineers building agent evaluation suites, and product teams optimizing prompts or agents against multi-dimensional quality criteria rather than a single score.

Read the docs →

Simulate via SDK

Call Simulation, previously available only from the dashboard, is now programmable.

What’s new

Trigger simulation runs from code. Against your Vapi- or Retell-backed voice agents, your SDK code can spin up a full simulation on demand.
CI/CD-ready. Gate deployments on voice quality metrics. Run nightly regression suites against production agents.
SDK handles the plumbing. Connection management, audio streaming, and result collection are all abstracted.

Why it matters

Dashboard simulation is great for exploratory work. SDK simulation is what turns voice testing into a real part of the CI/CD pipeline.

Who it’s for

Developers building voice agents on Vapi or Retell, and quality assurance (QA) teams automating regression testing as part of continuous integration pipelines.

Read the docs →

Simulation Management Improvements

Selective test rerun. When a 100-scenario suite completes with 8 failures, rerun only those 8, not all 100. Useful when debugging whether a specific fix resolved specific failures.

Auto-refresh and stop controls. The simulation dashboard auto-refreshes as test cases complete. The stop-simulation control lets you halt a running suite when early results already show what you need.

Visual workflow trace. The execution path of each scenario renders as a graph, so you see which branches were tested and which weren’t.

Platform and SDK

Default eval groups. Pre-built groups for retrieval-augmented generation (RAG), computer vision, and conversational AI use cases, so new users have a working starting point.

Workbench revamp. A slide-out code drawer puts the prompt code alongside the run, and the new header keeps the run controls in view without scrolling. Iterating on a prompt stays a one-screen activity.

Agent definition design. Configuration page restructured so prompts, tools, and policies stay grouped as the agent grows. The definition stays easy to navigate once it has more than a handful of nodes.

traceAI session support. The session.id attribute is now built into every traceAI instrumentor, so session-level grouping works automatically without custom instrumentation.

Older

Automated Scenario Builder, Agent Definition Versioning, and Simplified Session Tracking

Newer

ai-evaluation SDK v0.1.5, Personas, and Run-Prompt Enhancements

All changelog entries