Home / Changelog / 2025 Week 16

Apr 1 – Apr 14, 2025 2025 W16

Prototype V2 and Audio Evaluations

Rebuilt prompt engineering environment with built-in knowledge base, evaluations that run on the actual audio of voice calls, plus a guided walkthrough.

Platform Evaluate Monitor

6 new eval templates

2x faster dataset loading

What's in this digest

Platform New

Prototype V2 with knowledge base UI

Evaluate New

Audio evaluations

Monitor New

Error Localization in Prototype

Evaluate Improved

Compare datasets with diff view

First-time user walkthrough onboarding

Monitor Improved

Quick filters for annotations

Evaluate Fixed

Audio cell renderer in datasets

Prototype V2 — Prompts, Knowledge Base, and Datasets in One Place

W16

Prototype is where you iterate on prompts against sample inputs before pushing them into production. The original was a scratchpad. V2 is the full workflow: write the prompt, pull in the context it needs, iterate, then push the good examples into your evaluation dataset without leaving the tab.

What’s new

Built-in knowledge base. Attach reference docs, few-shot examples (sample input/output pairs that guide the model), and context files to a prompting session. No more copy-pasting from a separate tool or keeping ten tabs open.
In-app tutorial video. A 3-minute walkthrough plays directly inside Prototype. Skippable for returning users.
One-click push to datasets. Found a prompt-response pair worth keeping? Send it straight to an evaluation dataset (the test cases your automated evaluations run against) without leaving Prototype.

Why it matters

V2 closes the gap between prompt iteration and systematic evaluation. Found a good example while iterating? Push it to your evaluation dataset with one click, no copy-paste.

Who it’s for

Prompt engineers and AI practitioners iterating on prompts. Especially useful for teams collaborating on prompts, where the person writing a prompt and the person curating the evaluation dataset aren’t always the same.

Read the docs →

Audio Evaluations — Run Evaluations on the Actual Audio

Voice agents used to be evaluated the same way chat agents are: transcribe the call, run text evaluations on the transcript, and miss anything the transcript lost. That drops the parts of a voice interaction that often decide whether it worked — pauses, interruptions, pacing.

What’s new

Audio data support across the platform. File uploads, format conversion, and cloud storage all handle audio natively — multipart form-data handling extended to audio files.
Three audio evaluators at launch. Transcription (what was said), description (what the audio contains — speakers, tone, background), and quality assessment (clarity, noise, artifacts).
Pairs with the new audio cell renderer. Listen to any sample inline in the dataset view while you’re reviewing results.

Why it matters

Voice-agent teams can now attach evaluations to the audio itself, not a transcript. What the agent said, what the audio contained, and how clear the audio was — each scorable independently.

Who it’s for

Teams shipping voice agents into production. Quality assurance (QA) and developer teams building on voice platforms like Vapi or Retell — anyone who needs their test suite to reflect how the agent actually sounds on a call, not just what the transcript says.

Improvements

Datasets

Diff view between versions. Put two dataset versions side by side. Added rows are green, removed rows red, modified cells show what changed inline. Useful when you’re iterating on a test suite and need to see what shifted between runs.

Full-text search. Search across every column — prompts, expected outputs, metadata. Fast enough to use on datasets with thousands of rows.

Audio cell renderer. Play audio samples inline in the dataset table. No more downloading the file to hear it. Pairs with the new audio evaluations.

Onboarding

Gmail signup. One click, no password required.

First-time walkthrough. New accounts now get a guided tour of the platform’s main areas — tracing (recording what your agent did at each step), evaluation, and prototyping. Aim is first evaluation run in under five minutes.

Reviewing results

Quick filters for annotations. Filter the annotation panel (where reviewers label agent outputs) by label, reviewer, or status without leaving it.

Newer

Diff View in Experiments, Audio Across the Platform, and Run-Insight Views

All changelog entries