Prototype V2 and Audio Evaluations
A rebuilt prompt engineering environment with a built-in knowledge base, evaluations that run on the actual audio of voice calls, and a guided walkthrough for new users.
What's in this digest
Prototype V2 — Prompts, Knowledge Base, and Datasets in One Place

Prototype is where you iterate on prompts against sample inputs before pushing them into production. The original was a scratchpad. V2 is the full workflow: write the prompt, pull in the context it needs, iterate, then push the good examples into your evaluation dataset without leaving the tab.
What’s new
- Built-in knowledge base. Attach reference docs, few-shot examples (sample input/output pairs that guide the model), and context files to a prompting session. No more copy-pasting from a separate tool or keeping ten tabs open.
- In-app tutorial video. A 3-minute walkthrough plays directly inside Prototype. Skippable for returning users.
- One-click push to datasets. Found a prompt-response pair worth keeping? Send it straight to an evaluation dataset (the test cases your automated evaluations run against) without leaving Prototype.
Why it matters
V2 closes the gap between prompt iteration and systematic evaluation. Found a good example while iterating? Push it to your evaluation dataset with one click, no copy-paste.
Who it’s for
Prompt engineers and AI practitioners iterating on prompts. Especially useful for teams collaborating on prompts, where the person writing a prompt and the person curating the evaluation dataset aren’t always the same.
Audio Evaluations — Run Evaluations on the Actual Audio
Voice agents used to be evaluated the same way chat agents are: transcribe the call, run text evaluations on the transcript, and miss anything the transcript lost. That drops the parts of a voice interaction that often decide whether it worked — pauses, interruptions, pacing.
What’s new
- Audio data support across the platform. File uploads, format conversion, and cloud storage all handle audio natively — multipart form-data handling extended to audio files.
- Three audio evaluators at launch. Transcription (what was said), description (what the audio contains — speakers, tone, background), and quality assessment (clarity, noise, artifacts).
- Pairs with the new audio cell renderer. Listen to any sample inline in the dataset view while you’re reviewing results.
Why it matters
Voice-agent teams can now attach evaluations to the audio itself, not a transcript. What the agent said, what the audio contained, and how clear the audio was — each scorable independently.
Who it’s for
Teams shipping voice agents into production. Quality assurance (QA) and developer teams building on voice platforms like Vapi or Retell — anyone who needs their test suite to reflect how the agent actually sounds on a call, not just what the transcript says.
Improvements
Datasets
Diff view between versions. Put two dataset versions side by side. Added rows are green, removed rows red, modified cells show what changed inline. Useful when you’re iterating on a test suite and need to see what shifted between runs.
Full-text search. Search across every column — prompts, expected outputs, metadata. Fast enough to use on datasets with thousands of rows.
Audio cell renderer. Play audio samples inline in the dataset table. No more downloading the file to hear it. Pairs with the new audio evaluations.
Onboarding
Gmail signup. One click, no password required.
First-time walkthrough. New accounts now get a guided tour of the platform’s main areas — tracing (recording what your agent did at each step), evaluation, and prototyping. Aim is first evaluation run in under five minutes.
Reviewing results
Quick filters for annotations. Filter the annotation panel (where reviewers label agent outputs) by label, reviewer, or status without leaving it.