What's in this digest
Prototype V2 — Built for Serious Prompt Engineering
The original Prototype was a starting point. V2 is a leap forward. We rebuilt the entire experience from the ground up to support the workflows that power users actually need: rapid prompt iteration backed by a structured knowledge base.
The new Prototype V2 introduces an integrated knowledge base UI that lets you attach reference documents, few-shot examples, and context files directly to your prototyping sessions. No more switching tabs or copy-pasting from external docs. Everything your prompt needs lives in one place.
We also shipped a tutorial video embedded directly in the Prototype interface. New users can watch a 3-minute walkthrough and start building immediately, while experienced users can skip straight to their workspace.
The biggest workflow improvement: you can now add prompt-response pairs directly to datasets from within Prototype. Found a great example during iteration? Push it to your eval dataset in one click. This closes the loop between experimentation and systematic evaluation.
Audio Evaluations — Conversational Completeness
Voice agents are everywhere, and until now, evaluating them meant transcribing audio and running text-based evals. That misses everything that makes audio interactions unique: tone, pacing, interruptions, and conversational flow.
Our new audio evaluation suite introduces conversational completeness metrics designed specifically for spoken interactions. Measure whether your voice agent actually addressed the user’s full request, handled multi-turn dialogue naturally, and maintained coherent conversation structure.
The audio eval pipeline works natively with audio files — no transcription step required. Upload audio samples to your datasets and run evaluations directly against the waveform data.
Dataset Improvements
Datasets got a series of quality-of-life upgrades that make managing evaluation data significantly faster.
Diff view for dataset comparison lets you place two dataset versions side by side and see exactly what changed. Added rows are highlighted in green, removed rows in red, and modified cells show inline diffs. This is essential when you are iterating on your test suite and need to understand how changes propagate.
Full-text search across dataset rows means you can find specific test cases instantly, even in datasets with thousands of entries. Search works across all columns including prompts, expected outputs, and metadata fields.
The new audio cell renderer brings inline playback to dataset tables. Click play on any audio cell to hear the sample without leaving the dataset view. Combined with the new audio evaluations, this creates a complete workflow for voice agent testing.
Onboarding and Platform Updates
New users now get a guided walkthrough on first login. The onboarding flow covers the three pillars of the platform — tracing, evaluation, and prototyping — and gets users to their first evaluation run in under five minutes.
Gmail signup removes another friction point. One click, no password, and you are in.
For teams already deep in their evaluation workflows, quick filters for annotations let you slice annotation data by label, reviewer, and status directly from the panel. And the new run insight views give you visual summaries of evaluation runs with pass/fail distributions and score trends at a glance.