Diff View in Experiments, Audio Across the Platform, and Run-Insight Views
Compare two experiment runs side by side, play audio inline in traces and datasets, and see every evaluation run at a glance with insight summaries.
What's in this digest
Diff View in Experiments — See Exactly What Each Change Moved

Iterating on prompts, models, or agent configurations is only useful if you can measure the difference between two runs. Until now, comparing experiments meant exporting results and lining them up in a spreadsheet. Diff view puts that comparison directly inside the experiment page.
What’s new
- Side-by-side experiment comparison. Place two experiment configurations (each a run of your prompt, model, or agent against a dataset of test cases) next to each other with one click.
- Per-test-case highlights. Output differences, score deltas, and configuration changes marked inline.
- Score summaries at the top. See aggregate score movement before drilling into individual test cases.
Why it matters
Prompt tuning and model switching both produce small per-call differences that add up to a meaningful aggregate change. Diff view surfaces both — the aggregate story and the per-row specifics — in one screen.
Who it’s for
Prompt engineers and AI practitioners iterating on prompts, and ML/AI engineers benchmarking different model configurations against the same test set.
Audio Support Across Observe and Datasets
Voice agent teams have been asking for one thing repeatedly: the ability to listen to audio without leaving the platform. This release ships it.
What’s new
- Inline audio playback in trace views. Any trace span that contains audio renders with a player inline — click play, no download needed.
- Waveform previews in dataset tables. Audio cells in datasets show a waveform preview, and the full audio plays inline.
- Audio works like any other data. Anywhere you see a text span in the platform, audio behaves the same way — same filters, same actions, same workflows.
Why it matters
Debugging voice agents used to mean downloading WAV files and playing them in a separate app. Audio-in-platform removes that friction — investigation and playback happen in the same screen.
Who it’s for
Teams shipping voice agents into production, and quality assurance (QA) teams reviewing voice agent output either in traces or in evaluation datasets.
Additional Improvements
Run insight views. Every evaluation run now opens to a summary: pass/fail distribution, score trends across runs, and outliers called out automatically. Less clicking, more at-a-glance reading.
Rate-limit error UX and alerts. Clearer error messages when an API call hits a rate limit, plus in-app alerts so the limit issue doesn’t hide in the logs.
Knowledge base tutorial video. The Prototype V2 knowledge base (shipped in w16) now has an embedded 3-minute walkthrough video in the UI for new users.
Compare run evaluation revamp. Refined layout for comparing evaluation runs with cleaner score display and easier navigation.
Prototype Observe filters. The same filter vocabulary used in Observe now works on Prototype runs — consistent behavior across the two surfaces.
Synthetic data UI polish. The synthetic data generator (where you generate test cases for evaluations) picks up better progress indicators and batch controls for large jobs.