Iterating on prompts, models, or agent configurations is only useful if you can measure the difference between two runs. Until now, comparing experiments meant exporting results and lining them up in a spreadsheet. Diff view puts that comparison directly inside the experiment page.

What’s new

Side-by-side experiment comparison. Place two experiment configurations (each a run of your prompt, model, or agent against a dataset of test cases) next to each other with one click.
Per-test-case highlights. Output differences, score deltas, and configuration changes marked inline.
Score summaries at the top. See aggregate score movement before drilling into individual test cases.

Why it matters

Prompt tuning and model switching both produce small per-call differences that add up to a meaningful aggregate change. Diff view surfaces both — the aggregate story and the per-row specifics — in one screen.

Who it’s for

Prompt engineers and AI practitioners iterating on prompts, and ML/AI engineers benchmarking different model configurations against the same test set.

Read the docs →

Audio Support Across Observe and Datasets

Voice agent teams have been asking for one thing repeatedly: the ability to listen to audio without leaving the platform. This release ships it.

What’s new

Inline audio playback in trace views. Any trace span that contains audio renders with a player inline — click play, no download needed.
Waveform previews in dataset tables. Audio cells in datasets show a waveform preview, and the full audio plays inline.
Audio works like any other data. Anywhere you see a text span in the platform, audio behaves the same way — same filters, same actions, same workflows.

Why it matters

Debugging voice agents used to mean downloading WAV files and playing them in a separate app. Audio-in-platform removes that friction — investigation and playback happen in the same screen.

Who it’s for

Teams shipping voice agents into production, and quality assurance (QA) teams reviewing voice agent output either in traces or in evaluation datasets.

Read the docs →

Additional Improvements

Run insight views. Every evaluation run now opens to a summary: pass/fail distribution, score trends across runs, and outliers called out automatically. Less clicking, more at-a-glance reading.

Rate-limit error UX and alerts. Clearer error messages when an API call hits a rate limit, plus in-app alerts so the limit issue doesn’t hide in the logs.

Knowledge base tutorial video. The Prototype V2 knowledge base (shipped in w16) now has an embedded 3-minute walkthrough video in the UI for new users.

Compare run evaluation revamp. Refined layout for comparing evaluation runs with cleaner score display and easier navigation.

Prototype Observe filters. The same filter vocabulary used in Observe now works on Prototype runs — consistent behavior across the two surfaces.

Synthetic data UI polish. The synthetic data generator (where you generate test cases for evaluations) picks up better progress indicators and batch controls for large jobs.

Older

Prototype V2 and Audio Evaluations

Newer

Workbench V2, Custom Evaluations Revamp, and SDK Updates

All changelog entries