Home / Changelog / 2025 Week 16

Apr 14 – Apr 18, 2025 2025 W16

Prototype V2 and Audio Evaluations

A rebuilt Prototype experience with knowledge base UI, plus first-class audio evaluations and a smoother onboarding flow for new users.

Platform Evaluate Monitor

6 new eval templates

2x faster dataset loading

What's in this digest

Platform Prototype V2 with knowledge base UI New

Evaluate Audio evaluations New

Evaluate Compare datasets with diff view Improved

Evaluate Search in datasets Improved

Platform Gmail signup option Improved

Platform First-time user walkthrough onboarding Improved

Monitor Quick filters for annotations Improved

Evaluate Run insight views for evals Improved

Platform Add to dataset from Prototype Improved

Evaluate Audio cell renderer in datasets Fixed

Prototype V2 — Built for Serious Prompt Engineering

The original Prototype was a starting point. V2 is a leap forward. We rebuilt the entire experience from the ground up to support the workflows that power users actually need: rapid prompt iteration backed by a structured knowledge base.

The new Prototype V2 introduces an integrated knowledge base UI that lets you attach reference documents, few-shot examples, and context files directly to your prototyping sessions. No more switching tabs or copy-pasting from external docs. Everything your prompt needs lives in one place.

We also shipped a tutorial video embedded directly in the Prototype interface. New users can watch a 3-minute walkthrough and start building immediately, while experienced users can skip straight to their workspace.

The biggest workflow improvement: you can now add prompt-response pairs directly to datasets from within Prototype. Found a great example during iteration? Push it to your eval dataset in one click. This closes the loop between experimentation and systematic evaluation.

Audio Evaluations — Conversational Completeness

Voice agents are everywhere, and until now, evaluating them meant transcribing audio and running text-based evals. That misses everything that makes audio interactions unique: tone, pacing, interruptions, and conversational flow.

Our new audio evaluation suite introduces conversational completeness metrics designed specifically for spoken interactions. Measure whether your voice agent actually addressed the user’s full request, handled multi-turn dialogue naturally, and maintained coherent conversation structure.

The audio eval pipeline works natively with audio files — no transcription step required. Upload audio samples to your datasets and run evaluations directly against the waveform data.

Dataset Improvements

Datasets got a series of quality-of-life upgrades that make managing evaluation data significantly faster.

Diff view for dataset comparison lets you place two dataset versions side by side and see exactly what changed. Added rows are highlighted in green, removed rows in red, and modified cells show inline diffs. This is essential when you are iterating on your test suite and need to understand how changes propagate.

Full-text search across dataset rows means you can find specific test cases instantly, even in datasets with thousands of entries. Search works across all columns including prompts, expected outputs, and metadata fields.

The new audio cell renderer brings inline playback to dataset tables. Click play on any audio cell to hear the sample without leaving the dataset view. Combined with the new audio evaluations, this creates a complete workflow for voice agent testing.

Onboarding and Platform Updates

New users now get a guided walkthrough on first login. The onboarding flow covers the three pillars of the platform — tracing, evaluation, and prototyping — and gets users to their first evaluation run in under five minutes.

Gmail signup removes another friction point. One click, no password, and you are in.

For teams already deep in their evaluation workflows, quick filters for annotations let you slice annotation data by label, reviewer, and status directly from the panel. And the new run insight views give you visual summaries of evaluation runs with pass/fail distributions and score trends at a glance.

Newer

Annotations Flow and Error Localization

All changelog entries

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Platform

Audience

LEARN

DEVELOPERS

Featured

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Prototype V2 and Audio Evaluations

What's in this digest

Prototype V2 — Built for Serious Prompt Engineering

Audio Evaluations — Conversational Completeness

Dataset Improvements

Onboarding and Platform Updates

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Prototype V2 and Audio Evaluations

What's in this digest

Prototype V2 — Built for Serious Prompt Engineering

Audio Evaluations — Conversational Completeness

Dataset Improvements

Onboarding and Platform Updates

FutureAGI AI Assistant