Home / Changelog / 2026 Week 6

Feb 9 – Feb 13, 2026 2026 W6

Simulate from Prompt Workbench

Launch simulations directly from the Prompt Workbench and annotate voice calls with structured human feedback.

Simulate Evaluate Monitor Platform Agents

5 label types for voice annotations

2x faster file processing

What's in this digest

Simulate Simulate using Prompt Workbench New

Evaluate Human annotations for voice calls New

Monitor Agent health monitoring for voice agents New

Evaluate Multi-image support in evaluations Improved

Platform Reasoning model support Improved

Simulate WebSocket simulation grid updates Improved

Simulate Image and audio output support in workbench Improved

Platform Azure endpoint type selector Improved

Agents Workflow execution management in Agent Playground Improved

Platform Read-write database split Improved

Platform Polars-based file processing Improved

Simulate Faster simulation results and evaluations dashboard Improved

Simulate Directly from the Prompt Workbench

The most common workflow in Future AGI looks like this: you write a prompt, test it manually, realize you need broader coverage, switch to Simulation, re-configure everything, and run your tests. That context switch is now gone.

Simulate from Prompt Workbench lets you add and configure simulations without ever leaving the workbench. Write your prompt, define your test variables, and launch a simulation run from the same interface. Results stream back in real time thanks to new WebSocket-powered grid updates — no more refreshing the page to check progress. When a simulation surfaces an issue, you are already in the right place to iterate on the prompt and rerun.

This is not a simplified version of Simulation bolted onto the workbench. It is the full simulation engine, accessible from where you are already working. Configure datasets, evaluation criteria, model parameters, and concurrency settings. The only difference is that you never leave the context of the prompt you are refining.

Human Annotations for Voice Calls

Voice agents present a unique evaluation challenge. Automated metrics catch some failures, but nuance in tone, pacing, and conversational appropriateness still requires human judgment. The new voice annotation system brings structured human feedback to voice agent transcripts with a purpose-built review interface.

Five label types cover the dimensions that matter most for voice quality: correctness, helpfulness, safety, tone, and a custom label you define for your domain. Multiple reviewers can annotate the same transcript independently, and the platform aggregates their feedback with inter-annotator agreement metrics. This gives you a clear signal on where your voice agent excels and where it needs work, grounded in human assessment rather than proxy metrics.

Agent Compass Now Monitors Voice Agents

Agent Compass — the real-time health monitoring system introduced for text-based agents — now extends to voice. Track call duration distributions, response latency percentiles, interruption rates, and conversation completion metrics. Set thresholds and receive alerts when your voice agent’s behavior drifts outside acceptable bounds.

For teams operating voice agents at scale, this is the difference between discovering a degradation from customer complaints and catching it in the first five minutes.

Performance Infrastructure

Two backend changes deliver measurable performance improvements across the platform. The read-write database split separates query traffic from write operations, eliminating contention that previously caused slowdowns during heavy evaluation runs. Teams running large-scale simulations will notice faster dashboard loads and more responsive search.

File processing for datasets has been rebuilt on Polars, replacing the previous pandas-based pipeline. CSV, Excel, and JSON imports now process at roughly twice the previous speed, with significantly lower memory consumption. For teams importing datasets with hundreds of thousands of rows, this turns a minutes-long wait into a background task that finishes before you switch tabs.

Multimodal Expansion

This release continues the push toward comprehensive multimodal support. Evaluations now handle multi-image inputs, letting you score outputs that reference or generate multiple images in a single response. The Prompt Workbench renders both image and audio outputs inline, so multimodal prompt iteration no longer requires external preview tools. Azure users get a dedicated endpoint type selector that correctly formats API requests for Azure-hosted models, resolving a friction point reported by several enterprise teams.

Reasoning model support brings chain-of-thought visibility to traces and evaluations. When your model produces intermediate reasoning steps, they appear as distinct spans in the trace view, making it straightforward to audit the logic path that led to any particular output.

Older

Agent Playground - Build Multi-Step Agents Visually

Newer

Deep Space Theme and ai-evaluation 1.0

All changelog entries

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Platform

Audience

LEARN

DEVELOPERS

Featured

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Simulate from Prompt Workbench

What's in this digest

Simulate Directly from the Prompt Workbench

Human Annotations for Voice Calls

Agent Compass Now Monitors Voice Agents

Performance Infrastructure

Multimodal Expansion

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Simulate from Prompt Workbench

What's in this digest

Simulate Directly from the Prompt Workbench

Human Annotations for Voice Calls

Agent Compass Now Monitors Voice Agents

Performance Infrastructure

Multimodal Expansion

FutureAGI AI Assistant