Home / Changelog

Changelog

Weekly digests of everything we ship. New features, improvements, and fixes to the Future AGI platform.

May 11, 2026

W20 May 11 – May 15, 2026

Self-Host in One Command, Jinja2 Prompts, and Polish Across Evals and Observability

Self-host Future AGI in one command with pre-built images and a Windows installer. Prompts now support Jinja2 alongside Mustache. Plus reliable access to voice recordings, annotation and eval scores exported alongside traces into datasets, a Request Explorer in the AI gateway, nested JSON access in eval variables, faster session lists, and a long list of polish across Observe and evals.

Platform New

Self-Host Future AGI in One Command

Self-hosting Future AGI is one script: clone the repo, run `bin/install`, log in. The script pulls pre-built service images from Docker Hub, brings the stack up in about 30 seconds, and works on macOS, Linux, and Windows (through Git Bash, WSL, or a new native PowerShell installer). The first account is created from the command line, with no email server required.

Platform New

Prompts now support Jinja2 templates alongside Mustache

Use `{% if %}` for conditionals, `{% for %}` for loops, and Jinja2 filters to transform values inline. The existing Mustache `{{ variable }}` syntax keeps working, and a Template Format dropdown picks between the two per prompt.

~30s self-host pull-to-up

5 pre-built service images

Mustache + Jinja2 prompt templating

14 improvements across evals, observe, and voice

Monitor Improved

Reliable voice evals, no dependency on provider URLs

Future AGI now stores its own copy of every recording the moment the call arrives. Evals on voice traces run against audio you control, with no dependency on the original provider URL, so traces, replays, and eval reruns stay valid as long as you keep the trace. Storing your own copy is on by default for every voice provider and can be switched off per provider.

Evaluate Improved

Export annotations and eval scores alongside traces to datasets

Pick which annotation scores and eval results to export alongside traces when sending them to a dataset. The selected metrics arrive as dedicated columns in the dataset, so it's immediately ready for experiments with baseline scores, training data, or downstream exports — no re-running the evals.

API Improved

Filter and export AI gateway request logs

A new Request Explorer in the AI gateway logs filters every request by model, status, or metadata and exports the result. Useful for cost audits, error investigations, and compliance pulls.

Evaluate Improved

Reach into nested JSON fields from eval and prompt variables

Eval and run-prompt variables accept dotted paths into nested JSON. A single column like `payload` can feed multiple variables (`payload.user`, `payload.request.id`) without flattening the data upstream first.

Monitor Improved

Filter Observe by trace ID or span ID, plus Select all view columns

Paste a trace ID or span ID into the Observe filter bar to jump straight to that trace or span. The view-columns picker has a new Select all toggle.

Monitor Improved

Faster session list on high-volume traces

The session list now loads in seconds on workspaces ingesting millions of traces. No more waiting at a spinner to pick up a session for review.

Evaluate Improved

API columns: smoother edits, visible progress

Edits to an API column's URL, method, headers, params, or body now sync across devices and browsers. Requests run in batches, so rows fill in as each batch completes and the column's progress is visible while it runs. Param and header keys with underscores (like `api_key`) display exactly as you typed them.

Evaluate Improved

Cleaner eval columns

Inside eval column groups, Result now appears before Reason. Each eval column displays the template's current default version instead of defaulting to V1. Eval scores of exactly 0 render distinctly, and column averages refresh after you rearrange columns.

Evaluate Improved

Add traces to annotation queues from any source

Traces sent via our SDK or OpenTelemetry (OTLP) can be queued for annotation from the trace list. Dataset rows added to queues no longer get stuck loading, and CSV exports include annotator notes.

Monitor Improved

Trace attributes: long values expand, rows are easier to scan

Long string values in the trace attributes panel are click-to-expand instead of clipped. Dividers between rows make it clearer where one attribute ends and the next begins.

Monitor Fixed

Error Feed comes to voice and simulation projects

Error clusters from evals are now available on voice and simulation projects, visible across the summary, trend charts, and the trace panel. Clicking a voice trace opens the voice call panel.

Monitor Improved

Voice analytics: consistent units and a tighter default view

Latency, silence, and Time to First Word display in milliseconds across the call analytics panel and the call-logs table for at-a-glance comparison. Per-role talk-time percentages and Talk Ratio sit in the call detail panel for every transcript format, with Talk Ratio hidden by default to keep the view tight.

Platform Fixed

Workspace invites work for existing org members

Inviting an existing org member to a new workspace now sends the email and adds the workspace to their list straight away.

Platform Fixed

j and k row-navigation shortcuts no longer hijack text input

The j and k keys (used to move between rows in eval and task detail panels) now yield to focused text inputs, so the letters land in your comment instead of moving the row selection.

Read full digest

Apr 27, 2026

W18 Apr 27 – May 1, 2026

Evals Revamp, Experiment V2, Observe Revamp, and Error Feed

130+ ready-made evals that reach outside the prompt: pull live web data and call your own tools as part of scoring. Experiments now run end-to-end on agents, not just prompts. Observe gets plain-English filtering and saved views. The Error Feed clusters failures by root cause at a fraction of cost and pushes any cluster to Linear in one click.

Evaluate New

Evals Revamp: Score Against the Tools and Data Your Agent Actually Uses

Evals now reach outside the prompt: pull live web data, query your databases, and call the same tools your agent runs against in production. 130+ ready-made evals, composite scoring in one pass, full version history, and a unified picker that opens everywhere.

Evaluate New

Experiment V2: From Prompts to Agents

Experiments now run end-to-end on agents, not just prompts. Compare prompts, agents, or stacked model variants in one run, view per-node output for each agent, score against any dataset column, and edit, stop, or rerun mid-flight.

Monitor New

Observe Revamp: Plain-English Filters, Custom Views, and Imagine-with-Falcon Charts

Observe is rebuilt around how you actually debug. Filter traces in plain English, save and switch between custom views in a click, search inline across logs and previews, turn any trace into an interactive chart with Falcon, and run evals or push to datasets without leaving the trace list.

Monitor New

Error Feed: Cluster, Triage, and Push to Linear in One Click

The Error Feed clusters failing traces by root cause, updates about 5 seconds after a trace completes, and runs at roughly 10x lower cost than its previous version. Push a cluster to Linear with one click, run on-demand Deep Analysis when you need root cause, and read failing-vs-working trace evidence side by side.

130+ ready-made evals

~5s Error Feed update lag

10x lower-cost Error Feed

1-click push to Linear

Platform New

Custom pricing plans with billing dashboard v2

Sales can configure enterprise deals with custom pricing terms, and finance can generate invoices from the admin surface. The billing dashboard is rebuilt around the new model so cost breakdowns, usage trends, and per-key spend match the actual contract.

Platform Improved

Standardised search bar across list pages

Every list page in the platform now uses the same search bar, so the keyboard shortcut, the search syntax, and the result behaviour stay the same whether you're searching traces, datasets, evals, or experiments. No re-learning per surface.

Evaluate Improved

Annotator picker: search and add at scale

Search and add annotators directly from the picker, even on workspaces with a large user list. Infinite scroll loads results as you go, and server-side search returns matches instantly.

Platform Improved

Custom dashboards polish

Round of polish on Custom Dashboards: time range syncs to the widget editor, widgets resize cleanly, description fields render instantly, empty state on metric removal, filter Add button visibility, and dashboard description position.

API Improved

Agent definition API revamp

Agent definition endpoints now share one consistent payload shape across create, read, update, and delete (instead of variations per operation), so SDK and integration code can drop the per-endpoint marshalling logic and use a single type.

API Improved

Simulate analytics API

New analytics API for simulation data: pull aggregate metrics, breakdowns, and trends for any simulation run.

Monitor Improved

All Attributes JSON viewer search filter

The All Attributes JSON viewer (in the trace detail drawer) gets a search box that filters the JSON tree as you type. Useful when a trace has hundreds of attributes and you only want to see the ones matching a specific key or value.

Read full digest

Apr 13, 2026

W16 Apr 13 – Apr 17, 2026

Voice AI Production-to-Simulation, Annotation Queue Assignment, and API Docs Improvements

Turn any live voice call into a simulation test case, manually assign annotation queue items, and navigate API docs with full context on a single page.

Simulate New

Voice AI: Production to Simulation in One Click

Take any production voice call and turn it directly into a simulation test case. Rerun it against different prompt versions, prompt chains, or agent definitions and compare behaviour on real inputs.

API New

API Reference: Everything You Need on One Page

Each endpoint now renders cURL example, expected response, and full parameter details side by side. No tab switching, no scrolling between sections when you are writing integration code.

Evaluate New

Annotation Queue: Faster Review, Stronger Agreement

Multi-assignment, prefetching, and manual annotator assignment for annotation queues. Reviewer approval is now optional and can be toggled per workflow.

1-click production call to simulation

days → seconds regression to reproducible test

Simulate Improved

Voice Metrics in Call Lists

New voice metrics are available as columns in the call list view, so you can filter and triage calls by latency, interruptions, or turn duration without opening each one.

Read full digest

Mar 30, 2026

W14 Mar 30 – Apr 3, 2026

Falcon AI and 4x Faster Frontend

Meet Falcon AI, a context-aware assistant embedded in the platform that debugs traces, scaffolds simulations, and drafts evaluations from what you are looking at. Plus a 4x faster frontend across every page, ClickHouse Replicated MergeTree migration, and voice simulation moves to LiveKit.

Platform New

Falcon AI

Context-aware AI assistant embedded in the platform for trace debugging, simulation creation, evaluation building, and dataset construction.

4x faster page loads

page-aware Falcon AI

high-availability ClickHouse trace storage

Platform New

Frontend speed improvements

Four targeted optimizations delivering 4x faster page loads across the platform.

Simulate Improved

Voice simulation moves to LiveKit

Voice simulation now runs on LiveKit instead of Vapi, with lower latency and more natural conversational dynamics during simulated calls.

Monitor Improved

ClickHouse Replicated MergeTree migration

Trace storage upgraded to Replicated MergeTree for high availability and automatic failover.

Platform Improved

Prompt generation and improvement

AI-assisted prompt generation from task descriptions and one-click prompt improvement suggestions.

Read full digest

Mar 16, 2026

W12 Mar 16 – Mar 20, 2026

Custom Dashboards, MCP Server, 2FA with Passkeys, and Annotation Queues

A drag-and-drop dashboard builder, a Model Context Protocol (MCP) server that puts Future AGI inside your IDE, two-factor authentication with passkeys + recovery codes, full Annotation Queue workflows, and a rebuilt Agent Command Center.

Platform New

Custom Dashboards

Drag-and-drop dashboard builder for tracking agent performance across evaluation scores, system metrics, cost, and experiment progress.

SDK New

MCP Server

Connect Future AGI to Cursor, Claude Code, VS Code, Claude Desktop, and Windsurf. Your AI coding assistant can run evaluations, query datasets, and pull traces without leaving the editor.

5 IDE integrations (MCP)

2,373+ LLM models priced out of the box

8 external platform integrations

Platform New

Two-factor authentication with passkeys

Full 2FA rollout: TOTP (authenticator app codes), WebAuthn passkeys (biometric or hardware-key sign-in), and recovery codes for account recovery.

Platform New

Role-based access control

RBAC at both organization and workspace levels. Four roles: Owner (billing), Admin (workspaces and integrations), Member (create and modify), and Viewer (read-only).

Evaluate New

Annotation Queues

Full annotation queue workflow with per-user progress tracking, inline scoring, and unified scores across traces, sessions, datasets, and simulation outputs.

Platform New

Agent Command Center: rebuilt architecture

The Agent Command Center gateway ships with a rebuilt architecture: consistent state across replicas, comprehensive security hardening, and tighter correctness guarantees.

Platform New

Integrations hub

Connect Langfuse, Datadog, PostHog, PagerDuty, Mixpanel, S3, Azure Blob, and Google Cloud Storage from a single configuration page in under a minute per integration.

Platform Improved

Pricing for 2,373+ models

Supported-model pricing grows from roughly 200 to over 2,373, so every major provider model now has accurate cost attribution out of the box.

Platform Improved

Falcon AI: skills UI and settings redesign

Falcon AI gets a redesigned skills and settings surface, plus a working stop-generation control for cancelling long responses cleanly.

Platform Improved

Chat history popover and feedback affordances

Chat history popover with thumbs up/down feedback and tool-call persistence across sessions.

Evaluate Improved

Task description field for optimizers

Every optimizer type now accepts a task description field (optimization objective) as input.

Platform Improved

Multi-breakdown analytics

Analytics now support breakdown by multiple dimensions at once. Slice the same widget by model, agent version, and time period together.

Monitor Improved

Central evaluation metrics rollups

Evaluation metrics now feed into one analytics rollup pipeline shared across surfaces, so a metric's value matches everywhere it appears (dashboards, the eval explorer, API). Aggregations on large eval datasets stay fast because the rollup is precomputed.

SDK Improved

Unified tracing across every traceAI language

traceAI now emits a single standardized span shape and attribute set across all four supported languages (Python, TypeScript, Java, C#), so traces from any part of your stack share one consistent shape end to end. A bundled migration utility upgrades historical traces in place.

Platform New

Enterprise multi-org architecture

Production-ready multi-org deployments with per-organization data boundaries, workspace-aware enforcement at every platform endpoint, and end-to-end test coverage across the multi-org lifecycle.

Simulate Improved

Outbound calls support

Backend coverage for outbound voice calls in simulation extended: more provider-side call states tracked, more failure modes surfaced cleanly. Builds on the outbound calling flow already in Simulate.

SDK Improved

futureagi-mcp-server tools

futureagi-mcp-server exposes the platform's tool surface to your IDE assistant: evaluations, datasets, prompts, experiments, simulations, tracing, agents, annotations, optimization, usage, and more. Your AI editor calls Future AGI directly without context-switching to the dashboard.

Read full digest

Mar 2, 2026

W10 Mar 2 – Mar 6, 2026

Agent Command Center, Agent Playground, and ClickHouse Migration

Agent Command Center: multi-provider routing, guardrails, fallbacks, and per-key cost controls for every LLM call. Agent Playground: a visual graph builder for multi-step agents with typed node connections, version management, and live execution control. Plus a ClickHouse migration that transforms trace query performance.

Platform Guard New

Agent Command Center

Multi-provider routing, API key management, inline guardrails, automatic fallbacks, per-key cost tracking, and real-time analytics for every LLM call.

Agents New

Agent Playground

Visual graph builder for multi-step agents: two node types (LLM prompt + Agent), typed node connections, global variables, draft/publish workflow, version management, workflow execution control, and a programmatic graph API.

15+ LLM providers

6 load balancing strategies

2 data regions

Monitor New

ClickHouse migration for trace storage

Trace queries that used to take seconds now return in milliseconds, even on workspaces with millions of spans. Aggregations and filters across long time ranges feel interactive instead of batch.

Evaluate New

Annotation queue

Create annotation queues for traces, sessions, datasets, and simulation outputs to organize human review workflows.

Platform New

Multi-region support

Deploy and store data in US or EU regions for compliance with data residency requirements.

Evaluate Improved

Dataset and simulation analytics

One API surface returns both dataset quality metrics and simulation result trends, so external dashboards or BI tools can pull both without stitching together separate endpoints.

Guard Improved

Real-time updates for key revocation

Instant API key revocation across all replicas via real-time updates, closing the window between revocation and enforcement.

API Improved

Per-key authentication for Agent Command Center

Per-key auth and attribution for Agent Command Center: each team or workload runs through its own API key, with request validation, usage attribution, and downstream tracking all flowing from a single per-key identity. Drop-in for any existing OpenAI-compatible client.

Read full digest

Feb 16, 2026

W8 Feb 16 – Feb 20, 2026

ai-evaluation 1.0, Deep Space Theme, Multi-Language SDKs, and Multimodal Workbench

The ai-evaluation SDK hits 1.0 with a unified evaluate API, multimodal LLM judge, and 72+ metrics. Deep Space brings a redesigned dark mode. traceAI ships C# and Java SDKs plus 31 new TypeScript instrumentor packages. And the Prompt Workbench goes multimodal with WebSocket streaming.

Evaluate New

ai-evaluation v1.0.0

The evaluation SDK's stable release: unified evaluate API, multimodal LLM judge, auto-generated grading criteria, vector-store feedback loops, 72+ local metrics, OpenTelemetry integration, streaming evaluation.

72+ ai-evaluation metrics

4 SDK languages (Py, TS, C#, Java)

31 new TypeScript instrumentors

Platform New

Deep Space dark mode migration

Comprehensive monochrome theme across the entire platform, with every component, every surface, and every interaction state rethought for visual consistency and reduced eye strain.

SDK New

traceAI C# SDK

Full traceAI instrumentation for C# applications and .NET environments. ASP.NET services, Azure Functions, and standalone apps get automatic instrumentation for LLM calls, tool invocations, and agent workflows.

SDK New

traceAI Java SDK

Java SDK with 25 instrumentation modules covering Spring Boot, Micronaut, Quarkus, Apache HttpClient, OkHttp, JDBC drivers, and major Java-based LLM client libraries.

SDK New

31 new TypeScript instrumentor packages

Large expansion of TypeScript instrumentation covering frameworks, databases, HTTP clients, and more, the long tail that TypeScript agent developers rely on.

Platform New

Multimodal Prompt Workbench with WebSocket streaming

The Prompt Workbench goes multimodal end-to-end, with WebSocket-streamed responses that render in real time.

Simulate New

Graph-aware dataset scenario generation

Dataset-based scenarios now generate with branch-context. The LLM call that generates each scenario understands which branch of the graph it's in, producing scenarios that stay consistent with surrounding branches.

Simulate Improved

WebSocket simulation grid

Simulation grid updates stream to the UI via WebSocket instead of polling.

Monitor Improved

Langfuse integration

Configure Langfuse directly from the platform UI, with Langfuse-compatible endpoints for Vapi-routed traffic.

Monitor Improved

Voice observability annotations

Annotation workflow now extends into voice observability. Add annotations directly on voice traces with dedicated filters.

Simulate Improved

Provider-agnostic voice simulation runs

Voice simulation runs are now provider-agnostic, with LiveKit signal monitoring giving precise call-state tracking across providers.

SDK Improved

traceAI Python SDK update

New framework support and end-to-end tests for the Python traceAI SDK.

Platform Improved

Per-tab workspace context

Each browser tab keeps its own workspace context: production traces in one tab, staging simulations in another, no cross-tab interference.

Evaluate Improved

Function parameters in evaluations

Pass function parameters directly to evaluation metrics for dynamic scoring configurations.

Evaluate Improved

Reasoning parameters support in prompts

Configure reasoning-specific parameters (reasoning effort, max thinking tokens, whether to surface the reasoning trace) when working with chain-of-thought models in the Prompt Workbench. The same prompt can be tuned for fast and cheap or slow and thorough without rewriting it.

Monitor Improved

has_eval filter for traces and spans

Filter traces and spans in Observe by whether they have evaluations attached.

Monitor Improved

Workspace-scoped error analysis (Feed display)

Workspace-aware error analysis: the Error Feed surfaces failures from the active workspace, keeping triage focused on the project you're debugging. Cleaner empty states for users still being onboarded to projects.

Monitor Improved

Optimized trace session queries

Filtering and sorting on trace sessions now happen in the database layer instead of in app code, so session views load quickly even on workspaces with millions of spans. Most visible when filtering long lists by date or status.

Read full digest

Feb 2, 2026

W6 Feb 2 – Feb 6, 2026

Simulate from Prompt Workbench, Voice Annotations, and Agent Health for Voice Agents

Launch simulations without leaving the Prompt Workbench, annotate voice calls with structured human feedback, and extend Agent Compass health monitoring to voice agents.

Simulate New

Simulate from Prompt Workbench

Add and configure simulations directly from the Prompt Workbench. Full simulation engine, accessible from where you're already working.

5 label types for voice annotations

2x faster file processing

Evaluate New

Human annotations for voice calls

Structured feedback for voice transcripts, with five label types and support for multiple reviewers on the same call.

Monitor New

Agent health monitoring for voice agents

Agent Compass now extends to voice agents, with call duration distributions, response latency percentiles, interruption rates, and conversation completion metrics in real time.

Simulate Improved

Voice simulation revamp

Voice simulation runs against a rebuilt runtime that cuts end-to-end run time on multi-scenario suites. The API surface was consolidated to fewer endpoints with consistent payload shapes, so a test harness wiring against it doesn't have to handle one-off response formats per call type.

Evaluate Improved

Multi-image support in evaluations and datasets

Evaluations accept and score multi-image inputs; datasets accept multiple images per row.

Platform Improved

Reasoning model support

First-class support for reasoning models. Chain-of-thought steps appear as distinct spans (the individual steps inside a trace) in the trace view.

Simulate Improved

WebSocket simulation grid updates

Simulation results stream to the grid in real time, with no manual refreshes.

Platform Improved

Image and audio output rendering in Prompt Workbench

The Prompt Workbench renders image and audio outputs inline for multimodal prompt iteration.

Platform Improved

Azure endpoint type selector

Select Azure-specific endpoint types when configuring custom models, with proper API format handling for Azure-hosted deployments.

Platform Improved

Read traffic and write traffic separated

Dashboard loads and search no longer slow down during heavy evaluation runs. Query traffic and write operations run on independent paths.

Platform Improved

2x faster dataset imports

CSV, Excel, and JSON imports now run roughly 2x faster with significantly lower memory consumption.

Simulate Improved

Faster simulation results and evaluations dashboard

Simulation results and the evaluations dashboard load noticeably faster: queries pushed down to the database layer, page-by-page rendering replacing full result-set loads. Most visible when scrolling through hundreds of test runs.

Evaluate Improved

Function evaluations in test evaluations

Function-type evaluations (deterministic Python or JavaScript checks you author yourself, not LLM-judged) now run inside test evaluation workflows. Useful when pass/fail is logic, not opinion: schema validation, exact-match comparisons, custom string parsing.

API Improved

Simulate API changes for run tables and optimization

Simulate run tables and optimisation endpoints have cleaner request shapes and more consistent error responses, so existing API consumers can drop one-off conditionals around handling individual endpoints.

Read full digest

Jan 19, 2026

W4 Jan 19 – Jan 23, 2026

Baseline Chat Comparison, Fix My Agent Polish, and OpenTelemetry Instrumentation

Baseline chat comparison wires production conversations into simulation as the fastest path from production failure to reproducible test. Plus Fix My Agent polish, OpenTelemetry instrumentation, and image-output support across datasets and the Prompt Workbench.

1-click production trace to simulation baseline

4 Agent framework wrappers

Simulate New

Baseline chat comparison from Observe to Simulation

Compare real production conversations against simulated outputs to identify drift and regressions. Fastest path from "something went wrong in production" to "here's a reproducible test case."

Agents Improved

Fix My Agent: final polish

The drawer is restructured so you reach the suggested fix faster, call-selection bugs that affected long simulation runs are fixed, and a restore-with-conflicts flow keeps your local edits intact when Fix My Agent's suggestion lands on top of changes you've already made.

Platform Improved

OpenTelemetry instrumentation in the platform

Future AGI now emits OpenTelemetry (OTEL) traces for its own operations, with Sentry integration across long-running workflows.

Agents Improved

Agent Prompt Optimiser: resume support

Long-running optimiser jobs now survive restarts and pick up from where they left off.

Evaluate Improved

Dataset optimisation with direct evaluation

Run evaluations directly from datasets and manage trial items without leaving the dataset view.

Evaluate Improved

Image output support in datasets and Prompt Workbench

Datasets and the Prompt Workbench render image outputs natively, useful for document analysis, chart generation, and visual QA workflows.

Evaluate Improved

Multiple image upload in datasets

Upload many images at once when building datasets. No more one-by-one file selection.

Simulate Improved

Bulk delete and bulk rerun test executions

Select dozens of test executions and delete or rerun them in a single action. What used to require API scripting is now two clicks.

Agents Improved

Output type selection in Playground results

Set the expected response format on a Playground run so multimodal output renders correctly instead of defaulting to a raw string preview.

Simulate Improved

Chat inputs for simulation analysis agent

The simulation analysis agent now accepts chat-formatted inputs, so chat-based simulation runs can be triaged by the same diagnostic agent that already handles voice runs. Useful when reviewing mixed simulation suites.

SDK Improved

simulate-sdk v0.1.2

Cloud mode for offloading simulation execution, agent wrappers for OpenAI, LangChain, Gemini, and Anthropic, and tool-call capture built into every simulation run.

Read full digest

Jan 5, 2026

W2 Jan 5 – Jan 9, 2026

Chat Simulation via Observe, Pre-Built Evaluation Groups, and Fix My Agent for Chat

Launch chat simulations directly from real production conversations, pick from 10 ready-to-use evaluation groups with no configuration, and get Fix My Agent diagnostics for chat agents.

Simulate New

Chat Simulation via Observe

Browse real customer conversations in Observe (the view of your live production traces), find the one you want to test against, launch a chat simulation from it with one click.

10 Pre-built evaluation groups

1-click Simulate from Observe

Evaluate New

Pre-built evaluation groups for simulations

10 ready-to-use evaluation groups covering common quality dimensions: attach one to a simulation and run, no configuration required.

Agents New

Fix My Agent support for chat agents

The Fix My Agent diagnostic engine now works on chat-based agents with text-specific failure-mode detection.

Simulate New

Chat Simulation V1 launch polish

The remaining launch pieces for Chat Simulation V1 land together: the persona section, chat logs with inline traces and attributes, the evaluation mapping flow, and the status/UI polish that ties them into one product. Everything needed to run Chat Sim end to end is now in place.

Agents Improved

Agent Prompt Optimiser launch on platform

The Agent Prompt Optimiser is now accessible directly from the platform UI. Pick a strategy, choose target calls, run optimisation without writing API code.

Evaluate Improved

Domain-level metrics, insights, and human comparison summary

Simulation runs now roll up metrics by domain (support, sales, onboarding, whatever you've tagged) and surface a side-by-side summary that compares the agent's behaviour against human-handled conversations. Useful for spotting where automation is on par and where it lags.

Evaluate Improved

Agent prompt conformance evaluation

New evaluation metric that measures how closely an agent follows its prompt instructions.

Platform Improved

Additional LLM models

Latest model releases from the major providers (OpenAI, Anthropic, Google, and others) are wired into evaluation and prompt surfaces, so you can run scoring or generation against new models without waiting for a separate Future AGI release.

Platform Improved

Unified chat message roles

Message roles (user, assistant, system, tool) and the way the dashboard labels them now match across every chat surface. A conversation captured in Observe reads identically inside Simulate or the Workbench, so you can copy a transcript across tools without manual remapping.

Simulate Improved

Call analytics drawer with transcripts and manual graph creation

Call analytics drawer gains transcripts view and a manual graph-creation flow.

Platform Improved

Dynamic model parameter updates from API

Model parameters automatically refresh based on provider API capabilities, so new model variants and parameter changes from providers flow through without manual updates.

Platform Improved

Audio content validation for audio models

Automatic validation of audio content format and quality before submission to audio models. Catches bad inputs before they consume compute.

API Improved

Manage replay sessions through the API

Programmatic management of replay sessions, useful for wiring replay regression checks into CI/CD pipelines.

Simulate Improved

Streamlined persona management in scenarios

Assign and swap personas inside scenarios with a simplified inline interface; edit persona info directly from scenario columns.

Simulate Improved

Complete simulation status visibility

Stage-level progress indicators on every simulation run (scenario generation, conversation execution, evaluation scoring), visible in real time.

Platform Improved

API key management: delete from dashboard

Delete API keys directly from the dashboard to revoke access.

Evaluate Improved

Edit Run Experiment

Experiment configurations can now be edited mid-run (swap a model variant, adjust a scoring threshold) without restarting. Configuration history is tracked per data point so the audit trail stays clean.

Read full digest

Dec 22, 2025

W52 Dec 22 – Dec 26, 2025

Chat Simulation V1, Agent Prompt Optimiser, and Reliability Upgrades

Simulation for text-based chat agents, a six-strategy automated prompt optimiser, selective optimisation against specific calls, and reliability upgrades that keep long simulation and optimisation jobs running through restarts.

Simulate New

Chat Simulation V1

Full simulation engine for chat-based agents with persona-driven conversations, scenario generation, and in-drawer analytics.

6 Optimisation strategies (agent-opt)

200+ Conversation turns per simulation

Agents New

Agent Prompt Optimiser

Automated prompt optimisation runs against your evaluation data: six strategies wired into the platform with a results UI.

Agents New

Optimize My Agent V3: targeted optimisation

Select specific calls to optimise against, instead of optimising across the whole dataset. Direct the optimiser at the failure modes you actually care about.

Simulate New

Create scenario from Observe

Convert any real production session in Observe (the view of your live production traces) into a reusable simulation scenario with one click.

Simulate Improved

Replay sessions from real traces

Re-run historical production conversations through your current agent configuration: regression tests built from production traffic.

Platform Improved

Dot notation for JSON column variables

Reference nested JSON fields with dot notation (e.g., `user.profile.language`) in prompt templates and experiment configurations.

Platform Improved

Document format preview in Dataset and Experiment

Inline preview for documents referenced in datasets and experiment results. No download needed.

Simulate Improved

Instruction input in scenario creation

Describe what you want to test in natural language and the system generates scenarios accordingly.

Evaluate Improved

Evals filtering in dataset summary

Filter evaluation results inside the dataset summary view for faster drill-down.

Monitor Improved

Agent-centric metrics and call log improvements

Per-agent metrics now appear on the dashboard, so agent quality is comparable directly across agents instead of aggregated. Call log capture broadened to record more of what the agent does inside a call.

Evaluate Improved

Edit synthetic data configuration

Modify synthetic-data generation configuration after a run has started.

Read full digest

Dec 8, 2025

W50 Dec 8 – Dec 12, 2025

Fix My Agent, Persona Management Suite, and JSON Input/Output in Sessions

Context-aware debugging that tells you why a simulation failed and how to fix it, full lifecycle management for simulation personas, structured JSON rendering in session views, and the backend for the upcoming Agent Prompt Optimiser.

Agents New

Fix My Agent

Context-aware debugging that reads the full execution context of a failed simulation and generates ranked, actionable suggestions across two classes: agent-level (prompt, context, tool usage) and infrastructure-level (provider timeouts, rate limits, integration misconfiguration).

ranked Fix My Agent suggestions

3x Faster issue resolution (reported)

Simulate New

Persona management suite

Full lifecycle management for simulation personas: view across workspace, duplicate as starting points, edit inline, and delete.

Platform New

JSON input/output in session view

Session views render structured JSON input and output natively: collapsible trees, syntax highlighting, Markdown rendering in table cells.

Agents Improved

Agent Prompt Optimiser groundwork

Backend infrastructure for the Agent Prompt Optimiser is now in place: evaluation hookups, the optimisation job runner, and the result schema. The user-facing optimiser builds on top of this foundation.

Evaluate Improved

Edit experiment configuration after starting

Modify experiment parameters mid-run without restarting. The system tracks which configuration was active for each data point so result integrity is preserved.

Platform Improved

JSON dot notation in Run Prompts and Experiments

Reference nested JSON fields using dot notation (e.g., `user.profile.preferences.language`) in prompt templates and experiment configs.

Simulate Improved

Custom voices for ElevenLabs and Cartesia

Custom voice support for ElevenLabs and Cartesia in Run Prompt and Experiments.

Platform Improved

Rate limits update for custom subscription tier

Rate limits raised on the CUSTOM subscription tier so high-volume customers no longer hit the previous ceiling on bursty workloads.

Platform Improved

Enhanced audio player with lazy loading

Rebuilt audio player with lazy loading. Faster page loads on views with many session recordings.

Agents Improved

Fetch agent definition from providers

Import agent configurations directly from Vapi and Retell with one click. No manual recreation.

Monitor Improved

Polling and loading state in error localization

Error-localization analysis runs asynchronously. The view now polls for the result and shows a loading state in the meantime, so the page no longer appears to hang while the analysis completes in the background.

Simulate Improved

Real-time loading states for calls

Live progress indicators and estimated time remaining for ongoing calls.

Platform Improved

Workspace issues view

Dedicated view that surfaces issues affecting the whole workspace (not just a single project), so admins can spot platform-level problems without drilling into each project individually.

Read full digest

Nov 24, 2025

W48 Nov 24 – Nov 28, 2025

Multi-Branch Scenarios, Custom Background Noises, and Critical-Issue Feed in Simulate

Scenarios that branch into multiple conversation paths, ambient noise profiles that push simulations closer to production reality, and a new critical-issue feed that surfaces the most important simulation failures.

Simulate New

Multi-branch scenario generation

AI-generated scenarios that branch into multiple paths: intent forks, escalation paths, and digression branches. Navigate the tree and run simulations against specific branches.

10+ Background noise profiles

3 Scenario branch types

Simulate New

Custom background noises in simulation

Add realistic ambient noise profiles to voice simulations (coffee shop chatter, traffic, typing, crowds, office HVAC) so tests match production conditions.

Agents New

Enable Others option for agent definition

Define agents against custom voice or LLM providers beyond the standard Vapi and Retell integrations. Open configuration for teams with heterogeneous stacks.

Simulate New

Critical-issue feed in Simulate

A dedicated feed surfaces the most important simulation failures across runs: the issues worth fixing first.

Evaluate Improved

Eval explanation summary for simulations

Every simulation evaluation now includes a human-readable explanation: not just a score, but why the score landed there.

Simulate Improved

Call analytics revamp

Redesigned call analytics and call detail views with clearer metrics, latency breakdowns, and per-call cost reporting.

Simulate Improved

Latency metrics in Simulate

Per-step latency tracking for simulation calls. See how long each phase took.

Simulate Improved

Persona polish: more accents and language sorting

Southern accent and additional accents added to personas. Languages and accents now sorted alphabetically in the persona selector.

Platform Improved

Prompt WebSocket streaming

Real-time prompt execution with streaming responses via WebSocket. No more wait-for-completion in the Workbench.

Evaluate Improved

Edit evaluation variable remapping

Remap evaluation variables after creation without rebuilding the entire evaluation configuration.

Monitor Improved

Observe enhancements

Sticky filters, pagination improvements, inline metadata display, and updated pricing logic in Observe (the view of your live production traces).

Monitor Improved

New columns in Observe

Additional columns in the Observe table surface agent and call metadata directly without drilling in.

Evaluate Improved

Scenario column support in evals and run test

Scenario columns are now available inside evaluation runs and run-test results, providing richer test data without leaving the evaluation view.

Simulate Fixed

Simulated assistant call ending fixes

Simulated assistant calls now end reliably when the agent finishes its final turn or hits a configured timeout, instead of leaving the session hanging.

Read full digest

Nov 10, 2025

W46 Nov 10 – Nov 14, 2025

Simulation Call Observability, Retell and Outbound Calls in Simulate, Tool Evaluation

Logs, latency, and cost on every simulation call. Retell-backed agents, outbound calling, and tool-level verification all land in Simulate. Plus personas editable after creation and a rebuilt Run Prompt and Experiment workflow.

Simulate New

Logs, latency, and cost breakdown on simulation calls

Every simulation call now shows structured logs, per-call latency metrics, and cost attribution in one view.

3 simulation observability layers

2 voice providers in Simulate

Simulate New

Retell agents in Simulate

Native Retell support inside the simulation loop. Retell joins Vapi as a supported voice provider for end-to-end test runs.

Simulate New

Outbound calling in Simulate

Simulate outbound voice flows where your agent places the call: appointment reminders, sales outreach, proactive notifications, payment collection.

Evaluate New

Tool evaluation in Simulate

Verify every tool and function call your voice agent makes during a simulation. Did it call the right tool? Were the parameters right? Did it handle the response?

Simulate New

Run Prompt and Experiment revamp

Contextual provider selection and streamlined configuration for prompt runs and experiments. Audio output supported from both workflows.

Simulate Improved

Personas: full CRUD and edit-after-creation

Create personas, edit them after creation, and manage the full persona library. Pairs with the persona system.

Simulate Improved

Reasoning column in Simulate

View the reasoning trace behind each simulation decision directly in the results table.

Simulate Improved

Custom voices in Run Prompt and Experiments

Use custom voices from ElevenLabs and Cartesia in prompt runs and experiment workflows.

Monitor Improved

Expanded evaluation attributes in voice observability

New evaluation dimensions for voice quality, latency consistency, and naturalness in voice agent monitoring.

Simulate Improved

Error localization in Simulate

Errors in simulation runs are now pinpointed to the exact step and provider that caused them.

Evaluate Improved

Edit evaluations within experiment page

Modify evaluation configurations inline without leaving the experiment view.

API Improved

Configure and re-run evaluations via API

Programmatically configure evaluation parameters and trigger re-runs through the API.

Platform Improved

Session history enhancements

Session history now renders full transcripts instead of truncating long ones, and supports more locales. A session shows everything your users actually said.

Monitor Improved

Observe homepage revamp

Observe landing page now leads with recent traces and active alerts, so the most common starting points (jump to a trace, check an alert) take fewer clicks. Initial load is noticeably faster on high-volume workspaces.

Read full digest

Oct 27, 2025

W44 Oct 27 – Oct 31, 2025

Credit Usage Revamp, Multi-Language Agents, and New TTS Providers

Workspace-level credit attribution, a 3-step guided agent builder with multi-language support, a rebuilt Prompt Workbench with commit-style version history, and four new text-to-speech providers.

Platform New

Credit usage summary redesign

Workspace-level credit attribution with per-feature breakdowns and historical trend lines. Forecast AI spend with real granularity.

4 New TTS providers

15+ Languages supported

Agents New

New agent definition UX

A 3-step guided flow for building agents: identity and behavior, then tools and integrations, then a preview sandbox before deploying.

Platform New

Prompt Workbench revamp

Commit-based version history comes to the Prompt Workbench. Think git, but for prompts.

Agents New

Multi-language support in agent definition

Agents can be defined to operate natively in 15+ languages with locale-aware behavior that goes beyond simple translation.

Simulate Improved

Add columns to scenarios via AI and manual input

Enrich simulation scenarios with custom data columns. Generate them with AI, or enter them manually.

Simulate Improved

Enhanced language and accent support in simulation

Broader dialect and accent coverage for realistic multi-language voice simulations.

Simulate Improved

Simulate metrics revamp

Redesigned metrics dashboard with real-time pass/fail rates and drill-down into individual test cases.

SDK New

ai-evaluation v0.2.2

First-class LLM-as-a-Judge and new heuristic metrics: JSON schema validation, similarity, string matching, aggregation.

Platform Improved

Call analytics integration

Unified analytics for voice agent calls with cost, duration, and quality breakdowns in one view.

Simulate Improved

Detailed voice provider logs

Full request and response logs for every voice provider interaction during simulation.

Simulate New

New TTS model integrations

Four new text-to-speech (TTS) providers (Cartesia, Hume, Neuphonics, and LMNT) now available in simulation.

Read full digest

Oct 13, 2025

W42 Oct 13 – Oct 17, 2025

ai-evaluation SDK v0.1.5, Personas, and Run-Prompt Enhancements

The ai-evaluation SDK launches with 50+ evaluation templates. Pre-built and custom personas come to simulation, with dataset-derived personas from real call transcripts. Plus voice output in Run Prompt and Experiment, ingestion-time cost tracking, and more.

SDK New

ai-evaluation v0.1.5 launch

Initial release of the ai-evaluation SDK with 50+ evaluation templates covering faithfulness, relevance, safety, and domain-specific metrics.

50+ Eval templates in ai-evaluation

3 Persona sources

Simulate New

Pre-built and custom personas

Pre-built caller personas, custom persona definitions, and dataset-derived personas generated from your real call transcripts.

Evaluate Improved

Provider transcript as evaluation attribute

The voice provider's native transcript is now available as an evaluation input, useful for comparing automatic speech recognition (ASR) accuracy and response quality side by side.

Platform Improved

Enhanced onboarding flow

First-run setup branches by user role, so each new user lands on the steps that match how they will use the platform instead of a single generic walkthrough. Time from sign-up to first evaluation drops accordingly.

Simulate Improved

Voice output in Run Prompt and Run Experiment

Generate and evaluate spoken responses directly from the prompt playground and experiment workflows.

Simulate Improved

Manual, AI-generated, and dataset-sourced scenario rows

Three ways to add scenario rows to a simulation: type them in, generate them with AI, or import from an existing dataset.

Evaluate Improved

Run evaluations on completed test runs

Apply new evaluation criteria to previously completed simulation runs, with no need to re-execute the calls.

Agents Improved

Agent definition version selection in simulation

Select a specific agent definition version when running a simulation, so regression testing and A/B comparisons become straightforward.

SDK Improved

ai-evaluation v0.2.1

Batch evaluation support and bias detection added to the ai-evaluation SDK.

SDK Improved

traceAI OpenAI Agents SDK support

Native instrumentation for OpenAI's Agents SDK that captures tool calls, agent handoffs, and multi-agent orchestration as traces.

Platform Fixed

Updated pricing calculation in Observe

Trace cost is calculated as spans land in Observe, not after a post-processing pass. The cost figures on the dashboard and on the underlying trace stay in sync, so usage reporting matches what the trace view shows.

Read full digest

Sep 29, 2025

W40 Sep 29 – Oct 3, 2025

Voice Observability for Vapi, Retell, and ElevenLabs; Eval Groups in Experiments; Simulate via SDK

Observability ships for three voice platforms at once. Evaluation groups integrate with experiments and optimization. Call Simulation is now triggerable from the SDK.

Monitor New

Voice Observability for Vapi, Retell, and ElevenLabs

Observability ships for three voice platforms at once: Vapi, Retell, and ElevenLabs. Call metrics, transcript analysis, and utterance-level tracking on the same trace and evaluation surface you already use for text agents.

3 Voice platforms supported

60-70% Cost reduction via SDK simulation

Evaluate New

Eval groups in experiments and optimization

Evaluation groups now integrate with experiments and agent optimization. Run a whole group of evaluations together and optimize against group-level scores.

SDK New

Simulate via SDK

Trigger Call Simulation programmatically through the SDK against Vapi- or Retell-backed voice agents. Plug simulation into CI/CD pipelines, scheduled jobs, or custom orchestration.

Simulate Improved

Selective test rerun in Simulate

Rerun only the specific failed or flagged test cases in a simulation suite, without re-executing the entire suite.

Evaluate Improved

Default eval groups

Pre-built evaluation groups for common use cases: retrieval-augmented generation (RAG), computer vision, conversational AI.

Simulate Improved

Advanced simulation management

Auto-refresh on the simulation dashboard, stop-simulation control for running suites, and a visual workflow trace of the execution path for each scenario.

Platform Improved

Workbench revamp

Slide-out code drawer in the Workbench keeps prompt code visible while you iterate, so you don't tab-switch to inspect what you're running. The header is restructured to keep run controls reachable without scrolling.

SDK Improved

traceAI session support

Native `session.id` attribute support in traceAI, with automatic session grouping across all instrumented frameworks and no custom middleware.

Platform Fixed

Agent definition design changes

Agent configuration page redesigned for agents with many prompts and tools. Sections stay grouped and reachable as the configuration grows past a handful of nodes.

Read full digest

Sep 15, 2025

W38 Sep 15 – Sep 19, 2025

Automated Scenario Builder, Agent Definition Versioning, and Simplified Session Tracking

Upload a standard operating procedure or call transcript and get test scenarios with edge cases generated automatically. Commit-style version control for agent definitions. Session-level observability from a single span attribute.

Simulate New

Automated scenario and workflow builder

Upload standard operating procedures (SOPs), call transcripts, or product documentation. The system generates simulation scenarios with edge cases and branching conversation flows.

10x Faster scenario generation

3 Audio download formats

Agents New

Agent definition versioning

Commit-style version control for agent definitions. Every change gets a commit message, consolidated test reports show how each version performed, and rollback is one click.

Monitor Improved

Simplified session tracking

Add a single `session.id` attribute to spans and Future AGI groups related traces into a session view automatically.

Evaluate Improved

Advanced evaluation group management

Full create, read, update, delete (CRUD) operations for evaluation groups. Organize and manage evaluation suites programmatically.

Simulate Improved

Multi-channel audio player for simulation calls

Listen to simulation calls with separate channels for agent and caller audio — focus on one side or hear both together.

Simulate Improved

Flexible call recording downloads

Download simulation call recordings in three audio formats, for offline analysis, compliance archiving, and sharing.

Platform Improved

Prompt collaboration features

Collaborative editing and commenting on prompts. Multiple team members can work on the same prompt, leave feedback, and track revisions.

Platform Improved

Dedicated background worker pool for trace ingestion

Dedicated background worker pool for trace ingestion. Separates trace processing from other background jobs and eliminates resource contention during high-volume ingestion.

Platform Improved

Optimized trace ingestion pipeline

End-to-end optimization of the trace ingestion pipeline. Faster time from trace emission to dashboard availability.

Platform Fixed

Annotation and prompt import fixes in datasets

Resolved issues with importing annotations and prompt data into datasets for smoother data pipeline operations.

Read full digest

Sep 1, 2025

W36 Sep 1 – Sep 5, 2025

Agent Compass, Annotation Quality Dashboard, and Enterprise Multi-Workspace Security

Zero-config performance insights on your agent traces, statistical dashboards for annotator agreement, and enterprise-grade multi-workspace isolation with audit logging.

Agents New

Agent Compass

Automatic, zero-configuration performance insights on your agent traces. Detects issues at the trace level with no evaluation setup required.

0 Config required for Agent Compass

5 Eval group templates

Evaluate New

Annotation quality dashboard

Inter-annotator agreement metrics (including Cohen's kappa) to measure how consistent your human reviewers are.

Platform New

Enterprise multi-workspace security

Multiple independent workspaces with full data isolation, per-workspace role-based access control (RBAC), and cross-workspace audit logging.

Monitor Improved

Feed insights with error clusters

Observability feed now groups related errors into clusters and surfaces trend lines showing whether issues are getting better or worse.

Platform Improved

Intelligent onboarding navigation

Guided onboarding flow that adapts to your role and use case.

Simulate Improved

Voice agent testing and analytics improvements

Dashboard metrics and scenario column views for voice simulation. Call success rate, average call duration, and scenario coverage in one place.

Platform Improved

Prompt library organization

Folder-based prompt architecture with templates for structuring large prompt libraries across teams and projects.

Evaluate Improved

Evaluation grouping API

API support for evaluation grouping — organize related evaluations programmatically, run evaluation groups as test suites in CI/CD.

Platform Fixed

Enhanced plans and pricing experience

Redesigned pricing page with clearer plan comparisons and streamlined upgrade flows.

Read full digest

Aug 18, 2025

W34 Aug 18 – Aug 22, 2025

Summary Dashboards, Alerts Revamp, Prompt SDK, and Workspaces RBAC

Redesigned summary dashboards with new chart types and side-by-side comparison, a rebuilt alerts system, Prompt SDK upgrades for production use, and role-based workspace access control.

Monitor New

Summary screen revamp

Rebuilt summary dashboards. Spider, bar, and pie chart visualizations, plus side-by-side comparison between any two runs, prompt versions, or time periods.

4 Role levels

3 Chart types

Monitor New

Alerts revamp with Slack and email

Rebuilt alerts system with Slack and email notification channels, customizable thresholds, and alert grouping to reduce notification noise.

SDK New

Prompt SDK upgrades

Caching, A/B testing, and multi-environment deployment in the Prompt SDK — production-grade prompt management.

Platform New

Workspaces RBAC

Role-based access control with Owner, Admin, Member, and Viewer roles. Granular permissions on evaluations, simulations, and production traces.

Platform Improved

AWS Marketplace integration

Purchase and manage Future AGI subscriptions through AWS Marketplace with consolidated AWS billing.

SDK Improved

Error localizer via SDK

Synchronous and asynchronous standalone error localization through the SDK to pinpoint failures in agent execution chains.

Evaluate Improved

Critical issue detection on datasets

Automatic detection of critical issues in datasets with actionable advice for data quality problems.

Monitor Improved

Prompt metrics in Observe

Track trace performance per prompt version inside Observe — measure the real-world impact of a prompt change.

SDK Fixed

traceAI optional dependencies cleanup

Reduced install bloat by making framework-specific dependencies optional, cutting package size significantly.

Read full digest

Aug 4, 2025

W32 Aug 4 – Aug 8, 2025

Document Columns, Function Evaluations, and Async Evals via SDK

Upload documents directly into datasets with built-in OCR, write deterministic function evaluations for objective checks, and run evaluations asynchronously from the SDK.

Platform New

Document column support in datasets

Upload TXT, DOC, DOCX, PDF, and scanned documents directly into datasets. Built-in OCR handles scans and images.

5 Document types supported

50+ Eval templates

Platform Improved

Edit synthetic data after generation

Modify AI-generated synthetic data before committing it to your datasets.

Monitor Improved

User tab in Dashboard and Observe

Per-user views in Dashboard and Observe surface aggregate metrics across sessions and traces.

Platform Improved

Configure labels per prompt version

Tag each prompt version with custom labels to track experiments, A/B tests, and rollout stages.

Monitor Improved

Video support in Observe

Capture and replay video outputs from multimodal agents directly in the Observe trace view.

Platform Improved

OCR support for document processing

Optical character recognition extracts text from scanned documents and images — legacy paper-based workflows become testable without manual transcription.

Evaluate Improved

Comparison summary

Compare evaluation results and prompt summaries across two datasets side by side to measure improvement or catch regressions.

SDK Improved

Bulk annotation and user feedback via API/SDK

Submit thousands of annotations and feedback entries in a single call. For high-throughput labeling workflows and integrations with existing annotation tools.

SDK Improved

traceAI v0.1.10 with LLM prompt template labels

traceAI automatically labels LLM spans with prompt template identifiers, so you can filter traces by prompt version.

SDK Improved

traceAI Pipecat integration

Native Pipecat instrumentation for tracing voice and multimodal AI pipelines built on Pipecat.

SDK Improved

traceAI LlamaIndex TypeScript

TypeScript instrumentation for LlamaIndex, bringing tracing to the RAG framework in Node.js environments.

Monitor Fixed

Timestamp column in trace/spans

Precise timestamp columns in trace and span views for accurate timing analysis and debugging.

Evaluate Fixed

JSON view for evals log

Inspect raw evaluation log data in structured JSON format for debugging and integration purposes.

Read full digest

Jul 21, 2025

W30 Jul 21 – Jul 25, 2025

Voice Simulation and the Evals Playground

AI-conducted phone calls test your voice agents end-to-end. Plus an interactive sandbox for testing evaluations in real time — with inline scoring on any span in a trace.

Simulate New

Call Simulation

AI-powered simulator agents place real phone calls to your voice agents. Scenario-driven conversations with real turn-taking, interruptions, and the unpredictable patterns real users bring.

sub-second Voice latency

60-70% Cost reduction vs manual QA

Evaluate New

Evals Playground

An interactive sandbox for testing and refining evaluations before you commit. Paste a prompt-response pair, pick criteria and a judge model, see the score in real time. Also available inline inside the trace view.

Simulate Improved

Simulator agent form and agent definition dropdowns

Configure simulation agents through a form interface — select target agent, define simulator persona, choose scenarios, launch. No YAML files.

Simulate Improved

Add scenarios from datasets

Import test scenarios directly from your existing datasets. Turn a collection of real customer transcripts into repeatable simulation test cases with a few clicks.

Platform Improved

Mixpanel analytics integration

Full Mixpanel integration across the platform. Track feature usage, evaluation runs, and simulation sessions to understand adoption and workflow bottlenecks.

SDK Improved

traceAI TypeScript Vercel instrumentor

Automatic observability for serverless AI deployments on Vercel. Wrap your handler, every LLM call, tool invocation, and response is captured as a span.

Evaluate Improved

CRUD on custom evaluations

Full create, read, update, and delete operations for custom evaluations — manage your evaluation library as it grows.

Evaluate Improved

Add feedback to evaluations

Attach human feedback directly to evaluation results. Build labeled datasets and improve evaluation accuracy over time.

Platform Fixed

Refresh token cycle for session management

Automatic token refresh cycle ensures uninterrupted simulation sessions without manual re-authentication.

Platform Fixed

Span name display in traces

Span names now appear directly in the trace view, making it faster to navigate complex agent execution trees.

Read full digest

Jul 7, 2025

W28 Jul 7 – Jul 11, 2025

System Metrics in Observe, Multimodal Bedrock Tracing, and Eval Playground Upgrades

Infrastructure metrics alongside agent traces in Observe, image tracing for AWS Bedrock, and standalone mode + feedback loops in the Eval Playground.

Monitor New

System metrics in Observe graph

System metrics are now selectable directly from the Observe graph dropdown — render them in the primary graph or the compare graph alongside agent traces.

4 chart types

25+ instrumentors

Monitor New

Multimodal support for AWS Bedrock

Trace image inputs and outputs for AWS Bedrock multimodal models (Amazon Titan Image Generator, Anthropic Claude on Bedrock).

Evaluate Improved

Eval Playground improvements

Standalone evaluation mode, feedback collection on playground results, and clearer scoring visualization.

Evaluate Improved

Multi-line graphs in evaluations

Plot multiple evaluation metrics on a single chart — overlay hallucination rate against relevance score, or compare accuracy across models on the same axes.

Platform Improved

API key management revamp

Redesigned API key management with a cleaner UI, bulk operations, and clearer permission displays.

SDK Improved

Google GenAI instrumentor

Automatic tracing for Google's Generative AI SDK with support for Gemini models and function calling.

Evaluate Improved

Langfuse evaluations integration

Route Langfuse evaluation data into Future AGI for unified analysis alongside native evaluations.

SDK Improved

traceAI v0.1.11

Google GenAI instrumentor and multimodal AWS Bedrock support in the core SDK.

Monitor Fixed

Annotation notes

Add free-form text notes to annotations for richer context on human feedback.

Platform Fixed

Draft prompts

Save work-in-progress prompts as drafts without publishing them to the shared prompt library.

Read full digest

Jun 23, 2025

W26 Jun 23 – Jun 27, 2025

Alerts and Monitors, gRPC Trace Ingestion, and the Observe Graph

Real-time alerts with Slack and email, gRPC trace transport with 60% less latency, and a visual graph of agent execution inside Observe.

Monitor New

Alerts and monitors

Set thresholds on any evaluation metric, trace property, or system measurement — get notified in Slack or email when the threshold is crossed.

60% less latency with gRPC

2 notification channels

Platform New

gRPC support for trace ingestion

High-performance gRPC transport for trace data, with 60% lower latency than HTTP/JSON.

Monitor New

Observe graph visualization

An interactive directed graph view of agent execution. Each node is a span, each edge is the flow of execution.

Platform Improved

Developer keys

API key management with scoped permissions — read-only, write, or admin keys, each with its own usage metrics and one-click rotation.

Platform Improved

Model serving infrastructure

New internal infrastructure for serving evaluation and guardrail models. Autoscales based on demand, optimizes for low-latency inference.

SDK Improved

traceAI gRPC transport support

Python and TypeScript SDKs now support gRPC as a trace transport alongside HTTP.

Evaluate Improved

Eval tab revamp

Redesigned evaluation results tab with better data visualization, more flexible filtering, and one-click CSV and JSON export.

Evaluate Fixed

Prompt eval updates

Improved prompt evaluation workflow with batch execution and result comparison across prompt versions.

Read full digest

Jun 9, 2025

W24 Jun 9 – Jun 13, 2025

Breaking Bad UI Redesign, Custom Model Endpoints, and Observe Enhancements

A redesigned platform UI with new navigation, a new component library, and consistent interaction patterns. Azure OpenAI and self-hosted models as evaluation judges. New filters and provider logos in Observe.

Platform New

Breaking Bad — platform update

A platform-wide release: model-choice in evaluations, enhanced error localization, extended alerts, and a redesigned UI across every surface.

3 custom judge endpoint types

3x faster dataset loading

Evaluate New

Custom model dropdown with Azure and custom endpoints

Use Azure OpenAI deployments, OpenAI-compatible endpoints, or self-hosted models as your evaluation judge — critical for teams with data residency, cost, or fine-tuned-judge requirements.

Monitor Improved

Attribute filters in Observe

Filter traces by custom attributes, metadata, and span properties in the Observe view.

Monitor Improved

Sentry error monitoring integration

When Sentry captures an exception in your app, the corresponding trace links automatically. Full-stack debugging context from application error to agent execution path.

Evaluate Improved

Image and audio support in evaluation log table

View image and audio outputs directly in evaluation log tables with inline rendering.

Platform Improved

Faster dataset loading

3x performance improvement for dataset loading through pagination and lazy rendering.

Evaluate Improved

Evaluation feedback in Observe

Attach feedback directly to evaluation results viewed from the Observe surface.

SDK Improved

traceAI Google ADK support

Automatic instrumentation for Google's Agent Development Kit (ADK) with full span capture and tool-call tracing.

SDK Improved

traceAI TypeScript: new evaluation support

Expanded evaluation capabilities in the TypeScript SDK, with new metric types and batch submission.

Evaluate Fixed

Eval template validation

Automatic validation of evaluation templates before execution to catch configuration errors early.

Monitor Fixed

Provider logos for tracing

Visual provider identification in trace views with logos for OpenAI, Anthropic, Google, and other LLM providers.

Read full digest

May 26, 2025

W22 May 26 – May 30, 2025

Protect Flash, TypeScript SDK v0.1.0, and Custom Evaluations in Observe

A speed-optimized guardrails path with a binary harmful/not-harmful decision. The first official TypeScript SDK. Configurable custom evaluations that run directly on production traces.

Guard New

Protect Flash

A speed-optimized binary content moderation path for Protect — harmful / not-harmful classification when full multi-category protection is overkill.

binary Protect Flash classification

1st TypeScript SDK

Evaluate New

Custom evaluations in Observe

Run configurable custom evaluations directly on production traces from the Observe view — no exports or separate evaluation runs.

SDK New

TypeScript @traceai/fi-core v0.1.0

The first official TypeScript SDK. Full tracing and evaluation for Node.js, Deno, and edge runtimes like Cloudflare Workers.

Platform Improved

API-based pricing for evaluations and error localizer

Pay-per-use pricing for evaluation runs and error localization, replacing fixed-tier limits. Start small, scale without hitting plan boundaries.

Platform Improved

Stop streaming for long-running prompts

Cancel in-progress LLM generations mid-stream to save time and tokens on runaway outputs. The partial output is preserved so you can still analyze what went wrong.

Evaluate Improved

Evaluations in the prompt Workbench

Run evaluations directly inside the Workbench to score prompt outputs without leaving the editor.

Platform Improved

Add to dataset from Prototype and Observe

Push prompt-response pairs from Prototype directly into any evaluation dataset, and add traces from Observe into datasets with one click. Closes the loop between iteration and systematic evaluation.

Evaluate Improved

Import saved prompts in datasets

Pull prompts from your saved prompt library directly into dataset rows for evaluation.

Evaluate Fixed

Feedback enhancement system

Structured feedback collection on evaluation results to continuously improve eval accuracy.

Read full digest

May 12, 2025

W20 May 12 – May 16, 2025

Workbench V2, Custom Evaluations Revamp, and SDK Updates

A rebuilt Workbench for prompt engineering, a redesigned custom evaluation builder with judge-model selection, and three traceAI SDK packages with audio, image, and multimodal support.

Platform New

Workbench V2

A rebuilt prompt engineering environment. New multi-section prompt editor, resizable playground layout, prompt cards to organize your prompt library, and inline editing in the test case grid.

3 new SDK versions

12 prompt templates

Evaluate New

Custom evaluations revamp with model dropdown

Redesigned custom evaluation builder. Pick which LLM serves as the judge — and compare how different judges score the same data.

Monitor Improved

Annotations revamp with add/compare flow

Cleaner annotation workflows. The add flow is a single panel instead of a multi-step modal. Compare places two annotated traces side by side with differences highlighted.

Evaluate Improved

Sheet UI revamp for datasets

Refined spreadsheet interface with arrow-key cell navigation, keyboard shortcuts, and better scroll performance on datasets with thousands of rows.

Evaluate Improved

Import saved prompts into datasets

Pull prompts from your saved prompt library directly into dataset rows for evaluation.

SDK Improved

traceAI core v0.1.4

Audio evaluation support and prototype evaluation validation in the core SDK.

SDK Improved

traceAI OpenAI v0.1.3

The OpenAI instrumentor (library that captures OpenAI API calls automatically) now covers audio generation and image generation models.

SDK Improved

traceAI LangChain v0.1.4

The LangChain instrumentor adds image extraction from multimodal chains and support for OpenAI's Computer Use Agent.

Evaluate Fixed

Column configure dropdown in compare view

Choose which columns to display in comparison views to focus on the metrics that matter.

Evaluate Fixed

Delete dataset functionality

Clean up old or unused datasets with a delete option and confirmation safeguard.

Read full digest

Apr 28, 2025

W18 Apr 28 – May 2, 2025

Diff View in Experiments, Audio Across the Platform, and Run-Insight Views

Compare two experiment runs side by side, play audio inline in traces and datasets, and see every evaluation run at a glance with insight summaries.

Evaluate New

Diff view in experiments

Compare two experiment runs side by side. Output differences, score deltas, and configuration changes highlighted inline.

2 experiment configurations compared

3-min KB tutorial walkthrough

Platform New

Audio support across Observe and datasets

Audio is now built in across the platform — play clips inside any trace, view waveforms in dataset tables.

Evaluate Improved

Run insight views

Every evaluation run now opens to a summary view: pass/fail distribution, score trends across runs, and outliers called out automatically.

Platform Improved

Rate-limit error UX and alerts

Clearer error messages when API rate limits are hit, plus in-app alerts when limit issues are detected.

Platform Improved

Knowledge base tutorial video

An embedded 3-minute walkthrough video inside the Knowledge Base section — pairs with the Prototype V2 knowledge base UI shipped in w16.

Evaluate Improved

Compare run eval revamp

Refined comparison view for evaluation runs with cleaner score display and easier navigation between runs.

Platform Improved

Prototype Observe filters

Apply the same filter vocabulary used in Observe to Prototype runs — consistency across the two surfaces.

Evaluate Fixed

Synthetic data UI improvements

Cleaner interface for synthetic data generation with better progress indicators and batch controls.

Read full digest

Apr 14, 2025

W16 Apr 14 – Apr 18, 2025

Prototype V2 and Audio Evaluations

A rebuilt prompt engineering environment with a built-in knowledge base, evaluations that run on the actual audio of voice calls, and a guided walkthrough for new users.

Platform New

Prototype V2 with knowledge base UI

A rebuilt prompt engineering environment. Attach reference docs and few-shot examples next to your prompt, watch a 3-minute tutorial in-app, and iterate without switching tabs.

6 new eval templates

2x faster dataset loading

Evaluate New

Changelog

Subscribe to changelog

Self-Host in One Command, Jinja2 Prompts, and Polish Across Evals and Observability

Self-Host Future AGI in One Command

Prompts now support Jinja2 templates alongside Mustache

Reliable voice evals, no dependency on provider URLs

Export annotations and eval scores alongside traces to datasets

Filter and export AI gateway request logs

Reach into nested JSON fields from eval and prompt variables

Filter Observe by trace ID or span ID, plus Select all view columns

Faster session list on high-volume traces

API columns: smoother edits, visible progress

Cleaner eval columns

Add traces to annotation queues from any source

Trace attributes: long values expand, rows are easier to scan

Error Feed comes to voice and simulation projects

Voice analytics: consistent units and a tighter default view

Workspace invites work for existing org members

j and k row-navigation shortcuts no longer hijack text input

Evals Revamp, Experiment V2, Observe Revamp, and Error Feed

Evals Revamp: Score Against the Tools and Data Your Agent Actually Uses

Experiment V2: From Prompts to Agents

Observe Revamp: Plain-English Filters, Custom Views, and Imagine-with-Falcon Charts

Error Feed: Cluster, Triage, and Push to Linear in One Click

Custom pricing plans with billing dashboard v2

Standardised search bar across list pages

Annotator picker: search and add at scale

Custom dashboards polish

Agent definition API revamp

Simulate analytics API

All Attributes JSON viewer search filter

Voice AI Production-to-Simulation, Annotation Queue Assignment, and API Docs Improvements

Voice AI: Production to Simulation in One Click

API Reference: Everything You Need on One Page

Annotation Queue: Faster Review, Stronger Agreement

Voice Metrics in Call Lists

Falcon AI and 4x Faster Frontend

Falcon AI

Frontend speed improvements

Voice simulation moves to LiveKit

ClickHouse Replicated MergeTree migration

Prompt generation and improvement

Custom Dashboards, MCP Server, 2FA with Passkeys, and Annotation Queues

Custom Dashboards

MCP Server

Two-factor authentication with passkeys

Role-based access control

Annotation Queues

Agent Command Center: rebuilt architecture

Integrations hub

Pricing for 2,373+ models

Falcon AI: skills UI and settings redesign

Chat history popover and feedback affordances

Task description field for optimizers

Multi-breakdown analytics

Central evaluation metrics rollups

Unified tracing across every traceAI language

Enterprise multi-org architecture

Outbound calls support

futureagi-mcp-server tools

Agent Command Center, Agent Playground, and ClickHouse Migration

Agent Command Center

Agent Playground

ClickHouse migration for trace storage

Annotation queue

Multi-region support

Dataset and simulation analytics

Real-time updates for key revocation

Per-key authentication for Agent Command Center

ai-evaluation 1.0, Deep Space Theme, Multi-Language SDKs, and Multimodal Workbench

ai-evaluation v1.0.0

Deep Space dark mode migration

traceAI C# SDK

traceAI Java SDK

31 new TypeScript instrumentor packages

Multimodal Prompt Workbench with WebSocket streaming

Graph-aware dataset scenario generation

WebSocket simulation grid

Langfuse integration

Voice observability annotations