Home / Changelog

Changelog

Weekly digests of everything we ship. New features, improvements, and fixes to the Future AGI platform.

Jun 8, 2026

W24 May 26 – Jun 8, 2026

Perplexity ships Sonar and gpt-5.1 to self-hosted Future AGI, plus polish across Evals, Observe, and Platform

Straight from Perplexity. The team sent a PR into the open-source Future AGI repo with first-party support for every current Perplexity model on self-host. Plus a new customer-agent task-completion eval, faster audio eval runs when there's no feedback to retrieve, filter and column fixes across Observe, and a pricing and usage page that renders correctly.

API New

Perplexity contributes Sonar and gpt-5.1 to self-hosted Future AGI

The agentic eval Azure callback no longer crashes when `invocation_params` comes back as `None`. Useful when Azure returns a short-form response that omits the params block.

Read full digest

May 25, 2026

W22 May 12 – May 25, 2026

Evals on Traces and Sessions, Configurable Eval Context, and Polish Across Evals, Observe, and Simulate

Every eval type (LLM-as-Judge, Code, and Agent) can now score at every level: spans, traces, and sessions. Eval setup also gets simpler: turn on context injection and skip variable mapping entirely. The eval reads the context on its own. Plus two new conversation evals: Dead Air Detection (a Code eval at zero LLM cost) and Conversation Hallucination. Plus eval inputs up to 200K characters, partial inputs as warnings, custom dotted paths in mapping, span-level fields and API columns, and polish across Observe and Simulate.

Evaluate New

Score traces and sessions with any eval

You can now run any eval (LLM-as-Judge, Code, or Agent) at three levels: on one step the agent took (a span), on a full run (a trace), or on a whole conversation (a session). Composite evals work at all three too. A composite eval is one score that combines several evals. Inside a composite, each eval can have its own settings, like a minimum word count or a regex pattern. In Tasks, you can also type custom paths to map deeper fields, not just pick from the dropdown. And if some mapped variables are missing, the eval runs with a warning instead of failing. It still errors only if all of them are missing. Built-in system evals stay strict.

Evaluate New

Skip variable mapping with context injection

You can now skip variable mapping. Turn on context injection and the eval reads the context on its own. Just pick what you want the eval to see: a dataset row, a span, a trace, a session, or a voice call. This works for every eval type (LLM-as-Judge, Code, and Agent). Inside a Task, the right context is picked for you based on the row type. For large traces or sessions, the eval looks at just the spans it needs, not the whole thing.

Starting a simulation against a scenario that hadn't finished generating used to give you zero connected calls and a permanent `Pending`. Run New Simulation now waits until every selected scenario is fully ready before it activates. The Edit Agent form used to leave you guessing whether the agent picked up provider changes; it now auto-syncs with the provider the same way Create Agent does, with a clear syncing indicator and an error notification on failure. And when every call in a run errored out, the eval bar used to disappear entirely; it now shows zero so the run still reports a meaningful result.

API Improved

AI gateway: Bedrock structured output and streaming failover

If you're calling Bedrock through the gateway, you can now use the same structured-output format you use for OpenAI and Gemini. The gateway translates it for Bedrock automatically, so you don't need provider-specific request shaping. And streaming chat with provider failover handles a common silent failure: when the upstream stream returns nothing, the gateway switches to the fallback model before the client sees a broken response.

Read full digest

May 11, 2026

W20 Apr 28 – May 11, 2026

Self-Host in One Command, Jinja2 Prompts, and Polish Across Evals and Observability

Self-host Future AGI in one command with pre-built images and a Windows installer. Prompts add Jinja2 alongside Mustache. Plus Request Explorer and polish.

Platform New

Self-Host Future AGI in One Command

Self-hosting Future AGI is one script: clone the repo, run `bin/install`, log in. The script pulls pre-built service images from Docker Hub, brings the stack up in about 30 seconds, and works on macOS, Linux, and Windows (through Git Bash, WSL, or a new native PowerShell installer). The first account is created from the command line, with no email server required.

Platform New

Prompts now support Jinja2 templates alongside Mustache

Use `{% if %}` for conditionals, `{% for %}` for loops, and Jinja2 filters to transform values inline. The existing Mustache `{{ variable }}` syntax keeps working, and a Template Format dropdown picks between the two per prompt.

The j and k keys (used to move between rows in eval and task detail panels) now yield to focused text inputs, so the letters land in your comment instead of moving the row selection.

Read full digest

Apr 27, 2026

W18 Apr 14 – Apr 27, 2026

Evals Revamp, Experiment V2, Observe Revamp, and Error Feed

130+ ready-made evals that pull live web data and call your tools for scoring. Agent-level experiments. Plain-English Observe filters. Clustered errors.

Evaluate New

Evals Revamp: Score Against the Tools and Data Your Agent Actually Uses

Evals now reach outside the prompt: pull live web data, query your databases, and call the same tools your agent runs against in production. 130+ ready-made evals, composite scoring in one pass, full version history, and a unified picker that opens everywhere.

Evaluate New

Experiment V2: From Prompts to Agents

Experiments now run end-to-end on agents, not just prompts. Compare prompts, agents, or stacked model variants in one run, view per-node output for each agent, score against any dataset column, and edit, stop, or rerun mid-flight.

Monitor New

Observe Revamp: Plain-English Filters, Custom Views, and Imagine-with-Falcon Charts

Observe is rebuilt around how you actually debug. Filter traces in plain English, save and switch between custom views in a click, search inline across logs and previews, turn any trace into an interactive chart with Falcon, and run evals or push to datasets without leaving the trace list.

Monitor New

Error Feed: Cluster, Triage, and Push to Linear in One Click

The Error Feed clusters failing traces by root cause, updates about 5 seconds after a trace completes, and runs at roughly 10x lower cost than its previous version. Push a cluster to Linear with one click, run on-demand Deep Analysis when you need root cause, and read failing-vs-working trace evidence side by side.

130+ ready-made evals

~5s Error Feed update lag

10x lower-cost Error Feed

1-click push to Linear

Platform New

Custom pricing plans with billing dashboard v2

Sales can configure enterprise deals with custom pricing terms, and finance can generate invoices from the admin surface. The billing dashboard is rebuilt around the new model so cost breakdowns, usage trends, and per-key spend match the actual contract.

Platform Improved

Standardised search bar across list pages

Every list page in the platform now uses the same search bar, so the keyboard shortcut, the search syntax, and the result behaviour stay the same whether you're searching traces, datasets, evals, or experiments. No re-learning per surface.

Evaluate Improved

Annotator picker: search and add at scale

Search and add annotators directly from the picker, even on workspaces with a large user list. Infinite scroll loads results as you go, and server-side search returns matches instantly.

Platform Improved

Custom dashboards polish

Round of polish on Custom Dashboards: time range syncs to the widget editor, widgets resize cleanly, description fields render instantly, empty state on metric removal, filter Add button visibility, and dashboard description position.

API Improved

Agent definition API revamp

Agent definition endpoints now share one consistent payload shape across create, read, update, and delete (instead of variations per operation), so SDK and integration code can drop the per-endpoint marshalling logic and use a single type.

API Improved

Simulate analytics API

New analytics API for simulation data: pull aggregate metrics, breakdowns, and trends for any simulation run.

Monitor Improved

All Attributes JSON viewer search filter

The All Attributes JSON viewer (in the trace detail drawer) gets a search box that filters the JSON tree as you type. Useful when a trace has hundreds of attributes and you only want to see the ones matching a specific key or value.

Read full digest

Apr 13, 2026

W16 Mar 31 – Apr 13, 2026

Voice AI Production-to-Simulation, Annotation Queue Assignment, and API Docs Improvements

Turn any live voice call into a simulation test case, manually assign annotation queue items, and navigate API docs with full context on a single page.

Simulate New

Voice AI: Production to Simulation in One Click

Take any production voice call and turn it directly into a simulation test case. Rerun it against different prompt versions, prompt chains, or agent definitions and compare behaviour on real inputs.

API New

API Reference: Everything You Need on One Page

Each endpoint now renders cURL example, expected response, and full parameter details side by side. No tab switching, no scrolling between sections when you are writing integration code.

Evaluate New

Annotation Queue: Faster Review, Stronger Agreement

Multi-assignment, prefetching, and manual annotator assignment for annotation queues. Reviewer approval is now optional and can be toggled per workflow.

1-click production call to simulation

days → seconds regression to reproducible test

Simulate Improved

Voice Metrics in Call Lists

New voice metrics are available as columns in the call list view, so you can filter and triage calls by latency, interruptions, or turn duration without opening each one.

Read full digest

Mar 30, 2026

W14 Mar 17 – Mar 30, 2026

Falcon AI and 4x Faster Frontend

Falcon AI: context-aware assistant that debugs traces, scaffolds simulations, drafts evals. Plus 4x faster frontend, ClickHouse Replicated, LiveKit voice.

Platform New

Falcon AI

Context-aware AI assistant embedded in the platform for trace debugging, simulation creation, evaluation building, and dataset construction.

4x faster page loads

page-aware Falcon AI

high-availability ClickHouse trace storage

Platform New

Frontend speed improvements

Four targeted optimizations delivering 4x faster page loads across the platform.

Simulate Improved

Voice simulation moves to LiveKit

Voice simulation now runs on LiveKit instead of Vapi, with lower latency and more natural conversational dynamics during simulated calls.

Monitor Improved

ClickHouse Replicated MergeTree migration

Trace storage upgraded to Replicated MergeTree for high availability and automatic failover.

Platform Improved

Prompt generation and improvement

AI-assisted prompt generation from task descriptions and one-click prompt improvement suggestions.

Read full digest

Mar 16, 2026

W12 Mar 3 – Mar 16, 2026

Custom Dashboards, MCP Server, 2FA with Passkeys, and Annotation Queues

Drag-and-drop dashboard builder, an MCP server that puts Future AGI in your IDE, 2FA with passkeys, full Annotation Queue workflows, and a rebuilt ACC.

Platform New

Custom Dashboards

Drag-and-drop dashboard builder for tracking agent performance across evaluation scores, system metrics, cost, and experiment progress.

SDK New

MCP Server

Backend coverage for outbound voice calls in simulation extended: more provider-side call states tracked, more failure modes surfaced cleanly. Builds on the outbound calling flow already in Simulate.

SDK Improved

futureagi-mcp-server tools

futureagi-mcp-server exposes the platform's tool surface to your IDE assistant: evaluations, datasets, prompts, experiments, simulations, tracing, agents, annotations, optimization, usage, and more. Your AI editor calls Future AGI directly without context-switching to the dashboard.

Read full digest

Mar 2, 2026

W10 Feb 17 – Mar 2, 2026

Agent Command Center, Agent Playground, and ClickHouse Migration

Agent Command Center: routing, guardrails, fallbacks, per-key cost controls. Agent Playground: visual multi-step graph builder. Plus ClickHouse migration.

Platform Guard New

Agent Command Center

Multi-provider routing, API key management, inline guardrails, automatic fallbacks, per-key cost tracking, and real-time analytics for every LLM call.

Agents New

Agent Playground

Visual graph builder for multi-step agents: two node types (LLM prompt + Agent), typed node connections, global variables, draft/publish workflow, version management, workflow execution control, and a programmatic graph API.

15+ LLM providers

6 load balancing strategies

2 data regions

Monitor New

ClickHouse migration for trace storage

Trace queries that used to take seconds now return in milliseconds, even on workspaces with millions of spans. Aggregations and filters across long time ranges feel interactive instead of batch.

Evaluate New

Annotation queue

Create annotation queues for traces, sessions, datasets, and simulation outputs to organize human review workflows.

Platform New

Multi-region support

Deploy and store data in US or EU regions for compliance with data residency requirements.

Evaluate Improved

Dataset and simulation analytics

One API surface returns both dataset quality metrics and simulation result trends, so external dashboards or BI tools can pull both without stitching together separate endpoints.

Guard Improved

Real-time updates for key revocation

Instant API key revocation across all replicas via real-time updates, closing the window between revocation and enforcement.

API Improved

Per-key authentication for Agent Command Center

Per-key auth and attribution for Agent Command Center: each team or workload runs through its own API key, with request validation, usage attribution, and downstream tracking all flowing from a single per-key identity. Drop-in for any existing OpenAI-compatible client.

Read full digest

Feb 16, 2026

W8 Feb 3 – Feb 16, 2026

ai-evaluation 1.0, Deep Space Theme, Multi-Language SDKs, and Multimodal Workbench

ai-evaluation SDK 1.0: unified evaluate API, multimodal judge, 72+ metrics. Deep Space dark mode. C#/Java SDKs + 31 TS instrumentors. Multimodal Workbench.

Evaluate New

ai-evaluation v1.0.0

The evaluation SDK's stable release: unified evaluate API, multimodal LLM judge, auto-generated grading criteria, vector-store feedback loops, 72+ local metrics, OpenTelemetry integration, streaming evaluation.

72+ ai-evaluation metrics

4 SDK languages (Py, TS, C#, Java)

31 new TypeScript instrumentors

Platform New

Deep Space dark mode migration

Comprehensive monochrome theme across the entire platform, with every component, every surface, and every interaction state rethought for visual consistency and reduced eye strain.

SDK New

traceAI C# SDK

Full traceAI instrumentation for C# applications and .NET environments. ASP.NET services, Azure Functions, and standalone apps get automatic instrumentation for LLM calls, tool invocations, and agent workflows.

SDK New

traceAI Java SDK

Java SDK with 25 instrumentation modules covering Spring Boot, Micronaut, Quarkus, Apache HttpClient, OkHttp, JDBC drivers, and major Java-based LLM client libraries.

SDK New

31 new TypeScript instrumentor packages

Large expansion of TypeScript instrumentation covering frameworks, databases, HTTP clients, and more, the long tail that TypeScript agent developers rely on.

Platform New

Multimodal Prompt Workbench with WebSocket streaming

The Prompt Workbench goes multimodal end-to-end, with WebSocket-streamed responses that render in real time.

Simulate New

Graph-aware dataset scenario generation

Dataset-based scenarios now generate with branch-context. The LLM call that generates each scenario understands which branch of the graph it's in, producing scenarios that stay consistent with surrounding branches.

Simulate Improved

WebSocket simulation grid

Simulation grid updates stream to the UI via WebSocket instead of polling.

Monitor Improved

Langfuse integration

Configure Langfuse directly from the platform UI, with Langfuse-compatible endpoints for Vapi-routed traffic.

Monitor Improved

Voice observability annotations

Annotation workflow now extends into voice observability. Add annotations directly on voice traces with dedicated filters.

Simulate Improved

Provider-agnostic voice simulation runs

Voice simulation runs are now provider-agnostic, with LiveKit signal monitoring giving precise call-state tracking across providers.

SDK Improved

traceAI Python SDK update

New framework support and end-to-end tests for the Python traceAI SDK.

Platform Improved

Per-tab workspace context

Each browser tab keeps its own workspace context: production traces in one tab, staging simulations in another, no cross-tab interference.

Evaluate Improved

Function parameters in evaluations

Pass function parameters directly to evaluation metrics for dynamic scoring configurations.

Evaluate Improved

Reasoning parameters support in prompts

Configure reasoning-specific parameters (reasoning effort, max thinking tokens, whether to surface the reasoning trace) when working with chain-of-thought models in the Prompt Workbench. The same prompt can be tuned for fast and cheap or slow and thorough without rewriting it.

Monitor Improved

has_eval filter for traces and spans

Filter traces and spans in Observe by whether they have evaluations attached.

Monitor Improved

Workspace-scoped error analysis (Feed display)

Workspace-aware error analysis: the Error Feed surfaces failures from the active workspace, keeping triage focused on the project you're debugging. Cleaner empty states for users still being onboarded to projects.

Monitor Improved

Optimized trace session queries

Filtering and sorting on trace sessions now happen in the database layer instead of in app code, so session views load quickly even on workspaces with millions of spans. Most visible when filtering long lists by date or status.

Read full digest

Feb 2, 2026

W6 Jan 20 – Feb 2, 2026

Simulate from Prompt Workbench, Voice Annotations, and Agent Health for Voice Agents

Launch simulations without leaving the Prompt Workbench, annotate voice calls with structured human feedback, and extend Agent Compass health to voice.

Simulate New

Simulate from Prompt Workbench

Evaluate Improved

Function evaluations in test evaluations

Function-type evaluations (deterministic Python or JavaScript checks you author yourself, not LLM-judged) now run inside test evaluation workflows. Useful when pass/fail is logic, not opinion: schema validation, exact-match comparisons, custom string parsing.

API Improved

Simulate API changes for run tables and optimization

Simulate run tables and optimisation endpoints have cleaner request shapes and more consistent error responses, so existing API consumers can drop one-off conditionals around handling individual endpoints.

Read full digest

Jan 19, 2026

W4 Jan 6 – Jan 19, 2026

Baseline Chat Comparison, Fix My Agent Polish, and OpenTelemetry Instrumentation

Read full digest

Jan 5, 2026

W2 Dec 23 – Jan 5, 2026

Chat Simulation via Observe, Pre-Built Evaluation Groups, and Fix My Agent for Chat

Launch chat simulations directly from real production conversations, 10 ready-to-use evaluation groups with no configuration, and Fix My Agent for chat.

Simulate New

Chat Simulation via Observe

Evaluate Improved

Edit Run Experiment

Experiment configurations can now be edited mid-run (swap a model variant, adjust a scoring threshold) without restarting. Configuration history is tracked per data point so the audit trail stays clean.

Read full digest

Dec 22, 2025

W52 Dec 9 – Dec 22, 2025

Chat Simulation V1, Agent Prompt Optimiser, and Reliability Upgrades

Simulation for text chat agents, a six-strategy automated prompt optimiser, selective optimisation against specific calls, and resilience on restarts.

Simulate New

Chat Simulation V1

Full simulation engine for chat-based agents with persona-driven conversations, scenario generation, and in-drawer analytics.

6 Optimisation strategies (agent-opt)

200+ Conversation turns per simulation

Agents New

Agent Prompt Optimiser

Automated prompt optimisation runs against your evaluation data: six strategies wired into the platform with a results UI.

Agents New

Optimize My Agent V3: targeted optimisation

Select specific calls to optimise against, instead of optimising across the whole dataset. Direct the optimiser at the failure modes you actually care about.

Simulate New

Create scenario from Observe

Convert any real production session in Observe (the view of your live production traces) into a reusable simulation scenario with one click.

Simulate Improved

Replay sessions from real traces

Re-run historical production conversations through your current agent configuration: regression tests built from production traffic.

Platform Improved

Dot notation for JSON column variables

Reference nested JSON fields with dot notation (e.g., `user.profile.language`) in prompt templates and experiment configurations.

Platform Improved

Document format preview in Dataset and Experiment

Inline preview for documents referenced in datasets and experiment results. No download needed.

Simulate Improved

Instruction input in scenario creation

Describe what you want to test in natural language and the system generates scenarios accordingly.

Evaluate Improved

Evals filtering in dataset summary

Filter evaluation results inside the dataset summary view for faster drill-down.

Monitor Improved

Agent-centric metrics and call log improvements

Per-agent metrics now appear on the dashboard, so agent quality is comparable directly across agents instead of aggregated. Call log capture broadened to record more of what the agent does inside a call.

Evaluate Improved

Edit synthetic data configuration

Modify synthetic-data generation configuration after a run has started.

Read full digest

Dec 8, 2025

W50 Nov 25 – Dec 8, 2025

Fix My Agent, Persona Management Suite, and JSON Input/Output in Sessions

Context-aware debugging that explains why a simulation failed and how to fix it, full lifecycle for personas, structured JSON sessions, Optimiser backend.

Agents New

Fix My Agent

Platform Improved

Workspace issues view

Dedicated view that surfaces issues affecting the whole workspace (not just a single project), so admins can spot platform-level problems without drilling into each project individually.

Read full digest

Nov 24, 2025

W48 Nov 11 – Nov 24, 2025

Multi-Branch Scenarios, Custom Background Noises, and Critical-Issue Feed in Simulate

Scenarios that branch into multiple conversation paths, ambient noise profiles that push simulations closer to production, and a critical-issue feed.

Simulate New

Multi-branch scenario generation

Evaluate Improved

Scenario column support in evals and run test

Scenario columns are now available inside evaluation runs and run-test results, providing richer test data without leaving the evaluation view.

Simulate Fixed

Simulated assistant call ending fixes

Simulated assistant calls now end reliably when the agent finishes its final turn or hits a configured timeout, instead of leaving the session hanging.

Read full digest

Nov 10, 2025

W46 Oct 28 – Nov 10, 2025

Simulation Call Observability, Retell and Outbound Calls in Simulate, Tool Evaluation

Logs, latency, cost on every simulation call. Retell agents, outbound calling, tool-level verification in Simulate. Plus editable personas and Run Prompt.

Simulate New

Logs, latency, and cost breakdown on simulation calls

Platform Improved

Session history enhancements

Session history now renders full transcripts instead of truncating long ones, and supports more locales. A session shows everything your users actually said.

Monitor Improved

Observe homepage revamp

Observe landing page now leads with recent traces and active alerts, so the most common starting points (jump to a trace, check an alert) take fewer clicks. Initial load is noticeably faster on high-volume workspaces.

Read full digest

Oct 27, 2025

W44 Oct 14 – Oct 27, 2025

Credit Usage Revamp, Multi-Language Agents, and New TTS Providers

Workspace credit attribution, a 3-step guided agent builder with multi-language, rebuilt Prompt Workbench with commit history, and 4 new TTS providers.

Platform New

Credit usage summary redesign

Simulate Improved

Detailed voice provider logs

Full request and response logs for every voice provider interaction during simulation.

Simulate New

New TTS model integrations

Four new text-to-speech (TTS) providers (Cartesia, Hume, Neuphonics, and LMNT) now available in simulation.

Read full digest

Oct 13, 2025

W42 Sep 30 – Oct 13, 2025

ai-evaluation SDK v0.1.5, Personas, and Run-Prompt Enhancements

ai-evaluation SDK launches with 50+ templates. Pre-built and custom personas come to simulation, with dataset-derived personas from real call transcripts.

SDK New

ai-evaluation v0.1.5 launch

Initial release of the ai-evaluation SDK with 50+ evaluation templates covering faithfulness, relevance, safety, and domain-specific metrics.

50+ Eval templates in ai-evaluation

3 Persona sources

Simulate New

Pre-built and custom personas

Pre-built caller personas, custom persona definitions, and dataset-derived personas generated from your real call transcripts.

Evaluate Improved

Provider transcript as evaluation attribute

The voice provider's native transcript is now available as an evaluation input, useful for comparing automatic speech recognition (ASR) accuracy and response quality side by side.

Platform Improved

Enhanced onboarding flow

First-run setup branches by user role, so each new user lands on the steps that match how they will use the platform instead of a single generic walkthrough. Time from sign-up to first evaluation drops accordingly.

Simulate Improved

Voice output in Run Prompt and Run Experiment

Generate and evaluate spoken responses directly from the prompt playground and experiment workflows.

Simulate Improved

Manual, AI-generated, and dataset-sourced scenario rows

Three ways to add scenario rows to a simulation: type them in, generate them with AI, or import from an existing dataset.

Evaluate Improved

Run evaluations on completed test runs

Apply new evaluation criteria to previously completed simulation runs, with no need to re-execute the calls.

Agents Improved

Agent definition version selection in simulation

Select a specific agent definition version when running a simulation, so regression testing and A/B comparisons become straightforward.

SDK Improved

ai-evaluation v0.2.1

Batch evaluation support and bias detection added to the ai-evaluation SDK.

SDK Improved

traceAI OpenAI Agents SDK support

Native instrumentation for OpenAI's Agents SDK that captures tool calls, agent handoffs, and multi-agent orchestration as traces.

Platform Fixed

Updated pricing calculation in Observe

Trace cost is calculated as spans land in Observe, not after a post-processing pass. The cost figures on the dashboard and on the underlying trace stay in sync, so usage reporting matches what the trace view shows.

Read full digest

Sep 29, 2025

W40 Sep 16 – Sep 29, 2025

Voice Observability for Vapi, Retell, and ElevenLabs; Eval Groups in Experiments; Simulate via SDK

Observability ships for three voice platforms at once. Evaluation groups integrate with experiments and optimization. Call Simulation now SDK-triggerable.

Monitor New

Voice Observability for Vapi, Retell, and ElevenLabs

Observability ships for three voice platforms at once: Vapi, Retell, and ElevenLabs. Call metrics, transcript analysis, and utterance-level tracking on the same trace and evaluation surface you already use for text agents.

3 Voice platforms supported

60-70% Cost reduction via SDK simulation

Evaluate New

Eval groups in experiments and optimization

Evaluation groups now integrate with experiments and agent optimization. Run a whole group of evaluations together and optimize against group-level scores.

SDK New

Simulate via SDK

Trigger Call Simulation programmatically through the SDK against Vapi- or Retell-backed voice agents. Plug simulation into CI/CD pipelines, scheduled jobs, or custom orchestration.

Simulate Improved

Selective test rerun in Simulate

Rerun only the specific failed or flagged test cases in a simulation suite, without re-executing the entire suite.

Evaluate Improved

Default eval groups

Pre-built evaluation groups for common use cases: retrieval-augmented generation (RAG), computer vision, conversational AI.

Simulate Improved

Advanced simulation management

Auto-refresh on the simulation dashboard, stop-simulation control for running suites, and a visual workflow trace of the execution path for each scenario.

Platform Improved

Workbench revamp

Slide-out code drawer in the Workbench keeps prompt code visible while you iterate, so you don't tab-switch to inspect what you're running. The header is restructured to keep run controls reachable without scrolling.

SDK Improved

traceAI session support

Native `session.id` attribute support in traceAI, with automatic session grouping across all instrumented frameworks and no custom middleware.

Platform Fixed

Agent definition design changes

Agent configuration page redesigned for agents with many prompts and tools. Sections stay grouped and reachable as the configuration grows past a handful of nodes.

Read full digest

Sep 15, 2025

W38 Sep 2 – Sep 15, 2025

Automated Scenario Builder, Agent Definition Versioning, and Simplified Session Tracking

Upload an SOP or call transcript to auto-generate test scenarios with edge cases. Commit-style version control for agents. Session observability per span.

Simulate New

Automated scenario and workflow builder

Platform Fixed

Annotation and prompt import fixes in datasets

Resolved issues with importing annotations and prompt data into datasets for smoother data pipeline operations.

Read full digest

Sep 1, 2025

W36 Aug 19 – Sep 1, 2025

Agent Compass, Annotation Quality Dashboard, and Enterprise Multi-Workspace Security

Zero-config performance insights on agent traces, statistical dashboards for annotator agreement, and multi-workspace isolation with audit logging.

Agents New

Agent Compass

Automatic, zero-configuration performance insights on your agent traces. Detects issues at the trace level with no evaluation setup required.

0 Config required for Agent Compass

5 Eval group templates

Evaluate New

Annotation quality dashboard

Inter-annotator agreement metrics (including Cohen's kappa) to measure how consistent your human reviewers are.

Platform New

Enterprise multi-workspace security

Multiple independent workspaces with full data isolation, per-workspace role-based access control (RBAC), and cross-workspace audit logging.

Monitor Improved

Feed insights with error clusters

Observability feed now groups related errors into clusters and surfaces trend lines showing whether issues are getting better or worse.

Platform Improved

Intelligent onboarding navigation

Guided onboarding flow that adapts to your role and use case.

Simulate Improved

Voice agent testing and analytics improvements

Dashboard metrics and scenario column views for voice simulation. Call success rate, average call duration, and scenario coverage in one place.

Platform Improved

Prompt library organization

Folder-based prompt architecture with templates for structuring large prompt libraries across teams and projects.

Evaluate Improved

Evaluation grouping API

API support for evaluation grouping — organize related evaluations programmatically, run evaluation groups as test suites in CI/CD.

Platform Fixed

Enhanced plans and pricing experience

Redesigned pricing page with clearer plan comparisons and streamlined upgrade flows.

Read full digest

Aug 18, 2025

W34 Aug 5 – Aug 18, 2025

Summary Dashboards, Alerts Revamp, Prompt SDK, and Workspaces RBAC

Redesigned summary dashboards with new chart types and side-by-side compare, a rebuilt alerts system, Prompt SDK upgrades, and RBAC workspace access.

Monitor New

Summary screen revamp

Rebuilt summary dashboards. Spider, bar, and pie chart visualizations, plus side-by-side comparison between any two runs, prompt versions, or time periods.

4 Role levels

3 Chart types

Monitor New

Alerts revamp with Slack and email

Rebuilt alerts system with Slack and email notification channels, customizable thresholds, and alert grouping to reduce notification noise.

SDK New

Prompt SDK upgrades

Caching, A/B testing, and multi-environment deployment in the Prompt SDK — production-grade prompt management.

Platform New

Workspaces RBAC

Role-based access control with Owner, Admin, Member, and Viewer roles. Granular permissions on evaluations, simulations, and production traces.

Platform Improved

AWS Marketplace integration

Purchase and manage Future AGI subscriptions through AWS Marketplace with consolidated AWS billing.

SDK Improved

Error localizer via SDK

Synchronous and asynchronous standalone error localization through the SDK to pinpoint failures in agent execution chains.

Evaluate Improved

Critical issue detection on datasets

Automatic detection of critical issues in datasets with actionable advice for data quality problems.

Monitor Improved

Prompt metrics in Observe

Track trace performance per prompt version inside Observe — measure the real-world impact of a prompt change.

SDK Fixed

traceAI optional dependencies cleanup

Reduced install bloat by making framework-specific dependencies optional, cutting package size significantly.

Read full digest

Aug 4, 2025

W32 Jul 22 – Aug 4, 2025

Document Columns, Function Evaluations, and Async Evals via SDK

Upload documents into datasets with built-in OCR, write deterministic function evals for objective checks, and run evaluations async from the SDK.

Platform New

Document column support in datasets

Evaluate Fixed

JSON view for evals log

Inspect raw evaluation log data in structured JSON format for debugging and integration purposes.

Read full digest

Jul 21, 2025

W30 Jul 8 – Jul 21, 2025

Voice Simulation and the Evals Playground

AI-conducted phone calls test voice agents end-to-end. Plus an interactive sandbox for evaluations in real time, with inline scoring on any trace span.

Simulate New

Call Simulation

AI-powered simulator agents place real phone calls to your voice agents. Scenario-driven conversations with real turn-taking, interruptions, and the unpredictable patterns real users bring.

sub-second Voice latency

60-70% Cost reduction vs manual QA

Evaluate New

Evals Playground

An interactive sandbox for testing and refining evaluations before you commit. Paste a prompt-response pair, pick criteria and a judge model, see the score in real time. Also available inline inside the trace view.

Simulate Improved

Simulator agent form and agent definition dropdowns

Configure simulation agents through a form interface — select target agent, define simulator persona, choose scenarios, launch. No YAML files.

Simulate Improved

Add scenarios from datasets

Import test scenarios directly from your existing datasets. Turn a collection of real customer transcripts into repeatable simulation test cases with a few clicks.

Platform Improved

Mixpanel analytics integration

Full Mixpanel integration across the platform. Track feature usage, evaluation runs, and simulation sessions to understand adoption and workflow bottlenecks.

SDK Improved

traceAI TypeScript Vercel instrumentor

Automatic observability for serverless AI deployments on Vercel. Wrap your handler, every LLM call, tool invocation, and response is captured as a span.

Evaluate Improved

CRUD on custom evaluations

Full create, read, update, and delete operations for custom evaluations — manage your evaluation library as it grows.

Evaluate Improved

Add feedback to evaluations

Attach human feedback directly to evaluation results. Build labeled datasets and improve evaluation accuracy over time.

Platform Fixed

Refresh token cycle for session management

Automatic token refresh cycle ensures uninterrupted simulation sessions without manual re-authentication.

Platform Fixed

Span name display in traces

Span names now appear directly in the trace view, making it faster to navigate complex agent execution trees.

Read full digest

Jul 7, 2025

W28 Jun 24 – Jul 7, 2025

System Metrics in Observe, Multimodal Bedrock Tracing, and Eval Playground Upgrades

Infrastructure metrics alongside agent traces in Observe, image tracing for AWS Bedrock, and standalone mode + feedback loops in the Eval Playground.

Monitor New

System metrics in Observe graph

Platform Fixed

Draft prompts

Save work-in-progress prompts as drafts without publishing them to the shared prompt library.

Read full digest

Jun 23, 2025

W26 Jun 10 – Jun 23, 2025

Alerts and Monitors, gRPC Trace Ingestion, and the Observe Graph

Real-time alerts with Slack and email, gRPC trace transport with 60% less latency, and a visual graph of agent execution inside Observe.

Monitor New

Alerts and monitors

Set thresholds on any evaluation metric, trace property, or system measurement — get notified in Slack or email when the threshold is crossed.

60% less latency with gRPC

2 notification channels

Platform New

gRPC support for trace ingestion

High-performance gRPC transport for trace data, with 60% lower latency than HTTP/JSON.

Monitor New

Observe graph visualization

An interactive directed graph view of agent execution. Each node is a span, each edge is the flow of execution.

Platform Improved

Developer keys

API key management with scoped permissions — read-only, write, or admin keys, each with its own usage metrics and one-click rotation.

Platform Improved

Model serving infrastructure

New internal infrastructure for serving evaluation and guardrail models. Autoscales based on demand, optimizes for low-latency inference.

SDK Improved

traceAI gRPC transport support

Python and TypeScript SDKs now support gRPC as a trace transport alongside HTTP.

Evaluate Improved

Eval tab revamp

Redesigned evaluation results tab with better data visualization, more flexible filtering, and one-click CSV and JSON export.

Evaluate Fixed

Prompt eval updates

Improved prompt evaluation workflow with batch execution and result comparison across prompt versions.

Read full digest

Jun 9, 2025

W24 May 27 – Jun 9, 2025

Breaking Bad UI Redesign, Custom Model Endpoints, and Observe Enhancements

Platform UI redesign: new navigation, component library, consistent patterns. Azure OpenAI and self-hosted judges. New Observe filters and provider logos.

Platform New

Breaking Bad — platform update

A platform-wide release: model-choice in evaluations, enhanced error localization, extended alerts, and a redesigned UI across every surface.

3 custom judge endpoint types

3x faster dataset loading

Evaluate New

Custom model dropdown with Azure and custom endpoints

Use Azure OpenAI deployments, OpenAI-compatible endpoints, or self-hosted models as your evaluation judge — critical for teams with data residency, cost, or fine-tuned-judge requirements.

Monitor Improved

Attribute filters in Observe

Filter traces by custom attributes, metadata, and span properties in the Observe view.

Monitor Improved

Sentry error monitoring integration

When Sentry captures an exception in your app, the corresponding trace links automatically. Full-stack debugging context from application error to agent execution path.

Evaluate Improved

Image and audio support in evaluation log table

View image and audio outputs directly in evaluation log tables with inline rendering.

Platform Improved

Faster dataset loading

3x performance improvement for dataset loading through pagination and lazy rendering.

Evaluate Improved

Evaluation feedback in Observe

Attach feedback directly to evaluation results viewed from the Observe surface.

SDK Improved

traceAI Google ADK support

Automatic instrumentation for Google's Agent Development Kit (ADK) with full span capture and tool-call tracing.

SDK Improved

traceAI TypeScript: new evaluation support

Expanded evaluation capabilities in the TypeScript SDK, with new metric types and batch submission.

Evaluate Fixed

Eval template validation

Automatic validation of evaluation templates before execution to catch configuration errors early.

Monitor Fixed

Provider logos for tracing

Visual provider identification in trace views with logos for OpenAI, Anthropic, Google, and other LLM providers.

Read full digest

May 26, 2025

W22 May 13 – May 26, 2025

Protect Flash, TypeScript SDK v0.1.0, and Custom Evaluations in Observe

Speed-optimized guardrails path with binary harmful/not-harmful decision, the first official TypeScript SDK, and configurable evals on production traces.

Guard New

Protect Flash

Read full digest

May 12, 2025

W20 Apr 29 – May 12, 2025

Workbench V2, Custom Evaluations Revamp, and SDK Updates

Rebuilt Workbench for prompt engineering, redesigned eval builder with judge-model selection, and three traceAI SDKs: audio, image, multimodal.

Platform New

Workbench V2

Evaluate Fixed

Delete dataset functionality

Clean up old or unused datasets with a delete option and confirmation safeguard.

Read full digest

Apr 28, 2025

W18 Apr 15 – Apr 28, 2025

Diff View in Experiments, Audio Across the Platform, and Run-Insight Views

Compare two experiment runs side by side, play audio inline in traces and datasets, and see every evaluation run at a glance with insight summaries.

Evaluate New

Diff view in experiments

Compare two experiment runs side by side. Output differences, score deltas, and configuration changes highlighted inline.

2 experiment configurations compared

3-min KB tutorial walkthrough

Platform New

Audio support across Observe and datasets

Audio is now built in across the platform — play clips inside any trace, view waveforms in dataset tables.

Evaluate Improved

Run insight views

Every evaluation run now opens to a summary view: pass/fail distribution, score trends across runs, and outliers called out automatically.

Platform Improved

Rate-limit error UX and alerts

Clearer error messages when API rate limits are hit, plus in-app alerts when limit issues are detected.

Platform Improved

Knowledge base tutorial video

An embedded 3-minute walkthrough video inside the Knowledge Base section — pairs with the Prototype V2 knowledge base UI shipped in w16.

Evaluate Improved

Compare run eval revamp

Refined comparison view for evaluation runs with cleaner score display and easier navigation between runs.

Platform Improved

Prototype Observe filters

Apply the same filter vocabulary used in Observe to Prototype runs — consistency across the two surfaces.

Evaluate Fixed

Synthetic data UI improvements

Cleaner interface for synthetic data generation with better progress indicators and batch controls.

Read full digest

Apr 14, 2025

W16 Apr 1 – Apr 14, 2025

Prototype V2 and Audio Evaluations

Rebuilt prompt engineering environment with built-in knowledge base, evaluations that run on the actual audio of voice calls, plus a guided walkthrough.

Platform New

Prototype V2 with knowledge base UI

A rebuilt prompt engineering environment. Attach reference docs and few-shot examples next to your prompt, watch a 3-minute tutorial in-app, and iterate without switching tabs.

6 new eval templates

2x faster dataset loading

Evaluate New

Changelog

Subscribe to changelog

Perplexity ships Sonar and gpt-5.1 to self-hosted Future AGI, plus polish across Evals, Observe, and Platform

Perplexity contributes Sonar and gpt-5.1 to self-hosted Future AGI

New built-in eval: customer-agent task completion

Regex PII detection: default to all five PII types

Eval type persists correctly when creating LLM and Code evals

Eval config auto-populates from the baseline column

Eval workflow: a batch of refinements

Eval picker: stale data prevented, mapping disabled while columns load

Custom eval URLs: open on default version, ?v= survives, task errors expandable

Audio eval runs are ~3.5x faster when there's no feedback to retrieve

Clear error when an eval can't reach a media URL

Dataset evals with template variables in instructions no longer crash

semantic_list_contains handles numeric expected values

Static few-shot examples reach the LLM eval

Voice call mapping reaches deeper into the raw log

PDF filter chip in the eval picker is populated

Confirmation dialog before deleting an eval template

Pass and Fail chips on trace eval drawer

Multi-choice eval output type flows through the playground and picker

Error localiser: Show more works and composite evals are hidden

Fix-with-Falcon hides on label-field eval rows that already pass

Eval usage rollups: date range and session-target row exclusion

Data injection on system evals checks your variable mappings

Long eval and label names are readable on hover

Trace Name and Span Name filter dropdowns populate suggestions

Span-attached annotations appear in the Annotations filter category

Text filters are case-insensitive across Observe

Trace ID and Span ID filters render as single-select

Cleared filters stay cleared after a page refresh

Annotation filters: operator dropdown, chip label, and task filter

Add and edit filter chips on Sessions, Users, and User Traces

Column order persists across auto-refresh on every Observe grid

Consistent grid theme across Traces, Spans, Sessions, and Users

Call Logs grid: autosize works and Call ID column respects the width you set

Hover tooltip on long span-attribute column keys in Live Preview

Single loading spinner on the Users page

Save View buttons visible, full tab name preserved

Agent Graph and Agent Path: real fullscreen, disabled on voice projects

Tasks list chip hover and popover stay usable

Eval tasks list refreshes while rows are pending or running

Voice projects: call recording on the Error Feed cluster overview

agent_talk_percentage is filterable on voice traces

Faster Users tab and session detail loads

Reliable trace ingestion: custom user IDs and span PK retries

Voice eval fan-out: longer scheduling window

Pricing and usage page: correct rounding, units, and free-tier savings

Faster, more reliable signup with reCAPTCHA

Commit action surfaced on the prompt workbench

Better text contrast across Agent, Persona, Scenarios, Test Detail, Eval Picker, and Error Feed

Dataset media fetches no longer hang

ffmpeg calls time out on malformed audio

Agentic eval Azure callback handles missing invocation params

Evals on Traces and Sessions, Configurable Eval Context, and Polish Across Evals, Observe, and Simulate

Score traces and sessions with any eval

Skip variable mapping with context injection

Two new conversation evals: Dead Air Detection and Conversation Hallucination

Eval inputs accept up to 200K characters

Consistent scoring across built-in numeric evals

Eval mapping reaches span-level fields and API columns

Code eval parameters: numbers in, required fields enforced

Deleting an eval task removes all related results

Eval versioning: V1 on publish, full-field snapshots, draft separation

Eval picker: configuration persists, source-aware naming, local preview filter

Built-in evals work out of the box on open-source installs

Cleaner eval cells across the platform

Eval tasks: detail-page edits reflect in the list automatically

Evals UI: faster reruns, accurate version banner, cleaner sliders and animations

Trace ID and Span ID filters on the Tasks page; consistent operators across Observe

More discoverable column resize, persistent custom columns across Observe

Trace list: complete user data, filter chips, tag input, and overlay timing

Voice projects: usage visibility, trace-level evals, surfaced errors

Simulate: safer runs and better failure handling

AI gateway: Bedrock structured output and streaming failover

Self-Host in One Command, Jinja2 Prompts, and Polish Across Evals and Observability

Self-Host Future AGI in One Command

Prompts now support Jinja2 templates alongside Mustache

Reliable voice evals, no dependency on provider URLs

Export annotations and eval scores alongside traces to datasets