Home / Changelog / 2026 Week 4

Jan 6 – Jan 19, 2026 2026 W4

Baseline Chat Comparison, Fix My Agent Polish, and OpenTelemetry Instrumentation

Baseline chat comparison wires production conversations into simulation as the fastest path to reproducible tests. Plus Fix My Agent, OTel, image outputs.

Simulate Agents Platform Evaluate SDK

1-click production trace to simulation baseline

4 Agent framework wrappers

What's in this digest

Simulate New

Baseline chat comparison from Observe to Simulation

Agents Improved

Fix My Agent: final polish

Platform Improved

OpenTelemetry instrumentation in the platform

Agents Improved

Agent Prompt Optimiser: resume support

Evaluate Improved

Dataset optimisation with direct evaluation

Evaluate Improved

Image output support in datasets and Prompt Workbench

Evaluate Improved

Multiple image upload in datasets

Simulate Improved

Bulk delete and bulk rerun test executions

Agents Improved

Output type selection in Playground results

Simulate Improved

Chat inputs for simulation analysis agent

SDK Improved

simulate-sdk v0.1.2

Baseline Chat Comparison: Observe to Simulation

Baseline chat comparison bridges Observe and Simulate. Take a real production conversation captured in Observe (the view of your live production traces), feed it into Simulation as a baseline, and compare the simulated output against what actually happened.

Why it matters

Fastest path from “something went wrong in production” to “here’s a reproducible test case that catches it.” The baseline is the real transcript; the simulation output is what your current configuration would have produced. Differences between the two surface regressions and drift.

Who it’s for

Quality assurance (QA) teams building regression tests from production failures, and product teams doing before/after comparisons when they update a prompt or model.

Read the docs →

Fix My Agent: Final Polish

Fix My Agent’s final-release polish lands. The drawer is restructured so you reach the suggested fix faster, call-selection bugs that affected long simulation runs are fixed, a restore-with-conflicts flow keeps your local edits intact when applying suggestions on top of changes you’ve already made, and chat-simulation integration is wired up end to end.

OpenTelemetry Instrumentation and Workflow Observability

The core platform now emits OpenTelemetry (OTEL) traces for its own operations. Future AGI’s observability surface now extends inward to Future AGI itself. Sentry integration has been added across long-running workflows. And the Agent Prompt Optimiser gains resume support, so long-running optimisation jobs survive restarts and pick up where they left off.

Dataset and Evaluation Improvements

Run evaluations directly from datasets. Select a dataset, choose your evaluation criteria, run. No navigation away.

Trial items. Test a small subset before committing to a full evaluation run. Saves time and compute when iterating on scoring rubrics.

Image outputs in datasets and Prompt Workbench. Datasets now store and display image outputs alongside text; the Prompt Workbench renders images inline when your model returns visual content.

Multi-image upload. Select a batch, upload, done.

Bulk delete and bulk rerun test executions. Select many, act once.

Output type selection in Playground results. Set the expected response format on a Playground run so multimodal output renders correctly instead of defaulting to a raw string preview.

Chat inputs for simulation analysis agent. The simulation analysis agent now accepts chat-formatted inputs, so chat-based simulation runs can be triaged by the same diagnostic agent that handles voice runs.

simulate-sdk v0.1.2: Agent Wrappers and Cloud Mode

Cloud mode. Offload simulation execution to Future AGI infrastructure. No local compute to provision, CI pipelines stay fast.
Agent wrappers for every major framework. Wrap existing agent code with a single function call and gain simulation, evaluation, and observability across OpenAI, LangChain, Gemini, and Anthropic.
Tool calls captured by default. Every invocation, its arguments, and its return value land in the simulation trace automatically.

Older

Chat Simulation via Observe, Pre-Built Evaluation Groups, and Fix My Agent for Chat

Newer

Simulate from Prompt Workbench, Voice Annotations, and Agent Health for Voice Agents

All changelog entries