Baseline Chat Comparison, Fix My Agent Polish, and OpenTelemetry Instrumentation
Baseline chat comparison wires production conversations into simulation as the fastest path from production failure to reproducible test. Plus Fix My Agent polish, OpenTelemetry instrumentation, and image-output support across datasets and the Prompt Workbench.
What's in this digest
Baseline Chat Comparison: Observe to Simulation
Baseline chat comparison bridges Observe and Simulate. Take a real production conversation captured in Observe (the view of your live production traces), feed it into Simulation as a baseline, and compare the simulated output against what actually happened.
Why it matters
Fastest path from “something went wrong in production” to “here’s a reproducible test case that catches it.” The baseline is the real transcript; the simulation output is what your current configuration would have produced. Differences between the two surface regressions and drift.
Who it’s for
Quality assurance (QA) teams building regression tests from production failures, and product teams doing before/after comparisons when they update a prompt or model.
Fix My Agent: Final Polish
Fix My Agent’s final-release polish lands. The drawer is restructured so you reach the suggested fix faster, call-selection bugs that affected long simulation runs are fixed, a restore-with-conflicts flow keeps your local edits intact when applying suggestions on top of changes you’ve already made, and chat-simulation integration is wired up end to end.
OpenTelemetry Instrumentation and Workflow Observability
The core platform now emits OpenTelemetry (OTEL) traces for its own operations. Future AGI’s observability surface now extends inward to Future AGI itself. Sentry integration has been added across long-running workflows. And the Agent Prompt Optimiser gains resume support, so long-running optimisation jobs survive restarts and pick up where they left off.
Dataset and Evaluation Improvements
Run evaluations directly from datasets. Select a dataset, choose your evaluation criteria, run. No navigation away.
Trial items. Test a small subset before committing to a full evaluation run. Saves time and compute when iterating on scoring rubrics.
Image outputs in datasets and Prompt Workbench. Datasets now store and display image outputs alongside text; the Prompt Workbench renders images inline when your model returns visual content.
Multi-image upload. Select a batch, upload, done.
Bulk delete and bulk rerun test executions. Select many, act once.
Output type selection in Playground results. Set the expected response format on a Playground run so multimodal output renders correctly instead of defaulting to a raw string preview.
Chat inputs for simulation analysis agent. The simulation analysis agent now accepts chat-formatted inputs, so chat-based simulation runs can be triaged by the same diagnostic agent that handles voice runs.
simulate-sdk v0.1.2: Agent Wrappers and Cloud Mode
- Cloud mode. Offload simulation execution to Future AGI infrastructure. No local compute to provision, CI pipelines stay fast.
- Agent wrappers for every major framework. Wrap existing agent code with a single function call and gain simulation, evaluation, and observability across OpenAI, LangChain, Gemini, and Anthropic.
- Tool calls captured by default. Every invocation, its arguments, and its return value land in the simulation trace automatically.