Breaking Bad UI Redesign, Custom Model Endpoints, and Observe Enhancements
A redesigned platform UI with new navigation, a new component library, and consistent interaction patterns. Azure OpenAI and self-hosted models as evaluation judges. New filters and provider logos in Observe.
What's in this digest
Breaking Bad — A Platform-Wide Update

Called “Breaking Bad” internally because almost every surface of the platform changed at once. This release pairs a platform-wide UI redesign with deeper improvements to evaluations, error localization, and alerts.
What’s new
- Model choice in evaluation configurations. Every evaluation can now specify which LLM evaluates it.
- Enhanced error localization. Builds on the Error Localization launched in Prototype (w16) — handles more failure shapes and integrates more cleanly into the trace view.
- Extended alert monitoring. Log tracking, new threshold methods, and log-fetching — groundwork for the alerts system that lands formally in w26 and gets revamped in w34.
- Evaluation config management. Manage evaluation configurations through the API.
- A platform-wide UI redesign. Covers evaluations, experiments, Observe, Prototype, and shared components.
Why it matters
Breaking Bad consolidates several in-flight investments — model choice in evals, error-localizer maturity, alerts groundwork — under one release, paired with a large UI pass so existing users and new users land on the same consistent surface.
Who it’s for
Everyone using Future AGI — existing users see the UI update, and teams that depend on evaluation configurations or alert monitoring pick up the deeper improvements.
Custom Model Endpoints — Any Model as Your Judge
Not every team uses the default OpenAI or Anthropic models as their evaluation judge. The new custom model dropdown lets you configure any model endpoint as the judge for an evaluation.
What’s new
- Azure OpenAI deployments with your own API keys and endpoints.
- OpenAI-compatible endpoints for any provider that follows the OpenAI chat completions format.
- Self-hosted models running on your own infrastructure.
Why it matters
Teams with data residency requirements, custom fine-tuned judge models, or cost constraints now have a clean, supported path to use the models they need as judges — without workarounds or forked tooling.
Who it’s for
Enterprise teams with data residency or procurement requirements, teams running fine-tuned judge models, and teams using self-hosted models in production.
Observe Enhancements
Attribute filters in Observe. Filter traces by custom attributes and metadata. Tag traces with environment, user segment, feature flag, or any custom property, and filter on those dimensions.
Provider logos in tracing. Every span (an individual step inside a trace — an LLM call, a tool invocation, a retrieval) in a trace view now shows the logo of the LLM provider it called — OpenAI, Anthropic, Google, Cohere, and others — so you identify the model behind each step at a glance.
Sentry integration. When Sentry captures an exception in your application, the corresponding trace links automatically — full-stack debugging context from application error to agent execution path.
Evaluation feedback in Observe. Attach feedback directly to evaluation results from the Observe surface.
Platform and Evaluation Improvements
Image and audio support in evaluation log table. Image and audio outputs render inline in evaluation log tables — no more downloading files to inspect them.
Faster dataset loading. Dataset loading gets a 3x performance improvement through pagination and lazy rendering — noticeable on datasets with thousands of rows.
Evaluation template validation. Automatic validation of evaluation templates before execution catches configuration errors early instead of at run time.
traceAI Google ADK support. Automatic instrumentation for Google’s Agent Development Kit (ADK) — spans for every agent step, tool call, and model invocation are captured automatically.
TypeScript SDK: expanded evaluations. New metric types and batch submission support for running evaluations from TypeScript.