Home / Changelog / 2025 Week 24

Jun 9 – Jun 13, 2025 2025 W24

Breaking Bad UI Redesign, Custom Model Endpoints, and Observe Enhancements

A redesigned platform UI with new navigation, a new component library, and consistent interaction patterns. Azure OpenAI and self-hosted models as evaluation judges. New filters and provider logos in Observe.

Platform Evaluate Monitor SDK

3 custom judge endpoint types

3x faster dataset loading

What's in this digest

Platform New

Breaking Bad — platform update

Evaluate New

Custom model dropdown with Azure and custom endpoints

Monitor Improved

Attribute filters in Observe

Monitor Improved

Sentry error monitoring integration

Evaluate Improved

Image and audio support in evaluation log table

Platform Improved

Faster dataset loading

Evaluate Improved

Evaluation feedback in Observe

SDK Improved

traceAI Google ADK support

SDK Improved

traceAI TypeScript: new evaluation support

Evaluate Fixed

Eval template validation

Monitor Fixed

Provider logos for tracing

Breaking Bad — A Platform-Wide Update

Called “Breaking Bad” internally because almost every surface of the platform changed at once. This release pairs a platform-wide UI redesign with deeper improvements to evaluations, error localization, and alerts.

What’s new

Model choice in evaluation configurations. Every evaluation can now specify which LLM evaluates it.
Enhanced error localization. Builds on the Error Localization launched in Prototype (w16) — handles more failure shapes and integrates more cleanly into the trace view.
Extended alert monitoring. Log tracking, new threshold methods, and log-fetching — groundwork for the alerts system that lands formally in w26 and gets revamped in w34.
Evaluation config management. Manage evaluation configurations through the API.
A platform-wide UI redesign. Covers evaluations, experiments, Observe, Prototype, and shared components.

Why it matters

Breaking Bad consolidates several in-flight investments — model choice in evals, error-localizer maturity, alerts groundwork — under one release, paired with a large UI pass so existing users and new users land on the same consistent surface.

Who it’s for

Everyone using Future AGI — existing users see the UI update, and teams that depend on evaluation configurations or alert monitoring pick up the deeper improvements.

Custom Model Endpoints — Any Model as Your Judge

Not every team uses the default OpenAI or Anthropic models as their evaluation judge. The new custom model dropdown lets you configure any model endpoint as the judge for an evaluation.

What’s new

Azure OpenAI deployments with your own API keys and endpoints.
OpenAI-compatible endpoints for any provider that follows the OpenAI chat completions format.
Self-hosted models running on your own infrastructure.

Why it matters

Teams with data residency requirements, custom fine-tuned judge models, or cost constraints now have a clean, supported path to use the models they need as judges — without workarounds or forked tooling.

Who it’s for

Enterprise teams with data residency or procurement requirements, teams running fine-tuned judge models, and teams using self-hosted models in production.

Read the docs →

Observe Enhancements

Attribute filters in Observe. Filter traces by custom attributes and metadata. Tag traces with environment, user segment, feature flag, or any custom property, and filter on those dimensions.

Provider logos in tracing. Every span (an individual step inside a trace — an LLM call, a tool invocation, a retrieval) in a trace view now shows the logo of the LLM provider it called — OpenAI, Anthropic, Google, Cohere, and others — so you identify the model behind each step at a glance.

Sentry integration. When Sentry captures an exception in your application, the corresponding trace links automatically — full-stack debugging context from application error to agent execution path.

Evaluation feedback in Observe. Attach feedback directly to evaluation results from the Observe surface.

Platform and Evaluation Improvements

Image and audio support in evaluation log table. Image and audio outputs render inline in evaluation log tables — no more downloading files to inspect them.

Faster dataset loading. Dataset loading gets a 3x performance improvement through pagination and lazy rendering — noticeable on datasets with thousands of rows.

Evaluation template validation. Automatic validation of evaluation templates before execution catches configuration errors early instead of at run time.

traceAI Google ADK support. Automatic instrumentation for Google’s Agent Development Kit (ADK) — spans for every agent step, tool call, and model invocation are captured automatically.

TypeScript SDK: expanded evaluations. New metric types and batch submission support for running evaluations from TypeScript.

Older

Protect Flash, TypeScript SDK v0.1.0, and Custom Evaluations in Observe

Newer

Alerts and Monitors, gRPC Trace Ingestion, and the Observe Graph

All changelog entries