Home / Changelog / 2025 Week 24

Jun 9 – Jun 13, 2025 2025 W24

Eval Playground and Inline Evaluations

An interactive sandbox for testing evaluations in real time, inline eval scoring on traces, and broad provider support with Google ADK and custom model endpoints.

Evaluate Monitor Platform SDK

5 new eval types

3x faster dataset loading

What's in this digest

Evaluate Evals Playground New

Evaluate Inline evaluations for tracing New

Evaluate Experiment page redesign Improved

Monitor Sentry error monitoring integration Improved

Evaluate Custom model dropdown with Azure and custom endpoints Improved

Evaluate Image and audio support in eval log table Improved

Platform Faster dataset loading Improved

Evaluate Eval template validation Fixed

Monitor Attribute filters in Observe Improved

Monitor Provider logos for tracing Fixed

SDK traceAI Google ADK support Improved

SDK traceAI new eval support in TypeScript Improved

Evals Playground — Test Before You Deploy

Building a good evaluation is an iterative process. You define criteria, test it against examples, realize the scoring rubric is too strict or too lenient, adjust, and test again. Until now, that cycle required creating a dataset, running a full evaluation, waiting for results, and repeating.

The Evals Playground eliminates that overhead entirely. It is an interactive sandbox where you can test evaluations with sample inputs and see scores in real time. Paste a prompt-response pair, select your evaluation criteria, choose a judge model, and get instant feedback on how the eval scores that example.

The Playground supports all evaluation types: built-in metrics, custom LLM-as-judge evaluations, and composite scoring rules. You can toggle between different judge models to see how scoring varies, adjust your rubric criteria and re-score instantly, and save configurations that work well as reusable templates.

This is particularly valuable for teams building custom evaluations for the first time. Instead of guessing at rubric wording and discovering problems after a full eval run, you can refine your evaluation interactively until it scores the way you expect.

Inline Evaluations — Score Any Span on Demand

Sometimes you are investigating a specific trace and want to know how a particular response scores against your evaluation criteria. The new inline evaluations feature lets you do exactly that.

Click any span in a trace view, select “Evaluate,” and choose from your evaluation templates. The evaluation runs on that single span and displays the result inline — right next to the trace data. No need to export the data, create a dataset entry, or run a batch evaluation.

This is a game-changer for debugging. When you find a suspicious response in your traces, you can immediately quantify how it scores on factual accuracy, relevance, safety, or any custom metric. It transforms the trace view from a passive observation tool into an active analysis environment.

Custom Model Endpoints

Not every team uses the default OpenAI or Anthropic models for evaluation judging. The new custom model dropdown lets you configure any model endpoint as your evaluation judge:

Azure OpenAI deployments with your own API keys and endpoints
Custom API endpoints that follow the OpenAI-compatible chat completions format
Self-hosted models running on your own infrastructure

This is critical for teams with data residency requirements, custom fine-tuned judge models, or cost constraints that make specific model deployments more practical.

Experiment Page Redesign

The experiment view received a substantial rebuild. The new layout puts comparison front and center with side-by-side result panels, sortable metric columns, and a run history timeline that lets you navigate between experiment iterations.

Metric sorting lets you quickly find the best and worst performing configurations. Click any metric column header to sort, and the rows reorder to surface the top performers. Combined with the diff view shipped in a previous release, you get a complete picture of how each experiment run compares.

Monitoring and SDK Updates

Sentry integration connects your application error monitoring to your agent traces. When Sentry captures an exception, the corresponding trace is linked automatically, giving you full-stack debugging context from application error to agent execution path.

Attribute filters in Observe let you filter traces by custom attributes and metadata. Tag your traces with environment, user segment, feature flag, or any custom property, and filter on those dimensions in the Observe view.

Provider logos add visual clarity to trace views. Each span now displays the logo of the LLM provider it called — OpenAI, Anthropic, Google, Cohere, and others — making it easy to identify the model behind each step at a glance.

On the SDK side, Google ADK support brings automatic instrumentation to Google’s Agent Development Kit, and the TypeScript SDK gained expanded evaluation capabilities with new metric types and batch submission support.

Older

Breaking Bad -- A Complete UI Overhaul

Newer

Alerts, gRPC, and the Observe Graph

All changelog entries

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Platform

Audience

LEARN

DEVELOPERS

Featured

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Eval Playground and Inline Evaluations

What's in this digest

Evals Playground — Test Before You Deploy

Inline Evaluations — Score Any Span on Demand

Custom Model Endpoints

Experiment Page Redesign

Monitoring and SDK Updates

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Eval Playground and Inline Evaluations

What's in this digest

Evals Playground — Test Before You Deploy

Inline Evaluations — Score Any Span on Demand

Custom Model Endpoints

Experiment Page Redesign

Monitoring and SDK Updates

FutureAGI AI Assistant