Home / Changelog / 2025 Week 22

May 26 – May 30, 2025 2025 W22

Breaking Bad -- A Complete UI Overhaul

A comprehensive redesign of the entire platform UI, the first TypeScript SDK, and flash-speed guardrails with Protect Flash.

Platform Evaluate SDK Guard

1st TypeScript SDK

100% UI redesigned

What's in this digest

Platform Breaking Bad UI redesign New

Evaluate Custom evals in Observe New

SDK TypeScript @traceai/fi-core v0.1.0 New

Guard Protect Flash implementation New

Platform API-based pricing for evals and error localizer Improved

Platform Stop streaming for long-running prompts Improved

Evaluate Evaluations in prompt workbench Improved

Evaluate Feedback enhancement system Fixed

Breaking Bad — Every Pixel Reconsidered

We called this release “Breaking Bad” internally because we broke everything to make it better. This is not a theme update or a color palette swap. It is a ground-up redesign of every surface in the platform.

The navigation structure is completely rethought. The old sidebar with nested menus has been replaced by a streamlined top-level navigation with contextual sidebars that appear only when relevant. Moving between Observe, Evaluate, and Prototype is now a single click, and your workspace state is preserved when you switch contexts.

The component library behind the redesign is entirely new. Every table, modal, dropdown, form field, and button follows a consistent design language. Tables load faster with virtualized rendering. Modals are lighter and dismiss more naturally. Forms validate inline instead of on submit.

The result is a platform that feels like a single, cohesive product rather than a collection of features built at different times. Every interaction is faster, every layout is more intentional, and every screen uses space more efficiently.

Custom Evals in Observe

Until now, evaluating production traces meant exporting data and running evaluations separately. Custom evals in Observe collapses that workflow into a single action. Select traces from your Observe view, configure an evaluation with your custom criteria and judge model, and run it in place. Results appear as columns alongside your trace data.

This is powerful for production monitoring. Set up custom evals to flag hallucinations, check policy compliance, or score response quality — all running directly on your live trace data. Combined with the alert system, you can build fully automated quality monitoring pipelines.

TypeScript SDK — @traceai/fi-core v0.1.0

This is a milestone release. @traceai/fi-core v0.1.0 is our first official TypeScript SDK, bringing full tracing and evaluation capabilities to the Node.js and Deno ecosystems.

The TypeScript SDK mirrors the Python SDK’s API surface, so teams working across both languages get a consistent developer experience. Key capabilities include:

Automatic instrumentation for OpenAI, Anthropic, and other LLM providers
Manual span creation with the trace and span decorators
Evaluation submission for running evals programmatically
Type-safe configuration with full TypeScript definitions

Installation is a single command: npm install @traceai/fi-core. The SDK ships with zero native dependencies and works in Node.js 18+, Deno, and edge runtimes like Cloudflare Workers.

Protect Flash — Guardrails at the Speed of Inference

Production guardrails have a fundamental tension: they need to be thorough enough to catch harmful outputs but fast enough to not destroy your latency budget. Protect Flash resolves that tension.

Protect Flash is an optimized guardrails engine that screens LLM outputs in under 50 milliseconds. It runs a lightweight classification model trained on our evaluation dataset to detect common failure modes: hallucinations, PII leakage, off-topic responses, and policy violations.

The architecture is designed for inline deployment. Protect Flash sits between your LLM provider and your application, screening every response before it reaches the user. At sub-50ms latency, it adds negligible overhead to the request path while providing real-time protection.

Platform and Pricing Updates

API-based pricing for evaluations and the error localizer replaces the old fixed-tier limits. You now pay per evaluation run and per error localization request, with volume discounts at scale. This means you can start small and scale without hitting arbitrary plan boundaries.

Stop streaming lets you cancel long-running LLM generations mid-stream. If you can see the output going off the rails, hit stop and save both time and tokens. The partial output is preserved so you can still analyze what went wrong.

Evaluations in the prompt workbench bring scoring directly into the editing workflow. Run an eval on your prompt’s output without leaving the editor, see the scores, adjust your prompt, and run again. This tight iteration loop is what makes prompt engineering productive.

Older

Workbench V2 -- The Prompt Engineering Revolution

Newer

Eval Playground and Inline Evaluations

All changelog entries

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Platform

Audience

LEARN

DEVELOPERS

Featured

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Breaking Bad -- A Complete UI Overhaul

What's in this digest

Breaking Bad — Every Pixel Reconsidered

Custom Evals in Observe

TypeScript SDK — @traceai/fi-core v0.1.0

Protect Flash — Guardrails at the Speed of Inference

Platform and Pricing Updates

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Breaking Bad -- A Complete UI Overhaul

What's in this digest

Breaking Bad — Every Pixel Reconsidered

Custom Evals in Observe

TypeScript SDK — @traceai/fi-core v0.1.0

Protect Flash — Guardrails at the Speed of Inference

Platform and Pricing Updates

FutureAGI AI Assistant