Annotations Flow and Error Localization

Find the root cause of agent failures instantly with error localization in Observe, plus a complete annotations flow for human feedback on traces.

Monitor Evaluate Platform

70+ annotation filters

5x rate limit increase

What's in this digest

Monitor Error localization in Observe New

Monitor Annotations flow for trace view New

Evaluate Updated dataset layout with sheet view Improved

Evaluate Diff view in experiments Improved

Platform Audio support across Observe and datasets Improved

Platform Higher rate limits Improved

Evaluate Synthetic data UI improvements Fixed

Error Localization — Find the Root Cause Instantly

When an agent fails in production, the hardest part is not knowing that it failed — it is figuring out why. A typical agent trace might have dozens of spans: LLM calls, tool invocations, retrieval steps, and branching logic. Somewhere in that chain, something went wrong. Finding it manually means clicking through every span, reading every output, and reconstructing the execution path in your head.

Error localization eliminates that work entirely. When a trace contains a failure, our system automatically analyzes the execution graph and pinpoints the exact span where the problem originated. It distinguishes between the root cause and downstream consequences, so you are not distracted by cascading errors that are just symptoms of the original failure.

The error localization engine examines several signals:

Exception propagation paths through the trace tree
Output anomalies where a span’s output deviates significantly from expected patterns
Latency spikes that indicate timeouts or resource contention
Tool call failures with error codes and retry patterns

The result is a highlighted span in your trace view with a clear explanation of what went wrong and why. What used to take hours of debugging now takes seconds.

Annotations Flow — Human Feedback on Traces

Automated evaluations catch systematic issues, but human review catches the subtle ones. The new annotations flow brings a complete human feedback workflow directly into the trace view.

Reviewers can now attach multi-label annotations to any span in a trace. Mark a response as “hallucinated,” “off-topic,” “tone mismatch,” or any custom label your team defines. Each annotation captures the reviewer, timestamp, and optional notes for context.

The workflow supports team collaboration with reviewer assignment and approval states. Route traces to specific team members for review, track which traces have been reviewed, and filter your annotation queue by status. With over 70 filter combinations available, you can slice your annotation data by label, reviewer, date range, model, and more.

This creates a powerful feedback loop: human annotations feed back into your evaluation datasets, improving your automated evals over time. The more your team reviews, the smarter your evaluation suite becomes.

Dataset and Experiment Upgrades

The dataset interface got a major layout refresh with the new sheet view. If you have ever wished your dataset editor felt more like a spreadsheet, this is it. Resizable columns, frozen headers, inline editing, and bulk operations make managing large datasets dramatically more efficient.

Diff view in experiments extends the comparison capabilities we shipped for datasets into the experiment workflow. Run two experiment configurations and see exactly how their outputs differ — highlighted inline with score comparisons and configuration deltas. This is invaluable when you are tuning prompts or testing model switches and need to understand the precise impact of each change.

Audio Support and Platform Improvements

Audio support is now native across the entire platform. Trace views render audio spans with inline playback, and dataset tables display audio cells with waveform previews. If you are building voice agents, you no longer need to leave the platform to hear what your agent actually said.

On the infrastructure side, we pushed rate limits up by 5x across all API endpoints. Teams running high-throughput production workloads were hitting limits during peak evaluation runs, and this increase gives substantial headroom for even the largest deployments. The synthetic data generation UI also received polish with better progress indicators and batch controls for large generation jobs.

Older

Prototype V2 and Audio Evaluations

Newer

Workbench V2 -- The Prompt Engineering Revolution

All changelog entries