Most observability tools ask you to define what you’re looking for before you can find it. Set up evaluations (tests that score agent outputs against criteria you define), configure thresholds, write detection rules. Agent Compass inverts that.

What’s new

Zero-config insights. Point Agent Compass at your traces (the end-to-end records of how your agent handled each request) and it surfaces performance issues automatically.
Three detection categories. Latency anomalies (specific tool calls or LLM invocations taking longer than expected), quality regressions (agent responses deviating from established patterns), and error clusters (similar failures repeating across sessions).
Trace-level attribution. Every issue points to the specific span (individual step inside a trace) responsible, so you skip straight from “something’s wrong” to “this tool call is the cause.”

Why it matters

For teams just starting with agent monitoring, Agent Compass provides usable signal on day one — before they’ve defined a single evaluation. For teams with mature evaluation setups, it catches issues that fall outside the specific checks they’ve written.

Who it’s for

Teams new to agent observability who want signal without spending a week on setup, and mature MLOps teams looking for issues that their existing evaluations don’t catch.

Annotation Quality Dashboard — Statistics for Your Human Reviewers

Human evaluation is only as good as the humans doing it. When multiple annotators label the same data differently, which labels do you trust? The annotation quality dashboard answers that question with statistical rigor.

What’s new

Cohen’s kappa scores measure inter-annotator agreement beyond what chance alone would predict.
Annotator-level breakdowns show which team members are consistent and which may need calibration.
Category-level analysis reveals which evaluation criteria are subjective and may need clearer rubrics.

Why it matters

If your annotators disagree on what counts as a hallucination, the evaluation scores built from their labels are noise. The dashboard makes disagreement visible — so you can address it (calibration, clearer rubric, better training) instead of building on top of inconsistent data.

Who it’s for

Teams running human evaluation as ground truth — quality assurance (QA) teams running manual reviews, compliance and audit reviewers, and data teams building labeled evaluation datasets.

Read the docs →

Enterprise Multi-Workspace Security

Larger organizations need more than role-based access control (RBAC) within a single workspace. The enterprise security framework adds workspace isolation, per-workspace policies, and cross-workspace audit logging.

What’s new

Independent workspaces. Each workspace operates as its own security boundary — data, traces, evaluations, and configurations are fully isolated across workspaces.
Per-workspace RBAC. Different role assignments per workspace, so a Member in one workspace can be an Admin in another.
Cross-workspace audit logging. Administrators see a consolidated audit trail across every workspace they manage.
Central administration. Manage multiple workspaces from one console while maintaining strict separation between teams, projects, or business units.

Why it matters

Combined with the RBAC framework from the previous release, this gives enterprise security teams the governance model they need to approve Future AGI for production use — including in regulated industries where isolation between teams is a compliance requirement.

Who it’s for

Enterprise security and procurement teams reviewing Future AGI, and platform administrators supporting many internal teams using AI tooling in parallel.

Read the docs →

Observability and Voice Testing Improvements

Feed insights with error clusters. Instead of scrolling through individual error traces, the feed groups related errors into clusters and surfaces trend lines: is this issue growing, steady, or shrinking? “This tool call has failed 47 times in the last hour with a timeout error” is one cluster, not 47 separate line items.

Voice agent testing dashboards. New metrics and scenario column views for simulation. Track call success rates, average call duration, and scenario coverage in a single dashboard. Each simulation scenario becomes a column in your test matrix, making it easy to verify that every conversation path has been tested.

Platform Improvements

Folder-based prompt organization. Organize large prompt libraries by project, team, or feature area. Templates standardize prompt structure across the organization.

Intelligent onboarding navigation. The onboarding flow adapts to your role and use case on first login.

Plans and pricing redesign. Clearer plan comparisons and streamlined upgrade flow on the pricing page.

Evaluation grouping API. Organize related evaluations programmatically. Run evaluation groups as test suites in continuous integration (CI/CD) pipelines.

Older

Summary Dashboards, Alerts Revamp, Prompt SDK, and Workspaces RBAC

Newer

Automated Scenario Builder, Agent Definition Versioning, and Simplified Session Tracking

All changelog entries