Evals Revamp, Experiment V2, Observe Revamp, and Error Feed
130+ ready-made evals that reach outside the prompt: pull live web data and call your own tools as part of scoring. Experiments now run end-to-end on agents, not just prompts. Observe gets plain-English filtering and saved views. The Error Feed clusters failures by root cause at a fraction of cost and pushes any cluster to Linear in one click.
What's in this digest
Evals Revamp: Score Against the Tools and Data Your Agent Actually Uses
Evals used to be sealed off from the systems your agent actually runs against. Whatever you typed into the prompt and the criteria field was the entire world the eval got to see. No way to ask Grammarly whether the tone was right, no way to run a code eval against your test suite, no way to check a claim against today’s news, no way to verify a record exists in your database.
Evals reach outside the prompt
Each eval can pull in external context as part of scoring:
- Live web data. Fact-check a generated claim against the current state of the web. Useful when the answer depends on something that changes (prices, news, availability, regulations).
- Your databases and APIs. An eval can query your own systems to verify what the model said matches what’s actually true: does the order ID exist, is the user in the cohort, does the policy permit this answer.
- Tools you already use. Plug in Grammarly or Hemingway for tone and clarity, ESLint and Pytest for generated code, vector-DB lookups and internal knowledge bases for retrieval evals, translation and schema validators for structure. Each tool runs as part of the eval, not as a separate pipeline.
Upgraded eval experience
The eval surface is rebuilt around how you actually work with evals. Every eval shows its type, output type, 30-day pass-rate sparkline, error rate, and last-updated time at a glance, with filters, search, and bulk actions on top.
Three eval types you can author
- LLM-as-Judge. Rich-text instructions with template variables (
{{ground_truth}},{{model_output}}), model selector, optional internet access, tags, and description. - Code evals. Write Python or JavaScript directly in the browser. Your code runs against the inputs and returns a 0-1 score with a reason.
- Agent evals. Real agents with multi-step reasoning and MCP tool access. This is the layer that calls Grammarly, ESLint, your database, the web, and so on.
Creating evals with Falcon
Describe what you want to evaluate in plain English (e.g. “evaluate SEO blog quality”) and Falcon scaffolds the eval for you: the instructions, variable mapping, and a starting configuration. Pick the eval type (Agent, LLM-as-Judge, or Code) and Falcon adapts the scaffolding to fit. From blank slate to runnable eval is one prompt instead of authoring every field by hand.
Summary
Control how detailed or brief the evaluation output is. Pick from four built-in templates: None for the raw scoring output, Short for key points only, Long for a full-context summary, Concise for a compact essential-insights summary. Or create your own custom template that matches your team’s reporting format. Useful when one eval needs to feed both a tight dashboard cell and a deeper review surface.
Ground truth
Upload annotated reference data per eval to calibrate the eval against human judgment. The eval scores against the ground truth dataset, so you can spot drift between what the eval picks up and what your reviewers would mark, and tune the eval until the two agree.
Feedback
Reviewers can submit per-result feedback with an improvement note. Feedback history is captured per eval (who, when, source, action) and feeds back into the calibration loop alongside the ground truth set above.
130+ ready-made evals
A starter library of 130+ evals (a mix of code-based and LLM-based) ships with the revamp: coverage for hallucination, faithfulness, tone, code quality, structure, safety, retrieval relevance, and dozens of domain-specific dimensions. Take any one as-is, or fork it as a starting point for your own.
Composite evals: score everything in one pass
Bundle multiple evals into one composite unit:
- Per-child versioning. Each imported eval is pinned to a specific version.
- Weighted aggregation. Assign weights and pick an aggregation function (weighted average, min, max, or mean).
- Aggregate or per-child results. Click any composite cell to see the breakdown across children.
- One pass, one score. All criteria run in a single eval pass instead of N separate runs you stitch together.
Full version history
Every save creates a new version with the prompt, config, model, and metadata snapshotted. The Versions tab lists history, lets you set a default version, restore an earlier one, and compare two side by side. Iteration doesn’t lose prior work.
Standardized scoring
Every eval scores on a normalized 0-1 scale with one of three output shapes:
- Pass / Fail. Binary outcome.
- Scoring. Predefined categories where each maps to a numeric score between 0 and 1, with a configurable pass threshold.
- Choices. Predefined categories with display labels (multi-select optional); the selected categories produce a score in the 0-1 range.
Same scoring shape whether the eval runs from the playground, in an experiment, on a dataset, or against a live trace, so scores are directly comparable across every surface.
Testing playground: four modes
The right panel of the eval detail page is a full testing surface:
- Custom. Manual input fields per variable, run and see score + reason.
- Tracing. Pick a project, navigate trace by trace, map variables to span fields, run.
- Simulation. Point at a simulation run.
- Dataset. Point at a dataset and a row.
Plus a Map Variables toggle: switch between explicit variable mapping and “Pass Context” (send the whole trace / session context without mapping).
Why it matters
The evaluation question that actually matters in production is “does this output hold up against the real world?”, not “does it pass a few criteria the prompt author thought of.” Letting evals reach into the web, your databases, and the tools you already trust closes that gap. Evaluation moves from a closed-loop check to something connected to the systems your agent operates in.
Who it’s for
ML and AI engineers building evaluation suites, data scientists authoring custom rubrics, quality assurance (QA) teams managing evaluation criteria, and anyone whose evaluation quality has been bottlenecked by what the prompt alone can verify.
Experiment V2: From Prompts to Agents
Experiments used to run only on prompts. You could line up two prompt versions against the same dataset and see which scored higher. Experiment V2 makes them run end-to-end on agents: the agent executes per row, you can run multiple variants in one go, and you can edit, stop, or rerun pieces mid-flight.
Experiments now run on agents
The agent executes per row, not just a single prompt. You see the output of every node (retrieval step, tool call, sub-agent, final response), so when an experiment regresses, you can tell where in the agent it regressed, not just that the final answer changed.
Multi-variant in a single run
One experiment can compare many things side by side instead of running a separate experiment per variant:
- Prompts. Two or more prompt versions against the same dataset.
- Agents. Two or more agents against the same dataset.
- Stacked model variants. Different LLM, TTS, STT, or image models per step in the same agent, all evaluated together.
LLM, TTS, STT, and Image are all first-class. One experiment surface for every modality.
Score against any column in your dataset
Use any dataset column as baseline. Score against an expected_response column, a label column, a similarity reference, whatever’s in your dataset. Built-in baseline comparison without a separate setup step.
Edit, stop, and rerun mid-flight
- Edit a running experiment. Change evals, swap variables, adjust configuration without rebuilding from scratch.
- Stop control. Kill a misconfigured run as soon as you notice instead of waiting for it to finish.
- Rerun specific columns. Re-execute just the eval column or the prompt column that changed; everything else stays.
- Real-time progress. Watch the run populate row by row instead of refreshing for status.
Summary view in the experiment list
Each experiment in the list shows status, model count, and eval count up front, so you can scan a list of runs and pick the one to compare against without opening each.
Manage evaluations inside the experiment
The Manage Experiment Evals drawer gives full create / read / update / delete on the eval set inside a running or saved experiment, with no need to rebuild the experiment to change what gets scored.
Why it matters
Real production agents are graphs of prompts, models, retrieval steps, and tool calls. A single-prompt evaluation wasn’t enough to tell you whether the agent improved. Experiment V2 evaluates the agent as a whole, with per-node visibility, and lets you do it across modalities and variants in one run, closer to how you actually decide what to ship.
Who it’s for
ML and AI engineers running experiments on agents and stacked model variants, product teams comparing prompts before rolling out a change, and quality assurance (QA) teams scoring agent runs against baseline datasets.
Observe Revamp: Plain-English Filters, Custom Views, and Imagine-with-Falcon Charts
You used to spend more time configuring trace views than actually reading them. The revamp flips that: the time between “something looks off” and “I know what broke” drops from minutes to seconds.
Filter using plain English
Type what you want to see in plain English. No query language, no operator cheat sheet. “Show me traces from the last hour where eval score dropped below 0.7” or “errors from the production project where latency was over 3 seconds”. The filter materialises as chips you can edit, save, or clear.
Saved views you can switch in a click
The top of Observe is a tab bar: three fixed tabs (Trace, Sessions, Users) plus any number of custom view tabs you create. Each tab captures everything: filters, columns, sort, display options. Stop rebuilding the same view.
- Tab bar interactions. Drag-to-reorder, right-click for rename / duplicate / delete, overflow into a +N dropdown, keyboard shortcuts 1-9.
- Personal or project visibility. Pin a private view for your own debugging or share one with the project so the whole team starts from the same baseline.
- URL synced. Every tab is shareable.
Inline search across logs, previews, and attributes
Find what you’re looking for without opening every trace. Search runs across log content, span previews, and attribute keys / values, scoped to whatever filter is active.
Imagine with Falcon: turn any trace into an interactive chart
Hand Falcon a trace (or a filtered slice) and it builds an interactive chart from the underlying data: latency distributions, error patterns, token-count outliers, eval-score drift over time. Patterns surface from the data itself instead of you eyeballing rows. Every chart is interactive. Click a point to drop into the underlying traces.
Add evals across traces, spans, and sessions
Run evals from the trace homepage on any selection: a single trace, a multi-select batch, individual spans, or whole sessions. From the same selection you can also push to a dataset, send to an annotation queue, replay, tag, or delete. No drilling into individual traces, no separate workflow.
Move through traces faster
Move between traces with keyboard shortcuts, never leaving the keyboard. Logs, attributes, and annotations each live in a dedicated pane you can flip through quickly.
Three views above the trace list: Graph, Agent Graph, Agent Path
Switch between three top-of-page views for the same set of traces:
- Graph View. Latency and traffic over time, so anomalies and traffic spikes show up at a glance.
- Agent Graph. The agent rendered as a connected flow (Start → nodes → End), showing which paths through the agent ran and where they branched in production traffic.
- Agent Path. The linear sequence of nodes each trace traversed, useful when you want to scan the route a request actually took end to end.
Trace list redesigned
The trace table is rebuilt with polished cell renderers for every column: latency bar, cost, tokens (prompt / completion / total), status chips, tags, model, eval-score badges, timestamps. Drag-to-reorder columns, a column picker grouped by category, custom columns from any span attribute, and quick filters for errors, non-annotated traces, and eval pass/fail toggles. Agent logs grouped by trace, span, user, or session. Compare graph for dual-graph rendering. Server-side infinite scroll backs every interaction.
Trace detail rebuilt
Click a trace and the detail drawer opens with a Gantt-style span tree:
- Span tree with Gantt timeline. Hierarchical tree on the left, color-coded timeline bars on the right, expand / collapse subtrees, hover for duration / tokens / cost.
- Span detail panel. Preview, Evals, and Annotations tabs scoped to the selected span.
- Span notes. Attach free-form notes for triage handoffs and review threads.
- Show or hide span metrics. Toggle latency, tokens, cost, evals, annotations, and events on or off from the trace detail’s display panel, so the span tree stays as compact or as detailed as the moment calls for.
Why it matters
Every layer of friction between “something looks off” and “here’s the broken span” is a layer your agent keeps doing the wrong thing in production. The revamp removes most of those layers: plain-English filtering instead of query-syntax, saved views instead of rebuilt ones, charts that surface patterns instead of waiting for you to spot them.
Who it’s for
MLOps and platform engineering teams running agents in production, and quality assurance (QA) teams triaging live traffic.
Error Feed: Cluster, Triage, and Push to Linear in One Click
Agent failures used to take minutes to surface and longer to triage across projects. The Error Feed now updates about 5 seconds after a trace completes, with failures auto-grouped into a single triageable view, at roughly 10x lower cost than the previous Error Feed.
Fast and low cost to leave open
- ~5-second update lag between a failing trace landing and the cluster it belongs to updating in the feed.
- Roughly 10x lower cost to run than the previous Error Feed, with the trace sampling rate fully configurable so you can dial cost up or down to match your budget.
- Deep Analysis runs only when you ask. Per-cluster root-cause investigation runs on-demand. The feed itself stays lightweight; the heavy analysis happens when you decide a cluster deserves it.
Triage signal you actually want
- Worsening clusters get an escalation badge. A cluster that’s getting more frequent or more severe stands out even in a busy feed, so you don’t have to remember last week’s baseline.
- Each cluster traces back to the deploy or commit that introduced the failure. When a regression appears, you see which release broke it, not just that something broke.
- Side-by-side failing vs working trace evidence. For each cluster, the feed shows a representative failing trace next to a representative working trace, verbatim and not LLM-summarised. You debug from the actual traces, not from a model’s interpretation of them.
One-click push to Linear
Spotted an error? Push the cluster to Linear in one click. The issue arrives pre-filled with title, description, and a link back to the cluster. No copy-pasting, no cross-referencing later. More tracker integrations are on the way.
What you still get from the original feed
- Errors clustered by root cause. Not 47 copies of the same failure.
- Frequency, first seen, last seen on every cluster. Sort by whichever matters.
- One click into underlying traces with the right filter pre-applied.
- Route to an annotation queue or a teammate for follow-up.
Why it matters
Error triage breaks when the loop is slow or expensive. A 5-second update lag means failures land in front of you before the next one happens. A 10x cost cut means you don’t have to choose between visibility and budget. Deep Analysis on-demand means the lightweight feed stays lightweight, and the expensive root-cause work happens only when it earns its cost.
Who it’s for
Teams that need to ship a fix to a frustrated customer before the next escalation, support and engineering working together to triage live failures, MLOps and platform engineers who own agent reliability in production, and QA leads who want a ranked feed of what is breaking right now (and what is breaking more often than yesterday) instead of digging through every failed trace by hand.
Custom Pricing Plans + Billing Dashboard v2
Enterprise deals require custom pricing.
- Custom pricing plans with admin invoice generation.
- Billing dashboard v2 rebuilt around the new pricing model: clearer cost breakdowns, usage trends, and per-key spend visibility.
Improvements
Standardised search bar across list pages. Every list page in the platform now uses the same search bar, so the keyboard shortcut, search syntax, and result behaviour stay the same whether you’re searching traces, datasets, evals, or experiments.
Annotator picker: search and add at scale. Search and add annotators directly from the picker, even on workspaces with a large user list. Infinite scroll loads results as you go, and server-side search returns matches instantly.
Custom dashboards polish. Time range syncs to the widget editor, widgets resize cleanly, description fields render instantly, empty state on metric removal, filter Add button visibility, and dashboard description position.
Agent definition API revamp. Agent definition endpoints now share one consistent payload shape across create, read, update, and delete, so SDK and integration code can drop per-endpoint marshalling and use a single type.
Simulate analytics API. New analytics API for simulation data: pull aggregate metrics, breakdowns, and trends for any simulation run.
All Attributes JSON viewer search filter. The All Attributes JSON viewer (in the trace detail drawer) gets a search box that filters the JSON tree as you type. Useful when a trace has hundreds of attributes and you only want to see the ones matching a specific key or value.