Evals on Traces and Sessions, Configurable Eval Context, and Polish Across Evals, Observe, and Simulate
Every eval type (LLM-as-Judge, Code, and Agent) can now score at every level: spans, traces, and sessions. Eval setup also gets simpler: turn on context injection and skip variable mapping entirely. The eval reads the context on its own. Plus two new conversation evals: Dead Air Detection (a Code eval at zero LLM cost) and Conversation Hallucination. Plus eval inputs up to 200K characters, partial inputs as warnings, custom dotted paths in mapping, span-level fields and API columns, and polish across Observe and Simulate.
What's in this digest
Score Traces and Sessions With Any Eval
Evals can score at three levels. A span is a single step the agent takes (one LLM call or one tool call). A trace is the full sequence of steps for one user request. A session is a multi-turn conversation made up of several traces. The right level depends on what you’re checking. With this release, every eval type (LLM-as-Judge, Code, and Agent) scores at all three, including composite evals (a single score combining multiple evals). Three smaller upgrades land alongside it. Each eval inside a composite gets its own configurable parameters. The variable mapping picker accepts custom dotted paths. And partial inputs warn instead of failing the run.
What’s new
- Trace and session coverage across every eval type. When you set up an eval task, the level you score at can be a span, a trace, or a session, no matter the eval type. Composite evals (which run several evals together as one score) score at all three levels too, with each individual eval’s result alongside the aggregate and summary. (PR #491)
- Per-eval parameters in composite evals. Evals like Word Count in Range and Regex PII Detection need their own parameters (minimum word count, regex patterns, thresholds). The composite eval panel now has an inline editor that shows one field per eval, so you can configure each one without leaving the composite. (PR #548)
- Custom dotted paths in variable mapping. The variable mapping picker on eval task setup accepts custom dotted paths in addition to the dropdown suggestions, across every eval type. Type
attributes.llm.token_count.completionor any path that isn’t pre-listed, and it resolves the same way. (PR #490) - Partial inputs are a warning, not an error. Custom evals (LLM-as-Judge, Code, Agent) run with a yellow warning when at least one mapped variable has a value. Only the all-empty case still errors. The warning is consistent across every place evals run: datasets, the playground, evals on live traces (span, trace, or session level), the SDK, and MCP tools. Built-in system evals stay strict; they still require every mapped variable to be present. (PR #545)
Why it matters
Evaluation across the agent’s full surface (every step, every run, every conversation) is now consistent across every eval type. The ask we hear most often (“score this trace on safety, factuality, and tone, then aggregate”) is now a one-step setup. Per-eval parameters unblock composite evals that combine Word Count in Range (200-600 words) with Regex PII Detection (SSN, email) without dropping into code. Custom dotted paths mean any nested OpenTelemetry attribute is mappable without waiting for a platform update. And partial-input warnings turn a hard fail into a soft signal, so a half-filled dataset doesn’t break the batch.
Who it’s for
ML and AI engineers building scorecards that span an agent run, applied AI teams running quality gates on traces and sessions, and anyone whose evals need to reach beyond a single span.
Skip Variable Mapping with Context Injection
You can now skip variable mapping in eval setup. Turn on context injection and the eval reads the context on its own, so you save the time and effort of mapping every variable. Just pick what you want the eval to see: a dataset row, a span, a trace, a session, or a voice call. This works for every eval type (LLM-as-Judge, Code, and Agent). For large traces or sessions, the eval looks at just the spans it needs, not the whole thing.
What’s new
- Skip variable mapping. When context injection is on, you don’t have to map variables at all. The eval reads the context directly. Less manual work for every eval setup. (PR #440)
- Five context surfaces, individually selectable. Toggle dataset rows, span metadata, trace structure, session structure, or voice calls (transcript, summary, scenario, timings, and call metadata) independently.
- Pick the level yourself, or let the eval pick for you. Choose the context level explicitly, or let the eval auto-select based on the project’s setup.
- Auto pre-selection inside Tasks. When you add an eval inside a Task, the right context type is pre-selected from the row type: Spans seeds span context, Traces seeds trace context, Sessions seeds session context, Voice Calls seeds call context.
- On-demand navigation through traces and sessions. For large traces or sessions, the eval looks at the underlying spans it needs instead of pasting the whole structure into the prompt.
Why it matters
Variable mapping is the slowest, most tedious step in eval setup. Context injection removes it: turn it on, pick what the eval should see, done. The eval handles the rest. And for large traces or sessions, looking at just the spans the eval needs keeps token costs down and signal high, instead of dumping the whole structure into a giant prompt.
Who it’s for
ML and AI engineers who want to set up evals fast without per-variable mapping, applied AI teams running production evals across mixed surfaces, and anyone where token cost or eval precision matters at scale.
Improvements
Two new conversation evals: Dead Air Detection and Conversation Hallucination. Dead Air Detection flags conversations with too much silence. It’s a Code eval, so it runs at zero LLM cost (signal analysis on the audio, no model call). You can tune the share of silent audio, the longest single silent gap, and a silence threshold. Conversation Hallucination scores the whole conversation, not one turn at a time. It catches invented facts and unsupported claims that build up across the full exchange, which a single-turn hallucination eval would miss. Add either to any voice or conversation project.
Eval inputs accept up to 200K characters. Long inputs no longer get clipped. LLM-as-Judge and Agent evals now accept up to 200K characters of input (was 15K). Fits long agent transcripts, large retrieved contexts, and multi-document grading.
Consistent scoring across built-in numeric evals. Built-in scoring evals (accuracy, balanced accuracy, Cohen’s kappa, F-beta, Matthews correlation, precision, and more) now normalise inputs the same way every time: case is ignored, whitespace is handled consistently, and comparing two empty values returns a perfect-match score. Several system eval rule prompts are also refined.
Eval mapping reaches span-level fields and API columns. You can now map eval variables to span-level fields like latency, cost, and tokens. Use paths like spans.0.latency_ms. The API column also accepts the same dotted-path syntax in URL, body, params, and headers. You can send a JSON column as the request body directly (no wrapping needed). And numeric array positions like items.0.role work everywhere, including LLM-as-Judge evals.
Code eval parameters: numbers in, required fields enforced. Numeric parameters on code evals (latency limits, percentage thresholds, scoring weights) accept text values from form fields, like 200 or 0.75. The values get validated against min and max bounds. Required parameters keep the Save button disabled until they’re filled. Any parameters you add to your custom evaluate function appear in the Variable Mapping panel as soon as you type them.
Deleting an eval task removes all related results. Deleting an eval task now removes its results from eval columns, charts, monitors, and filters in one step. No leftover rows or stale entries.
Eval versioning: V1 on publish, full-field snapshots, draft separation. V1 is created the first time you publish an eval, not when you save a draft. Restoring an earlier version brings back every field on the eval (output type, pass threshold, choice scores, error localisation, and tags). When you run an eval, you can change a few settings just for that one run (for example, whether the eval has internet access, or what context to inject) without saving the change to the eval. And draft evals stay in the editor; they join the eval picker only after you publish them.
Eval picker: configuration persists, source-aware naming, local preview filter. Every option you set (model, criteria, run settings, mapping) is saved and restored when you reopen the eval. Auto-generated names include the source (dataset, experiment, workbench, etc.) and a precise timestamp. The right pane has a local preview filter that shows the level you’re scoring at (span, trace, session, or voice call). The Save & Add Evaluation button becomes clickable for code evals as soon as mapping is complete.
Built-in evals work out of the box on open-source installs. Open-source installs ship every built-in eval as LLM-as-Judge, ready to run on first boot (no extra setup). On system evals, you can change the model used to judge, but the instructions stay read-only.
Cleaner eval cells across the platform. When an eval errors, the cell now shows a red ‘Error’ chip instead of looking blank. Voice and call-log grids hide Reason columns until an eval has filled them in. And eval result values display consistently across formats (number, category, choice list, or score-with-reason).
Eval tasks: detail-page edits reflect in the list automatically. Updating, pausing, resuming, or renaming an eval task from the detail page refreshes the task list. Changes show up without a hard refresh.
Evals UI: faster reruns, accurate version banner, cleaner sliders and animations. Editing an eval inside a test execution triggers a rerun. The Viewing version X banner reflects the current version state in real time. The Save Version button activates when you’ve made a change. Required-field asterisks use a consistent style across every field. Pass-threshold sliders display clean whole-number percentages. The eval picker search shows a spinner while a search runs. The in-progress step in optimisation timelines animates with progress. The Evals login carousel uses up-to-date screenshots.
Trace ID and Span ID filters on the Tasks page; consistent operators across Observe. Trace ID and Span ID filters work on the Tasks page the same way they do on the Tracing page. Filter operators on span attributes are consistent across every Observe page. Saved views with ID filters reopen with the right operator selected.
More discoverable column resize, persistent custom columns across Observe. Observe tables show a clear column resize handle at rest, brighter on hover, with a wider grab area. Custom columns added on any Observe tab (Traces, Spans, Sessions, Users, Voice Calls) persist across refresh, tab switch, and switching saved views.
Trace list: complete user data, filter chips, tag input, and overlay timing. The User column on the trace list shows the trace’s end user. Span-attribute filter chips on the Tasks page display the complete operator and value. The enter-arrow icon on the tag input adds the tag on click. Categorical eval scores display consistently across new and historical runs. The No traces found overlay is suppressed while data is loading or filters are changing. The property picker handles duplicate property names across categories cleanly.
Voice projects: usage visibility, trace-level evals, surfaced errors. Voice provider traffic shows up in usage summaries and billing reports. Trace-level evals (evals that score a trace rather than one step) show up on the trace detail page. Errored evals on voice traces show a red Error chip, kept out of score averages. Customer call logs from the voice provider appear inside the simulation call detail page. Eval tasks run end-to-end on voice calls.
Simulate: safer runs and better failure handling. Starting a simulation against a scenario that hadn’t finished generating used to give you zero connected calls and a permanent Pending. Run New Simulation now waits until every selected scenario is fully ready before it activates. The Edit Agent form used to leave you guessing whether the agent picked up provider changes; it now auto-syncs with the provider the same way Create Agent does, with a clear syncing indicator and an error notification on failure. And when every call in a run errored out, the eval bar used to disappear entirely; it now shows zero so the run still reports a meaningful result.
AI gateway: Bedrock structured output and streaming failover. If you’re calling Bedrock through the gateway, you can now use the same structured-output format you use for OpenAI and Gemini. The gateway translates it for Bedrock automatically, so you don’t need provider-specific request shaping. And streaming chat with provider failover handles a common silent failure: when the upstream stream returns nothing, the gateway switches to the fallback model before the client sees a broken response.