Perplexity ships Sonar and gpt-5.1 to self-hosted Future AGI, plus polish across Evals, Observe, and Platform
Straight from Perplexity. The team sent a PR into the open-source Future AGI repo with first-party support for every current Perplexity model on self-host. Plus a new customer-agent task-completion eval, faster audio eval runs when there's no feedback to retrieve, filter and column fixes across Observe, and a pricing and usage page that renders correctly.
Perplexity contributes Sonar and gpt-5.1 to self-hosted Future AGI
Five Sonar variants are now in the eval model picker and the AI gateway: `sonar` (fast, search-augmented chat), `sonar-pro` (the new default with 200K context), `sonar-reasoning` and `sonar-reasoning-pro` (chain-of-thought over live web results), and `sonar-deep-research` (long-form, multi-source investigations). Perplexity's Agent API adds `gpt-5.1` for agent workflows with built-in web search and tool use. Add your Perplexity API key as `PERPLEXITY_API_KEY` to enable them on self-host.
New built-in eval: customer-agent task completion
Score whether your agent actually finished the job. A new built-in eval, `customer_agent_task_completion`, reads the full conversation and returns Pass or Fail with a reason on whether a customer-facing agent completed the task the user asked for. Useful as a top-line outcome metric on voice and chat support conversations, alongside the evals you run on each individual turn.
Regex PII detection: default to all five PII types
If you leave `detect_types` empty on the Regex PII Detection eval, it now scans for all five built-in types (emails, phone numbers, SSNs, credit cards, IP addresses) instead of returning nothing. You can still narrow the list per eval.
Eval type persists correctly when creating LLM and Code evals
Creating a new LLM-as-Judge or Code eval now writes the right eval type to the row, instead of defaulting to a placeholder. Reopening the eval picks up the correct type. Templates and saved evals behave the same.
Eval config auto-populates from the baseline column
When you add an eval inside an experiment and pick a baseline column, the eval settings (model, criteria, output type) now auto-fill from that baseline. Faster setup when you're rerunning an existing eval on a new run.
Eval workflow: a batch of refinements
Variables in your prompt highlight when you connect them to data in the test panel. Turning data injection on or off is now a clean toggle. Once you pick how the eval scores (pass/fail, numeric, choices), that choice is locked, but the specific labels, score numbers, and thresholds stay editable. When you upload expected answers as ground-truth data, you can watch the processing progress in real time. `Pass Rate` is now called `Task Completion Rate` across all stats and charts.
Eval picker: stale data prevented, mapping disabled while columns load
The eval picker no longer flashes stale data when you reopen it after switching projects. Variable mapping rows are disabled while the source columns are still loading, so you can't map to a column that hasn't arrived yet.
Custom eval URLs: open on default version, ?v= survives, task errors expandable
Opening a custom eval URL without a version parameter now lands on the default version. A `?v=` parameter in the URL is preserved across navigation inside the eval. When an eval task fails, the error panel has a Show more toggle that expands the full message.
Audio eval runs are ~3.5x faster when there's no feedback to retrieve
Evals that use feedback-based retrieval now skip the lookup when there are no feedback rows yet. On audio-heavy projects with no feedback corpus, a single eval run is roughly 3.5x faster (about a 70% drop in time).
Clear error when an eval can't reach a media URL
When an eval fails because it can't fetch an image or audio file from the URL you mapped, the error message now says so and points at the URL, instead of returning a generic backend error. Easier to fix mapping mistakes and broken links.
Dataset evals with template variables in instructions no longer crash
Dataset evals whose instructions contain `{{variable}}` placeholders no longer crash trying to resolve those placeholders as IDs. The instructions stay untouched and the eval runs cleanly.
semantic_list_contains handles numeric expected values
The semantic_list_contains comparator no longer crashes when the expected value is a number (for example, an ID or a quantity). Numbers are coerced to strings before comparison.
Static few-shot examples reach the LLM eval
Few-shot example rows stored on a dataset now expand into the prompt for the LLM eval, instead of being passed as metadata only. Few-shot prompting on dataset-backed evals works as documented.
Voice call mapping reaches deeper into the raw log
When voice eval mapping can't resolve a field in the structured trace data, it now falls back to the raw call log. More voice rows map cleanly without manual intervention.
PDF filter chip in the eval picker is populated
Evals that accept PDF input are now tagged so the PDF filter chip in the eval picker actually lists them. The chip previously returned no results; it now lists every PDF-capable eval.
Confirmation dialog before deleting an eval template
Deleting an eval template now asks for confirmation. Prevents accidental clicks on the row action.
Pass and Fail chips on trace eval drawer
When an eval returns a Pass or Fail result, the trace eval drawer shows a Pass or Fail chip instead of 100% or 0%. Makes the outcome obvious at a glance.
Multi-choice eval output type flows through the playground and picker
If an eval is configured as multi-choice, that flag is now propagated through the playground run and the eval picker, so the run is scored as multi-choice end to end.
Error localiser: Show more works and composite evals are hidden
The Show more toggle on the error localiser now expands the full per-highlight reason for the highlighted span. The error localiser panel is hidden for composite evals, where it doesn't apply.
Fix-with-Falcon hides on label-field eval rows that already pass
On evals where the result is stored in the label field, the Fix-with-Falcon button no longer shows up on rows that already pass. Cleans up the action menu on passing rows.
Eval usage rollups: date range and session-target row exclusion
The `/eval-task/get_usage/` endpoint accepts a date range so you can scope usage to any window. Rows targeting sessions are now excluded from the per-trace usage rollup, so the count matches what you see in the UI.
Data injection on system evals checks your variable mappings
When you turn on data injection for a system eval, every required variable (like the question and expected answer) now needs to be mapped before you can save. The eval is ready to run as soon as you create it.
Long eval and label names are readable on hover
Truncated names in the Evals list and truncated labels in the Eval Usage detail panel now show the full text in a hover tooltip. The Evals list grid also fills its container properly instead of clipping.
Trace Name and Span Name filter dropdowns populate suggestions
The Trace Name and Span Name filter pickers on Observe now show the list of names from the current project as suggestions, instead of an empty dropdown.
Span-attached annotations appear in the Annotations filter category
Annotations applied at the span level (not just the trace level) are now included in the metrics endpoint that powers the filter picker. The Annotations category in the filter dropdown is populated for span-only score rows.
Text filters are case-insensitive across Observe
Filters on text fields (Trace Name, Span Name, model, provider, span attribute strings) now match regardless of case. Searching for `gpt-4o` finds rows tagged `GPT-4O`.
Trace ID and Span ID filters render as single-select
The Trace ID and Span ID filters now render as a single-select control (a radio, no chips, no +N count) since a span belongs to one trace at a time. Reloading and then editing a saved filter with an ID also returns the right rows instead of zero.
Cleared filters stay cleared after a page refresh
Clearing a filter on Observe now sticks. The filter no longer reappears when you navigate away and come back to the page.
Annotation filters: operator dropdown, chip label, and task filter
Categorical and thumbs-up annotation filters now show the right operator dropdown after the page hydrates. The chip label matches the selected operator. The Tasks filter returns rows on thumbs annotations instead of returning empty.
Add and edit filter chips on Sessions, Users, and User Traces
The plus icon to add a filter chip, and click-to-edit on existing chips, now work on the Sessions, Users, and User Traces tabs. Previously these actions only worked on the Traces and Spans tabs.
Column order persists across auto-refresh on every Observe grid
When you reorder columns by dragging on Traces, Spans, Sessions, or Call Logs, the new order survives the next auto-refresh and stays in sync with the display panel.
Consistent grid theme across Traces, Spans, Sessions, and Users
Header text colour, row hover state, and cell padding are now identical across all four Observe grids (Traces, Spans, Sessions, Users), instead of each grid having a slightly different look.
Call Logs grid: autosize works and Call ID column respects the width you set
On the Call Logs grid (used by voice projects), the autosize columns action now fits column widths to content instead of being a no-op. Widening the Call ID column also shows more of the UUID, instead of clamping at 130 pixels and truncating in the middle of the value.
Hover tooltip on long span-attribute column keys in Live Preview
Long span-attribute paths in the Task Live Preview now show the full key as a tooltip on hover, so you can read the path without resizing the column.
Single loading spinner on the Users page
The Users page no longer shows two stacked loading spinners on initial mount. One spinner appears while data loads, then the rows render.
Save View buttons visible, full tab name preserved
The Save View button is now visible against the dark theme on every Observe tab. Long view names truncate visually with an ellipsis instead of being cut down to a shorter stored string, so the full name stays in state.
Agent Graph and Agent Path: real fullscreen, disabled on voice projects
The fullscreen button on the Agent Graph and Agent Path views now opens a real fullscreen view of the graph, and dead zoom controls on Agent Path are removed. On voice projects, where there's no agent graph to render, both tabs are now disabled with a tooltip explaining why, instead of opening to a black screen.
Tasks list chip hover and popover stay usable
Cell chips on the Tasks list now keep their hover colour while the cursor moves onto the popover, so you can read the popover without it closing the moment you move toward it.
Eval tasks list refreshes while rows are pending or running
The eval tasks list now polls while any row is pending or running, so progress updates appear without a manual refresh. Relative-time cells (`2 minutes ago`) also refresh on each poll instead of going stale.
Voice projects: call recording on the Error Feed cluster overview
On voice projects, the Error Feed cluster Overview tab now shows the call recording in place of the agent flow diagram, so you can listen to the call that produced the error.
agent_talk_percentage is filterable on voice traces
`agent_talk_percentage` is now a first-class system metric for voice projects. The filter dropdown matches what you see in the grid (`40%` as a single metric), instead of showing `0.4` as a custom attribute.
Faster Users tab and session detail loads
The Users tab now loads instead of timing out on workspaces with high session volume, and session detail no longer returns a 400 on slow queries. Both pages reach the data through a faster path on the backend.
Reliable trace ingestion: custom user IDs and span PK retries
Two ingestion fixes. Traces with custom user IDs from the SDK no longer fail to ingest (the CUSTOM enum value was restored). And bulk span writes tolerate primary-key collisions during retry races, instead of dropping the whole batch.
Voice eval fan-out: longer scheduling window
The scheduling window for voice eval fan-out is now twelve hours, so large voice runs don't lose rows when individual evals queue for a long time before starting.
Pricing and usage page: correct rounding, units, and free-tier savings
The pricing and usage page now rounds small numbers correctly, shows the right units on per-call costs, and surfaces free-tier savings as a separate line. Sub-1 usage and sub-cent costs render legibly. The legend is visible and the period caption matches the selected window.
Faster, more reliable signup with reCAPTCHA
The signup form now waits for reCAPTCHA to be ready before allowing submit, so fast typists no longer hit a `reCAPTCHA not ready` error. The page also preconnects to the reCAPTCHA host, so the script loads sooner.
Commit action surfaced on the prompt workbench
The Commit button on the prompt workbench is now on the main toolbar, not buried in the More menu. Saved queries are invalidated right after a commit, so the new version shows up immediately in the version list.
Better text contrast across Agent, Persona, Scenarios, Test Detail, Eval Picker, and Error Feed
Secondary text colour has been adjusted across the Agent builder, Persona, Scenarios, Test Detail, Eval Picker, and Error Feed pages so labels and descriptions read clearly against the dark theme.
Dataset media fetches no longer hang
Image and audio fetches on dataset rows now have a request timeout, so a slow or unreachable URL fails fast instead of hanging the eval indefinitely.
ffmpeg calls time out on malformed audio
ffmpeg subprocess calls inside voice and storage paths now have a timeout, so a malformed audio file can't hang a worker indefinitely.
Agentic eval Azure callback handles missing invocation params
The agentic eval Azure callback no longer crashes when `invocation_params` comes back as `None`. Useful when Azure returns a short-form response that omits the params block.