Home / Changelog / 2026 Week 20

May 11 – May 15, 2026 2026 W20

Self-Host in One Command, Jinja2 Prompts, and Polish Across Evals and Observability

Self-host Future AGI in one command with pre-built images and a Windows installer. Prompts now support Jinja2 alongside Mustache. Plus reliable access to voice recordings, annotation and eval scores exported alongside traces into datasets, a Request Explorer in the AI gateway, nested JSON access in eval variables, faster session lists, and a long list of polish across Observe and evals.

Platform Monitor Evaluate API

~30s self-host pull-to-up

5 pre-built service images

Mustache + Jinja2 prompt templating

14 improvements across evals, observe, and voice

What's in this digest

Platform New

Self-Host Future AGI in One Command

Platform New

Prompts now support Jinja2 templates alongside Mustache

Monitor Improved

Reliable voice evals, no dependency on provider URLs

Evaluate Improved

Export annotations and eval scores alongside traces to datasets

API Improved

Filter and export AI gateway request logs

Evaluate Improved

Reach into nested JSON fields from eval and prompt variables

Monitor Improved

Filter Observe by trace ID or span ID, plus Select all view columns

Monitor Improved

Faster session list on high-volume traces

Evaluate Improved

API columns: smoother edits, visible progress

Evaluate Improved

Cleaner eval columns

Evaluate Improved

Add traces to annotation queues from any source

Monitor Improved

Trace attributes: long values expand, rows are easier to scan

Monitor Fixed

Error Feed comes to voice and simulation projects

Monitor Improved

Voice analytics: consistent units and a tighter default view

Platform Fixed

Workspace invites work for existing org members

Platform Fixed

j and k row-navigation shortcuts no longer hijack text input

Self-Host Future AGI in One Command

Self-hosting Future AGI is one script. Clone the repo, run bin/install, log in. The script pulls pre-built service images from Docker Hub and brings the entire stack up in about 30 seconds. The same script works on macOS, Linux, and Windows (through Git Bash, WSL, or a new native PowerShell installer). You’ll need Docker, Docker Compose, and at least 8 GB of RAM.

What’s new

Pre-built service images on Docker Hub. bin/install pulls five pre-built images instead of building locally, so the stack boots in about 30 seconds. (PR #357, PR #281)
Native Windows PowerShell installer. A new PowerShell installer brings parity with the Unix installer on Windows, alongside the existing Git Bash and WSL paths. (PR #373)
Light or full install profile. Light runs the minimum to use the app (about 12 containers). Full adds analytics services so Observe and Trace populate (about 22 containers). Default to light; switch to full when you need the analytics views. (PR #253)
First account from the command line. A new create_user command sets up your first login without an email server configured. (PR #201)

Why it matters

Teams evaluating self-host options judge the product on the first install. One script and about 30 seconds means a fresh clone reaches a working dashboard before most evaluation calendars even start. Account creation from the command line means the install ends at “logged in,” not “now configure SMTP.”

Who it’s for

Platform and infrastructure teams running self-host evaluations, engineers shipping the open-source distribution behind a VPC or in an air-gapped environment, and Windows developers running native or through WSL.

Read the docs →

Prompts Now Support Jinja2 Templates Alongside Mustache

Most prompt editors only let you fill in variables: {{ name }} becomes the value of name, and that’s it. Real prompts often need more than that. Include a safety reminder only when the input mentions a regulated topic. List every tool the agent has access to, one per line. Pick a different set of few-shot examples based on the task. Jinja2 brings that kind of logic into the prompt itself. (PR #163)

What’s new

Conditional branches. Use {% if %} / {% elif %} / {% else %} to include or skip parts of the prompt based on a variable.
Loops. Use {% for %} to iterate over arrays (cart items, search results, agent observations) and emit one block per element.
Filters. Transform values inline: uppercase, length, default values, JSON formatting, and more.
Mustache still works. The existing {{ variable }} syntax keeps working. A new Template Format dropdown picks between the two per prompt.
Available everywhere prompts are written. Prompt workbench, run prompt view, and Agent Playground node forms. Variable extraction for the inputs panel works correctly in both modes.

Why it matters

Prompts that need conditional logic used to require duplicating the prompt for each variation, or pre-processing the template in code. With Jinja2, that logic lives where the prompt lives, so whoever writes the prompt can change it without filing a code change.

Who it’s for

AI and prompt engineers whose prompts have outgrown plain variable substitution, applied AI teams building agents that need conditional flow inside the prompt, and anyone porting templates from a Python codebase that already uses Jinja2.

Read the docs →

Improvements

Reliable voice evals, no dependency on provider URLs. Future AGI now stores its own copy of every recording the moment the call arrives. Evals on voice traces run against audio you control, with no dependency on the original provider URL, so traces, replays, and eval reruns stay valid as long as you keep the trace. Storing your own copy is on by default for every voice provider and can be switched off per provider.

Export annotations and eval scores alongside traces to datasets. Pick which annotation scores and eval results to export alongside traces when sending them to a dataset. The selected metrics arrive as dedicated columns in the dataset, so it’s immediately ready for experiments with baseline scores, training data, or downstream exports — no re-running the evals.

Filter and export AI gateway request logs. A new Request Explorer in the AI gateway logs filters every request by model, status, or metadata and exports the result. Useful for cost audits, error investigations, and compliance pulls.

Reach into nested JSON fields from eval and prompt variables. Eval and run-prompt variables accept dotted paths. A single column like payload can feed payload.user, payload.request.id, and so on, with no flattening upstream.

Filter Observe by trace ID or span ID. Paste a trace ID or span ID into the Observe filter bar to jump straight to that trace or span. The view-columns picker also has a new Select all toggle.

Faster session list on high-volume traces. The session list loads in seconds on workspaces ingesting millions of traces. No more waiting at a spinner to pick up a session for review.

API columns: smoother edits, visible progress. Edits to an API column’s URL, method, headers, params, or body now sync across devices and browsers. Requests run in batches, so rows fill in as each batch completes and the column’s progress is visible while it runs. Param and header keys with underscores (like api_key) display exactly as you typed them.

Cleaner eval columns. Inside eval column groups, Result now appears before Reason. Each eval column displays the template’s current default version instead of defaulting to V1. Eval scores of exactly 0 render distinctly, and column averages refresh after you rearrange columns.

Add traces to annotation queues from any source. Traces sent via our SDK or OpenTelemetry (OTLP) can be queued for annotation from the trace list. Dataset rows added to queues no longer get stuck loading. CSV exports include annotator notes.

Trace attributes: long values expand, rows are easier to scan. Long string values in the trace attributes panel are click-to-expand instead of clipped. Dividers between rows make it clearer where one attribute ends and the next begins.

Error Feed comes to voice and simulation projects. Error clusters from evals are now available on voice and simulation projects, visible across the summary, trend charts, and the trace panel. Clicking a voice trace opens the voice call panel.

Voice analytics: consistent units and a tighter default view. Latency, silence, and Time to First Word display in milliseconds across the call analytics panel and the call-logs table for at-a-glance comparison. Per-role talk-time percentages and Talk Ratio sit in the call detail panel for every transcript format, with Talk Ratio hidden by default to keep the view tight.

Workspace invites work for existing org members. Inviting an existing org member to a new workspace sends the email and adds the workspace to their list straight away.

j and k row-navigation shortcuts no longer hijack text input. The j and k keys (used to move between rows in eval and task detail panels) now yield to focused text inputs, so the letters land in your comment instead of moving the row selection.

Older

Evals Revamp, Experiment V2, Observe Revamp, and Error Feed

All changelog entries