Home / Changelog / 2026 Week 8

Feb 3 – Feb 16, 2026 2026 W8

ai-evaluation 1.0, Deep Space Theme, Multi-Language SDKs, and Multimodal Workbench

ai-evaluation SDK 1.0: unified evaluate API, multimodal judge, 72+ metrics. Deep Space dark mode. C#/Java SDKs + 31 TS instrumentors. Multimodal Workbench.

Evaluate Platform SDK Simulate Monitor

72+ ai-evaluation metrics

4 SDK languages (Py, TS, C#, Java)

31 new TypeScript instrumentors

What's in this digest

Evaluate New

ai-evaluation v1.0.0

Platform New

Deep Space dark mode migration

31 new TypeScript instrumentor packages

Platform New

Multimodal Prompt Workbench with WebSocket streaming

Simulate New

Graph-aware dataset scenario generation

Simulate Improved

WebSocket simulation grid

Monitor Improved

Langfuse integration

Monitor Improved

Voice observability annotations

Simulate Improved

Provider-agnostic voice simulation runs

SDK Improved

traceAI Python SDK update

Platform Improved

Per-tab workspace context

Evaluate Improved

Function parameters in evaluations

Evaluate Improved

Reasoning parameters support in prompts

Monitor Improved

has_eval filter for traces and spans

Monitor Improved

Workspace-scoped error analysis (Feed display)

Monitor Improved

Optimized trace session queries

ai-evaluation v1.0.0: The Evaluation SDK, Stable

W08

After months of iteration, the ai-evaluation SDK reaches 1.0.

What’s new

Unified evaluate API. One function call handles everything: select metrics, point to your data, get results. Works for local metrics, LLM-as-judge (where one LLM scores the outputs of another) evaluations, and custom scoring functions through the same interface.
Multimodal LLM judge. Pass images, audio, and text to the judge and it scores the output against criteria you define, or criteria the SDK auto-generates from your dataset.
Auto-generated grading criteria. The SDK analyses your dataset, identifies quality dimensions that matter, and produces rubrics you can review and refine.
Vector-store-backed feedback loops. Every human annotation, every corrected score, every edge case gets stored and used to improve future evaluations. Your evaluation pipeline learns from domain expertise over time.
72+ local metrics. Factuality, relevance, coherence, safety, toxicity, bias, and dozens of domain-specific dimensions, out of the box.
OpenTelemetry integration. Every evaluation run produces OTEL spans that flow into your existing observability stack.
Streaming evaluation. Score outputs as they generate, catching issues before the full response completes.

Why it matters

Evaluation has been a stitching exercise: scoring functions here, aggregation there, reporting somewhere else. A unified API with auto-generated criteria and a feedback system that learns from corrections shortens the evaluation loop from project to library call.

Who it’s for

ML and AI engineers building systematic evaluation suites, data scientists developing custom rubrics, and any team integrating evaluations into production code paths.

Read the docs →

Deep Space: A New Visual Identity

The entire Future AGI platform redesigned around a monochrome dark-mode theme.

What’s new

Comprehensive coverage. Every component, every surface, every interaction state redesigned, not just background color swaps.
Hierarchy through depth. Subtle luminance variations establish hierarchy instead of relying on color.
Accessibility-conscious. Critical information surfaces through contrast and positioning rather than color coding, making the interface more accessible.

Why it matters

AI development involves long debugging sessions. A theme that reduces eye strain and uses space more intentionally matters more the longer you stare at it.

Who it’s for

Everyone using Future AGI. Particularly useful for teams running long debugging sessions in low-light environments.

traceAI Goes Multi-Language

traceAI becomes a truly polyglot instrumentation platform with C#, Java, and expanded TypeScript support.

What’s new

C# SDK. Full traceAI support for .NET. ASP.NET services, Azure Functions, and standalone apps get automatic instrumentation for LLM calls, tool invocations, and agent workflows.
Java SDK: 25 instrumentation modules. Spring Boot, Micronaut, Quarkus, Apache HttpClient, OkHttp, JDBC drivers, and the major Java-based LLM client libraries.
31 new TypeScript instrumentor packages. Covers the long tail of frameworks, databases, HTTP clients, and libraries that TypeScript agent developers rely on.
Python SDK update. New framework support and comprehensive end-to-end tests.

Why it matters

Enterprise teams on JVM or .NET no longer have to maintain a Python sidecar to get tracing. traceAI now provides native instrumentation across four programming languages: Python, TypeScript, Java, and C#.

Who it’s for

Developers integrating traceAI into non-Python stacks, and enterprise teams running JVM-based or .NET-based agent services.

Read the docs →

Multimodal Prompt Workbench + Graph-Aware Scenario Generation

Multimodal Prompt Workbench with WebSocket streaming. The Workbench handles multimodal inputs and outputs (text, image, audio) natively. Responses stream via WebSocket so long outputs progress visibly instead of sitting behind a loading spinner.

Graph-aware dataset scenario generation. The LLM call that generates each dataset-based scenario understands which branch of the scenario graph it’s in, so scenarios stay consistent with surrounding branches instead of drifting off-topic.

Simulation, Monitoring, and API

WebSocket simulation grid. Simulation results stream to the grid via WebSocket instead of polling.

Langfuse integration. Configure Langfuse directly from the platform UI, with Langfuse-compatible endpoints for Vapi-routed traffic, routing Langfuse evaluation data into Future AGI for unified analysis.

Voice observability annotations. Annotation workflow now extends into voice observability. Annotate voice traces with dedicated filters.

Provider-agnostic voice simulation runs. Voice simulation runs are now provider-agnostic, with LiveKit signal monitoring giving precise call-state tracking across providers.

has_eval filter for traces and spans. Filter traces by whether they have evaluations attached.

Workspace-scoped error analysis. The Error Feed surfaces failures from the active workspace, keeping triage focused on the project you’re debugging. Cleaner empty states for users still being onboarded to projects.

Optimized trace session queries. Filtering and sorting now happen in the database, not in app code, so session views load quickly even on workspaces with millions of spans.

Newer model support across providers. The model-routing layer picks up newly released models from each provider as they ship, instead of waiting for a Future AGI release. Provider-side bug fixes flow through the same path.

Agent type added to scenario API. Scenario API responses now include the agent_type field, so consumers can branch on whether a scenario was authored for a voice agent or a text agent without making a separate request.

Per-tab workspace context. Switching workspaces in one tab no longer affects others. Each tab keeps its own workspace context.

Function parameters in evaluations. Pass function parameters directly to evaluation metrics for dynamic scoring configurations.

Reasoning parameters in prompts. Configure reasoning-specific parameters (reasoning effort, max thinking tokens, whether to surface the reasoning trace) when working with chain-of-thought models. The same prompt can be tuned for fast and cheap or slow and thorough without rewriting it.

Auto-refresh preference persists per user. Auto-refresh stays where you left it across reloads. No more re-toggling on every page visit.

Older

Simulate from Prompt Workbench, Voice Annotations, and Agent Health for Voice Agents

Newer

Agent Command Center, Agent Playground, and ClickHouse Migration

All changelog entries