ai-evaluation 1.0, Deep Space Theme, Multi-Language SDKs, and Multimodal Workbench
The ai-evaluation SDK hits 1.0 with a unified evaluate API, multimodal LLM judge, and 72+ metrics. Deep Space brings a redesigned dark mode. traceAI ships C# and Java SDKs plus 31 new TypeScript instrumentor packages. And the Prompt Workbench goes multimodal with WebSocket streaming.
What's in this digest
ai-evaluation v1.0.0: The Evaluation SDK, Stable

After months of iteration, the ai-evaluation SDK reaches 1.0.
What’s new
- Unified
evaluateAPI. One function call handles everything: select metrics, point to your data, get results. Works for local metrics, LLM-as-judge (where one LLM scores the outputs of another) evaluations, and custom scoring functions through the same interface. - Multimodal LLM judge. Pass images, audio, and text to the judge and it scores the output against criteria you define, or criteria the SDK auto-generates from your dataset.
- Auto-generated grading criteria. The SDK analyses your dataset, identifies quality dimensions that matter, and produces rubrics you can review and refine.
- Vector-store-backed feedback loops. Every human annotation, every corrected score, every edge case gets stored and used to improve future evaluations. Your evaluation pipeline learns from domain expertise over time.
- 72+ local metrics. Factuality, relevance, coherence, safety, toxicity, bias, and dozens of domain-specific dimensions, out of the box.
- OpenTelemetry integration. Every evaluation run produces OTEL spans that flow into your existing observability stack.
- Streaming evaluation. Score outputs as they generate, catching issues before the full response completes.
Why it matters
Evaluation has been a stitching exercise: scoring functions here, aggregation there, reporting somewhere else. A unified API with auto-generated criteria and a feedback system that learns from corrections shortens the evaluation loop from project to library call.
Who it’s for
ML and AI engineers building systematic evaluation suites, data scientists developing custom rubrics, and any team integrating evaluations into production code paths.
Deep Space: A New Visual Identity
The entire Future AGI platform redesigned around a monochrome dark-mode theme.
What’s new
- Comprehensive coverage. Every component, every surface, every interaction state redesigned, not just background color swaps.
- Hierarchy through depth. Subtle luminance variations establish hierarchy instead of relying on color.
- Accessibility-conscious. Critical information surfaces through contrast and positioning rather than color coding, making the interface more accessible.
Why it matters
AI development involves long debugging sessions. A theme that reduces eye strain and uses space more intentionally matters more the longer you stare at it.
Who it’s for
Everyone using Future AGI. Particularly useful for teams running long debugging sessions in low-light environments.
traceAI Goes Multi-Language
traceAI becomes a truly polyglot instrumentation platform with C#, Java, and expanded TypeScript support.
What’s new
- C# SDK. Full traceAI support for .NET. ASP.NET services, Azure Functions, and standalone apps get automatic instrumentation for LLM calls, tool invocations, and agent workflows.
- Java SDK: 25 instrumentation modules. Spring Boot, Micronaut, Quarkus, Apache HttpClient, OkHttp, JDBC drivers, and the major Java-based LLM client libraries.
- 31 new TypeScript instrumentor packages. Covers the long tail of frameworks, databases, HTTP clients, and libraries that TypeScript agent developers rely on.
- Python SDK update. New framework support and comprehensive end-to-end tests.
Why it matters
Enterprise teams on JVM or .NET no longer have to maintain a Python sidecar to get tracing. traceAI now provides native instrumentation across four programming languages: Python, TypeScript, Java, and C#.
Who it’s for
Developers integrating traceAI into non-Python stacks, and enterprise teams running JVM-based or .NET-based agent services.
Multimodal Prompt Workbench + Graph-Aware Scenario Generation
Multimodal Prompt Workbench with WebSocket streaming. The Workbench handles multimodal inputs and outputs (text, image, audio) natively. Responses stream via WebSocket so long outputs progress visibly instead of sitting behind a loading spinner.
Graph-aware dataset scenario generation. The LLM call that generates each dataset-based scenario understands which branch of the scenario graph it’s in, so scenarios stay consistent with surrounding branches instead of drifting off-topic.
Simulation, Monitoring, and API
WebSocket simulation grid. Simulation results stream to the grid via WebSocket instead of polling.
Langfuse integration. Configure Langfuse directly from the platform UI, with Langfuse-compatible endpoints for Vapi-routed traffic, routing Langfuse evaluation data into Future AGI for unified analysis.
Voice observability annotations. Annotation workflow now extends into voice observability. Annotate voice traces with dedicated filters.
Provider-agnostic voice simulation runs. Voice simulation runs are now provider-agnostic, with LiveKit signal monitoring giving precise call-state tracking across providers.
has_eval filter for traces and spans. Filter traces by whether they have evaluations attached.
Workspace-scoped error analysis. The Error Feed surfaces failures from the active workspace, keeping triage focused on the project you’re debugging. Cleaner empty states for users still being onboarded to projects.
Optimized trace session queries. Filtering and sorting now happen in the database, not in app code, so session views load quickly even on workspaces with millions of spans.
Newer model support across providers. The model-routing layer picks up newly released models from each provider as they ship, instead of waiting for a Future AGI release. Provider-side bug fixes flow through the same path.
Agent type added to scenario API. Scenario API responses now include the agent_type field, so consumers can branch on whether a scenario was authored for a voice agent or a text agent without making a separate request.
Platform Refinements
Per-tab workspace context. Switching workspaces in one tab no longer affects others. Each tab keeps its own workspace context.
Function parameters in evaluations. Pass function parameters directly to evaluation metrics for dynamic scoring configurations.
Reasoning parameters in prompts. Configure reasoning-specific parameters (reasoning effort, max thinking tokens, whether to surface the reasoning trace) when working with chain-of-thought models. The same prompt can be tuned for fast and cheap or slow and thorough without rewriting it.
Auto-refresh preference persists per user. Auto-refresh stays where you left it across reloads. No more re-toggling on every page visit.