Home / Changelog / 2025 Week 32

Aug 4 – Aug 8, 2025 2025 W32

Document Intelligence and Async Evaluations

Process documents natively in your datasets, run evaluations asynchronously via SDK, and compare prompt performance across experiments.

Evaluate Platform Monitor SDK

5 Document types supported

50+ Eval templates

What's in this digest

Evaluate Function evaluations New

Platform Edit synthetic data after generation Improved

Platform Document column support in datasets New

Monitor User tab in Dashboard and Observe Improved

Monitor Timestamp column in trace/spans Fixed

Platform Configure labels per prompt version Improved

SDK Async evals via SDK New

Monitor Video support in Observe Improved

Platform OCR support for document processing Improved

Evaluate Comparison summary Improved

SDK Bulk annotation and user feedback via API/SDK Improved

Evaluate JSON view for evals log Fixed

SDK traceAI v0.1.10 with LLM prompt template labels Improved

SDK traceAI Pipecat integration Improved

SDK traceAI LlamaIndex TypeScript Improved

Document Column Support and OCR

AI agents increasingly work with documents — contracts, invoices, support tickets, medical records. Testing these agents requires datasets that contain actual documents, not just extracted text. Starting today, datasets in Future AGI natively support document columns.

Upload TXT, DOC, DOCX, and PDF files directly into your dataset rows. Each document is indexed, searchable, and available as input to your evaluation pipelines. For scanned documents and images, built-in OCR extracts text automatically, so even legacy paper-based workflows can be tested without manual transcription.

This is not just file storage. Document columns integrate with the full evaluation pipeline. Run faithfulness checks against source documents. Verify that your agent’s summary matches the original PDF. Test whether your RAG system retrieves the right sections from a 200-page contract. Five document types are supported at launch, with more coming.

Function Evaluations

Sometimes you need evaluation logic that goes beyond LLM-as-judge. Function evaluations let you define custom evaluation functions that execute deterministic checks against agent outputs. Verify JSON schema compliance, check numerical accuracy, validate that specific fields are present, or implement any business logic that your quality bar demands.

Function evals run alongside your existing LLM-based evaluations, giving you a hybrid approach: use AI judgment for subjective quality and deterministic functions for objective correctness.

Async Evaluations via SDK

Production systems cannot afford to block on evaluation calls. The new async evaluation capability in the SDK lets you fire evaluation requests and continue processing without waiting for results. Evaluations execute in the background and results are available through callbacks, polling, or webhooks.

This unlocks real-time evaluation in high-throughput environments. Run evaluations on every agent response in production without adding latency to your user-facing pipeline.

Comparison Summary

Iterating on prompts and models is only valuable if you can measure the difference. The new comparison summary feature lets you place two datasets side-by-side and see exactly how evaluation scores, prompt performance, and quality metrics changed between them. Spot regressions instantly. Confirm improvements with data.

SDK and Instrumentation Updates

This release brings a wave of SDK improvements. traceAI v0.1.10 now automatically labels LLM spans with prompt template identifiers, making it trivial to filter traces by which prompt version generated them. New Pipecat integration brings tracing to voice and multimodal AI pipelines, complementing last release’s Call Simulation launch. And TypeScript developers using LlamaIndex now get first-class instrumentation support.

Bulk annotation and feedback via the API and SDK rounds out the data pipeline story. Import thousands of human labels in a single call, connect your existing annotation tools, and keep your evaluation datasets fresh with real human judgment at scale.

Additional Improvements

The new user tab in Dashboard and Observe surfaces per-user metrics across sessions, helping teams understand how individual end-users experience their AI agents. Synthetic data is now editable after generation, so you can refine AI-generated test cases before they enter your evaluation pipeline. And prompt versions can carry custom labels for tracking experiments and rollout stages.

Older

Voice Simulation is Here

Newer

Summary Dashboards, Alerts Revamp, and Prompt SDK

All changelog entries

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Platform

Audience

LEARN

DEVELOPERS

Featured

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Document Intelligence and Async Evaluations

What's in this digest

Document Column Support and OCR

Function Evaluations

Async Evaluations via SDK

Comparison Summary

SDK and Instrumentation Updates

Additional Improvements

Mastering AI Agent Evaluation

The Agentic RAG Playbook

Document Intelligence and Async Evaluations

What's in this digest

Document Column Support and OCR

Function Evaluations

Async Evaluations via SDK

Comparison Summary

SDK and Instrumentation Updates

Additional Improvements

FutureAGI AI Assistant