Document Intelligence and Async Evaluations
Process documents natively in your datasets, run evaluations asynchronously via SDK, and compare prompt performance across experiments.
What's in this digest
Document Column Support and OCR
AI agents increasingly work with documents — contracts, invoices, support tickets, medical records. Testing these agents requires datasets that contain actual documents, not just extracted text. Starting today, datasets in Future AGI natively support document columns.
Upload TXT, DOC, DOCX, and PDF files directly into your dataset rows. Each document is indexed, searchable, and available as input to your evaluation pipelines. For scanned documents and images, built-in OCR extracts text automatically, so even legacy paper-based workflows can be tested without manual transcription.
This is not just file storage. Document columns integrate with the full evaluation pipeline. Run faithfulness checks against source documents. Verify that your agent’s summary matches the original PDF. Test whether your RAG system retrieves the right sections from a 200-page contract. Five document types are supported at launch, with more coming.
Function Evaluations
Sometimes you need evaluation logic that goes beyond LLM-as-judge. Function evaluations let you define custom evaluation functions that execute deterministic checks against agent outputs. Verify JSON schema compliance, check numerical accuracy, validate that specific fields are present, or implement any business logic that your quality bar demands.
Function evals run alongside your existing LLM-based evaluations, giving you a hybrid approach: use AI judgment for subjective quality and deterministic functions for objective correctness.
Async Evaluations via SDK
Production systems cannot afford to block on evaluation calls. The new async evaluation capability in the SDK lets you fire evaluation requests and continue processing without waiting for results. Evaluations execute in the background and results are available through callbacks, polling, or webhooks.
This unlocks real-time evaluation in high-throughput environments. Run evaluations on every agent response in production without adding latency to your user-facing pipeline.
Comparison Summary
Iterating on prompts and models is only valuable if you can measure the difference. The new comparison summary feature lets you place two datasets side-by-side and see exactly how evaluation scores, prompt performance, and quality metrics changed between them. Spot regressions instantly. Confirm improvements with data.
SDK and Instrumentation Updates
This release brings a wave of SDK improvements. traceAI v0.1.10 now automatically labels LLM spans with prompt template identifiers, making it trivial to filter traces by which prompt version generated them. New Pipecat integration brings tracing to voice and multimodal AI pipelines, complementing last release’s Call Simulation launch. And TypeScript developers using LlamaIndex now get first-class instrumentation support.
Bulk annotation and feedback via the API and SDK rounds out the data pipeline story. Import thousands of human labels in a single call, connect your existing annotation tools, and keep your evaluation datasets fresh with real human judgment at scale.
Additional Improvements
The new user tab in Dashboard and Observe surfaces per-user metrics across sessions, helping teams understand how individual end-users experience their AI agents. Synthetic data is now editable after generation, so you can refine AI-generated test cases before they enter your evaluation pipeline. And prompt versions can carry custom labels for tracking experiments and rollout stages.