AI agents increasingly work with documents — contracts, invoices, support tickets, medical records. Testing those agents requires datasets that contain the actual documents, not just extracted text. Starting this release, datasets in Future AGI support document columns natively.

What’s new

Upload documents directly. TXT, DOC, DOCX, and PDF files go straight into dataset rows.
Built-in OCR. Scanned documents and images are processed with optical character recognition (OCR) automatically — text is extracted so legacy paper-based workflows can be tested without manual transcription.
Indexed and searchable. Documents in dataset rows are available as input to the full evaluation pipeline.
Five document types at launch. More coming.

Why it matters

Document columns aren’t just file storage. They integrate with evaluations: run faithfulness checks against source documents, verify an agent’s summary matches the original PDF, or test whether your retrieval-augmented generation (RAG) system pulls the right sections from a 200-page contract.

Who it’s for

Teams building agents that process documents — contracts, invoices, medical records, support tickets — and RAG-system teams whose tests currently rely on manually extracted text rather than the real source files.

Read the docs →

Function Evaluations — Deterministic Checks

Sometimes you need evaluation logic that doesn’t involve an LLM. Function evaluations let you define custom evaluation functions that execute deterministic checks against agent outputs.

What’s new

Custom deterministic logic. Verify JSON schema compliance, check numerical accuracy, validate that specific fields are present, or implement any business-logic check your quality bar requires.
Runs alongside LLM evaluations. LLM judgment for subjective quality, deterministic functions for objective correctness — in the same evaluation run.

Why it matters

LLM-as-judge (where one LLM scores the outputs of another) is great for open-ended quality assessment but inconsistent for things that should be binary pass/fail — a malformed JSON, a missing required field, a wrong total. Function evaluations give you deterministic outcomes where you need them.

Who it’s for

ML and AI engineers building evaluation suites where part of the quality bar is programmatic, and quality assurance (QA) teams gating deployments in continuous integration pipelines where deterministic pass/fail is required.

Async Evaluations via SDK

Production systems can’t block on evaluation calls. The new async capability in the SDK lets you fire evaluation requests and continue processing without waiting for results. Evaluations execute in the background; results come back via callback, polling, or webhook.

What’s new

Non-blocking evaluation submission from any SDK language (Python, TypeScript).
Three result delivery modes. Callback, polling, or webhook — pick what fits your pipeline.

Why it matters

You can now run evaluations on every agent response in production without adding latency to the user-facing path.

Who it’s for

Developers integrating evaluations into production code paths, and MLOps teams running high-volume evaluations in live systems.

Comparison Summary

Iterating on prompts and models is only useful if you can measure the difference. The new comparison summary lets you place two datasets side by side and see exactly how evaluation scores, prompt performance, and quality metrics changed between them. Spot regressions instantly, confirm improvements with data.

SDK and Instrumentation Updates

traceAI v0.1.10 with prompt template labels. LLM spans (individual steps inside a trace) are now automatically labeled with prompt template identifiers — filter traces by which prompt version generated them.

traceAI Pipecat integration. Native instrumentation for voice and multimodal AI pipelines built on Pipecat.

traceAI LlamaIndex TypeScript. TypeScript instrumentation for LlamaIndex, bringing RAG tracing into Node.js environments.

Bulk annotation and feedback via API/SDK. Import thousands of human labels in a single call. Useful when connecting existing annotation tools or seeding an evaluation dataset at scale.

Additional Improvements

User tab in Dashboard and Observe. Per-user metrics across sessions and traces.

Edit synthetic data after generation. Refine AI-generated test cases before they enter your evaluation pipeline.

Labels per prompt version. Tag each prompt version to track experiments, A/B tests, and rollout stages.

Video support in Observe. Capture and replay video outputs from multimodal agents inside the trace view.

Timestamp column in trace and spans. Precise timestamps for timing analysis.

JSON view for evaluation log. Raw evaluation log data in structured JSON.

Older

Voice Simulation and the Evals Playground

Newer

Summary Dashboards, Alerts Revamp, Prompt SDK, and Workspaces RBAC

All changelog entries