At Future AGI, we are committed to building the next generation of evaluation-first AI systems. April was super significant for us, packed with exciting new features, vibrant community events, and serious engineering wins. Let’s dive into everything we shipped, celebrated, and discovered this month.
✅ Product Updates
Launched Compare Data - A New Standard for LLM Comparison
Comparing model outputs across different experiments has always been a tedious and manual task for AI engineers. Without standardized tools, teams are forced to rely on spreadsheets, screenshots, and subjective assessments to determine which model or prompt performed better. This approach not only slows down iteration cycles but also introduces inconsistencies and biases in model selection.
Future AGI's Compare Data is designed to make LLM comparisons structured, visual, and lightning-fast, enabling:
Side-by-side output comparisons across models and prompts
Prompt-level breakdowns and behavior diagnostics
Faster iteration cycles with clearer decisions
Visual summaries that surface patterns without the noise
Users can zoom out for high-level summaries across datasets or zoom in to perform detailed prompt-level comparisons. This structured comparison eliminates subjectivity, provides granular visibility into model behavior shifts, and enables faster, data-backed decision-making.
To compare your LLM Models to the best in the world, click here!
Launched Knowledge Base Integration- for reliable Synthetic Data
Traditional synthetic data generation often lacks grounding in real organizational context, leading to hallucinated outputs that are unusable in high-stakes environments like finance, healthcare, and legal. Organizations building evaluation sets or fine-tuning models need a way to create synthetic data that reflects their real-world knowledge and domain-specific language.
We introduced the Knowledge Base-powered Synthetic Data Generation feature to directly solve this gap.
With this capability:
Users can upload their own documents like PDFs, SOPs, product manuals and internal guidelines to build a custom knowledge base.
Synthetic data is generated with every datapoint anchored to the uploaded knowledge, ensuring factual precision.
The system adapts to the organization's specific language, structure, and tone, avoiding generic, hallucinated outputs.
By maintaining a ~90% content overlap with the original documents, the generated datasets become high-fidelity and immediately usable for creating evaluation datasets or fine-tuning models.
This gives organizations complete control over synthetic data generation while ensuring regulatory compliance and relevance.
To know how Future AGI creates accurate synthetic datasets, read our documentation here!
Launched Audio Evaluations - Powering the Multimodal Stack
Evaluating audio data or output has been a major challenge due to the lack of consistent tools, high manual review costs, and unreliable subjective assessments. As audio LLMs become central to customer interactions, from IVR systems to AI-powered support calls, ensuring high-quality audio at scale is now essential.
To address this, we launched state-of-the-art Audio Evaluations, a comprehensive set of metrics for automated, objective, and scalable evaluation of audio outputs.
Here’s how it works:
Users can import audio datasets via CSV/JSON uploads, Hugging Face datasets, or SDK scripts.
Our system provides pre-built evaluation metrics tailored for audio.
Evaluations can be run at scale, with support for testing on over 5,000 audio datapoints in a batch.
Error localization highlights exactly where an audio output fails, enabling targeted feedback and improvements.
The platform not only accelerates development cycles by providing instant evaluation reports but also helps fine-tune audio LLMs for domain-specific needs like multilingual IVR conversations and customer support call analysis.
To see how you can evaluate your audio using LLMs, read our documentation here!
Future AGI integrated with OpenAI Agents SDK
We’re excited to share that our platform has been officially recognized by OpenAI and now listed in the OpenAI Agents SDK documentation as a provider for tracing and evaluations.
With the OpenAI Agents SDK still in its early stages, we’re proud to offer essential tools that make observability, evaluation, and tracing more accessible and reliable for developers.
If you’re exploring OpenAI Agents, we invite you to check out our resources and see how we can help you build faster, smarter, and more safely. You can find all the details here.

🌐Other Updates
Webinar on "Evaluating AI with Confidence"
Too often, teams focus on building and fine-tuning models first and only test for issues like hallucinations, incomplete responses, or reliability gaps just before or sometimes after the launch. By then, fixes are slower, costlier, and riskier.
In this session, we dove deep into Future AGI’s evaluation workflow: covering multi-modal evaluations, custom metrics, feedback loops, and error localization. This empowers AI teams in catching issues early, improving model reliability, and building with confidence.
Perfect for anyone looking to make AI development faster, sharper, and more aligned.
Watch the webinar-https://futureagi.com/blogs/evaluating-ai-with-confidence

Register now for our upcoming Webinar: "Modern AI Engineering: Strategies That Scale"
Sandeep Kaipu, Engineering Leader @Broadcom will share actionable strategies on building scalable infra for your modern GenAI stack.
Data & Eval Driven Development: A hands-on session at the AI User Conference, SF
One of April’s biggest highlights was the AI User Conference- a major global event attended by AI professionals across industries and countries. Our Founder, Nikhil, led a hands-on workshop on making AI agents truly customer-ready using a data and evaluation-driven development approach.
A key takeaway from the event was the growing recognition that powerful AI alone isn’t enough, what matters is how reliably it performs in real-world use. The conversations underscored a rising demand for evaluation and observability as core pillars in building trustworthy, user-centric AI systems. For us, it reaffirmed our mission to make transparency and continuous assessment foundational to every AI deployment.

Closing Thoughts
Every launch, every conversation points to one truth: AI needs more discipline, trust, and care.
At Future AGI, we’re staying curious, moving fast, and staying true to our mission: Helping teams build AI that works — reliably, safely, and at scale.
For more updates, join the conversation in our Slack Community.
Your partner in building Trustworthy AI!
