Self-improving evals.
Not just LLM-as-judge.
A unified evaluation agent that runs local heuristics first - free and sub-second - then augments with an LLM only when confidence is low. Submit corrections when the judge is wrong, and it learns. One evaluate() call for faithfulness, RAG, safety, agents, streaming, and more.
Evaluation Run
Passed An eval agent that learns -
not a static metric library
Most eval frameworks send every check through an LLM - expensive and slow. Our hybrid approach runs local heuristics first (NLI models, string matching, embedding similarity, regex, schema validation) at zero cost in sub-second. Only when confidence is low does the system augment with an LLM (Gemini, GPT, Claude, or our fine-tuned Turing models). You get the accuracy of LLM-as-judge at a fraction of the cost.
See the hybrid architectureWhen the judge gets a case wrong, submit a correction. It's stored in a vector database and retrieved as a few-shot example for future evaluations on similar inputs. Your eval accuracy improves with every correction - the system learns your domain's edge cases over time. No retraining, no prompt engineering. Just correct and move on.
How feedback worksString checks, NLI-based faithfulness, RAG retrieval quality, LLM-as-judge, agent trajectory scoring, function call validation, guardrails, streaming safety, audio, image - all through a single evaluate() API. Batch multiple metrics in one call. Run locally or in cloud. Export pipelines as YAML for CI/CD. One framework replaces Ragas + DeepEval + guardrail libraries + custom scripts.
Explore the SDKDefine an eval config once and attach it to dataset runs, simulation tests, experiments, playground sessions, or production traces in Observe. Historic batch for retroactive scoring, continuous for real-time. Configurable sampling rates, span limits, and attribute-based filters. Results always comparable across every context.
See eval configs One framework for every
AI evaluation pattern
Catch hallucinations without LLM cost
Local NLI models detect contradictions and unsupported claims at zero API cost. Faithfulness, claim support, factual consistency, and contradiction detection run locally in sub-second. Augment with an LLM only for ambiguous paraphrases the heuristic can't resolve.
Build domain-specific evals that learn
Medical shorthand, legal jargon, internal terminology - generic LLM judges misclassify these constantly. Submit corrections when the judge is wrong. ChromaDB stores them and surfaces similar corrections as few-shot examples. Your eval adapts to your domain without prompt engineering.
Gate deployments in CI/CD
Export eval configs as YAML. Run in GitHub Actions, GitLab CI, or any pipeline. If hallucination rate exceeds your threshold or safety score drops, the deployment fails. Same config runs in staging and production - no threshold drift.
Evaluate RAG, agents, and function calls
Context relevance, chunk attribution, recall for RAG. Task completion, tool selection accuracy, trajectory scoring for agents. Function name match and parameter validation for tool use. One framework covers every AI pattern - not separate libraries for each.
Kill toxic streams mid-generation
StreamingEvaluator monitors LLM output token-by-token. When toxicity, PII, or safety violations spike above your threshold, the stream is cut immediately - not after the full response is generated. Early-stop policies let you define exactly when to intervene.
Monitor production with zero config
Attach eval configs to Observe projects. Continuous run type scores new spans as they arrive with configurable sampling. Quality scores attach to traces via OpenTelemetry - search for bad responses in Jaeger, Datadog, or Grafana. Alerts fire when scores cross thresholds.
Local first, augment when needed,
improve with every correction
Pick metrics or describe your app
Choose from 50+ local metrics and LLM-augmented evaluators, or use AutoEval - describe your app in natural language and get a pre-configured pipeline. Faithfulness, RAG quality, safety, agent trajectory, function calling, and more. Export as YAML for CI/CD.
Run locally, augment when needed
Heuristics run first at zero cost. Set augment=True to refine with an LLM only when confidence is low. Pass a feedback store and previous corrections become few-shot examples - your eval improves with every correction.
Attach to any context, compare everything
Same eval config attaches to datasets, simulations, experiments, or production traces. Results include scores, pass/fail, and explanations per row. Compare across runs, track trends, gate deployments. Feed results into Prompt Optimizer and Fix My Agent.
Powering teams from
prototype to production
From ambitious startups to global enterprises, teams trust Future AGI to ship AI agents confidently.