Ragas vs Future AGI in 2026: A Head-to-Head for ML Engineers Running RAG
Ragas vs Future AGI compared honestly on RAG metric coverage, cost economics, CI fit, runtime guardrails, observability, and the closed loop. Where each one wins, where they tie, and how to compose.
Table of Contents
Ragas and Future AGI share a shortlist because their RAG metric vocabularies overlap. They got there from opposite directions. Ragas is a RAG-specific evaluation library; Future AGI is an eval-stack package — the RAG metrics plus an OpenTelemetry tracer, an optimizer, a gateway, and inline guardrails on one bill. Pick Ragas when RAG eval is the whole job and the rest of the stack is already chosen. Pick Future AGI when RAG sits inside an agent runtime, per-eval cost matters at production volume, or the gate, tracer, optimizer, and guardrails need to live on the same span tree.
The frame: library vs eval-stack package
The feature lists overlap on RAG rubrics because every paper since 2023 uses the same names. Below the rubric layer, they diverge.
Ragas is a focused RAG evaluation library — one pip install, a small dependency tree, seven canonical metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall, Context Entity Recall, Noise Sensitivity, Multimodal Faithfulness), a TestsetGenerator for synthetic question-answer-context triples, and thin adapter packages for LangSmith, Langfuse, and OpenTelemetry. The design center is a notebook-friendly API with naming that matches the academic literature and a deliberate refusal to grow outside the RAG box.
Future AGI is an eval-stack package — three Apache 2.0 building blocks plus a hosted control plane:
- ai-evaluation ships 60-plus cloud
EvalTemplateclasses and 72 local metrics covering RAG, agent trajectory, function calling, structured output, hallucination, code security, and customer-agent surfaces. The unifiedevaluate()API routes acrossLocalEngine(heuristics, NLI, no API call),TuringEngine(in-house classifiers), andLLMEngine(any LiteLLM model). - traceAI is OpenTelemetry-native with OpenInference spans across 50-plus framework integrations in Python, TypeScript, Java (Spring Boot, Spring AI, LangChain4j), and C#.
- agent-opt ships six optimizers (ProTeGi, GEPA, Meta-Prompt, PromptWizard, Bayesian, Random Search) that consume a labelled dataset and propose the next prompt revision.
On top sits the Future AGI Platform with self-improving evaluators, an in-product rubric-authoring agent, and the Agent Command Center (20-plus providers, 5-level budgets, 33 guardrail scanners). Error Feed clusters failing production traces via HDBSCAN with a Sonnet 4.5 Judge writing immediate_fix per cluster.
Ragas is a metric library. Future AGI is a runtime that observes, scores, optimizes, and guards.
TL;DR: capability snapshot
| Capability | Ragas | Future AGI |
|---|---|---|
| Core identity | RAG-specific eval library | Eval-stack package (eval + trace + optimize + gateway + guardrails) |
| License | Apache 2.0 single Python package | traceAI, ai-evaluation, agent-opt Apache 2.0; hosted Platform closed |
| RAG metric coverage | Faithfulness, Answer Relevancy, Context Precision, Context Recall, Context Entity Recall, Noise Sensitivity, Multimodal Faithfulness | Same conceptual space under different names plus chunk-level rubrics, NDCG, MRR, P@k, R@k |
| Cost economics | LLM-judge default per metric per row | Heuristic / NLI / classifier / LLM-judge cascade with augment=True |
| Agent + non-RAG evals | Out of scope | Agent trajectory (7), function calling (4), structured output (11), code security (6), 11 customer-agent rubrics |
| Distributed runners | Single-process | Celery, Ray, Temporal, Kubernetes + ResilientBackend wrapper |
| Observability integration | LangSmith, Langfuse, OTel adapters | traceAI OpenInference spans in Python, TS, Java, C# |
| Runtime guardrails | Not in scope | Protect at 65 ms text / 107 ms image; 13 backends; 8 sub-10 ms Scanners |
| Gateway | Not in scope | Agent Command Center: 20-plus providers, 5-level budgets, 33 scanners |
| Closed loop | Not in scope | Error Feed clustering + agent-opt rewrites |
| Best fit | Pure-RAG, notebook-shaped, single-process | RAG inside an agent runtime, production scale, regulated workflow |
One-line verdict: Ragas wins on focus, naming, and footprint for pure-RAG. Future AGI wins on operational surface, classifier-backed cost economics, and the closed loop. The two compose; the choice is whether RAG stands alone or sits inside something larger.
Where Ragas wins
Ragas wins three concrete fights, and the wins are real.
Canonical naming the literature speaks. Every RAG paper since 2023 uses Faithfulness, Answer Relevancy, Context Precision, and Context Recall. Future AGI ships the same conceptual rubrics under different names (Groundedness, Completeness, ContextAdherence, ChunkAttribution) for principled reasons — chunk-level granularity, hallucination-detection clarity — but if a stakeholder demands faithfulness in the dashboard column header because that is what the spec said, Ragas is the cleaner pick. Naming is a real cost.
Smallest possible dependency footprint. pip install ragas and the tree is small enough to read. No NLI extra, no distributed-runners extra, no guardrail-models extra. The ai-evaluation library is also pip-installable and explicitly modular, but the surface (60-plus EvalTemplate classes, 72 local metrics, four engines, four distributed backends) is a steeper learning curve on day one. For a notebook that needs to ship a score by Friday, Ragas is faster to internalize.
TestsetGenerator is a first-class surface. Synthetic test-set generation from a corpus is the canonical Ragas trick: feed it documents, get back question-answer-context triples. Future AGI ships dataset utilities, but Ragas’s TestsetGenerator is the polished one — the literature cites it, the docs walk it end-to-end, and the heuristics for question-type distribution (single-hop, multi-hop, conditional) are tuned. If synthetic bootstrapping is the wedge use case, Ragas earns the install.
The Ragas bet: if RAG is the whole job, canonical names are non-negotiable, and the rest of the stack already exists, this is the cleanest library in the OSS category. The honest gap is scope.
Where Future AGI wins
Future AGI wins on operational surface — the parts of the stack downstream of “what is this metric and how do I score one row.”
The classifier cascade collapses per-eval cost at scale. Ragas runs an LLM judge for every canonical metric by default. With gpt-4o-mini as the judge, per-eval cost is small at low volumes and material at 100K-plus evals per day. Future AGI’s three-tier model — local heuristics and NLI for the cheap path, the in-house Turing classifier family for the middle, and an LLM judge for the long tail — shifts the dominant cost line from per-eval tokens to GPU time on the NLI and classifier inference. At production volume the gap is an order of magnitude.
from fi.evals import evaluate
# Cheap-first cascade: local NLI runs free, judge fires only on uncertain rows
score = evaluate(
eval_name="faithfulness",
output="The RPO is 15 minutes.",
context="Primary DB RPO: 15 minutes. RTO: 30 minutes.",
augment=True,
model="gemini/gemini-2.5-flash",
)
The augment=True flag wires the cascade: the local heuristic runs first and the LLM judge only fires when the local score crosses a confidence threshold, with the local reasoning passed in as a prior.
Agent and non-RAG evals are first-class. Ragas is RAG-only. The moment RAG sits inside an agent — tool calls, multi-step planning, structured output, code execution — you need a second library. ai-evaluation covers agent trajectory (task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, goal_progress, action_safety, reasoning_quality), four function-calling rubrics, and 11 structured-output rubrics in the same library. One install, one engine call, one span attribute schema.
Runtime guardrails on the request path. Ragas does not ship a runtime layer. Future AGI ships Protect (four Gemma 3n LoRA adapters at 65 ms text and 107 ms image median time-to-label, served via vLLM with hot-swappable endpoints), 13 guardrail backends (LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, WILDGUARD_7B, SHIELDGEMMA_2B, and friends), 8 sub-10 ms Scanners (JailbreakScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner), and four aggregation strategies (ANY, ALL, MAJORITY, WEIGHTED). Policy violations are caught synchronously before the model call returns.
from fi.evals import Guardrails
from fi.evals.guardrails import (
GuardrailModel, RailType, AggregationStrategy,
JailbreakScanner, SecretsScanner,
)
guard = Guardrails(
models=[GuardrailModel(name="qwen3guard-4b", weight=0.6),
GuardrailModel(name="llamaguard-3-1b", weight=0.4)],
scanners=[JailbreakScanner(), SecretsScanner()],
rail_type=RailType.INPUT,
aggregation=AggregationStrategy.WEIGHTED,
)
Distributed runners ship out of the box. Ragas’s evaluate() is single-process. Future AGI’s FrameworkEvaluator ships four backends — Celery, Ray, Temporal with durable retries, and a Kubernetes Job backend creating one Job per task — plus a ResilientBackend wrapper that composes circuit-breaker, rate-limit, retry, degradation, and health-check configs around any backend. For a 10M-row nightly eval that has to finish before standup, the difference is whether the eval finishes.
The closed loop runs inside the stack. Ragas evaluates; what happens with the score is downstream. Future AGI’s Error Feed sits inside the eval stack: HDBSCAN clustering groups failing traces into named issues, a Sonnet 4.5 Judge on Bedrock reads each cluster and writes an immediate_fix, and the labelled dataset feeds agent-opt which proposes the next prompt revision. The Agent Command Center applies the update on the next request and rolls back if the score regresses. No fine-tuning loop, no RLHF — prompt-tuning by retrieved few-shots (vector search over past corrections in ChromaFeedbackStore) plus threshold sweeps (ThresholdCalibrator over [0.3, 0.9] × 13 steps). Mechanics visible in python/fi/evals/feedback/.
Where they tie: RAG metric coverage
On the canonical RAG rubric layer, this comparison ties harder than either vendor admits. Ragas’s seven cover the conceptual ground; Future AGI covers the same ground with finer granularity:
| Ragas | Future AGI |
|---|---|
| Faithfulness | Groundedness (47), FactualAccuracy (66), local faithfulness, claim_support, factual_consistency |
| Answer Relevancy | Completeness (10), ContextRelevance (9), local answer_relevancy |
| Context Precision | ContextAdherence (5), ChunkAttribution (11), local context_precision |
| Context Recall | ChunkUtilization (12), local context_recall, recall_at_k, ndcg, mrr |
| Context Entity Recall | local context_entity_recall |
| Noise Sensitivity | local noise_sensitivity |
| Multimodal Faithfulness | OCREvaluation, ImageInstructionAdherence, CaptionHallucination |
The local NLI backbone in ai-evaluation is the same DeBERTa entailment family that drives Ragas’s deterministic checks. The cloud-template versions route to the Turing classifier instead of an LLM judge by default — the cost economics point — but the rubric is conceptually the same. Scores on identical (question, contexts, answer) rows will not match prompt-for-prompt across the two libraries because judge prompts and thresholds differ, but the failure modes they catch are the same set.
For a research paper, Ragas naming reads cleaner. For production RAG with by-layer regression debugging, Future AGI’s chunk-level rubrics catch failures the canonical seven smooth over. The choice on this axis is naming preference and cost model, not coverage.
Where each one falls short
Ragas: four honest limits.
- RAG-only by design. Agent trajectory, function-calling correctness, structured-output validation, code-security rubrics — all out of scope. A team whose RAG sits inside an agent reaches for a second library inside a quarter.
- LLM-judge default with no cascade. Every canonical metric calls the judge on every row. No heuristic or NLI fallback, no
augment=Trueequivalent. - No runtime layer. No gateway, no router, no inline guardrails. Production safety enforcement is a separate buy.
- Single-process by default. No Celery, Ray, Temporal, or Kubernetes backends.
Future AGI: three deliberate tradeoffs.
- Bigger surface than a notebook needs. For a 200-row benchmark on a laptop, Ragas’s seven rubrics are easier to remember than 60-plus
EvalTemplateclasses plus 72 local metrics. The unifiedevaluate()API and thefi initCLI scaffold make the surface tractable, but the first-day learning curve is steeper. - Linear is the only ticket sink in Error Feed today. Slack, GitHub, Jira, and PagerDuty are on the roadmap.
- Trace-stream ingestion into
agent-optis roadmap. The optimizer consumes a labelled dataset today; the directtraceAI-to-agent-optconnector is on the roadmap. The eval-driven path works now.
The decision framework
| Pick | Choose this if | Avoid if |
|---|---|---|
| Ragas | Pure-RAG workload in a single-process Python notebook; canonical naming (faithfulness, answer_relevancy) is a stakeholder requirement; dependency budget is tight; the rest of your stack (gateway, guardrails, observability) is already chosen | RAG sits inside an agent; you need runtime guardrails or a gateway in the same stack; per-eval cost matters at production volume; you need distributed runners for the nightly eval |
| Future AGI | RAG lives inside an agent runtime; you need agent + function-calling + structured-output evals next to Faithfulness; per-eval cost matters at scale; you need runtime guardrails at the request boundary; you run Java or Spring Boot; you want the optimizer and gateway downstream on one bill | RAG is the entire requirement and you’re happy operating four control planes; you specifically need OSI-canonical naming in dashboard headers for procurement reasons |
| Both, composed | You want the canonical Ragas names in CI and the Future AGI operational surface in production; you’re disciplined about pinning judge prompts so offline and online scores stay comparable | Score drift between two judge implementations is a deal-breaker for your team; one source of truth is non-negotiable |
The hybrid pattern: Ragas in CI, Future AGI in production
The two libraries do not duplicate instrumentation. The common 2026 pattern composes them at different layers.
In CI. Keep Ragas for the canonical metric names the research team already knows. Run the seven rubrics on a small golden set during every pull request and block the merge if faithfulness regresses past a threshold. The reproducibility argument is real: a year-old golden set scored with a pinned Ragas version produces a number that matches the literature, which matters when a stakeholder asks “what is our faithfulness score” and expects the academic answer.
In production. Drop traceAI into the application code to capture OpenInference spans across LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, Pydantic AI, AutoGen, Mastra, and the rest of the 50-plus surfaces. Use ai-evaluation for agent trajectory, function-calling, structured-output, and cost-efficient continuous RAG evaluation against the Turing classifier family — the cascade collapses per-row cost so you can score 100 percent of traffic on the cheap rubrics and sample the LLM-judge ones. Wire agent-opt to the captured dataset. Run Agent Command Center as the routing and guardrail layer with Protect inline.
The seam: the Ragas score on the CI golden set and the Future AGI score on production traces will not match prompt-for-prompt. Pin both judge prompts if score-comparison across CI and prod is a hard requirement; otherwise treat the two as different signals at different layers and report them separately.
When the comparison ends at Future AGI
The Ragas-or-Future-AGI framing only works when RAG is the entire job. If runtime guardrails, gateway routing, agent trajectory evals, distributed runners, or a tied optimization loop are also on the requirement list, the comparison ends here.
Future AGI ships the eval stack as a package. traceAI runs OpenTelemetry-and-OpenInference tracing across Python, TypeScript, Java (Spring Boot, Spring AI, LangChain4j), and C#. ai-evaluation runs the unified evaluate() API across local heuristics, NLI, the Turing classifier family, and any LiteLLM model — with augment=True wiring the cascade. agent-opt closes the loop with six optimizers. The Agent Command Center sits in the request path with 20-plus providers, 5-level budgets, 33 guardrail scanners, and response headers exposing routing decision, cost, latency, fallback, and guardrail trigger.
The practical difference: in Ragas, an eval score is a number you log somewhere. In Future AGI, that same score lands as a span attribute on the trace tree the gateway and guardrails wrote into via enrich_span_with_evaluation, so eval.<metric>.score, eval.<metric>.reason, and eval.<metric>.latency_ms ride along with gateway.routing_strategy and guardrail.triggered on one span. One source of truth, one attribute schema, one self-host plane. SOC 2 Type II, HIPAA, GDPR, and CCPA are certified; ISO/IEC 27001 is in active audit.
Ragas wins the RAG-library slice. Future AGI wins when the operational surface is the constraint — and for most teams running customer-facing RAG in 2026, the operational surface is the constraint.
Common mistakes when comparing Ragas and Future AGI
- Treating metric coverage as the only axis. Ragas and Future AGI tie on RAG rubric coverage. The real axes are cost economics at production volume, runtime guardrails, distributed runners, and the closed loop.
- Confusing library vs runtime. The fair Ragas comparison is
ai-evaluationalone; the full Future AGI comparison is the eval-stack package. - Underestimating the LLM-judge bill. At 100K-plus evals per day with
gpt-4o-minias the judge, the monthly cost is real. Model it before committing to a judge-default workflow. - Assuming scores match across CI and prod. If Ragas runs in CI and Future AGI runs in production, the judge prompts and thresholds differ. Pin both sides or report the two signals separately.
- Forgetting the runtime layer. Ragas observes; it does not enforce. PII redaction, injection blocking, and content moderation at the request hop are a separate buy.
Sources
- Ragas documentation
- Ragas repository
- Future AGI ai-evaluation (Apache 2.0)
- Future AGI traceAI (Apache 2.0)
- Future AGI agent-opt (Apache 2.0)
- Future AGI Protect latency
- Agent Command Center docs
Related reading
- Best RAG Evaluation Tools in 2026
- RAG Evaluation Metrics: A Deep Dive (2026)
- Evaluating RAG Applications in CI/CD in 2026
- Best Cost-Efficient AI Evaluation Platforms in 2026
- Future AGI vs Langfuse in 2026
- Future AGI vs LangSmith in 2026
- Ragas Alternatives in 2026
- Custom LLM Eval Metrics Best Practices in 2026
Frequently asked questions
Should I pick Ragas or Future AGI in 2026?
Are Ragas and Future AGI both open source?
Does Future AGI cover the canonical Ragas RAG metrics?
How does cost-per-eval compare?
Can I run Ragas and Future AGI together?
When is Ragas the better choice over Future AGI?
Does Ragas have an LLM gateway or runtime guardrails?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Azure OpenAI eval has three Azure-specific axes: deployment-name drift, region-pinning, and Content Safety precision on benign queries. Here's the pattern.