Guides

Ragas vs Future AGI in 2026: A Head-to-Head for ML Engineers Running RAG

Ragas vs Future AGI compared honestly on RAG metric coverage, cost economics, CI fit, runtime guardrails, observability, and the closed loop. Where each one wins, where they tie, and how to compose.

·
Updated
·
13 min read
llm-evaluation 2026 rag-evaluation ragas comparison ai-gateway
Editorial cover image for Future AGI vs Ragas in 2026: Full-Stack RAG Eval vs RAG-Only Library
Table of Contents

Ragas and Future AGI share a shortlist because their RAG metric vocabularies overlap. They got there from opposite directions. Ragas is a RAG-specific evaluation library; Future AGI is an eval-stack package — the RAG metrics plus an OpenTelemetry tracer, an optimizer, a gateway, and inline guardrails on one bill. Pick Ragas when RAG eval is the whole job and the rest of the stack is already chosen. Pick Future AGI when RAG sits inside an agent runtime, per-eval cost matters at production volume, or the gate, tracer, optimizer, and guardrails need to live on the same span tree.

The frame: library vs eval-stack package

The feature lists overlap on RAG rubrics because every paper since 2023 uses the same names. Below the rubric layer, they diverge.

Ragas is a focused RAG evaluation library — one pip install, a small dependency tree, seven canonical metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall, Context Entity Recall, Noise Sensitivity, Multimodal Faithfulness), a TestsetGenerator for synthetic question-answer-context triples, and thin adapter packages for LangSmith, Langfuse, and OpenTelemetry. The design center is a notebook-friendly API with naming that matches the academic literature and a deliberate refusal to grow outside the RAG box.

Future AGI is an eval-stack package — three Apache 2.0 building blocks plus a hosted control plane:

  • ai-evaluation ships 60-plus cloud EvalTemplate classes and 72 local metrics covering RAG, agent trajectory, function calling, structured output, hallucination, code security, and customer-agent surfaces. The unified evaluate() API routes across LocalEngine (heuristics, NLI, no API call), TuringEngine (in-house classifiers), and LLMEngine (any LiteLLM model).
  • traceAI is OpenTelemetry-native with OpenInference spans across 50-plus framework integrations in Python, TypeScript, Java (Spring Boot, Spring AI, LangChain4j), and C#.
  • agent-opt ships six optimizers (ProTeGi, GEPA, Meta-Prompt, PromptWizard, Bayesian, Random Search) that consume a labelled dataset and propose the next prompt revision.

On top sits the Future AGI Platform with self-improving evaluators, an in-product rubric-authoring agent, and the Agent Command Center (20-plus providers, 5-level budgets, 33 guardrail scanners). Error Feed clusters failing production traces via HDBSCAN with a Sonnet 4.5 Judge writing immediate_fix per cluster.

Ragas is a metric library. Future AGI is a runtime that observes, scores, optimizes, and guards.

TL;DR: capability snapshot

CapabilityRagasFuture AGI
Core identityRAG-specific eval libraryEval-stack package (eval + trace + optimize + gateway + guardrails)
LicenseApache 2.0 single Python packagetraceAI, ai-evaluation, agent-opt Apache 2.0; hosted Platform closed
RAG metric coverageFaithfulness, Answer Relevancy, Context Precision, Context Recall, Context Entity Recall, Noise Sensitivity, Multimodal FaithfulnessSame conceptual space under different names plus chunk-level rubrics, NDCG, MRR, P@k, R@k
Cost economicsLLM-judge default per metric per rowHeuristic / NLI / classifier / LLM-judge cascade with augment=True
Agent + non-RAG evalsOut of scopeAgent trajectory (7), function calling (4), structured output (11), code security (6), 11 customer-agent rubrics
Distributed runnersSingle-processCelery, Ray, Temporal, Kubernetes + ResilientBackend wrapper
Observability integrationLangSmith, Langfuse, OTel adapterstraceAI OpenInference spans in Python, TS, Java, C#
Runtime guardrailsNot in scopeProtect at 65 ms text / 107 ms image; 13 backends; 8 sub-10 ms Scanners
GatewayNot in scopeAgent Command Center: 20-plus providers, 5-level budgets, 33 scanners
Closed loopNot in scopeError Feed clustering + agent-opt rewrites
Best fitPure-RAG, notebook-shaped, single-processRAG inside an agent runtime, production scale, regulated workflow

One-line verdict: Ragas wins on focus, naming, and footprint for pure-RAG. Future AGI wins on operational surface, classifier-backed cost economics, and the closed loop. The two compose; the choice is whether RAG stands alone or sits inside something larger.

Where Ragas wins

Ragas wins three concrete fights, and the wins are real.

Canonical naming the literature speaks. Every RAG paper since 2023 uses Faithfulness, Answer Relevancy, Context Precision, and Context Recall. Future AGI ships the same conceptual rubrics under different names (Groundedness, Completeness, ContextAdherence, ChunkAttribution) for principled reasons — chunk-level granularity, hallucination-detection clarity — but if a stakeholder demands faithfulness in the dashboard column header because that is what the spec said, Ragas is the cleaner pick. Naming is a real cost.

Smallest possible dependency footprint. pip install ragas and the tree is small enough to read. No NLI extra, no distributed-runners extra, no guardrail-models extra. The ai-evaluation library is also pip-installable and explicitly modular, but the surface (60-plus EvalTemplate classes, 72 local metrics, four engines, four distributed backends) is a steeper learning curve on day one. For a notebook that needs to ship a score by Friday, Ragas is faster to internalize.

TestsetGenerator is a first-class surface. Synthetic test-set generation from a corpus is the canonical Ragas trick: feed it documents, get back question-answer-context triples. Future AGI ships dataset utilities, but Ragas’s TestsetGenerator is the polished one — the literature cites it, the docs walk it end-to-end, and the heuristics for question-type distribution (single-hop, multi-hop, conditional) are tuned. If synthetic bootstrapping is the wedge use case, Ragas earns the install.

The Ragas bet: if RAG is the whole job, canonical names are non-negotiable, and the rest of the stack already exists, this is the cleanest library in the OSS category. The honest gap is scope.

Where Future AGI wins

Future AGI wins on operational surface — the parts of the stack downstream of “what is this metric and how do I score one row.”

The classifier cascade collapses per-eval cost at scale. Ragas runs an LLM judge for every canonical metric by default. With gpt-4o-mini as the judge, per-eval cost is small at low volumes and material at 100K-plus evals per day. Future AGI’s three-tier model — local heuristics and NLI for the cheap path, the in-house Turing classifier family for the middle, and an LLM judge for the long tail — shifts the dominant cost line from per-eval tokens to GPU time on the NLI and classifier inference. At production volume the gap is an order of magnitude.

from fi.evals import evaluate

# Cheap-first cascade: local NLI runs free, judge fires only on uncertain rows
score = evaluate(
    eval_name="faithfulness",
    output="The RPO is 15 minutes.",
    context="Primary DB RPO: 15 minutes. RTO: 30 minutes.",
    augment=True,
    model="gemini/gemini-2.5-flash",
)

The augment=True flag wires the cascade: the local heuristic runs first and the LLM judge only fires when the local score crosses a confidence threshold, with the local reasoning passed in as a prior.

Agent and non-RAG evals are first-class. Ragas is RAG-only. The moment RAG sits inside an agent — tool calls, multi-step planning, structured output, code execution — you need a second library. ai-evaluation covers agent trajectory (task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, goal_progress, action_safety, reasoning_quality), four function-calling rubrics, and 11 structured-output rubrics in the same library. One install, one engine call, one span attribute schema.

Runtime guardrails on the request path. Ragas does not ship a runtime layer. Future AGI ships Protect (four Gemma 3n LoRA adapters at 65 ms text and 107 ms image median time-to-label, served via vLLM with hot-swappable endpoints), 13 guardrail backends (LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, WILDGUARD_7B, SHIELDGEMMA_2B, and friends), 8 sub-10 ms Scanners (JailbreakScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner), and four aggregation strategies (ANY, ALL, MAJORITY, WEIGHTED). Policy violations are caught synchronously before the model call returns.

from fi.evals import Guardrails
from fi.evals.guardrails import (
    GuardrailModel, RailType, AggregationStrategy,
    JailbreakScanner, SecretsScanner,
)

guard = Guardrails(
    models=[GuardrailModel(name="qwen3guard-4b", weight=0.6),
            GuardrailModel(name="llamaguard-3-1b", weight=0.4)],
    scanners=[JailbreakScanner(), SecretsScanner()],
    rail_type=RailType.INPUT,
    aggregation=AggregationStrategy.WEIGHTED,
)

Distributed runners ship out of the box. Ragas’s evaluate() is single-process. Future AGI’s FrameworkEvaluator ships four backends — Celery, Ray, Temporal with durable retries, and a Kubernetes Job backend creating one Job per task — plus a ResilientBackend wrapper that composes circuit-breaker, rate-limit, retry, degradation, and health-check configs around any backend. For a 10M-row nightly eval that has to finish before standup, the difference is whether the eval finishes.

The closed loop runs inside the stack. Ragas evaluates; what happens with the score is downstream. Future AGI’s Error Feed sits inside the eval stack: HDBSCAN clustering groups failing traces into named issues, a Sonnet 4.5 Judge on Bedrock reads each cluster and writes an immediate_fix, and the labelled dataset feeds agent-opt which proposes the next prompt revision. The Agent Command Center applies the update on the next request and rolls back if the score regresses. No fine-tuning loop, no RLHF — prompt-tuning by retrieved few-shots (vector search over past corrections in ChromaFeedbackStore) plus threshold sweeps (ThresholdCalibrator over [0.3, 0.9] × 13 steps). Mechanics visible in python/fi/evals/feedback/.

Where they tie: RAG metric coverage

On the canonical RAG rubric layer, this comparison ties harder than either vendor admits. Ragas’s seven cover the conceptual ground; Future AGI covers the same ground with finer granularity:

RagasFuture AGI
FaithfulnessGroundedness (47), FactualAccuracy (66), local faithfulness, claim_support, factual_consistency
Answer RelevancyCompleteness (10), ContextRelevance (9), local answer_relevancy
Context PrecisionContextAdherence (5), ChunkAttribution (11), local context_precision
Context RecallChunkUtilization (12), local context_recall, recall_at_k, ndcg, mrr
Context Entity Recalllocal context_entity_recall
Noise Sensitivitylocal noise_sensitivity
Multimodal FaithfulnessOCREvaluation, ImageInstructionAdherence, CaptionHallucination

The local NLI backbone in ai-evaluation is the same DeBERTa entailment family that drives Ragas’s deterministic checks. The cloud-template versions route to the Turing classifier instead of an LLM judge by default — the cost economics point — but the rubric is conceptually the same. Scores on identical (question, contexts, answer) rows will not match prompt-for-prompt across the two libraries because judge prompts and thresholds differ, but the failure modes they catch are the same set.

For a research paper, Ragas naming reads cleaner. For production RAG with by-layer regression debugging, Future AGI’s chunk-level rubrics catch failures the canonical seven smooth over. The choice on this axis is naming preference and cost model, not coverage.

Where each one falls short

Ragas: four honest limits.

  • RAG-only by design. Agent trajectory, function-calling correctness, structured-output validation, code-security rubrics — all out of scope. A team whose RAG sits inside an agent reaches for a second library inside a quarter.
  • LLM-judge default with no cascade. Every canonical metric calls the judge on every row. No heuristic or NLI fallback, no augment=True equivalent.
  • No runtime layer. No gateway, no router, no inline guardrails. Production safety enforcement is a separate buy.
  • Single-process by default. No Celery, Ray, Temporal, or Kubernetes backends.

Future AGI: three deliberate tradeoffs.

  • Bigger surface than a notebook needs. For a 200-row benchmark on a laptop, Ragas’s seven rubrics are easier to remember than 60-plus EvalTemplate classes plus 72 local metrics. The unified evaluate() API and the fi init CLI scaffold make the surface tractable, but the first-day learning curve is steeper.
  • Linear is the only ticket sink in Error Feed today. Slack, GitHub, Jira, and PagerDuty are on the roadmap.
  • Trace-stream ingestion into agent-opt is roadmap. The optimizer consumes a labelled dataset today; the direct traceAI-to-agent-opt connector is on the roadmap. The eval-driven path works now.

The decision framework

PickChoose this ifAvoid if
RagasPure-RAG workload in a single-process Python notebook; canonical naming (faithfulness, answer_relevancy) is a stakeholder requirement; dependency budget is tight; the rest of your stack (gateway, guardrails, observability) is already chosenRAG sits inside an agent; you need runtime guardrails or a gateway in the same stack; per-eval cost matters at production volume; you need distributed runners for the nightly eval
Future AGIRAG lives inside an agent runtime; you need agent + function-calling + structured-output evals next to Faithfulness; per-eval cost matters at scale; you need runtime guardrails at the request boundary; you run Java or Spring Boot; you want the optimizer and gateway downstream on one billRAG is the entire requirement and you’re happy operating four control planes; you specifically need OSI-canonical naming in dashboard headers for procurement reasons
Both, composedYou want the canonical Ragas names in CI and the Future AGI operational surface in production; you’re disciplined about pinning judge prompts so offline and online scores stay comparableScore drift between two judge implementations is a deal-breaker for your team; one source of truth is non-negotiable

The hybrid pattern: Ragas in CI, Future AGI in production

The two libraries do not duplicate instrumentation. The common 2026 pattern composes them at different layers.

In CI. Keep Ragas for the canonical metric names the research team already knows. Run the seven rubrics on a small golden set during every pull request and block the merge if faithfulness regresses past a threshold. The reproducibility argument is real: a year-old golden set scored with a pinned Ragas version produces a number that matches the literature, which matters when a stakeholder asks “what is our faithfulness score” and expects the academic answer.

In production. Drop traceAI into the application code to capture OpenInference spans across LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, Pydantic AI, AutoGen, Mastra, and the rest of the 50-plus surfaces. Use ai-evaluation for agent trajectory, function-calling, structured-output, and cost-efficient continuous RAG evaluation against the Turing classifier family — the cascade collapses per-row cost so you can score 100 percent of traffic on the cheap rubrics and sample the LLM-judge ones. Wire agent-opt to the captured dataset. Run Agent Command Center as the routing and guardrail layer with Protect inline.

The seam: the Ragas score on the CI golden set and the Future AGI score on production traces will not match prompt-for-prompt. Pin both judge prompts if score-comparison across CI and prod is a hard requirement; otherwise treat the two as different signals at different layers and report them separately.

When the comparison ends at Future AGI

The Ragas-or-Future-AGI framing only works when RAG is the entire job. If runtime guardrails, gateway routing, agent trajectory evals, distributed runners, or a tied optimization loop are also on the requirement list, the comparison ends here.

Future AGI ships the eval stack as a package. traceAI runs OpenTelemetry-and-OpenInference tracing across Python, TypeScript, Java (Spring Boot, Spring AI, LangChain4j), and C#. ai-evaluation runs the unified evaluate() API across local heuristics, NLI, the Turing classifier family, and any LiteLLM model — with augment=True wiring the cascade. agent-opt closes the loop with six optimizers. The Agent Command Center sits in the request path with 20-plus providers, 5-level budgets, 33 guardrail scanners, and response headers exposing routing decision, cost, latency, fallback, and guardrail trigger.

The practical difference: in Ragas, an eval score is a number you log somewhere. In Future AGI, that same score lands as a span attribute on the trace tree the gateway and guardrails wrote into via enrich_span_with_evaluation, so eval.<metric>.score, eval.<metric>.reason, and eval.<metric>.latency_ms ride along with gateway.routing_strategy and guardrail.triggered on one span. One source of truth, one attribute schema, one self-host plane. SOC 2 Type II, HIPAA, GDPR, and CCPA are certified; ISO/IEC 27001 is in active audit.

Ragas wins the RAG-library slice. Future AGI wins when the operational surface is the constraint — and for most teams running customer-facing RAG in 2026, the operational surface is the constraint.

Common mistakes when comparing Ragas and Future AGI

  • Treating metric coverage as the only axis. Ragas and Future AGI tie on RAG rubric coverage. The real axes are cost economics at production volume, runtime guardrails, distributed runners, and the closed loop.
  • Confusing library vs runtime. The fair Ragas comparison is ai-evaluation alone; the full Future AGI comparison is the eval-stack package.
  • Underestimating the LLM-judge bill. At 100K-plus evals per day with gpt-4o-mini as the judge, the monthly cost is real. Model it before committing to a judge-default workflow.
  • Assuming scores match across CI and prod. If Ragas runs in CI and Future AGI runs in production, the judge prompts and thresholds differ. Pin both sides or report the two signals separately.
  • Forgetting the runtime layer. Ragas observes; it does not enforce. PII redaction, injection blocking, and content moderation at the request hop are a separate buy.

Sources

Frequently asked questions

Should I pick Ragas or Future AGI in 2026?
Pick Ragas when RAG eval is the whole job, the seven canonical rubrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall, Context Entity Recall, Noise Sensitivity, Multimodal Faithfulness) are the rubrics your team already speaks, and you want the smallest possible dependency surface in a single-process Python notebook. Pick Future AGI when RAG sits inside a larger agent runtime and you need agent trajectory evals, runtime guardrails on the request path, classifier-backed cost economics at production volume, distributed runners, or the eval-to-optimizer-to-gateway loop on one bill. The honest read is that Ragas is a RAG-specific library and Future AGI is an eval-stack package; the choice is upstream of features.
Are Ragas and Future AGI both open source?
Yes. Ragas is Apache 2.0 in a single pip-installable Python package with a small dependency tree. Future AGI's three building blocks (traceAI, ai-evaluation, agent-opt) are Apache 2.0 as well; the hosted Future AGI Platform and the Agent Command Center are the closed-source control plane on top. Both projects accept community contributions. Ragas is the smaller library surface for a narrow RAG-only use case; ai-evaluation is modular with explicit extras ([nli], [celery], [ray], [temporal]) so you can install just the local metrics in a notebook or pull in the distributed runners when the workload grows.
Does Future AGI cover the canonical Ragas RAG metrics?
Yes, under different names. Faithfulness maps to Groundedness (eval_id 47) and FactualAccuracy (66). Answer Relevancy maps to Completeness (10) and ContextRelevance (9). Context Precision and Context Recall map to ContextAdherence (5) plus the chunk-level rubrics ChunkAttribution (11) and ChunkUtilization (12). Context Entity Recall, Noise Sensitivity, and the rag_score composite ship as local NLI-backed metrics. The DeBERTa entailment backbone behind the local faithfulness and claim_support metrics is the same family that powers Ragas's deterministic checks.
How does cost-per-eval compare?
Ragas runs an LLM judge for every canonical metric by default; per-eval cost scales with the judge model (typically gpt-4o-mini or gpt-4o) and gets material at 100K-plus evals per day. Future AGI ships a three-tier cascade: heuristic and NLI-backed local metrics run sub-millisecond with no API call, the in-house Turing classifier family scores cloud templates at lower per-eval cost than Galileo Luna-2, and the LLM judge is reserved for the tail. The augment=True flag runs the local heuristic first and only escalates to the judge on uncertain cases.
Can I run Ragas and Future AGI together?
Yes, this is the common 2026 pattern. Keep Ragas in CI for the canonical RAG metric names that your team already understands and that match the academic literature. Drop traceAI into the application code to capture OpenInference spans across 50-plus framework surfaces. Use ai-evaluation for agent trajectory, structured-output, and tool-calling evals that Ragas does not cover. Wire agent-opt to the captured dataset for prompt optimization. Run Agent Command Center as the routing and guardrail layer. The two projects do not duplicate instrumentation; they sit at different layers of the stack.
When is Ragas the better choice over Future AGI?
Three cases. First, a Python team prototyping a pure-RAG pipeline that wants the canonical naming familiar from the academic literature. Second, a research workload running a benchmark sweep in a single-process notebook where distributed runners and runtime guardrails would be overhead. Third, a project on a tight dependency budget that wants the smallest possible library footprint with no platform tie-in. For these workloads Ragas is genuinely cleaner; the breadth of ai-evaluation is overkill and the cloud templates are unused.
Does Ragas have an LLM gateway or runtime guardrails?
No. Ragas is an evaluation library, not a runtime layer. There is no inline guardrail at the request boundary, no router or fallback, no PII scanner on the production path. Future AGI ships the Agent Command Center gateway (20-plus providers, 5-level budgets, 33 guardrail scanners) plus Protect (four Gemma 3n LoRA adapters at 65 ms text and 107 ms image median time-to-label per arxiv.org/abs/2510.13351) so policy violations are caught synchronously before the model call returns. If you need either of those in the same stack as your eval, the comparison ends at Future AGI.
Related Articles
View all