Guides

Evaluating LlamaIndex RAG Applications in 2026

LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.

April 17, 2026

Updated May 20, 2026

12 min read

llamaindex rag llm-evaluation llm-observability traceAI 2026

Table of Contents

LlamaIndex makes RAG easy to build and hard to debug. Four lines of Python wire a VectorStoreIndex to a RetrieverQueryEngine, and the first hundred queries pass. Then a user asks something the router was never tested on, a sub-question engine retrieves the wrong namespace, the tree_summarize synthesizer drops the citation that mattered, and the answer ships back confident and wrong. Generic RAG eval reports a faithfulness drop. It does not tell you which primitive broke.

LlamaIndex is not one RAG pattern. It is a composition kit, and every primitive emits its own span shape and its own failure mode. The opinion this post earns: LlamaIndex eval is not generic RAG eval. The right rubric set has four layers (retrieval-stage, query-pipeline, agent-mode, and the traceAI bridge to production) and the built-in Evaluator rubrics ship the floor, not the ceiling.

This guide walks the four layers, the Future AGI templates that map to each, the traceAI instrumentation that connects offline rubrics to live spans, and the closed loop that turns production failures back into eval cases. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.

TL;DR: the four-layer LlamaIndex rubric set

Layer	LlamaIndex surface	Rubric set
Retrieval-stage	`VectorStoreIndex`, `KeywordTableIndex`, `BM25Retriever`, custom retrievers	Recall@k, MRR, `ContextRelevance`, `ChunkAttribution`, `ChunkUtilization`
Query-pipeline	`SubQuestionQueryEngine`, `RouterQueryEngine`, `RecursiveRetriever`	Router accuracy (`CustomLLMJudge`), decomposition quality (`CustomLLMJudge`), per-branch `ContextRelevance` + `Groundedness`, merged-answer `Completeness`
Synthesizer	`compact`, `refine`, `tree_summarize` `ResponseSynthesizer` modes	`Groundedness`, `ContextAdherence`, `Completeness`, `AnswerRefusal`, citation validity (deterministic)
Agent-mode	`OpenAIAgent`, `ReActAgent`, `AgentRunner`, tool-using flows	`EvaluateFunctionCalling`, `TaskCompletion`, per-step `Groundedness`
Bridge	traceAI `LlamaIndexInstrumentor`	Same rubrics as `EvalTag` span-attached scorers; `RETRIEVER`, `LLM`, `CHAIN`, `AGENT` span kinds

The built-in llama_index.core.evaluation.FaithfulnessEvaluator and RelevancyEvaluator ship a useful floor for prototyping. They do not separate retrieval from synthesis, they do not understand router decisions, and they do not score tool calls. The four-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.

Why generic RAG eval misses LlamaIndex-specific failure modes

A flat retrieve-then-synthesize pipeline has two surfaces, and the canonical rubric split covered in our RAG evaluation metrics deep dive handles it cleanly. LlamaIndex is rarely flat. The interesting apps compose, and four failure modes appear that single-number quality scores cannot see.

A RouterQueryEngine over three sub-engines makes a decision before retrieval runs. If the router picks the vector engine when the query needs the SQL engine, ContextRelevance is still high (the chunks are relevant to whatever the vector engine retrieved), Groundedness is high (the synthesizer grounded faithfully), and the user got the wrong answer because the wrong engine ran.

A SubQuestionQueryEngine decomposes a complex query into pieces, runs each through its own engine, and synthesizes. The classic failure is dropping a constraint: the user asked for X under condition Y, the decomposition asked about X and about Y separately, and the final synthesis lost the conjunction. Per-sub-question rubrics pass. The final answer is wrong.

A tree_summarize synthesizer behaves nothing like compact or refine. Long contexts that look fine on compact lose citations on tree_summarize. A metric that scores only the final answer cannot tell you which mode degraded after Tuesday’s deploy.

An OpenAIAgent or ReActAgent calls tools in sequence. A wrong tool call halfway through produces a downstream answer that is grounded (in the wrong evidence) and complete (the wrong task is finished). Generic RAG eval has no concept of tool-call correctness.

Each failure becomes a five-minute bisect when the rubric set is tagged by LlamaIndex primitive.

Layer 1: retrieval-stage eval

Retrieval-stage rubrics score the chunks the retriever surfaced, independent of what the synthesizer wrote. They are the upstream signal. If retrieval regressed, every generation rubric will follow, and the bisect should start here.

Three Future AGI templates do the work. ContextRelevance (eval_id 9) scores whether each retrieved chunk is relevant to the query, catching the classic “asked about Section 12, got Section 9” failure when vector similarity surfaces a lexical neighbour with the wrong meaning. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim, catching fabricated citations and citation misalignment. ChunkUtilization (eval_id 12) scores how many of the retrieved chunks the synthesizer actually used, surfacing over-fetch where top-10 retrieval feeds the synthesizer eight chunks it ignores.

Pair these with the IR-style rubrics on a labelled probe set: Recall@k, MRR, nDCG. The local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, context_recall, context_precision, and context_entity_recall as deterministic metrics that run with no API call.

from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase

ev = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

def score_retrieval(query: str, response, k: int = 10):
    retrieved = [n.node.get_content() for n in response.source_nodes[:k]]
    tc = TestCase(input=query, output=str(response), context=retrieved)
    return ev.evaluate(
        eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
        inputs=[tc],
    ).eval_results[0]

Two operating rules separate a working retrieval gate from theatre. Score per retriever, not per app. A VectorStoreIndex and a BM25Retriever have different failure shapes; tagging the score by retriever lets a regression point at the retriever that moved. Set a per-rubric floor. ChunkUtilization below 30 percent says you are over-fetching; add a reranker before generation or drop similarity_top_k. The best rerankers for RAG (2026) walks the reranker side of the same decision.

Layer 2: query-pipeline eval

LlamaIndex’s compositional engines fail in ways the per-stage retrieval and generation rubrics cannot see. Two failure modes dominate: routing errors and decomposition errors. Neither has a built-in template, and both write cleanly as CustomLLMJudge rubrics that run through the same Evaluator pipeline.

Router accuracy is the rubric that tells a router bug apart from a downstream retriever bug. Given the user query, the candidate sub-engine names with their descriptions, and the routed-to choice, did the router pick the best engine. Score it separately from answer quality; a perfectly answered query routed to the wrong engine is luck, and the next harder query will fail.

from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge

router_rubric = CustomLLMJudge(
    name="router_accuracy",
    rubric="""You are scoring a LlamaIndex RouterQueryEngine decision.

QUERY: {query}

CANDIDATES:
{candidates}

ROUTER PICKED: {selected}

Score 1.0 if the choice is clearly correct, 0.5 if defensible,
0.0 if a different sub-engine would have been clearly better.
Reason in one sentence.""",
    model="gpt-4.1",
)

Sub-question decomposition quality is the second rubric. Given the original query and the sub-questions the SubQuestionQueryEngine generated, did the decomposition cover the original query without leaving constraints behind. The most common failure: the original asked for X under condition Y, the decomposition asked about X and Y separately, and the merged synthesis lost the conjunction. A rubric that scores decomposition fidelity catches it before the synthesis covers for it.

For a SubQuestionQueryEngine, score every sub-question on ContextRelevance and Groundedness independently, then score the final synthesis on Completeness against the union of all sub-answers. A drop in per-sub-question scores with stable final-answer scores tells you a specific branch regressed. A drop in final-answer Completeness with stable per-sub-question scores tells you the merge step is dropping content. Same data, two different bugs.

For RecursiveRetriever flows, score each recursion level with ContextRelevance and the final synthesis with Completeness against the union of recursively retrieved context. Track recursion depth as a span attribute and alarm on excessive depth; a recursion that never terminates pays token cost on every level.

Layer 3: agent-mode eval

OpenAIAgent, ReActAgent, and the tool-using flows under AgentRunner add two failure modes RAG rubrics cannot see. The agent calls the wrong tool, the downstream answer is grounded in the wrong evidence, and Groundedness passes. The agent calls every tool correctly but loops, refuses, or quits early, and per-step rubrics pass while the user gets nothing. Two SDK templates exist exactly for this.

EvaluateFunctionCalling (eval_id 98, exported as LLMFunctionCalling) scores whether each tool call was the right call with the right arguments at that step of the trajectory. TaskCompletion (eval_id 99) scores whether the agent finished the user’s task end-to-end. Both treat the agent trajectory as the unit of evaluation, not a single LLM turn.

from fi.evals.templates import EvaluateFunctionCalling, TaskCompletion, Groundedness

def score_agent(query: str, agent_response):
    trajectory = []
    for span in extract_tool_spans(agent_response):
        trajectory.append(TestCase(
            input=query,
            output=span.tool_call,
            context=span.tool_args,
        ))
    final = TestCase(input=query, output=str(agent_response))

    tool_scores = ev.evaluate(
        eval_templates=[EvaluateFunctionCalling()],
        inputs=trajectory,
    )
    task_score = ev.evaluate(
        eval_templates=[TaskCompletion(), Groundedness()],
        inputs=[final],
    )
    return tool_scores, task_score

The pairing matters. EvaluateFunctionCalling catches the agent picking the wrong tool. TaskCompletion catches the agent picking every right tool and still failing the user. Run both. Score per agent turn in CI, then score live trajectories as EvalTag span-attached scorers via traceAI. For the broader agent-evaluation pattern outside LlamaIndex specifically, the agent evaluation frameworks 2026 post covers tool-trajectory rubrics across frameworks.

For multi-document agent flows (a document agent per source with an orchestrator on top), evaluate each document agent on the base RAG rubric set, then add a CustomLLMJudge rubric for the orchestrator’s routing and merging decisions. The same pattern that works for RouterQueryEngine works here.

Layer 4: the traceAI bridge to production

CI catches the regressions you can think of. Production catches the rest. The same four-layer rubric set should run as span-attached scorers against live LlamaIndex traces, and that requires a tracer that understands LlamaIndex’s primitive boundaries.

traceAI (Apache 2.0) ships a LlamaIndexInstrumentor that hooks LlamaIndex’s dispatcher and emits OpenTelemetry spans for every QueryEngine, Retriever, ResponseSynthesizer, and agent turn. Each span carries fi.span.kind set to RETRIEVER, LLM, CHAIN, or AGENT, the rendered prompt, the retrieved nodes with similarity scores, and the response. No manual span creation.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="llamaindex_rag_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)

After that one call, every QueryEngine.query() produces a span tree that maps directly to the LlamaIndex composition. A RouterQueryEngine produces a router-decision span with the selected engine in metadata and the chosen sub-engine’s full span tree underneath. A SubQuestionQueryEngine produces one child span per sub-question with its own retriever and LLM children, plus a final synthesis LLM span. An OpenAIAgent produces one child span per tool call (fi.span.kind=AGENT) and a final LLM span.

The pluggable semantic conventions matter. register() accepts a semantic_conventions argument with FI, OTEL_GENAI, OPENINFERENCE, or OPENLLMETRY; the same instrumentation ingests into Phoenix or Traceloop without re-instrumenting. For the OpenTelemetry plumbing end-to-end, our guide on instrumenting your AI agent with traceAI covers the broader pattern.

Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics. Run citation validity and ChunkUtilization on 100 percent because they are cheap. Alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per route over 30 to 90 minutes.

Drift between offline pass and online drop is itself a quality signal. Track per-rubric delta between CI baseline and production rolling mean; the gap tells you how representative your eval set is.

Guardrails for safety-critical retrieval

Some LlamaIndex apps retrieve over sensitive data (medical records, legal contracts, financial filings) and the same rubrics that score offline have to gate at request time. The Guardrails API takes the same templates as the offline Evaluator and runs them as a request-time gate with a configurable aggregation strategy.

from fi.evals import Guardrails
from fi.evals.types import RailType, AggregationStrategy
from fi.evals.templates import Groundedness, ContextRelevance

retrieval_rail = Guardrails(
    rail_type=RailType.RETRIEVAL,
    eval_templates=[Groundedness(), ContextRelevance()],
    aggregation=AggregationStrategy.MAJORITY,
)

verdict = retrieval_rail.check(
    query=user_query,
    context=retrieved_nodes,
    response=draft_answer,
)
if not verdict.passed:
    answer = fallback_response(user_query)

AggregationStrategy.MAJORITY requires more than half the templates to pass. AggregationStrategy.ALL is the strict path for high-stakes domains. Thirteen guardrail backends sit behind the API (nine open-weight, four API). For the broader guardrails landscape, AI compliance guardrails for enterprise LLMs covers the model choices.

Closing the loop with Error Feed

The loop is what makes a LlamaIndex eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit ratio near 90 percent) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1-to-5).

For LlamaIndex apps the cluster names look like:

“Router picks the vector engine when the query needs the SQL engine”
“Tree-summarize synthesizer drops the key citation on long contexts”
“Sub-question decomposition loses a conjunction constraint”
“Recursive retriever runs past depth 4 on entity-graph queries”
“Compact synthesizer truncates the last retrieved node on token-budget cap”

Each named cluster ships as a Linear issue today (Slack, GitHub, Jira, and PagerDuty are on the roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with your product. Representative traces from each cluster promote into the eval set under engineer sign-off; the next PR touching the offending primitive has to clear the new entries. The dataset ratchets stronger every week; the four-layer CI gate catches more LlamaIndex-specific regressions every quarter. For the full closed-loop pattern across CI and prod, evaluate RAG applications in CI/CD walks the regression-gate side of the same workflow.

Three deliberate tradeoffs

Four layers cost more eval budget than one. Scoring retrieval, query-pipeline, synthesizer, and agent rubrics on every CI run is more LLM-judge calls than a single faithfulness score on the final answer. The payoff is debuggable regressions and a bisect that takes minutes. Future AGI’s classifier-backed evals on the Platform run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default rather than the exception.
CustomLLMJudge rubrics for router and decomposition add maintenance. Two extra rubrics to keep calibrated as the router prompt and the decomposition prompt evolve. The lift is the only signal that separates a routing bug from a retriever bug. New deployments without a router can skip this layer; the moment you add a RouterQueryEngine, write the rubric the same week.
AggregationStrategy.ALL guardrails cost latency. Strict aggregation runs every template before the answer ships, which adds 100 to 400 ms depending on rubric mix. Worth it for compliance-sensitive paths. Optional for casual workloads where MAJORITY is the right operating point.

How Future AGI ships this

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the four LlamaIndex layers. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent against your live traces.

ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator. RAG layer: ContextRelevance (9), ChunkAttribution (11), ChunkUtilization (12), Groundedness (47), ContextAdherence (5), Completeness (10), AnswerRefusal (88), FactualAccuracy (66). Agent layer: EvaluateFunctionCalling (98), TaskCompletion (99). Plus 50+ total templates, CustomLLMJudge for router and decomposition rubrics, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0): LlamaIndexInstrumentor().instrument(...) covers every major query engine, retriever, synthesizer, and agent. 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING, AGENT. 62 built-in evals via EvalTag.
Future AGI Platform: self-improving evaluators tuned by thumbs-up and thumbs-down feedback; in-product authoring agent generates LlamaIndex-specific rubrics (router accuracy, decomposition fidelity, synthesizer-mode appropriateness) from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse-stored span embeddings groups failing traces; Sonnet 4.5 Judge writes the immediate_fix; representative traces promote into the eval set. Linear OAuth wired today; Slack, GitHub, Jira, PagerDuty on the roadmap.
agent-opt (Apache 2.0): six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) for tuning synthesizer, router, and decomposition prompts against the rubric set above.
Agent Command Center: 17 MB Go binary self-hosts in your VPC for the LLM calls underneath every LlamaIndex stage. 100+ providers, 18+ built-in guardrail scanners, exact + semantic caching, SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Ready to evaluate your first LlamaIndex app? Wire ContextRelevance, Groundedness, ChunkAttribution, and EvaluateFunctionCalling into a pytest fixture this afternoon against the ai-evaluation SDK, then add LlamaIndexInstrumentor when production traces start asking questions the CI gate missed. Build it once, refresh the golden set weekly, and the four-layer eval loop pays for itself the first time it catches the regression that would have shipped.

Frequently asked questions

Why does LlamaIndex RAG need a different evaluation approach from generic RAG eval?

Because LlamaIndex composes RAG out of distinct primitives that each fail in their own way. A flat `VectorStoreIndex` plus `RetrieverQueryEngine` behaves nothing like a `SubQuestionQueryEngine` over multiple indices, and a `RouterQueryEngine` adds a routing decision on top of everything. Generic RAG eval reports one faithfulness score on the final answer and tells you nothing about which stage broke. A regression in a Tuesday deploy could be the chunker, the embedder, a router prompt, a sub-question decomposition prompt, a synthesizer mode change, or an agent tool-call edit. LlamaIndex-aware eval splits the rubric set by primitive so each score points back at the span that produced it. That is the difference between a five-minute bisect and three days of swapping models.

What are the four eval layers for a LlamaIndex application?

Retrieval-stage eval scores the chunks the index surfaced against the query: Recall@k, MRR, ContextRelevance, ChunkAttribution, ChunkUtilization. Query-pipeline eval scores compositional engines: sub-question decomposition quality on `SubQuestionQueryEngine`, router accuracy on `RouterQueryEngine`, and merge correctness when multiple indices feed one synthesis. Agent-mode eval scores `AgentQueryEngine` and tool-using flows: `EvaluateFunctionCalling` for tool-call correctness, `TaskCompletion` for end-to-end task success. The fourth layer is the bridge: traceAI emits OpenTelemetry spans for every LlamaIndex primitive and the same rubrics run as span-attached scorers on live traffic via `EvalTag`. Skip any layer and you have a blind spot a user will eventually file a ticket about.

Which Future AGI templates map to which LlamaIndex stages?

Retrieval stage: `ContextRelevance` (eval_id 9), `ChunkAttribution` (11), `ChunkUtilization` (12). Synthesizer stage: `Groundedness` (47), `ContextAdherence` (5), `Completeness` (10), `AnswerRefusal` (88). Cross-cutting: `FactualAccuracy` (66), citation validity as a deterministic string check. Agent stage: `EvaluateFunctionCalling` (98) for tool-call correctness, `TaskCompletion` (99) for whether the agent finished the user's task. Router accuracy and sub-question decomposition quality do not have built-in templates because they are LlamaIndex-specific; write them with `CustomLLMJudge` and run them through the same `Evaluator` pipeline.

How does traceAI instrument LlamaIndex?

One line: `LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)`. The instrumentor hooks LlamaIndex's dispatcher and emits OpenTelemetry spans for every `QueryEngine`, `Retriever`, and `ResponseSynthesizer` call, with `fi.span.kind` set to `RETRIEVER`, `LLM`, `CHAIN`, or `AGENT` as appropriate. Spans carry the query, retrieved nodes with similarity scores, the synthesizer prompt, the model call, and the final response. The same instrumentor covers `RetrieverQueryEngine`, `SubQuestionQueryEngine`, `RouterQueryEngine`, and the OpenAI agent variants. Pluggable semantic conventions at `register()` time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean the same spans ingest into Phoenix or Traceloop without re-instrumenting.

How big should the LlamaIndex golden set be?

Start at 200 to 500 query and expected-answer pairs per query engine, sampled from production traces once instrumentation is live. For a `RouterQueryEngine` that fans to three sub-engines, budget 150 pairs per branch plus 50 pairs that explicitly probe the routing decision (queries near the decision boundary, queries that look like one engine but need another). For `SubQuestionQueryEngine`, include 50 to 100 deliberately conjunctive queries (X under condition Y, multi-hop joins) because the most common decomposition failure is dropping a constraint. Cover happy-path queries, the hardest 10 percent of historical failures, and at least three edge cases unique to LlamaIndex's primitive in question. Promote failing production traces weekly through Error Feed.

How do I evaluate router and sub-question query engines specifically?

Router accuracy is a `CustomLLMJudge` rubric that takes the user query, the candidate sub-engine names with descriptions, and the routed-to choice, then scores whether the router picked the best engine. Score it separately from downstream answer quality; otherwise a perfectly answered query routed to the wrong engine masks a bug that will fail the next harder query. Sub-question decomposition is two rubrics: one that scores whether the decomposition covered the original query without dropping a constraint, and one that runs `ContextRelevance` and `Groundedness` per sub-question with the final synthesis scored against the union of all sub-answers with `Completeness`. Both rubrics drop into the same `Evaluator` call as the standard templates.

What does the agent-mode eval look like for `AgentQueryEngine`?

Two rubrics on top of the standard RAG set. `EvaluateFunctionCalling` (eval_id 98, also exported as `LLMFunctionCalling`) scores whether each tool call was the right call with the right arguments at that step of the trajectory. `TaskCompletion` (eval_id 99) scores whether the agent finished the user's task end-to-end. Together they catch the two agent-specific failure modes generic RAG eval cannot see: an agent that calls the wrong tool then synthesizes a confident answer (`FunctionCalling` flags it, `Groundedness` does not), and an agent that calls every tool correctly but loops, refuses, or quits early (`TaskCompletion` flags it, per-step rubrics do not). Run both per agent turn in CI, then score live trajectories via `EvalTag` once traceAI emits the spans.

What does Future AGI ship for LlamaIndex eval today, and what is roadmap?

Shipping today: traceAI's `LlamaIndexInstrumentor` (Apache 2.0) covering every major query engine and retriever; the `ai-evaluation` SDK (Apache 2.0) with the eight RAG-relevant templates plus `EvaluateFunctionCalling` and `TaskCompletion` for agent flows, 50+ total templates across the catalog, 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B; 4 API), eight sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes); `CustomLLMJudge` for router accuracy and decomposition rubrics; Error Feed clustering inside the eval stack via HDBSCAN on ClickHouse-stored span embeddings plus a Sonnet 4.5 Judge that writes the immediate_fix; Linear OAuth wired as the Error Feed sink. Roadmap: Slack, GitHub, Jira, and PagerDuty integrations for Error Feed; the trace-stream-to-agent-opt connector that auto-promotes high-signal production traces into optimization datasets is in flight.

View all

Guides

Evaluating LangChain RAG Applications in 2026

LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on LCEL output confirms the fix.

Rishav Hada · Mar 14, 2026

12 min

Guides

Evaluating Haystack RAG Pipelines in 2026

Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.

Vrinda Damani · Mar 10, 2026

12 min

Guides

Evaluating Cohere Rerank in RAG (2026)

Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.

Nikhil Pareek · Apr 28, 2026

11 min

TL;DR: the four-layer LlamaIndex rubric set

Why generic RAG eval misses LlamaIndex-specific failure modes

Layer 1: retrieval-stage eval

Layer 2: query-pipeline eval

Layer 3: agent-mode eval

Layer 4: the traceAI bridge to production

Guardrails for safety-critical retrieval

Closing the loop with Error Feed

Three deliberate tradeoffs

How Future AGI ships this

Related reading

Frequently asked questions