Evaluating LlamaIndex RAG Applications in 2026
LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.
Table of Contents
LlamaIndex makes RAG easy to build and hard to debug. Four lines of Python wire a VectorStoreIndex to a RetrieverQueryEngine, and the first hundred queries pass. Then a user asks something the router was never tested on, a sub-question engine retrieves the wrong namespace, the tree_summarize synthesizer drops the citation that mattered, and the answer ships back confident and wrong. Generic RAG eval reports a faithfulness drop. It does not tell you which primitive broke.
LlamaIndex is not one RAG pattern. It is a composition kit, and every primitive emits its own span shape and its own failure mode. The opinion this post earns: LlamaIndex eval is not generic RAG eval. The right rubric set has four layers (retrieval-stage, query-pipeline, agent-mode, and the traceAI bridge to production) and the built-in Evaluator rubrics ship the floor, not the ceiling.
This guide walks the four layers, the Future AGI templates that map to each, the traceAI instrumentation that connects offline rubrics to live spans, and the closed loop that turns production failures back into eval cases. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.
TL;DR: the four-layer LlamaIndex rubric set
| Layer | LlamaIndex surface | Rubric set |
|---|---|---|
| Retrieval-stage | VectorStoreIndex, KeywordTableIndex, BM25Retriever, custom retrievers | Recall@k, MRR, ContextRelevance, ChunkAttribution, ChunkUtilization |
| Query-pipeline | SubQuestionQueryEngine, RouterQueryEngine, RecursiveRetriever | Router accuracy (CustomLLMJudge), decomposition quality (CustomLLMJudge), per-branch ContextRelevance + Groundedness, merged-answer Completeness |
| Synthesizer | compact, refine, tree_summarize ResponseSynthesizer modes | Groundedness, ContextAdherence, Completeness, AnswerRefusal, citation validity (deterministic) |
| Agent-mode | OpenAIAgent, ReActAgent, AgentRunner, tool-using flows | EvaluateFunctionCalling, TaskCompletion, per-step Groundedness |
| Bridge | traceAI LlamaIndexInstrumentor | Same rubrics as EvalTag span-attached scorers; RETRIEVER, LLM, CHAIN, AGENT span kinds |
The built-in llama_index.core.evaluation.FaithfulnessEvaluator and RelevancyEvaluator ship a useful floor for prototyping. They do not separate retrieval from synthesis, they do not understand router decisions, and they do not score tool calls. The four-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.
Why generic RAG eval misses LlamaIndex-specific failure modes
A flat retrieve-then-synthesize pipeline has two surfaces, and the canonical rubric split covered in our RAG evaluation metrics deep dive handles it cleanly. LlamaIndex is rarely flat. The interesting apps compose, and four failure modes appear that single-number quality scores cannot see.
A RouterQueryEngine over three sub-engines makes a decision before retrieval runs. If the router picks the vector engine when the query needs the SQL engine, ContextRelevance is still high (the chunks are relevant to whatever the vector engine retrieved), Groundedness is high (the synthesizer grounded faithfully), and the user got the wrong answer because the wrong engine ran.
A SubQuestionQueryEngine decomposes a complex query into pieces, runs each through its own engine, and synthesizes. The classic failure is dropping a constraint: the user asked for X under condition Y, the decomposition asked about X and about Y separately, and the final synthesis lost the conjunction. Per-sub-question rubrics pass. The final answer is wrong.
A tree_summarize synthesizer behaves nothing like compact or refine. Long contexts that look fine on compact lose citations on tree_summarize. A metric that scores only the final answer cannot tell you which mode degraded after Tuesday’s deploy.
An OpenAIAgent or ReActAgent calls tools in sequence. A wrong tool call halfway through produces a downstream answer that is grounded (in the wrong evidence) and complete (the wrong task is finished). Generic RAG eval has no concept of tool-call correctness.
Each failure becomes a five-minute bisect when the rubric set is tagged by LlamaIndex primitive.
Layer 1: retrieval-stage eval
Retrieval-stage rubrics score the chunks the retriever surfaced, independent of what the synthesizer wrote. They are the upstream signal. If retrieval regressed, every generation rubric will follow, and the bisect should start here.
Three Future AGI templates do the work. ContextRelevance (eval_id 9) scores whether each retrieved chunk is relevant to the query, catching the classic “asked about Section 12, got Section 9” failure when vector similarity surfaces a lexical neighbour with the wrong meaning. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim, catching fabricated citations and citation misalignment. ChunkUtilization (eval_id 12) scores how many of the retrieved chunks the synthesizer actually used, surfacing over-fetch where top-10 retrieval feeds the synthesizer eight chunks it ignores.
Pair these with the IR-style rubrics on a labelled probe set: Recall@k, MRR, nDCG. The local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, context_recall, context_precision, and context_entity_recall as deterministic metrics that run with no API call.
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase
ev = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
def score_retrieval(query: str, response, k: int = 10):
retrieved = [n.node.get_content() for n in response.source_nodes[:k]]
tc = TestCase(input=query, output=str(response), context=retrieved)
return ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=[tc],
).eval_results[0]
Two operating rules separate a working retrieval gate from theatre. Score per retriever, not per app. A VectorStoreIndex and a BM25Retriever have different failure shapes; tagging the score by retriever lets a regression point at the retriever that moved. Set a per-rubric floor. ChunkUtilization below 30 percent says you are over-fetching; add a reranker before generation or drop similarity_top_k. The best rerankers for RAG (2026) walks the reranker side of the same decision.
Layer 2: query-pipeline eval
LlamaIndex’s compositional engines fail in ways the per-stage retrieval and generation rubrics cannot see. Two failure modes dominate: routing errors and decomposition errors. Neither has a built-in template, and both write cleanly as CustomLLMJudge rubrics that run through the same Evaluator pipeline.
Router accuracy is the rubric that tells a router bug apart from a downstream retriever bug. Given the user query, the candidate sub-engine names with their descriptions, and the routed-to choice, did the router pick the best engine. Score it separately from answer quality; a perfectly answered query routed to the wrong engine is luck, and the next harder query will fail.
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
router_rubric = CustomLLMJudge(
name="router_accuracy",
rubric="""You are scoring a LlamaIndex RouterQueryEngine decision.
QUERY: {query}
CANDIDATES:
{candidates}
ROUTER PICKED: {selected}
Score 1.0 if the choice is clearly correct, 0.5 if defensible,
0.0 if a different sub-engine would have been clearly better.
Reason in one sentence.""",
model="gpt-4.1",
)
Sub-question decomposition quality is the second rubric. Given the original query and the sub-questions the SubQuestionQueryEngine generated, did the decomposition cover the original query without leaving constraints behind. The most common failure: the original asked for X under condition Y, the decomposition asked about X and Y separately, and the merged synthesis lost the conjunction. A rubric that scores decomposition fidelity catches it before the synthesis covers for it.
For a SubQuestionQueryEngine, score every sub-question on ContextRelevance and Groundedness independently, then score the final synthesis on Completeness against the union of all sub-answers. A drop in per-sub-question scores with stable final-answer scores tells you a specific branch regressed. A drop in final-answer Completeness with stable per-sub-question scores tells you the merge step is dropping content. Same data, two different bugs.
For RecursiveRetriever flows, score each recursion level with ContextRelevance and the final synthesis with Completeness against the union of recursively retrieved context. Track recursion depth as a span attribute and alarm on excessive depth; a recursion that never terminates pays token cost on every level.
Layer 3: agent-mode eval
OpenAIAgent, ReActAgent, and the tool-using flows under AgentRunner add two failure modes RAG rubrics cannot see. The agent calls the wrong tool, the downstream answer is grounded in the wrong evidence, and Groundedness passes. The agent calls every tool correctly but loops, refuses, or quits early, and per-step rubrics pass while the user gets nothing. Two SDK templates exist exactly for this.
EvaluateFunctionCalling (eval_id 98, exported as LLMFunctionCalling) scores whether each tool call was the right call with the right arguments at that step of the trajectory. TaskCompletion (eval_id 99) scores whether the agent finished the user’s task end-to-end. Both treat the agent trajectory as the unit of evaluation, not a single LLM turn.
from fi.evals.templates import EvaluateFunctionCalling, TaskCompletion, Groundedness
def score_agent(query: str, agent_response):
trajectory = []
for span in extract_tool_spans(agent_response):
trajectory.append(TestCase(
input=query,
output=span.tool_call,
context=span.tool_args,
))
final = TestCase(input=query, output=str(agent_response))
tool_scores = ev.evaluate(
eval_templates=[EvaluateFunctionCalling()],
inputs=trajectory,
)
task_score = ev.evaluate(
eval_templates=[TaskCompletion(), Groundedness()],
inputs=[final],
)
return tool_scores, task_score
The pairing matters. EvaluateFunctionCalling catches the agent picking the wrong tool. TaskCompletion catches the agent picking every right tool and still failing the user. Run both. Score per agent turn in CI, then score live trajectories as EvalTag span-attached scorers via traceAI. For the broader agent-evaluation pattern outside LlamaIndex specifically, the agent evaluation frameworks 2026 post covers tool-trajectory rubrics across frameworks.
For multi-document agent flows (a document agent per source with an orchestrator on top), evaluate each document agent on the base RAG rubric set, then add a CustomLLMJudge rubric for the orchestrator’s routing and merging decisions. The same pattern that works for RouterQueryEngine works here.
Layer 4: the traceAI bridge to production
CI catches the regressions you can think of. Production catches the rest. The same four-layer rubric set should run as span-attached scorers against live LlamaIndex traces, and that requires a tracer that understands LlamaIndex’s primitive boundaries.
traceAI (Apache 2.0) ships a LlamaIndexInstrumentor that hooks LlamaIndex’s dispatcher and emits OpenTelemetry spans for every QueryEngine, Retriever, ResponseSynthesizer, and agent turn. Each span carries fi.span.kind set to RETRIEVER, LLM, CHAIN, or AGENT, the rendered prompt, the retrieved nodes with similarity scores, and the response. No manual span creation.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_llamaindex import LlamaIndexInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="llamaindex_rag_prod",
)
LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)
After that one call, every QueryEngine.query() produces a span tree that maps directly to the LlamaIndex composition. A RouterQueryEngine produces a router-decision span with the selected engine in metadata and the chosen sub-engine’s full span tree underneath. A SubQuestionQueryEngine produces one child span per sub-question with its own retriever and LLM children, plus a final synthesis LLM span. An OpenAIAgent produces one child span per tool call (fi.span.kind=AGENT) and a final LLM span.
The pluggable semantic conventions matter. register() accepts a semantic_conventions argument with FI, OTEL_GENAI, OPENINFERENCE, or OPENLLMETRY; the same instrumentation ingests into Phoenix or Traceloop without re-instrumenting. For the OpenTelemetry plumbing end-to-end, our guide on instrumenting your AI agent with traceAI covers the broader pattern.
Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics. Run citation validity and ChunkUtilization on 100 percent because they are cheap. Alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per route over 30 to 90 minutes.
Drift between offline pass and online drop is itself a quality signal. Track per-rubric delta between CI baseline and production rolling mean; the gap tells you how representative your eval set is.
Guardrails for safety-critical retrieval
Some LlamaIndex apps retrieve over sensitive data (medical records, legal contracts, financial filings) and the same rubrics that score offline have to gate at request time. The Guardrails API takes the same templates as the offline Evaluator and runs them as a request-time gate with a configurable aggregation strategy.
from fi.evals import Guardrails
from fi.evals.types import RailType, AggregationStrategy
from fi.evals.templates import Groundedness, ContextRelevance
retrieval_rail = Guardrails(
rail_type=RailType.RETRIEVAL,
eval_templates=[Groundedness(), ContextRelevance()],
aggregation=AggregationStrategy.MAJORITY,
)
verdict = retrieval_rail.check(
query=user_query,
context=retrieved_nodes,
response=draft_answer,
)
if not verdict.passed:
answer = fallback_response(user_query)
AggregationStrategy.MAJORITY requires more than half the templates to pass. AggregationStrategy.ALL is the strict path for high-stakes domains. Thirteen guardrail backends sit behind the API (nine open-weight, four API). For the broader guardrails landscape, AI compliance guardrails for enterprise LLMs covers the model choices.
Closing the loop with Error Feed
The loop is what makes a LlamaIndex eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit ratio near 90 percent) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1-to-5).
For LlamaIndex apps the cluster names look like:
- “Router picks the vector engine when the query needs the SQL engine”
- “Tree-summarize synthesizer drops the key citation on long contexts”
- “Sub-question decomposition loses a conjunction constraint”
- “Recursive retriever runs past depth 4 on entity-graph queries”
- “Compact synthesizer truncates the last retrieved node on token-budget cap”
Each named cluster ships as a Linear issue today (Slack, GitHub, Jira, and PagerDuty are on the roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with your product. Representative traces from each cluster promote into the eval set under engineer sign-off; the next PR touching the offending primitive has to clear the new entries. The dataset ratchets stronger every week; the four-layer CI gate catches more LlamaIndex-specific regressions every quarter. For the full closed-loop pattern across CI and prod, evaluate RAG applications in CI/CD walks the regression-gate side of the same workflow.
Three deliberate tradeoffs
- Four layers cost more eval budget than one. Scoring retrieval, query-pipeline, synthesizer, and agent rubrics on every CI run is more LLM-judge calls than a single faithfulness score on the final answer. The payoff is debuggable regressions and a bisect that takes minutes. Future AGI’s classifier-backed evals on the Platform run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default rather than the exception.
CustomLLMJudgerubrics for router and decomposition add maintenance. Two extra rubrics to keep calibrated as the router prompt and the decomposition prompt evolve. The lift is the only signal that separates a routing bug from a retriever bug. New deployments without a router can skip this layer; the moment you add aRouterQueryEngine, write the rubric the same week.AggregationStrategy.ALLguardrails cost latency. Strict aggregation runs every template before the answer ships, which adds 100 to 400 ms depending on rubric mix. Worth it for compliance-sensitive paths. Optional for casual workloads whereMAJORITYis the right operating point.
How Future AGI ships this
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the four LlamaIndex layers. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent against your live traces.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator. RAG layer:ContextRelevance(9),ChunkAttribution(11),ChunkUtilization(12),Groundedness(47),ContextAdherence(5),Completeness(10),AnswerRefusal(88),FactualAccuracy(66). Agent layer:EvaluateFunctionCalling(98),TaskCompletion(99). Plus 50+ total templates,CustomLLMJudgefor router and decomposition rubrics, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes). - traceAI (Apache 2.0):
LlamaIndexInstrumentor().instrument(...)covers every major query engine, retriever, synthesizer, and agent. 50+ AI surfaces across Python, TypeScript, Java, C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). 14 span kinds including first-classRETRIEVER,RERANKER,EMBEDDING,AGENT. 62 built-in evals viaEvalTag. - Future AGI Platform: self-improving evaluators tuned by thumbs-up and thumbs-down feedback; in-product authoring agent generates LlamaIndex-specific rubrics (router accuracy, decomposition fidelity, synthesizer-mode appropriateness) from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse-stored span embeddings groups failing traces; Sonnet 4.5 Judge writes the
immediate_fix; representative traces promote into the eval set. Linear OAuth wired today; Slack, GitHub, Jira, PagerDuty on the roadmap. - agent-opt (Apache 2.0): six optimisers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) for tuning synthesizer, router, and decomposition prompts against the rubric set above. - Agent Command Center: 17 MB Go binary self-hosts in your VPC for the LLM calls underneath every LlamaIndex stage. 100+ providers, 18+ built-in guardrail scanners, exact + semantic caching, SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Ready to evaluate your first LlamaIndex app? Wire ContextRelevance, Groundedness, ChunkAttribution, and EvaluateFunctionCalling into a pytest fixture this afternoon against the ai-evaluation SDK, then add LlamaIndexInstrumentor when production traces start asking questions the CI gate missed. Build it once, refresh the golden set weekly, and the four-layer eval loop pays for itself the first time it catches the regression that would have shipped.
Related reading
- RAG Evaluation Metrics: A Deep Dive (2026)
- PDF Q&A Chatbot: Build and Evaluate (2026)
- Evaluate RAG Applications in CI/CD (2026)
- Best Rerankers for RAG (2026)
- Agent Evaluation Frameworks (2026)
- Instrument Your AI Agent with traceAI
- Advanced Chunking Techniques for RAG
- Contract Review RAG: Build and Evaluate (2026)
Frequently asked questions
Why does LlamaIndex RAG need a different evaluation approach from generic RAG eval?
What are the four eval layers for a LlamaIndex application?
Which Future AGI templates map to which LlamaIndex stages?
How does traceAI instrument LlamaIndex?
How big should the LlamaIndex golden set be?
How do I evaluate router and sub-question query engines specifically?
What does the agent-mode eval look like for `AgentQueryEngine`?
What does Future AGI ship for LlamaIndex eval today, and what is roadmap?
Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.
LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on the LCEL output confirms the fix.
Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.