Evaluating LangChain RAG Applications in 2026
LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on the LCEL output confirms the fix.
Table of Contents
LangChain makes RAG easy to compose and hard to debug. Six lines of LCEL wire create_history_aware_retriever to create_retrieval_chain, and the first hundred queries pass. Then a user asks a follow-up that depends on turn one, the reformulator builds a bad standalone query, the retriever pulls the wrong namespace, the synthesizer drops the citation that mattered, and the answer ships back confident and wrong. A faithfulness score reports a drop. It does not tell you which runnable broke.
LangChain RAG eval is two problems, not one. The retriever and the chain. Per-step rubrics (retrieve, rerank, generate) score each runnable in isolation against its own success criterion. Chain-level rubrics score the final LCEL output. Most teams ship the second layer and discover the per-step gaps in production. The opinion this post earns: the per-step layer catches the bug; the chain-level layer confirms the fix. Run both, attribute every score back to the LCEL component that produced the underlying span.
This guide walks the two rubric layers, the Future AGI templates that map to each, the LangChainInstrumentor that connects offline rubrics to live spans, the CI gate, and the Error Feed that closes the loop back to the dataset. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.
TL;DR: the two-layer LangChain rubric set
| Layer | LangChain surface | Rubric set |
|---|---|---|
| Per-step (retrieve) | BaseRetriever, EnsembleRetriever, MultiQueryRetriever, ParentDocumentRetriever | ContextRelevance (9), ChunkAttribution (11), ChunkUtilization (12), Recall@k, MRR, nDCG |
| Per-step (rerank) | ContextualCompressionRetriever, CrossEncoderReranker, CohereRerank | Delta ContextRelevance (pre vs post rerank), recall preservation |
| Per-step (generate) | LLM step inside create_retrieval_chain, custom RunnableSequence | Groundedness (47), ContextAdherence (5), AnswerRefusal (88) on the LLM step’s output against its retrieved context |
| Chain-level | Final LCEL output via RunnableSequence.invoke(), create_retrieval_chain answer | Groundedness, ContextAdherence, Completeness (10), AnswerRefusal, FactualAccuracy (66) |
| Agent overlay | AgentExecutor, LangGraph StateGraph, tool-using flows | EvaluateFunctionCalling (98), TaskCompletion (99), per-node Groundedness |
| Bridge | traceAI LangChainInstrumentor | Same rubrics as EvalTag span-attached scorers; CHAIN, RETRIEVER, LLM, TOOL, AGENT span kinds |
The built-in langchain.evaluation.QAEvalChain and CriteriaEvalChain ship a useful floor for prototyping. They score the final answer with one rubric and do not separate retrieval from generation. The two-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.
Why generic RAG eval misses LangChain-specific failure modes
A flat retrieve-then-synthesize pipeline has two surfaces, and the canonical rubric split covered in our RAG evaluation metrics deep dive handles it. LangChain is rarely flat. LCEL composition lets runnables nest arbitrarily deep, and four failure modes appear that final-answer scores cannot see.
A history-aware retriever runs a reformulation LLM before retrieval. create_history_aware_retriever takes the chat history, asks the LLM for a standalone query, then feeds that into the retriever. If the reformulator builds a bad standalone query, ContextRelevance drops on retrieved chunks, the synthesizer covers for it with model priors, and the answer looks plausible. Per-step rubrics catch the reformulator. Chain-level rubrics do not.
An EnsembleRetriever fuses ranks from a vector store and a BM25 retriever. The fusion can silently overweight one side; both score well in isolation but recall on the fused output drops on queries BM25 used to catch. Per-retriever drift tracking catches it.
An AgentExecutor exposes RAG as one tool. The agent calls the web-search tool when the RAG tool was the right call, the synthesizer grounds the answer in the wrong evidence, and Groundedness passes. The user got the wrong answer because the wrong tool ran.
A RunnableParallel fans out to three retrievers and merges through a RunnableLambda. If two retrievers return the same canonical doc with different URLs, the merge keeps both, the context window fills with duplicates, ChunkUtilization drops. Generic eval sees a slightly lower answer score with no obvious cause.
Each failure becomes a five-minute bisect once the rubric set is tagged by LCEL component.
Layer 1: per-step rubrics on the retriever and rerank
Per-step rubrics gate each runnable against its own success criterion. They are the upstream signal. If retrieval or rerank regressed, every chain-level score will follow, and the bisect should start here.
Three Future AGI templates do the work on the retriever. ContextRelevance (eval_id 9) scores whether each retrieved chunk is on-topic for the query, catching the “asked about Section 12, got Section 9” failure when vector similarity surfaces a lexical neighbour. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim. ChunkUtilization (eval_id 12) scores how much of the retrieved context the synthesizer actually used; high retrieval cost with low utilization signals k is too high or the reranker is misconfigured.
Pair these with IR-style rubrics on a labelled probe set: Recall@k, MRR, nDCG. The local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, context_recall, context_precision, and context_entity_recall as deterministic metrics with no API call.
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase
ev = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
def score_retrieve_step(query, retrieved_docs, response):
tc = TestCase(
input=query, output=response,
context=[d.page_content for d in retrieved_docs],
)
return ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=[tc],
).eval_results[0]
For the rerank step, score ContextRelevance on the retriever output before ContextualCompressionRetriever (or CohereRerank, or CrossEncoderReranker) and after. The delta is the rubric. A reranker that does not improve ContextRelevance on average is dead weight; one that drops Recall@k on the top-3 has overfit to lexical features. The best rerankers for RAG (2026) walks the reranker side of the same call.
Two operating rules. Score per retriever, not per app. Tagging the score by retriever lets a regression point at the side that moved. Set a per-rubric floor. ChunkUtilization below 30 percent says you are over-fetching. Recall@k below 0.7 says the chunker is the problem before the retriever is, and advanced chunking techniques for RAG covers the upstream call.
Layer 2: chain-level rubrics on the LCEL output
Chain-level rubrics gate what the user actually sees. Five Future AGI templates cover the answer surface.
Groundedness (eval_id 47) scores whether every claim in the answer is supported by the retrieved context. The hallucination check. An LLM step that improvises beyond context fails here even when retrieval was perfect. ContextAdherence (eval_id 5) catches a different failure: the synthesizer staying close to retrieved context rather than drifting into conversation buffer or model priors. Both matter for ConversationalRetrievalChain because it has a drift surface pure RetrievalQA does not.
Completeness (eval_id 10) scores whether the answer covers the question and the retrieved evidence relevant to it; a short answer that ignored half the evidence fails here. AnswerRefusal (eval_id 88) catches the over-cautious refusal when evidence was right there. FactualAccuracy (eval_id 66) scores world-knowledge claims independent of retrieval; useful when the synthesizer pads with model-prior trivia.
from fi.evals.templates import Groundedness, ContextAdherence, Completeness, AnswerRefusal
def score_chain_output(query, retrieved_docs, answer):
tc = TestCase(
input=query, output=answer,
context=[d.page_content for d in retrieved_docs],
)
return ev.evaluate(
eval_templates=[Groundedness(), ContextAdherence(), Completeness(), AnswerRefusal()],
inputs=[tc],
).eval_results[0]
The pairing rule. Per-step failures with passing chain scores mean the synthesizer covered for an upstream bug; the next harder query will fail. Chain failures with passing per-step scores mean the synthesizer dropped or paraphrased away evidence that was in the context; the bug is in the LLM step, not retrieval. Same templates, two runs, two attribution paths.
For LangChain-specific patterns the standard templates do not cover, CustomLLMJudge writes the rubric in natural language and runs through the same Evaluator. Three are worth writing the week the matching primitive enters the app: LCEL component attribution (which runnable produced the first wrong output), memory consistency (did the chain preserve the right prior turns at turn N), and chain composition correctness (was the chosen chain shape right for the query class).
Layer 3: the agent overlay for AgentExecutor and LangGraph
AgentExecutor and LangGraph add two failure modes per-step and chain rubrics cannot see. The agent calls the wrong tool, the answer is grounded in wrong evidence, Groundedness passes. Or the agent calls every tool correctly but loops, refuses, or quits early; per-step rubrics pass while the user gets nothing.
EvaluateFunctionCalling (eval_id 98, exported as LLMFunctionCalling) scores whether each tool call was right with the right arguments at that step. TaskCompletion (eval_id 99) scores whether the agent finished the user’s task end-to-end. Both treat the trajectory as the unit of evaluation.
from fi.evals.templates import EvaluateFunctionCalling, TaskCompletion, Groundedness
def score_agent(query, trajectory_spans, final_answer):
tool_cases = [
TestCase(input=query, output=s.tool_call, context=s.tool_args)
for s in trajectory_spans
]
final_case = TestCase(input=query, output=final_answer)
tool_scores = ev.evaluate([EvaluateFunctionCalling()], inputs=tool_cases)
task_score = ev.evaluate([TaskCompletion(), Groundedness()], inputs=[final_case])
return tool_scores, task_score
For LangGraph the same templates run per node; the bundled LangChainInstrumentor emits superstep-level metadata so the score attributes to the right graph step. A CustomLLMJudge rubric for graph-edge correctness catches routing bugs the per-node rubrics miss. Agent evaluation frameworks 2026 covers tool-trajectory rubrics across frameworks; LangGraph agent evaluation covers the graph-specific layer.
Instrumenting LangChain with traceAI
CI catches the regressions you can think of. Production catches the rest. The same two-layer rubric set should run as span-attached scorers against live LangChain traces, and that requires a tracer that understands LCEL composition.
traceAI (Apache 2.0) ships a LangChainInstrumentor that hooks LangChain’s callback manager and emits OpenTelemetry spans for every runnable, retriever, LLM, tool, and agent invocation. Each span carries fi.span.kind set to CHAIN, RETRIEVER, LLM, TOOL, or AGENT, the LCEL component name, the rendered prompt, retrieved documents with similarity scores and source, and the response.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="langchain_rag_prod",
)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)
After that one call, every chain invocation produces a span tree that maps to the LCEL composition. create_retrieval_chain produces a root CHAIN span with a RETRIEVER child carrying retrieved documents and an LLM child carrying the synthesizer prompt and answer. A history-aware composition adds an LLM reformulator child before RETRIEVER. An AgentExecutor produces one TOOL child per tool call and a final LLM span. A LangGraph state machine adds langgraph.superstep.node_count, nodes executed per superstep, and state diffs.
The instrumentor coexists with whatever callback handler the team already wrote. Sync, async (ainvoke, astream), batched (batch, abatch), and streamed chains emit through the same callback surface. RunnableParallel, RunnableLambda, and RunnableSequence propagate span context so parallel branches produce separate subtrees. Instrumenting your AI agent with traceAI covers the broader pattern; LangChain RAG observability covers the stack underneath.
Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics. Run IR metrics and ChunkUtilization on 100 percent. Alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per chain over 30 to 90 minutes.
The CI gate: per-step on push, chain-level on merge
Budget eval cost across two triggers. Per-step rubrics run on every push (cheap, deterministic for IR metrics, fast for LLM-judges scoring small contexts). Chain-level rubrics run on every merge to main, protecting the artifact rather than blocking every push.
# tests/test_langchain_rag_eval.py
import pytest
from statistics import mean
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase
ev = Evaluator()
@pytest.fixture(scope="module")
def golden_set():
return load_golden_set("data/langchain_rag_golden.jsonl")
def test_retriever_floor(golden_set, rag_chain):
cases = []
for q, _ in golden_set:
result = rag_chain.invoke({"input": q})
cases.append(TestCase(
input=q, output=result["answer"],
context=[d.page_content for d in result["context"]],
))
results = ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=cases,
)
assert mean(r.metrics[0].value for r in results.eval_results) >= 0.80
The merge gate runs Groundedness, ContextAdherence, and Completeness on the same golden set with stricter thresholds (0.85 to 0.90 depending on domain risk) and posts the per-rubric delta from the previous main as a PR comment. Evaluate RAG applications in CI/CD walks the regression-gate side in detail.
For cost control on high-volume apps, the augment=True cascade runs a cheap classifier first and only escalates to a frontier judge on uncertain inputs. On a 10,000-query batch this cuts judge cost by 5 to 10x without changing the score distribution.
Production observability and the Error Feed
The closed loop is what makes a LangChain eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit near 90 percent) reads each failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-to-5).
For LangChain RAG apps the cluster names look like:
- “History-aware reformulator builds a bad standalone query on turn 3 onwards”
- “Agent calls web-search tool when the RAG tool was the right call”
- “EnsembleRetriever overweights vector side, BM25 recall drops to zero”
- “ContextualCompressionRetriever downranks the correct doc on long queries”
- “LangGraph planner takes wrong conditional edge on empty tool_calls state”
Each named cluster ships as a Linear issue today (Slack, GitHub, Jira, PagerDuty on the roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with the product. Representative traces from each cluster promote into the golden set under engineer sign-off; the next PR touching the offending runnable has to clear the new entries.
For safety-critical retrieval (medical, legal, financial), wrap the same templates in the Guardrails API at request time with AggregationStrategy.MAJORITY for casual workloads or AggregationStrategy.ALL for compliance paths. Thirteen guardrail backends sit behind the API (nine open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, SHIELDGEMMA_2B; four API). AI compliance guardrails for enterprise LLMs covers the model choices; LangChain callback tracing best practices covers the callback-side instrumentation.
Three deliberate tradeoffs
- Two layers cost more eval budget than one. Running per-step and chain-level rubrics together is more LLM-judge calls than one faithfulness score. The payoff is debuggable regressions and a five-minute bisect. Future AGI’s classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default.
CustomLLMJudgerubrics add maintenance. LCEL attribution and memory consistency rubrics need recalibration as the chain composition evolves. The lift is the only signal that separates a reformulator bug from a retriever bug.AggregationStrategy.ALLguardrails cost latency. Strict aggregation runs every template before the answer ships, adding 100 to 400 ms. Worth it for compliance paths. Optional whereMAJORITYis the right operating point.
How Future AGI ships this
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the two LangChain layers. Graduate to the Platform when you want self-improving rubrics against your live traces.
- ai-evaluation SDK (Apache 2.0): per-step layer
ContextRelevance(9),ChunkAttribution(11),ChunkUtilization(12); chain layerGroundedness(47),ContextAdherence(5),Completeness(10),AnswerRefusal(88),FactualAccuracy(66); agent overlayEvaluateFunctionCalling(98),TaskCompletion(99). 50+ total templates,CustomLLMJudgefor LCEL attribution and memory consistency, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes). - traceAI (Apache 2.0):
LangChainInstrumentor().instrument(...)covers every runnable, retriever, LLM, tool, agent, and LangGraph node. 50+ AI surfaces across Python, TypeScript, Java, C#; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY); 14 span kinds including first-classRETRIEVER,RERANKER,EMBEDDING,AGENT. - Future AGI Platform: self-improving evaluators tuned by thumbs-up/down feedback from production traces; in-product agent authors LangChain-specific rubrics from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces, Sonnet 4.5 Judge writes the
immediate_fix, representative traces promote into the golden set. Linear OAuth today; Slack, GitHub, Jira, PagerDuty on roadmap. - agent-opt (Apache 2.0): six optimisers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) for tuning synthesizer, reformulator, and tool-selection prompts. - Agent Command Center: Go binary self-hosts in your VPC for the LLM calls underneath every LangChain runnable. 100+ providers, 18+ built-in guardrail scanners, exact + semantic caching; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Ready to evaluate your first LangChain RAG app? Wire ContextRelevance and ChunkAttribution on the retriever, Groundedness and Completeness on the chain output, and EvaluateFunctionCalling if RAG sits inside an AgentExecutor, into a pytest fixture this afternoon against the ai-evaluation SDK. Add LangChainInstrumentor when production traces ask questions the CI gate missed. The two-layer loop pays for itself the first time it catches the regression that would have shipped.
Related reading
- RAG Evaluation Metrics: A Deep Dive (2026)
- Evaluating LlamaIndex RAG Applications (2026)
- Evaluate RAG Applications in CI/CD (2026)
- Best Rerankers for RAG (2026)
- Agent Evaluation Frameworks (2026)
- LangGraph Agent Evaluation (2026)
- Instrument Your AI Agent with traceAI
- LangChain RAG Observability
- LangChain Callback Tracing Best Practices (2026)
- Advanced Chunking Techniques for RAG
Frequently asked questions
Why does LangChain RAG need a different evaluation approach from generic RAG eval?
What is the right rubric split for a LangChain RAG pipeline?
How does traceAI instrument LangChain?
How big should a LangChain RAG golden set be?
How do per-step and chain-level rubrics differ in CI?
How does traceAI handle LangChain callbacks and async chains?
How does the Error Feed cluster LangChain failures?
What does Future AGI ship for LangChain RAG eval today, and what is roadmap?
Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.
LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.
Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.