Guides

Evaluating LangChain RAG Applications in 2026

LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on LCEL output confirms the fix.

March 14, 2026

Updated May 20, 2026

12 min read

langchain rag llm-evaluation llm-observability traceAI 2026

Table of Contents

LangChain makes RAG easy to compose and hard to debug. Six lines of LCEL wire create_history_aware_retriever to create_retrieval_chain, and the first hundred queries pass. Then a user asks a follow-up that depends on turn one, the reformulator builds a bad standalone query, the retriever pulls the wrong namespace, the synthesizer drops the citation that mattered, and the answer ships back confident and wrong. A faithfulness score reports a drop. It does not tell you which runnable broke.

LangChain RAG eval is two problems, not one. The retriever and the chain. Per-step rubrics (retrieve, rerank, generate) score each runnable in isolation against its own success criterion. Chain-level rubrics score the final LCEL output. Most teams ship the second layer and discover the per-step gaps in production. The opinion this post earns: the per-step layer catches the bug; the chain-level layer confirms the fix. Run both, attribute every score back to the LCEL component that produced the underlying span.

This guide walks the two rubric layers, the Future AGI templates that map to each, the LangChainInstrumentor that connects offline rubrics to live spans, the CI gate, and the Error Feed that closes the loop back to the dataset. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.

TL;DR: the two-layer LangChain rubric set

Layer	LangChain surface	Rubric set
Per-step (retrieve)	`BaseRetriever`, `EnsembleRetriever`, `MultiQueryRetriever`, `ParentDocumentRetriever`	`ContextRelevance` (9), `ChunkAttribution` (11), `ChunkUtilization` (12), Recall@k, MRR, nDCG
Per-step (rerank)	`ContextualCompressionRetriever`, `CrossEncoderReranker`, `CohereRerank`	Delta `ContextRelevance` (pre vs post rerank), recall preservation
Per-step (generate)	LLM step inside `create_retrieval_chain`, custom `RunnableSequence`	`Groundedness` (47), `ContextAdherence` (5), `AnswerRefusal` (88) on the LLM step’s output against its retrieved context
Chain-level	Final LCEL output via `RunnableSequence.invoke()`, `create_retrieval_chain` answer	`Groundedness`, `ContextAdherence`, `Completeness` (10), `AnswerRefusal`, `FactualAccuracy` (66)
Agent overlay	`AgentExecutor`, LangGraph `StateGraph`, tool-using flows	`EvaluateFunctionCalling` (98), `TaskCompletion` (99), per-node `Groundedness`
Bridge	traceAI `LangChainInstrumentor`	Same rubrics as `EvalTag` span-attached scorers; `CHAIN`, `RETRIEVER`, `LLM`, `TOOL`, `AGENT` span kinds

The built-in langchain.evaluation.QAEvalChain and CriteriaEvalChain ship a useful floor for prototyping. They score the final answer with one rubric and do not separate retrieval from generation. The two-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.

Why generic RAG eval misses LangChain-specific failure modes

A flat retrieve-then-synthesize pipeline has two surfaces, and the canonical rubric split covered in our RAG evaluation metrics deep dive handles it. LangChain is rarely flat. LCEL composition lets runnables nest arbitrarily deep, and four failure modes appear that final-answer scores cannot see.

A history-aware retriever runs a reformulation LLM before retrieval. create_history_aware_retriever takes the chat history, asks the LLM for a standalone query, then feeds that into the retriever. If the reformulator builds a bad standalone query, ContextRelevance drops on retrieved chunks, the synthesizer covers for it with model priors, and the answer looks plausible. Per-step rubrics catch the reformulator. Chain-level rubrics do not.

An EnsembleRetriever fuses ranks from a vector store and a BM25 retriever. The fusion can silently overweight one side; both score well in isolation but recall on the fused output drops on queries BM25 used to catch. Per-retriever drift tracking catches it.

An AgentExecutor exposes RAG as one tool. The agent calls the web-search tool when the RAG tool was the right call, the synthesizer grounds the answer in the wrong evidence, and Groundedness passes. The user got the wrong answer because the wrong tool ran.

A RunnableParallel fans out to three retrievers and merges through a RunnableLambda. If two retrievers return the same canonical doc with different URLs, the merge keeps both, the context window fills with duplicates, ChunkUtilization drops. Generic eval sees a slightly lower answer score with no obvious cause.

Each failure becomes a five-minute bisect once the rubric set is tagged by LCEL component.

Layer 1: per-step rubrics on the retriever and rerank

Per-step rubrics gate each runnable against its own success criterion. They are the upstream signal. If retrieval or rerank regressed, every chain-level score will follow, and the bisect should start here.

Three Future AGI templates do the work on the retriever. ContextRelevance (eval_id 9) scores whether each retrieved chunk is on-topic for the query, catching the “asked about Section 12, got Section 9” failure when vector similarity surfaces a lexical neighbour. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim. ChunkUtilization (eval_id 12) scores how much of the retrieved context the synthesizer actually used; high retrieval cost with low utilization signals k is too high or the reranker is misconfigured.

Pair these with IR-style rubrics on a labelled probe set: Recall@k, MRR, nDCG. The local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, context_recall, context_precision, and context_entity_recall as deterministic metrics with no API call.

from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase

ev = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

def score_retrieve_step(query, retrieved_docs, response):
    tc = TestCase(
        input=query, output=response,
        context=[d.page_content for d in retrieved_docs],
    )
    return ev.evaluate(
        eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
        inputs=[tc],
    ).eval_results[0]

For the rerank step, score ContextRelevance on the retriever output before ContextualCompressionRetriever (or CohereRerank, or CrossEncoderReranker) and after. The delta is the rubric. A reranker that does not improve ContextRelevance on average is dead weight; one that drops Recall@k on the top-3 has overfit to lexical features. The best rerankers for RAG (2026) walks the reranker side of the same call.

Two operating rules. Score per retriever, not per app. Tagging the score by retriever lets a regression point at the side that moved. Set a per-rubric floor. ChunkUtilization below 30 percent says you are over-fetching. Recall@k below 0.7 says the chunker is the problem before the retriever is, and advanced chunking techniques for RAG covers the upstream call.

Layer 2: chain-level rubrics on the LCEL output

Chain-level rubrics gate what the user actually sees. Five Future AGI templates cover the answer surface.

Groundedness (eval_id 47) scores whether every claim in the answer is supported by the retrieved context. The hallucination check. An LLM step that improvises beyond context fails here even when retrieval was perfect. ContextAdherence (eval_id 5) catches a different failure: the synthesizer staying close to retrieved context rather than drifting into conversation buffer or model priors. Both matter for ConversationalRetrievalChain because it has a drift surface pure RetrievalQA does not.

Completeness (eval_id 10) scores whether the answer covers the question and the retrieved evidence relevant to it; a short answer that ignored half the evidence fails here. AnswerRefusal (eval_id 88) catches the over-cautious refusal when evidence was right there. FactualAccuracy (eval_id 66) scores world-knowledge claims independent of retrieval; useful when the synthesizer pads with model-prior trivia.

from fi.evals.templates import Groundedness, ContextAdherence, Completeness, AnswerRefusal

def score_chain_output(query, retrieved_docs, answer):
    tc = TestCase(
        input=query, output=answer,
        context=[d.page_content for d in retrieved_docs],
    )
    return ev.evaluate(
        eval_templates=[Groundedness(), ContextAdherence(), Completeness(), AnswerRefusal()],
        inputs=[tc],
    ).eval_results[0]

The pairing rule. Per-step failures with passing chain scores mean the synthesizer covered for an upstream bug; the next harder query will fail. Chain failures with passing per-step scores mean the synthesizer dropped or paraphrased away evidence that was in the context; the bug is in the LLM step, not retrieval. Same templates, two runs, two attribution paths.

For LangChain-specific patterns the standard templates do not cover, CustomLLMJudge writes the rubric in natural language and runs through the same Evaluator. Three are worth writing the week the matching primitive enters the app: LCEL component attribution (which runnable produced the first wrong output), memory consistency (did the chain preserve the right prior turns at turn N), and chain composition correctness (was the chosen chain shape right for the query class).

Layer 3: the agent overlay for AgentExecutor and LangGraph

AgentExecutor and LangGraph add two failure modes per-step and chain rubrics cannot see. The agent calls the wrong tool, the answer is grounded in wrong evidence, Groundedness passes. Or the agent calls every tool correctly but loops, refuses, or quits early; per-step rubrics pass while the user gets nothing.

EvaluateFunctionCalling (eval_id 98, exported as LLMFunctionCalling) scores whether each tool call was right with the right arguments at that step. TaskCompletion (eval_id 99) scores whether the agent finished the user’s task end-to-end. Both treat the trajectory as the unit of evaluation.

from fi.evals.templates import EvaluateFunctionCalling, TaskCompletion, Groundedness

def score_agent(query, trajectory_spans, final_answer):
    tool_cases = [
        TestCase(input=query, output=s.tool_call, context=s.tool_args)
        for s in trajectory_spans
    ]
    final_case = TestCase(input=query, output=final_answer)
    tool_scores = ev.evaluate([EvaluateFunctionCalling()], inputs=tool_cases)
    task_score = ev.evaluate([TaskCompletion(), Groundedness()], inputs=[final_case])
    return tool_scores, task_score

For LangGraph the same templates run per node; the bundled LangChainInstrumentor emits superstep-level metadata so the score attributes to the right graph step. A CustomLLMJudge rubric for graph-edge correctness catches routing bugs the per-node rubrics miss. Agent evaluation frameworks 2026 covers tool-trajectory rubrics across frameworks; LangGraph agent evaluation covers the graph-specific layer.

Instrumenting LangChain with traceAI

CI catches the regressions you can think of. Production catches the rest. The same two-layer rubric set should run as span-attached scorers against live LangChain traces, and that requires a tracer that understands LCEL composition.

traceAI (Apache 2.0) ships a LangChainInstrumentor that hooks LangChain’s callback manager and emits OpenTelemetry spans for every runnable, retriever, LLM, tool, and agent invocation. Each span carries fi.span.kind set to CHAIN, RETRIEVER, LLM, TOOL, or AGENT, the LCEL component name, the rendered prompt, retrieved documents with similarity scores and source, and the response.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="langchain_rag_prod",
)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)

After that one call, every chain invocation produces a span tree that maps to the LCEL composition. create_retrieval_chain produces a root CHAIN span with a RETRIEVER child carrying retrieved documents and an LLM child carrying the synthesizer prompt and answer. A history-aware composition adds an LLM reformulator child before RETRIEVER. An AgentExecutor produces one TOOL child per tool call and a final LLM span. A LangGraph state machine adds langgraph.superstep.node_count, nodes executed per superstep, and state diffs.

The instrumentor coexists with whatever callback handler the team already wrote. Sync, async (ainvoke, astream), batched (batch, abatch), and streamed chains emit through the same callback surface. RunnableParallel, RunnableLambda, and RunnableSequence propagate span context so parallel branches produce separate subtrees. Instrumenting your AI agent with traceAI covers the broader pattern; LangChain RAG observability covers the stack underneath.

Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics. Run IR metrics and ChunkUtilization on 100 percent. Alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per chain over 30 to 90 minutes.

The CI gate: per-step on push, chain-level on merge

Budget eval cost across two triggers. Per-step rubrics run on every push (cheap, deterministic for IR metrics, fast for LLM-judges scoring small contexts). Chain-level rubrics run on every merge to main, protecting the artifact rather than blocking every push.

# tests/test_langchain_rag_eval.py
import pytest
from statistics import mean
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase

ev = Evaluator()

@pytest.fixture(scope="module")
def golden_set():
    return load_golden_set("data/langchain_rag_golden.jsonl")

def test_retriever_floor(golden_set, rag_chain):
    cases = []
    for q, _ in golden_set:
        result = rag_chain.invoke({"input": q})
        cases.append(TestCase(
            input=q, output=result["answer"],
            context=[d.page_content for d in result["context"]],
        ))
    results = ev.evaluate(
        eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
        inputs=cases,
    )
    assert mean(r.metrics[0].value for r in results.eval_results) >= 0.80

The merge gate runs Groundedness, ContextAdherence, and Completeness on the same golden set with stricter thresholds (0.85 to 0.90 depending on domain risk) and posts the per-rubric delta from the previous main as a PR comment. Evaluate RAG applications in CI/CD walks the regression-gate side in detail.

For cost control on high-volume apps, the augment=True cascade runs a cheap classifier first and only escalates to a frontier judge on uncertain inputs. On a 10,000-query batch this cuts judge cost by 5 to 10x without changing the score distribution.

Production observability and the Error Feed

The closed loop is what makes a LangChain eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Claude Sonnet 4.5 Judge on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit near 90 percent) reads each failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-to-5).

For LangChain RAG apps the cluster names look like:

“History-aware reformulator builds a bad standalone query on turn 3 onwards”
“Agent calls web-search tool when the RAG tool was the right call”
“EnsembleRetriever overweights vector side, BM25 recall drops to zero”
“ContextualCompressionRetriever downranks the correct doc on long queries”
“LangGraph planner takes wrong conditional edge on empty tool_calls state”

Each named cluster ships as a Linear issue today (Slack, GitHub, Jira, PagerDuty on the roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with the product. Representative traces from each cluster promote into the golden set under engineer sign-off; the next PR touching the offending runnable has to clear the new entries.

For safety-critical retrieval (medical, legal, financial), wrap the same templates in the Guardrails API at request time with AggregationStrategy.MAJORITY for casual workloads or AggregationStrategy.ALL for compliance paths. Thirteen guardrail backends sit behind the API (nine open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, SHIELDGEMMA_2B; four API). AI compliance guardrails for enterprise LLMs covers the model choices; LangChain callback tracing best practices covers the callback-side instrumentation.

Three deliberate tradeoffs

Two layers cost more eval budget than one. Running per-step and chain-level rubrics together is more LLM-judge calls than one faithfulness score. The payoff is debuggable regressions and a five-minute bisect. Future AGI’s classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes weekly full-dataset reruns the default.
CustomLLMJudge rubrics add maintenance. LCEL attribution and memory consistency rubrics need recalibration as the chain composition evolves. The lift is the only signal that separates a reformulator bug from a retriever bug.
AggregationStrategy.ALL guardrails cost latency. Strict aggregation runs every template before the answer ships, adding 100 to 400 ms. Worth it for compliance paths. Optional where MAJORITY is the right operating point.

How Future AGI ships this

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the two LangChain layers. Graduate to the Platform when you want self-improving rubrics against your live traces.

ai-evaluation SDK (Apache 2.0): per-step layer ContextRelevance (9), ChunkAttribution (11), ChunkUtilization (12); chain layer Groundedness (47), ContextAdherence (5), Completeness (10), AnswerRefusal (88), FactualAccuracy (66); agent overlay EvaluateFunctionCalling (98), TaskCompletion (99). 50+ total templates, CustomLLMJudge for LCEL attribution and memory consistency, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0): LangChainInstrumentor().instrument(...) covers every runnable, retriever, LLM, tool, agent, and LangGraph node. 50+ AI surfaces across Python, TypeScript, Java, C#; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY); 14 span kinds including first-class RETRIEVER, RERANKER, EMBEDDING, AGENT.
Future AGI Platform: self-improving evaluators tuned by thumbs-up/down feedback from production traces; in-product agent authors LangChain-specific rubrics from natural language; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces, Sonnet 4.5 Judge writes the immediate_fix, representative traces promote into the golden set. Linear OAuth today; Slack, GitHub, Jira, PagerDuty on roadmap.
agent-opt (Apache 2.0): six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) for tuning synthesizer, reformulator, and tool-selection prompts.
Agent Command Center: Go binary self-hosts in your VPC for the LLM calls underneath every LangChain runnable. 100+ providers, 18+ built-in guardrail scanners, exact + semantic caching; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Ready to evaluate your first LangChain RAG app? Wire ContextRelevance and ChunkAttribution on the retriever, Groundedness and Completeness on the chain output, and EvaluateFunctionCalling if RAG sits inside an AgentExecutor, into a pytest fixture this afternoon against the ai-evaluation SDK. Add LangChainInstrumentor when production traces ask questions the CI gate missed. The two-layer loop pays for itself the first time it catches the regression that would have shipped.

Frequently asked questions

Why does LangChain RAG need a different evaluation approach from generic RAG eval?

LangChain ships a chain framework, not one RAG pattern. The same app might use a flat `RetrievalQA`, a `create_history_aware_retriever` + `create_retrieval_chain` composition, an `AgentExecutor` that calls RAG as one tool, or a LangGraph state machine with retrieval as one node. Each pattern emits a different span tree and fails differently. LCEL composition makes the failure surface bigger because runnables nest arbitrarily deep, and a metric that scores the final answer rolls every runnable's contribution into one number. LangChain-aware eval separates per-step rubrics on the retriever and rerank from chain-level rubrics on the LCEL output, then attributes each score back to the runnable that produced the span. The first catches the bug; the second confirms the fix.

What is the right rubric split for a LangChain RAG pipeline?

Two rubric layers, scored separately. Per-step rubrics on the retriever (`ContextRelevance`, `ChunkAttribution`, `ChunkUtilization`, plus Recall@k and MRR on a labelled probe set) and on the rerank (drop in `ContextRelevance` from pre- to post-rerank tells you whether the reranker actually reordered correctly). Chain-level rubrics on the final LCEL output (`Groundedness`, `ContextAdherence`, `Completeness`, `AnswerRefusal`, `FactualAccuracy`). For agent-RAG, layer `EvaluateFunctionCalling` and `TaskCompletion` on top of both. Score them as different runs through the same `Evaluator` so a regression points at the layer that moved instead of one aggregate number that hides which step broke.

How does traceAI instrument LangChain?

One call. `LangChainInstrumentor().instrument(tracer_provider=trace_provider)` hooks LangChain's callback manager and emits OpenTelemetry spans for every runnable, retriever, LLM, tool, and agent invocation. Each span carries the right `fi.span.kind` (`CHAIN`, `RETRIEVER`, `LLM`, `TOOL`, `AGENT`), the LCEL component name, the input and output payloads, and retrieval metadata (documents, similarity scores, source). For LangGraph the bundled instrumentor adds superstep-level metadata (`langgraph.superstep.node_count`, nodes_executed, state diffs) so a failure attributes to the specific graph node that produced it rather than to the whole graph. The same instrumentor covers `RetrievalQA`, `ConversationalRetrievalChain`, the modern `create_retrieval_chain` composition, `AgentExecutor`, and compiled LangGraph state machines.

How big should a LangChain RAG golden set be?

Plan for 200 to 500 query and expected-answer pairs as the working baseline, sampled from production traces once instrumentation is live. If the app uses a `ConversationalRetrievalChain` or any history-aware composition, add 100 multi-turn dialogues that probe memory and reformulation. If RAG is one tool inside an agent, add 100 examples where another tool is the right call and 100 where RAG is, so the eval can score the tool-selection layer. Cover happy-path queries, the hardest 10 percent of historical failures, and three to five edge cases unique to the LangChain primitive the app actually uses. Promote new failing production traces into the set every week through Error Feed; the dataset ratchets stronger as the app ages.

How do per-step and chain-level rubrics differ in CI?

Per-step rubrics gate the runnable that produced the span. `ContextRelevance` on the retriever, `ChunkUtilization` on the retrieval output, and the IR metrics on a labelled probe set fail a PR that regresses retrieval quality even if the synthesizer covers for it. Chain-level rubrics gate the LCEL output. `Groundedness` and `Completeness` on the final answer fail a PR that ships a wrong answer even if every per-step rubric passes individually (because the synthesizer dropped the citation, the reformulator built the wrong standalone query, or the merge step lost a constraint). Most teams ship at the chain level only and discover the per-step gaps in production. Run both in CI; budget the per-step layer on every push, the chain-level layer on every merge.

How does traceAI handle LangChain callbacks and async chains?

The LangChain instrumentor hooks the same callback manager LangChain uses internally for `LCEL`. Sync chains, async chains (`ainvoke`, `astream`), batched chains (`batch`, `abatch`), and streamed chains all emit spans through the same callback surface. Span context propagates across `RunnableParallel`, `RunnableLambda`, and `RunnableSequence` so parallel branches produce separate child spans with their own retriever and LLM children. For custom callbacks the user has already written, the instrumentor coexists rather than replaces: traceAI emits OpenTelemetry spans while the user's callback handler keeps doing whatever it did before. No callback rewrites required.

How does the Error Feed cluster LangChain failures?

Failing traces are embedded over their span attributes and soft-clustered with HDBSCAN over ClickHouse-stored embeddings. A Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit ratio near 90 percent) reads each failing trace, writes the RCA, evidence quotes, an `immediate_fix`, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1-to-5). For LangChain RAG the common cluster names look like 'history-aware reformulator builds a bad standalone query on turn 3,' 'agent calls web-search when the RAG tool was the right call,' or 'reranker downranks the correct doc.' Each named cluster ships as a Linear issue today (Slack, GitHub, Jira, PagerDuty on the roadmap) and the fix recommendations feed back into the Platform's self-improving evaluators.

What does Future AGI ship for LangChain RAG eval today, and what is roadmap?

Shipping today: traceAI's `LangChainInstrumentor` (Apache 2.0) covering every major chain type, runnable, retriever, tool, agent, and LangGraph node; the `ai-evaluation` SDK (Apache 2.0) with the RAG-relevant templates (`ContextRelevance`, `ChunkAttribution`, `ChunkUtilization`, `Groundedness`, `ContextAdherence`, `Completeness`, `AnswerRefusal`, `FactualAccuracy`), plus `EvaluateFunctionCalling` and `TaskCompletion` for agent flows, 50+ total templates, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes); `CustomLLMJudge` for LangChain-specific rubrics (LCEL component attribution, memory consistency, chain composition correctness); Error Feed inside the eval stack via HDBSCAN on ClickHouse-stored span embeddings; Linear OAuth wired as the issue sink. Roadmap: Slack, GitHub, Jira, PagerDuty integrations for Error Feed; the trace-stream-to-agent-opt connector that auto-promotes high-signal production traces into optimization datasets is in flight.

View all

Guides

Evaluating LlamaIndex RAG Applications in 2026

LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.

Vrinda Damani · Apr 17, 2026

12 min

Guides

Evaluating Haystack RAG Pipelines in 2026

Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.

Vrinda Damani · Mar 10, 2026

12 min

Guides

Evaluating Cohere Rerank in RAG (2026)

Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.

Nikhil Pareek · Apr 28, 2026

11 min

TL;DR: the two-layer LangChain rubric set

Why generic RAG eval misses LangChain-specific failure modes

Layer 1: per-step rubrics on the retriever and rerank

Layer 2: chain-level rubrics on the LCEL output

Layer 3: the agent overlay for AgentExecutor and LangGraph

Instrumenting LangChain with traceAI

The CI gate: per-step on push, chain-level on merge

Production observability and the Error Feed

Three deliberate tradeoffs

How Future AGI ships this

Related reading

Frequently asked questions