Evaluating Haystack RAG Pipelines in 2026
Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.
Table of Contents
Haystack 2.x makes RAG easy to compose and easy to misjudge. Five lines wire SentenceTransformersTextEmbedder to InMemoryEmbeddingRetriever, a PromptBuilder, and an OpenAIGenerator, and the first hundred queries pass. Then someone drops a TransformersSimilarityRanker into the DAG, the ranker downranks the correct passage on multi-hop queries, the prompt template silently truncates context, and the generator ships back a confident answer grounded in the wrong evidence. A flat faithfulness score on the final reply tells you the answer is wrong. It does not tell you which Component broke.
Haystack is the most component-decomposable RAG framework. Every node in a Pipeline is a discrete Component with named input and output sockets, a class the runtime can introspect, and a clean boundary the framework maintains for you. The opinion this post earns: per-component rubrics on the Retriever, Ranker, and Generator, plus a pipeline-level Groundedness check on the rendered PromptBuilder output, is the methodology that matches the architecture. Stop treating Pipeline.run() like a black box. Score each component on its own success criterion and attribute every regression back to the span the framework already drew for you.
This guide walks the two rubric layers, the Future AGI templates that map to each, the HaystackInstrumentor that connects offline rubrics to live spans, the CI gate, and the Error Feed that closes the loop back to the dataset. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.
TL;DR: the Haystack rubric set
| Layer | Haystack surface | Rubric set |
|---|---|---|
| Per-component (retrieve) | InMemoryEmbeddingRetriever, BM25Retriever, WebSearch, DocumentJoiner | ContextRelevance (9), ChunkAttribution (11), ChunkUtilization (12), Recall@k, MRR, nDCG |
| Per-component (rerank) | TransformersSimilarityRanker, CohereRanker, MetaFieldRanker | Delta ContextRelevance (pre vs post rank), top-3 recall preservation |
| Per-component (generate) | OpenAIGenerator, AnthropicGenerator, HuggingFaceGenerator | Groundedness (47), ContextAdherence (5), AnswerRefusal (88) on the generator’s reply against its retrieved context |
| Pipeline-level | Rendered PromptBuilder output, final Generator.replies[0] | Groundedness, ContextAdherence, Completeness (10), AnswerRefusal, FactualAccuracy (66) |
| Tool overlay | ToolInvoker, function-calling Generator flows | EvaluateFunctionCalling (98), TaskCompletion (99), per-step Groundedness |
| Bridge | traceAI HaystackInstrumentor | Same rubrics as EvalTag span-attached scorers; RETRIEVER, RERANKER, EMBEDDING, LLM span kinds |
Haystack ships built-in evaluators (SASEvaluator, LLMEvaluator, ContextRelevanceEvaluator, FaithfulnessEvaluator) that give a useful floor for prototyping. They score the final answer with one rubric per call and do not attribute scores back to the component that produced the span. The two-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.
Why Haystack’s Pipeline architecture changes RAG eval
Most RAG frameworks make you reconstruct component boundaries after the fact. LangChain runnables nest arbitrarily; LlamaIndex composes engines that compose engines. Haystack does the opposite. A Pipeline is a DAG with explicitly declared edges between Component instances, and the runtime walks that graph through Pipeline._run_component. The framework knows every component’s name, class, input sockets, and output sockets at run time. The trace boundary is a free side effect.
That makes four eval moves natural that are awkward elsewhere.
Score the retriever in isolation, on its own output socket. InMemoryEmbeddingRetriever.run returns a documents socket; score ContextRelevance against that socket directly. Retriever regressions show up before any LLM cost.
Score the ranker with a before-and-after diff on the same docs. Pull the retriever’s documents socket, pull the ranker’s documents socket, diff ContextRelevance in the top-k. A ranker that does not move relevant docs up is dead weight.
Score the generator against its actual rendered context, not the original query. The PromptBuilder renders a Jinja template against the retriever’s docs; the LLM span’s input is the rendered prompt. Score Groundedness against what the prompt actually carried, after any token-budget truncation.
Score the joiner when hybrid retrieval is in play. A DocumentJoiner fusing BM25 and embedding retrievers can silently overweight one side. Two retrievers with passing per-component scores plus a joiner with no rubric is the classic configuration where production drift hides until a quarterly review.
Each move is one rubric on one span the framework already emitted. The four together turn “the answer is wrong” into “the ranker downranked the correct doc on multi-hop queries for the de_DE locale” in five minutes of bisect.
Layer 1: per-component rubrics
Per-component rubrics gate each Component against its own success criterion. They are the upstream signal. If the retriever or the ranker regressed, every pipeline-level score follows, and the bisect should start here.
Three Future AGI templates do the work on the retriever. ContextRelevance (eval_id 9) scores whether each retrieved chunk is on-topic for the query, catching the classic “vector similarity surfaced a lexical neighbour” failure. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim. ChunkUtilization (eval_id 12) scores how much of the retrieved context the generator actually used; sustained ChunkUtilization below 30 percent says top_k is too high or the ranker is misconfigured. Pair these with IR-style rubrics on a labelled probe set; the local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, and context_recall as deterministic metrics with no API call.
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase
ev = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
def score_retriever(query, retriever_output, generator_reply):
docs = retriever_output["documents"]
tc = TestCase(input=query, output=generator_reply,
context=[d.content for d in docs])
return ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=[tc],
).eval_results[0]
For the ranker, score ContextRelevance on the retriever’s documents socket before the ranker runs and on the ranker’s documents socket after. The delta is the rubric. A TransformersSimilarityRanker that does not improve ContextRelevance on average is dead weight; one that drops Recall@k on the top-3 has overfit to lexical features. Best rerankers for RAG (2026) walks the reranker call.
Two operating rules. Score per retriever, not per pipeline. Hybrid pipelines with both BM25Retriever and InMemoryEmbeddingRetriever need separate per-retriever recall before the DocumentJoiner fuses them, because the merge can hide a single-side regression. Set a per-rubric floor. Recall@k below 0.7 says the chunker is the problem before the retriever is, and advanced chunking techniques for RAG covers the upstream call.
For the generator’s own span (the LLM call inside OpenAIGenerator.run), score Groundedness, ContextAdherence, and AnswerRefusal against the rendered prompt context, not the original retrieved set. The prompt template can truncate; the generator’s grounding is only as good as what the prompt actually carried.
Layer 2: pipeline-level rubrics on the final reply
Pipeline-level rubrics gate what the user actually sees. Five templates cover the answer surface.
Groundedness (eval_id 47) scores whether every claim is supported by retrieved context. The hallucination check. ContextAdherence (eval_id 5) catches the inverse: the generator drifting into model priors instead of staying close to context. Both matter when the pipeline carries chat history into the prompt. Completeness (eval_id 10) scores whether the answer covers the question and the relevant evidence. AnswerRefusal (eval_id 88) catches the over-cautious refusal when evidence was right there. FactualAccuracy (eval_id 66) scores world-knowledge claims independent of retrieval; useful when the generator pads with model-prior trivia.
from fi.evals.templates import (
Groundedness, ContextAdherence, Completeness, AnswerRefusal, FactualAccuracy,
)
def score_pipeline_reply(query, pipeline_result, expected_answer=None):
docs = pipeline_result["retriever"]["documents"]
reply = pipeline_result["llm"]["replies"][0]
tc = TestCase(
input=query, output=reply,
context=[d.content for d in docs],
expected_output=expected_answer,
)
return ev.evaluate(
eval_templates=[Groundedness(), ContextAdherence(), Completeness(),
AnswerRefusal(), FactualAccuracy()],
inputs=[tc],
).eval_results[0]
The pairing rule. Per-component failures with passing pipeline scores mean the generator covered for an upstream bug; the next harder query will fail. Pipeline failures with passing per-component scores mean the PromptBuilder Jinja loop truncated context under token pressure, the DocumentJoiner lost a side, or the generator drifted into priors. Same templates, two runs, two attribution paths.
For Haystack-specific patterns the standard templates do not cover, CustomLLMJudge writes the rubric in natural language and runs through the same Evaluator. Three are worth writing the week the matching primitive enters the pipeline. Ranker uplift: given retriever and ranker top-k, did the ranker move a known-relevant doc up (+1), leave it unchanged (0), or downrank it (-1). Joiner correctness: when two retrievers feed a DocumentJoiner, did the joiner keep the right docs from each side, or silently drop BM25 hits on rare-token queries. Branch routing: when the pipeline branches on query type, did the right branch fire for this query class.
For tool-using flows (Generator calling Haystack Tools through ToolInvoker), layer EvaluateFunctionCalling (98) and TaskCompletion (99) on top of the standard set. They catch the agent-shape failures: a wrong tool call grounded in the wrong evidence, or a correct trajectory that loops or quits early. Agent evaluation frameworks 2026 covers tool-trajectory rubrics across frameworks.
Instrumenting Haystack with traceAI
CI catches the regressions you can think of. Production catches the rest. The same two-layer rubric set should run as span-attached scorers against live Haystack traces, and that requires a tracer that understands the Pipeline graph.
traceAI (Apache 2.0) ships a HaystackInstrumentor that wraps haystack.Pipeline.run and Pipeline._run_component, then introspects each component class to assign the right fi.span.kind. _get_component_type maps *Embedder to EMBEDDING, *Retriever and *WebSearch to RETRIEVER, *Ranker to RERANKER, anything with Generator in the class name (or a replies output socket) to LLM, and detects PromptBuilder explicitly.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_haystack import HaystackInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="haystack_rag_prod",
)
HaystackInstrumentor().instrument(tracer_provider=trace_provider)
After that one call, every Pipeline.run() produces a trace tree that mirrors the DAG. A basic RAG pipeline emits a root CHAIN span with EMBEDDING, RETRIEVER, CHAIN (for the prompt build), and LLM children. Add a TransformersSimilarityRanker and a RERANKER span appears with full input and output document lists, scores, and ranker model name. Add a ToolInvoker and the LLM span carries the tool-call payload with one child span per tool execution. Sync and async pipelines both flow through Pipeline.run, so the same wrapper catches both. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY via register()) let the same instrumentation ingest into Phoenix or Traceloop without re-instrumenting. For the broader pattern, instrumenting your AI agent with traceAI covers it end-to-end.
Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics, run IR metrics and ChunkUtilization on 100 percent, and alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per component over 30 to 90 minutes.
For the LLM calls underneath any Generator, point the component at the Future AGI gateway: OpenAIGenerator(model="gpt-4o-mini", api_base_url="https://gateway.futureagi.com/v1", ...). Every reply carries x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-routing-strategy headers; traceAI captures them on the LLM span. Routing policy moves to the gateway; the pipeline definition stays portable.
The CI gate: per-component on push, pipeline-level on merge
Budget eval cost across two triggers. Per-component rubrics run on every push (cheap, deterministic for IR metrics, fast for LLM-judges scoring small contexts). Pipeline-level rubrics run on every merge to main, protecting the shipped artifact rather than blocking every push.
# tests/test_haystack_rag_eval.py
import pytest
from statistics import mean
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase
ev = Evaluator()
@pytest.fixture(scope="module")
def golden_set():
return load_golden_set("data/haystack_rag_golden.jsonl")
def test_retriever_floor(golden_set, rag_pipeline):
cases = []
for q, _ in golden_set:
result = rag_pipeline.run(
{"text_embedder": {"text": q}, "prompt_builder": {"question": q}}
)
cases.append(TestCase(
input=q, output=result["llm"]["replies"][0],
context=[d.content for d in result["retriever"]["documents"]],
))
results = ev.evaluate(
eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
inputs=cases,
)
assert mean(r.metrics[0].value for r in results.eval_results) >= 0.80
The merge gate runs Groundedness, ContextAdherence, and Completeness on the same golden set with stricter thresholds (0.85 to 0.90 depending on domain risk) and posts the per-rubric delta from previous main as a PR comment. Evaluate RAG applications in CI/CD covers the regression-gate side in detail.
For cost control on high-volume pipelines, augment=True on the Evaluator cascades a cheap classifier first and only escalates to a frontier judge on uncertain inputs. On a 10,000-query batch this cuts judge cost by 5 to 10x without changing the score distribution. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make weekly full-dataset reruns the default.
For safety-critical retrieval (medical, legal, financial), wrap the same templates in the Guardrails API at request time. AggregationStrategy.MAJORITY for casual workloads, AggregationStrategy.ALL for compliance paths. Thirteen guardrail backends sit behind the API (nine open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, SHIELDGEMMA_2B; four API). AI compliance guardrails for enterprise LLMs covers the model choices.
Production observability and the Error Feed
The closed loop is what makes a Haystack eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.
Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Sonnet 4.5 Judge on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit near 90 percent) reads each failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-to-5).
For Haystack RAG the cluster names show up like:
- “TransformersSimilarityRanker downranks correct doc below top-3 on multi-hop queries”
- “PromptBuilder Jinja loop truncates context when token budget creeps over 4096”
- “InMemoryEmbeddingRetriever misses domain synonyms after corpus refresh”
- “DocumentJoiner drops BM25 side on rare-token queries”
Each cluster ships as a Linear issue today (Slack, GitHub, Jira, PagerDuty on roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with the pipeline. Representative traces promote into the golden set under engineer sign-off; the next PR touching the offending component has to clear the new entries. The dataset ratchets stronger every week.
For prompt and template optimization, agent-opt ships six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer). Run them against the same rubric set; the loop earns its budget the first time it rewrites a PromptBuilder template that was silently truncating context.
Three deliberate tradeoffs
- Two layers cost more eval budget than one. Per-component plus pipeline-level rubrics is more LLM-judge calls than one faithfulness score. The payoff is a five-minute bisect that points at a specific
Component. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make weekly full-dataset reruns the default. CustomLLMJudgerubrics for ranker uplift, joiner correctness, and branch routing add maintenance. Three extra rubrics to keep calibrated as the pipeline evolves. The lift is the only signal that separates a ranker regression from a retriever regression, or a joiner bug from a single-retriever bug. The moment aTransformersSimilarityRankerenters the DAG, write the uplift rubric the same week.AggregationStrategy.ALLguardrails cost latency. Strict aggregation adds 100 to 400 ms depending on rubric mix. Worth it for compliance-sensitive paths;MAJORITYis the right operating point elsewhere.
How Future AGI ships this
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the two Haystack layers. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent against your live traces.
- ai-evaluation SDK (Apache 2.0): per-component layer
ContextRelevance(9),ChunkAttribution(11),ChunkUtilization(12); pipeline layerGroundedness(47),ContextAdherence(5),Completeness(10),AnswerRefusal(88),FactualAccuracy(66); tool overlayEvaluateFunctionCalling(98),TaskCompletion(99). 50+ total templates,CustomLLMJudgefor ranker uplift, joiner correctness, and branch routing; 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes). - traceAI (Apache 2.0):
HaystackInstrumentor().instrument(...)covers every Haystack 2.9+ component with auto-detectedEMBEDDING,RETRIEVER,RERANKER,LLMspan kinds. 50+ AI surfaces across Python, TypeScript, Java, C#; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). - Future AGI Platform: self-improving evaluators tuned by thumbs-up and thumbs-down feedback from production traces; in-product authoring agent writes Haystack-specific rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces; Sonnet 4.5 Judge writes the
immediate_fix; representative traces promote into the golden set. Linear OAuth today; Slack, GitHub, Jira, PagerDuty on roadmap. - Agent Command Center: 17 MB Go binary self-hosts in your VPC for the LLM calls underneath every
Generator. 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Ready to evaluate your first Haystack RAG pipeline? Wire ContextRelevance on the retriever socket, ranker uplift on the RERANKER span, and Groundedness plus Completeness on replies[0] into a pytest fixture against the ai-evaluation SDK. Add HaystackInstrumentor when production traces start asking questions the CI gate missed. The per-component loop pays for itself the first time it points at the TransformersSimilarityRanker instead of blaming the OpenAIGenerator.
Related reading
- RAG Evaluation Metrics: A Deep Dive (2026)
- Evaluating LangChain RAG Applications (2026)
- Evaluating LlamaIndex RAG Applications (2026)
- Evaluate RAG Applications in CI/CD (2026)
- Best Rerankers for RAG (2026)
- Agent Evaluation Frameworks (2026)
- Instrument Your AI Agent with traceAI
- Advanced Chunking Techniques for RAG
- AI Compliance Guardrails for Enterprise LLMs
Frequently asked questions
Why does Haystack RAG need a per-component evaluation strategy?
What is the right rubric split for a Haystack RAG pipeline?
How does traceAI instrument Haystack?
How big should a Haystack RAG golden set be?
How do per-component and pipeline-level rubrics differ in CI?
Where does the Future AGI gateway fit in a Haystack pipeline?
How does the Error Feed cluster Haystack failures?
What does Future AGI ship for Haystack RAG eval today, and what is roadmap?
LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on the LCEL output confirms the fix.
LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.
Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.