Guides

Evaluating Haystack RAG Pipelines in 2026

Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.

March 10, 2026

Updated May 20, 2026

12 min read

haystack rag llm-evaluation llm-observability traceAI 2026

Table of Contents

Haystack 2.x makes RAG easy to compose and easy to misjudge. Five lines wire SentenceTransformersTextEmbedder to InMemoryEmbeddingRetriever, a PromptBuilder, and an OpenAIGenerator, and the first hundred queries pass. Then someone drops a TransformersSimilarityRanker into the DAG, the ranker downranks the correct passage on multi-hop queries, the prompt template silently truncates context, and the generator ships back a confident answer grounded in the wrong evidence. A flat faithfulness score on the final reply tells you the answer is wrong. It does not tell you which Component broke.

Haystack is the most component-decomposable RAG framework. Every node in a Pipeline is a discrete Component with named input and output sockets, a class the runtime can introspect, and a clean boundary the framework maintains for you. The opinion this post earns: per-component rubrics on the Retriever, Ranker, and Generator, plus a pipeline-level Groundedness check on the rendered PromptBuilder output, is the methodology that matches the architecture. Stop treating Pipeline.run() like a black box. Score each component on its own success criterion and attribute every regression back to the span the framework already drew for you.

This guide walks the two rubric layers, the Future AGI templates that map to each, the HaystackInstrumentor that connects offline rubrics to live spans, the CI gate, and the Error Feed that closes the loop back to the dataset. Code shaped against the ai-evaluation SDK with the exact EvalTemplate IDs the SDK ships.

TL;DR: the Haystack rubric set

Layer	Haystack surface	Rubric set
Per-component (retrieve)	`InMemoryEmbeddingRetriever`, `BM25Retriever`, `WebSearch`, `DocumentJoiner`	`ContextRelevance` (9), `ChunkAttribution` (11), `ChunkUtilization` (12), Recall@k, MRR, nDCG
Per-component (rerank)	`TransformersSimilarityRanker`, `CohereRanker`, `MetaFieldRanker`	Delta `ContextRelevance` (pre vs post rank), top-3 recall preservation
Per-component (generate)	`OpenAIGenerator`, `AnthropicGenerator`, `HuggingFaceGenerator`	`Groundedness` (47), `ContextAdherence` (5), `AnswerRefusal` (88) on the generator’s reply against its retrieved context
Pipeline-level	Rendered `PromptBuilder` output, final `Generator.replies[0]`	`Groundedness`, `ContextAdherence`, `Completeness` (10), `AnswerRefusal`, `FactualAccuracy` (66)
Tool overlay	`ToolInvoker`, function-calling `Generator` flows	`EvaluateFunctionCalling` (98), `TaskCompletion` (99), per-step `Groundedness`
Bridge	traceAI `HaystackInstrumentor`	Same rubrics as `EvalTag` span-attached scorers; `RETRIEVER`, `RERANKER`, `EMBEDDING`, `LLM` span kinds

Haystack ships built-in evaluators (SASEvaluator, LLMEvaluator, ContextRelevanceEvaluator, FaithfulnessEvaluator) that give a useful floor for prototyping. They score the final answer with one rubric per call and do not attribute scores back to the component that produced the span. The two-layer set above does, and it runs in CI and as production guardrails on the same Evaluator API.

Why Haystack’s Pipeline architecture changes RAG eval

Most RAG frameworks make you reconstruct component boundaries after the fact. LangChain runnables nest arbitrarily; LlamaIndex composes engines that compose engines. Haystack does the opposite. A Pipeline is a DAG with explicitly declared edges between Component instances, and the runtime walks that graph through Pipeline._run_component. The framework knows every component’s name, class, input sockets, and output sockets at run time. The trace boundary is a free side effect.

That makes four eval moves natural that are awkward elsewhere.

Score the retriever in isolation, on its own output socket. InMemoryEmbeddingRetriever.run returns a documents socket; score ContextRelevance against that socket directly. Retriever regressions show up before any LLM cost.

Score the ranker with a before-and-after diff on the same docs. Pull the retriever’s documents socket, pull the ranker’s documents socket, diff ContextRelevance in the top-k. A ranker that does not move relevant docs up is dead weight.

Score the generator against its actual rendered context, not the original query. The PromptBuilder renders a Jinja template against the retriever’s docs; the LLM span’s input is the rendered prompt. Score Groundedness against what the prompt actually carried, after any token-budget truncation.

Score the joiner when hybrid retrieval is in play. A DocumentJoiner fusing BM25 and embedding retrievers can silently overweight one side. Two retrievers with passing per-component scores plus a joiner with no rubric is the classic configuration where production drift hides until a quarterly review.

Each move is one rubric on one span the framework already emitted. The four together turn “the answer is wrong” into “the ranker downranked the correct doc on multi-hop queries for the de_DE locale” in five minutes of bisect.

Layer 1: per-component rubrics

Per-component rubrics gate each Component against its own success criterion. They are the upstream signal. If the retriever or the ranker regressed, every pipeline-level score follows, and the bisect should start here.

Three Future AGI templates do the work on the retriever. ContextRelevance (eval_id 9) scores whether each retrieved chunk is on-topic for the query, catching the classic “vector similarity surfaced a lexical neighbour” failure. ChunkAttribution (eval_id 11) scores whether each cited chunk actually supports the cited claim. ChunkUtilization (eval_id 12) scores how much of the retrieved context the generator actually used; sustained ChunkUtilization below 30 percent says top_k is too high or the ranker is misconfigured. Pair these with IR-style rubrics on a labelled probe set; the local python/fi/evals/metrics/rag/retrieval/ package ships recall_at_k, precision_at_k, mrr, ndcg, and context_recall as deterministic metrics with no API call.

from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase

ev = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

def score_retriever(query, retriever_output, generator_reply):
    docs = retriever_output["documents"]
    tc = TestCase(input=query, output=generator_reply,
                  context=[d.content for d in docs])
    return ev.evaluate(
        eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
        inputs=[tc],
    ).eval_results[0]

For the ranker, score ContextRelevance on the retriever’s documents socket before the ranker runs and on the ranker’s documents socket after. The delta is the rubric. A TransformersSimilarityRanker that does not improve ContextRelevance on average is dead weight; one that drops Recall@k on the top-3 has overfit to lexical features. Best rerankers for RAG (2026) walks the reranker call.

Two operating rules. Score per retriever, not per pipeline. Hybrid pipelines with both BM25Retriever and InMemoryEmbeddingRetriever need separate per-retriever recall before the DocumentJoiner fuses them, because the merge can hide a single-side regression. Set a per-rubric floor. Recall@k below 0.7 says the chunker is the problem before the retriever is, and advanced chunking techniques for RAG covers the upstream call.

For the generator’s own span (the LLM call inside OpenAIGenerator.run), score Groundedness, ContextAdherence, and AnswerRefusal against the rendered prompt context, not the original retrieved set. The prompt template can truncate; the generator’s grounding is only as good as what the prompt actually carried.

Layer 2: pipeline-level rubrics on the final reply

Pipeline-level rubrics gate what the user actually sees. Five templates cover the answer surface.

Groundedness (eval_id 47) scores whether every claim is supported by retrieved context. The hallucination check. ContextAdherence (eval_id 5) catches the inverse: the generator drifting into model priors instead of staying close to context. Both matter when the pipeline carries chat history into the prompt. Completeness (eval_id 10) scores whether the answer covers the question and the relevant evidence. AnswerRefusal (eval_id 88) catches the over-cautious refusal when evidence was right there. FactualAccuracy (eval_id 66) scores world-knowledge claims independent of retrieval; useful when the generator pads with model-prior trivia.

from fi.evals.templates import (
    Groundedness, ContextAdherence, Completeness, AnswerRefusal, FactualAccuracy,
)

def score_pipeline_reply(query, pipeline_result, expected_answer=None):
    docs = pipeline_result["retriever"]["documents"]
    reply = pipeline_result["llm"]["replies"][0]
    tc = TestCase(
        input=query, output=reply,
        context=[d.content for d in docs],
        expected_output=expected_answer,
    )
    return ev.evaluate(
        eval_templates=[Groundedness(), ContextAdherence(), Completeness(),
                        AnswerRefusal(), FactualAccuracy()],
        inputs=[tc],
    ).eval_results[0]

The pairing rule. Per-component failures with passing pipeline scores mean the generator covered for an upstream bug; the next harder query will fail. Pipeline failures with passing per-component scores mean the PromptBuilder Jinja loop truncated context under token pressure, the DocumentJoiner lost a side, or the generator drifted into priors. Same templates, two runs, two attribution paths.

For Haystack-specific patterns the standard templates do not cover, CustomLLMJudge writes the rubric in natural language and runs through the same Evaluator. Three are worth writing the week the matching primitive enters the pipeline. Ranker uplift: given retriever and ranker top-k, did the ranker move a known-relevant doc up (+1), leave it unchanged (0), or downrank it (-1). Joiner correctness: when two retrievers feed a DocumentJoiner, did the joiner keep the right docs from each side, or silently drop BM25 hits on rare-token queries. Branch routing: when the pipeline branches on query type, did the right branch fire for this query class.

For tool-using flows (Generator calling Haystack Tools through ToolInvoker), layer EvaluateFunctionCalling (98) and TaskCompletion (99) on top of the standard set. They catch the agent-shape failures: a wrong tool call grounded in the wrong evidence, or a correct trajectory that loops or quits early. Agent evaluation frameworks 2026 covers tool-trajectory rubrics across frameworks.

Instrumenting Haystack with traceAI

CI catches the regressions you can think of. Production catches the rest. The same two-layer rubric set should run as span-attached scorers against live Haystack traces, and that requires a tracer that understands the Pipeline graph.

traceAI (Apache 2.0) ships a HaystackInstrumentor that wraps haystack.Pipeline.run and Pipeline._run_component, then introspects each component class to assign the right fi.span.kind. _get_component_type maps *Embedder to EMBEDDING, *Retriever and *WebSearch to RETRIEVER, *Ranker to RERANKER, anything with Generator in the class name (or a replies output socket) to LLM, and detects PromptBuilder explicitly.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_haystack import HaystackInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="haystack_rag_prod",
)
HaystackInstrumentor().instrument(tracer_provider=trace_provider)

After that one call, every Pipeline.run() produces a trace tree that mirrors the DAG. A basic RAG pipeline emits a root CHAIN span with EMBEDDING, RETRIEVER, CHAIN (for the prompt build), and LLM children. Add a TransformersSimilarityRanker and a RERANKER span appears with full input and output document lists, scores, and ranker model name. Add a ToolInvoker and the LLM span carries the tool-call payload with one child span per tool execution. Sync and async pipelines both flow through Pipeline.run, so the same wrapper catches both. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY via register()) let the same instrumentation ingest into Phoenix or Traceloop without re-instrumenting. For the broader pattern, instrumenting your AI agent with traceAI covers it end-to-end.

Attach the same Evaluator rubrics as span-attached scorers via EvalTag; the verdict lives on the trace next to latency, model, and chunk IDs. Sample 5 to 10 percent of production traffic for LLM-judge rubrics, run IR metrics and ChunkUtilization on 100 percent, and alarm on a 2 to 5 point sustained drop in rolling-mean per rubric per component over 30 to 90 minutes.

For the LLM calls underneath any Generator, point the component at the Future AGI gateway: OpenAIGenerator(model="gpt-4o-mini", api_base_url="https://gateway.futureagi.com/v1", ...). Every reply carries x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-routing-strategy headers; traceAI captures them on the LLM span. Routing policy moves to the gateway; the pipeline definition stays portable.

The CI gate: per-component on push, pipeline-level on merge

Budget eval cost across two triggers. Per-component rubrics run on every push (cheap, deterministic for IR metrics, fast for LLM-judges scoring small contexts). Pipeline-level rubrics run on every merge to main, protecting the shipped artifact rather than blocking every push.

# tests/test_haystack_rag_eval.py
import pytest
from statistics import mean
from fi.evals import Evaluator
from fi.evals.templates import ContextRelevance, ChunkAttribution, ChunkUtilization
from fi.testcases import TestCase

ev = Evaluator()

@pytest.fixture(scope="module")
def golden_set():
    return load_golden_set("data/haystack_rag_golden.jsonl")

def test_retriever_floor(golden_set, rag_pipeline):
    cases = []
    for q, _ in golden_set:
        result = rag_pipeline.run(
            {"text_embedder": {"text": q}, "prompt_builder": {"question": q}}
        )
        cases.append(TestCase(
            input=q, output=result["llm"]["replies"][0],
            context=[d.content for d in result["retriever"]["documents"]],
        ))
    results = ev.evaluate(
        eval_templates=[ContextRelevance(), ChunkAttribution(), ChunkUtilization()],
        inputs=cases,
    )
    assert mean(r.metrics[0].value for r in results.eval_results) >= 0.80

The merge gate runs Groundedness, ContextAdherence, and Completeness on the same golden set with stricter thresholds (0.85 to 0.90 depending on domain risk) and posts the per-rubric delta from previous main as a PR comment. Evaluate RAG applications in CI/CD covers the regression-gate side in detail.

For cost control on high-volume pipelines, augment=True on the Evaluator cascades a cheap classifier first and only escalates to a frontier judge on uncertain inputs. On a 10,000-query batch this cuts judge cost by 5 to 10x without changing the score distribution. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make weekly full-dataset reruns the default.

For safety-critical retrieval (medical, legal, financial), wrap the same templates in the Guardrails API at request time. AggregationStrategy.MAJORITY for casual workloads, AggregationStrategy.ALL for compliance paths. Thirteen guardrail backends sit behind the API (nine open-weight including LLAMAGUARD_3_8B, QWEN3GUARD_8B, GRANITE_GUARDIAN_8B, SHIELDGEMMA_2B; four API). AI compliance guardrails for enterprise LLMs covers the model choices.

Production observability and the Error Feed

The closed loop is what makes a Haystack eval system compound. Without it, every incident produces a one-off fix and the team writes the same regression twice next quarter.

Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups every failing trace into a named issue. A Sonnet 4.5 Judge on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit near 90 percent) reads each failing trace and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-to-5).

For Haystack RAG the cluster names show up like:

“TransformersSimilarityRanker downranks correct doc below top-3 on multi-hop queries”
“PromptBuilder Jinja loop truncates context when token budget creeps over 4096”
“InMemoryEmbeddingRetriever misses domain synonyms after corpus refresh”
“DocumentJoiner drops BM25 side on rare-token queries”

Each cluster ships as a Linear issue today (Slack, GitHub, Jira, PagerDuty on roadmap). Two patterns close the loop. The immediate_fix feeds the Platform’s self-improving evaluators so the rubric ages with the pipeline. Representative traces promote into the golden set under engineer sign-off; the next PR touching the offending component has to clear the new entries. The dataset ratchets stronger every week.

For prompt and template optimization, agent-opt ships six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer). Run them against the same rubric set; the loop earns its budget the first time it rewrites a PromptBuilder template that was silently truncating context.

Three deliberate tradeoffs

Two layers cost more eval budget than one. Per-component plus pipeline-level rubrics is more LLM-judge calls than one faithfulness score. The payoff is a five-minute bisect that points at a specific Component. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make weekly full-dataset reruns the default.
CustomLLMJudge rubrics for ranker uplift, joiner correctness, and branch routing add maintenance. Three extra rubrics to keep calibrated as the pipeline evolves. The lift is the only signal that separates a ranker regression from a retriever regression, or a joiner bug from a single-retriever bug. The moment a TransformersSimilarityRanker enters the DAG, write the uplift rubric the same week.
AggregationStrategy.ALL guardrails cost latency. Strict aggregation adds 100 to 400 ms depending on rubric mix. Worth it for compliance-sensitive paths; MAJORITY is the right operating point elsewhere.

How Future AGI ships this

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals on the two Haystack layers. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent against your live traces.

ai-evaluation SDK (Apache 2.0): per-component layer ContextRelevance (9), ChunkAttribution (11), ChunkUtilization (12); pipeline layer Groundedness (47), ContextAdherence (5), Completeness (10), AnswerRefusal (88), FactualAccuracy (66); tool overlay EvaluateFunctionCalling (98), TaskCompletion (99). 50+ total templates, CustomLLMJudge for ranker uplift, joiner correctness, and branch routing; 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0): HaystackInstrumentor().instrument(...) covers every Haystack 2.9+ component with auto-detected EMBEDDING, RETRIEVER, RERANKER, LLM span kinds. 50+ AI surfaces across Python, TypeScript, Java, C#; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY).
Future AGI Platform: self-improving evaluators tuned by thumbs-up and thumbs-down feedback from production traces; in-product authoring agent writes Haystack-specific rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
Error Feed (inside the eval stack): HDBSCAN clustering groups failing traces; Sonnet 4.5 Judge writes the immediate_fix; representative traces promote into the golden set. Linear OAuth today; Slack, GitHub, Jira, PagerDuty on roadmap.
Agent Command Center: 17 MB Go binary self-hosts in your VPC for the LLM calls underneath every Generator. 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Ready to evaluate your first Haystack RAG pipeline? Wire ContextRelevance on the retriever socket, ranker uplift on the RERANKER span, and Groundedness plus Completeness on replies[0] into a pytest fixture against the ai-evaluation SDK. Add HaystackInstrumentor when production traces start asking questions the CI gate missed. The per-component loop pays for itself the first time it points at the TransformersSimilarityRanker instead of blaming the OpenAIGenerator.

Frequently asked questions

Why does Haystack RAG need a per-component evaluation strategy?

Haystack 2.x represents a RAG app as an explicit `Pipeline` DAG of `Component` instances with named input and output sockets. A typical pipeline is `SentenceTransformersTextEmbedder` to `InMemoryEmbeddingRetriever` to optional `TransformersSimilarityRanker` to `PromptBuilder` to `OpenAIGenerator`. Each component has its own failure mode and its own success criterion. A single `Groundedness` score on the final answer collapses five components into one number and tells you nothing about which one regressed. Per-component rubrics map each score back to the component that produced the span; a drop in MRR after a ranker swap points at the ranker, not the generator. The framework already exposes the boundary that most eval setups throw away. Use it.

What is the right rubric split for a Haystack RAG pipeline?

Two layers, scored separately. Per-component rubrics on the Retriever (`ContextRelevance`, `ChunkAttribution`, `ChunkUtilization`, plus Recall@k and MRR on a labelled probe set), on the Ranker (delta `ContextRelevance` from pre- to post-rank, recall preservation in top-k), and on the Generator (`Groundedness`, `ContextAdherence`, `AnswerRefusal` on the LLM step output against its retrieved context). Pipeline-level rubrics on the rendered `PromptBuilder` output and the final `OpenAIGenerator` reply (`Groundedness`, `ContextAdherence`, `Completeness`, `AnswerRefusal`, `FactualAccuracy`). Per-component rubrics catch the bug. Pipeline-level rubrics confirm the fix.

How does traceAI instrument Haystack?

One call. `HaystackInstrumentor().instrument(tracer_provider=trace_provider)` wraps `haystack.Pipeline.run` and `Pipeline._run_component`, then sniffs each component class to assign the right `fi.span.kind`. `Embedder` components map to `EMBEDDING`, `Retriever` and `WebSearch` components to `RETRIEVER`, `Ranker` components to `RERANKER`, and any class with `Generator` in its name (or a `replies` output socket) to `LLM`. The `PromptBuilder` is detected explicitly. Spans carry the component class name, the rendered prompt, retrieved documents with scores, ranker input and output document IDs, and the final generation. Sync, async, and branched pipelines all emit through the same wrapper.

How big should a Haystack RAG golden set be?

Plan for 200 to 500 query and expected-answer pairs as the working baseline, sampled from production traces once instrumentation is live. If the pipeline uses hybrid retrieval (`BM25Retriever` plus `InMemoryEmbeddingRetriever` with a joiner), add 50 queries that explicitly favour each side, so the eval can score per-retriever recall before the merge. If the pipeline branches on query type (a conditional router that skips the ranker for short factoid lookups), add 50 queries per branch plus 50 near the decision boundary. Cover happy-path queries, the hardest 10 percent of historical failures, and three to five edge cases unique to the component the app actually uses. Promote new failing production traces into the set every week through Error Feed.

How do per-component and pipeline-level rubrics differ in CI?

Per-component rubrics gate the component that produced the span. `ContextRelevance` on `InMemoryEmbeddingRetriever`, delta `ContextRelevance` across `TransformersSimilarityRanker`, and the IR metrics on a labelled probe set fail a PR that regresses retrieval even when the generator covers for it. Pipeline-level rubrics gate the `OpenAIGenerator` reply. `Groundedness` and `Completeness` on the final answer fail a PR that ships a wrong answer even when every per-component rubric passes (because the prompt template silently truncated context, or the joiner dropped half the docs, or the generator drifted into model priors). Run both in CI; budget per-component on every push, pipeline-level on every merge.

Where does the Future AGI gateway fit in a Haystack pipeline?

Point `OpenAIGenerator`, `AnthropicGenerator`, or any LLM-backed Ranker at `https://gateway.futureagi.com/v1` via the component's `api_base_url` argument. The gateway returns `x-prism-cost`, `x-prism-latency-ms`, `x-prism-model-used`, and `x-prism-routing-strategy` headers on every response. traceAI captures them on the `LLM` span, so per-component cost and latency aggregate to the pipeline root span automatically. Routing policy (provider preference, fallback chain, token budget) moves to the gateway, which means the Haystack pipeline definition stops carrying provider-specific knobs and stays portable across environments. 18+ built-in guardrail scanners run at the same network hop.

How does the Error Feed cluster Haystack failures?

Failing pipeline runs stream into Error Feed. HDBSCAN soft-clusters them over ClickHouse-stored span embeddings, grouping every failing trace into a named issue. A Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools, Haiku Chauffeur for spans over 3000 characters, prompt-cache hit ratio near 90 percent) reads the failing trace and writes the RCA, evidence quotes, an `immediate_fix`, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1-to-5). Typical Haystack clusters: 'Ranker downranks correct doc below top-3 on multi-hop queries,' 'PromptBuilder template truncates context when token budget creeps over,' 'Embedding retriever misses domain synonyms after corpus refresh.' Each cluster ships as a Linear issue today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

What does Future AGI ship for Haystack RAG eval today, and what is roadmap?

Shipping today: traceAI's `HaystackInstrumentor` (Apache 2.0) auto-detecting every Haystack 2.9+ component (Generator, Embedder, Ranker, Retriever, WebSearch, PromptBuilder) and emitting OpenTelemetry spans with the right `fi.span.kind`; the `ai-evaluation` SDK (Apache 2.0) with the RAG-relevant templates (`ContextRelevance`, `ChunkAttribution`, `ChunkUtilization`, `Groundedness`, `ContextAdherence`, `Completeness`, `AnswerRefusal`, `FactualAccuracy`), plus `EvaluateFunctionCalling` and `TaskCompletion` for tool-using flows, 50+ total templates, 13 guardrail backends (9 open-weight, 4 API), 8 sub-10ms Scanners; `CustomLLMJudge` for Haystack-specific rubrics (ranker uplift, joiner correctness, branch routing); Error Feed inside the eval stack with HDBSCAN clustering and Sonnet 4.5 Judge; Linear OAuth wired as the issue sink; Agent Command Center self-hosts at 100+ providers with 18+ built-in guardrail scanners. Roadmap: Slack, GitHub, Jira, PagerDuty integrations for Error Feed; the trace-stream-to-agent-opt connector that auto-promotes high-signal production traces into optimization datasets.

View all

Guides

Evaluating LlamaIndex RAG Applications in 2026

LlamaIndex RAG eval is not generic RAG eval. Four layers, four rubrics: retriever, query-pipeline, agent tool calls, and the traceAI bridge to production.

Vrinda Damani · Apr 17, 2026

12 min

Guides

Evaluating LangChain RAG Applications in 2026

LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on LCEL output confirms the fix.

Rishav Hada · Mar 14, 2026

12 min

Guides

Evaluating Cohere Rerank in RAG (2026)

Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.

Nikhil Pareek · Apr 28, 2026

11 min

TL;DR: the Haystack rubric set

Why Haystack’s Pipeline architecture changes RAG eval

Layer 1: per-component rubrics

Layer 2: pipeline-level rubrics on the final reply

Instrumenting Haystack with traceAI

The CI gate: per-component on push, pipeline-level on merge

Production observability and the Error Feed

Three deliberate tradeoffs

How Future AGI ships this

Related reading

Frequently asked questions