RAG

What Is RAG Pipeline?

The runtime execution path a query takes through a retrieval-augmented generation system, from embedding to retrieval to generation, traced as a span sequence.

What Is RAG Pipeline?

A RAG pipeline is the runtime execution path of a single query through a retrieval-augmented generation system. The query enters, gets embedded by the embedding model, is sent to the vector store as a similarity search, returns top-k passages, optionally passes through a reranker, and lands in the LLM’s prompt as retrieved context for generation. Each step is a span; the full pipeline is the trace. It is the dynamic counterpart to RAG architecture (the static layout). Engineers measure pipeline-level signals. latency per step, token cost per trace, per-step eval scores. to ship and debug RAG systems in production.

Why It Matters in Production LLM and Agent Systems

The pipeline is where the architecture meets reality. A clean architecture diagram tells you nothing about whether retrieval times out under load, whether reranker latency blows the response budget, or whether the prompt that wraps retrieved context is actually formatted correctly when the LLM sees it. Pipeline observability is what makes a RAG system debuggable; without per-step traces, an engineer staring at “answer is wrong” has no way to localise the failure to embed, retrieve, rerank, or generate.

The pain shows up across roles. SREs see p99 latency spikes with no idea which component blew the budget. ML engineers see hallucinations that prompt tweaks cannot fix because the wrong context made it into the prompt at all. Cost engineers see token-cost-per-trace creeping up and need to identify which pipeline variant is over-fetching. Product managers see eval-fail-rate spike and cannot tell users what changed.

In 2026 agent stacks, pipelines compose. An agentic-RAG pipeline runs a retrieval pipeline as one step in a larger agent loop, and a corrective-RAG pipeline conditionally runs a fallback retrieval pipeline if the first retrieval fails an evaluator. These nested pipelines are only debuggable when each level is independently traced. Pipeline observability is no longer optional. it is the substrate that makes agentic RAG explainable to engineers and auditors.

How FutureAGI Handles RAG Pipelines

FutureAGI’s approach is to capture every pipeline step as a typed span and run evaluators against those spans live or in batch. The traceAI-haystack integration auto-instruments Haystack pipelines: each component (embedder, retriever, ranker, generator) emits a span with inputs, outputs, latency, and tokens. traceAI-langchain and traceAI-llamaindex do the same for those frameworks. The result is a trace where you can see the embedding for the query, the top-k chunks returned, the reranker’s reordering, and the final LLM call. all in one timeline.

On top of that, fi.evals.RAGScore runs end-to-end on the pipeline’s output, while ContextRelevance, ChunkAttribution, and Groundedness score individual stages. A live eval can be wired to fire on every retrieve-span where retrieval.documents is present, write the score back as a span event, and trigger an alert when fail-rate crosses threshold.

A typical FutureAGI workflow: a Haystack RAG pipeline starts dropping on a long-tail customer cohort. The trace dashboard shows retrieve-span p99 latency is fine but ContextRelevance p10 has dropped from 0.78 to 0.42 on that cohort. The engineer drills into a failing trace, sees the retriever pulled three off-topic chunks for a query mentioning a renamed product, and ships a query-rewriter as a new pipeline step. Re-running the canonical golden dataset confirms ContextRelevance recovered without Groundedness regressing. That cycle. observed → diagnosed → fixed → re-evaluated. runs in a single afternoon when the pipeline is fully instrumented.

How to Measure or Detect It

Pipeline-level signals are about steps, not just the final answer:

  • Per-span latency: p50/p99 on embed, retrieve, rerank, generate spans. captured automatically by traceAI integrations.
  • Token-cost-per-trace: llm.token_count.prompt + llm.token_count.completion summed per pipeline trace.
  • fi.evals.ContextRelevance on the retrieve span: 0–1 score per request. the canonical retrieval-quality signal.
  • fi.evals.Groundedness on the generate span: pass/fail per request. catches hallucination at the last step.
  • fi.evals.RAGScore end-to-end on the pipeline output.
  • Eval-fail-rate-by-cohort: per-route, per-tenant, per-pipeline-variant. the regression alarm.
from fi.evals import RAGScore, ContextRelevance

retrieval_score = ContextRelevance().evaluate(
    input="What's the latency target?",
    context=["...latency target: <300ms p99..."]
)
print(retrieval_score.score)

2026 RAG pipeline stages and signals

In our 2026 evals, the most reliable RAG pipelines record a structured span per stage, with at least the fields below. Anything less and an incident becomes guesswork.

StageSpan fields to captureQuality signal
Embed queryEmbedding model id, dim, normalizationEmbedding latency, drift detector
Rewrite (query rewriting)Original input, rewritten text, rewrite modelRecall delta vs no-rewrite
RetrieveIndex name, top-k, filter set, score distributionContextRelevance, ContextRecall
Rerank (reranker)Reranker provider, top-n in/out, latencyContextPrecision, rerank p99 latency
Prompt assemblyFinal ordered chunk IDs, prompt token countToken-cost-per-trace
GenerateModel name, completion tokens, finish reasonGroundedness, Faithfulness, AnswerRelevancy
CiteSource IDs cited, citation countChunkAttribution, ChunkUtilization

Unlike LangSmith’s pipeline view, FutureAGI’s traceAI-haystack, traceAI-langchain, and traceAI-llamaindex integrations attach evaluator results to the span as a span event, so a regression on a Gemini 3 Pro vs Claude Opus 4.7 swap surfaces at the exact stage where the score moved. That keeps RAG pipeline debugging deterministic in a stack where each stage may use a different model, provider, and budget.

Anchor each stage of the release gate to public data. For retrieval, RAGBench (12 RAG tasks across 6 domains, 100K+ examples) and CRAG (Meta, 4400 stratified Q with noise injection. frontier RAG pipelines hit 30-45% without custom tuning) are the standard cross-domain stress tests. For grounding, RAGTruth (18K labeled chunks; frontier models fail Groundedness on 5-8%) is the cleanest signal. a pipeline that adds a reranker should observably push that floor down on the RAGTruth replay before promotion.

Common Mistakes

  • Treating architecture and pipeline as the same thing. Architecture is the diagram; pipeline is the live execution. Confusing them masks where to debug.
  • Tracing only the LLM call. The LLM is the last step; most RAG failures originate upstream in retrieval. Trace every span or you cannot localise.
  • Over-fetching top-k to “be safe”. Larger k inflates token cost and dilutes the prompt. Tune k against ContextRelevance and ChunkUtilization, not intuition.
  • Skipping the rerank span when latency budgets are tight. A cross-encoder reranker on top-20 → top-3 typically beats a dense-only top-3. the latency is usually worth it.
  • Caching pipeline output by exact prompt. Embedding non-determinism and minor query variation make exact-match cache hit rates collapse; semantic-cache via Agent Command Center is the right primitive.
  • Treating the pipeline as a single model concern. A pipeline that works on Claude Opus 4.7 may regress on Gemini 3 Pro or GPT-5.1 even though both are frontier; the right release gate is per-route RAGScore against the same golden dataset.
  • Skipping trace anchors on rewrite and reflect spans. Agentic RAG loops add steps; if those steps are not instrumented, the failure is invisible.

Frequently Asked Questions

What is a RAG pipeline?

A RAG pipeline is the runtime path a query takes through a retrieval-augmented generation system. embed the query, retrieve top-k from the vector store, optionally rerank, then generate the answer with the LLM conditioned on that context.

How is a RAG pipeline different from RAG architecture?

Architecture is the static layout of components. what the system contains. The pipeline is the runtime execution of those components for a single request. Architecture is the diagram; pipeline is the trace.

How do you trace a RAG pipeline?

FutureAGI's traceAI-haystack, traceAI-llamaindex, and traceAI-langchain integrations capture spans for every pipeline step. embed, retrieve, rerank, generate. with OpenTelemetry attributes like retrieval.documents and llm.token_count.