Articles

Agentic RAG in 2026: Tool-Using Retrieval Agents, Query Rewriting, Multi-Hop Patterns, and How to Observe Them

Agentic RAG in 2026: tool-using agents over vector DBs, query rewriting, multi-hop retrieval, and how to trace and evaluate every retrieve span with FAGI.

July 21, 2025

Updated May 14, 2026

13 min read

agents rag agentic-rag observability 2026

A research agent answers a multi-hop legal question in production. The pipeline retrieves 8 chunks, generates a draft, and ships. Two days later a customer flags one paragraph as fabricated. The trace shows the agent retrieved 8 chunks, used 6 of them, and invented the seventh fact entirely. No span scored faithfulness. No judge gated the answer. The agent had every framework feature it needed, except the one that would have caught the hallucination: a self-check loop. This is what 2025-era RAG looks like when the agent layer is bolted onto a classic pipeline without a trace + eval back-end. This post is the 2026 picture of agentic RAG: the patterns that actually ship, the failure modes that actually bite, and how to observe and evaluate every span using FutureAGI’s traceAI (Apache 2.0) and ai-evaluation (Apache 2.0) libraries.

TL;DR: Agentic RAG in one table

What	Classic RAG (2023-2024)	Agentic RAG (2026)
Retrieval calls per turn	1	1 to 6, dynamic
Query rewriting	Optional	Default, often with decomposition
Multi-hop reasoning	No	Yes, with state across hops
Self-check on draft	No	Faithfulness judge gates the answer
Re-retrieval on failure	No	Loop until supported or step-budget hit
Latency	1 LLM call + 1 retrieve	3 to 8 LLM calls + 2 to 6 retrieves
Best fit	FAQ, single-doc lookup	Multi-hop research, compliance, ambiguous queries
Failure mode	Under-retrieves	Over-retrieves, loops

If you only read one row: agentic RAG trades latency and tokens for faithfulness on hard questions. If your hardest questions are single-doc lookups, classic RAG is correct. If your hardest questions are multi-hop or ambiguous, agentic RAG with a self-check loop wins.

What is agentic RAG, precisely

Classic RAG is a function: query in, retrieve once, generate once, answer out. The retrieval call is part of the prompt construction; the model has no agency over whether to retrieve or what to retrieve.

Agentic RAG is a loop with a policy. The LLM agent decides: do I retrieve? With what query? With which retriever? Is the result enough? Do I retrieve again? Is the draft answer supported by the retrieved evidence? If not, do I retrieve more or hand back a refusal?

Concretely, agentic RAG has four primitives the classic pipeline does not:

Decision to retrieve. The agent can choose to answer from its own knowledge for trivial questions and only call the retriever when the question is corpus-specific.
Query transformation. The agent rewrites the user query before retrieval. It can decompose a 2-hop question into 2 sub-queries, expand entities, or restate in domain vocabulary.
Iterative retrieval. The agent retrieves, reads, decides whether more evidence is needed, and retrieves again. The chain runs until the agent has enough or hits a step budget.
Self-check. Before the answer ships, a judge scores faithfulness or groundedness. If the judge flags a claim, the agent loops back to retrieval.

Each of these primitives has a cost in tokens and latency. The trade is reliability on hard questions for cost on easy ones. A well-tuned agentic RAG routes easy questions through 1 retrieve and hard questions through 4, hitting both targets.

Agentic RAG architecture: tool-using agent over a retrieval index with query rewriting, multi-hop loop, and a faithfulness judge gating the final answer

Figure 1: Agentic RAG architecture in 2026: the agent loops over retrieve and reason, with a faithfulness judge gating the final response.

The five patterns that ship in 2026

Across the agentic RAG pipelines that hit production in the last year, five patterns recur. A real system uses 3 or 4 of them, rarely all 5.

Pattern 1: Query rewriting and decomposition

The user query is rarely what the retriever wants. “What’s the latency difference between the new turbo model and what we shipped last quarter?” needs to become two retrieval queries: one for the new turbo model spec, one for last quarter’s release. Decomposition is the agent step that splits and rewrites.

Two common shapes:

Step-back prompting: rewrite the literal query into a more general question that surfaces background, then a more specific question that surfaces the fact. Useful for ambiguous queries where the literal phrasing misses the corpus.
Sub-query decomposition: split a multi-hop question into N independent sub-questions, retrieve each, then synthesize. The classic example is HotpotQA-style 2-hop questions where one entity is in document A and one is in document B.

Across the multi-hop QA literature (HotpotQA, MuSiQue, 2WikiMultiHopQA), sub-query decomposition consistently lifts retrieval recall versus single-shot retrieval; reported gains vary by retriever and corpus, often in the range of several points to tens of points at recall@k. The trade is one extra LLM call per turn. The cost-benefit only works when multi-hop questions are common in the workload; for single-doc lookups, query rewriting alone (no decomposition) is enough.

Pattern 2: Multi-hop retrieval with state

Multi-hop retrieval keeps state across hops: each retrieve call sees the previous retrieves and the partial reasoning. The agent reads chunk 1, identifies the missing fact, retrieves chunk 2 against the rewritten query, and continues.

Three design choices matter:

Step budget. Cap the number of retrieves per turn (typical: 4 to 6). Without a cap, the agent loops.
Stop condition. The agent emits an “I have enough” signal, or a confidence score above threshold, or a faithfulness check passes. Without an explicit stop, the agent keeps retrieving.
State carry-over. Pass the prior chunks (or a summary) and the running plan into the next retrieve prompt so the agent does not re-retrieve the same chunks.

LangGraph’s state machine, the OpenAI Agents SDK’s tool-call loop, and LlamaIndex’s AgentWorkflow all provide this shape natively.

Pattern 3: Tool routing across retrievers

Production agentic RAG rarely uses one retriever. The agent picks per query:

Dense vector search for semantic similarity (default).
BM25 / sparse retrieval for exact term matching, IDs, error codes, version numbers.
Hybrid (BM25 + dense) for the common case where both signals matter.
Web search when the corpus is stale on a recent event.
SQL / structured data when the question is a metric (a count, a sum, a comparison).
Re-ranker (Cohere Rerank, BGE Reranker, Jina Reranker) as a second stage on top of the first-stage retriever to lift precision.

The tool-routing step is itself a tool-call: the agent emits “use BM25 for this query” or “use web_search for this query” and the framework dispatches. The router can be the LLM itself (zero-shot tool selection) or a small classifier (faster, less flexible).

Pattern 4: Self-check on the draft

Before the final answer ships, a judge scores faithfulness or groundedness against the retrieved chunks. Three options for the judge:

Reference-free faithfulness judge (a common production pattern in 2026). The judge sees the draft and the retrieved chunks and emits a per-sentence support score.
Citation enforcement. The agent is required to attach a chunk ID to every sentence; sentences without a citation are stripped or rewritten.
Hallucination detector running independently of the agent’s own confidence.

FutureAGI’s fi.evals templates cover faithfulness, groundedness, and hallucination scoring directly; citation enforcement is then a thin application rule on top of those scores (drop or rewrite sentences without a matching chunk ID). The judge call costs 1 to 2 extra evals per turn, typically with a fast judge model (FutureAGI turing_flash at ~1-2s for online screening, turing_small at ~2-3s for deeper faithfulness scoring).

Pattern 5: Re-retrieval on failure

When the self-check flags an unsupported claim, the agent does not give up. It rewrites the query around the unsupported claim and retrieves again. Two or three retries is the typical cap; beyond that the agent should refuse or escalate.

The re-retrieve loop is what separates a good agentic RAG from a bad one. A naive system without re-retrieval ships hallucinated answers when the first retrieve missed; a good system catches the failure, retrieves more, and either grounds the answer or refuses.

A reference implementation

The skeleton below shows the five patterns wired together using LangGraph, with FutureAGI traceAI for spans and fi.evals for the faithfulness judge. Substitute the framework you use; the pattern shape is the same.

# PSEUDOCODE: adapt to your stack. The fi.evals.evaluate call and
# the fi_instrumentation.register / traceai_langchain.LangChainInstrumentor
# calls are real APIs (pip install "traceAI-langchain[langgraph]" future-agi
# ai-evaluation). The llm_rewrite, pick_retriever, has_enough_evidence, and
# llm_generate functions are placeholders for your own retriever, planner,
# and generation primitives.

from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor  # covers LangGraph via the [langgraph] extra
from fi.evals import evaluate

# 1. Wire OTel tracing once at startup (real API)
register(project_name="agentic-rag-prod")
LangChainInstrumentor().instrument()

# 2. Define the loop: rewrite -> retrieve -> reason -> (re-retrieve | answer)
def rewrite_query(state):
    # Pattern 1: query rewriting / decomposition (placeholder)
    state["queries"] = llm_rewrite(state["user_query"])
    return state

def retrieve(state):
    # Pattern 3: tool routing across retrievers (placeholder)
    tool = pick_retriever(state["queries"][-1])
    chunks = tool.retrieve(state["queries"][-1], k=10)
    state["chunks"].extend(chunks)
    return state

def reason(state):
    # Pattern 2: multi-hop state carry-over (placeholder)
    if has_enough_evidence(state) or state["hops"] >= MAX_HOPS:
        state["next"] = "draft"
    else:
        state["next"] = "rewrite"
        state["hops"] += 1
    return state

def draft(state):
    state["draft"] = llm_generate(state["user_query"], state["chunks"])
    return state

def self_check(state):
    # Pattern 4: faithfulness judge gates the answer (real fi.evals call)
    result = evaluate(
        "faithfulness",
        output=state["draft"],
        context="\n".join(c.text for c in state["chunks"]),
    )
    if result.score >= 0.8:
        state["next"] = "ship"
    elif state["retries"] < MAX_RETRIES:
        state["next"] = "rewrite"  # Pattern 5: re-retrieve on failure
        state["retries"] += 1
    else:
        state["next"] = "refuse"
    return state

The five-step loop above is the canonical 2026 shape. Every step emits a span; every span carries an evaluator score; the dashboard rolls up per-trace into the four metrics that matter: faithfulness rate, retrieves-per-correct-answer, latency p95, cost-per-correct-answer.

Observability and evaluation: the part most teams skip

A working agentic RAG without trace + eval is a black box. You see latency p95 and an answer, but not which hop missed, which retriever returned the wrong chunks, or whether the faithfulness judge agreed with humans.

Trace every span

Use traceAI (Apache 2.0) or OpenInference (Apache 2.0) to emit OTel spans for every retrieve, rerank, generate, and judge call. The span attributes (input, output, model, tokens, latency) flow into FutureAGI’s dashboard or any OTel-compatible back-end.

Attach evaluators to spans

The four core evaluators for agentic RAG:

context_relevance: scored on the retrieve span, measures whether the retrieved chunks are relevant to the rewritten query. Catches retrieval drift.
faithfulness (or groundedness): scored on the generate span, measures whether the draft is supported by the retrieved chunks. Catches hallucination.
hallucination: a reference-free judge on the final answer. The safety net.
answer_relevance: scored on the final answer, measures whether the answer actually addresses the user query. Catches off-topic drift.

All four ship as fi.evals templates and run as cloud-evals (turing_flash at ~1-2 seconds for online screening, turing_small at ~2-3 seconds for deeper analysis, turing_large at ~3-5 seconds).

The four roll-up metrics

From per-span evaluator scores, four trace-level metrics give you the dashboard:

Faithfulness rate: percent of traces where the final answer passes the faithfulness threshold.
Retrieves-per-correct-answer: total retrieves divided by traces with passing faithfulness. Surfaces over-retrieval.
Latency p95: the standard latency tail.
Cost-per-correct-answer: total cost divided by passing traces. The composite you use for ship/no-ship.

Tracking only latency p95 is the most common eval mistake. A pipeline that drops latency by 30% by skipping the self-check looks like a win on the dashboard and ships hallucinations.

How to choose between agentic and classic RAG

Three questions decide.

Are the hardest questions multi-hop? If yes, agentic RAG wins. If no, classic RAG is faster and cheaper.
Does faithfulness matter more than latency? Legal, medical, compliance, finance: yes, agentic with self-check. High-volume customer support FAQ: latency wins, classic.
Do you have observability? Agentic RAG without trace + eval ships hallucinations you cannot debug. If you cannot wire in traceAI or equivalent before ship, run classic until you can.

For most production builders, the answer is hybrid: classic single-pass for the 80% of easy questions, agentic for the 20% of hard ones, with the agent itself deciding which path. The dispatch logic is a tool-routing step at the top of the agent.

Frameworks at a glance

As of May 2026:

LangGraph: state-machine framework on top of LangChain. The default for stateful agentic RAG. Mature, debuggable, instrumentable with traceAI through the LangChain integration.
OpenAI Agents SDK: tool-call loop built on the OpenAI Responses API. Lightest weight, fastest to prototype. Good for single-vendor stacks.
LlamaIndex AgentWorkflow: retrieval-first agent framework. The fit when retrieval is the primary primitive and the agent is a loop around it.
CrewAI: role-based multi-agent. Good for retrieve-and-write workflows where a researcher agent and a writer agent share state.
Microsoft Agent Framework: the AutoGen successor, more stable runtime, multi-agent dispatch.
Pydantic AI: typed agent framework. The pick for Python codebases that already use Pydantic for everything.

Each can be instrumented via traceAI or OpenInference (some with a ready-made integration, others through OTel adapters or custom wrappers); all can be evaluated with fi.evals. The framework is replaceable; the trace + eval back-end is the load-bearing piece.

For a deeper comparison of OSS agent frameworks, see The Open-Source Stack for AI Agents in 2026. For framework-level eval, see Agent Eval Metrics in 2026.

Common failure modes and the fix for each

Over-retrieval

The agent loops, retrieves 8 to 12 times, burns tokens. The fix is a hard step budget (max 4 retrieves), an explicit stop signal in the agent prompt, and a retrieves-per-correct-answer metric on the dashboard so you can see when a deploy regresses.

Under-retrieval

The agent stops at 1 retrieve when the question needed 3. The fix is the self-check loop: the faithfulness judge flags unsupported claims and triggers re-retrieval. Without the judge, the agent has no signal to keep going.

Tool misrouting

The agent picks BM25 for a semantic query or vector search for an exact ID match. The fix is a per-tool span with a hit-rate metric. Misrouting shows up as a tool whose retrieval results are never used downstream.

Judge drift

The faithfulness judge labels too leniently and lets hallucinations through. The fix is periodic calibration: 50 to 100 human labels per month, compared against the judge’s verdicts. A typical starting threshold is Cohen’s kappa around 0.6 (the moderate-to-substantial agreement boundary on Landis and Koch’s scale); tune the threshold based on the risk class of the application and the cost of false positives.

State contamination

Long agent runs accumulate context from prior turns and confuse the retriever. The fix is per-turn state reset and a summary primitive that compresses prior turns into a 200-token summary rather than carrying the full transcript.

For depth on retrieval-quality monitoring, see Best Retrieval Quality Monitoring Tools 2026. For re-ranker selection, see Best Rerankers for RAG 2026.

Where this is going in 2027

Three trends are visible.

First, retrieval is becoming a learned policy rather than a static pipeline. The agent picks not only which retriever but which top-k, which re-ranker depth, and which chunk size per query. Early systems are showing 15-20% retrieval-quality lift over static configurations.

Second, the agentic and structured-data boundary is blurring. Production agents increasingly query SQL, knowledge graphs, and vector indexes interchangeably; the agent’s job is to pick. Graph-RAG and SQL-aware agents are absorbing what used to be separate categories.

Third, the trace + eval layer is becoming the platform. Frameworks come and go (CrewAI to LangGraph to Agents SDK in two years); the OTel + evaluator-attached-spans pattern is the constant. Pick the platform that owns the trace + eval layer first, the framework second.

How to start

If you are building agentic RAG in 2026:

Pick the framework that fits your team (LangGraph for stateful, Agents SDK for OpenAI-native, LlamaIndex for retrieval-first).
Wire traceAI (Apache 2.0) or OpenInference at the framework instrumentation point. Use the ready-made integration where available; otherwise add an OTel adapter or wrapper.
Attach fi.evals templates (context_relevance, faithfulness, answer_relevance) to your retrieve and generate spans.
Run a 100-query golden set with the four roll-up metrics and a manual review of 20 failures. The pattern in those 20 failures tells you which of the 5 patterns is missing.
Add the missing pattern, re-run, ship when faithfulness rate clears your threshold.

The FutureAGI platform handles steps 2 to 4 in one stack: traceAI for spans, fi.evals for scoring, the dashboard for the roll-ups. Self-host or use the cloud; the Apache 2.0 trace + eval libraries work either way.

Sources

LangChain LangGraph: https://langchain-ai.github.io/langgraph/
OpenAI Agents SDK: https://github.com/openai/openai-agents-python
LlamaIndex AgentWorkflow: https://docs.llamaindex.ai/en/stable/module_guides/workflow/
FutureAGI traceAI (Apache 2.0): https://github.com/future-agi/traceAI
FutureAGI ai-evaluation (Apache 2.0): https://github.com/future-agi/ai-evaluation
FutureAGI fi.evals templates: https://docs.futureagi.com/docs/sdk/evals
OpenInference (Apache 2.0): https://github.com/Arize-ai/openinference
HotpotQA multi-hop benchmark: https://hotpotqa.github.io/
Step-back prompting paper: https://arxiv.org/abs/2310.06117

Frequently asked questions

What is agentic RAG and how does it differ from classic RAG in 2026?

Agentic RAG wraps a retrieval-augmented generation pipeline inside an LLM agent that decides when to retrieve, what to retrieve, and whether the answer is good enough. Classic RAG does one retrieve-then-generate pass per user turn. Agentic RAG can call the retriever many times across one turn, rewrite the query between hops, switch tools (vector search, BM25, web search, SQL), check faithfulness on the draft, and re-retrieve if a fact is unsupported. The price is more tokens and more latency per turn, traded for higher final-answer faithfulness on multi-hop and ambiguous questions.

When should I use agentic RAG instead of classic RAG?

Use agentic RAG when one of three things is true. First, the question is multi-hop: the answer needs facts from more than one document and a single embedding lookup will miss at least one of them. Second, the question is ambiguous: the literal user query is far from the corpus phrasing and a rewrite step lifts recall substantially. Third, faithfulness matters more than latency: a legal, medical, or compliance agent needs a self-check loop. For high-volume FAQ or single-doc lookups, classic single-pass RAG is faster and cheaper and the agentic overhead is wasted.

What are the core patterns in an agentic RAG system in 2026?

Five patterns recur across production agentic RAG systems in 2026. Query rewriting and decomposition: the agent rewrites the user query into a retrieval-friendly form, optionally splitting it into sub-queries. Multi-hop retrieval: the agent iterates retrieve-reason-retrieve until it has enough evidence. Tool routing: the agent picks between vector search, BM25, web search, SQL, and re-rankers per query. Self-check on the draft: a faithfulness or groundedness judge gates the final answer. Re-retrieval on failure: when the judge flags an unsupported claim, the agent goes back to retrieval with a refined query.

How do I observe and evaluate an agentic RAG pipeline?

Wire OTel spans around every retrieve, rerank, generate, and judge step using FutureAGI traceAI (Apache 2.0). Attach evaluator scores to each retrieve span using fi.evals templates: context_relevance for retrieval quality, faithfulness or groundedness for the generation, and hallucination for the final answer. Aggregate per-trace into a cost-per-correct-answer and a per-hop drop-off chart. The trajectory view from FutureAGI surfaces over-retrieval (the agent ran 6 retrieves when 2 would have answered) and under-retrieval (the agent stopped at 1 retrieve and missed the second hop).

What about hallucinations in agentic RAG? Does the agent layer help or hurt?

It depends on whether you wire in a self-check. A naive agentic RAG that calls the retriever 4 times and stitches the chunks together hallucinates more than classic RAG because the agent has more chances to drift and more context to confuse the generator. An agentic RAG with a faithfulness judge attached to the draft, plus a re-retrieve-on-failure loop, hallucinates less than classic RAG because every claim has a backing chunk before it ships. The judge cost is real: budget for 1 to 2 additional eval calls per turn.

Which framework should I use to build an agentic RAG system in 2026?

LangGraph and the OpenAI Agents SDK dominate stateful agentic RAG; LlamaIndex's AgentWorkflow is strong on retrieval-first patterns; CrewAI works for role-based multi-agent retrieval teams; Pydantic AI is the typed alternative. Most still need a separate eval and observability layer for production-quality scoring. Pair whichever framework fits your team with traceAI (Apache 2.0) for OTel-based spans and fi.evals for retrieval and generation scoring. The framework is replaceable, the trace and eval layer is not.

How does query rewriting actually improve retrieval in agentic RAG?

User queries are often too short, too natural-language, or use entity names the corpus does not. Query rewriting in agentic RAG runs the user query through the LLM with the instruction to produce a retrieval-friendly form: longer, with synonyms, with entity expansion, and (for decomposition) split into 2 to 4 sub-queries each matching one fact in the answer. Published multi-hop QA studies on HotpotQA and MuSiQue report recall gains ranging from several points to tens of points, at the cost of one extra LLM call per turn.

What is the biggest failure mode of agentic RAG in production?

Over-retrieval. The agent gets stuck in a loop, calls the retriever 8 to 12 times for a question that two retrieves would answer, burns tokens and latency, and often surfaces a worse final answer because the generator is now drowning in 40 chunks of context. The fix is a hard step budget (max 4 retrieve calls per turn), a faithfulness judge that exits the loop the moment the draft is supported, and a trace-attached metric for retrieves-per-correct-answer so you can spot the loop pattern in your dashboard.

View all

Guide

Instrument an AI Agent in Minutes with TraceAI in 2026

Instrument AI agents with TraceAI in 2026: OpenTelemetry-native Apache 2.0 spans, 20+ framework instrumentors, FITracer decorators, and 5-minute setup.

NVJK Kartik · Nov 30, 2025

8 min

Guide

RAG Architecture 2026: Patterns, Code, and Eval

RAG architecture in 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG. Real code plus Context Adherence and Groundedness eval.

NVJK Kartik · Jan 31, 2025

8 min

Guide

AI Infrastructure Guide 2026: The Production Reference Stack

The 2026 reference stack for AI infrastructure: GPU compute, distributed training, MLOps, gateway routing, observability + eval, security, FinOps. With real tools.

NVJK Kartik · Aug 26, 2025

7 min