What Is Modular RAG?
An agentic RAG design that splits retrieval, ranking, grounding, generation, and fallback into independently evaluated modules.
What Is Modular RAG?
Modular RAG is an agent-system pattern that breaks retrieval-augmented generation into independently tested modules: query planning, retrieval, reranking, context assembly, generation, grounding, and fallback. Instead of one opaque RAG chain, each module exposes inputs, outputs, scores, and trace spans, so teams can replace or route components without rewriting the whole system. It shows up in eval pipelines and production traces when agents choose different retrievers, tools, prompts, or correction steps per task.
Why modular RAG matters in production LLM and agent systems
When modular RAG is ignored, one bad boundary makes an entire agent look random. The retriever may return stale policy chunks, the reranker may bury the only relevant passage, or the generator may cite context it never actually used. The visible failure is a fluent but wrong answer, then a wrong tool call, then a user-facing action such as an incorrect refund, denial, or escalation. FutureAGI treats these boundaries as production surfaces, not only design choices, because each module can regress independently.
The pain lands on different teams at different times. Developers see failing traces where the final answer looks plausible, but no trace field explains whether retrieval, ranking, or generation broke. SREs see p99 latency jump after a new reranker rollout. Compliance reviewers see unsupported claims and need proof that the answer came from approved documents. Product teams see thumbs-down rate rise for one customer cohort while the average RAG score still looks acceptable.
This matters more in 2026 agent stacks because retrieval is no longer a single preprocessing step. Agents call retrievers conditionally, switch tools mid-trajectory, run corrective loops, and send retrieved context into downstream actions. A monolithic RAG score may tell you quality fell. Modular RAG tells you which block caused the fall and which fallback should run next.
How FutureAGI handles modular RAG
FutureAGI’s approach is to treat modular RAG as a traced agent workflow with evaluators attached at module boundaries. For the eval:RAGScore surface, fi.evals.RAGScore scores the full retrieval-to-answer path, while ContextRelevance, Groundedness, ChunkAttribution, and ToolSelectionAccuracy explain which module helped or hurt the result. Unlike a single final-answer check in a Ragas run or a custom notebook, the score is tied back to the production trace that produced the answer.
A real workflow looks like this: a support agent receives a billing-policy question, writes a retrieval plan, calls a vector retriever, reranks ten passages to three, generates an answer, then decides whether to file a refund tool call. traceAI-langchain records each module as a span. The retriever span carries retrieved documents. The agent span carries agent.trajectory.step. The LLM span carries llm.token_count.prompt and completion cost.
The engineer then wires thresholds to the right surface. If ContextRelevance drops under 0.70 on the retriever span, the agent reroutes to a broader query rewrite. If Groundedness fails on the generation span, the system returns a fallback response instead of executing the refund tool. If ToolSelectionAccuracy falls for the refund step, the trace is pushed into a regression eval dataset. FutureAGI keeps the fix local: replace the reranker, tighten the prompt, or tune the fallback route without rewriting the whole RAG application.
How to measure modular RAG
Measure modular RAG at the boundary between modules, not only at the final answer:
RAGScore: comprehensive RAG evaluation across retrieval and answer quality for the full modular flow.ContextRelevance: 0-1 retrieval signal for whether the fetched chunks answer the input.Groundedness: checks whether the generated answer is supported by the retrieved context.ToolSelectionAccuracy: scores whether the agent chose the right retriever, reranker, or action tool.- Trace signals:
agent.trajectory.step, retriever latency p99,llm.token_count.prompt, token-cost-per-trace, and eval-fail-rate-by-module. - User proxy: thumbs-down rate or escalation-rate by module version, not just by whole agent version.
from fi.evals import RAGScore
score = RAGScore().evaluate(
input="Can I refund a duplicate invoice?",
output="Yes, duplicate invoices can be refunded after verification.",
context=["Refunds are allowed for verified duplicate invoices."]
)
print(score.score)
Common mistakes
- Scoring only the final answer. A passing response can hide a weak retriever, an overactive reranker, or a generator that ignored context.
- Treating module names as observability. A diagram is not a trace; each block needs inputs, outputs, latency, and evaluator scores.
- Sharing one top-k setting across tasks. Billing, legal, and troubleshooting queries need different recall and token-cost tradeoffs.
- Replacing modules without regression cohorts. A better reranker for common queries can break rare long-tail document families.
- Letting correction loops run without exit criteria. Corrective RAG needs thresholds, retry caps, and
agent.trajectory.stepinspection.
Frequently Asked Questions
What is Modular RAG?
Modular RAG decomposes retrieval-augmented generation into independently observable agent blocks such as retrieval, reranking, generation, grounding, and fallback. Each block can be traced, scored, routed, and replaced without rebuilding the whole chain.
How is Modular RAG different from Agentic RAG?
Agentic RAG is the broader pattern where an agent decides when and how to retrieve. Modular RAG is the implementation discipline: every retrieval, ranking, grounding, and correction step is treated as a replaceable and measurable module.
How do you measure Modular RAG?
FutureAGI measures Modular RAG with fi.evals.RAGScore for the full flow, plus ContextRelevance, Groundedness, ChunkAttribution, and ToolSelectionAccuracy on module-level traces.