What is query rewriting in RAG?

Query rewriting in RAG transforms a user's raw question into one or more retrieval-ready queries before knowledge-base search, improving recall and reducing irrelevant context.

How is query rewriting different from reranking?

Query rewriting changes the search request before retrieval. Reranking orders the retrieved documents after the first search pass, so it cannot recover evidence that was never fetched.

How do you measure query rewriting?

FutureAGI measures its effect with ContextRelevance, ContextRecall, ContextPrecision, and Groundedness by comparing original inputs, rewritten queries, retrieved chunks, and final answers.

What Is Query Rewriting? FutureAGI RAG Guide (2026)

What Is Query Rewriting?

Query rewriting in RAG is a retrieval technique that converts a user’s raw question into one or more search-ready queries before fetching context. It belongs to retrieval-augmented generation, and it shows up in the retriever span of a production trace or offline eval pipeline. FutureAGI teams judge a rewrite by whether it improves ContextRelevance, ContextRecall, and Groundedness without changing the user’s intent, adding unsupported constraints, or increasing retrieval cost beyond the route budget.

Why Query Rewriting Matters in Production LLM and Agent Systems

The concrete failure is retrieval drift before generation starts. Users ask messy questions: “Can I cancel after annual?”, “why did invoice jump?”, or “does this apply in Germany?” A raw vector search may miss the corpus phrasing, retrieve a neighboring policy, or return chunks about cancellation, invoices, and Germany without the clause that answers the question. The generator then writes a fluent answer from weak evidence. The visible symptom looks like a RAG hallucination, but the first bug was a bad search query.

Developers feel this as non-reproducible retrieval quality. The same intent works when phrased in product language and fails when phrased like a customer ticket. SREs see higher retriever retries, p99 latency jumps from fan-out rewrites, and token-cost-per-trace rising because broad rewrites pull too much context. Product teams see thumbs-down comments such as “that is not what I asked” or “wrong policy.” Compliance teams care because a rewrite can silently cross a policy boundary, such as changing “EU cancellation” into a generic global cancellation query.

Agentic RAG raises the stakes. A multi-step support agent may rewrite the query, retrieve evidence, choose an action, and call a billing tool. If the rewrite drops “annual” or adds “refund” when the user asked about cancellation eligibility, the later tool call can be wrong even when every downstream span looks well-formed.

How FutureAGI Handles Query Rewriting in RAG

For this term, the FutureAGI anchor is none: query rewriting is a RAG design pattern, not a standalone FutureAGI product surface. FutureAGI’s approach is to evaluate the rewrite through the surrounding retrieval and answer trace. The important record is not only the final answer; it is the original user input, rewritten query, retrieved chunks, reranker output, and generated response joined under one trace.

Example: a LangChain support bot instrumented with traceAI-langchain receives “Can I still cancel if I moved to annual last week?” The rewrite step emits “annual plan cancellation policy refund proration first week.” The retriever returns plan-change policy chunks, then the answer cites the cancellation window. FutureAGI evaluates the run with ContextRelevance for query-to-context fit, ContextRecall for whether required policy clauses were retrieved in offline regression sets, ContextPrecision for ranking quality, and Groundedness for the final answer.

The engineer’s next step depends on the failure split. If ContextRecall rises but ContextPrecision falls, the rewrite is too broad and needs constraints or a smaller fan-out. If relevance improves but groundedness fails, the answer prompt may be using context incorrectly. If all retrieval metrics improve but latency p99 crosses budget, the team can cap rewrite variants or add a semantic cache for repeated intents. Unlike a Ragas-only offline scorecard, this workflow keeps the rewrite and the production trace tied together, so the owner can debug the exact query that changed retrieval behavior.

How to Measure or Detect Query Rewriting

Measure query rewriting by comparing the retrieval path with and without the rewrite:

ContextRelevance: scores whether retrieved context matches the original user input; low scores mean the rewrite pulled off-topic evidence.
ContextRecall: on labeled eval sets, checks whether the rewritten query retrieved the facts needed for the reference answer.
ContextPrecision: catches broad rewrites that retrieve the right chunk but bury it below irrelevant context.
A/B delta: run original-query and rewritten-query retrievers on the same eval dataset; compare recall, precision, groundedness, latency, and cost.
Trace fields: store original input, rewritten query, rewrite model, top-k, reranker config, retrieved chunk ids, and final answer on the same trace.
Dashboard signals: monitor rewrite-adoption rate, empty-retrieval rate, retrieval p99 latency, token-cost-per-trace, eval-fail-rate-by-cohort, and thumbs-down rate.

Minimal retrieval-side check:

from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="Can I cancel after moving to annual?",
    context=["Annual plans can be cancelled within 7 days of upgrade."]
)
print(result.score, result.reason)

Common Mistakes

Optimizing for recall alone. A rewrite that retrieves everything also floods the prompt; track ContextPrecision, Groundedness, and token-cost-per-trace in dashboards before rollout.
Letting the rewrite answer the question. Query rewriting should reformulate intent for search, not add facts, assumptions, dates, policy exceptions, or conclusions.
Scoring against the rewritten query only. The user asked the original question; evaluate relevance against that input to catch semantic drift and intent loss before search.
Using one rewrite prompt for every route. Legal, support, code, and sales corpora need different synonyms, constraints, and refusal rules rather than one generic rewrite style.
Ignoring latency fan-out. Multi-query expansion can improve recall while breaking p99 latency, retriever rate limits, and gateway cost budgets during peak traffic.