HyDE is a RAG retrieval technique that asks an LLM to generate a hypothetical answer, embeds that generated text, and searches for source passages near the synthetic embedding. It is useful when raw user queries are too short or mismatched to corpus wording.

How is HyDE different from query rewriting?

Query rewriting changes the user's query into a clearer search query. HyDE generates an answer-like document and embeds that document, so retrieval is guided by a richer semantic target rather than a rewritten keyword query.

How do you measure HyDE?

Use FutureAGI's ContextRecall and ContextRelevance evaluators to compare baseline retrieval with HyDE retrieval, then use Groundedness on the final answer. Track eval-fail-rate-by-cohort for vague, long-tail, and out-of-domain queries.

What Is HyDE? Definition, Examples & FutureAGI Guide (2026)

What Is HyDE?

HyDE, short for Hypothetical Document Embeddings, is a RAG retrieval technique that turns a user query into a synthetic answer-like document, embeds that generated text, and uses the embedding to retrieve similar source passages. It shows up in the retrieval stage of production traces, before reranking and generation. The method often improves recall for vague queries, but it can also retrieve context that matches the LLM’s guess rather than the user’s real intent. In FutureAGI, treat HyDE as a retrieval variant to test, not as a grounding guarantee.

Why It Matters in Production LLM and Agent Systems

HyDE changes the failure profile of retrieval. Plain vector search may miss a relevant document because the user query is short, informal, or uses vocabulary absent from the corpus. HyDE can fix that by expanding the query into a richer semantic probe. The risk is that the generated hypothetical document contains the wrong product name, legal clause, date, API, or domain assumption. Retrieval then follows the synthetic answer, not the user’s need.

The symptoms are subtle. Logs show high embedding similarity, but the retrieved chunks do not answer the original question. A RAG answer looks polished while citing adjacent but wrong policy text. Support agents start with a plausible false premise, then call the wrong tool or escalate to the wrong queue. SREs see higher token cost because HyDE adds an LLM call before retrieval; product teams see “close but wrong” answers; compliance reviewers see source citations that look valid until inspected line by line.

This matters more in 2026-era multi-step systems than in single-turn chat. An agent may use HyDE retrieval to decide which tool to call, feed that context into a planner, then persist the result into memory. One retrieval drift can become three downstream decisions. Unlike BM25, which can preserve exact identifiers, HyDE optimizes for semantic neighborhood. That is powerful for vague questions and dangerous for exact-match tasks.

How FutureAGI Evaluates HyDE Retrieval

FutureAGI’s approach is to evaluate HyDE as a retrieval strategy, not as a dedicated product primitive. Because the supplied anchor is none, the right workflow is to compare the HyDE branch against a baseline retriever and score the effect on context and answer quality. Use a fi.datasets.Dataset with rows for the original query, baseline context, HyDE context, final answer, and expected source when available.

For a LangChain RAG workflow, instrument the chain with the traceAI-langchain integration, then run both paths over the same golden dataset. The exact FutureAGI evaluator surfaces are ContextRecall, ContextRelevance, Groundedness, and optionally ChunkAttribution. ContextRecall checks whether the needed source material appears in the retrieved context. ContextRelevance checks whether that context is useful for the query. Groundedness checks whether the final answer stays supported by retrieved evidence.

A practical decision rule is simple: ship HyDE only when recall improves without a relevance or groundedness drop. If ContextRecall.score rises on vague queries but Groundedness.score falls on exact-ID queries, route HyDE only for the vague-query cohort and keep baseline retrieval for identifiers. In our 2026 evals, the useful HyDE deployments are conditional: they improve long-tail semantic search while leaving SKU, ticket, code, and policy-number lookup on exact or hybrid retrieval. The engineer’s next action is a threshold, fallback, or regression eval, not a blanket retriever swap.

How to Measure or Detect It

Measure HyDE by comparing baseline retrieval and HyDE retrieval on the same query set:

ContextRecall: returns whether the retrieved context contains the facts or source entities needed to answer the query.
ContextRelevance: returns whether the retrieved passages are actually useful for the user query, not just semantically nearby.
Groundedness: returns whether the final answer is supported by the retrieved context.
MRR and NDCG@k: rank-sensitive retrieval metrics for labeled query-document pairs.
Dashboard signals: eval-fail-rate-by-cohort, retrieval p99 latency, token-cost-per-trace, and thumbs-down rate after HyDE-enabled answers.

from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="How do I rotate a FutureAGI API key?",
    context=retrieved_context,
)
print(result.score, result.reason)

The most important detection pattern is disagreement: HyDE improves recall on broad natural-language questions but worsens relevance on exact lookup tasks. Track those cohorts separately, or the average score will hide the regression.

Common Mistakes

Treating the hypothetical document as evidence. HyDE generates a retrieval probe, not a source. Only retrieved corpus passages can ground the answer.
Using one retrieval path for every query. BM25 or hybrid search often beats HyDE on IDs, error codes, clause numbers, and SKUs.
Measuring only final-answer faithfulness. If the answer refuses safely, retrieval may still have failed. Score the retrieved context directly.
Generating with too much randomness. High-temperature hypothetical documents drift into invented entities, which pulls vector search toward the wrong neighborhood.
Skipping agent trace boundaries. Each agent step needs its own retrieval evidence; recycled HyDE context can contaminate later tool calls.