What is hybrid search in RAG?

Hybrid search combines keyword retrieval, often BM25, with vector retrieval so a RAG system can find exact terms and semantic matches in the same query path.

How is hybrid search different from vector search?

Vector search ranks by embedding similarity only. Hybrid search adds sparse keyword matching, then fuses or reranks both result sets to handle product codes, names, legal citations, and paraphrased questions.

How do you measure hybrid search?

FutureAGI traces Elasticsearch retrieval with traceAI-elasticsearch and scores returned context with ContextRelevance, ContextPrecision, NDCG, and Groundedness across labeled query cohorts.

What Is Hybrid Search? FutureAGI RAG Guide (2026)

What Is Hybrid Search?

Hybrid search is a RAG retrieval method that combines sparse keyword search, often BM25, with dense vector search to retrieve both exact identifiers and semantically similar passages. It appears in production traces before reranking and generation, usually as fused Elasticsearch and vector-query spans. FutureAGI uses traceAI-elasticsearch to observe those retrieval paths, then scores returned context with ContextRelevance, ContextPrecision, and Groundedness so engineers can catch missing evidence before the model answers.

Why It Matters in Production LLM and Agent Systems

Hybrid search failures usually show up as false negatives. Dense retrieval can miss an exact SKU, ticket ID, regulation clause, customer name, or method signature because the embedding model treats it as a low-signal token. BM25 can miss a paraphrased support question because the user did not reuse the words in the knowledge base. Either miss becomes a downstream answer problem: the LLM may generate a plausible response from weak context, cite the wrong document, or ask an agent to call a tool with the wrong entity.

The pain lands on several teams. Developers see high answer variance for entity-heavy queries. SREs see retrieval p99 climb when top-k is inflated to compensate for weak recall. Compliance teams see audit tickets where the final answer lacks the right source. End users see confident answers that are one account, policy, or product version off.

The symptoms are concrete in logs: low overlap between BM25 and vector hits, repeated reranker demotions of dense-only results, rising no-answer rate for exact IDs, and answer quality that drops on long-tail tenants. In 2026-era agentic systems, this matters more because retrieval is often only one step inside a longer plan. A single missed document can make an agent choose the wrong tool, skip a required clarification, or carry bad context across several turns. Unlike dense-only vector search, hybrid search gives engineers a controlled way to preserve exact-match recall while still covering semantic phrasing.

How FutureAGI Handles Hybrid Search

FutureAGI’s approach is to treat hybrid search as an observable retrieval decision, not a black-box index setting. With the traceAI:elasticsearch anchor, the traceAI-elasticsearch integration instruments Elasticsearch retrieval spans in Java services. A support RAG service can emit one trace where the sparse branch records the BM25 query, the dense branch records the vector query, and the fused retrieval span records returned document IDs, scores, latency, and top-k.

FutureAGI then attaches evaluators to the span outputs. ContextRelevance checks whether the returned passages answer the user query. ContextPrecision measures whether relevant items are ranked early enough to enter the prompt. Groundedness catches the final failure mode: the LLM used retrieved context incorrectly or added unsupported claims. For ranking regression work, teams can also track NDCG and MRR on labeled query sets, especially for entity-heavy cohorts.

A real workflow: an enterprise search assistant starts failing queries such as “Can plan AP-8127 export audit logs?” The trace shows Elasticsearch returned the exact plan document in the BM25 branch, but the fusion policy pushed three generic billing articles above it. The engineer lowers the vector weight for queries matching account-code patterns, adds a reranker after fusion, and reruns a golden dataset with product codes, acronyms, and policy names. If ContextRelevance p10 rises and Groundedness does not regress, the change ships behind a cohort threshold. Unlike a final-answer-only Ragas faithfulness check, this catches the retrieval defect before generation hides it.

How to Measure or Detect It

Measure hybrid search by separating recall, ranking, answer support, and runtime cost:

Branch recall: percent of labeled queries where BM25, vector, or the fused set retrieves the expected document.
ContextRelevance: FutureAGI evaluator score for whether retrieved context answers the query.
ContextPrecision and NDCG: ranking signals that show whether useful documents appear early enough for the prompt budget.
Trace signals: Elasticsearch span latency, top-k, returned document IDs, retrieval scores, and branch overlap.
Production proxies: no-answer rate, thumbs-down rate, escalation rate, and eval-fail-rate-by-cohort.

from fi.evals import ContextRelevance

query = "When does order 8127 ship?"
contexts = ["Order 8127 ships on May 9, 2026."]
score = ContextRelevance().evaluate(input=query, context=contexts)
print(score.score)

Watch the tails, not only the mean. Hybrid search often looks healthy on broad FAQ traffic while failing on IDs, abbreviations, policy names, and recently added documents.

Common Mistakes

Mixing BM25 and vector scores without normalization. The scores live on different scales; use rank fusion or tuned weights.
Treating high recall as enough. Bigger top-k can bury the exact match under semantic noise and inflate prompt tokens.
Skipping entity-heavy test sets. Hybrid search proves its value on SKUs, codes, names, dates, and acronyms, not generic FAQ questions.
Evaluating only final answers. A good answer can hide retrieval misses; score context before generation with ContextRelevance.
Sharing one fusion policy across tenants. Corpora differ; legal, support, and code indexes need separate weights and thresholds.