RAG

What Is Semantic Search?

Retrieval that ranks content by meaning using embeddings, not just exact keyword overlap.

Semantic search is retrieval that ranks content by meaning rather than exact keyword overlap, usually by comparing embeddings. In a RAG pipeline, it appears inside the retriever span: the user query becomes a vector, the vector database returns top-k candidate chunks, and the model uses those chunks as context. FutureAGI treats semantic search as a measurable RAG component, not a black box, by tracing retrieval calls and scoring whether returned context actually supports the answer.

The 2026 backdrop: embedding models have consolidated around a small set of strong production options (OpenAI text-embedding-3-large, Cohere Embed v4, Voyage v3, Google Gemini Embedding) and most teams now combine semantic search with hybrid search and a reranker by default. On RAGTruth’s 18K labeled chunks and the BRIGHT benchmark, hybrid (BM25 + dense) plus reranking beats pure dense semantic search by 12-20 points on top-1 retrieval precision. a gap that holds across every embedding family. Pure semantic retrieval as a complete stack is mostly an anti-pattern at this point.

Why semantic search matters in production LLM and agent systems

Wrong semantic search causes quiet RAG failures. The LLM may answer fluently because it received context, while the retrieved chunks are merely adjacent to the question. A billing assistant might retrieve a renewal policy for annual plans when the user asked about a monthly trial. A support agent might find a migration guide for the right product but the wrong version. The output looks grounded until a customer, compliance reviewer, or on-call engineer asks which source actually supported the claim.

The symptoms are usually split across systems. Retrieval logs show high similarity scores, but product metrics show thumbs-down feedback, repeated clarifying questions, or escalation-rate spikes. SREs see p99 retrieval latency rise when teams increase top-k to compensate for poor recall. ML engineers see Groundedness failures downstream and have to determine whether the generator ignored good evidence or the retriever supplied weak evidence.

Agentic systems make the failure sharper. A multi-step agent may use semantic search to pick a tool, choose a policy, fetch account context, or write a follow-up task. One retrieval miss can route the next five steps through the wrong branch. Unlike BM25 keyword search, semantic search can bridge vocabulary gaps, but it can also over-match broad concepts and miss exact constraints such as product codes, jurisdictions, dates, or customer entitlements.

FutureAGI’s approach is to treat semantic search as a retriever that can be traced, evaluated, and compared across releases. With traceAI-pinecone (or the equivalent for Weaviate, Qdrant, Milvus, pgvector), a vector-database query becomes a trace span rather than an invisible library call. Engineers can inspect the query, index or namespace, top-k setting, retrieved document identifiers, retrieval.score, retrieval.documents, and retrieval latency beside the model span that consumed the context.

DecisionWhat to compareEvaluator
New embedding modelRecall on golden queriesContextRelevance
Top-k tuningPrecision vs latencyContextPrecision
Adding a rerankerRank improvementContextPrecision, p99 latency
Switching to hybrid searchRecall on exact-match tokensContextRelevance per cohort
Index reshardEmpty-result rateretrieval span counts
Embedding-model snapshot updateScore distribution shiftretrieval-score histogram

The evaluation layer separates retrieval quality from answer quality. ContextRelevance checks whether the returned context matches the user’s query. ContextPrecision measures retrieval ranking quality, so the best evidence should appear before weaker matches. Groundedness then evaluates whether the final response is grounded in the provided context. That distinction matters: unlike a Ragas faithfulness-only workflow, the team can see whether the answer failed because retrieval was wrong or because generation ignored good evidence.

A typical workflow starts with a golden query set from production. The team tests a new embedding model or vector-index configuration, runs both variants through the same questions, and compares ContextRelevance, ContextPrecision, MRR, empty-result rate, and retrieval p99. If the new index improves p10 ContextRelevance but adds 180 ms at p99, the engineer can gate the rollout, lower top-k, add a reranker, or keep the old index for latency-sensitive cohorts.

Measure semantic search before the generator has a chance to hide retrieval errors.

  • ContextRelevance. scores whether retrieved context matches the user’s query intent.
  • ContextPrecision. measures retrieval ranking quality, especially whether the best chunks appear near the top.
  • Recall@k. tracks whether known gold documents appear in the top-k result set.
  • MRR and NDCG. catch cases where the right result exists but is ranked too low to be useful.
  • Trace signals. retrieval.score, retrieval.documents, index name, top-k, empty-result rate, and retrieval p99 by cohort.
  • User proxy. correlate thumbs-down rate and escalation rate with the retriever span, not only the final LLM span.
from fi.evals import ContextRelevance, ContextPrecision

ctx = ContextRelevance().evaluate(
    input="How do I rotate an API key?",
    context="API keys can be rotated from Settings > Security."
)
prec = ContextPrecision().evaluate(
    input="How do I rotate an API key?",
    context=retrieved_chunks,
    output=final_answer,
)
print(ctx.score, prec.score)

Common mistakes

  • Treating semantic similarity as correctness. A retrieved passage can sound related while missing the required policy, version, entitlement, or account state.
  • Trusting top-k without reranking. Dense retrieval finds the right neighborhood but not the best evidence at rank one. the reranker is what closes that gap.
  • Ignoring lexical constraints. Error codes, SKUs, laws, and person names often need BM25 or hybrid search beside embeddings.
  • Comparing embedding models on vendor benchmarks instead of a golden dataset built from real production queries. MTEB rankings rarely predict retrieval quality on your corpus.
  • Hiding retriever spans behind the final LLM call. You need query, result IDs, scores, and latency to debug retrieval misses.
  • Not pinning the embedding model snapshot. Vendor updates change similarity geometry without warning; a static “text-embedding-3-large” string can refer to two different models months apart.
  • Indexing once and never re-evaluating. Source documents evolve; an index built six months ago will quietly drift even when the embedding model is pinned.
  • Confusing dense recall with semantic understanding. Dense retrieval finds the right neighborhood, but the neighborhood may not contain the policy clause the answer needs. a reranker plus a precision evaluator usually does more for quality than a bigger embedding model.

Frequently Asked Questions

What is semantic search?

Semantic search ranks content by meaning rather than exact keyword overlap. In RAG, it turns queries and documents into embeddings so the retriever can return context that matches user intent.

How is semantic search different from keyword search?

Keyword search such as BM25 matches tokens and term frequency. Semantic search uses embeddings to match intent, so it can find relevant passages even when query and document wording differs.

How do you measure semantic search?

FutureAGI measures semantic search with ContextRelevance for query-to-context fit, ContextPrecision for ranking quality, and traceAI Pinecone spans for retrieval scores and latency.