RAG

What Is Vector Search?

Similarity retrieval over high-dimensional embedding vectors using approximate nearest-neighbour algorithms, returning top-k semantically closest matches.

Vector search is similarity retrieval over high-dimensional embedding vectors. A query is encoded by an embedding model into a vector, then a vector database returns the top-k indexed vectors closest to that query under a distance metric — typically cosine similarity, dot product, or L2. To make this sub-linear at scale, vector DBs use approximate nearest-neighbour (ANN) algorithms like HNSW (hierarchical navigable small worlds) or IVF (inverted file). Vector search is the default retriever in modern RAG: it surfaces semantically relevant passages even when query and document share no exact keywords. Hybrid search combines it with BM25 keyword matching for stronger precision.

Why It Matters in Production LLM and Agent Systems

Vector search is what makes RAG work over real-world corpora where users ask questions in natural language and documents use different vocabulary. A keyword index would miss “how do I cancel my account?” against a doc titled “Subscription Termination Policy”; vector search hits it because the embeddings capture intent. That capability is the reason RAG works at all on customer-support, knowledge-base, and research-agent workloads.

The pain shows up across roles. Retrieval engineers see recall ceilings tied to embedding-model choice — switching from text-embedding-3-small to a domain-tuned model can lift recall 15+ points. Platform engineers see latency budgets blown by ANN parameters tuned for accuracy over speed. ML engineers see RAG hallucinations that turn out to be vector-search misses: the right chunk wasn’t in top-k. Cost engineers see token-per-query ratios spike when over-fetching k=20 to compensate for poor retrieval.

In 2026, vector search is no longer the only retriever in production. Hybrid search (vector + BM25) is the default for high-precision use cases. Multi-vector retrieval (ColBERT-style) trades index size for query precision. Sparse-dense fusion combines lexical and semantic signals. Choosing the retriever is now a decision per workload — and one that requires component-level evals to make rationally rather than by intuition.

FutureAGI’s approach is to instrument vector-search calls so retrieval quality is observable end-to-end. The traceAI-pinecone, traceAI-weaviate, traceAI-qdrant, traceAI-chromadb, and other vector-DB integrations auto-emit OpenTelemetry spans for every search call. Each span carries attributes like retrieval.documents (the actual chunk text returned), retrieval.score (the similarity score per result), vector.metric (the distance function used), and vector.collection (which index served the query). These attributes flow into the trace timeline so an engineer can see exactly what was retrieved and how confidently.

On the eval side, fi.evals.ContextRelevance scores whether the returned vectors actually match query intent — the canonical vector-search-quality signal. ChunkAttribution and ChunkUtilization complete the diagnostic: did the LLM downstream use the retrieved vectors, and how well? These run on every sampled production trace as an online eval, and on the golden dataset offline before any retriever change ships.

A typical FutureAGI workflow: a team is testing whether to upgrade their embedding model from a 1536-dim general-purpose embedding to a 1024-dim domain-tuned one. They re-embed the corpus, run the same golden dataset against both indexes, and compare ContextRelevance and ChunkAttribution side by side in the FutureAGI dashboard. The domain-tuned model lifts ContextRelevance p10 from 0.61 to 0.78; they ship it. That decision rested on three numbers, not a vendor benchmark.

How to Measure or Detect It

Vector-search quality is measurable at multiple grains:

  • fi.evals.ContextRelevance: 0–1 score on whether retrieved chunks match the query — the canonical retrieval-quality signal.
  • Recall@k on a labelled set: percentage of queries where the gold doc appears in top-k — the precision benchmark.
  • fi.evals.ChunkAttribution: pass/fail on whether the LLM downstream used the retrieved vectors.
  • Query latency: p50/p99 on retrieve spans — auto-captured by traceAI vector-DB integrations.
  • MRR (Mean Reciprocal Rank) on labelled queries: position of the first correct hit — sensitive to ranking quality.
  • OTel attributes: retrieval.documents, retrieval.score, vector.metric, vector.collection — emitted on every search.
from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="When does my trial expire?",
    context="Free trials end 14 days after signup unless upgraded earlier."
)
print(result.score, result.reason)

Common Mistakes

  • Defaulting to cosine similarity without checking what the embedding model expects. Some embedding models are trained for dot product, not cosine — wrong metric drops quality silently.
  • Skipping a reranker on dense-only top-k. Vector search has decent recall but mediocre top-1 precision; a cross-encoder reranker on top-20 → top-3 typically lifts answer quality measurably.
  • Tuning ef/nlist ANN parameters by intuition. These are recall-vs-latency knobs; tune them against ContextRelevance and p99 latency, not vibes.
  • Embedding once and never re-embedding. Embedding-model upgrades invalidate the entire index. Plan re-embed cycles as part of the retrieval roadmap.
  • Using vector search alone for queries with proper nouns or codes. Hybrid search (vector + BM25) almost always beats vector-only for entity-heavy queries.

Frequently Asked Questions

What is vector search?

Vector search is similarity retrieval over high-dimensional embedding vectors. A query is encoded as a vector, then the vector database returns the top-k closest indexed vectors under a distance metric like cosine or dot product.

How is vector search different from keyword search?

Keyword search (BM25) matches exact tokens and their statistical importance. Vector search matches semantic meaning via embeddings, finding relevant passages even when query and document share no words. Production systems often combine both as hybrid search.

How do you measure vector-search quality?

FutureAGI's fi.evals ContextRelevance returns 0–1 on whether retrieved vectors actually match query intent, and ChunkAttribution returns pass/fail on whether the answer used what was retrieved.