Evaluating Vector Database Recall Quality in 2026
Vendor vector-DB benchmarks are theater. ANN-vs-exact-knn recall on your vectors plus p99 under your filter cardinality is the eval that decides prod.
Table of Contents
Most vector-DB vendor benchmark pages report Recall@10 of 0.97 on a 1M-vector synthetic corpus with no filters, no payload reads, and a hand-tuned HNSW config. Then you turn on tenant filters at 1% selectivity, plug in your actual embeddings, push real QPS, and recall@10 collapses to 0.79 inside an afternoon. The benchmark just didn’t measure what production looks like.
The opinion this post earns: vendor vector-database benchmarks are theater. They measure ANN-vs-exact recall on synthetic queries against an index nobody deploys, then publish a chart with “p99 < 2 ms” in the title. Your production has metadata payloads, hybrid scoring, multi-tenant filters with 0.5% selectivity, and an HNSW configuration somebody picked once and never re-tuned. The eval that matters is ANN-vs-exact-knn recall on your vectors, p99 latency under your filter cardinality, and filter correctness on your payload schema. Without that, you are picking a vendor by leaderboard theater.
This post is the methodology. Why vendor benchmarks don’t transfer, the ANN-vs-exact-knn protocol, p99 latency under realistic filter cardinality, hybrid search lift, filter correctness as a unit-test surface, and the production patterns that cause silent recall drift. For the upstream embedding decision, Evaluating Embedding Models in 2026 sets the recall floor every layer below inherits; this post is the layer below.
TL;DR: the four numbers that decide a vector DB
| Number | What it measures | Why vendor pages skip it |
|---|---|---|
| ANN recall vs exact-knn | Did the ANN return the real top-k? | They report against a synthetic ground truth, not yours |
| p99 latency by filter selectivity | Tail under realistic filter load | They publish unfiltered or 50% selectivity |
| Hybrid lift (dense + BM25 vs dense) | Recall gain from fusion | Most vendor charts skip BM25 entirely |
| Filter correctness on payload | Does the filter actually hold? | A correctness bug looks like a recall miss on the page |
Two non-negotiables. Compute exact-knn ground truth on your vectors before you compare anything. Vendor recall numbers are scored against the vendor’s ground truth, not yours. Bucket every latency number by filter selectivity. A single p99 figure is meaningless when 0.5% filters and 50% filters live on opposite sides of a graph-walk cliff.
Why vendor benchmarks do not transfer
Three reasons your downstream eval lies if you pick a vector DB from a vendor leaderboard.
Synthetic queries, synthetic distributions. Most vendor benchmarks use SIFT, GIST, or a slice of MS MARCO. None of those have the embedding model, chunk size, or query shape on your corpus. A 0.97 recall@10 on SIFT-1M tells you the implementation works; it tells you nothing about your 12M embeddings of 6-token enterprise tickets.
Filters are skipped or run at 50% selectivity. HNSW with post-filtering walks the graph, collects candidates, then discards what fails the filter. At 50% selectivity, half survive and recall barely moves. At 0.5%, the walk runs out of nearest neighbors before k candidates pass, and recall drops 10-20 points. Vendor benchmarks rarely publish the low-selectivity tail because the chart looks bad.
Payloads are absent from the benchmark, dominant in production. Reading 4 KB of JSON payload per hit doubles the per-query memory hop. Vendor benchmarks disable payload reads. Your production code reads the payload on every hit, every time.
HNSW parameters are hand-tuned for the slide. ef_construction=400, M=64, ef_search=300 is the published recipe. It produces a beautiful chart and a 4x larger index than the default ef_construction=200, M=16, ef_search=100. Nobody runs the published config in production; everybody runs defaults.
Use vendor benchmarks to drop the bottom half of the field, then run the four-number eval below on your data. The candidate that wins the leaderboard can lose your workload. The candidate that loses the leaderboard can win on filter-heavy multi-tenant traffic. Both are invisible without your ground truth.
The ANN-vs-exact-knn protocol
A real recall number compares an approximate-nearest-neighbor result against the exact top-k for the same query on the same corpus. Most teams skip this because they assume it is expensive. It is not.
Step 1: sample 500 queries from production. Stratify by filter selectivity (0-1%, 1-5%, 5-25%, 25-100%), namespace, and query length. Production traces from traceAI are the source; an afternoon of trace export and stratification builds the seed set.
Step 2: compute exact-knn ground truth. Brute-force a flat cosine or L2 sweep across the whole index for each of the 500 queries. The output is a table of (query_id, exact_top_k_ids). NumPy on a single GPU handles 10M vectors in minutes; Faiss IndexFlatIP is the cleanest wrapper for larger corpora. Rebuild the ground truth on corpus deltas above 5%; smaller deltas reuse the existing table.
Step 3: score ANN recall as the overlap. ANN recall@k is len(ann_top_k ∩ exact_top_k) / k, averaged across queries. Read it per stratum, never as a flat aggregate.
import numpy as np
from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
def exact_knn(query_vec, corpus_matrix, k=10):
sims = corpus_matrix @ query_vec
return np.argpartition(-sims, k)[:k]
ann_exact_recall = CustomLLMJudge(
name="ann_exact_recall_at_k",
rubric=(
"Given (query_id, ann_top_k_ids, exact_top_k_ids), "
"return len(ann_top_k_ids INTERSECT exact_top_k_ids) / k. "
"Flag any per-bucket recall below 0.90."
),
)
filter_correctness = CustomLLMJudge(
name="filter_correctness",
rubric=(
"Given (query, filter_expression, returned_rows), "
"evaluate the filter against every returned row's payload. "
"Return 1.0 only if every row satisfies the expression. "
"A score below 0.999 indicates a payload-index bug, not a recall issue."
),
)
def score_vector_db(golden_set, ann_search_fn, corpus_matrix):
test_cases = []
for row in golden_set:
exact_ids = exact_knn(row.query_vec, corpus_matrix, k=10)
ann_results = ann_search_fn(
row.query_vec, k=10, filter=row.filter, namespace=row.namespace,
)
test_cases.append(TestCase(
query_id=row.query_id,
ann_top_k_ids=[r.id for r in ann_results],
exact_top_k_ids=exact_ids.tolist(),
filter_expression=row.filter,
returned_rows=ann_results,
selectivity_bucket=row.bucket,
namespace=row.namespace,
))
return evaluator.evaluate(
eval_templates=[ann_exact_recall, filter_correctness],
inputs=test_cases,
)
The same Evaluator.evaluate() call works against Pinecone, Weaviate, Qdrant, Milvus, pgvector, and Turbopuffer. Only the ann_search_fn changes. That’s the point: one comparable score column per vector DB candidate, produced by the same judge against the same ground truth.
p99 latency under your filter cardinality
A single p99 number is meaningless. Bucket by selectivity, then read each bucket separately.
selectivity bucket | median QPS | p50 p95 p99
0-1% (rare tenant) | 180 | 18 ms 92 ms 220 ms
1-5% (small slice) | 340 | 14 ms 61 ms 140 ms
5-25% (typical) | 1,100 | 9 ms 31 ms 72 ms
25-100% (broad) | 2,400 | 7 ms 19 ms 38 ms
Most production tail regressions live in the 0.1-5% band. HNSW with post-filtering walks the graph, collects candidates, then drops everything that fails the filter; at low selectivity, the walk runs out before it has k survivors and recall drops together with latency rising. Pre-filtered ANN (Qdrant payload index, Milvus partitioned scalar filter, Weaviate inverted index) trades some build cost for stable tail behavior in this band, which is why the bake-off matters.
Run the load at production QPS for at least one hour. Shorter runs miss GC pauses, replica failovers, and the moment the hot-path stops fitting in page cache. Capture latency separately from rerank and generation; mixing them buries the retriever signal.
For Turbopuffer, the on-disk layout means cold queries have a different tail than warm queries; bucket by cache_state as an extra dimension. For pgvector with HNSW + btree filter, watch for the planner choosing a sequential scan above ~10% selectivity, which silently changes the recall regime.
Hybrid search: dense + BM25 lift
A pure dense head misses keyword-anchored queries; a pure BM25 head misses paraphrases. Fused retrieval is where most production RAG corpora actually live. The eval has to score all three.
Run the same 500-query set through three configurations. Pure dense (cosine over embeddings), pure sparse (BM25 over the same chunks), and fused (reciprocal rank fusion or a learned reranker over both candidate sets). Score recall@k against the same exact-knn ground truth used for the dense head.
The interesting number is the lift per stratum.
stratum | dense BM25 fused lift
short keyword (≤5 tok) | 0.71 0.82 0.92 +0.21 over dense
long natural (>15 tok) | 0.86 0.64 0.89 +0.03 over dense
domain identifiers | 0.68 0.91 0.94 +0.26 over dense
multilingual fragments | 0.79 0.58 0.83 +0.04 over dense
Short-keyword and identifier-heavy queries are where BM25 does most of the work; long natural-language queries barely move. If your traffic is 60% short-keyword and you ship dense-only, you are leaving 15-20 recall points on the floor. If your traffic is 80% long natural language and you ship hybrid, you are paying for a BM25 index that buys you a sub-point. The traffic shape decides the architecture, not the slideware. The retrieval quality monitoring guide keeps this lift number honest as the corpus drifts.
Filter correctness as a unit test
A vector DB can return high recall on the right candidates and still apply the filter wrong on payloads. We have seen Qdrant return tenant-A documents in tenant-B filtered queries when the payload index lagged behind the vector index by a few minutes after a bulk insert. We have seen pgvector with HNSW plus a btree filter return technically-correct rows that violated an OR-of-AND clause because the planner short-circuited the wrong subexpression. Both look like recall misses on a metric dashboard. Both are correctness bugs.
Filter correctness is a unit-test surface, not a recall metric. Run 500 queries with known filter expressions, retrieve, parse every returned row’s payload, and assert the filter holds. The pass rate floor is 0.999. Anything below means a payload-index bug, a planner bug, or a stale-index race condition, and recall numbers above it are not trustworthy until the bug is fixed.
The same CustomLLMJudge template covers it. Wire FilterCorrectness into the same eval run as ANNExactRecall; the cost is negligible and the catch rate on payload bugs is high.
Production patterns: drift on inserts, payload races, hot-namespace bias
Three production patterns separate teams that ship recall from teams that pay for it.
Drift on inserts. HNSW is incrementally constructable but not incrementally optimal. After 30-50% of the index lands post-build, recall@10 drifts down 2-5 points because the graph entry points no longer reflect the distribution. Pinecone serverless papers over this with background reindexing; self-hosted Qdrant and Milvus require an explicit rebuild on a schedule. Re-run the golden set weekly and gate on the trend, not the snapshot.
Payload-index race conditions. Most vendors index payloads asynchronously after the vector is written. A query in the gap can return a row that fails the filter at read time. The window is usually milliseconds to seconds. Long enough to corrupt tenant isolation if traffic crosses tenants fast.
Hot-namespace bias. Tenant A with 10M vectors and tenant B with 100K share the same HNSW graph in most managed offerings. The entry points are dominated by tenant A’s distribution; tenant B sits in sparser regions and recalls 5-10 points lower. An aggregate recall number averages this away. Per-namespace recall is the only metric that catches it; gate on worst-namespace, not mean.
Instrumenting retrieval with traceAI
Every retrieval call gets wrapped in a typed RETRIEVER span. Without this, you cannot slice recall by index type, filter selectivity, or namespace later.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from opentelemetry import trace
tracer_provider = register(
project_type=ProjectType.OBSERVE,
project_name="vector-db-bakeoff",
)
tracer = trace.get_tracer(__name__)
def retrieve(query: str, namespace: str, filter_expr: dict, k: int = 10):
with tracer.start_as_current_span("retrieve") as span:
span.set_attribute("fi.span.kind", "RETRIEVER")
span.set_attribute("input.value", query)
span.set_attribute("vector.db", "qdrant")
span.set_attribute("vector.index_type", "HNSW")
span.set_attribute("vector.ef_search", 100)
span.set_attribute("vector.namespace", namespace)
span.set_attribute("vector.filter", str(filter_expr))
span.set_attribute("vector.filter.selectivity_bucket", bucket_for(filter_expr))
results = qdrant_client.search(
collection_name=namespace,
query_vector=embed(query),
query_filter=filter_expr,
limit=k,
search_params={"hnsw_ef": 100},
)
for i, hit in enumerate(results):
span.set_attribute(f"retrieval.documents.{i}.document.id", hit.id)
span.set_attribute(f"retrieval.documents.{i}.document.score", hit.score)
return results
This works identically across Pinecone, Weaviate, Milvus, pgvector, Turbopuffer, and Vespa. Only the client and the attribute values change; the span shape stays constant, which is what makes a side-by-side bake-off possible. The agent observability vs evaluation vs benchmarking writeup covers the broader observability picture.
How Future AGI ships vector-DB evaluation
Future AGI ships the eval stack as a package. Start with the SDK. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0): six RAG-specific
EvalTemplateclasses (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization) plus 50+ total;CustomLLMJudgeforANNExactRecall,FilterCorrectness,HybridLift, andPerNamespaceFairnessrubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2. - Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes vector-DB-eval rubrics from natural-language descriptions; four distributed runners (Celery, Ray, Temporal, Kubernetes) collapse a six-vendor-by-three-index-config sweep to minutes.
- traceAI (Apache 2.0): auto-instrumentation across 50+ AI surfaces in Python, TypeScript, Java, and C# (Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector). Every retrieval call emits a typed
RETRIEVERspan withvector.db,vector.index_type,vector.ef_search,vector.namespace, and the filter expression, so a config drift shows up against the same dashboard recall lands on. - Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse-stored span embeddings; Sonnet 4.5 Judge writes the
immediate_fixper cluster. Common clusters: “Qdrantef_search=50drops 6 points on niche-domain queries,” “pgvector planner switched to seq scan at 12% selectivity,” “tenant 7 lagged tenant 1 by 8 points after the May 14 backfill.” - Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers; the RAG-related LLM calls (rerank, query rewriting, generation) route through
gateway.futureagi.com/v1for per-call cost visibility. Combined with vector-DB cost capture, you get full RAG-stack budget visibility. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.
For very large sweeps where shipping data to a hosted evaluator is the wrong shape, the Platform supports BYOC. Eval workers run in your VPC next to the vector DB, so a 240-config sweep against a 50M-vector index runs at network speeds, not internet speeds.
Honest framing. Today, only Linear sync is wired from Error Feed; Slack, Jira, GitHub, and PagerDuty are on the roadmap. The trace-stream-to-dataset connector for agent-opt is also roadmap. Optimization on retriever query rewriting and rerank prompts is eval-driven today and ships via the six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard). If a vendor tells you the closed loop from trace to dataset to optimized prompt is one-click today, they are bending the definition of one-click.
Ready to compute exact-knn ground truth on your corpus this afternoon? Wire ANNExactRecall, FilterCorrectness, and a per-bucket latency table into a pytest fixture against the ai-evaluation SDK. Stratify by filter selectivity, namespace, and query length. Capture every retrieval call as a typed RETRIEVER span via traceAI. Add the gateway when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why don't vendor vector database benchmarks transfer to my production?
How do I compute exact-knn ground truth for ANN recall evaluation?
What is the right way to measure p99 latency under filter cardinality?
How does hybrid search (vector + BM25) change recall evaluation?
Why is filter correctness a separate evaluation from recall?
How does Future AGI evaluate vector database recall in production?
Can I run the same eval setup across Pinecone, Weaviate, Qdrant, Milvus, pgvector, and Turbopuffer?
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.
Customer support eval in 2026: escalation taxonomy first, clause-level retrieval, tool-call correctness on Zendesk and Intercom, paired Containment and False-Resolution rates.
The definitive 2026 reference: three generation patterns (persona, taxonomy-stratified, evolution), the filter that survives, calibration against real, and three use cases.