Guides

Evaluating Vector Database Recall Quality in 2026

Vendor vector-DB benchmarks are theater. ANN-vs-exact-knn recall on your vectors plus p99 under your filter cardinality is the eval that decides prod.

·
Updated
·
10 min read
vector-database rag ann-recall hnsw pinecone qdrant 2026
Editorial cover image for Evaluating Vector Database Recall and Retrieval Quality in 2026
Table of Contents

Most vector-DB vendor benchmark pages report Recall@10 of 0.97 on a 1M-vector synthetic corpus with no filters, no payload reads, and a hand-tuned HNSW config. Then you turn on tenant filters at 1% selectivity, plug in your actual embeddings, push real QPS, and recall@10 collapses to 0.79 inside an afternoon. The benchmark just didn’t measure what production looks like.

The opinion this post earns: vendor vector-database benchmarks are theater. They measure ANN-vs-exact recall on synthetic queries against an index nobody deploys, then publish a chart with “p99 < 2 ms” in the title. Your production has metadata payloads, hybrid scoring, multi-tenant filters with 0.5% selectivity, and an HNSW configuration somebody picked once and never re-tuned. The eval that matters is ANN-vs-exact-knn recall on your vectors, p99 latency under your filter cardinality, and filter correctness on your payload schema. Without that, you are picking a vendor by leaderboard theater.

This post is the methodology. Why vendor benchmarks don’t transfer, the ANN-vs-exact-knn protocol, p99 latency under realistic filter cardinality, hybrid search lift, filter correctness as a unit-test surface, and the production patterns that cause silent recall drift. For the upstream embedding decision, Evaluating Embedding Models in 2026 sets the recall floor every layer below inherits; this post is the layer below.

TL;DR: the four numbers that decide a vector DB

NumberWhat it measuresWhy vendor pages skip it
ANN recall vs exact-knnDid the ANN return the real top-k?They report against a synthetic ground truth, not yours
p99 latency by filter selectivityTail under realistic filter loadThey publish unfiltered or 50% selectivity
Hybrid lift (dense + BM25 vs dense)Recall gain from fusionMost vendor charts skip BM25 entirely
Filter correctness on payloadDoes the filter actually hold?A correctness bug looks like a recall miss on the page

Two non-negotiables. Compute exact-knn ground truth on your vectors before you compare anything. Vendor recall numbers are scored against the vendor’s ground truth, not yours. Bucket every latency number by filter selectivity. A single p99 figure is meaningless when 0.5% filters and 50% filters live on opposite sides of a graph-walk cliff.

Why vendor benchmarks do not transfer

Three reasons your downstream eval lies if you pick a vector DB from a vendor leaderboard.

Synthetic queries, synthetic distributions. Most vendor benchmarks use SIFT, GIST, or a slice of MS MARCO. None of those have the embedding model, chunk size, or query shape on your corpus. A 0.97 recall@10 on SIFT-1M tells you the implementation works; it tells you nothing about your 12M embeddings of 6-token enterprise tickets.

Filters are skipped or run at 50% selectivity. HNSW with post-filtering walks the graph, collects candidates, then discards what fails the filter. At 50% selectivity, half survive and recall barely moves. At 0.5%, the walk runs out of nearest neighbors before k candidates pass, and recall drops 10-20 points. Vendor benchmarks rarely publish the low-selectivity tail because the chart looks bad.

Payloads are absent from the benchmark, dominant in production. Reading 4 KB of JSON payload per hit doubles the per-query memory hop. Vendor benchmarks disable payload reads. Your production code reads the payload on every hit, every time.

HNSW parameters are hand-tuned for the slide. ef_construction=400, M=64, ef_search=300 is the published recipe. It produces a beautiful chart and a 4x larger index than the default ef_construction=200, M=16, ef_search=100. Nobody runs the published config in production; everybody runs defaults.

Use vendor benchmarks to drop the bottom half of the field, then run the four-number eval below on your data. The candidate that wins the leaderboard can lose your workload. The candidate that loses the leaderboard can win on filter-heavy multi-tenant traffic. Both are invisible without your ground truth.

The ANN-vs-exact-knn protocol

A real recall number compares an approximate-nearest-neighbor result against the exact top-k for the same query on the same corpus. Most teams skip this because they assume it is expensive. It is not.

Step 1: sample 500 queries from production. Stratify by filter selectivity (0-1%, 1-5%, 5-25%, 25-100%), namespace, and query length. Production traces from traceAI are the source; an afternoon of trace export and stratification builds the seed set.

Step 2: compute exact-knn ground truth. Brute-force a flat cosine or L2 sweep across the whole index for each of the 500 queries. The output is a table of (query_id, exact_top_k_ids). NumPy on a single GPU handles 10M vectors in minutes; Faiss IndexFlatIP is the cleanest wrapper for larger corpora. Rebuild the ground truth on corpus deltas above 5%; smaller deltas reuse the existing table.

Step 3: score ANN recall as the overlap. ANN recall@k is len(ann_top_k ∩ exact_top_k) / k, averaged across queries. Read it per stratum, never as a flat aggregate.

import numpy as np
from fi.evals import Evaluator
from fi.evals.templates import CustomLLMJudge
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

def exact_knn(query_vec, corpus_matrix, k=10):
    sims = corpus_matrix @ query_vec
    return np.argpartition(-sims, k)[:k]

ann_exact_recall = CustomLLMJudge(
    name="ann_exact_recall_at_k",
    rubric=(
        "Given (query_id, ann_top_k_ids, exact_top_k_ids), "
        "return len(ann_top_k_ids INTERSECT exact_top_k_ids) / k. "
        "Flag any per-bucket recall below 0.90."
    ),
)

filter_correctness = CustomLLMJudge(
    name="filter_correctness",
    rubric=(
        "Given (query, filter_expression, returned_rows), "
        "evaluate the filter against every returned row's payload. "
        "Return 1.0 only if every row satisfies the expression. "
        "A score below 0.999 indicates a payload-index bug, not a recall issue."
    ),
)

def score_vector_db(golden_set, ann_search_fn, corpus_matrix):
    test_cases = []
    for row in golden_set:
        exact_ids = exact_knn(row.query_vec, corpus_matrix, k=10)
        ann_results = ann_search_fn(
            row.query_vec, k=10, filter=row.filter, namespace=row.namespace,
        )
        test_cases.append(TestCase(
            query_id=row.query_id,
            ann_top_k_ids=[r.id for r in ann_results],
            exact_top_k_ids=exact_ids.tolist(),
            filter_expression=row.filter,
            returned_rows=ann_results,
            selectivity_bucket=row.bucket,
            namespace=row.namespace,
        ))
    return evaluator.evaluate(
        eval_templates=[ann_exact_recall, filter_correctness],
        inputs=test_cases,
    )

The same Evaluator.evaluate() call works against Pinecone, Weaviate, Qdrant, Milvus, pgvector, and Turbopuffer. Only the ann_search_fn changes. That’s the point: one comparable score column per vector DB candidate, produced by the same judge against the same ground truth.

p99 latency under your filter cardinality

A single p99 number is meaningless. Bucket by selectivity, then read each bucket separately.

selectivity bucket   |  median QPS  |  p50    p95    p99
0-1%   (rare tenant) |    180       |  18 ms  92 ms  220 ms
1-5%   (small slice) |    340       |  14 ms  61 ms  140 ms
5-25%  (typical)     |  1,100       |   9 ms  31 ms   72 ms
25-100% (broad)      |  2,400       |   7 ms  19 ms   38 ms

Most production tail regressions live in the 0.1-5% band. HNSW with post-filtering walks the graph, collects candidates, then drops everything that fails the filter; at low selectivity, the walk runs out before it has k survivors and recall drops together with latency rising. Pre-filtered ANN (Qdrant payload index, Milvus partitioned scalar filter, Weaviate inverted index) trades some build cost for stable tail behavior in this band, which is why the bake-off matters.

Run the load at production QPS for at least one hour. Shorter runs miss GC pauses, replica failovers, and the moment the hot-path stops fitting in page cache. Capture latency separately from rerank and generation; mixing them buries the retriever signal.

For Turbopuffer, the on-disk layout means cold queries have a different tail than warm queries; bucket by cache_state as an extra dimension. For pgvector with HNSW + btree filter, watch for the planner choosing a sequential scan above ~10% selectivity, which silently changes the recall regime.

Hybrid search: dense + BM25 lift

A pure dense head misses keyword-anchored queries; a pure BM25 head misses paraphrases. Fused retrieval is where most production RAG corpora actually live. The eval has to score all three.

Run the same 500-query set through three configurations. Pure dense (cosine over embeddings), pure sparse (BM25 over the same chunks), and fused (reciprocal rank fusion or a learned reranker over both candidate sets). Score recall@k against the same exact-knn ground truth used for the dense head.

The interesting number is the lift per stratum.

stratum                |  dense   BM25   fused   lift
short keyword (≤5 tok) |   0.71   0.82   0.92    +0.21 over dense
long natural (>15 tok) |   0.86   0.64   0.89    +0.03 over dense
domain identifiers     |   0.68   0.91   0.94    +0.26 over dense
multilingual fragments |   0.79   0.58   0.83    +0.04 over dense

Short-keyword and identifier-heavy queries are where BM25 does most of the work; long natural-language queries barely move. If your traffic is 60% short-keyword and you ship dense-only, you are leaving 15-20 recall points on the floor. If your traffic is 80% long natural language and you ship hybrid, you are paying for a BM25 index that buys you a sub-point. The traffic shape decides the architecture, not the slideware. The retrieval quality monitoring guide keeps this lift number honest as the corpus drifts.

Filter correctness as a unit test

A vector DB can return high recall on the right candidates and still apply the filter wrong on payloads. We have seen Qdrant return tenant-A documents in tenant-B filtered queries when the payload index lagged behind the vector index by a few minutes after a bulk insert. We have seen pgvector with HNSW plus a btree filter return technically-correct rows that violated an OR-of-AND clause because the planner short-circuited the wrong subexpression. Both look like recall misses on a metric dashboard. Both are correctness bugs.

Filter correctness is a unit-test surface, not a recall metric. Run 500 queries with known filter expressions, retrieve, parse every returned row’s payload, and assert the filter holds. The pass rate floor is 0.999. Anything below means a payload-index bug, a planner bug, or a stale-index race condition, and recall numbers above it are not trustworthy until the bug is fixed.

The same CustomLLMJudge template covers it. Wire FilterCorrectness into the same eval run as ANNExactRecall; the cost is negligible and the catch rate on payload bugs is high.

Production patterns: drift on inserts, payload races, hot-namespace bias

Three production patterns separate teams that ship recall from teams that pay for it.

Drift on inserts. HNSW is incrementally constructable but not incrementally optimal. After 30-50% of the index lands post-build, recall@10 drifts down 2-5 points because the graph entry points no longer reflect the distribution. Pinecone serverless papers over this with background reindexing; self-hosted Qdrant and Milvus require an explicit rebuild on a schedule. Re-run the golden set weekly and gate on the trend, not the snapshot.

Payload-index race conditions. Most vendors index payloads asynchronously after the vector is written. A query in the gap can return a row that fails the filter at read time. The window is usually milliseconds to seconds. Long enough to corrupt tenant isolation if traffic crosses tenants fast.

Hot-namespace bias. Tenant A with 10M vectors and tenant B with 100K share the same HNSW graph in most managed offerings. The entry points are dominated by tenant A’s distribution; tenant B sits in sparser regions and recalls 5-10 points lower. An aggregate recall number averages this away. Per-namespace recall is the only metric that catches it; gate on worst-namespace, not mean.

Instrumenting retrieval with traceAI

Every retrieval call gets wrapped in a typed RETRIEVER span. Without this, you cannot slice recall by index type, filter selectivity, or namespace later.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from opentelemetry import trace

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="vector-db-bakeoff",
)
tracer = trace.get_tracer(__name__)

def retrieve(query: str, namespace: str, filter_expr: dict, k: int = 10):
    with tracer.start_as_current_span("retrieve") as span:
        span.set_attribute("fi.span.kind", "RETRIEVER")
        span.set_attribute("input.value", query)
        span.set_attribute("vector.db", "qdrant")
        span.set_attribute("vector.index_type", "HNSW")
        span.set_attribute("vector.ef_search", 100)
        span.set_attribute("vector.namespace", namespace)
        span.set_attribute("vector.filter", str(filter_expr))
        span.set_attribute("vector.filter.selectivity_bucket", bucket_for(filter_expr))

        results = qdrant_client.search(
            collection_name=namespace,
            query_vector=embed(query),
            query_filter=filter_expr,
            limit=k,
            search_params={"hnsw_ef": 100},
        )

        for i, hit in enumerate(results):
            span.set_attribute(f"retrieval.documents.{i}.document.id", hit.id)
            span.set_attribute(f"retrieval.documents.{i}.document.score", hit.score)
        return results

This works identically across Pinecone, Weaviate, Milvus, pgvector, Turbopuffer, and Vespa. Only the client and the attribute values change; the span shape stays constant, which is what makes a side-by-side bake-off possible. The agent observability vs evaluation vs benchmarking writeup covers the broader observability picture.

How Future AGI ships vector-DB evaluation

Future AGI ships the eval stack as a package. Start with the SDK. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): six RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization) plus 50+ total; CustomLLMJudge for ANNExactRecall, FilterCorrectness, HybridLift, and PerNamespaceFairness rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes vector-DB-eval rubrics from natural-language descriptions; four distributed runners (Celery, Ray, Temporal, Kubernetes) collapse a six-vendor-by-three-index-config sweep to minutes.
  • traceAI (Apache 2.0): auto-instrumentation across 50+ AI surfaces in Python, TypeScript, Java, and C# (Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector). Every retrieval call emits a typed RETRIEVER span with vector.db, vector.index_type, vector.ef_search, vector.namespace, and the filter expression, so a config drift shows up against the same dashboard recall lands on.
  • Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse-stored span embeddings; Sonnet 4.5 Judge writes the immediate_fix per cluster. Common clusters: “Qdrant ef_search=50 drops 6 points on niche-domain queries,” “pgvector planner switched to seq scan at 12% selectivity,” “tenant 7 lagged tenant 1 by 8 points after the May 14 backfill.”
  • Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers; the RAG-related LLM calls (rerank, query rewriting, generation) route through gateway.futureagi.com/v1 for per-call cost visibility. Combined with vector-DB cost capture, you get full RAG-stack budget visibility. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.

For very large sweeps where shipping data to a hosted evaluator is the wrong shape, the Platform supports BYOC. Eval workers run in your VPC next to the vector DB, so a 240-config sweep against a 50M-vector index runs at network speeds, not internet speeds.

Honest framing. Today, only Linear sync is wired from Error Feed; Slack, Jira, GitHub, and PagerDuty are on the roadmap. The trace-stream-to-dataset connector for agent-opt is also roadmap. Optimization on retriever query rewriting and rerank prompts is eval-driven today and ships via the six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard). If a vendor tells you the closed loop from trace to dataset to optimized prompt is one-click today, they are bending the definition of one-click.

Ready to compute exact-knn ground truth on your corpus this afternoon? Wire ANNExactRecall, FilterCorrectness, and a per-bucket latency table into a pytest fixture against the ai-evaluation SDK. Stratify by filter selectivity, namespace, and query length. Capture every retrieval call as a typed RETRIEVER span via traceAI. Add the gateway when production traces start asking questions the CI gate missed.

Frequently asked questions

Why don't vendor vector database benchmarks transfer to my production?
Because the vendor ran ANN-vs-exact recall on a synthetic 1M-vector corpus with no filters, no payload reads, and an HNSW configuration nobody can replicate. Your production has filters with 200K candidates per query, metadata payloads that double the memory hop, hybrid scoring that fuses BM25 and dense, and an `ef_search` somebody picked once in 2024. The published recall@10 of 0.97 collapses to 0.81 the moment a tenant filter trims the candidate set, because most vendors apply filters as a post-stage and run out of nearest neighbors before the filter is satisfied. Treat the leaderboard as a competence floor. Compute ANN recall against your own exact-knn ground truth on your vectors, with your filters on, at your production cardinality.
How do I compute exact-knn ground truth for ANN recall evaluation?
Brute-force a flat L2 or cosine sweep across the whole index for a 500-query sample, ranked top-k, and store the resulting `(query_id, exact_top_k_ids)` table as the ground truth. NumPy on a single GPU handles 10M vectors in minutes; a Faiss IndexFlatIP wraps it cleanly for larger corpora. ANN recall@k is then the fraction of the ANN top-k that overlaps with the exact top-k, averaged across queries. Rebuild the ground truth whenever the corpus changes by more than 5%; smaller corpus deltas can reuse the existing table. The cost is one batch sweep per refresh, not per-query.
What is the right way to measure p99 latency under filter cardinality?
Bucket queries by filter selectivity, then measure p50, p95, and p99 inside each bucket. A 50% selectivity filter (half the corpus matches) behaves very differently from a 0.5% selectivity filter (a thousandth matches). HNSW with post-filtering degrades sharply as selectivity drops, because the graph walks find candidates that the filter then discards. Most production p99 regressions live in the 0.1 to 5% selectivity band. Run a synthetic load with realistic filter distributions captured from your traffic logs, at production QPS, for at least an hour to see the tail behave.
How does hybrid search (vector + BM25) change recall evaluation?
Hybrid search adds a second retrieval path and a fusion step (reciprocal rank fusion, weighted sum, or learned reranker). The eval has to score the dense head, the sparse head, and the fused output independently. Pure dense recall might be 0.83; pure BM25 recall might be 0.71; fused recall is often 0.91 because the two heads miss different chunks. The interesting number is the lift, not the absolute. Score hybrid recall@k against the same exact-knn ground truth used for the dense head, then read the per-stratum lift to see where BM25 is doing the real work.
Why is filter correctness a separate evaluation from recall?
Because a vector DB can return high recall on the right candidates but apply the filter wrong on payloads. We have seen Qdrant return tenant A documents in tenant B's filtered query when the payload index lagged the vector index, and we have seen pgvector with HNSW + a btree filter return technically correct rows that violated an OR-of-AND clause. Filter correctness is a unit-test problem, not a recall metric. Run 500 queries with known filter expressions, retrieve, parse the payload, and assert the filter holds on every result. A 0.999 pass rate is the floor; anything below means a payload-index bug, not a recall bug.
How does Future AGI evaluate vector database recall in production?
Three surfaces, one loop. traceAI auto-instruments retrieval calls into a typed RETRIEVER span with `vector.db`, `vector.index_type`, `vector.ef_search`, `vector.namespace`, and the filter expression captured as a span attribute. The ai-evaluation SDK (Apache 2.0) ships RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, ChunkAttribution, ChunkUtilization, Completeness); CustomLLMJudge covers ANNExactRecall, FilterCorrectness, and HybridLift rubrics. Four distributed runners (Celery, Ray, Temporal, Kubernetes) parallelize a sweep across vector DB candidates and HNSW grids. Error Feed clusters failures via HDBSCAN over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge writes an `immediate_fix` per cluster that feeds the Platform's self-improving evaluators.
Can I run the same eval setup across Pinecone, Weaviate, Qdrant, Milvus, pgvector, and Turbopuffer?
Yes. The golden set is corpus-level, not DB-level. The retrieval call is wrapped in a traceAI RETRIEVER span with vendor as an attribute. The same `Evaluator.evaluate()` call against the same eval templates produces a comparable score column per DB. The only DB-specific code is the client wrapper; the exact-knn ground truth, the filter expressions, the QPS profile, and the judge are shared. This is how a real bake-off runs: same queries, same filters, same expected chunks, four to six backends, one comparable score column per vendor.
Related Articles
View all