RAG

What Is a Vector Database?

A database optimised for storing high-dimensional embedding vectors and serving approximate nearest-neighbour search with metadata filtering.

What Is a Vector Database?

A vector database is a storage system optimised for high-dimensional embedding vectors and approximate nearest-neighbour (ANN) search. Unlike a traditional row-store that indexes scalar values for exact-match or range queries, a vector database indexes vectors using structures like HNSW (Hierarchical Navigable Small Worlds), IVF (Inverted File), DiskANN, or LSH, trading exact match for sub-linear search at billion-scale. As of May 2026, the modern vector DB does much more than top-k similarity: it combines vector similarity with metadata filters (hybrid search), supports streaming inserts, multi-tenant namespacing, hybrid sparse+dense scoring (BM25 + dense, or SPLADE), per-tenant index isolation, and increasingly built-in reranker hooks. In RAG systems it holds embedded document chunks; at query time the user’s query is embedded by an embedding model, sent to the vector DB, and the top-k matches return in tens of milliseconds. FutureAGI does not ship one; we instrument the leading options through traceAI so retrieval is observable and evaluable wherever you store your embeddings.

Why a vector database matters in production LLM and agent systems

The vector database is the backbone of every RAG system, and its operational properties. latency, recall, freshness, cost, multi-tenant isolation, chunking-compatibility. set hard ceilings on the rest of the stack. A vector DB that supports only exact-match ANN cannot handle multi-tenant filters; one that lacks streaming inserts cannot serve a knowledge base that updates daily; one without per-tenant namespacing leaks data across customers in subtle ways that pass a casual audit. Choosing the wrong index type at the wrong scale leads to either p99 latency blowing the response budget or recall ceilings that cap RAG quality below what the LLM can use.

The 2026 stakes are higher than 2023 because agents do far more retrieval than chatbots did. A single agent turn often issues 3-10 vector queries. one for the user question, several for tool-input expansion, a few for working-memory lookups, and at least one for the final answer assembly. Multiply that by an average 4-5 turn conversation and a single user request can produce 30+ vector queries. The amortised latency budget per query is now around 30ms p99 at frontier latency targets, not the 200ms p99 that was acceptable for single-turn RAG in 2023. A vector DB that was “fast enough” two years ago will not survive a CrewAI or LangGraph agent workload today.

The pain shows up across roles. Retrieval engineers see recall drop as the corpus grows because the index parameters (HNSW ef_construction, M, IVF nlist/nprobe) were tuned for a smaller dataset and never re-tuned. Platform engineers see cost overruns from over-provisioned dedicated indexes when serverless would have worked, or surprise costs from per-query pricing on a high-throughput agent workload. SREs see incidents when index rebuilds bring down search, or when streaming-insert lag spikes after a large document import. Compliance leads need vector DBs that support tenant isolation, encryption at rest with customer-managed keys, audit logs, and EU residency. not all of them do, and pretending otherwise is a 2026 risk-register entry waiting to happen.

In May 2026, the vector-DB landscape is broader and more differentiated than it was: Pinecone (managed serverless, strong ops story, expensive at scale), Weaviate (open-source with hybrid search and built-in modules), Qdrant (Rust-based, fast, gaining enterprise share), ChromaDB (developer-friendly, weak at large-scale ops), Milvus (cloud-native scale, complex), pgvector (Postgres extension for teams that want one DB, now with pgvectorscale for serious throughput), LanceDB (columnar, multimodal, file-based), Turbopuffer (serverless object-storage-backed, very cheap), Vespa (Yahoo’s mature ranking engine, hybrid scoring), and Vald. MongoDB Atlas, Redis Stack, Elasticsearch, OpenSearch, Azure AI Search, and Couchbase now ship first-class vector capabilities too. The differences matter for RAG quality and ops cost, but choice matters less than instrumentation: a fast vector DB that you cannot trace is still a black box.

How FutureAGI handles vector databases

FutureAGI’s approach is explicit: we do not ship a vector store. We instrument the ones you use. The traceAI library has first-class integrations for traceAI-pinecone, traceAI-weaviate, traceAI-qdrant, traceAI-chromadb, plus traceAI-milvus, traceAI-pgvector, traceAI-lancedb, traceAI-mongodb, traceAI-redis, traceAI-elasticsearch, and traceAI-azure-search. Each integration auto-instruments the client SDK, emitting OpenTelemetry spans for every query with attributes like retrieval.documents, retrieval.score, vector.collection, vector.metric, vector.top_k, vector.filter, db.system, and db.operation. The fi.span.kind = "RETRIEVER" tag means dashboards and the agent command center can filter every retrieval span across the whole agent trajectory without parsing free-form span names.

That instrumentation feeds the eval and observability layers. fi.evals.ContextRelevance scores whether the retrieved vectors actually matched the query intent. fi.evals.ChunkAttribution confirms the LLM downstream actually used what the vector DB returned, not something it invented. fi.evals.ContextPrecision and fi.evals.ContextRecall together measure whether the retrieved set is concentrated on relevant chunks and whether the relevant chunks are present at all. fi.evals.Groundedness then checks that the final answer is supported by what came back. The tracing dashboard shows per-collection p99 latency, per-tenant query distribution, recall outliers, and the eval-fail rate slice by retrieval source. operational signals that vector DB providers’ own dashboards almost never combine with downstream RAG quality, because their telemetry stops at the storage layer.

A typical 2026 FutureAGI workflow: a team running Qdrant for one tenant cluster and Pinecone for another instruments both with traceAI. They run a unified golden dataset through each, see Qdrant has 12% better ContextRelevance p10 but Pinecone has 40% lower p99 query latency under sustained agent load, and decide based on the latency-vs-quality tradeoff for that specific tenant. enterprise where p99 latency matters more goes to Pinecone, internal research workloads where recall matters more go to Qdrant. The decision is data, not a vendor pitch. We treat vector DBs as a substrate FutureAGI runs on, not a layer FutureAGI replaces. The contrast worth naming: Pinecone and Weaviate each ship their own observability dashboards, but those dashboards stop at query-level latency and recall. they do not connect a retrieval span to the downstream LLM answer or the eval verdict on that answer. FutureAGI’s value is closing that loop.

Comparison: the 2026 vector DB landscape

For an opinionated picture of where each option fits in a serious 2026 stack:

Vector DBIndex typesStrengthsWatch-outsBest fit
PineconeProprietary serverless + podStrong managed ops, fast hybrid search, namespacingCost at scale; vendor lock-inEnterprise SaaS, multi-tenant, low ops burden
WeaviateHNSW + flatHybrid (BM25+dense), modules, GraphQLMemory profile on large corporaOpen-source, hybrid search, semantic + keyword
QdrantHNSW (Rust)Very fast, payload filtering, distributedNewer enterprise storyLatency-critical, multi-tenant filters
Milvus / ZillizHNSW, IVF, DiskANN, GPUBillion-scale, GPU search, hot/cold tiersOperational complexityMassive corpora, ML-heavy retrieval
ChromaDBHNSWDev-friendly Python-first APIWeaker at large-scale opsPrototyping, small to mid corpora
pgvector + pgvectorscaleHNSW, IVFFlatSingle-DB simplicity, joins, transactionsThroughput ceiling per nodeTeams already on Postgres
LanceDBIVF-PQ on Arrow/LanceColumnar, multimodal, file-basedYounger ecosystemMultimodal, embedded, edge
TurbopufferObject-storage-backedVery cheap per GB, serverlessHigher latency tierCost-sensitive, large but cool corpora
VespaHNSW + rank profilesMature ranking, hybrid scoringSteep learning curveSearch-heavy products, ranking engineering
Elasticsearch / OpenSearchHNSW (Lucene 9.x)Hybrid with existing ES infraSlower vector-only pathTeams with existing ES skills
Azure AI SearchHNSWTight Azure integration, semantic rankerAzure-onlyMicrosoft-shop enterprise
MongoDB Atlas VectorHNSWSame DB as operational dataNewer; query patterns evolvingMongo-first stacks

The honest 2026 take: there is no single best vector DB. There is the best vector DB for your latency budget, recall floor, multi-tenant model, embedding dimension, ops staffing, and compliance footprint. The decision should come from a measured run through your RAG pipeline, not from a benchmark blog post.

How to use a vector database with FutureAGI

The install pattern looks the same regardless of which vector DB you pick. Install the matching traceAI integration, register the project, and let the retriever spans flow:

from fi_instrumentation import register
from traceai_pinecone import PineconeInstrumentor
from traceai_langchain import LangChainInstrumentor

trace_provider = register(project_name="support-rag-prod")
PineconeInstrumentor().instrument(tracer_provider=trace_provider)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)

Once instrumentation is live, every vector query produces a span with fi.span.kind = "RETRIEVER", the query embedding hash, the top-k results, per-result scores, the metadata filter, and the wall-clock latency. From there:

  1. Production monitoring. point the OTLP exporter at the FutureAGI agent command center; the tracing dashboard shows per-collection p99 latency, recall outliers, and per-tenant volume.
  2. Offline replay. re-run the same queries against a candidate index (different embedding model, different ANN params, different DB) through simulate; compare ContextRelevance and Groundedness deltas before swapping production.
  3. Release gates. attach ContextRelevance, ContextPrecision, ContextRecall, and ChunkAttribution to the golden dataset; the CI job re-runs the gate and blocks the deploy on retrieval regression.
  4. Cost and latency budgeting. slice query cost and gen_ai.client.operation.duration by vector.collection and tenant; the same dashboard works whether traffic hit Pinecone, Qdrant, or pgvector.

How to measure or detect vector database quality

Vector-DB quality is measured at the storage and retrieval layer:

  • Query latency: p50/p99 on retrieve spans. captured automatically by traceAI-pinecone, traceAI-weaviate, traceAI-qdrant, traceAI-chromadb etc. Hold p99 under 50ms for interactive agent loops; under 100ms for batch RAG; alert when p99 doubles week over week.
  • Recall@k on a labelled set: percentage of queries where the gold doc appears in top-k. the canonical retrieval-quality benchmark. Track recall@5, recall@10, and recall@20 separately; a corpus that drops recall@5 but holds recall@20 needs a reranker, not a new vector DB.
  • fi.evals.ContextRelevance: 0-1 score on whether the returned vectors actually match query intent. The most important RAG retrieval signal in 2026 because it directly predicts downstream answer quality.
  • fi.evals.ContextPrecision: how concentrated relevant chunks are in the top-k. penalises noisy top-k.
  • fi.evals.ContextRecall: whether the relevant chunks are present at all in the retrieved set, regardless of order.
  • fi.evals.ChunkAttribution: pass/fail on whether the answer used what the vector DB returned. A retrieved chunk that the LLM ignores is wasted cost.
  • fi.evals.Groundedness: whether the final answer is supported by the retrieved context. Pairs with retrieval metrics for end-to-end RAG signal.
  • Index freshness lag: time from corpus insert to query-visibility. surfaces stale-context failures that look like model errors but are actually streaming-insert lag.
  • Per-tenant cost: cost per query and cost per GB stored, segmented by tenant. Serverless options often invert at scale.
  • OTel attributes: vector.collection, vector.metric, vector.top_k, vector.filter, retrieval.documents, retrieval.score.

Minimal Python for scoring a retrieval result:

from fi.evals import ContextRelevance, ContextPrecision, ContextRecall

cr = ContextRelevance()
cp = ContextPrecision()
crc = ContextRecall()

result_cr = cr.evaluate(
    input="What's the cancellation policy?",
    context="Cancel any time within the first 30 days for a full refund."
)
result_cp = cp.evaluate(
    input="What's the cancellation policy?",
    context_chunks=top_k_chunks,
    relevant_chunks=labelled_relevant,
)
result_crc = crc.evaluate(
    input="What's the cancellation policy?",
    context_chunks=top_k_chunks,
    relevant_chunks=labelled_relevant,
)
print(result_cr.score, result_cp.score, result_crc.score)

In our 2026 evals, the teams that consistently win on RAG quality are the ones that pair retrieval metrics with end-to-end evals on the same dataset, not the ones that chase recall@10 in isolation. A vector DB that delivers 95% recall@10 but whose chunks the LLM mostly ignores is worse than a vector DB at 88% recall@10 whose chunks the model actually grounds against. We recommend running every candidate index through public retrieval benchmarks before swapping production: RAGTruth (18K labeled chunks; useful for chunk-level groundedness) and CRAG (4,409 questions across five domains) catch retrieval-attribution failures that single-corpus recall@k charts hide, and MultiHop-RAG forces multi-step retrievers to reveal whether top-k captures the supporting evidence chain or just lexical lookalikes.

Common mistakes

  • Picking a vector DB before knowing the access pattern. Read-heavy FAQ workloads have different needs than write-heavy chat-history workloads. Multi-tenant SaaS has different needs than single-corpus internal search. Define the access pattern first, then choose.
  • Over-tuning HNSW ef parameters without measuring recall. A bigger ef is always slower; whether it improves recall is empirical and dataset-specific. Re-tune after any embedding model change.
  • Ignoring metadata-filter performance. Hybrid search on a poorly-indexed metadata column degrades latency by 10x. Index filter columns explicitly; in Pinecone use selective namespacing, in Qdrant payload indexes, in pgvector composite GIN/BTREE.
  • Single-index for multi-tenant data. Tenant data leaks happen at the metadata-filter layer. Use namespacing or per-tenant collections; do not rely solely on WHERE tenant_id = ... for security.
  • Skipping index rebuild plans. Embedding-model upgrades require a full re-embed; a vector DB without a clean rebuild path traps you on an old embedding version. Test the rebuild path before you depend on it.
  • Not measuring chunking impact. The vector DB’s recall ceiling is set by the chunks you put in. Bad chunking caps quality regardless of which DB you pick.
  • Ignoring the reranker option. Many “switch vector DB” projects should have been “add a Cohere or BGE reranker” projects. Reranking on top-50 with a cross-encoder almost always beats tuning ANN params on top-10.
  • Trusting vendor benchmarks. Published recall@10 numbers come from datasets the vendor chose. Run your own benchmark with your embeddings, your filters, and your scale before committing.
  • Skipping the PII redaction step before insert. Vector DBs that store raw text in payloads become a data-residency problem the moment an auditor asks. Strip or hash PII at chunk time.
  • No observability connecting the retrieval span to the answer. A retrieval that returned the right chunk but the LLM ignored is invisible without ChunkAttribution. Without observability, you cannot tell whether quality issues are a retrieval problem or a generation problem.

Frequently Asked Questions

What is a vector database?

A vector database stores high-dimensional embedding vectors and serves approximate nearest-neighbour search at low latency. In RAG systems it holds embedded document chunks and returns top-k matches when the user query is embedded and searched.

How is a vector database different from a traditional database?

Traditional databases index scalar values for exact-match or range queries. Vector databases index high-dimensional vectors for similarity search using ANN algorithms like HNSW or IVF. Most modern vector DBs also support hybrid search combining metadata filters with vector similarity.

Does FutureAGI ship a vector database?

No. FutureAGI does not ship a vector store. We instrument the leading vector databases. Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, Milvus, LanceDB. through traceAI integrations, so retrieval calls are observable and evaluable wherever you store your embeddings.