RAG

What Is pgvector?

A PostgreSQL extension that stores embedding vectors and performs nearest-neighbor search for RAG retrieval inside the database.

What Is pgvector?

pgvector is an open-source PostgreSQL extension for storing embedding vectors and running nearest-neighbor search inside Postgres. In a RAG system, it is the vector-search layer that returns relevant document chunks before the LLM answers. It shows up in production traces as a retrieval span with query embedding, top-k results, scores, collection/table name, and filter metadata. FutureAGI instruments pgvector through traceAI:pgvector so teams can connect retrieval behavior to answer quality.

Why pgvector Matters in Production LLM and Agent Systems

pgvector usually enters a system as a convenience choice: the team already has Postgres, so embeddings live beside accounts, permissions, and document metadata. That is useful, but the failure modes are mundane. A missing HNSW or IVFFlat index turns top-k retrieval into a latency incident. An embedding-dimension mismatch blocks writes during ingestion. A stale reindex after an embedding-model migration returns semantically old neighbors. A tenant filter pushed after vector search can leak context or bury the right document behind another customer’s chunk.

Developers feel this as flaky RAG quality rather than a database error. SREs see p99 retrieval latency, lock waits, autovacuum pressure, and CPU spikes on retrieval spans. Product teams see answer quality vary by corpus size because the first hundred thousand rows behaved differently than the next ten million. Compliance teams care because pgvector often shares the same database as sensitive business data.

In 2026-era agent systems, one weak retrieval step compounds across a plan. A support agent may retrieve the wrong refund policy, call a CRM tool with the wrong justification, and then write a follow-up email. Unlike Pinecone or Qdrant, pgvector inherits Postgres operational boundaries: indexes, transactions, vacuum, replicas, and SQL permissions all shape RAG reliability. That makes it attractive only when retrieval quality and database health are measured together.

How FutureAGI Handles pgvector

FutureAGI’s approach is to make pgvector a traceable retrieval dependency, not a hidden SQL call. The traceAI:pgvector (traceAI-pgvector) integration wraps pgvector-backed retrieval in Python, TypeScript, and Java workflows and records a span for each search. Engineers inspect retrieval.documents, retrieval.score, vector.collection, vector.metric, top-k, embedding model, and filter metadata beside the downstream LLM span.

That trace becomes useful when it is paired with evals. ContextRelevance checks whether retrieved chunks match the user query. Groundedness checks whether the generated answer stays inside the supplied context. ChunkAttribution checks whether the answer can be tied back to the chunks pgvector returned. Together, those signals separate database retrieval problems from generation problems.

A typical FutureAGI workflow starts when a team changes pgvector from exact search to an HNSW index. The trace dashboard shows p99 retrieval latency improving, but the evaluation cohort shows ContextRelevance dropping for policy questions with strict tenant filters. The engineer compares spans before and after the migration, sees lower scores on filtered searches, raises ef_search, rebuilds the index, and sets an alert on eval-fail-rate-by-cohort. The fix is not “use a different database”; it is to make the pgvector configuration measurable against the RAG job it serves.

How to Measure or Detect pgvector Quality

Measure pgvector at both the database and RAG layers:

  • Retrieval latency: p50 and p99 on traceAI:pgvector spans, sliced by table, tenant, index type, and filter shape.
  • Recall@k: whether known relevant chunks appear in the top-k results for a labelled query set.
  • ContextRelevance: scores whether retrieved chunks match query intent before the LLM sees them.
  • ChunkAttribution: checks whether the final answer uses the chunks pgvector returned.
  • Freshness lag: time from source-document update to searchable vector row.
  • Operational signals: Postgres CPU, lock waits, index size, replica lag, and slow-query count.
from fi.evals import ContextRelevance

result = ContextRelevance().evaluate(
    input="How do I rotate an API key?",
    context="API keys can be rotated from Settings > Security > Keys."
)
print(result.score, result.reason)

Pair the evaluator result with the retrieval span. If latency is fine but ContextRelevance drops, inspect chunking, embedding model version, distance metric, and filter selectivity before changing the prompt.

Common Mistakes

  • Treating pgvector as “just Postgres.” Vector search adds index choices, distance metrics, and recall tradeoffs that normal B-tree intuition does not cover.
  • Skipping labelled recall tests. A fast HNSW query is not useful if the correct policy document falls outside top-k.
  • Mixing embedding dimensions. Model migrations fail quietly when old and new vectors share a table without versioned columns or backfill rules.
  • Filtering after retrieval. Applying tenant or permission filters too late can hide relevant chunks or expose the wrong context.
  • Ignoring Postgres maintenance. Vacuum pressure, replica lag, and index bloat can become RAG incidents before application metrics fire.

Frequently Asked Questions

What is pgvector?

pgvector is an open-source PostgreSQL extension that stores embeddings and runs nearest-neighbor vector search inside Postgres. In RAG systems, it keeps document chunks, metadata, and retrieval indexes close to the operational database.

How is pgvector different from Pinecone?

pgvector runs inside PostgreSQL, so it shares SQL, permissions, backup, and operational controls with the rest of the database. Pinecone is a managed vector database built as a separate retrieval service.

How do you measure pgvector?

FutureAGI measures pgvector through traceAI:pgvector retrieval spans plus evaluators such as ContextRelevance and ChunkAttribution. Track retrieval latency, recall@k, retrieved scores, and downstream groundedness together.