Guides

Best Embedding Models in 2026: NV-Embed-v2, BGE-M3, E5-mistral, Voyage 3 Compared

The best embedding models in 2026: NV-Embed-v2, BGE-M3, E5-mistral, OpenAI v3, Voyage 3, Cohere Embed-3. MTEB benchmarks, pricing, and how to pick.

·
Updated
·
9 min read
agents evaluations llms rag embeddings
Best embedding models 2026 NV-Embed BGE E5 OpenAI Voyage Cohere

TL;DR: Best embedding models in May 2026

WorkloadBest modelLicenseWhy
MTEB-topping general-purposeNVIDIA NV-Embed-v2NVIDIA AI Foundation Models License (research + limited commercial)72.31 on MTEB English, 4096-dim, 32K context
Production multilingual RAGBAAI BGE-M3MIT100 plus languages, dense + sparse + multi-vector in one model, 8K context
Hosted API, default pickOpenAI text-embedding-3-largeCommercial3072-dim, 8K context, broad language coverage, $0.13/1M tokens
Long-context hosted APIVoyage AI voyage-3Commercial32K context, $0.06/1M tokens, top-3 on MTEB v2
Multilingual hosted APICohere Embed-3CommercialBest non-English performance among hosted APIs
Best small/medium open modelSnowflake arctic-embed-l-v2.0 (568M) or bge-small-en (33M)Apache 2.0Arctic for strongest leaderboard at small scale; bge-small-en for absolute minimum footprint
Code embeddingsjina-embeddings-v2-code or Codestral EmbedApache 2.0 / commercialCode-specific training, high recall on code retrieval

Why embedding models are the foundation of modern AI

Embedding models translate text (or images, or audio) into dense numerical vectors that capture semantic meaning. Three workloads depend on them:

  1. Retrieval-augmented generation (RAG): every chunk in your knowledge base needs to be embedded; every query needs to be embedded; cosine similarity surfaces the relevant chunks.
  2. Semantic search: traditional keyword search misses synonyms, paraphrases, and intent; embeddings catch them.
  3. Clustering and classification: customer-support tickets, product catalog, news articles, anything that needs to be grouped by meaning.

According to Gartner, over 30% of the increase in API demand by 2026 comes from LLMs and tools using LLMs. Most of those tools sit on top of an embedding model.

Types of embedding models: 2026 view

Static word embeddings (Word2Vec, GloVe, FastText)

  • Word2Vec: Google’s 2013 neural-net approach to word vectors.
  • GloVe: Stanford’s global matrix factorisation plus local context windows.
  • FastText: Facebook’s subword-aware extension of Word2Vec.

Verdict in 2026: static word embeddings are legacy. They give every occurrence of “bank” the same vector regardless of whether the context is financial or fluvial. Use them only for historical reproducibility, not new builds.

Contextual word embeddings (ELMo, BERT)

  • ELMo: bidirectional LSTM contextual embeddings (2018).
  • BERT: transformer-based bidirectional encoder (2018-2019), the architecture that everything since builds on.

Verdict in 2026: BERT-style encoders are the architectural ancestors of modern sentence embedders. Plain BERT is rarely used directly anymore; use sentence-tuned variants (SBERT, MPNet) or modern descendants.

Sentence-level embeddings (USE, SBERT, InferSent)

  • Universal Sentence Encoder (USE): Google’s 2018 sentence-level embedder.
  • SBERT: Sentence-BERT fine-tuned for semantic similarity. The progenitor of most modern dense embedders.
  • InferSent: Facebook’s BiLSTM-on-NLI-data sentence embedder (2017).

Verdict in 2026: SBERT-style models still ship in many production pipelines, especially the modern descendants (all-MiniLM, all-mpnet, bge-large). USE and InferSent are largely retired.

Universal text embeddings (E5, BGE, NV-Embed, GritLM, SFR-Embedding)

  • E5: Microsoft’s general embedder family. E5-mistral-7b is the LLM-backbone variant.
  • BGE: BAAI’s series; BGE-M3 is the current open-weight production champion.
  • NV-Embed: NVIDIA’s LLM-backbone embedder. NV-Embed-v2 currently leads the MTEB English leaderboard.
  • GritLM-7B: unified generation and embedding from one model.
  • SFR-Embedding-2: Salesforce’s strong general embedder.

Verdict in 2026: this is where the action is. Universal text embeddings handle retrieval, classification, clustering, and similarity from one model. Reranking is usually layered on top with a cross-encoder like BGE-reranker-v2 or Cohere Rerank-3.5 rather than served by the same encoder.

The MTEB leaderboard: who leads in May 2026

MTEB (Massive Text Embedding Benchmark) is the de facto leaderboard for embedding models. The current MTEB English Leaderboard (Snapshot May 2026, top 7) looks roughly like this:

ModelScoreDimContextLicense
NV-Embed-v272.31409632KNVIDIA AI Foundation Models License
stella_en_1.5B_v5~7115368KMIT
SFR-Embedding-2_R~71409632KCC BY-NC 4.0
GritLM-7B~66409632KApache 2.0
E5-mistral-7b-instruct~66409632KMIT
voyage-3-large (Voyage AI)~71102432KCommercial API; the larger sibling of voyage-3
BGE-multilingual-gemma2~7035848KGemma license

Always check the live leaderboard, not this table. Numbers shift weekly.

Capabilities and optimisation

These models are trained on broad data and fine-tuned with contrastive objectives so they handle retrieval, classification, clustering, and similarity all at once.

Optimisation techniques that matter in production:

  • Matryoshka representations: truncate the embedding to the first 256 or 512 dimensions and lose only 1-3% recall. Cuts storage and search cost dramatically. OpenAI v3 and Snowflake arctic-embed both support this natively.
  • Quantisation: int8 or binary embeddings cut memory 4-32x with 1-5% recall hit. Cohere Embed-3 ships native int8 and binary modes.
  • Knowledge distillation: a smaller “student” model trained to mimic a larger “teacher”. The bge-small and arctic-embed-m families are distilled to under 500M params.
  • Pruning: removing low-importance parameters; combined with distillation, cuts inference cost 50-80%.

Core architecture in modern embedding models

Encoder-only (BERT family)

BERT and its sentence-tuned descendants (SBERT, MPNet) use deep bidirectional transformer encoders to build rich contextual representations.

Decoder-only with embed pooling (LLM-backbone)

NV-Embed-v2, E5-mistral-7b, GritLM, and SFR-Embedding all start from a decoder LLM (Mistral, Llama) and add a pooling layer plus contrastive fine-tuning. This pattern unlocked the recent MTEB gains.

Self-attention

Self-attention scores the relevance of every token to every other token. This is the mechanism that lets modern embedders capture long-range context that ELMo and BiLSTMs missed.

Pretraining objectives in 2026

  • Masked language modelling (MLM) (BERT-era): predict masked tokens.
  • Contrastive learning: pull similar pairs together, push dissimilar pairs apart. The 2026 default for embedding models.
  • Cross-encoder pretraining: train a separate cross-encoder for reranking after dense retrieval. BGE-reranker-v2 and Cohere Rerank-3.5 lead.
  • Instruction tuning for embedders: prepend “Represent this for retrieval:” to the query to bias the embedding toward the task. Used by E5, NV-Embed, BGE.

Architectural trade-offs: NV-Embed vs BGE-M3 vs OpenAI v3

AspectNV-Embed-v2BGE-M3OpenAI text-embedding-3-large
MTEB English72.31~67~64
LanguagesEnglish100 plusBroad (no leaderboard claim)
ModesDenseDense, sparse, multi-vector (ColBERT)Dense
Params7.85B~568Mundisclosed
Context32K8K8K
LicenseNVIDIA AIFM (research-friendly, commercial with restrictions)MITCommercial
Best forMax accuracy on English retrievalMultilingual production RAGZero-ops hosted API

NV-Embed-v2 takes the leaderboard crown but has license restrictions on commercial use; check the NVIDIA AI Foundation Models License terms.

BGE-M3 is the production workhorse: it is MIT-licensed, supports 100 plus languages, and ships dense plus sparse plus multi-vector retrieval modes in one model. Most 2026 production RAG stacks default to BGE-M3 plus BGE-reranker-v2.

How large language models power high-quality embeddings

LLMs like Mistral 7B and Llama 4 now serve as the backbone for the strongest open-weight embedders (NV-Embed-v2, E5-mistral, GritLM, SFR-Embedding-2). The training recipe:

  1. Start from a pretrained decoder LLM (Mistral 7B is the common choice).
  2. Replace the next-token head with a pooling layer (last-token or mean-pool).
  3. Contrastive fine-tune on hundreds of millions of paraphrase, NLI, and retrieval pairs.
  4. Optionally add instruction prompts to bias toward specific tasks.

The result: embeddings that benefit from the LLM’s broad world knowledge, at the cost of being 5-10x larger than BGE-style encoders. NV-Embed-v2 is 7.85B parameters; BGE-large is ~330M.

How to choose an embedding model in 2026: a decision matrix

Your needDefault pickBackup
Hosted API, defaultOpenAI text-embedding-3-largeVoyage AI voyage-3
Multilingual RAG productionBGE-M3 (self-host)Cohere Embed-3 (hosted)
MTEB leaderboard max accuracyNV-Embed-v2stella_en_1.5B_v5
Long context (32K)NV-Embed-v2 (self-host) or voyage-3 (hosted)E5-mistral-7b
Code embeddingsCodestral Embed (commercial)jina-embeddings-v2-code (Apache 2.0)
Smallest production modelbge-small-en (33M)Snowflake arctic-embed-l-v2.0 (568M) for stronger quality at slightly larger size
Hybrid retrieval (dense + sparse)BGE-M3SPLADE
Permissive open licenseBGE-M3 (MIT), arctic-embed (Apache 2.0), GritLM (Apache 2.0), jina-embeddings-v2 (Apache 2.0)E5 family (MIT)

Most teams should start with BGE-M3 or OpenAI text-embedding-3-large, build a labelled retrieval set, and re-evaluate after 90 days of production data.

How to evaluate embedding models for your workload

Leaderboards lie. Your domain is different. Build a labelled retrieval set and score candidates yourself.

The 5-step eval pattern that works:

  1. Collect 200 plus queries. Pull from real production logs if available; synthetic if not.
  2. Mark the correct passages. For each query, identify the chunks that should be retrieved.
  3. Score recall@k. k=5 and k=10 are the standard cutoffs.
  4. Score NDCG@10. Rewards correct passages near the top.
  5. Add a reranker. BGE-reranker-v2 or Cohere Rerank-3.5 typically lifts recall@5 by 10-25%.

Future AGI’s evaluate function exposes retrieval metrics directly:

import os
from fi.evals import evaluate, Evaluator

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

# For each (query, retrieved-chunks) pair, score retrieval quality.
query = "What were the EU AI Act enforcement deadlines added in 2026?"
retrieved_chunks = ["Retrieved chunk 1...", "Retrieved chunk 2..."]
generated_answer = "The model's RAG answer goes here."

score = evaluate(
    evaluator=Evaluator.CONTEXT_RELEVANCE,
    input=query,
    output=generated_answer,
    context=retrieved_chunks,
)
print(score)

Set FI_API_KEY and FI_SECRET_KEY to log to the dashboard. The library is Apache 2.0 at github.com/future-agi/ai-evaluation.

How to monitor embedding-model performance in production with Future AGI

Embedding choice locks in retrieval recall, which locks in RAG accuracy. If your embedding model drifts (or your data drifts under your embedding), your downstream eval scores drift.

Two pieces of the Future AGI platform pair with any embedding model choice:

  • traceAI captures every retrieval span (query embedding, top-k chunks returned, similarity scores) as Apache 2.0 OpenTelemetry instrumentation. Source at github.com/future-agi/traceAI.
  • Evaluate scores retrieval and generation quality together with built-in metrics: context_relevance, context_sufficiency, chunk_attribution, groundedness. Apache 2.0 library at github.com/future-agi/ai-evaluation.

Future AGI is not an embedding model vendor and never will be: embeddings are best owned by specialist labs (NVIDIA, BAAI, OpenAI, Voyage, Cohere, Jina). Future AGI is the eval and observability layer that pairs with whichever embedding you pick.

Sources

Frequently asked questions

What is the best embedding model in 2026?
For pure MTEB-leaderboard performance, NVIDIA's NV-Embed-v2 (4096-dim, 7.85B params) is the top open-weight model with an MTEB English Leaderboard score of 72.31 as of May 2026. For production RAG, BAAI's BGE-M3 dominates because it handles dense, sparse, and ColBERT-style multi-vector retrieval in one model and supports 100 plus languages with 8K context. For hosted APIs, OpenAI text-embedding-3-large and Voyage AI's voyage-3 are the strongest. For multilingual retrieval, Cohere Embed-3 leads. Pick on workload, not on leaderboard rank.
Should I use an open-source embedding model or a hosted API in 2026?
Use open-source (BGE-M3, NV-Embed-v2, E5-mistral-7b) when you need full control over data privacy, want to fine-tune on your domain, or have predictable enough volume that a single H100 makes economic sense. Use a hosted API (OpenAI v3, Voyage 3, Cohere Embed-3) when you want zero infrastructure operations, need to start in days not weeks, or your volume varies wildly. The crossover point is usually around 100M tokens/month: below that, hosted wins; above that, self-host wins.
What is MTEB and how reliable is it as a benchmark for embedding models in 2026?
MTEB (Massive Text Embedding Benchmark) is the de facto leaderboard for embedding models, covering 56 tasks across retrieval, classification, clustering, semantic similarity, reranking, and summarisation. It is reliable for comparing models on general-purpose English benchmarks but optimised models can overfit to MTEB tasks. Always validate on your own retrieval set, especially if your domain is medical, legal, or non-English. Reference: https://huggingface.co/spaces/mteb/leaderboard.
What context length do modern embedding models support in 2026?
Context length has grown sharply. BGE-M3 supports 8K tokens. Voyage AI voyage-3 supports 32K. OpenAI text-embedding-3-large supports 8K. NVIDIA NV-Embed-v2 supports 32K. The 512-token cap of BERT-era embeddings is gone. Long-context embeddings reduce chunking pain in RAG but cost more per call and need careful tuning because most retrieval still works best on 500-2,000 token chunks.
How do I evaluate embedding models for my RAG application in 2026?
Build a labelled retrieval set: 200 plus queries with the correct passages marked. Score candidate embedding models on recall@k (k=5, k=10) and NDCG@10. Reranking with a cross-encoder like BGE-reranker-v2 or Cohere Rerank-3.5 typically lifts recall by 10-25%. Future AGI's evaluate function exposes retrieval metrics (context_relevance, context_sufficiency, chunk_attribution) so you can score embedding choice plus reranker plus chunking strategy as one pipeline.
What is the difference between dense and sparse embeddings in 2026?
Dense embeddings (NV-Embed, BGE, OpenAI v3, Voyage 3) map text to a fixed-size float vector (768-4096 dimensions); excellent at semantic similarity. Sparse embeddings (SPLADE, BM25, BGE-M3's sparse mode) keep the bag-of-words sparseness while learning term weights; excellent at exact-match recall and rare-token retrieval. Hybrid retrieval (dense plus sparse plus reranker) is the May 2026 default in production RAG, and BGE-M3 ships all three modes in one model.
Which embedding model handles multilingual text best in 2026?
BGE-M3 supports 100 plus languages and is the open-weight leader on multilingual retrieval. Cohere Embed-3 is the strongest hosted multilingual API. OpenAI text-embedding-3-large handles non-English passably but is not the leader. For specific language families (Indic, Arabic, Chinese), region-specific models (Jina v3, multilingual-e5-large) sometimes beat the global leaders. Always validate against a language-specific labelled set.
How much does it cost to embed 1 million tokens in 2026?
OpenAI text-embedding-3-small costs $0.02 per 1M tokens. text-embedding-3-large costs $0.13. Voyage AI voyage-3 costs $0.06 per 1M. Cohere Embed-3 costs $0.10 per 1M. Self-hosting BGE-M3 or NV-Embed-v2 on a rented H100 costs roughly $2/hour and can embed tens of millions of tokens per hour at high batch sizes, often dropping per-token cost below the hosted small-model tier when the GPU is kept saturated. At low or bursty volume, hosted APIs are far simpler and the cost difference is small. Crossover depends heavily on your batch size, model choice, and hardware utilisation.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.