Guides

Evaluating Embedding Models in 2026

MTEB Recall@10 does not transfer to your domain. 500 labeled query-passage pairs from your traffic decide which embedding wins. Without that, you are picking by leaderboard theater.

·
Updated
·
12 min read
embedding-models rag semantic-search llm-evaluation mteb traceAI 2026
Editorial cover image for Evaluating Embedding Models in 2026: A Practical Tutorial
Table of Contents

You shipped a RAG pipeline last quarter on OpenAI text-embedding-3-large because it topped the MTEB leaderboard. Recall@10 on your data is 0.74. Support agents tell you the bot returns the wrong warranty clause on enterprise tickets. The trace says the right chunk sits at rank 47 of 50; the supporting clause is in the corpus, the embedding never moved it into the top ten. The team meeting blames the reranker. MTEB says the model is great. MTEB is true and useless.

The opinion this post earns: MTEB Recall@10 does not transfer. Your domain has its own retrieval distribution. Code, legal, medical, multilingual, semi-structured tickets all behave differently from the 56-task MTEB average. Two models within one point on the leaderboard regularly sit eight to twelve points apart on a 500-query set sampled from your real traffic. The cheapest reliability win in most RAG stacks is not a bigger model or a better reranker; it is a 500-pair labeled retrieval evaluation built from your traffic, sweeping six candidates at three dimension targets each through one gateway, with a per-stratum breakdown. Without that, you are picking by leaderboard theater.

This guide is the methodology. Why MTEB lies for production retrieval, the 500-pair labeled-eval protocol, per-domain decisions across code, legal, medical, multilingual, and prose, the cost-latency-dimension tradeoff, production patterns (Matryoshka, quantization, caching), and traceAI EMBEDDING-span instrumentation that keeps the verdict honest after deploy. For the broader RAG stack, the RAG evaluation metrics deep dive covers the foundational definitions; this post sets the recall floor every other RAG layer inherits.

TL;DR: which embedding wins for which domain

DomainDefault winnerWhy
English prose RAGOpenAI text-embedding-3-large @ 1024Strongest default, Matryoshka dimensions, broad coverage
Multilingual productionCohere Embed v4100+ languages, smallest per-language floor
Code and code-mixedVoyage voyage-3-large (code variant)Domain-tuned on identifiers and call graphs
Legal and financeVoyage voyage-3-large (domain variants)Fine-tuned on contracts and filings
Self-host, on-prem, air-gappedMixedbread mxbai-embed-large-v2Open weights, quantization-friendly, single-GPU
Cost-constrained open-weightStella v5 (1.5B)Strongest small model, OSS, cheap inference
Hybrid retrieval (dense + sparse + late-interaction)BGE bge-m3One model serves three signals

Two non-negotiables across every embedding decision. Build a 500-pair labeled retrieval set from your production traffic before you pick. MTEB filters; your data decides. Stratify by query length, language, and document type. A scalar average hides the cohort that fails real customers.

Why MTEB lies for production retrieval

MTEB is a useful competence filter. As a verdict on which embedding wins on your data, it is consistently wrong. Three reasons.

Distributional mismatch. MTEB aggregates 56 tasks across web text, scientific abstracts, BEIR-style question-passage pairs, and a long tail of clustering and classification benchmarks. The joint distribution of (query shape, passage shape, language) in the average has approximately zero overlap with most production corpora. A model that wins news classification can lose warranty-clause retrieval where queries are six-token enterprise-tier fragments and passages are nested legal subsections. On per-domain Recall@10, the ordering of top-five embeddings flips on at least one workload in every controlled study we have seen.

Benchmark contamination. Several frontier embeddings train on data that overlaps with MTEB tasks. The leaderboard climbs without production recall climbing with it. Voyage’s technical notes flag this; Anthropic and Cohere have published variants of the same caveat. Treat MTEB as a competence floor, not a ranking.

Frozen tasks, moving world. MTEB’s task mix was assembled in 2022-2023. The 2026 retrieval workload is shorter queries, more multilingual fragments, code-mixed documents, and agent loops that embed and retrieve multiple times per turn. The average underweights every shift.

The fix is not to argue with MTEB. Use it to drop the bottom half of the catalog (anything below 60 on MTEB English is rarely worth the eval slot), then run a 500-pair labeled retrieval evaluation on your data. The candidate that wins the average can still lose the strata that matter. The candidate that loses the average can win your workload by ten points. Both are invisible without your own labels.

The 500-pair labeled-eval methodology

A 500-pair labeled query-passage set built from production traces beats a 5000-pair synthetic set every time. The size separates winners from losers with 95% confidence on the rank order for most production corpora. The shape is what makes it useful.

Sample from real traffic. Pull 500 queries from production logs across four strata: short keyword (under five tokens), long natural language (more than 15 tokens), domain jargon (SKUs, ICD codes, statute references, identifiers), and multilingual fragments if you serve more than one language. The RAG observability workflow makes this routine; an afternoon of trace export, deduplication, and stratification builds the seed set.

Label with a human reviewer for the supporting chunk. Each query gets a chunk ID that contains the supporting span. Span-level labels matter: a chunk counts as a hit only if it carries the answer text, never the topic alone. Synthetic labeling with an LLM is fine as a bootstrap; the production rubric needs human ground truth for the cohort that ships.

Score Recall@10, MRR, NDCG@10, p95 latency, and cost per million tokens. Five numbers, per candidate, per stratum.

from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, ChunkAttribution, ChunkUtilization,
    Groundedness, ContextAdherence, Completeness,
    CustomLLMJudge,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

domain_recall = CustomLLMJudge(
    name="domain_recall_at_k",
    rubric="""
    Given (query, expected_chunk_id, retrieved_chunks_top_10),
    return 1.0 if expected_chunk_id is in retrieved_chunks_top_10,
    else 0.0. Also flag the rank of the expected chunk if present.
    """,
    judge_model="gpt-4.1",
)

def score_embedding_candidate(golden_set, embed_fn, retrieve_fn):
    rows = []
    for ex in golden_set:
        retrieved = retrieve_fn(ex.query, k=10, embed=embed_fn)
        tc = TestCase(
            query=ex.query,
            expected_chunk_id=ex.expected_chunk_id,
            retrieved_chunks=[r.id for r in retrieved],
            context="\n\n".join(r.text for r in retrieved),
            stratum=ex.stratum,
            language=ex.language,
        )
        result = evaluator.evaluate(
            eval_templates=[
                ContextRelevance(), ChunkAttribution(),
                ChunkUtilization(), Groundedness(),
                domain_recall,
            ],
            inputs=[tc],
        )
        rows.append((ex.stratum, ex.language, result))
    return rows

Wire downstream rubrics so retrieval-to-answer stays connected. Score Groundedness, ContextRelevance, ChunkAttribution, and ChunkUtilization on the generated answer with the retrieved context. A 5-point recall gain that does not move answer quality means the generator is ignoring the recall. The chunk-attribution deep dive covers the rubric definitions.

Decide per stratum, not on the average. A model that wins on average and loses 15 points on multilingual is not a winner if multilingual is 20% of your traffic. Read the per-language, per-query-length, per-document-type breakdown. The leaderboard you build internally is the one that ships the decision.

Per-domain decisions: where each embedding wins

The same six candidates behave differently across document types. The per-domain breakdown is the methodology’s main output.

Code and code-mixed: Voyage voyage-3-large (code variant) or BGE bge-m3 with late-interaction. Function names, variable identifiers, and call-graph references decide the answer. Pooled embeddings flatten the identifier signal. Voyage’s code-tuned variants lift Recall@10 by 4 to 8 points over OpenAI text-embedding-3-large on most code corpora we have measured. BGE bge-m3 with the late-interaction (ColBERT-style) head can match or beat it at 4x to 10x storage; the chunking-strategies eval covers the late-interaction case.

Legal, finance, medical: Voyage domain variants or OpenAI with clause-level chunking. Contracts, regulations, clinical notes, filings. Recall is typically high (the right clause is in the top 50); precision is typically low (similar-worded clauses cluster). Voyage’s domain-tuned variants pick up 2 to 5 points on identifier-heavy queries (ICD codes, statute numbers, ticker symbols). OpenAI text-embedding-3-large with clause-level chunking is the default if you avoid vendor lock-in; pair with a cross-encoder reranker and the gap closes. The contract review RAG guide covers the segmenter pattern.

Multilingual production: Cohere Embed v4. 100-plus languages, smallest per-language floor we have measured. English-trained embeddings (text-embedding-3, Voyage, Mixedbread, Stella) drop 10 to 20 points on lower-resource languages (Hindi, Arabic, Bengali, Swahili), and that drop never shows up in a global metric. Cohere holds quality across Spanish, French, German, Japanese, Mandarin, Arabic, Hindi, Portuguese. For mixed-language corpora (English code with Russian comments, Spanish-English insurance docs), pair Cohere or bge-m3 with late-interaction.

Long-form English prose: OpenAI text-embedding-3-large at 1024. Research papers, transcripts, books, narrative knowledge bases. Strong default, Matryoshka dimensions knob, broad coverage. 1024 is the sweet spot; 3072 buys 2 to 3 points of Recall@10 at 3x the storage.

Marketing copy, FAQs, product docs: any strong default at 512 dimensions. Uniform short paragraphs, lookup-shaped queries, recall@3 sits near 0.95. Stella v5 or BGE bge-m3 at 512 matches OpenAI here for a fraction of the cost.

Self-host, on-prem, air-gapped, sensitive data: Mixedbread mxbai-embed-large-v2. Open weights, runs in your VPC on a single A10 or L4 GPU, quantization-friendly (binary and scalar quantization cut storage 4x to 32x with a one-to-three point recall hit). The per-call cost moves from per-million-tokens to amortized infrastructure.

Route by document type before embedding, not after. A single embedding applied uniformly is the source of most embedding-eval failures we see in production.

The cost-latency-dimension tradeoff

Dimension count is a price-performance lever, not a fixed model property. OpenAI text-embedding-3-large supports a dimensions parameter that returns 256, 512, 1024, 1536, or 3072 from one inference. Matryoshka training in BGE bge-m3 and Nomic variants ships the same knob without an API parameter. The Recall@10 delta between 3072 and 1024 is often two points; between 1024 and 256 it can be six. Pair with a cross-encoder reranker and the reranker often recovers most of the loss.

CandidateRecall@10Recall@10 after rerank$ / 1M tokensp95 retrieval (ms)Storage / 1M docs
text-embedding-3-large @ 30720.870.92$0.133811.7 GB
text-embedding-3-large @ 10240.850.91$0.13223.9 GB
text-embedding-3-large @ 2560.790.88$0.13141.0 GB
Cohere Embed v4 @ 10240.860.91$0.10263.9 GB
Voyage voyage-3-large @ 10240.860.92$0.06243.9 GB
BGE bge-m3 @ 1024 (self-host)0.840.91~$0.01263.9 GB
Mixedbread mxbai-embed-large-v2 (self-host)0.830.90~$0.01283.9 GB
Stella v5 (1.5B, self-host)0.820.90~$0.005243.9 GB

Numbers are illustrative, not your benchmark. The point is the shape: once it is filled in on your 500-pair set, the decision is rarely controversial. Three patterns recur:

  • Dimension reduction is mostly free with a reranker. Drop 3072 to 1024 and the reranker recovers the loss. Storage cuts 3x; p95 latency cuts roughly in half.
  • Self-hosted Mixedbread, Stella, or BGE is within one to two points of the API leaders at one-tenth the cost. Break-even is roughly two to five million embedded tokens per day. Past that, self-host wins.
  • Cost per correct answer is the metric that matters. Tag every gateway call with embedding model and read cost / correct_answer_count per route. Cheaper embeddings that lose one point of recall but cost one-tenth often win on this denominator.

Production patterns: Matryoshka, quantization, caching

Three production patterns separate teams that ship recall from teams that pay for it.

Matryoshka embeddings: one model, many dimensions. Train once, serve at any prefix length. OpenAI text-embedding-3, BGE bge-m3, Nomic Embed v2 all ship Matryoshka-trained heads. The implication: do not pick a dimension; pick a model and sweep the dimension. The same 500-pair set runs at three targets in one job. Ship the dimension that maximizes Recall@10 after rerank divided by total monthly cost on your traffic.

Quantization: 4x to 32x storage savings, low recall cost. Binary quantization (one bit per dimension) cuts storage 32x with a 2-to-5 point Recall@10 hit on most corpora; scalar quantization (8 bits per dimension) cuts 4x with a sub-1 point hit. Mixedbread and BGE ship quantization-friendly weights. The order: evaluate at full precision, decide the model, then sweep quantization on the chosen model.

Semantic caching of embedding outputs. For workloads with query repetition (product search, support FAQs, agent loops that re-embed the same query mid-turn), an embedding cache cuts cost 30 to 60 percent and trims p50 latency to sub-5ms on hits. Two layers. Exact cache on (text_normalised, model, dimensions) keys handles deterministic lookups. Semantic cache on embedding-similarity-to-past-queries above a threshold (0.95 cosine for English question-answer corpora; lower for code) handles paraphrased queries. The Agent Command Center ships both as gateway primitives.

How Future AGI ships embedding evaluation

Future AGI ships the eval stack as a package. Start with the SDK. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): six RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization) plus 50+ total; CustomLLMJudge for DomainRecallAtK, DimensionEfficiency, and MultilingualEmbedQuality rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes embedding-eval rubrics from natural-language descriptions; four distributed runners (Celery, Ray, Temporal, Kubernetes) collapse a six-candidate-by-three-dimensions sweep to minutes.
  • traceAI (Apache 2.0): auto-instrumentation across 50+ AI surfaces in Python, TypeScript, Java, and C# (OpenAI, Cohere, Voyage, plus the vector-DB stack: Pinecone, Qdrant, Milvus, Weaviate, pgvector). Every embed call emits a typed EMBEDDING span with embedding.model_name, embedding.dimensions, and embedding.input_length, so a model-version drift shows up against the same dashboard recall lands on.
  • Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse-stored span embeddings; Sonnet 4.5 Judge writes the immediate_fix per cluster. Common clusters: “text-embedding-3-small drops 8 points on legal queries,” “Cohere wins multilingual but loses to Voyage on technical English,” “model-version drift moved recall 4 points on April 12.”
  • Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers including OpenAI, Cohere, Voyage, Mixedbread, and self-hosted BGE, Stella, and Nomic Embed through one endpoint. 18+ built-in guardrail scanners plus 15 third-party adapters. Exact and semantic caching at the gateway layer. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="embedding-eval-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

The same 500-pair labeled set that gates CI runs against live traffic on a daily schedule. Alarm on a 2-to-5 point Recall@10 drop in rolling-mean over 30 to 90 minutes. CI catches regressions you can think of; production catches the silent provider-version drift you cannot.

Ready to run your first 500-pair embedding sweep? Wire ContextRelevance, ChunkAttribution, ChunkUtilization, and a DomainRecallAtK CustomLLMJudge into a pytest fixture this afternoon against the ai-evaluation SDK. Stratify by query length, language, and document type. Route six candidates through the Agent Command Center so cost and version headers come for free. Add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is MTEB Recall@10 a bad way to pick an embedding model in 2026?
MTEB averages 56 tasks across web text, scientific abstracts, and synthetic question-passage pairs that no production team actually runs. Two models within one point on the MTEB leaderboard regularly sit eight to twelve points apart on a 500-query domain set sampled from real traffic. The reason is distributional. Your queries are short keyword lookups, semi-structured tickets, multilingual fragments, or identifier-heavy code; the gold passages are clauses, paragraphs, function bodies, or call-note sections written in your house style. Nothing in the MTEB average tells you how a model behaves on that joint distribution. Use MTEB to drop the bottom half of the catalog. Pick the winner on your data.
How many labeled query-passage pairs do I need to evaluate an embedding model?
Five hundred is enough to separate winners from losers with 95% confidence on the rank order for most production corpora. Two hundred works if your domain is narrow (one document type, one language) and you accept noisier results. A thousand pairs sharpens dimension-and-quantization tradeoffs but rarely changes the model decision. The constraint is not volume. It is stratification: cover short and long queries, in-domain jargon, multilingual segments, and the long-tail patterns that your traffic logs already show. A 500-pair set built from real queries beats a 5000-pair synthetic set built from a tutorial template. Sample from production traces, label with a human reviewer for the supporting chunk, and version the set the way you would a fine-tune dataset.
How does Voyage compare to Cohere Embed v4, OpenAI text-embedding-3, Mixedbread, Stella, and BGE in 2026?
On most English RAG corpora the top six sit within three to five Recall@10 points of each other. Voyage voyage-3-large wins on code, finance, and legal where the domain variants are tuned hard. Cohere Embed v4 wins on multilingual production at scale; 100-plus languages with the smallest per-language floor. OpenAI text-embedding-3-large is the strongest default for English RAG with the Matryoshka dimensions knob, but pays a premium per million tokens. Mixedbread mxbai-embed-large-v2 is the best self-hostable English option with open weights and quantization-friendly behavior. Stella v5 (1.5B) is the strongest small open-weight model for cost-constrained production. BGE bge-m3 wins on hybrid retrieval (dense plus sparse plus ColBERT-style late interaction) where the same model serves multiple signals. The decision is workload shape, not absolute quality.
What metrics decide an embedding model evaluation?
Five numbers on the same 500-pair set, scored per stratum. Recall@10 (does the right passage land in the top ten?), MRR (how high in the ranking?), NDCG@10 (graded relevance, when you have graded labels), p95 retrieval latency at your target dimension and index, and cost per million tokens embedded at expected volume. Stratify by query length, language, and document type so the floor on your worst slice stays visible. A model that wins on average and loses 15 points on multilingual is not a winner. Pair retrieval scores with downstream Groundedness, ContextRelevance, ChunkAttribution, and ChunkUtilization so the embedding-to-answer chain stays connected; better retrieval that does not move generation quality means you are buying recall the generator is not using.
Should I use Matryoshka embeddings and quantization in production?
Yes, when measured. Matryoshka-style embeddings (OpenAI text-embedding-3 dimensions parameter, BGE bge-m3, Nomic Embed v2) let one model serve 256, 512, 1024, and 3072 dimensions from a single inference. Recall@10 between 3072 and 1024 is often two points; between 1024 and 256 it can be six. Pair with a cross-encoder reranker and the reranker often recovers most of the loss, which makes lower dimensions effectively free on storage and latency. Binary or scalar quantization (BGE, Mixedbread support it natively) cuts storage another 4x to 32x with a one-to-three point recall hit on most corpora. Measure both on your data; do not assume the published curves transfer. The right number is the one that maximizes recall@10 after rerank divided by total monthly cost on your traffic shape.
How does Future AGI score embedding model quality in production?
A 500-pair labeled set runs in CI before deploy and as span-attached scorers on live traces after. The ai-evaluation SDK (Apache 2.0) ships Groundedness, ContextRelevance, ContextAdherence, ChunkAttribution, ChunkUtilization, and Completeness as named EvalTemplate classes; CustomLLMJudge covers domain-specific rubrics like DomainRecallAtK and MultilingualEmbedQuality. traceAI auto-instrumentation emits a typed EMBEDDING span with embedding.model_name, embedding.dimensions, and embedding.input_length per call, so a model-version drift shows up against the same dashboard recall lands on. Error Feed soft-clusters failing retrievals with HDBSCAN over ClickHouse-stored span embeddings; a Sonnet 4.5 Judge agent writes the immediate_fix per cluster. The Agent Command Center fronts OpenAI, Cohere, Voyage, Mixedbread, and self-hosted BGE or Nomic through one OpenAI-compatible endpoint so the candidate sweep does not special-case any provider.
Where does reranking fit into embedding evaluation?
Pick the embedding by recall@50 with a wide candidate set, then add a reranker and pick the final cut by NDCG@10 and recall@10 after rerank. A first-stage recall@50 of 0.95 plus a strong cross-encoder beats first-stage recall@10 of 0.90 with no reranker on most corpora. The reranker is precision-only; it cannot rescue a chunk that never made the candidate list. The order that ships: embedding decides what enters the candidate set, the reranker decides what survives, and the eval triangle (NDCG@k_post_rerank, recall delta pre to post, latency added by the rerank hop) tells you whether the pair earns its slot. For the reranker layer specifically, the [Cohere Rerank evaluation deep dive](/blog/evaluating-cohere-rerank-rag-2026/) covers the protocol; this post is the embedding-side decision that sets the recall floor.
Related Articles
View all