Evaluating Embedding Models in 2026
MTEB Recall@10 does not transfer to your domain. 500 labeled query-passage pairs from your traffic decide which embedding wins. Without that, you are picking by leaderboard theater.
Table of Contents
You shipped a RAG pipeline last quarter on OpenAI text-embedding-3-large because it topped the MTEB leaderboard. Recall@10 on your data is 0.74. Support agents tell you the bot returns the wrong warranty clause on enterprise tickets. The trace says the right chunk sits at rank 47 of 50; the supporting clause is in the corpus, the embedding never moved it into the top ten. The team meeting blames the reranker. MTEB says the model is great. MTEB is true and useless.
The opinion this post earns: MTEB Recall@10 does not transfer. Your domain has its own retrieval distribution. Code, legal, medical, multilingual, semi-structured tickets all behave differently from the 56-task MTEB average. Two models within one point on the leaderboard regularly sit eight to twelve points apart on a 500-query set sampled from your real traffic. The cheapest reliability win in most RAG stacks is not a bigger model or a better reranker; it is a 500-pair labeled retrieval evaluation built from your traffic, sweeping six candidates at three dimension targets each through one gateway, with a per-stratum breakdown. Without that, you are picking by leaderboard theater.
This guide is the methodology. Why MTEB lies for production retrieval, the 500-pair labeled-eval protocol, per-domain decisions across code, legal, medical, multilingual, and prose, the cost-latency-dimension tradeoff, production patterns (Matryoshka, quantization, caching), and traceAI EMBEDDING-span instrumentation that keeps the verdict honest after deploy. For the broader RAG stack, the RAG evaluation metrics deep dive covers the foundational definitions; this post sets the recall floor every other RAG layer inherits.
TL;DR: which embedding wins for which domain
| Domain | Default winner | Why |
|---|---|---|
| English prose RAG | OpenAI text-embedding-3-large @ 1024 | Strongest default, Matryoshka dimensions, broad coverage |
| Multilingual production | Cohere Embed v4 | 100+ languages, smallest per-language floor |
| Code and code-mixed | Voyage voyage-3-large (code variant) | Domain-tuned on identifiers and call graphs |
| Legal and finance | Voyage voyage-3-large (domain variants) | Fine-tuned on contracts and filings |
| Self-host, on-prem, air-gapped | Mixedbread mxbai-embed-large-v2 | Open weights, quantization-friendly, single-GPU |
| Cost-constrained open-weight | Stella v5 (1.5B) | Strongest small model, OSS, cheap inference |
| Hybrid retrieval (dense + sparse + late-interaction) | BGE bge-m3 | One model serves three signals |
Two non-negotiables across every embedding decision. Build a 500-pair labeled retrieval set from your production traffic before you pick. MTEB filters; your data decides. Stratify by query length, language, and document type. A scalar average hides the cohort that fails real customers.
Why MTEB lies for production retrieval
MTEB is a useful competence filter. As a verdict on which embedding wins on your data, it is consistently wrong. Three reasons.
Distributional mismatch. MTEB aggregates 56 tasks across web text, scientific abstracts, BEIR-style question-passage pairs, and a long tail of clustering and classification benchmarks. The joint distribution of (query shape, passage shape, language) in the average has approximately zero overlap with most production corpora. A model that wins news classification can lose warranty-clause retrieval where queries are six-token enterprise-tier fragments and passages are nested legal subsections. On per-domain Recall@10, the ordering of top-five embeddings flips on at least one workload in every controlled study we have seen.
Benchmark contamination. Several frontier embeddings train on data that overlaps with MTEB tasks. The leaderboard climbs without production recall climbing with it. Voyage’s technical notes flag this; Anthropic and Cohere have published variants of the same caveat. Treat MTEB as a competence floor, not a ranking.
Frozen tasks, moving world. MTEB’s task mix was assembled in 2022-2023. The 2026 retrieval workload is shorter queries, more multilingual fragments, code-mixed documents, and agent loops that embed and retrieve multiple times per turn. The average underweights every shift.
The fix is not to argue with MTEB. Use it to drop the bottom half of the catalog (anything below 60 on MTEB English is rarely worth the eval slot), then run a 500-pair labeled retrieval evaluation on your data. The candidate that wins the average can still lose the strata that matter. The candidate that loses the average can win your workload by ten points. Both are invisible without your own labels.
The 500-pair labeled-eval methodology
A 500-pair labeled query-passage set built from production traces beats a 5000-pair synthetic set every time. The size separates winners from losers with 95% confidence on the rank order for most production corpora. The shape is what makes it useful.
Sample from real traffic. Pull 500 queries from production logs across four strata: short keyword (under five tokens), long natural language (more than 15 tokens), domain jargon (SKUs, ICD codes, statute references, identifiers), and multilingual fragments if you serve more than one language. The RAG observability workflow makes this routine; an afternoon of trace export, deduplication, and stratification builds the seed set.
Label with a human reviewer for the supporting chunk. Each query gets a chunk ID that contains the supporting span. Span-level labels matter: a chunk counts as a hit only if it carries the answer text, never the topic alone. Synthetic labeling with an LLM is fine as a bootstrap; the production rubric needs human ground truth for the cohort that ships.
Score Recall@10, MRR, NDCG@10, p95 latency, and cost per million tokens. Five numbers, per candidate, per stratum.
from fi.evals import Evaluator
from fi.evals.templates import (
ContextRelevance, ChunkAttribution, ChunkUtilization,
Groundedness, ContextAdherence, Completeness,
CustomLLMJudge,
)
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
domain_recall = CustomLLMJudge(
name="domain_recall_at_k",
rubric="""
Given (query, expected_chunk_id, retrieved_chunks_top_10),
return 1.0 if expected_chunk_id is in retrieved_chunks_top_10,
else 0.0. Also flag the rank of the expected chunk if present.
""",
judge_model="gpt-4.1",
)
def score_embedding_candidate(golden_set, embed_fn, retrieve_fn):
rows = []
for ex in golden_set:
retrieved = retrieve_fn(ex.query, k=10, embed=embed_fn)
tc = TestCase(
query=ex.query,
expected_chunk_id=ex.expected_chunk_id,
retrieved_chunks=[r.id for r in retrieved],
context="\n\n".join(r.text for r in retrieved),
stratum=ex.stratum,
language=ex.language,
)
result = evaluator.evaluate(
eval_templates=[
ContextRelevance(), ChunkAttribution(),
ChunkUtilization(), Groundedness(),
domain_recall,
],
inputs=[tc],
)
rows.append((ex.stratum, ex.language, result))
return rows
Wire downstream rubrics so retrieval-to-answer stays connected. Score Groundedness, ContextRelevance, ChunkAttribution, and ChunkUtilization on the generated answer with the retrieved context. A 5-point recall gain that does not move answer quality means the generator is ignoring the recall. The chunk-attribution deep dive covers the rubric definitions.
Decide per stratum, not on the average. A model that wins on average and loses 15 points on multilingual is not a winner if multilingual is 20% of your traffic. Read the per-language, per-query-length, per-document-type breakdown. The leaderboard you build internally is the one that ships the decision.
Per-domain decisions: where each embedding wins
The same six candidates behave differently across document types. The per-domain breakdown is the methodology’s main output.
Code and code-mixed: Voyage voyage-3-large (code variant) or BGE bge-m3 with late-interaction. Function names, variable identifiers, and call-graph references decide the answer. Pooled embeddings flatten the identifier signal. Voyage’s code-tuned variants lift Recall@10 by 4 to 8 points over OpenAI text-embedding-3-large on most code corpora we have measured. BGE bge-m3 with the late-interaction (ColBERT-style) head can match or beat it at 4x to 10x storage; the chunking-strategies eval covers the late-interaction case.
Legal, finance, medical: Voyage domain variants or OpenAI with clause-level chunking. Contracts, regulations, clinical notes, filings. Recall is typically high (the right clause is in the top 50); precision is typically low (similar-worded clauses cluster). Voyage’s domain-tuned variants pick up 2 to 5 points on identifier-heavy queries (ICD codes, statute numbers, ticker symbols). OpenAI text-embedding-3-large with clause-level chunking is the default if you avoid vendor lock-in; pair with a cross-encoder reranker and the gap closes. The contract review RAG guide covers the segmenter pattern.
Multilingual production: Cohere Embed v4. 100-plus languages, smallest per-language floor we have measured. English-trained embeddings (text-embedding-3, Voyage, Mixedbread, Stella) drop 10 to 20 points on lower-resource languages (Hindi, Arabic, Bengali, Swahili), and that drop never shows up in a global metric. Cohere holds quality across Spanish, French, German, Japanese, Mandarin, Arabic, Hindi, Portuguese. For mixed-language corpora (English code with Russian comments, Spanish-English insurance docs), pair Cohere or bge-m3 with late-interaction.
Long-form English prose: OpenAI text-embedding-3-large at 1024. Research papers, transcripts, books, narrative knowledge bases. Strong default, Matryoshka dimensions knob, broad coverage. 1024 is the sweet spot; 3072 buys 2 to 3 points of Recall@10 at 3x the storage.
Marketing copy, FAQs, product docs: any strong default at 512 dimensions. Uniform short paragraphs, lookup-shaped queries, recall@3 sits near 0.95. Stella v5 or BGE bge-m3 at 512 matches OpenAI here for a fraction of the cost.
Self-host, on-prem, air-gapped, sensitive data: Mixedbread mxbai-embed-large-v2. Open weights, runs in your VPC on a single A10 or L4 GPU, quantization-friendly (binary and scalar quantization cut storage 4x to 32x with a one-to-three point recall hit). The per-call cost moves from per-million-tokens to amortized infrastructure.
Route by document type before embedding, not after. A single embedding applied uniformly is the source of most embedding-eval failures we see in production.
The cost-latency-dimension tradeoff
Dimension count is a price-performance lever, not a fixed model property. OpenAI text-embedding-3-large supports a dimensions parameter that returns 256, 512, 1024, 1536, or 3072 from one inference. Matryoshka training in BGE bge-m3 and Nomic variants ships the same knob without an API parameter. The Recall@10 delta between 3072 and 1024 is often two points; between 1024 and 256 it can be six. Pair with a cross-encoder reranker and the reranker often recovers most of the loss.
| Candidate | Recall@10 | Recall@10 after rerank | $ / 1M tokens | p95 retrieval (ms) | Storage / 1M docs |
|---|---|---|---|---|---|
| text-embedding-3-large @ 3072 | 0.87 | 0.92 | $0.13 | 38 | 11.7 GB |
| text-embedding-3-large @ 1024 | 0.85 | 0.91 | $0.13 | 22 | 3.9 GB |
| text-embedding-3-large @ 256 | 0.79 | 0.88 | $0.13 | 14 | 1.0 GB |
| Cohere Embed v4 @ 1024 | 0.86 | 0.91 | $0.10 | 26 | 3.9 GB |
| Voyage voyage-3-large @ 1024 | 0.86 | 0.92 | $0.06 | 24 | 3.9 GB |
| BGE bge-m3 @ 1024 (self-host) | 0.84 | 0.91 | ~$0.01 | 26 | 3.9 GB |
| Mixedbread mxbai-embed-large-v2 (self-host) | 0.83 | 0.90 | ~$0.01 | 28 | 3.9 GB |
| Stella v5 (1.5B, self-host) | 0.82 | 0.90 | ~$0.005 | 24 | 3.9 GB |
Numbers are illustrative, not your benchmark. The point is the shape: once it is filled in on your 500-pair set, the decision is rarely controversial. Three patterns recur:
- Dimension reduction is mostly free with a reranker. Drop 3072 to 1024 and the reranker recovers the loss. Storage cuts 3x; p95 latency cuts roughly in half.
- Self-hosted Mixedbread, Stella, or BGE is within one to two points of the API leaders at one-tenth the cost. Break-even is roughly two to five million embedded tokens per day. Past that, self-host wins.
- Cost per correct answer is the metric that matters. Tag every gateway call with embedding model and read
cost / correct_answer_countper route. Cheaper embeddings that lose one point of recall but cost one-tenth often win on this denominator.
Production patterns: Matryoshka, quantization, caching
Three production patterns separate teams that ship recall from teams that pay for it.
Matryoshka embeddings: one model, many dimensions. Train once, serve at any prefix length. OpenAI text-embedding-3, BGE bge-m3, Nomic Embed v2 all ship Matryoshka-trained heads. The implication: do not pick a dimension; pick a model and sweep the dimension. The same 500-pair set runs at three targets in one job. Ship the dimension that maximizes Recall@10 after rerank divided by total monthly cost on your traffic.
Quantization: 4x to 32x storage savings, low recall cost. Binary quantization (one bit per dimension) cuts storage 32x with a 2-to-5 point Recall@10 hit on most corpora; scalar quantization (8 bits per dimension) cuts 4x with a sub-1 point hit. Mixedbread and BGE ship quantization-friendly weights. The order: evaluate at full precision, decide the model, then sweep quantization on the chosen model.
Semantic caching of embedding outputs. For workloads with query repetition (product search, support FAQs, agent loops that re-embed the same query mid-turn), an embedding cache cuts cost 30 to 60 percent and trims p50 latency to sub-5ms on hits. Two layers. Exact cache on (text_normalised, model, dimensions) keys handles deterministic lookups. Semantic cache on embedding-similarity-to-past-queries above a threshold (0.95 cosine for English question-answer corpora; lower for code) handles paraphrased queries. The Agent Command Center ships both as gateway primitives.
How Future AGI ships embedding evaluation
Future AGI ships the eval stack as a package. Start with the SDK. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0): six RAG-specific
EvalTemplateclasses (Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization) plus 50+ total;CustomLLMJudgeforDomainRecallAtK,DimensionEfficiency, andMultilingualEmbedQualityrubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2. - Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes embedding-eval rubrics from natural-language descriptions; four distributed runners (Celery, Ray, Temporal, Kubernetes) collapse a six-candidate-by-three-dimensions sweep to minutes.
- traceAI (Apache 2.0): auto-instrumentation across 50+ AI surfaces in Python, TypeScript, Java, and C# (OpenAI, Cohere, Voyage, plus the vector-DB stack: Pinecone, Qdrant, Milvus, Weaviate, pgvector). Every embed call emits a typed
EMBEDDINGspan withembedding.model_name,embedding.dimensions, andembedding.input_length, so a model-version drift shows up against the same dashboard recall lands on. - Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse-stored span embeddings; Sonnet 4.5 Judge writes the
immediate_fixper cluster. Common clusters: “text-embedding-3-small drops 8 points on legal queries,” “Cohere wins multilingual but loses to Voyage on technical English,” “model-version drift moved recall 4 points on April 12.” - Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers including OpenAI, Cohere, Voyage, Mixedbread, and self-hosted BGE, Stella, and Nomic Embed through one endpoint. 18+ built-in guardrail scanners plus 15 third-party adapters. Exact and semantic caching at the gateway layer. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="embedding-eval-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
The same 500-pair labeled set that gates CI runs against live traffic on a daily schedule. Alarm on a 2-to-5 point Recall@10 drop in rolling-mean over 30 to 90 minutes. CI catches regressions you can think of; production catches the silent provider-version drift you cannot.
Ready to run your first 500-pair embedding sweep? Wire ContextRelevance, ChunkAttribution, ChunkUtilization, and a DomainRecallAtK CustomLLMJudge into a pytest fixture this afternoon against the ai-evaluation SDK. Stratify by query length, language, and document type. Route six candidates through the Agent Command Center so cost and version headers come for free. Add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why is MTEB Recall@10 a bad way to pick an embedding model in 2026?
How many labeled query-passage pairs do I need to evaluate an embedding model?
How does Voyage compare to Cohere Embed v4, OpenAI text-embedding-3, Mixedbread, Stella, and BGE in 2026?
What metrics decide an embedding model evaluation?
Should I use Matryoshka embeddings and quantization in production?
How does Future AGI score embedding model quality in production?
Where does reranking fit into embedding evaluation?
Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.
Haystack Pipelines are component DAGs, not black boxes. Per-component rubrics on Retriever, Ranker, Generator + pipeline-level Groundedness.
LangChain RAG eval is two problems: the retriever and the chain. Per-step rubrics catch the bug; chain-level Groundedness on the LCEL output confirms the fix.