Models

What Is MTEB?

A benchmark suite covering 8 task families and 50+ datasets used to rank text embedding models for RAG, search, and clustering.

What Is MTEB?

MTEB (Massive Text Embedding Benchmark) is a benchmark suite for text embedding models, designed to evaluate them across the full range of tasks embeddings actually serve in production. It covers eight task families — classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitext mining — across more than 50 datasets and 100+ languages. Each model is scored per task and ranked on a public leaderboard by mean score. Released by Hugging Face in 2022, MTEB became the standard filter teams use when picking embedding models for retrieval-augmented generation, semantic search, or clustering pipelines.

Why It Matters in Production LLM and Agent Systems

Embeddings are a quiet, expensive component of every RAG system. The chosen embedding model determines what gets retrieved, which determines what the LLM sees, which determines whether the answer is grounded or hallucinated. A wrong embedding choice does not error — it silently retrieves slightly-off chunks, the LLM produces fluent but incorrect answers, and the team blames the LLM.

The pain shows up across roles. ML engineers swap an embedding model based on MTEB rank and find their RAG retrieval quality drops, because the new model was strong on STS but weak on long-document retrieval. Platform engineers see embedding inference cost explode because a higher-MTEB model is also 5x larger and re-indexing 10M documents now requires a planning meeting. Product owners ship a “better RAG” only to receive bug reports about citations that don’t match the answer.

By 2026 the MTEB leaderboard is densely populated with strong open-weight models — Voyage, Cohere, BGE, E5, Stella, Jina, gte, and dozens of variants compete in the top 30. The headline mean score is no longer a clean discriminator; pick on the per-task family that matches your use case. For RAG specifically, retrieval and STS scores matter more than classification or clustering, and the gap between MTEB rank and domain retrieval performance can be wide.

How FutureAGI Handles MTEB

FutureAGI does not maintain an MTEB leaderboard — Hugging Face’s public leaderboard remains the canonical reference. Where FutureAGI sits is the bridge between MTEB-rank credibility and domain-specific retrieval performance, which is the metric that actually determines RAG quality.

The workflow has two parts. First, filter via MTEB: pick three to five candidate embedding models from the leaderboard’s retrieval task family, optionally filtered by language, dimension, or open vs. closed weight. Second, validate domain-specifically: load your domain query-document pairs into a Dataset, run fi.evals.ContextRelevance and fi.evals.ContextRecall on the retrieval output of each candidate, and compare the per-cohort scores. The candidate that wins on your domain often is not the candidate ranked highest on MTEB.

fi.evals.EmbeddingSimilarity provides direct semantic-similarity scoring when you need to evaluate the embedding output itself rather than retrieval downstream. The KnowledgeBase SDK handles the indexing side: swap embedding models, re-index, and run the same evaluator suite to compare. Compared to running MTEB and a separate retrieval eval in two different stacks, FutureAGI keeps the comparison in one dataset, with the model identifier and embedding identifier on every row. Concretely: a team picking an embedding for a legal RAG system filters MTEB to the top five retrieval models, runs ContextRelevance and ContextRecall against 2,000 lawyer-labelled queries, and picks the model that wins on legal-cohort performance, even if it ranks fourth on MTEB overall.

How to Measure or Detect It

MTEB and complementary domain evaluation surface a small set of practical signals:

  • MTEB retrieval task score: the most predictive MTEB sub-score for RAG; weight it over the global mean.
  • MTEB STS score: the second-most useful sub-score for RAG, especially for paraphrase-heavy queries.
  • fi.evals.ContextRelevance: 0–1 score for whether retrieved chunks are relevant to the query — the canonical RAG retrieval evaluator.
  • fi.evals.ContextRecall: 0–1 score for retrieval completeness; pairs with ContextRelevance.
  • fi.evals.EmbeddingSimilarity: direct semantic similarity between texts — useful for re-ranker evaluation.
  • Embedding inference cost per million tokens: the cost axis MTEB does not measure — re-indexing cost dominates in large-corpus deployments.
  • Embedding dimension: 1024 vs. 4096 affects vector store size and query speed; track alongside quality.

Minimal Python:

from fi.evals import ContextRelevance, ContextRecall

ctx_rel = ContextRelevance()
ctx_recall = ContextRecall()

result = ctx_rel.evaluate(
    input="What was the 2024 ruling on Section 230?",
    context=retrieved_chunks,
)
print(result.score, result.reason)

Common Mistakes

  • Picking by MTEB mean score. The mean averages eight task families; for RAG only retrieval and STS matter most. Filter the leaderboard.
  • Skipping domain validation. MTEB’s datasets are not your data; a top-ranked model can underperform on legal, medical, or code corpora. Always run a domain eval.
  • Ignoring embedding dimension and re-index cost. A 4096-dim model gives small retrieval gains but doubles vector store size and inference cost vs. 1024-dim.
  • Comparing models without re-indexing. Embedding models are not interchangeable on the same index — re-index every candidate or the comparison is meaningless.
  • Trusting only English MTEB scores for multilingual products. Use the multilingual MTEB subset (MIRACL, etc.) for non-English deployments.

Frequently Asked Questions

What is MTEB?

MTEB (Massive Text Embedding Benchmark) is a benchmark suite for text embedding models, covering 8 task families across 50+ datasets and 100+ languages, used to rank embedding models for RAG and search.

How do you use MTEB to pick an embedding model?

Filter the MTEB leaderboard by the task families you care about (typically retrieval and STS for RAG), then run the top 3-5 candidates against your own domain corpus and measure ContextRelevance and ContextRecall on real queries.

Is MTEB sufficient for picking a production embedding model?

No. MTEB scores are a useful first filter but rarely predict domain-specific retrieval quality. Always validate against a domain golden dataset using FutureAGI's ContextRelevance and EmbeddingSimilarity evaluators.