How is cross-lingual different from multilingual?

A multilingual model handles many languages; a cross-lingual model also aligns them in a shared space so a Spanish query can retrieve English documents, and reasoning transfers across the language boundary.

How do you evaluate a cross-lingual model?

FutureAGI runs TranslationAccuracy for direct generation tasks and EmbeddingSimilarity for retrieval, plus AnswerRelevancy on per-language cohorts to surface quality gaps no single global score reveals.

Cross-Lingual Language Model: FutureAGI Guide (2026)

Q: What is a cross-lingual language model?

A cross-lingual language model is trained to share representations across multiple languages so the same query embedding works regardless of source language and the model can generate fluent output in any supported language.

What Is a Cross-Lingual Language Model?

A cross-lingual language model is a language model trained or fine-tuned to share representations across multiple languages, so a query in one language can drive retrieval, reasoning, or generation in another. It learns language-agnostic embeddings during pretraining — through multilingual corpora, parallel-data alignment, translation pretext tasks, or contrastive objectives — and uses those embeddings to transfer knowledge across the language boundary. In production, cross-lingual models power multilingual RAG, global support agents, and translation pipelines. FutureAGI evaluates them by language pair and cohort, not only by a global score.

Why It Matters in Production LLM and Agent Systems

A model that scores well on English benchmarks can degrade silently on lower-resource languages, and the per-language quality gap is invisible until a customer in São Paulo or Jakarta complains. Average accuracy looks fine because most traffic is English; the long tail dies quietly. The same gap appears in retrieval — a Spanish query embedded poorly will not match the English document that contains the right answer, and the support agent confidently produces a generic response instead of the specific one.

The pain is felt across roles. A platform engineer ships a global RAG system, and Indonesian users see hallucination rates 3× higher than English users. A localization lead cannot explain why translation quality varies between French and Vietnamese. A compliance team in the EU asks how the system performs on Romanian and Bulgarian, and the only honest answer is “we don’t know — we evaluated on English.”

In 2026 agent stacks the gap compounds. An agent’s tool calls, retrieved context, and final answer all need to work across languages, and a weak point at any step breaks the trajectory. Cross-lingual evaluation has to slice metrics by language pair and by step — global means hide failures.

How FutureAGI Handles Cross-Lingual Language Models

FutureAGI’s approach is to treat language as a cohort dimension on every eval and trace. When you load a Dataset of cross-lingual test cases, each row carries source_lang and target_lang metadata. Dataset.add_evaluation() attaches TranslationAccuracy for direct translation tasks and AnswerRelevancy for retrieval-based answers — both are scored per language pair, so the dashboard exposes an English/French/Hindi/Indonesian breakdown rather than a single mean. For embedding-based retrieval, EmbeddingSimilarity scores cross-lingual query/document matches against gold pairs; this surfaces an embedding-space drift between languages before it shows up as a hallucination at the top of the stack.

In production, traceAI integrations such as langchain and llamaindex annotate spans with the detected language and fields such as llm.token_count.prompt, so eval-fail-rate-by-cohort can be sliced by language without extra work. A team running a global support agent on openai-agents can see at a glance which language cohort is regressing after a model upgrade, and the regression eval against a canonical cross-lingual Dataset blocks the deploy.

Compared to running monolingual evaluators and averaging — the common shortcut — FutureAGI’s cohort-aware approach forces the gap to be visible. We’ve found that the per-language breakdown is the single most revealing slice when shipping a multilingual product.

How to Measure or Detect It

Slice every metric by language pair; never trust a single global score. Build the dataset with paired source, target, locale, task type, and retrieval-context columns. Keep high-volume languages and low-resource languages in separate cohorts, because one threshold rarely fits both. For agent systems, measure the planner input, retrieved context, tool arguments, and final answer separately.

TranslationAccuracy: scores generated translations against references; configurable for source/target language.
EmbeddingSimilarity: cross-lingual retrieval quality between queries and documents in different languages.
AnswerRelevancy: works reference-free across languages — surfaces relevance gaps in cross-lingual RAG.
Per-language eval-fail-rate (dashboard signal): language-cohort failure rates plotted side-by-side; alert on tail-language regressions.
Refusal-rate by language: a model that refuses more in low-resource languages is a fairness flag, not a robustness signal.

Minimal Python:

from fi.evals import TranslationAccuracy, EmbeddingSimilarity

translate = TranslationAccuracy()
embed = EmbeddingSimilarity()

result = translate.evaluate(
    input="The meeting starts at noon.",
    output="La reunión empieza al mediodía.",
    expected_response="La reunión comienza a las doce.",
)
print(result.score, result.reason)

Common Mistakes

Reporting a global score across languages. The English mean almost always masks tail-language failures. Always slice by language pair.
Using BLEU as the only translation metric. BLEU rewards n-gram overlap; for paraphrase-heavy targets pair it with embedding similarity and reference-free judges.
Evaluating retrieval and generation together. The cross-lingual gap can live in the embedding step or the generator; score them separately or you cannot fix the right one.
Skipping low-resource cohorts. Coverage on five languages does not generalize to fifty. If you serve a language, you evaluate it.
Ignoring tokenizer effects. Some tokenizers fragment non-Latin scripts heavily, inflating cost and latency on those cohorts — measure it.