Evaluation

What Is Semantic Similarity?

A meaning-based comparison that scores whether two texts express the same intent or content despite different wording.

What Is Semantic Similarity?

Semantic similarity is an LLM-evaluation signal that measures whether two texts express the same meaning rather than the same surface words. It shows up in eval pipelines when a generated answer is correct but paraphrased, when a retriever returns a conceptually matching chunk, or when a dataset needs near-duplicate cleanup. FutureAGI measures it with embedding-based comparisons that produce a thresholdable 0-1 cosine similarity score, then layers Faithfulness or AnswerRelevancy on top so similar-but-wrong answers do not pass.

In 2026, frontier models paraphrase aggressively by default. The exact-match dashboards that teams ran in 2023 now show false-fail rates above 30% on perfectly good answers. On TruthfulQA’s 817 questions and HaluEval (35K Q&A), embedding-similarity grading recovers 60-80% of correct paraphrases that exact-match silently rejects. but stays vulnerable to negation (“we allow refunds” vs “we do not allow refunds”) which is why pairing with NLI checks is now standard. Semantic similarity is the cheap first filter that gets you past that, and it remains the basic building block underneath richer signals like semantic accuracy and AnswerRelevancy.

Why semantic similarity matters in production LLM and agent systems

Semantic similarity matters because production answers are rarely word-for-word replicas of a reference. If you score open-ended answers with exact match, a correct paraphrase fails. If you score them with only token overlap, a fluent but wrong answer can pass because it repeats the right nouns. The result is metric inversion: the eval suite rewards formatting, not meaning.

The pain reaches several teams at once. A developer sees a 40% exact-match pass rate on a support bot even though human review finds most answers acceptable. An SRE sees eval-fail-rate-by-cohort jump after a prompt release, but user thumbs-down rate stays flat. A data lead finds that near-duplicate golden examples inflate benchmark confidence because the same intent appears six ways in the holdout set.

The failure modes are concrete: false negatives on correct paraphrases, false positives on lexical overlap, duplicate test leakage, and silent semantic drift across multi-step agents. Agentic systems amplify the issue because each step rewrites state: the planner summarizes the goal, a tool returns structured data, memory compresses the exchange, and the final answer paraphrases all of it. If you cannot tell “same meaning, different words” from “different intent, similar words,” a regression eval can block good releases and ship bad ones.

How FutureAGI handles semantic similarity

FutureAGI’s approach is to treat semantic similarity as a soft matching layer, not a universal truth metric. The cheap path compares response and expected_response by sentence embeddings and returns a 0-1 score. The same comparison can run against dataset rows, regression evals, or sampled production traces from traceAI-langchain.

Example: a support team has a golden answer, “Reset links expire after 10 minutes.” The model writes, “The reset link is valid for ten minutes.” Exact match fails; BLEU and ROUGE may under-score the paraphrase; semantic similarity scores it high. The team sets a threshold of 0.82 for paraphrase acceptance, then pairs it with Groundedness or Faithfulness when the answer must be supported by retrieved policy text. That pairing matters: a hallucination can be semantically similar to the reference shape but unsupported by context. Compared with LangSmith’s built-in embedding-similarity evaluator, FutureAGI keeps the embedding-model identity, dimension, and version on the eval row, so a similarity shift after an embedding-model snapshot update is attributable rather than a mystery.

Use caseWhat semantic similarity doesWhat it cannot do alone
Grading paraphrased answersAccepts correct paraphrases that exact-match rejectsVerify factual correctness
Retrieval evaluationConfirms chunk topical matchCatch contradictions in retrieved text
Golden-dataset deduplicationSurfaces near-duplicate promptsDistinguish nuanced variants
Conversational driftTracks meaning stability across turnsDetect persona changes
Multi-agent handoffConfirms a handed-off goal kept its meaningVerify the handoff was authorized

The engineer’s next action is operational. In a regression eval, any cohort whose median similarity drops below the release baseline blocks the prompt or model rollout. In production, similarity scores attached to traces are segmented by prompt version, locale, and task type. If Spanish refund answers drop while English stays flat, the fix is likely translation coverage or retrieval, not model quality in general.

How to measure semantic similarity

Measure semantic similarity as a calibrated metric, not a single global threshold. Useful signals include:

  • Embedding-based similarity. 0-1 cosine between response and expected_response sentence embeddings. The cheap first filter.
  • Similarity distribution by cohort. p10, median, and p90 by prompt version, locale, task, and model.
  • Exact-match disagreement rate. rows where exact match fails but semantic similarity passes; these are usually acceptable paraphrases.
  • Near-miss false positives. pairs like “cancel subscription” vs “keep subscription” that should stay below threshold.
  • Faithfulness. pair with semantic similarity so the answer respects the citation contract.
  • AnswerRelevancy. confirms the response addresses the user’s actual ask.
  • User-feedback proxy. thumbs-down or escalation rate for cases near the similarity threshold.

Minimal Python:

from fi.evals import AnswerRelevancy, Faithfulness

# Use AnswerRelevancy as the primary semantic-similarity-style check
relevancy = AnswerRelevancy().evaluate(
    input="When do password reset links expire?",
    output="The reset link is valid for ten minutes.",
)
faith = Faithfulness().evaluate(
    output="The reset link is valid for ten minutes.",
    context="Password reset links expire after 10 minutes.",
)
print(relevancy.score, faith.score)

Calibrate thresholds on labeled examples before using them in CI. A 0.78 score may pass support paraphrases and fail legal-policy clauses. Our default starting point is a 0.85 acceptance threshold for free-form chat and 0.92 for structured fields, then tuned per cohort with a small human-labeled calibration set.

Common mistakes

Most weak semantic-similarity evals fail because engineers treat “close meaning” as a full quality verdict.

  • Treating semantic similarity as factual correctness. A plausible paraphrase can preserve a hallucination; pair it with Groundedness or Faithfulness.
  • Reusing one threshold across models. Cosine scores shift when you change embedding model, dimension, language, or domain.
  • Evaluating only positives. Include adversarial near-misses like “cancel” vs “keep” subscription to tune false positives.
  • Using it for canonical outputs. Country codes, JSON keys, SKUs, and tool names need exact or schema metrics.
  • Deduplicating golden datasets too aggressively. Near-duplicate prompts may cover different intents, policies, or locales.
  • Comparing similarity scores across embedding-model snapshots. Vendor updates change geometry; pin the model.
  • Treating one embedding model as universal. Multilingual evals need multilingual embeddings; legal-domain evals benefit from domain-tuned embeddings. The right embedding model is task-dependent, not vendor-dependent.
  • Ignoring negation. “We allow refunds” and “we do not allow refunds” can sit close in embedding space; pair with a contradiction-aware check on safety-sensitive surfaces.

Frequently Asked Questions

What is semantic similarity?

Semantic similarity scores whether two texts express the same meaning even when their wording differs. In LLM evaluation, it lets teams grade correct paraphrases and retrieval matches without requiring exact strings.

How is semantic similarity different from exact match?

Exact match requires the response and reference to be identical. Semantic similarity allows different wording if the meaning is close enough, so it is better for open-ended answers.

How do you measure semantic similarity?

FutureAGI measures it by comparing embeddings of response and expected_response and returning a 0-1 score, then pairs it with Faithfulness or AnswerRelevancy to avoid rewarding fluent but wrong answers.