Evaluation

What Is Semantic Similarity?

A meaning-based comparison that scores whether two texts express the same intent or content despite different wording.

What Is Semantic Similarity?

Semantic similarity is an LLM-evaluation signal that measures whether two texts express the same meaning rather than the same surface words. It shows up in eval pipelines when a generated answer is correct but paraphrased, when a retriever returns a conceptually matching chunk, or when a dataset needs near-duplicate cleanup. FutureAGI measures it with fi.evals.EmbeddingSimilarity, a local metric that compares sentence embeddings and returns a thresholdable 0-1 similarity score.

Why It Matters in Production LLM and Agent Systems

Semantic similarity matters because production answers are rarely word-for-word replicas of a reference. If you score open-ended answers with exact-match, a correct paraphrase fails. If you score them with only token overlap, a fluent but wrong answer can pass because it repeats the right nouns. The result is metric inversion: the eval suite rewards formatting, not meaning.

The pain reaches several teams at once. A developer sees a 40% exact-match pass rate on a support bot even though human review finds most answers acceptable. An SRE sees eval-fail-rate-by-cohort jump after a prompt release, but user thumbs-down rate stays flat. A data lead finds that near-duplicate golden examples inflate benchmark confidence because the same intent appears six ways in the holdout set.

The failure modes are concrete: false negatives on correct paraphrases, false positives on lexical overlap, duplicate test leakage, and silent semantic drift across multi-step agents. Agentic systems amplify the issue because each step rewrites state: the planner summarizes the goal, a tool returns structured data, memory compresses the exchange, and the final answer paraphrases all of it. If you cannot tell “same meaning, different words” from “different intent, similar words,” a regression eval can block good releases and ship bad ones.

How FutureAGI Handles Semantic Similarity

FutureAGI’s approach is to treat semantic similarity as a soft matching layer, not as a universal truth metric. The product surface for this page is eval:EmbeddingSimilarity: the fi.evals.EmbeddingSimilarity local metric compares response and expected_response by sentence embeddings and returns a 0-1 score. The same evaluator can run against dataset rows, regression evals, or sampled production traces from traceAI-langchain.

Example: a support team has a golden answer, “Reset links expire after 10 minutes.” The model writes, “The reset link is valid for ten minutes.” Exact match fails; BLEU and ROUGE may under-score the paraphrase; EmbeddingSimilarity should score it high. The team sets a threshold of 0.82 for paraphrase acceptance, then pairs it with Groundedness or Faithfulness when the answer must be supported by retrieved policy text. That pairing matters: a hallucination can be semantically similar to the reference shape but unsupported by context.

The engineer’s next action is operational, not just a report. In a regression eval, any cohort whose median EmbeddingSimilarity.score drops below the release baseline blocks the prompt or model rollout. In production, similarity scores attached to traces are segmented by prompt version, locale, and task type. If Spanish refund answers drop while English stays flat, the fix is likely translation coverage or retrieval, not model quality in general.

How to Measure or Detect It

Measure semantic similarity as a calibrated metric, not a single global threshold. Useful signals include:

  • fi.evals.EmbeddingSimilarity - returns a 0-1 cosine-similarity score between response and expected_response.
  • Similarity distribution by cohort - p10, median, and p90 by prompt version, locale, task, and model.
  • Exact-match disagreement rate - rows where exact match fails but semantic similarity passes; these are usually acceptable paraphrases.
  • Near-miss false positives - pairs like “cancel subscription” versus “keep subscription” that should stay below threshold.
  • User-feedback proxy - thumbs-down or escalation rate for cases near the similarity threshold.

Minimal Python:

from fi.evals import EmbeddingSimilarity

metric = EmbeddingSimilarity()
result = metric.evaluate(
    response="The reset link is valid for ten minutes.",
    expected_response="Password reset links expire after 10 minutes.",
)
print(result.score)

Calibrate thresholds on labeled examples before using them in CI. A 0.78 score may pass support paraphrases and fail legal-policy clauses.

Common Mistakes

Most weak semantic-similarity evals fail because engineers treat “close meaning” as a full quality verdict.

  • Treating semantic similarity as factual correctness. A plausible paraphrase can preserve a hallucination; pair it with Groundedness, Faithfulness, or FactualConsistency.
  • Reusing one threshold across models. Cosine scores shift when you change embedding model, dimension, language, or domain.
  • Evaluating only positives. Include adversarial near-misses like “cancel” versus “keep” subscription to tune false positives.
  • Using it for canonical outputs. Country codes, JSON keys, SKUs, and tool names need exact or schema metrics.
  • Deduplicating golden datasets too aggressively. Near-duplicate prompts may cover different intents, policies, or locales.

Frequently Asked Questions

What is semantic similarity?

Semantic similarity scores whether two texts express the same meaning even when their wording differs. In LLM evaluation, it lets teams grade correct paraphrases and retrieval matches without requiring exact strings.

How is semantic similarity different from exact match?

Exact match requires the response and reference to be identical. Semantic similarity allows different wording if the meaning is close enough, so it is better for open-ended answers.

How do you measure semantic similarity?

FutureAGI measures it with `fi.evals.EmbeddingSimilarity`, which compares response and expected_response embeddings and returns a 0-1 similarity score for thresholding.