What Is Cosine Similarity?
A metric that measures the cosine of the angle between two vectors, returning a value from -1 to 1 used for embedding comparison.
What Is Cosine Similarity?
Cosine similarity is a model metric that measures how closely two vectors point in the same direction, returning -1 to 1. In LLM systems, it compares embeddings for vector search, response-versus-reference evaluation, retrieval ranking, and semantic-cache deduplication. Unlike Euclidean distance, cosine similarity ignores vector magnitude, so it stays useful when embedding models produce different vector scales; FutureAGI uses it inside EmbeddingSimilarity and retrieval traces.
Why Cosine Similarity Matters in Production LLM and Agent Systems
Almost every modern LLM stack ranks something by cosine similarity. Vector databases — Pinecone, Weaviate, Qdrant, pgvector, Chroma, Milvus, LanceDB — index embeddings and serve nearest-neighbour queries by cosine. Semantic caches deduplicate prompts by cosine. Evaluation pipelines compare a generated response to an expected response by cosine over their embeddings. Reranker models combine cosine with cross-encoder scores. If you ship retrieval, you ship cosine.
The pain comes from treating cosine as a black-box quality score. A platform engineer sets the semantic-cache threshold at 0.85 because “that’s what the docs said,” and watches cache hit rate stagnate at 4% — the threshold was tuned for a different embedding model on a different distribution. An ML engineer evaluates a RAG system with cosine alone and misses that the embedding model collapses paraphrases of opposing claims close together. A data scientist switches embedding model from text-embedding-3-small to text-embedding-3-large and forgets that the cosine distribution shifts; the same threshold now blocks half the legitimate cache hits.
In 2026, the stack has standardised on cosine as the default similarity metric, but production reliability requires evaluating the embeddings themselves — drift, calibration, alignment to the eval task — not just the cosine numbers they produce.
How FutureAGI Handles Cosine Similarity
FutureAGI’s approach is to treat cosine as a primitive that powers higher-level evaluators rather than a final metric. EmbeddingSimilarity evaluator: returns 0–1 cosine similarity between two texts (typically remapped from -1..1) using a stable embedding backbone, with a returned reason that names the chosen model. Use it for response-vs-reference, query-vs-retrieved-chunk, or chunk-vs-chunk comparisons. SemanticListContains: extends cosine to one-vs-many — does a response contain any phrase semantically similar to a list of references? Retrieval surface: traceAI integrations such as traceAI-pinecone, traceAI-pgvector, traceAI-qdrant, and traceAI-weaviate capture the cosine score returned by the vector store as a span attribute, so you can correlate retrieval rank with downstream answer quality. Semantic cache: Agent Command Center’s semantic-cache uses cosine to decide hit-or-miss; a per-route threshold is configurable and observable.
Concretely: a RAG team building a legal-research assistant evaluates two embedding models — text-embedding-3-large and a domain-tuned alternative — by running 800 question-document pairs through EmbeddingSimilarity. The domain-tuned model wins by 6 cosine points on average but underperforms on two regulated topics because of training-data overlap, surfaced when the team slices by topic. They pick text-embedding-3-large and tune the cache threshold per topic. Without per-cohort cosine evaluation, that decision would have shipped with hidden regressions. Unlike a generic vector-search benchmark, FutureAGI ties the cosine number to the downstream Faithfulness and AnswerRelevancy of the RAG pipeline.
How to Measure Cosine Similarity
Cosine itself is one number, but its production behavior is multi-signal:
EmbeddingSimilarity: returns 0–1 cosine between two texts; the canonical evaluator.- Retrieval cosine score: captured per chunk on every retrieval span; correlate with
ContextRelevanceto detect drift. - Semantic-cache hit-rate (dashboard signal): tunable by cosine threshold; track per-route.
- Cosine distribution per cohort: histogram of similarity scores by intent or topic; surfaces miscalibration.
- Threshold sensitivity: the change in hit-rate per 0.01 of threshold; useful when picking a cache-or-block cutoff.
Minimal Python:
from fi.evals import EmbeddingSimilarity
sim = EmbeddingSimilarity()
result = sim.evaluate(
text_a="How do I reset my password?",
text_b="What is the password reset flow?",
)
print(result.score, result.reason)
Common mistakes
- Reusing thresholds across embedding models. A 0.85 cutoff with
text-embedding-3-smallis a different operating point than 0.85 withtext-embedding-3-large. - Treating cosine as a quality score. It is a similarity metric; quality requires evaluating the downstream task.
- Ignoring sign. Most production embeddings live in (0, 1), but task-specific models can produce negative similarity; check before clipping.
- Using cosine when normalisation is broken. If the embedding pipeline forgets to L2-normalise, cosine results vary; normalise once at index time.
- No per-cohort calibration. Average cosine hides that one cohort’s similarity distribution is shifted; calibrate per intent or topic.
Frequently Asked Questions
What is cosine similarity?
Cosine similarity is the cosine of the angle between two vectors, returning -1 to 1. Close to 1 means vectors point in similar directions; close to 0 means orthogonal; close to -1 means opposite.
How is cosine similarity different from Euclidean distance?
Euclidean distance compares vector magnitudes and positions in space; cosine similarity compares only directions. Cosine is preferred for embeddings because length carries no semantic meaning and varies across embedding models.
How does FutureAGI use cosine similarity?
FutureAGI's EmbeddingSimilarity evaluator returns cosine similarity between two texts; vector-search integrations such as traceAI-pinecone and traceAI-pgvector use it to rank retrieval; semantic-cache uses it for near-duplicate detection.