How is an embedding projector different from cosine-similarity scoring?

Cosine similarity is a numerical comparison between two vectors. An embedding projector is a qualitative visualization tool that surfaces structure in the entire space — clusters, gaps, isolates — that a single cosine number cannot show.

How do you use an embedding projector with FutureAGI data?

Export the embeddings used by your retrieval pipeline from FutureAGI Dataset or trace spans, load them into TensorBoard's projector or a UMAP notebook, and color points by evaluator score, cohort, or label to debug retrieval quality.

Embedding Projector: Definition & FutureAGI Guide (2026)

What Is an Embedding Projector?

An embedding projector is a visualization tool that maps high-dimensional embeddings — typically 384, 768, or 1536 dimensions — into 2D or 3D using techniques like t-SNE, UMAP, or PCA so engineers can inspect cluster structure, neighbors, and outliers. The TensorBoard projector is the canonical example; many teams build equivalents in notebooks. It is most useful when debugging retrieval, comparing embedding models, or auditing whether two cohorts share an embedding space. FutureAGI does not ship a projector UI, but exposes the source data — fi.datasets.Dataset rows, retrieval spans, evaluator scores — so any projector tool can plot the same vectors you evaluate.

Why Embedding Projectors Matter in Production LLM and Agent Systems

Cosine similarity scores tell you whether two vectors are close. They do not tell you why retrieval is missing the right document. The “why” is usually structural: the wrong cluster, a gap between two related sub-topics, an isolated outlier the encoder did not learn well. A projector exposes that structure.

ML engineers use projectors when retrieval quality regresses for an unknown reason. SREs see it as the diagnostic step when vector-search recall@k drops on a specific cohort and the team needs to decide between encoder swap, threshold change, or content rewrite. Product managers see the value when a one-off “why is this happening” investigation produces a screenshot showing the failing queries clustered far away from the correct documents.

That matters more in agentic traces because retrieval mistakes become inputs to planning, tool selection, and memory writes. A projector can reveal the upstream neighborhood before downstream evaluators disagree.

In 2026 stacks, projectors are also the fastest way to compare two encoders side by side. A team upgrading from text-embedding-3-small to a fine-tuned domain encoder can plot both spaces colored by topic and visually confirm that the new encoder separates classes the old one collapsed. That qualitative confirmation rarely lives in a single number, and it is often the deciding factor for the upgrade.

How FutureAGI Handles Embedding-Projector Workflows

FutureAGI does not provide a built-in embedding projector — that work is well-served by TensorBoard, UMAP-in-notebook, or vendor tools. FutureAGI’s approach is to expose the data the projector needs in a form that lines up with the rest of your eval and trace stack. Each retrieval span captured through traceAI-langchain or traceAI-llamaindex includes the query embedding, the retrieved chunk embeddings, and the similarity scores. Each fi.datasets.Dataset row can carry a vector column or a reference to one. Both are exportable to NumPy, parquet, or TensorBoard’s projector format.

A real workflow: a RAG team sees ContextRelevance regressing on the “billing” cohort. They export the last 1,000 billing-cohort traces from FutureAGI as a CSV with query text, retrieved chunk text, embeddings, and the evaluator score. They load it into a Jupyter notebook, run UMAP, color each point by ContextRelevance score, and immediately see two distinct clusters of billing queries — one served correctly, one mapped near a marketing-content cluster. The fix is a content split, not an encoder change.

For comparing embedding models, the same data export is the input to two projector runs side-by-side. FutureAGI’s recommendation is to color points by evaluator score rather than topic label; this surfaces “embedding geometry where retrieval fails” rather than “embedding geometry that matches your prior assumptions.” We’ve found that switching the coloring axis from topic to score is the single change that turns projector exploration into a useful debugging tool.

How to Measure Embedding-Projector Results

Treat the projector as a diagnostic view over the same data you evaluate. The plot itself is not a release metric; the measurable part is the retrieval and embedding behavior behind the projection:

fi.datasets.Dataset export — write embedding vectors plus query, chunk, cohort, and score metadata to parquet, CSV, NPY, or TSV for TensorBoard or UMAP.
fi.evals.EmbeddingSimilarity — returns semantic similarity for query/context or answer/reference pairs; watch cohort medians and tail failures.
fi.evals.ContextRelevance — colors points by retrieval quality so a pretty cluster does not hide wrong-context retrieval.
traceAI-langchain retrieval spans — keep query text, retrieved documents, model version, similarity scores, and llm.token_count.prompt near the projected point.
Dashboard signals — alert on eval-fail-rate-by-cohort, top-k score compression, nearest-neighbor churn, p99 vector-search latency, and thumbs-down rate.
Projector validation — inspect outlier clusters, then confirm the suspected fix with a regression eval before changing the encoder or chunker.

from fi.evals import EmbeddingSimilarity

sim = EmbeddingSimilarity()
result = sim.evaluate(
    text_a="How do I update billing details?",
    text_b="Change payment method from billing settings.",
)
print(result.score, result.reason)

Common mistakes

Most projector incidents come from treating a visual diagnostic as if it were a calibrated metric:

Projecting every vector at once. A 100K-point plot becomes a dense blob; sample by cohort, time window, and failure mode.
Trusting t-SNE distances. t-SNE preserves local neighborhoods, not global scale; use UMAP or PCA when comparing movement between releases.
Coloring by topic alone. The useful failures appear when points are colored by ContextRelevance, correctness, tenant, or escalation outcome.
Skipping the second-encoder comparison. Single-encoder projections show shape, not regression; always compare against the previous accepted encoder.
Using the projector as a release gate. It is exploratory evidence; ship decisions still need Dataset.add_evaluation thresholds and cohort-level pass rates.

Tie every screenshot to a dataset slice, evaluator score, and explicit follow-up experiment.