K-means is an unsupervised clustering algorithm that partitions a dataset into k groups by iteratively assigning each point to the nearest centroid and recomputing centroids until convergence.

How is k-means different from density-based clustering?

K-means assumes spherical clusters of similar size and requires k upfront. Density-based clustering (DBSCAN, HDBSCAN) finds arbitrary-shape clusters and discovers k from the data, but is slower on large datasets.

How do you measure k-means quality in an LLM pipeline?

Track silhouette score and within-cluster sum of squares (inertia) for the chosen k. In an LLM stack, validate clusters by running `EmbeddingSimilarity` within and across clusters and checking that downstream evals slice cleanly per cluster.

K-Means: Definition, Examples & FutureAGI Guide (2026)

What Is K-Means?

K-means is an unsupervised clustering algorithm that partitions data into k groups by assigning each point to the nearest centroid and recomputing centroids until assignments stabilize. In AI reliability work, it is a model-analysis clustering method used to group embeddings, prompts, users, or traces into production cohorts. FutureAGI uses k-means-style cohort labels to slice evaluation results and trace patterns, then validates those groups with metrics such as EmbeddingSimilarity before teams act on a cluster.

Why K-Means Matters in Production LLM and Agent Systems

LLM teams hit k-means quietly. A retrieval team builds an embedding index, then runs k-means over the corpus to produce coarse-grained topic centroids that speed up nearest-neighbor search. A platform team needs to slice eval-fail-rate-by-cohort and discovers there are 9 distinct prompt patterns in production — k-means over prompt embeddings produces those cohorts. A safety team profiling jailbreak attempts clusters them to find new attack families before any single instance trips an evaluator. None of this is “k-means as a model”; it’s k-means as analysis infrastructure.

The pain shows up when clusters are wrong. An ML engineer slices eval results by k=5 clusters and concludes the regression is in cluster 3 — but cluster 3 is actually two distinct user behaviors that k-means merged because it picked the wrong k. The team chases the wrong cohort for a week. SREs see latency anomalies clustering at one centroid and assume hardware drift, when the centroid is just an artifact of a bad initialization seed.

In 2026 multi-step agent traces, the relevance is in trajectory clustering: cluster trajectories by step pattern, surface the modal behaviors, and run targeted evals on outlier clusters. K-means is the cheapest first pass for that — fast, easy to recompute on every cohort refresh, and a sensible default before reaching for HDBSCAN or graph-based methods.

How FutureAGI Uses K-Means in Evaluation and Tracing

FutureAGI does not train k-means models; it consumes cluster labels as evaluation and observability context. In fi.datasets.Dataset, an engineer can attach a precomputed cluster_id to every row and call Dataset.add_evaluation to slice scores by cluster, surfacing the cohort where Faithfulness regresses. At trace level, traceAI spans from traceAI-openai, traceAI-langchain, and traceAI-llamaindex carry llm.input.messages and llm.output.text, which can be embedded and clustered with k-means in the offline pipeline; the resulting cluster id becomes a cohort axis in the dashboard.

Concretely: a RAG team running on traceAI-pinecone exports a week of retrieval traces, embeds the queries, runs k-means with k=8, and labels each trace with its cluster. Then they call Dataset.add_evaluation(ContextRelevance) and pivot the result by cluster. One cluster — k=4 — shows a ContextRelevance mean of 0.62 versus 0.84 elsewhere; that’s the cohort to fix. FutureAGI’s approach is to treat k-means as a hypothesis generator: it points to recurring failure cohorts, then ContextRelevance, Faithfulness, TaskCompletion, or fi.evals.EmbeddingSimilarity decides whether the cohort is actionable. Unlike HDBSCAN, k-means requires k upfront and prefers roughly spherical clusters, so the dashboard should show cluster stability and evaluator deltas, not just a scatter plot.

How to Measure K-Means Quality

K-means quality is measurable, but the cluster label is never the metric. Treat the label as a cohort key and validate whether the cohort is stable, internally coherent, and connected to a real production regression.

Silhouette score: ranges from −1 to 1; under 0.3 is weak, over 0.5 is usually usable for coarse production cohorts.
Within-cluster sum of squares (inertia): use the elbow method to pick k, then re-check that the chosen k produces interpretable slices.
fi.evals.EmbeddingSimilarity: validate intra-cluster cohesion by sampling pairs and checking that similarity scores stay above the team’s threshold inside a cluster.
Per-cluster eval-fail-rate (dashboard signal): the practical reason to cluster — see which cohort fails on Faithfulness, ContextRelevance, or TaskCompletion.
llm.input.messages and llm.output.text: traceAI fields that can be embedded and grouped when the clustering target is prompt or response behavior.
Cluster stability across runs: re-cluster on next week’s data and check that label assignments stay >85% stable; instability means k is wrong.

from fi.evals import EmbeddingSimilarity
from sklearn.cluster import KMeans

labels = KMeans(n_clusters=8, n_init="auto", random_state=42).fit_predict(embeddings)
dataset["cluster_id"] = labels
dataset.add_evaluation(EmbeddingSimilarity(), group_by="cluster_id")

Common Mistakes

Picking k by inspection. Use silhouette score or the elbow method on inertia; intuition produces unstable cohorts.
Running k-means on raw token counts. Sparse high-dimensional data clusters poorly; embed first, then cluster.
Assuming clusters are stable across weeks. Production drift moves centroids; recompute on a defined cadence.
Using k-means on non-spherical data. If the eval slice doesn’t separate, switch to HDBSCAN or density-based clustering instead.
Forgetting to pin the random seed. Different inits produce different clusters; pin the seed for reproducibility.