What are clustering algorithms?

Clustering algorithms are the methods that group items by similarity without labels. Major families are partitioning (k-means), hierarchical, density-based (DBSCAN, HDBSCAN), distribution-based (Gaussian mixture models), and message-passing (affinity propagation).

Which clustering algorithm should I use for embeddings?

For dense LLM embeddings under 100K points, HDBSCAN handles noise and variable cluster size without requiring k upfront. For larger sets, mini-batch k-means scales linearly. For small sets where cluster count is genuinely unknown, affinity propagation works.

How does FutureAGI evaluate clustering algorithm choice?

FutureAGI scores the LLM application that consumes clustering output. Run AnswerRelevancy or TaskCompletion against a Dataset segmented by cluster-id, and the algorithm whose clusters produce the highest downstream score is the right pick; silhouette alone is insufficient.

Clustering Algorithms: Definition & FutureAGI Guide (2026)

What Is Clustering Algorithms?

Clustering algorithms are unsupervised methods that group items by similarity without labels. The main families include partitioning methods such as k-means, hierarchical clustering, density-based methods such as DBSCAN or HDBSCAN, distribution-based Gaussian mixture models, and message-passing methods such as affinity propagation. In production LLM systems, that choice shapes retrieval cohorts, routing, prompt design, and trace triage. FutureAGI evaluates clustering algorithms by testing downstream task quality and cluster-level EmbeddingSimilarity, not by trusting silhouette score alone.

Why Clustering Algorithms Matter in Production LLM and Agent Systems

The same dataset clustered with k-means versus HDBSCAN can produce wildly different groupings. K-means assumes spherical clusters of similar size; HDBSCAN finds variable-density clusters and labels noise points as outliers. If your production trace embeddings have a few hot intents and a long tail of one-off queries, k-means crams the long tail into the nearest centroid and loses the signal; HDBSCAN labels them as noise and you can route them to a human reviewer.

The pain shows up across roles. An ML engineer runs k-means with k=10 on 50K trace embeddings, charts a clean silhouette score, and ships intent-routing — three months later a long-tail customer cohort gets the wrong prompt because k-means absorbed it into a larger centroid. A platform engineer dedups a corpus with hierarchical clustering at a single linkage threshold; the chosen threshold is wrong for one document type and that type gets over-deduplicated, hurting RAG recall. A product lead reviews “anomaly clusters” from a k-means run on telemetry and the cluster labelled “anomalous” is actually two distinct anomaly types lumped together because the algorithm cannot find non-spherical shapes.

In 2026 agent stacks, where clustering feeds routing, prompt design, and trace triage, the algorithm pick is part of the contract with downstream evaluation. Picking the wrong family is rarely caught by silhouette score alone.

How FutureAGI Evaluates Clustering Algorithm Choice

FutureAGI’s approach is to judge clustering algorithms by the reliability of the LLM workflow that consumes their clusters, not by the cluster labels alone. FutureAGI does not implement clustering inside your app; it grades retrieval, routing, and triage behavior after the clusters are written to traces or datasets. The pattern is the same regardless of which algorithm produced the clusters: write cluster_id as a span attribute on every trace, run downstream evaluators, segment by cluster-id, and rank algorithm choices by the resulting eval scores.

Concretely: a RAG team is choosing between k-means(k=12), HDBSCAN(min_cluster_size=8), and affinity propagation for organising their 8K production query embeddings. They build a 500-row golden Dataset with labelled “expected retrieval set” per query. For each algorithm, they cluster the corpus and run Dataset.add_evaluation(Groundedness()) plus Dataset.add_evaluation(EmbeddingSimilarity()) on retrievals built from the algorithm’s output. The dashboard ranks algorithms by aggregate downstream score. HDBSCAN wins by 4 points on Groundedness and 2 points on long-tail recall - not because its silhouette is highest, but because its noise-labelling forces the long-tail queries to a human-curated fallback rather than a wrong centroid. Unlike Ragas faithfulness, which scores the final RAG answer, this comparison isolates whether the clustering choice improves retrieval and fallback behavior.

For trace triage and anomaly detection, the same cluster_id span attribute lets AnswerRelevancy segment-and-rank by cluster, surfacing which algorithm-derived clusters host the worst LLM behaviour. If clusters feed a gateway semantic-cache or routing policy, the engineer can threshold by cluster-level eval failures and send noisy cohorts to fallback review.

How to Measure Clustering Algorithms

Algorithm choice is graded by intrinsic metrics, downstream eval, and operational fit:

EmbeddingSimilarity (fi.evals): cluster-member cohesion check — high pairwise similarity within a cluster confirms the algorithm captured semantics.
Silhouette score (intrinsic): cohesion vs. separation; useful as a sanity check but not as a final gate.
Davies-Bouldin (intrinsic): lower is better; complements silhouette.
Downstream eval lift: AnswerRelevancy, TaskCompletion, or Groundedness segmented by cluster_id; the algorithm whose clusters produce the highest scores wins.
Noise-handling fit: density-based algorithms label outliers; if your downstream system can route outliers, HDBSCAN beats k-means on long-tail performance.

from fi.evals import EmbeddingSimilarity

sim = EmbeddingSimilarity()
# Score whether members of a candidate cluster are semantically coherent
result = sim.evaluate(
    text_a="please cancel my subscription",
    text_b="how do I unsubscribe from this service",
)
print(result.score, result.reason)

Common mistakes

Defaulting to k-means without justifying k. K-means is fast but assumes spherical equal-size clusters; many LLM-embedding distributions break that assumption.
Skipping noise handling. K-means forces every point into a cluster; HDBSCAN’s noise label is often the most useful output.
Tuning hyperparameters on silhouette alone. Silhouette can be high while downstream eval is low.
Comparing algorithms on different similarity metrics. Use a single metric (typically cosine for embeddings) when benchmarking algorithm choice.
Re-clustering on every batch without versioning. Production teams need stable cluster IDs for routing; version the clustering output with the embedding model.