How is affinity propagation different from k-means?

K-means requires k upfront and is sensitive to initialisation. Affinity propagation discovers cluster count automatically through message passing but costs O(N²) per iteration, which limits its dataset size.

How does FutureAGI use clustering algorithms like affinity propagation?

FutureAGI does not run affinity propagation directly. We evaluate LLM outputs and embeddings; if your pipeline clusters embeddings as a preprocessing step, the EmbeddingSimilarity evaluator scores how well downstream LLM responses preserve cluster semantics.

What Is Affinity Propagation? Definition (2026)

Q: What is affinity propagation?

Affinity propagation is a clustering algorithm that uses message passing between data points to converge on a set of exemplars, without needing the number of clusters to be specified in advance.

What Is Affinity Propagation?

Affinity propagation, introduced by Frey and Dueck in 2007, is a clustering algorithm that identifies exemplars — data points that best represent each cluster — through iterative message passing. Two messages flow between every pair of points: responsibility r(i, k), how well-suited point k is to serve as exemplar for point i, and availability a(i, k), how appropriate it is for k to choose itself as an exemplar given other points’ preferences. Iteration converges to a stable set of exemplars and assignments. Unlike k-means, you don’t pick k; the algorithm picks it. The cost is O(N²) per iteration, which constrains it to small datasets.

Why It Matters in Production LLM and Agent Systems

Most LLM stacks need clustering somewhere — grouping similar production traces for triage, deduplicating queries before caching, organising embeddings for retrieval, surfacing user-intent clusters for prompt redesign. The choice of algorithm shapes what those groups look like.

K-means is the default but it forces you to pick k. Pick wrong and similar queries get scattered across clusters or lumped together. Affinity propagation removes that knob — useful when the underlying cluster count is genuinely unknown and the dataset is small enough (a few thousand points) for O(N²) to fit. Common LLM-stack uses: clustering 2K embeddings of failed traces to find recurring intents, grouping prompts in a small library by similarity for semantic-cache design, partitioning a curated golden dataset into homogeneous slices for cohort eval.

The pain of getting clustering wrong shows up across roles. An ML engineer runs k-means with k=8 on production trace embeddings, declares “we have 8 intents”, and ships a router based on it — three months later half the traffic falls into a single bloated cluster because k=8 was too few. A product lead reviews the “intent clusters” and finds they are linguistically similar but semantically incoherent. A platform engineer caches by exact prompt match because clustering the prompts was too painful, and watches cache hit rate flatline at 4%.

In 2026 agent stacks where clustering feeds into routing, prompt design, and dataset curation, picking the right algorithm is more than a hyperparameter — it is a contract with downstream evaluation.

How FutureAGI Handles Clustering Outputs

FutureAGI does not implement affinity propagation. We sit downstream of clustering: when your pipeline groups embeddings, queries, or traces and feeds the result into an LLM application, FutureAGI evaluates whether that grouping helped or hurt.

Concretely: a team clusters their last 5K customer-support queries with affinity propagation, identifies 23 exemplar queries, and uses them as few-shot examples in their system-prompt. They test two prompt versions — one with affinity-propagation exemplars, one with k-means(k=10) exemplars — through Prompt.commit() versions and run regression evals against a Dataset versioned at v6. FutureAGI’s AnswerRelevancy and TaskCompletion evaluators score both prompt variants; the affinity-propagation prompt scores 3.2 points higher on the long-tail intent cohort. The team picks it, but the dashboard reveals the gain came from one specific cluster — the 23rd exemplar — the tail intent k-means had absorbed into a larger cluster.

For embedding-based RAG retrieval, the EmbeddingSimilarity evaluator scores whether retrieved chunks are semantically close to the query, regardless of how those chunks were clustered upstream. If your clustering step is poisoning retrieval, the eval surfaces it.

How to Measure or Detect It

Clustering quality is measured by downstream task quality plus standard intrinsic metrics:

EmbeddingSimilarity: returns 0–1 cosine similarity between two embeddings; useful for measuring whether cluster members are semantically close.
Silhouette score (intrinsic metric): measures cluster cohesion vs separation; higher is better.
Davies-Bouldin index (intrinsic metric): lower is better; tracks intra-cluster dispersion vs inter-cluster distance.
Downstream eval lift: compare AnswerRelevancy or TaskCompletion between prompts/routes built on different clustering outputs.
Cluster-count stability (dashboard signal): the number of exemplars affinity propagation converges to across re-runs; large variance hints at unstable similarity matrix.

from fi.evals import EmbeddingSimilarity

sim = EmbeddingSimilarity()
result = sim.evaluate(
    text_a="How do I reset my password?",
    text_b="My password isn't working",
)
print(result.score, result.reason)

Common Mistakes

Running affinity propagation on a dataset too large for O(N²). It will not converge in reasonable time; use mini-batch k-means or HDBSCAN past 10K points.
Trusting cluster labels without inspecting exemplars. A “cluster of 200 points” labeled by an unrepresentative exemplar is a poor few-shot example.
Tuning the preference parameter blindly. Affinity propagation’s preference vector controls cluster count; sweep it and pick by downstream eval, not silhouette alone.
Mixing similarity metrics across the pipeline. If you cluster on cosine similarity but retrieve with dot product, the cluster structure does not generalise.
Treating clustering as a one-shot offline step. Production embeddings drift; re-cluster on a schedule and version the cluster output.