Fact-checked by Grok 2 weeks ago

Affinity propagation

Affinity propagation is an exemplar-based clustering algorithm that identifies representative data points, or exemplars, by iteratively exchanging real-valued messages of similarity between pairs of data points until a high-quality set of clusters emerges. Introduced in 2007 by Brendan J. Frey and Delbert Dueck, the method takes as input a of pairwise similarities—such as negative distances—and a parameter that influences the number of clusters by setting the baseline attractiveness of each point as a potential exemplar. Unlike traditional clustering techniques like k-means, which require specifying the number of clusters in advance and rely on random initializations, affinity propagation avoids these by using a message-passing framework inspired by statistical physics to self-organize data into clusters without predefined parameters for cluster count. The algorithm operates through two types of messages: responsibility messages, which indicate how well-suited a data point is as an exemplar for another point, and availability messages, which reflect how appropriate a point is to serve as an exemplar overall. Responsibility r(i,k) for point i considering point k as exemplar is computed as the similarity s(i,k) minus the maximum of [similarity plus availability] from other potential exemplars, i.e., \max_{k' \neq k} [s(i,k') + a(i,k')], while availability a(i,k) aggregates evidence from other points, capped at zero to prevent negative influences. Updates incorporate a damping factor (typically 0.5) to stabilize convergence, and the process halts after a fixed number of iterations or when assignments stabilize. The output consists of exemplars and cluster assignments, determined by maximizing the sum of responsibility and availability for each point. Affinity propagation demonstrates superior performance in benchmarks, achieving lower quantization errors—such as 108 versus 119 on a of 900 face images—compared to k-centers clustering, while completing tasks up to 100 times faster, as seen in gene expression analysis on 75,066 DNA segments (6 minutes versus 208 hours). It excels with non-Euclidean similarities and large datasets, making it suitable for applications including image clustering (e.g., faces), bioinformatics (e.g., detection in data), natural language processing (e.g., sentence summarization), and network analysis (e.g., identifying central cities in routes from 456 ). The algorithm's flexibility and efficiency have led to extensions for , mixed data types, and constrained clustering, though it can be computationally intensive for very large similarity matrices due to O(N²) complexity.

Overview

Description

Affinity propagation is an unsupervised clustering that identifies representative exemplars among data points by exchanging real-valued messages based on pairwise similarities between them, forming clusters around these exemplars to group similar points. Unlike traditional methods that rely on predefined parameters, it dynamically determines the optimal number of clusters through this message-passing process. The algorithm treats every data point as a potential exemplar initially, evaluating their suitability to serve as centers based on how well they represent nearby points in the similarity space. The number of resulting s is controlled by preference values assigned to each point, which reflect their a priori attractiveness as exemplars; higher preferences lead to more s, while lower ones promote fewer, more general groupings—for instance, setting all preferences to the of the input similarities typically produces a moderate number of s. In contrast to centroid-based approaches like k-means, which require specifying the exact number of clusters and initial center positions that can lead to suboptimal results if poorly chosen, affinity propagation avoids these constraints by considering all points as candidates and using similarity measures that need not be or symmetric. The primary inputs are a quantifying affinities between pairs of points and the values for each.

History

Affinity propagation was introduced in 2007 by Brendan J. Frey and Delbert Dueck through their seminal paper "Clustering by Passing Messages Between Data Points," published in Science. This work presented the algorithm as an exemplar-based clustering method that exchanges real-valued messages between data points to identify clusters without requiring the number of clusters to be specified in advance. The algorithm's development was motivated by concepts from and techniques in graphical models, aiming to improve upon traditional methods by enabling automatic determination of exemplars and clusters through iterative . In the same year, Dueck and Frey extended its application to in their ICCV "Non-metric Affinity Propagation for Unsupervised Categorization," demonstrating its utility in identifying meaningful image categories from unlabeled data. Post-2007 developments included semi-supervised variants to incorporate prior knowledge, such as the approach by Wang et al. in 2012, which used constraint projections to integrate must-link and cannot-link constraints into affinity propagation for improved clustering on UCI datasets. In economic analysis, Almeida and Balanco applied the algorithm in 2020 to uncover temporal patterns in input-output multipliers across U.S. industries from 1997 to 2017, revealing structural shifts through cluster analysis of economic data. By the early 2020s, extensions integrated with for scalable clustering, notably in the 2020 AAAI paper "Unsupervised via " by Huang et al., which combined message-passing mechanisms with neural networks to address error propagation in high-dimensional tasks. Further advancements have focused on incremental and adaptive versions, such as the 2021 active for (), enhancing efficiency for data streams while preserving the core message-passing framework.

Algorithm

Message Passing Mechanism

Affinity propagation operates through a message-passing mechanism where data points exchange real-valued messages to determine exemplars, which serve as cluster centers. The core of this process involves two primary types of messages: responsibilities and availabilities. A responsibility message from data point i to candidate exemplar k, denoted r(i,k), expresses the accumulated evidence for how well-suited k is to serve as the exemplar for i relative to other candidate exemplars. In contrast, an availability message from candidate exemplar k to data point i, denoted a(i,k), reflects the accumulated evidence for how appropriate it would be for k to serve as the exemplar for i, based on the support from other points that would choose k. The algorithm begins by initializing all availability values to zero, assuming no initial preferences for any point serving as an exemplar. It then enters an iterative loop where, in each iteration, responsibility messages are updated for all pairs of points based on the current availabilities and the input similarities between points; these updates quantify the relative suitability of each candidate exemplar. Next, availability messages are updated using the newly computed responsibilities, incorporating the self-responsibility of the candidate exemplar to account for its preference to be an exemplar itself. For each data point i, the net influence on its choice of exemplar is computed by summing the responsibility and availability for each candidate k; the point k that maximizes this sum becomes the exemplar for i, and points that are their own exemplars (i.e., k = i) are designated as the final exemplars. To stabilize the iterative process and prevent oscillations in message values, a \lambda (typically set to 0.5, where $0 < \lambda < 1) is applied during updates. This factor blends the previous message value with the newly computed value, using the : new message = \lambda \times previous message + (1 - \lambda) \times new message, which ensures smoother . The process continues until convergence is achieved, such as when exemplar assignments remain stable over a fixed number of iterations (e.g., 10). The message-passing loop can be outlined in pseudocode as follows:
Initialize a(i,k) = 0 for all i, k
While not converged:
    # Update responsibilities for all pairs (including self)
    For each i:
        For each k:  # includes k = i
            tmp_r(i,k) = s(i,k) - max_{k' ≠ k} [a(i,k') + s(i,k')]  # for k=i, max over k'≠i of a(i,k') + s(i,k')
    # Apply damping to responsibilities
    For each i, k:
        r(i,k) = λ * old_r(i,k) + (1-λ) * tmp_r(i,k)
    # Update availabilities
    For each k:
        For each i ≠ k:
            tmp_a(i,k) = min(0, r(k,k) + sum_{j ≠ i,k} max(0, r(j,k)))
        a(k,k) = sum_{j ≠ k} max(0, r(j,k))
        tmp_a(k,k) = a(k,k)  # self-availability
    # Apply damping to availabilities
    For each i, k:
        a(i,k) = λ * old_a(i,k) + (1-λ) * tmp_a(i,k)
    # Assign exemplars using damped messages
    For each i:
        k* = argmax_k [a(i,k) + r(i,k)]
        If k* = i, mark i as exemplar
    Check for convergence (e.g., stable exemplars for 10 iterations)
Return exemplars and cluster assignments
This procedure leverages the input similarity measures to iteratively refine exemplar selections through decentralized communication among data points.

Iterative Updates and Convergence

The affinity propagation algorithm operates through an iterative loop that alternates between updating all elements of the responsibility matrix, based on the current availability matrix, and then updating all elements of the availability matrix, based on the updated responsibilities. Following these matrix updates, the combined responsibility and availability values are used to monitor and update the current exemplar decisions for each data point. This process repeats, with messages exchanged between all pairs of data points, until the algorithm converges. Convergence is typically achieved when the changes in the and messages fall below a predefined , after a fixed maximum number of iterations (such as 200), or when the exemplar assignments remain unchanged for a specified number of consecutive iterations (often 10 to 15). These criteria ensure that the clustering stabilizes without excessive computation, as the message-passing mechanism naturally propagates evidence toward consistent exemplar selections over iterations. To prevent oscillations in message updates that could hinder , especially in cases with similar data points or noisy similarities, a is applied during each iteration. This factor, commonly set to 0.5, computes new messages as a weighted of their previous values and the undamped updates, promoting while allowing the algorithm to exhibit self-correcting behavior in practice. If oscillations persist, increasing the or adding small random noise to the similarities can further aid without altering the final results significantly. Upon convergence, the algorithm assigns each data point to its preferred exemplar by selecting the candidate that maximizes the sum of the corresponding and values, thereby defining the memberships. The resulting can then be evaluated for quality using the net similarity objective, which sums the similarities of all points to their assigned exemplars, providing a measure of overall clustering .

Mathematical Formulation

Similarity Measure

Affinity propagation requires a similarity matrix as input, where the entry s(i,k) quantifies how well-suited data point k is to serve as an exemplar for data point i. This reflects the appropriateness of k representing i in a , with higher values indicating stronger suitability. For real-valued vector data, a common choice for the similarity is the negative squared , defined as s(i,k) = -\| \mathbf{x}_i - \mathbf{x}_k \|^2, which ensures that closer points yield higher similarities. The diagonal elements s(k,k), known as the preference values, represent the a priori suitability of point k to be an exemplar; these are often set to the or minimum of the off-diagonal similarities to influence the number of s, where higher preferences promote more exemplars and thus more s. In cases involving non-Euclidean data, alternative similarity functions are employed, such as the Gaussian kernel s(i,k) = \exp\left( -\frac{\| \mathbf{x}_i - \mathbf{x}_k \|^2}{2\sigma^2} \right), which captures local similarities effectively for applications like spatiotemporal point pattern analysis. For domain-specific tasks, custom metrics may be used, such as log-likelihood ratios under probabilistic models or measures based on encoding costs for sequences like gene exons. Preprocessing the input data is crucial to ensure robust similarity computation, particularly through to scale features appropriately and prevent dominance by high-variance dimensions. For high-dimensional inputs, techniques like are often applied beforehand to mitigate the curse of dimensionality and improve clustering performance in areas such as text analysis.

Responsibility and Availability Matrices

In affinity propagation, the responsibility matrix \mathbf{R} and availability matrix \mathbf{A} encode the message-passing dynamics between data points, where each entry r(i,k) and a(i,k) quantifies the evidence for point k serving as the exemplar for point i. These matrices are interdependent, with responsibilities influencing availabilities and vice versa, enabling the emergence of clusters through iterative updates. The input to these computations is a similarity matrix \mathbf{S}, where s(i,k) measures the suitability of k as exemplar for i. The responsibility r(i,k) captures the accumulated evidence that point k is the exemplar for point i, relative to other candidate exemplars k' \neq k. It is computed as r(i,k) = s(i,k) - \max_{k' \neq k} \left[ a(i,k') + s(i,k') \right], where the reflects the net attractiveness of k after accounting for competition from more suitable alternatives, weighted by their availabilities. The self-responsibility r(k,k) follows the same formula as the general case: r(k,k) = s(k,k) - \max_{k' \neq k} [a(k,k') + s(k,k')]. This formulation ensures that high-similarity points with low competition receive positive responsibilities, promoting them as potential exemplars. The availability a(i,k) represents the degree to which point i is available to choose point k as its exemplar, based on the support k receives from other points. For i \neq k, a(i,k) = \min \left\{ 0, \, r(k,k) + \sum_{i' \neq k, i} \max \left( 0, r(i',k) \right) \right\}, which aggregates positive responsibilities from other points toward k, including k's self-responsibility, but caps negative values at zero to prevent over-penalization. The self-availability a(k,k) is simply the sum of all positive responsibilities directed to k from other points: a(k,k) = \sum_{i' \neq k} \max \left( 0, r(i',k) \right). These equations highlight the interdependency: availabilities depend on prior responsibilities, which in turn rely on prior availabilities, necessitating iterative refinement. To stabilize and avoid oscillations in high-dimensional or noisy data, the updates incorporate with a factor \lambda (typically 0.5). The new responsibility and availability are thus \mathbf{r}_{\text{new}}(i,k) = \lambda \, \mathbf{r}_{\text{old}}(i,k) + (1 - \lambda) \, r_{\text{computed}}(i,k), and similarly for \mathbf{a}(i,k), where r_{\text{computed}} and a_{\text{computed}} are the undamped values from the equations above. This blends the previous message with the current computation, ensuring smoother . Exemplars are identified by evaluating the net evidence r(i,k) + a(i,k) for each point i, with the exemplar for i being k = \arg\max_k [r(i,k) + a(i,k)]. Points i for which this maximum occurs at k = i (i.e., r(i,i) + a(i,i) \geq r(i,k) + a(i,k) for all k \neq i) are designated as exemplars, as they sufficiently outcompete alternatives based on self-preference and incoming support. Cluster assignments follow directly: all points sharing the same exemplar form a cluster. In practice, points with positive r(k,k) + a(k,k) are more likely to emerge as exemplars, reflecting strong self-evidence.

Applications

In Bioinformatics

Affinity propagation has found significant application in bioinformatics for transcript identification and gene co-expression clustering, particularly in analyzing microarray data. In an early demonstration, the algorithm clustered 75,066 putative exons from chromosome 1 using expression levels across 12 tissue samples, where similarities were defined by genomic proximity and coordinated transcription patterns; this approach identified clusters with a true-positive rate of 39% at a 3% false-positive rate, surpassing k-centers clustering's 17% true-positive rate. Such clustering leverages the message-passing mechanism to group co-expressed s without requiring a predefined number of clusters, aiding in the discovery of functional gene modules from sparse, high-dimensional expression data. In protein-protein interaction (PPI) networks, affinity propagation identifies functional modules by treating proteins as nodes and deriving similarities from interaction strengths or association scores. An enhanced variant, Overlapped Affinity Propagation (OAP), applies the algorithm to the PPI network (4,928 proteins, 17,201 interactions) to detect overlapping protein complexes, yielding 232 modules with 93% ( < 0.001), precision of 50.4%, and an F-measure of 47.2% against a of 428 known complexes. This method outperforms baselines like Markov Clustering (MCL) in by up to 200.7% while capturing biologically relevant modules involved in cellular processes, validated using databases such as and . A notable case study involves clustering microbial genomes and metagenomic for species identification, as extended post-2007. The tool BinSanity employs affinity propagation to bin metagenomic contigs based on coverage profiles across samples, effectively grouping sequences from the same microbial in complex communities; it demonstrated superior performance over tools like MetaBAT and CONCOCT, leaving fewer contigs unclustered in diverse and strain-mixed datasets. These applications highlight affinity propagation's advantages in handling noisy biological —such as incomplete interactions or variable expression—without assuming cluster shapes or sizes, enabling robust discovery in irregular, real-world genomic datasets.

In Image and Text Processing

Affinity propagation has been applied in tasks, particularly for ing images in databases. In one early demonstration, it was used to the Olivetti faces dataset, identifying exemplars that represent distinct facial expressions and identities with low error rates compared to k-centers methods. This approach leverages pairwise similarities, such as distances in pixel space, to group similar faces without predefined numbers. Additionally, affinity propagation facilitates exemplar selection for , where it mines representative instances from large corpora to train detectors. For instance, in class exemplar mining, it selects diverse yet representative object samples from relational graphs of features, improving detection accuracy on benchmarks like Caltech-101 by focusing on high-quality exemplars. In text mining, affinity propagation supports document clustering and topic modeling by treating documents or sentences as data points with similarities based on metrics like cosine distance over TF-IDF vectors. The original formulation demonstrated its efficacy in selecting representative sentences from manuscripts, effectively summarizing text by identifying key exemplars that capture thematic content. A semi-supervised variant incorporates labeled constraints to guide clustering of text data, enhancing performance on partially annotated corpora by projecting similarities to respect must-link and cannot-link pairs. For example, it has been applied to group similar articles, where exemplars serve as headlines for s of related stories, or to high-dimensional word embeddings from models like , revealing semantic groupings in large text corpora. Due to its O(n²) from the full similarity matrix, affinity propagation faces scalability challenges on large or text datasets exceeding thousands of points. To address this, approximations such as landmark-based methods or local-global reduce computations while preserving cluster quality, enabling applications to datasets with millions of high-dimensional features like descriptors or embeddings. These variants maintain the message-passing mechanism's ability to handle irregular data structures in visual and textual domains.

Properties

Advantages

Affinity propagation offers a key advantage over traditional clustering algorithms like k-means by not requiring the number of clusters to be specified in advance. Instead, the algorithm uses a preference parameter to influence the number of exemplars, allowing the data's inherent structure to determine the optimal partitioning through iterative message passing. This flexibility makes it particularly suitable for exploratory data analysis where the cluster count is unknown or variable. The method excels in handling arbitrary cluster shapes and sizes due to its reliance on pairwise similarity measures rather than geometric assumptions, such as spherical clusters in k-means. By selecting exemplars directly from the points, affinity propagation demonstrates robustness to outliers, as points with low similarity to others are less likely to become exemplars and can be assigned to nearby clusters without distorting the overall structure. This exemplar-based approach enables effective clustering in non-Euclidean or asymmetric similarity spaces, such as those encountered in bioinformatics or . Empirically, affinity propagation has shown superior performance in benchmarks, achieving faster and lower error rates compared to methods like k-centers. For instance, in face image clustering, it produced solutions with a squared error of 108, outperforming the best of 100 k-centers runs (error of 119), and in gene expression analysis, it detected genes with a 39% true-positive rate at a 3% false-positive rate, compared to 17% for k-centers—all while requiring significantly less computation time, such as 6 minutes versus 208 hours for large datasets. These results highlight its efficiency for non-convex clusters and real-world applications. The interpretability of affinity propagation stems from its use of actual data points as exemplars, providing meaningful representatives that facilitate human analysis and validation of clusters, unlike abstract centroids generated by other algorithms. This feature enhances its utility in domains requiring explainable results, such as in sensory data.

Limitations

Affinity propagation exhibits a time and space complexity of O(n²) due to the requirement of computing and storing a full among all data points, which limits its practical application to datasets with approximately 10,000 points or fewer on standard hardware without specialized approximations or . The algorithm is highly sensitive to the choice of the preference parameter, which controls the number of exemplars and thus the clustering ; suboptimal tuning can result in excessive fragmentation (over-clustering) or insufficient separation (under-clustering), necessitating careful selection through methods like similarity or cross-validation. Compared to k-means, affinity propagation is generally slower for large-scale datasets owing to its iterative message-passing nature and quadratic scaling, making it less suitable for applications where k-means' linearithmic complexity provides better efficiency. Additionally, it tends to underperform on very high-dimensional data without dimensionality reduction preprocessing, as of dimensionality degrades the reliability of similarity measures. In specific domains like protein interaction networks, Markov clustering often outperforms affinity propagation on dense graphs by demonstrating greater robustness to noise and better recovery of biologically relevant modules. Unlike probabilistic models such as Gaussian mixture models, affinity propagation lacks inherent or soft assignments, providing only hard deterministic clusterings without statistical guarantees on data generation assumptions.

Implementations

Open-Source Libraries

Affinity propagation has been implemented in several open-source libraries across popular programming languages, enabling researchers and practitioners to apply the algorithm to diverse datasets. These implementations typically adhere to the core message-passing mechanism while incorporating optimizations for efficiency and flexibility, such as support for custom similarity measures and handling of large-scale . Most initial open-source versions emerged between 2008 and 2015, with subsequent updates focusing on improvements and with modern environments. In , the library provides the AffinityPropagation within its clustering module, offering a robust that supports custom similarity matrices through the affinity parameter (e.g., precomputed distances or metrics) and adjustable factors to control . This handles the and updates internally, making it suitable for medium-sized datasets, though its limits for very large inputs. The was introduced in early versions around and has seen efficiency enhancements in releases post-version 1.0 (from onward), including better random state handling for . For R users, the apcluster package delivers a comprehensive affinity propagation toolkit, including the core apcluster() function for standard clustering and extensions like leveraged affinity propagation (apclusterL()) for subsampling large datasets to reduce computational load. It supports parallel computation options through integration with R's parallel processing capabilities and Rcpp-based C++ backends for faster execution, particularly beneficial for rectangular similarity matrices or sparse data structures via the Matrix package. Initially released in 2011, the package has been updated regularly, with version 1.4.0 (2014) adding sparse matrix support and the latest version (1.4.14 as of September 2025) incorporating refinements for exemplar-based agglomerative clustering. Julia's Clustering.jl package includes affinity propagation via the affprop() , seamlessly integrating with the language's ecosystem (e.g., from Distances.jl) to compute similarities such as or distances on the fly. This implementation emphasizes , leveraging Julia's for efficient on numerical arrays, and returns cluster assignments along with exemplar indices. First appearing in package versions around 2015, it has evolved with Julia's ecosystem, maintaining compatibility with recent releases (e.g., v1.10+ by 2025) for optimized handling of multidimensional data. In , the ELKI (Environment for Developing KDD-Applications Supported by Index-Structures) framework incorporates affinity propagation through the AffinityPropagation class in its clustering module, tailored for tasks on large datasets with support for various distance functions and indexing structures to mitigate the algorithm's O(n²) . It processes relational data efficiently, outputting clusterings with exemplar points, and is extensible for custom similarity definitions. ELKI's implementation dates back to early releases around 2010-2012, with ongoing updates in versions up to 0.8.0 (2022) enhancing scalability for high-dimensional inputs common in research applications.

Integration in Frameworks

Affinity propagation is integrated into the library as part of its clustering module, where it is implemented as the AffinityPropagation class in sklearn.cluster. This allows seamless use within scikit-learn pipelines, such as combining it with preprocessing steps like () to handle high-dimensional data and improve computational efficiency by reducing the feature space before computing the similarity matrix. The algorithm's quadratic time and space complexity with respect to the number of samples makes such essential for scalability on larger datasets. In , affinity propagation is available through user-contributed implementations on the MATLAB File Exchange, including adaptive and semi-supervised variants that extend the original algorithm for specific use cases. These implementations, often based on the MATLAB code provided by the algorithm's originators, are commonly applied in bioinformatics, such as for clustering data to reveal temporal patterns in transcriptional programs. For instance, affinity propagation has been used to identify modules in time-series data by accounting for delays and phase shifts in expression profiles. Post-2020 developments have seen custom extensions of affinity propagation within frameworks like and , particularly for clustering high-dimensional neural embeddings generated by models such as autoencoders or transformers. These integrations facilitate end-to-end workflows where embeddings from neural networks serve as input similarities for affinity propagation, enhancing applications in tasks like in embeddings. Examples include -based implementations for speaker verification embeddings and adaptations for general clustering pipelines. To address the O(n²) complexity, practitioners often combine affinity propagation with dimensionality reduction techniques like , which mitigates memory and runtime demands by lowering the effective input size, though it does not fully resolve scalability for very large n. Parallel versions are available through external packages in , such as SparkAffinityPropagation, enabling distributed computation of the message-passing updates across clusters for scenarios. The preference parameter, which influences the number of exemplars, can be tuned in these frameworks to balance cluster quality and computational cost.