Fact-checked by Grok 2 weeks ago

k -means++

k-means++ is a for the method that selects initial cluster centers in a way that improves both the speed and quality of the resulting clustering. Introduced in 2007, it addresses the limitations of traditional random initialization in k-means by choosing the first center uniformly at random from the data points and subsequent centers with probability proportional to the squared from the nearest existing center, thereby spreading out the initial centers more effectively. This procedure ensures that the expected cost of the clustering is at most O(\log k) times the optimal cost, providing theoretical guarantees on approximation quality that are \Theta(\log k)-competitive. The k-means algorithm itself partitions a set of n points in \mathbb{R}^d into k clusters by iteratively assigning points to the nearest center and updating centers to the mean of assigned points, aiming to minimize the sum of squared distances within clusters (known as the potential \phi). However, its performance heavily depends on the initial centers; poor choices can lead to suboptimal local minima and slow convergence. k-means++ mitigates this by its probabilistic initialization, which empirical tests show reduces the number of iterations needed for convergence while yielding clusters closer to the optimum compared to uniform random seeding. The method is simple to implement, requiring only a single pass over the data for seeding, and has been extended to variants like k-median clustering with similar logarithmic approximation factors. Widely adopted in machine learning libraries such as scikit-learn, k-means++ has become a standard enhancement for practical applications in data analysis, image segmentation, and pattern recognition.

Introduction

Overview

k-means++ is a seeding algorithm that enhances the k-means clustering method by providing a more robust initialization of cluster centroids, mitigating the sensitivity of k-means to random starting configurations that can lead to suboptimal local minima. The core mechanism involves selecting initial centroids probabilistically: the first centroid is chosen uniformly at random, and subsequent ones are picked with probability proportional to the squared distance from the nearest previously selected centroid, promoting a wider spread of starting points across the data. This initialization yields an expected approximation factor of O(\log k) relative to the optimal clustering cost, where k is the number of clusters, resulting in improved solution quality and faster convergence to high-quality clusters compared to standard random seeding. Developed by David Arthur and Sergei Vassilvitskii, k-means++ was introduced in their 2007 paper, "k-means++: The Advantages of Careful Seeding," and has since become a standard preprocessing step for implementations in various libraries.

Historical Development

The k-means++ algorithm was introduced in 2007 by David Arthur () and Sergei Vassilvitskii (), in their paper titled "k-means++: The Advantages of Careful Seeding," presented at the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (). The work addressed the sensitivity of the standard k-means algorithm to the choice of initial cluster centers, a problem that often leads to suboptimal clustering results, particularly when dealing with large-scale datasets. This initialization method builds directly on Lloyd's k-means algorithm, originally described in a 1957 technical report by Stuart Lloyd at Bell Laboratories and later republished in 1982, which had become a foundational for partitioning data but suffered from inconsistent performance due to random or arbitrary starting points. Following its publication, k-means++ gained rapid adoption in machine learning libraries and tools, notably integrated into during the early as the default initialization strategy for its KMeans implementation, enhancing its accessibility for practitioners working with moderate to large datasets. Extensions emerged in the to handle environments, including distributed variants that adapt the seeding process for frameworks; for instance, the scalable k-means|| algorithm, proposed in 2012, enables efficient oversampling in multiple rounds to approximate k-means++ performance while reducing sequential passes, making it suitable for massive-scale clustering on distributed systems. By 2025, the original k-means++ paper had amassed over 13,000 citations, reflecting its profound influence on clustering research and practice, and it inspired numerous variants, including parallel adaptations like k-means|| that prioritize scalability without sacrificing approximation quality. This enduring impact underscores k-means++'s role in bridging theoretical improvements with real-world demands for robust, efficient initialization in .

Fundamentals of k-means Clustering

Standard k-means Algorithm

The standard k-means algorithm, often referred to as , is an iterative partitioning method that assigns a given of n points in d-dimensional space to k clusters by minimizing the within-cluster (WCSS) objective function. This objective, denoted as J = \sum_{i=1}^k \sum_{\mathbf{x} \in C_i} \| \mathbf{x} - \boldsymbol{\mu}_i \|^2, measures the total squared between each point \mathbf{x} in cluster C_i and the cluster's \boldsymbol{\mu}_i, which is the mean of the points in C_i. The algorithm seeks to find a local minimum of this function through alternating optimization steps, though it does not guarantee the global optimum due to its nature. The process begins with an initial selection of k centroids, after which it proceeds in two main phases repeated until convergence: assignment and update. In the assignment step, each data point is assigned to the nearest based on the , forming tentative clusters. The update step then recomputes each as the of all points assigned to its cluster. These steps iterate until the centroids stabilize (i.e., the change in centroids falls below a small ) or a maximum number of iterations is reached, at which point the assignments represent the final clustering. Convergence is typically monotonic in the objective function value, but the algorithm may terminate in a local minimum influenced by the starting centroids. Common initialization strategies for the centroids include Forgy's method, which randomly selects k distinct data points from the dataset as the initial centroids, and the random partition method, which first randomly assigns each data point to one of the k clusters (with roughly equal sizes) and then sets the initial centroids as the means of these partitions. Both approaches are straightforward and widely used as defaults in implementations, though their randomness can lead to variability in outcomes across runs. The choice of initialization affects the final clustering quality, often necessitating multiple runs with different seeds to select the best result based on the value. The following outlines the core iterative loop of the -means algorithm, assuming initial have been provided:
function kmeans(X, [k](/page/K), initial_centroids):
    centroids ← initial_centroids  // Shape: [k](/page/K) × d
    repeat:
        // Assignment step
        clusters ← empty list of [k](/page/K) sets
        for each point x in X:
            distances ← ||x - c||^2 for each centroid c
            assign x to argmin(distances)
        
        // Update step
        for i = 1 to [k](/page/K):
            if clusters[i] is empty:
                handle empty cluster (e.g., reassign or perturb)
            else:
                centroids[i] ← mean of points in clusters[i]
        
        // Check convergence
        if centroids changed < ε:
            break
    until max iterations reached
    return clusters
This implementation follows the batch update style of Lloyd's original description, where all assignments are performed before any centroid updates.

Limitations of Initialization

The standard k-means algorithm exhibits high sensitivity to the initial placement of , particularly when using random initialization methods such as Forgy's or Random Partition, which can result in unbalanced cluster assignments, the formation of empty clusters, slower , and entrapment in suboptimal local minima. This sensitivity arises because random selection lacks any mechanism to ensure the initial centroids are well-spread across the data space, often leading to biased partitioning that increases the number of iterations required for or yields clustering results far from the global optimum. An illustrative example of this issue occurs in a synthetic two-dimensional comprising three well-separated Gaussian distributions, each representing a distinct of points forming circular . If random initialization selects all initial centroids within the densest —say, the leftmost group—the algorithm may assign nearly all data points to that single , rendering the other two clusters empty and producing a highly degenerate solution. In a simple visualization, this appears as three isolated groups of points (e.g., one at coordinates around (0,0), another at (5,0), and a third at (2.5,4)), with all starting centroids clustered tightly in the first group, highlighting how poor initialization fails to capture the underlying structure. Empirical studies confirm these limitations, demonstrating that random initialization leads to significant variability in the final clustering cost across multiple runs on the same dataset, often necessitating multiple restarts to obtain reliable results. For instance, comparisons across diverse synthetic and real-world datasets show that non-deterministic methods like Forgy and Random Partition exhibit high variance in the sum-of-squared-error (SSE) objective, with poor initializations frequently yielding significantly worse solutions than those from more careful seeding approaches. To address these drawbacks, researchers have proposed alternative initialization strategies, including k-means++ and the Bradley-Fayyad refinement , which prioritize selecting diverse and representative initial to improve consistency and quality.

The k-means++ Algorithm

Initialization Procedure

The k-means++ initialization procedure provides a scalable for selecting initial cluster , aiming to improve the quality of the subsequent clustering by distributing the starting points more evenly across the dataset. This approach begins by selecting the first centroid uniformly at random from the set of data points, ensuring an unbiased starting position without favoring any particular region of the data space. For the remaining k-1 centroids, the algorithm iteratively computes, for each data point x, the squared distance D(x) to the nearest previously chosen centroid. A new centroid is then selected with probability proportional to D(x)^2, which biases the choice toward points that are farther from the existing centroids. This step is repeated until all k centroids are chosen, promoting a spread-out initialization that avoids concentrating multiple centroids in high-density areas. The procedure can be expressed in pseudocode as follows:
1. Select an initial centroid c_1 uniformly at random from the data points X.
2. For i = 2 to k:
   a. For each data point x in X, compute D(x) = min_{j=1 to i-1} ||x - c_j||^2 (squared Euclidean distance to the nearest existing centroid).
   b. Select the i-th centroid c_i from X with probability D(x)^2 / \sum_{x \in X} D(x)^2.
3. Return the set of k centroids {c_1, ..., c_k}.
This probabilistic selection leverages the D^2 weighting to intuitively favor outlier points or those in underrepresented regions, thereby enhancing the initial coverage of the data space and reducing the risk of poor local optima in the full k-means process. Due to its randomized nature, the method inherently avoids deterministic ties in centroid selection, as the probability distribution ensures a non-zero chance for any eligible point, though exact ties in distances are resolved arbitrarily if they occur. Empty clusters are unlikely at initialization but are managed through the standard k-means following seeding.

Integration with k-means

The k-means++ algorithm integrates seamlessly into the standard k-means pipeline by replacing the traditional random selection of initial with its probability-based seeding procedure. Once the initial k are chosen—starting with a uniform random selection for the first and subsequent ones proportional to the squared distance from the nearest existing —the process continues with the conventional Lloyd's : assigning each data point to the nearest and recomputing as the means of their assigned points until or a maximum is reached. The complete workflow for k-means with k-means++ initialization encompasses several stages: first, preprocessing to determine the number of k, often via methods like the elbow criterion; second, the k-means++ initialization phase to select well-spaced initial ; third, the iterative clustering loop involving point-to-centroid assignment and centroid updates; and finally, outputting the cluster assignments and centroids. This structure ensures that the enhanced initialization directly feeds into the core optimization loop without altering its mechanics. In practice, it is advisable to run k-means with k-means++ initialization multiple times using different random seeds to mitigate variability and average results for more robust clustering, though this is far less essential than with purely random initialization due to the seeding's inherent stability. Empirically, incorporating k-means++ leads to faster , typically requiring substantially fewer iterations than random initialization and yielding improvements of approximately 2-5 times on benchmark datasets such as the UCI KDD Cup 1999 intrusion detection and Spambase datasets.

Theoretical Foundations

Probabilistic Analysis

The probabilistic analysis of k-means++ centers on the initialization procedure, which selects s in a manner that probabilistically favors points distant from already chosen centers, leading to an expected clustering cost that scales favorably with the optimal cost. In this model, the first is chosen uniformly at random from the X. Subsequent s are selected with probability P(x) = \frac{D(x)^2}{\sum_{y \in X} D(y)^2}, where D(x) denotes the squared from point x to the nearest previously selected . This D^2-sampling mechanism ensures that the spreads out the initial centers, avoiding concentrations in dense regions of the data. The expected of the initialization, defined as the \phi(C) = \sum_{x \in X} \min_{c \in C} \|x - c\|^2 for a set of centers C, is analyzed through a series of lemmas building toward a logarithmic bound. For the first center chosen uniformly, the expected satisfies E[\phi(A_1)] \leq 2 \phi_{OPT}, where \phi_{OPT} is the of an optimal k-clustering and A_1 is the set containing the first center; this follows from the fact that a uniform random point has expected at most half the to the optimal centers. For each subsequent center i = 2, \dots, k, the D^2-sampling yields an expected increase bounded by E[\phi(A_i) | A_{i-1}] \leq \phi(A_{i-1}) + 8 \phi_{OPT}, derived by showing that the probability of selecting a point near an existing center is low, and the contribution from distant points aligns with the optimal distribution. Summing these inductively using the harmonic series H_{k-1} \leq \ln(k-1) + 1 (where H_t approximates the growth in expected at step t) results in the overall bound E[\phi(A_k)] \leq 8 (\ln k + 2) \phi_{OPT}. A later improves this constant to 5(\ln k + 2). This demonstrates that k-means++ initialization achieves an expected within a O(\log k) factor of the optimum in expectation, which is asymptotically tight up to constants given the \Omega(\log k) lower bound. Compared to uniform random initialization, the D^2-sampling in k-means++ reduces variance in the starting configuration by probabilistically prioritizing underrepresented regions, thereby decreasing the likelihood of degenerate clusterings where multiple centers coincide or cluster in low-variance areas. This spreading effect mitigates the high variance inherent in uniform sampling, where poor seeds can lead to costs far exceeding the optimum, as evidenced by empirical variance comparisons in the original analysis. The analysis assumes data points lie in \mathbb{R}^d, with the cost measured via squared distances, and implicitly presumes no significant outliers that could skew the D^2 probabilities. These assumptions limit applicability to non- metrics or outlier-heavy datasets, where the probabilistic guarantees may not hold without modifications.

Approximation Guarantees

The problem of exact is NP-hard, even when the number of clusters k is fixed to 2, underscoring the necessity of approximation algorithms for practical use. The k-means++ algorithm addresses this by providing a seeded initialization that, when combined with the standard Lloyd's iteration (the core of the k-means procedure), yields a clustering with expected cost at most $5(\ln k + 2) times the optimal cost OPT, where OPT denotes the minimum possible k-means objective value over all possible clusterings. This bound simplifies asymptotically to O(\log k) \cdot \mathrm{OPT}, marking a significant theoretical improvement over uniform random initialization, which only guarantees an expected cost of O(k) \cdot \mathrm{OPT}. Extensions of k-means++ have further refined these guarantees in specialized settings. For instance, in models, a scalable variant selects O(k \log k) centers in a single pass and achieves a constant-factor to OPT in expectation. Similarly, for weighted k-means variants—where points have associated weights reflecting importance or frequency—reductions to the unweighted case preserve the O(\log k)- guarantee of the original algorithm. Despite these advances, the theoretical bounds for k-means++ remain loose, particularly in high-dimensional spaces where the curse of dimensionality amplifies distances and weakens relative guarantees. In practice, empirical performance frequently surpasses these theoretical limits, often achieving costs much closer to OPT on real-world datasets.

Practical Considerations

Computational Complexity

The algorithm introduces an initialization procedure that precedes the standard Lloyd's iterations of , resulting in an overall dominated by both phases. The naive implementation of the initialization step requires computing the squared distance from each of the n data points to the closest existing for each of the k selections, yielding a of O(n k^2). This arises because the cost accumulates as O(n) for the first (uniform sampling), O(n) for distances to one in the second step, and up to O(n k) in the k-th step, summing to quadratic dependence on k. Integrating this with the subsequent k-means iterations, which naively cost O(n k) per iteration for assigning points to the nearest and updating centroids, leads to a total runtime of O(n k^2 + I n k), where I is the number of iterations; however, k-means++ typically converges in fewer iterations, reducing effective runtime by 50-88% across benchmarks. Optimizations significantly mitigate these costs. Elkan's algorithm, applicable to the iterative phase, exploits the to maintain on distances, avoiding up to 99% of distance computations in practice and bounding the per-iteration time at O(n k), though worst-case remains O(n k^2) without parallelism (denoted as /t for t processing units). For the initialization, approximations via sampling reduce the quadratic term to O(s k^2 + n log k), where s is a small sample size, enabling for massive datasets. Distance computations remain the primary in both phases, as they constitute the bulk of operations in high-dimensional spaces. Space complexity for k-means++ is O(n k + n d + k d), where d is the data dimensionality, primarily due to storing points, centroids, and—under optimizations like Elkan's—an O(n k) matrix for distance bounds; this can be mitigated for disk-resident data via streaming, though full suits moderate scales. For large n, during initialization further scales space to O((n + s) k). The algorithm's parallelizability addresses challenges, with distance calculations and centroid updates distributable via frameworks, achieving near-linear speedup on clusters by partitioning points across nodes and aggregating partial sums for updates. As of 2025, GPU accelerations in libraries like cuML leverage for parallel distance computations and matrix operations, delivering 10-50x speedups over CPU implementations for datasets with millions of points, particularly in high dimensions.

Parameter Selection

The primary hyperparameter in k-means++ is the number of clusters k, which must be specified in advance and significantly influences the quality of the resulting partitioning. Selecting an appropriate k requires evaluating clustering validity across a range of candidate values, often using internal metrics that assess compactness and separation without ground-truth labels. One common for choosing k is the elbow method, which involves running k-means++ for increasing values of k (typically from 1 to a reasonable upper bound, such as \sqrt{n} where n is the number of data points), computing the within-cluster (WCSS) for each, and plotting WCSS against k. The optimal k is selected at the "elbow" point where the rate of decrease in WCSS begins to diminish significantly, indicating that additional clusters yield marginal improvements in compactness. This method provides an intuitive visual aid but can be subjective in identifying the , particularly for noisy or high-dimensional data. Another approach is the silhouette score, which quantifies how well each data point fits within its assigned cluster relative to neighboring clusters. For a given k, the average silhouette score is computed across all points, ranging from -1 (poor clustering) to 1 (well-separated clusters), and the k maximizing this average is chosen as optimal. In practice with k-means++, this evaluation often requires fewer repeated runs compared to standard k-means, as the robust initialization reduces variability in outcomes across trials. The statistic offers a more statistically grounded alternative by comparing the log of WCSS for the observed data under k-means++ to that expected under a null reference distribution (typically uniform on the data's bounding box or principal components). For each k, the value is the difference between these logs, standardized by simulation-based variability; the optimal k is the smallest value where the exceeds a certain or shows a . This method accounts for the data's intrinsic structure and is particularly useful when clusters are unevenly sized. In cases where domain expertise is available, k can be informed directly by prior knowledge of the data's underlying groups, bypassing purely data-driven methods. For automatic selection, extensions like X-means, which uses standard k-means initialization but can be adapted to incorporate k-means++ by testing splits at potential cluster boundaries using the to decide whether to increase k, may be employed.

Applications and Implementations

Real-World Applications

k-means++ finds extensive application in , where it performs clustering to generate reduced color palettes, enabling efficient storage and transmission while minimizing visual distortion. By selecting initial centroids that are spread out in the feature space of colors, achieves superior quantization compared to random initialization, particularly for high-resolution images in akin to Photoshop. For instance, in PDE-based schemes, k-means++ has been shown to effectively cluster for palette construction, resulting in compression ratios that and . In the realm of , k-means++ enhances by partitioning customers into homogeneous groups based on purchasing patterns, demographics, and browsing behavior, thereby supporting targeted marketing and personalized recommendations similar to those implemented by . This initialization method improves cluster quality and convergence speed on large customer datasets, allowing businesses to identify high-value segments more reliably. Studies on platforms demonstrate that k-means++ outperforms standard k-means in segmentation accuracy, facilitating better for promotions and inventory management. Within bioinformatics, k-means++ is employed for clustering gene expression profiles to delineate cancer subtypes, aiding in precise and planning. The algorithm's probabilistic seeding helps mitigate sensitivity to initial conditions in high-dimensional data, leading to more stable identification of molecular subgroups associated with diseases like . For example, in multi-modal analyses of tumor samples, k-means++ integrates with other features to reveal event-free patient subgroups, enhancing prognostic models. A prominent involves 's deployment of k-means++ variants in document clustering for search indexing, enabling efficient organization of massive web corpora to boost retrieval relevance and speed. By leveraging careful initialization, the approach scales to billions of documents, supporting features like query result grouping and enhancements. researchers have extended k-means++ with privacy-preserving mechanisms for such large-scale applications, underscoring its practical impact in production search systems.

Software Libraries

The k-means++ initialization method is implemented in the popular library through its KMeans class, where it serves as the default initialization strategy via the init='k-means++' parameter. This implementation uses a variant of k-means++ that performs multiple trials at each sampling step to select initial centroids, enhancing speed. Users can control the number of initialization runs with the n_init parameter to mitigate sensitivity to random starting points, and a basic usage example involves importing the class as from sklearn.cluster import KMeans followed by fitting the model to data. Apache Spark's MLlib provides a distributed implementation of with k-means||, a parallelized of k-means++ designed for large-scale across clusters. This approach enables handling of massive datasets, such as those with billions of points, by distributing computations over multiple nodes in , , or environments. The KMeans estimator in MLlib supports scalable training through parameters like k for the number of clusters and initMode='k-means||' for the initialization scheme, making it suitable for applications in production pipelines. The toolkit, implemented in , includes k-means++ as an initialization option in its SimpleKMeans clusterer, selectable via the -init 1 command-line flag or through the . This supports both and distance metrics and is particularly useful for with visual tools, allowing users to load datasets, configure clusters, and visualize assignments without extensive coding. As of 2025, integrations with and facilitate embedding k-means++ into pipelines, such as for feature extraction or post-processing in workflows, often via custom layers or compatible extensions. Additionally, NVIDIA's cuML library offers GPU-accelerated k-means++ implementations compatible with APIs, providing significant speedups for large datasets through -enabled computations.
LibraryPrimary LanguageScalabilityUnique Features
scikit-learnSingle-node, up to millions of pointsMultiple initializations (n_init), seamless integration with /
Apache Spark MLlib//Distributed, billions of points across clustersParallel k-means
WekaSingle-node, moderate datasetsGUI for visual analysis, multiple distance metrics
RAPIDS cuML ()GPU-accelerated, large datasetsDrop-in compatibility, 10-100x speedups on GPUs

References

  1. [1]
    [PDF] k-means++: The Advantages of Careful Seeding - Stanford CS Theory
    Abstract. The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster.
  2. [2]
    KMeans — scikit-learn 1.7.2 documentation
    The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best ...Demonstration of k-means... · K_means · Comparison of the K-Means... · MeanShift
  3. [3]
    [PDF] Scalable K-Means++ - Stanford CS Theory
    The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is prov- ably close to the optimum solution. A ...
  4. [4]
    [1312.4176] Distributed k-means algorithm - arXiv
    Dec 15, 2013 · In this paper we provide a fully distributed implementation of the k-means clustering algorithm, intended for wireless sensor networks where ...
  5. [5]
  6. [6]
    [PDF] Least Squares Quantization in PCM
    We define the noise signal as the difference between the receiver-output signal and the original signal and the noise power as the average square of the noise ...
  7. [7]
    [PDF] some methods for classification and analysis of multivariate ...
    N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably.
  8. [8]
    [PDF] An empirical comparison of four initialization methods for the K ...
    In this paper, we aim to compare empirically four initialization methods for the K-Means algorithm: random, Forgy,. MacQueen and Kaufman.
  9. [9]
    A comparative study of efficient initialization methods for the k ...
    Adverse effects of improper initialization include empty clusters, slower convergence, and a higher chance of getting stuck in bad local minima (Celebi, 2011).
  10. [10]
    [PDF] A comparative study of efficient initialization methods for the k ...
    A frequently cited advantage of the more elaborate methods is that they often lead to faster k-means convergence, i.e. require fewer iterations, and as a result ...
  11. [11]
    [PDF] Refining Initial Points for K-Means Clustering - Semantic Scholar
    Initialization of Iterative Refinement Clustering Algorithms · U. FayyadCory ReinaP. Bradley · KDD ; An Iterative Improved k-means Clustering · M. DalalN. Harale ...
  12. [12]
    [PDF] The hardness of k-means clustering - UCSD CSE
    K-means clustering is an NP-hard optimization problem, even when k is fixed to 2. 2-means clustering is also NP-hard.Missing: seminal | Show results with:seminal
  13. [13]
    [PDF] Approximation Algorithms for Clustering Uncertain Data
    Jun 12, 2008 · For uncertain versions of k-means and k-median, we show re- ductions to their corresponding weighted versions on data with no uncertainties.Missing: paper | Show results with:paper
  14. [14]
    [PDF] An Efficient Massively Parallel Constant-Factor Approximation ...
    Jul 18, 2025 · bounds for the k-means problem remain frustratingly loose. Hence ... Hence, rf satisfies the required approximation guarantee with CR = 9Γ.
  15. [15]
    [PDF] Using the Triangle Inequality to Accelerate k-Means
    Abstract. The -means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while.
  16. [16]
    The k-means Algorithm: Survey & Evaluation
    Aug 12, 2020 · This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings.Missing: ++ motivation<|control11|><|separator|>
  17. [17]
    rapidsai/cuml: cuML - RAPIDS Machine Learning Library - GitHub
    For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks ...Missing: acceleration | Show results with:acceleration
  18. [18]
    [PDF] Selection of K in K-means clustering
    This paper proposes a method based on infor- mation obtained during the K-means clustering operation itself to select the number of clusters, K. The method ...
  19. [19]
    [PDF] a graphical aid to the interpretation and validation of cluster analysis
    In general, each value of k will yield a different overall average silhouette width S(k). One way to choose k 'appropriately' is to select that value of k ...
  20. [20]
    [PDF] Estimating the number of clusters in a data set via the gap statistic
    Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the.
  21. [21]
    [PDF] Extending -means with Efficient Estimation of the Number of Clusters
    This paper has only empirically demonstrated X-means on up to four-dimensional data, although simpler algorithms (Pelleg Moore,. 2000) still give significant ...
  22. [22]
    [PDF] Clustering-Based Quantisation for PDE-Based Image Compression
    Jun 20, 2017 · partitional clustering: We use the k-means++ vari- ant of the venerable k-means algorithm of Lloyd [22]. 2. hierarchical clustering: We ...
  23. [23]
    Research on E-commerce Customer Segmentation Based on the K ...
    Apr 9, 2024 · This paper gives the experimental results of AFCH customer segmentation for an e-commerce enterprise by K-Means++ algorithm.
  24. [24]
  25. [25]
    Multi-modal clustering reveals event-free patient subgroup in ...
    Aug 2, 2025 · K-means++ improves upon standard K-means46 by using a probability-based sampling method to select initial cluster centroids, which accelerates ...
  26. [26]
    A gene module identification algorithm and its applications to ...
    Mar 9, 2021 · K-means algorithm is a classical clustering method, and it finds the best membership cluster for each sample point through multiple iterations.<|separator|>
  27. [27]
    Differentially private clustering for large-scale datasets
    May 25, 2023 · This code brings differentially private k-means clustering to large scale datasets using distributed computing. ... clustering solution for the ...Missing: motivation | Show results with:motivation
  28. [28]
    Practical Differentially Private Clustering - Google Research
    Oct 21, 2021 · This is followed by executing any (non-private) clustering algorithm (e.g., k-means++) on this privately generated core-set. At a high level ...
  29. [29]
    Clustering - Spark 4.0.1 Documentation
    The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||. KMeans is implemented as an Estimator and generates a ...K-means · Latent Dirichlet allocation (LDA) · Bisecting k-means
  30. [30]
    SimpleKMeans
    Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance.
  31. [31]
    tff.learning.algorithms.build_fed_kmeans - Federated - TensorFlow
    Sep 20, 2024 · This function creates a tff.learning.templates.LearningProcess that performs federated k-means clustering. Specifically, this performs mini-batch k-means ...