Fact-checked by Grok 2 weeks ago

Cluster analysis

Cluster analysis is an unsupervised data analysis technique that partitions a set of objects into groups, or clusters, such that objects within the same cluster are more similar to each other than to those in other clusters, based on predefined measures of similarity or dissimilarity.^[1] This method aims to discover inherent structures or patterns in data without relying on predefined labels or categories, making it a fundamental tool for exploratory analysis in fields like statistics and machine learning.^[2] The process typically involves selecting appropriate distance metrics, such as the Euclidean distance for numerical data, to quantify similarities and iteratively forming clusters that maximize intra-cluster cohesion and inter-cluster separation.^[3] Key characteristics of cluster analysis include its nonparametric nature, which allows it to handle diverse data types—numerical, categorical, or mixed—without assuming underlying distributions, and its focus on either hard partitioning (where each object belongs to exactly one cluster) or soft partitioning (allowing probabilistic memberships).^[1] Common algorithms fall into several categories: partitional methods like k-means, which divide data into a fixed number of non-overlapping subsets by minimizing variance within clusters; hierarchical methods, such as agglomerative clustering that builds a tree-like structure by successively merging similar clusters; and density-based methods like DBSCAN, which identify clusters as dense regions separated by sparse areas.^[2] Evaluation often relies on internal criteria, such as silhouette scores measuring cluster compactness and separation, or external validation when ground truth labels are available.^[1] Originating in the biological sciences for taxonomic classification in the early 20th century, cluster analysis has evolved significantly with advancements in computational power and data mining, gaining prominence through seminal works in the 1970s and 1980s that formalized algorithms for broader applications.^[1] Today, it is widely applied in diverse domains, including genomics for gene expression grouping, marketing for customer segmentation, image processing for object recognition, and climate science for pattern detection in environmental data.^[3] Despite its utility, challenges persist, such as sensitivity to outliers, the need to determine the optimal number of clusters, and scalability for large datasets, driving ongoing research into robust and efficient variants.^[2]

Fundamentals

Definition and Objectives

Cluster analysis is an unsupervised machine learning technique that partitions a given dataset into subsets, known as clusters, such that data points within the same cluster exhibit greater similarity to each other than to those in other clusters, typically measured using distance or dissimilarity metrics.^[4] This process relies on inherent data structures without requiring predefined labels or supervision, enabling the discovery of natural groupings in unlabeled data.^[5] The primary objectives of cluster analysis include facilitating exploratory data analysis to uncover hidden patterns, supporting pattern discovery in complex datasets, aiding anomaly detection by identifying outliers as points distant from cluster centers, and contributing to dimensionality reduction by summarizing data into representative cluster prototypes.^[5] These goals emphasize its role in unsupervised learning, where the aim is to reveal intrinsic data organization for subsequent tasks like classification or visualization, rather than predictive modeling.^[4] Effective cluster analysis presupposes appropriate data representation, encompassing numerical attributes (e.g., continuous values like measurements) and categorical attributes (e.g., discrete labels like categories), which may require preprocessing to handle mixed types.^[4] Central to this are similarity or distance measures that quantify resemblance between data points; common examples include the Euclidean distance, defined as
d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2},
where x and y are data points in an n-dimensional space, and the Manhattan distance, d(x, y) = \sum_{i=1}^{n} |x_i - y_i|, which sums absolute differences along each dimension.^[4] Key concepts in cluster analysis distinguish between hard clustering, where each data point is assigned exclusively to one cluster with no overlap, and soft clustering, where points may belong to multiple clusters with varying degrees of membership, often represented as probabilities.^[5] Clusters can exhibit overlap in soft approaches, allowing for ambiguous boundaries in real-world data, while scalability issues arise with large datasets, as many algorithms struggle with high computational complexity for millions of points, necessitating efficient implementations.^[4]

Historical Development

The roots of cluster analysis trace back to the early 1930s in anthropology, where Harold E. Driver and Alfred L. Kroeber introduced quantitative methods to assess similarities in cultural traits, particularly kinship terminologies, marking one of the first systematic attempts to group data based on resemblance measures.^[6] This approach laid the groundwork for empirical classification by using similarity coefficients to organize qualitative data into hierarchical structures, though computations were manual and limited to small datasets. In the late 1930s, the technique entered psychology through Joseph Zubin's 1938 proposal for measuring "like-mindedness" among individuals via profile correlations, enabling the identification of homogeneous groups in behavioral data.^[7] Robert C. Tryon further advanced these ideas in 1939 with his monograph on cluster analysis, applying correlation profiles and orthometric factors to isolate unities in personality traits and mental abilities, thus formalizing clustering as a tool for psychological taxonomy.^[8] The 1960s saw significant milestones in the development of algorithmic frameworks, driven by the need for more robust statistical methods. Joe H. Ward's 1963 minimum variance method introduced an objective function to minimize within-cluster variance in hierarchical clustering, providing a criterion for optimal grouping that became widely adopted for its computational efficiency.^[9] Building on this, G. N. Lance and W. T. Williams published their 1967 agglomerative framework, which generalized hierarchical clustering strategies through recursive distance updates, unifying various linkage criteria and facilitating implementation on early computers.^[10] Concurrently, James MacQueen formalized the k-means algorithm in 1967, offering a partitioning approach that iteratively assigns data points to clusters by minimizing squared distances to centroids, though its roots extended to earlier work like Lloyd's 1957 vector quantization.^[11] From the 1970s onward, cluster analysis transitioned from manual to computer-aided processes, spurred by advancements in statistical software. The integration of clustering routines into packages like SPSS version 5.0 around 1975 enabled hierarchical and k-means methods for larger datasets, democratizing access for researchers in social sciences and beyond.^[12] This era also witnessed the emergence of fuzzy clustering approaches, such as those developed by Dunn in 1973 and refined by Bezdek, allowing for partial memberships in clusters.^[13] Additionally, density-based approaches emerged, exemplified by Martin Ester and colleagues' 1996 DBSCAN algorithm, which identifies clusters of arbitrary shape in spatial data by focusing on density reachability rather than predefined parameters like the number of clusters.^[14] Influential figures such as Teuvo Kohonen contributed in the 1980s with self-organizing maps, a neural network-inspired method for topographic clustering that visualizes high-dimensional data on low-dimensional grids.^[15] By the 2000s, cluster analysis underwent a paradigm shift toward probabilistic models, influenced by machine learning's emphasis on uncertainty and scalability. Traditional deterministic methods like k-means gave way to model-based techniques, such as Gaussian mixture models, which assign soft memberships via expectation-maximization and better handle overlapping clusters, as reviewed in works on large-scale data analysis.^[16] This evolution reflected broader integration with machine learning, enabling applications in gene expression analysis for biological pattern discovery.

Clustering Methods

Hierarchical Clustering

Hierarchical clustering, also known as connectivity-based clustering, constructs a hierarchy of clusters either by progressively merging smaller clusters into larger ones or by recursively splitting larger clusters into smaller ones.^[17] The agglomerative approach, which is the most commonly used, starts with each data point as its own singleton cluster and iteratively merges the closest pair of clusters until all points form a single cluster.^[18] In contrast, the divisive approach begins with all data points in one cluster and repeatedly splits the most heterogeneous cluster into two until each point is isolated.^[17] The resulting hierarchy is typically visualized using a dendrogram, a tree-like diagram where the height of each merge or split indicates the dissimilarity level at which clusters are combined or divided, allowing users to select a partitioning by cutting the tree at a desired level.^[19] The merging or splitting process in hierarchical clustering relies on linkage criteria that define the distance between clusters. Single linkage, also known as nearest-neighbor linkage, measures the minimum distance between any point in one cluster and any point in the other, which can lead to chaining effects in elongated clusters.^[20] Complete linkage, or furthest-neighbor linkage, uses the maximum distance between points in the two clusters, promoting compact, spherical clusters but being sensitive to outliers.^[20] Average linkage computes the mean distance between all pairs of points across the clusters, balancing the tendencies of single and complete methods to produce more balanced hierarchies.^[20] Ward's method, a variance-minimizing approach, merges clusters that result in the smallest increase in total within-cluster variance, with the linkage distance given by

\Delta = \sqrt{\frac{n_i n_j}{n_i + n_j}} \| \mu_i - \mu_j \|^2

where n_i and n_j are the sizes of clusters i and j, and \mu_i and \mu_j are their centroids.^[21] Several algorithms implement hierarchical clustering efficiently. The naive agglomerative algorithm updates distances after each merge, achieving O(n^3) time complexity for n points due to repeated pairwise computations.^[19] Optimized versions like SLINK for single linkage reduce this to O(n^2) time by maintaining a compact representation of the dendrogram and avoiding redundant distance calculations.^[22] Similarly, CLINK achieves O(n^2) time for complete linkage through an incremental approach that extends partial hierarchies.^[23] To determine the number of clusters, stopping criteria such as the inconsistency coefficient are used, which measures how a link's height deviates from the average and standard deviation of nearby links in the dendrogram, with values above a threshold indicating natural cuts.^[24] Hierarchical clustering excels at handling non-spherical and varying-density clusters without requiring a predefined number of clusters, providing interpretable hierarchies via dendrograms.^[18] However, it is computationally intensive, with even optimized implementations requiring O(n^2) time and space, making it less suitable for very large datasets.^[19] In bioinformatics, hierarchical clustering is widely applied to gene expression data, where average or Ward's linkage with Euclidean distances groups genes or samples based on similar expression profiles, visualized in dendrograms to reveal co-regulated pathways or tissue-specific patterns.^[25] For instance, in microarray datasets, it identifies clusters of genes upregulated in cancer samples, aiding in biomarker discovery.^[26]

Centroid-Based Clustering

Centroid-based clustering methods partition a dataset into a predefined number of clusters, K, by iteratively assigning data points to the nearest centroid (prototype) and updating the centroids based on the assignments until convergence is achieved. This approach optimizes for compact, spherical clusters by minimizing the distance between points and their assigned centroids, typically using Euclidean distance. The core principle relies on creating Voronoi partitions where each cluster consists of points closest to its centroid, promoting intra-cluster similarity.^[27] The k-means algorithm, a foundational centroid-based method, formalizes this process through Lloyd's iterative procedure. It begins with initialization of K centroids, often randomly selected from the data or using improved strategies like k-means++ for better spread. In the assignment step, each point x is allocated to the cluster C_k with the nearest centroid \mu_k, forming Voronoi cells. The update step recomputes each centroid as the mean of its assigned points: \mu_k = \frac{1}{|C_k|} \sum_{x \in C_k} x. This alternation continues until centroids stabilize or a maximum number of iterations is reached. The objective is to minimize the within-cluster sum of squares (WCSS): \sum_{k=1}^K \sum_{x \in C_k} \| x - \mu_k \|^2, which measures total intra-cluster variance.^[28]^[28]^[28]^[28] Variants address specific limitations of standard k-means. The k-medoids algorithm, implemented as Partitioning Around Medoids (PAM), selects actual data points as medoids instead of means, making it more robust to outliers by minimizing total dissimilarity rather than squared Euclidean distance; it iteratively swaps medoids to optimize the objective. Fuzzy c-means extends the approach to soft assignments, allowing partial memberships u_{ik} for point i to cluster k via u_{ik} = \frac{1}{\sum_j (d_{ik}/d_{jk})^{2/(m-1)}}, where d_{ik} is the distance to centroid k and m > 1 controls fuzziness, enabling overlapping clusters. Initialization significantly impacts results due to k-means' sensitivity to starting centroids, often leading to local optima; multiple runs with random seeds or k-means++—which probabilistically selects initial centroids farther apart—are recommended to mitigate this. Determining K involves methods like the elbow technique, plotting WCSS against K and selecting the "elbow" point where marginal gains diminish, indicating balanced complexity and fit. Despite its efficiency, centroid-based clustering assumes clusters are spherical, roughly equal-sized, and of similar variance, which fails for elongated, varying-density, or unevenly sized groups; it also exhibits quadratic sensitivity to outliers in high dimensions. The time complexity is O(nkt), where n is the number of points, k the number of clusters, and t the iterations, scaling poorly for large datasets without approximations.^[27]^[27]^[27]

Density-Based Clustering

Density-based clustering algorithms define clusters as dense regions in the data space, separated by areas of lower density, allowing for the discovery of clusters with arbitrary shapes and the identification of noise without assuming a fixed number of clusters. In this framework, a cluster is a maximal set of density-connected points, where two points are density-connected if they belong to a chain of points such that each consecutive pair is directly density-reachable from the other based on local density criteria. Points lying in sparse regions, not belonging to any such dense structure, are treated as noise or outliers.^[29] The foundational algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), was introduced in 1996 by Ester et al. It operates using two key parameters: Eps, which defines the radius of the ε-neighborhood around a point, and MinPts, the minimum number of points required to consider a neighborhood dense. Points are categorized as core points if their ε-neighborhood contains at least MinPts points (including themselves), border points if they lie in the ε-neighborhood of a core point but have fewer than MinPts points in their own neighborhood, and noise points otherwise. Direct density-reachability occurs when a core point p has another point q within its ε-neighborhood; density-reachability extends this transitively through chains of such connections. The algorithm processes the dataset by selecting an arbitrary unvisited point, performing a range query to retrieve all points within its ε-neighborhood (with naive implementation taking O(n) time per query), and expanding the cluster by recursively adding all density-reachable points from core points; unexpandable points are marked as noise, and the process repeats until all points are visited.^[29] To handle datasets with varying densities, where a single Eps value may fail to capture both sparse and dense regions, OPTICS (Ordering Points To Identify the Clustering Structure) was developed in 1999 by Ankerst et al. as an extension of DBSCAN. OPTICS does not produce flat clusters directly but generates an augmented ordering of points based on their core-distance (minimum Eps to qualify as core) and reachability-distance (minimum distance to a density-reachable point under varying Eps), visualized in a reachability plot that reveals hierarchical cluster structures at multiple density thresholds; clusters can then be extracted post hoc by applying a steepness criterion to the plot.^[30] A notable variant, HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), proposed in 2013 by Campello et al., further advances density-based methods by constructing a hierarchy without requiring a fixed Eps parameter. It builds upon mutual reachability distances to form a minimum spanning tree, applies single-linkage clustering to create a dendrogram of density levels, and extracts flat clusters by selecting a stability-based cut in the hierarchy, effectively accommodating clusters of differing densities and providing robust outlier detection through the hierarchy's leaves. Density-based clustering offers several advantages, including the ability to detect clusters without specifying the number of clusters in advance, robustness to noise through explicit outlier labeling, and flexibility in identifying non-convex, arbitrary-shaped clusters in spatial or high-dimensional data. However, these methods are sensitive to parameter selection—such as Eps and MinPts in DBSCAN—which can significantly affect results if not tuned appropriately via techniques like k-distance plots, and they may underperform on datasets with substantial density variations unless extended by algorithms like OPTICS or HDBSCAN.^[31]^[29] A practical example of density-based clustering is its application to spatial datasets, such as grouping earthquake hypocenters to delineate seismic fault zones; DBSCAN can identify dense swarms of aftershocks as clusters while classifying isolated low-magnitude events as noise, aiding in hazard assessment without assuming spherical cluster shapes.^[32]

Model-Based Clustering

Model-based clustering approaches treat the data as arising from a mixture of underlying probability distributions, where each component represents a cluster, and parameters are estimated via statistical inference to uncover the latent structure. This probabilistic framework allows for soft assignments of data points to clusters based on posterior probabilities, enabling the modeling of uncertainty and overlap between groups.^[33] Unlike deterministic methods, it assumes the data are generated from a finite mixture model, typically specified as p(\mathbf{x}) = \sum_{k=1}^K \pi_k f(\mathbf{x} \mid \theta_k), where \pi_k are mixing proportions with \sum_{k=1}^K \pi_k = 1 and \pi_k > 0, and f(\mathbf{x} \mid \theta_k) is the density of the k-th component parameterized by \theta_k. A common instantiation uses Gaussian components, yielding the Gaussian mixture model p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), which assumes clusters form ellipsoidal shapes aligned with the data's covariance structure.^[33] Parameter estimation is typically performed using the expectation-maximization (EM) algorithm, which maximizes the likelihood by iteratively computing expectations and maximizations.^[34] In the E-step, posterior probabilities of component membership are calculated as z_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} for each data point i and component k. The M-step then updates the parameters: mixing proportions as \pi_k = \frac{1}{n} \sum_{i=1}^n z_{ik}, means as \boldsymbol{\mu}_k = \frac{\sum_{i=1}^n z_{ik} \mathbf{x}_i}{\sum_{i=1}^n z_{ik}}, and covariances as \boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^n z_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_{i=1}^n z_{ik}}.^[34] Bayesian extensions address the fixed number of components K by employing nonparametric priors like the Dirichlet process mixture, which allows the number of clusters to be inferred from the data rather than specified in advance.^[35] This approach uses Markov chain Monte Carlo methods, such as those proposed by Neal, to sample from the posterior distribution over partitions and parameters.^[36] The core assumption is that clusters correspond to draws from underlying probability distributions, accommodating varying shapes and sizes through flexible covariance modeling, though it excels particularly with ellipsoidal forms. Implementation is facilitated by software like the mclust package in R, which fits Gaussian finite mixture models via EM and supports model selection across various covariance structures.^[37] Despite its strengths, model-based clustering incurs high computational cost due to iterative optimization, especially for large datasets or complex covariances. Overfitting is a key limitation, mitigated by criteria such as the Bayesian Information Criterion (BIC), defined as \mathrm{BIC} = -2 \ln L + p \ln n, where L is the maximized likelihood, p the number of parameters, and n the sample size; lower BIC values favor parsimonious models.^[38] K-means can be viewed as a special case with hard assignments and spherical covariances equal across clusters.

Grid-Based and Other Methods

Grid-based clustering methods partition the data space into a finite number of non-overlapping grid cells or hyper-rectangles, allowing clustering operations to be performed efficiently on these discretized structures rather than individual data points. This approach reduces computational complexity by summarizing statistical properties within each cell, such as count, mean, variance, and distribution type, enabling quick query processing and region-based analysis. A key parameter is the cell resolution or grid size, which determines the granularity of the partitioning and trades off between accuracy and efficiency; finer grids capture more detail but increase processing time. These methods are particularly suited for spatial data mining tasks where exact point-level precision is less critical than scalability.^[39] The STING (Statistical Information Grid) algorithm exemplifies this paradigm by employing a hierarchical, quadtree-like grid structure that adaptively refines cells based on data density, starting from a coarse global level and drilling down to finer resolutions only where necessary. Developed in 1997, STING precomputes statistical summaries for each cell—such as min/max values, entropy, and skewness—to support clustering by identifying dense regions through bottom-up aggregation or top-down refinement, making it efficient for large datasets with varying densities. Adaptive grids like those in STING handle non-uniform data distributions by allowing variable cell sizes, avoiding the inefficiencies of uniform partitioning in sparse areas.^[39]^[40] Subspace clustering extends grid-based techniques to high-dimensional data, where clusters may exist only in lower-dimensional projections, mitigating the curse of dimensionality by searching for dense regions in arbitrary subspaces. The CLIQUE algorithm, introduced in 1998, applies a grid structure across all possible subspaces generated from the full feature set, identifying dense unit hypercubes and merging them into clusters while producing concise descriptions in disjunctive normal form (DNF). CLIQUE partitions each dimension into intervals based on data percentiles, covers the space with rectangular units, and uses a density threshold to prune sparse areas, enabling scalability to hundreds of dimensions without exhaustive subspace enumeration. This grid-based subspace approach excels in datasets where relevant features vary across clusters, such as gene expression analysis.^[41]^[42] Spectral clustering treats data as vertices in a similarity graph, using the graph Laplacian to embed points into a low-dimensional space for subsequent partitioning, which is effective for non-convex clusters. The unnormalized graph Laplacian is defined as L = D - A, where A is the affinity matrix with entries representing pairwise similarities (e.g., Gaussian kernel), and D is the diagonal degree matrix with D_{ii} = \sum_j A_{ij}. The algorithm computes the k smallest eigenvectors of L, forms a matrix of these eigenvectors (each row corresponding to a data point), and applies k-means clustering in this embedding space to obtain the final partitions. This method leverages spectral graph theory to minimize the normalized cut, balancing cluster cohesion and separation.^[43]^[44] Among other methods, self-organizing maps (SOMs) provide a neural network-based alternative that preserves topological relationships through competitive learning on a discrete lattice. Proposed by Kohonen in 1982, SOMs initialize a grid of neurons with weight vectors and iteratively updates the winning neuron (closest to the input) and its neighbors toward the data point, gradually forming a continuous mapping of high-dimensional input space onto the lower-dimensional grid. This unsupervised process results in topology-preserving clusters where similar inputs activate nearby neurons, useful for visualization and exploratory analysis.^[45] Grid-based and spectral methods offer advantages in scalability for large-scale and high-dimensional datasets, with time complexities often linear in the number of points or grid cells, outperforming point-based algorithms on massive data. However, quantization in grid-based approaches can lead to loss of precision, as boundary points may be arbitrarily assigned to cells, potentially distorting cluster shapes in low-density regions. Spectral methods, while robust to noise and outliers, incur higher costs for eigenvector computation in very large graphs. Some hybrid approaches integrate grid structures with density estimation to refine boundaries, enhancing accuracy without full point-wise processing. For instance, spectral clustering has been applied to image segmentation by modeling pixels as graph nodes with intensity-based affinities, partitioning the image into coherent regions as demonstrated in the normalized cuts framework.^[40]^[42]^[44]^[46]

Evaluation Techniques

Internal Validation Metrics

Internal validation metrics evaluate the quality of a clustering solution solely based on the intrinsic information within the dataset and the resulting partition, without reference to external ground truth labels. These metrics typically emphasize two key properties: compactness, which measures how closely related the objects are within each cluster, and separation, which assesses how distinct the clusters are from one another. By quantifying these aspects, internal metrics enable the selection of the optimal number of clusters k, the comparison of different clustering algorithms on the same data, or the assessment of partition stability, often under assumptions such as the use of Euclidean distance in feature space.^[47] One of the earliest internal metrics is the Dunn index, proposed by Dunn in 1974, which computes the ratio of the smallest inter-cluster distance to the largest intra-cluster diameter across all clusters. A higher value indicates better-defined clusters with strong separation relative to internal dispersion, making it suitable for identifying compact and well-separated structures. However, it can be computationally intensive for large datasets due to the need to evaluate all pairwise distances. The Davies–Bouldin index, introduced by Davies and Bouldin in 1979, provides a measure of cluster similarity by averaging, for each cluster, the maximum ratio of the sum of within-cluster scatter to the between-cluster separation with respect to other clusters. Formally, it is defined as

DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \frac{s_i + s_j}{d_{ij}},

where s_i is the average distance between points in cluster i and its centroid, d_{ij} is the distance between centroids of clusters i and j, and lower values signify superior clustering quality. This index is particularly useful for comparing partitions where compactness and separation trade-offs are critical, though it assumes spherical clusters in Euclidean space.^[48] The silhouette coefficient, developed by Rousseeuw in 1987, offers a more intuitive per-object assessment of clustering validity by measuring how similar an object is to its own cluster compared to neighboring clusters. For each data point i, the silhouette value is calculated as

s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}},

where a(i) is the average distance from i to other points in its cluster, and b(i) is the smallest average distance from i to points in a different cluster; the global coefficient is the mean over all points, ranging from -1 (poor assignment) to 1 (well-clustered). This metric not only yields a summary score but also enables graphical visualization to detect outliers or suboptimal merges, assuming a distance metric like Euclidean.^[49] Another prominent metric is the Calinski-Harabasz index, also known as the variance ratio criterion, proposed by Caliński and Harabasz in 1974, which evaluates the ratio of between-cluster dispersion to within-cluster dispersion, adjusted for the number of clusters and data points. Higher values indicate favorable partitions with greater inter-cluster variance relative to intra-cluster variance, making it effective for selecting k in methods like k-means under multivariate normal assumptions in Euclidean space. These metrics collectively support unsupervised evaluation but may vary in sensitivity to cluster shape, density, or dimensionality, necessitating careful selection based on data characteristics.^[50]

External Validation Metrics

External validation metrics assess the quality of a clustering by comparing the obtained partitions to a known ground truth labeling of the data, typically available in supervised evaluation scenarios where external class information exists. These metrics quantify the agreement between the clustering output and the reference labels, focusing on aspects such as pairwise similarities, set overlaps, or information preservation, and are particularly useful for benchmarking algorithms when ground truth is accessible. Unlike internal metrics that rely solely on data structure, external measures provide an objective evaluation against predefined categories, enabling direct assessment of how well the clustering recovers the underlying structure. A foundational tool for computing many external metrics is the contingency table, also known as a confusion matrix in this context, which tabulates the joint frequencies of cluster assignments and ground truth labels. For a dataset with n points, the table C is an m \times k matrix where C_{i,j} denotes the number of points assigned to cluster i in the clustering and class j in the ground truth, with m clusters and k classes. This table serves as the basis for deriving agreement statistics, such as true positives (TP: pairs in same cluster and same class), true negatives (TN: pairs in different clusters and different classes), false positives (FP: pairs in same cluster but different classes), and false negatives (FN: pairs in different clusters but same class). The contingency table allows for straightforward calculation of overlaps and mismatches, facilitating the application of various indices. The Rand Index (RI), introduced by Rand in 1971, measures the fraction of pairwise agreements between the clustering and ground truth, treating pairs of points as correctly classified if they are either both in the same group or both in different groups according to both partitions. It is defined as RI = \frac{TP + TN}{TP + TN + FP + FN}, yielding a value between 0 (no agreement) and 1 (perfect agreement). However, the basic RI is sensitive to chance agreements and is often adjusted using the Hubert-Arabie formula to produce the Adjusted Rand Index (ARI), which subtracts the expected agreement under random labeling and normalizes to range from -1 to 1, with values near 0 indicating random partitions. This adjustment makes ARI a robust choice for comparing clusterings of varying sizes. The F-measure adapts the precision-recall framework from information retrieval to clustering evaluation, computing the harmonic mean of precision (fraction of points in a cluster that belong to the dominant ground truth class) and recall (fraction of points in a ground truth class assigned to the dominant cluster). For each cluster, precision and recall are calculated relative to the best-matching class, and the F-measure is

F = 2 \times \frac{\text{[precision](/page/Precision)} \times \text{[recall](/page/The_Recall)}}{\text{[precision](/page/Precision)} + \text{[recall](/page/The_Recall)}}

, typically averaged across clusters using macro-averaging (equal weight per cluster) or micro-averaging (equal weight per point) to handle imbalances. This metric emphasizes balanced recovery of classes but requires solving an assignment problem to match clusters to classes optimally, as discussed in comparative analyses of extrinsic metrics. The Jaccard Index, originally a set similarity measure, evaluates clustering by computing the intersection-over-union for pairs of clusters and their corresponding ground truth classes, often extended to average pairwise similarities between predicted and true partitions. For two sets A and B, it is J(A, B) = \frac{|A \cap B|}{|A \cup B|}, ranging from 0 (no overlap) to 1 (identical sets), and in clustering, it is applied via the contingency table to quantify overlap after optimal matching. This index is particularly intuitive for assessing set-based agreement but can be computationally intensive for large numbers of clusters due to the need for bipartite matching. Normalized Mutual Information (NMI) provides an information-theoretic measure of shared information between the clustering and ground truth, normalized to account for entropy differences. It is calculated as NMI(X; Y) = \frac{2 I(X; Y)}{H(X) + H(Y)}, where I(X; Y) is the mutual information (sum over joint probabilities minus marginals), and H(X), H(Y) are the entropies of the cluster and class distributions, respectively, derived from the contingency table probabilities. NMI ranges from 0 (independent partitions) to 1 (identical), offering robustness to varying cluster numbers and sizes, as analyzed in studies of information-theoretic clustering comparisons. The Fowlkes-Mallows Index (FM) focuses on the geometric mean of precision and recall across all pairs, emphasizing pairwise concordance without requiring full class assignments. It is given by FM = \sqrt{\frac{TP}{\ (TP + FP)(TP + FN)\ }}, which simplifies to the square root of the product of the geometric means of TP proportions, and ranges from 0 to 1. Proposed for comparing hierarchical clusterings, FM is less sensitive to cluster size imbalances than arithmetic means and performs well when ground truth has clear boundaries, as demonstrated in its original formulation for dendrogram evaluation. Purity measures the extent to which each cluster contains points from a single ground truth class, assigning the cluster to its dominant class and computing the fraction of points matching that class, then averaging over all clusters weighted by size. Formally, for cluster i with dominant class j, purity is \sum_i \frac{|C_{i,j}|}{n_i} \times \frac{n_i}{n}, where n_i is the cluster size and n the total points, yielding values from 0 (no purity) to 1 (perfect). This simple metric highlights homogeneity but tends to favor solutions with many small clusters, as it does not penalize over-fragmentation. External validation metrics share several limitations that can affect their applicability. They inherently require access to ground truth labels, which are often unavailable in real-world unsupervised settings, limiting their use to benchmark datasets. Many, such as RI, FM, and purity, assume non-overlapping hard clusters and perform poorly with fuzzy or probabilistic partitions, where points belong to multiple groups. Additionally, metrics like purity and F-measure can be biased toward balanced or equal-sized clusters, overestimating quality for fragmented solutions and underestimating for uneven distributions, as highlighted in formal constraint analyses of extrinsic indices.

Assessing Cluster Tendency

Assessing cluster tendency involves evaluating whether a dataset exhibits inherent structure suitable for clustering, prior to applying clustering algorithms, to avoid analyzing random noise as meaningful groups. This step helps determine if the data is "clusterable," meaning it contains non-random patterns or groupings that can be captured by clustering methods. Visual and statistical approaches are commonly used to inspect this tendency, often as a preliminary analysis to inform subsequent clustering decisions.^[51] Visual methods provide an intuitive way to inspect potential groupings in the data. Scatter plots of the original features or pairs of variables allow users to visually detect concentrations or separations in low-dimensional data. For higher-dimensional datasets, projecting the data onto principal components using principal component analysis (PCA) can reveal hidden structures by reducing dimensionality while preserving variance, enabling observation of clusters in 2D or 3D plots. These techniques are particularly useful for initial exploratory analysis, though they rely on human interpretation and may miss subtle patterns in noisy data. The Hopkins statistic is a widely used statistical test for quantifying cluster tendency by comparing the distances between randomly generated points and actual data points to their nearest neighbors. Introduced in 1954, it computes a value H between 0 and 1, where H close to 0 (e.g., H < 0.3) indicates uniform data (low cluster tendency), H ≈ 0.5 indicates random distribution, and H > 0.5 (approaching 1) suggests clustering tendency.^[51] The statistic is calculated as the ratio of the average distance from m random points to their nearest neighbors in the data versus the average distance from m data points to their nearest neighbors, providing a measure of spatial randomness. It is effective for detecting overall clusterability but can be sensitive to outliers.^[52] Visual Assessment of Tendency (VAT), proposed in 2002, is a matrix-based visual method that reorders a pairwise dissimilarity matrix to produce an image revealing block-diagonal structures, where each block corresponds to a potential cluster. The algorithm applies a nearest neighbor heuristic to permute rows and columns of the matrix, highlighting natural groupings as dark blocks along the diagonal against a lighter background of inter-cluster dissimilarities. VAT is computationally efficient for moderate-sized datasets and aids in estimating the number of clusters by counting distinct blocks, though it assumes a dissimilarity matrix as input. Extensions like iVAT improve robustness to noise by incorporating path-based distances. Other tools for assessing cluster tendency include the NbClust R package, which applies up to 30 indices to recommend the optimal number of clusters, incorporating measures like the Hopkins statistic alongside silhouette and Dunn indices for comprehensive evaluation. The elbow method plots the within-cluster sum of squares against the number of clusters k, identifying an "elbow" point where adding more clusters yields diminishing returns, signaling natural groupings. Similarly, the gap statistic compares the log of within-cluster dispersion to that of random reference data, selecting k where the gap is maximized to detect non-random structure. These methods often complement tendency assessment by suggesting k if clustering is viable. Assessing cluster tendency assumes that datasets may contain noise or outliers, which can mask true structures; robust methods like VAT or Hopkins with preprocessing help mitigate this by focusing on dominant patterns. Preprocessing plays a crucial role, such as feature scaling (e.g., standardization to zero mean and unit variance) to ensure distance-based measures treat all dimensions equally, preventing bias from varying scales. Without scaling, tendency tests may falsely indicate randomness in heterogeneous data. In gene expression data analysis, assessing cluster tendency is essential to identify natural groupings of genes or samples before clustering; for instance, applying the Hopkins statistic to microarray datasets has shown low H values (<0.3), confirming clusterable structures related to biological pathways or disease states.

Applications

In Biology and Medicine

Cluster analysis plays a pivotal role in biology and medicine by enabling the organization of high-dimensional datasets, such as gene expression profiles and medical images, into meaningful groups that reveal underlying patterns in biological processes and disease mechanisms. In biology, hierarchical clustering has been instrumental in analyzing genome-wide expression patterns from DNA microarrays, where genes with similar expression levels across samples are grouped to identify co-regulated functional modules. For instance, Eisen et al. (1998) introduced a hierarchical clustering system that visualizes expression data as color-coded dendrograms, facilitating the discovery of cell cycle-regulated genes in yeast.^[53] Similarly, hierarchical methods like unweighted pair group method with arithmetic mean (UPGMA) are widely used in phylogenetic tree construction to infer evolutionary relationships from distance matrices of sequence similarities, assuming a constant rate of evolution.^[54] In protein interaction networks, spectral clustering uncovers functional modules by leveraging the eigenvectors of the network's Laplacian matrix, identifying densely connected subgraphs that correspond to protein complexes. Li et al. (2010) demonstrated this approach on yeast PPI data, achieving higher precision in complex detection compared to traditional methods.^[55] In bioinformatics, clustering addresses the redundancy in large sequence datasets, with tools like CD-HIT employing a greedy incremental algorithm to group protein or nucleotide sequences at user-specified identity thresholds, reducing database sizes for faster similarity searches. Li et al. (2006) reported that CD-HIT processes million-sequence databases in hours, outperforming earlier tools in speed and accuracy for non-redundant set construction.^[56] For metagenomics, density-based methods such as DBSCAN identify operational taxonomic units (OTUs) in microbial communities by grouping sequences based on neighborhood density, accommodating varying cluster shapes in sparse 16S rRNA data. Bhat and Prabhu (2017) highlighted DBSCAN's utility in handling noise and outliers in uncultured microbial profiles, improving taxonomic resolution over centroid-based alternatives.^[57] In medicine, centroid-based clustering like k-means stratifies patients using electronic health records (EHRs) to uncover disease subtypes based on clinical features such as vital signs and lab results. Chen et al. (2016) applied k-means to EHR data from chronic disease cohorts, identifying prognostic subgroups that informed personalized care coordination.^[58] For image analysis, fuzzy c-means (FCM) segmentation partitions MRI scans into tumor, edema, and healthy tissue regions by allowing partial memberships, which is robust to intensity inhomogeneities common in brain imaging. Koley and Sadhu (2019) enhanced FCM with rough sets for precise glioma boundary delineation, achieving Dice scores above 0.85 on benchmark datasets.^[59] Notable case studies illustrate these applications' impact. In cancer subtyping, model-based clustering with Gaussian mixtures analyzes The Cancer Genome Atlas (TCGA) multi-omics data to delineate molecular subtypes, such as luminal and basal-like breast cancers, guiding targeted therapies. Lee et al. (2020) developed the Hydra framework, which automatically detects multimodal distributions in TCGA profiles, revealing subtype-specific driver mutations.^[60] During the COVID-19 pandemic, density-based clustering grouped symptoms like fever and dyspnea from patient reports to identify risk profiles, aiding triage. A 2024 analysis using DBSCAN on symptom co-occurrence data from 2020 outbreaks uncovered persistent clusters linked to long COVID severity.^[61] Despite these advances, clustering in biology and medicine faces challenges from high dimensionality and noise in omics data, where the "curse of dimensionality" amplifies irrelevant features and dropout events in single-cell RNA-seq. Duò et al. (2025) benchmarked algorithms on scRNA-seq datasets, showing that noise distorts cluster boundaries unless mitigated by dimensionality reduction, yet over-reduction risks losing biological signals.^[62] These issues necessitate robust preprocessing and validation to ensure interpretable results in clinical settings. In business, cluster analysis plays a pivotal role in market segmentation, enabling firms to divide heterogeneous consumer bases into homogeneous groups based on shared characteristics such as demographics, behaviors, or psychographics, thereby facilitating targeted marketing strategies and resource allocation.^[63] This approach, rooted in seminal methodological frameworks, allows companies to identify actionable customer segments for product differentiation and personalized advertising.^[63] For instance, k-means clustering has been applied to retail data to segment customers by recency, frequency, and monetary value (RFM) metrics, improving campaign effectiveness in e-commerce.^[64] In the financial sector, cluster analysis supports risk management by grouping assets, institutions, or clients according to risk profiles derived from financial indicators like credit scores, transaction histories, and volatility measures.^[65] Applications include credit risk evaluation, where mixed data clustering combines quantitative and qualitative features to classify loan applicants into low-, medium-, and high-risk categories, enhancing lending decisions.^[66] Additionally, density-based methods like DBSCAN have been used for fraud detection in transaction datasets, isolating anomalous clusters from normal patterns to mitigate financial losses.^[65] In the social sciences, cluster analysis aids in uncovering typologies and structures within complex human data, such as identifying social classes or behavioral patterns in sociological surveys. A foundational application involves hierarchical clustering of relational data to classify social networks, revealing community structures based on interaction similarities, as demonstrated in early work on network partitioning.^[67] In psychology, it has been employed to delineate personality types from Big Five trait inventories, with robust data-driven approaches identifying four distinct configurations—average, reserved, self-centered, and role-model—across large datasets, providing insights into individual differences and mental health correlations.^[68] Model-based clustering further extends this to qualitative social data, such as interview codings, to reveal motives or attitudes in prevention studies.^[69]

In Computer Science and Web Technologies

In computer science and web technologies, cluster analysis serves as a foundational technique for organizing unstructured data, enabling efficient processing in areas such as information retrieval and pattern recognition. Algorithms like k-means and hierarchical clustering are integrated into scalable systems to handle large-scale datasets, supporting tasks from search optimization to user behavior modeling. This application leverages clustering's ability to group similar items without supervision, enhancing algorithmic efficiency in dynamic environments like the web.^[70] Document clustering, a key method in information retrieval, often employs term frequency-inverse document frequency (TF-IDF) vectorization combined with k-means to group text-based content for search engines. For instance, this approach has been applied to the 20 Newsgroups dataset, a standard benchmark simulating news articles, where TF-IDF transforms documents into numerical vectors before k-means partitions them into topical clusters, improving relevance in search results like those in early Google news grouping systems.^[71]^[72] The technique reduces dimensionality and highlights discriminative terms, achieving high purity scores in topical separation for web-scale text corpora.^[73] Density-based clustering, such as DBSCAN, plays a critical role in anomaly detection for cybersecurity, identifying unusual patterns in network traffic that signal intrusions. By defining clusters based on data point density, these methods isolate outliers representing potential threats, like irregular connection spikes in enterprise networks, with applications in intrusion detection systems that process real-time logs to flag deviations from normal behavior.^[74] Studies show DBSCAN outperforms partition-based alternatives in handling varying densities of attack vectors, enhancing detection accuracy in high-dimensional network data.^[75] In recommendation systems, spectral clustering enhances collaborative filtering by uncovering latent structures in user-item interactions, grouping similar videos or content for platforms like YouTube. This method constructs affinity graphs from user ratings and applies eigenvector decomposition to partition data, enabling personalized suggestions through clustered user preferences or item similarities, as demonstrated in optimizations that improve recommendation diversity and precision.^[76] For video grouping, spectral techniques reduce the computational overhead of matrix factorization in large-scale collaborative models.^[77] Web usage mining utilizes hierarchical clustering to analyze user sessions, representing navigation paths as sequences of page visits to identify behavioral patterns. In this process, sessions are preprocessed into vectors or trees, then agglomerated using linkage criteria like Ward's method to form dendrograms that reveal site structure and user intents, aiding in personalization and load balancing for web servers.^[78] This approach excels in capturing sequential dependencies in paths, with evaluations showing improved silhouette scores for session grouping in e-commerce logs.^[79] Case studies in image retrieval post-2015 highlight deep clustering's impact, where convolutional neural networks extract features before clustering to enable content-based search in vast databases. For example, web-scale image collections of 100 million items have been clustered using deep representations and approximate nearest neighbors, achieving sub-hour processing on single machines for retrieval tasks in search engines.^[80] In big data contexts, MapReduce implementations of k-means distribute centroid updates across nodes, scaling to terabyte-scale datasets by parallelizing distance computations and iterations, as shown in frameworks that reduce convergence time by factors of 10-20 on Hadoop clusters.^[81] Clustering integrates seamlessly into machine learning pipelines via libraries like scikit-learn, where estimators such as KMeans or DBSCAN are chained with preprocessing steps like standardization in Pipeline objects for end-to-end workflows. This modular design supports deployment in production systems, from feature extraction to model evaluation, ensuring reproducible clustering in applications like web analytics.^[70]^[82]

Advanced Topics

Fuzzy and Probabilistic Clustering

Fuzzy clustering extends traditional hard partitioning by allowing data points to belong to multiple clusters with varying degrees of membership, represented by values in the interval [0,1] that sum to 1 across all clusters for each point.^[83] This approach is particularly useful for handling overlapping clusters and ambiguous data where strict boundaries are inappropriate.^[83] The seminal fuzzy c-means (FCM) algorithm, originally proposed by Dunn in 1973 and generalized by Bezdek in 1981, minimizes an objective function that incorporates membership degrees raised to a fuzziness parameter m > 1:

J = \sum_{i=1}^n \sum_{k=1}^c u_{ik}^m \| x_i - v_k \|^2

where u_{ik} is the membership of data point x_i in cluster k, v_k is the cluster prototype, and \| \cdot \|^2 denotes the squared Euclidean distance. The algorithm iteratively updates memberships and prototypes:

u_{ik} = \frac{1}{\sum_{j=1}^c \left( \frac{\| x_i - v_k \|}{\| x_i - v_j \|} \right)^{2/(m-1)}}

and

v_k = \frac{\sum_{i=1}^n u_{ik}^m x_i}{\sum_{i=1}^n u_{ik}^m}.

Convergence is achieved when changes in J fall below a threshold.^[83] Probabilistic clustering interprets these soft assignments as posterior probabilities of cluster membership, often modeled via finite mixture distributions such as Gaussian mixtures. The expectation-maximization (EM) algorithm, introduced by Dempster et al. in 1977, iteratively estimates mixture parameters by maximizing the likelihood, yielding probabilistic cluster assignments in the E-step. This framework links fuzzy methods to statistical inference, enabling uncertainty quantification in cluster assignments.^[84] Possibilistic clustering, developed by Krishnapuram and Keller in 1993, relaxes the sum-to-1 constraint on memberships, treating them as degrees of possibility rather than probabilities.^[85] This modification enhances robustness to noise and outliers by avoiding forced assignments to all clusters, with an objective function similar to FCM but incorporating a temperature parameter to control typicality.^[85] In applications, fuzzy clustering excels in image processing, such as MRI brain tissue segmentation, where it delineates regions with gradual transitions and handles intensity inhomogeneities effectively.^[86] It also supports decision-making under uncertainty by providing graded memberships for risk assessment in ambiguous scenarios.^[83] Key advantages of fuzzy and probabilistic methods include robustness to outliers and noise compared to hard clustering, as memberships dilute the influence of anomalous points.^[87] However, they require careful tuning of the fuzziness parameter m, where values near 1 yield crisp partitions and larger m increases overlap but may lead to less distinct clusters.^[83]

Recent Developments and Challenges

In recent years, deep clustering has emerged as a significant advancement, integrating deep neural networks with traditional clustering techniques to learn representations and assignments jointly. A seminal approach is Deep Embedded Clustering (DEC), introduced in 2016, which uses an autoencoder to map data into a lower-dimensional space and optimizes cluster assignments via a clustering loss that minimizes distances to cluster centers while repelling points from other clusters.^[88] This method has inspired subsequent works, including surveys highlighting its role in improving clustering on complex datasets through end-to-end learning.^[89] Building on this, graph neural networks (GNNs) have been adapted for clustering in the 2020s, particularly for graph-structured data; for instance, Deep Modularity Networks (DMoN) employ unsupervised GNN pooling inspired by modularity measures to identify dense node communities, outperforming traditional methods on benchmark graphs.^[90] Scalability remains a key focus for handling large-scale and streaming data. Mini-batch k-means, a variant of k-means that processes data in small random subsets, reduces computational time while approximating the full-batch objective, making it suitable for massive datasets.^[70] Similarly, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), originally proposed in 1996, has seen updates for streaming environments, where it builds hierarchical summaries of data points to enable incremental clustering without full data access.^[91] These methods address the limitations of classical algorithms in high-volume scenarios, such as real-time analytics. Subspace and ensemble clustering have advanced to manage multi-view and multimodal data prevalent in modern applications. Multi-view subspace clustering seeks low-dimensional subspaces shared across views, with recent methods like adaptive consensus graph learning constructing unified affinity matrices from multiple data representations to enhance robustness.^[92] Ensemble approaches integrate diverse clustering results, particularly for multimodal data in the 2020s, by combining subspace recoveries to mitigate view-specific noise and improve overall partition quality.^[93] Despite these progressions, clustering faces persistent challenges. The curse of dimensionality leads to sparsity and distance metric degradation in high-dimensional spaces, complicating meaningful cluster separation; dimensionality reduction techniques like t-SNE, applied as preprocessing, help by embedding data into lower dimensions while preserving local structures.^[94] Interpretability is another hurdle, especially in deep clustering models, where explainable AI (XAI) methods aim to elucidate cluster decisions through inherent model design or post-hoc explanations, such as prototype-based interpretations.^[95] Ethical concerns arise from biases in AI clustering, often stemming from unrepresentative training data that perpetuate discriminatory groupings, such as in social or medical applications, necessitating bias audits and diverse datasets.^[96] Looking ahead, quantum clustering prototypes show promise for exponential speedups on complex problems. Variational quantum algorithms, explored since 2023, use quantum circuits to optimize clustering objectives, enabling multi-cluster detection in noisy intermediate-scale quantum (NISQ) devices; subsequent hybrid quantum-classical approaches, such as quantum-assisted k-means in 2024, and NISQ-compatible improvements in 2025, further enhance practicality for real-world datasets.^[97]^[98]^[99] Additionally, federated learning frameworks support privacy-preserving clustering by allowing decentralized model training across clients without sharing raw data, as in patient clustering for personalized medicine; recent advances in clustered federated learning as of 2025 address data heterogeneity in healthcare and IoT applications through model-based and hybrid partitioning strategies.^[100]^[101] These directions aim to balance computational efficiency, privacy, and ethical integrity in evolving data landscapes.

References

[1]
[PDF] Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should ...
[2]
Cluster analysis: A modern statistical review - Jaeger - 2023
Aug 19, 2022 · It focuses on hierarchical agglomerative clustering, k-means clustering, mixture models, and then several related topics of which any cluster analysis ...
[3]
Finding groups in data: an introduction to cluster analysis - NIH
Jun 4, 2023 · Cluster analysis, known as the unsupervised classification, looks for the natural groups or clusters in a given set of data, ...
[4]
[PDF] Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis divides data into groups based on data information, where objects within a group are similar and related to each other.
[5]
Data clustering: a review: ACM Computing Surveys: Vol 31, No 3
Sep 1, 1999 · Data clustering: a review. Authors: A. K. Jain. A. K. Jain. Michigan ... ACM Computing Surveys 35, 5-16. Google Scholar. [3]. AL-SULTAN ...
[6]
https://www.jstor.org/stable/279415
[7]
A general theory of classificatory sorting strategies - Oxford Academic
G. N. Lance, W. T. Williams, A general theory of classificatory sorting strategies: II. Clustering systems, The Computer Journal, Volume 10, Issue 3, 1967 ...
[8]
Some methods for classification and analysis of multivariate ...
VOL. 5.1 | 1967 Some methods for classification and analysis of multivariate observations. Chapter Author(s) J. MacQueen. Editor(s) Lucien M. Le Cam, ...
[9]
SPSS Library: A history of SPSS statistical features - OARC Stats
This white paper lists the statistical enhancements that we've added to the SPSS product line starting with version 5.0 through our upcoming version, 9.0.
[10]
Data clustering: 50 years beyond K-means - ScienceDirect.com
Jun 1, 2010 · We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering ...
[11]
[PDF] A survey of hierarchical clustering algorithms
Hierarchical clustering algorithms divide into two categories: Agglomerative and Divisive. Agglomerative clustering executes in a bottom–top fashion, which ...
[12]
[PDF] An Efficient and Effective Generic Agglomerative Hierarchical ...
In this paper, we focus on hierarchical clustering methods. There are two kinds of proce- dures: agglomerative and divisive. The former type builds the ...
[13]
[PDF] Efficient Algorithms for - Agglomerative Hierarchical Clustering ...
Sibson (1973) describes a single linkage SAHN algorithm based on the efficient extension of a hierarchical clustering of m objects to one of (m+1) objects ...
[14]
[PDF] A Characterization of Linkage-Based Hierarchical Clustering
This class includes commonly-used algorithms such as single-linkage, average-linkage, complete-linkage, and Ward's method. In this paper, we provide a property- ...
[15]
[PDF] Ward's Hierarchical Clustering Method - arXiv
Dec 11, 2011 · Ward's method minimizes the increase in total within-cluster sum of squared error.
[16]
SLINK: An optimally efficient algorithm for the single-link cluster ...
Jan 1, 1973 · The SLINK algorithm carries out single-link (nearest-neighbour) cluster analysis on an arbitrary dissimilarity coefficient and provides a representation of the ...
[17]
An efficient algorithm for a complete link method - Oxford Academic
The algorithm is based, like the algorithm for the single link cluster method (Slink) presented by Sibson (1973), on a compact representation of a dendrogram: ...<|separator|>
[18]
Inconsistency coefficient - MATLAB - MathWorks
This MATLAB function returns the inconsistency coefficient for each link of the hierarchical cluster tree Z generated by the linkage function.Description · Examples · Input Arguments · Output Arguments
[19]
Clustering Algorithms: Their Application to Gene Expression Data
Nov 30, 2016 · This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledgeHierarchical Methods · Recent Clustering Techniques · Partition Clustering
[20]
[PDF] Hierarchical clustering of gene expression data
One is hierarchical clustering. For example, Eisen et al [1] used hierarchical agglomerative clustering (HAC) to cluster two spotted. DNA microarray data and ...
[21]
https://arxiv.org/pdf/1111.6285
[22]
https://academic.oup.com/comjnl/article-abstract/16/1/30/434805
[23]
Density-Based Clustering in Spatial Databases: The Algorithm ...
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd ...
[24]
OPTICS: ordering points to identify the clustering structure
OPTICS: ordering points to identify the clustering structure ; Mihael Ankerst ; Markus M. · Breunig ; Hans-Peter Kriegel ; Jörg Sander.
[25]
[PDF] A Survey of Some Density Based Clustering Techniques - arXiv
In this paper, a study of these methods is done along with their characteristics, advantages and disadvantages and most importantly – their applicability to ...
[26]
Improved Earthquake Clustering Using a Density‐Adaptive ...
Dec 6, 2023 · In this study, we propose a modification to the DBSCAN algorithm by introducing an event density map replacing S min and ε.Abstract · Introduction · Method · Discussion
[27]
Mixture modelling for cluster analysis - G J McLachlan, S U Chang ...
Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of ...
[28]
Maximum Likelihood from Incomplete Data Via the EM Algorithm
Summary. A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality.
[29]
Markov Chain Sampling Methods for Dirichlet Process Mixture Models
Feb 21, 2012 · This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of ...
[30]
[PDF] Markov Chain Sampling Methods for Dirichlet Process Mixture ...
Sep 21, 2007 · This article reviews Markov chain methods for sampling from Dirichlet process mixture models, presenting two new approaches: Metropolis- ...
[31]
[PDF] SOFTWARE REVIEW Enhanced Model-Based Clustering, Density ...
The MCLUST software, with offerings originally limited to model-based hierarchical clustering, was extended to include EM for parameterized Gaussian mixture ...
[32]
Clustering via the Bayesian information criterion with applications in ...
We propose to choose the number of clusters by optimizing the Bayesian information criterion (BIC), a model selection criterion in the statistics literature.
[33]
[PDF] STING: A Statistical Information Grid Approach to Spatial Data Mining
This is the first paper that introduces clustering techniques into spatial data mining problems and it represents a significant improvement on large data sets ...
[34]
[PDF] Survey of Clustering Data Mining Techniques
Stat, Assoc., 58, 301, 235-244. WANG, W., YANG, J., and MUNTZ, R. 1997. STING: a statistical information grid approach to spatialdata mining. In Proceedings ...
[35]
[PDF] Automatic Subspace Clustering of High Dimensional Data for Data ...
CLIQUE identi es dense clusters in subspaces of maximum dimen- sionality. It generates cluster descriptions in the form of. DNF expressions that are minimized ...
[36]
(PDF) A Survey of Grid Based Clustering Algorithms - ResearchGate
Aug 10, 2025 · Grid based methods quantize the object space into a finite number of cells (hyper-rectangles) and then perform the required operations on the quantized space.
[37]
[PDF] On Spectral Clustering: Analysis and an algorithm Andrew Y. Ng CS ...
In this paper, we present a simple spectral clustering algorithm that can be ... I n Neural Information Processing Systems 14 , 2002. [3] F. C hung ...
[38]
[0711.0189] A Tutorial on Spectral Clustering - arXiv
Nov 1, 2007 · The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties.
[39]
[PDF] Self-organized formation of topologically correct feature maps
This work contains a theoretical study and computer simulations of a new self-organizing process. The principal discovery is that in a simple network of.
[40]
[PDF] Normalized cuts and image segmentation - People @EECS
SHI AND MALIK: NORMALIZED CUTS AND IMAGE SEGMENTATION. 889. Fig. 1. A case where minimum cut gives a bad partition. Page 3. groups, are in fact identical and ...
[41]
From A-to-Z review of clustering validation indices - ScienceDirect.com
Oct 7, 2024 · The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices.
[42]
A Cluster Separation Measure | IEEE Journals & Magazine
A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance.Missing: original | Show results with:original
[43]
A graphical aid to the interpretation and validation of cluster analysis
November 1987, Pages 53-65. Journal of Computational and Applied Mathematics. Silhouettes: A graphical aid to the interpretation and validation of cluster ...
[44]
A dendrite method for cluster analysis: Communications in Statistics
Original Articles. A dendrite method for cluster analysis. T. Caliński Academy of Agriculture , Poznań, Poland. &. J Harabasz Academy of Agriculture , Poznań, ...
[45]
Assessing Clustering Tendency - Datanovia.com
In this article, we described how to assess clustering tendency using the Hopkins statistics and a visual method. After showing that a data is clusterable, the ...Missing: paper | Show results with:paper
[46]
Validating clusters using the Hopkins statistic - ResearchGate
Aug 6, 2025 · A novel scheme for cluster validity using a test for random position hypothesis is proposed. The random position hypothesis is tested against an alternative ...Missing: original | Show results with:original
[47]
Cluster analysis and display of genome-wide expression patterns
A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms.Cluster Analysis And Display... · Abstract · Results
[48]
Common Methods for Phylogenetic Tree Construction and Their ...
May 11, 2024 · In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian ...
[49]
Diffusion Model Based Spectral Clustering for Protein-Protein ...
We propose a diffusion model-based spectral clustering algorithm, which analytically solves the cluster structure of PPI networks as a problem of random walks.
[50]
Cd-hit: a fast program for clustering and comparing large sets of ...
Cd-hit is a fast program for clustering and comparing large protein or nucleotide sequences, using short word filtering to efficiently handle huge databases.Abstract · INTRODUCTION · ALGORITHMS · RESULTS
[51]
(PDF) OTU Clustering: A window to analyse uncultured microbial ...
Jan 23, 2018 · It is an important tool in genomics and metagenomics which performs taxonomic profiling of the microbial world by grouping 16S RDNA amplicon ...
[52]
Patient Stratification Using Electronic Health Records from a Chronic ...
The goal of this study is to devise a machine learning framework to assist care coordination programs in prognostic stratification.
[53]
MRI Brain Tumor Segmentation and Analysis using Rough-Fuzzy C ...
Present work proposes an automated brain tumor segmentation method using rough-fuzzy C-means (RFCM) and shape based topological properties.Missing: seminal | Show results with:seminal
[54]
Hydra: A mixture modeling framework for subtyping pediatric cancer ...
Apr 10, 2020 · We developed a computational framework to automatically detect multimodal expression distributions within well-defined disease populations.
[55]
(PDF) An In-Depth Analysis of COVID-19 Symptoms Considering the ...
Jun 30, 2025 · More precisely, we implemented clustering techniques to discover symptomatic patterns across the various dimensions. For instance, in analyzing ...
[56]
Comparative benchmarking of single-cell clustering algorithms for ...
Sep 3, 2025 · However, high-dimensional transcriptomic data also presents challenges, such as redundancy and noise. In this section, we first examine the ...
[57]
Market Segmentation: Conceptual and Methodological Foundations
The book starts with a framework for considering the various bases and methods available for conducting segmentation studies.
[58]
An Exploration of Clustering Algorithms for Customer Segmentation ...
Oct 12, 2023 · This paper explores K-means, GMM, DBSCAN, agglomerative clustering, and BIRCH for customer segmentation in the UK retail market, using the RFM ...
[59]
[PDF] Clustering Approaches for Financial Data Analysis: a Survey - arXiv
In this paper, we survey different clustering algorithms for analysing different financial datasets for a variety of applications; credit cards fraud detection, ...
[60]
Cluster Analysis for mixed data: An application to credit risk evaluation
This paper highlights the relevance of both quantitative and qualitative features of applicants and proposes a new methodology based on mixed data clustering ...
[61]
An algorithm for clustering relational data with applications to social ...
A method of hierarchical clustering for relational data is presented, which begins by forming a new square matrix of product-moment correlations.<|control11|><|separator|>
[62]
A robust data-driven approach identifies four personality types ...
Sep 17, 2018 · The FFM surmises the existence of five traits—neuroticism, extraversion, openness, agreeableness and conscientiousness—that capture the main ...
[63]
Clustering Methods with Qualitative Data: A Mixed ... - PubMed Central
Cluster analysis, using methods like hierarchical, K-Means, and latent class analysis, can clarify findings from qualitative data in prevention research.
[64]
2.3. Clustering — scikit-learn 1.7.2 documentation
The algorithm has a time complexity of the order O ( N 2 T ) , where N is the number of samples and T is the number of iterations until convergence. Further, ...
[65]
Clustering text documents using k-means - Scikit-learn
Text documents are clustered using k-means with a Bag of Words approach, using TfidfVectorizer or HashingVectorizer, and LSA for dimensionality reduction.
[66]
20 news groups | K-Means Clustering - Kaggle
Application: TF-IDF is commonly used for tasks such as document similarity, text classification, and information retrieval. It helps to identify and prioritize ...
[67]
[PDF] Topical Clustering of Search Results - Google Research
Feb 12, 2012 · ABSTRACT. Search results clustering (SRC) is a challenging algorithmic problem that requires grouping together the results returned.
[68]
Density-Based Clustering and Anomaly Detection - ResearchGate
This paper presents the results of the analysis of the network intrusion detection systems using data mining techniques and anomaly detection. Anomaly detection ...
[69]
A SURVEY ON THE USE OF DATA CLUSTERING FOR INTRUSION ...
Density-based clustering algorithm clusters data points based on the density of the points in a region. DBSCAN is one of the best density-based algorithms. Yang ...
[70]
A study on a recommendation algorithm based on spectral ...
This article proposes an optimization method for a recommendation system based on spectral clustering (SC) and gated recurrent unit (GRU), named the GRU-KSC ...
[71]
An overview of video recommender systems: state-of-the-art and ...
2. Unsupervised learning. Video-based collaborative filtering often starts with clustering to decrease the search space of the model-based approach.
[72]
(PDF) Web user profiling using hierarchical clustering with improved ...
Feb 19, 2018 · ... Web usage mining process. The proposed method defines user sessions as a set of navigation paths in the Web graph and produces a complete ...
[73]
A Review on Clustering Techniques: Creating Better User ... - MDPI
Sep 13, 2021 · This paper presents a review of clustering techniques used in web usage mining, namely the partition-based, hierarchical, density-based, and fuzzy clustering ...<|separator|>
[74]
ICCV 2015 Open Access Repository
Combined with powerful deep learned representations, we achieve clustering of a 100 million image collection on a single machine in less than one hour.
[75]
Efficient MapReduce Kernel k-Means for Big Data Clustering
The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It ...
[76]
7.1. Pipelines and composite estimators - Scikit-learn
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data.
[77]
Fuzzy C-means cluster analysis - Scholarpedia
Aug 26, 2009 · Bezdek (2011), Scholarpedia, 6(7):2057. doi:10.4249/scholarpedia.2057, revision #91285 [link to/cite this article]. Jump to: navigation, search.Missing: paper | Show results with:paper
[78]
2.1. Gaussian mixture models - Scikit-learn
The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids ...GaussianMixture · BayesianGaussianMixture · Gaussian Mixture Model...
[79]
https://www.mdpi.com/1999-5903/13/9/233
[80]
A Novel Brain MRI Image Segmentation Method Using an Improved ...
Mar 25, 2021 · The classic fuzzy c-means (FCM) algorithm is extremely sensitive to noise and offset fields. If the algorithm is used directly to segment the ...
[81]
Performance Analysis of Fuzzy C-Means Clustering Methods for MRI ...
FCM is most usually used techniques for image segmentation of medical image applications because of its fuzzy nature, where one pixel can belong to multiple ...
[82]
Unsupervised Deep Embedding for Clustering Analysis
In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural ...
[83]
[PDF] Deep Clustering: A Comprehensive Survey - arXiv
Oct 9, 2022 · 8 SUMMARY OF DEEP CLUSTERING METHODS. In this paper, we introduce recent advances in the field of deep clustering. This is mainly kind of ...
[84]
Graph Clustering with Graph Neural Networks
We introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality.
[85]
State-of-the-art on clustering data streams | Big Data Analytics
Dec 1, 2016 · This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most well-known streaming platforms which implement ...
[86]
Multi-view Subspace Clustering via An Adaptive Consensus Graph ...
Jan 30, 2024 · In this paper, we initially assume the existence of a consensus reconstruction coefficient matrix and then use it to build a consensus graph ...
[87]
A Survey on Multi-View Clustering - PMC - PubMed Central
In practice, the data could be sampled from multiple subspaces. Subspace clustering [28] is the technique to find the underlying subspaces and then cluster the ...Missing: 2020s | Show results with:2020s
[88]
Clustering on the output of t-SNE - Cross Validated - Stack Exchange
Feb 23, 2017 · Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection – A Remedy Against the Curse of Dimensionality? In ...Should dimensionality reduction for visualization be considered a ...Why is t-SNE not used as a dimensionality reduction technique for ...More results from stats.stackexchange.comMissing: challenges solutions
[89]
XAI Beyond Classification: Interpretable Neural Clustering
In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with ...
[90]
Ethical and Bias Considerations in Artificial Intelligence/Machine ...
Three broad factors are responsible for biases in AI models: (1) data bias, which is the use of unrepresentative data; (2) development bias, which is the result ...
[91]
Variational quantum and quantum-inspired clustering - Nature
Aug 16, 2023 · Here we present a quantum algorithm for clustering data based on a variational quantum circuit. The algorithm allows to classify data into many clusters.
[92]
Privacy-preserving patient clustering for personalized federated ...
Jul 17, 2023 · We propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level ...