Fact-checked by Grok 2 weeks ago

Cluster analysis

Cluster analysis is an technique that partitions a set of objects into groups, or clusters, such that objects within the same cluster are more similar to each other than to those in other clusters, based on predefined measures of similarity or dissimilarity. This method aims to discover inherent structures or patterns in without relying on predefined labels or categories, making it a fundamental tool for exploratory analysis in fields like statistics and . The process typically involves selecting appropriate distance metrics, such as the for numerical , to quantify similarities and iteratively forming clusters that maximize intra-cluster cohesion and inter-cluster separation. Key characteristics of cluster analysis include its nonparametric nature, which allows it to handle diverse data types—numerical, categorical, or mixed—without assuming underlying distributions, and its focus on either hard partitioning (where each object belongs to exactly one cluster) or soft partitioning (allowing probabilistic memberships). Common algorithms fall into several categories: partitional methods like k-means, which divide data into a fixed number of non-overlapping subsets by minimizing variance within clusters; hierarchical methods, such as agglomerative clustering that builds a tree-like structure by successively merging similar clusters; and density-based methods like , which identify clusters as dense regions separated by sparse areas. Evaluation often relies on internal criteria, such as silhouette scores measuring cluster compactness and separation, or external validation when labels are available. Originating in the biological sciences for taxonomic classification in the early 20th century, cluster analysis has evolved significantly with advancements in computational power and data mining, gaining prominence through seminal works in the 1970s and 1980s that formalized algorithms for broader applications. Today, it is widely applied in diverse domains, including genomics for gene expression grouping, marketing for customer segmentation, image processing for object recognition, and climate science for pattern detection in environmental data. Despite its utility, challenges persist, such as sensitivity to outliers, the need to determine the optimal number of clusters, and scalability for large datasets, driving ongoing research into robust and efficient variants.

Fundamentals

Definition and Objectives

Cluster analysis is an technique that partitions a given into subsets, known as , such that data points within the same cluster exhibit greater similarity to each other than to those in other clusters, typically measured using or dissimilarity metrics. This process relies on inherent structures without requiring predefined labels or supervision, enabling the discovery of natural groupings in unlabeled . The primary objectives of cluster analysis include facilitating to uncover hidden patterns, supporting pattern discovery in complex datasets, aiding by identifying outliers as points distant from cluster centers, and contributing to by summarizing into representative cluster prototypes. These goals emphasize its role in , where the aim is to reveal intrinsic organization for subsequent tasks like or , rather than predictive modeling. Effective cluster analysis presupposes appropriate data representation, encompassing numerical attributes (e.g., continuous values like measurements) and categorical attributes (e.g., labels like categories), which may require preprocessing to handle mixed types. Central to this are similarity or distance measures that quantify resemblance between points; common examples include the , defined as
d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2},
where x and y are points in an n-dimensional space, and the distance, d(x, y) = \sum_{i=1}^{n} |x_i - y_i|, which sums absolute differences along each dimension.
Key concepts in cluster analysis distinguish between hard clustering, where each data point is assigned exclusively to one cluster with no overlap, and soft clustering, where points may belong to multiple clusters with varying degrees of membership, often represented as probabilities. Clusters can exhibit overlap in soft approaches, allowing for ambiguous boundaries in real-world , while scalability issues arise with large datasets, as many algorithms struggle with high for millions of points, necessitating efficient implementations.

Historical Development

The roots of cluster analysis trace back to the early in , where Harold E. Driver and Alfred L. Kroeber introduced quantitative methods to assess similarities in cultural traits, particularly terminologies, marking one of the first systematic attempts to group based on resemblance measures. This approach laid the groundwork for empirical by using similarity coefficients to organize qualitative into hierarchical structures, though computations were manual and limited to small datasets. In the late , the technique entered through Joseph Zubin's 1938 proposal for measuring "like-mindedness" among individuals via correlations, enabling the identification of homogeneous groups in behavioral . Robert C. Tryon further advanced these ideas in 1939 with his on cluster analysis, applying profiles and orthometric factors to isolate unities in traits and mental abilities, thus formalizing clustering as a tool for psychological . The 1960s saw significant milestones in the development of algorithmic frameworks, driven by the need for more robust statistical methods. Joe H. Ward's 1963 minimum variance method introduced an objective function to minimize within-cluster variance in , providing a criterion for optimal grouping that became widely adopted for its computational efficiency. Building on this, G. N. Lance and W. T. Williams published their 1967 agglomerative framework, which generalized strategies through recursive distance updates, unifying various linkage criteria and facilitating implementation on early computers. Concurrently, James MacQueen formalized the in 1967, offering a partitioning approach that iteratively assigns data points to clusters by minimizing squared distances to centroids, though its roots extended to earlier work like Lloyd's 1957 . From the 1970s onward, cluster analysis transitioned from manual to computer-aided processes, spurred by advancements in statistical software. The integration of clustering routines into packages like version 5.0 around 1975 enabled hierarchical and k-means methods for larger datasets, democratizing access for researchers in social sciences and beyond. This era also witnessed the emergence of approaches, such as those developed by Dunn in 1973 and refined by Bezdek, allowing for partial memberships in clusters. Additionally, density-based approaches emerged, exemplified by Martin Ester and colleagues' 1996 algorithm, which identifies clusters of arbitrary shape in spatial data by focusing on density reachability rather than predefined parameters like the number of clusters. Influential figures such as Teuvo Kohonen contributed in the 1980s with self-organizing maps, a neural network-inspired method for topographic clustering that visualizes high-dimensional data on low-dimensional grids. By the 2000s, cluster analysis underwent a toward probabilistic models, influenced by 's emphasis on uncertainty and scalability. Traditional deterministic methods like k-means gave way to model-based techniques, such as Gaussian mixture models, which assign soft memberships via expectation-maximization and better handle overlapping clusters, as reviewed in works on large-scale . This evolution reflected broader integration with , enabling applications in analysis for biological pattern discovery.

Clustering Methods

Hierarchical Clustering

Hierarchical clustering, also known as connectivity-based clustering, constructs a of either by progressively merging smaller into larger ones or by recursively splitting larger into smaller ones. The agglomerative approach, which is the most commonly used, starts with each data point as its own and iteratively merges the closest pair of until all points form a single . In contrast, the divisive approach begins with all data points in one and repeatedly splits the most heterogeneous into two until each point is isolated. The resulting is typically visualized using a , a tree-like where the height of each merge or split indicates the dissimilarity level at which are combined or divided, allowing users to select a partitioning by cutting the at a desired level. The merging or splitting process in hierarchical clustering relies on linkage criteria that define the distance between clusters. Single linkage, also known as nearest-neighbor linkage, measures the minimum distance between any point in one cluster and any point in the other, which can lead to effects in elongated clusters. Complete linkage, or furthest-neighbor linkage, uses the maximum distance between points in the two clusters, promoting compact, spherical clusters but being sensitive to outliers. Average linkage computes the mean distance between all pairs of points across the clusters, balancing the tendencies of single and complete methods to produce more balanced hierarchies. , a variance-minimizing approach, merges clusters that result in the smallest increase in total within-cluster variance, with the linkage distance given by \Delta = \sqrt{\frac{n_i n_j}{n_i + n_j}} \| \mu_i - \mu_j \|^2 where n_i and n_j are the sizes of clusters i and j, and \mu_i and \mu_j are their centroids. Several algorithms implement hierarchical clustering efficiently. The naive agglomerative algorithm updates distances after each merge, achieving O(n^3) time complexity for n points due to repeated pairwise computations. Optimized versions like SLINK for single linkage reduce this to O(n^2) time by maintaining a compact representation of the dendrogram and avoiding redundant distance calculations. Similarly, CLINK achieves O(n^2) time for complete linkage through an incremental approach that extends partial hierarchies. To determine the number of clusters, stopping criteria such as the inconsistency coefficient are used, which measures how a link's height deviates from the average and standard deviation of nearby links in the dendrogram, with values above a threshold indicating natural cuts. Hierarchical clustering excels at handling non-spherical and varying-density clusters without requiring a predefined number of clusters, providing interpretable hierarchies via dendrograms. However, it is computationally intensive, with even optimized implementations requiring O(n^2) time and space, making it less suitable for very large datasets. In bioinformatics, is widely applied to data, where average or Ward's linkage with distances groups genes or samples based on similar expression profiles, visualized in dendrograms to reveal co-regulated pathways or tissue-specific patterns. For instance, in datasets, it identifies clusters of genes upregulated in cancer samples, aiding in discovery.

Centroid-Based Clustering

Centroid-based clustering methods partition a dataset into a predefined number of clusters, K, by iteratively assigning data points to the nearest centroid (prototype) and updating the centroids based on the assignments until convergence is achieved. This approach optimizes for compact, spherical clusters by minimizing the distance between points and their assigned centroids, typically using Euclidean distance. The core principle relies on creating Voronoi partitions where each cluster consists of points closest to its centroid, promoting intra-cluster similarity. The k-means algorithm, a foundational centroid-based , formalizes this process through Lloyd's iterative . It begins with initialization of K centroids, often randomly selected from the or using improved strategies like k-means++ for better spread. In the assignment step, each point x is allocated to the cluster C_k with the nearest \mu_k, forming Voronoi cells. The update step recomputes each as the mean of its assigned points: \mu_k = \frac{1}{|C_k|} \sum_{x \in C_k} x. This alternation continues until centroids stabilize or a maximum number of iterations is reached. The objective is to minimize the within-cluster (WCSS): \sum_{k=1}^K \sum_{x \in C_k} \| x - \mu_k \|^2, which measures total intra-cluster variance. Variants address specific limitations of standard k-means. The k-medoids algorithm, implemented as Partitioning Around Medoids (PAM), selects actual data points as medoids instead of means, making it more robust to outliers by minimizing total dissimilarity rather than squared Euclidean distance; it iteratively swaps medoids to optimize the objective. Fuzzy c-means extends the approach to soft assignments, allowing partial memberships u_{ik} for point i to cluster k via u_{ik} = \frac{1}{\sum_j (d_{ik}/d_{jk})^{2/(m-1)}}, where d_{ik} is the distance to centroid k and m > 1 controls fuzziness, enabling overlapping clusters. Initialization significantly impacts results due to k-means' sensitivity to starting centroids, often leading to local optima; multiple runs with random seeds or k-means++—which probabilistically selects initial centroids farther apart—are recommended to mitigate this. Determining K involves methods like the technique, plotting WCSS against K and selecting the "" point where marginal gains diminish, indicating balanced complexity and fit. Despite its efficiency, centroid-based clustering assumes clusters are spherical, roughly equal-sized, and of similar variance, which fails for elongated, varying-density, or unevenly sized groups; it also exhibits quadratic sensitivity to outliers in high dimensions. The is O(nkt), where n is the number of points, k the number of clusters, and t the iterations, scaling poorly for large datasets without approximations.

Density-Based Clustering

Density-based clustering algorithms define clusters as dense regions in the data space, separated by areas of lower , allowing for the of clusters with arbitrary shapes and the identification of without assuming a fixed number of clusters. In this framework, a is a maximal set of density-connected points, where two points are density-connected if they belong to a chain of points such that each consecutive pair is directly density-reachable from the other based on local criteria. Points lying in sparse regions, not belonging to any such dense structure, are treated as or outliers. The foundational algorithm, (Density-Based Spatial Clustering of Applications with Noise), was introduced in 1996 by Ester et al. It operates using two key parameters: , which defines the radius of the ε-neighborhood around a point, and MinPts, the minimum number of points required to consider a neighborhood dense. Points are categorized as core points if their ε-neighborhood contains at least MinPts points (including themselves), border points if they lie in the ε-neighborhood of a core point but have fewer than MinPts points in their own neighborhood, and noise points otherwise. Direct density-reachability occurs when a core point p has another point q within its ε-neighborhood; density-reachability extends this transitively through chains of such connections. The algorithm processes the dataset by selecting an arbitrary unvisited point, performing a range query to retrieve all points within its ε-neighborhood (with naive implementation taking O(n) time per query), and expanding the cluster by recursively adding all density-reachable points from core points; unexpandable points are marked as , and the process repeats until all points are visited. To handle datasets with varying densities, where a single value may fail to capture both sparse and dense regions, (Ordering Points To Identify the Clustering Structure) was developed in 1999 by Ankerst et al. as an extension of . does not produce flat clusters directly but generates an augmented ordering of points based on their core-distance (minimum to qualify as core) and reachability-distance (minimum distance to a density-reachable point under varying ), visualized in a reachability plot that reveals hierarchical cluster structures at multiple density thresholds; clusters can then be extracted by applying a steepness criterion to the plot. A notable variant, HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), proposed in 2013 by Campello et al., further advances density-based methods by constructing a without requiring a fixed parameter. It builds upon mutual reachability distances to form a , applies to create a of density levels, and extracts flat clusters by selecting a stability-based cut in the , effectively accommodating clusters of differing densities and providing robust detection through the hierarchy's leaves. Density-based clustering offers several advantages, including the ability to detect clusters without specifying the number of clusters in advance, robustness to noise through explicit outlier labeling, and flexibility in identifying non-convex, arbitrary-shaped clusters in spatial or high-dimensional data. However, these methods are sensitive to parameter selection—such as Eps and MinPts in DBSCAN—which can significantly affect results if not tuned appropriately via techniques like k-distance plots, and they may underperform on datasets with substantial density variations unless extended by algorithms like OPTICS or HDBSCAN. A practical example of density-based clustering is its application to spatial datasets, such as grouping hypocenters to delineate seismic fault zones; can identify dense swarms of aftershocks as clusters while classifying isolated low-magnitude events as noise, aiding in hazard assessment without assuming spherical cluster shapes.

Model-Based Clustering

Model-based clustering approaches treat the data as arising from a of underlying probability distributions, where each component represents a cluster, and parameters are estimated via to uncover the latent structure. This probabilistic framework allows for soft assignments of data points to clusters based on posterior probabilities, enabling the modeling of and overlap between groups. Unlike deterministic methods, it assumes the data are generated from a finite , typically specified as p(\mathbf{x}) = \sum_{k=1}^K \pi_k f(\mathbf{x} \mid \theta_k), where \pi_k are mixing proportions with \sum_{k=1}^K \pi_k = 1 and \pi_k > 0, and f(\mathbf{x} \mid \theta_k) is the density of the k-th component parameterized by \theta_k. A common instantiation uses Gaussian components, yielding the Gaussian mixture model p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), which assumes clusters form ellipsoidal shapes aligned with the data's structure. Parameter estimation is typically performed using the algorithm, which maximizes the likelihood by iteratively computing expectations and maximizations. In the E-step, posterior probabilities of component membership are calculated as z_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} for each data point i and component k. The M-step then updates the parameters: mixing proportions as \pi_k = \frac{1}{n} \sum_{i=1}^n z_{ik}, means as \boldsymbol{\mu}_k = \frac{\sum_{i=1}^n z_{ik} \mathbf{x}_i}{\sum_{i=1}^n z_{ik}}, and covariances as \boldsymbol{\Sigma}_k = \frac{\sum_{i=1}^n z_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_{i=1}^n z_{ik}}. Bayesian extensions address the fixed number of components K by employing nonparametric priors like the mixture, which allows the number of clusters to be inferred from the data rather than specified in advance. This approach uses methods, such as those proposed by , to sample from the posterior distribution over partitions and parameters. The core assumption is that clusters correspond to draws from underlying probability distributions, accommodating varying shapes and sizes through flexible modeling, though it excels particularly with ellipsoidal forms. Implementation is facilitated by software like the mclust package in , which fits Gaussian finite mixture models via and supports across various structures. Despite its strengths, model-based clustering incurs high computational cost due to iterative optimization, especially for large datasets or complex covariances. is a key limitation, mitigated by criteria such as the (), defined as \mathrm{BIC} = -2 \ln L + p \ln n, where L is the maximized likelihood, p the number of parameters, and n the sample size; lower BIC values favor parsimonious models. K-means can be viewed as a special case with hard assignments and spherical covariances equal across clusters.

Grid-Based and Other Methods

Grid-based clustering methods partition the data space into a finite number of non-overlapping cells or hyper-rectangles, allowing clustering operations to be performed efficiently on these discretized structures rather than individual data points. This approach reduces by summarizing statistical properties within each , such as count, , variance, and distribution type, enabling quick query processing and region-based . A key is the cell or size, which determines the of the partitioning and trades off between accuracy and ; finer grids capture more detail but increase processing time. These methods are particularly suited for spatial tasks where exact point-level is less critical than . The (Statistical Information Grid) exemplifies this paradigm by employing a hierarchical, quadtree-like structure that adaptively refines cells based on , starting from a coarse global level and drilling down to finer resolutions only where necessary. Developed in 1997, STING precomputes statistical summaries for each cell—such as min/max values, , and —to support clustering by identifying dense regions through bottom-up aggregation or top-down refinement, making it efficient for large datasets with varying densities. Adaptive grids like those in STING handle non-uniform distributions by allowing variable cell sizes, avoiding the inefficiencies of uniform partitioning in sparse areas. Subspace clustering extends grid-based techniques to high-dimensional data, where clusters may exist only in lower-dimensional projections, mitigating of dimensionality by searching for dense regions in arbitrary s. The CLIQUE algorithm, introduced in , applies a structure across all possible subspaces generated from the full feature set, identifying dense unit hypercubes and merging them into clusters while producing concise descriptions in (DNF). CLIQUE partitions each dimension into intervals based on data percentiles, covers the space with rectangular units, and uses a to sparse areas, enabling to hundreds of dimensions without exhaustive subspace enumeration. This grid-based subspace approach excels in datasets where relevant features vary across clusters, such as analysis. Spectral clustering treats data as vertices in a similarity , using the Laplacian to embed points into a low-dimensional space for subsequent partitioning, which is effective for non-convex clusters. The unnormalized Laplacian is defined as L = D - A, where A is the affinity with entries representing pairwise similarities (e.g., Gaussian kernel), and D is the diagonal with D_{ii} = \sum_j A_{ij}. The algorithm computes the k smallest eigenvectors of L, forms a matrix of these eigenvectors (each row corresponding to a data point), and applies in this embedding space to obtain the final partitions. This method leverages to minimize the normalized cut, balancing cluster cohesion and separation. Among other methods, self-organizing maps (SOMs) provide a neural network-based alternative that preserves topological relationships through competitive learning on a discrete lattice. Proposed by Kohonen in , SOMs initialize a of s with weight vectors and iteratively updates the winning neuron (closest to the input) and its neighbors toward the data point, gradually forming a continuous of high-dimensional input space onto the lower-dimensional . This process results in topology-preserving clusters where similar inputs activate nearby neurons, useful for and exploratory analysis. Grid-based and spectral methods offer advantages in scalability for large-scale and high-dimensional datasets, with time complexities often linear in the number of points or grid cells, outperforming point-based algorithms on massive data. However, quantization in grid-based approaches can lead to loss of precision, as boundary points may be arbitrarily assigned to cells, potentially distorting cluster shapes in low-density regions. Spectral methods, while robust to noise and outliers, incur higher costs for eigenvector computation in very large graphs. Some hybrid approaches integrate grid structures with density estimation to refine boundaries, enhancing accuracy without full point-wise processing. For instance, spectral clustering has been applied to image segmentation by modeling pixels as graph nodes with intensity-based affinities, partitioning the image into coherent regions as demonstrated in the normalized cuts framework.

Evaluation Techniques

Internal Validation Metrics

Internal validation metrics evaluate the quality of a clustering solution solely based on the intrinsic information within the dataset and the resulting partition, without reference to external ground truth labels. These metrics typically emphasize two key properties: compactness, which measures how closely related the objects are within each cluster, and separation, which assesses how distinct the clusters are from one another. By quantifying these aspects, internal metrics enable the selection of the optimal number of clusters k, the comparison of different clustering algorithms on the same data, or the assessment of partition stability, often under assumptions such as the use of Euclidean distance in feature space. One of the earliest internal metrics is the , proposed by Dunn in 1974, which computes the of the smallest inter-cluster to the largest intra-cluster across all clusters. A higher value indicates better-defined clusters with strong separation relative to internal dispersion, making it suitable for identifying compact and well-separated structures. However, it can be computationally intensive for large datasets due to the need to evaluate all pairwise . The , introduced by Davies and Bouldin in 1979, provides a measure of cluster similarity by averaging, for each cluster, the maximum of the sum of within-cluster scatter to the between-cluster separation with respect to other clusters. Formally, it is defined as DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \frac{s_i + s_j}{d_{ij}}, where s_i is the average between points in cluster i and its , d_{ij} is the between centroids of clusters i and j, and lower values signify superior clustering quality. This index is particularly useful for comparing partitions where and separation trade-offs are critical, though it assumes spherical clusters in . The silhouette , developed by Rousseeuw in 1987, offers a more intuitive per-object assessment of clustering validity by measuring how similar an object is to its own compared to neighboring . For each data point i, the silhouette value is calculated as s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}, where a(i) is the average distance from i to other points in its , and b(i) is the smallest average distance from i to points in a different ; the global is the mean over all points, ranging from -1 (poor assignment) to 1 (well-clustered). This metric not only yields a summary score but also enables graphical to detect outliers or suboptimal merges, assuming a distance metric like . Another prominent metric is the Calinski-Harabasz index, also known as the variance ratio criterion, proposed by Caliński and Harabasz in 1974, which evaluates the ratio of between-cluster dispersion to within-cluster dispersion, adjusted for the number of clusters and data points. Higher values indicate favorable partitions with greater inter-cluster variance relative to intra-cluster variance, making it effective for selecting k in methods like k-means under multivariate normal assumptions in . These metrics collectively support evaluation but may vary in sensitivity to cluster shape, density, or dimensionality, necessitating careful selection based on data characteristics.

External Validation Metrics

External validation metrics assess the quality of a clustering by comparing the obtained partitions to a known labeling of the data, typically available in supervised evaluation scenarios where external class information exists. These metrics quantify the agreement between the clustering output and the reference labels, focusing on aspects such as pairwise similarities, set overlaps, or information preservation, and are particularly useful for benchmarking algorithms when is accessible. Unlike internal metrics that rely solely on , external measures provide an objective evaluation against predefined categories, enabling direct assessment of how well the clustering recovers the underlying structure. A foundational tool for computing many external metrics is the , also known as a in this context, which tabulates the joint frequencies of assignments and labels. For a with n points, the table C is an m \times k where C_{i,j} denotes the number of points assigned to i in the clustering and j in the , with m s and k es. This table serves as the basis for deriving agreement statistics, such as true positives (TP: pairs in same and same ), true negatives (TN: pairs in different s and different es), false positives (FP: pairs in same but different es), and false negatives (FN: pairs in different s but same ). The allows for straightforward calculation of overlaps and mismatches, facilitating the application of various indices. The (RI), introduced by Rand in 1971, measures the fraction of pairwise agreements between the clustering and , treating pairs of points as correctly classified if they are either both in the same group or both in different groups according to both partitions. It is defined as RI = \frac{TP + TN}{TP + TN + FP + FN}, yielding a value between 0 (no agreement) and 1 (perfect agreement). However, the basic RI is sensitive to chance agreements and is often adjusted using the Hubert-Arabie formula to produce the Adjusted Rand Index (ARI), which subtracts the expected agreement under random labeling and normalizes to range from -1 to 1, with values near 0 indicating random partitions. This adjustment makes ARI a robust choice for comparing clusterings of varying sizes. The F-measure adapts the precision-recall framework from to clustering evaluation, computing the of (fraction of points in a that belong to the dominant ) and (fraction of points in a assigned to the dominant ). For each , and are calculated relative to the best-matching , and the F-measure is F = 2 \times \frac{\text{[precision](/page/Precision)} \times \text{[recall](/page/The_Recall)}}{\text{[precision](/page/Precision)} + \text{[recall](/page/The_Recall)}}, typically averaged across s using macro-averaging (equal weight per ) or micro-averaging (equal weight per point) to handle imbalances. This metric emphasizes balanced recovery of es but requires solving an to match s to es optimally, as discussed in comparative analyses of extrinsic metrics. The , originally a set similarity measure, evaluates clustering by computing the intersection-over-union for pairs of clusters and their corresponding classes, often extended to average pairwise similarities between predicted and true partitions. For two sets A and B, it is J(A, B) = \frac{|A \cap B|}{|A \cup B|}, ranging from (no overlap) to (identical sets), and in clustering, it is applied via the to quantify overlap after optimal matching. This index is particularly intuitive for assessing set-based agreement but can be computationally intensive for large numbers of clusters due to the need for bipartite matching. Normalized Mutual Information (NMI) provides an information-theoretic measure of shared information between the clustering and , normalized to account for entropy differences. It is calculated as NMI(X; Y) = \frac{2 I(X; Y)}{H(X) + H(Y)}, where I(X; Y) is the (sum over joint probabilities minus marginals), and H(X), H(Y) are the of the cluster and class distributions, respectively, derived from the probabilities. NMI ranges from 0 (independent partitions) to 1 (identical), offering robustness to varying cluster numbers and sizes, as analyzed in studies of information-theoretic clustering comparisons. The Fowlkes-Mallows Index (FM) focuses on the of across all pairs, emphasizing pairwise concordance without requiring full class assignments. It is given by FM = \sqrt{\frac{TP}{\ (TP + FP)(TP + FN)\ }}, which simplifies to the of the product of the geometric means of TP proportions, and ranges from 0 to 1. Proposed for comparing hierarchical clusterings, FM is less sensitive to cluster size imbalances than arithmetic means and performs well when has clear boundaries, as demonstrated in its original formulation for evaluation. Purity measures the extent to which each cluster contains points from a single , assigning the cluster to its dominant and computing the fraction of points matching that , then averaging over all clusters weighted by size. Formally, for cluster i with dominant j, purity is \sum_i \frac{|C_{i,j}|}{n_i} \times \frac{n_i}{n}, where n_i is the cluster size and n the total points, yielding values from 0 (no purity) to 1 (perfect). This simple metric highlights homogeneity but tends to favor solutions with many small clusters, as it does not penalize over-fragmentation. External validation metrics share several limitations that can affect their applicability. They inherently require access to ground truth labels, which are often unavailable in real-world unsupervised settings, limiting their use to benchmark datasets. Many, such as RI, FM, and purity, assume non-overlapping hard clusters and perform poorly with fuzzy or probabilistic partitions, where points belong to multiple groups. Additionally, metrics like purity and F-measure can be biased toward balanced or equal-sized clusters, overestimating quality for fragmented solutions and underestimating for uneven distributions, as highlighted in formal constraint analyses of extrinsic indices.

Assessing Cluster Tendency

Assessing cluster tendency involves evaluating whether a exhibits inherent structure suitable for clustering, prior to applying clustering algorithms, to avoid analyzing random noise as meaningful groups. This step helps determine if the data is "clusterable," meaning it contains non-random patterns or groupings that can be captured by clustering methods. Visual and statistical approaches are commonly used to inspect this tendency, often as a preliminary to inform subsequent clustering decisions. Visual methods provide an intuitive way to inspect potential groupings in the data. Scatter plots of the original features or pairs of variables allow users to visually detect concentrations or separations in low-dimensional data. For higher-dimensional datasets, projecting the data onto principal components using can reveal hidden structures by reducing dimensionality while preserving variance, enabling observation of clusters in 2D or 3D plots. These techniques are particularly useful for initial exploratory analysis, though they rely on human interpretation and may miss subtle patterns in noisy data. The Hopkins statistic is a widely used statistical test for quantifying cluster tendency by comparing the distances between randomly generated points and actual data points to their nearest neighbors. Introduced in 1954, it computes a value H between 0 and 1, where H close to 0 (e.g., H < 0.3) indicates uniform data (low cluster tendency), H ≈ 0.5 indicates random distribution, and H > 0.5 (approaching 1) suggests clustering tendency. The statistic is calculated as the ratio of the average distance from m random points to their nearest neighbors in the data versus the average distance from m data points to their nearest neighbors, providing a measure of spatial randomness. It is effective for detecting overall clusterability but can be sensitive to outliers. Visual Assessment of Tendency (VAT), proposed in 2002, is a -based visual that reorders a pairwise dissimilarity to produce an revealing block-diagonal structures, where each block corresponds to a potential cluster. The algorithm applies a nearest neighbor to permute rows and columns of the , highlighting natural groupings as dark blocks along the diagonal against a lighter background of inter-cluster dissimilarities. VAT is computationally efficient for moderate-sized datasets and aids in estimating the number of clusters by counting distinct blocks, though it assumes a dissimilarity as input. Extensions like iVAT improve robustness to by incorporating path-based distances. Other tools for assessing cluster tendency include the NbClust R package, which applies up to 30 indices to recommend the optimal number of clusters, incorporating measures like the alongside and Dunn indices for comprehensive evaluation. The method plots the within-cluster against the number of clusters , identifying an "elbow" point where adding more clusters yields diminishing returns, signaling natural groupings. Similarly, the statistic compares the log of within-cluster dispersion to that of random , selecting where the is maximized to detect non-random structure. These methods often complement tendency by suggesting if clustering is viable. Assessing cluster tendency assumes that datasets may contain noise or outliers, which can mask true structures; robust methods like VAT or Hopkins with preprocessing help mitigate this by focusing on dominant patterns. Preprocessing plays a crucial role, such as feature scaling (e.g., standardization to zero mean and unit variance) to ensure distance-based measures treat all dimensions equally, preventing bias from varying scales. Without scaling, tendency tests may falsely indicate randomness in heterogeneous data. In gene expression data analysis, assessing cluster tendency is essential to identify natural groupings of genes or samples before clustering; for instance, applying the Hopkins statistic to microarray datasets has shown low H values (<0.3), confirming clusterable structures related to biological pathways or disease states.

Applications

In Biology and Medicine

Cluster analysis plays a pivotal role in biology and medicine by enabling the organization of high-dimensional datasets, such as gene expression profiles and medical images, into meaningful groups that reveal underlying patterns in biological processes and disease mechanisms. In biology, hierarchical clustering has been instrumental in analyzing genome-wide expression patterns from DNA microarrays, where genes with similar expression levels across samples are grouped to identify co-regulated functional modules. For instance, Eisen et al. (1998) introduced a hierarchical clustering system that visualizes expression data as color-coded dendrograms, facilitating the discovery of cell cycle-regulated genes in yeast. Similarly, hierarchical methods like unweighted pair group method with arithmetic mean (UPGMA) are widely used in phylogenetic tree construction to infer evolutionary relationships from distance matrices of sequence similarities, assuming a constant rate of evolution. In protein interaction networks, spectral clustering uncovers functional modules by leveraging the eigenvectors of the network's Laplacian matrix, identifying densely connected subgraphs that correspond to protein complexes. Li et al. (2010) demonstrated this approach on yeast PPI data, achieving higher precision in complex detection compared to traditional methods. In bioinformatics, clustering addresses the redundancy in large sequence datasets, with tools like employing a greedy incremental algorithm to group protein or nucleotide sequences at user-specified identity thresholds, reducing database sizes for faster similarity searches. Li et al. (2006) reported that CD-HIT processes million-sequence databases in hours, outperforming earlier tools in speed and accuracy for non-redundant set construction. For metagenomics, density-based methods such as identify operational taxonomic units (OTUs) in microbial communities by grouping sequences based on neighborhood density, accommodating varying cluster shapes in sparse 16S rRNA data. Bhat and Prabhu (2017) highlighted DBSCAN's utility in handling noise and outliers in uncultured microbial profiles, improving taxonomic resolution over centroid-based alternatives. In medicine, centroid-based clustering like stratifies patients using electronic health records (EHRs) to uncover disease subtypes based on clinical features such as vital signs and lab results. Chen et al. (2016) applied to EHR data from chronic disease cohorts, identifying prognostic subgroups that informed personalized care coordination. For image analysis, segmentation partitions MRI scans into tumor, edema, and healthy tissue regions by allowing partial memberships, which is robust to intensity inhomogeneities common in brain imaging. Koley and Sadhu (2019) enhanced with rough sets for precise glioma boundary delineation, achieving Dice scores above 0.85 on benchmark datasets. Notable case studies illustrate these applications' impact. In cancer subtyping, model-based clustering with Gaussian mixtures analyzes The Cancer Genome Atlas (TCGA) multi-omics data to delineate molecular subtypes, such as luminal and basal-like breast cancers, guiding targeted therapies. Lee et al. (2020) developed the Hydra framework, which automatically detects multimodal distributions in TCGA profiles, revealing subtype-specific driver mutations. During the COVID-19 pandemic, density-based clustering grouped symptoms like fever and dyspnea from patient reports to identify risk profiles, aiding triage. A 2024 analysis using DBSCAN on symptom co-occurrence data from 2020 outbreaks uncovered persistent clusters linked to long COVID severity. Despite these advances, clustering in biology and medicine faces challenges from high dimensionality and noise in omics data, where the "curse of dimensionality" amplifies irrelevant features and dropout events in single-cell RNA-seq. Duò et al. (2025) benchmarked algorithms on scRNA-seq datasets, showing that noise distorts cluster boundaries unless mitigated by dimensionality reduction, yet over-reduction risks losing biological signals. These issues necessitate robust preprocessing and validation to ensure interpretable results in clinical settings.

In Business and Social Sciences

In business, cluster analysis plays a pivotal role in market segmentation, enabling firms to divide heterogeneous consumer bases into homogeneous groups based on shared characteristics such as demographics, behaviors, or psychographics, thereby facilitating targeted marketing strategies and resource allocation. This approach, rooted in seminal methodological frameworks, allows companies to identify actionable customer segments for product differentiation and personalized advertising. For instance, k-means clustering has been applied to retail data to segment customers by recency, frequency, and monetary value (RFM) metrics, improving campaign effectiveness in e-commerce. In the financial sector, cluster analysis supports risk management by grouping assets, institutions, or clients according to risk profiles derived from financial indicators like credit scores, transaction histories, and volatility measures. Applications include credit risk evaluation, where mixed data clustering combines quantitative and qualitative features to classify loan applicants into low-, medium-, and high-risk categories, enhancing lending decisions. Additionally, density-based methods like have been used for fraud detection in transaction datasets, isolating anomalous clusters from normal patterns to mitigate financial losses. In the social sciences, cluster analysis aids in uncovering typologies and structures within complex human data, such as identifying social classes or behavioral patterns in sociological surveys. A foundational application involves hierarchical clustering of relational data to classify social networks, revealing community structures based on interaction similarities, as demonstrated in early work on network partitioning. In psychology, it has been employed to delineate personality types from Big Five trait inventories, with robust data-driven approaches identifying four distinct configurations—average, reserved, self-centered, and role-model—across large datasets, providing insights into individual differences and mental health correlations. Model-based clustering further extends this to qualitative social data, such as interview codings, to reveal motives or attitudes in prevention studies.

In Computer Science and Web Technologies

In computer science and web technologies, cluster analysis serves as a foundational technique for organizing unstructured data, enabling efficient processing in areas such as information retrieval and pattern recognition. Algorithms like and are integrated into scalable systems to handle large-scale datasets, supporting tasks from search optimization to user behavior modeling. This application leverages clustering's ability to group similar items without supervision, enhancing algorithmic efficiency in dynamic environments like the web. Document clustering, a key method in information retrieval, often employs term frequency-inverse document frequency (TF-IDF) vectorization combined with k-means to group text-based content for search engines. For instance, this approach has been applied to the 20 Newsgroups dataset, a standard benchmark simulating news articles, where TF-IDF transforms documents into numerical vectors before k-means partitions them into topical clusters, improving relevance in search results like those in early Google news grouping systems. The technique reduces dimensionality and highlights discriminative terms, achieving high purity scores in topical separation for web-scale text corpora. Density-based clustering, such as , plays a critical role in anomaly detection for cybersecurity, identifying unusual patterns in network traffic that signal intrusions. By defining clusters based on data point density, these methods isolate outliers representing potential threats, like irregular connection spikes in enterprise networks, with applications in intrusion detection systems that process real-time logs to flag deviations from normal behavior. Studies show outperforms partition-based alternatives in handling varying densities of attack vectors, enhancing detection accuracy in high-dimensional network data. In recommendation systems, spectral clustering enhances collaborative filtering by uncovering latent structures in user-item interactions, grouping similar videos or content for platforms like YouTube. This method constructs affinity graphs from user ratings and applies eigenvector decomposition to partition data, enabling personalized suggestions through clustered user preferences or item similarities, as demonstrated in optimizations that improve recommendation diversity and precision. For video grouping, spectral techniques reduce the computational overhead of matrix factorization in large-scale collaborative models. Web usage mining utilizes hierarchical clustering to analyze user sessions, representing navigation paths as sequences of page visits to identify behavioral patterns. In this process, sessions are preprocessed into vectors or trees, then agglomerated using linkage criteria like to form dendrograms that reveal site structure and user intents, aiding in personalization and load balancing for web servers. This approach excels in capturing sequential dependencies in paths, with evaluations showing improved silhouette scores for session grouping in e-commerce logs. Case studies in image retrieval post-2015 highlight deep clustering's impact, where convolutional neural networks extract features before clustering to enable content-based search in vast databases. For example, web-scale image collections of 100 million items have been clustered using deep representations and approximate nearest neighbors, achieving sub-hour processing on single machines for retrieval tasks in search engines. In big data contexts, MapReduce implementations of distribute centroid updates across nodes, scaling to terabyte-scale datasets by parallelizing distance computations and iterations, as shown in frameworks that reduce convergence time by factors of 10-20 on clusters. Clustering integrates seamlessly into machine learning pipelines via libraries like , where estimators such as or are chained with preprocessing steps like standardization in objects for end-to-end workflows. This modular design supports deployment in production systems, from feature extraction to model evaluation, ensuring reproducible clustering in applications like web analytics.

Advanced Topics

Fuzzy and Probabilistic Clustering

Fuzzy clustering extends traditional hard partitioning by allowing data points to belong to multiple clusters with varying degrees of membership, represented by values in the interval [0,1] that sum to 1 across all clusters for each point. This approach is particularly useful for handling overlapping clusters and ambiguous data where strict boundaries are inappropriate. The seminal fuzzy c-means (FCM) algorithm, originally proposed by Dunn in 1973 and generalized by Bezdek in 1981, minimizes an objective function that incorporates membership degrees raised to a fuzziness parameter m > 1: J = \sum_{i=1}^n \sum_{k=1}^c u_{ik}^m \| x_i - v_k \|^2 where u_{ik} is the membership of data point x_i in cluster k, v_k is the cluster prototype, and \| \cdot \|^2 denotes the squared . The algorithm iteratively updates memberships and prototypes: u_{ik} = \frac{1}{\sum_{j=1}^c \left( \frac{\| x_i - v_k \|}{\| x_i - v_j \|} \right)^{2/(m-1)}} and v_k = \frac{\sum_{i=1}^n u_{ik}^m x_i}{\sum_{i=1}^n u_{ik}^m}. is achieved when changes in J fall below a . Probabilistic clustering interprets these soft assignments as posterior probabilities of cluster membership, often modeled via finite mixture distributions such as Gaussian mixtures. The expectation-maximization (EM) algorithm, introduced by Dempster et al. in 1977, iteratively estimates mixture parameters by maximizing the likelihood, yielding probabilistic cluster assignments in the E-step. This framework links fuzzy methods to , enabling in cluster assignments. Possibilistic clustering, developed by Krishnapuram and Keller in 1993, relaxes the sum-to-1 constraint on memberships, treating them as degrees of possibility rather than probabilities. This modification enhances robustness to noise and outliers by avoiding forced assignments to all clusters, with an objective function similar to FCM but incorporating a parameter to control typicality. In applications, fuzzy clustering excels in image processing, such as MRI brain tissue segmentation, where it delineates regions with gradual transitions and handles intensity inhomogeneities effectively. It also supports under by providing graded memberships for in ambiguous scenarios. Key advantages of fuzzy and probabilistic methods include robustness to outliers and noise compared to hard clustering, as memberships dilute the influence of anomalous points. However, they require careful tuning of the fuzziness parameter m, where values near 1 yield crisp partitions and larger m increases overlap but may lead to less distinct clusters.

Recent Developments and Challenges

In recent years, deep clustering has emerged as a significant advancement, integrating deep neural networks with traditional clustering techniques to learn representations and assignments jointly. A seminal approach is Deep Embedded Clustering (DEC), introduced in 2016, which uses an to map data into a lower-dimensional space and optimizes cluster assignments via a clustering loss that minimizes distances to cluster centers while repelling points from other clusters. This method has inspired subsequent works, including surveys highlighting its role in improving clustering on complex datasets through end-to-end learning. Building on this, graph neural networks (GNNs) have been adapted for clustering in the 2020s, particularly for graph-structured data; for instance, Deep Modularity Networks (DMoN) employ unsupervised GNN pooling inspired by modularity measures to identify dense node communities, outperforming traditional methods on graphs. Scalability remains a key focus for handling large-scale and . Mini-batch k-means, a variant of k-means that processes data in small random subsets, reduces computational time while approximating the full-batch objective, making it suitable for massive datasets. Similarly, (Balanced Iterative Reducing and Clustering using Hierarchies), originally proposed in , has seen updates for streaming environments, where it builds hierarchical summaries of data points to enable incremental clustering without full data access. These methods address the limitations of classical algorithms in high-volume scenarios, such as . Subspace and ensemble clustering have advanced to manage multi-view and multimodal data prevalent in modern applications. Multi-view subspace clustering seeks low-dimensional subspaces shared across views, with recent methods like adaptive consensus graph learning constructing unified affinity matrices from multiple data representations to enhance robustness. Ensemble approaches integrate diverse clustering results, particularly for multimodal data in the 2020s, by combining subspace recoveries to mitigate view-specific noise and improve overall partition quality. Despite these progressions, clustering faces persistent challenges. The curse of dimensionality leads to sparsity and distance metric degradation in high-dimensional spaces, complicating meaningful cluster separation; dimensionality reduction techniques like t-SNE, applied as preprocessing, help by embedding data into lower dimensions while preserving local structures. Interpretability is another hurdle, especially in deep clustering models, where explainable AI (XAI) methods aim to elucidate cluster decisions through inherent model design or post-hoc explanations, such as prototype-based interpretations. Ethical concerns arise from biases in AI clustering, often stemming from unrepresentative training data that perpetuate discriminatory groupings, such as in social or medical applications, necessitating bias audits and diverse datasets. Looking ahead, quantum clustering prototypes show promise for exponential speedups on complex problems. Variational quantum algorithms, explored since 2023, use quantum circuits to optimize clustering objectives, enabling multi-cluster detection in noisy intermediate-scale quantum (NISQ) devices; subsequent hybrid quantum-classical approaches, such as quantum-assisted k-means in 2024, and NISQ-compatible improvements in 2025, further enhance practicality for real-world datasets. Additionally, frameworks support privacy-preserving clustering by allowing decentralized model training across clients without sharing raw data, as in patient clustering for ; recent advances in clustered as of 2025 address data heterogeneity in healthcare and applications through model-based and hybrid partitioning strategies. These directions aim to balance computational efficiency, privacy, and ethical integrity in evolving data landscapes.

References

  1. [1]
    [PDF] Cluster Analysis: Basic Concepts and Algorithms
    Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should ...
  2. [2]
    Cluster analysis: A modern statistical review - Jaeger - 2023
    Aug 19, 2022 · It focuses on hierarchical agglomerative clustering, k-means clustering, mixture models, and then several related topics of which any cluster analysis ...
  3. [3]
    Finding groups in data: an introduction to cluster analysis - NIH
    Jun 4, 2023 · Cluster analysis, known as the unsupervised classification, looks for the natural groups or clusters in a given set of data, ...
  4. [4]
    [PDF] Cluster Analysis: Basic Concepts and Algorithms
    Cluster analysis divides data into groups based on data information, where objects within a group are similar and related to each other.
  5. [5]
    Data clustering: a review: ACM Computing Surveys: Vol 31, No 3
    Sep 1, 1999 · Data clustering: a review. Authors: A. K. Jain. A. K. Jain. Michigan ... ACM Computing Surveys 35, 5-16. Google Scholar. [3]. AL-SULTAN ...
  6. [6]
  7. [7]
    A general theory of classificatory sorting strategies - Oxford Academic
    G. N. Lance, W. T. Williams, A general theory of classificatory sorting strategies: II. Clustering systems, The Computer Journal, Volume 10, Issue 3, 1967 ...
  8. [8]
    Some methods for classification and analysis of multivariate ...
    VOL. 5.1 | 1967 Some methods for classification and analysis of multivariate observations. Chapter Author(s) J. MacQueen. Editor(s) Lucien M. Le Cam, ...
  9. [9]
    SPSS Library: A history of SPSS statistical features - OARC Stats
    This white paper lists the statistical enhancements that we've added to the SPSS product line starting with version 5.0 through our upcoming version, 9.0.
  10. [10]
    Data clustering: 50 years beyond K-means - ScienceDirect.com
    Jun 1, 2010 · We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering ...
  11. [11]
    [PDF] A survey of hierarchical clustering algorithms
    Hierarchical clustering algorithms divide into two categories: Agglomerative and Divisive. Agglomerative clustering executes in a bottom–top fashion, which ...
  12. [12]
    [PDF] An Efficient and Effective Generic Agglomerative Hierarchical ...
    In this paper, we focus on hierarchical clustering methods. There are two kinds of proce- dures: agglomerative and divisive. The former type builds the ...
  13. [13]
    [PDF] Efficient Algorithms for - Agglomerative Hierarchical Clustering ...
    Sibson (1973) describes a single linkage SAHN algorithm based on the efficient extension of a hierarchical clustering of m objects to one of (m+1) objects ...
  14. [14]
    [PDF] A Characterization of Linkage-Based Hierarchical Clustering
    This class includes commonly-used algorithms such as single-linkage, average-linkage, complete-linkage, and Ward's method. In this paper, we provide a property- ...
  15. [15]
    [PDF] Ward's Hierarchical Clustering Method - arXiv
    Dec 11, 2011 · Ward's method minimizes the increase in total within-cluster sum of squared error.
  16. [16]
    SLINK: An optimally efficient algorithm for the single-link cluster ...
    Jan 1, 1973 · The SLINK algorithm carries out single-link (nearest-neighbour) cluster analysis on an arbitrary dissimilarity coefficient and provides a representation of the ...
  17. [17]
    An efficient algorithm for a complete link method - Oxford Academic
    The algorithm is based, like the algorithm for the single link cluster method (Slink) presented by Sibson (1973), on a compact representation of a dendrogram: ...<|separator|>
  18. [18]
    Inconsistency coefficient - MATLAB - MathWorks
    This MATLAB function returns the inconsistency coefficient for each link of the hierarchical cluster tree Z generated by the linkage function.Description · Examples · Input Arguments · Output Arguments
  19. [19]
    Clustering Algorithms: Their Application to Gene Expression Data
    Nov 30, 2016 · This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledgeHierarchical Methods · Recent Clustering Techniques · Partition Clustering
  20. [20]
    [PDF] Hierarchical clustering of gene expression data
    One is hierarchical clustering. For example, Eisen et al [1] used hierarchical agglomerative clustering (HAC) to cluster two spotted. DNA microarray data and ...
  21. [21]
  22. [22]
  23. [23]
    Density-Based Clustering in Spatial Databases: The Algorithm ...
    Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd ...
  24. [24]
    OPTICS: ordering points to identify the clustering structure
    OPTICS: ordering points to identify the clustering structure ; Mihael Ankerst ; Markus M. · Breunig ; Hans-Peter Kriegel ; Jörg Sander.
  25. [25]
    [PDF] A Survey of Some Density Based Clustering Techniques - arXiv
    In this paper, a study of these methods is done along with their characteristics, advantages and disadvantages and most importantly – their applicability to ...
  26. [26]
    Improved Earthquake Clustering Using a Density‐Adaptive ...
    Dec 6, 2023 · In this study, we propose a modification to the DBSCAN algorithm by introducing an event density map replacing S min and ε.Abstract · Introduction · Method · Discussion
  27. [27]
    Mixture modelling for cluster analysis - G J McLachlan, S U Chang ...
    Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of ...
  28. [28]
    Maximum Likelihood from Incomplete Data Via the EM Algorithm
    Summary. A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality.
  29. [29]
    Markov Chain Sampling Methods for Dirichlet Process Mixture Models
    Feb 21, 2012 · This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of ...
  30. [30]
    [PDF] Markov Chain Sampling Methods for Dirichlet Process Mixture ...
    Sep 21, 2007 · This article reviews Markov chain methods for sampling from Dirichlet process mixture models, presenting two new approaches: Metropolis- ...
  31. [31]
    [PDF] SOFTWARE REVIEW Enhanced Model-Based Clustering, Density ...
    The MCLUST software, with offerings originally limited to model-based hierarchical clustering, was extended to include EM for parameterized Gaussian mixture ...
  32. [32]
    Clustering via the Bayesian information criterion with applications in ...
    We propose to choose the number of clusters by optimizing the Bayesian information criterion (BIC), a model selection criterion in the statistics literature.
  33. [33]
    [PDF] STING: A Statistical Information Grid Approach to Spatial Data Mining
    This is the first paper that introduces clustering techniques into spatial data mining problems and it represents a significant improvement on large data sets ...
  34. [34]
    [PDF] Survey of Clustering Data Mining Techniques
    Stat, Assoc., 58, 301, 235-244. WANG, W., YANG, J., and MUNTZ, R. 1997. STING: a statistical information grid approach to spatialdata mining. In Proceedings ...
  35. [35]
    [PDF] Automatic Subspace Clustering of High Dimensional Data for Data ...
    CLIQUE identi es dense clusters in subspaces of maximum dimen- sionality. It generates cluster descriptions in the form of. DNF expressions that are minimized ...
  36. [36]
    (PDF) A Survey of Grid Based Clustering Algorithms - ResearchGate
    Aug 10, 2025 · Grid based methods quantize the object space into a finite number of cells (hyper-rectangles) and then perform the required operations on the quantized space.
  37. [37]
    [PDF] On Spectral Clustering: Analysis and an algorithm Andrew Y. Ng CS ...
    In this paper, we present a simple spectral clustering algorithm that can be ... I n Neural Information Processing Systems 14 , 2002. [3] F. C hung ...
  38. [38]
    [0711.0189] A Tutorial on Spectral Clustering - arXiv
    Nov 1, 2007 · The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties.
  39. [39]
    [PDF] Self-organized formation of topologically correct feature maps
    This work contains a theoretical study and computer simulations of a new self-organizing process. The principal discovery is that in a simple network of.
  40. [40]
    [PDF] Normalized cuts and image segmentation - People @EECS
    SHI AND MALIK: NORMALIZED CUTS AND IMAGE SEGMENTATION. 889. Fig. 1. A case where minimum cut gives a bad partition. Page 3. groups, are in fact identical and ...
  41. [41]
    From A-to-Z review of clustering validation indices - ScienceDirect.com
    Oct 7, 2024 · The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices.
  42. [42]
    A Cluster Separation Measure | IEEE Journals & Magazine
    A measure is presented which indicates the similarity of clusters which are assumed to have a data density which is a decreasing function of distance.Missing: original | Show results with:original
  43. [43]
    A graphical aid to the interpretation and validation of cluster analysis
    November 1987, Pages 53-65. Journal of Computational and Applied Mathematics. Silhouettes: A graphical aid to the interpretation and validation of cluster ...
  44. [44]
    A dendrite method for cluster analysis: Communications in Statistics
    Original Articles. A dendrite method for cluster analysis. T. Caliński Academy of Agriculture , Poznań, Poland. &. J Harabasz Academy of Agriculture , Poznań, ...
  45. [45]
    Assessing Clustering Tendency - Datanovia.com
    In this article, we described how to assess clustering tendency using the Hopkins statistics and a visual method. After showing that a data is clusterable, the ...Missing: paper | Show results with:paper
  46. [46]
    Validating clusters using the Hopkins statistic - ResearchGate
    Aug 6, 2025 · A novel scheme for cluster validity using a test for random position hypothesis is proposed. The random position hypothesis is tested against an alternative ...Missing: original | Show results with:original
  47. [47]
    Cluster analysis and display of genome-wide expression patterns
    A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms.Cluster Analysis And Display... · Abstract · Results
  48. [48]
    Common Methods for Phylogenetic Tree Construction and Their ...
    May 11, 2024 · In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian ...
  49. [49]
    Diffusion Model Based Spectral Clustering for Protein-Protein ...
    We propose a diffusion model-based spectral clustering algorithm, which analytically solves the cluster structure of PPI networks as a problem of random walks.
  50. [50]
    Cd-hit: a fast program for clustering and comparing large sets of ...
    Cd-hit is a fast program for clustering and comparing large protein or nucleotide sequences, using short word filtering to efficiently handle huge databases.Abstract · INTRODUCTION · ALGORITHMS · RESULTS
  51. [51]
    (PDF) OTU Clustering: A window to analyse uncultured microbial ...
    Jan 23, 2018 · It is an important tool in genomics and metagenomics which performs taxonomic profiling of the microbial world by grouping 16S RDNA amplicon ...
  52. [52]
    Patient Stratification Using Electronic Health Records from a Chronic ...
    The goal of this study is to devise a machine learning framework to assist care coordination programs in prognostic stratification.
  53. [53]
    MRI Brain Tumor Segmentation and Analysis using Rough-Fuzzy C ...
    Present work proposes an automated brain tumor segmentation method using rough-fuzzy C-means (RFCM) and shape based topological properties.Missing: seminal | Show results with:seminal
  54. [54]
    Hydra: A mixture modeling framework for subtyping pediatric cancer ...
    Apr 10, 2020 · We developed a computational framework to automatically detect multimodal expression distributions within well-defined disease populations.
  55. [55]
    (PDF) An In-Depth Analysis of COVID-19 Symptoms Considering the ...
    Jun 30, 2025 · More precisely, we implemented clustering techniques to discover symptomatic patterns across the various dimensions. For instance, in analyzing ...
  56. [56]
    Comparative benchmarking of single-cell clustering algorithms for ...
    Sep 3, 2025 · However, high-dimensional transcriptomic data also presents challenges, such as redundancy and noise. In this section, we first examine the ...
  57. [57]
    Market Segmentation: Conceptual and Methodological Foundations
    The book starts with a framework for considering the various bases and methods available for conducting segmentation studies.
  58. [58]
    An Exploration of Clustering Algorithms for Customer Segmentation ...
    Oct 12, 2023 · This paper explores K-means, GMM, DBSCAN, agglomerative clustering, and BIRCH for customer segmentation in the UK retail market, using the RFM ...
  59. [59]
    [PDF] Clustering Approaches for Financial Data Analysis: a Survey - arXiv
    In this paper, we survey different clustering algorithms for analysing different financial datasets for a variety of applications; credit cards fraud detection, ...
  60. [60]
    Cluster Analysis for mixed data: An application to credit risk evaluation
    This paper highlights the relevance of both quantitative and qualitative features of applicants and proposes a new methodology based on mixed data clustering ...
  61. [61]
    An algorithm for clustering relational data with applications to social ...
    A method of hierarchical clustering for relational data is presented, which begins by forming a new square matrix of product-moment correlations.<|control11|><|separator|>
  62. [62]
    A robust data-driven approach identifies four personality types ...
    Sep 17, 2018 · The FFM surmises the existence of five traits—neuroticism, extraversion, openness, agreeableness and conscientiousness—that capture the main ...
  63. [63]
    Clustering Methods with Qualitative Data: A Mixed ... - PubMed Central
    Cluster analysis, using methods like hierarchical, K-Means, and latent class analysis, can clarify findings from qualitative data in prevention research.
  64. [64]
    2.3. Clustering — scikit-learn 1.7.2 documentation
    The algorithm has a time complexity of the order O ( N 2 T ) , where N is the number of samples and T is the number of iterations until convergence. Further, ...
  65. [65]
    Clustering text documents using k-means - Scikit-learn
    Text documents are clustered using k-means with a Bag of Words approach, using TfidfVectorizer or HashingVectorizer, and LSA for dimensionality reduction.
  66. [66]
    20 news groups | K-Means Clustering - Kaggle
    Application: TF-IDF is commonly used for tasks such as document similarity, text classification, and information retrieval. It helps to identify and prioritize ...
  67. [67]
    [PDF] Topical Clustering of Search Results - Google Research
    Feb 12, 2012 · ABSTRACT. Search results clustering (SRC) is a challenging algorithmic problem that requires grouping together the results returned.
  68. [68]
    Density-Based Clustering and Anomaly Detection - ResearchGate
    This paper presents the results of the analysis of the network intrusion detection systems using data mining techniques and anomaly detection. Anomaly detection ...
  69. [69]
    A SURVEY ON THE USE OF DATA CLUSTERING FOR INTRUSION ...
    Density-based clustering algorithm clusters data points based on the density of the points in a region. DBSCAN is one of the best density-based algorithms. Yang ...
  70. [70]
    A study on a recommendation algorithm based on spectral ...
    This article proposes an optimization method for a recommendation system based on spectral clustering (SC) and gated recurrent unit (GRU), named the GRU-KSC ...
  71. [71]
    An overview of video recommender systems: state-of-the-art and ...
    2. Unsupervised learning. Video-based collaborative filtering often starts with clustering to decrease the search space of the model-based approach.
  72. [72]
    (PDF) Web user profiling using hierarchical clustering with improved ...
    Feb 19, 2018 · ... Web usage mining process. The proposed method defines user sessions as a set of navigation paths in the Web graph and produces a complete ...
  73. [73]
    A Review on Clustering Techniques: Creating Better User ... - MDPI
    Sep 13, 2021 · This paper presents a review of clustering techniques used in web usage mining, namely the partition-based, hierarchical, density-based, and fuzzy clustering ...<|separator|>
  74. [74]
    ICCV 2015 Open Access Repository
    Combined with powerful deep learned representations, we achieve clustering of a 100 million image collection on a single machine in less than one hour.
  75. [75]
    Efficient MapReduce Kernel k-Means for Big Data Clustering
    The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It ...
  76. [76]
    7.1. Pipelines and composite estimators - Scikit-learn
    Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data.
  77. [77]
    Fuzzy C-means cluster analysis - Scholarpedia
    Aug 26, 2009 · Bezdek (2011), Scholarpedia, 6(7):2057. doi:10.4249/scholarpedia.2057, revision #91285 [link to/cite this article]. Jump to: navigation, search.Missing: paper | Show results with:paper
  78. [78]
    2.1. Gaussian mixture models - Scikit-learn
    The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids ...GaussianMixture · BayesianGaussianMixture · Gaussian Mixture Model...
  79. [79]
  80. [80]
    A Novel Brain MRI Image Segmentation Method Using an Improved ...
    Mar 25, 2021 · The classic fuzzy c-means (FCM) algorithm is extremely sensitive to noise and offset fields. If the algorithm is used directly to segment the ...
  81. [81]
    Performance Analysis of Fuzzy C-Means Clustering Methods for MRI ...
    FCM is most usually used techniques for image segmentation of medical image applications because of its fuzzy nature, where one pixel can belong to multiple ...
  82. [82]
    Unsupervised Deep Embedding for Clustering Analysis
    In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural ...
  83. [83]
    [PDF] Deep Clustering: A Comprehensive Survey - arXiv
    Oct 9, 2022 · 8 SUMMARY OF DEEP CLUSTERING METHODS. In this paper, we introduce recent advances in the field of deep clustering. This is mainly kind of ...
  84. [84]
    Graph Clustering with Graph Neural Networks
    We introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality.
  85. [85]
    State-of-the-art on clustering data streams | Big Data Analytics
    Dec 1, 2016 · This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most well-known streaming platforms which implement ...
  86. [86]
    Multi-view Subspace Clustering via An Adaptive Consensus Graph ...
    Jan 30, 2024 · In this paper, we initially assume the existence of a consensus reconstruction coefficient matrix and then use it to build a consensus graph ...
  87. [87]
    A Survey on Multi-View Clustering - PMC - PubMed Central
    In practice, the data could be sampled from multiple subspaces. Subspace clustering [28] is the technique to find the underlying subspaces and then cluster the ...Missing: 2020s | Show results with:2020s
  88. [88]
    Clustering on the output of t-SNE - Cross Validated - Stack Exchange
    Feb 23, 2017 · Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection – A Remedy Against the Curse of Dimensionality? In ...Should dimensionality reduction for visualization be considered a ...Why is t-SNE not used as a dimensionality reduction technique for ...More results from stats.stackexchange.comMissing: challenges solutions
  89. [89]
    XAI Beyond Classification: Interpretable Neural Clustering
    In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with ...
  90. [90]
    Ethical and Bias Considerations in Artificial Intelligence/Machine ...
    Three broad factors are responsible for biases in AI models: (1) data bias, which is the use of unrepresentative data; (2) development bias, which is the result ...
  91. [91]
    Variational quantum and quantum-inspired clustering - Nature
    Aug 16, 2023 · Here we present a quantum algorithm for clustering data based on a variational quantum circuit. The algorithm allows to classify data into many clusters.
  92. [92]
    Privacy-preserving patient clustering for personalized federated ...
    Jul 17, 2023 · We propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level ...