Fact-checked by Grok 2 weeks ago

Rand index

The Rand index, introduced by William M. Rand in 1971,^[1] is a measure of similarity between two clusterings of the same dataset in statistics and data clustering, quantifying the proportion of pairs of data points that are assigned to the same or different clusters consistently across both clusterings. It serves as an external evaluation metric for assessing clustering algorithms by comparing their output to a reference partitioning, such as ground truth labels. The index is computed by considering all possible pairs of elements in the dataset, where the total number of such pairs is given by \binom{n}{2} for n data points. It is formally defined as R = \frac{a + d}{a + b + c + d}, with a denoting the number of pairs in the same cluster in both partitions, d the number in different clusters in both, b the number in the same cluster in the first but different in the second, and c the reverse, where n_{ij} counts elements common to cluster i in the first partitioning and j in the second, n_i the size of cluster i, and n'_j the size of cluster j. Key properties include a range from 0 (no agreement on any pair) to 1 (perfect agreement), invariance to permutations of cluster labels, and the fact that $1 - R provides a dissimilarity measure between partitions. However, it tends to yield high values even for random clusterings, particularly when many pairs are ungrouped in both, prompting the development of adjusted variants to correct for chance agreement. In practice, the Rand index is widely applied in machine learning for validating unsupervised clustering methods like k-means or hierarchical clustering against known structures, as implemented in libraries such as scikit-learn.^[2]

Background

Cluster analysis basics

Cluster analysis is a core method in unsupervised machine learning that groups similar objects into clusters based on their features or attributes, without relying on labeled training data to guide the process.^[3] This technique aims to partition datasets such that objects within the same cluster exhibit high internal similarity—often measured by distance metrics like Euclidean distance—while differing substantially from those in other clusters.^[4] By uncovering hidden patterns and structures in unlabeled data, cluster analysis serves as a foundational tool for exploratory data analysis across various domains. Clustering approaches are broadly categorized into hard and soft variants. Hard clustering assigns each object to precisely one cluster, producing non-overlapping partitions of the dataset, as exemplified by the k-means algorithm where assignments are binary and exclusive.^[5] In soft clustering, objects can belong to multiple clusters simultaneously with degrees of membership, typically represented as probabilities or fuzzy values that sum to one, enabling the capture of overlapping or ambiguous group structures; fuzzy c-means, originally proposed by Dunn in 1973 and generalized by Bezdek in 1981, is a prominent example.^[5]^[6] Common applications of cluster analysis span data mining, where it supports tasks like customer segmentation and outlier detection; bioinformatics, including the grouping of gene expression profiles and protein interaction networks; and pattern recognition, such as classifying images or documents based on inherent similarities.^[7]^[8] Historically, cluster analysis emerged in the 1950s and 1960s through interdisciplinary efforts in statistics and computer science, with early methods like k-means independently developed by Steinhaus in 1956 and Lloyd in 1957, though evaluation measures for comparing clusterings arose later.^[9] A key output of these methods is a partition of the dataset into cohesive groups.

Partitions and equivalence relations

In mathematics, a partition of a set U is defined as a subdivision of the set into nonempty subsets that are pairwise disjoint and whose union equals U.^[10] This structure ensures that every element of U belongs to exactly one subset, providing a way to group elements without overlap or omission.^[11] Cluster analysis involves generating such partitions to group data points based on similarity, where each cluster forms one of the disjoint subsets.^[4] For a finite set S with n elements, a clustering can be denoted as a partition of S into k nonempty subsets, often written as \mathcal{P} = \{ C_1, C_2, \dots, C_k \}, where \bigcup_{i=1}^k C_i = S and C_i \cap C_j = \emptyset for i \neq j.^[12] Partitions are closely linked to equivalence relations on a set. An equivalence relation \sim on a set A is a binary relation that satisfies reflexivity (for all a \in A, a \sim a), symmetry (if a \sim b, then b \sim a), and transitivity (if a \sim b and b \sim c, then a \sim c).^[13] Such a relation induces a partition of A into equivalence classes, where two elements are in the same class (or cluster) if they are related by \sim, and the classes form the disjoint subsets covering A.^[14] Conversely, any partition of A defines an equivalence relation by placing elements in the same subset if and only if they are related.^[15] To illustrate, consider the set \{1, 2, 3, 4\}. One partition is \{\{1, 2\}, \{3, 4\}\}, grouping 1 with 2 and 3 with 4, while another is \{\{1, 3\}, \{2, 4\}\}, pairing 1 with 3 and 2 with 4; these differ in how elements are assigned to subsets, highlighting distinct groupings without overlap.^[11]

Core Concepts

Definition of the Rand index

The Rand index is a measure of the similarity between two data clusterings, based on the pairwise agreement between their cluster assignments. Given two partitions U and V of a set S with n = |S| elements, the Rand index R(U, V) quantifies the proportion of pairs of elements that are either both placed in the same cluster or both placed in different clusters in U and V. This index was introduced by William M. Rand in 1971 as an objective criterion for evaluating clustering methods.^[16] Formally, let a denote the number of element pairs that are in the same cluster in both U and V, b the number of pairs in the same cluster in U but different clusters in V, c the number of pairs in different clusters in U but the same cluster in V, and d the number of pairs in different clusters in both partitions. The total number of possible pairs is \binom{n}{2} = a + b + c + d. The Rand index is then defined as

R(U, V) = \frac{a + d}{a + b + c + d},

where a + d represents the concordant pairs (agreeing on cluster membership) and b + c the discordant pairs (disagreeing).^[16] To illustrate, consider a small set S = \{1, 2, 3, 4\} with partitions U = \{\{1, 2\}, \{3, 4\}\} and V = \{\{1, 2, 3\}, \{4\}\}. The possible pairs are (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). Here, a = 1 (pair (1,2) same in both), b = 1 (pair (3,4) same in U but different in V), c = 2 (pairs (1,3) and (2,3) different in U but same in V), and d = 2 (pairs (1,4) and (2,4) different in both). Thus, R(U, V) = \frac{1 + 2}{1 + 1 + 2 + 2} = 0.5.^[16]

Interpretation and range

The Rand index quantifies the similarity between two clusterings of a dataset by evaluating the agreement on whether pairs of elements are grouped together or separately. A value of 1 denotes perfect agreement, indicating that the two clusterings are identical in their assignments. Conversely, a value of 0 signifies no agreement, where every pair of elements is classified discordantly—either together in one clustering but apart in the other, or vice versa—resulting in complete opposition between the partitions. The index always lies within the closed interval [0, 1], providing a bounded measure that facilitates straightforward comparisons across different clustering evaluations. This range ensures that the metric is normalized, with higher values reflecting greater concordance in the pairwise decisions made by the clusterings. Intuitively, the Rand index represents the fraction of all possible pairs of data points that are correctly classified as similar or dissimilar by both clusterings relative to the total number of pairs. This pairwise perspective underscores its foundation in counting agreements on element relationships, such as pairs in the same or different clusters across the two partitions. One limitation of the Rand index is its sensitivity to cluster sizes, where the presence of large clusters in both partitions can lead to inflated values even if the overall structures differ substantially, as many pairs within those clusters may agree by coincidence.

Properties and Relations

Mathematical properties

The Rand index exhibits several key mathematical properties that underpin its utility in evaluating clustering similarity. It is invariant to permutations of cluster labels, meaning that relabeling the clusters in either partition does not alter the index value, as the measure relies solely on pairwise agreements rather than specific label assignments. This invariance ensures that the index remains consistent under arbitrary renumbering of clusters, a desirable feature for unsupervised learning contexts where labels are not predefined. Additionally, the Rand index is scale-invariant with respect to the dataset size n, as it normalizes by the total number of pairs \binom{n}{2}, allowing direct comparability across datasets of varying sizes. The index is symmetric, satisfying R(U, V) = R(V, U) for any two partitions U and V, due to the symmetric treatment of pairs in the underlying counting of agreements and disagreements. Regarding metric-like behavior, the Rand index as a similarity measure is non-negative and bounded in [0, 1], with R = 1 indicating identical partitions and R = 0 indicating complete disagreement. While the associated distance d(U, V) = 1 - R(U, V) is symmetric and non-negative, it does not satisfy the triangle inequality in general, precluding it from being a true metric; for instance, certain configurations of three partitions can violate d(U, W) \leq d(U, V) + d(V, W). This property arises because pair-counting similarities like the Rand index prioritize global agreement without enforcing subadditivity over intermediate partitions. The Rand index is bounded in [0, 1] because it represents the proportion of agreeing pairs out of all possible pairs, with the counts of true positives (pairs clustered together in both), true negatives (pairs separated in both), false positives, and false negatives being non-negative integers that sum to \binom{n}{2}. Thus, the numerator (agreeing pairs) ranges from 0 to the total, yielding $0 \leq R \leq 1. A brief proof follows from the non-negativity: let a, b, c, d \geq 0 denote the respective pair counts with a + b + c + d = T = \binom{n}{2}; then R = (a + d)/T \in [0, 1] since $0 \leq a + d \leq T. Under a random clustering model where one partition is fixed and the other is generated randomly (e.g., via uniform random assignment respecting cluster sizes), the expected value of the Rand index is given by

E[R] = \frac{S_U S_V + (T - S_U)(T - S_V)}{T^2},

where T = \binom{n}{2}, S_U = \sum_i \binom{|U_i|}{2}, and S_V = \sum_j \binom{|V_j|}{2}. This expectation varies with cluster size distributions; for balanced partitions (e.g., equal-sized clusters in binary or few-cluster settings), E[R] \approx 0.5, reflecting approximately half the pairs agreeing by chance, though it approaches 1 for many singleton clusters where most pairs are separated in both.

Relation to classification accuracy

In supervised classification, the Rand index serves as an analogy to accuracy by treating the ground truth labels as a reference partition and the predicted labels as another partition, thereby measuring how well the predictions recover the underlying class structure through pair-wise agreements. This approach is particularly relevant in semi-supervised settings, where ground truth may be available only at the cluster level rather than for individual instances, allowing evaluation of partition similarity without requiring exact label matching.^[17]^[18] Unlike standard classification accuracy, which computes the proportion of instances correctly assigned to their true class, the Rand index evaluates pair-wise concordance: it counts the fraction of instance pairs that are either both correctly placed in the same class or correctly separated into different classes. In binary classification, these metrics align closely, as pair-wise decisions mirror instance-level correctness under fixed labels. However, they diverge in multi-class scenarios, where the Rand index remains invariant to label permutations—yielding a high value for well-separated classes even if labels are swapped—while accuracy drops to zero without optimal label alignment; this difference is pronounced in imbalanced datasets, where accuracy may inflate by favoring majority classes, but the Rand index better captures structural fidelity.^[19]^[18] For ground truth partition U and predicted partition V over n instances, the Rand index is given by

R(U, V) = 1 - \frac{b + c}{\binom{n}{2}},

where b and c denote the numbers of pairs incorrectly grouped together or apart, respectively, and \binom{n}{2} is the total number of pairs; this formulation positions the index as one minus the normalized error rate in pair-wise label assignments.^[17] Consider a multi-class dataset like the UCI Glass identification set with imbalanced classes (six types with sizes 70, 76, 17, 13, 9, 29; type 4 absent). In such imbalanced datasets, the Rand index penalizes cross-class confusions in minority groups more than per-instance accuracy, highlighting the index's emphasis on relational structure over individual matches.^[18] This property renders the Rand index valuable in applications where instance-level ground truth is unavailable, such as evaluating semi-supervised models that produce cluster assignments for unlabeled data, enabling assessment against partial supervision via partition comparison.^[18]

Adjusted Rand Index

Motivation for adjustment

The raw Rand index, while providing a measure of agreement between two clusterings based on pairwise concurrences, exhibits a significant bias toward higher values even when the clusterings are produced randomly. This occurs particularly in scenarios involving varying numbers of clusters or unequal cluster sizes, where the index typically ranges from approximately 0.5 to 0.9 for random assignments, making it difficult to distinguish meaningful structure from noise. This bias stems from the chance agreement problem, wherein the raw index fails to account for agreements that would occur by random chance, thereby overestimating similarity without penalizing incidental matches. Analogous to Cohen's kappa in inter-rater agreement for classification tasks, the raw Rand index requires adjustment to normalize for this expected random overlap, ensuring a more reliable assessment of clustering quality.^[20] To address this limitation, Hubert and Arabie proposed the adjusted Rand index in 1985, which corrects the raw measure by subtracting the expected agreement under a hypergeometric distribution model of randomness, thereby centering the baseline at zero for random clusterings.^[21] The adjustment is particularly essential when comparing clustering results across diverse datasets or algorithms, as the expected value of the raw Rand index varies with dataset characteristics such as sample size and cluster granularity, potentially leading to misleading cross-study interpretations.

Contingency table construction

The contingency table provides a compact representation for comparing two clusterings, denoted as partitions U = \{U_1, \dots, U_{n_U}\} and V = \{V_1, \dots, V_{n_V}\} of a set of n elements. It is defined as an n_U \times n_V matrix \mathbf{N}, where the entry n_{ij} counts the number of elements belonging to both cluster U_i in U and cluster V_j in V. This structure captures the overlap between the partitions without explicitly listing element assignments. To construct the contingency table, iterate over each of the n elements exactly once: for an element assigned to cluster U_i in partition U and cluster V_j in partition V, increment the cell n_{ij} by 1. Upon completion, the row sums a_i = \sum_{j=1}^{n_V} n_{ij} yield the sizes of the clusters in U, the column sums b_j = \sum_{i=1}^{n_U} n_{ij} yield the sizes of the clusters in V, and the overall sum n = \sum_{i=1}^{n_U} a_i = \sum_{j=1}^{n_V} b_j equals the total number of elements. This linear-time process, with complexity O(n + n_U n_V) for building and summing, forms the basis for subsequent index calculations. For illustration, consider partitions U = \{\{1,2\}, \{3,4,5,6\}, \{7,8,9,10\}\} and V = \{\{1,3,7\}, \{2,4,5\}, \{6,8,9,10\}\} over elements labeled 1 through 10. The resulting contingency table is:

Cluster U \ Cluster V	V_1	V_2	V_3	Row sums (a_i)
U_1	1	1	0	2
U_2	1	2	1	4
U_3	1	0	3	4
Column sums (b_j)	3	3	4	n = 10

The row and column sums verify the cluster sizes: |U_1| = 2, |U_2| = 4, |U_3| = 4; |V_1| = 3, |V_2| = 3, |V_3| = 4. The utility of the contingency table lies in enabling efficient computation of pairwise agreement metrics underlying indices like the adjusted Rand index, by deriving pair counts from the table entries rather than enumerating all \binom{n}{2} possible pairs, reducing time from O(n^2) to O(n + n_U n_V).

Computation and Variants

Formula and calculation

To compute the raw Rand index using a contingency table, where n_{ij} is the number of elements in cluster i of the first partitioning and cluster j of the second, with row sums a_i = \sum_j n_{ij} and column sums b_j = \sum_i n_{ij}, first calculate:

a = number of agreeing pairs in same cluster: \sum_{ij} \binom{n_{ij}}{2}
b = pairs same in first, different in second: \sum_i \binom{a_i}{2} - a
c = pairs different in first, same in second: \sum_j \binom{b_j}{2} - a
d = pairs different in both: \binom{n}{2} - a - b - c

Then, R = \frac{a + d}{\binom{n}{2}}. The adjusted Rand index (ARI) corrects the raw Rand index for chance agreement by subtracting the expected index under random labeling and normalizing by the maximum possible deviation from chance. It is defined as

\text{ARI} = \frac{R - \mathbb{E}[R]}{1 - \mathbb{E}[R]},

where R is the raw Rand index measuring pair agreement between two clusterings, and \mathbb{E}[R] is its expected value assuming hypergeometric distribution for random assignments. A detailed expression for ARI can be derived using the contingency table n_{ij}, where n_{ij} denotes the number of objects in both cluster i of the first partitioning and cluster j of the second, with row sums a_i = \sum_j n_{ij}, column sums b_j = \sum_i n_{ij}, and total objects n = \sum_i a_i = \sum_j b_j. The formula is

\text{ARI} = \frac{\sum_{ij} \binom{n_{ij}}{2} - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \Big/ \binom{n}{2} \right] }{ \frac{1}{2} \left( \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \right) - \left[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \Big/ \binom{n}{2} \right] },

with \binom{k}{2} = k(k-1)/2. This numerator captures the difference between observed and expected pair concordances, while the denominator scales by the maximum possible agreement beyond chance. To compute ARI, first construct the contingency table from the two clusterings by counting overlaps n_{ij}. Next, calculate the sums of pairwise agreements within cells (\sum_{ij} \binom{n_{ij}}{2}), within rows (\sum_i \binom{a_i}{2}), and within columns (\sum_j \binom{b_j}{2}). Then, compute the expected agreement as the product of row and column sums divided by the total pairs \binom{n}{2}. The numerator is the observed minus expected concordances, and the denominator normalizes by half the sum of row and column agreements minus the expected. Finally, divide to obtain ARI. The contingency table serves as input for these pair counts. The ARI ranges from -1 to 1, where 1 indicates perfect agreement, 0 represents agreement no better than random, and negative values signify clustering worse than chance (anti-agreement). For a numerical example, consider a dataset of n=6 objects with true clusters U1 (size 3) and U2 (size 3), and predicted clusters V1 (size 4) and V2 (size 2). The contingency table is:

	V1	V2	Row sum
U1	3	0	3
U2	1	2	3
Col sum	4	2	6

Compute \sum_{ij} \binom{n_{ij}}{2} = \binom{3}{2} + \binom{0}{2} + \binom{1}{2} + \binom{2}{2} = 3 + 0 + 0 + 1 = 4. Row sums give \sum_i \binom{a_i}{2} = 2 \times \binom{3}{2} = 2 \times 3 = 6. Column sums give \sum_j \binom{b_j}{2} = \binom{4}{2} + \binom{2}{2} = 6 + 1 = 7. Total pairs: \binom{6}{2} = 15. Expected: (6 \times 7) / 15 = 42 / 15 = 2.8. Numerator: $4 - 2.8 = 1.2. Denominator: \frac{1}{2}(6 + 7) - 2.8 = 6.5 - 2.8 = 3.7. Thus, ARI = $1.2 / 3.7 \approx 0.324, indicating moderate agreement beyond chance. The Fowlkes-Mallows index (FMI) measures the similarity between two clusterings by taking the geometric mean of the precision and recall of cluster assignments, defined as \text{FMI} = \sqrt{ \frac{TP}{TP + FP} \cdot \frac{TP}{TP + FN} }, where TP is the number of pairs correctly clustered together (same in both), FP is the number of pairs in the same predicted cluster but different true clusters, and FN is the number of pairs in different predicted clusters but the same true cluster.^[22] This index is particularly sensitive to cluster overlap, as it penalizes mismatches in both positive and negative pair agreements without accounting for true negatives.^[23] The Jaccard index, originally a set similarity measure, is adapted for clustering evaluation by computing the ratio of intersecting pairs (correctly clustered together) to the union of all relevant pairs, expressed as \frac{a}{a + b + c}, where b and c represent false positive and false negative pairs, respectively. This adaptation provides a simpler metric than the Rand index by focusing solely on agreements among pairs that are clustered together in at least one partitioning, thereby ignoring information about true negatives.^[23] Normalized Mutual Information (NMI) offers an information-theoretic approach to clustering similarity, calculated as NMI = \frac{2 I(U;V)}{H(U) + H(V)}, where I(U;V) is the mutual information between clusterings U and V, and H(\cdot) denotes entropy.^[24] It is especially suitable for evaluating probabilistic or overlapping clusterings, as it quantifies shared information normalized by the entropies of the individual clusterings, providing bounds between 0 and 1.^[24] These indices share computational foundations with the Rand index and Adjusted Rand Index (ARI), often relying on contingency tables to tally pair agreements or class-cluster overlaps.^[23] A comparison of their properties is summarized below:

Index	Basis	Key Strength	Limitation	Chance-Corrected?
Rand Index	Pair counting	Captures all pair agreements (TP, TN)	Sensitive to clustering size	No
Adjusted Rand Index	Pair counting	Corrects for random agreement	Can be unstable for small or highly imbalanced datasets	Yes
Fowlkes-Mallows	Precision-recall mean	Sensitive to overlap detection	Ignores true negatives	No
Jaccard	Set intersection/union	Simple, focuses on positive matches	Ignores true negatives and scale	No
Normalized Mutual Info	Information theory	Handles probabilistic clusterings	Less intuitive for crisp partitions	No

^[23]^[25]^[24] Selection among these indices depends on the evaluation context: ARI is preferred for symmetric, chance-adjusted comparisons in balanced datasets, while Fowlkes-Mallows suits scenarios emphasizing overlap sensitivity, Jaccard favors simplicity in positive-match focused assessments, and NMI excels in probabilistic or size-insensitive settings.^[23]^[25]^[24]

References

[1]
2.3. Clustering — scikit-learn 1.7.2 documentation
Scikit-learn's clustering module includes methods like K-Means, Affinity Propagation, Spectral Clustering, and DBSCAN, among others.SpectralClustering · AgglomerativeClustering · 2.4. Biclustering · KMeans
[2]
Cluster Analysis: Unsupervised Learning via Supervised ... - NIH
In the absence of a class label, clustering analysis is also called unsupervised learning, as opposed to supervised learning that includes classification and ...
[3]
[PDF] Cluster Analysis: Basic Concepts and Algorithms
Cluster analysis divides data into groups based on data information, where objects within a group are similar and related to each other.
[4]
[PDF] Basic Principles of Clustering Methods - arXiv
Dec 10, 2019 · In Section II we discuss one popular method for hard clustering, the k-means algorithm. In Section III, we discuss one popular method for soft ...
[5]
Fuzzy C-means cluster analysis - Scholarpedia
Aug 26, 2009 · Fuzzy c-means clustering was first reported in the literature for a special case (m=2) by Joe Dunn in 1974. The general case (for any m greater ...Missing: seminal | Show results with:seminal
[6]
[PDF] DATA CLUSTERING: Algorithms and Applications - People
The clustering problem has been addressed by a number of different communities such as pattern recognition, databases, data mining and machine learning. In ...
[7]
Deep learning-based clustering approaches for bioinformatics - PMC
Feb 1, 2020 · In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images.
[8]
[PDF] Data Clustering: 50 Years Beyond K-Means
K-means has a rich and diverse history as it was independently discovered in different scientific fields by Steinhaus (1956) [Steinhaus,. 1956], Lloyd ...Missing: seminal | Show results with:seminal
[9]
[PDF] Chapter 3: Partitions and counting - Dartmouth Mathematics
described in terms of partitions of a set. A partition of a set U is a sub- division of the set into subsets that are disjoint and exhaustive, i.e.,.
[10]
[PDF] CMSC 250: Set Theory and Proofs - UMD MATH
Mar 13, 2023 · Definition 5.0.1. A partition of a set A is a choice of dividing the elements of A into pairwise disjoint nonempty subsets whose union is A.
[11]
[PDF] Introduction to Discrete Mathematics
Sep 14, 2012 · Definition 1.13 (Partition) A partition of a set S is a collection of sets S = {S1,S2,...} (possibly infinite) such that. • the sets are ...
[12]
5.1 Equivalence Relations
Let A/∼ denote the collection of equivalence classes; A/∼ is a partition of A. (Recall that a partition is a collection of disjoint subsets of A whose union is ...
[13]
[PDF] equivalence relations - Cornell Mathematics
More explic- itly: if R is an equivalence relation in X, then the set of equivalence classes is a partition of X that induces the relation R, and if C is a ...
[14]
[PDF] Equivalence Relations and Partitions - UCSB
Jun 18, 2013 · Conversely, a partition of X gives rise to an equivalence relation on X whose equivalence classes are exactly the elements of the partition.
[15]
Objective Criteria for the Evaluation of Clustering Methods - jstor
December 1971, Volume 66, Number 336. Theory and Methods Section. Objective Criteria for the Evaluation of. Clustering Methods. WILLIAM M. RAND*. Many ...
[16]
Objective Criteria for the Evaluation of Clustering Methods
This article focuses instead on the development of procedures for evaluating clustering methods in objective terms of how they cluster. The specific clustering ...
[17]
[PDF] On the Use of the Adjusted Rand Index as a Metric for Evaluating ...
The Adjusted Rand Index (ARI) measures agreement between two partitions, used in supervised classification as a performance measure and for feature selection.
[18]
Evaluation of clustering - Stanford NLP Group
The Rand index ( ) measures the percentage of decisions that are correct. That is, it is simply accuracy (Section 8.3 , page 8.3 ).
[19]
[PDF] Details of the Adjusted Rand index and Clustering algorithms
May 3, 2001 · We adopt the adjusted Rand index as our measure of agreement between the external criteria and clustering results. 1.1 Illustrations of the ...Missing: original | Show results with:original
[20]
On the Equivalence of Cohen's Kappa and the Hubert-Arabie ...
Dec 6, 2008 · It is shown that one can calculate the Hubert-Arabie adjusted Rand index by first forming the fourfold contingency table counting the number of pairs of ...
[21]
Comparing partitions | Journal of Classification
Hubert, L., Arabie, P. Comparing partitions. Journal of Classification 2, 193–218 (1985). https://doi.org/10.1007/BF01908075. Download citation. Issue date ...
[22]
https://www.jstor.org/stable/2288117
[23]
A Method for Comparing Two Hierarchical Clusterings - jstor
The method uses a measure, Bk, derived from a matching matrix of cut trees, to compare clusterings and provide a numerical measure of similarity.
[24]
Clustering algorithms: A comparative approach - PMC - NIH
The Jaccard, Fowlkes-Mallows and adjusted rand index belong to the same pair counting category, making them closely related. Some differences include the fact ...
[25]
[PDF] Information Theoretic Measures for Clusterings Comparison
Abstract. Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest.<|control11|><|separator|>
[26]
Element-centric clustering comparison unifies overlaps and hierarchy
Jun 12, 2019 · Here we focus on three exemplary similarity measures—the normalized mutual information (NMI), Fowlkes-Mallows index (FM), and our element- ...