Fact-checked by Grok 2 weeks ago

Variation of information

Variation of information (VI) is an information-theoretic metric designed to quantify the distance between two clusterings of the same , measuring the amount of lost and gained when transitioning from one partitioning to another. Introduced by Marina Meilă in 2003, it provides a principled way to compare clusterings based on and , addressing limitations of earlier similarity measures like the adjusted Rand index. Formally, for two clusterings C and C' of a set of n elements, VI is defined as \text{VI}(C, C') = H(C) + H(C') - 2I(C; C'), where H(C) denotes the of clustering C, representing the expected needed to specify the cluster assignment of a randomly chosen element, and I(C; C') is the mutual between C and C', capturing the shared information between the two partitions. This formulation derives from the properties of information theory, where VI can also be interpreted as the symmetric difference in the information contents of the clusterings. VI possesses several desirable properties that establish it as a true on the of all possible clusterings: it is non-negative, achieving zero the two clusterings are identical; it is symmetric, meaning \text{VI}(C, C') = \text{VI}(C', C); and it satisfies the , \text{VI}(C, C'') \leq \text{VI}(C, C') + \text{VI}(C', C''), allowing for meaningful comparisons across multiple clusterings. Unlike alone, which is not a due to lacking the , VI bounds the distance by the sum of the individual entropies, $0 \leq \text{VI}(C, C') \leq H(C) + H(C'). In practice, VI is widely applied in and for evaluating clustering algorithms, such as comparing the output of different methods on the same data or assessing stability across runs. It is particularly valued for its independence from cluster labeling and its ability to handle varying numbers of clusters, making it suitable for hierarchical and probabilistic clustering contexts. Meilă's axiomatic derivation further justifies VI as the unique criterion that is lattice-aligned—respecting the partial order of clusterings—and convexly additive, though it highlights an impossibility for bounded metrics satisfying these axioms simultaneously.

Overview

Definition

Variation of information (VI) is a used to quantify the between two clusterings of the same , treating clusterings as partitions of a set X with |X| = n into non-empty, disjoint subsets called clusters. Intuitively, VI measures the total amount of that must be communicated to transform one clustering into the other, capturing the information lost from the original clustering and the information gained in the new one. This is grounded in , where and serve as foundational tools for assessing uncertainty and shared information in random variables representing cluster assignments. Formally, given two clusterings C and C', VI is defined as \text{VI}(C, C') = H(C) + H(C') - 2I(C; C'), where H(C) denotes the entropy of the random variable indicating the cluster assignment under C (with elements drawn uniformly from X), H(C') is defined analogously, and I(C; C') is the mutual information between these assignment variables. This formulation equals the sum of the conditional entropies H(C'|C) + H(C|C'), representing the expected additional bits needed to specify the clusters of one given the other. To illustrate, consider a dataset X = \{1, 2, 3, 4\} with two clusterings: C = \{\{1,2\}, \{3,4\}\} and C' = \{\{1,3\}, \{2,4\}\}, each partitioning into two clusters of size 2. The entropy H(C) = -\left(2 \cdot (0.5 \log_2 0.5)\right) = 1 bit, and similarly H(C') = 1 bit, since assignments are balanced. The joint entropy H(C, C') = 2 bits, as the four intersection cells each contain one element with probability $1/4. Thus, I(C; C') = H(C) + H(C') - H(C, C') = 0 bits, yielding \text{VI}(C, C') = 1 + 1 - 2 \cdot 0 = 2 bits, indicating the clusterings share no information and require full reconfiguration.

Historical Background

The variation of information () metric was introduced by Marina Meilă in her 2003 paper "Comparing Clusterings by the Variation of Information," presented at the 16th Annual Conference on Learning Theory () and published in the proceedings Learning Theory and Kernel Machines. This work established VI as an information-theoretic distance measure specifically designed for evaluating the similarity between clusterings, drawing from foundational concepts in developed by in 1948. Meilă's motivation stemmed from the shortcomings of earlier clustering comparison metrics, such as the Rand index introduced by William M. Rand in 1971, which often failed to satisfy desirable metric properties like symmetry and the triangle inequality, limiting their utility in systematic analyses. By framing clustering comparison in terms of mutual information and entropy, VI addressed these gaps, providing a metric that behaves as a true distance under certain conditions and aligns with probabilistic interpretations of partitions. Subsequent developments built on this foundation, with normalized variants of VI proposed by N. X. Vinh and J. Epps in their 2009 work, later formalized in a 2010 (JMLR) paper titled "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance." These refinements introduced adjustments for chance agreement and to improve comparability across datasets of varying sizes, marking 2003 as the pivotal year for VI's inception and 2010 for key enhancements in its practical application.

Mathematical Formulation

Entropy-Based Expression

The variation of information (VI) between two clusterings C and C' of a dataset can be expressed in terms of conditional entropies as \text{VI}(C, C') = H(C' \mid C) + H(C \mid C'), where H(C' \mid C) represents the conditional entropy of C' given C, quantifying the additional information required to specify the clustering C' after knowing C, and H(C \mid C') similarly quantifies the additional information for C given C'. This decomposition interprets VI as the sum of information "lost" (the uncertainty in the reference clustering remaining after conditioning on the other) and information "gained" (the uncertainty in the comparison clustering resolved by the reference). For clusterings, the entropy H(C) measures the uncertainty in the partition C = \{C_k\}_{k=1}^K, defined as H(C) = -\sum_{k=1}^K p_k \log p_k, where p_k = |C_k| / n is the of an element belonging to cluster k, with n the total number of elements; the entropy H(C') is defined analogously. The conditional entropy H(C' \mid C) is then H(C, C') - H(C), where the joint entropy H(C, C') is -\sum_{k,k'} p(k,k') \log p(k,k') over joint probabilities p(k,k'). This entropy-based form connects directly to mutual information I(C; C'), the shared information between the clusterings, given by I(C; C') = H(C) - H(C \mid C') = H(C') - H(C' \mid C), which symmetrizes to \text{VI}(C, C') = H(C) + H(C') - 2I(C; C'). To compute VI, construct the K \times K' contingency table with entries n_{kk'} = |C_k \cap C'_{k'}|, yielding joint probabilities p_{kk'} = n_{kk'}/n, marginals p_k = \sum_{k'} p_{kk'}, and p'_{k'} = \sum_k p_{kk'}. Then calculate H(C), H(C'), and the joint H(C, C') from these, with conditional entropies following as differences; the logarithm is typically base-2 for measurement in bits, though base-e is also used.

Alternative Formulations

One alternative formulation of the variation of information () expresses it in terms of joint entropy, where the distance between two clusterings C and C' is given by VI(C, C') = 2H(C, C') - H(C) - H(C'), with H(C, C') denoting the joint entropy of the cluster assignments under the joint P(k, k') = |C_k \cap C'_{k'}|/n. This form highlights the redundancy in the joint relative to the marginal entropies of the individual clusterings. Equivalently, VI can be decomposed into conditional entropies, representing the explicit required to describe one clustering given the other without : VI(C, C') = H(C \mid C') + H(C' \mid C). Here, H(C \mid C') measures the remaining in C after knowing C', and vice versa, capturing the symmetric information loss and gain between partitions. A sum-based variant, often denoted as d_{\text{sum}}(U, V), directly emphasizes the additivity of the distance as the sum of marginal entropies minus twice the : d_{\text{sum}}(U, V) = H(U) + H(V) - 2I(U; V). This expression, equivalent to the standard VI, underscores its structure as a non-negative additive measure suitable for comparisons, though it is not normalized for dataset size. To address varying dataset sizes and bound VI between 0 and 1, a normalized version is defined as \text{NVI}(C, C') = \frac{VI(C, C')}{\log n}, where n is the number of data points; this scaling ensures comparability across different sample sizes while preserving the properties of the original.

Properties

Metric Axioms

The variation of information () serves as a on the of clusterings, satisfying the fundamental axioms of non-negativity, symmetry, and , which distinguish it from similarity measures like that fail to form proper distances. These properties arise from its formulation in terms of and , ensuring VI quantifies dissimilarity between clusterings in a mathematically rigorous manner. Non-negativity requires that VI(C, C') ≥ 0 for any clusterings C and C', with equality holding C = C' up to a of labels. To see this, recall that VI(C, C') = H(C) + H(C') - 2I(C; C'), where H denotes and I denotes . From , I(C; C') ≤ min(H(C), H(C')), so assume H(C) ≤ H(C'); then VI(C, C') ≥ H(C) + H(C') - 2H(C) = H(C') - H(C) ≥ 0. Equality occurs precisely when I(C; C') = min(H(C), H(C')), which implies that one clustering is a deterministic of the other (i.e., they induce the same ), up to relabeling. This non-negativity ensures VI behaves as a valid , unlike , which achieves its maximum (rather than zero) when clusterings are identical. Symmetry states that VI(C, C') = VI(C', C) for all clusterings C and C'. This follows directly from the symmetry of mutual information, I(C; C') = I(C'; C), and the additivity of entropy, yielding identical expressions when swapping C and C'. Consequently, VI treats the order of clusterings indifferently, a key requirement for any metric that avoids directional bias in comparisons. The identity of indiscernibles is a direct consequence of non-negativity: VI(C, C') = 0 if and only if C = C' (up to label permutation). If the clusterings are identical, then H(C|C') = 0 and H(C'|C) = 0, implying VI(C, C') = 0; conversely, if VI(C, C') = 0, the conditional entropies vanish, meaning each clustering fully determines the other. This axiom reinforces VI's utility in distinguishing distinct clusterings while equating equivalent ones.

Identities and Bounds

The variation of information satisfies several mathematical identities that relate it to entropy measures of associated partitions. One such identity expresses VI in terms of the entropies of the meet and join of the two clusterings in the partition lattice, where the meet C \wedge C' is the finest partition refined by both C and C' (the product partition with blocks given by non-empty intersections), and the join C \vee C' is the coarsest partition that refines both (with blocks formed by merging intersecting blocks from C and C'). Specifically, \text{VI}(C, C') = H(C \wedge C') - H(C \vee C'), where H denotes the Shannon entropy of the partition. This follows from the partition entropy identity H(C \vee C') + H(C \wedge C') = H(C) + H(C') and the conditional entropy expression \text{VI}(C, C') = H(C \wedge C') - H(C) + H(C \wedge C') - H(C'). The variation of information also obeys the , establishing it as a on the of clusterings: for any clusterings C, C', and C'', \text{VI}(C, C'') \leq \text{VI}(C, C') + \text{VI}(C', C''). This holds due to the subadditivity of : H(A \mid B) \leq H(A \mid D) + H(D \mid B) for any partitions A, B, and D, which implies H(C'' \mid C) \leq H(C'' \mid C') + H(C' \mid C), \quad H(C \mid C'') \leq H(C \mid C') + H(C' \mid C''). Adding these yields the inequality. The triangle inequality aligns with the metric axioms, confirming VI's role as a true distance measure. Bounds on VI provide insight into its range. Non-negativity follows from the non-negativity of conditional entropies: $0 \leq \text{VI}(C, C') \leq H(C) + H(C'), with equality to zero if and only if C = C' (i.e., the clusterings are identical). A tighter upper bound is \text{VI}(C, C') \leq \log n, achieved when one clustering is the discrete partition into singletons and the other is the trivial partition into one cluster. To illustrate the triangle inequality explicitly, consider a dataset with n=4 elements and three clusterings (assuming base-2 logarithms for entropy in bits): let C = \{\{1\},\{2\},\{3\},\{4\}\} (discrete, H(C)=2); C' = \{\{1,2\},\{3,4\}\} (two equal-sized clusters, H(C')=1); and C'' = \{\{1,2,3,4\}\} (trivial, H(C'')=0). Then \text{VI}(C, C') = 1, \text{VI}(C', C'') = 1, and \text{VI}(C, C'') = 2, satisfying $2 = 1 + 1.

Applications

Clustering Evaluation

In unsupervised learning, variation of information (VI) serves as an external evaluation metric for assessing the quality of clusterings when ground-truth labels are available, such as in datasets. It quantifies the information lost and gained when transitioning from the predicted clustering to the true partitioning, thereby penalizing both over-clustering (excessive fragmentation) and under-clustering (insufficient separation) by measuring conditional entropies relative to the . In practice, VI is computed using a that captures the overlap between predicted clusters and ground-truth labels, where table entries represent the number of data points assigned to each pair of clusters. This table can be generated via cross-tabulation functions in libraries like scikit-learn's confusion_matrix or ' crosstab, followed by entropy calculations over marginal and joint distributions. For instance, on the —comprising 150 samples across three with four sepal and measurements—applying with three centroids to the features and comparing the resulting assignments to the true labels yields a VI score of approximately 0.81, reflecting a strong but imperfect alignment due to minor misclassifications between versicolor and virginica samples. Key advantages of VI in clustering evaluation include its ability to handle clusterings with differing numbers of groups without bias toward granularity and its invariance to arbitrary label permutations, ensuring fair comparisons across algorithms that may relabel clusters differently. VI finds specific application in validating hierarchical clustering by comparing dendrogram cuts at various levels to ground-truth partitions, helping select optimal resolutions without assuming a fixed number of clusters. In community detection on graphs, it evaluates detected structures against known communities, offering an information-theoretic alternative to internal metrics like modularity by directly assessing partition similarity.

Extensions and Variants

One prominent extension of the (VI) is the (NVI), which addresses the metric's dependence on clustering by it for comparability across different cluster structures. Specifically, NVI is defined as $1 - \frac{I(C; C')}{H(C, C')}, where H(C, C') is the joint , yielding a range of [0, 1] and ensuring while preserving VI's information-theoretic . This is particularly beneficial for comparing clusterings with varying levels, such as in high-throughput sequencing datasets. Another variant, the adjusted variation of information (AVI), incorporates a correction for chance agreement to mitigate biases in random clusterings, similar to the adjusted . AVI is defined as AVI(U, V) = VI(U, V) - E[VI(U, V)], where E[VI(U, V)] is the expected VI under the of independent clusterings, approximated using the for computational efficiency. This adjustment enhances reliability in applications with varying cluster numbers or sizes, providing values closer to zero for random agreements and up to the maximum for perfect matches. Weighted variants extend to handle unequal cluster sizes or probabilistic assignments, such as in , by incorporating membership degrees into and calculations. For instance, in soft clusterings, hard indicator functions are replaced with fuzzy membership matrices, generalizing to measure distances between overlapping or probabilistic partitions while maintaining properties. These adaptations are useful for scenarios like fuzzy c-means algorithms, where partial assignments reflect in data partitioning. Beyond clustering evaluation, VI variants have found application in bioinformatics for analyzing data partitioning. For example, the method employs VI to assess clustering stability and module identification in high-dimensional expression profiles without assuming a fixed number of clusters, enabling robust discovery of co-expression patterns in datasets like those from cell cycle studies.

Comparisons

With

The variation of information (VI) between two clusterings C and C' relates directly to through the expression \text{VI}(C, C') = H(C) + H(C') - 2I(C; C'), where H denotes and I denotes ; this shows VI as a symmetric transformation that adjusts by the sum of marginal entropies to yield a distance metric. Mutual information I(C; C') measures the shared between clusterings, bounded between 0 (indicating independence) and \min(H(C), H(C')) (indicating one clustering fully determines the other). However, while symmetric, fails to satisfy the and thus does not form a , limiting its use in geometric or distance-based analyses. In contrast, VI satisfies all axioms—non-negativity, , and the —and is bounded in [0, \log_2 n] for a of n points (in bits), with equality to 0 only when C = C' and the upper bound achieved for the all-singletons clustering and the (single-cluster) clustering. This incorporation of marginal entropies in VI penalizes discrepancies in clustering granularity, such as over-clustering (high H(C)) or under-clustering (low H(C)), which mutual information overlooks by focusing solely on overlap. For example, consider two clusterings on a small dataset where I(C; C') = 0.5 bits due to partial agreement, but one clustering has higher entropy from finer partitions; here, VI might equal 1.2 bits, reflecting the additional information loss from mismatched structures beyond mere shared content. Mutual information is thus suited for gauging dependence strength in information-theoretic contexts, whereas is preferred as a clustering to enable applications like hierarchical comparisons or optimization under constraints.

With Other Clustering Metrics

The () is a pair-counting that quantifies the proportion of point pairs—either both clustered together or apart—that agree between two partitions, scaled to [0,1]. While computationally efficient, exhibits sensitivity to the of clusterings and fails to adequately account for imbalances in cluster sizes, often overestimating similarity in cases of uneven cardinalities. In contrast, variation of information (), as an information-theoretic , leverages to penalize discrepancies in cluster size distributions, offering a more robust evaluation that reflects the actual lost or gained across partitions. The Adjusted Rand Index (ARI) improves upon RI by adjusting for chance agreement via a hypergeometric distribution, producing values in [-1,1] where random clusterings score near 0. This correction enhances reliability for imbalanced data, but ARI is not a true —lacking the —and requires careful interpretation due to its dependence on dataset size without inherent . VI addresses these limitations as a bona fide that is symmetric and satisfies all metric axioms, while its normalized form ensures , making it preferable for applications requiring computations, such as algorithms. Empirical analyses show VI and ARI disagree in approximately 37.6% of real-world clustering comparisons, particularly when cluster sizes vary significantly. Normalized (NMI) assesses clustering similarity by scaling mutual information against the of the individual clustering entropies, NMI(C,C') = \frac{I(C,C')}{\sqrt{H(C)H(C')}}, yielding scores in [0,1]. This provides an intuitive bounded measure, but NMI variants can introduce biases toward smaller clusters and are not , potentially leading to asymmetric interpretations in non-standard forms. VI, derived from conditional entropies, maintains and metric properties while capturing both shared and unique more directly, avoiding NMI's normalization pitfalls in datasets with disparate cluster counts. Studies indicate VI and NMI diverge in about 40.3% of cases on datasets, with VI demonstrating superior discrimination for unequal partitions. To illustrate these differences, consider synthetic datasets with balanced versus imbalanced clusters: pair-counting metrics like RI and ARI often yield similar scores for partitions that superficially agree on pairs but ignore size-driven information loss, whereas VI highlights greater distances in imbalanced scenarios due to its entropy-based penalization. The following table summarizes key properties:
MetricRangeMetric PropertiesSensitivity to Cluster Size ImbalanceNormalization Required
RI[0,1]NoLow (insensitive, overestimates)No
ARI[-1,1]NoModerate (chance-corrected)Yes for scale invariance
NMI[0,1]NoModerate (biases toward small clusters)Built-in, but variant-dependent
VI[0, ∞)Yes (symmetric, triangle inequality)Low (entropy-handled)Optional for [0,1]
Overall, VI's information-theoretic foundation provides advantages in theoretical rigor and practical utility over these traditional metrics, especially in diverse clustering scenarios.

References

  1. [1]
    Comparing clusterings—an information based distance
    The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering to clustering .
  2. [2]
    [PDF] Comparing Clusterings
    The cri- terion, called variation of information (VI), measures the amount of in- formation lost and gained in changing from clustering C to clustering C ...
  3. [3]
    [PDF] Information Theoretic Measures for Clusterings Comparison
    Abstract. Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest.
  4. [4]
    Comparing Clusterings by the Variation of Information - SpringerLink
    Variation of information (VI) measures the information lost and gained when changing from one clustering to another, and is a metric on clusterings.
  5. [5]
    Information Theoretic Measures for Clusterings Comparison
    In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the ...Missing: 2009 | Show results with:2009
  6. [6]
    [PDF] Comparing Clusterings – an information based distance
    Apr 3, 2006 · Clustering, or finding partitions1 in data, has become an increasingly popular part of data analysis. Each day new data are clustered, ...Missing: citation | Show results with:citation
  7. [7]
  8. [8]
    [PDF] Comparing Clusterings – An Axiomatic View
    In (Meil˘a, 2003) the variation of information was shown to satisfy AJ and AC and that it is a metric, among several other properties. It is also easy to show.
  9. [9]
    Comparing Clusterings – An Information Based Distance
    Aug 5, 2025 · Comparing Clusterings – An Information Based Distance. May 2007; Journal of Multivariate Analysis 98(5):873-895. DOI:10.1016/j.jmva.2006.11.013.
  10. [10]
    [PDF] Clustering Validation - Computer Science
    VI(C,T )=(H(T )−I(C,T )) + (H(C)−I(C,T )). = H(T )+H(C)−2I(C,T ). Variation of information (VI) is zero only when C and T are identical. Thus, the lower ...
  11. [11]
    Evaluation & Validation · Clustering.jl - JuliaStats
    Variation of Information. Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is ...<|control11|><|separator|>
  12. [12]
    Recommendations for validating hierarchical clustering in consumer ...
    The paper introduces the clustering of three sensory data sets using different distance metrics and linkage rules for different numbers of clusters.
  13. [13]
    How to Compare Communities Using Variation of Information
    Sep 5, 2013 · To compare communities in a network determined by differing methods, use Variation of Information, or some other statistic derived from the confusion matrix.Missing: alternative | Show results with:alternative
  14. [14]
    [PDF] Community Detection via Maximization of Modularity and Its Variants
    Modularity measures the difference between the actual fraction of edges within the community and such fraction expected in a random- ized graph with the same ...
  15. [15]
    [PDF] Comparing Hard and Overlapping Clusterings
    Marina Meila. Comparing clusterings by the variation of information. In Bernhard Schlkopf and Manfred Warmuth, editors, Learning Theory and Kernel Machines ...<|control11|><|separator|>
  16. [16]
    AutoSOME: a clustering method for identifying gene expression ...
    Mar 4, 2010 · Background: Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology.
  17. [17]