Variation of information

Variation of information (VI) is an information-theoretic metric designed to quantify the distance between two clusterings of the same dataset, measuring the amount of information lost and gained when transitioning from one partitioning to another.^[1] Introduced by Marina Meilă in 2003, it provides a principled way to compare clusterings based on entropy and mutual information, addressing limitations of earlier similarity measures like the adjusted Rand index.^[2] Formally, for two clusterings C and C' of a set of n elements, VI is defined as

\text{VI}(C, C') = H(C) + H(C') - 2I(C; C'),

where H(C) denotes the entropy of clustering C, representing the expected information needed to specify the cluster assignment of a randomly chosen element, and I(C; C') is the mutual information between C and C', capturing the shared information between the two partitions.^[1] This formulation derives from the properties of information theory, where VI can also be interpreted as the symmetric difference in the information contents of the clusterings.^[2] VI possesses several desirable properties that establish it as a true metric on the space of all possible clusterings: it is non-negative, achieving zero if and only if the two clusterings are identical; it is symmetric, meaning \text{VI}(C, C') = \text{VI}(C', C); and it satisfies the triangle inequality, \text{VI}(C, C'') \leq \text{VI}(C, C') + \text{VI}(C', C''), allowing for meaningful comparisons across multiple clusterings.^[1] Unlike mutual information alone, which is not a metric due to lacking the triangle inequality, VI bounds the distance by the sum of the individual entropies, $0 \leq \text{VI}(C, C') \leq H(C) + H(C').^[2] In practice, VI is widely applied in machine learning and data analysis for evaluating clustering algorithms, such as comparing the output of different methods on the same data or assessing stability across runs.^[3] It is particularly valued for its independence from cluster labeling and its ability to handle varying numbers of clusters, making it suitable for hierarchical and probabilistic clustering contexts.^[1] Meilă's axiomatic derivation further justifies VI as the unique criterion that is lattice-aligned—respecting the partial order of clusterings—and convexly additive, though it highlights an impossibility for bounded metrics satisfying these axioms simultaneously.^[1]

Overview

Definition

Variation of information (VI) is a metric used to quantify the distance between two clusterings of the same dataset, treating clusterings as partitions of a set X with |X| = n elements into non-empty, disjoint subsets called clusters.^[4] Intuitively, VI measures the total amount of information that must be communicated to transform one clustering into the other, capturing the information lost from the original clustering and the information gained in the new one. This is grounded in information theory, where entropy and mutual information serve as foundational tools for assessing uncertainty and shared information in random variables representing cluster assignments.^[4] Formally, given two clusterings C and C', VI is defined as

\text{VI}(C, C') = H(C) + H(C') - 2I(C; C'),

where H(C) denotes the entropy of the random variable indicating the cluster assignment under C (with elements drawn uniformly from X), H(C') is defined analogously, and I(C; C') is the mutual information between these assignment variables.^[4] This formulation equals the sum of the conditional entropies H(C'|C) + H(C|C'), representing the expected additional bits needed to specify the clusters of one given the other. To illustrate, consider a dataset X = \{1, 2, 3, 4\} with two clusterings: C = \{\{1,2\}, \{3,4\}\} and C' = \{\{1,3\}, \{2,4\}\}, each partitioning into two clusters of size 2. The entropy H(C) = -\left(2 \cdot (0.5 \log_2 0.5)\right) = 1 bit, and similarly H(C') = 1 bit, since assignments are balanced. The joint entropy H(C, C') = 2 bits, as the four intersection cells each contain one element with probability $1/4. Thus, I(C; C') = H(C) + H(C') - H(C, C') = 0 bits, yielding \text{VI}(C, C') = 1 + 1 - 2 \cdot 0 = 2 bits, indicating the clusterings share no information and require full reconfiguration.^[4]

Historical Background

The variation of information (VI) metric was introduced by Marina Meilă in her 2003 paper "Comparing Clusterings by the Variation of Information," presented at the 16th Annual Conference on Learning Theory (COLT) and published in the proceedings Learning Theory and Kernel Machines.^[4] This work established VI as an information-theoretic distance measure specifically designed for evaluating the similarity between clusterings, drawing from foundational concepts in information theory developed by Claude Shannon in 1948. Meilă's motivation stemmed from the shortcomings of earlier clustering comparison metrics, such as the Rand index introduced by William M. Rand in 1971, which often failed to satisfy desirable metric properties like symmetry and the triangle inequality, limiting their utility in systematic analyses.^[4] By framing clustering comparison in terms of mutual information and entropy, VI addressed these gaps, providing a metric that behaves as a true distance under certain conditions and aligns with probabilistic interpretations of partitions.^[4] Subsequent developments built on this foundation, with normalized variants of VI proposed by N. X. Vinh and J. Epps in their 2009 work, later formalized in a 2010 Journal of Machine Learning Research (JMLR) paper titled "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance."^[5] These refinements introduced adjustments for chance agreement and normalization to improve comparability across datasets of varying sizes, marking 2003 as the pivotal year for VI's inception and 2010 for key enhancements in its practical application.^[5]

Mathematical Formulation

Entropy-Based Expression

The variation of information (VI) between two clusterings C and C' of a dataset can be expressed in terms of conditional entropies as

\text{VI}(C, C') = H(C' \mid C) + H(C \mid C'),

where H(C' \mid C) represents the conditional entropy of C' given C, quantifying the additional information required to specify the clustering C' after knowing C, and H(C \mid C') similarly quantifies the additional information for C given C'.^[6] This decomposition interprets VI as the sum of information "lost" (the uncertainty in the reference clustering remaining after conditioning on the other) and information "gained" (the uncertainty in the comparison clustering resolved by the reference).^[6] For clusterings, the entropy H(C) measures the uncertainty in the partition C = \{C_k\}_{k=1}^K, defined as

H(C) = -\sum_{k=1}^K p_k \log p_k,

where p_k = |C_k| / n is the empirical probability of an element belonging to cluster k, with n the total number of elements; the entropy H(C') is defined analogously.^[6] The conditional entropy H(C' \mid C) is then H(C, C') - H(C), where the joint entropy H(C, C') is -\sum_{k,k'} p(k,k') \log p(k,k') over joint probabilities p(k,k').^[6] This entropy-based form connects directly to mutual information I(C; C'), the shared information between the clusterings, given by

I(C; C') = H(C) - H(C \mid C') = H(C') - H(C' \mid C),

which symmetrizes to \text{VI}(C, C') = H(C) + H(C') - 2I(C; C').^[6] To compute VI, construct the K \times K' contingency table with entries n_{kk'} = |C_k \cap C'_{k'}|, yielding joint probabilities p_{kk'} = n_{kk'}/n, marginals p_k = \sum_{k'} p_{kk'}, and p'_{k'} = \sum_k p_{kk'}.^[6] Then calculate H(C), H(C'), and the joint H(C, C') from these, with conditional entropies following as differences; the logarithm is typically base-2 for measurement in bits, though base-e is also used.^[6]

Alternative Formulations

One alternative formulation of the variation of information (VI) expresses it in terms of joint entropy, where the distance between two clusterings C and C' is given by

VI(C, C') = 2H(C, C') - H(C) - H(C'),

with H(C, C') denoting the joint entropy of the cluster assignments under the joint distribution P(k, k') = |C_k \cap C'_{k'}|/n.^[6] This form highlights the redundancy in the joint distribution relative to the marginal entropies of the individual clusterings. Equivalently, VI can be decomposed into conditional entropies, representing the explicit information content required to describe one clustering given the other without redundancy:

VI(C, C') = H(C \mid C') + H(C' \mid C).

Here, H(C \mid C') measures the remaining uncertainty in C after knowing C', and vice versa, capturing the symmetric information loss and gain between partitions.^[6] A sum-based variant, often denoted as d_{\text{sum}}(U, V), directly emphasizes the additivity of the distance as the sum of marginal entropies minus twice the mutual information:

d_{\text{sum}}(U, V) = H(U) + H(V) - 2I(U; V).

This expression, equivalent to the standard VI, underscores its structure as a non-negative additive measure suitable for hierarchical clustering comparisons, though it is not normalized for dataset size.^[3] To address varying dataset sizes and bound VI between 0 and 1, a normalized version is defined as

\text{NVI}(C, C') = \frac{VI(C, C')}{\log n},

where n is the number of data points; this scaling ensures comparability across different sample sizes while preserving the metric properties of the original.^[6]

Properties

Metric Axioms

The variation of information (VI) serves as a metric on the space of clusterings, satisfying the fundamental axioms of non-negativity, symmetry, and identity of indiscernibles, which distinguish it from similarity measures like mutual information that fail to form proper distances.^[7] These properties arise from its formulation in terms of entropy and mutual information, ensuring VI quantifies dissimilarity between clusterings in a mathematically rigorous manner.^[7] Non-negativity requires that VI(C, C') ≥ 0 for any clusterings C and C', with equality holding if and only if C = C' up to a permutation of labels. To see this, recall that VI(C, C') = H(C) + H(C') - 2I(C; C'), where H denotes entropy and I denotes mutual information.^[7] From information theory, I(C; C') ≤ min(H(C), H(C')), so without loss of generality assume H(C) ≤ H(C'); then VI(C, C') ≥ H(C) + H(C') - 2H(C) = H(C') - H(C) ≥ 0.^[7] Equality occurs precisely when I(C; C') = min(H(C), H(C')), which implies that one clustering is a deterministic function of the other (i.e., they induce the same partition), up to relabeling.^[7] This non-negativity ensures VI behaves as a valid distance, unlike mutual information, which achieves its maximum (rather than zero) when clusterings are identical.^[7] Symmetry states that VI(C, C') = VI(C', C) for all clusterings C and C'. This follows directly from the symmetry of mutual information, I(C; C') = I(C'; C), and the additivity of entropy, yielding identical expressions when swapping C and C'.^[7] Consequently, VI treats the order of clusterings indifferently, a key requirement for any metric that avoids directional bias in comparisons.^[7] The identity of indiscernibles is a direct consequence of non-negativity: VI(C, C') = 0 if and only if C = C' (up to label permutation). If the clusterings are identical, then H(C|C') = 0 and H(C'|C) = 0, implying VI(C, C') = 0; conversely, if VI(C, C') = 0, the conditional entropies vanish, meaning each clustering fully determines the other.^[7] This axiom reinforces VI's utility in distinguishing distinct clusterings while equating equivalent ones.^[7]

Identities and Bounds

The variation of information satisfies several key mathematical identities that relate it to entropy measures of associated partitions. One such identity expresses VI in terms of the entropies of the meet and join of the two clusterings in the partition lattice, where the meet C \wedge C' is the finest partition refined by both C and C' (the product partition with blocks given by non-empty intersections), and the join C \vee C' is the coarsest partition that refines both (with blocks formed by merging intersecting blocks from C and C'). Specifically,

\text{VI}(C, C') = H(C \wedge C') - H(C \vee C'),

where H denotes the Shannon entropy of the partition.^[6] This follows from the partition entropy identity H(C \vee C') + H(C \wedge C') = H(C) + H(C') and the conditional entropy expression \text{VI}(C, C') = H(C \wedge C') - H(C) + H(C \wedge C') - H(C').^[6] The variation of information also obeys the triangle inequality, establishing it as a metric on the space of clusterings: for any clusterings C, C', and C'',

\text{VI}(C, C'') \leq \text{VI}(C, C') + \text{VI}(C', C'').

This holds due to the subadditivity of conditional entropy: H(A \mid B) \leq H(A \mid D) + H(D \mid B) for any partitions A, B, and D, which implies

H(C'' \mid C) \leq H(C'' \mid C') + H(C' \mid C), \quad H(C \mid C'') \leq H(C \mid C') + H(C' \mid C'').

Adding these yields the inequality.^[6] The triangle inequality aligns with the metric axioms, confirming VI's role as a true distance measure.^[8] Bounds on VI provide insight into its range. Non-negativity follows from the non-negativity of conditional entropies: $0 \leq \text{VI}(C, C') \leq H(C) + H(C'), with equality to zero if and only if C = C' (i.e., the clusterings are identical).^[6] A tighter upper bound is \text{VI}(C, C') \leq \log n, achieved when one clustering is the discrete partition into singletons and the other is the trivial partition into one cluster.^[6] To illustrate the triangle inequality explicitly, consider a dataset with n=4 elements and three clusterings (assuming base-2 logarithms for entropy in bits): let C = \{\{1\},\{2\},\{3\},\{4\}\} (discrete, H(C)=2); C' = \{\{1,2\},\{3,4\}\} (two equal-sized clusters, H(C')=1); and C'' = \{\{1,2,3,4\}\} (trivial, H(C'')=0). Then \text{VI}(C, C') = 1, \text{VI}(C', C'') = 1, and \text{VI}(C, C'') = 2, satisfying $2 = 1 + 1.^[6]

Applications

Clustering Evaluation

In unsupervised learning, variation of information (VI) serves as an external evaluation metric for assessing the quality of clusterings when ground-truth labels are available, such as in benchmark datasets. It quantifies the information lost and gained when transitioning from the predicted clustering to the true partitioning, thereby penalizing both over-clustering (excessive fragmentation) and under-clustering (insufficient separation) by measuring conditional entropies relative to the ground truth.^[1]^[9] In practice, VI is computed using a contingency table that captures the overlap between predicted clusters and ground-truth labels, where table entries represent the number of data points assigned to each pair of clusters. This table can be generated via cross-tabulation functions in libraries like scikit-learn's confusion_matrix or pandas' crosstab, followed by entropy calculations over marginal and joint distributions. For instance, on the Iris dataset—comprising 150 samples across three species with four sepal and petal measurements—applying k-means clustering with three centroids to the features and comparing the resulting assignments to the true species labels yields a VI score of approximately 0.81, reflecting a strong but imperfect alignment due to minor misclassifications between versicolor and virginica samples.^[10] Key advantages of VI in clustering evaluation include its ability to handle clusterings with differing numbers of groups without bias toward granularity and its invariance to arbitrary label permutations, ensuring fair comparisons across algorithms that may relabel clusters differently.^[1]^[11] VI finds specific application in validating hierarchical clustering by comparing dendrogram cuts at various levels to ground-truth partitions, helping select optimal resolutions without assuming a fixed number of clusters. In community detection on graphs, it evaluates detected structures against known communities, offering an information-theoretic alternative to internal metrics like modularity by directly assessing partition similarity.^[12]^[13]^[14]

Extensions and Variants

One prominent extension of the variation of information (VI) is the normalized variation of information (NVI), which addresses the metric's dependence on clustering granularity by scaling it for comparability across different cluster structures. Specifically, NVI is defined as $1 - \frac{I(C; C')}{H(C, C')}, where H(C, C') is the joint entropy, yielding a range of [0, 1] and ensuring normalization while preserving VI's information-theoretic properties. This normalization is particularly beneficial for comparing clusterings with varying entropy levels, such as in high-throughput sequencing datasets.^[3] Another variant, the adjusted variation of information (AVI), incorporates a correction for chance agreement to mitigate biases in random clusterings, similar to the adjusted Rand index. AVI is defined as AVI(U, V) = VI(U, V) - E[VI(U, V)], where E[VI(U, V)] is the expected VI under the null hypothesis of independent clusterings, approximated using the hypergeometric distribution for computational efficiency. This adjustment enhances reliability in applications with varying cluster numbers or sizes, providing values closer to zero for random agreements and up to the maximum for perfect matches.^[3] Weighted variants extend VI to handle unequal cluster sizes or probabilistic assignments, such as in fuzzy clustering, by incorporating membership degrees into entropy and mutual information calculations. For instance, in soft clusterings, hard indicator functions are replaced with fuzzy membership matrices, generalizing VI to measure distances between overlapping or probabilistic partitions while maintaining metric properties. These adaptations are useful for scenarios like fuzzy c-means algorithms, where partial assignments reflect uncertainty in data partitioning.^[15] Beyond clustering evaluation, VI variants have found application in bioinformatics for analyzing gene expression data partitioning. For example, the AutoSOME method employs VI to assess clustering stability and module identification in high-dimensional expression profiles without assuming a fixed number of clusters, enabling robust discovery of co-expression patterns in datasets like those from yeast cell cycle studies.^[16]

Comparisons

With Mutual Information

The variation of information (VI) between two clusterings C and C' relates directly to mutual information through the expression

\text{VI}(C, C') = H(C) + H(C') - 2I(C; C'),

where H denotes entropy and I denotes mutual information; this shows VI as a symmetric transformation that adjusts mutual information by the sum of marginal entropies to yield a distance metric.^[2]^[1] Mutual information I(C; C') measures the shared information content between clusterings, bounded between 0 (indicating independence) and \min(H(C), H(C')) (indicating one clustering fully determines the other).^[1] However, while symmetric, mutual information fails to satisfy the triangle inequality and thus does not form a metric, limiting its use in geometric or distance-based analyses.^[1] In contrast, VI satisfies all metric axioms—non-negativity, symmetry, and the triangle inequality—and is bounded in [0, \log_2 n] for a dataset of n points (in bits), with equality to 0 only when C = C' and the upper bound achieved for the all-singletons clustering and the uniform (single-cluster) clustering.^[2]^[1] This incorporation of marginal entropies in VI penalizes discrepancies in clustering granularity, such as over-clustering (high H(C)) or under-clustering (low H(C)), which mutual information overlooks by focusing solely on overlap.^[2] For example, consider two clusterings on a small dataset where I(C; C') = 0.5 bits due to partial agreement, but one clustering has higher entropy from finer partitions; here, VI might equal 1.2 bits, reflecting the additional information loss from mismatched structures beyond mere shared content.^[2] Mutual information is thus suited for gauging dependence strength in information-theoretic contexts, whereas VI is preferred as a clustering distance metric to enable applications like hierarchical comparisons or optimization under metric constraints.^[1]

With Other Clustering Metrics

The Rand Index (RI) is a pair-counting similarity measure that quantifies the proportion of point pairs—either both clustered together or apart—that agree between two partitions, scaled to [0,1]. While computationally efficient, RI exhibits sensitivity to the granularity of clusterings and fails to adequately account for imbalances in cluster sizes, often overestimating similarity in cases of uneven cardinalities. In contrast, variation of information (VI), as an information-theoretic distance, leverages entropy to penalize discrepancies in cluster size distributions, offering a more robust evaluation that reflects the actual information content lost or gained across partitions.^[6] The Adjusted Rand Index (ARI) improves upon RI by adjusting for chance agreement via a hypergeometric distribution, producing values in [-1,1] where random clusterings score near 0. This correction enhances reliability for imbalanced data, but ARI is not a true metric—lacking the triangle inequality—and requires careful interpretation due to its dependence on dataset size without inherent normalization. VI addresses these limitations as a bona fide metric that is symmetric and satisfies all metric axioms, while its normalized form ensures scale invariance, making it preferable for applications requiring distance computations, such as hierarchical clustering algorithms. Empirical analyses show VI and ARI disagree in approximately 37.6% of real-world clustering comparisons, particularly when cluster sizes vary significantly.^[17] Normalized Mutual Information (NMI) assesses clustering similarity by scaling mutual information against the geometric mean of the individual clustering entropies, NMI(C,C') = \frac{I(C,C')}{\sqrt{H(C)H(C')}}, yielding scores in [0,1]. This normalization provides an intuitive bounded measure, but NMI variants can introduce biases toward smaller clusters and are not metrics, potentially leading to asymmetric interpretations in non-standard forms. VI, derived from conditional entropies, maintains symmetry and metric properties while capturing both shared and unique information more directly, avoiding NMI's normalization pitfalls in datasets with disparate cluster counts. Studies indicate VI and NMI diverge in about 40.3% of cases on benchmark datasets, with VI demonstrating superior discrimination for unequal partitions.^[3] To illustrate these differences, consider synthetic datasets with balanced versus imbalanced clusters: pair-counting metrics like RI and ARI often yield similar scores for partitions that superficially agree on pairs but ignore size-driven information loss, whereas VI highlights greater distances in imbalanced scenarios due to its entropy-based penalization. The following table summarizes key properties:

Metric	Range	Metric Properties	Sensitivity to Cluster Size Imbalance	Normalization Required
RI	[0,1]	No	Low (insensitive, overestimates)	No
ARI	[-1,1]	No	Moderate (chance-corrected)	Yes for scale invariance
NMI	[0,1]	No	Moderate (biases toward small clusters)	Built-in, but variant-dependent
VI	[0, ∞)	Yes (symmetric, triangle inequality)	Low (entropy-handled)	Optional for [0,1]

Overall, VI's information-theoretic foundation provides advantages in theoretical rigor and practical utility over these traditional metrics, especially in diverse clustering scenarios.^[17]