Normalized compression distance
The Normalized Compression Distance (NCD) is a parameter-free similarity metric that measures the mutual information shared between two objects—such as strings, files, or sequences—by leveraging the compressed lengths produced by standard data compression algorithms, approximating the theoretical Normalized Information Distance based on Kolmogorov complexity.[1] Introduced by Rudi Cilibrasi and Paul M. B. Vitányi in their 2005 paper "Clustering by Compression," the NCD enables domain-independent comparisons without requiring specialized features or prior knowledge about the data.[1] The NCD between two objects x and y is formally defined as \text{NCD}(x, y) = \frac{C(xy) - \min\{C(x), C(y)\}}{\max\{C(x), C(y)\}}, where C(z) denotes the length in bits of the compressed representation of z using a compressor, and xy is the concatenation of x and y.[1] This formulation yields a value between 0 (indicating high similarity, as one object is largely redundant given the other) and just over 1 (indicating dissimilarity), and it is robust across different compressors like gzip or bzip2, though results can vary slightly due to compressor idiosyncrasies.[2] The metric's universality stems from its roots in algorithmic information theory, making it applicable to diverse data types without alignment or feature extraction.[1] NCD has been widely applied in clustering tasks across fields, including biological classification, music genre classification, and cybersecurity for malware detection, often outperforming traditional methods in unsupervised settings.[3][4][5] Its hierarchical clustering variant, using dendrograms, has revealed insights such as evolutionary relationships in genomes and structural similarities in musical compositions.[1] Recent developments as of 2024 include neural variants of NCD for enhanced classification tasks.[6] Despite its strengths, NCD can be sensitive to compressor choice and data preprocessing, necessitating careful implementation to avoid pitfalls like non-idempotence in repeated compressions.[2]Theoretical Foundations
Information Distance
The information distance serves as a foundational theoretical measure of dissimilarity between two finite objects, represented as binary strings x and y, rooted in algorithmic information theory.[7] The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest binary program that outputs x when executed on a universal Turing machine. Similarly, the conditional Kolmogorov complexity K(x|y) is the length of the shortest program that outputs x given y as an auxiliary input to the universal Turing machine. The universal information distance between x and y, denoted ID(x,y), is given by ID(x,y) = \max\{K(x|y), K(y|x)\}. This quantity represents the additional information required to describe one string given the other, capturing the shared information content between them up to additive logarithmic terms. Equivalently, up to such terms, ID(x,y) corresponds to the length of the shortest program that outputs both x and y minus the complexity of the shorter string. Since Kolmogorov complexity is uncomputable—arising from the undecidability of the halting problem for Turing machines—ID(x,y) is likewise non-computable, motivating the development of practical approximations in subsequent work. The concept was introduced by Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitányi in their 2004 paper on similarity metrics based on complexity theory.[7]Normalized Information Distance
The normalized information distance (NID) provides a bounded metric for measuring the similarity between two objects x and y based on their Kolmogorov complexities, addressing the unbounded nature of the raw information distance. Defined as \text{NID}(x,y) = \frac{\max\{K(x|y), K(y|x)\}}{\max\{K(x), K(y)\}}, where K(x|y) denotes the conditional Kolmogorov complexity of x given y, NID ranges from 0, indicating that x and y are identical or maximally similar (one fully compressible given the other), to 1, indicating maximal dissimilarity (no shared information, each requires its full description length independently).[8] This normalization derives from the information distance \text{ID}(x,y) = \max\{K(x|y), K(y|x)\} + O(1), which itself approximates the minimal program length needed to transform one object into the other. NID satisfies the properties of a metric: non-negativity holds since complexities are non-negative, yielding \text{NID}(x,y) \geq 0; symmetry follows from the symmetry of conditional complexities, so \text{NID}(x,y) = \text{NID}(y,x); and the triangle inequality \text{NID}(x,z) \leq \text{NID}(x,y) + \text{NID}(y,z) is established up to an additive constant term negligible relative to the complexities involved, ensuring it functions as a distance in the limit. In interpretation, \text{NID}(x,y) \approx 0 when x and y share significant mutual information, making one highly compressible given the other, while \text{NID}(x,y) \approx 1 when they are informationally independent. Compared to the raw information distance, NID offers theoretical advantages including universality across object types, independence from individual object sizes (avoiding bias toward longer descriptions), and metric properties that enable its use in clustering and similarity-based analyses. Notably, NID stands as the only known non-trivial metric derivable directly from plain Kolmogorov complexity.[9]Definition and Computation
Normalized Compression Distance Metric
The Normalized Compression Distance (NCD) was first proposed by Li et al. in 2004 as a universal similarity metric suitable for tasks such as clustering across diverse data types, including strings, genomes, and languages.[10] The metric is defined by the formula \mathrm{NCD}(x,y) = \frac{C(xy) - \min\{C(x), C(y)\}}{\max\{C(x), C(y)\}}, where C(z) denotes the length, in bytes, of the compressed representation of an object z produced by a real-world compression algorithm.[10] In this expression, C(xy) is the compressed length of the concatenation of objects x and y, which reflects the shared information or redundancy between them: greater overlap allows for more efficient joint compression relative to the individual compressions.[10] The NCD yields values in the approximate range [0, 1 + \epsilon], where \epsilon \leq 0.1 accounts for practical compression imperfections; a value near 0 signifies high similarity, as the objects are highly compressible together, while a value near 1 indicates maximal dissimilarity, and occasional values above 1 arise from compressor overhead.[10] The NCD provides a computable approximation to the Normalized Information Distance, with the approximation quality improving as the compressor more closely emulates Kolmogorov complexity; in this limit, \mathrm{NCD}(x,y) \approx \mathrm{NID}(x,y) + O(\log n / n), where n relates to the object size.[9]Approximation via Compression Algorithms
In practice, the normalized information distance is approximated by the normalized compression distance (NCD) through the use of off-the-shelf compression algorithms, where the Kolmogorov complexity of an object is estimated by the length of its compressed representation.[1] This substitution allows NCD to be computed empirically without requiring theoretical complexity measures.[1] Common compressors include gzip (based on Lempel-Ziv parsing), bzip2 (employing block sorting and Huffman coding), and PPMZ (a prediction by partial matching variant with Ziv-Lempel compression), with the compressed size C(z) taken as the output file length in bytes.[1] To compute NCD between two objects x and y, the process involves compressing x individually to obtain C(x), compressing y to obtain C(y), and compressing their concatenation xy to obtain C(xy); these lengths are then inserted into the NCD formula.[1] For non-string data, inputs are serialized into byte sequences suitable for compression: binary data is treated directly as bitstreams, images are converted via raster-order pixel scanning (e.g., row-by-row), and multisets are serialized by concatenating elements in length-increasing lexicographic order, often with delimiters to distinguish components.[1][11][12] A key challenge arises from compressor-specific preprocessing overhead, such as fixed block sizes in bzip2 (typically 900 KB) or sliding window limits in gzip (around 32 KB), which can cause NCD values exceeding 1 even for identical objects when sizes surpass these thresholds, as redundancy is not fully captured. To mitigate this, practitioners recommend applying multiple compressors and averaging the resulting NCD values for more robust estimates. NCD remains parameter-free, requiring no tuning or domain-specific thresholds, and alignment-free, as it relies solely on compression patterns without needing sequence matching or feature extraction.[1]Properties and Performance
Theoretical Properties
The Normalized Compression Distance (NCD) exhibits several key metric properties when computed using a normal compressor, which is defined as one that is idempotent, monotonic nondecreasing, and satisfies the prefix and superadditivity conditions up to logarithmic terms.[13] Specifically, NCD is symmetric, meaning NCD(x, y) = NCD(y, x) for any objects x and y, due to the symmetry in the compression of concatenated strings.[9] It is also non-negative, with NCD(x, y) ≥ 0 and equality holding if and only if x = y, as the compressed length of the concatenation cannot be shorter than the longer individual object beyond redundancies.[13] However, the triangle inequality NCD(x, y) ≤ NCD(x, z) + NCD(z, y) holds theoretically for normal compressors via the distributivity property of compression lengths but is violated in practice due to imperfections in real-world compressors, such as non-idempotence or suboptimal encoding.[13][14] NCD possesses universality when the underlying compressor is universal, meaning it approaches the Normalized Information Distance (NID)—a non-computable metric based on Kolmogorov complexity—for any pair of objects as their sizes increase, minorizing every computable similarity distance up to an additive error term ε = [C(x|y) - K(x|y)] / C(x), where K denotes Kolmogorov complexity.[13] This universality ensures that NCD captures the dominant shared information between objects regardless of the specific compressor used, provided it is effective.[9] In terms of behavioral properties, NCD approaches 1 for pairs of random, independent strings, reflecting maximal dissimilarity as their concatenation compresses no better than individually; it equals 0 for identical copies, indicating perfect similarity; and yields intermediate values between 0 and 1 for objects sharing patterns, quantifying the degree of redundancy in their joint description.[9] Cilibrasi and Vitányi (2005) provide a proof sketch for the asymptotic satisfaction of the triangle inequality, showing that for sufficiently large objects, deviations are bounded by O((log n)/n), where n is the maximum length, but empirical violations arise from compressor limitations like fixed precision or non-asymptotic behavior.[13] The normalization in NCD confers scale-invariance, enabling fair comparisons between objects of disparate sizes by relativizing the compression gain to the larger object's compressed length, thus bounding the metric in [0, 1] independently of absolute scales.[9]Empirical Performance and Limitations
Empirical evaluations of the Normalized Compression Distance (NCD) have demonstrated its effectiveness in practical tasks such as plagiarism detection and text clustering, though it often incurs higher computational costs compared to simpler metrics like Euclidean distance. In plagiarism detection for programming assignments, NCD-based systems using compressors like bzip2 have identified high-similarity pairs with p-values as low as 0.002, enabling confident flagging of 1.9% of submissions in a class of 369 as potential plagiarism cases, outperforming tools like zyBooks in surfacing subtle matches after variable renaming.[15] For text clustering, NCD has shown utility in grouping documents without domain-specific features, achieving competitive results in hierarchical clustering of heterogeneous data, though exact accuracy varies by dataset and compressor. However, NCD computations are generally slower than Euclidean distance-based methods due to the overhead of repeated compression operations, making it less suitable for very large-scale clustering without optimization. Key limitations of NCD arise from its dependence on practical compressors, which deviate from ideal theoretical properties. NCD exhibits sensitivity to the choice of compressor, with variances in distance estimates up to 0.1 between algorithms like gzip and bzip2 for identical objects, leading to inconsistent similarity rankings. For instance, gzip shows discontinuities at block sizes around 32 KB, causing NCD(x,x) to jump to 0.9 for larger files, while bzip2 exhibits NCD(x,x) values of 0.2–0.3 for files up to 450 KB. Additionally, compressors often fail idempotence (C(C(x)) ≠ C(x)), violating the metric's identity axiom and introducing non-metric behavior that can produce clustering artifacts, such as inflated distances for short strings where redundancy detection is limited. This overhead is pronounced for short inputs, where NCD fluctuations hinder reliable comparisons.[16] NCD computation scales quadratically with the number of objects due to pairwise compressions, typically limiting practical use to datasets up to gigabyte-scale without parallelization, as each distance requires compressing individual and concatenated strings. Recent developments address these issues through neural approximations, such as Neural NCD (2024), which employs large language models (LLMs) like RWKV as lossless compressors to capture semantic similarities better than traditional methods like gzip on natural language tasks. For example, Neural NCD achieved approximately 65% test accuracy on the AGNews dataset at 100-shot few-shot learning, surpassing gzip's 0.785 compression rate with a more effective 0.248 rate, though at higher computational expense due to LLM inference. However, it underperforms on datasets like DBpedia (approximately 55% accuracy) and shows variable results independent of compression efficiency, highlighting a disconnect between compression quality and classification utility.[17] GPU accelerations have mitigated scalability issues for image and satellite data applications. Using NVIDIA's nvCOMP library, GPU-based NCD variants achieve approximately 10x speedup over multi-threaded CPU implementations, reducing computation time from 84,967 seconds to 8,828 seconds on the UC Merced Land Use dataset (2,100 images), while achieving 75% classification accuracy for binary tasks (e.g., rural vs. urban). Similar gains apply to satellite imagery clustering on BigEarthNet, reducing false alarms in land cover detection through parallel patch-wise compressions.[18]Applications
Clustering and Classification
The Normalized Compression Distance (NCD) functions as a versatile distance metric in unsupervised clustering tasks, enabling the grouping of similar objects such as documents, files, or binary data without relying on predefined features or domain knowledge. In hierarchical clustering, NCD is used to compute pairwise distances, facilitating the construction of dendrograms that reveal hierarchical similarity structures among datasets. For instance, when applied to Russian literary texts (both originals and translations), NCD-based hierarchical clustering achieved a tree quality score S(T) = 0.949, effectively grouping works by author and translator.[19] Similarly, in clustering heterogeneous data like music MIDI files across genres (jazz, rock, classical), NCD produced dendrograms with only 6 misclassifications out of 36 samples, demonstrating its ability to capture stylistic redundancies.[19] NCD can also integrate into partitioning algorithms like k-means by serving as the distance function to assign data points to clusters based on compression-derived similarities. This approach has been employed for web page clustering, where NCD's focus on structural redundancy allows grouping of similar content without text preprocessing, yielding purity scores around 85-93% in benchmarks involving document-like files.[19] The parameter-free nature of NCD makes it particularly advantageous over feature-engineered methods, as it automatically detects shared information patterns, such as repetitive structures in text or code. In supervised classification, NCD supports nearest-neighbor classifiers by measuring similarity to labeled training examples, enabling tasks like genre detection and anomaly identification across diverse data types. For text classification on the Reuters-21578 dataset (subsets R8 and R52), an NCD-based k-NN approach using gzip compression achieved accuracies of 95.4% on R8 and 89.6% on R52, comparable to TF-IDF with logistic regression (94.9% and 87.4%, respectively), but without requiring tokenization, lowercasing, or other preprocessing steps.[20] This highlights NCD's efficiency in low-resource settings, where it leverages raw data to approximate information overlap. Practical examples include spam filtering, where NCD enhances Bayesian classifiers by verifying email similarity to known spam samples; combining it with a spam probability threshold above 0.5 reduced total error to 4.98% and boosted detection to 99.49% on benchmark corpora.[21] For binary data like malware detection, NCD classifiers identify anomalies by comparing file compressibility, capturing structural redundancies in executables with accuracies of 65-77% on large datasets such as Microsoft's malware classification challenge, outperforming in scenarios with unstructured byte sequences but benefiting from faster approximations for scalability.[22] Overall, NCD's universal applicability provides a robust, preprocessing-free alternative to traditional methods in both clustering and classification.Bioinformatics and Phylogenetics
In bioinformatics, the Normalized Compression Distance (NCD) provides an alignment-free method for measuring similarity between DNA and protein sequences, approximating shared informational content via compression algorithms to capture evolutionary conservation without presupposing specific substitution models. This approach treats sequences as digital objects, leveraging the conditional Kolmogorov complexity to quantify how much one sequence compresses in the presence of another, thus highlighting conserved patterns indicative of functional or evolutionary relatedness.[23] In phylogenetics, NCD generates distance matrices from unaligned molecular sequences, serving as an alternative to alignment-dependent techniques like multiple sequence alignment for constructing evolutionary trees. These matrices enable distance-based tree-building methods, such as neighbor-joining, and have been applied to datasets including mammalian mitochondrial genomes and plant sequences like those from Arabidopsis thaliana. Recent evaluations demonstrate that NCD-based phylogenies perform robustly even under incomplete lineage sorting, comparing favorably to coalescent- and concatenation-based approaches.[24][25] NCD excels in scenarios involving highly divergent species, where it outperforms traditional model-based distances like Jukes-Cantor by better accommodating complex sequence divergences without model assumptions, achieving higher topological accuracy in tree reconstruction. For instance, NCD has facilitated clustering of viral genomes, such as hemagglutinin sequences from influenza viruses, revealing strain relationships through shared compressibility patterns.[26] Similarly, it has been used to cluster protein gene families, like 3-phosphoglycerate kinase across taxa, effectively identifying distant homologies where alignment methods falter.[23] A notable extension applies NCD to multisets of k-mers for read mapping in next-generation sequencing, comparing k-mer compositions to align short reads to references while minimizing full-sequence alignments and computational overhead; this remains pertinent for processing large-scale genomic data. NCD's parameter-free nature also allows detection of non-homologous similarities, such as those arising from structural rearrangements, making it advantageous for identifying evolutionary events decoupled from standard vertical inheritance.[3][23]Other Domains
In neuroscience, the normalized compression distance (NCD) has been applied to analyze cortico-muscular connectivity by measuring functional coupling between electroencephalography (EEG) and electromyography (EMG) signals. In a 2022 study involving 15 healthy volunteers performing a weak handgrip task, NCD quantified synchronization levels between cortical and muscular activities, revealing lower synchronization in less dexterous networks and providing a data-driven estimate of two-node functional interactions without requiring predefined features.[27] In imaging applications, NCD facilitates similarity assessment for satellite and aerial images by treating pixel data as compressible strings, enabling tasks such as change detection and scene clustering. Recent GPU-accelerated implementations have optimized NCD computations for large-scale satellite imagery, dividing images into patches and aggregating pairwise similarities to identify temporal changes efficiently.[28] For synthetic aperture radar (SAR) image change detection, NCD-based methods, applied post-2020, directly compare compressed representations of bi-temporal images, bypassing traditional feature extraction.[29] For graph-structured data, the normalized graph compression distance (NGCD) extends NCD to measure similarity between networks by compressing their adjacency matrices, supporting applications in network matching. Introduced in a 2025 framework, NGCD computes pairwise graph dissimilarities rapidly, offering a parameter-free alternative to edit-distance methods for tasks like graph classification and clustering.[30] Beyond these areas, NCD detects plagiarism in source code by evaluating compression-based similarities between programs, as demonstrated in tools for identifying cloned buggy code in student assignments and professional repositories.[31] Similarly, for images, NCD identifies duplicated or altered content through direct similarity scoring, proving effective in experimental setups for detecting near-duplicates without alignment.[32] In natural language processing, NCD adaptations for multisets enable bag-of-words representations by treating term frequency vectors as compressible multisets, facilitating document similarity computations in alignment-free text analysis.[3]Related Metrics
Normalized Google Distance
The Normalized Google Distance (NGD) is a metric for measuring semantic similarity between two terms x and y based on their co-occurrence frequencies in web search results, serving as a practical approximation to theoretical information distance measures. Introduced by Cilibrasi and Vitányi in 2007, it provides a low-cost alternative to compression-based methods for assessing topical relatedness at web scale.[33] The formula for NGD is given by \text{NGD}(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)}{\log N - \min\{\log f(x), \log f(y)\}}, where f(x) denotes the number of web pages containing term x, f(x,y) is the number containing both x and y, and N is the total number of indexed web pages (e.g., approximately 8 billion at the time of introduction). Computation involves querying a search engine like Google for the hit counts of x, y, and the conjunction "x y" (approximating f(x,y)), with logarithms typically in base e or $10 (the base cancels in ratios). This normalization ensures the metric is scale-invariant and bounded, similar in spirit to the normalization in Normalized Compression Distance but leveraging search frequencies instead of data compression. NGD values range from 0 (indicating strong semantic association, as when f(x,y) \approx \max\{f(x), f(y)\}) to approximately 1 (for unrelated terms under independence assumptions, where f(x,y) \approx f(x) f(y) / N), and approach \infty if the terms never co-occur (f(x,y) = 0). For instance, terms like "horse" and "rider" yield an NGD of about 0.44, reflecting moderate relatedness, while highly associated pairs approach 0. The metric effectively captures topical co-occurrences, such as a low NGD between "banana" and "monkey" due to frequent joint mentions in contextual web content. However, it struggles with syntactic similarities (e.g., grammatical structures without topical links) and rare terms where hit counts are low or zero, leading to unreliable estimates.Normalized Relative Compression
The Normalized Relative Compression (NRC) is a one-sided similarity metric derived from compression-based approximations to Kolmogorov complexity, defined as\text{NRC}(x|y) = \frac{C(x||y)}{|x|},
where C(\cdot) denotes the compressed length of an object using a given compressor, C(x||y) is the relative (exclusive) compression of x given y, and |x| is the uncompressed length of x.[34] This formulation measures how much x can be compressed using information from y, capturing shared information content. The metric ranges from 0 to 1, where values close to 0 indicate high similarity (as x is largely predictable from y), and values near 1 suggest dissimilarity (with little compression benefit from y).[34] Unlike the Normalized Compression Distance (NCD), which uses concatenated compression normalized by the maximum individual compressed length, NRC uses relative compression normalized by the uncompressed length, offering greater stability when comparing objects of unequal sizes by directly accounting for length disparities.[34] Since NRC is asymmetric, for a symmetric distance metric, variants like the Normalized Conditional Compression Distance (NCCD) are used:
\text{NCCD}(x,y) = \frac{\max\{C(x|y), C(y|x)\}}{\max\{C(x), C(y)\}},
or the average of directed relative compressions.[34] NRC was proposed to enhance compression-based similarity measures, particularly to mitigate overhead issues in compressing short data strings, where fixed compressor costs can distort estimates.[34] Computationally, it leverages off-the-shelf compressors supporting relative or conditional modes (e.g., via modifications to gzip or bzip2) to compute C(x||y); otherwise, the concatenation approximation C(xy) - C(y) is used, though less accurate for conditionals.[34]