Fact-checked by Grok 2 weeks ago

Normalized compression distance

The Normalized Compression Distance (NCD) is a parameter-free similarity metric that measures the mutual information shared between two objects—such as strings, files, or sequences—by leveraging the compressed lengths produced by standard data compression algorithms, approximating the theoretical Normalized Information Distance based on Kolmogorov complexity.^[1] Introduced by Rudi Cilibrasi and Paul M. B. Vitányi in their 2005 paper "Clustering by Compression," the NCD enables domain-independent comparisons without requiring specialized features or prior knowledge about the data.^[1] The NCD between two objects x and y is formally defined as

\text{NCD}(x, y) = \frac{C(xy) - \min\{C(x), C(y)\}}{\max\{C(x), C(y)\}},

where C(z) denotes the length in bits of the compressed representation of z using a compressor, and xy is the concatenation of x and y.^[1] This formulation yields a value between 0 (indicating high similarity, as one object is largely redundant given the other) and just over 1 (indicating dissimilarity), and it is robust across different compressors like gzip or bzip2, though results can vary slightly due to compressor idiosyncrasies.^[2] The metric's universality stems from its roots in algorithmic information theory, making it applicable to diverse data types without alignment or feature extraction.^[1] NCD has been widely applied in clustering tasks across fields, including biological classification, music genre classification, and cybersecurity for malware detection, often outperforming traditional methods in unsupervised settings.^[3]^[4]^[5] Its hierarchical clustering variant, using dendrograms, has revealed insights such as evolutionary relationships in genomes and structural similarities in musical compositions.^[1] Recent developments as of 2024 include neural variants of NCD for enhanced classification tasks.^[6] Despite its strengths, NCD can be sensitive to compressor choice and data preprocessing, necessitating careful implementation to avoid pitfalls like non-idempotence in repeated compressions.^[2]

Theoretical Foundations

Information Distance

The information distance serves as a foundational theoretical measure of dissimilarity between two finite objects, represented as binary strings x and y, rooted in algorithmic information theory.^[7] The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest binary program that outputs x when executed on a universal Turing machine. Similarly, the conditional Kolmogorov complexity K(x|y) is the length of the shortest program that outputs x given y as an auxiliary input to the universal Turing machine. The universal information distance between x and y, denoted ID(x,y), is given by

ID(x,y) = \max\{K(x|y), K(y|x)\}.

This quantity represents the additional information required to describe one string given the other, capturing the shared information content between them up to additive logarithmic terms. Equivalently, up to such terms, ID(x,y) corresponds to the length of the shortest program that outputs both x and y minus the complexity of the shorter string. Since Kolmogorov complexity is uncomputable—arising from the undecidability of the halting problem for Turing machines—ID(x,y) is likewise non-computable, motivating the development of practical approximations in subsequent work. The concept was introduced by Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitányi in their 2004 paper on similarity metrics based on complexity theory.^[7]

Normalized Information Distance

The normalized information distance (NID) provides a bounded metric for measuring the similarity between two objects x and y based on their Kolmogorov complexities, addressing the unbounded nature of the raw information distance. Defined as

\text{NID}(x,y) = \frac{\max\{K(x|y), K(y|x)\}}{\max\{K(x), K(y)\}},

where K(x|y) denotes the conditional Kolmogorov complexity of x given y, NID ranges from 0, indicating that x and y are identical or maximally similar (one fully compressible given the other), to 1, indicating maximal dissimilarity (no shared information, each requires its full description length independently).^[8] This normalization derives from the information distance \text{ID}(x,y) = \max\{K(x|y), K(y|x)\} + O(1), which itself approximates the minimal program length needed to transform one object into the other. NID satisfies the properties of a metric: non-negativity holds since complexities are non-negative, yielding \text{NID}(x,y) \geq 0; symmetry follows from the symmetry of conditional complexities, so \text{NID}(x,y) = \text{NID}(y,x); and the triangle inequality \text{NID}(x,z) \leq \text{NID}(x,y) + \text{NID}(y,z) is established up to an additive constant term negligible relative to the complexities involved, ensuring it functions as a distance in the limit. In interpretation, \text{NID}(x,y) \approx 0 when x and y share significant mutual information, making one highly compressible given the other, while \text{NID}(x,y) \approx 1 when they are informationally independent. Compared to the raw information distance, NID offers theoretical advantages including universality across object types, independence from individual object sizes (avoiding bias toward longer descriptions), and metric properties that enable its use in clustering and similarity-based analyses. Notably, NID stands as the only known non-trivial metric derivable directly from plain Kolmogorov complexity.^[9]

Definition and Computation

Normalized Compression Distance Metric

The Normalized Compression Distance (NCD) was first proposed by Li et al. in 2004 as a universal similarity metric suitable for tasks such as clustering across diverse data types, including strings, genomes, and languages.^[10] The metric is defined by the formula

\mathrm{NCD}(x,y) = \frac{C(xy) - \min\{C(x), C(y)\}}{\max\{C(x), C(y)\}},

where C(z) denotes the length, in bytes, of the compressed representation of an object z produced by a real-world compression algorithm.^[10] In this expression, C(xy) is the compressed length of the concatenation of objects x and y, which reflects the shared information or redundancy between them: greater overlap allows for more efficient joint compression relative to the individual compressions.^[10] The NCD yields values in the approximate range [0, 1 + \epsilon], where \epsilon \leq 0.1 accounts for practical compression imperfections; a value near 0 signifies high similarity, as the objects are highly compressible together, while a value near 1 indicates maximal dissimilarity, and occasional values above 1 arise from compressor overhead.^[10] The NCD provides a computable approximation to the Normalized Information Distance, with the approximation quality improving as the compressor more closely emulates Kolmogorov complexity; in this limit, \mathrm{NCD}(x,y) \approx \mathrm{NID}(x,y) + O(\log n / n), where n relates to the object size.^[9]

Approximation via Compression Algorithms

In practice, the normalized information distance is approximated by the normalized compression distance (NCD) through the use of off-the-shelf compression algorithms, where the Kolmogorov complexity of an object is estimated by the length of its compressed representation.^[1] This substitution allows NCD to be computed empirically without requiring theoretical complexity measures.^[1] Common compressors include gzip (based on Lempel-Ziv parsing), bzip2 (employing block sorting and Huffman coding), and PPMZ (a prediction by partial matching variant with Ziv-Lempel compression), with the compressed size C(z) taken as the output file length in bytes.^[1] To compute NCD between two objects x and y, the process involves compressing x individually to obtain C(x), compressing y to obtain C(y), and compressing their concatenation xy to obtain C(xy); these lengths are then inserted into the NCD formula.^[1] For non-string data, inputs are serialized into byte sequences suitable for compression: binary data is treated directly as bitstreams, images are converted via raster-order pixel scanning (e.g., row-by-row), and multisets are serialized by concatenating elements in length-increasing lexicographic order, often with delimiters to distinguish components.^[1]^[11]^[12] A key challenge arises from compressor-specific preprocessing overhead, such as fixed block sizes in bzip2 (typically 900 KB) or sliding window limits in gzip (around 32 KB), which can cause NCD values exceeding 1 even for identical objects when sizes surpass these thresholds, as redundancy is not fully captured. To mitigate this, practitioners recommend applying multiple compressors and averaging the resulting NCD values for more robust estimates. NCD remains parameter-free, requiring no tuning or domain-specific thresholds, and alignment-free, as it relies solely on compression patterns without needing sequence matching or feature extraction.^[1]

Properties and Performance

Theoretical Properties

The Normalized Compression Distance (NCD) exhibits several key metric properties when computed using a normal compressor, which is defined as one that is idempotent, monotonic nondecreasing, and satisfies the prefix and superadditivity conditions up to logarithmic terms.^[13] Specifically, NCD is symmetric, meaning NCD(x, y) = NCD(y, x) for any objects x and y, due to the symmetry in the compression of concatenated strings.^[9] It is also non-negative, with NCD(x, y) ≥ 0 and equality holding if and only if x = y, as the compressed length of the concatenation cannot be shorter than the longer individual object beyond redundancies.^[13] However, the triangle inequality NCD(x, y) ≤ NCD(x, z) + NCD(z, y) holds theoretically for normal compressors via the distributivity property of compression lengths but is violated in practice due to imperfections in real-world compressors, such as non-idempotence or suboptimal encoding.^[13]^[14] NCD possesses universality when the underlying compressor is universal, meaning it approaches the Normalized Information Distance (NID)—a non-computable metric based on Kolmogorov complexity—for any pair of objects as their sizes increase, minorizing every computable similarity distance up to an additive error term ε = [C(x|y) - K(x|y)] / C(x), where K denotes Kolmogorov complexity.^[13] This universality ensures that NCD captures the dominant shared information between objects regardless of the specific compressor used, provided it is effective.^[9] In terms of behavioral properties, NCD approaches 1 for pairs of random, independent strings, reflecting maximal dissimilarity as their concatenation compresses no better than individually; it equals 0 for identical copies, indicating perfect similarity; and yields intermediate values between 0 and 1 for objects sharing patterns, quantifying the degree of redundancy in their joint description.^[9] Cilibrasi and Vitányi (2005) provide a proof sketch for the asymptotic satisfaction of the triangle inequality, showing that for sufficiently large objects, deviations are bounded by O((log n)/n), where n is the maximum length, but empirical violations arise from compressor limitations like fixed precision or non-asymptotic behavior.^[13] The normalization in NCD confers scale-invariance, enabling fair comparisons between objects of disparate sizes by relativizing the compression gain to the larger object's compressed length, thus bounding the metric in [0, 1] independently of absolute scales.^[9]

Empirical Performance and Limitations

Empirical evaluations of the Normalized Compression Distance (NCD) have demonstrated its effectiveness in practical tasks such as plagiarism detection and text clustering, though it often incurs higher computational costs compared to simpler metrics like Euclidean distance. In plagiarism detection for programming assignments, NCD-based systems using compressors like bzip2 have identified high-similarity pairs with p-values as low as 0.002, enabling confident flagging of 1.9% of submissions in a class of 369 as potential plagiarism cases, outperforming tools like zyBooks in surfacing subtle matches after variable renaming.^[15] For text clustering, NCD has shown utility in grouping documents without domain-specific features, achieving competitive results in hierarchical clustering of heterogeneous data, though exact accuracy varies by dataset and compressor. However, NCD computations are generally slower than Euclidean distance-based methods due to the overhead of repeated compression operations, making it less suitable for very large-scale clustering without optimization. Key limitations of NCD arise from its dependence on practical compressors, which deviate from ideal theoretical properties. NCD exhibits sensitivity to the choice of compressor, with variances in distance estimates up to 0.1 between algorithms like gzip and bzip2 for identical objects, leading to inconsistent similarity rankings. For instance, gzip shows discontinuities at block sizes around 32 KB, causing NCD(x,x) to jump to 0.9 for larger files, while bzip2 exhibits NCD(x,x) values of 0.2–0.3 for files up to 450 KB. Additionally, compressors often fail idempotence (C(C(x)) ≠ C(x)), violating the metric's identity axiom and introducing non-metric behavior that can produce clustering artifacts, such as inflated distances for short strings where redundancy detection is limited. This overhead is pronounced for short inputs, where NCD fluctuations hinder reliable comparisons.^[16] NCD computation scales quadratically with the number of objects due to pairwise compressions, typically limiting practical use to datasets up to gigabyte-scale without parallelization, as each distance requires compressing individual and concatenated strings. Recent developments address these issues through neural approximations, such as Neural NCD (2024), which employs large language models (LLMs) like RWKV as lossless compressors to capture semantic similarities better than traditional methods like gzip on natural language tasks. For example, Neural NCD achieved approximately 65% test accuracy on the AGNews dataset at 100-shot few-shot learning, surpassing gzip's 0.785 compression rate with a more effective 0.248 rate, though at higher computational expense due to LLM inference. However, it underperforms on datasets like DBpedia (approximately 55% accuracy) and shows variable results independent of compression efficiency, highlighting a disconnect between compression quality and classification utility.^[17] GPU accelerations have mitigated scalability issues for image and satellite data applications. Using NVIDIA's nvCOMP library, GPU-based NCD variants achieve approximately 10x speedup over multi-threaded CPU implementations, reducing computation time from 84,967 seconds to 8,828 seconds on the UC Merced Land Use dataset (2,100 images), while achieving 75% classification accuracy for binary tasks (e.g., rural vs. urban). Similar gains apply to satellite imagery clustering on BigEarthNet, reducing false alarms in land cover detection through parallel patch-wise compressions.^[18]

Applications

Clustering and Classification

The Normalized Compression Distance (NCD) functions as a versatile distance metric in unsupervised clustering tasks, enabling the grouping of similar objects such as documents, files, or binary data without relying on predefined features or domain knowledge. In hierarchical clustering, NCD is used to compute pairwise distances, facilitating the construction of dendrograms that reveal hierarchical similarity structures among datasets. For instance, when applied to Russian literary texts (both originals and translations), NCD-based hierarchical clustering achieved a tree quality score S(T) = 0.949, effectively grouping works by author and translator.^[19] Similarly, in clustering heterogeneous data like music MIDI files across genres (jazz, rock, classical), NCD produced dendrograms with only 6 misclassifications out of 36 samples, demonstrating its ability to capture stylistic redundancies.^[19] NCD can also integrate into partitioning algorithms like k-means by serving as the distance function to assign data points to clusters based on compression-derived similarities. This approach has been employed for web page clustering, where NCD's focus on structural redundancy allows grouping of similar content without text preprocessing, yielding purity scores around 85-93% in benchmarks involving document-like files.^[19] The parameter-free nature of NCD makes it particularly advantageous over feature-engineered methods, as it automatically detects shared information patterns, such as repetitive structures in text or code. In supervised classification, NCD supports nearest-neighbor classifiers by measuring similarity to labeled training examples, enabling tasks like genre detection and anomaly identification across diverse data types. For text classification on the Reuters-21578 dataset (subsets R8 and R52), an NCD-based k-NN approach using gzip compression achieved accuracies of 95.4% on R8 and 89.6% on R52, comparable to TF-IDF with logistic regression (94.9% and 87.4%, respectively), but without requiring tokenization, lowercasing, or other preprocessing steps.^[20] This highlights NCD's efficiency in low-resource settings, where it leverages raw data to approximate information overlap. Practical examples include spam filtering, where NCD enhances Bayesian classifiers by verifying email similarity to known spam samples; combining it with a spam probability threshold above 0.5 reduced total error to 4.98% and boosted detection to 99.49% on benchmark corpora.^[21] For binary data like malware detection, NCD classifiers identify anomalies by comparing file compressibility, capturing structural redundancies in executables with accuracies of 65-77% on large datasets such as Microsoft's malware classification challenge, outperforming in scenarios with unstructured byte sequences but benefiting from faster approximations for scalability.^[22] Overall, NCD's universal applicability provides a robust, preprocessing-free alternative to traditional methods in both clustering and classification.

Bioinformatics and Phylogenetics

In bioinformatics, the Normalized Compression Distance (NCD) provides an alignment-free method for measuring similarity between DNA and protein sequences, approximating shared informational content via compression algorithms to capture evolutionary conservation without presupposing specific substitution models. This approach treats sequences as digital objects, leveraging the conditional Kolmogorov complexity to quantify how much one sequence compresses in the presence of another, thus highlighting conserved patterns indicative of functional or evolutionary relatedness.^[23] In phylogenetics, NCD generates distance matrices from unaligned molecular sequences, serving as an alternative to alignment-dependent techniques like multiple sequence alignment for constructing evolutionary trees. These matrices enable distance-based tree-building methods, such as neighbor-joining, and have been applied to datasets including mammalian mitochondrial genomes and plant sequences like those from Arabidopsis thaliana. Recent evaluations demonstrate that NCD-based phylogenies perform robustly even under incomplete lineage sorting, comparing favorably to coalescent- and concatenation-based approaches.^[24]^[25] NCD excels in scenarios involving highly divergent species, where it outperforms traditional model-based distances like Jukes-Cantor by better accommodating complex sequence divergences without model assumptions, achieving higher topological accuracy in tree reconstruction. For instance, NCD has facilitated clustering of viral genomes, such as hemagglutinin sequences from influenza viruses, revealing strain relationships through shared compressibility patterns.^[26] Similarly, it has been used to cluster protein gene families, like 3-phosphoglycerate kinase across taxa, effectively identifying distant homologies where alignment methods falter.^[23] A notable extension applies NCD to multisets of k-mers for read mapping in next-generation sequencing, comparing k-mer compositions to align short reads to references while minimizing full-sequence alignments and computational overhead; this remains pertinent for processing large-scale genomic data. NCD's parameter-free nature also allows detection of non-homologous similarities, such as those arising from structural rearrangements, making it advantageous for identifying evolutionary events decoupled from standard vertical inheritance.^[3]^[23]

Other Domains

In neuroscience, the normalized compression distance (NCD) has been applied to analyze cortico-muscular connectivity by measuring functional coupling between electroencephalography (EEG) and electromyography (EMG) signals. In a 2022 study involving 15 healthy volunteers performing a weak handgrip task, NCD quantified synchronization levels between cortical and muscular activities, revealing lower synchronization in less dexterous networks and providing a data-driven estimate of two-node functional interactions without requiring predefined features.^[27] In imaging applications, NCD facilitates similarity assessment for satellite and aerial images by treating pixel data as compressible strings, enabling tasks such as change detection and scene clustering. Recent GPU-accelerated implementations have optimized NCD computations for large-scale satellite imagery, dividing images into patches and aggregating pairwise similarities to identify temporal changes efficiently.^[28] For synthetic aperture radar (SAR) image change detection, NCD-based methods, applied post-2020, directly compare compressed representations of bi-temporal images, bypassing traditional feature extraction.^[29] For graph-structured data, the normalized graph compression distance (NGCD) extends NCD to measure similarity between networks by compressing their adjacency matrices, supporting applications in network matching. Introduced in a 2025 framework, NGCD computes pairwise graph dissimilarities rapidly, offering a parameter-free alternative to edit-distance methods for tasks like graph classification and clustering.^[30] Beyond these areas, NCD detects plagiarism in source code by evaluating compression-based similarities between programs, as demonstrated in tools for identifying cloned buggy code in student assignments and professional repositories.^[31] Similarly, for images, NCD identifies duplicated or altered content through direct similarity scoring, proving effective in experimental setups for detecting near-duplicates without alignment.^[32] In natural language processing, NCD adaptations for multisets enable bag-of-words representations by treating term frequency vectors as compressible multisets, facilitating document similarity computations in alignment-free text analysis.^[3]

Normalized Google Distance

The Normalized Google Distance (NGD) is a metric for measuring semantic similarity between two terms x and y based on their co-occurrence frequencies in web search results, serving as a practical approximation to theoretical information distance measures. Introduced by Cilibrasi and Vitányi in 2007, it provides a low-cost alternative to compression-based methods for assessing topical relatedness at web scale.^[33] The formula for NGD is given by

\text{NGD}(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)}{\log N - \min\{\log f(x), \log f(y)\}},

where f(x) denotes the number of web pages containing term x, f(x,y) is the number containing both x and y, and N is the total number of indexed web pages (e.g., approximately 8 billion at the time of introduction). Computation involves querying a search engine like Google for the hit counts of x, y, and the conjunction "x y" (approximating f(x,y)), with logarithms typically in base e or $10 (the base cancels in ratios). This normalization ensures the metric is scale-invariant and bounded, similar in spirit to the normalization in Normalized Compression Distance but leveraging search frequencies instead of data compression. NGD values range from 0 (indicating strong semantic association, as when f(x,y) \approx \max\{f(x), f(y)\}) to approximately 1 (for unrelated terms under independence assumptions, where f(x,y) \approx f(x) f(y) / N), and approach \infty if the terms never co-occur (f(x,y) = 0). For instance, terms like "horse" and "rider" yield an NGD of about 0.44, reflecting moderate relatedness, while highly associated pairs approach 0. The metric effectively captures topical co-occurrences, such as a low NGD between "banana" and "monkey" due to frequent joint mentions in contextual web content. However, it struggles with syntactic similarities (e.g., grammatical structures without topical links) and rare terms where hit counts are low or zero, leading to unreliable estimates.

Normalized Relative Compression

The Normalized Relative Compression (NRC) is a one-sided similarity metric derived from compression-based approximations to Kolmogorov complexity, defined as
\text{NRC}(x|y) = \frac{C(x||y)}{|x|},
where C(\cdot) denotes the compressed length of an object using a given compressor, C(x||y) is the relative (exclusive) compression of x given y, and |x| is the uncompressed length of x.^[34] This formulation measures how much x can be compressed using information from y, capturing shared information content. The metric ranges from 0 to 1, where values close to 0 indicate high similarity (as x is largely predictable from y), and values near 1 suggest dissimilarity (with little compression benefit from y).^[34] Unlike the Normalized Compression Distance (NCD), which uses concatenated compression normalized by the maximum individual compressed length, NRC uses relative compression normalized by the uncompressed length, offering greater stability when comparing objects of unequal sizes by directly accounting for length disparities.^[34] Since NRC is asymmetric, for a symmetric distance metric, variants like the Normalized Conditional Compression Distance (NCCD) are used:
\text{NCCD}(x,y) = \frac{\max\{C(x|y), C(y|x)\}}{\max\{C(x), C(y)\}},
or the average of directed relative compressions.^[34] NRC was proposed to enhance compression-based similarity measures, particularly to mitigate overhead issues in compressing short data strings, where fixed compressor costs can distort estimates.^[34] Computationally, it leverages off-the-shelf compressors supporting relative or conditional modes (e.g., via modifications to gzip or bzip2) to compute C(x||y); otherwise, the concatenation approximation C(xy) - C(y) is used, though less accurate for conditionals.^[34]

Comparisons with Other Metrics

The Normalized Compression Distance (NCD) offers a simpler formulation than the Normalized Relative Compression (NRC), relying on three compression lengths—C(xy), C(x), and C(y)—without requiring explicit conditional modeling, whereas NRC (or its symmetric variants like NCCD) incorporates conditional compression, making it more sensitive to directional information gain.^[34] However, NCD can be less robust to size imbalances between objects, as concatenation in C(xy) may skew results when one object is significantly longer, while NRC's normalization by uncompressed length provides greater stability in such cases.^[34] Although NRC excels in tasks involving conditional dependencies, such as authorship attribution where it achieved 100% accuracy on certain datasets, its computational demands are higher due to the need for conditional compression support, contrasting with NCD's more straightforward approach.^[34] In comparison to the Normalized Google Distance (NGD), NCD operates intrinsically on the raw data objects themselves using universal compression algorithms like gzip, avoiding reliance on an external corpus, which makes it self-contained and applicable to any compressible data without external dependencies.^[9] NGD, by contrast, draws from web search engine hit counts as a proxy for co-occurrence, enabling faster and cheaper computations for term associations but introducing biases from search engine trends, such as overemphasizing popular concepts and underrepresenting niche or non-textual content.^[9] While NGD performs well for quick semantic relatedness in textual terms, it is less accurate for non-text data like images or sequences, where NCD's compression-based similarity captures structural patterns more reliably without web-specific distortions.^[9] Compared to traditional metrics, NCD outperforms edit distance (Levenshtein) in capturing semantic similarity, as compression inherently accounts for higher-level patterns beyond syntactic edits—for instance, synonyms or paraphrases in documents compress more similarly than dissimilar strings with equivalent edit costs, enabling better detection of conceptual likeness.^[35] Against cosine similarity, NCD is feature-free and parameter-free, operating directly on raw objects without requiring vector embeddings or dimensionality reduction, though this comes at the cost of higher computational expense due to repeated compression operations, whereas cosine is faster but demands predefined feature extraction that may overlook non-linear data structures.^[36] Benchmarks highlight trade-offs in plagiarism detection, where compression-based methods like those combining elements of NCD and NRC have shown superior accuracy over NGD, owing to compression's robustness to alterations like variable renaming.^[37] In contrast, NGD excelled in rapid term association for preliminary screening, leveraging web co-occurrences for efficiency in large-scale text matching.^[38] Recent advancements in neural variants of NCD, using large language models (LLMs) as compressors, demonstrate improved semantic capture on LLM-generated text—outperforming classical compressors like LZMA on datasets such as AGNews for classification tasks involving contextual understanding—but yield worse adherence to theoretical compression ideals, with classification accuracy decoupling from compression ratios due to architectural differences in neural prediction.^[17]

References

[1]
Clustering by compression | IEEE Journals & Magazine
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as ...
[2]
Common Pitfalls Using the Normalized Compression Distance
This paper shows that the compressors used to compute the normalized compression distance are not idempotent in some cases, being strongly skewed.
[3]
Normalized Compression Distance of Multisets with Applications - NIH
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression.
[4]
On normalized compression distance and large malware
Dec 30, 2015 · Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of ...Missing: review | Show results with:review<|control11|><|separator|>
[5]
[cs/0111054] The similarity metric - arXiv
Nov 20, 2001 · We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this ...
[6]
[0809.2553] Normalized Information Distance - arXiv
Sep 15, 2008 · The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable.
[7]
[PDF] Normalized Information Distance - CWI
Abstract The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus ...
[8]
On the Use of Normalized Compression Distances for Image ... - MDPI
This paper investigates the usefulness of the normalized compression distance (NCD) for image similarity detection. Instead of the direct NCD between images ...<|control11|><|separator|>
[9]
Normalized Compression Distance of Multisets with Applications
Dec 22, 2012 · Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects ...
[10]
[PDF] Clustering by Compression - UC Davis Mathematics
The result of approximating the NID using a real compressor C is called the normalized compression distance or NCD, formally defined in (IV.1). The theory ...<|control11|><|separator|>
[11]
A user‐friendly guide to using distance measures to compare time ...
Oct 5, 2023 · A non-metric or semi-metric that does not satisfy the triangle inequality ... The Normalized Compression Distance (NCD) is based on the ...
[12]
COMMON PITFALLS USING THE NORMALIZED COMPRESSION ...
This was formalized by Rudi Cilibrasi and Paul Vitányi [2], giving rise to the concept of normalized compression distance (NCD), which is based on the use of.
[13]
[PDF] Clustering by Compression - CWI
Abstract—We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as ...Missing: DOI | Show results with:DOI
[14]
[PDF] A Parameter-Free Classification Method with Compressors
Jul 9, 2023 · Normalized Compression Distance (NCD) is a computable version of NID based on real-world compressors. In this context, K(x) can be viewed as ...
[15]
[PDF] The Bayesian Spam Filter with NCD* - CEUR-WS.org
Bayesian spam filter is a statistic technique for filtering e-mails. It uses ... Normalized Compression Distance (NCD) is a mathematical way for measuring.
[16]
[PDF] An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard ...
The Normalized Compression Distance (NCD) [19] is a general pur- pose method of measuring the similarity between any two arbitrary objects. The NCD works via ...<|control11|><|separator|>
[17]
Application of compression-based distance measures to protein ...
The present study was undertaken in order to explore their utility in the classification of protein sequences.
[18]
Benchmarking of alignment-free sequence comparison methods
Jul 25, 2019 · LZW-Kernel can also estimate the distance between protein sequences using normalized compression distances (LZW-NCD). mash [11] estimates ...
[19]
Evaluating Compression-Based Phylogeny Estimation in the ...
Mar 7, 2023 · This study assesses characteristics of the normalized compression distance (NCD) technique for building phylogenetic trees from molecular data.
[20]
Exploring the Application of Models of DNA Evolution to Normalized ...
A recent study showed that Normalized Compression Distance (NCD) matrices allow for an alternative way to create phylogenetic trees. NCD matrices are produced ...
[21]
Comparing phylogeny by compression to phylogeny by NJp and ...
Jan 13, 2021 · We compare the accuracy of the Normalized Compress Distance (NCD) techniques for building phylogenetic trees from molecular data with the more ...
[22]
[PDF] Clustering the Normalized Compression Distance for Virus Data
Cilibrasi and P. M. B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory,. 51(4):1523–1545, 2005. [10] I. Fischer and J. Poland ...
[23]
Normalized compression distance to measure cortico-muscular ...
Nov 9, 2022 · It is a parameter-free measure that estimates the information shared by two signals by comparing the compression length of a file obtained ...
[24]
GPU-Based Normalized Compression Distance for Satellite Images
Jul 5, 2025 · This paper discusses a method of compressing three-dimensional biomedical images with losses, based on representing the data in the form of an ...Missing: acceleration | Show results with:acceleration
[25]
https://www.mdpi.com/2227-7390/13/21/3534
[26]
[PDF] Cloned Buggy Code Detection in Practice Using Normalized ...
To address the understandability, we adopt Normalized Compression Distance [7] as a similarity metric for our tool. The distance regards two source code.
[27]
On the Use of Normalized Compression Distances for Image ...
Jan 31, 2018 · This paper investigates the usefulness of the normalized compression distance (NCD) for image similarity detection.
[28]
[PDF] Authorship attribution using relative compression
This implements the C(x|y), required to compute the NCCD. The relative compression mode, C(xy), differs in how the internal models of the encoder are built. In ...
[29]
[PDF] Compression-based Similarity - CWI
then we arrive at the Normalized Compression Distance: NCD(x,y) = Z(xy)−min(Z ... Seker, Shared information and program plagiarism detection, IEEE Trans.Missing: empirical | Show results with:empirical
[30]
A fast compression-based similarity measure with applications to ...
The most widely known and used compression based similarity measure for general data is the Normalized Compression Distance (NCD), proposed by Li et al. [1] ...
[31]
A Program Plagiarism Detection Model Based on Information ...
Both approaches use a lexical analyzer to tokenize student code before constructing pairwise similarity matrices using the Normalized Compression Distance (NCD) ...Missing: NGD benchmarks<|control11|><|separator|>
[32]
[PDF] Normalized Web Distance and Word Similarity - CWI
The NWD also does not satisfy the triangle inequality eG(x, y) ≤ eG(x, z)+. eG(z,y) for all x, y, z. To see that, choose x, y, and z such that x and y never ...
[33]
Neural Normalized Compression Distance and the Disconnect ...
Oct 20, 2024 · By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose ...Missing: LLM | Show results with:LLM