Fact-checked by Grok 2 weeks ago

MinHash

MinHash is a probabilistic algorithm for efficiently estimating the Jaccard similarity between large sets, such as those representing documents or data streams, by generating compact signatures through the application of random permutations or hash functions to identify minimum values.^[1] It serves as a core component of locality-sensitive hashing (LSH), enabling the rapid detection of similar items in massive datasets without exhaustive pairwise comparisons.^[2] Introduced by Andrei Broder and colleagues in 1997, MinHash was originally developed to address the challenge of identifying near-duplicate documents in web-scale corpora, such as the AltaVista search engine's index of over 30 million pages, where it successfully clustered millions of similar items using shingling techniques combined with min-wise independent permutations.^[1] The method relies on the key probabilistic property that, for two sets A and B, the probability that the minimum hash value (under a random permutation) is the same for both sets equals their Jaccard similarity r(A, B) = \frac{|A \cap B|}{|A \cup B|}, providing an unbiased estimator when multiple independent hash functions are used to form signatures.^[1] This estimation is both dimension-independent and scalable, reducing the signature size to a few hundred bytes per set while maintaining accuracy for similarities above a tunable threshold.^[2] In practice, MinHash signatures are constructed by applying k independent hash functions to the elements of a set and selecting the minimum value for each, yielding a vector that approximates the set for similarity computations.^[2] For LSH applications, these signatures are partitioned into b bands of r rows each (k = b \times r), where pairs colliding in at least one band are flagged as candidate duplicates; the probability of detection follows the curve $1 - (1 - s^r)^b, with s as the similarity, allowing false positives and negatives to be balanced for specific thresholds like 0.5 or 0.8.^[2] The algorithm's efficiency stems from universal hashing or pseudorandom permutations, avoiding the need for true random permutations, and it has been extended to weighted variants for denser data representations.^[3] Beyond document deduplication, MinHash has become foundational in diverse fields, including bioinformatics for genome sketching (e.g., Mash tool for sequencing data comparison), recommendation systems for user-item similarity, and entity resolution in databases, where it handles sets of arbitrary size with linear-time preprocessing and subquadratic querying.^[4] Its influence persists in modern big data frameworks, underscoring its role in scalable similarity search amid growing data volumes.^[2]

Background Concepts

Jaccard Similarity

The Jaccard similarity between two finite sets A and B is defined as the cardinality of their intersection divided by the cardinality of their union:

J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$. This measure quantifies the degree of overlap between the sets, where a value of 1 indicates complete identity and values approaching 0 signify minimal shared elements. Introduced by botanist Paul Jaccard in 1901 to compare the similarity of plant species distributions across geographic regions, the metric provides a geometric interpretation as the proportion of common elements relative to the total distinct elements in the combined sets. In computational contexts, such as information retrieval and data mining, it serves as a foundational tool for assessing set-based resemblance without assuming vector weights or order.[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) The Jaccard similarity exhibits key properties: it is symmetric, satisfying $J(A, B) = J(B, A)$; reflexive, with $J(A, A) = 1$; and bounded in the interval $[0, 1]$, where 0 denotes [disjoint sets](/page/Disjoint_sets).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) For illustration, consider sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$: the [intersection](/page/Intersection) $\{2, 3\}$ has size 2, the [union](/page/Union) $\{1, 2, 3, 4\}$ has size 4, yielding $J(A, B) = 0.5$. In set-based applications, such as comparing [binary](/page/Binary) feature representations of documents or items, Jaccard similarity is favored over vector-oriented measures like [cosine similarity](/page/Cosine_similarity) because it explicitly incorporates the union in the denominator, offering a direct gauge of relative overlap for unweighted, presence-absence [data](/page/Data).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) ### Locality-Sensitive Hashing Locality-sensitive hashing (LSH) is a probabilistic framework for approximate [nearest neighbor search](/page/Nearest_neighbor_search) in high-dimensional spaces, consisting of a family of hash functions designed such that similar items are hashed to the same value (i.e., collide) with higher probability than dissimilar items.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) Formally, a family $ \mathcal{H} $ of functions from a [universe](/page/Universe) $ \mathcal{U} $ to a set of buckets is $(r, cr, p_1, p_2)$-sensitive with respect to a [distance](/page/Distance) metric $ d $ if, for any two points $ x, y \in \mathcal{U} $,

\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1 \quad \text{if } d(x, y) \leq r,

\Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2 \quad \text{if } d(x, y) > cr,

\Pr_{\pi \sim \mathcal{F}} \left[ \pi(x) = \min { \pi(y) \mid y \in X } \right] = \frac{1}{|X|}.

This condition ensures that, under a randomly selected permutation $\pi$ from $\mathcal{F}$, each element in $X$ has an equal probability of mapping to the smallest value among the images of $X$. The concept was introduced by Broder et al. in the context of estimating document resemblances for web duplicate detection, where random permutations model the relative ordering of document features to capture set overlaps accurately.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) The formal definition and analysis were further developed in their subsequent work, establishing min-wise independence as a key probabilistic tool for similarity estimation.[](https://dl.acm.org/doi/10.1145/276698.276781) Min-wise independent permutations are essential for the theoretical soundness of MinHash because they provide an unbiased mechanism for estimating the Jaccard similarity between sets, linking directly to universal hashing principles that minimize collision biases. In MinHash, hash functions are constructed to approximate these permutations, ensuring that the minimum hash value over a set behaves as if drawn from a truly random reordering. This independence property guarantees that the estimator for Jaccard similarity remains asymptotically unbiased, even with finite samples, by preserving the uniform selection of minima across set elements. Without min-wise independence, the collision probabilities could deviate systematically, leading to skewed similarity estimates in applications like large-scale data deduplication.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) The formal proof that min-wise independence yields exact Jaccard similarity estimation proceeds as follows. For two sets $A$ and $B$, let $U = A \cup B$ and $I = A \cap B$. Under a min-wise independent permutation $\pi$, the element achieving the minimum value $\min \{ \pi(y) \mid y \in U \}$ is uniformly distributed over all elements in $U$. The MinHash values collide, $h(A) = h(B)$, if and only if this minimum occurs in $I$, which happens with probability $|I| / |U|$. Thus, the collision probability equals the Jaccard similarity $J(A, B) = |I| / |U|$. This equivalence holds precisely due to the min-wise property applied to the union set.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) While exact min-wise independence requires the property to hold for all set sizes, stronger k-wise independence conditions—where permutations are independent up to any k elements—can ensure exactness for small sets and approximate behavior for larger ones. In practice, 2-wise (pairwise) independence often suffices for [MinHash](/page/MinHash)'s probabilistic guarantees, as it controls variance in collision estimates without needing full higher-order independence, though approximate k-wise constructions are used to achieve epsilon-close min-wise behavior efficiently.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) [MinHash](/page/MinHash), grounded in this framework, functions as a locality-sensitive hashing scheme for Jaccard similarity, where the collision probability exactly matches the similarity measure, enabling provable performance in nearest-neighbor searches.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) ### Practical Constructions In practical implementations of MinHash, approximate min-wise independent hash functions are employed to simulate random permutations efficiently, as exact constructions are computationally prohibitive. The most common approach uses 2-universal (2U) linear hash functions defined as $ h_j(t) = (a_{1,j} + a_{2,j} t \mod p) \mod D $, where $ p $ is a large prime greater than the universe size, $ D $ is typically a power of 2 for bit-range control (e.g., $ 2^b $ for $ b $-bit outputs), and coefficients $ a_{1,j} $, $ a_{2,j} $ are chosen uniformly at random from $ [0, p-1] $.[](https://www.combinatorics.org/ojs/index.php/eljc/article/view/v7i1r26)[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) This family provides pairwise independence, ensuring low collision probabilities while remaining computationally lightweight, with each hash evaluation requiring only modular arithmetic. For improved approximation quality, especially in scenarios with sparse data or small sketch sizes, 4-universal (4U) polynomial hash functions extend the construction to higher-degree polynomials: $ h_j(t) = \left( \sum_{i=1}^4 a_{i,j} t^{i-1} \mod p \right) \mod D $, where four random coefficients are used.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) These offer greater randomness than 2U functions, reducing variance in [Jaccard similarity](/page/Jaccard_similarity) estimates, though at a modest increase in computation per hash (still $ O(1) $ time). Empirical evaluations demonstrate that 4U performs comparably to or slightly better than 2U when using 200 or more hashes ($ k \geq 200 $) and $ b \geq 4 $ bits per hash value, achieving mean squared error (MSE) close to theoretical minima for sparse sets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) To support 64-bit architectures without overflow issues, concatenated or combined hashes are often used, such as $ (h_1(x) \cdot M + h_2(x)) \mod 2^{64} $, where $ M = 2^{32} $ interleaves two 32-bit hashes into a single 64-bit value, enabling efficient processing on modern hardware.[](http://papers.neurips.cc/paper/7239-practical-hash-functions-for-similarity-estimation-and-dimensionality-reduction.pdf) This method maintains approximate independence while leveraging unsigned 64-bit integer operations for speed. Error bounds for these approximations show that the estimated [Jaccard similarity](/page/Jaccard_similarity) converges to the true value with ratio approaching 1 with high probability when $ k $ is large (e.g., 200–500), with relative errors under 1% for typical web-scale datasets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Such constructions are integrated into libraries like Apache Spark's MinHashLSH module, which employs the 2U linear family with random coefficients modulo a prime for scalable similarity search on distributed datasets.[](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html) Despite not achieving exact min-wise independence, these pseudorandom approximations prove empirically effective, often matching full permutation accuracy in practice while reducing preprocessing time by orders of magnitude (e.g., via GPU acceleration for batch computations).[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Limitations include potential accuracy degradation for very dense sets or small $ D $, necessitating larger $ k $ (up to 500) in machine learning applications compared to duplicate detection.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) ## Extensions and Variants ### Weighted MinHash Weighted MinHash extends the standard [MinHash](/page/MinHash) technique to handle sets where elements have associated nonnegative weights, allowing for more nuanced similarity estimation in scenarios where uniform treatment of elements is insufficient. In the unweighted case, [MinHash](/page/MinHash) assumes all elements contribute equally, but real-world data often involves varying importance, such as frequencies or priorities. The [weighted Jaccard similarity](/page/Weighted_Jaccard_similarity), which generalizes the traditional [Jaccard index](/page/Jaccard_index), is defined as

\frac{\sum_x \min(w_A(x), w_B(x))}{\sum_x \max(w_A(x), w_B(x))},

References

[1]
[PDF] On the resemblance and containment of documents - cs.Princeton
Abstract. Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well.
[2]
[PDF] Minhashing - Stanford InfoLab
To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the ...
[3]
[PDF] A Review for Weighted MinHash Algorithms - arXiv
Nov 12, 2018 · The quantization-based weighted MinHash algorithms need to compute the hash values for all subelements, so it is very inefficient when the ...Missing: explanation | Show results with:explanation
[4]
Mash Screen: what's in my sequencing run?
Sep 25, 2017 · Mash builds on Andrei Broder's foundational paper “On the resemblance and containment of documents” (Broder 1997), and uses the MinHash ...
[5]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · ... Introduction to. Information. Retrieval. Christopher D. Manning. Prabhakar Raghavan. Hinrich Schütze. Cambridge University Press. Cambridge ...
[6]
None
### Summary of Locality-Sensitive Hashing (LSH) from the Paper
[7]
[PDF] Min-Wise Independent Permutations - Computer Science
In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?
[8]
[PDF] Mining of Massive Datasets - Stanford InfoLab
At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large ...
[9]
Min-wise independent permutations (extended abstract)
First page of PDF. Formats available. You can view the full content in ... --- was initiated by Broder, et al[4]. In this paper, we give a lower bound ...
[10]
Min-Wise Independent Permutations - ScienceDirect.com
We define and study the notion of min-wise independent families of permutations. We say that ⊆S n (the symmetric group) is min-wise independent if for any set ...
[11]
Min-Wise Independent Linear Permutations
Apr 23, 2000 · Min-Wise Independent Linear Permutations. Tom Bohman; Colin Cooper; Alan ... Published. 2000-04-23. Issue. Volume 7 (2000). Article Number. R26.Missing: publication | Show results with:publication
[12]
[PDF] b-Bit Minwise Hashing in Practice - Microsoft
Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) ...Missing: constructions | Show results with:constructions
[13]
[PDF] Practical Hash Functions for Similarity Estimation and ... - NIPS papers
Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. 28th ACM/SIAM. Symposium on Discrete Algorithms (SODA), pages 47–66, 2017.
[14]
org.apache.spark.ml.feature.MinHashLSHModel
Model produced by MinHashLSH, where multiple hash functions are stored. Each hash function is picked from the following family of hash functions, ...Missing: construction | Show results with:construction
[15]
[PDF] MinHash Sketches:
Jun 14, 2016 · Edith Cohen edith@cohenwang.com ... This particular application hugely popularized MinHash sketches. 5 Extensions. Weighted elements MinHash ...
[16]
[PDF] Consistent Weighted Sampling - Microsoft
“Consistent sampling” refers to sampling processes with the property that, while producing an appropriate distribution (perhaps uniform) over samples from an ...
[17]
[0910.3349] b-Bit Minwise Hashing - arXiv
Oct 18, 2009 · This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating ...Missing: compression | Show results with:compression
[18]
Locality sensitive hashing for the edit distance
**Summary of Order MinHash (OMH):**
[19]
[1407.4416] In Defense of MinHash Over SimHash - arXiv
Jul 16, 2014 · In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary.
[20]
[PDF] Syntactic clustering of the Web
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web.
[21]
[PDF] LSH Ensemble: Internet-Scale Domain Search - VLDB Endowment
Thus, in order to use LSH for containment search, the containment threshold needs to be converted to a Jaccard similarity threshold. Consider a domain X ...
[22]
https://www.bailis.org/papers/quakes-vldb2018.pdf
[23]
[PDF] Identifying Duplicate and Contradictory Information in Wikipedia - arXiv
In fact, the algorithm that we use in this paper, min- hash (Broder 1997), was originally developed for exactly this purpose. Another closely-related problem is ...
[24]
[PDF] Maintaining k-MinHash Signatures over Fully-Dynamic Data ... - arXiv
Mar 9, 2025 · to a streaming input model. An item here is a subset of a large ... Identifying and filtering near-duplicate documents. In Annual.
[25]
Hierarchical Clustering of the SOREL Malware Corpus
This section discusses the steps for creating hierarchical clusters of a malware corpus using the Jaccard Distance, MinHash, and Su-. perMinHash and how to ...
[26]
https://ieeexplore.ieee.org/document/8257969
[27]
https://ieeexplore.ieee.org/document/9799934
[28]
[PDF] Cluster selection for Clustered Federated Learning using Min-wise ...
This research presents two approaches for client clustering using local client data for clustered federated learning while preserving data privacy. The two.
[29]
[PDF] In Defense of MinHash Over SimHash
For example, to achieve a 90% recall for top-1 on MNIST, MinHash needs to scan, on average, 0.6% of the data points while SimHash has to scan 5%. For fair ...Missing: 100x | Show results with:100x
[30]
MinHash LSH — datasketch 1.6.5 documentation
MinHash LSH supports using Redis as the storage layer for handling large index and providing optional persistence as part of a production environment.Missing: complexity | Show results with:complexity
[31]
[PDF] arXiv:2501.01046v3 [cs.CL] 12 Mar 2025
Mar 12, 2025 · The signature generation and bucketing time involves computing the MinHash signature matri- ces and assigning bucket IDs for each band. The.
[32]
Mash: fast genome and metagenome distance estimation using ...
Jun 20, 2016 · Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test.
[33]
(PDF) A Survey on Locality Sensitive Hashing Algorithms and their ...
Apr 27, 2021 · Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces.
[34]
https://dl.acm.org/doi/10.1145/2063576.2063749
[35]
Finding near-duplicate web pages: A large-scale evaluation of ...
In a large scale evaluation conducted by Google Inc., the min-hash approach outperforms other competing methods for the application of webpage duplicate ...
[36]
When the levee breaks: a practical guide to sketching algorithms for ...
Sep 13, 2019 · To apply MinHash to a genomic data stream, we rely on the bioinformatic workhorse of k-mer decomposition. This involves breaking down a sequence ...