Fact-checked by Grok 2 weeks ago

MinHash

MinHash is a probabilistic algorithm for efficiently estimating the Jaccard similarity between large sets, such as those representing documents or data streams, by generating compact signatures through the application of random permutations or hash functions to identify minimum values. It serves as a core component of locality-sensitive hashing (LSH), enabling the rapid detection of similar items in massive datasets without exhaustive pairwise comparisons. Introduced by Andrei Broder and colleagues in 1997, MinHash was originally developed to address the challenge of identifying near-duplicate documents in web-scale corpora, such as the AltaVista search engine's index of over 30 million pages, where it successfully clustered millions of similar items using shingling techniques combined with min-wise independent permutations. The method relies on the key probabilistic property that, for two sets A and B, the probability that the minimum hash value (under a random permutation) is the same for both sets equals their Jaccard similarity r(A, B) = \frac{|A \cap B|}{|A \cup B|}, providing an unbiased estimator when multiple independent hash functions are used to form signatures. This estimation is both dimension-independent and scalable, reducing the signature size to a few hundred bytes per set while maintaining accuracy for similarities above a tunable threshold. In practice, MinHash signatures are constructed by applying k independent hash functions to the elements of a set and selecting the minimum value for each, yielding a that approximates the set for similarity computations. For LSH applications, these signatures are partitioned into b bands of r rows each (k = b \times r), where pairs colliding in at least one band are flagged as candidate duplicates; the probability of detection follows the curve $1 - (1 - s^r)^b, with s as the similarity, allowing false positives and negatives to be balanced for specific thresholds like 0.5 or 0.8. The algorithm's efficiency stems from or pseudorandom permutations, avoiding the need for true random permutations, and it has been extended to weighted variants for denser data representations. Beyond document deduplication, MinHash has become foundational in diverse fields, including bioinformatics for genome sketching (e.g., Mash tool for sequencing data comparison), recommendation systems for user-item similarity, and entity resolution in databases, where it handles sets of arbitrary size with linear-time preprocessing and subquadratic querying. Its influence persists in modern frameworks, underscoring its role in scalable similarity search amid growing data volumes.

Background Concepts

Jaccard Similarity

The Jaccard similarity between two finite sets A and B is defined as the cardinality of their intersection divided by the cardinality of their union: J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$. This measure quantifies the degree of overlap between the sets, where a value of 1 indicates complete identity and values approaching 0 signify minimal shared elements. Introduced by botanist Paul Jaccard in 1901 to compare the similarity of plant species distributions across geographic regions, the metric provides a geometric interpretation as the proportion of common elements relative to the total distinct elements in the combined sets. In computational contexts, such as information retrieval and data mining, it serves as a foundational tool for assessing set-based resemblance without assuming vector weights or order.[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) The Jaccard similarity exhibits key properties: it is symmetric, satisfying $J(A, B) = J(B, A)$; reflexive, with $J(A, A) = 1$; and bounded in the interval $[0, 1]$, where 0 denotes [disjoint sets](/page/Disjoint_sets).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) For illustration, consider sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$: the [intersection](/page/Intersection) $\{2, 3\}$ has size 2, the [union](/page/Union) $\{1, 2, 3, 4\}$ has size 4, yielding $J(A, B) = 0.5$. In set-based applications, such as comparing [binary](/page/Binary) feature representations of documents or items, Jaccard similarity is favored over vector-oriented measures like [cosine similarity](/page/Cosine_similarity) because it explicitly incorporates the union in the denominator, offering a direct gauge of relative overlap for unweighted, presence-absence [data](/page/Data).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) ### Locality-Sensitive Hashing Locality-sensitive hashing (LSH) is a probabilistic framework for approximate [nearest neighbor search](/page/Nearest_neighbor_search) in high-dimensional spaces, consisting of a family of hash functions designed such that similar items are hashed to the same value (i.e., collide) with higher probability than dissimilar items.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) Formally, a family $ \mathcal{H} $ of functions from a [universe](/page/Universe) $ \mathcal{U} $ to a set of buckets is $(r, cr, p_1, p_2)$-sensitive with respect to a [distance](/page/Distance) metric $ d $ if, for any two points $ x, y \in \mathcal{U} $, \Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1 \quad \text{if } d(x, y) \leq r, \Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2 \quad \text{if } d(x, y) > cr, where $ c > 1 $ and $ p_1 > p_2 $.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) This property enables efficient indexing by bucketing data points and querying only candidate neighbors in the same bucket, achieving sublinear query times while trading exactness for scalability.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) The LSH framework was introduced by Indyk and Motwani in 1998 to mitigate the curse of dimensionality in nearest neighbor problems, where traditional methods like exhaustive search or tree-based indexing fail in high dimensions due to exponential growth in volume and sparsity.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) To balance [precision and recall](/page/Precision_and_recall), amplification techniques refine the basic LSH family: the AND-construction concatenates $ k $ independent hash functions into a single composite hash, reducing false positives by requiring collisions in all components (increasing specificity); conversely, the OR-construction applies $ l $ independent LSH families in parallel, treating collisions in any as a match to boost recall by capturing more true positives.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) These methods allow tuning the tradeoff, with space and query time scaling as $ O(n^{1 + \rho}) $ and $ O(n^\rho) $, respectively, where $ \rho = \frac{\log(1/p_1)}{\log(1/p_2)} < 1 $.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) In the context of Jaccard similarity, which measures set overlap as $ |A \cap B| / |A \cup B| $, LSH is particularly valuable because sets are often represented as characteristic vectors in extremely high-dimensional spaces (e.g., one dimension per possible element in the universe), exacerbating the curse of dimensionality through sparsity and the infeasibility of comparing all pairs.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) [MinHash](/page/MinHash) provides an LSH instantiation tailored to this metric, enabling efficient estimation of similar sets without exhaustive computation.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) ## Core Algorithm ### Multiple Hash Functions Variant The multiple hash functions variant of MinHash estimates the Jaccard similarity between sets by generating compact signatures using several independent hash functions, providing an unbiased estimator with controlled variance through the number of functions employed.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf) Sets are represented as characteristic vectors in a universe $ U $, where each set $ S \subseteq U $ indicates presence or absence of elements. To compute signatures, $ k $ independent universal hash functions $ h_1, h_2, \dots, h_k : U \to [0, 1) $ (or a large integer range) are selected, approximating random permutations. For a set $ S $, the signature $ \sigma(S) $ is a $ k $-dimensional vector where the $ i $-th entry is $ \sigma(S)_i = \min_{x \in S} h_i(x) $. This process simulates applying $ k $ random permutations to the universe and recording the minimum value (or rank) for each permuted set.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The Jaccard similarity $ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $ is estimated as the fraction of matching entries in the signatures: $ \hat{J}(A, B) = \frac{1}{k} \sum_{i=1}^k \mathbf{1} [\sigma(A)_i = \sigma(B)_i] $, where $ \mathbf{1} $ is the indicator function. This estimator is unbiased, with expectation $ \mathbb{E}[\hat{J}(A, B)] = J(A, B) $.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The probability that signatures match at a given position equals the Jaccard similarity. For a fixed hash function $ h_i $ approximating a random permutation $ \pi_i $, the minima $ \min_{x \in A} h_i(x) $ and $ \min_{x \in B} h_i(x) $ match if the element of $ A \cup B $ with the smallest $ h_i $-value (or permuted rank) lies in $ A \cap B $. Since $ h_i $ randomizes the order uniformly, each element in the union is equally likely to be the minimum, yielding $ \Pr[\sigma(A)_i = \sigma(B)_i] = \frac{|A \cap B|}{|A \cup B|} = J(A, B) $. The matches across the $ k $ independent functions are identically distributed, enabling concentration bounds like Hoeffding's inequality for accuracy guarantees.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The following pseudocode computes signatures for a collection of sets given $ k $ hash functions: ``` function ComputeSignatures(sets S_1, ..., S_m, hash_functions h_1, ..., h_k): signatures = empty m x k matrix for j = 1 to m: // for each set S_j for i = 1 to k: min_hash = infinity for x in S_j: min_hash = min(min_hash, h_i(x)) signatures[j, i] = min_hash return signatures ``` This implementation assumes hash functions are pre-chosen and efficient to evaluate; in practice, universal hashing families like linear polynomials over finite fields are used.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) For illustration, consider universe $ U = \{1, 2, 3, 4, 5\} $, sets $ A = \{1, 3\} $, $ B = \{1, 4\} $ with true Jaccard $ J(A, B) = 1/3 $, and $ k=2 $ hash functions. Suppose $ h_1(1)=0.2 $, $ h_1(3)=0.7 $, $ h_1(4)=0.4 $, so $ \sigma(A)_1 = 0.2 $, $ \sigma(B)_1 = 0.2 $ (match); and $ h_2(1)=0.5 $, $ h_2(3)=0.3 $, $ h_2(4)=0.8 $, so $ \sigma(A)_2 = 0.3 $, $ \sigma(B)_2 = 0.5 $ (no match). The estimate is $ \hat{J}(A, B) = 1/2 \approx 0.333 $. Different hash realizations yield varying estimates, averaging to $ 1/3 $.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) To achieve an $ \varepsilon $-approximation (i.e., $ |\hat{J} - J| < \varepsilon $ with high probability, say $ 1 - \delta $), select $ k \approx \frac{1}{\varepsilon^2} \ln \frac{1}{\delta} $; common practical values are $ k = 100 $ to $ 500 $ for $ \varepsilon \approx 0.1 $. Larger $ k $ reduces variance but increases storage and computation proportionally.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) ### Single Hash Function Variant The single hash function variant of MinHash generates signatures by applying a fixed hash function to the minimum elements identified under multiple random permutations of the universe, offering a theoretically grounded approach to estimating set similarities. Unlike the multiple hash functions variant, which relies on independent hash computations for each signature component, this method uses permutations to induce random orderings before hashing the resulting minima.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) Given a set $ S \subseteq U $ where $ U $ is the universe of elements, and $ k $ independent random permutations $ \pi_1, \pi_2, \dots, \pi_k $ of $ U $, the signature $ \sigma(S) $ is a vector of length $ k $ defined as $$ \sigma(S) = h \left( \arg\min_{x \in S} \pi_i(x) \right) $$ for $ i = 1 $ to $ k $, where $ h $ is a single fixed hash function (e.g., a 64-bit universal hash) applied to the element with the lowest rank under $ \pi_i $. The Jaccard similarity between two sets is estimated as the fraction of positions where their signatures match.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) Each permuted minimum behaves equivalently to a component from a random hash function in the multiple hash variant, as the permutation provides a uniform random ordering, ensuring the probability of matching signatures at any position equals the Jaccard similarity $ J(S, T) = \frac{|S \cap T|}{|S \cup T|} $.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) This approach yields a storage benefit by representing each set with just $ k $ fixed-size hash values in the signature, avoiding the need to retain full hash outputs for every element across multiple independent computations, which can be substantial for dense or large sets. Typical signature sizes range from 256 to 1024 bits (e.g., $ k = 64 $ to 128 with 64-bit hashes), enabling efficient indexing of millions of sets.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) In implementation, explicit permutations over large universes are impractical, so they are simulated via compositions of simpler hash functions; for instance, a base hash can map elements to pseudo-ranks, with $ h $ then applied to the identified minimum, often using rolling hashes like Rabin fingerprints for shingle-based sets.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) For example, consider sets $ A = \{a, b, c\} $ and $ B = \{b, c, d\} $ over universe $ U = \{a, b, c, d\} $, with Jaccard similarity 0.5. Under permutation $ \pi_1: a \to 3, b \to 1, c \to 2, d \to 4 $, the argmin for both $ A $ and $ B $ is $ b $ (lowest rank 1), so $ \sigma_1(A) = \sigma_1(B) = h(b) $. Under $ \pi_2: a \to 2, b \to 4, c \to 1, d \to 3 $, the argmin for both is $ c $ (rank 1), so $ \sigma_2(A) = \sigma_2(B) = h(c) $. Matching at both positions estimates similarity 1.0, while other permutations may mismatch to average toward 0.5 over many trials.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) This variant is suited for applications with very large universes (e.g., billions of possible shingles in web documents), where generating and storing signatures via multiple full independent hash passes over all elements would be prohibitively expensive in computation and memory.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) ### Complexity Analysis The time complexity of computing a MinHash signature for a single set of size $n$ (where $n$ is the number of elements in the set) using $k$ independent hash functions is $O(nk)$, as each of the $n$ elements must be hashed under all $k$ functions to determine the minimum value for each.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) This process scales linearly with the set size for sparse representations, making it efficient for large, high-dimensional datasets where elements are drawn from a potentially vast universe.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Pairwise similarity estimation between two signatures then requires $O(k)$ time, simply by comparing the fraction of matching entries in the $k$-dimensional signatures.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) The space complexity for storing a single MinHash signature is $O(k)$, consisting of the $k$ minimum hash values, which remains independent of the universe size in the single-hash function variant where a universal hash family approximates multiple permutations without explicit enumeration of the full universe.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) For a collection of $N$ sets, the total storage is $O(Nk)$, enabling compact sketches that drastically reduce memory needs compared to raw set representations—for instance, compressing sets of up to 200,000 elements into signatures of roughly 1,000 bytes when $k \approx 100$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) In the context of locality-sensitive hashing (LSH) for indexing, MinHash signatures are organized into bands of $r$ rows each, with $b$ bands such that $k = b \cdot r$; this structure yields an average-case query time of $O(1)$ for retrieving candidate similar sets, as the banding concentrates similar items into the same buckets with high probability, limiting the number of comparisons to a constant expected value independent of $N$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Preprocessing to build the LSH index takes $O(Nk)$ time overall, but subsequent queries probe only a fixed number of buckets on average.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Key trade-offs arise with the choice of $k$: larger $k$ enhances estimation accuracy by reducing variance in the Jaccard similarity approximation, but it proportionally increases both storage and computation costs; Chernoff bounds quantify this, showing that the probability of significant error in the estimate decreases exponentially with $k$, e.g., bounding the deviation from the true similarity by $\delta$ with failure probability at most $2\exp(-k\delta^2/3)$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) These asymptotic behaviors position MinHash as suitable for sublinear sketching in streaming models, where fixed $k$ allows processing and storage in $O(k)$ space per update while approximating similarities without full data retention.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) ## Theoretical Foundations ### Min-Wise Independent Permutations A family of permutations $\mathcal{F} \subseteq S_n$ on the universe $$ is defined as min-wise independent if, for every nonempty subset $X \subseteq $ and every element $x \in X$, \Pr_{\pi \sim \mathcal{F}} \left[ \pi(x) = \min { \pi(y) \mid y \in X } \right] = \frac{1}{|X|}. This condition ensures that, under a randomly selected permutation $\pi$ from $\mathcal{F}$, each element in $X$ has an equal probability of mapping to the smallest value among the images of $X$. The concept was introduced by Broder et al. in the context of estimating document resemblances for web duplicate detection, where random permutations model the relative ordering of document features to capture set overlaps accurately.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) The formal definition and analysis were further developed in their subsequent work, establishing min-wise independence as a key probabilistic tool for similarity estimation.[](https://dl.acm.org/doi/10.1145/276698.276781) Min-wise independent permutations are essential for the theoretical soundness of MinHash because they provide an unbiased mechanism for estimating the Jaccard similarity between sets, linking directly to universal hashing principles that minimize collision biases. In MinHash, hash functions are constructed to approximate these permutations, ensuring that the minimum hash value over a set behaves as if drawn from a truly random reordering. This independence property guarantees that the estimator for Jaccard similarity remains asymptotically unbiased, even with finite samples, by preserving the uniform selection of minima across set elements. Without min-wise independence, the collision probabilities could deviate systematically, leading to skewed similarity estimates in applications like large-scale data deduplication.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) The formal proof that min-wise independence yields exact Jaccard similarity estimation proceeds as follows. For two sets $A$ and $B$, let $U = A \cup B$ and $I = A \cap B$. Under a min-wise independent permutation $\pi$, the element achieving the minimum value $\min \{ \pi(y) \mid y \in U \}$ is uniformly distributed over all elements in $U$. The MinHash values collide, $h(A) = h(B)$, if and only if this minimum occurs in $I$, which happens with probability $|I| / |U|$. Thus, the collision probability equals the Jaccard similarity $J(A, B) = |I| / |U|$. This equivalence holds precisely due to the min-wise property applied to the union set.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) While exact min-wise independence requires the property to hold for all set sizes, stronger k-wise independence conditions—where permutations are independent up to any k elements—can ensure exactness for small sets and approximate behavior for larger ones. In practice, 2-wise (pairwise) independence often suffices for [MinHash](/page/MinHash)'s probabilistic guarantees, as it controls variance in collision estimates without needing full higher-order independence, though approximate k-wise constructions are used to achieve epsilon-close min-wise behavior efficiently.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) [MinHash](/page/MinHash), grounded in this framework, functions as a locality-sensitive hashing scheme for Jaccard similarity, where the collision probability exactly matches the similarity measure, enabling provable performance in nearest-neighbor searches.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) ### Practical Constructions In practical implementations of MinHash, approximate min-wise independent hash functions are employed to simulate random permutations efficiently, as exact constructions are computationally prohibitive. The most common approach uses 2-universal (2U) linear hash functions defined as $ h_j(t) = (a_{1,j} + a_{2,j} t \mod p) \mod D $, where $ p $ is a large prime greater than the universe size, $ D $ is typically a power of 2 for bit-range control (e.g., $ 2^b $ for $ b $-bit outputs), and coefficients $ a_{1,j} $, $ a_{2,j} $ are chosen uniformly at random from $ [0, p-1] $.[](https://www.combinatorics.org/ojs/index.php/eljc/article/view/v7i1r26)[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) This family provides pairwise independence, ensuring low collision probabilities while remaining computationally lightweight, with each hash evaluation requiring only modular arithmetic. For improved approximation quality, especially in scenarios with sparse data or small sketch sizes, 4-universal (4U) polynomial hash functions extend the construction to higher-degree polynomials: $ h_j(t) = \left( \sum_{i=1}^4 a_{i,j} t^{i-1} \mod p \right) \mod D $, where four random coefficients are used.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) These offer greater randomness than 2U functions, reducing variance in [Jaccard similarity](/page/Jaccard_similarity) estimates, though at a modest increase in computation per hash (still $ O(1) $ time). Empirical evaluations demonstrate that 4U performs comparably to or slightly better than 2U when using 200 or more hashes ($ k \geq 200 $) and $ b \geq 4 $ bits per hash value, achieving mean squared error (MSE) close to theoretical minima for sparse sets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) To support 64-bit architectures without overflow issues, concatenated or combined hashes are often used, such as $ (h_1(x) \cdot M + h_2(x)) \mod 2^{64} $, where $ M = 2^{32} $ interleaves two 32-bit hashes into a single 64-bit value, enabling efficient processing on modern hardware.[](http://papers.neurips.cc/paper/7239-practical-hash-functions-for-similarity-estimation-and-dimensionality-reduction.pdf) This method maintains approximate independence while leveraging unsigned 64-bit integer operations for speed. Error bounds for these approximations show that the estimated [Jaccard similarity](/page/Jaccard_similarity) converges to the true value with ratio approaching 1 with high probability when $ k $ is large (e.g., 200–500), with relative errors under 1% for typical web-scale datasets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Such constructions are integrated into libraries like Apache Spark's MinHashLSH module, which employs the 2U linear family with random coefficients modulo a prime for scalable similarity search on distributed datasets.[](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html) Despite not achieving exact min-wise independence, these pseudorandom approximations prove empirically effective, often matching full permutation accuracy in practice while reducing preprocessing time by orders of magnitude (e.g., via GPU acceleration for batch computations).[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Limitations include potential accuracy degradation for very dense sets or small $ D $, necessitating larger $ k $ (up to 500) in machine learning applications compared to duplicate detection.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) ## Extensions and Variants ### Weighted MinHash Weighted MinHash extends the standard [MinHash](/page/MinHash) technique to handle sets where elements have associated nonnegative weights, allowing for more nuanced similarity estimation in scenarios where uniform treatment of elements is insufficient. In the unweighted case, [MinHash](/page/MinHash) assumes all elements contribute equally, but real-world data often involves varying importance, such as frequencies or priorities. The [weighted Jaccard similarity](/page/Weighted_Jaccard_similarity), which generalizes the traditional [Jaccard index](/page/Jaccard_index), is defined as \frac{\sum_x \min(w_A(x), w_B(x))}{\sum_x \max(w_A(x), w_B(x))}, where $w_A(x)$ and $w_B(x)$ are the weights of element $x$ in sets $A$ and $B$, respectively, and the sums are over all relevant elements. This metric captures the overlap weighted by the minimum contributions relative to the total maximum contributions, providing a better measure for applications like document comparison where term frequencies matter.[](http://www.cohenwang.org/edith/Surveys/minhash.pdf) The core adaptation involves modifying the hashing process to bias sampling toward higher-weight elements, ensuring inclusion probabilities are proportional to weights. One approach uses thresholding, where a random threshold is applied to normalized weights to select elements probabilistically. A key probabilistic sampling method, consistent weighted sampling (CWS), generates samples $(k, y_k)$ for each element such that the probability is proportional to the weight $w(x)$, often by drawing ranks from a distribution dependent on the weight.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/ConsistentWeightedSampling2.pdf) In practice, the weighted minimum is computed as $\arg\min_x \frac{h(x)}{w(x)}$, where $h(x)$ is a random hash function; this effective rank favors elements with larger $w(x)$, as higher weights reduce the adjusted hash value, increasing the chance of selection. Multiple such hashes are typically computed to form a sketch, with the process repeated for stability. One early variant, proposed by Haveliwala et al. in 2000, quantizes weights by multiplying by a large constant and applying standard MinHash to the resulting binary expansions.[](https://theory.stanford.edu/~taherh/papers/simsearch00.pdf) Later constructions, such as those using CWS, ensure that the probability of a sketch match between two weighted sets approximates their weighted Jaccard similarity, specifically $\Pr[\text{match}] \approx \sum_x \min(w_A(x), w_B(x)) / \sum_x \max(w_A(x), w_B(x))$. The mechanism adjusts sampling densities to handle weight disparities, making it suitable for scalable estimation without full data access.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/ConsistentWeightedSampling2.pdf) A representative example is document similarity, where elements are terms and weights are term frequencies (e.g., via [tf-idf](/page/tf-idf) scoring); here, common high-frequency terms drive the similarity score, unlike unweighted [bag-of-words](/page/bag-of-words) models that ignore multiplicity. Such weighted sketches enable applications in text analysis beyond simple presence, including near-duplicate detection in large corpora and clustering documents with varying term importance, as seen in web search indexing.[](http://www.cohenwang.org/edith/Surveys/minhash.pdf) ### Advanced Adaptations One prominent adaptation of MinHash involves b-bit compression, where the full hash value (typically 64 bits) is truncated to only the lowest b bits, such as 5 bits, to create signatures that map into 2^b possible buckets.[](https://arxiv.org/abs/0910.3349) This reduction enables significant space savings by storing fewer bits per hash while still approximating [Jaccard similarity](/page/Jaccard_similarity) through the proportion of matching b-bit values across multiple hash functions.[](https://arxiv.org/abs/0910.3349) The approach trades some estimation accuracy for efficiency, as the truncated hashes introduce additional variance in the similarity estimate.[](https://arxiv.org/abs/0910.3349) The probability of collision in b-bit MinHash adjusts based on the chosen b; for b=1, the effective Jaccard similarity threshold for reliable estimation is around 0.5, while larger b values (e.g., b=8) shift this threshold higher, improving precision at the cost of more storage.[](https://arxiv.org/abs/0910.3349) An unbiased estimator corrects for this bias, ensuring the expected value matches the true Jaccard similarity regardless of b.[](https://arxiv.org/abs/0910.3349) For handling ordered data like sequences, Ordinal MinHash (also known as Order MinHash or OMH) incorporates position-aware hashing by storing selected minhash values in ordered sublists, preserving relative order to better approximate edit distance rather than unordered set similarity.[](https://www.biorxiv.org/content/10.1101/534446v1) This variant refines standard MinHash by assigning hashes that account for positional relationships, making it suitable for applications in sequence alignment where order matters.[](https://www.biorxiv.org/content/10.1101/534446v1) In contrast to MinHash, which targets Jaccard similarity for sets, SimHash applies a similar locality-sensitive hashing principle but for cosine similarity on sparse binary vectors, such as document representations, by projecting onto random hyperplanes and signing bits rather than taking minima.[](https://arxiv.org/abs/1407.4416) A post-2010 adaptation extends MinHash to graph similarity by applying it to neighborhood sets around vertices, where sketches capture the k-nearest neighbors based on random rankings to estimate structural similarities like distances or centralities efficiently.[](https://www.semanticscholar.org/paper/58225ffe658820dcfbea7423828311179edf01b4) Post-2020 developments include [ProbMinHash](/page/ProbMinHash) (2020), which extends [MinHash](/page/MinHash) to estimate the probability Jaccard similarity for continuous probability measures and distributions, and [C-MinHash](/page/C-MinHash) (2022), a variant using circulant permutations to reduce the number of hash functions to one while maintaining accuracy for Jaccard estimation in large-scale applications.[](https://arxiv.org/abs/2008.08916)[](https://proceedings.mlr.press/v162/li22m.html) These adaptations generally achieve compression ratios of 10-20x in space (e.g., b=1 versus 64 bits) but introduce error rates that increase with lower b or sparser neighborhoods, with relative errors often under 5% for Jaccard similarities above 0.5 in controlled settings.[](https://arxiv.org/abs/0910.3349) ## Applications and Uses ### Similarity Search in Sets MinHash serves as the foundation for locality-sensitive hashing (LSH) in efficient similarity search over large collections of sets, enabling approximate nearest neighbor queries based on [Jaccard similarity](/page/Jaccard_similarity) without exhaustive pairwise comparisons. In this paradigm, MinHash signatures—compact representations of sets derived from multiple hash functions—are generated for each set in the database. These signatures are then partitioned into *b* bands, each consisting of *r* consecutive rows, where the total signature length *k* satisfies *k = b \times r*. For a query set, its signature is similarly banded, and candidate pairs are identified as those sharing at least one identical band hash value; this hashing of bands into buckets ensures that sets with high Jaccard similarity are likely to collide in the same bucket with probability approximately $1 - (1 - s^r)^b$, where *s* is the Jaccard similarity. This approach originated in the context of detecting near-duplicate web pages, where documents were represented as sets of [shingles](/page/shingles) and clustered using MinHash sketches to process millions of pages scalably.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) To tune the LSH parameters for a desired Jaccard similarity threshold *t*, the row length *r* is typically set to approximately $(1/t) \log(1/t)$ to ensure that the probability of band collision transitions sharply around *t*, while the number of bands *b* is chosen as *k / r* to balance false positives and negatives, with *k* fixed based on available storage (e.g., 100–256 hashes for practical accuracy). This configuration amplifies the collision probability for similar sets (Jaccard > *t*) to near 1 while keeping it low for dissimilar ones, allowing subquadratic query times even on billion-scale datasets; for instance, an LSH ensemble using MinHash indexed 262 million [web](/page/Web) domains for [containment](/page/Containment) similarity search, achieving efficient retrieval through multiple hash tables. For sparse sets—common in applications like document term vectors or [genome](/page/Genome) k-mers—MinHash integrates seamlessly with [inverted index](/page/Inverted_index) structures, where band hashes serve as keys mapping to lists of set identifiers, enabling fast posting list intersections during candidate generation without materializing full signatures.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf)[](http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf) Compared to exact Jaccard similarity search, which requires *O(n^2)* time for *n* sets, MinHash-LSH delivers substantial improvements in efficiency while maintaining high [recall](/page/The_Recall); optimizations in large-scale deployments have achieved over 100× [speedup](/page/Speedup) in end-to-end pipelines for 90% [recall](/page/The_Recall) on seismic datasets represented as [binary](/page/Binary) sets, demonstrating its viability for [real-time](/page/Real-time) queries over massive corpora.[](https://www.bailis.org/papers/quakes-vldb2018.pdf) ### Duplicate Detection and Clustering MinHash facilitates duplicate detection by generating compact signatures for data items represented as sets, enabling efficient estimation of Jaccard similarities between pairs to identify near-duplicates when the similarity exceeds a predefined threshold, such as 0.8 for textual content.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) This process begins with preprocessing the data into set representations, followed by signature computation using multiple hash functions to approximate the min-wise independence property, which ensures that the probability of matching hash minima correlates with set overlap.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For clustering, MinHash signatures support all-pairs similarity estimation, where pairs surpassing the threshold are linked to form clusters using union-find (disjoint-set union) structures, efficiently merging components in linear time relative to the number of candidate pairs.[](https://arxiv.org/pdf/1406.1143) This approach scales to large repositories by first generating candidate pairs via [locality-sensitive hashing](/page/Locality-sensitive_hashing) on signatures, then applying union-find to group duplicates into equivalence classes, as demonstrated in processing millions of [Wikipedia](/page/Wikipedia) sentences to yield over 1 million clusters.[](https://arxiv.org/pdf/1406.1143) In textual duplicate detection, [shingling](/page/Shingle) preprocesses documents into sets of k-grams—contiguous sequences of k tokens or characters—to capture near-duplicates arising from minor edits, insertions, or deletions, transforming the problem into set similarity estimation amenable to MinHash.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For instance, with k=10 for word-level [shingling](/page/Shingle), a document's shingle set allows MinHash to estimate resemblance as the [Jaccard index](/page/Jaccard_index) of these sets, enabling detection of plagiarized or mirrored web pages.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) A seminal application occurred in 1997 with AltaVista's [web crawler](/page/Web_crawler), which processed 30 million documents using MinHash on 10-[shingle](/page/Shingle)s to generate compact document signatures by selecting minima from 40-bit shingle fingerprints, identifying 5.3 million documents as exact duplicates in 2.1 million clusters and forming 3.6 million clusters from 12.3 million documents in a 150 GB corpus.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) This reduced storage redundancy and improved search quality by eliminating syntactic duplicates during indexing.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For streaming environments with dynamic sets, such as evolving data streams, MinHash signatures support online updates by incrementally adjusting minima upon set insertions or deletions, maintaining approximation quality without full recomputation, as in fully-dynamic models handling subset additions in sublinear time.[](https://arxiv.org/pdf/2407.21614) Hierarchical clustering leverages MinHash-estimated distances—derived from signature mismatches as proxies for Jaccard dissimilarity—to construct dendrograms, where agglomerative methods successively merge clusters based on average or complete linkage, revealing nested group structures in large collections like malware binaries.[](https://dl.acm.org/doi/pdf/10.1145/3704268.3748684) To mitigate false positives from probabilistic estimation, a secondary verification step compares the full content of candidate pairs flagged by MinHash, confirming duplicates via exact Jaccard computation or [edit distance](/page/Edit_distance), which filters noise while preserving high [recall](/page/The_Recall) in the initial screening.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) ### Other Domains In recommendation systems, MinHash has been employed to enhance [collaborative filtering](/page/Collaborative_filtering) by estimating similarities between user-item interaction sets, enabling efficient online collaborative filtering for millions of daily users. For instance, in large-scale news recommendation, MinHash clustering groups users and articles based on overlapping interests represented as sets. In [genomics](/page/Genomics), MinHash serves as a sketching technique for comparing microbial strains and metagenomes through [k-mer](/page/K-mer) representations, particularly post-2010 advancements in high-throughput sequencing. Tools like [Mash](/page/Mash) apply MinHash to estimate pairwise [mutation](/page/Mutation) distances between genomes, supporting rapid [containment](/page/Containment) searches in large databases without full [alignment](/page/Alignment), which is crucial for metagenomic [classification](/page/Classification) and outbreak detection. For [network](/page/Network) analysis, neighborhood-based MinHash variants compute graph similarities by hashing local [node](/page/Node) neighborhoods as sets, with works from 2015 onward enabling efficient estimation of [distance](/page/Distance) metrics in massive [graph](/page/Graph)s. This approach approximates neighborhood sizes and structural similarities, aiding in community detection and [link prediction](/page/Link_prediction) without exhaustive pairwise computations.[](https://ieeexplore.ieee.org/document/8257969) In [big data](/page/Big_data) environments, MinHash integrates with frameworks like [Apache Spark](/page/Apache_Spark) for distributed [locality-sensitive hashing](/page/Locality-sensitive_hashing) (LSH), allowing scalable similarity joins over sparse datasets. Implementations in Spark MLlib treat non-zero vector elements as binary sets for Jaccard distance estimation, as demonstrated in healthcare [sentiment analysis](/page/Sentiment_analysis) pipelines processing vast [COVID-19](/page/COVID-19) data volumes.[](https://ieeexplore.ieee.org/document/9799934) Emerging applications in the [2020s](/page/2020s) leverage MinHash for privacy-preserving similarity computations in [federated learning](/page/Federated_learning), where min-wise independent permutations cluster clients based on local data sketches without sharing raw information. This supports clustered federated setups, maintaining [differential privacy](/page/Differential_privacy) while improving model convergence on heterogeneous datasets.[](https://www.diva-portal.org/smash/get/diva2:1712025/FULLTEXT01.pdf) Recent extensions as of 2024 include fully-dynamic MinHash sketches for [streaming data](/page/Streaming_data) with recovery mechanisms and MinCloud for trusted [malware](/page/Malware) detection in [Linux](/page/Linux) environments using transferable sketches.[](https://arxiv.org/pdf/2407.21614)[](https://www.sciencedirect.com/science/article/abs/pii/S2214212624002096) A key limitation of MinHash is its design for sparse, binary set representations under Jaccard similarity; for dense real-valued vectors requiring [cosine similarity](/page/Cosine_similarity), SimHash provides a more suitable alternative.[](http://proceedings.mlr.press/v33/shrivastava14.pdf) ## Performance Evaluation ### Benchmarking Methods Evaluating MinHash implementations typically involves assessing both the accuracy of similarity estimation and the efficiency of processing large-scale data. Key metrics for accuracy include [precision and recall](/page/Precision_and_recall) in candidate pair generation for [locality-sensitive hashing](/page/Locality-sensitive_hashing) (LSH) applications, which measure the proportion of retrieved pairs that are true positives and the fraction of true similar pairs identified, respectively. For direct Jaccard similarity estimation, [mean absolute error](/page/Mean_absolute_error) (MAE) or variance of the estimate relative to the true [Jaccard index](/page/Jaccard_index) is commonly used, with error bounds derived from the number of hash functions applied. Throughput is quantified as the number of pairs processed per second or average query time, often benchmarked on commodity hardware to evaluate [scalability](/page/Scalability).[](http://ekzhu.com/datasketch/lsh.html)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) Datasets for benchmarking span synthetic and real-world collections to test robustness across distributions. Synthetic datasets often consist of uniform random sets or those following power-law distributions, with 1,000 samples over 100,000 features, allowing controlled variation in set cardinality and overlap to isolate estimation bias and variance. Real datasets include shingled web documents from large crawls, such as over 30 million Alta Vista pages totaling 150 GB, used to evaluate clustering at 50% resemblance thresholds; Wikipedia articles for near-duplicate detection; the 20 Newsgroups corpus (average 193 3-shingles per document); and Reddit post sets for pairwise similarity tasks. These enable assessment of performance on sparse, high-dimensional data typical of text processing.[](https://arxiv.org/pdf/1811.04633)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://ekzhu.com/datasketch/lsh.html) Testing protocols standardize parameter tuning and [performance measurement](/page/Performance_measurement) to ensure [reproducibility](/page/Reproducibility). Implementations vary the number of hash permutations $k$ (or total functions $num_perm$), bands $b$, and rows per band $r$ (with $k = b \times r$), plotting [receiver operating characteristic](/page/Receiver_operating_characteristic) (ROC) curves of [precision](/page/Precision) against [recall](/page/The_Recall) to balance false positives and negatives. Experiments repeat 10 times to compute means and standard deviations, often using open-source tools like the [Python](/page/Python) datasketch library for [MinHash](/page/MinHash) and LSH indexing, which supports [Redis](/page/Redis) for scalable storage and includes [benchmark](/page/Benchmark) scripts for query [latency](/page/Latency). Common pitfalls include sensitivity to [universe](/page/Universe) size, where insufficiently random [hash](/page/Hash) functions lead to biased permutations, and practical [hash](/page/Hash) collisions that inflate [error](/page/Error) in finite implementations, necessitating 40-bit or higher hash values for >99.9% confidence bounds.[](http://ekzhu.com/datasketch/lsh.html)[](https://arxiv.org/pdf/1811.04633)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) In the [2020s](/page/2020s), benchmarks have increasingly focused on [hardware acceleration](/page/Hardware_acceleration), particularly GPU implementations for MinHash LSH in [dataset](/page/Data_set) deduplication. For instance, the [FED](/page/Fed) framework achieves up to 107.2x speedup over CPU baselines on [datasets](/page/Data_set) like RedPajama-1T (1.2T [tokens](/page/The_Tokens)), processing 30 billion [tokens](/page/The_Tokens) in 111.7 seconds across 16 GPUs (for a subset), with the full [dataset](/page/Data_set) completed in 6 hours, using a Jaccard similarity [threshold](/page/Threshold) of 0.8, highlighting gains in MinHash generation (260x faster) and candidate filtering. Such evaluations emphasize end-to-end throughput on real corpora like [C4](/page/C4) and RealNews, underscoring MinHash's viability for trillion-token-scale tasks.[](https://arxiv.org/pdf/2501.01046) ### Comparative Analysis MinHash offers a probabilistic [approximation](/page/Approximation) to exact Jaccard similarity computation, which involves directly calculating set intersections and unions for all pairs, resulting in O(N^2 \cdot s) [time complexity](/page/Time_complexity) in the worst case for N sets of [average](/page/Average) [size](/page/Size) s, rendering it infeasible for large-scale datasets. In contrast, MinHash reduces query times to sublinear complexity through fixed-size sketches, enabling efficient similarity estimation at the cost of [approximation error](/page/Approximation_error).[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications) Compared to other [locality-sensitive hashing](/page/Locality-sensitive_hashing) techniques, MinHash is specifically tailored for Jaccard similarity on sets, outperforming methods like [HyperLogLog](/page/HyperLogLog), which focuses solely on [cardinality](/page/Cardinality) estimation without direct pairwise similarity support. Similarly, SimHash targets [Hamming distance](/page/Hamming_distance) for document similarity under cosine metrics, making it less suitable for pure set overlap tasks where MinHash excels, as demonstrated in near-duplicate detection benchmarks. MinHash also provides better estimates for Jaccard than Euclidean-based LSH variants, which are optimized for vector distances rather than set [containment](/page/Containment).[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications)[](https://dl.acm.org/doi/10.1145/2063576.2063749) Empirical evaluations highlight MinHash's effectiveness; in a 2006 study on 1.6 billion [web](/page/Web) pages, MinHash/LSH achieved 86% [precision](/page/Precision) for cross-site near-duplicates, while SimHash achieved 90%; a combined approach reached 79% [precision](/page/Precision) with 79% relative recall, with low false positive rates through banding. More recent genomic benchmarks, such as those using [Mash](/page/Mash), report high correlation (RMSE 0.00274) with exact average nucleotide identity on 54,000 genomes, with false positives controlled via [p-value](/page/P-value) thresholds below 10^{-10}.[](https://www3.cs.stonybrook.edu/~cse692/papers/henzinger_sigir06.pdf)[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) Key trade-offs include MinHash's space requirements, which scale with the number of hash functions needed for accuracy (e.g., 100-1000 for 1-5% error), potentially exceeding alternatives like the [Count-Min sketch](/page/Count-Min_sketch) for streaming environments where frequency estimation suffices over direct similarity. [Count-Min](/page/Count-Min) offers lower space for approximate counts in dynamic data but lacks MinHash's unbiased Jaccard estimator.[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications)[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1809-x) In modern contexts, particularly for sparse sets like k-mers in [genomics](/page/Genomics), MinHash outperforms [deep learning](/page/Deep_learning) embeddings in efficiency, enabling 10x faster processing than alignment-based [exact](/page/Ex'Act) methods or neural approaches on gigabase-scale [data](/page/Data), though embeddings may capture denser semantics at higher computational cost. Post-2015 parallel implementations, including GPU-accelerated MinHash, achieve up to 25x speedups over parallel CPU baselines for sketching large collections.[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)[](https://www.slideshare.net/slideshow/gpu-acceleration-of-set-similarity-joins/63164204)

References

  1. [1]
    [PDF] On the resemblance and containment of documents - cs.Princeton
    Abstract. Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well.
  2. [2]
    [PDF] Minhashing - Stanford InfoLab
    To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the ...
  3. [3]
    [PDF] A Review for Weighted MinHash Algorithms - arXiv
    Nov 12, 2018 · The quantization-based weighted MinHash algorithms need to compute the hash values for all subelements, so it is very inefficient when the ...Missing: explanation | Show results with:explanation
  4. [4]
    Mash Screen: what's in my sequencing run?
    Sep 25, 2017 · Mash builds on Andrei Broder's foundational paper “On the resemblance and containment of documents” (Broder 1997), and uses the MinHash ...
  5. [5]
    [PDF] Introduction to Information Retrieval - Stanford University
    Aug 1, 2006 · ... Introduction to. Information. Retrieval. Christopher D. Manning. Prabhakar Raghavan. Hinrich Schütze. Cambridge University Press. Cambridge ...
  6. [6]
    None
    ### Summary of Locality-Sensitive Hashing (LSH) from the Paper
  7. [7]
    [PDF] Min-Wise Independent Permutations - Computer Science
    In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?
  8. [8]
    [PDF] Mining of Massive Datasets - Stanford InfoLab
    At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large ...
  9. [9]
    Min-wise independent permutations (extended abstract)
    First page of PDF. Formats available. You can view the full content in ... --- was initiated by Broder, et al[4]. In this paper, we give a lower bound ...
  10. [10]
    Min-Wise Independent Permutations - ScienceDirect.com
    We define and study the notion of min-wise independent families of permutations. We say that ⊆S n (the symmetric group) is min-wise independent if for any set ...
  11. [11]
    Min-Wise Independent Linear Permutations
    Apr 23, 2000 · Min-Wise Independent Linear Permutations. Tom Bohman; Colin Cooper; Alan ... Published. 2000-04-23. Issue. Volume 7 (2000). Article Number. R26.Missing: publication | Show results with:publication
  12. [12]
    [PDF] b-Bit Minwise Hashing in Practice - Microsoft
    Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) ...Missing: constructions | Show results with:constructions
  13. [13]
    [PDF] Practical Hash Functions for Similarity Estimation and ... - NIPS papers
    Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. 28th ACM/SIAM. Symposium on Discrete Algorithms (SODA), pages 47–66, 2017.
  14. [14]
    org.apache.spark.ml.feature.MinHashLSHModel
    Model produced by MinHashLSH, where multiple hash functions are stored. Each hash function is picked from the following family of hash functions, ...Missing: construction | Show results with:construction
  15. [15]
    [PDF] MinHash Sketches:
    Jun 14, 2016 · Edith Cohen edith@cohenwang.com ... This particular application hugely popularized MinHash sketches. 5 Extensions. Weighted elements MinHash ...
  16. [16]
    [PDF] Consistent Weighted Sampling - Microsoft
    “Consistent sampling” refers to sampling processes with the property that, while producing an appropriate distribution (perhaps uniform) over samples from an ...
  17. [17]
    [0910.3349] b-Bit Minwise Hashing - arXiv
    Oct 18, 2009 · This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating ...Missing: compression | Show results with:compression
  18. [18]
    Locality sensitive hashing for the edit distance
    **Summary of Order MinHash (OMH):**
  19. [19]
    [1407.4416] In Defense of MinHash Over SimHash - arXiv
    Jul 16, 2014 · In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary.
  20. [20]
    [PDF] Syntactic clustering of the Web
    We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web.
  21. [21]
    [PDF] LSH Ensemble: Internet-Scale Domain Search - VLDB Endowment
    Thus, in order to use LSH for containment search, the containment threshold needs to be converted to a Jaccard similarity threshold. Consider a domain X ...
  22. [22]
  23. [23]
    [PDF] Identifying Duplicate and Contradictory Information in Wikipedia - arXiv
    In fact, the algorithm that we use in this paper, min- hash (Broder 1997), was originally developed for exactly this purpose. Another closely-related problem is ...
  24. [24]
    [PDF] Maintaining k-MinHash Signatures over Fully-Dynamic Data ... - arXiv
    Mar 9, 2025 · to a streaming input model. An item here is a subset of a large ... Identifying and filtering near-duplicate documents. In Annual.
  25. [25]
    Hierarchical Clustering of the SOREL Malware Corpus
    This section discusses the steps for creating hierarchical clusters of a malware corpus using the Jaccard Distance, MinHash, and Su-. perMinHash and how to ...
  26. [26]
  27. [27]
  28. [28]
    [PDF] Cluster selection for Clustered Federated Learning using Min-wise ...
    This research presents two approaches for client clustering using local client data for clustered federated learning while preserving data privacy. The two.
  29. [29]
    [PDF] In Defense of MinHash Over SimHash
    For example, to achieve a 90% recall for top-1 on MNIST, MinHash needs to scan, on average, 0.6% of the data points while SimHash has to scan 5%. For fair ...Missing: 100x | Show results with:100x
  30. [30]
    MinHash LSH — datasketch 1.6.5 documentation
    MinHash LSH supports using Redis as the storage layer for handling large index and providing optional persistence as part of a production environment.Missing: complexity | Show results with:complexity
  31. [31]
    [PDF] arXiv:2501.01046v3 [cs.CL] 12 Mar 2025
    Mar 12, 2025 · The signature generation and bucketing time involves computing the MinHash signature matri- ces and assigning bucket IDs for each band. The.
  32. [32]
    Mash: fast genome and metagenome distance estimation using ...
    Jun 20, 2016 · Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test.
  33. [33]
    (PDF) A Survey on Locality Sensitive Hashing Algorithms and their ...
    Apr 27, 2021 · Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces.
  34. [34]
  35. [35]
    Finding near-duplicate web pages: A large-scale evaluation of ...
    In a large scale evaluation conducted by Google Inc., the min-hash approach outperforms other competing methods for the application of webpage duplicate ...
  36. [36]
    When the levee breaks: a practical guide to sketching algorithms for ...
    Sep 13, 2019 · To apply MinHash to a genomic data stream, we rely on the bioinformatic workhorse of k-mer decomposition. This involves breaking down a sequence ...