Fact-checked by Grok 2 weeks ago

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistical measure used to quantify the similarity between two finite sets by computing the ratio of the size of their intersection to the size of their union, formally defined as J(A, B) = \frac{|A \cap B|}{|A \cup B|}, where A and B are the sets and the value ranges from 0 (no similarity) to 1 (identical sets).^[1] This index was introduced in 1901 by Swiss botanist Paul Jaccard (1868–1944) under the name coefficient de communauté in his study of floral distribution across regions in the Alps and Jura mountains, originally applied to assess the overlap in plant species presence between ecological samples.^[2] The Jaccard index has since become a foundational tool across disciplines, particularly in ecology for evaluating community similarity and species co-occurrence patterns based on presence-absence data, where it helps identify biogeographic relationships without considering abundance.^[3] In computer science and data mining, it is widely employed for tasks such as clustering analysis, duplicate detection in databases, and measuring overlap in binary feature vectors, including applications in information retrieval, recommender systems, and graph similarity computations.^[4] Its binary nature makes it robust for qualitative comparisons, though variants like weighted versions extend it to account for element frequencies or continuous data in fields such as bioinformatics and machine learning.^[2]

Introduction and Fundamentals

Definition and Formula

The Jaccard index is a measure of similarity between two sets, defined as the ratio of the size of their intersection to the size of their union. This captures the proportion of elements shared by both sets relative to the total distinct elements across them, providing a value that reflects overlap without regard to element order or multiplicity. The primary formula for sets A and B is

J(A, B) = \frac{|A \cap B|}{|A \cup B|},

where | \cdot | denotes cardinality. This derives directly from set theory, as the union cardinality satisfies |A \cup B| = |A| + |B| - |A \cap B|, so the index can equivalently be expressed as

J(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}. $$ The result lies in the interval $[0, 1]$, with 0 indicating no common elements and 1 indicating identical sets. For binary vectors representing set memberships (indicator vectors where entries are 1 if an element is present and 0 otherwise), the Jaccard index extends to

J(\mathbf{x}, \mathbf{y}) = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n (x_i + y_i - x_i y_i)},

where the numerator sums positions with 1 in both vectors (intersection) and the denominator sums positions with 1 in at least one vector (union).[](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) To illustrate, consider sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$. The intersection is $A \cap B = \{2, 3\}$ with size 2, and the union is $A \cup B = \{1, 2, 3, 4\}$ with size 4. Thus, $J(A, B) = 2 / 4 = 0.5$. In binary vector form, $\mathbf{x} = [1, 1, 1, 0]$ and $\mathbf{y} = [0, 1, 1, 1]$ yield numerator $1 + 1 = 2$ and denominator $1 + 2 + 2 + 1 - 2 = 4$, confirming the same value.[](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) ### Historical Background The Jaccard index originated with Swiss botanist Paul Jaccard (1868–1944), who introduced it in 1901 as a measure for comparing the presence of plant species across different ecological localities in the [Swiss Alps](/page/Swiss_Alps) and [Jura Mountains](/page/Jura_Mountains). In his seminal paper, _Distribution de la flore alpine dans le Bassin des Dranses et dans quelques régions voisines_ (Bulletin de la Société Vaudoise des Sciences Naturelles, vol. 37, pp. 241–272), Jaccard applied the coefficient—termed the "coefficient of community"—to floristic analysis, quantifying similarity based on shared species to assess floral distributions in regions such as the Wildhorn Massif and Trient Basin.[](https://herbmedit.org/papers/6b3e7c30-39a4-427b-9b8f-8972dc226ee8) This work marked the index's debut in ecological studies, focusing on binary presence-absence data to reveal patterns in plant communities without considering abundance. In the early 20th century, the Jaccard index found early applications in [phytosociology](/page/Phytosociology), where Jaccard and his contemporaries used it to compare floras and vegetation relevés, facilitating the classification of plant associations in alpine environments. It gained further prominence through integrations in ecological surveys, such as those by Josias Braun-Blanquet in his 1928 treatise on plant sociology, which emphasized its utility for delineating community similarities in binary formats. The index's evolution accelerated in the mid-20th century with the advent of [numerical taxonomy](/page/Numerical_taxonomy) and [multivariate statistics](/page/Multivariate_statistics), transitioning from [ecology](/page/Ecology) to broader statistical applications. Peter H. A. Sneath and Robert R. Sokal incorporated it into their 1963 foundational text on [numerical taxonomy](/page/Numerical_taxonomy), formalizing its computation via a 2×2 contingency table for similarity assessments in biological classification. This paved the way for its adoption in [cluster analysis](/page/Cluster_analysis), as detailed in Michael R. Anderberg's 1973 monograph, which highlighted the index's effectiveness for [pattern recognition](/page/Pattern_recognition) in diverse datasets. By the 1970s and 1980s, amid growing computational capabilities, the Jaccard index integrated into similarity matrices for [multivariate statistics](/page/Multivariate_statistics), influencing early [data mining](/page/Data_mining) and [computer science](/page/Computer_science) applications in fields like [information retrieval](/page/Information_retrieval) and [pattern matching](/page/Pattern_matching). ## Core Properties and Interpretations ### Mathematical Properties The Jaccard distance, defined as $ d(A, B) = 1 - J(A, B) $, where $ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $, satisfies the properties of a metric on the power set of a finite universe. Non-negativity holds because $ 0 \leq J(A, B) \leq 1 $, implying $ d(A, B) \geq 0 $, with equality if and only if $ A = B $.[](https://arxiv.org/pdf/1612.02696) Symmetry follows directly from the commutativity of set intersection and union, so $ d(A, B) = d(B, A) $.[](https://arxiv.org/pdf/1612.02696) The triangle inequality $ d(A, B) \leq d(A, C) + d(C, B) $ is established using the submodularity and monotonicity of the cardinality function: for sets $ A, B, C $, it relies on the inequality $ |A \cap C| \cdot |B \cup C| + |A \cup C| \cdot |B \cap C| \leq |C| \cdot (|A| + |B|) $, which ensures the required bound after algebraic manipulation.[](https://arxiv.org/pdf/1612.02696) The Jaccard index exhibits idempotence, meaning $ J(A, A) = 1 $ for any set $ A $, as the intersection and union coincide. Regarding monotonicity with respect to set inclusions, the index is non-decreasing when elements are added simultaneously to both sets in a way that expands the intersection. Specifically, if $ A \subseteq C $ and $ B \subseteq D $ such that $ C \setminus A \subseteq B $ and $ D \setminus B \subseteq A $ (i.e., added elements contribute to the intersection), then $ J(A, B) \leq J(C, D) $.[](https://www.sciencedirect.com/science/article/abs/pii/S0020025516304376) A useful algebraic decomposition expresses the Jaccard index in terms of relative set sizes:

J(A, B) = \frac{1}{\frac{|A|}{|A \cap B|} + \frac{|B|}{|A \cap B|} - 1},

derived from substituting $ |A \cup B| = |A| + |B| - |A \cap B| $ into the standard formula, emphasizing how similarity diminishes with disproportionate growth in individual set sizes beyond the overlap. For edge cases involving empty sets, the index is conventionally defined to avoid [division by zero](/page/Division_by_zero): $ J(\emptyset, \emptyset) = 1 $ (perfect similarity, as the sets are [identical](/page/The_Identical)) and $ J(\emptyset, B) = 0 $ for any non-empty $ B $ (no overlap).[](https://arxiv.org/pdf/1612.02696) ### Range and Normalization The Jaccard index, denoted as $ J(A, B) $, takes values in the closed [interval](/page/Interval) $[0, 1]$, where $ J(A, B) = 0 $ signifies that sets $ A $ and $ B $ are disjoint (no common elements), and $ J(A, B) = 1 $ signifies that the sets are [identical](/page/The_Identical) (perfect overlap). This bounded [range](/page/Range) arises directly from the ratio of [intersection](/page/Intersection) size to [union](/page/Union) size, ensuring the measure is scale-invariant and comparable across different set pairs. As a similarity coefficient, the Jaccard index is inherently normalized within $[0, 1]$, eliminating the need for additional scaling in most applications. It can be converted to a percentage scale by multiplying by 100, which aids in intuitive reporting, such as expressing 0.75 as 75% similarity. Probabilistically, the index represents the probability that an element chosen uniformly at random from the union of $ A $ and $ B $ lies in their intersection, providing a natural interpretive framework for random sampling scenarios. For empty sets, the index is conventionally defined as $ J(\emptyset, \emptyset) = 1 $ (identical sets) and $ J(\emptyset, B) = 0 $ for non-empty $ B $ (complete dissimilarity), avoiding issues with [division by zero](/page/Division_by_zero) in the formula.[](https://arxiv.org/pdf/1612.02696) The complementary Jaccard distance, $ d_J(A, B) = 1 - J(A, B) $, also lies in $[0, 1]$, with 0 for identical sets and 1 for [disjoint sets](/page/Disjoint_sets); it qualifies as a proper [metric space](/page/Metric_space) distance, satisfying non-negativity, symmetry, and the [triangle inequality](/page/Triangle_inequality).[](https://arxiv.org/abs/1612.02696) The index exhibits invariance under permutations of elements in the universal set, as its computation relies solely on set membership overlaps rather than ordering.[](https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1002/ecog.07737) Nonetheless, it remains sensitive to set sizes, where disproportionate growth in [union](/page/Union) size (due to differing cardinalities) can diminish the index even with fixed [intersection](/page/Intersection) sizes. ## Variants for [Binary Data](/page/Binary_data) ### Asymmetric Binary Similarity In the [context](/page/Context) of asymmetric [binary data](/page/Binary_data), the Jaccard index measures similarity by emphasizing the presence of attributes while disregarding cases where both attributes are absent, as absences often provide limited informative value.[](https://www.researchgate.net/publication/266496381_A_Survey_of_Binary_Similarity_and_Distance_Measures) This approach treats binary vectors where the state "1" (presence) carries more significance than "0" (absence), making it ideal for scenarios like ecological surveys or sparse representations.[](https://www.sciencedirect.com/topics/computer-science/jaccard-coefficient) For two binary vectors $ \mathbf{x} $ and $ \mathbf{y} $, the Jaccard similarity is defined as

J(\mathbf{x}, \mathbf{y}) = \frac{a}{a + b + c},

where $ a $ is the number of attributes present in both vectors (shared 1s), $ b $ is the number present only in $ \mathbf{x} $ (1 in $ \mathbf{x} $, 0 in $ \mathbf{y} $), and $ c $ is the number present only in $ \mathbf{y} $ (0 in $ \mathbf{x} $, 1 in $ \mathbf{y} $). The 0-0 matches (denoted $ d $) are excluded from the denominator, as they do not contribute to assessing overlap in positive attributes.[](https://www.researchgate.net/publication/266496381_A_Survey_of_Binary_Similarity_and_Distance_Measures) This formulation is particularly rational for sparse datasets, where most attributes are absent across observations; including 0-0 matches would artificially inflate similarity scores and dilute the focus on actual shared features. In [ecology](/page/Ecology), it originated from [Paul](/page/Paul) Jaccard's analysis of floral [community](/page/Community) distributions, where it quantified similarity based on [species](/page/Species) presences rather than absences, which may result from sampling limitations rather than true ecological absence.[](https://www.e-periodica.ch/digbib/view?pid=bsv-002:1901:37::547) Similarly, in document-term matrices for [information retrieval](/page/Information_retrieval), the index effectively captures term overlap in sparse binary representations of texts, where the absence of a term in a document is rarely diagnostic of dissimilarity.[](https://www.sciencedirect.com/topics/computer-science/jaccard-coefficient) Consider an example with two ecological samples assessed for the presence (1) or absence (0) of four species (A, B, C, D): | Species | Sample 1 | Sample 2 | |---------|----------|----------| | A | 1 | 1 | | B | 1 | 0 | | C | 0 | 1 | | D | 0 | 0 | Here, $ a = 1 $ (species A present in both), $ b = 1 $ (species B present only in Sample 1), $ c = 1 $ (species C present only in Sample 2), and $ d = 1 $ (species D absent in both, ignored). Thus, $ J = \frac{1}{1 + 1 + 1} = \frac{1}{3} \approx 0.333 $, indicating moderate similarity driven by the single shared presence relative to the total unique and shared presences.[](https://www.researchgate.net/publication/266496381_A_Survey_of_Binary_Similarity_and_Distance_Measures) ### Comparison to Simple Matching Coefficient The [simple matching coefficient](/page/Simple_matching_coefficient) (SMC), also known as the Sokal-Michener coefficient, measures the similarity between two binary vectors by calculating the proportion of attributes that match, including both positive matches (where both vectors have a 1) and negative matches (where both have a 0).[](https://jcsites.juniata.edu/faculty/rhodes/ml/simdissim.htm) Its formula is given by

\text{SMC}(x, y) = \frac{a + d}{a + b + c + d},

where $a$ is the number of attributes where both vectors are 1, $b$ where $x$ is 1 and $y$ is 0, $c$ where $x$ is 0 and $y$ is 1, and $d$ where both are 0.[](https://www.jstor.org/stable/1302424) In contrast to the Jaccard index, which ignores negative matches ($d$) and focuses solely on the union of positive attributes, the SMC treats presences and absences symmetrically by including $d$ in both the numerator and denominator.[](https://jcsites.juniata.edu/faculty/rhodes/ml/simdissim.htm) This makes the Jaccard index stricter for asymmetric binary data, such as sets where absences are not informative (e.g., species presence in ecological samples), while the SMC can inflate similarity scores when many attributes are absent in both vectors, potentially overemphasizing agreement on non-occurrences.[](https://www.jstor.org/stable/1302424) The Jaccard index formula is $J(x, y) = \frac{a}{a + b + c}$, highlighting that it penalizes mismatches in presences more heavily than the SMC.[](https://jcsites.juniata.edu/faculty/rhodes/ml/simdissim.htm) An exact algebraic relationship between the two measures is $\text{SMC}(x, y) = J(x, y) \cdot u + (1 - u)$, where $u = \frac{a + b + c}{a + b + c + d}$ represents the proportion of attributes with at least one presence; this shows how the SMC adjusts the Jaccard value upward by incorporating the weight of shared absences. When there are no shared absences ($d = 0$), the SMC equals the Jaccard index exactly.[](https://www.jstor.org/stable/1302424) For illustration, consider two binary vectors $x = (1, 0, 1, 1, 0, 1, 0, 0, 0, 1)$ and $y = (1, 1, 0, 1, 0, 0, 0, 0, 1, 1)$, yielding $a = 3$, $b = 2$, $c = 2$, and $d = 3$. Here, $J(x, y) = \frac{3}{3 + 2 + 2} \approx 0.429$, but $\text{SMC}(x, y) = \frac{3 + 3}{10} = 0.6$, demonstrating how the SMC's inclusion of $d$ increases the similarity estimate due to multiple 0-0 matches. In clustering contexts, the SMC is sometimes referred to as the [Rand index](/page/Rand_index) when evaluating pairwise agreements between clusterings, though it originates as a general [binary](/page/Binary) attribute [similarity measure](/page/Similarity_measure). This equivalence underscores its symmetric treatment but limits its applicability in scenarios where absences should not contribute to similarity, favoring the Jaccard index for such cases.[](https://www.jstor.org/stable/1302424) ## Generalized and Weighted Forms ### Weighted Jaccard Measures The standard Jaccard index treats all elements in sets equally, assigning a uniform weight of 1 to present elements and 0 to absent ones; however, this assumption fails in scenarios involving weighted graphs, multisets, or attribute-valued data where elements carry varying degrees of importance, such as edge weights in graphs or term frequencies in documents.[](https://www.cs.cornell.edu/courses/cs4300/2015sp/Slides/lecture3/vsm_cheatsheet.pdf) The weighted Jaccard measure addresses this limitation by incorporating element-specific weights, enabling a prioritized assessment of similarity that better captures relative contributions in applications like [information retrieval](/page/Information_retrieval) and [machine learning](/page/Machine_learning) classification. The formula for the weighted Jaccard similarity between two sets $A$ and $B$, with non-negative weight functions $w_A(i)$ and $w_B(i)$ defined over a universe of elements $i$, is:

J_w(A, B) = \frac{\sum_i \min(w_A(i), w_B(i))}{\sum_i \max(w_A(i), w_B(i))}

This yields a value between 0 and 1, where 1 indicates identical weighted sets and 0 indicates no overlap in weighted elements. As a baseline, the standard unweighted Jaccard emerges when weights are binary (1 for presence, 0 for absence).[](https://www.cs.cornell.edu/courses/cs4300/2015sp/Slides/lecture3/vsm_cheatsheet.pdf) This formulation derives from a natural generalization of set intersection and union to weighted contexts: the numerator represents a weighted intersection, aggregating the minimum weights across elements to quantify shared contribution without double-counting dominance; the denominator captures a weighted union via maximum weights, reflecting total coverage across the sets.[](https://arxiv.org/pdf/2110.09619) Such a min-max approach ensures the measure remains normalized and interpretable, analogous to the unweighted ratio of intersection cardinality to union cardinality, while handling continuous or discrete weights seamlessly.[](https://arxiv.org/pdf/2110.09619) A representative example arises in [information retrieval](/page/Information_retrieval), where documents are modeled as weighted sets of terms with frequencies as weights. Consider two documents: Document A with term frequencies \{apple: 2, banana: 1\}, and Document B with \{apple: 1, banana: 3, cherry: 1\}. The weighted intersection sums the minima: $\min(2,1) + \min(1,3) + \min(0,1) = 1 + 1 + 0 = 2$. The weighted union sums the maxima: $\max(2,1) + \max(1,3) + \max(0,1) = 2 + 3 + 1 = 6$. Thus, $J_w(A, B) = 2/6 \approx 0.333$, indicating moderate similarity driven by overlapping terms but tempered by differing frequencies and the unique term in B.[](https://www.cs.cornell.edu/courses/cs4300/2015sp/Slides/lecture3/vsm_cheatsheet.pdf) In recommendation systems, weighted Jaccard measures have been adapted since the early [2000s](/page/2000s) to evaluate overlaps in user-item interactions, where weights might represent ratings or interaction strengths, facilitating personalized suggestions through similarity in weighted preference sets.[](https://ieeexplore.ieee.org/document/10931083) ### Tanimoto Similarity and Distance The Tanimoto similarity [coefficient](/page/Coefficient) extends the Jaccard index to real-valued vectors $\mathbf{x}$ and $\mathbf{y}$ in a finite-dimensional space, defined as

T(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}|^2 + |\mathbf{y}|^2 - \mathbf{x} \cdot \mathbf{y}},

where $\mathbf{x} \cdot \mathbf{y}$ denotes the dot product and $\|\cdot\|^2$ the squared Euclidean norm.[](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3) For binary vectors, where components are 0 or 1, this formula simplifies exactly to the Jaccard similarity, measuring the ratio of intersecting 1s to the union of 1s across the vectors.[](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3) This vector form enables its application beyond sets to continuous or weighted representations in pattern recognition and data analysis.[](https://pubs.acs.org/doi/10.1021/ci9800211) The coefficient was introduced by T. T. Tanimoto in 1958 within an elementary mathematical framework for [classification](/page/Classification) and prediction of patterns, originally developed at [IBM](/page/IBM) to quantify similarity between attribute profiles.[](https://link.springer.com/article/10.1023/A:1019154432472) A common distance variant is the Tanimoto distance, defined as $D(\mathbf{x}, \mathbf{y}) = 1 - T(\mathbf{x}, \mathbf{y})$, which ranges from 0 (identical vectors) to 1 (no overlap in support).[](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3) In cheminformatics, the Tanimoto coefficient serves as the standard metric for assessing molecular similarity using binary fingerprints, such as Daylight fingerprints, where it equates to the Jaccard index for bit-string representations of substructural features; for instance, two compounds with fingerprints sharing 5 out of 10 active bits in their union yield $T = 0.5$.[](https://www.daylight.com/dayhtml/doc/theory/theory.finger.html) This application became prevalent in the [1980s](/page/1980s) with the rise of structural databases and fingerprint methods, facilitating [virtual screening](/page/Virtual_screening) in [drug discovery](/page/Drug_discovery).[](https://pubs.acs.org/doi/10.1021/ci9800211) ## Probabilistic and Advanced Extensions ### Probability Jaccard Similarity The probability Jaccard similarity extends the standard Jaccard index to compare discrete probability distributions over a [universe](/page/Universe) of elements, providing a scale-invariant measure suitable for uncertain or weighted data, such as normalized term frequencies in documents or probabilistic set memberships. Introduced by Moulton and Jiang in 2018, it addresses scenarios where exact set representations are unavailable or approximate, such as in [streaming data](/page/Streaming_data) or fuzzy matching.[](https://arxiv.org/pdf/1809.04052) Formally, for probability vectors $ x $ and $ y $ (non-negative vectors summing to 1), the probability Jaccard similarity is defined as

J_{\mathcal{P}}(x, y) = \sum_{i : x_i > 0, y_i > 0} \frac{1}{\sum_j \max\left( \frac{x_j}{x_i}, \frac{y_j}{y_i} \right)},

where the sum is over elements $ i $ with positive probability in both distributions. This formula ensures $ J_{\mathcal{P}}(x, y) = 1 $ if $ x = y $, and for indicator vectors of crisp sets (uniform probability 1/|set| over members), it reduces to the standard Jaccard index. Unlike simpler weighted variants like the fuzzy Jaccard $ \sum \min(x_i, y_i) / \sum \max(x_i, y_i) $, the probability Jaccard is invariant to scaling of the weights, making it appropriate for relative frequencies.[](https://arxiv.org/pdf/1809.04052) Computing $ J_{\mathcal{P}} $ exactly requires full knowledge of the distributions, which is challenging for large universes. Approximations use [locality-sensitive hashing](/page/Locality-sensitive_hashing) (LSH) schemes tailored to probabilistic data, such as P-MinHash (or ProbMinHash), which generate signatures where the probability of hash collisions equals $ J_{\mathcal{P}}(x, y) $. These methods draw hash values from exponential distributions scaled by the probabilities, enabling unbiased [estimation](/page/Estimation) via averaging over multiple hashes: the estimator $ \hat{J}_{\mathcal{P}} = \frac{1}{k} \sum_{r=1}^k \mathbf{1}_{h_r(x) = h_r(y)} $ satisfies $ \mathbb{E}[\hat{J}_{\mathcal{P}}] = J_{\mathcal{P}}(x, y) $, with variance bounded by $ J_{\mathcal{P}}(1 - J_{\mathcal{P}})/k $. For [cardinality](/page/Cardinality) estimation in unions, adaptations of [HyperLogLog](/page/HyperLogLog) can complement, but LSH-based approaches are preferred for direct similarity.[](https://arxiv.org/abs/1809.04052)[](https://arxiv.org/abs/1911.00675) For example, in web-scale document similarity, documents can be represented as probability distributions over [shingles](/page/Shingles) (n-grams), where probabilities reflect normalized frequencies. P-MinHash sketches produce compact signatures for fast estimation of $ J_{\mathcal{P}} $, extending the classic [MinHash](/page/MinHash) approach for crisp sets pioneered by Broder et al. to handle uncertainty in term weights. This is useful in search engines for detecting near-duplicates under noisy or probabilistic data.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](https://arxiv.org/abs/1911.00675) ### Optimality and Theoretical Justification P-MinHash and related LSH schemes provide unbiased estimators for the probability Jaccard similarity, with the expected value of the collision-based estimator equaling $ J_{\mathcal{P}}(x, y) $. This unbiasedness holds under the hashing model's assumptions, where hash collisions directly reflect the scale-invariant overlap in distributions.[](https://arxiv.org/abs/1809.04052) The probability Jaccard exhibits strong theoretical properties, including Pareto optimality among sampling-based similarity measures for probability distributions. No alternative sampling method can improve estimation accuracy for all pairs without worsening it for some, particularly in high-dimensional or sparse settings where uniform sampling (as in standard MinHash) underperforms. This optimality is proven via analysis of sensitivity to distributional overlaps and resistance to false alignments. The estimator's standard deviation scales as $ 1/\sqrt{k} $, where $ k $ is the number of hash functions, with variance $ J_{\mathcal{P}}(1 - J_{\mathcal{P}})/k $, allowing reliable approximations with modest $ k $ even in uncertain environments.[](https://arxiv.org/abs/1809.04052) For computational efficiency in dynamic settings, such as [turnstile](/page/Turnstile) streaming models (where set memberships [update](/page/Update) incrementally), the framework achieves optimal (1 ± ε)-approximations of $ J_{\mathcal{P}} $ using O(1/ε² log |U|) space and O(1) [update](/page/Update) time per [operation](/page/Operation), matching information-theoretic lower bounds. This makes it preferable for [real-time](/page/Real-time) applications like database querying or large-scale recommendation systems over methods requiring full materialization. These results were established in the late [2010s](/page/2010s) literature on streaming algorithms.[](https://arxiv.org/abs/1911.00675)[](https://arxiv.org/abs/1605.03949) ## Applications in Data Analysis ### Role in Binary Classification The Jaccard index serves as a performance metric in [binary classification](/page/Binary_classification) by measuring the similarity between the set of predicted positives and the set of actual positives, effectively capturing the overlap in positive predictions without considering true negatives. In the context of a [confusion matrix](/page/Confusion_matrix), it is defined as

J = \frac{TP}{TP + FP + FN},

where $TP$ represents true positives (correctly predicted positives), $FP$ false positives (incorrectly predicted positives), and $FN$ false negatives (missed positives). This expression corresponds to the size of the [intersection](/page/Intersection) divided by the size of the [union](/page/Union) of the predicted and actual positive sets, providing a direct measure of set overlap.[](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) The Jaccard index relates to [precision](/page/Precision) $P = \frac{TP}{TP + FP}$ and [recall](/page/The_Recall) $R = \frac{TP}{TP + FN}$ through the formula

J = \frac{P \cdot R}{P + R - P \cdot R}.

To arrive at this, start with the definitions: substitute $P$ and $R$ into the target expression. The denominator $P + R - P \cdot R$ simplifies to $\frac{TP + FP + TP + FN - TP^2 / ((TP + FP)(TP + FN))}{(TP + FP)(TP + FN)}$, but more straightforwardly, cross-multiplying and algebraic manipulation yields $J = \frac{TP^2 / ((TP + FP)(TP + FN))}{ (TP + FP + TP + FN - TP) / ((TP + FP)(TP + FN)) } = \frac{TP}{TP + FP + FN}$, confirming equivalence. This relation highlights the Jaccard index as a balanced compromise between [precision and recall](/page/Precision_and_recall), distinct from the [harmonic mean](/page/Harmonic_mean) used in the F1 score.[](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) In information retrieval, the Jaccard index evaluates ranking quality under binary relevance by assessing the overlap between the retrieved document set and the relevant document set, offering a set-based measure of retrieval effectiveness.[](https://web.stanford.edu/class/cs276/19handouts/lecture6-tfidf-6per.pdf) A specific application appears in computer vision object detection since the mid-2000s, notably in the PASCAL VOC challenges starting in 2007, where it is known as the Intersection over Union (IoU) and quantifies the spatial overlap between predicted and ground-truth bounding boxes to determine detection accuracy.[](https://giou.stanford.edu/) Key advantages of the Jaccard index in [binary classification](/page/Binary_classification) include its operation on hard binary predictions, making it independent of probabilistic thresholds, and its exclusive focus on positive overlap, which ignores true negatives and thus performs well in imbalanced scenarios where negatives vastly outnumber positives.[](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) ### Uses in [Machine Learning](/page/Machine_learning) and Beyond In [machine learning](/page/Machine_learning), the Jaccard index serves as a key [similarity measure](/page/Similarity_measure) in clustering algorithms, particularly for binary or set-based data. For instance, in [hierarchical clustering](/page/Hierarchical_clustering), it is employed as a linkage criterion to compute distances between clusters by assessing the overlap of their constituent sets, enabling effective grouping of categorical or presence-absence features.[](https://arxiv.org/abs/1507.03340) Additionally, during [feature selection](/page/Feature_selection), the index quantifies set overlaps to evaluate the stability and relevance of selected attributes, helping to identify robust subsets that minimize [redundancy](/page/Redundancy) across iterations of the selection process.[](https://jmlr.csail.mit.edu/papers/volume18/17-514/17-514.pdf) The Jaccard index also finds application in recommendation systems, where it measures user-item similarity based on the overlap of [transaction](/page/Transaction) or [preference](/page/Preference) sets. In [collaborative filtering](/page/Collaborative_filtering) approaches, such as those modeling user interactions with movies or products, it computes the intersection of items rated or purchased by users relative to their union, facilitating personalized suggestions in systems akin to those handling large-scale [streaming data](/page/Streaming_data). Beyond [machine learning](/page/Machine_learning), the Jaccard index is widely used in [ecology](/page/Ecology) to analyze species co-occurrence patterns, quantifying the similarity between communities by the proportion of shared [species](/page/Species) relative to the total unique species across sites.[](https://www.science.org/doi/10.1126/sciadv.abj9204) In [genomics](/page/Genomics), it assesses gene set similarity, such as the overlap between functional modules or genetic variants, aiding in the identification of conserved biological pathways or pairwise alignments in large sequencing datasets.[](https://pmc.ncbi.nlm.nih.gov/articles/PMC3959882/) Similarly, in [social network analysis](/page/Social_network_analysis), the index evaluates node similarity through common connections, like shared friends, to predict links or detect community structures in graphs representing [interpersonal ties](/page/Interpersonal_ties).[](https://www.cs.uic.edu/~elena/pubs/zheleva-kdd08.pdf) A notable application is in duplicate detection and [record linkage](/page/Record_linkage), where Jaccard thresholds have been applied since the 1990s to resolve entities by comparing attribute sets, such as names or addresses, in databases to merge duplicates while preserving unique records.[](https://www.researchgate.net/publication/3297646_Duplicate_Record_Detection_A_Survey) For efficient computation on large datasets, implementations leverage inverted indices to accelerate Jaccard similarity joins, partitioning sets and pruning non-overlapping candidates to scale to billions of elements without exhaustive pairwise comparisons.[](http://www.vldb.org/pvldb/vol9/p360-deng.pdf) In recent [natural language processing](/page/Natural_language_processing), particularly in the transformer era since 2020, the index complements embedding-based methods to measure text chunk similarity, such as overlapping n-grams in generated or summarized content, enhancing tasks like [plagiarism](/page/Plagiarism) detection or response evaluation.[](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-84.pdf)

References

[1]
[PDF] An Analytical Approach to the Jaccard Similarity Index - arXiv
Oct 21, 2024 · The Jaccard similarity index has often been employed in science and technology as a means to quantify the similarity between two sets. When.
[2]
[PDF] Further Generalizations of the Jaccard Index - arXiv
Nov 18, 2021 · We start by providing a brief historic perspective of the. Jaccard index, which was introduced in 1901 by Paul Jac- card (1868–1944) under ...Missing: paper | Show results with:paper
[3]
Jaccard/Tanimoto similarity test and estimation methods for ... - NIH
The Jaccard/Tanimoto coefficient measuring similarity between two species has long been used to evaluate co-occurrences between species or between biogeographic ...
[4]
Jaccard Coefficient - an overview | ScienceDirect Topics
The Jaccard coefficient, also known as the Jaccard Index or intersection over union (IoU), is defined as the rate of overlap between predicted masks and ...
[5]
jaccard_score — scikit-learn 1.7.1 documentation
For the binary case, setting average='binary' will return the Jaccard similarity coefficient for pos_label . If average is not 'binary' , pos_label is ignored ...
[6]
[PDF] A note on the triangle inequality for the Jaccard distance - arXiv
Dec 8, 2016 · The Jaccard distance Jδ is known to fulfill all properties of a metric, most notably, the triangle inequality—a fact that has been observed ...
[7]
[PDF] arXiv:2202.00913v2 [stat.ME] 8 Jul 2022
Jul 8, 2022 · The Jaccard similarity equals one if the two sets are equal, zero if they are disjoint and takes a value in (0, 1) otherwise. 10. Page 11. 10.<|separator|>
[8]
Mathematical properties of soft cardinality: Enhancing Jaccard, Dice ...
In this paper, a formal definition of soft cardinality is proposed together with an analysis of its boundaries, monotonicity property and a method for ...<|separator|>
[9]
Is the Jaccard distance a distance? - MathOverflow
Mar 13, 2010 · The triangle inequality for the Jaccard metric is the following rational inequality: 1−x011+x111x001+x010+x011+x101+x110+x111+1−x110+x111x010+x ...
[10]
[1612.02696] A note on the triangle inequality for the Jaccard distance
Dec 8, 2016 · Two simple proofs of the triangle inequality for the Jaccard distance in terms of nonnegative, monotone, submodular functions are given and discussed.
[11]
Jaccard dissimilarity in stochastic community models based on the ...
Jun 4, 2025 · This halving occurs because JD is invariant under permutations of site labels within ... The probabilistic basis of Jaccard's index of similarity.
[12]
A Survey of Binary Similarity and Distance Measures - ResearchGate
Aug 9, 2025 · Jaccard distance was used to calculate similarity between response profiles, as this measure emphasizes shared features (Choi et al., 2010) .
[13]
https://www.e-periodica.ch/digbib/view?pid=bsv-002:1901:37::547
[14]
Similarity and Dissimilarity
Jun 23, 2021 · Simple Matching and Jaccard Coefficients. Simple Matching Coef. SMC is measuring similarity in a symmetric setting, i.e., both 0 and 1 matches.
[15]
https://www.jstor.org/stable/1302424
[16]
1(b).2.1: Measures of Similarity and Dissimilarity - STAT ONLINE
Calculate the Simple matching coefficient and the Jaccard coefficient. Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7. Jaccard coefficient = 0 / ...Missing: index comparison
[17]
[PDF] Language and Information Vector Space Model Cheatsheet
Generalized Jaccard. sim(q, ~dj) = GeneralizedJaccard(q, ~dj) = P k min(wk,j,wk,q). P k max(wk,j,wk,q). Using binary weights, this becomes equivalent to the ...
[18]
https://ieeexplore.ieee.org/document/10931083
[19]
Why is Tanimoto index an appropriate choice for fingerprint-based ...
May 20, 2015 · The Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best (and in some sense equivalent) metrics for similarity ...Missing: Toshio 1958
[20]
Chemical Similarity Searching - ACS Publications
This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching.
[21]
A proof of the triangle inequality for the Tanimoto distance
T.T. Tanimoto, An elementary mathematical theory of classification and prediction, IBM Report (November, 1958), cited in: G. Salton, Automatic Information ...
[22]
Daylight Theory: Fingerprints
Note this differs from the Tanimoto index in which the similarity between two objects is estimated. This inherent asymmetry means that the Tversky index is ...
[23]
[PDF] On the resemblance and containment of documents - cs.Princeton
On the resemblance and containment of documents. Andrei Z. Broder digital Systems Research Center. 130 Lytton Avenue, Palo Alto, CA 94301, USA broder@pa.dec.com.
[24]
None
### Summary of Expected Jaccard/IoU for Probabilistic Predictions
[25]
[PDF] Combining the Best of Graphical Models and ConvNets for Semantic ...
Dec 16, 2014 · We also considered expected intersection, expected union, expected intersection over expected union, and expected union over expected ...
[26]
Maximally Consistent Sampling and the Jaccard Index of Probability ...
Sep 11, 2018 · Abstract page for arXiv paper 1809.04052: Maximally Consistent Sampling and the Jaccard Index of Probability Distributions.
[27]
A Class of Locality-Sensitive Hash Algorithms for the (Probability ...
Nov 2, 2019 · ProbMinHash -- A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity. Authors:Otmar Ertl.
[28]
https://arxiv.org/abs/1605.03949
[29]
[PDF] Scoring, Term Weighting and the - Information Retrieval
How can we rank-order the documents in the collection with respect to a query? ▫ Assign a score – say in [0, 1] – to each document. ▫ This score measures how ...
[30]
Generalized Intersection over Union
Intersection over Union (IoU), also known as the Jaccard index, is the most popular evaluation metric for tasks such as segmentation, object detection and ...
[31]
[1507.03340] Quantitative Evaluation of Performance and Validity ...
Jul 13, 2015 · This paper describes various validity and accuracy measures including Dunn's Index, Davies Bouldin Index, C Index, Rand Index, Jaccard Index, Silhouette Index, ...
[32]
[PDF] On the Stability of Feature Selection Algorithms
Feature Selection is central to modern data science, from exploratory data analysis to predictive model-building. The “stability” of a feature selection ...
[33]
A better index for analysis of co-occurrence and similarity - Science
Jan 26, 2022 · Published datasets reanalyzed with both α and Jaccard's index (J) yield profoundly different biological inferences. For example, a published ...
[34]
Using networks to measure similarity between genes - NIH
The Jaccard index calculates the proportion of shared Y-type nodes between two X-type nodes, relative to the total number of Y-type nodes connected to either X- ...Missing: idempotence monotonicity<|separator|>
[35]
[PDF] Using Friendship Ties and Family Circles for Link Prediction
The Jaccard coefficient is a standard metric for measuring the similarity of two sets. Unlike the feature number of common friends, it considers the size of the ...Missing: index | Show results with:index
[36]
Duplicate Record Detection: A Survey | Request PDF - ResearchGate
In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect ...
[37]
[PDF] An Efficient Partition Based Method for Exact Set Similarity Joins
We group the sets with same size to share computation. We build inverted indexes on the subsets to efficiently perform the set similarity join.
[38]
[PDF] Advancing Robust and Aligned Measures of Semantic Similarity in ...
May 10, 2024 · Jaccard similarity is most useful in contexts where the absence of certain words is more significant than their frequency or order, such as with ...