Fact-checked by Grok 2 weeks ago

Similarity measure

A similarity measure is a numerical function that quantifies the degree of resemblance between two data objects or entities within the same reference set, often producing values between 0 (indicating no similarity) and 1 (indicating complete similarity). These measures capture inexact matching by assessing proximity or likeness, contrasting with dissimilarity measures that emphasize differences, and are essential for tasks requiring comparison of incomplete or approximate data. Formally, a normalized similarity measure assigns to any pair of elements x_1, x_2 in a set X a score \sigma \in [0, 1], where \sigma = 1 if and only if x_1 = x_2, and it is symmetric such that \sigma(x_1, x_2) = \sigma(x_2, x_1). Similarity measures underpin numerous algorithms in machine learning, data mining, and pattern recognition, enabling applications like clustering, classification, anomaly detection, and information retrieval. For instance, in recommendation systems, they help identify user preferences by comparing item or profile vectors, while in text analysis, they support document ranking and plagiarism detection. Their choice significantly impacts algorithm performance, as different measures suit varying data types—such as binary, numerical, or categorical—and problem contexts, with properties like metric compliance (non-negativity, symmetry, and triangle inequality for distances) ensuring reliable computations. Key types of similarity measures are tailored to specific data structures and include:
  • Jaccard similarity (or index), which for sets A and B is defined as |A \cap B| / |A \cup B|, ideal for binary or set-based data like tags or presence-absence features.
  • Cosine similarity, computed as the cosine of the angle between two vectors \mathbf{u} and \mathbf{v} via \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||}, commonly used in high-dimensional spaces like text or image embeddings to emphasize directional alignment over magnitude.
  • Euclidean similarity, often derived from the Euclidean distance d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum (x_i - y_i)^2} by normalization (e.g., $1 - d / \max d), suitable for continuous numerical data in clustering and nearest-neighbor searches.
Advanced variants, such as the for covariance-adjusted comparisons or kernel-based measures for non-linear similarities, extend these foundations to handle correlated or complex data distributions. Overall, selecting an appropriate measure balances computational efficiency, interpretability, and alignment with domain-specific notions of "likeness," with ongoing research focusing on robust, scalable options for and AI applications.

Fundamentals

Definition

A similarity measure is a function, often denoted as \operatorname{sim}(A, B), that quantifies the degree of resemblance between two objects A and B, typically producing a real-valued output where higher scores indicate greater similarity. These measures are generally normalized to the [0, 1], with 1 representing maximum similarity (e.g., identical objects) and 0 indicating complete dissimilarity, though some variants may extend to negative values or unbounded ranges depending on the . In and , similarity measures address the need to compare, group, or rank objects when exact equality is impractical, particularly in high-dimensional or noisy datasets where subtle resemblances must be detected. They enable core operations such as clustering similar data points, retrieving relevant information in search systems, and identifying patterns in complex structures, thereby facilitating decision-making in scenarios involving incomplete or approximate matches. Such measures apply to diverse object types, including numerical vectors representing spaces, sets of elements like documents or user preferences, strings such as text or genetic sequences, and graphs modeling relationships in networks. The origins of similarity measures trace back to 20th-century statistics, with foundational developments like the introduced in 1895 for assessing linear relationships between variables, and the proposed in 1901 for ecological associations. In , they gained traction in the 1950s through applications, exemplified by the in 1950 for error detection in binary codes, marking an early bridge to computational tasks.

Properties

Similarity measures are characterized by several mathematical properties that define their behavior, though these are not always strictly required and can vary depending on the context and specific measure. A core property is symmetry, which states that the similarity between two objects A and B equals the similarity between B and A, ensuring undirected comparisons: \operatorname{sim}(A, B) = \operatorname{sim}(B, A). This property holds for most practical similarity measures in data analysis and machine learning, promoting consistency in relational assessments. Another essential property is reflexivity, where an object is maximally similar to itself: \operatorname{sim}(A, A) = \max, typically normalized to 1, indicating perfect and serving as a for comparisons. Strong reflexivity further specifies that this maximum value occurs the objects are identical. In contrast, transitivity—requiring that high similarity between A and B, and between B and C, implies high similarity between A and C—is not generally satisfied by similarity measures, as it depends on a specific operator and often fails in non- scenarios, distinguishing them from strict equivalence relations. Desirable but non-essential properties include boundedness, where similarity values are confined to a finite interval, commonly [0, 1] with 0 denoting no similarity and 1 indicating , facilitating interpretation and across datasets. Additionally, monotonicity with respect to an underlying ensures that similarity decreases (or non-increases) as the distance between objects grows, providing an intuitive relationship without requiring strict . Unlike distance measures, which often form metrics satisfying the (i.e., the direct between two points is no greater than the path through a third), similarity measures frequently lack this property, leading to non-metric behaviors that can simplify computations but complicate geometric interpretations. For instance, the Kullback-Leibler divergence, originally a non-symmetric measure of distributional difference, violates symmetry and the in its asymmetric form and is sometimes adapted (e.g., via averaging) for use as a similarity, highlighting practical deviations from ideal properties.

Relation to Distance Measures

Inverse Relationships

Similarity measures are frequently derived from distance measures through inversion or normalization techniques that map dissimilarity to a scale of likeness, typically ranging from 0 to 1. A standard transformation involves normalizing the distance by the maximum possible value in the domain and subtracting from 1, expressed as: \text{sim}(A, B) = 1 - \frac{d(A, B)}{\max_d} where d(A, B) denotes the between objects A and B, and \max_d is the of the or the maximum observed . This approach ensures that identical objects yield a similarity of 1, while maximally distant objects yield 0, providing an intuitive inverse relationship. For example, the , which quantifies straight-line separation in vector spaces, can be inverted using this formula to produce a corresponding similarity measure, especially in bounded domains where \max_d is predefined or estimated from the . This inversion is particularly rationalized in optimization contexts, where maximizing similarity aligns directly with minimizing the underlying distance, allowing seamless adaptation of gradient-based or search algorithms originally formulated for distances. One key advantage of these inverse relationships is the compatibility with existing distance-oriented algorithms, such as k-nearest neighbors or , enabling them to output similarity scores without structural modifications—simply by applying the transformation post-computation. This facilitates broader applicability in fields like , where higher similarity values are preferred for ranking. Despite these benefits, inversions carry limitations: they do not always preserve metric properties of the original distance. For instance, if d satisfies the , the resulting similarity \text{sim} may lead to a non-metric space when the is reversed or reinterpreted, potentially failing conditions like non-negativity or the in derived distances. This can complicate theoretical analyses or guarantee convergence in certain algorithms.

Metric vs. Non-Metric Similarities

In the context of similarity measures, a similarity is one that, when converted to a corresponding (often via an such as distance = 1 - similarity or a monotonic decreasing ), satisfies the axioms of a : non-negativity, symmetry, , and the . This ensures that the similarity induces a where direct paths are no longer than indirect ones, as formalized in the original definition of spaces by Maurice Fréchet in 1906. For instance, similarities derived from , such as those used in early (MDS), adhere to these properties when the data are on an interval or ratio . Non-metric similarities, by contrast, do not satisfy the in their induced distance, allowing for greater flexibility in modeling complex relationships that violate strict geometric constraints. A prominent example is kernel-based similarities in , such as the , which can produce indefinite kernel matrices that fail axioms to capture non-linear patterns. Another classic case is Tversky's feature contrast model, where similarity is computed as a weighted of common and distinctive features, s(A, B) = θ |A ∩ B| - α |A - B| - β |B - A|, which empirically fits psychological judgments better than alternatives by permitting asymmetries and violations of . The distinction carries significant implications for algorithmic use: metric similarities support efficient geometric computations, such as shortest-path algorithms in graphs or isometric embeddings into spaces, enabling scalable optimizations in clustering and nearest-neighbor search. Non-metric similarities, however, excel in probabilistic and kernel-based models, like support vector machines, where relaxing the allows implicit high-dimensional mappings without explicit computation, though at the cost of potential instability in distance-based approximations. This trade-off influences choices in , with metrics preferred for interpretable spatial structures and non-metrics for expressive, data-driven flexibility. Historically, the foundation of metric spaces was laid by Fréchet in his 1906 dissertation, generalizing distance concepts from to abstract sets, which by the mid-20th century influenced through Torgerson's metric MDS in 1952. Non-metric extensions emerged in the with Kruskal's non-metric MDS in 1964, accommodating in perceptual studies, and gained traction in by the 1980s as adopted kernel methods for non-Euclidean similarities.

Common Measures for Vectors

Cosine Similarity

is a measure of similarity between two non-zero vectors in an , defined as the cosine of the angle between them. This metric quantifies how well the directions of the vectors align, independent of their magnitudes, making it particularly suitable for comparing orientations in high-dimensional spaces. The formula for cosine similarity between two vectors \mathbf{A} and \mathbf{B} is given by: \text{sim}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} where \mathbf{A} \cdot \mathbf{B} denotes the dot product and \|\mathbf{A}\|, \|\mathbf{B}\| are the Euclidean norms (magnitudes) of the vectors. This expression derives from the geometric definition of the cosine of the angle \theta between two vectors: \cos \theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}, which normalizes the dot product to focus solely on directional alignment rather than vector length. By ignoring magnitude, cosine similarity effectively treats vectors as directions on a unit hypersphere, which is advantageous when comparing relative orientations without bias from scaling differences. (p. 121) Key properties of cosine similarity include its bounded range of [-1, 1], where 1 indicates identical directions, 0 represents , and -1 signifies opposite directions; it is symmetric such that \text{[sim](/page/Sim)}(\mathbf{A}, \mathbf{B}) = \text{[sim](/page/Sim)}(\mathbf{B}, \mathbf{A}); and it is reflexive with \text{[sim](/page/Sim)}(\mathbf{A}, \mathbf{A}) = 1 for non-zero vectors. However, it is not a metric because it fails to satisfy the triangle inequality, as the cosine of angles does not preserve additive distances in vector spaces. These properties make it a semi-metric suitable for orientation-based comparisons but not for strict distance geometries. (pp. 121, 133) In practice, cosine similarity is widely applied to document similarity using term frequency-inverse document frequency (TF-IDF) vector representations, where documents are modeled as sparse vectors in a high-dimensional term space. TF-IDF weights emphasize distinctive terms while downweighting common ones, and then measures the angular alignment of these vectors to assess topical overlap between documents. This approach excels in sparse data environments, such as text corpora, because it inherently handles high dimensionality and zero-valued entries by focusing on non-zero coordinate overlaps, reducing the impact of irrelevant dimensions and improving efficiency in information retrieval tasks. (pp. 119-125, 133) Computing requires calculating the and norms, which for vectors of dimension n takes O(n) in the worst case, though sparsity allows optimizations like skipping zero entries to achieve sublinear performance in practice. Pre-normalizing vectors to unit length further simplifies repeated computations to just the . (pp. 121, 125, 398)

Pearson Correlation

The , denoted as r, serves as a similarity measure for assessing the strength and direction of linear relationships between two numerical vectors, such as feature profiles or samples in multivariate . Developed by in 1895 as part of his work on and , it was originally formulated to quantify the degree of association between biological traits but later adapted in the for similarity computations in applications, including early recommender systems. The formula for the Pearson correlation between two vectors \mathbf{A} and \mathbf{B} of length n is given by: r = \frac{\sum_{i=1}^{n} (A_i - \bar{A})(B_i - \bar{B})}{\sqrt{\sum_{i=1}^{n} (A_i - \bar{A})^2} \sqrt{\sum_{i=1}^{n} (B_i - \bar{B})^2}} = \frac{\text{cov}(\mathbf{A}, \mathbf{B})}{\sigma_A \sigma_B}, where \bar{A} and \bar{B} are the means, \text{cov} is the covariance, and \sigma_A, \sigma_B are the standard deviations. This expression normalizes the covariance by the product of the standard deviations, yielding values in the range [-1, 1], where r = 1 indicates perfect positive linear similarity, r = -1 perfect negative, and r = 0 no linear association. Notably, the Pearson correlation coefficient is equivalent to the cosine similarity between the mean-centered versions of the two vectors. The derivation arises from centering the vectors by subtracting their means, which removes location effects and focuses on deviations from the average; the resulting normalized then measures the extent to which these deviations co-vary, quantifying linear dependence after this centering step. Under the assumption of joint normality, a value of zero indicates statistical , though in general it only implies absence of linear dependence. Key properties include (r(\mathbf{A}, \mathbf{B}) = r(\mathbf{B}, \mathbf{A})), boundedness in [-1, 1], and invariance to linear scaling and translation of either , making it suitable for comparing datasets with differing units or magnitudes. However, it assumes and, for inferential purposes, bivariate of the data; violations can lead to underestimation of non-linear relationships. Variants include adjustments for statistical significance testing, such as the t-test statistic t = r \sqrt{\frac{n-2}{1-r^2}}, which follows a t-distribution under the of no , enabling computation to assess whether observed similarity is likely due to chance.

Common Measures for Sets and Sequences

Jaccard Similarity

The Jaccard similarity, also known as the or coefficient, is a measure designed to quantify the overlap between two finite sets by focusing on the presence or absence of elements, originally developed in the context of botanical similarity studies. For two sets A and B, it is formally defined as J(A, B) = \frac{|A \cap B|}{|A \cup B|} where |A \cap B| denotes the size of the intersection (shared elements) and |A \cup B| the size of the union (all unique elements across both sets), with the convention that J(\emptyset, \emptyset) = 1 and J(A, \emptyset) = 0 for nonempty A. This formulation yields a value between 0 (no overlap) and 1 (identical sets). The measure derives from the intuitive ratio of shared features to the total distinct features, emphasizing presence/absence patterns rather than quantities or orders, which makes it particularly suitable for comparing collections like document term sets or feature vectors. In ecological applications, for instance, it assesses similarity by treating occurrences as set elements, capturing how much of one habitat's flora is represented in another. Computationally, exact evaluation can be performed efficiently using hash-based set representations, achieving O(\min(|A|, |B|)) by iterating over the smaller set to compute and sizes via lookups. Key properties include boundedness in [0, 1], symmetry (J(A, B) = J(B, A)), and reflexivity (J(A, A) = 1), ensuring it behaves as a valid . Additionally, the associated Jaccard d_J(A, B) = 1 - J(A, B) satisfies the axioms of a (non-negativity, symmetry, , and ) on the collection of finite sets. An extension, the weighted Jaccard similarity, generalizes this by assigning weights w_e \geq 0 to e, defining J_w(A, B) = \frac{\sum_{e \in A \cap B} \min(w_e^A, w_e^B)}{\sum_{e \in A \cup B} \max(w_e^A, w_e^B)}, where w_e^A and w_e^B are weights from sets A and B, respectively; this prioritizes elements with higher weights in overlap assessments, useful for scenarios like term frequency in .

Levenshtein Distance as Similarity

The , named after Vladimir I. Levenshtein who introduced it in for analyzing error-correcting codes in binary sequences, quantifies the dissimilarity between two strings by counting the minimum number of single-character operations—insertions, deletions, or substitutions—needed to convert one string into the other. Each operation has a cost of 1, making the distance the smallest total cost for transformation. This measure accounts for the sequential order of characters, distinguishing it from order-agnostic set-based similarities. The distance is computed efficiently using dynamic programming, as formalized by Wagner and Fischer in 1974. For strings A of length m and B of length n, a matrix dp of size (m+1) \times (n+1) is constructed, where dp holds the edit distance between the prefixes A[1..i] and B[1..j]. The recurrence is: dp = \min \begin{cases} dp[i-1] + 1 & \text{(deletion of } A\text{)} \\ dp[j-1] + 1 & \text{(insertion of } B\text{)} \\ dp[i-1][j-1] + c & \text{(substitution or match, where } c = 0 \text{ if } A = B, \text{ else } 1\text{)} \end{cases} with base cases dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = i, dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = j, and dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0. This runs in O(mn) time and space, enabling practical computation for moderate-length strings. To adapt the as a similarity measure, it is typically normalized to produce values in the interval [0, 1], where 1 indicates perfect similarity. The common formula is \operatorname{sim}(A, B) = 1 - \frac{d(A, B)}{\max(|A|, |B|)}, with d(A, B) denoting the ; this scales the distance by the longer string's length to handle varying inputs. The resulting similarity is symmetric, as d(A, B) = d(B, A), and bounded, but it does not always form a because the normalization can violate the . For example, similar short strings may yield high similarity, while inserting a long unrelated could disrupt additivity in comparisons. A notable variant is the Damerau-Levenshtein distance, which extends the original by including transpositions of adjacent characters as a single edit operation with cost 1, better capturing common typing errors. This was proposed by Frederick J. Damerau in specifically for improving spell-checker accuracy in computational systems. The dynamic programming computation for this variant modifies the recurrence to detect and penalize adjacent swaps, maintaining O(mn) complexity while increasing sensitivity to local rearrangements.

Applications in Machine Learning

Clustering

In machine learning, similarity measures play a pivotal role in clustering by quantifying the resemblance between data points, enabling algorithms to group similar instances into cohesive clusters without labeled supervision. These measures serve as the foundation for defining cluster proximity, such as in partitional methods like k-means where points are assigned to the nearest based on inverted similarity (e.g., distance derived from similarity), or in hierarchical approaches where they guide the iterative merging or splitting of clusters. By establishing how "close" points or groups are, similarity measures directly influence the structure and quality of the resulting partitions, making their selection critical for capturing underlying data patterns. In , similarity measures are integrated through linkage criteria that determine when to combine clusters during agglomerative processes. Single-linkage, also known as nearest-neighbor linkage, merges clusters based on the maximum similarity between any pair of points from the two groups, promoting elongated or chain-like structures that can connect loosely associated data. In contrast, complete-linkage, or farthest-neighbor linkage, uses the minimum similarity across all inter-cluster pairs, favoring compact, spherical clusters by requiring overall high resemblance. These criteria, often applied to vector-based similarities like cosine for text data or Jaccard for sets, allow hierarchical methods to build dendrograms revealing multi-level groupings, though they can be sensitive to outliers in single-linkage scenarios. A key challenge in using similarity measures for clustering arises from the curse of dimensionality in high-dimensional spaces, where distances between points become uniformly large and similarities converge, diminishing the measures' ability to discriminate clusters effectively—for instance, or cosine similarities may treat nearly all points as equally dissimilar beyond moderate dimensions. This phenomenon, observed in datasets with thousands of features like profiles, leads to poorer cluster separation and requires techniques or robust alternatives to maintain meaningful groupings. To evaluate clustering quality, the silhouette score provides a that assesses how well each point aligns with its assigned relative to others, computed as the average similarity (or inverse distance) within the versus the nearest neighboring , with values closer to 1 indicating strong cohesion and separation. Introduced as a graphical validation , it helps quantify the impact of different similarity measures on overall partition validity without ground truth labels. Advancements since the early 2000s have seen density-based clustering algorithms like adapt similarity measures to focus on local neighborhoods rather than global distances, defining clusters as dense regions separated by sparse areas using parameters like (a similarity ) and minimum points, thus handling arbitrary shapes and more robustly than traditional linkage methods. This approach, building on foundational notions, has been extended in variants to incorporate variable-density similarities for , high-dimensional data.

Recommender Systems

In recommender systems, similarity measures play a central role in , a that generates personalized recommendations by identifying patterns in user-item interactions. assumes that users who agreed in the past will agree in the future, leveraging similarity to predict preferences for unrated items. Two primary mechanisms dominate: user-user collaborative filtering, which computes similarities between users based on their rating histories to predict a target user's ratings by aggregating from similar users, and , which identifies similarities between items based on user ratings to recommend items akin to those a user has liked. For instance, user-user approaches often employ Pearson correlation to measure rating pattern similarities across users, accounting for individual biases in rating scales. Algorithms in typically rely on k-nearest neighbors (k-NN) methods, where similarity scores determine the k most similar entities (users or items) above a predefined threshold, and predictions are weighted averages of their ratings or interactions. This neighborhood-based approach efficiently scales to sparse user-item matrices by focusing computations on high-similarity pairs, enabling real-time recommendations in systems like and streaming services. The cold-start problem, where new users or items lack sufficient interaction data for reliable similarity computation, is commonly mitigated through hybrid methods that integrate with content-based similarities, such as on item attributes (e.g., genres or descriptions) or user demographics, to bootstrap initial profiles. Evaluation of these similarity-driven recommenders emphasizes ranking quality, with Precision@K serving as a key metric that assesses the proportion of relevant items among the top-K recommendations generated from similarity-ranked lists. This metric prioritizes the accuracy of high-ranked suggestions, reflecting user satisfaction in top-line outputs. Post-2000s advancements evolved beyond explicit k-NN similarities toward matrix factorization techniques, which implicitly capture user-item affinities through low-dimensional latent factors, improving performance on implicit feedback data like clicks or views while addressing sparsity.

Applications in Other Domains

Sequence Alignment

Sequence alignment is a fundamental technique in for measuring similarity between biological sequences, such as , , or protein sequences, by identifying regions of correspondence that may indicate functional, structural, or evolutionary relationships. This process involves arranging sequences to maximize a similarity score while accounting for possible insertions, deletions, and substitutions, often using dynamic programming or approaches to handle the of long sequences. Unlike general matching, biological alignments incorporate domain-specific scoring systems that reflect evolutionary probabilities rather than simple edit costs. The Needleman-Wunsch algorithm, introduced in 1970, provides a classic method for global , which aligns entire sequences end-to-end to compute an optimal similarity score based on matches, mismatches, and gaps. It employs a scoring matrix where identical residues receive positive scores, mismatches are penalized (often with lower positive or negative values), and gaps—representing insertions or deletions—are assigned negative penalties to discourage excessive indels. For protein sequences, substitution matrices like (BLOcks SUbstitution Matrix) are commonly used, derived from observed alignments in conserved protein blocks; for instance, BLOSUM62 assigns +11 for a tryptophan-tryptophan match and -4 for a gap open, enabling nuanced similarity assessment that captures evolutionary conservation. These scores are computed via a dynamic programming table, where each cell represents the maximum similarity up to that position, ultimately yielding an alignment path that quantifies overall sequence relatedness. In applications such as , pairwise similarity scores from alignments serve as distance metrics to construct evolutionary trees, where higher similarity implies closer relatedness among species or genes. For example, aligned protein sequences from diverse organisms can be scored to generate a , which is then used in clustering algorithms like neighbor-joining to infer branching patterns and divergence times. Tools like (Basic Local Alignment Search Tool), developed in 1990, extend this by using methods to rapidly identify local similarities—short, high-scoring segments—rather than full global alignments, making it efficient for database searches against vast genomic repositories. BLAST approximates optimal local alignments by seeding with short exact matches and extending them with scoring penalties similar to Needleman-Wunsch, reporting E-values to assess of similarities. Advancements in the incorporated to learn context-aware similarity functions for alignments, surpassing traditional fixed matrices by training neural networks on large datasets of known alignments. approaches, such as those applied to , optimize alignment paths by treating scoring as a learned from evolutionary , achieving higher accuracy on benchmark datasets like BAliBASE compared to classical methods. These learned models adaptively weigh matches and gaps based on sequence context, enhancing applications in and functional annotation.

Computer Vision

In computer vision, similarity measures are essential for comparing images, features, or regions to enable tasks such as retrieval, matching, and recognition. Early approaches often relied on color distributions, where histogram intersection serves as a simple yet effective metric for quantifying overlap between color histograms of two images, originally proposed for efficient color-based object indexing. This measure computes the sum of minimum bin values across corresponding histogram bins, providing a bounded similarity score between 0 and 1 that is computationally efficient for large databases. For more perceptually meaningful comparisons of color distributions, the (EMD) treats histograms as distributions of "earth" piles and calculates the minimum work required to transform one into the other, using a ground distance like in ; it has demonstrated superior retrieval performance over simpler metrics like chi-squared distance in color and texture-based image databases. Local feature descriptors further advanced similarity computation, with the (SIFT) generating 128-dimensional vectors from keypoints that are compared using to match invariant features across images under scale, rotation, and partial illumination changes. This pairwise Euclidean similarity, combined with ratio tests to discard ambiguous matches, enables robust correspondence in tasks like stereo vision and stitching. In object recognition, bag-of-words models treat SIFT descriptors as visual words, quantizing them into a codebook via and representing images as histograms weighted by term frequency-inverse document frequency (TF-IDF); between these histograms then ranks candidate images, adapting text retrieval techniques to achieve scalable recognition in large video datasets. Challenges in these measures include sensitivity to illumination variations, which can distort color histograms and descriptor responses; invariance is often achieved through normalized color spaces or preprocessing, such as or chromaticity-based adaptations that mitigate affine illuminant changes while preserving intersection-based similarities. The 2010s marked a shift with , where convolutional neural networks (CNNs) produce high-dimensional embeddings for images or regions, typically compared via on L2-normalized vectors to emphasize angular alignment over magnitude, as seen in metric learning frameworks for face recognition that improved verification accuracy on benchmarks like LFW. In object detection pipelines, mean Average Precision (mAP) evaluates performance by varying similarity thresholds, such as over (IoU) for bounding boxes, where detections exceeding a threshold (e.g., 0.5) are deemed true positives, aggregating precision-recall curves across classes and IoU levels from 0.5 to 0.95 to quantify overall .

Information Retrieval

In (IR), similarity measures play a central role in ranking documents by their relevance to a user query, enabling efficient search over large corpora. The (VSM), introduced in the 1970s, represents both queries and documents as vectors in a high-dimensional space where each dimension corresponds to a term in the vocabulary, with weights typically derived from term frequency-inverse document frequency (TF-IDF). is the predominant measure in VSM, computing the angle between query vector \mathbf{q} and document vector \mathbf{d} to assess their directional alignment, normalized to account for vector lengths: \cos(\theta) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \ \|\mathbf{d}\|} This approach prioritizes documents sharing terms with the query while downweighting document length effects, outperforming earlier Boolean models in empirical evaluations on collections like Cranfield. An extension to VSM, the BM25 ranking function, refines similarity by incorporating probabilistic weighting that emphasizes term rarity and saturation effects to avoid over-penalizing long documents. Developed within the Okapi IR system, BM25 estimates relevance as a sum over query terms t_i, with weights balancing term frequency tf, inverse document frequency idf, document length dl, and average length avgdl: \text{BM25}(d, q) = \sum_{i=1}^{n} \text{IDF}(t_i) \cdot \frac{tf(t_i, d) \cdot (k_1 + 1)}{tf(t_i, d) + k_1 \cdot (1 - b + b \cdot \frac{dl}{avgdl})} where k_1 and b are tunable parameters, and IDF reflects term scarcity across the corpus. This measure has become a benchmark in search engines like Elasticsearch, consistently improving precision over basic TF-IDF cosine similarity on TREC datasets in mean average precision. Query expansion enhances similarity computation by augmenting the original query with semantically related terms, such as synonyms, to address vocabulary mismatch. Techniques leveraging lexical resources like compute between query terms and synsets (e.g., via path distance or gloss overlap), selecting top-related words to expand the query vector before applying cosine or BM25. For instance, expanding "car" with "automobile" and "vehicle" via 's hypernym relations has shown recall improvements of up to 15% on TREC queries without significant precision loss. This method integrates seamlessly with VSM, broadening the effective term space while preserving ranking via established similarity scores. IR systems evaluate ranked outputs using metrics like Normalized Discounted Cumulative Gain (NDCG), which assesses similarity-based rankings by rewarding relevant documents at higher positions with graded relevance scores r_i and position discounts: \text{DCG}_p = \sum_{i=1}^{p} \frac{r_i}{\log_2(i+1)}, \quad \text{NDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p} where IDCG is the ideal DCG for perfect ranking. NDCG correlates strongly with user satisfaction in TREC evaluations, emphasizing top-k results where similarity measures directly impact perceived quality. In the 2020s, neural IR has shifted toward dense embeddings, where queries and documents are encoded into low-dimensional vectors via transformer models like , with inner-product similarity () serving as the primary ranking signal for its efficiency in approximating semantic relevance. Models like Dense Passage Retrieval (DPR) precompute embeddings for passages and retrieve via maximum inner product search (MIPS), achieving approximately 50% relative improvement in top-5 passage retrieval accuracy over BM25 on open-domain QA benchmarks like Natural Questions. Late-interaction variants, such as ColBERT, refine this by computing token-level similarities before aggregation, balancing expressiveness and speed while outperforming DPR by about 25% relatively in MRR@10 on MS MARCO. These approaches dominate modern dense retrievers, enabling scalable in production systems.

References

  1. [1]
    1(b).2.1: Measures of Similarity and Dissimilarity - STAT ONLINE
    Similarity and Dissimilarity · Numerical measure of how alike two data objects are. · Often falls between 0 (no similarity) and 1 (complete similarity).
  2. [2]
    [PDF] Understanding (dis)similarity measures - UPCommons
    Abstract. Intuitively, the concept of similarity is the notion to measure an inexact matching between two entities of the same reference set.
  3. [3]
    Notions of similarity for systems biology models - PMC
    Formally, a normalized similarity measure for a set X is defined as a function which assigns x1, x2 ∈ X a value σ ∈ [0, 1]. This σ is called the 'similarity ...
  4. [4]
    [PDF] Comprehensive Survey on Distance/Similarity Measures between ...
    Abstract— Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering,.
  5. [5]
    Similarity and Dissimilarity
    Jun 23, 2021 · Proximity refers to either a similarity or dissimilarity. Single attribute sim/dissim measures Nominal is binary if two values are equal or not.
  6. [6]
    [PDF] Cosine Similarity Tutorial | ITLab
    Apr 10, 2015 · ... similarity measures, a cosine similarity is a measure of the direction-length resemblance between vectors. An angle of 0 o means that cos θ ...
  7. [7]
    A guide to similarity measures and their data science applications
    Jul 26, 2025 · This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professionals.
  8. [8]
    A Guide to Similarity Measures - arXiv
    Aug 7, 2024 · This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professional.
  9. [9]
    (PDF) Understanding (dis)similarity measures - ResearchGate
    Intuitively, the concept of similarity is the notion to measure an inexact matching between two entities of the same reference set.
  10. [10]
    [PDF] Things to know about a (dis)similarity measure
    We argue that every property of a similarity should have a correspon- dence with one property of a dissimilarity and vice versa. Definition 1 A similarity ...
  11. [11]
  12. [12]
    [PDF] Author Disambiguation in Microsoft Academic Search Engine Dataset
    The confidence score is computed as the normalized similarity (1 - normalized distance) between the input author and an output. 2.3 Online Layer. As the user ...
  13. [13]
    [PDF] RECON Label Quality Report - OSTI
    Normalized Similarity = 1 - Normalized Distance. Finally, the normalized similarity is multiplied by 100 to get the label score percentage. 2.2. Identifying ...
  14. [14]
    Euclidean distance score and similarity - Cross Validated
    Mar 23, 2013 · The inverse is to change from distance to similarity. The 1 in the denominator is to make it so that the maximum value is 1 (if the distance ...How I can convert distance (Euclidean) to similarity scoreWhy is Euclidean distance not a good metric in high dimensions?More results from stats.stackexchange.com
  15. [15]
    Distance Metrics in Vector Search - Weaviate
    Aug 15, 2023 · In the context of vector search, similarity measures are a function that takes two vectors as input and calculates a distance value between them ...Fundamentals Of Vector... · Distance Metrics​ · Cosine Similarity​
  16. [16]
    Vector similarity techniques and scoring - Elasticsearch Labs
    May 13, 2024 · So in order to derive a score we need to invert the distance measure, so that the smallest distance yields the highest score. The way the ...
  17. [17]
    Similarity measures - Scholarpedia
    Dec 5, 2007 · This article focuses on perceived similarity. The degree to which people perceive two things as similar fundamentally affects their rational thought and ...
  18. [18]
    7.8. Pairwise metrics, Affinities and Kernels - Scikit-learn
    Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c . A kernel must also be ...
  19. [19]
    A vector space model for automatic indexing - ACM Digital Library
    Salton, G., and Yang, C.S. On the specification of term values in automatic indexing. J. Documen. 29, 4 (Dec. 1973), 351-372.Missing: original | Show results with:original
  20. [20]
  21. [21]
    [PDF] Introduction to Information Retrieval - Stanford University
    Aug 1, 2006 · ... Manning, Prabhakar Raghavan & Hinrich Schütze. Printed on April 1 ... Cosine computation for Exercise 6.19. 132. 8.1. Calculation of 11 ...
  22. [22]
    [PDF] A Triangle Inequality for Cosine Similarity? - SISAP
    In this paper, we introduce a triangle inequality for Cosine similarity that allows lifting most of these techniques from metric distances to Cosine similarity,.
  23. [23]
    [PDF] Term Weighting Approaches in Automatic Text Retrieval
    ized term weighting system is used with the vector similarity function of expression (5), one obtains the well-known cosine vector similarity formula that has ...
  24. [24]
    1.6 - (Pearson) Correlation Coefficient, \(r\) | STAT 501
    If b 1 is positive, then r takes a positive sign. That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, ...
  25. [25]
    [PDF] Contributions to the Mathematical Theory of Evolution. II. Skew ...
    Author(s): Karl Pearson. Reviewed work(s):. Source: Philosophical Transactions of the Royal Society of London. A, Vol. 186 (1895), pp. 343-. 414. Published by ...
  26. [26]
    [PDF] Using Filtering Agents to Improve Prediction Quality in the ...
    Collaborative filtering systems help address information overload by using the opinions of users in a community to make personal recommendations for ...
  27. [27]
    3.4.2 - Correlation | STAT 200
    Correlation: A measure of the direction and strength of the relationship between two variables. Properties of Pearson's r. − 1 ≤ r ≤ + 1; For a positive ...
  28. [28]
    Similarity measures for Collaborative Filtering-based Recommender ...
    In this paper we provide an in-depth review on similarity measures used for CF-based RS. For each measure, we outline its fundamental background and we test ...
  29. [29]
    Pearson Correlation Coefficient (r) | Guide & Examples - Scribbr
    May 13, 2022 · The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation between two variables.
  30. [30]
    [PDF] Finding the Jaccard Median - Stanford CS Theory
    In this paper we fully study the com- putational complexity of the weighted Jaccard median problem. We begin by showing that the problem is NP- hard.
  31. [31]
    [PDF] Improved Consistent Sampling, Weighted Minhash and L1 Sketching
    For a common case where all the vectors to be hashed have the same ℓ1 norm, Weighted Jaccard Similarity is a monotonic function of ℓ1 distance, therefore ...
  32. [32]
    [PDF] Binary codes capable of correcting deletions, insertions, and reversals
    845-848, August, 1965. Original article submitted January 2, 1965. Investigations of transmission of binary infor- mation usually consider a channel model in ...
  33. [33]
    [PDF] The String-to-String Correction Problem
    The string-to-string correction problem is to determine the distance between two strings using the minimum cost sequence of edit operations.
  34. [34]
    String Similarity Metrics – Edit Distance | Baeldung on Computer ...
    Mar 18, 2024 · If we want to use normalized metric, we may convert Levenshtein distance to similarity measure using the formula: sim_{ld}(a, b) = 1.0 ...
  35. [35]
    [PDF] The Normalized Edit Distance with Uniform Operation Costs is a Metric
    Apr 23, 2022 · This function is a metric, but it completely ignores the lengths of the words, thus it is not normalized. We turn to introduce the generalized ...
  36. [36]
    A technique for computer detection and correction of spelling errors
    The technique assumes a word has at most one error (wrong, missing, extra letter, or transposition) and compares it to the dictionary, testing for these errors.
  37. [37]
    A Comparison Study on Similarity and Dissimilarity Measures in ...
    Dec 11, 2015 · Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same ...
  38. [38]
    [PDF] Hierarchical Cluster Analysis: Comparison of Three Linkage ...
    The following section will review how employing three different linkage measures (single linkage, complete linkage, and average linkage) can result in three ...
  39. [39]
    [PDF] Robust Hierarchical Clustering
    However, single linkage, average linkage, and complete linkage would initially link the matched pairs and produce clusters with very high error with respect to ...
  40. [40]
    [PDF] The Challenges of Clustering High Dimensional Data
    For clustering purposes, the most relevant aspect of the curse of dimensionality concerns the effect of increasing dimensionality on distance or similarity.
  41. [41]
    A graphical aid to the interpretation and validation of cluster analysis
    Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie ...
  42. [42]
    A density-based algorithm for discovering clusters in large spatial ...
    In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary ...
  43. [43]
    [PDF] An Algorithmic Framework for Performing Collaborative Filtering
    In this paper we will explore the space of neighborhood- based collaborative filtering methods and describe some new better performing algorithms that we have ...
  44. [44]
    GroupLens: an open architecture for collaborative filtering of netnews
    GroupLens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles.
  45. [45]
    [PDF] Item-Based Collaborative Filtering Recommendation Algorithms
    Item-based collaborative filtering analyzes the user-item matrix to identify relationships between items, then uses these relationships to compute  ...Missing: seminal | Show results with:seminal
  46. [46]
    [PDF] Collaborative Filtering Recommender Systems - Michael Ekstrand
    Pearson correlation suffers from computing high similarity between users with few ratings in common. This can be alle- viated by setting a threshold on the ...Missing: seminal | Show results with:seminal
  47. [47]
    A hybrid similarity model for mitigating the cold-start problem of ...
    Sep 1, 2024 · In this paper, we explore a new robust hybrid similarity model, namely Wasserstein distance-based CF (WCF) model, for mitigating the cold-start problem of CF ...
  48. [48]
    Evaluating collaborative filtering recommender systems
    Key decisions in evaluating collaborative filtering systems include user tasks, analysis types, prediction quality, other attributes, and user-based evaluation.Missing: seminal | Show results with:seminal
  49. [49]
    [PDF] MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER ...
    Recommender systems can use implicit feedback to gain insight into user preferences. Indeed, they can gather behavioral information regardless of the user's ...
  50. [50]
    A general method applicable to the search for similarities ... - PubMed
    A general method applicable to the search for similarities in the amino acid sequence of two proteins. ... Authors. S B Needleman, C D Wunsch. PMID: 5420325; DOI: ...
  51. [51]
    Exploring the Relationship between Sequence Similarity and ...
    Families with significant sequence similarity have better alignment accuracy. Recent families are more accurate over all similarities, while ancient families ...
  52. [52]
    Basic local alignment search tool - PubMed - NIH
    Oct 5, 1990 · A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local ...
  53. [53]
    Using deep reinforcement learning approach for solving the multiple ...
    May 18, 2019 · In the present paper, we use a deep reinforcement learning (DRL) approach for solving the multiple sequence alignment problem which is an NP-complete problem.Missing: seminal | Show results with:seminal
  54. [54]
    Color indexing | International Journal of Computer Vision
    Swain, M.J., Ballard, D.H. Color indexing. Int J Comput Vision 7, 11–32 (1991). https://doi.org/10.1007/BF00130487. Download citation. Received: 22 January 1991.
  55. [55]
    [PDF] The Earth Mover's Distance as a Metric for Image Retrieval
    In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances. 1 Introduction.
  56. [56]
    [PDF] Distinctive Image Features from Scale-Invariant Keypoints
    Jan 5, 2004 · This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between ...
  57. [57]
    [PDF] Video Google: A Text Retrieval Approach to Object Matching in Videos
    We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video.
  58. [58]
  59. [59]
    [PDF] The Probabilistic Relevance Framework: BM25 and Beyond Contents
    The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, estimating a probability of relevance for each query-document pair.Missing: Sparrow | Show results with:Sparrow
  60. [60]
    Cumulated gain-based evaluation of IR techniques
    This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position.Missing: NDCG | Show results with:NDCG
  61. [61]
    [PDF] ColBERT: Efficient and Effective Passage Search via Contextualized ...
    Jun 4, 2020 · 3 COLBERT. ColBERT prescribes a simple framework for balancing the quality and cost of neural IR, particularly deep language models like BERT.