Fact-checked by Grok 2 weeks ago

Similarity measure

A similarity measure is a numerical function that quantifies the degree of resemblance between two data objects or entities within the same reference set, often producing values between 0 (indicating no similarity) and 1 (indicating complete similarity).^[1] These measures capture inexact matching by assessing proximity or likeness, contrasting with dissimilarity measures that emphasize differences, and are essential for tasks requiring comparison of incomplete or approximate data.^[2] Formally, a normalized similarity measure assigns to any pair of elements x_1, x_2 in a set X a score \sigma \in [0, 1], where \sigma = 1 if and only if x_1 = x_2, and it is symmetric such that \sigma(x_1, x_2) = \sigma(x_2, x_1).^[3] Similarity measures underpin numerous algorithms in machine learning, data mining, and pattern recognition, enabling applications like clustering, classification, anomaly detection, and information retrieval.^[4] For instance, in recommendation systems, they help identify user preferences by comparing item or profile vectors, while in text analysis, they support document ranking and plagiarism detection.^[5] Their choice significantly impacts algorithm performance, as different measures suit varying data types—such as binary, numerical, or categorical—and problem contexts, with properties like metric compliance (non-negativity, symmetry, and triangle inequality for distances) ensuring reliable computations.^[1] Key types of similarity measures are tailored to specific data structures and include:

Jaccard similarity (or index), which for sets A and B is defined as |A \cap B| / |A \cup B|, ideal for binary or set-based data like tags or presence-absence features.^[1]
Cosine similarity, computed as the cosine of the angle between two vectors \mathbf{u} and \mathbf{v} via \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||}, commonly used in high-dimensional spaces like text or image embeddings to emphasize directional alignment over magnitude.^[6]
Euclidean similarity, often derived from the Euclidean distance d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum (x_i - y_i)^2} by normalization (e.g., $1 - d / \max d), suitable for continuous numerical data in clustering and nearest-neighbor searches.^[1]

Advanced variants, such as the Mahalanobis distance for covariance-adjusted comparisons or kernel-based measures for non-linear similarities, extend these foundations to handle correlated or complex data distributions.^[1] Overall, selecting an appropriate measure balances computational efficiency, interpretability, and alignment with domain-specific notions of "likeness," with ongoing research focusing on robust, scalable options for big data and AI applications.^[4]

Fundamentals

Definition

A similarity measure is a function, often denoted as \operatorname{sim}(A, B), that quantifies the degree of resemblance between two objects A and B, typically producing a real-valued output where higher scores indicate greater similarity.^[7] These measures are generally normalized to the interval [0, 1], with 1 representing maximum similarity (e.g., identical objects) and 0 indicating complete dissimilarity, though some variants may extend to negative values or unbounded ranges depending on the context.^[8] In data analysis and machine learning, similarity measures address the need to compare, group, or rank objects when exact equality is impractical, particularly in high-dimensional or noisy datasets where subtle resemblances must be detected.^[7] They enable core operations such as clustering similar data points, retrieving relevant information in search systems, and identifying patterns in complex structures, thereby facilitating decision-making in scenarios involving incomplete or approximate matches.^[8] Such measures apply to diverse object types, including numerical vectors representing feature spaces, sets of discrete elements like documents or user preferences, strings such as text or genetic sequences, and graphs modeling relationships in networks.^[7] The origins of similarity measures trace back to 20th-century statistics, with foundational developments like the Pearson correlation coefficient introduced in 1895 for assessing linear relationships between variables, and the Jaccard index proposed in 1901 for ecological associations.^[9] In computer science, they gained traction in the 1950s through pattern recognition applications, exemplified by the Hamming distance in 1950 for error detection in binary codes, marking an early bridge to computational tasks.

Properties

Similarity measures are characterized by several mathematical properties that define their behavior, though these are not always strictly required and can vary depending on the context and specific measure. A core property is symmetry, which states that the similarity between two objects A and B equals the similarity between B and A, ensuring undirected comparisons: \operatorname{sim}(A, B) = \operatorname{sim}(B, A). This property holds for most practical similarity measures in data analysis and machine learning, promoting consistency in relational assessments.^[10] Another essential property is reflexivity, where an object is maximally similar to itself: \operatorname{sim}(A, A) = \max, typically normalized to 1, indicating perfect identity and serving as a baseline for comparisons. Strong reflexivity further specifies that this maximum value occurs if and only if the objects are identical. In contrast, transitivity—requiring that high similarity between A and B, and between B and C, implies high similarity between A and C—is not generally satisfied by similarity measures, as it depends on a specific transitivity operator and often fails in non-equivalence scenarios, distinguishing them from strict equivalence relations.^[10]^[11] Desirable but non-essential properties include boundedness, where similarity values are confined to a finite interval, commonly [0, 1] with 0 denoting no similarity and 1 indicating identity, facilitating interpretation and normalization across datasets. Additionally, monotonicity with respect to an underlying distance measure ensures that similarity decreases (or non-increases) as the distance between objects grows, providing an intuitive inverse relationship without requiring strict proportionality.^[12]^[7] Unlike distance measures, which often form metrics satisfying the triangle inequality (i.e., the direct distance between two points is no greater than the path through a third), similarity measures frequently lack this property, leading to non-metric behaviors that can simplify computations but complicate geometric interpretations. For instance, the Kullback-Leibler divergence, originally a non-symmetric measure of distributional difference, violates symmetry and the triangle inequality in its asymmetric form and is sometimes adapted (e.g., via averaging) for use as a similarity, highlighting practical deviations from ideal properties.^[12]^[10]

Relation to Distance Measures

Inverse Relationships

Similarity measures are frequently derived from distance measures through inversion or normalization techniques that map dissimilarity to a scale of likeness, typically ranging from 0 to 1. A standard transformation involves normalizing the distance by the maximum possible value in the domain and subtracting from 1, expressed as:

\text{sim}(A, B) = 1 - \frac{d(A, B)}{\max_d}

where d(A, B) denotes the distance between objects A and B, and \max_d is the diameter of the space or the maximum observed distance. This approach ensures that identical objects yield a similarity of 1, while maximally distant objects yield 0, providing an intuitive inverse relationship.^[13]^[14] For example, the Euclidean distance, which quantifies straight-line separation in vector spaces, can be inverted using this formula to produce a corresponding similarity measure, especially in bounded domains where \max_d is predefined or estimated from the dataset. This inversion is particularly rationalized in optimization contexts, where maximizing similarity aligns directly with minimizing the underlying distance, allowing seamless adaptation of gradient-based or search algorithms originally formulated for distances.^[15] One key advantage of these inverse relationships is the compatibility with existing distance-oriented algorithms, such as k-nearest neighbors or hierarchical clustering, enabling them to output similarity scores without structural modifications—simply by applying the transformation post-computation. This facilitates broader applicability in fields like information retrieval, where higher similarity values are preferred for ranking.^[16]^[17] Despite these benefits, inversions carry limitations: they do not always preserve metric properties of the original distance. For instance, if d satisfies the triangle inequality, the resulting similarity \text{sim} may lead to a non-metric space when the transformation is reversed or reinterpreted, potentially failing conditions like non-negativity or the triangle inequality in derived distances. This can complicate theoretical analyses or guarantee convergence in certain algorithms.

Metric vs. Non-Metric Similarities

In the context of similarity measures, a metric similarity is one that, when converted to a corresponding distance function (often via an inverse transformation such as distance = 1 - similarity or a monotonic decreasing function), satisfies the axioms of a metric space: non-negativity, symmetry, identity of indiscernibles, and the triangle inequality.^[18] This ensures that the similarity induces a geometry where direct paths are no longer than indirect ones, as formalized in the original definition of metric spaces by Maurice Fréchet in 1906. For instance, similarities derived from Euclidean distance, such as those used in early multidimensional scaling (MDS), adhere to these properties when the data are on an interval or ratio scale.^[18] Non-metric similarities, by contrast, do not satisfy the triangle inequality in their induced distance, allowing for greater flexibility in modeling complex relationships that violate strict geometric constraints. A prominent example is kernel-based similarities in machine learning, such as the sigmoid kernel, which can produce indefinite kernel matrices that fail metric axioms to capture non-linear patterns.^[19] Another classic case is Tversky's feature contrast model, where similarity is computed as a weighted combination of common and distinctive features, s(A, B) = θ |A ∩ B| - α |A - B| - β |B - A|, which empirically fits psychological judgments better than metric alternatives by permitting asymmetries and violations of transitivity.^[20] The distinction carries significant implications for algorithmic use: metric similarities support efficient geometric computations, such as shortest-path algorithms in graphs or isometric embeddings into Euclidean spaces, enabling scalable optimizations in clustering and nearest-neighbor search. Non-metric similarities, however, excel in probabilistic and kernel-based models, like support vector machines, where relaxing the triangle inequality allows implicit high-dimensional mappings without explicit computation, though at the cost of potential instability in distance-based approximations.^[19] This trade-off influences choices in data analysis, with metrics preferred for interpretable spatial structures and non-metrics for expressive, data-driven flexibility. Historically, the foundation of metric spaces was laid by Fréchet in his 1906 dissertation, generalizing distance concepts from Euclidean geometry to abstract sets, which by the mid-20th century influenced psychometrics through Torgerson's metric MDS in 1952. Non-metric extensions emerged in the 1960s with Kruskal's non-metric MDS in 1964, accommodating ordinal data in perceptual studies, and gained traction in data science by the 1980s as machine learning adopted kernel methods for non-Euclidean similarities.^[21]

Common Measures for Vectors

Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space, defined as the cosine of the angle between them. This metric quantifies how well the directions of the vectors align, independent of their magnitudes, making it particularly suitable for comparing orientations in high-dimensional spaces.^[22]^[23] The formula for cosine similarity between two vectors \mathbf{A} and \mathbf{B} is given by:

\text{sim}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}

where \mathbf{A} \cdot \mathbf{B} denotes the dot product and \|\mathbf{A}\|, \|\mathbf{B}\| are the Euclidean norms (magnitudes) of the vectors. This expression derives from the geometric definition of the cosine of the angle \theta between two vectors: \cos \theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}, which normalizes the dot product to focus solely on directional alignment rather than vector length. By ignoring magnitude, cosine similarity effectively treats vectors as directions on a unit hypersphere, which is advantageous when comparing relative orientations without bias from scaling differences.^[22]^[24] (p. 121) Key properties of cosine similarity include its bounded range of [-1, 1], where 1 indicates identical directions, 0 represents orthogonality, and -1 signifies opposite directions; it is symmetric such that \text{[sim](/page/Sim)}(\mathbf{A}, \mathbf{B}) = \text{[sim](/page/Sim)}(\mathbf{B}, \mathbf{A}); and it is reflexive with \text{[sim](/page/Sim)}(\mathbf{A}, \mathbf{A}) = 1 for non-zero vectors. However, it is not a metric because it fails to satisfy the triangle inequality, as the cosine of angles does not preserve additive distances in vector spaces. These properties make it a semi-metric suitable for orientation-based comparisons but not for strict distance geometries.^[24] (pp. 121, 133)^[25] In practice, cosine similarity is widely applied to document similarity using term frequency-inverse document frequency (TF-IDF) vector representations, where documents are modeled as sparse vectors in a high-dimensional term space. TF-IDF weights emphasize distinctive terms while downweighting common ones, and cosine similarity then measures the angular alignment of these vectors to assess topical overlap between documents. This approach excels in sparse data environments, such as text corpora, because it inherently handles high dimensionality and zero-valued entries by focusing on non-zero coordinate overlaps, reducing the impact of irrelevant dimensions and improving efficiency in information retrieval tasks.^[26]^[24] (pp. 119-125, 133) Computing cosine similarity requires calculating the dot product and norms, which for vectors of dimension n takes O(n) time complexity in the worst case, though sparsity allows optimizations like skipping zero entries to achieve sublinear performance in practice. Pre-normalizing vectors to unit length further simplifies repeated computations to just the dot product.^[24] (pp. 121, 125, 398)

Pearson Correlation

The Pearson correlation coefficient, denoted as r, serves as a similarity measure for assessing the strength and direction of linear relationships between two numerical vectors, such as feature profiles or data samples in multivariate analysis.^[27] Developed by statistician Karl Pearson in 1895 as part of his work on regression and inheritance, it was originally formulated to quantify the degree of association between biological traits but later adapted in the 20th century for similarity computations in machine learning applications, including early recommender systems.^[28]^[29] The formula for the Pearson correlation between two vectors \mathbf{A} and \mathbf{B} of length n is given by:

r = \frac{\sum_{i=1}^{n} (A_i - \bar{A})(B_i - \bar{B})}{\sqrt{\sum_{i=1}^{n} (A_i - \bar{A})^2} \sqrt{\sum_{i=1}^{n} (B_i - \bar{B})^2}} = \frac{\text{cov}(\mathbf{A}, \mathbf{B})}{\sigma_A \sigma_B},

where \bar{A} and \bar{B} are the means, \text{cov} is the covariance, and \sigma_A, \sigma_B are the standard deviations.^[27] This expression normalizes the covariance by the product of the standard deviations, yielding values in the range [-1, 1], where r = 1 indicates perfect positive linear similarity, r = -1 perfect negative, and r = 0 no linear association.^[30] Notably, the Pearson correlation coefficient is equivalent to the cosine similarity between the mean-centered versions of the two vectors.^[31] The derivation arises from centering the vectors by subtracting their means, which removes location effects and focuses on deviations from the average; the resulting normalized covariance then measures the extent to which these deviations co-vary, quantifying linear dependence after this centering step.^[32] Under the assumption of joint normality, a value of zero indicates statistical independence, though in general it only implies absence of linear dependence.^[27] Key properties include symmetry (r(\mathbf{A}, \mathbf{B}) = r(\mathbf{B}, \mathbf{A})), boundedness in [-1, 1], and invariance to linear scaling and translation of either vector, making it suitable for comparing datasets with differing units or magnitudes.^[30] However, it assumes linearity and, for inferential purposes, bivariate normality of the data; violations can lead to underestimation of non-linear relationships.^[32] Variants include adjustments for statistical significance testing, such as the t-test statistic t = r \sqrt{\frac{n-2}{1-r^2}}, which follows a t-distribution under the null hypothesis of no correlation, enabling p-value computation to assess whether observed similarity is likely due to chance.^[33]

Common Measures for Sets and Sequences

Jaccard Similarity

The Jaccard similarity, also known as the Jaccard index or coefficient, is a measure designed to quantify the overlap between two finite sets by focusing on the presence or absence of elements, originally developed in the context of botanical similarity studies. For two sets A and B, it is formally defined as

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

where |A \cap B| denotes the size of the intersection (shared elements) and |A \cup B| the size of the union (all unique elements across both sets), with the convention that J(\emptyset, \emptyset) = 1 and J(A, \emptyset) = 0 for nonempty A. This formulation yields a value between 0 (no overlap) and 1 (identical sets). The measure derives from the intuitive ratio of shared features to the total distinct features, emphasizing binary presence/absence patterns rather than quantities or orders, which makes it particularly suitable for comparing collections like document term sets or binary feature vectors. In ecological applications, for instance, it assesses community similarity by treating species occurrences as set elements, capturing how much of one habitat's flora is represented in another. Computationally, exact evaluation can be performed efficiently using hash-based set representations, achieving time complexity O(\min(|A|, |B|)) by iterating over the smaller set to compute intersection and union sizes via lookups.^[34] Key properties include boundedness in [0, 1], symmetry (J(A, B) = J(B, A)), and reflexivity (J(A, A) = 1), ensuring it behaves as a valid similarity function. Additionally, the associated Jaccard distance d_J(A, B) = 1 - J(A, B) satisfies the axioms of a metric space (non-negativity, symmetry, triangle inequality, and identity of indiscernibles) on the collection of finite sets. An extension, the weighted Jaccard similarity, generalizes this by assigning weights w_e \geq 0 to elements e, defining

J_w(A, B) = \frac{\sum_{e \in A \cap B} \min(w_e^A, w_e^B)}{\sum_{e \in A \cup B} \max(w_e^A, w_e^B)},

where w_e^A and w_e^B are weights from sets A and B, respectively; this prioritizes elements with higher weights in overlap assessments, useful for scenarios like term frequency in information retrieval.^[34]

Levenshtein Distance as Similarity

The Levenshtein distance, named after Vladimir I. Levenshtein who introduced it in 1965 for analyzing error-correcting codes in binary sequences, quantifies the dissimilarity between two strings by counting the minimum number of single-character operations—insertions, deletions, or substitutions—needed to convert one string into the other.^[35] Each operation has a cost of 1, making the distance the smallest total cost for transformation. This measure accounts for the sequential order of characters, distinguishing it from order-agnostic set-based similarities. The distance is computed efficiently using dynamic programming, as formalized by Wagner and Fischer in 1974. For strings A of length m and B of length n, a matrix dp of size (m+1) \times (n+1) is constructed, where dp holds the edit distance between the prefixes A[1..i] and B[1..j]. The recurrence is:

dp = \min \begin{cases} dp[i-1] + 1 & \text{(deletion of } A\text{)} \\ dp[j-1] + 1 & \text{(insertion of } B\text{)} \\ dp[i-1][j-1] + c & \text{(substitution or match, where } c = 0 \text{ if } A = B, \text{ else } 1\text{)} \end{cases}

with base cases

dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = i

dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = j

, and

dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0

. This algorithm runs in O(mn) time and space, enabling practical computation for moderate-length strings.^[36] To adapt the Levenshtein distance as a similarity measure, it is typically normalized to produce values in the interval [0, 1], where 1 indicates perfect similarity. The common formula is \operatorname{sim}(A, B) = 1 - \frac{d(A, B)}{\max(|A|, |B|)}, with d(A, B) denoting the edit distance; this scales the distance by the longer string's length to handle varying inputs.^[37] The resulting similarity is symmetric, as d(A, B) = d(B, A), and bounded, but it does not always form a metric because the normalization can violate the triangle inequality.^[38] For example, similar short strings may yield high similarity, while inserting a long unrelated substring could disrupt additivity in comparisons. A notable variant is the Damerau-Levenshtein distance, which extends the original by including transpositions of adjacent characters as a single edit operation with cost 1, better capturing common typing errors. This was proposed by Frederick J. Damerau in 1964 specifically for improving spell-checker accuracy in computational systems.^[39] The dynamic programming computation for this variant modifies the recurrence to detect and penalize adjacent swaps, maintaining O(mn) complexity while increasing sensitivity to local rearrangements.

Applications in Machine Learning

Clustering

In unsupervised machine learning, similarity measures play a pivotal role in clustering by quantifying the resemblance between data points, enabling algorithms to group similar instances into cohesive clusters without labeled supervision. These measures serve as the foundation for defining cluster proximity, such as in partitional methods like k-means where points are assigned to the nearest centroid based on inverted similarity (e.g., distance derived from similarity), or in hierarchical approaches where they guide the iterative merging or splitting of clusters. By establishing how "close" points or groups are, similarity measures directly influence the structure and quality of the resulting partitions, making their selection critical for capturing underlying data patterns.^[40] In hierarchical clustering, similarity measures are integrated through linkage criteria that determine when to combine clusters during agglomerative processes. Single-linkage, also known as nearest-neighbor linkage, merges clusters based on the maximum similarity between any pair of points from the two groups, promoting elongated or chain-like structures that can connect loosely associated data. In contrast, complete-linkage, or farthest-neighbor linkage, uses the minimum similarity across all inter-cluster pairs, favoring compact, spherical clusters by requiring overall high resemblance. These criteria, often applied to vector-based similarities like cosine for text data or Jaccard for sets, allow hierarchical methods to build dendrograms revealing multi-level groupings, though they can be sensitive to outliers in single-linkage scenarios.^[41]^[42] A key challenge in using similarity measures for clustering arises from the curse of dimensionality in high-dimensional spaces, where distances between points become uniformly large and similarities converge, diminishing the measures' ability to discriminate clusters effectively—for instance, Euclidean or cosine similarities may treat nearly all points as equally dissimilar beyond moderate dimensions. This phenomenon, observed in datasets with thousands of features like gene expression profiles, leads to poorer cluster separation and requires dimensionality reduction techniques or robust alternatives to maintain meaningful groupings.^[43] To evaluate clustering quality, the silhouette score provides a metric that assesses how well each point aligns with its assigned cluster relative to others, computed as the average similarity (or inverse distance) within the cluster versus the nearest neighboring cluster, with values closer to 1 indicating strong cohesion and separation. Introduced as a graphical validation tool, it helps quantify the impact of different similarity measures on overall partition validity without ground truth labels.^[44] Advancements since the early 2000s have seen density-based clustering algorithms like DBSCAN adapt similarity measures to focus on local neighborhoods rather than global distances, defining clusters as dense regions separated by sparse areas using parameters like epsilon (a similarity threshold) and minimum points, thus handling arbitrary shapes and noise more robustly than traditional linkage methods. This approach, building on foundational density notions, has been extended in variants to incorporate variable-density similarities for complex, high-dimensional data.^[45]

Recommender Systems

In recommender systems, similarity measures play a central role in collaborative filtering, a technique that generates personalized recommendations by identifying patterns in user-item interactions. Collaborative filtering assumes that users who agreed in the past will agree in the future, leveraging similarity to predict preferences for unrated items. Two primary mechanisms dominate: user-user collaborative filtering, which computes similarities between users based on their rating histories to predict a target user's ratings by aggregating from similar users, and item-item collaborative filtering, which identifies similarities between items based on user ratings to recommend items akin to those a user has liked. For instance, user-user approaches often employ Pearson correlation to measure rating pattern similarities across users, accounting for individual biases in rating scales.^[46]^[47]^[48] Algorithms in collaborative filtering typically rely on k-nearest neighbors (k-NN) methods, where similarity scores determine the k most similar entities (users or items) above a predefined threshold, and predictions are weighted averages of their ratings or interactions. This neighborhood-based approach efficiently scales to sparse user-item matrices by focusing computations on high-similarity pairs, enabling real-time recommendations in systems like e-commerce and streaming services. The cold-start problem, where new users or items lack sufficient interaction data for reliable similarity computation, is commonly mitigated through hybrid methods that integrate collaborative filtering with content-based similarities, such as cosine similarity on item attributes (e.g., genres or descriptions) or user demographics, to bootstrap initial profiles.^[48]^[49]^[50] Evaluation of these similarity-driven recommenders emphasizes ranking quality, with Precision@K serving as a key metric that assesses the proportion of relevant items among the top-K recommendations generated from similarity-ranked lists. This metric prioritizes the accuracy of high-ranked suggestions, reflecting user satisfaction in top-line outputs. Post-2000s advancements evolved beyond explicit k-NN similarities toward matrix factorization techniques, which implicitly capture user-item affinities through low-dimensional latent factors, improving performance on implicit feedback data like clicks or views while addressing sparsity.^[51]^[52]

Applications in Other Domains

Sequence Alignment

Sequence alignment is a fundamental technique in bioinformatics for measuring similarity between biological sequences, such as DNA, RNA, or protein sequences, by identifying regions of correspondence that may indicate functional, structural, or evolutionary relationships. This process involves arranging sequences to maximize a similarity score while accounting for possible insertions, deletions, and substitutions, often using dynamic programming or heuristic approaches to handle the computational complexity of long sequences. Unlike general string matching, biological alignments incorporate domain-specific scoring systems that reflect evolutionary probabilities rather than simple edit costs.^[53] The Needleman-Wunsch algorithm, introduced in 1970, provides a classic method for global sequence alignment, which aligns entire sequences end-to-end to compute an optimal similarity score based on matches, mismatches, and gaps. It employs a scoring matrix where identical residues receive positive scores, mismatches are penalized (often with lower positive or negative values), and gaps—representing insertions or deletions—are assigned negative penalties to discourage excessive indels. For protein sequences, substitution matrices like BLOSUM (BLOcks SUbstitution Matrix) are commonly used, derived from observed alignments in conserved protein blocks; for instance, BLOSUM62 assigns +11 for a tryptophan-tryptophan match and -4 for a gap open, enabling nuanced similarity assessment that captures evolutionary conservation. These scores are computed via a dynamic programming table, where each cell represents the maximum similarity up to that position, ultimately yielding an alignment path that quantifies overall sequence relatedness.^[53] In applications such as phylogenetics, pairwise similarity scores from alignments serve as distance metrics to construct evolutionary trees, where higher similarity implies closer relatedness among species or genes. For example, aligned protein sequences from diverse organisms can be scored to generate a distance matrix, which is then used in clustering algorithms like neighbor-joining to infer branching patterns and divergence times. Tools like BLAST (Basic Local Alignment Search Tool), developed in 1990, extend this by using heuristic methods to rapidly identify local similarities—short, high-scoring segments—rather than full global alignments, making it efficient for database searches against vast genomic repositories. BLAST approximates optimal local alignments by seeding with short exact matches and extending them with scoring penalties similar to Needleman-Wunsch, reporting E-values to assess statistical significance of similarities.^[54] Advancements in the 2010s incorporated deep learning to learn context-aware similarity functions for alignments, surpassing traditional fixed matrices by training neural networks on large datasets of known alignments. Deep reinforcement learning approaches, such as those applied to multiple sequence alignment, optimize alignment paths by treating scoring as a policy learned from evolutionary data, achieving higher accuracy on benchmark datasets like BAliBASE compared to classical methods. These learned models adaptively weigh matches and gaps based on sequence context, enhancing applications in protein structure prediction and functional annotation.^[55]

Computer Vision

In computer vision, similarity measures are essential for comparing images, features, or regions to enable tasks such as retrieval, matching, and recognition. Early approaches often relied on color distributions, where histogram intersection serves as a simple yet effective metric for quantifying overlap between color histograms of two images, originally proposed for efficient color-based object indexing. This measure computes the sum of minimum bin values across corresponding histogram bins, providing a bounded similarity score between 0 and 1 that is computationally efficient for large databases. For more perceptually meaningful comparisons of color distributions, the Earth Mover's Distance (EMD) treats histograms as distributions of "earth" piles and calculates the minimum work required to transform one into the other, using a ground distance like Euclidean in color space; it has demonstrated superior retrieval performance over simpler metrics like chi-squared distance in color and texture-based image databases.^[56]^[57] Local feature descriptors further advanced similarity computation, with the Scale-Invariant Feature Transform (SIFT) generating 128-dimensional vectors from keypoints that are compared using Euclidean distance to match invariant features across images under scale, rotation, and partial illumination changes. This pairwise Euclidean similarity, combined with ratio tests to discard ambiguous matches, enables robust correspondence in tasks like stereo vision and panorama stitching. In object recognition, bag-of-words models treat SIFT descriptors as visual words, quantizing them into a codebook via k-means clustering and representing images as histograms weighted by term frequency-inverse document frequency (TF-IDF); cosine similarity between these histograms then ranks candidate images, adapting text retrieval techniques to achieve scalable recognition in large video datasets.^[58]^[59] Challenges in these measures include sensitivity to illumination variations, which can distort color histograms and descriptor responses; invariance is often achieved through normalized color spaces or preprocessing, such as histogram equalization or chromaticity-based adaptations that mitigate affine illuminant changes while preserving intersection-based similarities. The 2010s marked a shift with deep learning, where convolutional neural networks (CNNs) produce high-dimensional embeddings for images or regions, typically compared via cosine similarity on L2-normalized vectors to emphasize angular alignment over magnitude, as seen in metric learning frameworks for face recognition that improved verification accuracy on benchmarks like LFW. In object detection pipelines, mean Average Precision (mAP) evaluates performance by varying similarity thresholds, such as Intersection over Union (IoU) for bounding boxes, where detections exceeding a threshold (e.g., 0.5) are deemed true positives, aggregating precision-recall curves across classes and IoU levels from 0.5 to 0.95 to quantify overall efficacy.^[60]

Information Retrieval

In information retrieval (IR), similarity measures play a central role in ranking documents by their relevance to a user query, enabling efficient search over large corpora. The vector space model (VSM), introduced in the 1970s, represents both queries and documents as vectors in a high-dimensional space where each dimension corresponds to a term in the vocabulary, with weights typically derived from term frequency-inverse document frequency (TF-IDF).^[22] Cosine similarity is the predominant measure in VSM, computing the angle between query vector \mathbf{q} and document vector \mathbf{d} to assess their directional alignment, normalized to account for vector lengths:

\cos(\theta) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \ \|\mathbf{d}\|}

This approach prioritizes documents sharing terms with the query while downweighting document length effects, outperforming earlier Boolean models in empirical evaluations on collections like Cranfield.^[22] An extension to VSM, the BM25 ranking function, refines similarity by incorporating probabilistic weighting that emphasizes term rarity and saturation effects to avoid over-penalizing long documents. Developed within the Okapi IR system, BM25 estimates relevance as a sum over query terms t_i, with weights balancing term frequency tf, inverse document frequency idf, document length dl, and average length avgdl:

\text{BM25}(d, q) = \sum_{i=1}^{n} \text{IDF}(t_i) \cdot \frac{tf(t_i, d) \cdot (k_1 + 1)}{tf(t_i, d) + k_1 \cdot (1 - b + b \cdot \frac{dl}{avgdl})}

where k_1 and b are tunable parameters, and IDF reflects term scarcity across the corpus.^[61] This measure has become a benchmark in search engines like Elasticsearch, consistently improving precision over basic TF-IDF cosine similarity on TREC datasets in mean average precision.^[61] Query expansion enhances similarity computation by augmenting the original query with semantically related terms, such as synonyms, to address vocabulary mismatch. Techniques leveraging lexical resources like WordNet compute semantic similarity between query terms and synsets (e.g., via path distance or gloss overlap), selecting top-related words to expand the query vector before applying cosine or BM25. For instance, expanding "car" with "automobile" and "vehicle" via WordNet's hypernym relations has shown recall improvements of up to 15% on TREC queries without significant precision loss. This method integrates seamlessly with VSM, broadening the effective term space while preserving ranking via established similarity scores. IR systems evaluate ranked outputs using metrics like Normalized Discounted Cumulative Gain (NDCG), which assesses similarity-based rankings by rewarding relevant documents at higher positions with graded relevance scores r_i and position discounts:

\text{DCG}_p = \sum_{i=1}^{p} \frac{r_i}{\log_2(i+1)}, \quad \text{NDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p}

where IDCG is the ideal DCG for perfect ranking.^[62] NDCG correlates strongly with user satisfaction in TREC evaluations, emphasizing top-k results where similarity measures directly impact perceived quality.^[62] In the 2020s, neural IR has shifted toward dense embeddings, where queries and documents are encoded into low-dimensional vectors via transformer models like BERT, with inner-product similarity (dot product) serving as the primary ranking signal for its efficiency in approximating semantic relevance. Models like Dense Passage Retrieval (DPR) precompute embeddings for passages and retrieve via maximum inner product search (MIPS), achieving approximately 50% relative improvement in top-5 passage retrieval accuracy over BM25 on open-domain QA benchmarks like Natural Questions.^[63] Late-interaction variants, such as ColBERT, refine this by computing token-level similarities before aggregation, balancing expressiveness and speed while outperforming DPR by about 25% relatively in MRR@10 on MS MARCO.^[64] These approaches dominate modern dense retrievers, enabling scalable semantic search in production systems.

References

[1]
1(b).2.1: Measures of Similarity and Dissimilarity - STAT ONLINE
Similarity and Dissimilarity · Numerical measure of how alike two data objects are. · Often falls between 0 (no similarity) and 1 (complete similarity).
[2]
[PDF] Understanding (dis)similarity measures - UPCommons
Abstract. Intuitively, the concept of similarity is the notion to measure an inexact matching between two entities of the same reference set.
[3]
Notions of similarity for systems biology models - PMC
Formally, a normalized similarity measure for a set X is defined as a function which assigns x1, x2 ∈ X a value σ ∈ [0, 1]. This σ is called the 'similarity ...
[4]
[PDF] Comprehensive Survey on Distance/Similarity Measures between ...
Abstract— Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering,.
[5]
Similarity and Dissimilarity
Jun 23, 2021 · Proximity refers to either a similarity or dissimilarity. Single attribute sim/dissim measures Nominal is binary if two values are equal or not.
[6]
[PDF] Cosine Similarity Tutorial | ITLab
Apr 10, 2015 · ... similarity measures, a cosine similarity is a measure of the direction-length resemblance between vectors. An angle of 0 o means that cos θ ...
[7]
A guide to similarity measures and their data science applications
Jul 26, 2025 · This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professionals.
[8]
A Guide to Similarity Measures - arXiv
Aug 7, 2024 · This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professional.
[9]
(PDF) Understanding (dis)similarity measures - ResearchGate
Intuitively, the concept of similarity is the notion to measure an inexact matching between two entities of the same reference set.
[10]
[PDF] Things to know about a (dis)similarity measure
We argue that every property of a similarity should have a correspon- dence with one property of a dissimilarity and vice versa. Definition 1 A similarity ...
[11]
https://www.cs.upc.edu/~belanche/Talks/mainKES.pdf
[12]
[PDF] Author Disambiguation in Microsoft Academic Search Engine Dataset
The confidence score is computed as the normalized similarity (1 - normalized distance) between the input author and an output. 2.3 Online Layer. As the user ...
[13]
[PDF] RECON Label Quality Report - OSTI
Normalized Similarity = 1 - Normalized Distance. Finally, the normalized similarity is multiplied by 100 to get the label score percentage. 2.2. Identifying ...
[14]
Euclidean distance score and similarity - Cross Validated
Mar 23, 2013 · The inverse is to change from distance to similarity. The 1 in the denominator is to make it so that the maximum value is 1 (if the distance ...How I can convert distance (Euclidean) to similarity scoreWhy is Euclidean distance not a good metric in high dimensions?More results from stats.stackexchange.com
[15]
Distance Metrics in Vector Search - Weaviate
Aug 15, 2023 · In the context of vector search, similarity measures are a function that takes two vectors as input and calculates a distance value between them ...Fundamentals Of Vector... · Distance Metrics · Cosine Similarity
[16]
Vector similarity techniques and scoring - Elasticsearch Labs
May 13, 2024 · So in order to derive a score we need to invert the distance measure, so that the smallest distance yields the highest score. The way the ...
[17]
Similarity measures - Scholarpedia
Dec 5, 2007 · This article focuses on perceived similarity. The degree to which people perceive two things as similar fundamentally affects their rational thought and ...
[18]
7.8. Pairwise metrics, Affinities and Kernels - Scikit-learn
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than objects a and c . A kernel must also be ...
[19]
A vector space model for automatic indexing - ACM Digital Library
Salton, G., and Yang, C.S. On the specification of term values in automatic indexing. J. Documen. 29, 4 (Dec. 1973), 351-372.Missing: original | Show results with:original
[20]
https://doi.org/10.1037/0033-295X.84.4.327
[21]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · ... Manning, Prabhakar Raghavan & Hinrich Schütze. Printed on April 1 ... Cosine computation for Exercise 6.19. 132. 8.1. Calculation of 11 ...
[22]
[PDF] A Triangle Inequality for Cosine Similarity? - SISAP
In this paper, we introduce a triangle inequality for Cosine similarity that allows lifting most of these techniques from metric distances to Cosine similarity,.
[23]
[PDF] Term Weighting Approaches in Automatic Text Retrieval
ized term weighting system is used with the vector similarity function of expression (5), one obtains the well-known cosine vector similarity formula that has ...
[24]
1.6 - (Pearson) Correlation Coefficient, $r$ | STAT 501
If b 1 is positive, then r takes a positive sign. That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, ...
[25]
[PDF] Contributions to the Mathematical Theory of Evolution. II. Skew ...
Author(s): Karl Pearson. Reviewed work(s):. Source: Philosophical Transactions of the Royal Society of London. A, Vol. 186 (1895), pp. 343-. 414. Published by ...
[26]
[PDF] Using Filtering Agents to Improve Prediction Quality in the ...
Collaborative filtering systems help address information overload by using the opinions of users in a community to make personal recommendations for ...
[27]
3.4.2 - Correlation | STAT 200
Correlation: A measure of the direction and strength of the relationship between two variables. Properties of Pearson's r. − 1 ≤ r ≤ + 1; For a positive ...
[28]
Similarity measures for Collaborative Filtering-based Recommender ...
In this paper we provide an in-depth review on similarity measures used for CF-based RS. For each measure, we outline its fundamental background and we test ...
[29]
Pearson Correlation Coefficient (r) | Guide & Examples - Scribbr
May 13, 2022 · The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation between two variables.
[30]
[PDF] Finding the Jaccard Median - Stanford CS Theory
In this paper we fully study the com- putational complexity of the weighted Jaccard median problem. We begin by showing that the problem is NP- hard.
[31]
[PDF] Improved Consistent Sampling, Weighted Minhash and L1 Sketching
For a common case where all the vectors to be hashed have the same ℓ1 norm, Weighted Jaccard Similarity is a monotonic function of ℓ1 distance, therefore ...
[32]
[PDF] Binary codes capable of correcting deletions, insertions, and reversals
845-848, August, 1965. Original article submitted January 2, 1965. Investigations of transmission of binary infor- mation usually consider a channel model in ...
[33]
[PDF] The String-to-String Correction Problem
The string-to-string correction problem is to determine the distance between two strings using the minimum cost sequence of edit operations.
[34]
String Similarity Metrics – Edit Distance | Baeldung on Computer ...
Mar 18, 2024 · If we want to use normalized metric, we may convert Levenshtein distance to similarity measure using the formula: sim_{ld}(a, b) = 1.0 ...
[35]
[PDF] The Normalized Edit Distance with Uniform Operation Costs is a Metric
Apr 23, 2022 · This function is a metric, but it completely ignores the lengths of the words, thus it is not normalized. We turn to introduce the generalized ...
[36]
A technique for computer detection and correction of spelling errors
The technique assumes a word has at most one error (wrong, missing, extra letter, or transposition) and compares it to the dictionary, testing for these errors.
[37]
A Comparison Study on Similarity and Dissimilarity Measures in ...
Dec 11, 2015 · Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same ...
[38]
[PDF] Hierarchical Cluster Analysis: Comparison of Three Linkage ...
The following section will review how employing three different linkage measures (single linkage, complete linkage, and average linkage) can result in three ...
[39]
[PDF] Robust Hierarchical Clustering
However, single linkage, average linkage, and complete linkage would initially link the matched pairs and produce clusters with very high error with respect to ...
[40]
[PDF] The Challenges of Clustering High Dimensional Data
For clustering purposes, the most relevant aspect of the curse of dimensionality concerns the effect of increasing dimensionality on distance or similarity.
[41]
A graphical aid to the interpretation and validation of cluster analysis
Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie ...
[42]
A density-based algorithm for discovering clusters in large spatial ...
In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary ...
[43]
[PDF] An Algorithmic Framework for Performing Collaborative Filtering
In this paper we will explore the space of neighborhood- based collaborative filtering methods and describe some new better performing algorithms that we have ...
[44]
GroupLens: an open architecture for collaborative filtering of netnews
GroupLens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles.
[45]
[PDF] Item-Based Collaborative Filtering Recommendation Algorithms
Item-based collaborative filtering analyzes the user-item matrix to identify relationships between items, then uses these relationships to compute ...Missing: seminal | Show results with:seminal
[46]
[PDF] Collaborative Filtering Recommender Systems - Michael Ekstrand
Pearson correlation suffers from computing high similarity between users with few ratings in common. This can be alle- viated by setting a threshold on the ...Missing: seminal | Show results with:seminal
[47]
A hybrid similarity model for mitigating the cold-start problem of ...
Sep 1, 2024 · In this paper, we explore a new robust hybrid similarity model, namely Wasserstein distance-based CF (WCF) model, for mitigating the cold-start problem of CF ...
[48]
Evaluating collaborative filtering recommender systems
Key decisions in evaluating collaborative filtering systems include user tasks, analysis types, prediction quality, other attributes, and user-based evaluation.Missing: seminal | Show results with:seminal
[49]
[PDF] MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER ...
Recommender systems can use implicit feedback to gain insight into user preferences. Indeed, they can gather behavioral information regardless of the user's ...
[50]
A general method applicable to the search for similarities ... - PubMed
A general method applicable to the search for similarities in the amino acid sequence of two proteins. ... Authors. S B Needleman, C D Wunsch. PMID: 5420325; DOI: ...
[51]
Exploring the Relationship between Sequence Similarity and ...
Families with significant sequence similarity have better alignment accuracy. Recent families are more accurate over all similarities, while ancient families ...
[52]
Basic local alignment search tool - PubMed - NIH
Oct 5, 1990 · A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local ...
[53]
Using deep reinforcement learning approach for solving the multiple ...
May 18, 2019 · In the present paper, we use a deep reinforcement learning (DRL) approach for solving the multiple sequence alignment problem which is an NP-complete problem.Missing: seminal | Show results with:seminal
[54]
Color indexing | International Journal of Computer Vision
Swain, M.J., Ballard, D.H. Color indexing. Int J Comput Vision 7, 11–32 (1991). https://doi.org/10.1007/BF00130487. Download citation. Received: 22 January 1991.
[55]
[PDF] The Earth Mover's Distance as a Metric for Image Retrieval
In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances. 1 Introduction.
[56]
[PDF] Distinctive Image Features from Scale-Invariant Keypoints
Jan 5, 2004 · This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between ...
[57]
[PDF] Video Google: A Text Retrieval Approach to Object Matching in Videos
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video.
[58]
https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf
[59]
[PDF] The Probabilistic Relevance Framework: BM25 and Beyond Contents
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, estimating a probability of relevance for each query-document pair.Missing: Sparrow | Show results with:Sparrow
[60]
Cumulated gain-based evaluation of IR techniques
This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position.Missing: NDCG | Show results with:NDCG
[61]
[PDF] ColBERT: Efficient and Effective Passage Search via Contextualized ...
Jun 4, 2020 · 3 COLBERT. ColBERT prescribes a simple framework for balancing the quality and cost of neural IR, particularly deep language models like BERT.