Fact-checked by Grok 2 weeks ago

Cosine similarity

Cosine similarity is a metric that measures the similarity between two non-zero vectors in an inner product space by calculating the cosine of the angle between them, which quantifies their directional alignment regardless of magnitude.^[1] Mathematically, it is defined as the dot product of the two vectors divided by the product of their Euclidean norms, expressed as \cos \theta = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \cdot ||\mathbf{B}||}, where the result ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonality.^[2] This measure is particularly valuable in high-dimensional spaces, such as those encountered in text analysis, because it normalizes for vector length and focuses solely on orientation.^[1] The concept draws from linear algebra, where the cosine captures angular proximity, and it exhibits key properties like scale invariance, making it robust for comparing datasets of varying sizes or densities.^[2] For mean-centered variables, cosine similarity equates to the Pearson correlation coefficient, linking it to statistical analysis of linear relationships.^[1] Unlike distance metrics such as Euclidean distance, which are sensitive to magnitude, cosine similarity treats vectors of different lengths as equally similar if their directions match, which is advantageous for sparse data representations.^[2] Cosine similarity gained prominence in information retrieval through its application to term frequency-inverse document frequency (tf-idf) vectors, with tf-idf weighting introduced by Karen Spärck Jones in 1972 to interpret term specificity and improve document ranking. Today, it underpins numerous fields, including natural language processing for semantic search and document clustering, recommendation systems for user-item matching, and machine learning tasks like topic modeling and image recognition.^[2] Its efficiency in handling high-dimensional, sparse data has made it a staple in large-scale applications, such as those in search engines and large language models.

Mathematical Foundations

Vectors in Inner Product Spaces

In mathematics, vectors are abstract elements of a real vector space, which is a set equipped with operations of addition and scalar multiplication by real numbers, satisfying axioms such as commutativity of addition, associativity, existence of a zero vector, additive inverses, distributivity, and compatibility with scalar multiplication.^[3] When this vector space is endowed with an additional structure called an inner product, it becomes a real inner product space, providing a framework for geometric interpretations in abstract settings.^[4] Finite-dimensional examples include the Euclidean space \mathbb{R}^n, where vectors are ordered n-tuples of real numbers representing points or directions in n-dimensional space.^[3] The inner product, denoted \langle \cdot, \cdot \rangle, is a real-valued bilinear form on the vector space that satisfies three key properties: symmetry, positive-definiteness, and linearity.^[5] Specifically, for vectors u, v, w in the space and scalar \alpha \in \mathbb{R},

Linearity in the first argument: \langle \alpha u + w, v \rangle = \alpha \langle u, v \rangle + \langle w, v \rangle,
Symmetry: \langle u, v \rangle = \langle v, u \rangle,
Positive-definiteness: \langle u, u \rangle \geq 0, with equality if and only if u = 0.

These axioms ensure the inner product behaves analogously to the familiar dot product while extending to more general contexts.^[3] A prominent example is the standard inner product in \mathbb{R}^n, known as the dot product, defined for vectors x = (x_1, \dots, x_n) and y = (y_1, \dots, y_n) by

\langle x, y \rangle = \sum_{i=1}^n x_i y_i.

This satisfies the axioms and corresponds to the algebraic form underlying projections and lengths in Euclidean geometry.^[5] The concept generalizes to infinite-dimensional abstract spaces, such as the space of square-integrable functions L^2([a,b]) over an interval, where the inner product is \langle f, g \rangle = \int_a^b f(t) g(t) \, dt, allowing similar algebraic operations on functions treated as vectors.^[4] Inner product spaces form the foundational structure for advanced mathematical theories, enabling the consistent definition of vector operations across diverse applications in physics, engineering, and data analysis.^[3] The inner product induces a norm \|v\| = \sqrt{\langle v, v \rangle}, which quantifies vector magnitudes and underpins further developments in metric geometry.^[5]

Dot Product and Vector Norms

The dot product of two vectors \mathbf{u} = (u_1, \dots, u_n) and \mathbf{v} = (v_1, \dots, v_n) in Euclidean space is defined as the scalar \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i.^[6] This algebraic formulation provides a computational method to obtain the dot product by multiplying corresponding components and summing the results.^[6] The Euclidean norm, or \ell^2-norm, of a vector \mathbf{u} = (u_1, \dots, u_n) measures its length and is given by \|\mathbf{u}\| = \sqrt{\mathbf{u} \cdot \mathbf{u}} = \sqrt{\sum_{i=1}^n u_i^2}.^[7] This norm satisfies key properties, including positivity—where \|\mathbf{u}\| > 0 if \mathbf{u} \neq \mathbf{0} and \|\mathbf{u}\| = 0 if and only if \mathbf{u} = \mathbf{0}—and homogeneity, such that \|c \mathbf{u}\| = |c| \|\mathbf{u}\| for any scalar c.^[7] For example, consider the 2D vectors \mathbf{u} = (3, 4) and \mathbf{v} = (1, 2). The dot product is \mathbf{u} \cdot \mathbf{v} = 3 \cdot 1 + 4 \cdot 2 = 11.^[6] The norm of \mathbf{u} is \|\mathbf{u}\| = \sqrt{3^2 + 4^2} = \sqrt{25} = 5, representing the length of the vector.^[7] The dot product relates to vector length via the norm, as the squared length \|\mathbf{u}\|^2 = \mathbf{u} \cdot \mathbf{u}.^[6] Additionally, two nonzero vectors are orthogonal (perpendicular) if their dot product is zero, indicating no directional alignment between them./06%3A_Orthogonality/6.01%3A_Dot_Products_and_Orthogonality) For instance, \mathbf{u} = (1, 0) and \mathbf{v} = (0, 1) satisfy \mathbf{u} \cdot \mathbf{v} = 1 \cdot 0 + 0 \cdot 1 = 0, confirming orthogonality./06%3A_Orthogonality/6.01%3A_Dot_Products_and_Orthogonality)

Core Definition

Cosine Similarity Formula

The cosine similarity between two non-zero vectors \mathbf{u} and \mathbf{v} in an inner product space is defined as the cosine of the angle \theta between them, given by the formula

\cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}

where \mathbf{u} \cdot \mathbf{v} denotes the dot product and \|\mathbf{u}\| = \sqrt{\mathbf{u} \cdot \mathbf{u}} is the Euclidean norm (magnitude) of \mathbf{u}.^[8] This expression normalizes the dot product by the product of the vectors' magnitudes, yielding a measure that depends only on their directional alignment rather than their lengths.^[9] The formula arises from the geometric interpretation of the dot product in Euclidean space. To derive it, consider the triangle formed by vectors \mathbf{u} and \mathbf{v} originating from a common point. The law of cosines states that for the side opposite angle \theta, denoted \|\mathbf{u} - \mathbf{v}\|,

\|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos \theta.

Expanding the left side using the algebraic definition of the dot product gives

\|\mathbf{u} - \mathbf{v}\|^2 = (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \mathbf{u} \cdot \mathbf{u} - 2 \mathbf{u} \cdot \mathbf{v} + \mathbf{v} \cdot \mathbf{v} = \|\mathbf{u}\|^2 - 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2.

Equating the expressions and solving for the cosine term yields \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos \theta, or equivalently, the cosine similarity formula above.^[8]^[10] The value of cosine similarity ranges from -1 to 1. A value of 1 indicates that the vectors point in the same direction (zero angle between them), 0 signifies orthogonality (right angle), and -1 means they point in exactly opposite directions (180 degrees).^[9] In applications with non-negative components, such as term-frequency vectors in information retrieval, the range is typically restricted to [0, 1].^[9] To compute cosine similarity, first calculate the dot product, then the norms, and divide. For example, consider vectors \mathbf{u} = (1, 2) and \mathbf{v} = (3, 4) in \mathbb{R}^2:

Dot product: \mathbf{u} \cdot \mathbf{v} = 1 \cdot 3 + 2 \cdot 4 = 11.
Norm of \mathbf{u}: \|\mathbf{u}\| = \sqrt{1^2 + 2^2} = \sqrt{5} \approx 2.236.
Norm of \mathbf{v}: \|\mathbf{v}\| = \sqrt{3^2 + 4^2} = \sqrt{25} = 5.
Cosine similarity: \cos \theta = \frac{11}{\sqrt{5} \cdot 5} \approx \frac{11}{11.180} \approx 0.984.

This positive value close to 1 reflects the near-alignment of the vectors.^[8] Cosine similarity is undefined for zero vectors, as the norms would be zero, leading to division by zero. In practice, vectors are often assumed or pre-normalized to be non-zero; if a zero vector occurs, similarity is conventionally treated as 0 or the case is excluded to avoid undefined results.^[9]

Geometric Interpretation

Cosine similarity quantifies the directional alignment between two vectors by computing the cosine of the angle θ between them, thereby disregarding differences in their magnitudes and focusing solely on orientation. This measure arises naturally from the geometric properties of vectors in Euclidean space, where the angle θ determines how closely the directions match: a value of 1 indicates identical directions (θ = 0°), 0 signifies orthogonality (θ = 90°), and -1 denotes opposite directions (θ = 180°).^[11] When vectors are normalized to unit length, they can be visualized as points on the surface of a unit hypersphere, and the cosine similarity simplifies to their dot product, directly capturing the angular separation. This spherical geometry underscores the measure's emphasis on relative positioning rather than scale, making it particularly apt for applications where directionality, such as in document term vectors, conveys semantic relatedness.^[12] For instance, two vectors aligned nearly parallel exhibit high cosine similarity approaching 1, reflecting strong directional concordance, while two vectors of identical magnitude but diverging at right angles yield a similarity of 0, highlighting the measure's insensitivity to length.^[13] Conversely, vectors pointing in opposing directions result in negative similarity, emphasizing opposition in orientation. This interpretation extends to vector projections: for unit vectors, the cosine of θ equals the scalar projection of one vector onto the other, representing the extent to which one aligns along the direction of the second.^[11] Such a view illustrates cosine similarity's role in assessing how much "overlap" exists in directional components, independent of overall vector extent.

Cosine Distance

Cosine distance is defined as the complement of cosine similarity, serving as a dissimilarity measure between two non-zero vectors \mathbf{u} and \mathbf{v} in an inner product space.^[14] It quantifies how misaligned the directions of the vectors are, with the formula given by

d(\mathbf{u}, \mathbf{v}) = 1 - \cos \theta = 1 - \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}

where \cos \theta is the cosine similarity, \mathbf{u} \cdot \mathbf{v} is the dot product, and \|\mathbf{u}\| and \|\mathbf{v}\| are the Euclidean norms.^[14] The distance ranges from 0, when the vectors point in identical directions (\cos \theta = 1), to 2, when they point in exactly opposite directions (\cos \theta = -1).^[14] As a distance function, cosine distance exhibits non-negativity (d(\mathbf{u}, \mathbf{v}) \geq 0) and identity of indiscernibles (d(\mathbf{u}, \mathbf{v}) = 0 if and only if \mathbf{u} and \mathbf{v} are scalar multiples with the same direction). However, it is not a true metric because it generally violates the triangle inequality; for example, there exist vectors where d(\mathbf{u}, \mathbf{w}) > d(\mathbf{u}, \mathbf{v}) + d(\mathbf{v}, \mathbf{w}). This limitation arises from its focus solely on angular separation, ignoring magnitude differences beyond normalization. To illustrate, consider two-dimensional vectors \mathbf{u} = (3, 4) and \mathbf{v} = (3, 0). The dot product is \mathbf{u} \cdot \mathbf{v} = 9, \|\mathbf{u}\| = 5, and \|\mathbf{v}\| = 3, yielding \cos \theta = 9 / (5 \times 3) = 0.6 and d(\mathbf{u}, \mathbf{v}) = 1 - 0.6 = 0.4. For orthogonal vectors like \mathbf{u} = (3, 4) and \mathbf{w} = (0, 5), \cos \theta = 0, so d(\mathbf{u}, \mathbf{w}) = 1. For opposite directions, such as \mathbf{u} = (1, 0) and \mathbf{x} = (-1, 0), \cos \theta = -1 and d(\mathbf{u}, \mathbf{x}) = 2.^[14] In practice, cosine distance is commonly employed in clustering algorithms, such as hierarchical agglomerative clustering, where dissimilarity measures facilitate grouping vectors by directional alignment rather than magnitude, particularly for high-dimensional data like text documents. This preference stems from its utility in scenarios where similarity scores need inversion to fit distance-based optimization frameworks.

L2-Normalized Euclidean Distance

The L2-normalized Euclidean distance is computed by first dividing each vector by its L2 norm to obtain unit vectors, then calculating the Euclidean distance between these normalized vectors.^[15] This process ensures that the distance measures only the angular separation between the vectors, as the magnitudes are standardized to 1.^[15] For two unit-normalized vectors \hat{u} and \hat{v}, the squared L2-normalized Euclidean distance derives from the vector subtraction formula:

\|\hat{u} - \hat{v}\|^2 = \|\hat{u}\|^2 + \|\hat{v}\|^2 - 2 \hat{u} \cdot \hat{v} = 1 + 1 - 2 \cos \theta = 2(1 - \cos \theta),

where \cos \theta is the cosine similarity between \hat{u} and \hat{v}.^[15] Thus, the distance is \|\hat{u} - \hat{v}\| = \sqrt{2(1 - \cos \theta)}, linking it directly to cosine similarity via the trigonometric identity for the chord length on the unit sphere.^[16] This equivalence shows that minimizing the L2-normalized Euclidean distance is equivalent to maximizing cosine similarity in terms of ranking proximity.^[15] Unlike the plain Euclidean distance, which depends on both the direction and magnitude of vectors and thus penalizes differences in vector lengths, the L2-normalized version ignores magnitude entirely and captures only directional differences.^[15] For instance, consider vectors u = [1, 0] and v = [2, 0]; their plain Euclidean distance is \|u - v\| = 1, but after L2 normalization (\hat{u} = [1, 0], \hat{v} = [1, 0]), the normalized distance is 0, matching their cosine similarity of 1.^[15] This highlights how normalization shifts the focus from absolute scale to relative orientation.^[16]

Properties

Algebraic Properties

Cosine similarity exhibits invariance to positive scaling of the vectors, a key algebraic property that distinguishes it from measures sensitive to magnitude. For any scalar k > 0 and non-zero vectors \mathbf{u} and \mathbf{v} in an inner product space, the cosine similarity satisfies

\cos(k\mathbf{u}, \mathbf{v}) = \frac{(k\mathbf{u}) \cdot \mathbf{v}}{\|k\mathbf{u}\| \|\mathbf{v}\|} = \frac{k (\mathbf{u} \cdot \mathbf{v})}{k \|\mathbf{u}\| \|\mathbf{v}\|} = \cos(\mathbf{u}, \mathbf{v}).

This cancellation occurs because the scaling factor k appears in both the dot product numerator and the norm denominator, rendering the measure dependent solely on the directional alignment rather than absolute lengths.^[17] This invariance ensures that proportional rescalings, common in data preprocessing, do not alter similarity assessments.^[18] The measure is bounded within the interval [-1, 1], reflecting the possible range of the cosine function in Euclidean geometry. This boundedness directly follows from the Cauchy-Schwarz inequality, which implies |\mathbf{u} \cdot \mathbf{v}| \leq \|\mathbf{u}\| \|\mathbf{v}\|, so

\left| \cos(\mathbf{u}, \mathbf{v}) \right| = \left| \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} \right| \leq 1.

Equality holds at the upper bound when \mathbf{u} and \mathbf{v} are positively linearly dependent, i.e., \mathbf{u} = k \mathbf{v} for some k > 0, yielding \cos(\mathbf{u}, \mathbf{v}) = 1; conversely, \cos(\mathbf{u}, \mathbf{v}) = -1 when k < 0. These conditions stem from the equality case of Cauchy-Schwarz, where the vectors are scalar multiples.^[19] Such bounds provide a natural normalization, preventing overflow in computations and facilitating probabilistic interpretations in applications.^[17] Cosine similarity lacks linearity under vector addition, meaning it does not preserve additive structure: in general, \cos(\mathbf{u} + \mathbf{v}, \mathbf{w}) \neq \cos(\mathbf{u}, \mathbf{w}) + \cos(\mathbf{v}, \mathbf{w}) or any simple linear combination thereof. This non-linearity arises because the normalization by the norm of the sum \|\mathbf{u} + \mathbf{v}\| introduces a non-linear dependence on the relative magnitudes and angles involved. However, algebraic bounds can be derived using inner product properties; for instance, by the triangle inequality, |(\mathbf{u} + \mathbf{v}) \cdot \mathbf{w}| \leq \|\mathbf{u} + \mathbf{v}\| \|\mathbf{w}\| \leq (\|\mathbf{u}\| + \|\mathbf{v}\|) \|\mathbf{w}\|, which implies |\cos(\mathbf{u} + \mathbf{v}, \mathbf{w})| \leq 1, though tighter estimates follow from expanding the dot product and applying Cauchy-Schwarz componentwise.^[19] These properties highlight the measure's focus on directional rather than vector-sum behaviors. Computationally, evaluating cosine similarity can encounter stability issues when vector norms approach zero, as the division by small \|\mathbf{u}\| \|\mathbf{v}\| amplifies rounding errors in floating-point arithmetic, occasionally producing values exceeding the theoretical bounds of [-1, 1]. For exactly zero norms, the measure is undefined, requiring assumptions of non-zero inputs or preprocessing to avoid division by zero. Stabilized implementations, such as those using safeguarded normalization (e.g., clamping small norms to a threshold like $10^{-8}), mitigate these effects by ensuring denominator robustness without significantly altering results for well-conditioned vectors.^[20]

Metric Properties

Cosine similarity, defined as the cosine of the angle between two non-zero vectors, serves as a measure of directional alignment rather than a distance metric. As a similarity function, it exhibits symmetry, since the angle between vectors \mathbf{u} and \mathbf{v} is the same as between \mathbf{v} and \mathbf{u}, yielding \cos(\theta_{\mathbf{u},\mathbf{v}}) = \cos(\theta_{\mathbf{v},\mathbf{u}}). It also satisfies reflexivity, with \cos(\theta_{\mathbf{u},\mathbf{u}}) = 1 for any non-zero \mathbf{u}. However, similarities like cosine are not metrics, as metric spaces require non-negative distances satisfying specific axioms, and cosine similarity can take negative values when angles exceed 90 degrees.^[21] The associated cosine distance, d(\mathbf{u}, \mathbf{v}) = 1 - \cos(\theta_{\mathbf{u},\mathbf{v}}), transforms the similarity into a non-negative dissimilarity measure ranging from 0 to 2. This distance satisfies non-negativity, as $1 - \cos(\theta) \geq 0 for \theta \in [0, \pi]; symmetry, following from the symmetry of cosine similarity; and the identity of indiscernibles, since d(\mathbf{u}, \mathbf{u}) = 0 and d(\mathbf{u}, \mathbf{v}) = 0 implies \mathbf{u} and \mathbf{v} are parallel with the same direction (up to positive scaling). Despite these properties, cosine distance generally fails to satisfy the triangle inequality d(\mathbf{u}, \mathbf{w}) \leq d(\mathbf{u}, \mathbf{v}) + d(\mathbf{v}, \mathbf{w}). A counterexample arises with vectors separated by small angles, such as \theta = 10^\circ between consecutive pairs: the direct distance approximates 0.0603, exceeding the sum of pairwise distances (each ≈0.0152, totaling 0.0304). This violation occurs because cosine distance approximates \theta^2 / 2 for small \theta, leading to d(\mathbf{u}, \mathbf{w}) \approx 2\theta^2 > \theta^2 \approx d(\mathbf{u}, \mathbf{v}) + d(\mathbf{v}, \mathbf{w}).^[21]^[22] The triangle inequality holds under certain restrictions, such as for non-negative vectors in the positive orthant, where angles are at most 90 degrees and cosine similarities are non-negative; in this subspace, a relaxed form of the inequality applies, enabling bounding techniques for similarity search. More generally, literature provides derived bounds, such as \cos(\theta_{\mathbf{u},\mathbf{w}}) \geq \cos(\theta_{\mathbf{u},\mathbf{v}} + \theta_{\mathbf{v},\mathbf{w}}), which serve as pseudo-metric approximations suitable for indexing structures like metric trees.^[21] Despite its non-metric nature, cosine distance remains valuable in high-dimensional applications, as its focus on angular separation ignores magnitude variations—common in sparse data like text representations—allowing effective similarity detection without the computational overhead of true metrics like Euclidean distance. The availability of relaxed inequalities supports efficient algorithms in information retrieval and clustering, where exact metric compliance is often secondary to practical performance.^[21]

Variants and Extensions

Otsuka-Ochiai Coefficient

The Otsuka-Ochiai coefficient serves as a similarity measure specifically tailored for binary data or sets, where it is computed as the cardinality of the intersection divided by the square root of the product of the cardinalities of the two sets:

\frac{|A \cap B|}{\sqrt{|A| \cdot |B|}}.

This formulation arises from representing sets A and B as binary incidence vectors, making it mathematically identical to the cosine similarity in that context. ^[23] The coefficient derives its name from Yanosuke Otsuka, who proposed it in 1936 for statistical analysis in biology, and Akira Ochiai, who independently developed and applied a similar measure in 1957 to quantify similarities in species distributions, such as those of soleoid fish in marine ecosystems. It gained prominence in ecological studies for assessing presence-absence patterns across biological samples, providing a probabilistic lens by approximating the geometric mean of conditional probabilities of co-occurrence normalized by marginal probabilities. ^[24] In contrast to the standard cosine similarity, which operates on real-valued vectors and balances dot product against magnitudes, the Otsuka-Ochiai coefficient for binary data inherently prioritizes the intersection's role in set overlap, using the geometric mean of set sizes in the denominator to downweight comparisons between disparate collection sizes without incorporating the union. ^[25] For instance, in analyzing document term sets, the coefficient evaluates topical similarity by treating each document's unique terms as a set; if document A has 10 terms and document B has 16 terms with 4 shared terms, the similarity is $4 / \sqrt{10 \times 16} \approx 0.32, indicating moderate overlap. Similarly, in genetics, it compares profiles of genetic markers across samples, where shared markers (e.g., insertions/deletions present in both) form the intersection, enabling quantification of relatedness in binary genotype data such as those from InDel markers in wheat cultivars. ^[26]^[27]

Soft Cosine Measure

The soft cosine measure is an extension of the standard cosine similarity that incorporates a similarity matrix S to account for relationships between vector dimensions, enabling the handling of semantic or fuzzy matches in scenarios like text analysis where exact feature overlaps are absent.^[28] This approach replaces the direct dot product with a weighted version that reflects feature interdependencies, such as synonymy or conceptual proximity.^[28] The formula for the soft cosine measure between vectors \mathbf{u} and \mathbf{v} is given by

\cos_{\text{soft}}(\mathbf{u}, \mathbf{v}) = \frac{\sum_{i=1}^n \sum_{j=1}^n s_{ij} u_i v_j}{\sqrt{\sum_{i=1}^n \sum_{j=1}^n s_{ij} u_i^2} \cdot \sqrt{\sum_{i=1}^n \sum_{j=1}^n s_{ij} v_j^2}},

where s_{ij} represents the similarity between the i-th and j-th features (e.g., words), typically ranging from 0 to 1 and derived from external resources like thesauri or pre-trained models.^[28] When s_{ij} = \delta_{ij} (the Kronecker delta, assuming orthogonal features), the measure reduces to the standard cosine similarity.^[28] This measure offers advantages in natural language processing by capturing non-exact matches, such as synonyms, which improves retrieval and classification accuracy without requiring dimensionality reduction techniques like latent semantic analysis.^[28] For instance, empirical evaluations on tasks like entrance exam question answering have shown it outperforming traditional cosine similarity, with c@1 scores increasing from 0.42 to 0.45.^[28] A representative example involves comparing two short documents using Word2Vec embeddings to populate the similarity matrix S: "Obama speaks to the media in Illinois" and "The president greets the press in Chicago". After preprocessing (e.g., stopword removal and tokenization), convert each to TF-IDF weighted bag-of-words vectors \mathbf{u} and \mathbf{v}. Compute s_{ij} as the cosine similarity between Word2Vec vectors of terms i and j (e.g., high similarity between "media" and "press", or "Illinois" and "Chicago" as locations). Apply the soft cosine formula to yield a score of approximately 0.26, indicating moderate semantic relatedness despite no shared content words; an unrelated document like "Oranges are my favorite fruit" scores near 0.^[29]

Applications

Information Retrieval

In information retrieval, cosine similarity serves as a core measure for assessing the relevance of documents to a user query within the vector space model. Documents and queries are represented as vectors in a high-dimensional space, where each dimension corresponds to a term in the vocabulary, and vector components are weighted using term frequency-inverse document frequency (TF-IDF). The TF-IDF weighting scheme assigns higher values to terms that are frequent in a specific document but rare across the entire corpus, emphasizing discriminative content. Cosine similarity then quantifies relevance by computing the cosine of the angle between the query vector and document vector, focusing on their directional alignment rather than magnitude differences, which captures semantic overlap effectively.^[30] This approach was popularized in the 1970s and 1980s through the vector space model, notably implemented in the SMART retrieval system developed by Gerard Salton at Cornell University. The SMART system, originating in the late 1960s and refined through experiments in the 1970s, employed cosine similarity to rank documents by their proximity to queries in vector space, demonstrating improved retrieval performance over earlier Boolean models on test collections like Cranfield. Salton's seminal work formalized this integration, establishing cosine as a standard for probabilistic ranking in experimental IR systems during that era.^[30] For example, consider a query "digital cameras" processed against a document containing the terms "digital cameras and video cameras," assuming a corpus of 10 million documents with stop words like "and" removed. The query vector might have components [4, 4] for "digital" (TF=1, IDF≈4) and "cameras" (TF=1, IDF≈4), while the document vector, after normalization, has weighted components approximately [0.71, 0.71] for those terms (with zero elsewhere). The cosine similarity score is the dot product divided by the product of vector norms, yielding a value around 1.0 for perfect alignment on shared terms, allowing the system to rank this document highly relative to others with lower angular similarity. Such scores enable efficient sorting of large document sets to retrieve the most relevant top-k results.^[31] Cosine similarity offers advantages over alternatives like Euclidean distance in IR, particularly for high-dimensional, sparse data typical of text corpora, where most terms do not appear in most documents. By normalizing for vector length, cosine avoids biasing against longer documents that naturally have higher term counts, ensuring fair comparison based on term distribution orientation; Euclidean distance, conversely, penalizes length disparities, leading to suboptimal rankings in such sparse environments. This property has made cosine the preferred metric in vector-based IR since its early adoption.

Machine Learning and Clustering

In machine learning, cosine similarity plays a key role in clustering algorithms for high-dimensional data, where it emphasizes directional alignment over magnitude differences. The spherical k-means algorithm, a variant of the traditional k-means, employs cosine similarity to assign data points to clusters by maximizing the cosine between points and centroids, proving particularly effective for sparse datasets like text corpora or gene expression profiles. This approach clusters items such as user profiles by grouping those with similar orientation in feature space, avoiding distortions from varying vector lengths.^[32] In hierarchical clustering, cosine similarity is used as a linkage metric to progressively merge clusters of similar items, such as documents or behavioral vectors, based on average pairwise cosines within and between groups.^[33] Cosine similarity is integral to recommendation systems within machine learning, especially in collaborative filtering, where it quantifies user-item or item-item similarities from rating matrices. In item-based collaborative filtering, cosine similarity computes the overlap in user preferences for pairs of items, enabling predictions by weighting ratings from the most similar items; this method enhances scalability for large systems like e-commerce platforms.^[34] For instance, in embedding spaces generated by neural models, cosine-based nearest neighbor search identifies similar user profiles or content features, supporting personalized suggestions in applications akin to Netflix's viewing recommendations. Its effectiveness in these tasks stems from robustness to sparse, high-dimensional data, as it ignores zero entries and prioritizes shared non-zero features, making it suitable for scenarios like user-item interactions with many missing ratings.^[32] However, cosine similarity remains sensitive to the curse of dimensionality, where in extremely high dimensions, vectors tend toward orthogonality, reducing its ability to distinguish subtle similarities.^[35] This scaling invariance to vector lengths further aids its application in ML by focusing on relative orientations.^[32]

Historical Development

Origins in Geometry

The roots of cosine similarity lie in classical geometry, where the cosine function underpins the measurement of angles between lines or directions. The foundational geometric relation appears in Euclid's Elements (circa 300 BCE), specifically in Book II, Propositions 12 and 13, which establish the relationship between the squares of a triangle's sides and the included angle—equivalent to the law of cosines, though without explicit trigonometric terminology.^[36] This proposition for obtuse triangles (II.12) and acute triangles (II.13) provided an early algebraic expression for angular dependence in spatial configurations.^[37] The explicit cosine function developed later within trigonometry, building on ancient chord-based methods but formalized in the 18th and 19th centuries amid advances in analysis and calculus. Leonhard Euler's Introductio in analysin infinitorum (1748) standardized the sine and cosine as functions of angles, integrating them into power series and complex analysis, which extended their utility beyond pure geometry.^[38] By the early 19th century, these functions were routinely applied in surveying, astronomy, and engineering, emphasizing the cosine's role in projecting lengths along directions. In the 19th century, cosine-based angle measurement entered vectorial frameworks through William Rowan Hamilton's quaternions (introduced 1843), which separated scalar and vector components to handle rotations and orientations in three dimensions.^[39] This evolved into modern vector analysis via J. Willard Gibbs and Oliver Heaviside's independent developments in the 1880s, where the dot product emerged as a tool for quantifying directional alignment between vectors. Gibbs's Vector Analysis (1901, with Edwin Bidwell Wilson) formalized this as a scalar operation capturing angular similarity.^[39] The dot product's brief notation history traces to Leibniz's 17th-century suggestions for vector symbols, but its geometric interpretation solidified in these 19th-century works. Pre-computing applications highlighted cosine's practical value in physics, particularly for directional forces. In mechanics, the work done by a force \vec{F} over displacement \vec{d} is given by W = \vec{F} \cdot \vec{d} = \|\vec{F}\| \|\vec{d}\| \cos \theta, where \theta is the angle between them, accounting only for the force component parallel to motion.^[40] This formulation, introduced by Gaspard-Gustave de Coriolis in his 1829 Calcul de l'effet des machines, quantified energy transfer in machines and resolved components of effort, predating vector notation but relying on cosine projections.^[40] The transition to similarity measures occurred implicitly in early 20th-century statistics, where cosine-like angular interpretations described correlations between variables. Karl Pearson's development of the correlation coefficient in 1895 treated standardized data as vectors, with the coefficient equaling the cosine of the angle between them, enabling quantitative assessment of linear associations in biological and social data.

Adoption in Computing

Cosine similarity entered computing prominently in the 1970s through its integration into information retrieval systems, particularly via Gerard Salton's vector space model introduced in 1975. This model represented documents and queries as vectors in term space, employing cosine similarity to compute the angle between vectors and thereby rank retrieval results based on relevance, independent of vector magnitudes. This approach marked a shift from Boolean models to probabilistic ranking, enabling more effective handling of sparse, high-dimensional data in early search engines.^[30] In the 1980s and 1990s, cosine similarity expanded into machine learning and data mining, underpinning techniques like latent semantic analysis (LSA) for dimensionality reduction and semantic matching in text processing. LSA, patented in 1988 and detailed in 1990, used singular value decomposition on term-document matrices followed by cosine similarity to capture latent relationships, improving retrieval accuracy over raw term matching.^[41] It also gained traction in bioinformatics starting in the late 1990s for gene expression analysis, where cosine measures (or equivalent Pearson correlations on centered data) compared high-dimensional profiles from microarray data to cluster co-expressed genes.^[42] From the 2000s onward, cosine similarity boomed in natural language processing and recommender systems alongside the proliferation of dense vector embeddings. It powered content-based and collaborative filtering in recommenders, such as item-similarity computations in algorithms from 2001, which scaled user-item interactions via cosine for personalized suggestions. Integration into open-source libraries like scikit-learn, initiated in 2007, standardized its use for tasks including text clustering and nearest-neighbor search, democratizing access for broader machine learning applications. Up to 2025, cosine similarity endures as a foundational metric in AI, enhanced by transformer architectures like BERT, where it quantifies semantic similarity between contextual embeddings in tasks such as paraphrase detection and retrieval-augmented generation. While transformer advancements have refined embedding representations, no paradigm shifts have supplanted cosine's role in angle-based comparisons, though recent critiques highlight its limitations in anisotropic spaces.

References

[1]
[PDF] Cosine Similarity Tutorial | ITLab
Apr 10, 2015 · Abstract – This is a tutorial on the cosine similarity measure. Its meaning in the context of uncorrelated and orthogonal variables is examined.
[2]
What Is Cosine Similarity? | IBM
Calculate the cosine similarity: The cosine similarity is found by dividing the dot product (step 1) by the product of the magnitudes of the vectors (step 2).What is cosine similarity? · Importance
[3]
A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ...
KAREN SPARCK JONES (1972) "A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL", Journal of Documentation, Vol. 28 No. 1, pp ...
[4]
https://mathworld.wolfram.com/InnerProductSpace.html
[5]
[PDF] Vector spaces and signal space - MIT OpenCourseWare
A vector space with an inner product satisfying these axioms is called an inner product space. The same definition applies to a real vector space, but the ...
[6]
Inner Product Space -- from Wolfram MathWorld
### Summary of Inner Product Space Definition and Examples
[7]
Inner Product -- from Wolfram MathWorld
A vector space together with an inner product on it is called an inner product space. This definition also applies to an abstract vector space over any field.
[8]
Dot Product -- from Wolfram MathWorld
The dot product can be defined for two vectors X and Y by X·Y=|X||Y|costheta, where theta is the angle between the vectors and |X| is the norm.Missing: formula | Show results with:formula
[9]
L^2-Norm -- from Wolfram MathWorld
The l^2-norm (also written "l^2-norm") |x| is a vector norm defined for a complex vector x=[x_1; x_2; |; x_n] (1) by |x|=sqrt(sum_(k=1)^n|x_k|^2), (2) where ...
[10]
Dot Product - Calculus II - Pauls Online Math Notes
Nov 16, 2022 · In this section we will define the dot product of two vectors. We give some of the basic properties of dot products and define orthogonal ...
[11]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · ... modern parlance, the word “search” has tended to replace “(information) retrieval”; the term “search” is quite ambiguous, but in context we ...<|control11|><|separator|>
[12]
Proof of the Law of Cosines
In general the dot product of two vectors is the product of the lengths of their line segments times the cosine of the angle between them.
[13]
[PDF] 18.02 Multivariable Calculus, Fall 2007 Transcript – Lecture 2 - MIT ...
For example, you can use this relation to give you the cosine of the angle between two vectors is the dot product divided by the product of the lengths.
[14]
Introduction to modern information retrieval : Salton, Gerard
Feb 12, 2011 · Introduction to modern information retrieval. by: Salton, Gerard; McGill, Michael J. Publication date: 1983. Topics: Information storage and ...
[15]
[PDF] Lexical Semantics
• Cosine cos( y) = . = ⎨⎩. ⎨⎩ x|| n. 2 n x, . | y| x y2 i=1 i=1. – Angle between two vectors. – Ranges from 0 (cos(90)=0) to 1 (cos(0)=1) ...
[16]
https://arxiv.org/pdf/1812.07695
[17]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · clusterings (instead of flat clusterings) in many applications in information retrieval and introduces a number of clustering algorithms that ...
[18]
[PDF] Index-based, High-dimensional, Cosine Threshold Querying ... - arXiv
Jan 10, 2019 · to-one mapping between the cosine similarity θ and the Euclidean distance r for unit vectors, i.e., r = 2 sin(arccos(θ)/2). Thus, finding ...
[19]
[PDF] on the distribution of cosine similarity with - arXiv
Oct 21, 2023 · Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related ...
[20]
[PDF] arXiv:2504.16318v2 [cs.LG] 20 May 2025
May 20, 2025 · 4 Mathematical Anatomy of Cosine. Similarity. Cosine similarity, at its core, is a measure of an- gular closeness between two vectors in ...<|control11|><|separator|>
[21]
[PDF] On the Cosine Similarity and Orthogonality between Persistence ...
Apr 6, 2025 · This is proved easily by applying AM-GM inequality [29] to Cauchy-Schwarz inequality in (1). Instead of persistence landscapes, the cosine ...
[22]
[PDF] DeepStability: A Study of Unstable Numerical Methods and Their ...
Feb 7, 2022 · Row 3 shows that a numerically unstable implementa- tion of cosine similarity distance may return a value greater than 1.0, which is incorrect ...
[23]
[2107.04071] A Triangle Inequality for Cosine Similarity - arXiv
Jul 8, 2021 · Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for ...
[24]
Cosine similarity / distance and triangle equation
Jan 27, 2012 · The cosine-similarity is defined as the inner product of two vectors A & B divided by the product of their magnitudes. This is the definition of ...How to prove cosine distance does not satisfy triangle inequality ...Prove that cosine distance does not satisfy the four properties of a ...More results from math.stackexchange.com
[25]
[PDF] TMSig: Tools for Molecular Signatures - Bioconductor
Oct 17, 2024 · the set sizes ( ¯Otsuka, 1936), which is equivalent to the cosine similarity of two bit vectors: ... similarity(x, type = "otsuka") # ¯Otsuka ...
[26]
A Rose by Any Other Name: On Basic Scores from the 2 3 2 Table ...
A common name for the GM is the Ochiai coefficient (Ochiai 1957). Ochiai credits. Otsuka (1936),6 but via an intermediate publication by Hamai. (1955). To add ...<|control11|><|separator|>
[27]
https://www.mdpi.com/1422-0067/20/19/4824
[28]
A novel method of literature mining to identify candidate COVID-19 ...
Jul 22, 2021 · From another point of view, the Otsuka-Ochiai coefficient is the same as the cosine similarity of each combination of words represented as a ...
[29]
InDel Marker Based Estimation of Multi-Gene Allele Contribution ...
... , respectively) using the Otsuka–Ochiai coefficient [47,48]. The Formula for calculating the coefficients was. d i j = 1 − a ( a + b ) ( a + c ) .
[30]
[PDF] Similarity of Features in Vector Space Model
When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine.
[31]
Soft Cosine Measure — gensim
Soft Cosine Measure (SCM) is a machine learning tool that assesses document similarity, even without shared words, using word embeddings and modeling synonymy.
[32]
A vector space model for automatic indexing | Communications of the ACM
### Summary of Cosine Similarity, TF-IDF, and Term Weighting in Vector Space Model
[33]
[PDF] Scoring, term - Introduction to Information Retrieval
Compute the vector space similarity between the query “digital cameras” and the document “digital cameras and video cameras” by filling out the empty columns in.
[34]
Concept Decompositions for Large Sparse Text Data Using Clustering
In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. ... Dhillon, I. S. & Modha, D. S. (1999). Concept ...<|control11|><|separator|>
[35]
https://arxiv.org/abs/2407.08623
[36]
[PDF] Item-Based Collaborative Filtering Recommendation Algorithms
Here we present three such methods. These are cosine-based similarity, correlation-based similar- ity and adjusted-cosine similarity. 3.1.1 ...
[37]
Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric
### Key Points on Cosine Similarity and Curse of Dimensionality/High Dimensions
[38]
Book II - Euclid's Elements - Clark University
Propositions II. 12 and II. 13 are recognizable as geometric forms of the law of cosines which is a generalization of I. 47.
[39]
Euclid's Elements, Book II, Proposition 12 - Clark University
Law of Cosines since AD equals –b cos A, the cosine of an obtuse angle being negative. Trigonometry was developed some time after the Elements was written, and ...
[40]
[PDF] A note on the history of trigonometric functions and substitutions
The modern presentation of trigonometry can be attributed to Euler (1707-. 1783) who presented in Introductio in analysin infinitorum (1748) the sine and cosine ...
[41]
[PDF] A History of Vector Analysis
Keeping in mind that Gibbs called our scalar or dot product the “skew product” and wrote it as “α.β,” whereas he called our cross product the “direct product ...Missing: angle | Show results with:angle
[42]
Gaspard-Gustave de Coriolis (1792 - 1843) - Biography - MacTutor
Coriolis studied mechanics and engineering mathematics, in particular friction, hydraulics, machine performance and ergonomics. He introduced the terms 'work' ...Missing: physics | Show results with:physics
[43]
On the selection of appropriate distances for gene expression data ...
Jan 24, 2014 · We also include in our analysis four "traditional" proximity measures, i.e., Cosine similarity - adapted as distance (COS), Euclidean distance ( ...Missing: history | Show results with:history