Cosine similarity
Cosine similarity is a metric that measures the similarity between two non-zero vectors in an inner product space by calculating the cosine of the angle between them, which quantifies their directional alignment regardless of magnitude.[1] Mathematically, it is defined as the dot product of the two vectors divided by the product of their Euclidean norms, expressed as \cos \theta = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \cdot ||\mathbf{B}||}, where the result ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonality.[2] This measure is particularly valuable in high-dimensional spaces, such as those encountered in text analysis, because it normalizes for vector length and focuses solely on orientation.[1] The concept draws from linear algebra, where the cosine captures angular proximity, and it exhibits key properties like scale invariance, making it robust for comparing datasets of varying sizes or densities.[2] For mean-centered variables, cosine similarity equates to the Pearson correlation coefficient, linking it to statistical analysis of linear relationships.[1] Unlike distance metrics such as Euclidean distance, which are sensitive to magnitude, cosine similarity treats vectors of different lengths as equally similar if their directions match, which is advantageous for sparse data representations.[2] Cosine similarity gained prominence in information retrieval through its application to term frequency-inverse document frequency (tf-idf) vectors, with tf-idf weighting introduced by Karen Spärck Jones in 1972 to interpret term specificity and improve document ranking. Today, it underpins numerous fields, including natural language processing for semantic search and document clustering, recommendation systems for user-item matching, and machine learning tasks like topic modeling and image recognition.[2] Its efficiency in handling high-dimensional, sparse data has made it a staple in large-scale applications, such as those in search engines and large language models.Mathematical Foundations
Vectors in Inner Product Spaces
In mathematics, vectors are abstract elements of a real vector space, which is a set equipped with operations of addition and scalar multiplication by real numbers, satisfying axioms such as commutativity of addition, associativity, existence of a zero vector, additive inverses, distributivity, and compatibility with scalar multiplication.[3] When this vector space is endowed with an additional structure called an inner product, it becomes a real inner product space, providing a framework for geometric interpretations in abstract settings.[4] Finite-dimensional examples include the Euclidean space \mathbb{R}^n, where vectors are ordered n-tuples of real numbers representing points or directions in n-dimensional space.[3] The inner product, denoted \langle \cdot, \cdot \rangle, is a real-valued bilinear form on the vector space that satisfies three key properties: symmetry, positive-definiteness, and linearity.[5] Specifically, for vectors u, v, w in the space and scalar \alpha \in \mathbb{R},- Linearity in the first argument: \langle \alpha u + w, v \rangle = \alpha \langle u, v \rangle + \langle w, v \rangle,
- Symmetry: \langle u, v \rangle = \langle v, u \rangle,
- Positive-definiteness: \langle u, u \rangle \geq 0, with equality if and only if u = 0.
Dot Product and Vector Norms
The dot product of two vectors \mathbf{u} = (u_1, \dots, u_n) and \mathbf{v} = (v_1, \dots, v_n) in Euclidean space is defined as the scalar \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^n u_i v_i.[6] This algebraic formulation provides a computational method to obtain the dot product by multiplying corresponding components and summing the results.[6] The Euclidean norm, or \ell^2-norm, of a vector \mathbf{u} = (u_1, \dots, u_n) measures its length and is given by \|\mathbf{u}\| = \sqrt{\mathbf{u} \cdot \mathbf{u}} = \sqrt{\sum_{i=1}^n u_i^2}.[7] This norm satisfies key properties, including positivity—where \|\mathbf{u}\| > 0 if \mathbf{u} \neq \mathbf{0} and \|\mathbf{u}\| = 0 if and only if \mathbf{u} = \mathbf{0}—and homogeneity, such that \|c \mathbf{u}\| = |c| \|\mathbf{u}\| for any scalar c.[7] For example, consider the 2D vectors \mathbf{u} = (3, 4) and \mathbf{v} = (1, 2). The dot product is \mathbf{u} \cdot \mathbf{v} = 3 \cdot 1 + 4 \cdot 2 = 11.[6] The norm of \mathbf{u} is \|\mathbf{u}\| = \sqrt{3^2 + 4^2} = \sqrt{25} = 5, representing the length of the vector.[7] The dot product relates to vector length via the norm, as the squared length \|\mathbf{u}\|^2 = \mathbf{u} \cdot \mathbf{u}.[6] Additionally, two nonzero vectors are orthogonal (perpendicular) if their dot product is zero, indicating no directional alignment between them./06%3A_Orthogonality/6.01%3A_Dot_Products_and_Orthogonality) For instance, \mathbf{u} = (1, 0) and \mathbf{v} = (0, 1) satisfy \mathbf{u} \cdot \mathbf{v} = 1 \cdot 0 + 0 \cdot 1 = 0, confirming orthogonality./06%3A_Orthogonality/6.01%3A_Dot_Products_and_Orthogonality)Core Definition
Cosine Similarity Formula
The cosine similarity between two non-zero vectors \mathbf{u} and \mathbf{v} in an inner product space is defined as the cosine of the angle \theta between them, given by the formula \cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} where \mathbf{u} \cdot \mathbf{v} denotes the dot product and \|\mathbf{u}\| = \sqrt{\mathbf{u} \cdot \mathbf{u}} is the Euclidean norm (magnitude) of \mathbf{u}.[8] This expression normalizes the dot product by the product of the vectors' magnitudes, yielding a measure that depends only on their directional alignment rather than their lengths.[9] The formula arises from the geometric interpretation of the dot product in Euclidean space. To derive it, consider the triangle formed by vectors \mathbf{u} and \mathbf{v} originating from a common point. The law of cosines states that for the side opposite angle \theta, denoted \|\mathbf{u} - \mathbf{v}\|, \|\mathbf{u} - \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2 - 2 \|\mathbf{u}\| \|\mathbf{v}\| \cos \theta. Expanding the left side using the algebraic definition of the dot product gives \|\mathbf{u} - \mathbf{v}\|^2 = (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \mathbf{u} \cdot \mathbf{u} - 2 \mathbf{u} \cdot \mathbf{v} + \mathbf{v} \cdot \mathbf{v} = \|\mathbf{u}\|^2 - 2 \mathbf{u} \cdot \mathbf{v} + \|\mathbf{v}\|^2. Equating the expressions and solving for the cosine term yields \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos \theta, or equivalently, the cosine similarity formula above.[8][10] The value of cosine similarity ranges from -1 to 1. A value of 1 indicates that the vectors point in the same direction (zero angle between them), 0 signifies orthogonality (right angle), and -1 means they point in exactly opposite directions (180 degrees).[9] In applications with non-negative components, such as term-frequency vectors in information retrieval, the range is typically restricted to [0, 1].[9] To compute cosine similarity, first calculate the dot product, then the norms, and divide. For example, consider vectors \mathbf{u} = (1, 2) and \mathbf{v} = (3, 4) in \mathbb{R}^2:- Dot product: \mathbf{u} \cdot \mathbf{v} = 1 \cdot 3 + 2 \cdot 4 = 11.
- Norm of \mathbf{u}: \|\mathbf{u}\| = \sqrt{1^2 + 2^2} = \sqrt{5} \approx 2.236.
- Norm of \mathbf{v}: \|\mathbf{v}\| = \sqrt{3^2 + 4^2} = \sqrt{25} = 5.
- Cosine similarity: \cos \theta = \frac{11}{\sqrt{5} \cdot 5} \approx \frac{11}{11.180} \approx 0.984.