Sum of squares
In mathematics, the sum of squares refers to the aggregation of the squared values of a set of numbers, serving as a foundational operation with applications across algebra, statistics, number theory, geometry, and optimization.[1] This simple yet powerful concept underpins measurements of dispersion, algebraic identities, and the representation of integers as sums of integer squares.[2] For a finite set of real numbers x_1, x_2, \dots, x_n, the sum of squares is defined as \sum_{i=1}^n x_i^2.[3] In statistics, the sum of squares is a key measure of variability within a dataset, calculated as the total of squared differences between each observation and the sample mean.[2] It partitions data variation into components such as the total sum of squares (SST or TSS), which captures overall deviation from the mean via \sum (y_i - \bar{y})^2; the regression sum of squares (SSR or RSS), quantifying variation explained by a model as \sum (\hat{y}_i - \bar{y})^2; and the error sum of squares (SSE), representing unexplained residuals as \sum (y_i - \hat{y}_i)^2, where SST = SSR + SSE.[4] These decompositions are essential in regression analysis and analysis of variance (ANOVA), enabling assessments of model fit through metrics like R-squared (SSR/SST) and hypothesis testing via F-statistics derived from mean squares (sum of squares divided by degrees of freedom).[2] Algebraically, sums of squares appear in identities that facilitate expansions and simplifications, such as for two variables: x^2 + y^2 = (x + y)^2 - 2xy, derived directly from the binomial square formula.[3] For sequences, closed-form formulas exist, including the sum of squares of the first n natural numbers, \sum_{k=1}^n k^2 = \frac{n(n+1)(2n+1)}{6}, proven by mathematical induction or telescoping series.[3] Similar expressions apply to even and odd numbers, like \sum_{k=1}^n (2k)^2 = \frac{2n(n+1)(2n+1)}{3}.[3] In number theory, the sum of squares function r_k(n) counts the ways a positive integer n can be expressed as the sum of k integer squares, allowing zeros and order distinctions.[1] Landmark results include Fermat's theorem on sums of two squares (1636), stating that a prime p is expressible as p = a^2 + b^2 if and only if p = 2 or p \equiv 1 \pmod{4}, extended by Euler to all integers where primes congruent to $3 \pmod{4} have even exponents in their factorization.[1] Lagrange's four-square theorem (1770) asserts every natural number is a sum of at most four squares, while Legendre's three-square theorem (1798) characterizes those expressible as three squares, excluding forms $4^a(8b + 7).[5] These theorems, rooted in Diophantus's early work, connect arithmetic progressions and modular forms to deeper analytic number theory.[1]Fundamental Concepts
Definition and Notation
In mathematics, the sum of squares refers to the aggregate of the squares of a finite collection of real numbers a_1, \dots, a_n, formally defined as \sum_{i=1}^n a_i^2. This expression quantifies the total squared magnitude of the numbers and forms the foundation for various norms and measures in algebra and analysis. For instance, consider the simple case of two scalars a and b; their sum of squares is a^2 + b^2, which represents the squared distance from the origin in the plane spanned by these values.[6] In vector spaces, the sum of squares commonly appears in the notation for the squared Euclidean norm of a vector \mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^n, denoted \|\mathbf{x}\|^2 = \sum_{i=1}^n x_i^2. This notation emphasizes the connection to the length of the vector, where the norm itself is the square root of this sum. Historically, early explorations of such expressions were linked to quadratic forms; Leonhard Euler, in his 18th-century investigations of Diophantine problems, employed notations for quadratic polynomials like a x^2 \pm b x \pm c = 0, which underpin sums of squares in multivariate settings.[7] The concept extends naturally to complex numbers, where the squared modulus of z \in \mathbb{C} is defined as |z|^2 = z \overline{z}, with \overline{z} denoting the complex conjugate. For a complex vector, this generalizes componentwise to \sum |z_i|^2.[8]Basic Properties
The sum of squares of real numbers exhibits fundamental non-negativity, as \sum_{i=1}^n a_i^2 \geq 0 for any real a_i, with equality holding if and only if a_1 = a_2 = \dots = a_n = 0.[9] This property follows directly from the non-negativity of squares of real numbers and the additivity of the sum.[10] Sums of squares are homogeneous of degree 2, meaning that scaling the inputs by a constant c \in \mathbb{R} scales the sum by c^2: \sum_{i=1}^n (c a_i)^2 = c^2 \sum_{i=1}^n a_i^2.[10] This homogeneity arises from the quadratic nature of the expression and is preserved under linear transformations of the variables.[10] In the context of vector spaces, the sum of squares \sum_{i=1}^n x_i^2 can be expressed as the quadratic form \mathbf{x}^T A \mathbf{x}, where \mathbf{x} = (x_1, \dots, x_n)^T is the column vector and A is the n \times n identity matrix, which is positive definite.[10] More generally, any positive semidefinite quadratic form \mathbf{x}^T A \mathbf{x} with A symmetric and all eigenvalues nonnegative can represent a weighted sum of squares after diagonalization.[9] This connection underscores the role of sums of squares in defining norms and inner products in Euclidean spaces. A key inequality involving sums of squares is the Cauchy-Schwarz inequality, which states that for real sequences a_i and b_i, (\sum_{i=1}^n a_i b_i)^2 \leq \left( \sum_{i=1}^n a_i^2 \right) \left( \sum_{i=1}^n b_i^2 \right), with equality if the sequences are proportional.[11] A proof sketch via quadratic forms considers the expression \sum_{i=1}^n (a_i + \lambda b_i)^2 \geq 0 for all real \lambda, which expands to a quadratic in \lambda: \left( \sum b_i^2 \right) \lambda^2 + 2 \left( \sum a_i b_i \right) \lambda + \sum a_i^2 \geq 0.[11] For this quadratic to be nonnegative for all \lambda, its discriminant must be nonpositive, yielding \left( \sum a_i b_i \right)^2 \leq \left( \sum a_i^2 \right) \left( \sum b_i^2 \right).[11]Applications in Statistics
Sum of Squared Errors
The sum of squared errors (SSE), also known as the residual sum of squares, is a measure of the discrepancy between observed values y_i and predicted values \hat{y}_i in a dataset, defined as \text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2, where n is the number of observations.[12] This metric quantifies the total squared deviations of the residuals, providing a way to assess how well a model fits the data, with smaller values indicating a better fit.[12] The SSE plays a central role in the method of least squares, where the objective is to minimize this sum to determine the optimal model parameters that best approximate the observed data.[13] In simple linear regression, minimizing the SSE leads to the ordinary least squares estimator for the slope coefficient \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, where \bar{x} and \bar{y} are the sample means of the predictor x_i and response y_i, respectively.[14] This approach was developed in the early 19th century by Adrien-Marie Legendre and Carl Friedrich Gauss, primarily for fitting models to astronomical observations, with Legendre publishing the first clear exposition in 1805 and Gauss claiming prior invention around 1795.[15] For example, consider fitting a line to three data points: (1, 2), (2, 4), and (3, 5). An initial guess of the line \hat{y} = 2x yields predicted values 2, 4, and 6, resulting in residuals 0, 0, and -1, so SSE = $0^2 + 0^2 + (-1)^2 = 1. Optimizing via least squares gives the fitted line \hat{y} = 1.5x + \frac{2}{3}, with predicted values \frac{13}{6}, \frac{11}{3}, and \frac{31}{6}, residuals -\frac{1}{6}, \frac{1}{3}, and -\frac{1}{6}, and SSE = \left(-\frac{1}{6}\right)^2 + \left(\frac{1}{3}\right)^2 + \left(-\frac{1}{6}\right)^2 = \frac{1}{6}, demonstrating the reduction achieved by minimization.Variance and Analysis of Variance
In statistics, the sum of squared deviations measures the total variability in a dataset relative to its mean, providing a foundational quantity for assessing dispersion. For a sample of n observations x_1, x_2, \dots, x_n with sample mean \bar{x}, the total sum of squares (SST) is defined as \text{SST} = \sum_{i=1}^n (x_i - \bar{x})^2. This quantity quantifies the overall spread of the data around the central tendency. The sample variance, which normalizes this sum to estimate population variability, is then computed as s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2, where the denominator n-1 applies Bessel's correction to yield an unbiased estimator of the population variance.[16] This decomposition of variability becomes central in analysis of variance (ANOVA), a method developed by Ronald Fisher to partition the total sum of squares into components attributable to different sources, enabling inference about group differences. In one-way ANOVA, which compares means across k groups with total N observations, SST decomposes into the between-group sum of squares (SSB), capturing variability due to group means, and the within-group sum of squares (SSW), reflecting variability within groups: \text{SST} = \text{SSB} + \text{SSW}, where \text{SSB} = \sum_{j=1}^k n_j (\bar{y}_j - \bar{y})^2, \quad \text{SSW} = \sum_{j=1}^k \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2, with \bar{y}_j as the mean of group j and \bar{y} as the grand mean. This partitioning, rooted in least squares principles, allows testing whether observed differences in group means exceed what would be expected by chance.[17] The ANOVA table summarizes this decomposition, including mean squares (MS) obtained by dividing sums of squares by their degrees of freedom (df). For one-way ANOVA, the degrees of freedom are df_between = k-1 and df_within = N-k, with total df = N-1. The between-group mean square is MSB = SSB / (k-1), and the within-group mean square is MSW = SSW / (N-k). The F-statistic, which tests the null hypothesis of equal group means, is then F = MSB / MSW, following an F-distribution under the null with parameters (k-1, N-k). A large F-value indicates that between-group variability significantly exceeds within-group variability, rejecting the null.[18] To illustrate, consider a one-way ANOVA comparing crop yields across three fertilizer treatments (k=3), with group sizes n1=5, n2=5, n3=5 (N=15) and yields (in kg): Group 1: 4.2, 4.5, 4.8, 5.0, 4.7; Group 2: 5.1, 5.3, 5.6, 5.4, 5.2; Group 3: 6.0, 6.2, 5.9, 6.1, 6.3. The grand mean \bar{y} \approx 5.35. Computations yield SST ≈ 5.96 (df=14), SSB ≈ 5.34 (df=2), and SSW ≈ 0.63 (df=12), with MSB ≈ 2.67, MSW ≈ 0.0525, and F ≈ 50.86 (p < 0.001), indicating significant differences among treatments. The ANOVA table is:| Source | SS | df | MS | F |
|---|---|---|---|---|
| Between | 5.34 | 2 | 2.67 | 50.86 |
| Within | 0.63 | 12 | 0.0525 | |
| Total | 5.96 | 14 |