Fact-checked by Grok 2 weeks ago

Regularized least squares

Regularized least squares (RLS) is a statistical method for estimating the parameters of a linear model by minimizing the sum of squared residuals between observed and predicted values, augmented by a penalty term that constrains the complexity of the solution, thereby mitigating issues such as overfitting, multicollinearity, and instability in ill-posed problems.^[1]^[2] This approach, also known as ridge regression or Tikhonov regularization, introduces a hyperparameter λ that balances fidelity to the data against the regularization penalty, typically formulated as minimizing \|Ax - b\|^2 + \lambda \|x\|^2 for linear systems, where A is the design matrix, b the response vector, and x the parameter vector.^[3]^[4] The origins of RLS trace back to Andrey Tikhonov's pioneering work on stabilizing solutions to ill-posed inverse problems in the 1960s, where he introduced the regularization method to obtain stable approximations for equations sensitive to perturbations.^[5] In the statistical context, the technique gained prominence through ridge regression, developed by Arthur E. Hoerl and Robert W. Kennard in 1970 to address biased estimation in nonorthogonal regression problems with high multicollinearity.^[3] These foundational contributions established RLS as a cornerstone of modern machine learning and data analysis, with extensions to reproducing kernel Hilbert spaces enabling nonlinear modeling via kernel regularized least squares.^[1]^[4] Key advantages of RLS include its computational tractability—solvable via linear systems like (K + \lambda I)c = y in the kernel case, where K is the kernel matrix—and its ability to incorporate prior knowledge through the choice of regularization norm, such as L2 for ridge or more general seminorms.^[4]^[1] The regularization parameter λ is often tuned using cross-validation or discrepancy principles to optimize generalization performance, making RLS widely applicable in fields like signal processing, geophysics, and supervised learning tasks.^[2] Notable variants include total least squares for error-in-variables models^[6] and sparse formulations combining L1 penalties,^[7] though the core L2-regularized form remains the most studied and implemented.

Fundamentals

Ordinary least squares review

Ordinary least squares (OLS) is a fundamental method in linear regression that estimates the parameters of a linear model by minimizing the sum of the squared residuals between observed and predicted values. Formally, given a dataset with n observations, an n \times p design matrix X (including a column of ones for the intercept), and a response vector y \in \mathbb{R}^n, OLS solves the optimization problem \min_w \|Xw - y\|_2^2, where w \in \mathbb{R}^p is the parameter vector.^[8] This approach assumes the true relationship can be approximated by a linear function plus additive error, making it the baseline for many statistical modeling tasks.^[9] The method was developed in the early 19th century, with Adrien-Marie Legendre publishing the first clear exposition in 1805 as a technique for fitting astronomical data, while Carl Friedrich Gauss claimed prior invention around 1801, leading to a notable priority dispute resolved in favor of Gauss's earlier conception based on historical evidence.^[10] Under the condition that X^T X is invertible, OLS admits a closed-form solution given by w = (X^T X)^{-1} X^T y, which projects the response onto the column space of X.^[8] This explicit solution is computationally efficient for moderate-sized problems and forms the foundation for more advanced regression techniques. OLS relies on several key assumptions to ensure reliable estimates: (1) linearity in parameters, meaning the model is y = X\beta + \varepsilon; (2) strict exogeneity, E(\varepsilon | X) = 0; (3) no perfect multicollinearity, so the rank of X equals p; and (4) spherical errors, where E(\varepsilon \varepsilon^T | X) = \sigma^2 I_n (homoscedasticity and independence).^[9] These assumptions imply that the errors are independent with constant variance \sigma^2. Under the first three assumptions, the OLS estimator is unbiased, satisfying E(w | X) = \beta.^[9] Furthermore, with all four assumptions (the Gauss-Markov conditions), it is the best linear unbiased estimator (BLUE), possessing the minimum variance among all linear unbiased estimators, with covariance matrix \text{Var}(w | X) = \sigma^2 (X^T X)^{-1}.^[9] In high-dimensional settings where the number of features approaches or exceeds the sample size, OLS estimates can suffer from high variance and overfitting, fitting noise rather than the underlying signal.^[11]

Need for regularization

In ordinary least squares (OLS), the estimator can exhibit high variance, leading to overfitting where the model captures noise in the training data rather than the underlying signal, resulting in poor predictions on unseen data.^[12] This issue arises particularly when the number of features is large relative to the sample size, amplifying sensitivity to small changes in the input data.^[13] Ill-posed problems further exacerbate OLS limitations, especially when the matrix X^T X is ill-conditioned or singular due to multicollinearity among predictors, causing the solution to be highly sensitive to perturbations in the data or measurement errors.^[13] In such cases, small eigenvalues in X^T X lead to unstable coefficient estimates with large magnitudes, inflating prediction variance and undermining reliability.^[12] The curse of dimensionality intensifies this when the number of features p exceeds the number of samples n (p > n), as the feature space becomes sparse, making it difficult to estimate parameters accurately without excessive variance or computational instability.^[12] Regularization addresses these challenges through the bias-variance tradeoff, intentionally introducing a small bias into the estimates to substantially reduce variance and improve overall mean squared error on new data.^[12] By constraining the model complexity, it stabilizes predictions without severely compromising the fit to the true relationship.^[13]

Core Formulation

General optimization problem

Regularized least squares (RLS) addresses the linear regression problem by formulating it as a constrained optimization task that balances data fidelity with model complexity. Specifically, given an n \times p design matrix X and an n-dimensional response vector y, the goal is to find the parameter vector w \in \mathbb{R}^p that minimizes the objective function

\min_w \|Xw - y\|_2^2 + \lambda R(w),

where \lambda > 0 is the regularization parameter and R(w) is a penalty function, often chosen to promote desirable properties in w such as sparsity or small norm. This formulation originated in the context of ridge regression, where R(w) = \|w\|_2^2, to mitigate issues like multicollinearity in least squares estimation.^[13] The regularization parameter \lambda governs the tradeoff between the least squares fitting term \|Xw - y\|_2^2, which measures how well the model fits the observed data, and the penalty term \lambda R(w), which discourages overly complex solutions that may lead to poor generalization. As \lambda increases, the solution w becomes more constrained, reducing variance at the potential cost of increased bias; conversely, smaller \lambda approaches ordinary least squares when \lambda = 0.^[13] For common choices of R(w), such as quadratic forms like R(w) = \|w\|_2^2 or more generally \|Lw\|_2^2 for a positive definite matrix L, the objective function is differentiable and convex, as the sum of a convex quadratic loss and a convex penalty yields a convex problem. The gradient of the objective with respect to w is given by

\nabla_w = 2X^T(Xw - y) + \lambda \nabla R(w),

which can be set to zero to find critical points, facilitating numerical optimization via methods like gradient descent. Under strict convexity assumptions on the objective—ensured, for instance, when X^T X + \lambda \nabla^2 R(w) is positive definite for all w—the minimizer exists and is unique, guaranteeing a well-defined solution even in ill-conditioned settings.

Solution and properties

The solution to the regularized least squares (RLS) optimization problem, which seeks to minimize \|Xw - y\|^2 + \lambda R(w) for a convex differentiable regularizer R(w), is characterized by the first-order optimality condition $2X^T(Xw - y) + \lambda \nabla R(w) = 0. For quadratic regularizers of the form R(w) = w^T \Gamma w where \Gamma is a positive semidefinite matrix, this simplifies to the linear normal equations (X^T X + \lambda \Gamma) w = X^T y, yielding a closed-form solution w = (X^T X + \lambda \Gamma)^{-1} X^T y upon inversion, assuming the matrix is invertible for \lambda > 0.^[14] Asymptotically, under assumptions such as fixed dimensionality p, independent identically distributed data with finite moments, and appropriate choice of \lambda (e.g., \lambda \to 0 at a suitable rate), the RLS estimator \hat{w} is strongly consistent, converging almost surely to the true parameter w_0 as the sample size n \to \infty. Furthermore, it achieves the parametric convergence rate of O_p(1/\sqrt{n}) for the estimation error \|\hat{w} - w_0\|, matching that of ordinary least squares while mitigating overfitting through the regularization term. These properties hold provided the design matrix X satisfies standard identifiability conditions and the noise is sub-Gaussian.^[15] Geometrically, the RLS solution represents a shrinkage of the ordinary least squares estimator towards the origin in the parameter space (for \Gamma = I) or more generally towards a subspace defined by the eigenspace of \Gamma, balancing fidelity to the data via the residual ellipsoid with constraint to a penalized region shaped by the regularizer. This shrinkage reduces the variance of the estimator at the cost of introducing bias, particularly beneficial in ill-conditioned or high-dimensional settings where the unregularized solution amplifies noise. The regularization parameter \lambda critically influences this bias-variance tradeoff, with small \lambda yielding solutions close to ordinary least squares and large \lambda enforcing stronger shrinkage. Optimal \lambda is typically selected via cross-validation, such as leave-one-out or k-fold methods, to minimize out-of-sample prediction error. Computationally, solving the normal equations directly requires forming X^T X in O(np^2) time followed by a p \times p system solve in O(p^3) time, though for large p, iterative methods like conjugate gradients can reduce this to O(np) per iteration under suitable preconditioning.

Kernel-Based Approaches

Reproducing kernel Hilbert spaces

A reproducing kernel Hilbert space (RKHS), denoted \mathcal{H}, is a Hilbert space of functions defined on a set X such that the point evaluation functional is continuous for every x \in X. This continuity means that there exists a constant C > 0 (possibly depending on x) such that |f(x)| \leq C \|f\|_{\mathcal{H}} for all f \in \mathcal{H}.^[16] The defining feature of an RKHS is the reproducing property: for each x \in X, there exists a function k_x \in \mathcal{H} such that f(x) = \langle f, k_x \rangle_{\mathcal{H}} for all f \in \mathcal{H}. The reproducing kernel k: X \times X \to \mathbb{R} is then given by k(x, x') = \langle k_x, k_{x'} \rangle_{\mathcal{H}}, which serves as the inner product between the representing functions. This property allows the inner product in \mathcal{H} to be computed solely through kernel evaluations without explicit knowledge of the functions themselves.^[16] Functions in an RKHS can often be expressed as finite linear combinations of kernel functions centered at data points, f = \sum_{i=1}^n \alpha_i k_{x_i}, where \alpha_i \in \mathbb{R}. The squared norm of such a function is then \|f\|_{\mathcal{H}}^2 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j), reflecting the quadratic form induced by the kernel matrix. For the space to form a valid Hilbert space with this inner product, the kernel k must be positive definite, meaning that for any finite set of points \{x_1, \dots, x_n\} \subset X and coefficients \{\alpha_1, \dots, \alpha_n\}, the sum \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j) \geq 0.^[16] The concept of RKHS was formally established by Nachman Aronszajn in 1950, providing a rigorous framework that unifies earlier work on integral operators and functional analysis. This formalization is essential for extending linear methods, such as regularized least squares, to nonlinear settings through kernel representations.

Kernel ridge regression

Kernel ridge regression extends regularized least squares to non-linear problems by incorporating kernel functions, which implicitly map the input data into a high-dimensional reproducing kernel Hilbert space (RKHS) without explicit computation of the feature vectors.^[17] The resulting model captures complex, non-linear dependencies between inputs and outputs while controlling overfitting through ridge regularization.^[18] The prediction function in kernel ridge regression is expressed as

f(x) = \sum_{i=1}^n \alpha_i k(x_i, x),

where \{x_i\}_{i=1}^n are the training inputs, k(\cdot, \cdot) is a positive semi-definite kernel function, and the coefficient vector \alpha is computed as \alpha = (K + \lambda I)^{-1} y. Here, K is the n \times n Gram matrix with entries K_{ij} = k(x_i, x_j), y is the vector of observed targets, \lambda > 0 is the regularization parameter, and I is the identity matrix.^[18] This dual formulation solves the optimization problem efficiently in the space of coefficients rather than the original feature space.^[17] By the representer theorem, the optimal solution to the kernel-regularized least squares problem lies in the finite-dimensional span of the kernel functions centered at the training points, ensuring that f can be represented solely in terms of the data without requiring functions outside this subspace.^[19] This property confines the search for the minimizer to a low-dimensional subspace, facilitating practical computation even for infinite-dimensional feature spaces.^[19] Kernel ridge regression is mathematically equivalent to applying finite-dimensional ridge regression directly in the feature space \phi(x) induced by the kernel, where k(x, x') = \langle \phi(x), \phi(x') \rangle, but the kernel trick avoids materializing the potentially infinite-dimensional \phi.^[17] This equivalence preserves the closed-form solution of ridge regression while enabling non-linear mappings.^[18] The choice of kernel function determines the structure of the implicit feature space and thus the model's flexibility; common selections include the radial basis function (RBF) kernel,

k(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\sigma^2} \right),

which corresponds to an infinite-dimensional space suitable for capturing local patterns, and polynomial kernels of the form k(x, x') = (x^\top x' + c)^d, which generate finite-dimensional spaces mimicking polynomial interactions.^[18] Hyperparameters such as \sigma, c, and d are typically tuned via cross-validation to balance bias and variance.^[17] Through this implicit high-dimensional mapping, kernel ridge regression achieves non-linearity in the input space without the prohibitive computational demands of explicit feature engineering, making it effective for tasks where relationships are inherently non-linear.^[17]

Prediction and complexity

In kernel regularized least squares (RLS), predictions for a new input x_* are made using the dual representation of the learned function, given by

f(x_*) = k(x_*, X)^T (K + \lambda I)^{-1} y,

where k(x_*, X) is the vector of kernel evaluations between x_* and the training inputs X, K is the n \times n kernel matrix with entries K_{ij} = k(x_i, x_j), \lambda > 0 is the regularization parameter, and y is the vector of training targets.^[20] This form leverages the representer theorem, expressing the solution as a linear combination of kernel functions centered at the training points, enabling non-linear mappings without explicitly computing high-dimensional features.^[21] The primary computational bottleneck in kernel RLS arises during training, where inverting the kernel matrix requires O(n^3) time complexity, with O(n^2) space needed to store the matrix itself for n training points.^[22] Once trained, predictions for out-of-sample points can be computed efficiently in O(n) time per query via the dual form, avoiding any need to retrain the model or recompute the inversion.^[23] This out-of-sample extension is a key advantage of the dual formulation, allowing seamless application to new data without altering the underlying coefficients (K + \lambda I)^{-1} y.^[20] To address scalability for large n, approximations such as the Nyström method and random features have been developed to reduce computational demands while preserving much of the model's expressive power. The Nyström method subsamples m \ll n landmark points to form a low-rank approximation of K, enabling training in O(m^3 + n m^2) time and O(m^2 + n m) space, followed by predictions in O(m) time.^[24] Similarly, random Fourier features approximate shift-invariant kernels by projecting inputs into an explicit m-dimensional feature space via random projections, transforming kernel RLS into a linear ridge regression problem solvable in O(m^3 + n m^2) time, which scales linearly with n for fixed m.^[25] These methods typically select m on the order of hundreds to thousands, achieving near-exact performance for well-conditioned kernels like the RBF.^[24] In high-n regimes, such as datasets with millions of points, these approximations introduce a fundamental trade-off: reducing m accelerates training and prediction by orders of magnitude but may degrade accuracy due to approximation error, particularly for kernels with slow-decaying eigenvalues.^[25] Empirical studies show that Nyström often outperforms random features in generalization error for structured data (e.g., achieving 10-20% lower MSE on benchmark regression tasks with m = 1000), though random features offer simpler implementation and better uniformity guarantees across diverse kernels.^[24] Selection of the approximation thus balances model fidelity against runtime constraints in practical deployments.^[23]

Theoretical Foundations

Feature maps and Mercer's theorem

Mercer's theorem provides a foundational decomposition for symmetric positive semi-definite kernels, enabling their representation in terms of feature maps. Specifically, for a continuous symmetric positive semi-definite kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} defined on a compact subset \mathcal{X} of \mathbb{R}^d, there exist non-negative eigenvalues \{\lambda_i\}_{i=1}^\infty with \sum_{i=1}^\infty \lambda_i < \infty and orthonormal eigenfunctions \{\phi_i\}_{i=1}^\infty in the space L^2(\mathcal{X}) such that

k(x, x') = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x')

for all x, x' \in \mathcal{X}.^[26] This expansion arises from the spectral decomposition of the associated integral operator T_k f(x) = \int_{\mathcal{X}} k(x, y) f(y) \, d\mu(y), where \mu is a positive measure on \mathcal{X}, and the eigenfunctions satisfy T_k \phi_i = \lambda_i \phi_i.^[26] The theorem links kernels to explicit feature maps by defining a mapping \phi: \mathcal{X} \to \ell^2, the Hilbert space of square-summable sequences, via \phi(x) = \sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots , such that k(x, x') = \langle \phi(x), \phi(x') \rangle_{\ell^2}.^[26] For certain kernels, this feature map can be constructed explicitly in finite dimensions. A prominent example is the polynomial kernel of degree p, k(x, x') = (x^\top x' + c)^p where c \geq 0, whose explicit feature map \phi(x) consists of all monomials up to degree p, resulting in a vector of dimension \binom{d + p}{p}. This mapping transforms the original data into a higher-dimensional space where linear methods can capture nonlinear relationships in the input space. In practice, explicit computation of \phi(x) is often infeasible due to the high or infinite dimensionality, leading to implicit feature maps via the kernel trick. This technique allows algorithms to operate solely on kernel evaluations k(x_i, x_j) to implicitly compute dot products \langle \phi(x_i), \phi(x_j) \rangle without ever materializing the map, as required in methods like kernel ridge regression.^[26] The series in Mercer's decomposition converges absolutely and uniformly on \mathcal{X} \times \mathcal{X} for continuous kernels, ensuring that finite truncations approximate the kernel arbitrarily well on compact sets.^[26]^[27] This uniform convergence supports the theoretical justification for using kernel approximations in finite-dimensional computations. Kernels violating the positive semi-definiteness condition, termed non-Mercer kernels, fail to admit such a decomposition and may lead to non-convex optimization problems or invalid Hilbert space structures in kernel methods. Fixes include regularizing the kernel matrix to enforce positive semi-definiteness, such as through eigenvalue clipping (replacing negative eigenvalues with zeros or small positives), or adapting algorithms to indefinite kernels using Kreĭn spaces, though these approaches can compromise convergence guarantees.^[28]

Bayesian interpretation

The Bayesian interpretation frames regularized least squares (RLS) as a form of probabilistic inference, specifically maximum a posteriori (MAP) estimation, where regularization arises from incorporating prior beliefs about the model parameters.^[29] In this view, the observed data y is modeled with a Gaussian likelihood p(y \mid X, w) = \mathcal{N}(Xw, \sigma^2 I), assuming independent Gaussian noise with variance \sigma^2.^[29] The parameter vector w is assigned a zero-mean Gaussian prior p(w) = \mathcal{N}(0, (\lambda / \sigma^2)^{-1} I), which penalizes large values of w and corresponds to the ridge regularization case.^[30] This setup connects the deterministic RLS objective to Bayesian updating, treating the solution as the mode of the posterior distribution p(w \mid y, X) \propto p(y \mid X, w) p(w).^[29] The MAP estimate is obtained by maximizing the posterior, which is equivalent to minimizing the negative log-posterior:

\arg\max_w p(w \mid y, X) = \min_w \frac{\|Xw - y\|^2}{\sigma^2} + \lambda \|w\|^2.

This shows how the regularization term \lambda \|w\|^2 emerges directly from the log of the Gaussian prior, up to constants.^[30] The full posterior distribution is also Gaussian, centered at the RLS solution w_{\text{RLS}} with covariance matrix (\lambda / \sigma^2 + X^T X / \sigma^2)^{-1}, providing not only point estimates but also uncertainty quantification over the parameters.^[29] The hyperparameter \lambda plays a key role as the precision (inverse variance) of the prior distribution on w, scaled by the noise variance; larger \lambda implies a stronger belief that w is close to zero, reflecting assumptions about model simplicity or sparsity in magnitude.^[30] This interpretation allows \lambda to be tuned via empirical Bayes methods, such as maximizing the marginal likelihood p(y \mid X), which integrates out w.^[29] Extending to kernel-based RLS, the approach aligns with Gaussian process (GP) regression, where the posterior mean under a GP prior with covariance function given by the kernel (e.g., the radial basis function kernel) matches the kernel ridge regression predictions.^[31] This equivalence highlights RLS as a finite-data approximation to nonparametric Bayesian function estimation in reproducing kernel Hilbert spaces.^[32]

Specific Methods

Ridge regression

Ridge regression is a specific instance of regularized least squares that employs an \ell_2 penalty on the coefficients to address issues like multicollinearity in linear regression models. The optimization problem is formulated as

\min_w \|Xw - y\|_2^2 + \lambda \|w\|_2^2,

where X is the n \times p design matrix, y is the response vector, w is the coefficient vector, and \lambda \geq 0 is the regularization parameter controlling the strength of the penalty. This approach was introduced by Hoerl and Kennard in 1970 as a biased estimation technique for nonorthogonal problems, equivalent to Tikhonov regularization with an identity operator.^[13] The closed-form solution for the ridge estimator is

w = (X^T X + \lambda I)^{-1} X^T y,

where I is the p \times p identity matrix. This solution shrinks the coefficients toward zero, introducing a small bias but substantially reducing variance, particularly when features are highly correlated.^[13] The bias-variance tradeoff stabilizes estimates in the presence of multicollinear features, often leading to better prediction performance than ordinary least squares. The regularization parameter \lambda is typically tuned using cross-validation methods, such as leave-one-out cross-validation, which has a computationally efficient closed-form expression for linear models, or generalized cross-validation akin to Mallow's C_p statistic to balance fit and complexity.^[13] Computationally, the solution can be obtained via singular value decomposition (SVD) of X, which automatically handles the ridge modification without requiring feature rescaling, provided the data are centered.^[13] From a Bayesian perspective, ridge regression corresponds to the posterior mean under a Gaussian prior on the coefficients.

Lasso and sparse regularization

The Lasso, or least absolute shrinkage and selection operator, is a regularized least squares method that incorporates an \ell_1 penalty on the coefficients to promote sparsity. It solves the optimization problem

\min_w \|Xw - y\|_2^2 + \lambda \|w\|_1,

where \lambda > 0 controls the strength of the penalty, X is the design matrix, y is the response vector, and \| \cdot \|_1 denotes the \ell_1 norm (sum of absolute values).^[33] This formulation was introduced by Tibshirani in 1996 as a technique for simultaneous parameter shrinkage and variable selection in linear regression, addressing limitations of ordinary least squares in high-dimensional settings.^[33] Geometrically, the Lasso can be interpreted as a constrained convex optimization problem: minimize \|Xw - y\|_2^2 subject to \|w\|_1 \leq t, where t > 0 is a budget on the \ell_1 norm. The constraint region forms a diamond-shaped polytope in coefficient space (a cross-polytope in higher dimensions), with vertices aligned on the coordinate axes. The quadratic loss contours are ellipsoids centered at the ordinary least squares solution; their intersection with the diamond often occurs at a vertex, forcing one or more coefficients to exactly zero and inducing sparsity. Key properties of Lasso solutions include high sparsity, where many coefficients are precisely zero, enabling automatic variable selection by identifying irrelevant features.^[33] This sparsity aids interpretability and reduces model complexity, particularly when the number of features exceeds the sample size. However, the \ell_1 penalty introduces bias, especially for large true coefficients, as it shrinks all estimates toward zero indiscriminately before selection. Common algorithms for solving the Lasso include coordinate descent, which cyclically optimizes one coefficient at a time while holding others fixed, leveraging the separability of the \ell_1 penalty. This approach, implemented in packages like glmnet, efficiently computes the entire regularization path as \lambda varies. Proximal gradient methods, such as the iterative shrinkage-thresholding algorithm (ISTA), approximate the solution by performing a gradient step on the smooth loss followed by soft-thresholding on the nonsmooth penalty. In contrast, exact \ell_0 penalization, which counts the number of nonzero coefficients (\min_w \|Xw - y\|_2^2 + \lambda \|w\|_0), directly enforces sparsity but is NP-hard due to its combinatorial nature, requiring enumeration over $2^p subsets for p features. The \ell_1 penalty approximates this \ell_0 ideal through convex relaxation, as justified by basis pursuit principles, where the \ell_1 norm promotes sparse solutions under certain conditions like the restricted isometry property.

Elastic net and hybrids

The elastic net is a regularization method that combines the \ell_1 penalty from the lasso with the \ell_2 penalty from ridge regression, formulated as the optimization problem

\min_w \|Xw - y\|^2_2 + \lambda_1 \|w\|_1 + \lambda_2 \|w\|^2_2,

where \lambda_1 \geq 0 and \lambda_2 \geq 0 control the relative contributions of sparsity and shrinkage, respectively. This approach was proposed by Zou and Hastie in 2005 to address limitations of the lasso, particularly in high-dimensional settings where features are highly correlated.^[34] A key advantage of the elastic net is its ability to handle groups of correlated features by encouraging similar coefficients within those groups, unlike the lasso which tends to select only one from a correlated set. It also mitigates the estimation bias introduced by the lasso's \ell_1 penalty through the stabilizing effect of the \ell_2 term. Computationally, the elastic net can be transformed into a lasso problem on an augmented dataset, where the original features are rescaled and augmented with scaled versions to incorporate the \ell_2 penalty.^[34] Parameter tuning for the elastic net typically involves grid search over \lambda_1 and \lambda_2, or equivalently over a regularization strength \lambda and a mixing ratio \rho = \frac{\lambda_2}{\lambda_1 + \lambda_2}, with cross-validation to select the values that minimize prediction error. Other hybrid approaches include the adaptive lasso, which modifies the \ell_1 penalty with data-dependent weights to achieve oracle properties like consistent variable selection and asymptotic normality.^[35]

Extensions and Variations

Iterative and online algorithms

Iterative methods for solving the regularized least squares (RLS) problem offer scalable alternatives to direct solvers, particularly for large datasets where forming the full Gram matrix or solving a dense linear system is prohibitive. The core approach involves applying gradient descent to the convex objective function \min_w \frac{1}{2} \|Xw - y\|^2 + \frac{\lambda}{2} \|w\|^2_R, where X \in \mathbb{R}^{n \times p} is the design matrix, y \in \mathbb{R}^n the response vector, \lambda > 0 the regularization parameter, and \|\cdot\|_R a norm induced by a positive semi-definite matrix R. The full-batch gradient is \nabla J(w) = X^T (Xw - y) + \lambda R w, and the update rule is w_{t+1} = w_t - \eta_t \nabla J(w_t) for step size \eta_t > 0.^[36] For constant step sizes appropriately chosen based on the smoothness constant, this yields linear convergence in the objective value, with rate factor (1 - \mu / L) per iteration, where \mu and L are the strong convexity and smoothness constants, respectively, and the condition number \kappa = L / \mu is bounded by the eigenvalues of X^T X + \lambda R. Stochastic gradient descent (SGD) extends this framework to online learning settings, where data arrives sequentially, enabling updates after each observation without storing the full dataset. For the ridge case (R = I, \|w\|^2), the stochastic gradient approximates the full gradient using a single sample (x_i, y_i), yielding the update w \leftarrow w - \eta (x_i^T (x_i w - y_i) + \lambda w), where \eta is the learning rate and the factor of 2 is absorbed for simplicity. This process minimizes excess risk relative to the ridge solution with \lambda \propto 1/t, exhibiting implicit regularization that biases toward minimum-norm solutions in overparameterized regimes. Averaged SGD variants achieve optimal statistical rates of O(1/n) for bias and O(p/n) for variance after n steps, under bounded noise assumptions.^[37]^[38] In kernelized RLS, such as kernel ridge regression, online variants adapt stochastic gradient methods in the reproducing kernel Hilbert space (RKHS) to handle nonlinear mappings implicitly via kernel expansions f(x) = \sum_i \alpha_i k(x_i, x), with coefficients updated as \alpha_i \leftarrow (1 - \eta \lambda) \alpha_i - \eta (y_i - f(x_i)) for squared loss. To address non-stationary data streams, these methods incorporate forgetting mechanisms, such as exponential decay on old coefficients or truncation of the expansion to retain only recent support vectors, effectively downweighting outdated samples. Recursive kernel least squares trackers further employ time-varying forgetting factors \lambda < 1 in the update recursion, reformulating the problem to penalize deviations from prior estimates while adapting to concept drift. Multiple forgetting factors can decouple parameters with varying change rates, improving tracking in dynamic environments like signal processing.^[39]^[40]^[41] These iterative approaches reduce per-iteration complexity from the batch solver's O(np^2) (for n > p)—dominated by forming and inverting X^T X + \lambda R—to O(np) for full-batch gradient descent or O(p) for SGD and kernel updates, assuming fixed iteration counts proportional to \log(1/\epsilon) for \epsilon-accuracy in strongly convex cases. Incremental variants further amortize costs over streaming data, achieving up to linear speedups over repeated batch solves. Historically, iterative RLS traces to adaptive filtering in signal processing, where the recursive least squares (RLS) algorithm emerged in the 1980s for real-time echo cancellation and channel equalization, building on earlier least-squares foundations to enable constant-time updates via matrix factorizations like the Kalman filter.^[42]^[43]

Multi-task and group regularization

In multi-task learning, regularized least squares extends to scenarios where multiple related prediction tasks share underlying structures, such as common features across outputs, by incorporating penalties that promote joint sparsity. A common formulation minimizes the sum of squared errors across tasks plus a shared regularizer, such as the \ell_{2,1}-norm on the weight matrix W, given by \min_W \|XW - Y\|_F^2 + \lambda \|W\|_{2,1}, where \| \cdot \|_F denotes the Frobenius norm, X is the input matrix, Y is the multi-output response matrix, and \|W\|_{2,1} = \sum_{i=1}^p \|w_i\|_2 sums the \ell_2-norms of the rows of W to encourage row-sparsity, selecting features shared across tasks. This approach induces sparsity at the task level, allowing the model to identify features relevant to all or many tasks while ignoring irrelevant ones, which is particularly useful for high-dimensional data with correlated outputs. The group lasso, a foundational extension of regularization to grouped variables, penalizes the \ell_{2,1}-norm \sum_g \|w_g\|_2, where groups g partition the features, promoting selection of entire groups rather than individual coefficients. Introduced for regression with structured predictors, this method selects variables in predefined groups (e.g., based on biological pathways or spatial proximity), shrinking irrelevant groups to zero while estimating within-group coefficients. In multi-task settings, it builds on this by applying the penalty across tasks to enforce joint group selection, as seen in variants from 2008 onward that adapt the group lasso for matrix-valued weights. Applications of these methods include joint regression for multi-output prediction in domains like genomics, where tasks might involve predicting expression levels across related genes or conditions, leveraging group structures from known pathways to improve feature selection and generalization. For instance, in microarray data analysis, supervised group lasso identifies clustered genes associated with phenotypes, enhancing predictive accuracy over independent task models. Optimization of these non-smooth problems typically relies on proximal methods, such as proximal gradient descent or alternating direction methods, which handle the \ell_{2,1}-norm via closed-form proximal operators that threshold group norms. These algorithms efficiently solve the convex optimization, scaling to large-scale multi-task problems by exploiting the separability of the penalty.

References

[1]
[PDF] Notes on Regularized Least Squares - DSpace@MIT
May 1, 2007 · This is a collection of information about regularized least squares (RLS). The facts here are not “new results”, but we have not seen them ...
[2]
[PDF] Lecture 7 Regularized least-squares and Gauss-Newton method
points where weighted sum is constant, J1 + µJ2 = α, correspond to line with slope −µ on (J2,J1) plot. Regularized least-squares and Gauss-Newton method.
[3]
[PDF] Ridge Regression: Biased Estimation for Nonorthogonal Problems
Ridge regression is an estimation procedure using - [X'X + kI]-'X'Y, where k > 0, to control instability in least squares estimates.
[4]
[PDF] Regularized Least Squares 1 Introduction - MIT
Feb 16, 2010 · Regularized Least Squares. Lecturer: Charlie Frogner. Scribe ... Regularized Least Squares (RLS) in which the square loss V (yi,f(xi)) ...
[5]
Numerical Differentiation and Regularization | SIAM Journal on ...
3. A. N. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math. Dokl., 4 (1963), 1035– ...
[6]
Tikhonov Regularization and Total Least Squares | SIAM Journal on ...
We show how Tikhonov's regularization method, which in its original formulation involves a least squares problem, can be recast in a total least squares ...
[7]
Linear Least Squares Regression - Information Technology Laboratory
Linear least squares regression is by far the most widely used modeling method. It is what most people mean when they say they have used "regression", "linear ...
[8]
[PDF] Finite-Sample Properties of OLS - Princeton University
Are the OLS Assumptions Satisfied? To justify the use of least squares, we need to make sure that Assumptions 1.1–. 1.4 are satisfied for the equation (1.7.4) ...
[9]
Gauss and the Invention of Least Squares - Project Euclid
The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
[10]
Chapter 4 OLS in review | Micro-Econometrics
OLS Assumptions These (very nice) properties depend upon a set of assumptions. The population relationship is linear in parameters with an additive disturbance.
[11]
The Elements of Statistical Learning - SpringerLink
This book describes the important ideas in a variety of fields such as medicine, biology, finance, and marketing in a common conceptual framework.
[12]
Ridge Regression: Biased Estimation for Nonorthogonal Problems
Ridge Regression: Biased Estimation for Nonorthogonal Problems. Arthur E. Hoerl University of Delaware and E. 1. du Pont de Nemours & Co. &. Robert W. Kennard ...
[13]
Tikhonov Regularization - an overview | ScienceDirect Topics
Tikhonov Regularization is a method in computer science that involves adding a penalty term to a cost function to incorporate prior knowledge about an image, ...
[14]
Tutorial on Asymptotic Properties of Regularized Least Squares ...
Dec 20, 2021 · In this paper, we give a tutorial on asymptotic properties of the Least Square (LS) and Regularized Least Squares (RLS) estimators for the finite impulse ...Missing: consistency | Show results with:consistency
[15]
[PDF] Theory of Reproducing Kernels - N. Aronszajn
Aug 26, 2002 · ... defined in E, forming a Hilbert space (complex or real). The function K(x, y) of x and y in E is called a reproducing kernel (r.k.) of Fif. 1.
[16]
(PDF) Ridge Regression Learning Algorithm in Dual Variables
PDF | In this paper we study a dual version of the Ridge Regression procedure. It allows us to perform non-linear regression by constructing a linear.
[17]
https://www.researchgate.net/publication/2791746_Ridge_Regression_Learning_Algorithm_in_Dual_Variables
[18]
[PDF] A Generalized Representer Theorem - Alex Smola
Abstract. Wahba's classical representer theorem states that the solu- tions of certain risk minimization problems involving an empirical risk.
[19]
[PDF] Learning with Kernels
Bernhard Schölkopf and Alexander Smola have written a comprehensive, yet ... basic toolbox for solving the problems arising in learning with kernels. The ...
[20]
[PDF] Kernel ridge Regression
This is a note to explain kernel ridge regression. 1 Ridge Regression. Possibly the most elementary algorithm that can be kernelized is ridge regression.
[21]
[PDF] Statistically and Computationally Efficient Variance Estimator for ...
For example, for n data points, the time and space complexity of kernel ridge regression (KRR) are of O(n3) and O(n2) respectively. The above can potentially ...
[22]
[PDF] Divide and Conquer Kernel Ridge Regression
This paper demonstrates the potential benefits of divide-and-conquer approaches for nonparametric and infinite-dimensional regression prob- lems. One difficulty ...Missing: seminal | Show results with:seminal
[23]
None
### Summary of Nyström Method for Kernel Ridge Regression, Complexity Reduction, and Comparison to Random Features
[24]
[PDF] Random Features for Large-Scale Kernel Machines - People @EECS
Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user.
[25]
[2106.08443] Reproducing Kernel Hilbert Space, Mercer's Theorem ...
Jun 15, 2021 · This paper covers kernels, kernel methods, RKHS, Mercer's theorem, eigenfunctions, kernel construction, and their use in machine learning, ...
[26]
On the speed of uniform convergence in Mercer's theorem - arXiv
May 1, 2022 · This infinite representation is known to converge uniformly to the kernel K. We estimate the speed of this convergence in terms of the decay ...
[27]
[PDF] Learning with Non-Positive Kernels - Cheng Soon Ong
In this paper we show that many kernel meth- ods can be adapted to deal with indefinite kernels, that is, kernels which are not posi- tive semidefinite. They do ...
[28]
[PDF] Pattern Recognition and Machine Learning - Microsoft
This book provides a comprehensive introduction to pattern recognition and machine learning, which are viewed as two facets of the same field.
[29]
[PDF] 1 An Introduction to Statistical Learning
One of the first books in this area—The Elements of Statistical Learning. (ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a second edition ...
[30]
[PDF] Gaussian Processes for Machine Learning
Gaussian processes for machine learning / Carl Edward Rasmussen, Christopher K. I. Williams. p. cm. —(Adaptive computation and machine learning). Includes ...
[31]
Gaussian Processes and Kernel Methods: A Review on ... - arXiv
Jul 6, 2018 · This paper bridges the gap between Gaussian processes and kernel methods, reviewing concepts and highlighting close similarities between the ...
[32]
Regression Shrinkage and Selection Via the Lasso - Oxford Academic
SUMMARY. We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute valu.Missing: original | Show results with:original
[33]
Convergence of the gradient method for ill-posed problems
The simplest, though not the fastest, amongst them is the Landweber iteration, which can be viewed as a gradient descent method for the associated least-squares ...
[34]
[PDF] Convergence Rates for Least-Squares Regression
In this section, we provide convergence bounds for regularized averaged stochastic gradient descent. The main novelty compared to the work of Bach and Moulines ...
[35]
The Implicit Regularization of Stochastic Gradient Flow for Least ...
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression.Missing: online | Show results with:online
[36]
None
### Summary of Online Kernel Learning Method for Regression
[37]
Kernel recursive least-squares tracker for time-varying regression
In this paper, we introduce a kernel recursive least-squares (KRLS) algorithm that is able to track nonlinear, time-varying relationships in data.
[38]
[PDF] A New Recursive Least-Squares Method with Multiple Forgetting ...
Mar 25, 2015 · Our approach hinges on the reformulation of the classic recursive least-squares with forgetting scheme as a regularized least squares problem.
[39]
[PDF] incremental regularized least squares for dimensionality reduction of ...
Now, we investigate the computational complexity of Algorithm 1 and compare it with that of RLS. In. Algorithm 1 the dominant cost is that of applying LSQR.
[40]
[PDF] Recursive Least-Squares Adaptive Filters - UCSB ECE
Introduction. • Linear least-squares problem was probably first developed and solved by Gauss (1795) in his work on mechanics.