Fact-checked by Grok 2 weeks ago

Coordinate descent

Coordinate descent is an optimization algorithm that iteratively minimizes a multivariable objective function by successively performing approximate minimization along one coordinate direction (or a block of coordinates) at a time, while holding all other variables fixed.^[1] This approach decomposes the full-dimensional problem into a sequence of simpler, lower-dimensional subproblems, making it particularly suitable for high-dimensional data and structured objectives. The method traces its origins to foundational iterative techniques in numerical linear algebra, such as the Gauss-Seidel algorithm developed in the 19th century for solving systems of linear equations, which updates variables sequentially using the most recent values.^[2] It was later adapted for quadratic programming by Hildreth in 1957, who proposed a procedure for solving convex quadratic programs with linear constraints via coordinate-wise updates.^[3] Theoretical foundations were further advanced in the 1970s and 1980s, notably by Ortega and Rheinboldt's analysis of convergence for nonlinear problems and Luo and Tseng's work on global convergence rates for convex functions.^[1] Interest in coordinate descent surged in the 2000s with the rise of large-scale machine learning, driven by efficient implementations for sparse regression models like Lasso, as demonstrated by Friedman et al. in 2007. Key variants include cyclic coordinate descent, which updates coordinates in a predetermined order, and randomized coordinate descent, which selects coordinates probabilistically to improve practical convergence, especially for nonsmooth objectives.^[1] Block coordinate descent generalizes this by updating groups of coordinates simultaneously, enabling parallelism and handling coupled variables. Accelerated versions, inspired by Nesterov's momentum techniques, achieve optimal convergence rates for smooth convex problems, while parallel implementations support distributed computing environments.^[1] Coordinate descent excels in applications requiring scalability, such as empirical risk minimization in supervised learning, compressed sensing, and support vector machines, where it leverages separability in the loss or regularization terms.^[1] Its advantages include simplicity, low memory footprint, and adaptability to proximal operators for nonsmooth terms, though it may converge slowly for ill-conditioned problems without randomization or acceleration. For convex objectives, convergence to the global minimum is guaranteed under conditions like Lipschitz continuity of partial gradients, with linear rates in strongly convex cases.^[1]

Introduction

Definition and Motivation

Coordinate descent is an iterative optimization method for minimizing a multivariate objective function by successively optimizing along individual coordinate directions, while holding all other variables fixed at their current values.^[1] This approach decomposes the high-dimensional optimization problem into a sequence of simpler, one-dimensional subproblems, each targeting a single variable. Mathematically, consider an objective function f: \mathbb{R}^n \to \mathbb{R}. At each iteration k, a coordinate index i_k is selected, and the update rule is given by

x_i^{k+1} = \arg\min_{x_i} f(x_1^k, \ldots, x_{i-1}^k, x_i, x_{i+1}^k, \ldots, x_n^k)

for i = i_k, with all other coordinates remaining unchanged.^[1] This exact minimization along the chosen coordinate leverages partial derivatives implicitly, as the solution often involves setting the partial derivative \frac{\partial f}{\partial x_i} to zero when f is differentiable, though the method applies more broadly. Prerequisite concepts include understanding optimization landscapes, where the function's convexity ensures a global minimum exists and partial derivatives indicate the direction of steepest descent along each axis without requiring full gradient computation.^[1] The primary motivation for coordinate descent lies in its computational efficiency for high-dimensional problems, where evaluating the full gradient or Hessian becomes prohibitively expensive due to the curse of dimensionality.^[1] By exploiting potential separability in the objective—such as when regularization terms like the \ell_1-norm decompose across coordinates—it simplifies subproblem solutions and reduces memory usage. Additionally, its structure facilitates parallelization, as updates to different coordinates can proceed independently on separate processors.^[1] For instance, in minimizing a quadratic function f(x, y) = 7x^2 + 6xy + 8y^2, coordinate descent alternates exact minimizations: fixing y and solving for x yields x = -\frac{3y}{7}, then fixing the new x and solving for y, progressively reducing the objective value toward the minimum. In contrast to full-vector methods like gradient descent, which update all coordinates simultaneously using the entire gradient, coordinate descent offers a scalable alternative for sparse or structured problems.^[1]

Historical Background

The origins of coordinate descent methods can be traced to the Gauss-Seidel iterative procedure for solving linear systems, introduced by Carl Friedrich Gauss around 1823 and formalized by Philipp Ludwig von Seidel in 1874, which updates variables sequentially and serves as a foundational example for quadratic optimization. Related relaxation techniques emerged in the early 20th century, such as Richard V. Southwell's 1940 work on iterative methods for partial differential equations, which employed coordinate-wise adjustments to approximate solutions. In optimization contexts, an early notable application appeared in Clifford Hildreth's 1957 quadratic programming algorithm, which used coordinate descent updates to handle constrained least-squares problems common in economic analysis. The related alternating projections method for finding points in subspace intersections was developed by John von Neumann in 1933, providing a two-block variant that influenced later multi-coordinate extensions. Coordinate descent gained prominence in nonlinear optimization through David G. Luenberger's 1973 textbook Introduction to Linear and Nonlinear Programming, where it was presented as a practical descent strategy. Concurrently, M.J.D. Powell's 1973 analysis in "On search directions for minimization algorithms" demonstrated the potential nonconvergence of cyclic implementations via a counterexample, spurring investigations into selection rules and reliability in nonlinear programming. Extensions to non-smooth objectives were advanced by Paul Tseng and Sangwoon Yun in their 2009 paper on coordinate gradient descent for separable nonsmooth minimization, broadening applicability beyond differentiable cases.^[4] In the 2000s, the method integrated into machine learning via proximal formulations, exemplified by Jerome Friedman's 2007 coordinate descent for lasso regularization paths. This progression built upon precursors like von Neumann's projections and Gauss-Seidel iterations, leading briefly to contemporary randomized variants for large-scale problems.^[5]

Algorithm Description

Basic Procedure

Coordinate descent is an iterative optimization method for minimizing a multivariate objective function f: \mathbb{R}^n \to \mathbb{R} by successively optimizing along individual coordinate directions while holding the other coordinates fixed.^[6] The algorithm begins by selecting an initial point x^{(0)} \in \mathbb{R}^n, often chosen based on domain knowledge or a simple heuristic such as the origin.^[7] At each iteration k \geq 0, a coordinate index i_k \in \{1, \dots, n\} is selected, and the corresponding component is updated to minimize the univariate restriction of f along that direction: x_i^{(k+1)} = \arg\min_{t \in \mathbb{R}} f(x^{(k)} + t e_{i_k}), where e_{i_k} is the standard basis vector with 1 in the i_k-th position and 0 elsewhere, while all other components remain unchanged (x_j^{(k+1)} = x_j^{(k)} for j \neq i_k).^[8] This process repeats until a predefined stopping criterion is met.^[6] In the standard cyclic variant, coordinates are selected in a fixed sequential order, cycling through i_k = (k \mod n) + 1.^[8] The following pseudocode outlines this procedure:

Input: Initial point x ∈ ℝⁿ, maximum iterations K, tolerance ε > 0
x ← x^{(0)}
k ← 0
while k < K:
    x_old ← copy of x
    for i = 1 to n:
        x_i ← arg min_t f(x_1, …, x_{i-1}, t, x_{i+1}, …, x_n)
    k ← k + 1
    if ||x - x_old|| ≤ ε:
        break
Output: x
Input: Initial point x ∈ ℝⁿ, maximum iterations K, tolerance ε > 0
x ← x^{(0)}
k ← 0
while k < K:
    x_old ← copy of x
    for i = 1 to n:
        x_i ← arg min_t f(x_1, …, x_{i-1}, t, x_{i+1}, …, x_n)
    k ← k + 1
    if ||x - x_old|| ≤ ε:
        break
Output: x

^[8] The core computational step involves solving the univariate subproblem \min_t f(x^{(k)} + t e_{i_k}), which can be performed exactly or inexactly depending on the structure of f. For exact minimization, the one-dimensional function is optimized to global precision, often via closed-form solutions when available; this is feasible for separable or structured objectives like quadratics.^[7] Inexact approaches approximate the minimizer, for instance, using a fixed step length \alpha_k > 0 such that x^{(k+1)} = x^{(k)} - \alpha_k \nabla_{i_k} f(x^{(k)}) e_{i_k}, where \nabla_{i_k} f denotes the partial derivative with respect to the i_k-th coordinate, and \alpha_k is chosen via line search or a constant like $1/L_{i_k} with L_{i_k} bounding the second derivative along that direction.^[6] For a quadratic objective f(x) = \frac{1}{2} x^T Q x + c^T x with positive definite Q, the univariate subproblem along coordinate i_k reduces to a one-dimensional quadratic \frac{1}{2} Q_{i_k i_k} t^2 + b t + d, where b = \sum_{j \neq i_k} Q_{i_k j} x_j^{(k)} + c_{i_k} and d is constant. The exact minimizer is then given by

t^* = -\frac{b}{Q_{i_k i_k}},

yielding a closed-form update x_{i_k}^{(k+1)} = x_{i_k}^{(k)} + t^*.^[8] Common stopping criteria include tolerance-based checks such as \|x^{(k+1)} - x^{(k)}\| < \epsilon for small \epsilon > 0, which monitors the change in the iterate; gradient-based criteria like \max_i |\nabla_i f(x^{(k)}) | < \epsilon, ensuring partial derivatives are near zero; or simply limiting the total iterations to K to prevent excessive computation.^[7] The choice of coordinate selection strategy, such as cyclic ordering, can influence the efficiency of these updates by affecting how quickly the algorithm explores the space.^[6]

Coordinate Selection Methods

In coordinate descent algorithms, the choice of which coordinate to update at each iteration significantly influences the overall efficiency and performance. Various selection strategies balance computational cost, convergence speed, and ease of implementation, with no single method dominating all scenarios.^[5] Cyclic selection updates the coordinates in a fixed sequential order, cycling through indices 1 to n and repeating. This approach is deterministic, ensuring reproducible results, and offers simplicity in implementation with low per-iteration overhead since no additional computations are needed for selection.^[5] Greedy selection identifies the coordinate that yields the maximum decrease in the objective function f after performing the exact minimization along that direction. Evaluating the potential decrease for all n coordinates typically incurs an O(n) computational cost per selection, assuming the line minimization for each is efficient; however, this can escalate in practice for complex functions.^[5] Random selection chooses the index i_k uniformly at random from 1 to n at each step. This probabilistic strategy provides variance reduction benefits by avoiding pathological orderings that might slow convergence in cyclic methods, while also facilitating theoretical analysis and parallelization.^[5] The Gauss-Southwell rule selects the coordinate corresponding to the largest absolute partial derivative |\partial f / \partial x_i|. Historically, this rule originated in work by Carl Friedrich Gauss and was popularized by R. V. Southwell in his 1946 treatise on relaxation methods for solving systems of linear equations. Implementation challenges arise from the requirement to compute or approximate the full gradient vector at every iteration, making it computationally intensive for high-dimensional or large-scale problems.^[5] (Note: Southwell reference via Wright's citation )

Method	Pros (Speed, Convergence, Applicability)	Cons (Speed, Convergence, Applicability)
Cyclic	Fast selection (constant time); simple and deterministic for any dimension; broadly applicable to serial implementations.	May converge slowly if coordinates have unequal importance or strong coupling; less suitable for non-convex problems where order matters. Slower convergence in unevenly scaled problems; limited parallelizability.^[5]
Greedy	Often achieves fastest practical convergence by prioritizing impactful updates; effective for problems with sparse or structured solutions.	High selection cost (O(n) or more); less applicable to very large n without approximations. Computationally prohibitive for high dimensions without efficient evaluation shortcuts.^[5]
Random	Low selection overhead (constant time); improved convergence guarantees via variance reduction; highly parallelizable and robust to non-convexity.	Non-deterministic, potentially missing key coordinates in some iterations; convergence may lag behind greedy in worst-case scenarios. Higher variance in performance across runs; less intuitive for deterministic applications.^[5]
Gauss-Southwell	Superior convergence rate compared to random selection in most cases, especially for smooth functions; applicable to gradient-based problems.^[9]	Expensive gradient computation per iteration; challenging for inexact or stochastic gradients. Impractical for massive datasets without gradient approximations; sensitive to gradient accuracy.^[5]^[9]

Theoretical Foundations

Convergence Analysis for Differentiable Functions

Convergence analysis for coordinate descent applied to differentiable functions typically assumes that the objective function f: \mathbb{R}^n \to \mathbb{R} is convex and continuously differentiable, with a Lipschitz continuous gradient. More precisely, the partial derivatives \nabla_i f are assumed to satisfy coordinate-wise Lipschitz conditions, meaning there exist constants L_i > 0 such that |\nabla_i f(x + t e_i) - \nabla_i f(x)| \leq L_i |t| for all x \in \mathbb{R}^n, i = 1, \dots, n, and t \in \mathbb{R}, where e_i is the i-th standard basis vector.^[10] These assumptions ensure that exact minimization along each coordinate direction can be performed efficiently, often in closed form or via line search. Under these conditions, coordinate descent achieves global convergence to a stationary point, where \nabla f(x^*) = 0. A proof sketch relies on the descent property: at each iteration k, selecting coordinate i_k and updating x^{k+1} = \arg\min_{t} f(x^k + t e_{i_k}) yields f(x^{k+1}) < f(x^k) unless \nabla_{i_k} f(x^k) = 0 for all selected coordinates over a full cycle. Since f is convex and the level sets are bounded (implied by Lipschitz gradients), the iterates \{x^k\} remain in a compact set, and the algorithm mapping is closed, ensuring the sequence converges to a point satisfying the first-order optimality condition. This result holds for both cyclic and greedy coordinate selection strategies.^[10] A foundational result on the descent property, due to Luenberger, establishes that each update strictly decreases the objective unless the selected coordinate gradient vanishes. Specifically, for a differentiable f with coordinate Lipschitz constant L_i, the decrease satisfies

f(x^{k+1}) \leq f(x^k) - \frac{1}{2 L_{i_k}} \|\nabla_{i_k} f(x^k)\|^2,

where i_k is the selected coordinate. If the selection minimizes the potential decrease, as in greedy variants, this can be sharpened to involve \min_i \|\nabla_i f(x^k)\|^2 / L_i.^[10] This property underpins the global convergence by guaranteeing sufficient progress toward stationarity. For convergence rates, when f is merely convex, coordinate descent exhibits sublinear convergence: f(x^k) - f^* = O(1/k), where f^* is the minimum value, assuming uniform selection probabilities or cyclic ordering.^[10] Under additional strong convexity (\mu > 0, i.e., f(y) \geq f(x) + \nabla f(x)^T (y - x) + (\mu/2) \|y - x\|^2), the method achieves linear convergence: \|x^k - x^*\| = O(\rho^k) for some \rho < 1, with the rate depending on the condition number \kappa = \max_i L_i / \mu.^[10] Quadratic functions provide a canonical example, as they are strongly convex with constant Hessian; here, exact coordinate updates yield linear convergence, and in separable cases (diagonal Hessian), the method terminates in at most n steps.

Convergence for Non-Differentiable Functions

In the case of non-differentiable convex functions, convergence analysis for coordinate descent relies on extensions that handle nonsmoothness through subgradients or proximal operators, assuming the objective f is convex but lacks differentiability everywhere. Typically, the function is decomposed as f(x) = h(x) + r(x), where h is convex and smooth (with coordinate-wise Lipschitz continuous gradients \nabla_i h), and r is convex, possibly nonsmooth, and separable across coordinates (r(x) = \sum_i r_i(x_i)) to enable coordinate-wise updates. This separability condition ensures that the proximal operator for r can be computed independently for each coordinate, avoiding the need to solve coupled subproblems. Without separability, exact coordinate minimization may fail to converge or cycle, as demonstrated in early analyses of nonsmooth cases. The proximal coordinate descent method addresses nonsmoothness by solving a linearized proximal subproblem for the selected coordinate i:

x_i^{k+1} = \prox_{\beta_k r_i} \left( x_i^k - \beta_k \nabla_i h(x^k) \right),

where \beta_k > 0 is a step size (often chosen as \beta_k = 1 / L_i, with L_i the Lipschitz constant of \nabla_i h), and the proximal operator is defined as

\prox_{\beta r_i}(z) = \argmin_{x_i} \left\{ r_i(x_i) + \frac{1}{2\beta} \|x_i - z\|^2 \right\}.

Here, \nabla_i h(x^k) approximates the direction of descent, while the proximal term regularizes the nonsmooth r_i. For fully nonsmooth objectives without a smooth component, a coordinate subgradient method can be used instead: select i and a subgradient g_i \in \partial_i f(x^k), then update x_i^{k+1} = x_i^k - \beta_k g_i, assuming bounded subgradients for step-size rules like \beta_k = O(1/\sqrt{k}). This mirrors the full subgradient method but scales poorly with dimension due to coordinate sampling.^[11]^[12]^[13] Under convexity and the above assumptions, proximal coordinate descent achieves sublinear convergence rates. For randomized selection of coordinates (uniform probability $1/n), the expected optimality gap satisfies \mathbb{E}[f(\bar{x}^k) - f^*] \leq O\left( \frac{n \max_i L_i \|x^0 - x^*\|^2}{k} \right), where \bar{x}^k is an ergodic average, x^* is an optimal point, and k is the iteration count; this holds with high probability under additional tail bounds. The Nesterov-Richtárik framework unifies this analysis, showing that the rate matches the complexity of full proximal gradient methods up to the dimension factor n, with improvements possible via nonuniform probabilities proportional to L_i. For specific nonsmooth terms like \ell_1-regularization, the proximal operator reduces to soft-thresholding: \prox_{\beta \| \cdot \|_1}(z)_j = \sign(z_j) \max(|z_j| - \beta, 0), which leverages the Moreau envelope to smooth the subproblem locally. These results extend the differentiable case by replacing gradient steps with proximal corrections, ensuring descent toward a point where $0 \in \partial f(x).^[11]

Variants and Extensions

Block Coordinate Descent

Block coordinate descent (BCD) extends the standard coordinate descent method by partitioning the variables into multiple blocks B_1, \dots, B_m, where each block consists of one or more coordinates, and sequentially minimizing the objective function over each block while keeping the others fixed.^[4] This approach generalizes the single-coordinate case, treating individual coordinates as singleton blocks.^[14] In the algorithm, at each iteration k, a block B is selected, and the variables in that block are updated by solving

x_B^{k+1} = \arg\min_{x_B} f(x_1^k, \dots, x_{B-1}^k, x_B, x_{B+1}^k, \dots, x_n^k),

where f is the objective function, and the other blocks remain unchanged.^[4] BCD offers advantages in parallelization, as updates within a block can often be computed concurrently, making it suitable for large-scale problems on distributed systems.^[15] It also effectively handles correlated variables by grouping them into blocks that exploit problem structure, such as in sparse or low-rank models.^[14] For instance, in nonnegative matrix factorization, BCD updates blocks of factor matrices sequentially, leading to faster convergence and better solution quality compared to alternating least squares methods on datasets like facial images. Under conditions of block-separability—where the function is separable across blocks—or block-Lipschitz continuity of the gradient, BCD converges to a stationary point, with rates analogous to coordinate descent but influenced by block-specific condition numbers that measure the Lipschitz constants within each block.^[4] These condition numbers determine the progress per iteration, with well-chosen blocks reducing the effective condition number and accelerating convergence.^[14] For partial block updates, inexact minimizations—such as approximate solutions via conjugate gradients or preconditioned iterations—require weaker assumptions, like bounded relative errors in the block subproblems, while preserving global convergence guarantees.

Randomized and Stochastic Variants

Randomized coordinate descent (RCD) introduces randomness into the coordinate selection process by choosing the index i_k uniformly at random from \{1, \dots, n\} at each iteration, rather than following a fixed order. This uniform sampling ensures that each coordinate is updated with equal probability $1/n, promoting balanced progress across all variables and mitigating the risk of slow convergence due to adversarial ordering in deterministic methods. As a probabilistic extension of coordinate selection strategies, RCD often exhibits faster practical convergence compared to cyclic selection, particularly on large-scale problems where coordinate interactions vary.^[11] Theoretical analysis for RCD on smooth strongly convex functions establishes an expected linear convergence rate, requiring O\left(n \log \frac{F(x_0) - F^*}{\epsilon \rho}\right) iterations to achieve \epsilon-accuracy in function value, where n is the dimension, \rho is a proximity parameter, and the rate improves scalability by a factor of n over full-gradient methods in terms of per-iteration cost. Accelerated variants of RCD incorporate momentum mechanisms, such as Nesterov-style extrapolation, to further enhance convergence; for instance, these methods achieve rates akin to accelerated full-gradient descent while maintaining the low per-iteration complexity of coordinate updates. Key developments in this area include the foundational work by Richtárik and Takáč on uniform sampling and its iteration complexity, which laid the groundwork for analyzing randomized methods on composite objectives.^[11]^[16]^[11] Stochastic coordinate descent builds on RCD by employing noisy approximations of partial gradients, typically through mini-batches of data points for each selected coordinate, enabling efficient handling of massive datasets where computing exact partial derivatives is prohibitive. A prominent variance-reduced formulation in stochastic coordinate descent, such as SVRCD, constructs an unbiased gradient estimator for the chosen coordinate i as

g_i = \nabla_i F(\tilde{x}) + \nabla_i f(x; \xi) - \nabla_i f(\tilde{x}; \xi),

where \tilde{x} is a periodically computed full-gradient snapshot, F is the full objective, and \xi is a random data sample (or mini-batch average); this correction reduces the variance of stochastic estimates, leading to accelerated progress toward the optimum.^[17] Minibatch extensions facilitate distributed computing by parallelizing gradient computations across nodes, with each worker processing a subset of data for the selected coordinate.^[18] In terms of convergence, stochastic variants with variance reduction inherit linear rates for smooth strongly convex objectives, often matching or approaching O\left(\left(n + \sqrt{\frac{L}{\mu}}\right) \log \frac{1}{\epsilon}\right) iterations, where L and \mu are the smoothness and strong convexity constants, while providing empirical speedups over non-stochastic RCD on high-dimensional data. Implementation-wise, these methods excel in memory efficiency for big data settings, as they require storing only the current iterate and snapshot gradients, with sparse access patterns allowing updates without loading the entire dataset into memory. Recent extensions as of 2025 include accelerated versions like ASVRCD for faster non-asymptotic rates and differentially private randomized block coordinate descent for privacy-preserving optimization in machine learning.^[11]^[17]^[19]

Limitations and Improvements

Key Limitations

One key limitation of coordinate descent is the zigzagging phenomenon, observed particularly in cyclic implementations, where the algorithm alternates updates between coordinates with minimal overall progress.^[20] This issue is especially pronounced in ill-conditioned problems where the ratio of the global Lipschitz constant to the maximum coordinate Lipschitz constant is large.^[21] This oscillation delays convergence by restricting progress to axis-aligned directions, leading to inefficient paths toward the optimum.^[7] The method's effectiveness is also hindered by its dependence on problem conditioning; strongly coupled coordinates, often reflected in a high condition number, result in poor performance as the algorithm requires numerous iterations to achieve meaningful reductions in the objective function. In such cases, the convergence rate deteriorates, with theoretical bounds showing linear rates of $1 - \mu / (n L_{\max}) for randomized variants, where \mu is the strong convexity parameter, n the dimension, and L_{\max} the maximum coordinate smoothness constant, highlighting worse-case slowdowns compared to full gradient methods.^[21] Additionally, solving the univariate subproblems poses challenges when the objective is non-separable or non-smooth, as exact minimization along a coordinate may demand costly line searches or proximal computations, complicating practical implementation. Scalability suffers in high-dimensional non-sparse settings, where each full iteration cycles through all n coordinates at an O(n) cost, making the approach inefficient for very large problems without exploitable structure.^[22] Empirically, these issues manifest in functions like the Rosenbrock function, which has a narrow, curved valley; coordinate descent exhibits slow convergence here relative to gradient descent, as its axis-aligned steps struggle to follow the valley's contour efficiently.^[23]

Strategies to Address Limitations

To mitigate the slow convergence often observed in coordinate descent due to issues like zigzagging or imbalanced coordinate scales, several acceleration techniques have been developed. One prominent approach is Nesterov-like momentum applied to coordinate updates, which incorporates extrapolation to achieve faster rates akin to full-gradient acceleration. In this method, an auxiliary sequence y^k is maintained, and the coordinate update is performed as x_i^{k+1} = y_i^k - \frac{1}{L_i} \nabla_i f(y^k), followed by extrapolation such as y^{k+1} = x^{k+1} + \beta_k (x^{k+1} - x^k), where L_i is the Lipschitz constant for the i-th coordinate and \beta_k is a momentum parameter.^[24] This variant improves upon standard coordinate descent by leveraging momentum to reduce the number of iterations needed for convergence on smooth convex problems. Preconditioning addresses the limitation of disparate coordinate Lipschitz constants, which can lead to inefficient progress in poorly scaled problems. A common strategy is diagonal scaling, where the optimization variables are rescaled by the inverse square roots of the diagonal elements approximating the Hessian or Lipschitz constants, effectively balancing the condition number across coordinates.^[25] This simple yet effective preconditioner can be computed from initial gradient evaluations and integrated into the coordinate update steps without altering the core algorithm structure.^[26] Hybrid approaches combine coordinate descent with full gradient steps to counteract zigzagging, where the algorithm oscillates between coordinates without substantial progress toward the optimum. For instance, periodic full gradient computations can guide coordinate selection or adjust step sizes, while active set methods identify and temporarily fix inactive coordinates (those with gradients near zero) to focus updates on promising directions.^[27]^[28] These integrations enhance directional efficiency, particularly in bound-constrained settings.^[27] For problems where exact coordinate minimization is computationally expensive, approximation techniques allow inexact subproblem solves while preserving convergence guarantees. One such method employs conjugate gradients to iteratively approximate the optimal coordinate update, solving a quadratic model of the subproblem up to a controlled residual tolerance.^[25] This approach reduces per-iteration cost significantly for high-dimensional or structured objectives, with the number of inner iterations tuned based on the desired accuracy.^[29] Software implementations facilitate the adoption of these enhanced strategies; for example, the generic coordinate descent solver designed for integration with CVXPY supports accelerated variants through modular extensions for momentum and preconditioning.

Applications

In Optimization Problems

Coordinate descent methods find extensive use in solving linear systems through iterative updates that minimize quadratic objectives. For instance, the Gauss-Seidel method can be interpreted as a cyclic coordinate descent algorithm applied to the least-squares problem of minimizing \|Ax - b\|_2^2, where A is an m \times n matrix and b \in \mathbb{R}^m; at each step, it solves for one variable while fixing the others, leading to convergence for symmetric positive-definite systems.^[5] This approach is particularly effective for sparse or structured matrices in quadratic programming, where full gradient computations are costly.^[5] In nonlinear constrained optimization, coordinate descent handles bound constraints by incorporating projections onto feasible sets during updates. For problems like \min f(x) subject to l \leq x \leq u, each coordinate update minimizes the objective along that direction and projects the result back onto the interval [l_i, u_i], enabling efficient handling of box constraints without requiring full projections.^[5] This projected variant maintains the simplicity of coordinate-wise minimization while ensuring feasibility, making it suitable for medium-scale nonlinear problems with simple bounds.^[5] In signal processing, coordinate descent addresses sparse recovery in compressed sensing by minimizing objectives with \ell_1 penalties, such as \min_u \|u\|_1 + \lambda \|Au - f\|_2^2, where A is the measurement matrix and f the observations. A greedy coordinate descent variant selects the coordinate with the largest gradient magnitude for update, achieving fast convergence and accurate sparse signal reconstruction competitive with state-of-the-art methods like basis pursuit.^[30] A representative example is portfolio optimization, where coordinate descent minimizes investment variance subject to norm constraints on weights. For the problem \min_w w^T \Sigma w subject to \sum w_i = 1 and penalties like \lambda_1 \|w\|_1 + \lambda_2 \|w\|_2^2, coordinate-wise updates alternate between solving univariate subproblems for each weight and enforcing the sum constraint, yielding sparse portfolios.^[31] Performance benchmarks on medium-scale problems, such as \ell_1-regularized least squares with thousands of variables, show coordinate descent methods achieving comparable solution times to interior-point methods for modest accuracy (e.g., 1% suboptimality) but requiring more iterations for high precision, where interior-point approaches leverage second-order information for faster convergence.^[32]

In Machine Learning and Statistics

Coordinate descent plays a central role in fitting high-dimensional statistical models, particularly those involving regularization to promote sparsity. In Lasso regression, which minimizes the least squares loss plus an L1 penalty, cyclic coordinate descent iteratively updates each coefficient by solving a univariate soft-thresholding problem while holding others fixed. The update for the j-th coefficient is given by

\beta_j \leftarrow \text{sign}\left( \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i + x_{ij} \beta_j) x_{ij} \right) \left( \left| \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i + x_{ij} \beta_j) x_{ij} \right| - \lambda \right)_+,

where \lambda is the regularization parameter and ( \cdot )_+ denotes the positive part; this approach extends naturally to the elastic net by incorporating an L2 penalty that shrinks coefficients toward a ridge solution before thresholding. This cyclic strategy computes the full regularization path efficiently and can be viewed as a special case of the iterative soft-thresholding algorithm (ISTA) when applied to orthogonal designs, where coordinate-wise minimization aligns with proximal gradient steps.^[33] For generalized linear models (GLMs), coordinate descent integrates into the iteratively reweighted least squares (IRLS) framework to handle non-Gaussian responses, such as in logistic regression for binary outcomes. In each IRLS iteration, the deviance is approximated by a quadratic, and coordinate descent solves the resulting weighted Lasso problem via soft-thresholding updates on the coefficients, enabling sparse model estimation along the regularization path. This method supports family-specific link functions and penalties, making it suitable for high-dimensional logistic regression where traditional Newton methods fail due to computational cost. In matrix completion and nonnegative matrix factorization (NMF), block coordinate descent optimizes low-rank approximations by alternately updating blocks of factors, such as rows or columns of the factor matrices. For matrix completion, which seeks to fill missing entries in a partially observed matrix under a low-rank constraint, block updates minimize the Frobenius norm loss over subsets of variables, often incorporating nuclear norm regularization for convexity; this yields scalable solutions for recommendation systems with millions of entries. Similarly, in NMF, which decomposes a nonnegative data matrix into low-rank nonnegative factors for dimensionality reduction, block coordinate descent applies multiplicative updates or exact line searches per block, promoting sparsity and interpretability in applications like topic modeling.^[34]^[35]^[36] For structure learning in Gaussian Markov random fields (MRFs), coordinate descent underlies the graphical Lasso algorithm, which estimates sparse inverse covariance matrices by penalizing the L1 norm of off-diagonal elements. The method performs block updates on the precision matrix columns, solving Lasso problems for each variable's conditional dependencies while enforcing positive definiteness through Cholesky factorization; this reveals the underlying graph structure in high-dimensional multivariate data, such as gene expression networks.^[37] Implementations like those in scikit-learn leverage coordinate descent for Lasso and elastic net, scaling to datasets with up to 10^6 features by exploiting sparsity and warm starts for path computation. Empirical studies demonstrate that coordinate descent achieves speedups of 10-100x over stochastic gradient descent (SGD) on sparse high-dimensional problems, due to exact univariate solves and reduced variance in updates.^[38]^[39]

References

[1]
[1502.04759] Coordinate Descent Algorithms - arXiv
Feb 17, 2015 · Coordinate descent algorithms solve optimization problems by successively performing approximate minimization along coordinate directions or coordinate ...Missing: scholarly | Show results with:scholarly
[2]
[PDF] Landmarks in the History of Iterative Methods
Mar 12, 2025 · Gauss-Seidel method (see Fig. 6.5). Observe that 2n iterations of Jacobi produce an equivalent result as n iterations of Gauss-Seidel. A ...
[3]
A quadratic programming procedure - Hildreth - Wiley Online Library
A quadratic programming procedure ; First published: March 1957 ; Citations · 242 ; Journal Paper No. 2001 of the Michigan Agricultural Experiment Station.
[4]
A coordinate gradient descent method for nonsmooth separable ...
Aug 1, 2007 · Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization ... Powell M.J.D. (1973). On search directions for minimization ...
[5]
[PDF] A Primer on Coordinate Descent Algorithms - arXiv
Jan 12, 2017 · This monograph presents a class of algorithms called coordinate descent algorithms for mathemati- cians, statisticians, and engineers ...
[6]
https://arxiv.org/pdf/1502.04759.pdf
[7]
[PDF] A Primer on Coordinate Descent Algorithms - UCLA Mathematics
Jan 11, 2017 · This monograph presents a class of algorithms called coordinate descent algorithms for mathemati- cians, statisticians, and engineers ...
[8]
[PDF] Coordinate descent - Statistics & Data Science
Seminal work of Tseng (2001) proves that for such f (provided f is continuous on compact set {x : f(x) ≤ f(x(0))} and f attains.
[9]
[PDF] Coordinate Descent Converges Faster with the Gauss-Southwell ...
Oct 28, 2018 · Let's make block coordinate descent go fast: Faster greedy rules, message-passing, active-set complexity, and superlinear convergence. arXiv: ...<|control11|><|separator|>
[10]
On the convergence of the coordinate descent method for convex ...
The coordinate descent method's convergence is extended to cost functions composed of an affine mapping with a strictly convex, twice differentiable function, ...
[11]
Iteration Complexity of Randomized Block-Coordinate Descent ...
Jul 14, 2011 · In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function.
[12]
Randomized Coordinate Subgradient Method for Nonsmooth ... - arXiv
Jun 30, 2022 · In this work, we propose to utilize the {Randomized Coordinate Subgradient Method} (RCS) for solving both nonsmooth convex and nonsmooth ...
[13]
[PDF] Proximal Algorithms - Stanford University
A proximal algorithm is an algorithm for solving a convex optimization problem that uses the proximal operators of the objective terms. For example, the ...
[14]
[PDF] Let's Make Block Coordinate Descent Converge Faster
This selection rule is discussed in Tseng and Yun. (2009a), but in their implementation they use a diagonal M. Csiba and Richtárik (2017) discuss and analyze ...
[15]
Parallel coordinate descent methods for big data optimization
Apr 12, 2015 · In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum ...Missing: advantages | Show results with:advantages
[16]
Accelerated, Parallel, and Proximal Coordinate Descent - SIAM.org
We propose a new randomized coordinate descent method for minimizing the sum of convex functions, each of which depends on a small number of coordinates only.
[17]
[PDF] Distributed Coordinate Descent Method for Learning with Big Data
Abstract. In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data.
[18]
[PDF] Coordinate Descent Methods - People @EECS
This approach, known as coordinate descent (CD) has a certain intuitive appeal: It replace the difficult problem of minimizing with respect to many variables ...
[19]
Efficiency of Coordinate Descent Methods on Huge-Scale ...
In this paper we propose new methods for solving huge-scale optimization problems. For problems of this size, even the simplest full-dimensional vector ...
[20]
[PDF] Optimization Algorithms - Pattern Recognition Lab
It takes many small steps, with localized zig-zagging behavior to eventually converge to the minimum. ∎ It is closely related to gradient descent. must be ...
[21]
Efficient Accelerated Coordinate Descent Methods and Faster ...
May 8, 2013 · In this paper we show how to accelerate randomized coordinate descent methods and achieve faster convergence rates without paying per-iteration ...
[22]
Inexact Coordinate Descent: Complexity and Preconditioning
Feb 29, 2016 · In this work, we relax this requirement and allow for the subproblem to be solved inexactly, leading to an inexact block coordinate descent method.
[23]
Acceleration of Primal-Dual Methods by Preconditioning and Simple ...
Nov 21, 2018 · To improve their performance, researchers have proposed techniques such as diagonal preconditioning and inexact subproblems. This paper ...
[24]
Hybrid Coordinate Descent for Efficient Neural Network Learning ...
Aug 2, 2024 · This paper presents a novel coordinate descent algorithm leveraging a combination of one-directional line search and gradient information for parameter updates.
[25]
Accelerated anti-lopsided algorithm for nonnegative least squares
Dec 1, 2016 · ... zigzagging problems or variable scaling problems [1]. ... coordinate descent, accelerated methods, active set methods, and iterative methods.
[26]
[PDF] Inexact Coordinate Descent: Complexity and Preconditioning
Apr 18, 2013 · Abstract. In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method.
[27]
Coordinate descent optimization for l 1 minimization with application ...
We propose a fast algorithm for solving the Basis Pursuit problem, min u $\{|u|_1\: \Au=f\}$, which has application to compressed sensing.
[28]
[PDF] An Interior-Point Method for Large-Scale -Regularized Least Squares
Compared with first-order methods such as coordinate-descent methods, our method is comparable in solving large problems with modest accuracy, but is able to.
[29]
When should one use Coordinate descent vs. gradient descent?
Apr 14, 2015 · I know that coordinate descent has problems with non-smooth functions but it is used in popular algorithms like SVM and LASSO. Gradient descent ...
[30]
[PDF] A Coordinate Descent Method for Robust Matrix Factorization and ...
In this work, a coordinate descent (CD) method is developed for matrix factorization under L1 fidelity so that the related minimization is done one variable at ...
[31]
[1301.3527] Block Coordinate Descent for Sparse NMF - arXiv
Jan 15, 2013 · Our proposed algorithm allows us to solve the mixed norm sparsity constraints while not sacrificing computation time.
[32]
https://web.stanford.edu/~boyd/papers/pdf/l1_ls.pdf
[33]
[PDF] The Graphical Lasso - arXiv
Aug 7, 2012 · The graphical lasso is an algorithm for learning structure in a Gaussian model, using i1 regularization to control zeros in the precision ...
[34]
1.1. Linear Models — scikit-learn 1.7.2 documentation
The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero ...Lasso, Lasso-LARS, and... · Lasso model selection: AIC... · LinearRegression
[35]
[PDF] Fast Regularization Paths via Coordinate Descent
A brief history of coordinate descent for the lasso. 1997 Tibshirani's student Wenjiang Fu at U. Toronto develops the. “shooting algorithm” for the lasso.