Kaczmarz method
The Kaczmarz method is an iterative algorithm for solving systems of consistent linear equations Ax = b, where A is an m \times n matrix with full column rank and m \geq n, by performing successive orthogonal projections of the current iterate onto the hyperplanes defined by each individual equation in the system.[1] In each step, starting from an initial guess x_0, the method updates the solution by projecting onto the i-th hyperplane H_i = \{x \in \mathbb{R}^n : \langle a_i, x \rangle = b_i \}, yielding x_{k+1} = x_k + \frac{b_i - \langle a_i, x_k \rangle}{\|a_i\|^2} a_i, where a_i is the i-th row of A and the rows are cycled through sequentially.[1] This row-action approach is particularly efficient for large-scale, overdetermined systems where direct methods are computationally prohibitive.[2]
Developed by Polish mathematician Stefan Kaczmarz in 1937 as a means to approximate solutions to linear systems, the method was initially published in German under the title Angenäherte Auflösung von Systemen linearer Gleichungen.[3] It saw limited attention until its rediscovery and application in the late 20th century, notably in 1970 when Gordon, Bender, and Herman adapted it for image reconstruction in computed tomography as the Algebraic Reconstruction Technique (ART).[2] The method's popularity surged in the 2000s with the introduction of randomized variants, which select rows probabilistically rather than cyclically, leading to faster practical convergence.[4]
Key applications of the Kaczmarz method span medical imaging, such as positron emission tomography (PET) and magnetic resonance imaging (MRI), where it reconstructs images from projection data, as well as signal processing, geophysics, and machine learning for solving sparse recovery problems.[5] For consistent systems, the classical cyclic version converges linearly to the unique solution when A is square and nonsingular, with the rate depending on the matrix's condition number.[1] Randomized extensions, such as those proposed by Strohmer and Vershynin in 2009, achieve exponential convergence in expectation, bounded by \left(1 - \frac{1}{\kappa^2(A)}\right)^k, where \kappa(A) is the scaled condition number, making them robust for ill-conditioned or noisy data.[4] Variants like block Kaczmarz, which project onto multiple rows simultaneously, and greedy selections further enhance speed for large-scale problems.[2]
Overview
Definition and Motivation
The Kaczmarz method is an iterative algorithm for solving consistent systems of linear equations Ax = b, where A is an m \times n matrix with m \geq n, x \in \mathbb{R}^n, and b \in \mathbb{R}^m, assuming an exact solution exists.[6] The method iteratively refines an initial guess by performing successive orthogonal projections onto the hyperplanes defined by the individual equations a_i^T x = b_i, with a_i denoting the i-th row of A.[6] This projection-based approach geometrically drives the solution toward the intersection of all hyperplanes, leveraging the structure of the system without requiring full matrix operations in each step.[6]
For large-scale overdetermined systems, the Kaczmarz method offers significant advantages over direct least-squares approaches, such as solving the normal equations A^T A x = A^T b. Forming A^T A demands substantial memory for the dense n \times n matrix, especially when n is large, and subsequent factorization is computationally intensive.[6] In contrast, the row-action nature of the Kaczmarz method processes only one equation per iteration, storing just the original matrix A and facilitating parallel implementations across rows, which is ideal for sparse or distributed data environments.[6]
A simple illustration arises in a 2D consistent system where two equations represent intersecting lines as hyperplanes. Starting from an arbitrary point, the method projects it orthogonally onto the first line, then projects the result onto the second line, repeating until convergence to the intersection point.[6] This geometric intuition underscores its efficacy for visualizing solution processes in low dimensions.
The method originated in 1937 as an approximate solver for linear systems and was later adapted for computed tomography, where it enables real-time image reconstruction from sparse projection data using row-action updates.[6] Randomized variants extend its applicability to inconsistent cases with faster convergence.[6]
Historical Background
The Kaczmarz method was introduced by Polish mathematician Stefan Kaczmarz in 1937 through his seminal paper "Angenäherte Auflösung von Systemen linearer Gleichungen," published in the Bulletin International de l'Académie Polonaise des Sciences et des Lettres.[7] Kaczmarz died in 1939 amid the onset of World War II. In this work, Kaczmarz proposed an iterative approach based on successive orthogonal projections onto the hyperplanes defined by the equations of a linear system, establishing convergence for systems with nonsingular coefficient matrices.[1] Despite its elegance, the method received limited initial attention.[7]
The algorithm saw independent rediscoveries in the late 1940s and early 1950s, notably by E. Bodewig in 1948 and G.E. Forsythe in 1953, who explored its utility in numerical linear algebra for engineering contexts.[1] Its popularity surged in the 1970s with applications to image reconstruction in computerized tomography (CT) scans, where Richard Gordon, Robert Bender, and Gabor T. Herman independently revived it as the Algebraic Reconstruction Technique (ART) in their 1970 paper.[8] This adoption integrated the method into medical imaging pipelines, emphasizing its efficiency for sparse, overdetermined systems arising from projection data.[8] Throughout the 1970s and 1980s, researchers like Herman and others extended its use in algebraic reconstruction, solidifying its role in tomography despite hardware limitations of the time.
A pivotal review came in 1981 with Yair Censor's survey on row-action methods, which systematically analyzed the Kaczmarz algorithm alongside related iterative projection techniques for large-scale sparse systems, highlighting its theoretical foundations and practical implementations.[9] The method's deterministic cycling was observed to exhibit slow convergence in ill-conditioned cases, motivating relaxations and ordering strategies to improve performance.[1] It revived dramatically in 2009, when Thomas Strohmer and Roman Vershynin proposed a randomized variant, proving exponential convergence rates and demonstrating its efficacy in signal processing for compressed sensing problems.[10] This resurgence connected the classical method to modern machine learning and data science applications.[10]
Classical Method
Algorithm Description
The classical Kaczmarz method is an iterative projection technique for solving consistent linear systems Ax = b, where A \in \mathbb{R}^{m \times n} is the coefficient matrix with m \geq n and full column rank, x \in \mathbb{R}^n is the unknown solution vector, and b \in \mathbb{R}^m is the right-hand side vector. The method proceeds by cyclically selecting rows of A and orthogonally projecting the current iterate onto the hyperplane defined by the corresponding equation.[1]
To implement the algorithm, initialize an arbitrary starting vector x^0 \in \mathbb{R}^n, often chosen as the zero vector for simplicity. For each iteration k = 0, 1, 2, \dots, select the row index i = (k \mod m) + 1, and update the iterate using the projection formula:
x^{k+1} = x^k + \frac{b_i - a_i^T x^k}{\|a_i\|^2} a_i,
where a_i^T denotes the i-th row of A. This update enforces exact satisfaction of the i-th equation at each step.[1]
The full procedure can be expressed in pseudocode as follows, which cycles through all rows sequentially in a loop until a convergence criterion is met, such as \|A x^k - b\|_2 < \epsilon for a small tolerance \epsilon > 0:
Input: Matrix A ∈ ℝ^{m×n} (full column rank), vector b ∈ ℝ^m, initial x⁰ ∈ ℝⁿ, tolerance ε > 0
Set k = 0
While ||A xᵏ - b||₂ ≥ ε:
For i = 1 to m:
x^{k+1} = xᵏ + (b_i - a_iᵀ xᵏ) / ||a_i||₂² * a_i
k = k + 1
End For
End While
Output: Approximate solution xᵏ
Input: Matrix A ∈ ℝ^{m×n} (full column rank), vector b ∈ ℝ^m, initial x⁰ ∈ ℝⁿ, tolerance ε > 0
Set k = 0
While ||A xᵏ - b||₂ ≥ ε:
For i = 1 to m:
x^{k+1} = xᵏ + (b_i - a_iᵀ xᵏ) / ||a_i||₂² * a_i
k = k + 1
End For
End While
Output: Approximate solution xᵏ
This structure completes one full cycle over the m rows in O(mn) time for dense A, with each row projection requiring O(n) operations for the inner product, scalar computation, and vector update; the method is particularly efficient when A is sparse, as operations scale with the number of nonzeros per row.
Geometric Interpretation
The Kaczmarz method admits a clear geometric interpretation as an iterative process of orthogonal projections onto the hyperplanes defined by the rows of the linear system Ax = b. Each equation a_i^T x = b_i, where a_i is the i-th row of A, corresponds to an affine hyperplane in \mathbb{R}^n, and the method seeks the point in the intersection of these hyperplanes (the solution set) by successively minimizing the Euclidean distance from the current iterate to each hyperplane in a cyclic manner.[6] This projection-based view underscores the method's role in feasibility problems, where the goal is to find a point in the intersection of convex sets, here specifically the hyperplanes.[11]
The orthogonal projection P_i(x) of a point x onto the hyperplane H_i = \{ z \in \mathbb{R}^n : a_i^T z = b_i \} is given by the formula
P_i(x) = x + \frac{b_i - a_i^T x}{\|a_i\|^2} a_i,
which adds to x the vector along the normal direction a_i that bridges the residual distance to the hyperplane, ensuring the minimal Euclidean distance.[6] This update represents the closest point on H_i to x, and in the Kaczmarz iteration, the next estimate is obtained by applying such projections sequentially over the hyperplanes.[11]
In two dimensions, where the hyperplanes reduce to lines, the iterative process manifests as a zig-zag trajectory: starting from an initial point, each projection reflects the current position perpendicularly onto the next line, gradually spiraling inward toward the intersection point that satisfies all equations.[11] For instance, consider lines defined by x_1 + x_2 = 1 and -x_1 + 3x_2 = 2; successive projections from a starting point like (0,0) would alternate between the lines, forming a polygonal path that converges to the solution (1/4, 3/4). In higher dimensions, the process similarly generates successive approximations that refine the position within the solution subspace, though the path is less visually intuitive, converging to the affine subspace defined by the consistent system.[6]
A conceptual diagram of this process in 2D typically depicts the initial point, the sequence of projection lines, and the orthogonal segments connecting iterates to their projections, illustrating the non-increasing distance to the solution and the eventual convergence at the intersection; arrows along the path highlight the iterative refinement.[11] The method's projections are nonexpansive, preserving or reducing distances, which ensures Fejér monotonicity: each iterate is no farther from any point in the solution set than its predecessor, promoting steady progress toward the intersection.[12]
Basic Convergence
The classical Kaczmarz method converges to the unique solution x^* of the consistent linear system Ax = b, where A \in \mathbb{R}^{m \times n} has full column rank and m \geq n, starting from any initial point x^0.
This convergence is linear, with the contraction factor depending on the angles between the row vectors of A; larger angles between successive rows lead to faster progress per iteration.[13] In the worst case, the method requires O(\kappa^2) iterations to achieve a desired accuracy, where \kappa denotes the condition number of A.[5]
A proof sketch relies on the orthogonal projection property: each iteration projects the current iterate onto the hyperplane defined by the selected equation, yielding \|x^{k+1} - x^*\| \leq \|x^k - x^*\| in the Euclidean norm, with strict inequality unless x^k already lies on that hyperplane. Over a full cycle through all rows, the error decreases by a factor bounded below by the smallest singular value of the normalized row matrix, ensuring monotonic decrease toward x^*.[5]
The method exhibits slow convergence for ill-conditioned systems (large \kappa) or when rows are nearly parallel (small angles), as progress per step becomes minimal.[13] Empirically, multiple full cycles through the rows are often needed for practical convergence in such cases.[5]
Randomized Variant
The randomized Kaczmarz method, introduced by Strohmer and Vershynin in 2009, extends the classical approach by incorporating probabilistic row selection to achieve faster expected convergence for solving linear systems Ax = b, where A \in \mathbb{R}^{m \times n} with m \geq n. This variant selects equations with probabilities proportional to the squared row norms rather than in a fixed sequence, making it particularly effective for large-scale overdetermined systems.
At each iteration k, the row index i_k is chosen randomly with probability p_{i_k} = \frac{\|a_{i_k}\|^2}{\|A\|_F^2}. The iterate is then updated via orthogonal projection onto the hyperplane defined by the selected equation a_{i_k}^T x = b_{i_k}:
x^{k+1} = x^k + \frac{b_{i_k} - a_{i_k}^T x^k}{\|a_{i_k}\|^2} a_{i_k}.
This update rule mirrors the classical Kaczmarz projection but applies it in a random order weighted by row norms, which serves as the deterministic special case when rows are cycled sequentially.
The algorithm assumes an overdetermined system (m \geq n) that may be consistent or inconsistent due to noise; it also applies to underdetermined cases (m < n) but primarily targets overdetermined ones where multiple equations constrain the solution.[14]
Here is the pseudocode for the randomized Kaczmarz method:
Initialize x^0 ∈ ℝ^n arbitrarily (e.g., x^0 = 0)
Compute p_i = ||a_i||^2 / ||A||_F^2 for i = 1, ..., m
While not converged (e.g., until ||Ax^k - b|| < ε or max iterations reached):
Select i_k randomly with probability p_{i_k}
x^{k+1} = x^k + \frac{b_{i_k} - a_{i_k}^T x^k}{\|a_{i_k}\|^2} a_{i_k}
k ← k + 1
Return x^k
Initialize x^0 ∈ ℝ^n arbitrarily (e.g., x^0 = 0)
Compute p_i = ||a_i||^2 / ||A||_F^2 for i = 1, ..., m
While not converged (e.g., until ||Ax^k - b|| < ε or max iterations reached):
Select i_k randomly with probability p_{i_k}
x^{k+1} = x^k + \frac{b_{i_k} - a_{i_k}^T x^k}{\|a_{i_k}\|^2} a_{i_k}
k ← k + 1
Return x^k
This formulation yields expected faster convergence compared to the deterministic version, with the weighted randomness promoting diverse projections that accelerate progress toward the solution. For inconsistent systems, the iterates approach a neighborhood of the least-squares solution \hat{x} = \arg\min_x \|Ax - b\|_2^2 up to an error determined by the noise level.[14]
Convergence Proof
The convergence of the randomized Kaczmarz method is analyzed in expectation, yielding a linear rate that is exponentially fast under mild conditions on the system matrix. For a consistent overdetermined linear system Ax = b where A \in \mathbb{R}^{m \times n} has full column rank and rows a_i^\top for i = 1, \dots, m, the method generates iterates x^k by randomly selecting an index i_k with probability p_i = \|a_i\|^2 / \|A\|_F^2 and updating via the projection x^{k+1} = x^k + \frac{b_{i_k} - a_{i_k}^\top x^k}{\|a_{i_k}\|^2} a_{i_k}. The following theorem establishes the expected squared distance to the true solution x^*:
\begin{equation}
\mathbb{E} \left[ |x^k - x^|^2 \right] \leq \left(1 - \frac{1}{\kappa_F^2}\right)^k |x^0 - x^|^2,
\end{equation}
where \kappa_F = \|A\|_F / \sigma_{\min}(A) is the frame condition number of A, with \|A\|_F denoting the Frobenius norm and \sigma_{\min}(A) the smallest singular value of A.[15]
To prove this, condition on the iterate x^k and take the expectation over the random row index i. The update satisfies
\begin{equation}
|x^{k+1} - x^|^2 = |x^k - x^|^2 - \frac{(a_i^\top (x^k - x^*))^2}{|a_i|^2},
\end{equation}
so the conditional expectation is
\begin{equation}
\mathbb{E} \left[ |x^{k+1} - x^|^2 \mid x^k \right] = |x^k - x^|^2 - \sum_{i=1}^m p_i \frac{(a_i^\top (x^k - x^))^2}{|a_i|^2} = |x^k - x^|^2 - \frac{|A(x^k - x^*)|^2}{|A|_F^2}.
\end{equation}
The subtracted term is nonnegative, ensuring nonexpansion in expectation. To bound the contraction, note that \|A(x^k - x^*)\|^2 \geq \sigma_{\min}^2(A) \|x^k - x^*\|^2, yielding a uniform contraction factor $1 - 1/\kappa_F^2. Iterating the expectation then gives the theorem.[15]
For inconsistent systems where no exact solution exists, the randomized Kaczmarz method converges in expectation to a neighborhood of the minimum Euclidean norm solution of the associated least-squares problem \min_x \|Ax - b\|_2^2, with the error bounded by a term proportional to the noise level, assuming the algorithm starts at the origin x^0 = 0. This follows from the method's behavior as a stochastic projection onto the row space, asymptotically approaching the least-squares solution up to noise-induced error.[16]
This expected convergence rate offers a key advantage over the deterministic Kaczmarz method, as it is independent of the number of equations m and avoids slowdowns due to near-parallel row directions, achieving exponential decay governed solely by the frame condition number.[15]
Insights and Equivalences
The randomized Kaczmarz method admits several theoretical equivalences that provide deeper insights into its behavior and unify it with broader classes of iterative algorithms. One key perspective views the uniform selection variant as a form of stochastic gradient descent (SGD) on the least-squares objective function \|Ax - b\|_2^2 / 2, where a row a_i of A is selected uniformly at random, and the update step employs a data-dependent step size of $1/\|a_i\|_2^2.[17] This SGD interpretation, established by connecting randomized projections to mini-batch gradient steps, reveals the method's reliance on row-wise residuals as unbiased stochastic gradients and explains its empirical efficiency in high-dimensional settings. The weighted variant discussed here can be seen as an importance-sampled version of this SGD, achieving improved convergence rates.[17]
Further equivalences arise within the sketch-and-project framework, where randomized Kaczmarz corresponds to sketching the rows of A with a random single-row subspace and then projecting onto the solution set of the sketched system.[18] Equivalently, it can be seen as a constrain-and-approximate procedure: first projecting the current iterate onto a randomly selected constraint hyperplane defined by the i-th equation, followed by an approximation step that minimizes the distance to the original point within that constraint.[18] From a random update viewpoint, the iteration aligns directly with the SGD formulation above, emphasizing the role of randomized coordinate selection in variance reduction.[18] Additionally, as a random fixed-point iteration, the update takes the form x^{k+1} = P_{i_k}(x^k), where P_i denotes the orthogonal projection operator onto the i-th hyperplane \{x : a_i^\top x = b_i\}, and i_k is drawn randomly; this casts convergence as approaching the fixed point of the expected projection operator.[18]
These perspectives extend to connections with expectation maximization (EM) algorithms, particularly in regularized variants for ill-posed inverse problems, where randomized Kaczmarz updates mimic EM steps by iteratively maximizing likelihoods over random subsets of constraints.[19] The fixed-point and SGD views, in turn, link randomized Kaczmarz to proximal methods and expectation-based iterations in optimization.[18] Such unifications facilitate comparisons across iterative solvers and motivate accelerations, as variance reduction techniques from SGD—such as importance sampling—directly improve convergence by reducing iteration-to-iteration fluctuations.[17]
Building on these insights, Gower and Richtárik (2015) generalized the method to allow arbitrary sketching measures, including weighted row selection probabilities p_i proportional to \|a_i\|_2^2, enabling tailored convergence rates that adapt to matrix conditioning and further bridging randomized Kaczmarz with weighted stochastic approximations in machine learning.[18] This weighting enhances the method's robustness, as seen in earlier analyses where probabilities scaled by squared row norms yield exponential expected convergence under minimal assumptions. Overall, these equivalences underscore randomized Kaczmarz's versatility, positioning it as a foundational primitive for randomized linear algebra and iterative optimization.[18]
Advanced Generalizations
Gower-Richtarik Framework
The Gower-Richtárik framework, introduced in 2015, provides a unified generalization of randomized iterative methods, including the randomized Kaczmarz method, to abstract sketch-and-project schemes for solving linear systems of the form Ax = b, where A \in \mathbb{R}^{m \times n} and B \succ 0 is a symmetric positive definite matrix defining the metric for projections.[20] This framework abstracts the projection step by incorporating random sketching matrices S_k drawn from a distribution \mathcal{D}, allowing for broader applicability beyond row selections.[20]
The core algorithm, often referred to as sketch-and-project, operates as follows: at each iteration k, select a random sketch S_k \sim \mathcal{D}; solve the subproblem \arg\min_z \|z - x^k\|_B subject to S_k^T A z = S_k^T b; and update x^{k+1} = x^k + (z - x^k), where z is the solution to the minimization.[20] For the classical randomized Kaczmarz method, this recovers the standard form by setting B = I_n (the identity matrix) and S_k = e_i for a randomly chosen standard basis vector e_i (corresponding to selecting row i of A), yielding the orthogonal projection onto the selected hyperplane.[20]
Under this framework, the method exhibits expected convergence in the B-norm: \mathbb{E}[\|x^{k+1} - x^*\|_B^2] \leq (1 - \mu_k) \|x^k - x^*\|_B^2, where x^* is the solution and \mu_k > 0 depends on the expected properties of the sketches, such as the trace of a certain projection matrix involving A, B, and the distribution \mathcal{D}.[20] For consistent systems (b \in \operatorname{range}(A)), the framework guarantees linear convergence to the solution with rate $1 - \lambda_{\min}(B^{-1} \mathbb{E}[Z]), where Z = A^T S (S^T A B^{-1} A^T S)^\dagger S^T A. In the randomized Kaczmarz case, this specializes to $1 - \lambda_{\min}(A^T A) / \|A\|_F^2 = 1 - 1/\kappa^2, where \kappa = \|A\|_F / \sigma_{\min}(A) is the scaled condition number.[20]
The flexibility of the Gower-Richtárik framework enables extensions to weighted projections (via choice of B), block-coordinate updates (through structured sketches S_k), and even nonlinear constraints by adapting the sketching distribution \mathcal{D}, making it a foundational tool for modern randomized optimization.[20]
Equivalent Perspectives
The Gower-Richtarik method for solving linear systems admits six equivalent formulations, each providing a distinct perspective that unifies various iterative solvers under a common framework.[18] These viewpoints—sketch-and-project, constrain-and-approximate, random intersect, random linear solve, random update, and random fixed-point—all generate the identical update rule for the iterate, as established by Theorem 2.1 in the original analysis, which demonstrates their equivalence through operator formulations and dual optimizations.[18]
The sketch-and-project view interprets the method as projecting the current iterate onto the solution set of a randomly sketched subsystem, minimizing the distance in a Bregman divergence induced by a positive definite matrix B. Specifically, given a linear system Ax = b with A \in \mathbb{R}^{m \times n}, a random sketching matrix S \in \mathbb{R}^{k \times m} (with k < m) yields the update
x^{k+1} = \arg\min_{x \in \mathcal{K}(S)} \|x - x^k\|_B,
where \mathcal{K}(S) = \{x : S A x = S b\} is the sketched solution space. This perspective highlights connections to randomized dimensionality reduction techniques.[18]
In the constrain-and-approximate formulation, the update minimizes the distance to the true solution while approximately satisfying a relaxed constraint. It solves
x^{k+1} = \arg\min_x \|x - x^*\|_B \quad \text{subject to} \quad x = x^k + B^{-1} A^T S y
for some y, which approximates the full system's constraints by restricting to an affine subspace parallel to the sketched directions. This view emphasizes constrained optimization duality.[18]
The random intersect perspective views the iteration as finding the unique intersection point of two randomly chosen affine subspaces: one passing through the current iterate orthogonal to the sketched rows, and another containing the solution. Equivalence arises from the geometry of projections onto these spaces.[18]
Under the random linear solve interpretation, each step solves a smaller random linear subsystem S A x = S b and uses its solution to update via a pseudoinverse, reducing to
x^{k+1} = x^k - B^{-1} A^T S^T (S A B^{-1} A^T S^T)^{-1} (S A x^k - S b).
This framing connects to direct methods for subsystems.[18]
The random update viewpoint treats the method as a stochastic gradient-like step in a randomized direction, with the update expressed as x^{k+1} = x^k + \alpha^k d^k, where d^k is a random search direction derived from the residual and B-norm. It aligns with stochastic optimization by interpreting the step as minimizing a local quadratic model.[18]
Finally, the random fixed-point formulation casts the iteration as a contraction mapping in the B-norm toward the solution, with the operator T_S(x) = x - B^{-1} A^T S^T (S A B^{-1} A^T S^T)^{-1} (S A x - S b) satisfying \|T_S(x) - x^*\|_B \leq \rho_S \|x - x^*\|_B for some contraction factor \rho_S < 1. This enables fixed-point theory for convergence analysis.[18]
These equivalences are proven by showing that each formulation derives the same explicit update expression through algebraic manipulations, often leveraging the Woodbury matrix identity or properties of Moore-Penrose pseudoinverses to align the projections and solves.[18] The multiplicity of views facilitates diverse convergence proofs, such as those using fixed-point contractions or stochastic optimization duality, allowing tailored analyses for different sketching distributions.[18]
In the context of the classical Kaczmarz method, which corresponds to uniform row sampling (S selects a single row of A) and B = I, each of these perspectives recovers the standard projection onto the hyperplane defined by the selected equation a_i^T x = b_i, yielding the update x^{k+1} = x^k - \frac{a_i^T x^k - b_i}{\|a_i\|^2} a_i.[18] This unification underscores how the Gower-Richtarik framework generalizes Kaczmarz while preserving its geometric projection essence across all views.[18]
Special Cases
One prominent special case within the Gower-Richtárik framework arises when the matrix A is square and the sketching operator selects single coordinates, which recovers the randomized coordinate descent (CD) method applied to the least-squares objective f(x) = \frac{1}{2} \|Ax - b\|^2.[18] In this setup, the update projects onto the subspace defined by a randomly chosen coordinate direction, equivalent to minimizing f(x) along that axis, with probabilities uniform over the n coordinates.[18]
Another specialization links the framework to the expectation-maximization (EM) algorithm in statistical models with missing data. The randomized Kaczmarz method, interpreted as a sketch-and-project update, aligns with EM iterations for imputing missing observations in linear models under Poisson likelihoods or Gaussian noise, particularly in applications like tomography where subsets of equations represent partial data.[19] This connection arises because EM's E-step estimates missing values via conditional expectations, while the M-step maximizes the complete-data likelihood, mirroring the cyclic projections in Kaczmarz-type methods for ill-posed systems.[19]
Further cases include weighted variants of the Kaczmarz method, where row selection probabilities are set to p_i = \|a_i\|^2 / \|A\|_F^2, achieving optimal convergence rates by emphasizing rows with larger norms. Block versions emerge as multi-row sketches, where the sketching matrix selects a random subset of rows, leading to updates that solve smaller least-squares subproblems over those blocks.[18]
Regarding convergence, the coordinate descent case exhibits a rate of $1 - 1/(n \kappa^2), serving as a lower bound on the expected decrease in the weighted error \mathbb{E}[\|x_{k+1} - x^*\|_{A^\top A}^2] \leq (1 - \lambda_{\min}(A^\top A)/\|A\|_F^2) \|x_k - x^*\|_{A^\top A}^2, where \kappa is the condition number of A and n is the dimension.[18] For EM interpretations, convergence relies on probabilistic models, ensuring monotonicity in the likelihood under suitable regularity conditions for missing data setups.[19]
These specializations explain the acceleration of Kaczmarz methods in systems with sparse or well-conditioned rows, as the effective sketching dimension increases, tightening the convergence factor \rho by better capturing the solution subspace relative to the trace of the metric matrix.[18]
Extensions and Applications
Block and Adaptive Methods
The block Kaczmarz method extends the classical approach by selecting multiple rows simultaneously in each iteration, projecting the current iterate onto the intersection of the corresponding hyperplanes to accelerate convergence for large-scale systems. In this variant, a block of b rows, denoted as A_B and b_B, is chosen, and the update is given by
x^{k+1} = \arg\min_{z} \|z - x^k\|_2^2 \quad \text{subject to} \quad A_B z = b_B,
which simplifies to x^{k+1} = x^k + A_B^+ (b_B - A_B x^k), where A_B^+ is the Moore-Penrose pseudoinverse of A_B.[21] This formulation allows for parallel computation within blocks, making it suitable for distributed systems, though the choice of block size b influences the trade-off between computational cost and progress per step.[21]
Block selection strategies vary between random and greedy approaches to balance exploration and exploitation of residuals. In the randomized block Kaczmarz method, blocks are selected uniformly at random or with probabilities proportional to the squared norms of block residuals, enabling expected linear convergence rates that depend on the minimum singular value of A and the block partitioning overlap.[21] For greedy selection, the block with the largest aggregate residual norm, \|A_B x^k - b_B\|_2, is chosen deterministically, which empirically speeds up convergence for ill-conditioned systems by prioritizing informative constraints.[22]
The following pseudocode outlines a generic block Kaczmarz iteration with greedy selection:
Input: A ∈ ℝ^{m×n}, b ∈ ℝ^m, x^0 ∈ ℝ^n, block size b
For k = 0, 1, 2, ...
Select block indices B_k ⊂ {1,...,m} with |B_k| = b maximizing \|A_{B_k} x^k - b_{B_k}\|_2
Compute r_B = b_{B_k} - A_{B_k} x^k
x^{k+1} = x^k + A_{B_k}^+ r_B
Output: x^K after K iterations
Input: A ∈ ℝ^{m×n}, b ∈ ℝ^m, x^0 ∈ ℝ^n, block size b
For k = 0, 1, 2, ...
Select block indices B_k ⊂ {1,...,m} with |B_k| = b maximizing \|A_{B_k} x^k - b_{B_k}\|_2
Compute r_B = b_{B_k} - A_{B_k} x^k
x^{k+1} = x^k + A_{B_k}^+ r_B
Output: x^K after K iterations
This greedy variant avoids pseudoinverse computation in some implementations by using row averaging or relaxation, reducing overhead for sparse matrices.[22]
Adaptive deterministic variants further refine this by dynamically adjusting block size or selection order based on current residual norms, incorporating momentum to dampen oscillations and enhance progress. For instance, the momentum-augmented adaptive deterministic block Kaczmarz (mADBK) method selects blocks via a residual-norm criterion and updates with a heavy-ball step: x^{k+1} = x^k + \alpha (A_{B_k}^+ r_{B_k}) + \beta (x^k - x^{k-1}), where \alpha and \beta are step and momentum parameters tuned adaptively.[23] Such adaptations ensure linear convergence to the least-norm solution for consistent systems without fixed partitioning, with rates improved by up to 20-30% over static greedy blocks in numerical tests on large sparse matrices.[23]
Convergence analysis for block methods yields sublinear rates for cyclic deterministic selections but linear rates in expectation for randomized versions, with the contraction factor $1 - c / \kappa^2 where c scales with block overlap and \kappa is the condition number; larger blocks reduce iterations but increase per-step cost.[21] These rates outperform single-row Kaczmarz (the case b=1) when rows are correlated, as blocks capture joint constraints more efficiently.[22] In the extreme case of full blocks (b=m), the method solves the system in one iteration by computing the projection onto the full solution space, equivalent to solving the normal equations A^T A x = A^T b for consistent overdetermined systems.[21]
Bregman and Weighted Variants
The Bregman-Kaczmarz method extends the classical Kaczmarz algorithm by replacing Euclidean projections with projections defined via Bregman divergences, enabling the handling of non-Euclidean geometries in constrained optimization problems.[24] The Bregman divergence associated with a strongly convex differentiable function F is given by
D_F(x, y) = F(x) - F(y) - \langle \nabla F(y), x - y \rangle,
which generalizes the squared Euclidean distance when F(x) = \frac{1}{2} \|x\|^2.[24] This formulation is particularly useful for problems involving probability distributions, where the Kullback-Leibler (KL) divergence arises by choosing F(x) = \sum_i x_i \log x_i, allowing projections onto simplices or other non-Euclidean constraint sets while preserving sparsity or positivity.[24]
A recent accelerated variant, the block accelerated randomized Bregman-Kaczmarz (ARBK) method from 2024, employs block-randomized updates that minimize the Bregman divergence D_F(z, x^k) subject to a subset of linear constraints, incorporating momentum through interpolation parameters and a restart mechanism to stabilize iterations.[25] In each step, a random block of rows is selected with probabilities proportional to their Lipschitz constants, and the update is computed in the primal space for efficiency, making it suitable for large-scale linearly constrained minimization of strongly convex functions.[25] This approach builds on the Gower-Richtarik framework for weighted randomized iterations as a foundational tool for incorporating row norms in selection probabilities.
Complementing these divergence-based extensions, the weighted average fast block Kaczmarz (WAFBK) method, proposed in 2024, addresses overdetermined linear systems by integrating greedy row selection—based on residual distances—with weighted averages of multiple projections to accelerate convergence without computing pseudoinverses.[26] The algorithm selects rows using one of four weighting schemes (uniform, non-uniform, residual-based, or distance-based) and updates via a weighted combination that balances speed and robustness, outperforming prior greedy block methods in numerical tests on large consistent systems.[26]
Convergence analyses for these variants demonstrate significant improvements over standard rates. The ARBK method achieves accelerated sublinear convergence of O(1/k^2) in Bregman distance for general cases and linear rates under the Polyak-Łojasiewicz condition and regularity assumptions on the blocks, with the restart scheme ensuring last-iterate guarantees.[25] Similarly, WAFBK exhibits linear convergence to the minimum-norm solution, with error bounds scaling as \|x_{k+1} - x^*\|_2^2 \leq (1 - \theta \frac{\sigma_{\min}^2(A)}{m \|A\|_F^2} \frac{\|A_{\mathcal{I}_k}\|_F^2}{\sigma_{\max}^2(A_{\mathcal{I}_k})}) \|x_k - x^*\|_2^2 for uniform weighting, where \theta captures the greedy selection efficacy.[26] These results hold in smooth interpolation regimes, where the constraints tightly interpolate the objective.
Recent developments in 2025 further advance adaptive block variants within row-action frameworks, emphasizing applications in machine learning such as linear regression on infinite-dimensional spaces and non-orthogonal expansions.[27] Surveys of row-action methods highlight how these adaptive Bregman and weighted extensions enhance scalability for ML tasks like sparse recovery and tensor regression, providing explicit regret bounds under noisy data.[27]
Practical Applications
The Kaczmarz method, particularly through its implementation as the Algebraic Reconstruction Technique (ART), plays a central role in computed tomography (CT) imaging by iteratively solving systems of linear equations derived from sparse X-ray projections to reconstruct cross-sectional images.[28] ART initializes a crude reconstruction and applies data-driven relaxation parameters to enhance convergence, enabling high-quality images from limited projections in compressed sensing frameworks, which reduces radiation dose in medical scans.[28] In magnetic resonance imaging (MRI), a phase-constrained variant of Kaczmarz's ART processes oversampled k-space data to mitigate phase errors and Gibbs ringing, achieving accelerated scans (e.g., 2x undersampling) without explicit regularization while preserving image fidelity for Cartesian and spiral trajectories.[29]
In signal processing, the randomized Kaczmarz method addresses underdetermined systems in compressed sensing by enabling fast sparse signal recovery through iterative projections onto randomly selected hyperplanes.[30] Variants like Kaczmarz-based Iterative Hard Thresholding (KZIHT) combine projection steps with thresholding to achieve linear convergence for bounded orthonormal and sub-Gaussian measurement matrices, recovering sparse signals with optimal statistical bias under minimal sampling requirements (e.g., m ≥ C s log⁴ N).[30] The sparse randomized Kaczmarz algorithm further supports recovery in multiple measurement vector problems with corruptions, using binning for robust support estimation in streaming settings, as demonstrated in applications like hyperspectral tomography.[31]
Within machine learning, the Kaczmarz method serves as a stochastic optimizer for least-squares regression, equivalent to stochastic gradient descent (SGD) with stepwise-optimal stepsizes in the interpolation regime of overparameterized models.[32] Recent analyses link it to continual learning, where block Kaczmarz iterations model sequential task adaptation, yielding universal forgetting rates of O(1/√k) for random orderings independent of dimensionality, improving upon prior SGD bounds.[33] In overparameterized settings, such as neural tangent kernels, last-iterate convergence reaches O(1/√T) with η = 1/β, explaining efficient training in high-dimensional linear models without variance reduction.[32]
Beyond these domains, the method applies to inverse problems in geophysics, such as seismic imaging for volcano monitoring, where distributed randomized Kaczmarz solves large-scale sparse least-squares systems from travel-time data across sensor networks, achieving accuracy comparable to centralized solvers with reduced communication overhead.[34] Parallel implementations on graphics processing units (GPUs) accelerate Kaczmarz for large-scale dense systems (e.g., n up to 15,000), outperforming CPU versions by factors of nearly 2x per iteration through vectorized dot products, though benefits diminish for sparse matrices due to data transfer costs.[35]
Empirical studies highlight performance advantages in high-dimensional settings, where randomized Kaczmarz variants converge faster than conjugate gradient methods for consistent overdetermined sparse systems, often reducing computation time significantly in tomography and imaging benchmarks.[6]