Conjugate gradient method
The conjugate gradient method is an iterative algorithm for numerically solving systems of linear equations Ax = b, where A is an n \times n symmetric positive-definite matrix.[1] Developed by Magnus R. Hestenes and Eduard Stiefel in 1952, it constructs a sequence of A-conjugate search directions—vectors p_k satisfying p_i^T A p_j = 0 for i \neq j—to minimize the associated quadratic functional \frac{1}{2} x^T A x - b^T x, whose gradient is the residual r = b - Ax.[2] In exact arithmetic, the method converges to the unique solution in at most n iterations, as the residuals become A-orthogonal, spanning the Krylov subspace until the error is eliminated.[3] This approach excels in handling large, sparse systems arising in scientific computing, requiring only storage for a few vectors and matrix-vector products per iteration, making it more memory-efficient than direct methods like Gaussian elimination for high-dimensional problems.[3] Key advantages include its robustness to ill-conditioning when preconditioned—using a symmetric positive-definite matrix M to approximate A^{-1} and accelerate convergence—and its theoretical error bound, which decreases by a factor related to the condition number \kappa(A) per step.[1] Extensions, such as nonlinear conjugate gradient variants, apply it to unconstrained optimization by adapting the direction update for non-quadratic objectives.[4] Notable applications span partial differential equations (PDEs) discretized via finite elements or finite differences, where sparse A models physical phenomena like heat diffusion or structural mechanics; optimization in machine learning for least-squares problems; and preconditioned solvers in computational fluid dynamics.[5] Despite finite-precision effects potentially slowing convergence beyond the theoretical n steps, preconditioning techniques like incomplete Cholesky factorization enhance practical performance, establishing the method as a cornerstone of iterative linear algebra.[3]Mathematical Background
Quadratic Optimization Problem
The conjugate gradient method addresses the problem of minimizing a quadratic function of the formf(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} + c,
where A is an n \times n symmetric positive definite matrix, \mathbf{b} \in \mathbb{R}^n, c \in \mathbb{R}, and \mathbf{x} \in \mathbb{R}^n.[2] This formulation assumes A is symmetric, ensuring the quadratic form is well-defined, and positive definite, guaranteeing a unique global minimum.[2] The minimizer \mathbf{x}^* of f(\mathbf{x}) satisfies the equivalent linear system
A \mathbf{x}^* = \mathbf{b}.
This equivalence follows from setting the gradient \nabla f(\mathbf{x}) = A \mathbf{x} - \mathbf{b} = \mathbf{0}, which yields the normal equations for the system.[2] The method operates in the Euclidean space \mathbb{R}^n equipped with the standard inner product \langle \mathbf{p}, \mathbf{q} \rangle = \mathbf{p}^T \mathbf{q}.[2] Additionally, an A-inner product is defined as \langle \mathbf{p}, \mathbf{q} \rangle_A = \mathbf{p}^T A \mathbf{q}, which induces a geometry aligned with the quadratic form and plays a key role in the method's convergence properties.[2] The analysis of the conjugate gradient method initially assumes exact arithmetic, with no rounding errors, under which the algorithm terminates in at most n iterations.[2] This idealization establishes the theoretical foundation, though practical implementations account for finite-precision effects.[2]
Conjugate Directions and Vectors
In the conjugate gradient method, the core idea revolves around directions that are conjugate with respect to a symmetric positive definite matrix A. Two nonzero vectors p and q are defined as A-conjugate if p^T A q = 0. For a set of vectors \{p_i\}_{i=0}^{n-1}, they are mutually A-conjugate if p_i^T A p_j = 0 for all i \neq j. This conjugacy condition generalizes orthogonality from the standard Euclidean inner product to the A-weighted inner product \langle u, v \rangle_A = u^T A v, providing a framework for efficient exploration of the quadratic optimization landscape.[2] A fundamental property of mutually A-conjugate directions is that any n linearly independent such directions form a basis for \mathbb{R}^n. In the context of minimizing the quadratic function f(x) = \frac{1}{2} x^T A x - b^T x, starting from an initial point, exact minimization can be achieved in at most n line searches along these directions. Each step minimizes f over the affine subspace spanned by the initial point and the conjugate directions up to that iteration, ensuring no redundant effort in the search. This spanning property exploits the geometry of the ellipsoid level sets defined by A, allowing the method to converge to the global minimum without revisiting previously optimized subspaces.[2] During the conjugate gradient iterations, the residual vectors r_k = b - A x_k are orthogonal, satisfying r_i^T r_j = 0 for i \neq j. The error vectors e_k = x^* - x_k exhibit A-orthogonality, satisfying e_i^T A e_j = 0 for i \neq j. These properties, alongside the conjugacy of the search directions, ensure efficient subspace minimization and rapid error reduction. The term "conjugate gradient" originates from this interplay between conjugate directions and the gradients (residuals) in the optimization process, as coined by Hestenes and Stiefel in their seminal 1952 paper.[2][3]Derivation of the Method
Direct Method Perspective
The conjugate gradient method can be viewed as a direct solver for systems of linear equations Ax = b, where A is an n \times n symmetric positive definite (SPD) matrix, guaranteeing an exact solution in at most n steps under exact arithmetic. This perspective treats the method as a means to expand the solution in a basis of A-conjugate directions, which form an orthogonal set with respect to the inner product \langle \mathbf{u}, \mathbf{v} \rangle_A = \mathbf{u}^T A \mathbf{v}. Starting from an initial guess \mathbf{x}_0, the residual is \mathbf{r}_0 = b - A \mathbf{x}_0, and the search directions \mathbf{p}_k are generated iteratively such that \mathbf{p}_i^T A \mathbf{p}_j = 0 for i \neq j. The exact solution is then expressed as \mathbf{x}^* = \mathbf{x}_0 + \sum_{k=1}^n \alpha_k \mathbf{p}_k, where the coefficients \alpha_k are chosen to minimize the quadratic functional f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} along each direction, yielding \alpha_k = \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k}. This step-wise minimization ensures that each update \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k reduces the residual \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k, with residuals remaining orthogonal: \mathbf{r}_i^T \mathbf{r}_j = 0 for i \neq j.[2] The finite termination in at most n steps follows from the dimensionality of the space. Since A is SPD, the conjugate directions \{\mathbf{p}_1, \dots, \mathbf{p}_n\} form a basis for \mathbb{R}^n, spanning the entire error space \mathbf{e}_0 = \mathbf{x}^* - \mathbf{x}_0. After n iterations, the residual \mathbf{r}_n = 0, as the orthogonal residuals \{\mathbf{r}_0, \dots, \mathbf{r}_{n-1}\} span \mathbb{R}^n, and the method exactly solves the system. This property holds because the A-inner product induces a valid Euclidean structure, allowing the conjugate basis to complete without linear dependence before n steps. In practice, for ill-conditioned A, termination may occur effectively earlier if the initial residual lies in a lower-dimensional invariant subspace, but the worst-case bound is n.[2][3] This direct solver interpretation relates closely to the Gram-Schmidt orthogonalization process applied to the Krylov subspace \mathcal{K}_k(\mathbf{r}_0, A) = \span\{\mathbf{r}_0, A\mathbf{r}_0, \dots, A^{k-1}\mathbf{r}_0\}. The residuals \mathbf{r}_k are A-orthogonalized versions of the Krylov basis vectors, and the search directions \mathbf{p}_k are obtained by A-orthogonalizing the successive residuals against previous directions, mirroring the Gram-Schmidt procedure in the A-inner product. Specifically, the conjugacy condition ensures that each new direction is orthogonal to the span of prior A \mathbf{p}_j, which aligns with the growing Krylov subspace of dimension k at step k. This connection underscores why the method terminates in n steps: the full Krylov subspace \mathcal{K}_n(\mathbf{r}_0, A) equals \mathbb{R}^n for SPD A with distinct eigenvalues or generic \mathbf{r}_0.[3][2] To maintain conjugacy without explicit orthogonalization, the update for the next direction is \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k, where \beta_k is derived to ensure \mathbf{p}_{k+1}^T A \mathbf{p}_j = 0 for all j \leq k. Imposing this on j = k gives \mathbf{r}_{k+1}^T A \mathbf{p}_k + \beta_k \mathbf{p}_k^T A \mathbf{p}_k = 0. Since \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k and residuals are orthogonal to prior residuals (with \mathbf{p}_k in the span of previous \mathbf{r}_j), the cross term simplifies using \mathbf{r}_k^T A \mathbf{p}_k = \mathbf{r}_k^T \mathbf{r}_k / \alpha_k, leading to \beta_k = \frac{\mathbf{r}_{k+1}^T \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}. This formula preserves the three-term recurrence, avoiding the need to store the full basis and enabling the direct expansion in finite steps.[2][3]Iterative Method Perspective
The conjugate gradient method serves as an iterative approximation scheme for solving the symmetric positive definite linear system Ax = b, where the goal is to minimize the associated quadratic functional f(x) = \frac{1}{2} x^T A x - b^T x. At step k, the iterate x_k is selected to minimize the quadratic functional f(x_k) over the k-th Krylov subspace \mathcal{K}_k = \SPAN\{r_0, A r_0, \dots, A^{k-1} r_0\}, where r_0 = b - A x_0 (equivalently, minimizing the A-norm of the error \|x_k - x^*\|_A over the affine space x_0 + \mathcal{K}_k). This subspace minimization ensures that x_k provides the best approximation in the A-norm to the exact solution within the affine space x_0 + \mathcal{K}_k.[2] The iteration proceeds by advancing along a conjugate search direction p_k, yielding the update x_{k+1} = x_k + \alpha_k p_k, where \alpha_k > 0 is a step length chosen to minimize f(x_{k+1}) along the line x_k + \alpha p_k. This line search condition implies \alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}, which enforces the orthogonality p_k^T r_{k+1} = 0 upon updating the residual as r_{k+1} = r_k - \alpha_k A p_k. The resulting residual r_{k+1} lies in \mathcal{K}_{k+1} and is orthogonal to all previous residuals \{r_0, \dots, r_k\}, a property that maintains the minimization over expanding Krylov subspaces.[2] To generate the next direction, p_{k+1} = r_{k+1} + \beta_k p_k, where \beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}, ensuring that the directions remain A-conjugate (p_i^T A p_j = 0 for i \neq j) and that the residuals remain mutually orthogonal. This conjugacy property, combined with the three-term recurrence for the residuals and directions, enables a short-recurrence formulation that avoids explicit construction of the full Krylov basis, requiring only O(n) storage and work per iteration for an n \times n system. The method thus efficiently approximates the exact minimizer of f(x) in finite steps for the quadratic case.[2]The Algorithm
Core Steps and Updates
The conjugate gradient method is implemented via an initialization phase followed by an iterative loop that computes step lengths, updates the approximate solution and residual, and maintains conjugate search directions. This algorithm, originally proposed by Hestenes and Stiefel, solves the linear system Ax = b where A is symmetric positive definite, typically in at most n iterations for an n \times n system in exact arithmetic.[2] Initialization requires selecting an arbitrary initial approximation x_0. The initial residual vector is then formed as r_0 = b - A x_0, and the initial search direction is set to p_0 = r_0. These choices ensure the method begins with a direction aligned with the gradient of the associated quadratic objective.[2] The core iteration, for k = 0, 1, 2, \dots, proceeds as follows. First, compute the step size \alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}. Update the solution approximation and residual recursively: x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k. Next, compute the conjugacy parameter \beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}, and update the search direction: p_{k+1} = r_{k+1} + \beta_k p_k. The formulas for \alpha_k and \beta_k arise from orthogonality conditions in the residual and conjugacy in the directions, ensuring short-recurrence efficiency.[2] Due to floating-point arithmetic, the recursive residual update can accumulate roundoff errors, leading to loss of conjugacy. To mitigate this, the residual should be recomputed explicitly at intervals as r_{k+1} = b - A x_{k+1}, particularly after a fixed number of steps or when \| r_{k+1}^T r_k \| becomes unreliable.[3] Stopping criteria include a tolerance on the residual norm, such as \| r_k \|_2 < \epsilon \| b \|_2 for a relative tolerance \epsilon, or a maximum iteration count exceeding the problem dimension to handle ill-conditioning.[2] The full algorithm in pseudocode form is:This implementation uses dot products for efficiency and monitors the squared residual norm to avoid square roots until checking the criterion.[2][3]Input: A (symmetric positive definite), b, x_0 (initial guess), ε (tolerance), max_iter Output: x (approximate solution) r ← b - A x_0 p ← r rsold ← r^T r for k = 0 to max_iter - 1 do if √rsold < ε then break end if Ap ← A p α ← rsold / (p^T Ap) x ← x + α p r ← r - α Ap rsnew ← r^T r if √rsnew < ε then break end if β ← rsnew / rsold p ← r + β p rsold ← rsnew end for // Optional: Recompute r = b - A x for accuracyInput: A (symmetric positive definite), b, x_0 (initial guess), ε (tolerance), max_iter Output: x (approximate solution) r ← b - A x_0 p ← r rsold ← r^T r for k = 0 to max_iter - 1 do if √rsold < ε then break end if Ap ← A p α ← rsold / (p^T Ap) x ← x + α p r ← r - α Ap rsnew ← r^T r if √rsnew < ε then break end if β ← rsnew / rsold p ← r + β p rsold ← rsnew end for // Optional: Recompute r = b - A x for accuracy
Practical Considerations and Restarting
In practice, rounding errors in finite-precision arithmetic can accumulate during the iterations of the conjugate gradient method, leading to a gradual loss of orthogonality among the residual vectors and potentially causing the algorithm to stall or converge more slowly than expected. To counteract this, a widely adopted implementation strategy involves periodic restarting, where the algorithm is reset every m iterations—typically with m around 20 to 30 for problems of moderate size—by setting the search direction equal to the current residual, p_{km} = r_{km}. This simple reset reinitiates the generation of conjugate directions from the updated residual, thereby restoring approximate orthogonality and enhancing overall numerical stability, though it may introduce a minor overhead in convergence speed for ill-conditioned systems.[3] Another key consideration for numerical robustness is the computation of the parameter \beta_k, which updates the search direction. Theoretically, r_{k+1}^T r_k = 0 due to orthogonality, but in floating-point arithmetic, this dot product is nonzero and small, leading to potential cancellation errors if used in the formula \beta_k = \frac{r_{k+1}^T r_k}{r_k^T r_k}. Instead, practitioners avoid this by employing the equivalent but more stable expression \beta_k = \frac{\| r_{k+1} \|^2}{\| r_k \|^2}, which relies solely on positive quantities (squared residual norms) and better preserves the method's properties under rounding perturbations.[3] Regarding efficiency, the conjugate gradient method excels in scenarios with large sparse matrices, as it requires only O(n) storage space for a handful of n-dimensional vectors (such as the iterate, residual, search direction, and auxiliary products like Ap), without needing to store the full matrix A. Operations are performed via on-the-fly matrix-vector multiplications Ap, which are computationally inexpensive for sparse representations, making the method scalable to systems with millions of unknowns.[2] The core iterative loop, incorporating stable \beta_k computation and restarting, can be implemented succinctly in languages like MATLAB or Julia. Below is example pseudocode in MATLAB syntax, assuming A is a function handle for sparse matrix-vector multiplication and restart_interval is set to 30:This formulation ensures efficient execution while mitigating common numerical pitfalls through the norm-based \beta_k and explicit restarts.[3]function [x, k] = cg_restart(A, b, x0, tol, maxit, restart_interval) x = x0; r = b - A(x); rsold = dot(r, r); p = r; k = 0; if sqrt(rsold) < tol, return; end while k < maxit k = k + 1; if mod(k, restart_interval) == 0 p = r; rsold = dot(r, r); end Ap = A(p); alpha = rsold / dot(p, Ap); x = x + alpha * p; r = r - alpha * Ap; rsnew = dot(r, r); if sqrt(rsnew) < tol, break; end beta = rsnew / rsold; p = r + beta * p; rsold = rsnew; end endfunction [x, k] = cg_restart(A, b, x0, tol, maxit, restart_interval) x = x0; r = b - A(x); rsold = dot(r, r); p = r; k = 0; if sqrt(rsold) < tol, return; end while k < maxit k = k + 1; if mod(k, restart_interval) == 0 p = r; rsold = dot(r, r); end Ap = A(p); alpha = rsold / dot(p, Ap); x = x + alpha * p; r = r - alpha * Ap; rsnew = dot(r, r); if sqrt(rsnew) < tol, break; end beta = rsnew / rsold; p = r + beta * p; rsold = rsnew; end end
Numerical Example
To illustrate the conjugate gradient method, consider solving the linear system Ax = b, where A is the symmetric positive definite 3×3 matrix A = \begin{pmatrix} 3 & 2 & 1 \\ 2 & 6 & 2 \\ 1 & 2 & 7 \end{pmatrix}, b = \begin{pmatrix} 2 \\ -8 \\ 2 \end{pmatrix}, and the initial approximation is x_0 = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}. The exact solution is x = \begin{pmatrix} 21/11 \\ -24/11 \\ 7/11 \end{pmatrix} \approx \begin{pmatrix} 1.909 \\ -2.182 \\ 0.636 \end{pmatrix}. The initial residual is r_0 = b - A x_0 = b = \begin{pmatrix} 2 \\ -8 \\ 2 \end{pmatrix}, and the initial search direction is p_0 = r_0. Then, \|r_0\|^2 = r_0^T r_0 = 72. Compute A p_0 = \begin{pmatrix} -8 \\ -40 \\ 0 \end{pmatrix}, so p_0^T A p_0 = 304 and \alpha_0 = 72 / 304 = 9/38 \approx 0.237. The first update is x_1 = x_0 + \alpha_0 p_0 = (9/38) p_0 = \begin{pmatrix} 9/19 \\ -36/19 \\ 9/19 \end{pmatrix} \approx \begin{pmatrix} 0.474 \\ -1.895 \\ 0.474 \end{pmatrix}, and r_1 = r_0 - \alpha_0 A p_0 = \begin{pmatrix} 74/19 \\ 28/19 \\ 38/19 \end{pmatrix} \approx \begin{pmatrix} 3.895 \\ 1.474 \\ 2.000 \end{pmatrix}, with \|r_1\|^2 = 7704/361 \approx 21.343. The subsequent \beta_1 = \|r_1\|^2 / \|r_0\|^2 = 107/361 \approx 0.296, so p_1 = r_1 + \beta_1 p_0 \approx \begin{pmatrix} 4.488 \\ -0.898 \\ 2.593 \end{pmatrix}. Continuing the iterations yields the updates shown in the table below, where the residuals \|r_k\|^2 decrease to machine precision (effectively zero) in exactly 3 steps, as expected for an n \times n system with no rounding errors.| Iteration k | x_k (approximate) | \|r_k\|^2 (approximate) |
|---|---|---|
| 0 | (0.000, 0.000, 0.000) | 72.000 |
| 1 | (0.474, -1.895, 0.474) | 21.343 |
| 2 | (1.344, -2.069, 0.976) | 5.492 |
| 3 | (1.909, -2.182, 0.636) | $10^{-16} (numerical zero) |