Conjugate gradient method

The conjugate gradient method is an iterative algorithm for numerically solving systems of linear equations Ax = b, where A is an n \times n symmetric positive-definite matrix.^[1] Developed by Magnus R. Hestenes and Eduard Stiefel in 1952, it constructs a sequence of A-conjugate search directions—vectors p_k satisfying p_i^T A p_j = 0 for i \neq j—to minimize the associated quadratic functional \frac{1}{2} x^T A x - b^T x, whose gradient is the residual r = b - Ax.^[2] In exact arithmetic, the method converges to the unique solution in at most n iterations, as the residuals become A-orthogonal, spanning the Krylov subspace until the error is eliminated.^[3] This approach excels in handling large, sparse systems arising in scientific computing, requiring only storage for a few vectors and matrix-vector products per iteration, making it more memory-efficient than direct methods like Gaussian elimination for high-dimensional problems.^[3] Key advantages include its robustness to ill-conditioning when preconditioned—using a symmetric positive-definite matrix M to approximate A^{-1} and accelerate convergence—and its theoretical error bound, which decreases by a factor related to the condition number \kappa(A) per step.^[1] Extensions, such as nonlinear conjugate gradient variants, apply it to unconstrained optimization by adapting the direction update for non-quadratic objectives.^[4] Notable applications span partial differential equations (PDEs) discretized via finite elements or finite differences, where sparse A models physical phenomena like heat diffusion or structural mechanics; optimization in machine learning for least-squares problems; and preconditioned solvers in computational fluid dynamics.^[5] Despite finite-precision effects potentially slowing convergence beyond the theoretical n steps, preconditioning techniques like incomplete Cholesky factorization enhance practical performance, establishing the method as a cornerstone of iterative linear algebra.^[3]

Mathematical Background

Quadratic Optimization Problem

The conjugate gradient method addresses the problem of minimizing a quadratic function of the form
f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} + c,
where A is an n \times n symmetric positive definite matrix, \mathbf{b} \in \mathbb{R}^n, c \in \mathbb{R}, and \mathbf{x} \in \mathbb{R}^n.^[2] This formulation assumes A is symmetric, ensuring the quadratic form is well-defined, and positive definite, guaranteeing a unique global minimum.^[2] The minimizer \mathbf{x}^* of f(\mathbf{x}) satisfies the equivalent linear system
A \mathbf{x}^* = \mathbf{b}.
This equivalence follows from setting the gradient \nabla f(\mathbf{x}) = A \mathbf{x} - \mathbf{b} = \mathbf{0}, which yields the normal equations for the system.^[2] The method operates in the Euclidean space \mathbb{R}^n equipped with the standard inner product \langle \mathbf{p}, \mathbf{q} \rangle = \mathbf{p}^T \mathbf{q}.^[2] Additionally, an A-inner product is defined as \langle \mathbf{p}, \mathbf{q} \rangle_A = \mathbf{p}^T A \mathbf{q}, which induces a geometry aligned with the quadratic form and plays a key role in the method's convergence properties.^[2] The analysis of the conjugate gradient method initially assumes exact arithmetic, with no rounding errors, under which the algorithm terminates in at most n iterations.^[2] This idealization establishes the theoretical foundation, though practical implementations account for finite-precision effects.^[2]

Conjugate Directions and Vectors

In the conjugate gradient method, the core idea revolves around directions that are conjugate with respect to a symmetric positive definite matrix A. Two nonzero vectors p and q are defined as A-conjugate if p^T A q = 0. For a set of vectors \{p_i\}_{i=0}^{n-1}, they are mutually A-conjugate if p_i^T A p_j = 0 for all i \neq j. This conjugacy condition generalizes orthogonality from the standard Euclidean inner product to the A-weighted inner product \langle u, v \rangle_A = u^T A v, providing a framework for efficient exploration of the quadratic optimization landscape.^[2] A fundamental property of mutually A-conjugate directions is that any n linearly independent such directions form a basis for \mathbb{R}^n. In the context of minimizing the quadratic function f(x) = \frac{1}{2} x^T A x - b^T x, starting from an initial point, exact minimization can be achieved in at most n line searches along these directions. Each step minimizes f over the affine subspace spanned by the initial point and the conjugate directions up to that iteration, ensuring no redundant effort in the search. This spanning property exploits the geometry of the ellipsoid level sets defined by A, allowing the method to converge to the global minimum without revisiting previously optimized subspaces.^[2] During the conjugate gradient iterations, the residual vectors r_k = b - A x_k are orthogonal, satisfying r_i^T r_j = 0 for i \neq j. The error vectors e_k = x^* - x_k exhibit A-orthogonality, satisfying e_i^T A e_j = 0 for i \neq j. These properties, alongside the conjugacy of the search directions, ensure efficient subspace minimization and rapid error reduction. The term "conjugate gradient" originates from this interplay between conjugate directions and the gradients (residuals) in the optimization process, as coined by Hestenes and Stiefel in their seminal 1952 paper.^[2]^[3]

Derivation of the Method

Direct Method Perspective

The conjugate gradient method can be viewed as a direct solver for systems of linear equations Ax = b, where A is an n \times n symmetric positive definite (SPD) matrix, guaranteeing an exact solution in at most n steps under exact arithmetic. This perspective treats the method as a means to expand the solution in a basis of A-conjugate directions, which form an orthogonal set with respect to the inner product \langle \mathbf{u}, \mathbf{v} \rangle_A = \mathbf{u}^T A \mathbf{v}. Starting from an initial guess \mathbf{x}_0, the residual is \mathbf{r}_0 = b - A \mathbf{x}_0, and the search directions \mathbf{p}_k are generated iteratively such that \mathbf{p}_i^T A \mathbf{p}_j = 0 for i \neq j. The exact solution is then expressed as

\mathbf{x}^* = \mathbf{x}_0 + \sum_{k=1}^n \alpha_k \mathbf{p}_k,

where the coefficients \alpha_k are chosen to minimize the quadratic functional f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} along each direction, yielding

\alpha_k = \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k}.

This step-wise minimization ensures that each update \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k reduces the residual \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k, with residuals remaining orthogonal: \mathbf{r}_i^T \mathbf{r}_j = 0 for i \neq j.^[2] The finite termination in at most n steps follows from the dimensionality of the space. Since A is SPD, the conjugate directions \{\mathbf{p}_1, \dots, \mathbf{p}_n\} form a basis for \mathbb{R}^n, spanning the entire error space \mathbf{e}_0 = \mathbf{x}^* - \mathbf{x}_0. After n iterations, the residual \mathbf{r}_n = 0, as the orthogonal residuals \{\mathbf{r}_0, \dots, \mathbf{r}_{n-1}\} span \mathbb{R}^n, and the method exactly solves the system. This property holds because the A-inner product induces a valid Euclidean structure, allowing the conjugate basis to complete without linear dependence before n steps. In practice, for ill-conditioned A, termination may occur effectively earlier if the initial residual lies in a lower-dimensional invariant subspace, but the worst-case bound is n.^[2]^[3] This direct solver interpretation relates closely to the Gram-Schmidt orthogonalization process applied to the Krylov subspace \mathcal{K}_k(\mathbf{r}_0, A) = \span\{\mathbf{r}_0, A\mathbf{r}_0, \dots, A^{k-1}\mathbf{r}_0\}. The residuals \mathbf{r}_k are A-orthogonalized versions of the Krylov basis vectors, and the search directions \mathbf{p}_k are obtained by A-orthogonalizing the successive residuals against previous directions, mirroring the Gram-Schmidt procedure in the A-inner product. Specifically, the conjugacy condition ensures that each new direction is orthogonal to the span of prior A \mathbf{p}_j, which aligns with the growing Krylov subspace of dimension k at step k. This connection underscores why the method terminates in n steps: the full Krylov subspace \mathcal{K}_n(\mathbf{r}_0, A) equals \mathbb{R}^n for SPD A with distinct eigenvalues or generic \mathbf{r}_0.^[3]^[2] To maintain conjugacy without explicit orthogonalization, the update for the next direction is \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k, where \beta_k is derived to ensure \mathbf{p}_{k+1}^T A \mathbf{p}_j = 0 for all j \leq k. Imposing this on j = k gives \mathbf{r}_{k+1}^T A \mathbf{p}_k + \beta_k \mathbf{p}_k^T A \mathbf{p}_k = 0. Since \mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k and residuals are orthogonal to prior residuals (with \mathbf{p}_k in the span of previous \mathbf{r}_j), the cross term simplifies using \mathbf{r}_k^T A \mathbf{p}_k = \mathbf{r}_k^T \mathbf{r}_k / \alpha_k, leading to

\beta_k = \frac{\mathbf{r}_{k+1}^T \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}.

This formula preserves the three-term recurrence, avoiding the need to store the full basis and enabling the direct expansion in finite steps.^[2]^[3]

Iterative Method Perspective

The conjugate gradient method serves as an iterative approximation scheme for solving the symmetric positive definite linear system Ax = b, where the goal is to minimize the associated quadratic functional f(x) = \frac{1}{2} x^T A x - b^T x. At step k, the iterate x_k is selected to minimize the quadratic functional f(x_k) over the k-th Krylov subspace \mathcal{K}_k = \SPAN\{r_0, A r_0, \dots, A^{k-1} r_0\}, where r_0 = b - A x_0 (equivalently, minimizing the A-norm of the error \|x_k - x^*\|_A over the affine space x_0 + \mathcal{K}_k). This subspace minimization ensures that x_k provides the best approximation in the A-norm to the exact solution within the affine space x_0 + \mathcal{K}_k.^[2] The iteration proceeds by advancing along a conjugate search direction p_k, yielding the update x_{k+1} = x_k + \alpha_k p_k, where \alpha_k > 0 is a step length chosen to minimize f(x_{k+1}) along the line x_k + \alpha p_k. This line search condition implies \alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}, which enforces the orthogonality p_k^T r_{k+1} = 0 upon updating the residual as r_{k+1} = r_k - \alpha_k A p_k. The resulting residual r_{k+1} lies in \mathcal{K}_{k+1} and is orthogonal to all previous residuals \{r_0, \dots, r_k\}, a property that maintains the minimization over expanding Krylov subspaces.^[2] To generate the next direction, p_{k+1} = r_{k+1} + \beta_k p_k, where \beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}, ensuring that the directions remain A-conjugate (p_i^T A p_j = 0 for i \neq j) and that the residuals remain mutually orthogonal. This conjugacy property, combined with the three-term recurrence for the residuals and directions, enables a short-recurrence formulation that avoids explicit construction of the full Krylov basis, requiring only O(n) storage and work per iteration for an n \times n system. The method thus efficiently approximates the exact minimizer of f(x) in finite steps for the quadratic case.^[2]

The Algorithm

Core Steps and Updates

The conjugate gradient method is implemented via an initialization phase followed by an iterative loop that computes step lengths, updates the approximate solution and residual, and maintains conjugate search directions. This algorithm, originally proposed by Hestenes and Stiefel, solves the linear system Ax = b where A is symmetric positive definite, typically in at most n iterations for an n \times n system in exact arithmetic.^[2] Initialization requires selecting an arbitrary initial approximation x_0. The initial residual vector is then formed as

r_0 = b - A x_0,

and the initial search direction is set to

p_0 = r_0.

These choices ensure the method begins with a direction aligned with the gradient of the associated quadratic objective.^[2] The core iteration, for k = 0, 1, 2, \dots, proceeds as follows. First, compute the step size

\alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}.

Update the solution approximation and residual recursively:

x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k.

Next, compute the conjugacy parameter

\beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k},

and update the search direction:

p_{k+1} = r_{k+1} + \beta_k p_k.

The formulas for \alpha_k and \beta_k arise from orthogonality conditions in the residual and conjugacy in the directions, ensuring short-recurrence efficiency.^[2] Due to floating-point arithmetic, the recursive residual update can accumulate roundoff errors, leading to loss of conjugacy. To mitigate this, the residual should be recomputed explicitly at intervals as r_{k+1} = b - A x_{k+1}, particularly after a fixed number of steps or when \| r_{k+1}^T r_k \| becomes unreliable.^[3] Stopping criteria include a tolerance on the residual norm, such as \| r_k \|_2 < \epsilon \| b \|_2 for a relative tolerance \epsilon, or a maximum iteration count exceeding the problem dimension to handle ill-conditioning.^[2] The full algorithm in pseudocode form is:

Input: A (symmetric positive definite), b, x_0 (initial guess), ε (tolerance), max_iter
Output: x (approximate solution)

r ← b - A x_0
p ← r
rsold ← r^T r

for k = 0 to max_iter - 1 do
    if √rsold < ε then break end if
    
    Ap ← A p
    α ← rsold / (p^T Ap)
    
    x ← x + α p
    r ← r - α Ap
    
    rsnew ← r^T r
    
    if √rsnew < ε then break end if
    
    β ← rsnew / rsold
    p ← r + β p
    rsold ← rsnew
end for

// Optional: Recompute r = b - A x for accuracy
Input: A (symmetric positive definite), b, x_0 (initial guess), ε (tolerance), max_iter
Output: x (approximate solution)

r ← b - A x_0
p ← r
rsold ← r^T r

for k = 0 to max_iter - 1 do
    if √rsold < ε then break end if
    
    Ap ← A p
    α ← rsold / (p^T Ap)
    
    x ← x + α p
    r ← r - α Ap
    
    rsnew ← r^T r
    
    if √rsnew < ε then break end if
    
    β ← rsnew / rsold
    p ← r + β p
    rsold ← rsnew
end for

// Optional: Recompute r = b - A x for accuracy

This implementation uses dot products for efficiency and monitors the squared residual norm to avoid square roots until checking the criterion.^[2]^[3]

Practical Considerations and Restarting

In practice, rounding errors in finite-precision arithmetic can accumulate during the iterations of the conjugate gradient method, leading to a gradual loss of orthogonality among the residual vectors and potentially causing the algorithm to stall or converge more slowly than expected. To counteract this, a widely adopted implementation strategy involves periodic restarting, where the algorithm is reset every m iterations—typically with m around 20 to 30 for problems of moderate size—by setting the search direction equal to the current residual, p_{km} = r_{km}. This simple reset reinitiates the generation of conjugate directions from the updated residual, thereby restoring approximate orthogonality and enhancing overall numerical stability, though it may introduce a minor overhead in convergence speed for ill-conditioned systems.^[3] Another key consideration for numerical robustness is the computation of the parameter \beta_k, which updates the search direction. Theoretically, r_{k+1}^T r_k = 0 due to orthogonality, but in floating-point arithmetic, this dot product is nonzero and small, leading to potential cancellation errors if used in the formula \beta_k = \frac{r_{k+1}^T r_k}{r_k^T r_k}. Instead, practitioners avoid this by employing the equivalent but more stable expression \beta_k = \frac{\| r_{k+1} \|^2}{\| r_k \|^2}, which relies solely on positive quantities (squared residual norms) and better preserves the method's properties under rounding perturbations.^[3] Regarding efficiency, the conjugate gradient method excels in scenarios with large sparse matrices, as it requires only O(n) storage space for a handful of n-dimensional vectors (such as the iterate, residual, search direction, and auxiliary products like Ap), without needing to store the full matrix A. Operations are performed via on-the-fly matrix-vector multiplications Ap, which are computationally inexpensive for sparse representations, making the method scalable to systems with millions of unknowns.^[2] The core iterative loop, incorporating stable \beta_k computation and restarting, can be implemented succinctly in languages like MATLAB or Julia. Below is example pseudocode in MATLAB syntax, assuming A is a function handle for sparse matrix-vector multiplication and restart_interval is set to 30:

function [x, k] = cg_restart(A, b, x0, tol, maxit, restart_interval)
    x = x0;
    r = b - A(x);
    rsold = dot(r, r);
    p = r;
    k = 0;
    if sqrt(rsold) < tol, return; end
    while k < maxit
        k = k + 1;
        if mod(k, restart_interval) == 0
            p = r;
            rsold = dot(r, r);
        end
        Ap = A(p);
        alpha = rsold / dot(p, Ap);
        x = x + alpha * p;
        r = r - alpha * Ap;
        rsnew = dot(r, r);
        if sqrt(rsnew) < tol, break; end
        beta = rsnew / rsold;
        p = r + beta * p;
        rsold = rsnew;
    end
end
function [x, k] = cg_restart(A, b, x0, tol, maxit, restart_interval)
    x = x0;
    r = b - A(x);
    rsold = dot(r, r);
    p = r;
    k = 0;
    if sqrt(rsold) < tol, return; end
    while k < maxit
        k = k + 1;
        if mod(k, restart_interval) == 0
            p = r;
            rsold = dot(r, r);
        end
        Ap = A(p);
        alpha = rsold / dot(p, Ap);
        x = x + alpha * p;
        r = r - alpha * Ap;
        rsnew = dot(r, r);
        if sqrt(rsnew) < tol, break; end
        beta = rsnew / rsold;
        p = r + beta * p;
        rsold = rsnew;
    end
end

This formulation ensures efficient execution while mitigating common numerical pitfalls through the norm-based \beta_k and explicit restarts.^[3]

Numerical Example

To illustrate the conjugate gradient method, consider solving the linear system Ax = b, where A is the symmetric positive definite 3×3 matrix

A = \begin{pmatrix} 3 & 2 & 1 \\ 2 & 6 & 2 \\ 1 & 2 & 7 \end{pmatrix},

b = \begin{pmatrix} 2 \\ -8 \\ 2 \end{pmatrix}, and the initial approximation is x_0 = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}. The exact solution is x = \begin{pmatrix} 21/11 \\ -24/11 \\ 7/11 \end{pmatrix} \approx \begin{pmatrix} 1.909 \\ -2.182 \\ 0.636 \end{pmatrix}. The initial residual is r_0 = b - A x_0 = b = \begin{pmatrix} 2 \\ -8 \\ 2 \end{pmatrix}, and the initial search direction is p_0 = r_0. Then, \|r_0\|^2 = r_0^T r_0 = 72. Compute A p_0 = \begin{pmatrix} -8 \\ -40 \\ 0 \end{pmatrix}, so p_0^T A p_0 = 304 and \alpha_0 = 72 / 304 = 9/38 \approx 0.237. The first update is x_1 = x_0 + \alpha_0 p_0 = (9/38) p_0 = \begin{pmatrix} 9/19 \\ -36/19 \\ 9/19 \end{pmatrix} \approx \begin{pmatrix} 0.474 \\ -1.895 \\ 0.474 \end{pmatrix}, and r_1 = r_0 - \alpha_0 A p_0 = \begin{pmatrix} 74/19 \\ 28/19 \\ 38/19 \end{pmatrix} \approx \begin{pmatrix} 3.895 \\ 1.474 \\ 2.000 \end{pmatrix}, with \|r_1\|^2 = 7704/361 \approx 21.343. The subsequent \beta_1 = \|r_1\|^2 / \|r_0\|^2 = 107/361 \approx 0.296, so p_1 = r_1 + \beta_1 p_0 \approx \begin{pmatrix} 4.488 \\ -0.898 \\ 2.593 \end{pmatrix}. Continuing the iterations yields the updates shown in the table below, where the residuals \|r_k\|^2 decrease to machine precision (effectively zero) in exactly 3 steps, as expected for an n \times n system with no rounding errors.

Iteration k	x_k (approximate)	\\|r_k\\|^2 (approximate)
0	(0.000, 0.000, 0.000)	72.000
1	(0.474, -1.895, 0.474)	21.343
2	(1.344, -2.069, 0.976)	5.492
3	(1.909, -2.182, 0.636)	$10^{-16} (numerical zero)

The search directions satisfy the conjugacy condition: for i \neq j, p_i^T A p_j = 0. For instance, recomputing A p_1 \approx \begin{pmatrix} 14.260 \\ 8.776 \\ 20.842 \end{pmatrix} gives p_0^T A p_1 \approx 0 (exactly zero in exact arithmetic), and similarly for the other pairs, demonstrating how the method generates mutually A-conjugate directions that span the space efficiently.

Theoretical Properties

Finite Termination and Sparsity

The conjugate gradient (CG) method possesses a finite termination property, guaranteeing that it finds the exact solution to the linear system Ax = b, where A is an n \times n symmetric positive definite matrix, in at most n iterations in exact arithmetic.^[3] This property arises because the residuals r_k generated at each step are mutually orthogonal, and the search directions p_k are A-conjugate, ensuring linear independence; since there can be at most n linearly independent vectors in \mathbb{R}^n, the process cannot continue beyond n steps without the residual becoming zero.^[3] More precisely, the iterates minimize the error over the Krylov subspace \mathcal{K}_k(A, r_0) = \span\{r_0, Ar_0, \dots, A^{k-1}r_0\}, and the linear independence of the Krylov basis vectors implies exact termination at or before step n if no numerical issues occur.^[6] Breakdowns in the algorithm occur under specific conditions that actually signal convergence. If the residual r_k = 0 at iteration k, the exact solution has been reached, terminating the method prematurely.^[3] Similarly, if the scalar p_k^T A p_k = 0, the step length \alpha_k cannot be computed, but for symmetric positive definite A, this indicates that the current iterate is the solution, as the conjugate directions span the necessary subspace without further progress needed.^[2] This finite termination makes CG particularly suitable for large sparse systems, where n \gg 1000, such as those arising from finite difference or finite element discretizations of partial differential equations (PDEs).^[6] Unlike direct methods like Gaussian elimination, which suffer from fill-in that increases storage and computational costs dramatically for sparse matrices, CG requires only matrix-vector products Av, preserving sparsity and enabling efficient implementation on systems with millions of unknowns.^[3] Historically, Hestenes and Stiefel developed CG in 1952 as an alternative to direct solvers for such large-scale problems in applied sciences, where traditional elimination techniques were impractical due to excessive memory demands and operation counts.^[2]

Convergence Theorems and Behavior

The convergence of the conjugate gradient (CG) method for solving symmetric positive definite systems Ax = b is typically analyzed in terms of the relative error in the energy norm, defined as \|e_k\|_A = \sqrt{e_k^T A e_k}, where e_k = x - x_k is the error at the k-th iteration and x is the exact solution. A fundamental result establishes an upper bound on this error, showing linear convergence whose rate depends critically on the condition number \kappa(A) = \lambda_{\max}/\lambda_{\min} of A, the ratio of its largest to smallest eigenvalue. Specifically,

\|e_k\|_A \leq 2 \left( \frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1} \right)^k \|e_0\|_A,

where e_0 is the initial error.^[7] This bound implies that the error reduces by a factor approaching $2/(\sqrt{\kappa(A)} + 1) per iteration in the worst case, highlighting the method's sensitivity to \kappa(A): convergence is rapid when \kappa(A) is small (well-conditioned A) but deteriorates markedly as \kappa(A) grows large, often requiring many iterations for high accuracy in ill-conditioned problems.^[8] Deeper theoretical understanding of CG's behavior comes from results linking convergence to the eigenvalue distribution of A, independent of its dimension. Kaniel's 1966 theorem demonstrates that the method's performance is governed primarily by how the eigenvalues cluster; specifically, eigenvalues grouped tightly around a few distinct values reduce the effective number of iterations needed, yielding faster convergence than the condition-number bound predicts, as the algorithm effectively minimizes errors in dominant spectral subspaces early on.^[9] This eigenvalue-based perspective explains why CG often outperforms pessimistic worst-case estimates in practice, particularly for matrices from discretized PDEs with clustered spectra. In practical implementations, convergence is monitored via the Euclidean norm of the residual \|r_k\|_2 = \|b - A x_k\|_2, which serves as a proxy for solution accuracy since \|r_k\|_2 decreases alongside the true error for consistent systems. Empirically, CG exhibits rapid initial convergence for well-conditioned matrices, often achieving machine precision in fewer than \sqrt{\kappa(A)} steps, but slows near the end for ill-conditioned cases due to accumulating roundoff errors in the orthogonalization process.^[3] While strictly formulated for symmetric positive definite A, variants of CG have been adapted for symmetric indefinite matrices—such as SYMMLQ, which maintains conjugacy in a modified sense—but these lack the same monotonic convergence guarantees and may fail to converge without additional safeguards.^[10]

Preconditioned Versions

Standard Preconditioned CG

The standard preconditioned conjugate gradient (PCG) method extends the basic conjugate gradient algorithm by incorporating a preconditioner to accelerate convergence for solving symmetric positive definite systems Ax = b. The preconditioner M is a symmetric positive definite matrix that approximates A^{-1}, typically constructed such that solving systems with M is computationally inexpensive compared to direct factorization of A. At each iteration k, the residual r_k = b - A x_k is used to compute a preconditioned residual z_k by solving the auxiliary system M z_k = r_k, which scales the search direction to better align with the geometry of the error in a preconditioned inner product. This modification transforms the problem into one with an effective condition number \kappa(M^{-1/2} A M^{-1/2}) that is typically much smaller than \kappa(A), leading to faster convergence rates bounded by \sqrt{\kappa(M^{-1} A)} iterations in the worst case. The core updates in PCG adapt the original conjugate gradient steps to incorporate the preconditioned residuals. Initialization sets r_0 = b - A x_0 and solves M z_0 = r_0, with the initial search direction p_0 = z_0. For subsequent iterations, the step length is

\alpha_k = \frac{r_k^T z_k}{p_k^T A p_k},

the update for the solution and residual is

x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k,

followed by solving M z_{k+1} = r_{k+1}, and the search direction is

\beta_k = \frac{r_{k+1}^T z_{k+1}}{r_k^T z_k}, \quad p_{k+1} = z_{k+1} + \beta_k p_k.

These formulas ensure that the search directions remain M-conjugate, preserving the orthogonality properties of the original method in the preconditioned space. Common choices for the preconditioner M include diagonal scaling, where M is the inverse of the diagonal of A, which is simple to apply and effective for mildly ill-conditioned systems by normalizing the equations. More robust options involve incomplete factorizations, such as the incomplete Cholesky decomposition, where M = LL^T with L being a sparse lower triangular matrix obtained by performing Cholesky factorization but discarding or zeroing certain fill-in elements to maintain sparsity; this approach approximates the full Cholesky factor while keeping setup and application costs low, particularly for matrices arising from finite difference discretizations of elliptic PDEs.

Flexible Preconditioned CG

The flexible preconditioned conjugate gradient (FPCG) method extends the standard preconditioned CG by relaxing the requirement of a fixed preconditioner, allowing a different symmetric positive definite preconditioner M_k at each iteration k. This enables the approximate solution of M_k z_k = r_k to vary in accuracy or form, making FPCG suitable for scenarios where exact preconditioner inverses are computationally expensive or impractical.^[11] In the algorithm, after computing the residual r_k, the preconditioned residual z_k is obtained via the varying solve with M_k. The step length \alpha_k follows the standard form \alpha_k = \frac{r_k^T z_k}{p_k^T A p_k}, updating x_{k+1} = x_k + \alpha_k p_k and r_{k+1} = r_k - \alpha_k A p_k. The key modification occurs in the search direction update: p_{k+1} = z_{k+1} + \beta_k p_k, where \beta_k = \frac{r_{k+1}^T z_{k+1}}{r_k^T z_k}. This \beta_k formula, adapted from the fixed-preconditioner case, ensures the method remains well-defined despite varying M_k, though it results in only approximate conjugacy of the search directions p_k with respect to A.^[11] The FPCG method converges provided each M_k is symmetric positive definite, but the varying preconditioners can lead to slower convergence compared to the standard PCG with a fixed M. Golub and Ye established convergence guarantees for inexact preconditioned CG variants, including inner-outer iterations that align with flexible preconditioning, showing that the error decreases monotonically under bounded inner-solve inaccuracies. Notay further analyzed cases with slightly varying preconditioners, demonstrating that the convergence rate remains close to optimal if deviations from a reference preconditioner are small, with numerical experiments confirming robustness.^[12]^[11] FPCG finds applications in preconditioning with iterative solvers, such as algebraic multigrid (AMG) or domain decomposition methods, where the preconditioner accuracy can be tuned per iteration to control computational cost—for instance, varying the number of multigrid cycles or subdomain iterations without disrupting outer-loop convergence.^[11]

Extensions and Variants

For Complex Hermitian Systems

The conjugate gradient (CG) method extends naturally to complex-valued linear systems Ax = b, where A is an n \times n Hermitian positive definite matrix, meaning A = A^H (with ^H denoting the conjugate transpose) and x^H A x > 0 for all nonzero complex vectors x. Such systems commonly arise in physical applications, including quantum mechanics, electromagnetics, and structural analysis, where the matrices reflect self-adjoint operators. The extension replaces the Euclidean inner product with the Hermitian inner product \langle x, y \rangle = y^H x, ensuring all computations remain well-defined and the method's convergence properties are preserved.^[13] The algorithm follows the same iterative updates as in the real symmetric case, but with Hermitian inner products throughout. Starting with an initial guess x_0 and residual r_0 = b - A x_0, set p_0 = r_0. At iteration k, compute the step length

\alpha_k = \frac{r_k^H r_k}{p_k^H A p_k},

which is real and positive since A is Hermitian positive definite. Update the solution x_{k+1} = x_k + \alpha_k p_k and residual r_{k+1} = r_k - \alpha_k A p_k. The conjugation coefficient is then

\beta_k = \frac{r_{k+1}^H r_{k+1}}{r_k^H r_k},

also real and nonnegative, as the squared norms r^H r are real. The search direction updates as p_{k+1} = r_{k+1} + \beta_k p_k. These steps require two matrix-vector products per iteration (one explicit, one in the denominator of \alpha_k) and maintain computational complexity comparable to the real case.^[14] The method preserves orthogonality of the residuals (r_i^H r_j = 0 for i \neq j) and A-conjugacy of the search directions (p_i^H A p_j = 0 for i \neq j) with respect to the complex A-inner product \langle x, y \rangle_A = y^H A x, which is real-valued for Hermitian positive definite A. This conjugacy ensures exact arithmetic termination in at most n steps, generating the same Krylov subspace as in the real case. In practice, finite precision may require monitoring the residual norm \|r_k\|_2 = \sqrt{r_k^H r_k} for convergence.^[13]

On Normal Equations

The conjugate gradient (CG) method can be extended to solve overdetermined least-squares problems of the form \min_x \| A x - b \|_2, where A \in \mathbb{C}^{m \times n} with m > n and b \in \mathbb{C}^m. This is achieved by reformulating the problem as the symmetric system (A^H A) x = A^H b, where A^H denotes the Hermitian (conjugate) transpose of A. The resulting coefficient matrix C = A^H A is Hermitian positive semidefinite, ensuring compatibility with the standard CG algorithm or its complex Hermitian variant when necessary.^[14] In this approach, CG is applied directly to the normal system C x = A^H b. While forming C explicitly is computationally expensive and may destroy sparsity in A, implementations such as the conjugate gradient on the normal equations (CGNE) or conjugate gradient normal residual (CGNR) methods avoid explicit formation of C by relying on matrix-vector products involving A and A^H. These variants maintain the core CG structure while addressing the practical needs of large-scale problems. A key drawback of this direct application is the amplification of the condition number: \kappa_2(C) = [\kappa_2(A)]^2, where \kappa_2 denotes the spectral condition number. This squaring effect can severely degrade convergence, particularly for ill-conditioned A, as CG's error reduction depends inversely on \sqrt{\kappa}. Consequently, specialized alternatives like LSQR (a bidiagonalization-based method) and CGLS (an implementation of CG on the normal equations with improved numerical properties) are generally preferred for better stability and efficiency. Nonetheless, direct CG on the normal equations remains historically significant, having been among the earliest iterative techniques for large-scale least squares before these refinements.^[14]

Comparisons and Limitations

Versus Steepest Descent

The steepest descent (SD) method is a simple iterative technique for solving symmetric positive definite linear systems Ax = b by minimizing the associated quadratic function f(x) = \frac{1}{2} x^T A x - b^T x. In each iteration, the search direction is taken as the negative residual, p_k = -r_k, where r_k = b - A x_k, and the step size \alpha_k = \frac{r_k^T r_k}{p_k^T A p_k} is chosen to minimize f along that direction, yielding the update x_{k+1} = x_k + \alpha_k p_k.^[2] This approach ensures that each step is locally optimal, as it minimizes the error in the A-norm within the one-dimensional subspace spanned by the current residual direction.^[15] However, SD often exhibits inefficient zigzagging behavior, particularly for ill-conditioned matrices where the eigenvalues of A span a wide range. Successive search directions become nearly collinear but opposite, causing the iterates to oscillate slowly toward the solution and requiring many iterations for convergence—potentially far exceeding the problem dimension n.^[3] In contrast, the conjugate gradient (CG) method generates a sequence of A-conjugate directions \{p_k\}, satisfying p_i^T A p_j = 0 for i \neq j, which avoids this zigzagging by ensuring that each new direction is orthogonal to previous ones in the A-inner product.^[2] This conjugacy property allows CG to explore higher-dimensional subspaces more effectively, leading to superior efficiency. A key advantage of CG lies in its global optimality within expanding Krylov subspaces: at step k, CG minimizes the A-norm of the error over the k-dimensional subspace spanned by the initial residual and powers of A applied to it, whereas SD only optimizes locally in a single direction per step.^[16] Consequently, CG converges in at most n iterations for an n × n system, exactly solving the problem when A is positive definite, while SD may require significantly more steps even in exact arithmetic.^[2] For instance, numerical examples on quadratic forms with elongated level sets demonstrate that CG reaches the minimum in fewer iterations than SD, with paths that proceed more directly without oscillation.^[3]

Advantages, Disadvantages, and Pathological Cases

The conjugate gradient method possesses notable advantages for solving large-scale systems of linear equations, particularly when the coefficient matrix is sparse and symmetric positive definite. One key benefit is its low storage requirement, needing only approximately three vectors of length n (for the residual, search direction, and solution iterate), which enables its application to problems with millions of variables without excessive memory demands.^[3] Furthermore, the method excels on sparse matrices by relying solely on matrix-vector multiplications, which can be computed efficiently due to the sparsity pattern, avoiding the fill-in issues of direct factorization methods.^[17] These operations are also highly parallelizable, as they involve independent computations across vector components, facilitating scalable implementations on distributed systems and GPUs.^[18] Despite these strengths, the method has significant disadvantages in practical settings. It is highly sensitive to rounding errors in finite-precision arithmetic, which can erode the orthogonality of the Krylov subspace and lead to erratic convergence or breakdown after far more than n iterations; to mitigate this, periodic restarts—reinitializing with the steepest descent direction—are often necessary.^[8] The algorithm assumes a symmetric positive definite matrix and fails on indefinite systems, where the lack of positive definiteness violates the conjugacy properties, potentially causing non-convergence or instability without modifications like shifts.^[2] Additionally, unlike direct solvers such as Gaussian elimination, it lacks built-in pivoting to handle numerical instability, making it vulnerable to ill-conditioning in the absence of other safeguards.^[19] Pathological cases highlight limitations in convergence behavior. For instance, even with a low condition number \kappa(A), a matrix whose eigenvalues are tightly clustered into a few distinct groups separated by large spectral gaps can exhibit unexpectedly slow convergence, as the method requires roughly as many iterations as the number of such clusters to reduce error components effectively, regardless of the overall eigenvalue spread.^[3]