Fact-checked by Grok 2 weeks ago

Anderson acceleration

Anderson acceleration is an iterative method designed to accelerate the convergence of fixed-point iterations by combining previous iterates and residuals to form an improved update, originally developed for solving nonlinear integral equations. Introduced by Donald G. Anderson in 1965, it addresses the slow convergence often encountered in simple fixed-point schemes, such as those arising from nonlinear systems or optimization problems. The method, also known as Anderson mixing, operates by minimizing a least-squares problem involving differences in function values (residuals) from a limited history of recent iterations, typically using a small number of previous steps (denoted as the mixing parameter m). This approach draws from quasi-Newton techniques and sequence transformation methods, such as the Shanks transformation, which generalize earlier acceleration strategies like Aitken's delta-squared process. Anderson acceleration can be viewed as a type-I or type-II variant depending on whether it mixes iterates or function values directly, with type-I being more common in practice for its stability. In applications, Anderson acceleration has proven particularly valuable in computational physics and chemistry, including electronic structure calculations where it is synonymous with Pulay mixing or direct inversion in the iterative subspace (DIIS). It has been extended to optimization algorithms like proximal gradient methods and coordinate descent, as well as to nonsmooth problems and machine learning tasks such as federated learning. Recent theoretical advances have established local linear convergence rates under contractivity assumptions and global convergence guarantees with regularization, enhancing its reliability across diverse numerical challenges.

History and Background

Origins and Development

Anderson acceleration was first proposed by Donald G. Anderson in 1965 as an iterative technique to accelerate the convergence of fixed-point iterations for solving nonlinear integral equations, particularly those arising in the kinetic theory of gases. The method involved a generalized secant approach that combines previous iterates to form an improved estimate, limiting the degree of extrapolation to maintain stability and efficiency in numerical solutions. During the late 1960s and 1970s, the technique saw adoption for addressing nonlinear integral equations in computational physics. The method was independently rediscovered in the 1970s in quantum chemistry, where it became known as Anderson mixing, particularly in self-consistent field methods. The method experienced a significant revival in the 1980s within electronic structure calculations, where it was routinely employed to speed up self-consistent field iterations in quantum chemistry and materials science simulations. This resurgence highlighted its practical value in high-dimensional nonlinear systems, leading to broader recognition in scientific computing. A pivotal milestone came in 2011 with the work of Homer F. Walker and Peng Ni, who provided a rigorous mathematical formulation and convergence analysis for Anderson acceleration applied to general fixed-point problems, bridging its historical use with modern optimization theory. Their SIAM Journal on Numerical Analysis paper formalized the approach, emphasizing its potential beyond specialized domains and spurring further theoretical and applied developments.

Initial Applications

The initial practical application of Anderson acceleration appeared in Donald G. Anderson's 1965 work, where it was employed to solve nonlinear integral equations derived from the kinetic theory of rarefied gases, a framework pertinent to modeling viscous fluid flows under low-density conditions. Numerical experiments in the paper focused on accelerating Picard iterations for the Krook kinetic equation, demonstrating improved convergence over standard fixed-point methods for these coupled singular Fredholm integral equations of the second kind. This development stemmed from Anderson's doctoral research in 1962, motivated by the need to expedite slowly converging Picard iterations in fixed-point problems involving large degrees of freedom. Early works highlighted several limitations, including sensitivity to the choice of depth parameter m, with large values leading to numerical instabilities and rank deficiency in the underlying least-squares problems; modest m (typically 2 to 12) was recommended to maintain robustness. Additional challenges involved handling outdated iterant data in highly nonlinear settings, potential divergence from sign errors in update coefficients \beta, and increased computational overhead for problems with high dimensionality, often mitigated by small relaxation parameters.

Core Concepts

Fixed-Point Iterations

A fixed-point problem involves finding a vector \mathbf{x} in a suitable space such that \mathbf{x} = g(\mathbf{x}), where g denotes a fixed-point mapping from the space into itself. This formulation arises naturally in various contexts, including the analysis of dynamical systems and the solution of equations where the fixed point represents an equilibrium or solution state. The standard iterative procedure for approximating the solution to a fixed-point problem is the Picard iteration, defined by the recurrence \mathbf{x}_{k+1} = g(\mathbf{x}_k) for k = 0, 1, 2, \dots, initialized with some starting vector \mathbf{x}_0. Convergence of this iteration to the fixed point is guaranteed under the contraction mapping theorem, also known as the Banach fixed-point theorem: if the space is complete and g is a contraction (i.e., there exists \delta \in [0, 1) such that \|g(\mathbf{x}) - g(\mathbf{y})\| \leq \delta \|\mathbf{x} - \mathbf{y}\| for all \mathbf{x}, \mathbf{y} in the space), then the iteration converges uniquely to the fixed point from any initial guess. Fixed-point problems frequently emerge from nonlinear equations of the form F(\mathbf{x}) = \mathbf{0}, where F is a nonlinear operator, by reformulating them as \mathbf{x} = g(\mathbf{x}) with g(\mathbf{x}) = \mathbf{x} - F(\mathbf{x}); a solution satisfies F(\mathbf{x}) = \mathbf{0} if and only if \mathbf{x} is a fixed point of g. This reformulation is particularly useful in numerical methods for systems where direct root-finding is challenging, such as in optimization or physics-based simulations, though the choice of g must ensure the contraction property for reliable convergence. Despite its simplicity, the Picard iteration often exhibits linear convergence, meaning the error \|\mathbf{x}_{k+1} - \mathbf{x}^*\| is asymptotically proportional to \delta times the previous error \|\mathbf{x}_k - \mathbf{x}^*\|, where \delta is the contraction constant. This rate becomes impractically slow when \delta is close to 1, a scenario common in ill-conditioned problems where small changes in the input lead to large variations in the solution, amplifying the need for careful selection of the mapping g to improve the constant.

Acceleration Mechanism

Anderson acceleration modifies the standard fixed-point iteration x_{k+1} = g(x_k) by constructing the next iterate as a linear combination of the m most recent function evaluations g(x_{k-m}), \dots, g(x_k), where the coefficients are selected to minimize a residual norm that measures the inconsistency in the fixed-point condition across these points. This extrapolation step aims to skip over slow-converging regions by leveraging historical data to predict a better approximation to the fixed point. The method, originally proposed for accelerating iterations in nonlinear integral equations, has since been generalized to various fixed-point problems. The core intuition behind this acceleration lies in its similarity to quasi-Newton methods for solving nonlinear equations, but adapted to the fixed-point setting x = g(x). By minimizing the residual over multiple previous steps, Anderson acceleration implicitly approximates the inverse Jacobian of the fixed-point map without requiring derivative computations, effectively capturing nonlinear behaviors more accurately than derivative-free alternatives. This subspace-based approximation enhances the directional update, promoting faster convergence even when the underlying iteration exhibits linear or oscillatory damping. In contrast to simple secant methods, which approximate the update using only one or two prior steps to mimic a Newton-like correction, Anderson acceleration integrates information from up to m past iterates, solving a larger least-squares problem to obtain a higher-dimensional approximation. This multi-secant approach reduces sensitivity to poor individual steps and provides a more stable acceleration, particularly in high-dimensional or ill-conditioned problems where single-secant methods may falter. The workflow proceeds as follows: at iteration k, after computing g(x_k), the method forms the residuals f_j = g(x_j) - x_j for the most recent m+1 iterates j = k-m to k. It then determines the coefficients \alpha by solving the minimization problem \min_{\alpha} \left\| \sum_{i=0}^m \alpha_i f_{k-m+i} \right\|_2 subject to \sum_{i=0}^m \alpha_i = 1. The resulting \alpha yields the accelerated update x_{k+1} = \sum_{i=0}^m \alpha_i g(x_{k-m+i}) (type-II variant), effectively enforcing a smaller predicted residual at the new point. This process repeats, typically resetting or shifting the history window as needed to maintain computational efficiency.

Mathematical Formulation

Derivation

Anderson acceleration derives from the standard fixed-point iteration framework for solving equations of the form x = g(x), where g: \mathbb{R}^n \to \mathbb{R}^n is a given mapping. The basic iteration is x_{k+1} = g(x_k), which converges under suitable conditions such as contractivity of g. To accelerate convergence, the method extends this by incorporating information from the m previous iterates x_k, x_{k-1}, \dots, x_{k-m+1}, along with their corresponding function values g(x_i). Define the residuals as f_i = g(x_i) - x_i for each iterate i, which measure the deviation from the fixed point. These residuals satisfy f(x) = 0 at the solution. To approximate the behavior of g or the associated Jacobian, compute the successive differences \Delta x_j = x_j - x_{j-1} and \Delta f_j = f_j - f_{j-1} for j = k-m+1, \dots, k. These differences provide secant-like approximations to the action of the Jacobian J of f, since \Delta f_j \approx J \Delta x_j. Form the matrices from these differences: let \Delta X_k be the n \times m matrix with columns \Delta x_j and \Delta F_k the n \times m matrix with columns \Delta f_j, for j = k-m+1 to k. The approximation \Delta F_k \approx J \Delta X_k suggests that J \approx \Delta F_k (\Delta X_k)^\dagger, where (\cdot)^\dagger denotes a pseudoinverse, leading to an approximate inverse J^{-1} \approx \Delta X_k (\Delta F_k)^\dagger. To estimate the Newton-like step \delta \approx -J^{-1} f_k, solve for the coefficient vector \gamma \in \mathbb{R}^m that minimizes the least-squares problem \gamma = \arg\min_{\gamma \in \mathbb{R}^m} \| \Delta F_k \gamma - f_k \|_2^2. This \gamma satisfies the secant conditions in a least-squares sense, approximating f_k \approx \Delta F_k \gamma. Thus, the step is \delta \approx -\Delta X_k \gamma, and the extrapolated point is \tilde{x}_{k+1} = x_k - \Delta X_k \gamma. Here, \Delta X_k serves as the matrix G approximating the action of J^{-1} via the differences, aligning with the secant relation \Delta x_j \approx -J^{-1} \Delta f_j. The full Anderson update incorporates a relaxation parameter \beta \in [0,1] for numerical stability, blending the extrapolated point with the unaccelerated fixed-point step: x_{k+1} = (1 - \beta) \, g(x_k) + \beta \, \tilde{x}_{k+1}. When \beta = 1, the update uses only the extrapolation \tilde{x}_{k+1}; smaller \beta mixes in the fixed-point step g(x_k) to dampen oscillations. This formulation, generalized from the original depth-1 case, enables the method to leverage historical data for faster convergence.

Minimization Problem Solution

The minimization problem central to Anderson acceleration involves solving a least-squares subproblem at each iteration to determine the coefficients \gamma that optimally combine previous residual differences. Specifically, this entails minimizing \min_{\gamma \in \mathbb{R}^m} \|\tilde{F}_k \gamma - \tilde{f}_k\|_2, where \tilde{F}_k = [\Delta f_{k-m+1}, \dots, \Delta f_k] is the n \times m matrix whose columns are the differences in successive function values (residuals) from the past m iterations, \Delta f_j = f_j - f_{j-1} with f_j = g(x_j) - x_j denoting the fixed-point residual at iterate x_j, and \tilde{f}_k = f_k is the current residual vector. This unconstrained least-squares problem admits a closed-form solution via the normal equations: \gamma = (\tilde{F}_k^T \tilde{F}_k)^{-1} \tilde{F}_k^T \tilde{f}_k, where the m \times m matrix \tilde{F}_k^T \tilde{F}_k is the Gram matrix of the residual differences. This \gamma is then used to form the accelerated update \tilde{x}_{k+1} = x_k - \Delta X_k \gamma, where \Delta X_k = [\Delta x_{k-m+1}, \dots, \Delta x_k] is the matrix of iterate differences. Equivalently, \tilde{x}_{k+1} is an affine combination of the previous iterates \sum_{i=0}^m \alpha_i x_{k-m+i} with coefficients \alpha derived from \gamma such that \sum \alpha_i = 1. In practice, direct solution of the normal equations can be unstable due to potential ill-conditioning of \tilde{F}_k^T \tilde{F}_k, particularly when m is small (typically m \leq 10) and the columns of \tilde{F}_k are nearly linearly dependent. Instead, stable computation is achieved through QR factorization of \tilde{F}_k = Q R, where R is upper triangular, allowing \gamma to be obtained by solving the reduced system R \gamma = Q^T \tilde{f}_k via back-substitution; this approach maintains numerical robustness with O(m^2 n) cost per iteration and enables efficient rank monitoring. Alternatively, singular value decomposition (SVD) of \tilde{F}_k = U \Sigma V^T provides the pseudoinverse solution \gamma = V \Sigma^{-1} U^T \tilde{f}_k, which is preferred for detecting and handling rank deficiency by truncating small singular values. To address ill-conditioning or singularity—common when the residuals exhibit linear dependencies or the fixed-point map g is nearly linear—implementations often incorporate truncation by removing columns of \tilde{F}_k with coefficients below a tolerance (e.g., if |\gamma_j| < 10^{-10}) or regularization via ridge addition \lambda I to \tilde{F}_k^T \tilde{F}_k with small \lambda > 0, ensuring invertibility while preserving convergence properties. These techniques are particularly effective in applications like nonlinear solvers where m is modest, balancing acceleration gains against computational overhead.

Relaxation Parameter

In Anderson acceleration, the relaxation parameter β ∈ [0,1] is incorporated into the update step as x_{k+1} = (1 - \beta) g(x_k) + \beta \tilde{x}_{k+1}, where \tilde{x}_{k+1} denotes the accelerated iterate obtained from the minimization problem, and g(x_k) is the standard fixed-point mapping applied to the current iterate. This formulation mixes the accelerated update with the unaccelerated one, allowing for a damped progression toward the fixed point. The primary purpose of the relaxation parameter is to enhance stability and prevent divergence, particularly when the underlying fixed-point mapping is non-contractive or subject to noise, such as in practical scientific computing applications. By introducing damping through β < 1, the method mitigates oscillations that can arise in pure acceleration (β = 1), thereby improving robustness without relying on strong theoretical assumptions about the mapping's contractivity. Common strategies for selecting β include using a fixed value of β = 1, which corresponds to the pure Anderson acceleration without damping and is suitable for well-behaved problems. Alternatively, β can be determined via line search to minimize the residual norm \|f_{k+1}\|, where f_{k+1} = g(x_{k+1}) - x_{k+1}, by solving an optimization such as \beta_k^* = -\frac{\langle f(\tilde{y}_k) - f(x_k), f(x_k) \rangle}{\|f(\tilde{y}_k) - f(x_k)\|^2}. Adaptive approaches based on residual norms, such as projecting onto the direction that minimizes the distance to the fixed-point mapping, further tailor β to the iteration's progress, often requiring no additional function evaluations. While β < 1 bolsters robustness in challenging settings, it can reduce the acceleration effect and slow overall convergence compared to β = 1 in ideal cases.

Depth Parameters m and m_k

In Anderson acceleration, the parameter m represents the maximum depth of memory, or the fixed upper limit on the number of previous iterates and function values incorporated into the least-squares minimization at each step. This parameter balances the quality of the secant approximation to the Jacobian against computational overhead, with typical values ranging from 5 to 10 in many applications, though larger values up to 50 have been used for high-dimensional problems. The actual depth utilized at iteration k, denoted m_k, is adaptively set as m_k = \min(m, k-1), allowing the method to start with limited history when few iterates are available and gradually incorporate more as the algorithm progresses. This growing depth enhances the approximation in later stages but requires careful management to prevent numerical issues. Larger values of m generally yield a more accurate low-rank approximation of the Jacobian through richer historical data, potentially accelerating convergence, but they escalate the per-step computational cost to O(m^2 n) due to the least-squares solve, where n is the problem dimension, and may introduce ill-conditioning in the coefficient matrix. Empirically, m is often selected via cross-validation or problem-specific tuning, starting with small values (e.g., 3–5) for initial testing and scaling up while monitoring residual norms and matrix condition numbers. To mitigate stagnation from accumulated poor-quality history, restart strategies periodically reset m_k to a lower value, such as reinitializing the memory after every m steps or switching depths dynamically based on residual thresholds (e.g., increasing from 3 to 10 once residuals drop below $10^{-2}). Such techniques, including column filtering for linear dependence, help maintain stability without full resets.

Theoretical Properties

Convergence Analysis

Under the assumption that the fixed-point map g is contractive, meaning it satisfies a Lipschitz condition with constant less than 1 in a neighborhood of the fixed point, Anderson acceleration achieves local R-linear convergence, where the rate depends on the contraction constant and the boundedness of the mixing coefficients. This improves upon the linear convergence rate of the unaccelerated fixed-point iteration by incorporating a linear combination of previous iterates that minimizes the residual in a low-dimensional subspace. For fixed-point iterations that converge linearly, Anderson acceleration further enhances the local behavior to faster R-linear convergence, as the acceleration reduces the first-order residual terms by a factor related to the gain at each step. Global convergence guarantees for Anderson acceleration typically require additional modifications, such as safeguarding steps or regularization, to ensure progress from arbitrary starting points assuming only that g is nonexpansive. When the relaxation parameter \beta = 1, corresponding to a full Anderson step without damping, the method can produce a monotonic decrease in residual norms under contractive conditions in appropriate norms, though this relies on the minimization step yielding sufficient reduction relative to prior iterates. However, standard Anderson acceleration is generally non-monotone and may fail to converge globally without such safeguards. In terms of comparative rates, Anderson acceleration outperforms simple extrapolation methods for underlying iterations with linear convergence by achieving faster linear convergence rates, but it can slow down iterations that already exhibit quadratic convergence by introducing higher-order terms that degrade the asymptotic order—for instance, reducing the order from 2 to approximately 1.618 for depth m=1 in Newton-based variants. Recent analyses have addressed non-monotone behavior through globalization strategies like adaptive quadratic regularization within a trust-region framework, ensuring global convergence to a stationary point while preserving the local R-linear rate of the original method under mild assumptions on residual monotonicity and averaged operators. The depth parameter m influences these rates by allowing larger subspaces for residual minimization, potentially accelerating convergence but increasing computational cost per iteration.

Relations to Other Methods

Anderson acceleration (AA) serves as a root-finding analog to quasi-Newton methods, particularly Broyden's method, by employing secant updates to approximate the Jacobian for fixed-point iterations. In this framework, AA constructs an approximate inverse Jacobian using differences in iterates and function values, akin to how Broyden's "good" and "bad" updates maintain secant conditions for nonlinear equations. For instance, Type-I AA corresponds to Broyden's good update, while Type-II aligns with the bad update, enabling multi-secant approximations that accelerate convergence without explicit Jacobian computations. Type-II AA is mathematically equivalent to the generalized minimum residual (GMRES) method when applied to minimizing the nonlinear residual in fixed-point problems. This equivalence holds up to the point where AA stagnates, with iterates matching GMRES steps until the grade of the residual is reached; beyond that, AA may converge to an approximate solution if stagnation occurs earlier. This connection underscores AA's Krylov subspace-like behavior in nonlinear settings, making it a nonlinear extension of GMRES for root-finding tasks. AA shares conceptual similarities with Direct Inversion in the Iterative Subspace (DIIS), both relying on subspace extrapolation to accelerate fixed-point iterations by linearly combining previous iterates. However, DIIS typically minimizes a norm of error vectors (such as residuals) within a subspace under a summation constraint, often via least-squares, whereas AA directly incorporates function values (e.g., differences in fixed-point mappings) to form the extrapolation coefficients. This distinction allows AA to operate more generally on function evaluations without requiring explicit error vector definitions, though in specific applications like quantum chemistry, the two methods coincide as reformulations of Pulay mixing. In contrast to momentum methods like Nesterov's accelerated gradient descent, which are tailored to smooth, gradient-based optimization and introduce extrapolation parameters to achieve optimal rates for convex functions, AA targets general fixed-point problems without assuming gradient structure. Nesterov momentum excels in early descent phases but can oscillate near solutions, while AA provides robust local acceleration for arbitrary nonlinear mappings, such as in expectation-maximization or proximal algorithms; hybrid approaches often combine Nesterov's global momentum with AA for enhanced performance across phases.

Variants and Extensions

Type-I and Type-II Formulations

Anderson acceleration admits two primary formulations, Type-I and Type-II, distinguished by their approaches to solving the underlying minimization problem and updating the iterates. These variants stem from different interpretations of the multi-secant conditions in quasi-Newton-like updates for fixed-point iterations of the form x = g(x), where g denotes the fixed-point mapping. The Type-I formulation determines the coefficients \gamma = (\gamma_0, \dots, \gamma_m) by minimizing the norm of the approximated residual using differences in the mapping evaluations: \min_{\gamma} \left\| \sum_{j=0}^{m} \gamma_j \left( g(x_{k-m+j}) - g(x_{k-m+j-1}) \right) - \left( g(x_k) - x_k \right) \right\| The next iterate is then computed as an affine combination: x_{k+1} = \sum_{j=0}^{m} \gamma_j x_{k-m+j} + \left(1 - \sum_{j=0}^{m} \gamma_j \right) g(x_k). This approach leverages secant differences in g to approximate the Jacobian update, relating to direct multi-secant methods. In contrast, the Type-II formulation minimizes directly over the residuals f_i = g(x_i) - x_i: \min_{\gamma} \left\| \sum_{j=0}^{m} \gamma_j f_{k-m+j} - f_k \right\|. The update follows as a quasi-Newton step: x_{k+1} = x_k - \sum_{j=0}^{m} \gamma_j f_{k-m+j}. This variant corresponds to an inverse multi-secant update, akin to Type-II Broyden methods, and enforces the multi-secant condition on the inverse Jacobian approximation. Type-I and Type-II differ computationally in matrix construction and evaluation requirements: Type-I operates on differences of g values, which can avoid extra calls to g beyond the base iteration by utilizing previously computed iterates, rendering it cheaper when g is costly. However, it tends to be less stable due to potential ill-conditioning in the difference matrix. Type-II requires forming the residual matrix explicitly but offers greater robustness, resembling the GMRES residual minimization for linear fixed-point problems. Type-I suits scenarios with low-cost g and emphasis on efficiency, while Type-II is favored for enhanced accuracy and stability in nonlinear settings.

Generalized and Alternating Versions

Generalized versions of Anderson acceleration extend the classical type-I and type-II formulations by incorporating damping mechanisms or weighted norms into the minimization step, particularly to improve robustness against noisy data prevalent in practical optimization settings. These modifications, emerging in the 2010s, introduce a damping parameter that blends historical iterates with fresh updates, often augmented by regularization terms such as l2 penalties to counteract instability from perturbations. For example, damped Anderson mixing stabilizes the linear combination of past fixed-point evaluations by enforcing a contraction factor, which has proven effective in accelerating convergence while maintaining reliability in noisy environments like reinforcement learning tasks. Alternating Anderson acceleration represents a 2025 advancement that hybridizes fixed-point iterations with acceleration phases, alternating between unaccelerated steps and Anderson updates to foster more efficient hybrid optimization frameworks. This generalized alternating scheme operates on a periodic cycle of t standard fixed-point iterations followed by s Anderson acceleration steps with window size m, allowing tunable balance between computational simplicity and speedup. By integrating such alternation, the method implicitly incorporates gradient corrections through accelerated phases, enhancing performance on gradient-based solvers like descent or ADMM for both smooth and nonsmooth problems. Numerical experiments demonstrate that optimal parameter choices (m, s, t) yield fewer iterations and reduced CPU time compared to vanilla windowed or single-step alternating variants, with convergence guarantees for contractive mappings and extensions to noncontractive cases via diagonalizable operators. Around 2020, Anderson acceleration was adapted to proximal operators to address non-smooth fixed-point problems arising in proximal gradient methods, enabling acceleration of iterations that handle composite objectives with nondifferentiable terms. These extensions reformulate the proximal gradient step as a fixed-point map, applying Anderson mixing to the sequence of proximal evaluations while preserving the method's ability to manage constraints and nonsmooth regularizers like l1 norms. The resulting algorithms maintain global convergence properties under nonexpansiveness assumptions, with practical safeguards to avoid divergence in high-dimensional settings. Recent variants in 2024 include derivative-free Anderson acceleration tailored for constrained monotone nonlinear equations, merging it with projection methods to eliminate reliance on gradients. The accelerated derivative-free projection method (AA-DFPM) interleaves Anderson updates with nonstationary relaxation parameters and minor per-iteration adjustments, achieving linear convergence rates under assumptions of monotonicity and Lipschitz continuity of the operators. This approach excels in large-scale problems where derivative computation is infeasible, outperforming unaccelerated counterparts in iteration efficiency on benchmark constrained systems. Another notable extension applies Anderson acceleration to geometry optimization, accelerating local-global solvers while enforcing energy decrease to guarantee monotonic progress and global convergence in nonconvex landscapes common to computer graphics simulations. By treating projection and optimization substeps as fixed-point iterations, this variant reduces iteration counts substantially with minimal overhead, ensuring stable shape deformations and physics-based adjustments.

Applications

In Nonlinear Optimization

Anderson acceleration finds extensive application in nonlinear optimization, particularly for solving systems of nonlinear equations F(x) = 0 through fixed-point reformulations x = G(x), where G is derived from the original problem. This approach is especially useful in derivative-free projection methods for constrained monotone nonlinear equations, where Anderson acceleration enhances convergence by incorporating historical iterates to extrapolate solutions more efficiently. For instance, in such methods, the fixed-point iteration is accelerated to achieve linear convergence under mild assumptions, reducing the computational burden in large-scale problems without requiring gradient information. A prominent use is in accelerating proximal gradient descent for composite optimization problems of the form \min_x f(x) + g(x), where f is smooth and g is nonsmooth. Here, Anderson acceleration is applied to the fixed-point mapping x = \prox_{\gamma g}(x - \gamma \nabla f(x)), leveraging past iterates to speed up convergence while preserving the method's global guarantees through stabilization techniques. This adaptation ensures local linear convergence for smooth mappings and practical speed-ups in non-smooth and constrained settings, making it suitable for machine learning tasks involving sparsity or regularization. Recent extensions include applications to operator splitting methods for convex optimization, providing unified acceleration frameworks. In manifold optimization, Anderson acceleration extends to Riemannian settings for solving fixed-point problems on curved spaces, such as geometry-aware tasks. Riemannian Anderson mixing methods accelerate iterations by extrapolating along the manifold using historical geodesic information, avoiding expensive retractions and achieving local linear convergence for C^2-smooth functions. These methods outperform standard Riemannian gradient descent in experiments on manifold-constrained problems, offering lower per-iteration costs. The benefits of Anderson acceleration in nonlinear optimization include substantial reductions in iteration counts, often by factors of 2 to 5 times in large-scale unconstrained problems, enhancing efficiency without significant added complexity. For example, in bioinformatics machine learning applications like support vector machine training on biological datasets, it accelerates convergence and lowers training loss more rapidly than unaccelerated methods. Further, it has been applied in dynamic optimal transport problems to improve theoretical and practical performance in optimization tasks.

In Scientific Computing and Simulations

Anderson acceleration plays a pivotal role in electronic structure calculations within density functional theory (DFT), where it accelerates the self-consistent field (SCF) iterations essential for determining ground-state electron densities. The SCF process involves iteratively solving the Kohn-Sham equations through fixed-point updates of the density, which can require numerous cycles due to slow convergence or instability from small spectral gaps. By incorporating a linear combination of previous density and potential iterates, Anderson acceleration enhances the damping strategies in these fixed-point methods, enabling robust convergence even in challenging cases like Kohn-Sham DFT for materials with narrow band gaps. Numerical benchmarks demonstrate that it outperforms simple damped SCF by achieving faster convergence rates, often succeeding where unaccelerated approaches diverge. In the numerical solution of partial differential equations (PDEs) governing physical processes, Anderson acceleration improves the efficiency of Picard iterations for nonlinear problems, such as advection-diffusion equations with convective terms. For instance, in simulations of natural convection via the steady Boussinesq equations, AA applied to Picard fixed-point updates accelerates convergence at high Rayleigh numbers (e.g., \mathrm{Ra} = 10^6 to $2 \times 10^6), where standard iterations stagnate. This approach not only reduces the total iteration count but also stabilizes solutions by mitigating higher-order error terms in the residual, making it suitable for finite element discretizations of convection-dominated flows. Recent monographs highlight its growing role in solving a wide range of nonlinear PDE systems across scientific computing. Applications in astrophysics and plasma physics leverage Anderson acceleration for fixed-point solves in multi-temperature models that couple energy transport across species. A key example is the three-temperature energy equations, which describe nonequilibrium dynamics of electron, ion, and radiation temperatures in high-energy plasmas, discretized via finite volume methods. Applying AA to the Picard iteration halves the required iterations compared to the unaccelerated variant and yields superior efficiency over Jacobian-free Newton-Krylov solvers in both 2D (e.g., CH/SO₂ material benchmarks) and 3D configurations, while enforcing physical constraints like non-negative temperatures through iterate adjustments. This has proven effective in radiation-hydrodynamics simulations relevant to inertial confinement fusion and astrophysical phenomena. In physics-based simulations, Anderson acceleration facilitates geometry optimization tasks by speeding up local-global solvers for nonlinear, non-convex problems. These solvers alternate between local solves (e.g., optimizing individual elements) and global projections, forming a fixed-point iteration amenable to AA. The method ensures monotonic energy decrease and global convergence via a carefully chosen mixing parameter, reducing iteration counts substantially across diverse applications like elastic rod simulations and cloth dynamics, with only a modest per-iteration overhead.

Implementations

Pseudocode

The standard Anderson acceleration procedure, in its Type-II formulation, accelerates the fixed-point iteration x_{k+1} = g(x_k) by extrapolating using a limited history of previous iterates and residuals, where the residual is defined as f_k = g(x_k) - x_k. The algorithm initializes with an initial guess x_0, a maximum depth parameter m \geq 0 (controlling the number of previous steps used for extrapolation), and an optional relaxation parameter \beta \in [0,1] (with \beta = 1 indicating no relaxation). At each iteration, it computes the current residual f_k, forms matrices of differences from the history relative to the current step, solves a least-squares problem to find an optimal combination \gamma, and updates the next iterate accordingly. The following pseudocode outlines the core steps of the Type-II Anderson acceleration algorithm:
Algorithm: Type-II Anderson Acceleration

Input: Initial x_0, maximum depth m ≥ 0, relaxation β ∈ [0,1], tolerance tol > 0, maximum iterations K_max
Output: Approximate fixed point x

// Initialization
k ← 0
x_k ← x_0
g_old ← []  (empty matrix for g values)
f_old ← []  (empty matrix for residuals)
m_k ← 0     (current depth)

// Compute initial
g_k ← g(x_k)
f_k ← g_k - x_k

while norm(f_k) > tol and k < K_max do
    m_k ← min(m, k)
    
    if m_k > 0 then
        // Form difference matrices relative to current (previous m_k steps)
        ΔF_k ← zeros(d, m_k)  // d x m_k
        ΔG_k ← zeros(d, m_k)
        for i = 1 to m_k do
            j ← k - m_k + i - 1  // previous indices, but use stored f_old, g_old which have last m_k
            ΔF_k(:, i) ← f_old(:, i) - f_k
            ΔG_k(:, i) ← g_old(:, i) - g_k
        end for
        // Solve unconstrained LS: min_γ || ΔF_k γ + f_k ||_2^2  (equivalent to constrained original)
        γ ← argmin_γ || ΔF_k γ + f_k ||_2^2   (e.g., via QR factorization: γ = ΔF_k \ (-f_k))
        
        x_{k+1} ← g_k - ΔG_k γ
    else
        x_{k+1} ← g_k
    end if
    
    // Apply relaxation if β < 1
    if β < 1 then
        x_{k+1} ← (1 - β) x_k + β x_{k+1}
    end if
    
    // Update history: shift in new f_k, g_k (truncate to m)
    if k ≥ m then
        f_old ← f_old(:, 2:end)
        g_old ← g_old(:, 2:end)
    end if
    f_old ← [f_old, f_k]
    g_old ← [g_old, g_k]
    
    // Advance
    x_k ← x_{k+1}
    g_k ← g(x_k)
    f_k ← g_k - x_k
    k ← k + 1
end while

return x_k
This pseudocode uses the Type-II update x_{k+1} = g_k - \Delta G_k \gamma, where \gamma minimizes \| \Delta F_k \gamma + f_k \|_2 over the history of depth m_k. The history depth m_k grows as m_k = \min(m, k) until it stabilizes at m, at which point older columns are discarded to maintain a fixed-size window and prevent unbounded memory growth. If m = 0, the algorithm reduces to the unaccelerated fixed-point iteration x_{k+1} = g(x_k), as no extrapolation is performed. The per-iteration computational complexity is dominated by forming the difference matrices (O(m d)) and solving the least-squares problem via QR factorization or direct methods (O(m^2 d)), where d is the dimension of x and m is typically small (e.g., 5–10) relative to d; the fixed-point evaluation g(x_k) adds problem-specific cost. In practice, monitor the condition number of \Delta F_k and drop columns with small singular values to avoid instability.

Example Code in MATLAB

A MATLAB implementation of the unconstrained (Type-I) variant of Anderson acceleration is provided below for solving fixed-point problems x = g(x), where g is a function handle computing the fixed-point map. The function anderson_acc takes the initial guess x0, tolerance tol, maximum iterations maxit, mixing parameter m (depth of acceleration), and damping factor beta (with beta = 1 corresponding to no damping). The least-squares subproblems are solved using the backslash operator (\), which is efficient for dense, low-dimensional problems; for sparse or high-dimensional cases, replace with lsqr for iterative solution. Residuals are computed as f_k = g(x_k) - x_k, and history is limited to the most recent m residuals. Damping is applied post-acceleration as x_{k+1} = (1 - \beta) y_k + \beta \tilde{x}_{k+1}, where y_k = g(x_k) is the unaccelerated iterate and \tilde{x}_{k+1} is the accelerated one, to improve stability in oscillatory cases. This unconstrained formulation solves the least-squares problem \gamma = \arg\min_\gamma \| F_k \gamma + f_k \|_2, where F_k has columns given by the previous residuals f_{k-m}, \dots, f_{k-1}, and the update is \tilde{x}_{k+1} = y_k - F_k \gamma. It differs from the constrained Type-II variant by avoiding the summation constraint on the coefficients, often leading to simpler implementation while maintaining acceleration benefits.
matlab
function [x, iter, res_hist] = anderson_acc(g, x0, tol, maxit, m, beta)
% ANDERSON_ACC Unconstrained Anderson acceleration for fixed-point x = g(x).
%   [x, iter, res_hist] = anderson_acc(g, x0, tol, maxit, m, beta)
%   g: function handle for fixed-point map.
%   x0: initial guess (column vector).
%   tol: convergence tolerance on ||g(x) - x||.
%   maxit: maximum iterations.
%   m: mixing parameter (acceleration depth, default 5).
%   beta: damping factor (default 1, full acceleration).
%
% Outputs:
%   x: converged solution.
%   iter: number of iterations.
%   res_hist: vector of residual norms.

if nargin < 5, m = 5; end
if nargin < 6, beta = 1; end

n = length(x0);
x = x0(:);  % Ensure column vector
res_hist = [];
F = [];  % History of residuals
iter = 0;

while iter < maxit
    y = g(x);  % Unaccelerated fixed-point step
    f = y - x;  % Residual
    res = norm(f);
    res_hist = [res_hist, res];
    
    if res < tol
        break;
    end
    
    % Append current residual to history
    F = [F, f];
    if size(F, 2) > m
        F = F(:, 2:end);  % Keep most recent m residuals (now excluding oldest)
    end
    
    % Anderson acceleration if sufficient history (use up to m previous)
    num_prev = size(F, 2) - 1;  % Number of previous after append
    if num_prev > 0
        Fk = F(:, 1:num_prev);  % Previous residuals
        gamma = Fk \ (-f);  % Solve min ||Fk * gamma + f||_2
        x_tilde = y - Fk * gamma;  % Accelerated update
    else
        x_tilde = y;
    end
    
    % Apply damping
    x = (1 - beta) * y + beta * x_tilde;
    
    iter = iter + 1;
end

if iter == maxit
    warning('anderson_acc: Max iterations reached without convergence.');
end
end
To demonstrate usage, consider the scalar fixed-point equation x = \cos(x), a classic example where plain fixed-point iteration converges slowly due to the contraction factor near 1. Define g as an anonymous function and call anderson_acc with x0 = 0, tol = 1e-10, maxit = 100, m = 5, beta = 1. The code typically converges in 4-6 iterations, compared to over 50 for unaccelerated iteration, with final residual below tol and solution near 0.739085. Residual history can be plotted via semilogy(res_hist) to visualize rapid decrease.
matlab
g = @(x) cos(x);
[x_sol, iter, res_hist] = anderson_acc(g, 0, 1e-10, 100, 5, 1);
fprintf('Solution: %f, Iterations: %d, Final residual: %e\n', x_sol, iter, res_hist(end));
% Example output: Solution: 0.739085, Iterations: 5, Final residual: 3.057769e-11
This implementation aligns with practical pseudocode structures by maintaining residual history and performing the LS update at each step after the initial iterations, enabling direct testing and extension (e.g., substituting lsqr(Fk, -f) for sparse Fk). For low-dimensional nonlinear systems, the matrix assembly for F_k remains efficient with backslash; for larger m or dimensions, monitor conditioning of F_k to avoid ill-posed LS problems, such as by dropping poorly conditioned columns.

References

  1. [1]
    Iterative Procedures for Nonlinear Integral Equations
    But, in this paper, we show that high order iterative methods can be used to solve a special case of nonlinear integral equations of ...
  2. [2]
    Shanks Sequence Transformations and Anderson Acceleration
    Shanks Sequence Transformations and Anderson Acceleration. Authors: Claude Brezinski, Michela Redivo-Zaglia, ...
  3. [3]
    Anderson Acceleration for Fixed-Point Iterations
    This paper concerns an acceleration method for fixed-point iterations that originated in work of DG Anderson [J. Assoc. Comput. Mach., 12 (1965), pp. 547–560]
  4. [4]
    [1910.08590] Anderson Acceleration of Proximal Gradient Methods
    Oct 18, 2019 · Abstract:Anderson acceleration is a well-established and simple technique for speeding up fixed-point computations with countless ...
  5. [5]
  6. [6]
    [PDF] Globally Convergent Type-I Anderson Acceleration for Nonsmooth ...
    As its name suggests, AA is an acceleration algorithm proposed by. D. G. Anderson in 1965 [4]. The earliest problem that AA dealt with was nonlinear integral ...
  7. [7]
    Non-stationary Anderson acceleration with optimized damping
    Dec 1, 2024 · Moreover, the early days of Anderson Mixing method (the 1980s, for electronic structure calculations) initially dictated the window size m ...
  8. [8]
    [PDF] Comments on "Anderson Acceleration, Mixing and Extrapolation"
    Oct 18, 2017 · More recently, versions of this method have been labelled as Anderson Acceleration in the applied mathematics community and Anderson Mixing in ...
  9. [9]
    Leveraging Anderson Acceleration for improved convergence of ...
    Sep 15, 2014 · Shanks sequence transformations and anderson acceleration. 2018, SIAM Review. View all citing articles on Scopus · View full text. Copyright ...
  10. [10]
    [PDF] fixed_point.pdf - UMD Math Department
    Fixed point methods find an iteration function T(x) where the zero of f(x) satisfies T(x∗) = x∗, and T(x) generates approximations xn+1 = T(xn).
  11. [11]
    [PDF] arXiv:1403.2546v2 [math.FA] 28 Apr 2014
    Apr 28, 2014 · We show that the Picard-S iteration method can be used to approximate the fixed point of contraction mappings. Also, we show that our new ...
  12. [12]
    [PDF] Picard Method for nonlinear fixed-point problems - UC Davis Math
    Jul 14, 2024 · The standard method for solving (1.1) is the fixed-point iteration, also known as the Picard iteration, noted for its simplicity but also for ...
  13. [13]
    [PDF] A Newton-Fixed Point Homotopy Algorithm For Nonlinear ... - arXiv
    Jul 5, 2012 · Our Newton-fixed point homotopy method with A = I in G(x, a) of (2. 1) found the root in 2 iterations(6 steps) and 3 iterations(13 steps) from ...
  14. [14]
    [PDF] AN ACCELERATED FIXED-POINT ITERATION FOR SOLUTION OF ...
    The main drawback of Picard iteration has been its linear convergence rate. A number of authors have looked at the use of these and related methods for RE.
  15. [15]
    [PDF] Survey of Methods for Solving Systems of Nonlinear Equations, Part I
    Aug 17, 2022 · The SOR method is another iterative method for solving a system of linear equations of the form Ax = b. If we assume that the diagonal elements ...
  16. [16]
    [PDF] Anderson Acceleration: Algorithms and Implementations1 - WPI
    This unconstrained least-squares problem leads to a modified form of Anderson acceleration. Denoting the least-squares solution by γ(k) = (γ. (k). 0 ,...,γ. (k).
  17. [17]
    [PDF] Anderson Acceleration for Nonsmooth Fixed Point Problems - arXiv
    Sep 21, 2022 · In particular, Anderson acceleration is designed to solve the fixed point problem when computing the Jacobian of G is impossible or too costly.
  18. [18]
    Anderson Acceleration with Truncated Gram–Schmidt - SIAM.org
    In this paper, we introduce a variant of AA based on a truncated Gram–Schmidt process (AATGS) which has a few advantages over the classical AA.
  19. [19]
    [PDF] Acceleration methods for fixed point iterations
    Nov 16, 2024 · As was seen earlier, Anderson Acceleration can be viewed as a secant method similar to a Quasi-Newton approach. These methods will.
  20. [20]
  21. [21]
  22. [22]
  23. [23]
    Convergence Analysis for Anderson Acceleration - SIAM.org
    We prove convergence of Anderson acceleration for a class of nonsmooth fixed-point problems for which the nonlinearities can be split into a smooth contractive ...
  24. [24]
    A Proof That Anderson Acceleration Improves the Convergence ...
    This paper provides theoretical justification that Anderson acceleration (AA) improves the convergence rate of contractive fixed-point iterations in the ...Missing: derivation | Show results with:derivation
  25. [25]
    Globally Convergent Type-I Anderson Acceleration for Nonsmooth ...
    We consider the application of the type-I Anderson acceleration to solving general nonsmooth fixed-point problems. By interleaving with safeguarding steps ...
  26. [26]
    Nonmonotone Globalization for Anderson Acceleration via Adaptive ...
    May 18, 2023 · Anderson acceleration ( AA ) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation.Missing: limitations | Show results with:limitations<|control11|><|separator|>
  27. [27]
    None
    **Paper Title:** Anderson Acceleration Without Restart: A Novel Method with n-Step Super Quadratic Convergence Rate
  28. [28]
    [PDF] Damped Anderson Mixing for Deep Reinforcement Learning
    In this paper, we provide deeper insights into Anderson acceleration in reinforcement learning by establishing its connection with quasi-Newton methods for ...<|control11|><|separator|>
  29. [29]
    None
    **Paper Title:** A Characterization of the Behavior of the Anderson Acceleration on Linear Problems
  30. [30]
    [PDF] Convergence analysis of adaptive DIIS algorithms with application ...
    Nov 18, 2021 · The principle of the Anderson–Pulay acceleration is presented in Section. 2 through an overview of the DIIS and its relation with the Anderson ...
  31. [31]
    [PDF] Accelerating Proximal Gradient-type Algorithms using Damped ...
    Aug 16, 2025 · The main aim of this paper is to harness the strengths of Nesterov's method and AA with a particular focus on their application to accelerating ...
  32. [32]
    [PDF] Two Classes of Multisecant Methods for Nonlinear Acceleration
    Jun 1, 2007 · The first is the Broyden-like class, of which Broyden's family is a subclass, and Anderson mixing is a particular member.<|control11|><|separator|>
  33. [33]
    [1803.06673] Damped Anderson acceleration with restarts and ...
    Mar 18, 2018 · We describe a new class of acceleration schemes that build on the Anderson acceleration technique for speeding fixed-point iterations.Missing: noisy 2010s
  34. [34]
    [PDF] Damped Anderson Mixing for Deep Reinforcement Learning
    Anderson mixing was first applied to value iteration in [11, 15] and resulted in significant conver- gence improvements. Regularized Anderson acceleration [25] ...<|control11|><|separator|>
  35. [35]
    Anderson acceleration for seismic inversion - GeoScienceWorld
    Jan 7, 2021 · AA is a strategy proposed to speed up iterative schemes, particularly fixed-point problems. In this paper, we modify the classic algorithm and ...
  36. [36]
    (PDF) Anderson accelerated augmented Lagrangian for extended ...
    Oct 12, 2021 · This article proposes an adaptive damping Anderson acceleration ... norm's superiority over Tikhonov regularization using two distinct noise data ...
  37. [37]
  38. [38]
    [2403.14924] Anderson acceleration of derivative-free projection ...
    Mar 22, 2024 · This paper considers the application of Anderson acceleration (AA) to DFPM for constrained monotone nonlinear equations.
  39. [39]
    Anderson Acceleration for Geometry Optimization and Physics ...
    May 15, 2018 · We apply Anderson acceleration, a well-established technique for fixed-point solvers, to speed up the convergence of a local-global solver.Missing: minimization | Show results with:minimization
  40. [40]
    A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization
    Sep 7, 2018 · Anderson acceleration (or Anderson mixing) is an efficient acceleration method for fixed point iterations x_{t+1}=G(x_t), e.g., gradient ...
  41. [41]
    Anderson Acceleration of Derivative-Free Projection Methods for ...
    Oct 17, 2025 · In this paper, we developed a novel algorithm of using Anderson acceleration (AA) for derivative-free projection method (DFPM) in solving ...
  42. [42]
    Anderson Acceleration of Proximal Gradient Methods
    This work introduces novel methods for adapting Anderson acceleration to proximal gradient algorithms.
  43. [43]
    Riemannian Anderson Mixing Methods for Minimizing $C^2$-Functions on Riemannian Manifolds
    ### Key Points on Riemannian Anderson Mixing for Manifold Optimization
  44. [44]
    Riemannian Anderson Mixing Methods for Minimizing C2 Functions ...
    Apr 25, 2025 · Anderson mixing (AM) method is a popular approach for accelerating fixed-point iterations by leveraging historical information from previous ...Missing: acceleration | Show results with:acceleration
  45. [45]
    Anderson Acceleration For Bioinformatics-Based Machine Learning
    Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms, with applications in various fields including deep ...
  46. [46]
    [PDF] Convergence analysis of direct minimization and self-consistent ...
    Oct 27, 2020 · The SCF iteration should be accelerated (for instance using the Anderson acceleration technique), and the gradient information in direct ...
  47. [47]
    Computing the self-consistent field in Kohn–Sham density functional ...
    Aug 14, 2019 · This work explores self-consistent field methods in Kohn-Sham DFT, identifies inefficiencies, and introduces a benchmark suite to evaluate ...
  48. [48]
    None
    ### Summary of Anderson Acceleration for Natural Convection (Boussinesq System), 2020
  49. [49]
    Anderson acceleration and application to the three-temperature ...
    Oct 15, 2017 · Anderson acceleration was first proposed in 1965 and, for some years, has been used successfully to accelerate the convergence of self ...
  50. [50]
    Anderson acceleration for geometry optimization and physics ...
    We apply Anderson acceleration, a well-established technique for fixed-point solvers, to speed up the convergence of a local-global solver.
  51. [51]
    [PDF] Anderson Acceleration For Bioinformatics-Based Machine Learning
    Abstract. Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms with applications.