Quasi-Newton method
The Quasi-Newton method is a class of iterative optimization algorithms designed to solve unconstrained nonlinear optimization problems by approximating the second-order information (Hessian matrix) of the objective function using only first-order gradient evaluations, thereby avoiding the computationally expensive direct computation or inversion of the exact Hessian as required in the classical Newton's method.[1][2] These methods update an approximate Hessian (or its inverse) at each iteration via low-rank modifications that satisfy the secant condition, which ensures the approximation correctly captures the change in gradients between successive points.[1][2]
Originating in the mid-1950s with early work by W. C. Davidon at Argonne National Laboratory—though not formally published until 1991—these methods evolved as efficient alternatives to full Newton's method for large-scale problems where second derivatives are costly or unavailable.[1][3] Key developments include the Davidon-Fletcher-Powell (DFP) update in the 1950s–1960s and the more robust Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, independently proposed in 1970 by Charles George Broyden, Roger Fletcher, Donald Goldfarb, and David Shanno.[2][3] Other notable variants encompass the symmetric rank-one (SR1) update and the broader Broyden class of methods, which generalize these updates through a parameter that interpolates between BFGS and DFP behaviors.[1]
Quasi-Newton methods offer significant advantages over first-order techniques like gradient descent, achieving superlinear convergence—meaning the error decreases faster than linearly near the optimum—while maintaining a per-iteration cost of only O(n²) for n-dimensional problems, compared to O(n³) for exact Newton steps.[1][3] Under suitable conditions, such as a twice continuously differentiable objective function with Lipschitz continuous Hessian and strict convexity ensuring positive-definiteness of updates (e.g., yᵀs > 0 where y and s are gradient and step differences), BFGS guarantees global convergence with line search and local superlinear rates.[1][3] For high-dimensional or memory-constrained applications, limited-memory variants like L-BFGS further reduce storage to O(n m) using a fixed number m of past updates, making them widely applicable in machine learning and scientific computing.[1]
Background and Fundamentals
Newton's Method Overview
Newton's method, also known as the Newton-Raphson method, is an iterative algorithm for finding roots of equations, originally developed by Isaac Newton in the mid-1660s and independently formulated by Joseph Raphson in 1690, though it gained widespread use in numerical analysis only in the 20th century.[4] For a univariate function f(x), the method generates a sequence of approximations via the iteration
x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)},
where f'(x_k) is the derivative evaluated at the current iterate x_k, provided f'(x_k) \neq 0.[5] This update approximates the root by intersecting the tangent line to the curve y = f(x) at x_k with the x-axis. Under suitable conditions, such as f being twice continuously differentiable near the root x^* with f'(x^*) \neq 0, the method exhibits local quadratic convergence, meaning the error e_{k+1} \approx C e_k^2 for some constant C > 0 and sufficiently small e_k = |x_k - x^*|.[5]
In multivariate optimization, Newton's method seeks to minimize a twice continuously differentiable function f: \mathbb{R}^n \to \mathbb{R} by solving \nabla f(\mathbf{x}^*) = \mathbf{0}. The iteration becomes
\mathbf{x}_{k+1} = \mathbf{x}_k - \mathbf{H}_k^{-1} \nabla f(\mathbf{x}_k),
where \mathbf{H}_k = \nabla^2 f(\mathbf{x}_k) is the Hessian matrix at \mathbf{x}_k, assumed invertible.[5] This step minimizes a local quadratic approximation m_k(\mathbf{p}) = f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2} \mathbf{p}^T \mathbf{H}_k \mathbf{p} of f along the search direction \mathbf{p}. Local quadratic convergence holds if f is twice continuously differentiable near a stationary point \mathbf{x}^* where \nabla f(\mathbf{x}^*) = \mathbf{0} and the Hessian \mathbf{H}(\mathbf{x}^*) is positive definite, ensuring the point is a strict local minimum.[5]
The method extends naturally to solving systems of nonlinear equations \mathbf{F}(\mathbf{x}) = \mathbf{0} in \mathbb{R}^n, where \mathbf{F}: \mathbb{R}^n \to \mathbb{R}^n is continuously differentiable. The update is
\mathbf{x}_{k+1} = \mathbf{x}_k - \mathbf{J}_k^{-1} \mathbf{F}(\mathbf{x}_k),
with \mathbf{J}_k = \nabla \mathbf{F}(\mathbf{x}_k) the Jacobian matrix, provided it is nonsingular.[5] Quadratic convergence occurs locally if \mathbf{F} is continuously differentiable near a solution \mathbf{x}^* with \mathbf{F}(\mathbf{x}^*) = \mathbf{0} and \mathbf{J}(\mathbf{x}^*) invertible.[5] This formulation draws from the univariate case, akin to a derivative-free precursor like the secant method but leveraging full first-order information.[5]
Motivation and Advantages of Quasi-Newton Methods
Newton's method achieves quadratic convergence near a solution, serving as an ideal benchmark for optimization and root-finding algorithms. However, its practical application is hindered by the high computational cost of evaluating exact Jacobians or Hessians, particularly in high-dimensional problems or when functions lack analytical second derivatives, necessitating expensive finite-difference approximations that require O(n^2) additional function or gradient evaluations per iteration.[6] Furthermore, storing and inverting these full n by n matrices demands O(n^2) space and time, which becomes prohibitive for large-scale problems where n exceeds thousands.[6]
Quasi-Newton methods address these limitations by constructing low-rank updates to approximate the Jacobian or Hessian using only function and gradient evaluations, thereby avoiding direct computation of second derivatives.[7] This results in reduced per-iteration expense, as updates can be performed in O(n^2) time similar to matrix-vector products in Newton's method, but with amortized lower costs for derivative approximations over multiple iterations.[6] These methods retain superlinear convergence rates, typically with order greater than 1 but less than 2, providing a favorable trade-off between efficiency and accuracy.[6]
The development of quasi-Newton methods emerged in the 1950s and 1960s, driven by the need for efficient optimization techniques in operations research and scientific computing, with early contributions including Davidon's variable-metric approach in 1959 and the Davidon-Fletcher-Powell update formalized in 1963.[8] Broyden's extensions in the mid-1960s further generalized these ideas to nonlinear systems, establishing a family of methods that balance computational feasibility with rapid convergence.[7] Overall, quasi-Newton approaches enable scalable solutions to problems where full Hessian information is impractical, maintaining much of Newton's desirable properties while mitigating its drawbacks.[6]
Core Update Mechanisms
The Secant Condition
The secant condition forms the cornerstone of quasi-Newton methods, providing a way to update an approximate derivative matrix without computing exact second-order information. In the context of root-finding for nonlinear systems \mathbf{F}(\mathbf{x}) = \mathbf{0}, where \mathbf{F}: \mathbb{R}^n \to \mathbb{R}^n is a nonlinear function, the condition specifies that the updated approximation \mathbf{B}_{k+1} to the Jacobian \mathbf{J}(\mathbf{x}_k) satisfies \mathbf{B}_{k+1} (\mathbf{x}_{k+1} - \mathbf{x}_k) = \mathbf{F}(\mathbf{x}_{k+1}) - \mathbf{F}(\mathbf{x}_k).[9] This equation ensures that the linear model defined by \mathbf{B}_{k+1} matches the actual change in \mathbf{F} along the step from \mathbf{x}_k to \mathbf{x}_{k+1}, mimicking the behavior of the true Jacobian on that direction.[10]
For unconstrained optimization problems of minimizing a twice-differentiable function f: \mathbb{R}^n \to \mathbb{R}, the secant condition adapts to the Hessian approximation: \mathbf{B}_{k+1} (\mathbf{x}_{k+1} - \mathbf{x}_k) = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k).[11] Here, \mathbf{B}_{k+1} approximates the Hessian \nabla^2 f(\mathbf{x}_k), and the condition captures the change in the gradient over the iteration step, denoted compactly as \mathbf{B}_{k+1} \mathbf{s}_k = \mathbf{y}_k where \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k and \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k).[8]
This condition derives from the first-order Taylor expansion of the underlying function. For root-finding, the Taylor series around \mathbf{x}_k gives \mathbf{F}(\mathbf{x}_{k+1}) \approx \mathbf{F}(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k) (\mathbf{x}_{k+1} - \mathbf{x}_k); imposing the secant condition on \mathbf{B}_{k+1} ensures that the approximation \mathbf{B}_{k+1} agrees with this expansion exactly along the secant direction \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k.[12] Similarly, in optimization, the expansion \nabla f(\mathbf{x}_{k+1}) \approx \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k) (\mathbf{x}_{k+1} - \mathbf{x}_k) motivates the condition, guaranteeing that \mathbf{B}_{k+1} reproduces the first-order change in the gradient.[2]
However, the secant condition does not uniquely determine \mathbf{B}_{k+1}, as infinitely many matrices in \mathbb{R}^{n \times n} can satisfy the single equation \mathbf{B}_{k+1} \mathbf{s}_k = \mathbf{y}_k for given \mathbf{s}_k and \mathbf{y}_k.[8] Solutions typically involve low-rank modifications to the previous approximation \mathbf{B}_k, such as rank-one updates, to minimally alter \mathbf{B}_k while enforcing the condition.[12]
Key properties of updates satisfying the secant condition include the potential for nonsingularity preservation: if \mathbf{B}_k is nonsingular and \mathbf{s}_k \neq \mathbf{0}, then under mild conditions (e.g., \mathbf{y}_k linearly independent from prior steps), \mathbf{B}_{k+1} remains nonsingular.[9] In the optimization setting, if \mathbf{B}_k is symmetric and positive definite, certain updates can preserve these properties, ensuring descent directions.[11]
Broyden's Family of Updates
In 1965, C. G. Broyden introduced quasi-Newton updates for approximating the Jacobian matrix in methods for solving systems of nonlinear equations, providing a framework that satisfies the secant condition while allowing flexibility.[9] This work was further explored in his 1967 paper on applications to function minimization, extending the ideas to optimization contexts.
For root-finding, a key update from Broyden's 1965 work is the rank-one modification:
\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{ (\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k) \mathbf{s}_k^T }{ \mathbf{s}_k^T \mathbf{s}_k },
where \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k and \mathbf{y}_k = \mathbf{F}(\mathbf{x}_{k+1}) - \mathbf{F}(\mathbf{x}_k). This formula, often referred to as Broyden's method, enforces the secant condition through an outer product correction, achieving O(n²) computational cost per iteration. Variants include a "good" update that better preserves certain properties, such as:
\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{ (\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k) (\mathbf{B}_k \mathbf{s}_k)^T }{ \mathbf{s}_k^T \mathbf{B}_k \mathbf{s}_k }.
In the optimization context, Broyden's later contributions led to the Broyden class of updates, which generalize specific methods like BFGS and DFP through a parameter \phi_k \in [0,1] as a convex combination: \mathbf{B}_{k+1} = (1 - \phi_k) \mathbf{B}_{k+1}^{\mathrm{BFGS}} + \phi_k \mathbf{B}_{k+1}^{\mathrm{DFP}}.[1] These updates satisfy the secant condition by construction and are detailed in subsequent sections on optimization applications. The choice of updates influences numerical stability and convergence, with rank-one forms promoting simplicity and rank-two forms incorporating more curvature information.
Applications in Optimization
BFGS and DFP Methods
The Davidon–Fletcher–Powell (DFP) method represents one of the earliest quasi-Newton approaches for unconstrained optimization, originally proposed by Davidon in an unpublished 1959 technical report and later formalized and analyzed by Fletcher and Powell in 1963.[13] This method updates an approximation \mathbf{B}_k to the Hessian matrix \nabla^2 f(\mathbf{x}_k) using curvature information from successive iterates, specifically the displacement \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k and the gradient change \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k). The DFP update formula is given by
\mathbf{B}_{k+1} = \mathbf{B}_k - \frac{\mathbf{B}_k \mathbf{s}_k \mathbf{s}_k^T \mathbf{B}_k^T}{\mathbf{s}_k^T \mathbf{B}_k \mathbf{s}_k} + \frac{\mathbf{y}_k \mathbf{y}_k^T}{\mathbf{y}_k^T \mathbf{s}_k},
where the update satisfies the secant condition \mathbf{B}_{k+1} \mathbf{s}_k = \mathbf{y}_k. If the initial approximation \mathbf{B}_0 is positive definite and the curvature condition \mathbf{y}_k^T \mathbf{s}_k > 0 holds (ensuring sufficient decrease via a suitable line search), the DFP update preserves positive definiteness of \mathbf{B}_{k+1}.
In practice, the DFP algorithm proceeds iteratively: at each step k, compute the search direction \mathbf{p}_k = -\mathbf{B}_k \nabla f(\mathbf{x}_k), perform a line search to select a step length \alpha_k > 0 satisfying Wolfe conditions (which imply \mathbf{y}_k^T \mathbf{s}_k > 0), update the iterate \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k, and apply the DFP update to \mathbf{B}_{k+1}. This framework ensures descent in the objective function f while approximating the Hessian with low-rank modifications, avoiding costly exact Hessian computations.
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, developed independently by Broyden, Fletcher, Goldfarb, and Shanno in 1970, provides a dual update to DFP that often exhibits superior numerical stability. The BFGS update for the Hessian approximation is
\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{\mathbf{y}_k \mathbf{y}_k^T}{\mathbf{y}_k^T \mathbf{s}_k} - \frac{\mathbf{B}_k \mathbf{s}_k \mathbf{s}_k^T \mathbf{B}_k}{\mathbf{s}_k^T \mathbf{B}_k \mathbf{s}_k},
also satisfying the secant condition, with positive definiteness preserved under the same assumptions on \mathbf{B}_0 and \mathbf{y}_k^T \mathbf{s}_k > 0. The algorithm follows the same outline as DFP, using the same line search-integrated update process, but the BFGS formula tends to yield better conditioned approximations, particularly for functions with inexact line searches.[14]
Empirical studies confirm that BFGS outperforms DFP in many unconstrained optimization problems, requiring fewer iterations and function evaluations due to its improved conditioning and robustness to scaling in the objective.[14] Both methods belong to Broyden's broader family of quasi-Newton updates but are specifically designed for optimization, emphasizing symmetry and positive definiteness for reliable descent directions.
Limited-Memory Variants
Limited-memory variants of quasi-Newton methods address the scalability issues of full-matrix updates in high-dimensional optimization problems by storing only a limited number of update pairs rather than the entire approximate Hessian matrix. The most prominent example is the limited-memory BFGS (L-BFGS) algorithm, which approximates the inverse Hessian using the m most recent pairs of curvature information, denoted as \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k and \mathbf{y}_k = \nabla f(\mathbf{x}_{k+1}) - \nabla f(\mathbf{x}_k), where m is a small integer typically between 5 and 20.
In L-BFGS, the approximate inverse Hessian-vector product \mathbf{H}_{k+1} \mathbf{g}_{k+1} (where \mathbf{g}_{k+1} = \nabla f(\mathbf{x}_{k+1})) is computed implicitly through a two-loop recursion procedure, avoiding explicit storage of the n \times n matrix. The recursion uses the m most recent pairs (\mathbf{s}_i, \mathbf{y}_i) for i = k-m+1, \dots, k, with \rho_i = 1 / (\mathbf{y}_i^T \mathbf{s}_i). It begins with the first loop (from most recent i = k to oldest i = k-m+1): initialize \mathbf{q} = \mathbf{g}_{k+1}; for each i, compute \alpha_i = \rho_i \mathbf{s}_i^T \mathbf{q} and update \mathbf{q} \leftarrow \mathbf{q} - \alpha_i \mathbf{y}_i. Then, apply an initial scaling \mathbf{H}_0 = \gamma_{k+1} \mathbf{I}, where \gamma_{k+1} = \frac{\mathbf{s}_k^T \mathbf{y}_k}{\mathbf{y}_k^T \mathbf{y}_k}, to obtain \mathbf{r} = \mathbf{H}_0 \mathbf{q}. This is followed by the second loop (from oldest i = k-m+1 to most recent i = k): for each i, compute \beta_i = \rho_i \mathbf{y}_i^T \mathbf{r} and update \mathbf{r} \leftarrow \mathbf{r} + (\alpha_i - \beta_i) \mathbf{s}_i. The result \mathbf{r} approximates \mathbf{H}_{k+1} \mathbf{g}_{k+1}.[15] This approach ensures that the storage requirement is O(m n) instead of O(n^2) for the full BFGS method, making it practical for problems where the dimension n > 1000.
L-BFGS has found widespread application in machine learning tasks, such as training logistic regression models on large datasets, where its efficiency in handling high-dimensional parameter spaces outperforms first-order methods like gradient descent in terms of convergence speed per iteration. In these settings, the choice of m = 5 to $20 balances approximation quality and memory usage, often leading to robust performance on datasets with millions of features.
Variants of L-BFGS leverage a compact representation of the approximate Hessian, typically in the form \mathbf{H}_k = \gamma_k \mathbf{I} + \mathbf{S}_k \mathbf{M}_k \mathbf{Y}_k^T, where \mathbf{S}_k and \mathbf{Y}_k are matrices of the stored vectors, \mathbf{M}_k encodes the inner products, and \gamma_k is a scaling factor; this structure enables efficient use as a preconditioner in conjugate gradient methods for solving the search direction subproblems.[16]
Applications in Root Finding
Broyden's Method for Systems
Broyden's method is a quasi-Newton technique specifically designed for solving systems of nonlinear equations \mathbf{F}(\mathbf{x}) = \mathbf{0}, where \mathbf{F}: \mathbb{R}^n \to \mathbb{R}^n is a continuously differentiable function with a nonsingular Jacobian at the solution \mathbf{x}^*. The method approximates the Jacobian matrix \mathbf{J}(\mathbf{x}) iteratively without requiring explicit derivative computations after the initial step, updating the approximation using function values to satisfy the secant condition \mathbf{y}_k = \mathbf{B}_{k+1} \mathbf{s}_k, where \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k and \mathbf{y}_k = \mathbf{F}(\mathbf{x}_{k+1}) - \mathbf{F}(\mathbf{x}_k). This approach extends the scalar secant method to higher dimensions and is particularly useful when Jacobian evaluations are costly or unavailable.
The algorithm begins with an initial guess \mathbf{x}_0 and an initial Jacobian approximation \mathbf{B}_0, often chosen as the identity matrix I for simplicity when no prior information is available. At each iteration k, the step \mathbf{s}_k is computed by solving the linear system \mathbf{B}_k \mathbf{s}_k = -\mathbf{F}(\mathbf{x}_k), typically via LU factorization or iterative solvers, and the next iterate is \mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{s}_k. The approximation is then updated using Broyden's rank-one formula:
\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{ (\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k) \mathbf{s}_k^T }{ \|\mathbf{s}_k\|^2 },
which ensures the secant condition is met while introducing a minimal perturbation to \mathbf{B}_k. This update requires only matrix-vector products and avoids full Jacobian recomputation, reducing computational cost to O(n^2) per iteration after the first.
Within Broyden's framework, two primary update variants exist: the "good" Broyden update and the "bad" Broyden update. The good update minimizes the change in the Frobenius norm \|\mathbf{B}_{k+1} - \mathbf{B}_k\|_F subject to the secant condition, leading to a more stable approximation that preserves positive definiteness in certain cases and exhibits better numerical behavior. In contrast, the bad update is a simpler projection that directly applies the rank-one correction but can be less stable, particularly when the iterates are far from the solution, as it may amplify errors in the approximation. The good variant is generally preferred in practice for its robustness, though both satisfy the secant condition and are rank-one updates.[17]
Under suitable conditions, Broyden's method exhibits local superlinear convergence. Specifically, if the initial approximation \mathbf{B}_0 is sufficiently close to \mathbf{J}(\mathbf{x}^*) and the Dennis-Moré condition holds—that is, \lim_{k \to \infty} \frac{\|\mathbf{B}_k^{-1} (\mathbf{J}(\mathbf{x}^*) - \mathbf{B}_k) \mathbf{s}_k \|}{\|\mathbf{s}_k\|} = 0—then the error satisfies \|\mathbf{x}_{k+1} - \mathbf{x}^*\| = o(\|\mathbf{x}_k - \mathbf{x}^*\|) as k \to \infty. This superlinear rate, typically approaching order 1.5–1.6 in low dimensions, is established for both good and bad updates, though the good variant often achieves it more reliably. Global convergence requires additional safeguards like line searches or trust regions, but locally, the method is highly efficient near the root.
To illustrate, consider the 2D nonlinear system \mathbf{F}(\mathbf{x}) = \begin{pmatrix} \sin x_1 + x_2 \\ x_1^2 + x_2^2 - 1 \end{pmatrix} = \mathbf{0}, which describes the intersection of a sine curve and a unit circle, with a root at approximately (\cos(\pi/6), -\sin(\pi/6)) \approx (0.866, -0.5). Starting from \mathbf{x}_0 = (0.5, 0.5) and \mathbf{B}_0 = I, the method iteratively refines the solution: the first step solves I \mathbf{s}_0 = -\mathbf{F}(\mathbf{x}_0), yielding \mathbf{x}_1 \approx (-0.479, 1.0), and subsequent updates via the rank-one formula converge superlinearly to the root in about 8–10 iterations, demonstrating the method's effectiveness on coupled transcendental equations. (Note: This example is adapted from standard numerical demonstrations in quasi-Newton literature.)
In practice, selecting \mathbf{B}_0 as the identity matrix works well for well-scaled problems but may require scaling adjustments for ill-conditioned systems. If \mathbf{B}_k becomes singular during iterations—detectable via a small minimum singular value—regularization techniques such as adding a small multiple of the identity (\mathbf{B}_k + \epsilon I) or restarting with a fresh finite-difference Jacobian can restore stability, though these increase computational overhead. Line searches along \mathbf{s}_k to ensure \|\mathbf{F}(\mathbf{x}_{k+1})\| < \|\mathbf{F}(\mathbf{x}_k)\| further enhance reliability, especially for nonconvex \mathbf{F}.
Variants for Nonlinear Equations
While Broyden's method serves as a foundational quasi-Newton approach for solving systems of nonlinear equations, several variants have been developed to enhance its robustness, particularly in cases prone to stagnation or divergence.[18]
Powell's hybrid method combines elements of Newton's method with quasi-Newton updates and incorporates a line search to promote global convergence, making it suitable for problems where the initial approximation is far from the solution. This approach alternates between full Jacobian evaluations and approximate updates, reducing computational cost while ensuring descent in the residual norm.[19]
Anderson mixing, originally proposed in 1965 and revived in the 2010s, accelerates convergence by extrapolating a linear combination of multiple previous iterates and residuals, effectively incorporating more historical information than single-secant methods like Broyden's. This technique is particularly effective for fixed-point iterations underlying nonlinear systems, often achieving faster linear convergence rates in practice.[20]
The Shamanskii modification improves stability by periodically recomputing the full Jacobian every m steps, using it unchanged for the intermediate quasi-Newton updates to mitigate error accumulation in the approximation. This periodic reset enhances reliability for ill-conditioned systems without the full expense of recomputation at every iteration.[21]
Globalized variants extend these methods with trust-region or line-search strategies to enforce sufficient decrease in the merit function, ensuring global convergence even from poor starting points. Trust-region approaches constrain updates within a dynamic radius based on trial steps, while line searches along quasi-Newton directions accept steps satisfying Wolfe conditions.[22]
These variants find prominent applications in chemical engineering, such as modeling large-scale reactor systems where nonlinear equations arise from mass and energy balances, often involving ill-conditioned Jacobians due to stiff kinetics.[21]
Theoretical Foundations
Convergence Properties
Quasi-Newton methods demonstrate local superlinear convergence, characterized by an order \alpha where $1 < \alpha < 2, under appropriate assumptions on the objective function and update mechanism. This property arises when the approximation matrices converge to the true Hessian (or Jacobian) in a manner that satisfies specific error conditions. For methods in Broyden's family, such as Broyden's method for nonlinear equations, superlinear convergence is established via the Dennis-Moré condition, which requires that
\lim_{k \to \infty} \frac{ \| \mathbf{B}_k^{-1} (\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k ) \| }{ \| \mathbf{s}_k \| } = 0,
where \mathbf{B}_k is the approximation matrix at iteration k, \mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k, and \mathbf{y}_k is the difference in function values or gradients. This condition ensures that the update error diminishes relative to the step size, leading to q-superlinear convergence rates for the sequence of iterates.[23]
In the context of unconstrained optimization, the BFGS method achieves q-superlinear convergence locally when paired with line searches satisfying the Wolfe conditions, which enforce sufficient decrease and curvature along the search direction. These conditions guarantee that the step length balances progress toward the minimum while preserving the quasi-Newton update's desirable properties, such as positive definiteness of the approximation. The theoretical foundation for this behavior in BFGS and related rank-two updates stems from analyses showing that the updates satisfy the Dennis-Moré condition asymptotically under Lipschitz continuity of the Hessian.[7]
Global convergence of quasi-Newton methods, including those from Broyden's class, requires additional safeguards like exact or inexact line searches or trust-region strategies to ensure descent and prevent divergence on nonconvex problems. Early results from the 1970s demonstrated that, with appropriate line search techniques, these methods converge globally to a stationary point for continuously differentiable functions, provided the initial approximation is sufficiently close or the objective satisfies mild regularity conditions. For instance, modifications to Broyden's method ensure global convergence even for linear mappings, extending to nonlinear cases with bounded deterioration of the approximation.[7]
The convergence rate of quasi-Newton methods is influenced by factors such as the quality of the initial Hessian or Jacobian approximation and the Lipschitz continuity of the function's derivatives, which control how quickly the secant condition is satisfied. Compared to full Newton's method, which achieves quadratic convergence but requires exact second derivatives, quasi-Newton methods offer a slower superlinear rate yet outperform first-order methods like steepest descent, which exhibit only linear convergence, especially near the solution where second-order information accelerates progress.[7]
Connection to Matrix Inversion
Quasi-Newton methods approximate the inverse of the Jacobian or Hessian matrix using secant conditions derived from successive function evaluations, providing an efficient alternative to exact matrix inversion in nonlinear optimization and root-finding problems. These updates construct low-rank modifications to an initial approximation, iteratively refining the inverse without recomputing it from scratch each time. This approach draws from iterative techniques for matrix inversion, where each update mimics a step toward solving for the inverse through successive corrections.[7]
A key linear algebra tool underlying these updates is the Sherman-Morrison formula, which computes the inverse of a matrix after a rank-one perturbation:
(\mathbf{A} + \mathbf{u} \mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{\mathbf{A}^{-1} \mathbf{u} \mathbf{v}^T \mathbf{A}^{-1}}{1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{u}},
assuming \mathbf{A} is invertible and the denominator is nonzero. This formula enables efficient rank-one updates in quasi-Newton methods, such as Broyden's good method, by adjusting the approximate inverse with minimal computational cost. For rank-two updates, common in optimization methods like BFGS, the more general Woodbury (or Sherman-Morrison-Woodbury) identity extends this capability to handle paired vectors, preserving the structure of the approximation.[24]
In Broyden's family of updates, the modification can be interpreted as a discrete Newton step applied to the equation \mathbf{B} \Delta = \mathbf{y} - \mathbf{B} \mathbf{s}, where \mathbf{B} is the current approximation to the Jacobian, \mathbf{s} is the step vector, and \mathbf{y} is the secant difference. Solving this least-squares problem yields the update direction \Delta, which corrects the approximation to satisfy the secant condition exactly in one dimension while minimizing deviation in others, effectively advancing the iterative inversion process. This perspective highlights the quasi-Newton update as a targeted correction akin to Newton's method for inverting the residual system.[25]
The development of these connections emerged in the 1960s with Broyden's foundational work in 1965, which introduced the class of updates satisfying the secant condition, linking them to efficient inversion techniques prevalent in numerical linear algebra at the time.[25][7]
An important implication of these update mechanisms is the preservation of nonsingularity: if the initial matrix approximation is nonsingular, the low-rank updates maintain this property under the Sherman-Morrison-Woodbury framework, provided the perturbation terms avoid singularity in the denominator. This stability ensures reliable step computations throughout the iteration, contributing to the robustness of quasi-Newton methods in practice.[24]
Implementations and Extensions
Software and Algorithms
MINPACK, developed in the 1970s by researchers at Argonne National Laboratory, is an early Fortran library providing implementations of quasi-Newton methods for solving systems of nonlinear equations and nonlinear least squares problems. It includes subroutines like HYBRD1 and HYBRJ1, which employ Broyden's method and hybrid approaches combining quasi-Newton updates with Gauss-Newton steps for improved robustness in nonlinear root-finding.[26]
In modern Python environments, the SciPy library's optimize.minimize function supports quasi-Newton optimization through the 'BFGS' method for unconstrained problems and 'L-BFGS-B' for bounded constraints, leveraging limited-memory approximations to handle large-scale objectives efficiently.[27] MATLAB's Optimization Toolbox implements quasi-Newton methods in the fminunc solver, which defaults to this approach for unconstrained multivariable minimization and allows specification via the 'Algorithm' option.[28]
The NLopt library, written in C++ with bindings for multiple languages, incorporates BFGS for unconstrained optimization and L-BFGS variants suitable for high-dimensional problems, enabling seamless integration in applications requiring nonlinear optimization.[29] BFGS serves as a common default in these libraries due to its balance of efficiency and superlinear convergence.[30]
A generic algorithmic framework for the BFGS method with Armijo line search, as outlined in standard numerical optimization references, proceeds as follows:
Initialize: Choose initial point x_0, initial Hessian approximation H_0 (e.g., identity matrix), tolerance ε > 0
Set k = 0
While ||∇f(x_k)|| > ε:
Compute search direction: p_k = -H_k ∇f(x_k)
Perform Armijo line search: Find α_k > 0 such that f(x_k + α_k p_k) ≤ f(x_k) + c α_k ∇f(x_k)^T p_k for some c ∈ (0,1)
Update: x_{k+1} = x_k + α_k p_k
Compute s_k = x_{k+1} - x_k, y_k = ∇f(x_{k+1}) - ∇f(x_k)
If s_k^T y_k > 0, update H_{k+1} using BFGS formula: H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T where ρ_k = 1 / (s_k^T y_k)
Else: H_{k+1} = H_k (or reset to initial)
k = k + 1
Return x_k
Initialize: Choose initial point x_0, initial Hessian approximation H_0 (e.g., identity matrix), tolerance ε > 0
Set k = 0
While ||∇f(x_k)|| > ε:
Compute search direction: p_k = -H_k ∇f(x_k)
Perform Armijo line search: Find α_k > 0 such that f(x_k + α_k p_k) ≤ f(x_k) + c α_k ∇f(x_k)^T p_k for some c ∈ (0,1)
Update: x_{k+1} = x_k + α_k p_k
Compute s_k = x_{k+1} - x_k, y_k = ∇f(x_{k+1}) - ∇f(x_k)
If s_k^T y_k > 0, update H_{k+1} using BFGS formula: H_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T where ρ_k = 1 / (s_k^T y_k)
Else: H_{k+1} = H_k (or reset to initial)
k = k + 1
Return x_k
This pseudocode assumes exact gradient evaluations and focuses on the core update loop, with the Armijo condition ensuring sufficient decrease.
Best practices for implementing quasi-Newton methods emphasize variable scaling to normalize gradients and function values, preventing ill-conditioning in high-dimensional spaces, and selecting an initial Hessian approximation as a diagonal matrix derived from gradient components or finite differences for faster early iterations.[31] These techniques enhance numerical stability without altering the method's asymptotic properties.[32]
Recent Advances
Recent developments in quasi-Newton methods have focused on enhancing efficiency and applicability in large-scale and distributed optimization settings. A notable advancement is the greedy quasi-Newton method introduced by Rodomanov and Nesterov in 2021, which selects updates from the Broyden family that minimize a specific error measure, leading to explicit superlinear convergence rates.[33] This approach has been extended to distributed environments, where adaptive greedy updates achieve non-asymptotic convergence guarantees, making it suitable for parallel computing in high-dimensional problems.[34]
In 2024, incremental quasi-Newton methods incorporating symmetric rank-1 (SR1) updates were proposed to handle streaming data in online learning scenarios. These methods update approximations incrementally as new data arrives, achieving faster superlinear convergence compared to traditional variants, which is particularly beneficial for sequential optimization tasks with evolving objectives.[35] For multi-variable nonlinear equations, full-rank Jacobian updates via the T-Secant strategy, developed around 2023, improve numerical stability by ensuring complete rank preservation in each iteration through dual secant conditions, reducing sensitivity to ill-conditioning.[36]
Applications in machine learning have seen quasi-Newton methods adapted for challenging problems like saddle-point optimization. The multiple greedy SR1 method for strongly-convex-strongly-concave saddle points, introduced in 2024, uses multiple greedy selections of SR1 updates to escape saddle points more effectively, demonstrating superior performance in training generative models.[37] In multi-objective optimization, limited-memory quasi-Newton approaches from 2023 construct Pareto fronts by approximating Hessians with low-rank updates, enabling scalable solutions for conflicting objectives in engineering design.[38] Additionally, a 2025 diagonal quasi-Newton method employing penalty decomposition addresses sparse nonconvex bound-constrained problems, promoting sparsity while maintaining convergence through diagonal Hessian approximations.[39]
These advances address longstanding gaps in classical quasi-Newton methods, such as handling noisy gradients and scalability in AI-driven applications, moving beyond the 1970s-era focus on small-scale unconstrained problems to support modern distributed and data-intensive frameworks.