Fact-checked by Grok 2 weeks ago

Limited-memory BFGS

Limited-memory BFGS (L-BFGS) is a quasi-Newton optimization algorithm designed for solving large-scale unconstrained nonlinear optimization problems by approximating the inverse Hessian matrix using a limited number of vector pairs from recent iterations, thereby avoiding the prohibitive memory requirements of the full BFGS method. This approach enables efficient computation in high-dimensional spaces, where the dimension d can reach thousands or more, by storing only m pairs of vectors (s_k, y_k) — with s_k representing the step taken and y_k the corresponding change in the gradient — typically setting m between 3 and 20 to balance accuracy and efficiency. The method computes search directions via a two-loop recursion that implicitly applies the approximate inverse Hessian to the current gradient, achieving linear convergence while requiring only O(m d) storage and time per iteration, in contrast to the O(d^2) demands of standard quasi-Newton updates. Introduced by Dong C. Liu and Jorge Nocedal in their 1989 paper, L-BFGS builds on the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update formula but modifies it for limited storage to handle problems where full matrix approximations are infeasible due to memory constraints. The algorithm initializes the approximation with a scaled , where the scaling factor \gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}} provides a simple yet effective diagonal estimate of the , and it satisfies the to ensure sufficient decrease in the objective function along the search direction. Numerical studies in the original work demonstrated that L-BFGS outperforms earlier limited-memory methods, such as those by Buckley and LeNir, in terms of iteration speed and overall efficiency, particularly when additional storage is available for more vector pairs. Global convergence is guaranteed for uniformly convex functions under standard assumptions, making it a robust choice for practical large-scale optimization. Key features of L-BFGS include its adaptability to procedures and its ability to leverage gradient information without requiring second derivatives, which enhances its applicability in fields like and scientific computing where objective functions are often expensive to evaluate. Variants such as L-BFGS-B extend the method to bound-constrained optimization by integrating gradient techniques with the limited-memory , allowing handling of simple bounds on variables while maintaining efficiency for large problems. Overall, L-BFGS remains a cornerstone of modern optimization software libraries due to its favorable trade-off between computational cost and performance.

Background

Quasi-Newton Methods

Quasi-Newton methods are a class of iterative algorithms for unconstrained optimization that approximate the second-order information of the objective function, specifically the , using only evaluations of the function and its . This approximation enables the methods to mimic the superlinear convergence properties of while avoiding the high computational cost associated with explicitly computing or storing the full , which is particularly beneficial for large-scale problems where second derivatives are expensive or infeasible to obtain. The development of quasi-Newton methods began in the late 1950s with W. C. Davidon's unpublished work on variable metric methods, which was later refined and popularized by R. and M. J. D. Powell in 1963 through their introduction of the Davidon-Fletcher-Powell (DFP) update formula. This marked a significant advancement in nonlinear optimization, followed by further contributions in the 1970s from C. G. Broyden, R. , D. Goldfarb, and D. F. Shanno, who proposed additional update schemes that improved and performance. These methods addressed the limitations of earlier techniques like steepest descent, offering a practical bridge between gradient-based and full second-order approaches for minimizing smooth functions. A core feature of quasi-Newton methods is adherence to the secant condition, which ensures that the approximation B_{k+1} satisfies B_{k+1} s_k = y_k, where s_k = x_{k+1} - x_k is the step vector and y_k = \nabla f(x_{k+1}) - \nabla f(x_k) is the change in ; this condition captures the local curvature information implicitly. Many quasi-Newton updates, such as the DFP formula, are designed to preserve the of the approximation when the objective function is (i.e., y_k^T s_k > 0), ensuring directions and global under suitable conditions. Compared to exact , which requires solving a involving the true at each (costing O(n^3) for n-dimensional problems), quasi-Newton methods update the approximation in O(n^2) time per , trading quadratic for superlinear rates while reducing overall expense in high dimensions. One common way to update the inverse Hessian approximation in quasi-Newton methods is through a rank-two modification that satisfies the secant condition H_{k+1} y_k = s_k. For instance, the DFP update for the inverse H_{k+1} takes the form: H_{k+1} = H_k - \frac{H_k y_k y_k^T H_k}{y_k^T H_k y_k} + \frac{s_k s_k^T}{s_k^T y_k}, where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k); this formula adjusts the prior approximation H_k to better match the observed along the latest step, with the term correcting for inconsistencies and the incorporating new change . BFGS represents a specific that modifies this update to enhance preservation.

BFGS Algorithm

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative for unconstrained nonlinear optimization, designed to approximate the inverse H_k of the objective function f(x) using only evaluations. Developed independently by Broyden, Fletcher, Goldfarb, and Shanno in 1970, it updates the approximation at each step to satisfy the secant condition while ensuring under suitable conditions. The algorithm initializes H_0 as a positive definite matrix, typically the H_0 = I, or a scaled identity based on an initial estimate. At iteration k, the search direction is computed as p_k = -H_k \nabla f(x_k), providing a descent direction since H_k is and p_k^T \nabla f(x_k) = - \nabla f(x_k)^T H_k \nabla f(x_k) < 0. A line search then determines the step size \alpha_k > 0 satisfying the : f(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k p_k^T \nabla f(x_k) for sufficient decrease (with $0 < c_1 < 1, often c_1 = 10^{-4}) and p_k^T \nabla f(x_k + \alpha_k p_k) \geq c_2 p_k^T \nabla f(x_k) for curvature (with c_1 < c_2 < 1, often c_2 = 0.9). The update yields x_{k+1} = x_k + \alpha_k p_k, s_k = \alpha_k p_k = x_{k+1} - x_k, and y_k = \nabla f(x_{k+1}) - \nabla f(x_k). The ensure s_k^T y_k > 0, preserving in subsequent updates. The BFGS update for the inverse Hessian approximation is given by H_{k+1} = \left( I - \rho_k s_k y_k^T \right) H_k \left( I - \rho_k y_k s_k^T \right) + \rho_k s_k s_k^T, where \rho_k = \frac{1}{s_k^T y_k}. This rank-two update satisfies the secant condition H_{k+1} y_k = s_k, ensuring the approximation captures the along the latest step, and maintains symmetry and if H_k is positive definite and s_k^T y_k > 0. Each update requires O(n^2) arithmetic operations for an n-dimensional problem. To derive this formula, consider the general problem of finding H_{k+1} that minimizes the change from H_k in the weighted Frobenius norm \|H_{k+1} - H_k\|_W = \mathrm{tr} \left[ (H_{k+1} - H_k) W (H_{k+1} - H_k)^T \right], where the weight W = H_k^{-1} emphasizes directions of high , subject to the secant condition H_{k+1} y_k = s_k and symmetry of H_{k+1}. Formulating this as a problem and applying the method of Lagrange multipliers with multipliers \lambda (for the secant) and a \Gamma (for symmetry) leads to the optimality conditions. Solving these yields the auxiliary terms V_k = I - \rho_k s_k y_k^T and the outer product addition, resulting in the compact BFGS form above. This choice of weighting distinguishes BFGS from other quasi-Newton updates like DFP, providing superior . Assuming f is twice continuously differentiable, \nabla f is continuous, and the is positive definite near a strict local minimizer, with line searches satisfying the , BFGS achieves superlinear : the error \|x_{k+1} - x^*\| satisfies \|x_{k+1} - x^*\| / \|x_k - x^*\| \to 0 as k \to \infty. More precisely, the rate is q- of order 1, and under stronger assumptions on the , it can be q-. The full n \times n H_k requires O(n^2) storage, and updates cost O(n^2) time, which scales poorly for high-dimensional problems (e.g., n > 10^3), necessitating limited-memory variants for large-scale applications.

Core Algorithm

Memory Approximation

The limited-memory BFGS (L-BFGS) method approximates the inverse H_k using only the m most recent pairs of vectors (s_i, y_i), where s_i = x_{k-i+1} - x_{k-i} represents the step taken in k-i, and y_i = \nabla f(x_{k-i+1}) - \nabla f(x_{k-i}) captures the change in the . This implicit leverages a series of low-rank updates derived from the standard BFGS formula, avoiding the explicit storage of the full n \times n required in the conventional BFGS . By maintaining just these m pairs—each of dimension n—the method achieves a of O(mn) , which is particularly advantageous for high-dimensional optimization problems where n is large. The compact form of the approximation expresses H_k through successive BFGS updates applied to an initial scaling matrix, with the product part given by: H_k \approx \left( \prod_{i=1}^m (I - \rho_i s_i y_i^T) \right) \gamma_k I \left( \prod_{i=1}^m (I - \rho_i y_i s_i^T) \right) + \sum \text{rank-one terms}, where \gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}} provides a scalar to the inverse at the previous , \rho_i = \frac{1}{y_i^T s_i} is the standard BFGS parameter, and the summation includes additional \rho_i s_i s_i^T terms (composed with products of updates) to fully satisfy the conditions H_k y_i = s_i for all m pairs. This structure implicitly composes the last m BFGS updates, treating earlier history as approximated by the scaled identity \gamma_k I, which ensures that the remains positive definite under standard assumptions on the objective function, such as convexity and satisfaction of the . The full is computed efficiently via a two-loop that requires O(mn) operations for the matrix-vector product H_k v without forming H_k explicitly. The parameter m, known as the memory length, controls the number of stored pairs and thus the between approximation quality and resource demands. Typical values range from 3 to 20, with smaller m (e.g., 5) often sufficient for many large-scale problems to achieve near-full BFGS performance while minimizing storage and per-iteration cost. Increasing m enhances the accuracy of the Hessian approximation by incorporating more curvature information, potentially accelerating , but at the expense of higher usage and computational overhead proportional to m; empirical studies show that m > 20 rarely yields significant gains relative to the added cost. In contrast to the full BFGS method's O(n^2) storage, this limited-memory approach scales linearly with problem dimension n, making L-BFGS practical for optimizations where n exceeds thousands. The initial scaling \gamma_k I plays a crucial role in adapting the approximation to local curvature, ensuring that the method begins each iteration with a reasonable estimate of the inverse Hessian near the current point x_k. By setting \gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}, which approximates the inverse of the average eigenvalue of the true Hessian along the recent step (projected onto the gradient change direction), the scaling promotes rapid initial progress and maintains the positive definiteness of H_k when the objective is convex or the updates satisfy the Wolfe conditions. This dynamic rescaling distinguishes L-BFGS from static initializations and contributes to its robustness across diverse problem classes.

Update Procedure

The limited-memory BFGS (L-BFGS) algorithm proceeds iteratively to minimize an objective function by approximating the inverse Hessian matrix using a limited number of past updates, enabling efficient computation for large-scale problems. The procedure begins with initialization: select an initial point x_0, a small integer m (typically 3 to 20) for the number of stored corrections, and an initial symmetric positive definite scaling matrix H_0^k (often the identity scaled by the initial curvature). In each iteration k, compute the gradient g_k at x_k, then determine the search direction p_k \approx -H_k g_k via the two-loop recursion without explicitly forming the n \times n Hessian approximation, where n is the problem dimension. Perform a line search to find a step length \alpha_k > 0 such that x_{k+1} = x_k + \alpha_k p_k satisfies descent conditions, compute the displacement s_k = x_{k+1} - x_k and curvature y_k = g_{k+1} - g_k, and update the limited memory by shifting the stored pairs \{s_i, y_i\} for the most recent m steps, discarding the oldest if necessary. The core of the update is the two-loop , which implicitly applies the m BFGS updates to compute the efficiently. The backward (first) starts with q = g_k and iteratively computes scaling factors \rho_i = 1 / (y_i^T s_i) for i = k-1, \dots, k-m, then updates q \leftarrow q - \rho_i (s_i^T q) y_i in a backward manner through the stored pairs (storing the \alpha_i = \rho_i (s_i^T q)), yielding q \approx H_k^0 g_k where H_k^0 is the initial scaling matrix (updated as \gamma_k I with \gamma_k = \frac{s_{k-1}^T y_{k-1}}{y_{k-1}^T y_{k-1}}). The forward (second) then builds the by initializing r = H_k^0 q and, for each i from oldest to newest, computing \beta_i = \rho_i (y_i^T r) and updating r \leftarrow r + (\alpha_i - \beta_i) s_i, resulting in r \approx H_k g_k and thus p_k = -r. This has a of O(m n) operations per , making it suitable for high-dimensional optimization. Line search integration ensures global convergence by enforcing sufficient decrease and conditions. Typically, the are used: the Armijo rule requires f(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k g_k^T p_k for a small c_1 \in (0,1) (e.g., 0.0001), while the curvature condition demands g_{k+1}^T p_k \geq c_2 g_k^T p_k for c_2 \in (c_1, 1) (e.g., 0.9), guaranteeing and sufficient progress along the approximate Newton direction. The search often starts with \alpha_k = 1 and backtracks (e.g., via cubic ) until conditions hold, balancing efficiency and robustness. Termination occurs when convergence criteria are met, such as the gradient norm falling below a threshold like \|g_k\| \leq \epsilon \max(1, \|x_k\|) for a tolerance \epsilon (e.g., $10^{-5}), or after a maximum number of iterations to prevent excessive computation. These criteria ensure the algorithm halts upon achieving a stationary point or practical optimality within the problem's scale. The full L-BFGS procedure can be outlined in pseudocode as follows:
Algorithm L-BFGS:
Given: initial x_0, m, H_0 (initial scaling), parameters for line search (c_1, c_2), tolerance ε, max iterations K_max.

Set k = 0
Store empty lists for the m most recent {s_i, y_i} pairs.
Store list for α_i from recursion.

While k < K_max and ||g_k|| > ε * max(1, ||x_k||):
    Compute g_k = ∇f(x_k)
    
    // Two-loop recursion for r ≈ H_k g_k, then p_k = -r
    q = g_k
    For i = k-1 downto max(0, k-m):
        ρ_i = 1 / (y_i^T s_i)
        α_i = ρ_i * (s_i^T q)
        Store α_i for this i
        q = q - α_i * y_i
    r = H_k^0 * q  // Scale by initial matrix, often diagonal γ_k I
    For i = max(0, k-m) to k-1:
        β = ρ_i * (y_i^T r)
        r = r + (α_i - β) * s_i
    p_k = -r
    
    // Line search for α_k satisfying Wolfe conditions
    α_k = 1.0
    While not Wolfe conditions satisfied (backtrack α_k):
        α_k = α_k / 2  // Or use cubic interpolation
        If α_k too small, set α_k = 0 and break
    x_{k+1} = x_k + α_k p_k
    g_{k+1} = ∇f(x_{k+1})
    s_k = α_k p_k
    y_k = g_{k+1} - g_k
    
    // Update memory: shift pairs, add new {s_k, y_k}, discard oldest if > m
    Update H_{k+1}^0 = γ_{k+1} I with γ_{k+1} = (s_k^T y_k) / (y_k^T y_k)
    k = k + 1

Return x_k as approximate minimizer.
(Note: The recursion details in the code are standard; implementations may vary slightly in indexing or storage.)

Applications

Optimization in Machine Learning

Limited-memory BFGS (L-BFGS) plays a key role in training models, particularly for tasks involving high-dimensional parameter spaces and large datasets, such as , neural networks, and support vector machines (SVMs). In , L-BFGS serves as an efficient solver for minimizing the negative log-likelihood with L2 regularization, enabling scalable binary or multiclass without requiring the full storage, which would be prohibitive for datasets with millions of samples. Similarly, for SVMs, L-BFGS facilitates the optimization of the dual formulation by approximating second-order curvature information through limited vector pairs, allowing effective handling of sparse, high-dimensional feature spaces common in problems. In neural networks, it provides a quasi-Newton alternative to methods, approximating the inverse to guide updates in deep architectures where noise can hinder . L-BFGS integrates seamlessly into popular machine learning frameworks, enhancing their optimization capabilities for regularized problems. In scikit-learn, the ElasticNet regressor employs L-BFGS as one of its solvers to balance L1 and L2 penalties, optimizing linear models for sparse data recovery and prediction tasks with improved generalization over unregularized approaches. For deep learning, frameworks like PyTorch include L-BFGS as a built-in optimizer, often used for fine-tuning or second-order approximations in scenarios where full-batch gradients are feasible, such as smaller networks or post-SGD refinement to accelerate convergence near local minima. This integration allows practitioners to leverage L-BFGS for problems where the objective is smooth and twice-differentiable, providing a bridge between first- and second-order optimization without excessive computational overhead. In , L-BFGS has been applied to train conditional random fields (CRFs) for sequence labeling tasks like and , where it efficiently optimizes the log-likelihood over structured outputs by managing the memory-limited updates for high-dimensional feature weights. A in CRF-based language processing demonstrates its effectiveness in integrating with probabilistic models to achieve robust performance on sequential data, outperforming gradient-based methods in iteration efficiency for moderately sized corpora. In , L-BFGS aids in parameter tuning for SVMs applied to image classification, such as optimizing hyperparameters and regularization strengths in datasets like MNIST or CIFAR, where hybrid approaches combining with BFGS yield competitive accuracy with reduced tuning time. Compared to first-order methods like (SGD), L-BFGS offers faster in practice for , strongly objectives, leveraging its approximations to take more informed steps and reduce the number of iterations needed. Empirical results on deep neural networks and logistic models show L-BFGS achieving faster than SGD in full-batch settings, particularly in the later stages of training where fine-grained optimization is critical, though it requires careful to maintain descent properties.

Scientific Computing

Limited-memory BFGS (L-BFGS) plays a crucial role in scientific computing for solving large-scale optimization problems arising from partial differential equations (PDEs), particularly in problems within and . In , L-BFGS is applied to formulations for seismic inversion, such as pre-stack Kirchhoff migration, where it efficiently approximates the using limited memory to handle datasets with millions of parameters. This approach enables accurate reconstruction of subsurface structures by minimizing misfits between observed and modeled seismic data in high-dimensional spaces. Similarly, in , L-BFGS optimizes parameters in (PINNs) for solving forward and PDEs, providing mesh-free solutions for complex phenomena like nanofluid flows governed by Navier-Stokes equations. In engineering domains, L-BFGS facilitates structural optimization and simulations, where problem dimensions often exceed $10^6. For structural optimization, it addresses bound-constrained problems in aerostructural design, integrating quasi-Newton updates with gradient projections to minimize objectives like while satisfying aerodynamic and constraints in finite element-based models. In , L-BFGS minimizes potential energy surfaces for large biomolecular systems, enabling efficient relaxation of atomic configurations in simulations involving thousands to millions of atoms, as seen in stability tests of potentials. Performance evaluations highlight L-BFGS's superiority over conjugate gradient methods in ill-conditioned problems typical of scientific simulations, such as those with poorly scaled Hessians from discretized PDEs, where it achieves faster rates—often requiring fewer iterations—while maintaining low overhead. Introduced in the late as a memory-efficient . Modern extensions leverage frameworks, including GPU implementations and distributed systems, to scale L-BFGS for exascale scientific applications like global circulation models.

Variants

L-BFGS-B

L-BFGS-B is an extension of the limited-memory BFGS (L-BFGS) algorithm designed for solving large-scale nonlinear optimization problems subject to simple bound constraints of the form l \leq x \leq u, where l and u are lower and upper bounds on the variables x. Introduced by Byrd, Lu, Nocedal, and Zhu in , it combines the efficiency of L-BFGS quasi-Newton approximations with a gradient method to ensure feasibility while maintaining low memory requirements. Unlike the core L-BFGS, which assumes unconstrained problems, L-BFGS-B handles bounds directly through projection techniques rather than incorporating Lagrange multipliers, making it particularly suitable for applications like where variables represent weights or allocations that must remain non-negative or within specified limits. The algorithm modifies the search direction computation by projecting the negative onto the feasible constraints, ensuring that steps remain within the bounds l \leq x \leq u. Specifically, the projected search path is defined as x(t) = P_C(x_k - t g_k, l, u), where P_C denotes the onto the constraint set C = \{ x \mid l \leq x \leq u \}, x_k is the current iterate, g_k is the , and t > 0 is a step . This creates a piecewise linear path that respects the bounds, allowing the algorithm to identify active constraints—variables fixed at their lower or upper limits—via . The method then employs an active set strategy to partition variables into free and fixed sets at each iteration, optimizing only over the free variables while treating fixed ones as constants to satisfy the bounds. A key component is the computation of the Cauchy point, which serves as an initial trial point for the and guarantees sufficient decrease in the objective function while preserving feasibility. The Cauchy point is obtained as the first local minimizer along the projected piecewise quadratic model of the objective function, approximating the behavior near the current point using the limited-memory BFGS approximation. This computation is efficient, requiring O(n) operations for the initial segment of the path and O(m) for subsequent segments, where n is the problem dimension and m (typically small, around 5–10) is the number of stored correction pairs in the L-BFGS approximation. Backtracking line searches along this projected path further ensure that the step satisfies both feasibility and the Armijo sufficient decrease condition. By avoiding the need for full inversions or multiplier computations, L-BFGS-B achieves scalability for high-dimensional problems with bound constraints.

OWL-QN

The Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) method is a variant of the L-BFGS algorithm designed specifically for optimizing objectives with L1 regularization, such as f(x) = \ell(x) + C \|x\|_1, where \ell(x) is a loss function and the L1 term induces sparsity. Developed by Andrew and Gao in 2007 as an extension of L-BFGS, OWL-QN addresses the non-differentiability of the L1 penalty at zero, which causes standard quasi-Newton updates to fail by violating secant conditions across coordinate orthants. By incorporating orthant-wise projections, the method preserves the separability of coordinates in sparsity-promoting problems, leading to improved convergence on non-smooth objectives compared to gradient-based alternatives. Central to OWL-QN are its orthant-wise updates, which selectively apply quasi-Newton approximations within each coordinate's (the region defined by the of the ) to maintain coordinate independence. This is achieved through secant conditions, where the update respects the L1 subgradient's piecewise linearity: for a v, the projection operator is defined as \pi(x; v)_i = x_i if \operatorname{sign}(x_i) = \operatorname{sign}(v_i), and 0 otherwise. The search is then computed as p_k = \pi(H_k g_k; g_k), where H_k approximates the of the smooth loss and g_k is the ; this ensures the aligns with the of the unconstrained quasi-Newton step, avoiding invalid updates near zero. These selective updates enhance in high-dimensional settings by limiting information to active coordinates, reducing computational overhead while accelerating for sparse solutions. OWL-QN finds primary application in sparse modeling tasks, such as and basis pursuit, where the L1 penalty enforces coordinate-wise independence and promotes solutions with many zero entries. In these problems, the method excels by leveraging the diagonal dominance of the effective (from the identity-like subgradient of the L1 term), allowing limited-memory storage of curvature pairs without full inversion. Empirical evaluations on large-scale L1-regularized log-linear models, akin to in , demonstrate that OWL-QN converges faster than or , achieving state-of-the-art sparsity with fewer iterations on datasets like tasks. To handle non-differentiability at zero, OWL-QN adapts the standard L-BFGS update formula by modifying the vector y_k = \nabla \ell(x_{k+1}) - \nabla \ell(x_k), which captures only the smooth loss differences and ignores the L1 subgradient. This y_k is then implicitly projected onto s during the condition enforcement in the limited-memory approximation, ensuring positive-definiteness and global convergence guarantees akin to those for L-BFGS on smooth functions. The adaptation maintains the two-loop of L-BFGS but confines updates to the of s_k = x_{k+1} - x_k, preventing curvature mismatches from the non-smooth penalty.

O-LBFGS

O-LBFGS is an variant of the limited-memory BFGS , tailored for problems where data is processed sequentially or in mini-batches rather than in full batches. Developed by Schraudolph, Yu, and Günter in 2007, it extends the quasi-Newton framework to handle noisy estimates, making it suitable for large-scale applications with streaming data. Unlike traditional L-BFGS, which relies on exact gradients and line searches, O-LBFGS incorporates updates while maintaining a low-memory of the Hessian-vector product. The core mechanism involves storing a limited number m of curvature pairs (s_k, y_k), where s_k = x_{k+1} - x_k is the and y_k = g_{k+1} - g_k is the difference in gradients evaluated consistently on the same mini-batch to mitigate noise. The inverse approximation is computed on-the-fly using the standard two-loop of L-BFGS: q = H_k g_{k+1}, where the first applies and then backtracks through the stored pairs: \rho_i = \frac{1}{y_i^T s_i}, \quad \alpha_i = \rho_i (s_i^T q), \quad q \leftarrow q - \alpha_i y_i, followed by a forward loop to recover H_k g_{k+1}. To adapt to the online setting, line search is replaced by a diminishing gain sequence \eta_t = \frac{\tau}{\tau + t} \eta_0, ensuring step sizes decrease appropriately, and a model-trust parameter \lambda bounds the update to preserve convexity. This results in an iteration complexity of O(m d), where d is the problem dimension, comparable to SGD but with superior curvature exploitation. O-LBFGS demonstrates robust performance in high-dimensional settings, converging approximately 20 times faster than SGD on non-realizable quadratic benchmarks with mini-batch sizes as small as 4. It has been applied to parameter estimation in conditional random fields for tasks, such as on the CoNLL-2000 and biomedical text chunking on the NLPBA-2004 corpus, where it achieves lower error rates with fewer iterations than natural gradient or SGD methods. Theoretical analysis confirms almost sure global for convex objectives under standard assumptions.

Implementations

Available Libraries

Several open-source libraries provide implementations of the Limited-memory BFGS (L-BFGS) algorithm and its variants, often derived from Jorge Nocedal's original code for unconstrained optimization. The seminal L-BFGS implementation, developed at Northwestern University's Optimization Center, serves as the foundation for many subsequent ports and remains available for direct use in numerical computing environments. A widely adopted C port, libLBFGS, translates this Fortran code while preserving its core functionality for large-scale unconstrained minimization, and it is licensed under the for broad accessibility. In , SciPy's optimize module includes the L-BFGS-B variant for bound-constrained problems via the minimize function, integrated into workflows through scikit-learn's solvers for models like and . For specialized variants, the L-BFGS-B algorithm, which handles simple bound constraints, is implemented in subroutines as described in Algorithm 778, with modern wrappers available in languages like C for integration into larger systems. The Orthant-Wise L-BFGS (OWL-QN) variant, designed for L1-regularized sparse optimization, has implementations such as a C++ library from , enabling efficient handling of nonsmooth objectives in high-dimensional settings. In , the Manopt toolbox supports Riemannian variants like rlbfgs (Riemannian L-BFGS) for optimization on manifolds, facilitating applications in geometry-constrained problems under the GNU General Public License v3.0. Cross-language coverage extends L-BFGS accessibility across scientific computing ecosystems. In C++, the header-only library implements both L-BFGS and L-BFGS-B for unconstrained and bound-constrained minimization, emphasizing simplicity and performance without external dependencies under the . Julia's Optim.jl package incorporates L-BFGS with options for preconditioning and supports both unconstrained and variant forms, licensed under for seamless integration in high-performance numerical tasks. R's base stats package provides L-BFGS-B through the function, supporting box constraints natively and widely used in statistical modeling. In frameworks as of November 2025, includes L-BFGS in its torch.optim module for unconstrained optimization, and Probability provides lbfgs_minimize for differentiable functions, enabling integration into training pipelines. Most of these libraries are freely available under permissive open-source licenses like BSD, , or 2.0, promoting widespread adoption. Recent developments in the have introduced GPU-accelerated variants, such as cuLBFGSB, a CUDA-based implementation of L-BFGS-B that achieves up to 10x speedup on hardware for large-scale problems.

Numerical Considerations

The memory m in L-BFGS determines the number of vector pairs stored for approximating the inverse , trading off between approximation accuracy and storage/computational cost; values of 5 are typically sufficient for small-scale problems, while 10 to 30 are recommended for large-scale applications to capture more information without excessive overhead. The approximation is often a scaled H_0 = \gamma I, where \gamma is set to \frac{s_k^T y_k}{y_k^T y_k} from the most recent or an over recent pairs, improving the search by reflecting . are crucial for ensuring descent and conditions; common choices include the Armijo c_1 = 10^{-4} for sufficient decrease and the Wolfe c_2 = 0.9, which promote stable progress while satisfying quasi-Newton requirements. Stability in L-BFGS implementations requires careful handling of degenerate updates to preserve positive definiteness of the Hessian approximation. If the curvature condition s_k^T y_k < \epsilon (with \epsilon typically $10^{-8} or smaller), the corresponding (s_k, y_k) pair is skipped to avoid introducing ill-conditioning or non-positive definiteness. For near-singular updates, damping techniques modify the scaling factor \gamma to \min\left( \gamma, \frac{s_k^T y_k}{y_k^T y_k} \right) or apply regularization, ensuring the approximation remains well-behaved. In ill-conditioned problems, where eigenvalues of the Hessian span many orders of magnitude, preconditioning via diagonal scaling (e.g., using approximate second derivatives) or limited-memory approximations on transformed variables accelerates convergence by mitigating slow progress in flat directions. Parallelization strategies for L-BFGS in distributed environments focus on scaling evaluations, as the core two-loop is inherently low-dimensional and serial. computations can be distributed across nodes using frameworks like , where partial are computed locally and aggregated via all-reduce operations, enabling handling of billions of parameters on hundreds of machines without synchronizing the full history. In multi-batch settings, approximations of updates maintain by pairs and applying to control variance, allowing asynchronous distributed execution while preserving quasi-Newton properties. Common pitfalls in L-BFGS implementations include numerical overflow during the two-loop in high dimensions or with poorly scaled variables, which can be mitigated by employing double-precision arithmetic throughout and monitoring intermediate quantities for bounds. Regularization, such as clipping extreme \gamma values or adding a of the to updates, further prevents instability from noisy gradients or near-collinear steps.

References

  1. [1]
    On the limited memory BFGS method for large scale optimization
    We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method.
  2. [2]
    [PDF] Lecture 23: Limited-Memory BFGS (L-BFGS) - cs.wisc.edu
    The matrices Bk and Hk constructed by BFGS are often dense, even when the true Hessian is sparse. In general, BFGS requires 卷(d2) computation per iteration and ...
  3. [3]
    A Limited Memory Algorithm for Bound Constrained Optimization
    It is based on the gradient projection method and uses a limited memory BFGS matrix to approximate the Hessian of the objective function. It is shown how to ...<|control11|><|separator|>
  4. [4]
    [PDF] 6.1 The BFGS Method - Purdue Computer Science
    Using the SMW formula, I can derive the following equation for the update of the inverse Hessian approximation, Hk that corresponds to the DFP update of Bk in ...
  5. [5]
    [PDF] Quasi-Newton optimization: Origin of the BFGS update
    ... quasi-. Newton method gives the exact minimum and the exact Hessian in n steps. (in exact arithmetic). 2. Secant condition: gk+1 − gk. | {z } γ. = Hk+1 (xk+1 ...
  6. [6]
    LogisticRegression — scikit-learn 1.7.2 documentation
    The 'liblinear' solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization is only supported ...LogisticRegressionCV · OneVsRestClassifier · Logistic function
  7. [7]
    [PDF] Improved BFGS Scheme Incorporated with Exponential Penalty ...
    This study introduces an improved Broyden-Fletcher-Goldfarb-Shanno (I-BFGS) scheme, meticulously designed for support vector machines (SVMs), laying the ...
  8. [8]
    [PDF] Practical Quasi-Newton Methods for Training Deep Neural Networks
    the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers.
  9. [9]
    ElasticNet — scikit-learn 1.7.1 documentation
    ElasticNet. Linear regression with combined L1 and L2 priors as regularizer. ElasticNetCV. Elastic Net model with iterative fitting along a regularization path.Missing: BFGS | Show results with:BFGS
  10. [10]
    A conditional random field framework for language process in ...
    Jun 10, 2022 · We proposed and built a CRF based framework and integrated it with L-BFGS. The advantage of CRF is that it makes fewer assumptions than the ...
  11. [11]
    Tuning SVM parameters by using a hybrid CLPSO–BFGS algorithm
    The experimental results show that the proposed method can efficiently tune the parameters of both L1-SVM and L2-SVM and achieve competitive performance ...Missing: vision | Show results with:vision
  12. [12]
    [PDF] Supplementary Document for ``On Optimization Methods for Deep ...
    Figure 3, but zoomed into the regions of the L-BFGS, CG and SGDs curves respectively so that the speedup obtained is visible. Also, the minibatch size used for ...
  13. [13]
    Limited-memory BFGS based least-squares pre-stack Kirchhoff ...
    The limited-memory BFGS (L-BFGS) method can evaluate the Hessian matrix indirectly using a limited amount of computer memory which only maintains a history of ...Limited-Memory Bfgs Based... · Algorithm 1 (preconditioned... · 5.1 Synthetic DataMissing: fluid | Show results with:fluid
  14. [14]
    Studies on modified limited-memory BFGS method in full waveform ...
    Numerical results show that the mL-BFGS method only increases a small calculation amount in each iteration, converges faster, and achieves higher inversion ...
  15. [15]
    L-BFGS optimizer Based PINNs framework for Magnetized Eyring ...
    Oct 27, 2025 · It computes and enhances the efficiency of training and accuracy of the network. This combination allows for mesh-free solutions of PDEs. This ...
  16. [16]
    Aerostructural Design Optimization Using a Multifidelity Quasi ...
    Jul 3, 2019 · While the new approach is currently implemented in L-BFGS-B, it may be applied to any optimizer using a line search for globalization. With each ...A. Optimization Problem... · B. Multifidelity... · B. Aeroelastic Range...
  17. [17]
    Basic Stability Tests of Machine Learning Potentials for Molecular ...
    Aug 27, 2025 · (21) Before each ML simulation, the geometry of the molecule was optimized using the L-BFGS minimizer. ... Molecular Dynamics Simulations.
  18. [18]
    Numerical Experience with Limited-Memory Quasi-Newton and ...
    Computational experience with several limited-memory quasi-Newton and truncated Newton methods for unconstrained nonlinear optimization is described.Missing: climate | Show results with:climate
  19. [19]
    Large-scale distributed L-BFGS | Journal of Big Data
    Jul 17, 2017 · In this paper, we present a parallelized implementation of the L-BFGS algorithm on a distributed system which includes a cluster of commodity computing ...
  20. [20]
    Parallel L-BFGS-B algorithm on GPU - ScienceDirect.com
    We demonstrate the speedup of L-BFGS-B enabled by our parallel implementation with extensive testings and present example applications to solve some typical non ...
  21. [21]
  22. [22]
    [PDF] Scalable Training of L1-Regularized Log-Linear Models - Microsoft
    In this paper, we propose a new algorithm based on l- bfgs for training large-scale log-linear models using L1 regularization, Orthant-Wise Limited-memory Quasi ...
  23. [23]
    None
    ### Summary of https://proceedings.mlr.press/v2/schraudolph07a/schraudolph07a.pdf
  24. [24]
    [PDF] Global Convergence of Online Limited Memory BFGS
    The oBFGS algorithm is a direct generalization of BFGS that uses stochastic gradients in lieu of deterministic gradients. RES differs in that it further ...
  25. [25]
    jmbr/lbfgs: L-BFGS Software for Large-scale Unconstrained ... - GitHub
    Mar 4, 2023 · L-BFGS is a limited-memory quasi-Newton code for unconstrained optimization. The code has been developed at the Optimization Center.Missing: implementation | Show results with:implementation
  26. [26]
    libLBFGS: L-BFGS library written in C - Naoaki Okazaki
    This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
  27. [27]
    minimize(method='L-BFGS-B') - Numpy and Scipy Documentation
    Minimize a scalar function of one or more variables using the L-BFGS-B algorithm. See also For documentation for the rest of the parameters, see scipy.optimize ...
  28. [28]
    Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound ...
    Dec 1, 1997 · L-BFGS-B is a limited-memory algorithm for solving large nonlinear optimization problems subject to simple bounds on the variables.
  29. [29]
    Orthant-Wise Limited-memory Quasi-Newton Optimizer for L1 ...
    The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form ...
  30. [30]
    A Header-only C++ Library for L-BFGS and L-BFGS-B Algorithms
    LBFGS++ is a header-only C++ library that implements the Limited-memory BFGS algorithm (L-BFGS) for unconstrained minimization problems.
  31. [31]
    (L-)BFGS · Optim - GitHub Pages
    In (L-)BFGS, the matrix is an approximation to the inverse of the Hessian built using differences of the gradients and iterates during the iterations.
  32. [32]
    optim function - RDocumentation
    Method "L-BFGS-B" is that of Byrd et. al. (1995) which allows box constraints, that is each variable can be given a lower and/or upper bound. The initial value ...
  33. [33]
    An open source library for the GPU-implementation of L-BFGS-B ...
    cuLBFGSB is an open-source library for the GPU-implementation (with NVIDIA CUDA) of the nonlinear optimization algorithm named the limited memory Broyden ...
  34. [34]
    Improved Hessian approximations for the limited memory BFGS ...
    This paper considers simple modifications of the limited memory BFGS (L-BFGS) method for large scale optimization.
  35. [35]
    [PDF] On the limited memory BFGS method for large scale optimization
    A Numerical Study of the Limited Memory BFGS Method and the Truncated-Newton Method for Large Scale Optimization · S. NashJ. Nocedal. Mathematics, Engineering.
  36. [36]
    Representations of quasi-Newton matrices and their use in limited ...
    Mar 9, 1993 · Representations of quasi-Newton matrices and their use in limited memory methods. Published: January 1994. Volume 63, pages 129–156, (1994) ...
  37. [37]
    (PDF) Damped Techniques for the Limited Memory BFGS Method ...
    Aug 9, 2025 · This paper is aimed to extend a certain damped technique, suitable for the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, to the limited memory BFGS method.
  38. [38]
    [PDF] A Preconditioned L-BFGS Algorithm with Application to Molecular ...
    Nov 21, 2004 · The limited-memory BFGS method has been widely used in large scale unconstrained opti- mization problems, such as the protein structure ...Missing: early climate<|control11|><|separator|>
  39. [39]
    Large-scale L-BFGS using MapReduce - NIPS papers
    In this paper, we study the problem of parallelizing the L-BFGS algorithm in large clusters of tens of thousands of shared-nothing commodity machines.
  40. [40]
    [1605.06049] A Multi-Batch L-BFGS Method for Machine Learning
    May 19, 2016 · This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing ...