Sequential quadratic programming
Sequential quadratic programming (SQP) is a class of iterative algorithms designed to solve nonlinear optimization problems subject to nonlinear constraints by approximating the original problem through a sequence of quadratic programming subproblems at each iteration.[1][2] These methods leverage the Karush-Kuhn-Tucker (KKT) conditions to formulate the subproblems, combining elements of Newton's method for unconstrained optimization with active set strategies for handling constraints.[1][3]
The origins of SQP trace back to Wilson's 1963 PhD thesis, which introduced the foundational idea of using quadratic approximations and Lagrange multipliers for constrained nonlinear programs.[2][3] Significant advancements occurred in the 1970s with contributions from Han and Powell, who developed quasi-Newton variants incorporating merit functions for global convergence and line search techniques to ensure descent.[2][3] Further refinements in the 1980s and 1990s, including trust-region approaches by Fletcher and filter methods, addressed issues like QP subproblem infeasibility and enhanced scalability for large-scale problems with sparse derivatives.[3]
In the SQP algorithm, at each iterate x_k, a quadratic subproblem is solved to determine a search direction d_k: minimize \nabla f(x_k)^T d + \frac{1}{2} d^T [B_k](/page/Hessian) d subject to linearized equality and inequality constraints \nabla h(x_k)^T d + h(x_k) = 0 and \nabla g(x_k)^T d + g(x_k) \geq 0, where B_k approximates the Hessian of the Lagrangian.[1][2] The next iterate is obtained via x_{k+1} = x_k + \alpha_k d_k, with step length \alpha_k selected using a merit function such as the augmented Lagrangian or l_1 penalty to promote global convergence.[2] Locally, SQP exhibits quadratic convergence when using the exact Hessian and superlinear convergence with quasi-Newton updates like BFGS, assuming regularity conditions such as linear independence of constraint gradients and strict complementarity.[2][3]
SQP methods are widely implemented in commercial and open-source software, including SNOPT for large-scale sparse problems and NPSOL for dense cases, making them suitable for applications in engineering design, chemical process optimization, and economics.[1][3] Their ability to handle general nonlinearities in objectives and constraints, while providing certificates of infeasibility, distinguishes them from interior-point or augmented Lagrangian methods, though they require efficient QP solvers and analytical derivatives for practical use.[1][3]
Background and Prerequisites
Nonlinear Programming Overview
Nonlinear programming (NLP) encompasses the optimization of a nonlinear objective function f(\mathbf{x}), where \mathbf{x} \in \mathbb{R}^n, subject to equality constraints \mathbf{c}(\mathbf{x}) = \mathbf{0} and inequality constraints \mathbf{g}(\mathbf{x}) \leq \mathbf{0}, with f, \mathbf{c}, and \mathbf{g} being nonlinear functions.[4] This formulation generalizes linear programming by allowing curvature in the objective and constraints, enabling the modeling of complex real-world phenomena.[5]
NLPs are classified into unconstrained problems, where no constraints are present; equality-constrained problems, involving only \mathbf{c}(\mathbf{x}) = \mathbf{0}; and general constrained cases that incorporate both equality and inequality constraints.[4] These categories reflect increasing complexity in handling feasibility and optimality. Solving NLPs presents significant challenges, including nonconvexity that can lead to multiple local minima rather than a unique global optimum, as well as heightened computational demands due to the nonlinearity disrupting straightforward search procedures like those in linear programming.[4]
The field emerged prominently in the post-World War II era, with foundational work in the early 1950s, but saw substantial growth and application in the 1960s and 1970s driven by engineering design problems and economic modeling needs, such as production planning under nonlinear costs.[6][4] For instance, consider minimizing f(x) = x^2 + \sin(x) subject to linear constraints like x \geq 0 and x \leq 2, where the oscillatory term introduces potential local minima. Quadratic programming, with its quadratic objective and linear constraints, acts as a key building block for addressing broader NLPs.[4]
Quadratic Programming Basics
Quadratic programming (QP) is a class of optimization problems characterized by a quadratic objective function and linear constraints. The standard formulation seeks to minimize \frac{1}{2} \mathbf{x}^T H \mathbf{x} + \mathbf{c}^T \mathbf{x}, where H is a symmetric positive semidefinite matrix, subject to equality constraints A \mathbf{x} = \mathbf{b} and inequality constraints G \mathbf{x} \leq \mathbf{d}. This structure ensures the problem is convex when H is positive semidefinite, allowing for a global minimum under feasible conditions.[7][8][9]
Several methods address the solvability of QP problems, each leveraging the convexity for efficient computation. Active-set strategies iteratively determine the binding constraints by solving reduced equality-constrained subproblems and adjusting the active set based on Lagrange multipliers. Interior-point methods navigate the feasible region interior using logarithmic barrier functions to handle inequalities, solving a sequence of barrier subproblems via Newton's method. Dual formulations transform the primal QP into a maximization problem over dual variables, often yielding insights into sensitivity and enabling specialized solvers.[7][10][11]
In broader optimization contexts, QP serves as a convex approximation to nonlinear problems, where the quadratic term models the local curvature of a more complex objective, facilitating reliable and computationally tractable local solutions. Convex QP instances exhibit polynomial-time solvability, as demonstrated by the ellipsoid method for bounded feasibility checks and interior-point approaches using self-concordant barrier functions, which achieve complexity bounds polynomial in the problem dimension and data precision.[12][13][14]
A representative application is mean-variance portfolio optimization, where the quadratic objective minimizes the portfolio's variance \mathbf{w}^T \Sigma \mathbf{w} (with \Sigma as the covariance matrix and \mathbf{w} as asset weights) subject to linear constraints on expected return \mu^T \mathbf{w} \geq r and budget \mathbf{1}^T \mathbf{w} = 1, balancing risk and return efficiently.[9][15]
Theoretical Foundations
Karush-Kuhn-Tucker Conditions
The Karush–Kuhn–Tucker (KKT) conditions are first-order necessary optimality criteria for local solutions to nonlinear programming problems involving both equality and inequality constraints. These conditions extend the classical Lagrange multiplier theorem from equality-constrained optimization to handle inequalities by incorporating nonnegative dual variables and a complementarity requirement. They form a cornerstone of constrained optimization theory, providing the framework for verifying optimality and deriving algorithms that seek points satisfying them.
The conditions were first derived by William Karush in his 1939 master's thesis, where he addressed minima of functions subject to inequality side constraints. Independently, Harold W. Kuhn and Albert W. Tucker formalized and published them in 1951 as part of their work on nonlinear programming, earning the conditions their common acronym. Karush's contribution remained largely unrecognized until the 1970s, after which the full name was adopted to honor all three originators.[16]
Consider the general nonlinear programming problem of minimizing a twice continuously differentiable objective function f: \mathbb{R}^n \to \mathbb{R} subject to m equality constraints c(x) = 0, where c: \mathbb{R}^n \to \mathbb{R}^m is continuously differentiable, and p inequality constraints g(x) \leq 0, where g: \mathbb{R}^n \to \mathbb{R}^p is continuously differentiable:
\begin{align*}
\min_{x \in \mathbb{R}^n} \quad & f(x) \\
\text{subject to} \quad & c(x) = 0, \\
& g(x) \leq 0.
\end{align*}
To derive the KKT conditions, form the Lagrangian \mathcal{L}(x, \lambda, \mu) = f(x) + \lambda^\top c(x) + \mu^\top g(x), where \lambda \in \mathbb{R}^m and \mu \in \mathbb{R}^p are Lagrange multipliers for the equalities and inequalities, respectively. Suppose x^* is a local minimizer. For the equality constraints, the method of Lagrange multipliers yields the stationarity condition \nabla_x \mathcal{L}(x^*, \lambda^*, \mu^*) = 0 for some \lambda^*, which gives
\nabla f(x^*) + \sum_{i=1}^m \lambda_i^* \nabla c_i(x^*) + \sum_{j=1}^p \mu_j^* \nabla g_j(x^*) = 0.
For the inequalities, the derivation considers the active set \mathcal{A}(x^*) = \{ j \mid g_j(x^*) = 0 \}. Perturbations around x^* must not decrease the objective while satisfying the constraints, leading to the requirement that multipliers \mu_j^* for inactive constraints (j \notin \mathcal{A}(x^*)) can be zero, but for active ones, they must be nonnegative to balance the gradient. This results in the full set of KKT conditions, which hold if a constraint qualification is satisfied:
-
Stationarity: There exist multipliers \lambda^* \in \mathbb{R}^m and \mu^* \in \mathbb{R}^p such that
\nabla f(x^*) + \lambda^{*\top} \nabla c(x^*) + \mu^{*\top} \nabla g(x^*) = 0.
-
Primal feasibility: c(x^*) = 0 and g(x^*) \leq 0.
-
Dual feasibility: \mu^* \geq 0.
-
Complementary slackness: \mu_j^* g_j(x^*) = 0 for all j = 1, \dots, p.
These conditions ensure that at x^*, the objective gradient is a nonnegative linear combination of the active constraint gradients, generalizing the unconstrained case \nabla f(x^*) = 0.
The KKT conditions are necessary under appropriate constraint qualifications (CQs), which prevent degeneracy and ensure the existence and uniqueness of multipliers. The linear independence constraint qualification (LICQ) requires that the set \{ \nabla c_i(x^*) \mid i=1,\dots,m \} \cup \{ \nabla g_j(x^*) \mid j \in \mathcal{A}(x^*) \} consists of linearly independent vectors. A weaker condition is the Mangasarian–Fromovitz constraint qualification (MFCQ), which states that \{ \nabla c_i(x^*) \mid i=1,\dots,m \} are linearly independent and there exists a vector d \in \mathbb{R}^n such that \nabla c(x^*)^\top d = 0 and \nabla g_j(x^*)^\top d < 0 for all j \in \mathcal{A}(x^*). Both CQs guarantee that any local minimizer satisfies the KKT conditions, with LICQ also implying unique multipliers.
In quadratic programming, the KKT conditions simplify to a linear complementarity problem, serving as exact optimality criteria that sequential methods approximate for general nonlinear cases.
As an illustrative example, consider the simple problem \min_{x \in \mathbb{R}} x^2 subject to x \geq 1, or equivalently g(x) = 1 - x \leq 0. The unconstrained minimizer x = 0 violates the constraint, so check x^* = 1. Here, f(x) = x^2, \nabla f(1) = 2, and \nabla g(1) = -1. Stationarity requires $2 + \mu^* (-1) = 0, so \mu^* = 2 \geq 0. Primal feasibility holds as g(1) = 0 \leq 0, and complementary slackness is $2 \cdot 0 = 0. Thus, x^* = 1 satisfies the KKT conditions under LICQ (single active gradient).
Newton-Type Methods in Optimization
Newton-type methods in optimization leverage second-order information, specifically the Hessian matrix of the objective function or Lagrangian, to compute search directions that accelerate convergence compared to first-order methods like gradient descent. These methods form the foundation for sequential quadratic programming (SQP) by approximating the nonlinear problem locally with quadratic models.[17]
In unconstrained optimization, the Newton method seeks to minimize a twice-differentiable function f: \mathbb{R}^n \to \mathbb{R} by iteratively solving the system \nabla^2 f(x_k) d_k + \nabla f(x_k) = 0 for the step d_k, yielding the update x_{k+1} = x_k + d_k. Near a strict local minimum where the Hessian is positive definite, this method exhibits quadratic convergence, meaning the error reduces quadratically with each iteration.[17]
For equality-constrained optimization, the method extends to the Lagrangian \mathcal{L}(x, \lambda) = f(x) + \lambda^T c(x), where c(x) = 0 are the constraints. The Newton step solves the linear system
\begin{bmatrix}
\nabla_{xx}^2 \mathcal{L}(x_k, \lambda_k) & \nabla c(x_k) \\
\nabla c(x_k)^T & 0
\end{bmatrix}
\begin{bmatrix}
d_k \\
\Delta \lambda_k
\end{bmatrix}
= -
\begin{bmatrix}
\nabla f(x_k) + \nabla c(x_k) \lambda_k \\
c(x_k)
\end{bmatrix},
providing a direction that reduces both the objective and constraint violation when the linearized constraints are consistent and the Hessian on the tangent space is positive definite. This approach relates to the Karush-Kuhn-Tucker (KKT) conditions by iteratively satisfying the associated nonlinear system.[17]
When exact second derivatives are unavailable or computationally expensive, quasi-Newton methods approximate the Hessian (or its inverse) using gradient information from successive iterations. Seminal updates include the Davidon-Fletcher-Powell (DFP) formula, which modifies the Hessian approximation to satisfy the secant condition \tilde{H}_{k+1} y_k = s_k while preserving symmetry and positive definiteness under suitable line searches, and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update, which approximates the inverse Hessian and often performs better in practice due to its numerical stability.[17]
These methods offer superlinear or quadratic local convergence rates, exploiting curvature information for efficient progress toward optima, but they can fail globally without safeguards like line searches or trust regions, as pure Newton steps may increase the function value far from the solution.[17]
As an illustrative example, consider minimizing the one-dimensional function f(x) = (x-1)^2 + (x-2)^4, which has a minimum near x \approx 1.6. Starting at x_0 = 0, the gradient is \nabla f(0) = -34 and Hessian is \nabla^2 f(0) = 50, so the Newton step is d_0 = -(\nabla^2 f(0))^{-1} \nabla f(0) = 34/50 = 0.68, yielding x_1 = 0.68. Subsequent steps converge rapidly, demonstrating the inversion of the scalar Hessian to compute the direction.[17]
Core SQP Framework
Sequential quadratic programming (SQP) addresses equality-constrained nonlinear optimization problems of the form \min_x f(x) subject to c(x) = 0, where f: \mathbb{R}^n \to \mathbb{R} is a nonlinear objective function and c: \mathbb{R}^n \to \mathbb{R}^m consists of m nonlinear equality constraints with m < n. At each iteration k, starting from a current estimate x_k, SQP approximates the nonlinear problem by solving an equality-constrained quadratic program (QP) that captures the local curvature of the objective and linearizes the constraints. The QP subproblem is formulated as
\min_d \ \nabla f(x_k)^T d + \frac{1}{2} d^T B_k d \quad \text{subject to} \quad \nabla c(x_k)^T d + c(x_k) = 0,
where B_k is a symmetric positive definite matrix approximating the Hessian of the Lagrangian \mathcal{L}(x, \lambda) = f(x) + \lambda^T c(x) with respect to x. This subproblem derives from applying Newton's method to the first-order Karush-Kuhn-Tucker (KKT) conditions of the original problem, yielding a local quadratic model of the objective and a linear approximation of the constraints.
The solution to the QP subproblem provides both the search direction d_k and an updated estimate of the Lagrange multipliers \lambda_{k+1}. Specifically, the KKT conditions for the QP are
B_k d_k + \nabla c(x_k) \lambda_{k+1} = -\nabla f(x_k), \quad \nabla c(x_k)^T d_k = -c(x_k),
which form a linear system solved for d_k and \lambda_{k+1}. The next iterate is then computed as x_{k+1} = x_k + \alpha_k d_k, where the step length \alpha_k > 0 is chosen via a line search procedure to ensure sufficient decrease in a merit function, as discussed in subsequent sections. This update maintains feasibility approximately while progressing toward the optimum.
The matrix B_k may be taken as the exact second derivative \nabla_{xx}^2 \mathcal{L}(x_k, \lambda_k), leading to a Newton-type method with potential quadratic convergence near the solution, or approximated using quasi-Newton updates to reduce computational cost. Common quasi-Newton approximations, such as the BFGS formula, update B_k based on differences in gradient information from successive iterates, ensuring positive definiteness while requiring only first-order oracle access. For instance, the BFGS update is
B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k},
where s_k = x_{k+1} - x_k and y_k = \nabla_x \mathcal{L}(x_{k+1}, \lambda_{k+1}) - \nabla_x \mathcal{L}(x_k, \lambda_k). This approximation maintains superlinear convergence under suitable conditions.
To illustrate, consider the problem \min_{x,y} x^2 + y^2 subject to x + y = 1. At the initial point (x_k, y_k) = (0, 0), we have \nabla f = (0, 0), c_k = -1, and \nabla c = (1, 1). Assuming B_k = I_2 (the 2x2 identity matrix) as a simple initial approximation, the QP subproblem becomes
\min_{d_x, d_y} \ \frac{1}{2} (d_x^2 + d_y^2) \quad \text{subject to} \ d_x + d_y = 1.
The solution yields d = (0.5, 0.5) and \lambda_{k+1} = -0.5, providing a direction that moves toward the optimum (0.5, 0.5). Subsequent iterations refine this using updated B_k and line search. This example demonstrates how the QP linearizes the constraint and quadratically models the objective to generate feasible progress.[2]
Inequality-Constrained Extension
The inequality-constrained extension of sequential quadratic programming (SQP) addresses general nonlinear programs of the form \min_x f(x) subject to c(x) = 0 and g(x) \geq 0, where f is the objective, c are equality constraints, and g are inequality constraints. This builds on the equality-constrained case by incorporating inequality handling into the quadratic programming (QP) subproblem at each iteration k, approximating the nonlinear program locally around the current point x_k. The QP subproblem is formulated as
\min_d \ \nabla f(x_k)^T d + \frac{1}{2} d^T B_k d
subject to \nabla c(x_k)^T d + c(x_k) = 0 and \nabla g_i(x_k)^T d + g_i(x_k) \geq 0 for each inequality index i, where B_k approximates the Hessian of the Lagrangian.[2][3]
A common approach to solving this QP subproblem is the active-set method, which estimates a working set \mathcal{A}_k \subseteq \{1, \dots, p\} of potentially active inequalities at x_k. The inequalities in \mathcal{A}_k are treated as equalities in the QP, yielding a reduced problem:
\min_d \ \nabla f(x_k)^T d + \frac{1}{2} d^T B_k d \quad \text{s.t.} \quad \nabla c(x_k)^T d + c(x_k) = 0, \ \nabla g_i(x_k)^T d + g_i(x_k) = 0 \ \forall i \in \mathcal{A}_k,
with the remaining inequalities checked for feasibility post-solution. The QP multipliers \mu_{k,i} \geq 0 for i \in \mathcal{A}_k are computed; if any \mu_{k,i} < 0, the constraint is removed from the active set, or if a violated inequality exists outside \mathcal{A}_k, it is added. This process iterates until a suitable QP solution is obtained, ensuring complementary slackness \mu_{k,i} ( \nabla g_i(x_k)^T d + g_i(x_k) ) = 0. The active-set strategy was foundational in early SQP developments for inequalities.[2][3]
For a more integrated treatment, the full Karush-Kuhn-Tucker (KKT) system of the QP subproblem can be solved directly, incorporating updates to Lagrange multipliers \lambda for equalities and \mu for inequalities. Assuming an active set \mathcal{A}_k, the system is
\begin{bmatrix}
B_k & \nabla c(x_k) & \nabla g_{\mathcal{A}_k}(x_k) \\
\nabla c(x_k)^T & 0 & 0 \\
\nabla g_{\mathcal{A}_k}(x_k)^T & 0 & 0
\end{bmatrix}
\begin{bmatrix}
d \\
\lambda_{k+1} \\
\mu_{k+1, \mathcal{A}_k}
\end{bmatrix}
= -
\begin{bmatrix}
\nabla f(x_k) \\
c(x_k) \\
g_{\mathcal{A}_k}(x_k)
\end{bmatrix},
where complementarity is enforced via the active-set choice (\mu_{k+1,i} \geq 0 implies treated as active, \mu_{k+1,i} = 0 for inactive). This primal-dual formulation facilitates efficient computation in large-scale settings.[3][18]
Nonconvexity in the QP subproblem arises when B_k is indefinite, potentially leading to unbounded or multiple local solutions, which can cause numerical instability or failure to progress. To mitigate this, B_k is often modified to be positive semidefinite (e.g., via eigenvalue adjustment) or embedded in a trust-region constraint \|d\| \leq \Delta_k, ensuring a well-posed QP while preserving superlinear convergence near solutions. Such safeguards are essential for practical reliability.[2][3]
As an illustrative example, consider \min_x x^2 subject to x \geq 0. The global minimum is at x = 0. At an interior point x_k > 0, the inequality is inactive, so the QP reduces to \min_d 2x_k d + d^2 with no constraint on d, yielding d = -x_k. Near the boundary, say at x_k = 0, the active set includes the inequality, transforming the QP to \min_d d^2 subject to d \geq 0, with solution d = 0 and \mu_k = 0, confirming stationarity as the unconstrained QP minimum lies on the boundary without requiring constraint force. This demonstrates how active-set handling captures boundary optima.[2]
Iterative Algorithm Description
Sequential quadratic programming (SQP) operates as an iterative procedure to solve nonlinear optimization problems by successively approximating the original problem with a sequence of quadratic programming (QP) subproblems. At each iteration k, given a current estimate x_k of the solution and associated Lagrange multipliers \lambda_k, the method forms a QP subproblem that uses a linear approximation of the constraints and a quadratic approximation of the Lagrangian objective function. The solution to this QP provides a search direction d_k and updated multipliers \lambda_{k+1}, which are then used to compute a step length \alpha_k and advance to the next iterate x_{k+1} = x_k + \alpha_k d_k. This process continues until convergence criteria are met, generating a sequence of points that ideally approach a local optimum satisfying the Karush-Kuhn-Tucker (KKT) conditions.[2]
The core iterative algorithm can be outlined in pseudocode as follows:
Initialize: Choose initial point x_0, multipliers λ_0, and Hessian approximation B_0 (e.g., [identity matrix](/page/Identity_matrix)); set k = 0; choose tolerance ε > 0.
While not converged:
Form QP subproblem: Minimize ∇f(x_k)^T d + (1/2) d^T B_k d
subject to ∇c(x_k)^T d + c(x_k) = 0 (equalities)
and ∇g(x_k)^T d + g(x_k) ≥ 0 (inequalities),
where c and g are the [constraint](/page/Constraint) functions.
Solve [QP](/page/QP) to obtain direction d_k and multipliers λ_{k+1}.
Compute step length α_k (e.g., via [line search](/page/Line_search) to reduce a merit function).
Set x_{k+1} = x_k + α_k d_k.
Update B_{k+1} (e.g., using BFGS formula).
Set k = k + 1.
End while
Return x_k, λ_k as approximate solution.
Initialize: Choose initial point x_0, multipliers λ_0, and Hessian approximation B_0 (e.g., [identity matrix](/page/Identity_matrix)); set k = 0; choose tolerance ε > 0.
While not converged:
Form QP subproblem: Minimize ∇f(x_k)^T d + (1/2) d^T B_k d
subject to ∇c(x_k)^T d + c(x_k) = 0 (equalities)
and ∇g(x_k)^T d + g(x_k) ≥ 0 (inequalities),
where c and g are the [constraint](/page/Constraint) functions.
Solve [QP](/page/QP) to obtain direction d_k and multipliers λ_{k+1}.
Compute step length α_k (e.g., via [line search](/page/Line_search) to reduce a merit function).
Set x_{k+1} = x_k + α_k d_k.
Update B_{k+1} (e.g., using BFGS formula).
Set k = k + 1.
End while
Return x_k, λ_k as approximate solution.
This pseudocode integrates the QP subproblem formation and solution as described in the equality- and inequality-constrained formulations.[2][3]
Termination occurs when the iterates satisfy approximate first-order optimality conditions, typically checked by ensuring the norm of the Lagrangian gradient \|\nabla L(x_k, \lambda_k)\| < \epsilon, the equality constraint violation \|c(x_k)\| < \epsilon, and the inequality violation or complementarity measure (e.g., \min(g(x_k), -\lambda_k g(x_k)) < \epsilon) is sufficiently small. These criteria verify proximity to a KKT point, with \epsilon often set to $10^{-6} or smaller depending on problem scale and desired accuracy.[2]
Initialization strategies aim to provide a reasonable starting point to promote rapid convergence and avoid infeasible QP subproblems. For equality-constrained problems, a feasible x_0 can be found by solving a least-squares problem minimizing \|c(x)\|^2; for inequalities, slack variables may be introduced to render the initial point feasible, or a phase-I procedure minimizes constraint violations before the main loop. Multipliers \lambda_0 are often set to zero or estimated from a least-squares fit to \nabla c(x_0)^T \lambda \approx -\nabla f(x_0), while B_0 is typically a positive definite matrix like the identity to ensure QP convexity.[3][2]
The computational cost per iteration is dominated by solving the QP subproblem, which for a problem with n variables and m constraints generally requires O(n^3) operations in the dense case using methods like active-set or interior-point solvers, though sparsity can reduce this to O(n^2) or better with specialized factorizations. Updating the Hessian approximation B_k adds O(n^2) cost via quasi-Newton formulas.[3]
To illustrate, consider the toy nonlinear program: minimize f(x, y) = x^2 + y^2 subject to g(x, y) = x + y - 1 \geq 0. The optimum is at (0.5, 0.5) with f = 0.5, where the constraint is active. Assume an initial feasible point x_0 = (1, 0), \lambda_0 = 0, and Hessian approximation B_0 = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} (matching the true Hessian of the Lagrangian). At iteration 0, \nabla f(x_0) = (2, 0)^T, \nabla g(x_0) = (1, 1)^T, and g(x_0) = 0. The QP subproblem is: minimize $2 d_x + d_x^2 + d_y^2 subject to d_x + d_y \geq 0. Assuming the constraint binds (verified post-solution), solving yields d_0 = (-0.5, 0.5)^T and \lambda_1 = 1. Taking a full step \alpha_0 = 1 (as it reduces the objective), x_1 = (0.5, 0.5). At iteration 1, \nabla L(x_1, \lambda_1) = (0, 0)^T and g(x_1) = 0, so convergence is achieved (e.g., \|\nabla L\| < 10^{-6}). A third iteration is unnecessary, demonstrating quadratic convergence near the solution. This example highlights how SQP rapidly identifies the active set and converges in few steps for low-dimensional problems.[2]
Analysis and Properties
Local Convergence Theory
Local convergence theory for sequential quadratic programming (SQP) methods establishes that, under suitable regularity conditions at a solution, the iterates converge rapidly to a local optimum of the nonlinear program. Specifically, when the exact Hessian of the Lagrangian is used in the quadratic subproblems, SQP exhibits quadratic convergence provided the linear independence constraint qualification (LICQ) and strong second-order sufficient conditions (SSOSC) hold. These conditions ensure the solution is isolated and the method behaves like a Newton iteration on the Karush-Kuhn-Tucker (KKT) system. If a quasi-Newton approximation to the Hessian is employed instead, convergence is superlinear, assuming the approximation satisfies standard quality conditions such as the Dennis-Moré criterion.
The key assumptions underpinning this local theory include LICQ, which requires the gradients of the active constraints to be linearly independent at the solution, guaranteeing unique Lagrange multipliers and well-posedness of the KKT conditions. Additionally, SSOSC demands that the Hessian of the Lagrangian be positive definite on the tangent space to the active constraints, confirming the point is a strict local minimizer. Strict complementarity slackness is also typically assumed, ensuring nonzero multipliers for active inequality constraints, which stabilizes the identification of the active set near the solution. These assumptions are standard in nonlinear optimization and are verified at the KKT point.
A proof sketch for quadratic convergence with the exact Hessian interprets the SQP step as a Newton iteration applied to the nonlinear KKT system. Let e_k = (x_k - x^*, \lambda_k - \lambda^*) denote the error in the primal-dual pair at iteration k. Near the solution, the KKT residual satisfies a quadratic approximation, leading to
e_{k+1} \approx \begin{pmatrix} \nabla^2_{xx} L(x^*, \lambda^*) & A(x^*) \\ A(x^*)^T & 0 \end{pmatrix} e_k^2,
where A(x^*) is the Jacobian of the active constraints, and the squared term arises from the second-order Taylor expansion of the nonlinear functions. Under LICQ and SSOSC, this mapping is a contraction, yielding \|e_{k+1}\| = O(\|e_k\|^2). For quasi-Newton variants, the approximation error in the Hessian introduces a linear term, reducing the rate to superlinear, \|e_{k+1}\| = o(\|e_k\|).
Sensitivity analysis reveals that the Lagrange multipliers converge at the same rate as the primal variables: quadratically with the exact Hessian and superlinearly with quasi-Newton approximations. This follows from the nonsingularity of the KKT matrix under LICQ and the continuous dependence of multipliers on the primal solution.
Despite these strong guarantees, the local convergence can fail if the assumptions are violated, such as near degenerate constraints where LICQ does not hold, leading to nonunique multipliers or ill-conditioned subproblems. Similarly, absence of strict complementarity may cause erratic active set changes, disrupting the superlinear or quadratic behavior.
Global Convergence Enhancements
Sequential quadratic programming (SQP) methods, while exhibiting rapid local convergence near solutions, can fail to progress from arbitrary initial points due to the lack of descent guarantees in the nonlinear problem. To address this, global convergence enhancements incorporate mechanisms that ensure monotonic decrease in a suitable merit function, preventing divergence or cycling and promoting convergence to stationary points under mild assumptions. These techniques transform the purely local Newton-like updates into globally reliable algorithms, often at the cost of slightly slower local rates.[19]
A key component is the use of merit functions to balance objective reduction and constraint satisfaction. One common choice is the quadratic penalty merit function, defined as
\phi(x, \rho) = f(x) + \frac{\rho}{2} \|c(x)\|^2 + \frac{\rho}{2} \|\min(0, g(x))\|^2,
where f(x) is the objective, c(x) and g(x) are the equality and inequality constraint functions, respectively, and \rho > 0 is a penalty parameter that may be updated iteratively to enforce feasibility. This quadratic penalty form provides a continuously differentiable measure of progress, with the \min(0, g(x)) term penalizing violations of inequalities while ignoring satisfied ones. The search direction from the QP subproblem is then accepted only if it yields sufficient decrease in \phi, ensuring descent even when the linearized constraints are inaccurate far from the solution.
To achieve this decrease, line search procedures are employed along the computed direction d_k. Standard Armijo conditions require that the step length \alpha_k satisfies \phi(x_k + \alpha_k d_k, \rho_k) \leq \phi(x_k, \rho_k) + \eta \alpha_k \nabla \phi(x_k, \rho_k)^T d_k for some $0 < \eta < 1, often combined with Wolfe curvature conditions to promote full steps near the solution. These backtracking searches safeguard against overshooting, maintaining bounded iterates and global convergence properties. Alternatively, trust-region variants constrain the QP subproblem to \|d\| \leq \Delta_k, solving for the direction within a local model trust region of radius \Delta_k. The radius is then adjusted—expanded if the actual merit function decrease exceeds a predicted model decrease by a factor greater than \gamma \in (0,1), or reduced otherwise—providing a geometry-based globalization that handles ill-conditioning robustly.[20]
When the Hessian approximation B_k in the QP subproblem is indefinite, leading to potentially unbounded or multiple solutions, modifications are applied to convexify it. A common approach adds a shifting term, such as \sigma_k I with \sigma_k \geq 0 chosen minimally to render B_k + \sigma_k I positive semidefinite, preserving sparsity and ensuring the QP remains bounded below while approximating the true curvature. This modification maintains descent properties without altering the null space of the constraints. Under assumptions like Lipschitz continuity of the problem functions, boundedness of iterates and Hessian approximations, and nondegeneracy, these enhancements guarantee that the sequence \{x_k\} converges globally to a point satisfying the Karush-Kuhn-Tucker conditions.[21]
Implementation Aspects
Merit Functions and Step Adjustment
In sequential quadratic programming (SQP), merit functions serve as auxiliary scalar functions that combine the objective and constraint violation terms to guide the line search and ensure descent toward feasible and optimal points.[3] A common choice is the \ell_1 penalty merit function, defined for problems with equality constraints c(x) = 0 and inequality constraints g(x) \leq 0 as
m(x; \rho) = f(x) + \rho \|c(x)\|_1 + \rho \| [g(x)]_+ \|_1,
where \rho > 0 is a penalty parameter, \| \cdot \|_1 denotes the \ell_1-norm (sum of absolute values), and [g(x)]_+ = \max(0, g(x)) elementwise.[3] This function penalizes both constraint violations and promotes reduction in the objective f(x). The \ell_1 merit function is nonsmooth due to the absolute values but computationally efficient, as it requires only function evaluations without gradients of the penalty terms.[3]
Exact penalty functions, such as the \ell_1 merit with sufficiently large \rho > \|\pi^*\|_\infty (where \pi^* are the optimal Lagrange multipliers), ensure that local minimizers of the merit function coincide with those of the original constrained problem, avoiding "phantom" solutions that might arise from smaller penalties.[3] In contrast, approximate penalties with smaller \rho balance feasibility and optimality progress but may require careful updating to maintain descent properties. The penalty parameter \rho is typically initialized to a modest value and updated dynamically: it is increased (e.g., doubled) if the norm of the multipliers from the quadratic program subproblem exceeds the current \rho, indicating potential inconsistency between the step and constraint satisfaction.[3] This strategy, originally proposed in early SQP frameworks, prevents stagnation when constraints are active or ill-conditioned.[3]
Line search procedures in SQP use the merit function to determine an acceptable step length \alpha_k > 0 along the search direction d_k from the QP subproblem, ensuring sufficient decrease. A standard variant is backtracking line search with the Armijo condition, which accepts \alpha_k if
m(x_k + \alpha_k d_k; \rho_k) \leq m(x_k; \rho_k) + \eta \alpha_k \nabla m(x_k; \rho_k)^T d_k,
where $0 < \eta < 1 (often \eta = 10^{-4}) and \alpha_k is reduced from an initial value (e.g., 1) by a factor like 0.5 until satisfied.[3] This condition guarantees a fraction \eta of the predicted linear decrease, promoting monotonic reduction in the merit function and contributing to global convergence when combined with positive definiteness of the Hessian approximation.[3]
To address cases where strict monotonicity leads to overly conservative steps or the Maratos effect (slow convergence near solutions), non-monotonic techniques like the watchdog method are employed. In the watchdog approach, the line search tolerates temporary increases in the merit function by referencing a "watchdog value" updated every few iterations (e.g., after q successful steps), allowing \alpha_k = 1 more readily while bounding overall progress.[3] This enhances efficiency in practice without compromising convergence guarantees.
For illustration, consider a simple iteration in a problem minimizing f(x) = x^2 subject to c(x) = x - 1 = 0, starting at x_0 = 2 with \rho_0 = 1. The QP yields direction d_0 = -1, and \nabla m(x_0; \rho_0) = (2x_0 + \rho_0 \cdot \text{sign}(c(x_0))) = (4 + 1) = 5, so the predicted decrease is \alpha \cdot (-5). With \eta = 0.1, test \alpha = 1: m(1; 1) = 1 + 1 \cdot |0| = 1, while m(2; 1) + 0.1 \cdot 1 \cdot (-5) = 5 - 0.5 = 4.5; since $1 < 4.5, accept \alpha = 1. If \rho were updated to 10 due to multiplier norm >1, \nabla m = 4 + 10 = 14, directional derivative -14, right-hand side 5 + 0.1 \cdot 1 \cdot (-14) = 5 - 1.4 = 3.6; 1 < 3.6 still holds, emphasizing feasibility more.[3]
Constraint Handling Techniques
In sequential quadratic programming (SQP) methods, feasibility restoration is invoked during iterations when the current point is infeasible and the quadratic programming (QP) subproblem lacks a feasible solution, addressing violations of equality and inequality constraints. This technique solves an auxiliary QP that minimizes the squared norm of linearized constraint violations, formulated as minimizing \| \nabla c(x_k)^T d + c(x_k) \|^2 + \| [ \nabla g(x_k)^T d + g(x_k) ]_+ \|^2, where c denotes equality constraints, g \leq 0 inequality constraints, and [ \cdot ]_+ = \max(0, \cdot) elementwise. This phase continues until feasibility is restored or a minimum violation is achieved, allowing the algorithm to return to normal SQP steps.[22]
Degeneracy in SQP arises when constraint gradients are linearly dependent or when strict complementarity fails, potentially leading to ill-conditioned QP subproblems and stalled convergence. To handle this, stabilized SQP methods introduce regularization via an augmented Lagrangian term in the QP, incorporating a penalty parameter \rho_k > 0 to penalize constraint violations and ensure the Hessian remains positive definite on the tangent space, with \rho_k bounded between constants \sigma_0 and \sigma_1 to maintain stability. Perturbations, such as small shifts in the right-hand side of linearized constraints, can also be applied to resolve linear dependence, promoting full-rank Jacobians without altering the solution set asymptotically. These approaches achieve local quadratic convergence even under violated Mangasarian-Fromovitz constraint qualifications.[23]
Slack variables enable SQP to initiate from infeasible points by reformulating inequality constraints g(x) \leq 0 as equalities g(x) + z = 0 with z \geq 0, transforming the QP subproblem into an equality-constrained form solvable via standard techniques. This allows the algorithm to generate steps that gradually reduce both objective and infeasibility, without requiring an initial feasible point, which is particularly useful for problems where finding a strictly feasible start is computationally expensive. The slacks are updated iteratively, converging to zero at feasible solutions under appropriate line search conditions.[2]
Scaling and preconditioning normalize constraints to mitigate numerical imbalances between equalities and inequalities, ensuring equitable treatment in the QP solver. Constraints are scaled by estimating their magnitudes (e.g., via norms of gradients or function values), applying diagonal scaling matrices to the Jacobian to balance condition numbers, often reducing sensitivity to ill-conditioning by orders of magnitude in large-scale problems. Preconditioning further stabilizes the reduced Hessian in the QP by incorporating limited-memory quasi-Newton approximations, enhancing iterative solver efficiency for sparse systems.[24]
For instance, in resolving QP infeasibility due to near-zero constraint gradients—indicative of degeneracy near a kink or manifold intersection—feasibility restoration with slacks can perturb the linearized inequalities slightly, yielding a direction d that moves toward the feasible set while avoiding singular systems, as demonstrated in stabilized variants where regularization ensures a unique solution with bounded multipliers.[25]
Applications and Alternatives
Practical Applications
Sequential quadratic programming (SQP) finds extensive use in engineering applications, particularly in optimal control problems within aerospace. For instance, it is employed for trajectory optimization of rockets and reentry vehicles, where nonlinear constraints on dynamics, fuel consumption, and thermal loads must be satisfied to minimize time or maximize payload. The SNOPT solver, based on SQP, has been applied to such large-scale problems in the aerospace industry, demonstrating robustness for sparse nonlinear programs involving thousands of variables.[26]
In chemical engineering, SQP facilitates parameter estimation in reactor design under nonlinear constraints, such as mass balance and kinetic rate laws, to optimize process efficiency and yield. A hybrid genetic algorithm-SQP approach has been used to determine kinetic parameters in complex reacting systems, handling the nonlinearity inherent in parameter estimation problems from experimental data. This enables accurate modeling of reactor performance while respecting constraints on temperature, pressure, and concentrations.[27]
Financial applications of SQP include portfolio optimization incorporating transaction costs and risk measures, addressing nonlinear objectives like expected utility maximization subject to cardinality or diversification constraints. In models with credibility measures for uncertain returns, SQP solves the resulting nonlinear programs to adjust portfolios dynamically, balancing returns against costs and risks such as value-at-risk. Such methods support rebalancing strategies that account for market frictions, improving practical implementability.[28]
In machine learning, SQP aids hyperparameter tuning and constrained neural network training, where optimization involves nonlinear loss functions with bounds on model complexity or fairness constraints. For physics-informed neural networks, trust-region SQP enforces hard constraints during training, enhancing accuracy in scientific simulations by integrating physical laws directly into the optimization. This approach mitigates issues like constraint violations in deep learning models applied to constrained domains.[29]
Comparison with Other Methods
Sequential quadratic programming (SQP) differs from interior-point methods, such as those implemented in IPOPT, primarily in its handling of inequality constraints. While interior-point methods employ barrier functions to maintain feasibility by penalizing constraint violations through logarithmic terms, SQP solves quadratic programming subproblems that approximate the nonlinear constraints linearly, allowing temporary infeasibility. This makes SQP generally faster per iteration, as evidenced by benchmarks where SQP achieves higher success rates (91.7%) on the CUTEst test collection compared to IPOPT's 82.3% with first-order information, though interior-point methods scale better for very large-scale problems with many inequalities due to their ability to exploit sparse structures efficiently.[13]
In contrast to augmented Lagrangian methods like ALGENCAN, which augment the Lagrangian with quadratic penalty terms on constraints to enforce feasibility and handle nonconvexity through outer iterations updating multipliers, SQP integrates second-order information directly into its quadratic subproblems, often requiring fewer outer loops for convergence. Augmented Lagrangian approaches excel in robustness for highly nonconvex problems by decoupling constraint satisfaction from objective minimization, but they typically demand more iterations overall due to the need for multiple penalty parameter adjustments. Stabilized variants combining SQP with augmented Lagrangian elements can mitigate these issues, achieving superlinear convergence under second-order sufficient conditions while preserving global convergence guarantees.[30][31]
Compared to first-order gradient-based methods, such as projected gradient descent, SQP leverages second-order Hessian approximations for more accurate step directions, leading to faster local convergence near solutions but at higher computational cost per iteration. First-order methods, relying solely on gradients, are preferable for very large-scale problems where Hessian computation is prohibitive, offering scalability at the expense of slower overall convergence and lower accuracy on medium-scale nonlinear problems with sparse structures.[32]
Empirical comparisons on the CUTEst test set highlight SQP's strengths in medium-scale structural optimization, where implementations like SNOPT solve over 90% of problems with fewer function evaluations than interior-point alternatives, particularly when degrees of freedom are below 3000. For instance, SQP methods based on quasi-Newton Hessian updates demonstrate superior iteration efficiency on constrained problems, outperforming first-order approaches.[13][24]
SQP is particularly suited for problems featuring sparse Hessians and a moderate number of constraints, where its quadratic subproblems can be solved efficiently; in such scenarios, it outperforms interior-point methods on iteration speed and augmented Lagrangian on loop count, while first-order methods are better reserved for ultra-large, derivative-light applications.[33]