Fact-checked by Grok 2 weeks ago

Differential dynamic programming

Differential dynamic programming (DDP) is an iterative optimization algorithm for solving nonlinear problems, particularly in for continuous-time systems. It applies the principle of optimality from dynamic programming by constructing local quadratic approximations to the Hamilton-Jacobi-Bellman equation around a nominal , enabling efficient backward and forward sweeps to refine control policies and state trajectories. DDP was developed in the late 1960s as an extension of dynamic programming to handle nonlinear dynamics without relying on the . The method was first proposed by D. Q. Mayne and systematically analyzed by Mayne and D. H. Jacobson in their 1970 , which established its theoretical foundations and properties for unconstrained problems. Unlike global dynamic programming approaches that suffer from the curse of dimensionality, DDP's local approximations make it computationally tractable for high-dimensional systems. Key features of DDP include its ability to incorporate second-order information for faster convergence compared to first-order methods like , support for control-affine dynamics where closed-form control updates are possible, and adaptability to discrete-time formulations. It has been extended to handle constraints, uncertainties, and systems, finding applications in for , for , and problems.

Introduction

Overview

Differential dynamic programming (DDP) is a second-order extension of dynamic programming for solving finite-horizon, discrete-time nonlinear problems. It builds on the principle of optimality by approximating the nonlinear dynamics and cost-to-go functions to enable efficient . The core purpose of DDP is to iteratively refine an initial trajectory, using local quadratic approximations around a nominal path to minimize a total cost that captures system performance and terminal objectives. This approach generates both control adjustments and linear policies, allowing for robust handling of perturbations in practice. Key characteristics of DDP include its reliance on local linearization of the dynamics and quadratic approximation of the value function, which facilitates scalability to high-dimensional systems without exhaustive state-space discretization. It is particularly well-suited for complex, nonlinear environments, such as those in robotics where real-time computation is essential. DDP finds typical applications in robot motion planning, such as bipedal locomotion and manipulation tasks, as well as spacecraft trajectory optimization for orbital transfers. In unconstrained settings, the algorithm demonstrates quadratic convergence to a local minimum, often requiring fewer iterations than first-order methods for smooth problems. Its iterative structure alternates between a backward pass, which propagates value function approximations from the terminal state, and a forward pass, which simulates and updates the trajectory using the derived controls.

Historical Development

Differential dynamic programming (DDP) emerged from the foundational principles of dynamic programming developed by Richard Bellman in the , which provided a framework for solving multistage optimization problems through recursive decomposition, and from advances in that enabled stagewise procedures for handling complex dynamics. DDP specifically builds on these by approximating the value function quadratically at each stage to mitigate the curse of dimensionality inherent in Bellman's original method. The algorithm was first introduced by David Q. Mayne in a seminal 1966 paper, building on his earlier 1965 work (received December 1965) that proposed a second-order gradient method for optimizing nonlinear discrete-time systems. This approach unified with second-order approximations to solve problems more efficiently than first-order methods. A comprehensive formalization came in the 1970 book Differential Dynamic Programming by David H. Jacobson and David Q. Mayne, which detailed the backward-forward pass structure, established local quadratic convergence under suitable conditions, and positioned DDP as a practical tool for nonlinear systems. The book emphasized DDP's advantages over calculus-of-variations methods by leveraging dynamic programming's recursive nature while incorporating differential approximations for computational tractability. In the 1970s, extensions to continuous-time formulations appeared, including early adaptations reviewed by Michael K. Sain in 1971 and contributions to continuous-time formulations by researchers such as William F. Denham. During the 1980s and 1990s, DDP found significant applications in aerospace engineering, particularly through NASA-funded research on optimal trajectory design for spacecraft and aircraft maneuvers, as documented in technical reports addressing rigid-flexible body dynamics and low-thrust transfers. These efforts highlighted DDP's robustness for high-dimensional problems in real-time control scenarios. The 2000s marked a resurgence of DDP driven by advances in computational power and , revitalizing its use in for in underactuated systems and manipulation tasks. This modern adoption is exemplified in the 2024 update to MIT's Underactuated Robotics textbook, which integrates DDP into chapters on dynamic programming and , underscoring its role in contemporary robotic control. In the , DDP continues to evolve with extensions like continuation for highly nonlinear problems and game-theoretic formulations for , as seen in recent applications to (as of 2025).

Problem Setup

Finite-Horizon Discrete-Time Optimal Control

The finite-horizon discrete-time problem seeks a sequence of control inputs that minimizes a cumulative cost functional over a predetermined number of time steps, starting from a given initial state. This formulation is central to in nonlinear systems, where the goal is to drive the toward desirable behavior while balancing performance and effort. The objective is to minimize the J = \sum_{k=0}^{N-1} l_k(x_k, u_k) + \phi(x_N) over the state-control trajectory \{x_k, u_k\}_{k=0}^N, subject to the fixed x_0. Here, l_k: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} denotes the running cost at time step k, which captures stage-wise objectives such as tracking errors or expenditure, and \phi: \mathbb{R}^n \to \mathbb{R} is the terminal cost evaluating the final state x_N. The system evolves according to the nonlinear state dynamics x_{k+1} = f(x_k, u_k), \quad k = 0, \dots, N-1, where x_k \in \mathbb{R}^n is the at time k, u_k \in \mathbb{R}^m is the control input applied over the interval [k, k+1), and f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n is a differentiable function representing the system model, often derived from discretization of continuous-time dynamics. An initial nominal trajectory \{\bar{x}_k, \bar{u}_k\} is typically required to initialize iterative solution methods. Key assumptions include a finite horizon N, discrete time steps without stochastic disturbances or hard constraints on states/controls in the basic setup, and that the running costs l_k, terminal cost \phi, and dynamics f are twice continuously differentiable to enable second-order approximations. These conditions facilitate analytical and numerical tractability while focusing on deterministic nonlinear behavior. This problem class is computationally challenging due to the nonlinearity of f and l_k, which generally yields a non-convex optimization landscape with multiple local minima, and the curse of dimensionality, where the exponential growth in state space complexity hampers exact solution methods for high-dimensional systems (n \gg 1).

Dynamic Programming Principles

Dynamic programming provides a foundational framework for solving problems by decomposing them into a sequence of simpler subproblems through recursive computation. At its core is the Bellman principle of optimality, which asserts that an optimal policy has the property that, regardless of the initial state and initial decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This principle enables the systematic evaluation of decisions stage by stage in multi-stage decision processes. Central to this approach is the value function V_k(x), defined as the minimum cost-to-go from stage k onward when starting from state x. The value function satisfies the , given by V_k(x) = \min_u \left[ l_k(x, u) + V_{k+1}(f(x, u)) \right], with the terminal condition V_N(x) = \phi(x) at the final stage N, where l_k(x, u) is the stage cost and f(x, u) is the state transition function. This recursion allows the optimal cost-to-go to be computed backward from the terminal stage to the initial stage. Once the value functions are obtained, an optimal policy can be derived by selecting, at each stage k, the control u that achieves the minimum in the ; the optimal trajectory is then simulated forward using this policy. In the context of , dynamic programming yields an exact solution in principle by exhaustively searching the state-action space at each stage. However, for nonlinear problems with high-dimensional states, exact dynamic programming becomes intractable due to the curse of dimensionality, where computational and storage requirements grow exponentially with the state dimension. This limitation motivates the development of approximate local methods, such as differential dynamic programming, which leverage differential approximations to make the approach feasible for practical nonlinear systems.

Core Algorithm

Mathematical Formulation

Differential dynamic programming approximates the dynamic programming value functions locally around a nominal \{\bar{x}_k, \bar{u}_k\}_{k=0}^{N} using second-order expansions to derive quadratic subproblems for control updates. This approach specializes the general dynamic programming principles by focusing on local linear-quadratic approximations rather than exact global solutions. Consider the nonlinear dynamics x_{k+1} = f_k(x_k, u_k) and stage cost l_k(x_k, u_k). Around the nominal trajectory, the dynamics are linearized as \delta x_{k+1} = f_k(\bar{x}_k, \bar{u}_k) + f_{x,k} \delta x_k + f_{u,k} \delta u_k - \bar{x}_{k+1}, where \delta x_k = x_k - \bar{x}_k, \delta u_k = u_k - \bar{u}_k, f_{x,k} = \frac{\partial f_k}{\partial x}(\bar{x}_k, \bar{u}_k), and f_{u,k} = \frac{\partial f_k}{\partial u}(\bar{x}_k, \bar{u}_k). This yields the affine approximation \delta x_{k+1} \approx A_k \delta x_k + B_k \delta u_k, with A_k = f_{x,k} and B_k = f_{u,k}. For the full second-order approximation, the dynamics expansion includes quadratic terms: f_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k) \approx \bar{x}_{k+1} + A_k \delta x_k + B_k \delta u_k + \frac{1}{2} \delta x_k^T f_{xx,k} \delta x_k + \delta x_k^T f_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T f_{uu,k} \delta u_k, where f_{xx,k}, f_{xu,k}, f_{uu,k} are the second partial derivative tensors evaluated at (\bar{x}_k, \bar{u}_k). The value function V_{k+1}(x_{k+1}), representing the optimal cost-to-go from stage k+1, is approximated quadratically around \bar{x}_{k+1}: V_{k+1}(x_{k+1} + \delta x_{k+1}) \approx V_{k+1}(\bar{x}_{k+1}) + V_x^{k+1} \delta x_{k+1} + \frac{1}{2} \delta x_{k+1}^T V_{xx}^{k+1} \delta x_{k+1}, where V_x^{k+1} = \frac{\partial V_{k+1}}{\partial x}(\bar{x}_{k+1}) is the gradient and V_{xx}^{k+1} = \frac{\partial^2 V_{k+1}}{\partial x^2}(\bar{x}_{k+1}) is the Hessian. The action-value function at stage k is defined as Q_k(x_k, u_k) = l_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)). Substituting the local approximations, Q_k becomes quadratic in the deviations \delta x_k and \delta u_k. First, expand the stage cost via second-order Taylor series: l_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k) \approx l_k(\bar{x}_k, \bar{u}_k) + l_{x,k} \delta x_k + l_{u,k} \delta u_k + \frac{1}{2} \begin{pmatrix} \delta x_k \\ \delta u_k \end{pmatrix}^T \begin{pmatrix} l_{xx,k} & l_{xu,k} \\ l_{ux,k} & l_{uu,k} \end{pmatrix} \begin{pmatrix} \delta x_k \\ \delta u_k \end{pmatrix}, with l_{x,k} = \frac{\partial l_k}{\partial x}(\bar{x}_k, \bar{u}_k), l_{u,k} = \frac{\partial l_k}{\partial u}(\bar{x}_k, \bar{u}_k), and the Hessians l_{xx,k} = \frac{\partial^2 l_k}{\partial x^2}(\bar{x}_k, \bar{u}_k), etc. (noting l_{xu,k} = l_{ux,k}^T). For the value term, substitute the second-order dynamics approximation into the quadratic V_{k+1}: V_{k+1}(f_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k)) \approx V_{k+1}(\bar{x}_{k+1}) + V_x^{k+1} (A_k \delta x_k + B_k \delta u_k) + \frac{1}{2} (A_k \delta x_k + B_k \delta u_k)^T V_{xx}^{k+1} (A_k \delta x_k + B_k \delta u_k) + V_x^{k+1} \cdot \left( \frac{1}{2} \delta x_k^T f_{xx,k} \delta x_k + \delta x_k^T f_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T f_{uu,k} \delta u_k \right), where \cdot denotes . Combining both expansions, the action-value function simplifies to Q_k(\delta x_k, \delta u_k) \approx q_k + Q_{x,k} \delta x_k + Q_{u,k} \delta u_k + \frac{1}{2} \delta x_k^T Q_{xx,k} \delta x_k + \delta x_k^T Q_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T Q_{uu,k} \delta u_k, where the constant q_k = l_k(\bar{x}_k, \bar{u}_k) + V_{k+1}(\bar{x}_{k+1}), the linear terms are Q_{x,k} = l_{x,k} + A_k^T V_x^{k+1} and Q_{u,k} = l_{u,k} + B_k^T V_x^{k+1}, and the quadratic coefficients derive as Q_{xx,k} = l_{xx,k} + A_k^T V_{xx}^{k+1} A_k + V_x^{k+1} \cdot f_{xx,k}, \quad Q_{xu,k} = l_{xu,k} + A_k^T V_{xx}^{k+1} B_k + V_x^{k+1} \cdot f_{xu,k}, \quad Q_{uu,k} = l_{uu,k} + B_k^T V_{xx}^{k+1} B_k + V_x^{k+1} \cdot f_{uu,k}. These coefficients capture the second-order effects from both the stage cost and the propagated value function. Minimizing Q_k with respect to \delta u_k for a given \delta x_k yields the optimal local policy \delta u_k = -Q_{uu,k}^{-1} (Q_{u,k} + Q_{xu,k}^T \delta x_k) = \alpha_k + K_k \delta x_k, where the feedforward term is \alpha_k = -Q_{uu,k}^{-1} Q_{u,k} and the feedback gain is K_k = -Q_{uu,k}^{-1} Q_{xu,k}^T. This affine policy approximates the optimal control deviation around the nominal .

Backward Pass

The backward pass in differential dynamic programming (DDP) constitutes the core recursive procedure for approximating the value function and deriving locally optimal control policies by propagating second-order information backward through time along a nominal . This pass leverages quadratic expansions of the cost-to-go and to compute local linear- approximations, enabling the identification of and control terms that approximate the global optimum in a neighborhood of the current . Introduced as a second-order extension of dynamic programming principles, the backward pass ensures efficient exploitation of information for convergence in nonlinear problems. The process begins with initialization at the terminal time step k = N, where the value function V_N(x_N) is set to the terminal cost \phi(x_N), with its first-order gradient V_x = \phi_x and second-order Hessian V_{xx} = \phi_{xx}. These terms represent the exact quadratic approximation at the horizon end, assuming \phi is twice differentiable. In standard notation, the value function approximation takes the form V_k(x_k) \approx c_k + s_k^\top \delta x_k + \frac{1}{2} \delta x_k^\top S_k \delta x_k, where \delta x_k = x_k - \bar{x}_k is the state deviation from the nominal \bar{x}_k, S_k \succeq 0 is the value Hessian, s_k is the value gradient, and c_k is the constant term. At k = N, c_N = \phi(\bar{x}_N), s_N = \phi_x, and S_N = \phi_{xx}. The recursion proceeds backward from k = N-1 to k = 0, at each step computing the action-value function coefficients Q_x, Q_u, Q_{xx}, Q_{xu}, Q_{uu} via second-order Taylor expansions of the stage cost l_k(x_k, u_k) and dynamics f_k(x_k, u_k) around the nominal trajectory, incorporating the value approximation from step k+1. Specifically, \begin{align} Q_x &= l_x + f_x^\top V_x, \\ Q_u &= l_u + f_u^\top V_x, \\ Q_{xx} &= l_{xx} + f_x^\top V_{xx} f_x + V_x \cdot f_{xx}, \\ Q_{uu} &= l_{uu} + f_u^\top V_{xx} f_u + V_x \cdot f_{uu}, \\ Q_{xu} &= l_{xu} + f_x^\top V_{xx} f_u + V_x \cdot f_{xu}, \end{align} where subscripts denote partial derivatives evaluated at (\bar{x}_k, \bar{u}_k), and \cdot indicates over tensor products for higher-order terms. These Q-terms approximate the change in the Bellman residual Q_k(x_k, u_k) = l_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)) quadratically in deviations \delta x_k, \delta u_k. From these, the value function parameters are updated to complete the recursion. The value is obtained by in the quadratic Q-form: S_k = Q_{xx} - Q_{xu} Q_{uu}^{-1} Q_{xu}^\top, assuming Q_{uu} \succ 0. The value gradient follows as s_k = Q_x - Q_{xu} Q_{uu}^{-1} Q_u, and the constant term incorporates the minimized quadratic over controls: c_k = c_{k+1} + Q - \frac{1}{2} Q_u^\top Q_{uu}^{-1} Q_u, where Q is the zeroth-order Q-term. These updates ensure the value approximation V_k minimizes the expected future cost under the local dynamics. The backward pass also yields the local control policy by minimizing the quadratic Q_k over \delta u_k, resulting in affine feedback form \delta u_k = k_k + K_k \delta x_k. The feedback gain is K_k = -Q_{uu}^{-1} Q_{xu}^\top, which provides state-deviation correction, and the feedforward term is k_k = -Q_{uu}^{-1} Q_u, capturing nominal adjustments. By iteratively refining these gains and value approximations, the backward pass propagates second-order sensitivity of the cost to state and control perturbations, facilitating rapid quadratic convergence near local optima when applied in conjunction with forward trajectory updates.

Forward Pass

The forward pass in differential dynamic programming utilizes the linear feedback policy derived during the backward pass to the nominal trajectory by simulating the dynamics forward in time, starting from the fixed initial \bar{x}_0. This process refines the candidate trajectory toward local optimality by applying the affine control law at each time step, where the control u_k is computed as u_k = \bar{u}_k + k_k + K_k (x_k - \bar{x}_k), with k_k and K_k being the and gains obtained from the of the value function. The resulting controls are then used to propagate the states via the true nonlinear dynamics: x_{k+1} = f(x_k, u_k) for k = 0 to N-1, yielding a new candidate trajectory \{x_k, u_k\}. Once the full trajectory is simulated, the total cost J is evaluated as J = \sum_{k=0}^{N-1} l(x_k, u_k) + \phi(x_N) and compared to the cost of the previous nominal trajectory. If the new cost represents an improvement (i.e., \Delta J < 0), the updated trajectory is accepted as the new nominal \{\bar{x}_k, \bar{u}_k\}; otherwise, the step is rejected or scaled back. This forward evaluation ensures that the trajectory remains feasible under the actual dynamics, contrasting with the local quadratic approximations used in the backward pass. The overall algorithm alternates between backward and forward passes iteratively until convergence, typically assessed by a small change in cost such as |\Delta J| < \epsilon, where \epsilon is a user-defined tolerance (e.g., $10^{-6}). In practice, convergence is achieved in 10 to 100 such cycles for many nonlinear control problems, with step sizes often decreasing over iterations to promote stability near the optimum. Upon convergence, the algorithm outputs the optimal open-loop trajectory \{x_k^*, u_k^*\} along with the time-varying feedback policy K_k for potential use in receding-horizon implementations, where the policy can stabilize execution against perturbations.

Practical Enhancements

Regularization Techniques

In differential dynamic programming, the Hessian Q_{uu} computed during the backward pass may become indefinite due to model nonlinearities, resulting in unstable gains that hinder convergence. To address this, Levenberg-Marquardt regularization modifies the control Hessian by adding a positive damping term: \tilde{Q}_{uu} = Q_{uu} + \lambda I, where \lambda > 0 and I is the , yielding the stabilized \delta u = -(\tilde{Q}_{uu})^{-1} Q_u. This ensures , blending Newton-like steps (small \lambda) with (large \lambda). The parameter \lambda is adapted iteratively, typically starting from a small value like $10^{-6} and increased (e.g., multiplied by 10) if the updated increases the —mimicking trust-region enforcement—while decreased (e.g., divided by 10) on successful reductions to accelerate . An approach applies directly to the value function S_k (or V_{xx}) propagated backward, adding \lambda I to enforce S_k \succ 0 and indirectly stabilizing Q_{uu} through the dynamics linearization. These techniques enhance local quadratic convergence near optima, with \lambda values commonly spanning $10^{-6} to $10^{6} depending on problem conditioning.

Line Search Methods

In differential dynamic programming (DDP), the full step size derived from the forward pass may lead to overshooting or increased cost due to linear-quadratic approximations of the nonlinear dynamics and cost function. To address this, line search methods adjust the step size \alpha along the search direction provided by the linear feedback gains K_k and feedforward terms k_k, ensuring reliable trajectory updates that decrease the objective. A common approach is backtracking line search, which begins with \alpha = 1 and iteratively reduces \alpha (typically by halving) until an acceptance criterion is met. The trial control input is computed as u_k^\alpha = \bar{u}_k + \alpha (k_k + K_k \delta x_k), followed by simulation of the trial trajectory \hat{x}_{k+1} = f(\hat{x}_k, u_k^\alpha) to evaluate the update. This process continues until the Armijo condition holds: \Delta J \leq \eta \alpha \nabla J^T d, where \Delta J is the change in cost, \eta is a small constant (e.g., $10^{-4}), \nabla J^T d is the predicted descent, and d is the search direction. The merit function for acceptance is typically the total cost J, though augmented versions incorporating constraints may be used in constrained settings. Alternatives include exact , which minimizes J(\alpha) directly (e.g., via quadratic interpolation or derivative-free methods), or trust-region methods that bound \alpha based on a local model trust radius to balance and . These techniques ensure monotonic descent in the , preventing and promoting , and are widely implemented in practical DDP variants such as iterative linear-quadratic regulation (iLQR). The added computational cost is modest, typically requiring 1--5 forward evaluations per iteration due to the backtracking reductions.

Advanced Variants

Stochastic and Monte Carlo DDP

Stochastic differential dynamic programming (SDDP) extends the deterministic DDP framework to systems with additive or multiplicative process , where the dynamics are modeled as x_{k+1} = f(x_k, u_k, w_k) with w_k \sim p(w) representing zero-mean drawn from a known p(w). The objective is to minimize the expected total cost \mathbb{E} \left[ \sum_{k=0}^{N-1} l_k(x_k, u_k) + \phi(x_N) \right], where the expectation is taken over the noise realizations, and l_k and \phi are the stage and terminal costs, respectively. This formulation accounts for uncertainty in the system evolution, making SDDP suitable for real-world applications like and where disturbances are inevitable. Monte Carlo methods approximate the expectations in SDDP by sampling multiple noise trajectories. In the forward pass, M noise samples are drawn to generate M rollouts from the current nominal trajectory, and the average cost is used to approximate the quadratic terms in the action-value function Q_k. This sampling-based approach, known as sampled DDP (SaDDP), enables handling of high-dimensional nonlinear systems by avoiding explicit computation of covariance matrices for the noise. For instance, in systems with up to 100 states, SaDDP demonstrates improved scalability compared to traditional methods that propagate uncertainty analytically. The backward pass in SDDP approximates expectations of the value function derivatives using sample averages. Specifically, the second-order terms like \mathbb{E}[V_{xx}] are estimated by averaging the Hessian contributions from the sampled trajectories, incorporating linear and quadratic noise effects for multiplicative cases. This leads to modified update rules that propagate uncertainty backward, yielding a stochastic policy that hedges against noise. Risk-sensitive variants of SDDP adopt a game-theoretic perspective, formulating the problem as a min-max optimization over worst-case disturbances to enhance robustness. In min-max DDP, the controller minimizes the cost while an adversarial player maximizes it through disturbance actions v_k, resulting in x_{k+1} = f(x_k, u_k, v_k) and a saddle-point solution under condition. This approach, equivalent to risk-sensitive control with exponential cost criteria, has been applied to quadrotor tracking under wind disturbances, outperforming nominal DDP in experimental settings. Convergence of SDDP improves with larger M, reducing variance in the approximations at the cost of higher computational demand; it is particularly effective for multiplicative where analytical expectations are intractable. Seminal work on SDDP for and multiplicative establishes its theoretical foundations, influencing subsequent sampling-based extensions. Recent advances include data-driven stochastic game-theoretic DDP, which uses regression to approximate unknown and dynamics from input-output , enabling of nonlinear systems (published April 2025). Additionally, integration with the enhances robustness for low-thrust trajectory design in space exploration, outperforming deterministic methods under model uncertainties.

Constrained DDP

Constrained differential dynamic programming (CDDP) extends the standard DDP framework to incorporate state and control constraints, ensuring trajectories remain feasible while minimizing the cost functional. These constraints are typically formulated as stagewise inequalities g_k(x_k, u_k) \leq 0 and equalities h_k(x_k, u_k) = 0, where x_k and u_k denote the state and control at time step k, respectively. Such formulations arise in applications requiring obstacle avoidance, joint limits, or resource bounds, transforming the unconstrained optimal control problem into a constrained one that requires modifications to both the backward and forward passes. One common approach to handle inequalities is the , which augments the original with a penalty term \mu \sum_k \max(0, g_k(x_k, u_k))^2, where \mu > 0 is a penalty parameter that increases iteratively to enforce feasibility. This quadratic penalty on violations integrates seamlessly into the approximation of the DDP backward pass, treating the augmented cost as unconstrained while gradually tightening adherence to g_k \leq 0. For equality constraints, similar penalties like \mu \sum_k h_k(x_k, u_k)^2 can be applied, or augmented formulations may incorporate Lagrange multipliers to improve convergence. These methods, often iterated in an outer loop around the standard DDP inner solver, balance computational efficiency with but can suffer from ill-conditioning for large \mu. Active-set methods address constraints by dynamically identifying the set of active inequalities (those near or violating g_k \leq 0) at each , linearizing them as equalities C_k \delta u_k + D_k \delta x_k \approx 0 around the nominal , and projecting the solution onto the feasible set. In the backward pass, this reduces to solving a constrained program () for the feedback gains and terms, excluding inactive constraints to maintain . The active set is refined based on Lagrange multipliers and violation tolerances, ensuring monotonic decrease in the merit function during the forward pass. This approach excels for sparse or predictable constraints but requires careful initialization to avoid frequent set switches, which can increase computational cost. Primal-dual methods incorporate Lagrange multipliers \lambda_k directly into the DDP recursion, augmenting the control quadratic term as Q_{u,k} \leftarrow Q_{u,k} + \lambda_k^T g_{u,k} (and similarly for cross terms) while linearizing the constraints via their Jacobians. Multipliers are updated via dual ascent steps, such as \lambda_k \leftarrow \lambda_k + \alpha \mu g_k in an augmented Lagrangian setup, where \alpha is a step size ensuring descent. For equalities, \lambda_k directly enforce h_k = 0 through the KKT conditions solved in each QP subproblem. These techniques, often combined with slack variables for inequalities, promote faster convergence near boundaries compared to pure penalties. Nonlinear constraints pose significant challenges in CDDP, necessitating accurate linearizations around the current to avoid infeasible QPs or ; poor approximations can lead to slower convergence than unconstrained DDP, with iteration counts increasing by factors of 2–5 in practice. Early developments in the late applied constrained DDP to multireservoir control problems, demonstrating its viability for large-scale sequential decisions with linear inequalities. In modern , such as quadrotor and , CDDP variants have enabled under complex state constraints, with seminal works revisiting and hybridizing penalty and active-set strategies for improved robustness. Recent applications include constraint-augmented DDP for automatic falling recovery in robots, achieving over 95% success rate in simulations and real experiments on varied terrains (2025). Parameterized variants enable waypoint-trajectory optimization for and UAVs with flexible timing and second-order convergence (2024). Proximal methods, such as PROXDDP, support real-time whole-body model-predictive control for quadruped locomotion and manipulator tasks (2025).

References

  1. [1]
    [PDF] 19700005267.pdf - NASA Technical Reports Server
    Differential dynamic programming is a technique, based on dynamic programming rather than the calculus of variations, for determining the optimal control ...Missing: original | Show results with:original
  2. [2]
    Ch. 7 - Dynamic Programming - Underactuated Robotics
    Nov 18, 2024 · This broad approach is often referred to as differential dynamic programming (c.f. [7] ). ... description of the update sequence. Suppose ...
  3. [3]
    [PDF] Differential Dynamic Programming with Nonlinear Constraints
    Abstract—Differential dynamic programming (DDP) is a widely used trajectory optimization technique that addresses nonlinear optimal control problems, ...
  4. [4]
    Differential Dynamic Programming - Google Books
    David H. Jacobson, David Q. Mayne, American Elsevier Publishing Company, 1970 - Mathematics - 208 pages
  5. [5]
    [PDF] Differential Dynamic Programming for Time-Delayed Systems - arXiv
    Jan 7, 2017 · Differential Dynamic Programming (DDP) is an optimal control method which utilizes a second-order approximation of the problem to find the ...
  6. [6]
    [PDF] Constrained Differential Dynamic Programming Revisited - arXiv
    May 3, 2020 · Abstract—Differential Dynamic Programming (DDP) has be- come a well established method for unconstrained trajectory optimization.
  7. [7]
    [PDF] Parameterized Differential Dynamic Programming - Robotics
    This paper generalizes previous work by proposing a general parameterized optimal control objective and deriving a parametric version of DDP, titled.
  8. [8]
    Differential Dynamic Programming–A Unified Approach to the ...
    Differential Dynamic Programming (D.D.P.) uses expressions for cost changes due to control changes to obtain optimality conditions and algorithms. It allows ...Missing: original paper
  9. [9]
    [PDF] Differential Dynamic Programming with Nonlinear Constraints
    Abstract—Differential dynamic programming (DDP) is a widely used trajectory optimization technique that addresses nonlinear optimal control problems, ...<|control11|><|separator|>
  10. [10]
    A Second-order Gradient Method for Determining Optimal ...
    ... A Second-order Gradient Method for Determining Optimal Trajectories of Non-linear Discrete-time Systems. By DAVID MAYNE Electrical Engineering Department ...
  11. [11]
    Differential dynamic programming : Jacobson, David H
    May 24, 2023 · Publication date: 1970 ; Topics: Control theory, Mathematical optimization, Dynamic programming ; Publisher: New York, American Elsevier Pub. Co.<|control11|><|separator|>
  12. [12]
    Continuous-time differential dynamic programming with terminal ...
    Nov 16, 2015 · Continuous-time differential dynamic programming with terminal constraints ... 1971 by Michael K. Sain [15]. However, to. the best of our ...
  13. [13]
    [PDF] NI\SI\ - NASA Technical Reports Server (NTRS)
    A differential dynamic programming approach could be considered for producing a rigorously optimal solution to the open-loop finite~time rigid/flex maneuver ...
  14. [14]
    [PDF] 19860015851.pdf - NASA Technical Reports Server
    a modified, first-order Differential Dynamic Programming method. A contribution to the outcome can be addressed to a new aircraft feature, having an angle ...
  15. [15]
    Ch. 10 - Trajectory Optimization - Underactuated Robotics
    Differential Dynamic Programming (DDP) [39] is inspired by the idea of solving locally quadratic approximations of the HJB equations. The resulting ...
  16. [16]
    Chapter 17. Optimal control
    5.1 Derivation¶. We start by formulating the HJB equation in discrete time. Consider a finite-horizon optimal control problem, and define the value function as ...Optimal control problem · LQR control · Trajectory Optimization
  17. [17]
    [PDF] Optimal Control Theory
    The reason we need multiple copies of Rnx is that we have a finite-horizon problem, and therefore the time when a given x 2 Rnx is reached makes a difference.
  18. [18]
    [PDF] THE THEORY OF DYNAMIC PROGRAMMING - Richard Bellman
    Dynamic programming treats multi-stage decision processes, viewing an optimal policy as one determining the decision required at each time in terms of the ...
  19. [19]
    [PDF] DYNAMIC PROGRAMMING - Gwern.net
    Bellman,“Bottleneck Problems and Dynamic Programming,”'. Proc. Nat. Acad. Sct., vol. 39 (1953), and presented in detail by R. Bellman,. 'Bottleneck Problems ...
  20. [20]
    Constrained Differential Dynamic Programming Revisited - arXiv
    May 3, 2020 · This paper builds upon penalty methods and active-set approaches, towards designing a Dynamic Programming-based methodology for constrained optimal control.
  21. [21]
    A Unified Perspective on Multiple Shooting In Differential Dynamic ...
    Sep 14, 2023 · It was originally designed as a single shooting method and thus is sensitive to the initial guess supplied.
  22. [22]
    [PDF] Control-Limited Differential Dynamic Programming
    Differential. Dynamic Programming (DDP) is an indirect method which optimizes only over the unconstrained control-space and is therefore fast enough to allow ...
  23. [23]
    Synthesis and stabilization of complex behaviors through online ...
    We present an online trajectory optimization method and software platform applicable to complex humanoid robots performing challenging tasks.
  24. [24]
  25. [25]
    [PDF] Motion Planning under Uncertainty using Differential Dynamic ...
    Using a variant of differential dynamic programming, our approach iterates with second-order convergence towards a linear control policy over the belief space ...
  26. [26]
    [PDF] Differential Dynamic Programming for Multi-Phase Rigid Contact ...
    Nov 6, 2018 · We use two regularization schemes: the Tikhonov regularization (over Quu) and its update using the Levenberg-Marquardt algorithm are typically ...
  27. [27]
    [PDF] Interior Point Differential Dynamic Programming - arXiv
    Oct 20, 2020 · Abstract—This paper introduces a novel Differential Dynamic. Programming (DDP) algorithm for solving discrete-time finite-.
  28. [28]
  29. [29]
  30. [30]
    [PDF] Equality Constrained Differential Dynamic Programming - Hal-Inria
    It is often used as a post-processing operation to smooth out trajectories that are generated by probabilistic methods or to directly control the robot motion.
  31. [31]
    Constrained differential dynamic programming and its application to ...
    This paper describes a modification of differential dynamic programming (DDP) which makes that technique applicable to certain constrained sequential decision ...