Fact-checked by Grok 2 weeks ago

Differential dynamic programming

Differential dynamic programming (DDP) is an iterative optimization algorithm for solving nonlinear optimal control problems, particularly in trajectory optimization for continuous-time systems. It applies the principle of optimality from dynamic programming by constructing local quadratic approximations to the Hamilton-Jacobi-Bellman equation around a nominal trajectory, enabling efficient backward and forward sweeps to refine control policies and state trajectories.^[1]^[2] DDP was developed in the late 1960s as an extension of dynamic programming to handle nonlinear dynamics without relying on the calculus of variations. The method was first proposed by D. Q. Mayne and systematically analyzed by Mayne and D. H. Jacobson in their 1970 monograph, which established its theoretical foundations and quadratic convergence properties for unconstrained problems.^[1] Unlike global dynamic programming approaches that suffer from the curse of dimensionality, DDP's local approximations make it computationally tractable for high-dimensional systems.^[2] Key features of DDP include its ability to incorporate second-order information for faster convergence compared to first-order methods like gradient descent, support for control-affine dynamics where closed-form control updates are possible, and adaptability to discrete-time formulations.^[1] It has been extended to handle constraints, stochastic uncertainties, and hybrid systems, finding applications in robotics for motion planning, aerospace for spacecraft trajectory optimization, and estimation problems.^[3]^[2]

Introduction

Overview

Differential dynamic programming (DDP) is a second-order extension of dynamic programming for solving finite-horizon, discrete-time nonlinear optimal control problems. It builds on the principle of optimality by approximating the nonlinear dynamics and cost-to-go functions to enable efficient trajectory optimization.^[4] The core purpose of DDP is to iteratively refine an initial trajectory, using local quadratic approximations around a nominal path to minimize a total cost that captures system performance and terminal objectives.^[5] This approach generates both feedforward control adjustments and linear feedback policies, allowing for robust handling of perturbations in practice.^[6] Key characteristics of DDP include its reliance on local linearization of the dynamics and quadratic approximation of the value function, which facilitates scalability to high-dimensional systems without exhaustive state-space discretization.^[5] It is particularly well-suited for complex, nonlinear environments, such as those in robotics where real-time computation is essential.^[7] DDP finds typical applications in robot motion planning, such as bipedal locomotion and manipulation tasks, as well as spacecraft trajectory optimization for orbital transfers.^[6] In unconstrained settings, the algorithm demonstrates quadratic convergence to a local minimum, often requiring fewer iterations than first-order methods for smooth problems.^[5] Its iterative structure alternates between a backward pass, which propagates value function approximations from the terminal state, and a forward pass, which simulates and updates the trajectory using the derived controls.^[6]

Historical Development

Differential dynamic programming (DDP) emerged from the foundational principles of dynamic programming developed by Richard Bellman in the 1950s, which provided a framework for solving multistage optimization problems through recursive decomposition, and from advances in nonlinear programming that enabled stagewise procedures for handling complex dynamics.^[8] DDP specifically builds on these by approximating the value function quadratically at each stage to mitigate the curse of dimensionality inherent in Bellman's original method.^[9] The algorithm was first introduced by David Q. Mayne in a seminal 1966 paper, building on his earlier 1965 work (received December 1965) that proposed a second-order gradient method for optimizing nonlinear discrete-time systems.^[8]^[10] This approach unified dynamic programming with second-order approximations to solve optimal control problems more efficiently than first-order methods. A comprehensive formalization came in the 1970 book Differential Dynamic Programming by David H. Jacobson and David Q. Mayne, which detailed the backward-forward pass structure, established local quadratic convergence under suitable conditions, and positioned DDP as a practical tool for nonlinear systems.^[4] The book emphasized DDP's advantages over calculus-of-variations methods by leveraging dynamic programming's recursive nature while incorporating differential approximations for computational tractability.^[11] In the 1970s, extensions to continuous-time formulations appeared, including early adaptations reviewed by Michael K. Sain in 1971 and contributions to continuous-time formulations by researchers such as William F. Denham.^[12]^[1] During the 1980s and 1990s, DDP found significant applications in aerospace engineering, particularly through NASA-funded research on optimal trajectory design for spacecraft and aircraft maneuvers, as documented in technical reports addressing rigid-flexible body dynamics and low-thrust transfers.^[13]^[14] These efforts highlighted DDP's robustness for high-dimensional problems in real-time control scenarios. The 2000s marked a resurgence of DDP driven by advances in computational power and automatic differentiation, revitalizing its use in robotics for trajectory optimization in underactuated systems and manipulation tasks.^[15] This modern adoption is exemplified in the 2024 update to MIT's Underactuated Robotics textbook, which integrates DDP into chapters on dynamic programming and trajectory optimization, underscoring its role in contemporary robotic control.^[2] In the 2020s, DDP continues to evolve with extensions like homotopy continuation for highly nonlinear problems and game-theoretic formulations for robust control, as seen in recent applications to spacecraft trajectory optimization (as of 2025).^[16]^[17]^[18]

Problem Setup

Finite-Horizon Discrete-Time Optimal Control

The finite-horizon discrete-time optimal control problem seeks a sequence of control inputs that minimizes a cumulative cost functional over a predetermined number of time steps, starting from a given initial state. This formulation is central to trajectory optimization in nonlinear systems, where the goal is to drive the system dynamics toward desirable behavior while balancing performance and effort.^[19] The objective is to minimize the total cost

J = \sum_{k=0}^{N-1} l_k(x_k, u_k) + \phi(x_N)

over the state-control trajectory \{x_k, u_k\}_{k=0}^N, subject to the fixed initial condition x_0. Here, l_k: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} denotes the running cost at time step k, which captures stage-wise objectives such as tracking errors or energy expenditure, and \phi: \mathbb{R}^n \to \mathbb{R} is the terminal cost evaluating the final state x_N.^[4]^[19] The system evolves according to the nonlinear state dynamics

x_{k+1} = f(x_k, u_k), \quad k = 0, \dots, N-1,

where x_k \in \mathbb{R}^n is the state vector at time k, u_k \in \mathbb{R}^m is the control input applied over the interval [k, k+1), and f: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^n is a differentiable function representing the system model, often derived from discretization of continuous-time dynamics. An initial nominal trajectory \{\bar{x}_k, \bar{u}_k\} is typically required to initialize iterative solution methods.^[4]^[19] Key assumptions include a finite horizon N, discrete time steps without stochastic disturbances or hard constraints on states/controls in the basic setup, and that the running costs l_k, terminal cost \phi, and dynamics f are twice continuously differentiable to enable second-order approximations. These conditions facilitate analytical and numerical tractability while focusing on deterministic nonlinear behavior.^[4]^[19] This problem class is computationally challenging due to the nonlinearity of f and l_k, which generally yields a non-convex optimization landscape with multiple local minima, and the curse of dimensionality, where the exponential growth in state space complexity hampers exact solution methods for high-dimensional systems (n \gg 1).^[20]^[19]

Dynamic Programming Principles

Dynamic programming provides a foundational framework for solving optimal control problems by decomposing them into a sequence of simpler subproblems through recursive computation. At its core is the Bellman principle of optimality, which asserts that an optimal policy has the property that, regardless of the initial state and initial decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.^[21] This principle enables the systematic evaluation of decisions stage by stage in multi-stage decision processes.^[22] Central to this approach is the value function V_k(x), defined as the minimum cost-to-go from stage k onward when starting from state x. The value function satisfies the Bellman recursion, given by

V_k(x) = \min_u \left[ l_k(x, u) + V_{k+1}(f(x, u)) \right],

with the terminal condition V_N(x) = \phi(x) at the final stage N, where l_k(x, u) is the stage cost and f(x, u) is the state transition function.^[22] This recursion allows the optimal cost-to-go to be computed backward from the terminal stage to the initial stage. Once the value functions are obtained, an optimal policy can be derived by selecting, at each stage k, the control u that achieves the minimum in the Bellman equation; the optimal trajectory is then simulated forward using this policy.^[10] In the context of optimal control, dynamic programming yields an exact solution in principle by exhaustively searching the state-action space at each stage.^[21] However, for nonlinear problems with high-dimensional states, exact dynamic programming becomes intractable due to the curse of dimensionality, where computational and storage requirements grow exponentially with the state dimension.^[22] This limitation motivates the development of approximate local methods, such as differential dynamic programming, which leverage differential approximations to make the approach feasible for practical nonlinear systems.^[10]

Core Algorithm

Mathematical Formulation

Differential dynamic programming approximates the dynamic programming value functions locally around a nominal trajectory \{\bar{x}_k, \bar{u}_k\}_{k=0}^{N} using second-order Taylor expansions to derive quadratic subproblems for control updates.^[10] This approach specializes the general dynamic programming principles by focusing on local linear-quadratic approximations rather than exact global solutions. Consider the nonlinear dynamics x_{k+1} = f_k(x_k, u_k) and stage cost l_k(x_k, u_k). Around the nominal trajectory, the dynamics are linearized as

\delta x_{k+1} = f_k(\bar{x}_k, \bar{u}_k) + f_{x,k} \delta x_k + f_{u,k} \delta u_k - \bar{x}_{k+1},

where \delta x_k = x_k - \bar{x}_k, \delta u_k = u_k - \bar{u}_k, f_{x,k} = \frac{\partial f_k}{\partial x}(\bar{x}_k, \bar{u}_k), and f_{u,k} = \frac{\partial f_k}{\partial u}(\bar{x}_k, \bar{u}_k).^[10] This yields the affine approximation \delta x_{k+1} \approx A_k \delta x_k + B_k \delta u_k, with A_k = f_{x,k} and B_k = f_{u,k}. For the full second-order approximation, the dynamics expansion includes quadratic terms: f_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k) \approx \bar{x}_{k+1} + A_k \delta x_k + B_k \delta u_k + \frac{1}{2} \delta x_k^T f_{xx,k} \delta x_k + \delta x_k^T f_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T f_{uu,k} \delta u_k, where f_{xx,k}, f_{xu,k}, f_{uu,k} are the second partial derivative tensors evaluated at (\bar{x}_k, \bar{u}_k). The value function V_{k+1}(x_{k+1}), representing the optimal cost-to-go from stage k+1, is approximated quadratically around \bar{x}_{k+1}:

V_{k+1}(x_{k+1} + \delta x_{k+1}) \approx V_{k+1}(\bar{x}_{k+1}) + V_x^{k+1} \delta x_{k+1} + \frac{1}{2} \delta x_{k+1}^T V_{xx}^{k+1} \delta x_{k+1},

where V_x^{k+1} = \frac{\partial V_{k+1}}{\partial x}(\bar{x}_{k+1}) is the gradient and V_{xx}^{k+1} = \frac{\partial^2 V_{k+1}}{\partial x^2}(\bar{x}_{k+1}) is the Hessian.^[10] The action-value function at stage k is defined as Q_k(x_k, u_k) = l_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)). Substituting the local approximations, Q_k becomes quadratic in the deviations \delta x_k and \delta u_k. First, expand the stage cost via second-order Taylor series:

l_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k) \approx l_k(\bar{x}_k, \bar{u}_k) + l_{x,k} \delta x_k + l_{u,k} \delta u_k + \frac{1}{2} \begin{pmatrix} \delta x_k \\ \delta u_k \end{pmatrix}^T \begin{pmatrix} l_{xx,k} & l_{xu,k} \\ l_{ux,k} & l_{uu,k} \end{pmatrix} \begin{pmatrix} \delta x_k \\ \delta u_k \end{pmatrix},

with l_{x,k} = \frac{\partial l_k}{\partial x}(\bar{x}_k, \bar{u}_k), l_{u,k} = \frac{\partial l_k}{\partial u}(\bar{x}_k, \bar{u}_k), and the Hessians l_{xx,k} = \frac{\partial^2 l_k}{\partial x^2}(\bar{x}_k, \bar{u}_k), etc. (noting l_{xu,k} = l_{ux,k}^T). For the value term, substitute the second-order dynamics approximation into the quadratic V_{k+1}:

V_{k+1}(f_k(\bar{x}_k + \delta x_k, \bar{u}_k + \delta u_k)) \approx V_{k+1}(\bar{x}_{k+1}) + V_x^{k+1} (A_k \delta x_k + B_k \delta u_k) + \frac{1}{2} (A_k \delta x_k + B_k \delta u_k)^T V_{xx}^{k+1} (A_k \delta x_k + B_k \delta u_k) + V_x^{k+1} \cdot \left( \frac{1}{2} \delta x_k^T f_{xx,k} \delta x_k + \delta x_k^T f_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T f_{uu,k} \delta u_k \right),

where \cdot denotes tensor contraction.^[23] Combining both expansions, the action-value function simplifies to

Q_k(\delta x_k, \delta u_k) \approx q_k + Q_{x,k} \delta x_k + Q_{u,k} \delta u_k + \frac{1}{2} \delta x_k^T Q_{xx,k} \delta x_k + \delta x_k^T Q_{xu,k} \delta u_k + \frac{1}{2} \delta u_k^T Q_{uu,k} \delta u_k,

where the constant q_k = l_k(\bar{x}_k, \bar{u}_k) + V_{k+1}(\bar{x}_{k+1}), the linear terms are Q_{x,k} = l_{x,k} + A_k^T V_x^{k+1} and Q_{u,k} = l_{u,k} + B_k^T V_x^{k+1}, and the quadratic coefficients derive as

Q_{xx,k} = l_{xx,k} + A_k^T V_{xx}^{k+1} A_k + V_x^{k+1} \cdot f_{xx,k}, \quad Q_{xu,k} = l_{xu,k} + A_k^T V_{xx}^{k+1} B_k + V_x^{k+1} \cdot f_{xu,k}, \quad Q_{uu,k} = l_{uu,k} + B_k^T V_{xx}^{k+1} B_k + V_x^{k+1} \cdot f_{uu,k}.

These coefficients capture the second-order effects from both the stage cost and the propagated value function.^[10]^[23] Minimizing Q_k with respect to \delta u_k for a given \delta x_k yields the optimal local policy

\delta u_k = -Q_{uu,k}^{-1} (Q_{u,k} + Q_{xu,k}^T \delta x_k) = \alpha_k + K_k \delta x_k,

where the feedforward term is \alpha_k = -Q_{uu,k}^{-1} Q_{u,k} and the feedback gain is K_k = -Q_{uu,k}^{-1} Q_{xu,k}^T. This affine policy approximates the optimal control deviation around the nominal trajectory.^[23]

Backward Pass

The backward pass in differential dynamic programming (DDP) constitutes the core recursive procedure for approximating the value function and deriving locally optimal control policies by propagating second-order information backward through time along a nominal trajectory. This pass leverages quadratic expansions of the cost-to-go and dynamics to compute local linear-quadratic approximations, enabling the identification of feedback and feedforward control terms that approximate the global optimum in a neighborhood of the current trajectory. Introduced as a second-order extension of dynamic programming principles, the backward pass ensures efficient exploitation of curvature information for convergence in nonlinear optimal control problems.^[10] The process begins with initialization at the terminal time step k = N, where the value function V_N(x_N) is set to the terminal cost \phi(x_N), with its first-order gradient V_x = \phi_x and second-order Hessian V_{xx} = \phi_{xx}. These terms represent the exact quadratic approximation at the horizon end, assuming \phi is twice differentiable. In standard notation, the value function approximation takes the form V_k(x_k) \approx c_k + s_k^\top \delta x_k + \frac{1}{2} \delta x_k^\top S_k \delta x_k, where \delta x_k = x_k - \bar{x}_k is the state deviation from the nominal \bar{x}_k, S_k \succeq 0 is the value Hessian, s_k is the value gradient, and c_k is the constant term. At k = N, c_N = \phi(\bar{x}_N), s_N = \phi_x, and S_N = \phi_{xx}.^[24]^[25] The recursion proceeds backward from k = N-1 to k = 0, at each step computing the action-value function coefficients Q_x, Q_u, Q_{xx}, Q_{xu}, Q_{uu} via second-order Taylor expansions of the stage cost l_k(x_k, u_k) and dynamics f_k(x_k, u_k) around the nominal trajectory, incorporating the value approximation from step k+1. Specifically,

\begin{align} Q_x &= l_x + f_x^\top V_x, \\ Q_u &= l_u + f_u^\top V_x, \\ Q_{xx} &= l_{xx} + f_x^\top V_{xx} f_x + V_x \cdot f_{xx}, \\ Q_{uu} &= l_{uu} + f_u^\top V_{xx} f_u + V_x \cdot f_{uu}, \\ Q_{xu} &= l_{xu} + f_x^\top V_{xx} f_u + V_x \cdot f_{xu}, \end{align}

where subscripts denote partial derivatives evaluated at (\bar{x}_k, \bar{u}_k), and \cdot indicates contraction over tensor products for higher-order terms. These Q-terms approximate the change in the Bellman residual Q_k(x_k, u_k) = l_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)) quadratically in deviations \delta x_k, \delta u_k.^[24]^[25] From these, the value function parameters are updated to complete the recursion. The value Hessian is obtained by completing the square in the quadratic Q-form:

S_k = Q_{xx} - Q_{xu} Q_{uu}^{-1} Q_{xu}^\top,

assuming Q_{uu} \succ 0. The value gradient follows as

s_k = Q_x - Q_{xu} Q_{uu}^{-1} Q_u,

and the constant term incorporates the minimized quadratic over controls:

c_k = c_{k+1} + Q - \frac{1}{2} Q_u^\top Q_{uu}^{-1} Q_u,

where Q is the zeroth-order Q-term. These updates ensure the value approximation V_k minimizes the expected future cost under the local dynamics.^[25] The backward pass also yields the local control policy by minimizing the quadratic Q_k over \delta u_k, resulting in affine feedback form \delta u_k = k_k + K_k \delta x_k. The feedback gain is

K_k = -Q_{uu}^{-1} Q_{xu}^\top,

which provides state-deviation correction, and the feedforward term is

k_k = -Q_{uu}^{-1} Q_u,

capturing nominal adjustments. By iteratively refining these gains and value approximations, the backward pass propagates second-order sensitivity of the cost to state and control perturbations, facilitating rapid quadratic convergence near local optima when applied in conjunction with forward trajectory updates.^[24]

Forward Pass

The forward pass in differential dynamic programming utilizes the linear feedback policy derived during the backward pass to update the nominal trajectory by simulating the nonlinear system dynamics forward in time, starting from the fixed initial state \bar{x}_0.^[23] This process refines the candidate trajectory toward local optimality by applying the affine control law at each time step, where the control u_k is computed as u_k = \bar{u}_k + k_k + K_k (x_k - \bar{x}_k), with k_k and K_k being the feedforward and feedback gains obtained from the quadratic approximation of the value function. The resulting controls are then used to propagate the states via the true nonlinear dynamics: x_{k+1} = f(x_k, u_k) for k = 0 to N-1, yielding a new candidate trajectory \{x_k, u_k\}.^[23] Once the full trajectory is simulated, the total cost J is evaluated as J = \sum_{k=0}^{N-1} l(x_k, u_k) + \phi(x_N) and compared to the cost of the previous nominal trajectory. If the new cost represents an improvement (i.e., \Delta J < 0), the updated trajectory is accepted as the new nominal \{\bar{x}_k, \bar{u}_k\}; otherwise, the step is rejected or scaled back.^[23] This forward evaluation ensures that the trajectory remains feasible under the actual dynamics, contrasting with the local quadratic approximations used in the backward pass.^[6] The overall algorithm alternates between backward and forward passes iteratively until convergence, typically assessed by a small change in cost such as |\Delta J| < \epsilon, where \epsilon is a user-defined tolerance (e.g., $10^{-6}). In practice, convergence is achieved in 10 to 100 such cycles for many nonlinear control problems, with step sizes often decreasing over iterations to promote stability near the optimum.^[23] Upon convergence, the algorithm outputs the optimal open-loop trajectory \{x_k^*, u_k^*\} along with the time-varying feedback policy K_k for potential use in receding-horizon implementations, where the policy can stabilize execution against perturbations.^[6]

Practical Enhancements

Regularization Techniques

In differential dynamic programming, the Hessian Q_{uu} computed during the backward pass may become indefinite due to model nonlinearities, resulting in unstable control gains that hinder convergence.^[26] To address this, Levenberg-Marquardt regularization modifies the control Hessian by adding a positive damping term: \tilde{Q}_{uu} = Q_{uu} + \lambda I, where \lambda > 0 and I is the identity matrix, yielding the stabilized control update \delta u = -(\tilde{Q}_{uu})^{-1} Q_u.^[27] This ensures positive definiteness, blending Newton-like steps (small \lambda) with gradient descent (large \lambda).^[26] The damping parameter \lambda is adapted iteratively, typically starting from a small value like $10^{-6} and increased (e.g., multiplied by 10) if the updated trajectory increases the cost—mimicking trust-region enforcement—while decreased (e.g., divided by 10) on successful cost reductions to accelerate convergence.^[26] An alternative approach applies damping directly to the value function Hessian S_k (or V_{xx}) propagated backward, adding \lambda I to enforce S_k \succ 0 and indirectly stabilizing Q_{uu} through the dynamics linearization.^[26] These techniques enhance local quadratic convergence near optima, with \lambda values commonly spanning $10^{-6} to $10^{6} depending on problem conditioning.^[26]

Line Search Methods

In differential dynamic programming (DDP), the full step size derived from the forward pass may lead to overshooting or increased cost due to linear-quadratic approximations of the nonlinear dynamics and cost function.^[23] To address this, line search methods adjust the step size \alpha along the search direction provided by the linear feedback gains K_k and feedforward terms k_k, ensuring reliable trajectory updates that decrease the objective.^[28] A common approach is backtracking line search, which begins with \alpha = 1 and iteratively reduces \alpha (typically by halving) until an acceptance criterion is met. The trial control input is computed as u_k^\alpha = \bar{u}_k + \alpha (k_k + K_k \delta x_k), followed by simulation of the trial trajectory \hat{x}_{k+1} = f(\hat{x}_k, u_k^\alpha) to evaluate the update.^[23]^[29] This process continues until the Armijo condition holds: \Delta J \leq \eta \alpha \nabla J^T d, where \Delta J is the change in cost, \eta is a small constant (e.g., $10^{-4}), \nabla J^T d is the predicted descent, and d is the search direction.^[23] The merit function for acceptance is typically the total cost J, though augmented versions incorporating constraints may be used in constrained settings.^[29] Alternatives include exact line search, which minimizes J(\alpha) directly (e.g., via quadratic interpolation or derivative-free methods), or trust-region methods that bound \alpha based on a local model trust radius to balance exploration and exploitation.^[30] These techniques ensure monotonic descent in the cost function, preventing divergence and promoting convergence, and are widely implemented in practical DDP variants such as iterative linear-quadratic regulation (iLQR).^[23]^[28] The added computational cost is modest, typically requiring 1--5 forward trajectory evaluations per iteration due to the backtracking reductions.^[29]

Advanced Variants

Stochastic and Monte Carlo DDP

Stochastic differential dynamic programming (SDDP) extends the deterministic DDP framework to systems with additive or multiplicative process noise, where the dynamics are modeled as x_{k+1} = f(x_k, u_k, w_k) with w_k \sim p(w) representing zero-mean noise drawn from a known distribution p(w).^[31] The objective is to minimize the expected total cost \mathbb{E} \left[ \sum_{k=0}^{N-1} l_k(x_k, u_k) + \phi(x_N) \right], where the expectation is taken over the noise realizations, and l_k and \phi are the stage and terminal costs, respectively.^[31] This formulation accounts for uncertainty in the system evolution, making SDDP suitable for real-world applications like robotics and aerospace where disturbances are inevitable.^[31] Monte Carlo methods approximate the expectations in SDDP by sampling multiple noise trajectories. In the forward pass, M noise samples are drawn to generate M rollouts from the current nominal trajectory, and the average cost is used to approximate the quadratic terms in the action-value function Q_k.^[32] This sampling-based approach, known as sampled DDP (SaDDP), enables handling of high-dimensional nonlinear systems by avoiding explicit computation of covariance matrices for the noise.^[32] For instance, in systems with up to 100 states, SaDDP demonstrates improved scalability compared to traditional methods that propagate uncertainty analytically.^[32] The backward pass in SDDP approximates expectations of the value function derivatives using sample averages. Specifically, the second-order terms like \mathbb{E}[V_{xx}] are estimated by averaging the Hessian contributions from the sampled trajectories, incorporating linear and quadratic noise effects for multiplicative cases.^[31] This leads to modified update rules that propagate uncertainty backward, yielding a stochastic policy that hedges against noise.^[31] Risk-sensitive variants of SDDP adopt a game-theoretic perspective, formulating the problem as a min-max optimization over worst-case noise disturbances to enhance robustness. In min-max DDP, the controller minimizes the cost while an adversarial player maximizes it through disturbance actions v_k, resulting in dynamics x_{k+1} = f(x_k, u_k, v_k) and a saddle-point solution under the Isaacs condition. This approach, equivalent to risk-sensitive control with exponential cost criteria, has been applied to quadrotor trajectory tracking under wind disturbances, outperforming nominal DDP in experimental settings. Convergence of Monte Carlo SDDP improves with larger M, reducing variance in the approximations at the cost of higher computational demand; it is particularly effective for multiplicative noise where analytical expectations are intractable.^[32] Seminal work on SDDP for state and control multiplicative noise establishes its theoretical foundations, influencing subsequent sampling-based extensions.^[31] Recent advances include data-driven stochastic game-theoretic DDP, which uses Gaussian process regression to approximate unknown drift and diffusion dynamics from input-output data, enabling optimal control of nonlinear stochastic systems (published April 2025).^[33] Additionally, integration with the unscented transform enhances robustness for low-thrust trajectory design in space exploration, outperforming deterministic methods under model uncertainties.^[34]

Constrained DDP

Constrained differential dynamic programming (CDDP) extends the standard DDP framework to incorporate state and control constraints, ensuring trajectories remain feasible while minimizing the cost functional. These constraints are typically formulated as stagewise inequalities g_k(x_k, u_k) \leq 0 and equalities h_k(x_k, u_k) = 0, where x_k and u_k denote the state and control at time step k, respectively. Such formulations arise in applications requiring obstacle avoidance, joint limits, or resource bounds, transforming the unconstrained optimal control problem into a constrained one that requires modifications to both the backward and forward passes.^[9]^[6] One common approach to handle inequalities is the penalty method, which augments the original cost function with a penalty term \mu \sum_k \max(0, g_k(x_k, u_k))^2, where \mu > 0 is a penalty parameter that increases iteratively to enforce feasibility. This quadratic penalty on constraint violations integrates seamlessly into the quadratic approximation of the DDP backward pass, treating the augmented cost as unconstrained while gradually tightening adherence to g_k \leq 0. For equality constraints, similar penalties like \mu \sum_k h_k(x_k, u_k)^2 can be applied, or augmented Lagrangian formulations may incorporate Lagrange multipliers to improve convergence. These methods, often iterated in an outer loop around the standard DDP inner solver, balance computational efficiency with constraint satisfaction but can suffer from ill-conditioning for large \mu.^[6] Active-set methods address constraints by dynamically identifying the set of active inequalities (those near or violating g_k \leq 0) at each iteration, linearizing them as equalities C_k \delta u_k + D_k \delta x_k \approx 0 around the nominal trajectory, and projecting the solution onto the feasible set. In the backward pass, this reduces to solving a constrained quadratic program (QP) for the feedback gains and feedforward terms, excluding inactive constraints to maintain efficiency. The active set is refined based on Lagrange multipliers and violation tolerances, ensuring monotonic decrease in the merit function during the forward pass. This approach excels for sparse or predictable constraints but requires careful initialization to avoid frequent set switches, which can increase computational cost.^[9] Primal-dual methods incorporate Lagrange multipliers \lambda_k directly into the DDP recursion, augmenting the control quadratic term as Q_{u,k} \leftarrow Q_{u,k} + \lambda_k^T g_{u,k} (and similarly for cross terms) while linearizing the constraints via their Jacobians. Multipliers are updated via dual ascent steps, such as \lambda_k \leftarrow \lambda_k + \alpha \mu g_k in an augmented Lagrangian setup, where \alpha is a step size ensuring descent. For equalities, \lambda_k directly enforce h_k = 0 through the KKT conditions solved in each QP subproblem. These techniques, often combined with slack variables for inequalities, promote faster convergence near boundaries compared to pure penalties.^[6]^[35] Nonlinear constraints pose significant challenges in CDDP, necessitating accurate linearizations around the current trajectory to avoid infeasible QPs or divergence; poor approximations can lead to slower convergence than unconstrained DDP, with iteration counts increasing by factors of 2–5 in practice. Early developments in the late 1970s applied constrained DDP to multireservoir control problems, demonstrating its viability for large-scale sequential decisions with linear inequalities. In modern robotics, such as quadrotor navigation and manipulation, CDDP variants have enabled real-time trajectory optimization under complex state constraints, with seminal works revisiting and hybridizing penalty and active-set strategies for improved robustness.^[36]^[9]^[6] Recent applications include constraint-augmented DDP for automatic falling recovery in humanoid robots, achieving over 95% success rate in simulations and real experiments on varied terrains (2025).^[37] Parameterized variants enable waypoint-trajectory optimization for spacecraft and UAVs with flexible timing and second-order convergence (2024).^[38] Proximal methods, such as PROXDDP, support real-time whole-body model-predictive control for quadruped locomotion and manipulator tasks (2025).^[39]

References

[1]
[PDF] 19700005267.pdf - NASA Technical Reports Server
Differential dynamic programming is a technique, based on dynamic programming rather than the calculus of variations, for determining the optimal control ...Missing: original | Show results with:original
[2]
Ch. 7 - Dynamic Programming - Underactuated Robotics
Nov 18, 2024 · This broad approach is often referred to as differential dynamic programming (c.f. [7] ). ... description of the update sequence. Suppose ...
[3]
[PDF] Differential Dynamic Programming with Nonlinear Constraints
Abstract—Differential dynamic programming (DDP) is a widely used trajectory optimization technique that addresses nonlinear optimal control problems, ...
[4]
Differential Dynamic Programming - Google Books
David H. Jacobson, David Q. Mayne, American Elsevier Publishing Company, 1970 - Mathematics - 208 pages
[5]
[PDF] Differential Dynamic Programming for Time-Delayed Systems - arXiv
Jan 7, 2017 · Differential Dynamic Programming (DDP) is an optimal control method which utilizes a second-order approximation of the problem to find the ...
[6]
[PDF] Constrained Differential Dynamic Programming Revisited - arXiv
May 3, 2020 · Abstract—Differential Dynamic Programming (DDP) has be- come a well established method for unconstrained trajectory optimization.
[7]
[PDF] Parameterized Differential Dynamic Programming - Robotics
This paper generalizes previous work by proposing a general parameterized optimal control objective and deriving a parametric version of DDP, titled.
[8]
Differential Dynamic Programming–A Unified Approach to the ...
Differential Dynamic Programming (D.D.P.) uses expressions for cost changes due to control changes to obtain optimality conditions and algorithms. It allows ...Missing: original paper
[9]
[PDF] Differential Dynamic Programming with Nonlinear Constraints
Abstract—Differential dynamic programming (DDP) is a widely used trajectory optimization technique that addresses nonlinear optimal control problems, ...<|control11|><|separator|>
[10]
A Second-order Gradient Method for Determining Optimal ...
... A Second-order Gradient Method for Determining Optimal Trajectories of Non-linear Discrete-time Systems. By DAVID MAYNE Electrical Engineering Department ...
[11]
Differential dynamic programming : Jacobson, David H
May 24, 2023 · Publication date: 1970 ; Topics: Control theory, Mathematical optimization, Dynamic programming ; Publisher: New York, American Elsevier Pub. Co.<|control11|><|separator|>
[12]
Continuous-time differential dynamic programming with terminal ...
Nov 16, 2015 · Continuous-time differential dynamic programming with terminal constraints ... 1971 by Michael K. Sain [15]. However, to. the best of our ...
[13]
[PDF] NI\SI\ - NASA Technical Reports Server (NTRS)
A differential dynamic programming approach could be considered for producing a rigorously optimal solution to the open-loop finite~time rigid/flex maneuver ...
[14]
[PDF] 19860015851.pdf - NASA Technical Reports Server
a modified, first-order Differential Dynamic Programming method. A contribution to the outcome can be addressed to a new aircraft feature, having an angle ...
[15]
Ch. 10 - Trajectory Optimization - Underactuated Robotics
Differential Dynamic Programming (DDP) [39] is inspired by the idea of solving locally quadratic approximations of the HJB equations. The resulting ...
[16]
Chapter 17. Optimal control
5.1 Derivation¶. We start by formulating the HJB equation in discrete time. Consider a finite-horizon optimal control problem, and define the value function as ...Optimal control problem · LQR control · Trajectory Optimization
[17]
[PDF] Optimal Control Theory
The reason we need multiple copies of Rnx is that we have a finite-horizon problem, and therefore the time when a given x 2 Rnx is reached makes a difference.
[18]
[PDF] THE THEORY OF DYNAMIC PROGRAMMING - Richard Bellman
Dynamic programming treats multi-stage decision processes, viewing an optimal policy as one determining the decision required at each time in terms of the ...
[19]
[PDF] DYNAMIC PROGRAMMING - Gwern.net
Bellman,“Bottleneck Problems and Dynamic Programming,”'. Proc. Nat. Acad. Sct., vol. 39 (1953), and presented in detail by R. Bellman,. 'Bottleneck Problems ...
[20]
Constrained Differential Dynamic Programming Revisited - arXiv
May 3, 2020 · This paper builds upon penalty methods and active-set approaches, towards designing a Dynamic Programming-based methodology for constrained optimal control.
[21]
A Unified Perspective on Multiple Shooting In Differential Dynamic ...
Sep 14, 2023 · It was originally designed as a single shooting method and thus is sensitive to the initial guess supplied.
[22]
[PDF] Control-Limited Differential Dynamic Programming
Differential. Dynamic Programming (DDP) is an indirect method which optimizes only over the unconstrained control-space and is therefore fast enough to allow ...
[23]
Synthesis and stabilization of complex behaviors through online ...
We present an online trajectory optimization method and software platform applicable to complex humanoid robots performing challenging tasks.
[24]
https://arxiv.org/abs/2005.00985
[25]
[PDF] Motion Planning under Uncertainty using Differential Dynamic ...
Using a variant of differential dynamic programming, our approach iterates with second-order convergence towards a linear control policy over the belief space ...
[26]
[PDF] Differential Dynamic Programming for Multi-Phase Rigid Contact ...
Nov 6, 2018 · We use two regularization schemes: the Tikhonov regularization (over Quu) and its update using the Levenberg-Marquardt algorithm are typically ...
[27]
[PDF] Interior Point Differential Dynamic Programming - arXiv
Oct 20, 2020 · Abstract—This paper introduces a novel Differential Dynamic. Programming (DDP) algorithm for solving discrete-time finite-.
[28]
https://robotics.cs.unc.edu/publications/vandenBerg2011_ISRR.pdf
[29]
http://crlab.cs.columbia.edu/humanoids_2018_proceedings/media/files/0173.pdf
[30]
[PDF] Equality Constrained Differential Dynamic Programming - Hal-Inria
It is often used as a post-processing operation to smooth out trajectories that are generated by probabilistic methods or to directly control the robot motion.
[31]
Constrained differential dynamic programming and its application to ...
This paper describes a modification of differential dynamic programming (DDP) which makes that technique applicable to certain constrained sequential decision ...