Fact-checked by Grok 2 weeks ago

Bellman equation

The Bellman equation is a core mathematical relation in dynamic programming that defines the value of a in terms of the maximum of its immediate reward plus the value of the remaining subproblem, enabling the decomposition of complex optimization tasks into simpler recursive components. Named after American applied mathematician Richard Bellman, who developed it in the late 1950s as part of his pioneering work on dynamic programming, the equation formalizes the principle of optimality: an optimal policy for a multistage decision process must consist of optimal policies for each of the subproblems arising after the initial stages. This principle, articulated by Bellman in 1957, asserts that "an optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." The equation typically takes the form V(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s') \right] for a s, a, reward R, transition probability P, and discount factor \gamma, though it generalizes to deterministic and continuous cases. In essence, it equates the of being in a given to the best possible immediate reward plus the discounted of the next under optimal future behavior, providing a fixed-point whose solution yields the optimal . Bellman introduced dynamic programming during his time at to address large-scale optimization problems in , such as routing and , where traditional proved inefficient. Beyond its origins in and , the Bellman equation has become indispensable in fields like for solving stochastic growth models and in for algorithms. In Markov decision processes (MDPs), a foundational framework for sequential decision-making under uncertainty, it underpins methods like value iteration and policy iteration to compute optimal policies by iteratively updating value estimates until convergence. For instance, in , the equation facilitates learning optimal behaviors in environments like games or by balancing and through temporal difference updates. Its recursive structure not only ensures computational tractability for problems with overlapping subproblems but also extends to Hamilton-Jacobi-Bellman equations in continuous-time , influencing modern applications in , autonomous systems, and .

Background Concepts

Dynamic Programming Foundations

Dynamic programming is a mathematical method for solving complex optimization problems by decomposing them into a series of simpler subproblems, where the solutions to these subproblems can be reused due to overlapping structures, thereby avoiding redundant computations. This approach, originally developed for multistage decision processes, enables the efficient computation of optimal policies by recursively building solutions from smaller components. The term "dynamic programming" was coined by Richard Bellman in the early 1950s while working at the , chosen deliberately to appeal to military funding sources during the era by evoking notions of planning and motion rather than , which faced scrutiny in government contracts. Key components of a dynamic programming formulation include states, which represent the system's configuration at a given stage; decisions or actions available at each state; transitions that describe how the system evolves based on chosen actions; rewards or costs associated with transitions; and value functions that quantify the long-term desirability of states under optimal decision-making. These elements form the foundation for modeling sequential decision problems, such as or inventory management, where choices at one stage influence future outcomes. Dynamic programming problems are distinguished by their time horizon: finite horizon problems involve a fixed number of stages, allowing solutions via , which starts from the terminal stage and works retrospectively to compute optimal values and decisions for each prior stage. In contrast, infinite horizon problems extend indefinitely, often requiring discounting of future rewards to ensure convergence or focusing on average performance metrics, which introduces stationarity in policies across stages. This distinction is crucial for applications in and , where the horizon reflects the problem's duration and objectives.

Bellman's Principle of Optimality

Bellman's Principle of Optimality states that an optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This principle, introduced by in his foundational work on , provides the theoretical justification for decomposing complex optimization problems into simpler, recursive subproblems. To illustrate informally, consider the problem of finding the shortest in a from a starting A to an ending Z, where weights represent distances. If the overall shortest from A to Z passes through an intermediate B, then the subpath from B to Z must itself be the shortest possible between B and Z; otherwise, replacing that subpath with a shorter one would yield a shorter overall from A to Z, contradicting the optimality of the original . This analogy highlights how the principle ensures that optimal solutions to larger problems inherit optimality from their subcomponents, without needing to reevaluate the entire structure from scratch. A proof sketch by contradiction proceeds as follows: Suppose there exists an optimal policy \pi^* for the full problem starting from initial state s_0, but for some initial decision leading to state s_1, the remaining policy \pi^*_{tail} from s_1 is not optimal for the subproblem starting at s_1. Then, there must exist another policy \tilde{\pi} that is optimal from s_1 and yields a strictly better value than \pi^*_{tail}. Constructing a new policy \tilde{\pi}^* that follows \pi^* up to s_1 and then switches to \tilde{\pi} would then outperform \pi^* for the original problem, contradicting the assumption that \pi^* is optimal. Thus, the tail of any optimal policy must itself be optimal for the resulting subproblem. The implications of this are profound for recursive methods: it guarantees that subproblems can be solved optimally and independently, enabling the reuse of solutions across overlapping instances—a key distinction from divide-and-conquer approaches, which typically do not revisit solved subproblems. By focusing on state-based , the principle underpins the efficiency of dynamic programming algorithms, where the value of a state depends only on the optimal values of subsequent states.

Historical Development

Richard Bellman developed the foundational ideas of dynamic programming, including what became known as the Bellman equation, while working at the starting in 1949. His research was motivated by military problems during the early era, such as multistage in and planning under uncertainty, which required efficient methods to handle complex sequential choices. To obtain funding for this work amid skepticism toward pure mathematical research at the time, Bellman invented the term "dynamic programming" around 1952, selecting "programming" for its association with practical planning and "dynamic" to evoke mathematical sophistication and motion. The Bellman equation emerged as a core tool within this framework, systematically formalized in his influential 1957 book Dynamic Programming, which provided a recursive approach to optimality based on Bellman's Principle of Optimality. Early applications in the demonstrated its utility, including solutions to problems in communication networks and inventory management, where it optimized paths and stock replenishment policies over multiple stages. Extensions to stochastic settings appeared in Bellman's work during the mid-1950s, incorporating probabilistic transitions to address real-world uncertainties in decision processes, as detailed in his publications and the 1957 book. By the , the framework connected to , culminating in the Hamilton-Jacobi-Bellman equation, which was explicitly formulated by Rudolf Kalman in the early part of the decade to link dynamic programming with differential equations for continuous systems. Following a period of relative dormancy, the Bellman equation saw renewed interest post-2010 in artificial intelligence, particularly through reinforcement learning, where advances in computational power and neural networks enabled scalable implementations for large-scale problems, though the core theory remained unchanged.

Mathematical Formulation

Deterministic Case

In the deterministic case, the Bellman equation describes optimal decision-making in environments where the transition from one state to the next is completely determined by the current state and chosen action, without any probabilistic uncertainty. This formulation assumes a discrete state space S and action space A, a reward function r: S \times A \to \mathbb{R} providing immediate payoff, and a deterministic transition function f: S \times A \to S specifying the next state. The equation expresses the optimal value function, which represents the maximum achievable total reward starting from a given state. For finite-horizon problems with time steps t = 0, 1, \dots, T, the Bellman equation takes the recursive form V_t(s) = \max_{a \in A} \left[ r(s, a) + V_{t+1}(f(s, a)) \right], with the terminal condition V_{T+1}(s) = 0 for all s \in S, or more generally incorporating a terminal reward if applicable. This equation allows backward computation of the value function from the end of the horizon, capturing the tradeoff between immediate rewards and future consequences under deterministic dynamics. In the infinite-horizon undiscounted setting, the equation simplifies to V(s) = \max_{a \in A} \left[ r(s, a) + V(f(s, a)) \right] for all s \in S, but solutions exist and converge only under specific conditions, such as acyclic state graphs or the use of proper policies that eventually reach absorbing states with zero future rewards. Without such assumptions, the equation may not yield a unique finite solution due to potential cycles leading to unbounded rewards. To address convergence issues in infinite-horizon problems, a discount factor \gamma \in [0, 1) is introduced, yielding the form V(s) = \max_{a \in A} \left[ r(s, a) + \gamma V(f(s, a)) \right]. This discounted version ensures the value function is a , guaranteeing a unique solution that prioritizes nearer-term rewards while still accounting for long-term outcomes. The interpretation across all variants is that the optimal value of a state equals the best immediate reward plus the (discounted) optimal value of the deterministically resulting next state, embodying Bellman's principle of optimality for multistage deterministic processes.

Stochastic Case

In stochastic environments, the Bellman equation accounts for uncertainty in state transitions, typically within the framework of Markov decision processes (MDPs), where the next state is drawn from a conditioned on the current state and action. This formulation extends the deterministic case by replacing fixed outcomes with expectations over possible next states, enabling the modeling of real-world systems like inventory management or robotics navigation under noise. Key assumptions underlying the stochastic Bellman equation include the , which posits that the over future s depends solely on the current and , independent of prior history; and stationarity, meaning transition probabilities and rewards remain constant over time. These assumptions simplify while capturing essential dynamics in sequential problems. For the infinite-horizon discounted case in an MDP, the optimal value function V(s), representing the maximum expected discounted return from state s, satisfies the Bellman optimality equation: V(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right], where r(s,a) is the immediate reward for taking action a in state s, \gamma \in [0,1) is the discount factor prioritizing near-term rewards, and P(s'|s,a) is the transition probability to next state s'. This equation embodies Bellman's principle of optimality, decomposing the value into immediate reward plus the discounted expected future value. In the finite-horizon setting, the value function is time-dependent, V_t(s), and evolves backward from the terminal time T, where V_T(s) typically equals zero or a terminal reward: V_t(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) V_{t+1}(s') \right], for t = 0, 1, \dots, T-1. This recursive structure allows computation via dynamic programming, starting from the end and propagating optimal values upstream. An equivalent formulation uses the action-value function Q(s,a), which gives the maximum expected return starting from state s, taking action a, and following the optimal policy thereafter: Q(s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q(s',a'). The optimal policy can then be derived as \pi(s) = \arg\max_a Q(s,a), and the value function relates via V(s) = \max_a Q(s,a). This Q-form is particularly useful in algorithms like , as it decouples action selection from state evaluation. The summation over s' interprets the future value as an , \mathbb{E}_{s' \sim P(\cdot|s,a)} [V(s')], highlighting how the equation balances immediate gains against probabilistic long-term outcomes under discounting. The deterministic case emerges as a special instance when P(s'|s,a) concentrates probability 1 on a single deterministic next state.

Continuous-State Extensions

The Bellman equation extends to continuous state and action spaces in discrete time by formulating the value function V(x) as a solution to the functional equation V(x) = \max_u \left[ r(x,u) + \gamma \int V(x') p(dx'|x,u) \right], where x \in \mathbb{R}^n is the , u \in \mathbb{R}^m is the , r(x,u) is the reward , \gamma \in (0,1) is the discount factor, and p(dx'|x,u) is the . This generalization preserves the optimality principle but requires solving an over infinite-dimensional spaces, contrasting with finite-state discretizations. In continuous-time settings, the Bellman equation evolves into the Hamilton-Jacobi-Bellman (HJB) (PDE), \frac{\partial V}{\partial t}(t,x) + \min_u \left[ l(x,u) + f(x,u) \cdot \nabla V(t,x) \right] = 0, with terminal boundary condition V(T,x) = g(x) at final time T, where V(t,x) is the value function, f(x,u) is the state dynamics, l(x,u) is the running cost, and the minimization is over admissible controls u. The HJB equation arises in deterministic problems and links directly to classical Hamilton-Jacobi theory in , providing necessary and sufficient conditions for optimality under smoothness assumptions. Solving these continuous extensions faces of dimensionality, where grows exponentially with state dimension due to the need to approximate high-dimensional integrals or PDEs over \mathbb{R}^n. This often necessitates approximations such as grid-based , which reduces the problem to a finite-state case but introduces errors scaling with grid resolution. Existence and uniqueness of solutions to the HJB equation hold under assumptions like of the dynamics f and cost l with respect to state and action, ensuring the is continuous and coercive. Viscosity solutions provide a that circumvents classical differentiability requirements, guaranteeing uniqueness in bounded domains via comparison principles. These continuous formulations connect to the calculus of variations through variational principles for optimal paths and to PDE theory in control, underpinning derivations of necessary conditions like Pontryagin's maximum principle.

Derivation and Proofs

From Optimality Principle

In the context of dynamic programming, consider a deterministic finite-horizon decision problem spanning N stages, indexed by t = 0, 1, \dots, N-1. At each stage t, the system occupies a state s_t \in S, where S is the state space, and the decision maker selects an action a_t \in A(s_t), with A(s_t) denoting the feasible action set for state s_t. This choice yields an immediate reward r(s_t, a_t) and deterministically transitions the system to the next state s_{t+1} = f(s_t, a_t), where f: S \times A \to S is the transition function. At the terminal stage t = N, a final reward g(s_N) is received, and no further actions are taken. The objective is to choose the sequence of actions to maximize the total accumulated reward \sum_{t=0}^{N-1} r(s_t, a_t) + g(s_N). Bellman's of optimality asserts that an optimal possesses the property that, regardless of the initial and initial action, the remaining decisions form an optimal for the subprocess starting from the resulting . Applying this , the optimal V_t(s_t), which represents the maximum total reward attainable from stage t to the end when starting in s_t, must satisfy a recursive relationship. Specifically, the optimal from s_t is obtained by selecting the action a_t that maximizes the sum of the immediate reward and the optimal from the subsequent s_{t+1}. This yields the Bellman optimality equation for the finite-horizon case: V_t(s_t) = \max_{a_t \in A(s_t)} \left[ r(s_t, a_t) + V_{t+1} \left( f(s_t, a_t) \right) \right], for t = 0, 1, \dots, N-1, subject to the terminal condition V_N(s_N) = g(s_N) for all s_N \in S. The values of V_t are computed through backward : initialize V_N(s) = g(s) for all s \in S, then iteratively apply the recursion from t = N-1 down to t = 0, substituting the already computed V_{t+1} at each step. This process systematically decomposes the original into a sequence of single-stage maximizations, leveraging the optimality principle to ensure each subproblem's solution contributes to the global optimum. For the infinite-horizon discounted variant, where the horizon extends indefinitely without a terminal reward and future rewards are discounted by a factor \gamma \in (0,1) to ensure , the total reward becomes \sum_{t=0}^\infty \gamma^t r(s_t, a_t). Assuming a stationary setting where the state space, action sets, rewards, and transitions do not depend on time, the optimal V(s) satisfies the time-invariant Bellman optimality equation: V(s) = \max_{a \in A(s)} \left[ r(s, a) + \gamma V \left( f(s, a) \right) \right], \quad \forall s \in S. This equation captures the optimal value as the fixed point of the Bellman operator T, defined by (TV)(s) = \max_a [r(s, a) + \gamma V(f(s, a))]. To prove uniqueness of the solution, observe that T is a on the of bounded real-valued functions on S equipped with the \|\cdot\|_\infty. For any two bounded functions V, W: S \to \mathbb{R}, \|TV - TW\|_\infty = \sup_{s \in S} \left| \max_a [r(s,a) + \gamma V(f(s,a))] - \max_a [r(s,a) + \gamma W(f(s,a))] \right| \leq \gamma \|V - W\|_\infty, since the maximization over actions preserves the \gamma-weighted difference, assuming bounded rewards |r(s,a)| \leq R < \infty for all s, a. By the Banach fixed-point theorem, T has a unique fixed point V^*, which solves the Bellman equation and equals the optimal value function. Furthermore, the contraction property guarantees convergence of value iteration: starting from any bounded initial function V^0, the sequence V^{k+1} = T V^k converges to V^* at a linear rate, with \|V^k - V^*\|_\infty \leq \frac{\gamma^k}{1 - \gamma} \|V^0 - V^*\|_\infty. This establishes that the infinite-horizon Bellman equation admits a unique solution, derivable from the finite-horizon recursion in the limit as N \to \infty.

In Markov Decision Processes

A Markov decision process (MDP) is formally defined as a tuple (S, A, P, r, \gamma), where S is the finite set of states, A is the finite set of actions, P(s'|s,a) denotes the probability of transitioning to state s' \in S from state s \in S under action a \in A, r(s,a) is the expected immediate reward for taking action a in state s, and \gamma \in [0,1) is the discount factor that weights future rewards. The optimal value function V^*(s) in an infinite-horizon discounted MDP represents the maximum expected total discounted reward obtainable starting from state s under an optimal policy \pi^*, given by V^*(s) = \max_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle|\; s_0 = s, \, a_t = \pi(s_t) \right]. This formulation extends the deterministic principle of optimality to stochastic environments by accounting for probabilistic transitions. To derive the Bellman optimality equation, condition the expectation on the initial action a and the resulting next state s'. The optimal value satisfies V^*(s) = \max_a \left[ r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot|s,a)} [V^*(s')] \right] = \max_a \left[ r(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s') \right], which recursively decomposes the long-term value into the immediate reward plus the discounted expected future value. This equation is characterized through the Bellman operator T, defined for any value function V: S \to \mathbb{R} by (TV)(s) = \max_a \left[ r(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V(s') \right]. The operator T possesses key properties that ensure the existence and uniqueness of the solution: it is monotonic, meaning if V \leq W pointwise for all s \in S, then TV \leq TW pointwise; and it is a contraction mapping with modulus \gamma in the supremum norm, satisfying \|TV - TW\|_\infty \leq \gamma \|V - W\|_\infty. By the Banach fixed-point theorem applied to the complete metric space of bounded value functions under the supremum norm, there exists a unique fixed point V^* such that V^* = T V^*, and this V^* coincides with the optimal value function satisfying the Bellman optimality equation. Moreover, for any initial value function V^{(0)}, the sequence V^{(k+1)} = T V^{(k)} converges to V^* at a geometric rate bounded by \gamma^k.

Solution Techniques

Value Iteration

Value iteration is an iterative algorithm designed to solve the Bellman optimality equation by computing successive approximations of the optimal value function V^* in Markov decision processes. It applies the Bellman optimality operator T repeatedly, starting from an initial value function V_0(s) = 0 for all states s, to generate the sequence V_{k+1} = T V_k until the supremum norm of the change \|V_{k+1} - V_k\|_\infty < \epsilon for a specified tolerance \epsilon > 0. This method is particularly suitable for infinite-horizon discounted problems with discount factor $0 \leq \gamma < 1, as the contraction property of T ensures reliable convergence to V^*. The Bellman optimality operator is given by (TV)(s) = \max_a \mathbb{E}[r(s, a, s') + \gamma V(s') \mid s, a], where the expectation is over the next state s' and reward r according to the transition probabilities. The core iteration updates the value of each state by selecting the action that maximizes the expected discounted future reward based on the current value estimates. The standard synchronous implementation of value iteration uses values from the previous full iteration to update all states simultaneously. The following pseudocode illustrates this process:
Initialize V(s) ← 0 for all s ∈ S

repeat
    Δ ← 0
    for each s ∈ S
        v ← V(s)
        V(s) ← max_a ∑_{s', r} p(s', r | s, a) [r + γ V(s')]
        Δ ← max(Δ, |V(s) - v|)
until Δ < ε
return V
In this synchronous approach, updates for all states rely exclusively on the value function from the prior iteration, ensuring straightforward analysis of convergence. Asynchronous value iteration, by contrast, updates states sequentially and in place, incorporating newly computed values immediately for subsequent updates within the same sweep; this can lead to faster practical convergence but still guarantees convergence to V^* under the same conditions, provided every state is updated infinitely often. Convergence of value iteration follows from the fact that T is a contraction mapping with modulus \gamma < 1 in the supremum norm, implying a geometric rate of convergence to the unique fixed point V^*. Specifically, the error at each state after k iterations satisfies |V_k(s) - V^*(s)| \leq \frac{\gamma^k}{1 - \gamma} \max_s |V_0(s) - V^*(s)| for all s, providing a quantifiable bound on the approximation quality. With the common initialization V_0(s) = 0, this bound simplifies further since \max_s V^*(s) \leq R_{\max}/(1 - \gamma), where R_{\max} is the maximum reward magnitude. Each iteration of value iteration requires O(|S| |A|) time complexity, assuming transition probabilities and rewards are accessible in constant time per action-state pair. The total number of iterations to achieve an \epsilon-accurate approximation is up to O(\log(1/\epsilon)), though it scales with $1/(1 - \gamma) in precise analyses due to the error bound's dependence on the discount factor. This makes value iteration computationally efficient for problems with moderate state and action spaces, especially when \gamma is not too close to 1.

Policy Iteration

Policy iteration is a dynamic programming algorithm for solving Markov decision processes (MDPs) by iteratively improving a policy through alternating steps of policy evaluation and policy improvement. Introduced by Ronald Howard in 1960, it begins with an arbitrary initial policy \pi_0 and generates a sequence of policies \pi_k that monotonically improves until convergence to the optimal policy \pi^*. Unlike methods that directly approximate value functions, policy iteration explicitly maintains and refines a policy at each iteration, leveraging the to ensure progress toward optimality. The algorithm proceeds in two main phases per iteration. First, policy evaluation computes the value function V^{\pi_k} for the current policy \pi_k by solving the linear system V^{\pi_k}(s) = \sum_{a} \pi_k(a|s) \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s') \right] for all states s, where r(s,a) is the reward, \gamma \in [0,1) is the discount factor, and P(s'|s,a) is the transition probability. This can be solved exactly using linear equation solvers for finite MDPs or approximated iteratively, such as by a truncated value iteration process that runs for a fixed number of steps. Second, policy improvement updates the policy to \pi_{k+1} by greedily selecting, for each state s, \pi_{k+1}(s) = \arg\max_a \left[ r(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s') \right]. These steps repeat until \pi_{k+1} = \pi_k, at which point the policy is optimal. The improvement step guarantees that V^{\pi_{k+1}} \geq V^{\pi_k} for all states, with strict inequality for at least one state unless the policy is already optimal. In finite MDPs with finite state and action spaces, policy iteration converges to the unique optimal policy in a finite number of iterations, bounded by the number of distinct deterministic policies, due to the monotonic improvement and the absence of cycles in the policy sequence. This finite-step convergence contrasts with value iteration, which approximates the value function iteratively but may require more total computations despite simpler per-step updates; policy iteration typically requires fewer iterations overall, though each is more computationally intensive owing to the full policy evaluation. Upon convergence, the final policy \pi^* is optimal, and it can also be extracted from the optimal value function V^* via the same greedy improvement formula, \pi^*(s) = \arg\max_a [r(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')], confirming consistency with the .

Other Numerical Methods

The Bellman equation can also be solved through linear programming (LP) formulations that recast the optimality conditions as a linear optimization problem. A standard dual LP for the discounted case minimizes the expected value function under an initial state distribution μ, subject to the Bellman optimality inequalities holding for all actions: \min_V \sum_s \mu(s) V(s) \text{subject to} \quad V(s) \geq r(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \quad \forall s,a. The optimal solution V^* achieves equality in the constraints for at least one optimal action per state, yielding the solution to the . This approach, introduced in early work on sequential decision models, provides a global optimization perspective complementary to iterative methods. The corresponding primal LP optimizes over discounted occupancy measures x(s,a), which represent the expected discounted frequency of state-action pairs under an optimal policy: \max_x \sum_{s,a} x(s,a) r(s,a) \text{subject to} \quad \sum_a x(s,a) - \gamma \sum_{s',a} P(s|s',a) x(s',a) = \mu(s) \quad \forall s, \quad x(s,a) \geq 0 \ \forall s,a. Strong duality ensures the primal and dual yield the same optimal value, with the optimal policy recoverable from the dual solution via argmax over actions in the tight constraints or from the primal via normalization of x(s,a). Primal-dual formulations enhance scalability for large MDPs by leveraging the saddle-point structure of the LP, enabling efficient algorithms that avoid explicit enumeration of all constraints. For instance, stochastic primal-dual methods approximate the through online updates on sampled trajectories, achieving polynomial-time convergence in the . These approaches are particularly useful for high-dimensional problems where full LP solvers are computationally prohibitive. For problems with large or continuous state spaces where exact LP is infeasible, approximation methods parameterize the value function to reduce dimensionality. Linear function approximation posits V(s) \approx \phi(s)^T \theta, where \phi(s) is a feature vector and \theta a parameter vector; the optimal \theta solves the projected Bellman equation \Pi T V = V, with \Pi the orthogonal projection onto the span of \{\phi(s)\} under a sampling distribution. This fixed point minimizes the mean-squared projected Bellman error and converges under contraction mappings of the Bellman operator. Temporal-difference (TD) learning offers a model-free numerical method to approximate solutions to the (projected) using sampled transitions, without requiring the full transition model. In its basic form, TD updates the value estimate via \Delta V(s) = \alpha [r + \gamma V(s') - V(s)], iteratively approaching the fixed point of the or its projection in function approximation settings. This stochastic approximation ties directly to the equation by bootstrapping future estimates, enabling practical computation in unknown environments. To handle large or continuous state spaces, discretization maps continuous variables to a finite grid, allowing application of exact methods like LP at the cost of approximation error that grows with grid resolution. For high-dimensional cases, neural networks serve as nonlinear approximators for the value function, representing V(s) through layered compositions that capture complex dependencies while mitigating the curse of dimensionality.

Applications

In Optimal Control

In optimal control, the Bellman equation manifests as the Hamilton-Jacobi-Bellman (HJB) equation, which characterizes the value function for continuous-time control problems. Consider a deterministic optimal control problem where the goal is to minimize the cost functional J(u) = \int_{t_0}^{T} l(x(t), u(t)) \, dt + g(x(T)), subject to the dynamics \dot{x}(t) = f(x(t), u(t)), with initial state x(t_0) = x_0. The value function V(t, x), defined as the infimum of J(u) over admissible controls, satisfies the nonlinear partial differential equation known as the HJB equation: \frac{\partial V}{\partial t}(t, x) + \min_{u} \left[ l(x, u) + \nabla_x V(t, x) \cdot f(x, u) \right] = 0, with terminal condition V(T, x) = g(x). This equation arises from applying Bellman's principle of optimality to infinitesimal time steps, ensuring that the optimal cost-to-go remains optimal subpathwise. For stochastic optimal control problems, the dynamics incorporate noise via the stochastic differential equation dX(t) = f(X(t), u(t)) \, dt + \sigma(X(t), u(t)) \, dW(t), where W(t) is a standard Wiener process, and the objective is to minimize the expected cost \mathbb{E} \left[ \int_{t_0}^{T} l(X(t), u(t)) \, dt + g(X(T)) \right]. The corresponding value function V(t, x) obeys the stochastic : \frac{\partial V}{\partial t}(t, x) + \min_{u} \left[ l(x, u) + f(x, u) \cdot \nabla_x V(t, x) + \frac{1}{2} \operatorname{tr} \left( \sigma(x, u) \sigma(x, u)^T \operatorname{Hess}_x V(t, x) \right) \right] = 0, again with terminal condition V(T, x) = g(x). The trace term accounts for the diffusion induced by the noise, transforming the HJB into a fully nonlinear parabolic . This formulation extends the deterministic case by incorporating in the derivation from the dynamic programming principle. A prominent example is the linear quadratic regulator (LQR), where the dynamics are linear \dot{x} = A x + B u, the running cost is quadratic l(x, u) = x^T Q x + u^T R u, and the terminal cost is g(x) = x^T P_f x. Substituting into the yields a specific solution where V(t, x) = x^T P(t) x, and P(t) satisfies the differential \dot{P} = -P A - A^T P + P B R^{-1} B^T P - Q, with P(T) = P_f. The optimal control is then u^*(t, x) = -R^{-1} B^T P(t) x, providing linear feedback. This special case admits closed-form solutions via the Riccati equation, highlighting the tractability of quadratic problems in engineering applications like spacecraft attitude control. The verification theorem ensures that solutions to the HJB equation yield optimal controls: if V is a sufficiently smooth solution to the HJB with the correct boundary condition, then V(t_0, x_0) equals the optimal cost, and the control u^*(t, x) = \arg\min_u \left[ l(x, u) + f(x, u) \cdot \nabla V(t, x) + \frac{1}{2} \operatorname{tr} \left( \sigma \sigma^T \operatorname{Hess} V \right) \right] (in the stochastic case) is optimal for the original problem. This theorem relies on applying (or the chain rule in deterministic settings) to show that V along any trajectory bounds the cost from below, with equality under the minimizing control. Unlike discrete-time , which guarantee existence via contraction mappings in finite spaces, continuous-time HJB equations may lack classical C^2 solutions due to nonlinearity; existence and uniqueness are established using , which provide a weak formulation robust to discontinuities.

In Economics and Operations Research

The Bellman equation has been foundational in operations research since its inception, with Richard Bellman applying dynamic programming principles to logistics problems in the 1950s, such as optimizing multistage decision processes in supply chains and resource allocation under uncertainty. These early applications at the RAND Corporation addressed military and industrial planning challenges, demonstrating how the equation decomposes complex sequential decisions into recursive subproblems. In economic growth models, the discrete-time Bellman equation underpins the Ramsey-Cass-Koopmans framework, where the value function V(k) for capital stock k is defined as V(k) = \max_c \left[ u(c) + \beta V(f(k) - c) \right], with c denoting consumption, u(\cdot) the utility function, \beta the discount factor, and f(\cdot) the production function. This formulation captures intertemporal optimization by balancing current consumption against future capital accumulation, yielding Euler equations that describe optimal savings paths in neoclassical growth theory. Inventory management in operations research often employs the Bellman equation for stochastic demand scenarios, as in the backorder model where the value function V(s) for inventory level s satisfies V(s) = \max_q \left[ -h s - b \mathbb{E}[(d - s - q)^+] + \gamma \mathbb{E} V(s + q - d) \right], with q the order quantity, d the random demand, h the holding cost, b the backorder cost, and \gamma the discount factor. This recursive structure enables computation of optimal ordering policies that minimize expected costs over time, accounting for uncertainties in supply and demand. In auction theory, dynamic programming via the Bellman equation models bidder strategies in repeated or multi-stage auctions, where agents solve for value functions that incorporate future bidding opportunities and information updates. Similarly, in contract design, principal-agent problems use the equation to derive incentive-compatible mechanisms, with the principal's value function recursively optimizing payments and actions under moral hazard or adverse selection. These applications highlight the equation's role in ensuring time-consistent equilibria in economic interactions. In finance, the Bellman equation and its continuous-time HJB extension are applied to portfolio optimization, where investors maximize expected utility of wealth over time under stochastic asset returns, and to option pricing via stochastic control formulations that solve for value functions representing arbitrage-free prices. Reinforcement learning variants further enable algorithmic trading strategies that learn optimal buy/sell policies from market data. High-dimensional economic models pose significant computational challenges when solving Bellman equations, primarily due to the curse of dimensionality, where the state space grows exponentially with variables like multiple capital types or heterogeneous agents, rendering standard value iteration infeasible. Techniques such as projection methods or sparse grids have been developed to mitigate this, but global solutions remain demanding for models beyond four or five dimensions.

In Reinforcement Learning and AI

In reinforcement learning (RL), the Bellman equation underpins model-free approaches that approximate by learning value functions or action-value functions directly from sampled interactions with the environment, without requiring a complete model of the transition dynamics or reward function. This paradigm enables agents to solve in partially observable or unknown settings, where classical dynamic programming methods like become infeasible due to the curse of dimensionality. Seminal works frame RL as a form of approximate dynamic programming, emphasizing to iteratively refine estimates toward the Bellman optimality equation through bootstrapping from experience tuples (state, action, , next state). A cornerstone algorithm is Q-learning, which directly approximates the action-value function Q^\pi(s, a) satisfying the Bellman equation Q^\pi(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q^\pi(s', a') \mid s, a] for the optimal policy. The update rule performs stochastic approximation to this fixed point: Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] where \alpha is the learning rate, r the observed reward, \gamma the discount factor, and (s', a') the next state-action pair. This off-policy method converges to the optimal Q-function under suitable conditions, such as infinite exploration and decreasing step sizes, allowing agents to learn optimal behavior solely from trial-and-error. Actor-critic methods extend this framework by combining policy-based (actor) and value-based (critic) components, where the critic estimates the value function via Bellman backups to provide a baseline for reducing variance in policy gradient updates. The actor parameterizes the policy \pi_\theta(a \mid s), updated via gradients like \nabla_\theta J(\theta) \approx \mathbb{E} [\nabla_\theta \log \pi_\theta(a \mid s) (Q(s, a) - b(s))], with the Bellman-derived Q-function serving as an advantage estimate over a baseline b(s). This hybrid approach, rooted in early policy iteration approximations, enables efficient learning in continuous action spaces and stochastic policies. Deep reinforcement learning scales these ideas to high-dimensional environments by representing Q-functions with neural networks, as in the Deep Q-Network (DQN), which solves the in pixel-based Atari games using convolutional networks for state encoding and experience replay to stabilize updates. DQN's target network, updated periodically, mitigates divergence in deep Bellman backups, achieving human-level performance on 49 games by approximating optimal Q-values from raw visual input. Compared to classical dynamic programming, RL's sample-based approximations offer scalability to vast, unknown state-action spaces—handling millions of dimensions via function approximation—while avoiding exhaustive model enumeration. However, challenges include training instability from non-stationary targets and high sample complexity in sparse-reward settings. The post-2015 RL boom, exemplified by , leveraged Monte Carlo tree search with neural value networks trained on Bellman residuals to master Go, demonstrating superhuman planning in combinatorial domains through iterative policy and value improvements. Recent advances as of 2025 include deep neural networks that provably solve Bellman equations for without the curse of dimensionality, enabling applications in complex simulations, and memristive hardware solvers that accelerate decision-making in resource-constrained environments.

Illustrative Examples

Deterministic Inventory Control

In deterministic control, the Bellman equation is applied to optimize ordering decisions over a finite planning horizon when demands are known in advance. The state s_t represents the level at the beginning of period t, the action q_t \geq 0 is the order quantity placed at the start of the period, and the known demand is d_t. The transitions deterministically as s_{t+1} = \max(0, s_t + q_t - d_t), with costs consisting of an ordering cost c(q_t) and a holding cost h(s_{t+1}). To account for potential shortages, a shortage cost p \cdot \max(0, d_t - s_t - q_t) is included, where p > 0. The objective is to minimize the total expected cost over T periods, leading to the Bellman equation for the value function V_t(s_t), which denotes the minimum cost-to-go from period t onward starting with s_t: V_t(s_t) = \min_{q_t \geq 0} \left[ c(q_t) + h(s_{t+1}) + p \cdot \max(0, d_t - s_t - q_t) + V_{t+1}(s_{t+1}) \right], with the terminal condition V_{T+1}(s_{T+1}) = 0 and s_{T+1} = \max(0, s_T + q_T - d_T). This formulation captures the trade-off between ordering costs (which may include fixed and variable components to reflect setup expenses) and holding costs (incurred for excess inventory carried forward), while penalizing shortages. In the deterministic setting, the Bellman equation enables backward induction to compute optimal policies, often resulting in base-stock policies where the target inventory level decreases over time as the horizon shortens, since future demands diminish and avoiding unnecessary holding becomes optimal. To illustrate, consider a three-period (T=3) problem with states limited to s_t \in \{0, 1, 2\} for simplicity, constant demand d_t = 1 each period, unit ordering cost of 1 per unit plus a of 3 whenever q_t > 0 (so c(q_t) = 3 + q_t if q_t > 0, else 0), holding cost h(s) = s, and shortage penalty p = 10. Actions q_t are non-negative integers up to 3 to cover cumulative needs. The value functions are computed backward as follows: For period 3 (V_4 \equiv 0):
  • V_3(0) = 4 (optimal q_3 = 1)
  • V_3(1) = 0 (optimal q_3 = 0)
  • V_3(2) = 1 (optimal q_3 = 0)
For period 2:
  • V_2(0) = 6 (optimal q_2 = 2)
  • V_2(1) = 4 (optimal q_2 = 0)
  • V_2(2) = 1 (optimal q_2 = 0)
For period 1:
  • V_1(0) = 9 (optimal q_1 = 3)
  • V_1(1) = 6 (optimal q_1 = 0)
  • V_1(2) = 5 (optimal q_1 = 0)
These values are summarized in the table below:
Period tV_t(0)V_t(1)V_t(2)
1965
2641
3401
The optimal policy follows a base-stock structure, ordering up to a target level b_t if s_t < b_t, where the levels decrease over time: b_1 = 3 (covering all remaining demands to amortize the fixed cost once), b_2 = 2, and b_3 = 1. For instance, starting with s_1 = 0, ordering q_1 = 3 results in s_2 = 2, followed by no orders in periods 2 and 3, incurring holding costs of 2 + 1 + 0 = 3 but only one fixed ordering cost, for a total of 9—lower than ordering each period separately (total 12). This demonstrates the key insight: the fixed component of ordering costs incentivizes bulk ordering early to minimize setups, balanced against accumulating holding costs for carried inventory.

Stochastic Gambling Problem

The stochastic gambling problem serves as a canonical example of an infinite-horizon discounted (MDP) that demonstrates the Bellman equation in a stochastic environment with absorbing states. The gambler begins with initial s, an integer from to N. The state space consists of capital levels, where s = 0 and s = N are absorbing states representing and , respectively. In non-absorbing states ($0 < s < N), the action is to choose a bet size b \in \{1, 2, \dots, s\}. The outcome is : with probability p = 0.5, the capital increases to \min(s + b, N); with probability 0.5, it decreases to \max(s - b, 0). Each play yields an immediate reward of -, modeling the ongoing cost of , and the process terminates upon absorption with value 0 at both boundaries. The goal is to maximize the expected total discounted reward using discount factor \gamma \in (0, 1). This setup captures the tension between risk and the desire to end the process quickly due to the negative rewards. The Bellman optimality equation for the optimal value function V(s) is V(s) = \max_{1 \leq b \leq s} \left[ -1 + \gamma \left( 0.5 \, V(\min(s + b, N)) + 0.5 \, V(\max(s - b, 0)) \right) \right] for $0 < s < N, subject to V(0) = 0 and V(N) = 0. This equation expresses the optimal value at s as the maximum over actions of the immediate reward plus the discounted of the next state. The symmetry of the fair (p = 0.5) implies V(s) = V(N - s). The solution to this MDP reveals that the optimal is bold play: at each state s, bet b = \min(s, N - s), staking either the entire capital or the exact amount needed to reach N, whichever is smaller. This maximizes the value function and can be verified by solving the Bellman equation using value iteration, which initializes V^0(s) = 0 for all s and updates V^{k+1}(s) according to the optimality equation until (typically measured by \max_s |V^{k+1}(s) - V^k(s)| < \epsilon). Bold play remains optimal in fair and subfair casinos even under discounting, as established in foundational theory extended to discounted criteria. For numerical illustration with N = 10 and \gamma = 0.9, value iteration converges to a symmetric value function V(s) that is negative for $0 < s < 10, reflecting the cumulative discounted costs. The function reaches its minimum (most negative value) near s = 5, where the expected number of steps to is highest under the optimal , and approaches 0 near the boundaries due to quick termination. This computation underscores the scale of costs: values near the edges are close to -1 (one step to with probability 0.5, or otherwise), while central values accumulate deeper negatives from prolonged play. A key insight from this problem is the of risk-seeking behavior in the optimal . Despite fair , bold play favors high-variance actions to accelerate and minimize discounted costs, rather than conservative bets that prolong exposure to the -1 rewards. This variance amplification—betting large fractions of capital—increases the chance of immediate ruin or success, effectively shortening the horizon in expectation. Such dynamics highlight how and in MDPs can lead to counterintuitive strategies, distinct from risk-averse choices in non-absorbing settings.

References

  1. [1]
    Bellman equation - Optimization Wiki
    Nov 28, 2021 · The Bellman equation is an optimality condition used in dynamic programming and named for Richard Bellman, whose principle of optimality is needed to derive it.
  2. [2]
    [PDF] Chapter 4 Introduction to Dynamic Programming - andrew.cmu.ed
    (a) Derive the Bellman equation and use it to characterize the optimal policy. (b) Assume that utility is given by u(ct)=ln(ct). Use the method of undeter ...<|control11|><|separator|>
  3. [3]
    [PDF] RICHARD BELLMAN ON THE BIRTH OF DYNAMIC PROGRAMMING
    I decided to call this technique “The principle of optimal- ity.” Oliver Gross said one day, 'The principle is not rigor- ous.' I replied, 'Of course not. It's ...
  4. [4]
    [PDF] Chapter 3. Dynamic Programming
    This chapter introduces basic ideas and methods of dynamic programming. ... equation (the Bellman equation), presents three methods for solving the Bellman.
  5. [5]
    [PDF] The Bellman Equation
    • The Bellman equation gives the utility of a state. • If there are n ... Define U1(s) to be the utility if the agent is at state s and lives for 1 ...
  6. [6]
    [PDF] THE THEORY OF DYNAMIC PROGRAMMING - Richard Bellman
    stated above, the basic idea of the theory of dynamic programming is that of viewing an optimal policy as one deter- mining the decision required at each time ...
  7. [7]
    [PDF] Economics 2010c: Lecture 1 Introduction to Dynamic Programming
    Sep 2, 2014 · The Bellman equation for this problem is (relatively) easy to write: ( ) = max{ [ ( +1)]}. (1). Page 18. Our problem is to find the value ...
  8. [8]
    4.2 Solving Markov Decision Processes
    The Bellman equation is an example of a dynamic programming equation, an equation that decomposes a problem into smaller subproblems via an inherent recursive ...
  9. [9]
    [PDF] Markov Decision Process and Bellman Equations - Stanford University
    MDP Definition for Discrete Time, Countable States. Definition. A Markov Decision Process (MDP) comprises of: A countable set of states S (State Space), a set ...
  10. [10]
    [PDF] Hamilton-Jacobi-Bellman Equations for Q-Learning in Continuous ...
    In this paper, we introduce Hamilton–Jacobi–Bellman (HJB) equations for Q-functions in continuous- time optimal control problems with Lipschitz continuous ...
  11. [11]
    THE THEORY OF DYNAMIC PROGRAMMING
    THE THEORY OF DYNAMIC PROGRAMMING. RICHARD BELLMAN. 1. Introduction. Before turning to a discussion of some representa- tive problems which will permit us to ...
  12. [12]
    Dynamic Programming and Optimal Control - Athena Scientific
    The book is a rigorous yet highly readable and comprehensive source on all aspects relevant to DP: applications, algorithms, mathematical aspects, ...
  13. [13]
    Richard Bellman on the Birth of Dynamic Programming - PubsOnLine
    HomeOperations ResearchVol. 50, No. 1. Richard Bellman on the Birth of Dynamic Programming. Stuart Dreyfus. Stuart Dreyfus. dreyfus@ieor.berkeley.edu.
  14. [14]
    Richard Bellman on the Birth of Dynamic Programming - PubsOnLine
    I had seen problems in economics and operations research and it was clear that some effort was required. I had a good team,. Irving Glicksberg and Oliver Gross.
  15. [15]
    [PDF] richard bellman - on a routing problem
    Introduction. The problem we wish to treat is a combinatorial one involving the determination of an optimal route from one point to another. These problems are.
  16. [16]
    5.1.5 Historical remarks - Daniel Liberzon
    This connection was explicitly made in the early 1960s by Kalman, who was apparently the first to use the name ``HJB equation." Kalman's derivation of ...
  17. [17]
    [PDF] Reinforcement Learning: An Introduction - Stanford University
    We first came to focus on what is now known as reinforcement learning in late. 1979. We were both at the University of Massachusetts, working on one of.
  18. [18]
    [PDF] Lecture Notes on Dynamic Programming
    2) A Deterministic Finite Horizon Problem. 2.1) Finding necessary conditions ... This is called Bellman's equation. We can regard this as an equation ...
  19. [19]
  20. [20]
    [PDF] Dynamic Programming and Markov Processes - Gwern
    * Equation 3.1 is the application of the “Principle of Optimality”' of dynamic programming to the Markovian decision process; this and other applications are.
  21. [21]
  22. [22]
    [PDF] DYNAMIC PROGRAMMING - Gwern
    RICHARD BELLMAN. PRINCETON UNIVERSITY PRESS. PRINCETON, NEW JERSEY. Page 3. A Rand Corporation Research Study. Published, 1957, by Princeton University Press.Missing: citation | Show results with:citation<|control11|><|separator|>
  23. [23]
    Dynamic programming : Bellman, Richard, 1920-1984
    Dec 18, 2020 · Dynamic programming. xxv, 340 p. : 22 cm. Originally published: Princeton, NJ : Princeton University Press, 1957. Includes bibliographical references and ...
  24. [24]
    [PDF] Markov Decision Processes - The University of Manchester
    Page 1. Page 2. Markov Decision. Processes. Page 3. Markov Decision. Processes. Discrete Stochastic. Dynamic Programming. MARTIN L. PUTERMAN. University of ...
  25. [25]
    [PDF] Value Iteration & Policy Iteration - CMU School of Computer Science
    Apr 8, 2020 · Value iteration is an algorithm, and policy iteration is a system of equations. Value iteration can be synchronous or asynchronous. Policy  ...
  26. [26]
    [PDF] Value and Policy Iterations in Optimal Control and Adaptive ... - MIT
    Bertsekas and Yu [5], [6] deal with issues that relate in part to the intricacies of the convergence of VI and PI in undiscounted infinite horizon DP. This ...
  27. [27]
    [PDF] Error Bounds for Approximate Value Iteration
    Iteration algorithm defined by the iteration Vn+1 = T Vn. Due to the contraction property in L∞. −norm of the opera- tor T , the iterates Vn converge to V ∗ ...
  28. [28]
    Linear Programming and Sequential Decisions | Management Science
    This paper demonstrates how a typical sequential probabilistic model may be formulated in terms of (a) an initial decision rule and (b) a Markov process, and ...
  29. [29]
    Linear programming formulations of Markov decision processes
    We give a new derivation of the linear program corresponding to a Markov Decision Process (MDP) in steady state, which seeks to minimize discounted total ...
  30. [30]
  31. [31]
    (PDF) Applied Optimal Control: Optimization, Estimation, and Control
    Aug 9, 2025 · Applied Optimal Control: Optimization, Estimation, and Control. DOI:10.1109/TSMC.1979.4310229. Authors: Arthur Bryson at Stanford University.
  32. [32]
    Stochastic Controls: Hamiltonian Systems and HJB Equations
    Hamiltonian Systems and HJB Equations. Authors: Jiongmin Yong, Xun Yu Zhou. Series Title: Stochastic Modelling and Applied Probability. DOI: https://doi.org ...
  33. [33]
    On some applications of the theory of dynamic programming to
    Richard Bellman, 1954. "On some applications of the theory of dynamic programming to logistics," Naval Research Logistics Quarterly, John Wiley & Sons, vol.Missing: original operations 1950s<|control11|><|separator|>
  34. [34]
    [PDF] Operations Research and the RAND Corporation
    Richard Bellman, together with a few collaborators, almost exclusively pioneered the development of this theory. The first RAND report on dynamic programming ...Missing: original | Show results with:original
  35. [35]
    [PDF] Lecture 10: Consumption-Savings Problem Optimal Growth
    Planner's problem: Neoclassical, Ramsey-Cass-Koopmans optimal growth model ... Bellman equation: V (k) = max c,k0. 1u(c) + βV (k/)l subject to c + k/ - (1 ...
  36. [36]
    [PDF] 4: Reinforcement Learning 1 Stochastic inventory management
    Sep 26, 2017 · In this section, we model stochastic demand for inventory that is not perishable, where solving the problem requires a se- quence of decisions.
  37. [37]
    [PDF] Dynamic Stochastic Inventory Management with Reference Price ...
    We analyze the joint inventory and pricing decisions of a firm when demand depends on not only the current selling price but also a memory-based reference ...
  38. [38]
    [PDF] Dynamic Demand Estimation in Auction Markets - Greg Lewis
    Oct 12, 2020 · We study demand estimation in a large auction market. In our model, a dynami- cally evolving population of buyers with unit demand and ...
  39. [39]
    [PDF] RECURSIVE CONTRACTS
    Importantly, ψ is a time-invariant policy function that solves the Bellman equation. We refer to this as the “standard dynamic programming” case.
  40. [40]
    [PDF] Dynamic Programming Approach to Principal-Agent Problems - CMAP
    Nov 21, 2015 · Abstract. We consider a general formulation of the Principal-Agent problem with a lump-sum payment on a finite horizon.
  41. [41]
    [PDF] Reducing the curse of dimensionality in dynamic stochastic ...
    The solvability of many economic models suffers from the so-called curse of dimensionality: the computing time required to solve these models presents an ...
  42. [42]
    [PDF] NBER WORKING PAPER SERIES DEEP LEARNING FOR SOLVING ...
    While powerful, these approaches face the curse of dimensionality, making global solutions computationally infeasible as the number of state variables increases ...
  43. [43]
    Q-learning | Machine Learning
    This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to ...
  44. [44]
    [PDF] Q-learning - Semantic Scholar
    This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to ...Missing: original | Show results with:original
  45. [45]
    [1312.5602] Playing Atari with Deep Reinforcement Learning - arXiv
    Dec 19, 2013 · This paper presents a deep learning model using a convolutional neural network and Q-learning to learn control policies from Atari game pixels, ...