Markov decision process
A Markov decision process (MDP) is a mathematical framework used to model sequential decision-making problems in environments where outcomes are partially stochastic and partially controlled by an agent's actions, satisfying the Markov property that the future state depends only on the current state and action, not on prior history.[1] Formally, an MDP is defined as a tuple (S, A, P, R, \gamma), where S is the set of states, A is the set of actions, P(s'|s,a) is the transition probability function specifying the probability of moving to state s' from state s after action a, R(s,a) is the reward function providing immediate rewards for state-action pairs, and \gamma \in [0,1) is the discount factor that weights future rewards relative to immediate ones.[2]
The origins of MDPs trace back to the 1950s, when Richard Bellman and Lloyd Shapley developed foundational concepts in dynamic programming and controlled Markov chains to address optimal decision-making under uncertainty.[3] Bellman's seminal 1957 paper, "A Markovian Decision Process," introduced key ideas such as value functions, establishing MDPs as a cornerstone for solving problems in stochastic control.[4] These early works formalized how agents can compute optimal policies—mappings from states to actions—that maximize expected cumulative discounted rewards over time, often through algorithms like value iteration or policy iteration.[5]
MDPs have broad applications across fields, including reinforcement learning where they underpin algorithms like Q-learning for training autonomous agents, optimal control in engineering systems such as robotics and inventory management, and finance for portfolio optimization and option pricing under uncertainty.[6] In healthcare, MDPs model treatment planning to balance short-term risks and long-term outcomes, while in manufacturing, they optimize production scheduling amid variable demands and machine failures.[7] Extensions like partially observable MDPs (POMDPs) address real-world scenarios with incomplete information, enhancing applicability in complex, dynamic environments.[8]
Fundamentals
Definition
A Markov decision process (MDP) is a mathematical framework used to model decision-making in dynamic environments where outcomes are influenced by both deliberate choices and inherent randomness. It formalizes sequential decision problems as a stochastic process, enabling the analysis of optimal policies under uncertainty. MDPs assume basic knowledge of probability theory, particularly stochastic processes, which are sequences of random variables representing system states evolving over time.
Formally, an MDP is defined as a tuple (S, A, P, R, \gamma), where S denotes the state space, encompassing all possible configurations of the system; A represents the action space, the set of available decisions at each state; P: S \times A \times S \to [0,1] is the transition probability function, specifying the likelihood of moving to a next state given the current state and action; R: S \times A \to \mathbb{R} is the reward function, assigning immediate payoffs for taking an action in a state; and \gamma \in [0,1) is the discount factor, which weights the importance of future rewards relative to immediate ones. This structure captures the essence of problems where an agent interacts with an environment over discrete time steps, selecting actions to maximize cumulative rewards.
Central to the MDP framework is the Markov property, which posits that the future state of the system depends solely on the current state and the action taken, independent of any prior history. Mathematically, this is expressed as P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} | s_t, a_t), ensuring that the process is memoryless and simplifying computation by focusing decisions on the present. This property distinguishes MDPs from more general decision processes that might require tracking full histories.
The concept of MDPs originated in the 1950s, coined by Richard Bellman as part of his development of dynamic programming techniques for addressing sequential decision problems in uncertain settings. Bellman's foundational work introduced the Markovian structure to model asymptotic behaviors in controlled stochastic systems, laying the groundwork for modern applications in optimization and control.[9]
Components
A Markov decision process (MDP) is formally defined by a tuple consisting of several key components that model the environment for sequential decision-making under uncertainty.[10]
The state space S represents the set of all possible states in which the decision-maker can find itself, which may be discrete or continuous, finite or infinite in size.[10] For example, in a grid world navigation task, states could correspond to the agent's position as coordinates on a 2D lattice.[10] This component encapsulates all relevant information about the system's configuration at any time, adhering to the Markov property where future states depend only on the current state.[10]
The action space A denotes the set of actions available to the decision-maker, which can vary depending on the current state, leading to state-dependent action spaces A(s) \subseteq A.[10] Actions may be deterministic, where selecting an action leads to a fixed outcome, or part of stochastic policies that randomize over actions to explore or balance risk.[10] In practice, actions often represent choices like moving in a specific direction or adjusting a control parameter in a robotic system.[10]
The transition function P(s' \mid s, a) specifies the probability distribution over next states s' \in S given the current state s \in S and action a \in A(s), capturing the dynamics of the environment.[10] This function assumes stationarity, meaning the transition probabilities do not change over time, which simplifies modeling long-term behaviors.[11] Stochastic transitions are common in real-world applications, such as weather-dependent movements in environmental planning.[10]
The reward function R(s, a, s') or alternatively R(s, a) provides the immediate reward or cost received after transitioning from state s to s' via action a, which can be deterministic or stochastic to reflect uncertainty in outcomes.[10] Rewards quantify the desirability of state-action transitions, guiding the decision-maker toward beneficial behaviors, such as positive scores for reaching goals in inventory management problems.[10] Stochastic rewards allow modeling noisy feedback, as seen in financial trading scenarios where payoffs vary.[11]
The discount factor \gamma \in [0, 1) is a parameter that weights the importance of future rewards relative to immediate ones, ensuring convergence in infinite-horizon settings by prioritizing short-term gains when \gamma is low.[10] A value of \gamma = 0 implies myopic decision-making focused solely on the current reward, while values closer to 1 emphasize long-term planning, as in sustainable resource allocation tasks.[10]
MDPs can operate over finite or infinite horizons, where a finite horizon limits the number of decision steps, often leading to time-dependent policies, whereas infinite horizons assume ongoing interactions without a fixed endpoint.[10] Additionally, tasks may be episodic, consisting of distinct sequences starting from initial states and ending in terminal states, or continuing, involving perpetual interactions without natural terminations, such as in ongoing network routing.[10]
Discrete-Time Model
In discrete-time Markov decision processes (MDPs), time progresses in discrete steps t = 0, 1, 2, \dots, where at each step the decision-maker observes the current state s_t \in \mathcal{S}, selects an action a_t \in \mathcal{A}(s_t), receives a reward r_{t+1}, and transitions to the next state s_{t+1}.[12] The evolution of the system is governed by the state transition probabilities, which specify the probability distribution over the next state and reward given the current state and action. Specifically, the transition kernel is defined as P(s_{t+1}, r_{t+1} \mid s_t, a_t), the joint probability of transitioning to state s_{t+1} and receiving reward r_{t+1} after taking action a_t in state s_t. This can be marginalized to yield the state transition probabilities P(s_{t+1} \mid s_t, a_t) = \sum_r P(s_{t+1}, r \mid s_t, a_t) and the expected reward function R(s_t, a_t) = \sum_{s_{t+1}, r} r \, P(s_{t+1}, r \mid s_t, a_t).[12] These components ensure the Markov property: the future state and reward depend only on the current state and action, not on prior history.[9]
The performance of a policy \pi, which maps states to action distributions \pi(a \mid s), is evaluated using the value function V^\pi(s), defined as the expected discounted cumulative reward starting from state s and following \pi thereafter. Formally,
V^\pi(s) = \mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \;\middle|\; s_t = s, \pi \right],
where $0 \leq \gamma < 1 is the discount factor that weights future rewards.[12] Similarly, the action-value function Q^\pi(s, a) assesses the expected discounted return when starting in state s, taking action a, and then adhering to \pi:
Q^\pi(s, a) = \mathbb{E} \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \;\middle|\; s_t = s, a_t = a, \pi \right].
This relates to the value function via V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a), providing a way to decompose policy evaluation.[12]
The Bellman expectation equation provides a recursive characterization of V^\pi(s) by conditioning on the immediate action, next state, and reward under \pi. Starting from the definition of V^\pi(s), expand the expectation:
V^\pi(s) = \mathbb{E} \left[ r_{t+1} + \gamma V^\pi(s_{t+1}) \;\middle|\; s_t = s, \pi \right],
since the remaining sum from k=1 onward is \gamma times V^\pi(s_{t+1}) by the law of total expectation and the Markov property. Substituting the policy and transition probabilities yields
V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma V^\pi(s') \right].
This equation expresses the value as the expected immediate reward plus the discounted value of the successor state, averaged over actions and outcomes.[12] A similar recursion holds for Q^\pi(s, a) = \sum_{s', r} P(s', r \mid s, a) [r + \gamma V^\pi(s')].[12]
The optimal value function V^*(s) is defined as the supremum over all policies: V^*(s) = \max_\pi V^\pi(s), representing the best possible expected discounted return from state s. It satisfies V^*(s) \geq V^\pi(s) for any \pi, with equality for optimal policies.
Optimization Objective
The primary optimization objective in a Markov decision process (MDP) is to select a policy \pi that maximizes the expected discounted cumulative reward over an infinite horizon, formally expressed as \max_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t, s_{t+1}) \mid \pi \right], where $0 \leq \gamma < 1 is the discount factor ensuring convergence. This expectation is taken with respect to the state-action trajectories induced by \pi and the transition dynamics P. An optimal policy \pi^* achieves this maximum and satisfies \pi^*(s) = \arg\max_a Q^*(s, a) for all states s, where Q^*(s, a) denotes the optimal action-value function representing the maximum expected return starting from state s and taking action a.
The Bellman optimality equation characterizes the optimal value function V^*, defined as the maximum expected discounted reward from state s, via
V^*(s) = \max_a \sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma V^*(s') \right]
for all s \in \mathcal{S}.[9] Similarly, the optimal action-value function Q^* obeys
Q^*(s, a) = \sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma \max_{a'} Q^*(s', a') \right].
[9] These equations embody the principle of optimality, stating that an optimal policy divides the problem into subproblems where the remaining decisions are also optimal.
The existence and uniqueness of V^* as the fixed point of the Bellman optimality operator T^* V(s) = \max_a \sum_{s', r} P(s', r \mid s, a) [r + \gamma V(s')] follow from the Banach fixed-point theorem applied in the complete metric space of bounded value functions equipped with the supremum norm \|V\|_\infty = \sup_s |V(s)|. Specifically, T^* is a \gamma-contraction mapping because for any value functions V, W,
\|T^* V - T^* W\|_\infty \leq \gamma \|V - W\|_\infty,
ensuring a unique fixed point V^* that is approached iteratively by applying T^* to any initial V_0. This contraction property holds under the finite-state, finite-action assumption and bounded rewards, guaranteeing the operator's well-definedness on the space of bounded functions.
The policy improvement theorem provides a constructive way to approach optimality: for any policy \pi, the greedy policy \pi'(s) = \arg\max_a Q^\pi(s, a) (with respect to the value function Q^\pi under \pi) satisfies V^{\pi'}(s) \geq V^\pi(s) for all s, with strict inequality for some s unless \pi is already optimal. Equality holds if and only if \pi is optimal, as the improvement is zero everywhere precisely when \pi satisfies the Bellman optimality equation. This theorem underpins methods for refining policies toward \pi^*.
In finite MDPs (with finite state and action spaces), there always exists an optimal policy that is stationary (time-independent) and deterministic, meaning \pi^*(s) selects a single action for each s without randomization or dependence on time or history. Such policies suffice to achieve V^* from any starting state, leveraging the contraction properties and the finite structure to ensure the existence of pure-strategy optima.
Examples and Applications
Illustrative Example
A simple yet insightful example of a Markov decision process is the gridworld environment, where an agent navigates a 4×4 grid representing the state space S, with positions labeled as (i, j) for row i = 1 to 4 (top to bottom) and column j = 1 to 4 (left to right). The agent's actions A are up, down, left, and right. The goal state is the absorbing position at (4,4), which provides a reward of +1 upon entry and $0 thereafter. Each transition in non-absorbing states yields a reward of -1, encouraging efficient paths. Two pitfall states at (2,2) and (3,3) are also absorbing, delivering a reward of -10 upon entry to penalize dangerous areas. This configuration highlights risk-reward trade-offs in stochastic environments, akin to examples in standard reinforcement learning literature.[10]
The dynamics P introduce stochasticity: from state s and action a, the agent moves to the intended neighbor with probability 0.8; with probability 0.1 each, it slips perpendicularly left or right relative to the intended direction (treating up/down as cycling to left/right slips). If a slip or move would exit the grid, the agent remains in s. Absorbing states transition to themselves with probability 1 and reward 0. The reward function R is deterministic based on the next state: +1 for entering the goal, -10 for pitfalls, and -1 otherwise.[10]
Consider a uniform random policy \pi, where each action is chosen with probability $1/4. The state-value function v_\pi(s) satisfies the Bellman expectation equation:
v_\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left[ r + \gamma v_\pi(s') \right]
for discount factor \gamma = 0.9, with v_\pi(s) = 0 for absorbing states. To illustrate computation, focus on state s = (4,3), adjacent to the goal. Under random policy, the expected update is the average over actions. For action right (a_R): 0.8 probability to goal (s' = (4,4), r = +1), 0.1 up to (3,3) (pit, r = -10), 0.1 down (stay, r = -1). Thus, expected for a_R: $0.8(+1 + 0.9 \cdot 0) + 0.1(-10 + 0.9 \cdot 0) + 0.1(-1 + 0.9 v_\pi(4,3)) = 0.8 - 1 + -0.1 + 0.09 v_\pi(4,3) = -0.3 + 0.09 v_\pi(4,3). Similar calculations for other actions yield lower values due to risks (e.g., left to (4,2), up/down slips toward pit). Solving the full system gives v_\pi(4,3) \approx -1.2, reflecting the policy's inefficiency from random slips toward the pit. For a distant state like s = (1,1), v_\pi(1,1) \approx -15.4, as random walks prolong steps and increase pitfall risk.[10]
The optimal value function v^*(s) and policy \pi^* maximize expected returns, solved via methods like value iteration (detailed elsewhere). In this gridworld, v^*(s) increases toward the goal while dropping sharply near pitfalls: e.g., v^*(4,3) = 0.62, v^*(4,2) = -1.2 (avoiding left slip to safe path), and v^*(1,1) = -3.5. The optimal policy directs actions toward the goal while detouring pitfalls, visualized as arrows: from (1,1) right to (1,2), down from (1,3) to (2,3) (bypassing (2,2) pit), and right/down near goal. A table summarizes approximate v^*(s) values:
| (1,1) | (1,2) | (1,3) | (1,4) |
|---|
| Row 1 | -3.5 | -2.9 | -2.4 | -2.0 |
| Row 2 | -4.1 | -10.0* | -1.9 | -3.2 |
| Row 3 | -2.8 | -3.4 | -10.0* | 0.1 |
| Row 4 | -1.8 | -1.2 | 0.62 | 0.0* |
(*Absorbing: pit -10 entry, goal 0 post-entry). Arrows indicate \pi^*: → → ↓ → for row 1; ↓ ← ↓ ↓ for row 2, etc., forming a safe path.[10]
The discount factor \gamma profoundly influences planning. With \gamma = 0.9, the agent weighs future rewards heavily, yielding v^*(1,1) \approx -3.5 by valuing the eventual +1 goal over many -1 steps (optimal path length ~4). In contrast, \gamma = 0 renders the agent myopic, with v^*(s) = \max_a \mathbb{E}[r \mid s, a], equaling -1 for states far from terminals where all transitions yield r = -1, but higher near the goal (e.g., ≈ -0.3 from (4,3) by risking entry) or lower near pits; this leads to policies attempting immediate high-reward actions where possible, rather than random movement. This demonstrates how \gamma balances immediate expected rewards against long-term utility in MDPs.[10]
Real-World Applications
Markov decision processes (MDPs) have found extensive application in robotics, particularly for path planning and control in uncertain environments. In robot navigation, MDPs model states as the robot's position and sensor readings, actions as movement directions, and rewards based on progress toward goals while penalizing collisions or inefficiencies. Post-2000 advancements in approximate methods, such as focussed processing of MDPs, have enabled scalable solutions for large state spaces in dynamic settings, allowing robots to compute near-optimal policies efficiently.[13] For instance, real-time path planning algorithms using MDPs handle stochastic obstacles by iteratively updating value functions to balance exploration and exploitation.[14]
In inventory management, MDPs optimize dynamic stocking policies under demand uncertainty, where states represent current inventory levels, actions involve ordering quantities, and rewards are defined as sales profits minus holding and shortage costs. This formulation allows for periodic review systems that minimize long-term costs by solving the Bellman equation for optimal ordering thresholds. Seminal work has demonstrated how value iteration on MDPs yields policies superior to traditional economic order quantity models in stochastic environments.[15] Recent extensions incorporate supply chain disruptions, showing up to 15-20% cost reductions in multi-echelon inventories through MDP-based decision rules.[16]
Financial portfolio optimization leverages MDPs to manage asset allocation over time, with states capturing current portfolio values and market conditions, actions as rebalancing decisions, and rewards tied to risk-adjusted returns. This sequential framework accounts for transaction costs and market volatility, enabling dynamic strategies that outperform static buy-and-hold approaches. A influential book on MDPs in finance details how discounted infinite-horizon models derive optimal policies via linear programming, applied to equity-bond portfolios with empirical backtests showing improved Sharpe ratios.[17]
In healthcare, MDPs facilitate treatment sequencing, such as dosing regimens for chronic diseases like diabetes, where states include patient health metrics and time since last dose, actions are dosage adjustments, and rewards reflect improved outcomes minus side effects. Recent 2020s applications use MDPs along with electronic health records to develop personalized treatment recommendations that enhance glycemic control.[18] For depression management, MDP models optimize sequential therapy choices by reformulating state-transition models, showing potential for dynamic treatment comparisons superior to static guidelines.[19]
MDPs underpin AI in games like backgammon, modeling board positions as states, legal moves as actions, and win/loss outcomes as rewards to learn optimal strategies through reinforcement learning. The seminal TD-Gammon program, using temporal-difference methods on the MDP framework, achieved superhuman performance by self-play, revolutionizing game AI with over 1.5 million training games.[20]
Modern applications include autonomous driving, where MDPs guide decision-making in uncertain traffic, as seen in systems like Waymo's since the 2010s, modeling vehicle states and actions to minimize collision risks while maximizing efficiency. In climate policy modeling, MDPs support sequential decisions on emissions reductions and adaptation, aligning with IPCC frameworks for socio-economic impacts; a 2020 model formulates government actions in response to climate states, optimizing welfare under uncertainty.[21]
Solution Algorithms
Value Iteration
Value iteration is a dynamic programming algorithm used to solve finite Markov decision processes (MDPs) by iteratively approximating the optimal value function until convergence. It directly computes the value of each state under the optimal policy by successive approximations based on the Bellman optimality equation, which relates the value of a state to the maximum expected reward plus discounted future value over possible actions.
The algorithm begins by initializing a value function V_0(s) = 0 for all states s \in S. In each iteration k, the value function is updated for every state according to the Bellman optimality operator:
V_{k+1}(s) = \max_{a \in A} \sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma V_k(s') \right],
where A is the action set, P(s', r \mid s, a) is the transition probability and reward distribution, and \gamma \in [0, 1) is the discount factor. This update is performed synchronously across all states in each iteration, repeating until the maximum change in value across states falls below a small threshold \epsilon > 0, indicating convergence to the optimal value function V^*.
Here is pseudocode for the value iteration algorithm:
Initialize V(s) = 0 for all s in S
While Δ > ε (some small threshold):
Δ = 0
For each s in S:
V_old = V(s)
V(s) = max_a ∑_{s',r} P(s',r|s,a) [r + γ V(s')]
Δ = max(Δ, |V(s) - V_old|)
Return V
Initialize V(s) = 0 for all s in S
While Δ > ε (some small threshold):
Δ = 0
For each s in S:
V_old = V(s)
V(s) = max_a ∑_{s',r} P(s',r|s,a) [r + γ V(s')]
Δ = max(Δ, |V(s) - V_old|)
Return V
This pseudocode assumes a deterministic reward for simplicity, but it generalizes to stochastic rewards via expectation.
To illustrate, consider a simple 3-state MDP with states S = \{1, 2, 3\}, where state 3 is terminal with value 0. Actions are "left" and "right" in states 1 and 2, yielding transitions: from 1, left stays in 1 (reward 0), right goes to 2 (reward 1); from 2, left goes to 1 (reward 1), right goes to 3 (reward 10). Assume \gamma = 0.9 and deterministic transitions. Starting with V_0 = [0, 0, 0], the first iteration yields V_1 = [1, 10, 0] (choosing right in 1, right in 2). The second iteration gives V_2 = [10, 10, 0] (right in 1 accounts for discounted value from 2, right in 2). Further iterations remain at V^* = [10, 10, 0].
Value iteration converges to the optimal value function because the Bellman optimality operator is a contraction mapping in the sup-norm with modulus \gamma. Specifically, for any value functions V and W,
\| TV - TW \|_\infty \leq \gamma \| V - W \|_\infty,
where T is the operator, implying \| V_{k+1} - V^* \|_\infty \leq \gamma \| V_k - V^* \|_\infty. In finite MDPs, value iteration converges to the optimal value function in a finite number of iterations due to the contraction property and finite state-action space, after which values stabilize.
Once the optimal value function V^* is obtained, an optimal policy \pi^* is extracted greedily by selecting, for each state s,
\pi^*(s) = \arg\max_{a \in A} Q^*(s, a),
where the optimal action-value function is Q^*(s, a) = \sum_{s', r} P(s', r \mid s, a) [r + \gamma V^*(s')]. This policy is optimal because it achieves V^* in every state.
Computationally, each iteration requires evaluating the Bellman update for all states and actions, leading to a time complexity of O(|S|^2 |A|) assuming transitions are precomputed or sampled efficiently, plus O(|S|) space for the value function. Synchronous updates compute all new values before applying them, ensuring monotonic improvement, while asynchronous updates (e.g., Gauss-Seidel style) apply changes immediately, potentially speeding convergence in practice but without the same theoretical guarantees. The number of iterations to \epsilon-convergence is typically O\left( \frac{|S| \log(1/\epsilon)}{1 - \gamma} \right).
Policy Iteration
Policy iteration is a dynamic programming algorithm for finding an optimal policy in a Markov decision process (MDP) by alternating between two steps: policy evaluation and policy improvement. Introduced by Ronald Howard in his seminal 1960 work, the method begins with an arbitrary initial policy \pi_0 and iteratively refines it until convergence to the optimal policy \pi^*.[22] Unlike value iteration, which performs successive approximations to the value function, policy iteration fully evaluates the current policy before deriving an improved one, often leading to faster convergence in practice by exploiting the structure of stationary policies.
In the policy evaluation step, the value function V^{\pi_k} for the current policy \pi_k is computed by solving the system of linear equations derived from the Bellman equation for that policy:
V^{\pi_k}(s) = \sum_{s', r} P(s', r \mid s, \pi_k(s)) \left[ r + \gamma V^{\pi_k}(s') \right], \quad \forall s \in \mathcal{S}.
For finite MDPs, this can be expressed in matrix form as (I - \gamma P^{\pi_k}) V^{\pi_k} = R^{\pi_k}, where I is the identity matrix, P^{\pi_k} is the |\mathcal{S}| \times |\mathcal{S}| transition probability matrix under \pi_k, and R^{\pi_k} is the expected reward vector under \pi_k. This linear system can be solved exactly using methods like Gaussian elimination, which requires O(|\mathcal{S}|^3) time in the worst case, though iterative solvers exploit sparsity for efficiency. The evaluation step converges monotonically to the true value function V^{\pi_k} under the contraction mapping property of the discounted Bellman operator.
The policy improvement step then greedily updates the policy based on the evaluated value function. For each state s, the new policy \pi_{k+1}(s) selects the action that maximizes the action-value (Q-function) under V^{\pi_k}:
\pi_{k+1}(s) = \arg\max_a \sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma V^{\pi_k}(s') \right], \quad \forall s \in \mathcal{S}.
This step requires O(|\mathcal{S}| |\mathcal{A}|) time for finite MDPs, as it evaluates the one-step lookahead for each state-action pair. The improved policy \pi_{k+1} is guaranteed to be at least as good as \pi_k, with strict improvement unless \pi_k is already optimal, ensuring monotonic progress in the value function.[22]
For finite-state, finite-action discounted MDPs, policy iteration converges to the optimal policy in a finite number of iterations, bounded by the number of distinct policies (at most |\mathcal{S}|^{|\mathcal{A}|}), though typically far fewer due to the greedy switching rule. Each full iteration incurs O(|\mathcal{S}|^2 |\mathcal{A}|) time if approximate iterative methods are used for evaluation (e.g., solving until a tolerance), plus the O(|\mathcal{S}| |\mathcal{A}|) for improvement. In comparison to value iteration, policy iteration generally requires fewer iterations because each improvement leverages a complete policy evaluation, resulting in larger value gains per step and empirical speedups on many benchmark problems despite the higher per-iteration cost.[23]
In MDPs with infinite state spaces, exact policy iteration is intractable, necessitating approximations such as function approximation for the value function or linear programming formulations of the Bellman optimality equations. Recent scalable variants in the 2020s integrate deep neural networks with policy iteration for high-dimensional problems, such as inventory optimization, where integer programming constraints ensure feasible policies, achieving near-optimal performance on large-scale instances.[24]
Extensions
Partially Observable MDPs
A partially observable Markov decision process (POMDP) extends the Markov decision process framework to model decision-making under uncertainty where the agent receives only partial information about the true state of the environment.[25] In a POMDP, the agent observes outcomes from an observation space O rather than the full state, requiring it to infer the state distribution based on prior beliefs and new evidence.[25]
Formally, a POMDP is defined by the tuple (S, A, O, P, R, \Omega, \gamma), augmenting the standard MDP with the observation space O and the observation function \Omega: O \times S \times A \to [0,1], which specifies the probability \Omega(o \mid s', a) of observing o \in O after executing action a \in A and transitioning to state s' \in S according to the transition probabilities P(s' \mid s, a).[25] The reward function R(s, a) and discount factor \gamma \in [0,1) remain as in the MDP.[25] The agent's knowledge is represented by a belief state b: S \to [0,1], a probability distribution over S that summarizes all available information, satisfying \sum_{s \in S} b(s) = 1.[25]
The POMDP can be reformulated as a belief-MDP, where the state space consists of all possible belief states forming the (|S|-1)-dimensional probability simplex \Delta(S).[25] In this view, the transition from belief b to a new belief b' after action a and observation o follows from Bayes' rule:
b'(s') = \frac{\sum_{s \in S} b(s) \, P(s' \mid s, a) \, \Omega(o \mid s', a)}{\Pr(o \mid b, a)},
where the normalizing denominator is \Pr(o \mid b, a) = \sum_{s' \in S} \sum_{s \in S} b(s) \, P(s' \mid s, a) \, \Omega(o \mid s', a).[25] This update preserves the Markov property, allowing dynamic programming techniques to be applied over the belief space, though the rewards and actions are now defined in terms of beliefs: the expected reward is \sum_{s \in S} b(s) \, R(s, a), and the policy maps beliefs to actions.[25]
Solving POMDPs presents significant challenges, primarily due to the curse of dimensionality: the belief space grows exponentially with the number of states, making exact computation intractable for even moderately sized problems.[25] Moreover, optimal policies in POMDPs are typically randomized, as deterministic policies over beliefs can be suboptimal in partially observable settings.[25]
Foundational work on POMDPs dates to Sondik (1971), who established that value functions in finite-horizon POMDPs can be represented as piecewise linear convex functions, enabling exact dynamic programming solutions via backups over these representations. Post-2000s developments have focused on scalable approximations, such as point-based value iteration (PBVI), which performs value function backups only at a finite set of strategically selected belief points, yielding anytime algorithms that improve solutions incrementally without enumerating the full belief space.[26] For large-scale POMDPs, methods like POMCP (2010) leverage Monte Carlo tree search for online planning, simulating trajectories from sampled beliefs to approximate optimal actions efficiently in high-dimensional or unfactored environments. Recent advances as of 2025 integrate deep reinforcement learning techniques, including recurrent natural policy gradients using neural networks for belief tracking and policy optimization, and deep belief Markov models for improved inference in complex POMDPs.[27][28]
Continuous-Time MDPs
A continuous-time Markov decision process (CTMDP) generalizes the discrete-time MDP framework to model decision-making problems where state transitions and rewards accumulate over continuous time, rather than at fixed intervals.[12] In a CTMDP, the system is defined by a state space S, action space A, transition rate functions \lambda(s, a, s') for s, s' \in S and a \in A(s), and reward rate functions r(s, a). The time spent in state s under action a follows an exponential distribution with rate \lambda(s, a) = \sum_{s' \neq s} \lambda(s, a, s'), after which the process jumps to s' with probability \lambda(s, a, s') / \lambda(s, a). Rewards are earned continuously at rate r(s, a) while in state s under action a.[12]
To solve CTMDPs, a common approach is uniformization, which transforms the model into an equivalent discrete-time Markov decision process (DTMDP) for analysis. Let \lambda_{\max} = \sup_{s, a} \lambda(s, a) be the maximum transition rate. The uniformized DTMDP has transition probabilities p(s' | s, a) = \lambda(s, a, s') / \lambda_{\max} for s' \neq s, and p(s | s, a) = 1 - \lambda(s, a) / \lambda_{\max} for self-loops representing fictitious jumps that do not change the state. The reward in the DTMDP is adjusted to r(s, a) / \lambda_{\max}, and the discount factor becomes \gamma' = e^{-\alpha / \lambda_{\max}}, where \alpha > 0 is the continuous-time discount rate. This equivalence preserves the optimal value function up to scaling, allowing standard DTMDP algorithms to be applied.[29]
For finite discrete state spaces, the optimal value function V^* of a discounted CTMDP can be formulated as a linear program (LP) after uniformization. The primal LP minimizes \sum_s \mu(s) V(s) subject to V(s) \geq r(s, a)/\lambda_{\max} + \gamma' \sum_{s'} p(s' | s, a) V(s') for all s \in S, a \in A(s), where \mu is a positive weighting measure (e.g., the stationary distribution under some policy, normalized so \sum_s \mu(s) = 1). The dual LP involves occupation measures and provides insights into optimal policies. This formulation enables efficient computation via standard LP solvers for moderate-sized problems.[12]
In continuous state spaces, the optimal value function satisfies the Hamilton-Jacobi-Bellman (HJB) equation $0 = \max_{a \in A(s)} \left[ r(s, a) + \sum_{s'} \lambda(s, a, s') (V(s') - V(s)) - \alpha V(s) \right], where \alpha > 0 is the discount rate. For general cases with possible discontinuities, existence and uniqueness of solutions are established in the sense of viscosity solutions, which provide a weak formulation robust to irregularities in the rates or rewards.[30]
Recent advances in the 2020s have leveraged deep learning to numerically solve HJB equations for high-dimensional CTMDPs, addressing the curse of dimensionality in traditional methods. For instance, physics-informed neural networks (PINNs) approximate viscosity solutions by minimizing residuals of the HJB equation alongside boundary conditions, trained via automatic differentiation.[31] Another approach uses deep operator networks (DeepONet) in conjunction with methods like deep Q-learning to approximate solutions to HJB-related PDEs, enabling handling of nonlinear operators in continuous-time settings.[32] These methods have shown effectiveness in applications like stochastic control with up to hundreds of dimensions.
Connections to Reinforcement Learning
Model-Based Methods
Model-based reinforcement learning (MBRL) methods address Markov decision processes (MDPs) by explicitly estimating the environment's dynamics, including the transition probabilities P(s'|s, a) and reward function R(s, a, s'), from sampled experiences, and subsequently leveraging these models for planning via established MDP algorithms like value or policy iteration. This paradigm contrasts with direct policy optimization by enabling simulated trajectories within the learned model to augment real-world data, thereby enhancing decision-making in sequential tasks. Seminal work in this area emphasizes the integration of model learning with planning to improve efficiency in environments where data collection is costly.[33]
A foundational algorithm in MBRL is Dyna, introduced by Sutton in 1990, which interleaves real experience with planning updates using a learned model to perform additional backup steps, effectively combining model-free learning for immediate action with model-based planning for long-term foresight. To focus computational effort on high-impact states, prioritized sweeping extends this framework by maintaining a priority queue based on temporal-difference errors, selectively updating value estimates for states likely to yield the largest improvements, as developed by Moore and Atkeson in 1993. These methods demonstrate how learned models can accelerate convergence in tabular MDPs by simulating experiences that mimic real interactions.[34][35]
Model learning techniques vary by scale: for small, discrete MDPs, tabular representations directly estimate transition and reward parameters from frequency counts of observed tuples (s, a, s', r), while larger, high-dimensional spaces employ function approximation, such as neural networks, to predict dynamics and rewards. Advanced approaches, including expectation-maximization (EM)-like algorithms, handle latent structure in transitions by iteratively refining model parameters to maximize the likelihood of observed data. MBRL offers advantages in sample efficiency, particularly in simulated or low-data regimes, by generating vast numbers of synthetic rollouts for planning, while maintaining strong connections to classical dynamic programming solutions for MDPs.[10][33]
Modern advancements, such as MuZero developed by DeepMind in 2019, extend MBRL to complex games like Atari, Go, chess, and shogi by learning latent models that predict future rewards, values, and policies without explicit rules, integrating tree-based search (e.g., Monte Carlo Tree Search) with the model for superhuman performance across domains. More recent developments, like the DreamerV3 algorithm (2022), further advance scalable MBRL using recurrent state-space models for world modeling in continuous control tasks, achieving strong performance in benchmarks as of 2025. This approach highlights MBRL's scalability through implicit model representations, achieving state-of-the-art results with fewer environmental interactions compared to model-free counterparts in benchmark tasks.[36][37]
Model-Free Methods
Model-free methods in reinforcement learning address Markov decision processes by directly estimating action-value functions or policies through interaction with the environment, without constructing an explicit model of the state transition probabilities or reward function. These techniques rely on trajectories of state-action-reward sequences to update estimates, making them suitable for environments where dynamics are unknown, stochastic, or computationally expensive to model. By avoiding model estimation, model-free approaches enable learning in high-dimensional or complex domains, though they often require more samples for convergence compared to model-based alternatives.[10]
A foundational class of model-free methods is temporal-difference (TD) learning, which bootstraps value estimates using partial returns observed from experience rather than full episodic returns. TD methods update estimates incrementally, balancing bias and variance to propagate information about future rewards backward through time.[10]
Q-learning exemplifies off-policy TD learning, estimating the optimal action-value function Q^*(s, a), which represents the maximum expected discounted return starting from state s, taking action a, and following the optimal policy thereafter. The update rule is
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right],
where \alpha is the learning rate, r is the immediate reward, \gamma is the discount factor, s' is the next state, and the maximum over a' selects the best next action greedily. This off-policy nature allows Q-learning to learn the optimal policy regardless of the behavior policy generating the data, and it converges to Q^* with probability one under suitable conditions on the learning rates and exploration.[38] Q-learning was first proposed in Watkins' 1989 PhD thesis and rigorously analyzed in a 1992 paper co-authored with Dayan.[39][38]
In contrast, SARSA is an on-policy TD method that learns the value of the policy being followed, updating the action-value function based on actions actually selected by the current policy. Its update replaces the maximum in Q-learning with the value of the next action a' sampled from the policy \pi:
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right].
This makes SARSA sensitive to the exploration strategy of the policy, as it evaluates and improves the same policy used for action selection, ensuring safer learning in environments where greedy actions might lead to risky states. SARSA, originally termed modified connectionist Q-learning, was introduced by Rummery and Niranjan in 1994.[40]
Policy gradient methods offer an alternative model-free paradigm by directly optimizing parameterized policies \pi_\theta to maximize the expected return J(\theta), bypassing value function estimation. These stochastic gradient ascent techniques adjust policy parameters \theta based on the gradient of performance, enabling learning in continuous action spaces where value-based methods struggle. The REINFORCE algorithm provides a Monte Carlo estimate of this gradient:
\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta (a_t | s_t) \, G_t,
where G_t is the actual return following time step t, and the sum is over a trajectory. This approach treats returns as unbiased but high-variance estimates of value, with baselines often added to reduce variance. REINFORCE was developed by Williams in 1992.[41] Actor-critic variants enhance policy gradients by using a learned critic to estimate values and subtract them as baselines, combining the strengths of policy-based and value-based methods for more stable updates. A prominent example is Proximal Policy Optimization (PPO, 2017), which uses clipped objectives to prevent large policy updates, making it reliable and widely used in practical RL applications as of 2025.[10][42]
For discrete MDPs with finite state and action spaces, tabular representations suffice for storing exact value functions in Q-learning or SARSA, as each state-action pair can maintain a dedicated entry. However, these methods scale poorly to large or continuous spaces due to the exponential growth in table size, leading to the sample inefficiency known as the curse of dimensionality.[10]
To overcome scaling limitations, deep reinforcement learning extends model-free methods with neural network function approximators. The Deep Q-Network (DQN) adapts Q-learning by representing Q(s, a; \theta) with a deep convolutional network, incorporating experience replay and target networks for stability; it achieved performance exceeding human experts on 49 Atari games using only pixel inputs. DQN was introduced by Mnih et al. in 2013.[43]
Theoretical Aspects
Computational Complexity
The computational complexity of solving Markov decision processes (MDPs) varies significantly depending on the model parameters, such as the state and action spaces, discount factor, and optimality criterion. For finite-state and finite-action discounted MDPs, exact solution methods like value iteration and policy iteration achieve polynomial-time complexity in the number of states |S| and actions |A|. Specifically, each iteration of value iteration requires O(|S|^2 |A|) time to update the value function across all state-action pairs via the Bellman operator, and the number of iterations T needed for convergence to an ε-optimal policy is O((1/(1-γ)) log(|S| R_max / ε)), where γ is the discount factor and R_max bounds the rewards, yielding overall time complexity O(|S|^2 |A| (1/(1-γ)) log(|S| / ε)). Policy iteration similarly converges in O(|S| |A|^2 (1/(1-γ)) log(|S| / ε)) time, as each policy evaluation step takes O(|S|^2 |A|) and improvement is O(|S| |A|). These bounds hold under the assumption that transition probabilities and rewards are provided explicitly, making the problem solvable in polynomial time relative to the input size when 1/(1-γ) is treated as a parameter.[44]
In contrast, undiscounted infinite-horizon MDPs, particularly under the average-reward criterion, exhibit greater hardness. When rewards are unrestricted in sign, finding an optimal policy is NP-complete, as the problem reduces to determining whether there exists a policy achieving average reward above a given threshold, which encodes known NP-hard problems like subset sum.[44] For nonnegative rewards, the problem remains P-complete, allowing polynomial-time solutions but with practical dependence on numerical precision due to the lack of contraction in the Bellman operator.[44] These results highlight that removing the discount factor eliminates the geometric convergence of iterative methods, often requiring linear programming formulations with O(|S| |A|) variables and constraints, solvable in polynomial time but sensitive to the reward structure.
For continuous-state or high-dimensional MDPs, exact solutions suffer from the curse of dimensionality, where complexity grows exponentially with the state dimension d, rendering standard methods intractable beyond low dimensions. Approximate solutions address this via Probably Approximately Correct (PAC) learning frameworks, which bound the number of samples needed to find an ε-optimal policy with probability at least 1-δ. In the generative model setting for discounted finite MDPs, PAC-MDP algorithms like R-MAX achieve sample complexity O(|S| |A| / ε^2 (1/(1-γ))^3 log(|S| |A| / (δ ε))), by exploring unknown transitions and rewards to estimate the model before applying value iteration. This bound captures the exploration overhead, scaling linearly in |S| and |A| but cubically in the effective horizon 1/(1-γ), emphasizing the need for model-based approaches in sample-efficient learning.[45]
Recent advances in scalable approximations mitigate these issues for structured MDPs, such as linear MDPs where transitions and rewards are linear in d-dimensional features. Algorithms like LSVI-UCB achieve polynomial sample complexity Õ(d^3 H^2 / ε^2) for horizon H and error ε in the episodic setting, providing the first provably efficient methods without strong convexity assumptions.[46] These guarantees extend to infinite-horizon discounted linear MDPs, enabling near-optimal regret Õ(√(d^3 / (1-γ)^5 T)) after T steps, facilitating practical reinforcement learning in high dimensions.[47]
Alternative Notations
In the literature on Markov decision processes (MDPs), the reward function is commonly denoted as R(s, a), representing the expected immediate reward for taking action a in state s, where the expectation is over the next state distribution P(s' | s, a).[48] An alternative notation specifies the reward as R(s, a, s'), which captures the immediate reward upon transitioning to the next state s'; this form is particularly useful in settings where rewards depend explicitly on the successor state, such as in some stochastic control applications.[49] In sample-based formulations, rewards are treated as realizations rather than expectations, often denoted similarly but sampled from a distribution r(s, a, s'), to emphasize empirical estimation in reinforcement learning algorithms.[50]
Discounting in MDPs typically involves a factor \gamma \in [0, 1) applied to future rewards in infinite-horizon settings, yielding the discounted total reward criterion \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]. An alternative is the average reward criterion, which evaluates policies based on the long-run average reward per step \lim_{T \to \infty} \frac{1}{T} \mathbb{E} \left[ \sum_{t=0}^{T-1} r_t \right], omitting \gamma and focusing on steady-state performance in recurrent MDPs; this is prevalent in applications like queueing systems where infinite horizons without discounting are natural.[51] For finite-horizon problems, the total undiscounted reward \mathbb{E} \left[ \sum_{t=0}^{H-1} r_t \right] is used, where H is the horizon length, avoiding discounting altogether to prioritize cumulative gains over a fixed period.[52]
Semi-Markov decision processes (SMDPs) extend standard MDPs by allowing variable-time steps, often notated with an effective horizon \tau(s, a) representing the expected duration of executing action a from state s; this framework underpins the options model in reinforcement learning, where options are temporally extended actions that induce an SMDP over the base MDP.[53] In the options framework, transitions and rewards are aggregated over \tau, with cumulative rewards scaled by duration to maintain consistency with average-reward criteria.[54]
Matrix formulations of MDPs represent transitions for each action a as a stochastic matrix P_a \in \mathbb{R}^{|S| \times |S|}, where (P_a)_{s,s'} = P(s' | s, a), and rewards as a vector R_a \in \mathbb{R}^{|S|} with (R_a)_s = R(s, a); value functions then satisfy linear equations like v = R_\pi + \gamma P_\pi v for policy \pi.[55] These forms enable linear programming (LP) duality for optimization, where the primal LP minimizes c^\top x subject to x \geq P_a^\top x + r_a for occupation measures x, dual to the Bellman equations.[56]
Domain-specific notations vary across fields; in reinforcement learning, trajectories are often denoted \tau = (s_0, a_0, r_0, s_1, \dots), emphasizing sequences of state-action-reward tuples for policy evaluation via Monte Carlo methods.[57] In control theory, states are frequently indexed as x_t at time t, with dynamics x_{t+1} = f(x_t, u_t) + w_t in stochastic settings, aligning MDPs with continuous-time or hybrid systems.[58]
Post-2010 reinforcement learning papers often clarify distinctions between discrete and continuous MDPs to resolve notation confusions; for instance, discrete MDPs use finite sets S and A with probabilistic transitions P, while continuous variants employ densities p(s' | s, a) over \mathbb{R}^n, adapting algorithms like Q-learning to function approximators without assuming discreteness.[59]