Reinforcement learning

Reinforcement learning (RL) is a paradigm in machine learning where an intelligent agent learns optimal behavior through trial-and-error interactions with a dynamic environment, aiming to maximize the long-term cumulative reward received over a sequence of decisions.^[1] Unlike supervised learning, which relies on labeled examples of correct actions, or unsupervised learning, which identifies patterns without explicit feedback, RL operates in a goal-oriented framework defined by delayed rewards and sequential decision-making.^[1] This process is formally modeled using Markov decision processes (MDPs), where the agent's choices depend only on the current state of the environment.^[2] At its core, RL involves several fundamental elements: an agent that selects actions; an environment that responds to those actions by transitioning to new states and providing scalar rewards; states representing the current situation; actions as possible interventions; and a policy that maps states to actions to guide behavior.^[1] The agent evaluates long-term outcomes through value functions, which estimate expected future rewards, and balances exploration (trying new actions to discover better strategies) against exploitation (using known rewarding actions).^[2] Rewards serve as the sole feedback signal, often sparse or delayed, compelling the agent to learn associations between actions and distant outcomes—a challenge addressed by methods like temporal-difference learning.^[1] RL traces its roots to early 20th-century psychology, including Thorndike's Law of Effect (1911) and Skinner's operant conditioning, which emphasized learning via rewards and punishments.^[1] Formal advancements emerged in the mid-20th century with dynamic programming and MDPs, pioneered by Bellman in the 1950s, followed by computational integrations in the 1980s through temporal-difference methods by Sutton and Q-learning by Watkins.^[1] By the 1990s, applications like TD-Gammon demonstrated RL's potential in games, achieving near-expert backgammon play through self-play.^[1] Contemporary RL encompasses model-free approaches, such as Q-learning and SARSA for value estimation, and policy-based methods like REINFORCE for direct policy optimization, often combined in actor-critic architectures.^[3] The integration of deep neural networks since the 2010s has enabled deep reinforcement learning (deep RL), powering breakthroughs like AlphaGo's mastery of Go in 2016 and advancements in robotics and autonomous systems.^[3] Recent developments as of 2025 emphasize scalability, interpretability, hierarchical structures, and hybrid systems with large language models for enhanced reasoning capabilities, alongside machine-discovered algorithms and new benchmarks like IntersectionZoo; applications continue to expand to areas like energy management, healthcare, and computational psychiatry.^[3]^[2]^[4]^[5]^[6]

Fundamentals

Definition and Motivation

Reinforcement learning (RL) is a paradigm in machine learning where an agent learns to make sequential decisions by interacting with an environment, receiving rewards or penalties based on the outcomes of its actions, with the objective of maximizing cumulative long-term reward.^[7] This trial-and-error process allows the agent to discover optimal behaviors without prior knowledge of the environment's dynamics.^[7] The core setup involves an agent-environment loop: at each step, the agent observes the current state of the environment, selects an action, receives a reward and the next state, and updates its policy accordingly.^[7] The motivation for RL draws heavily from biological principles of learning observed in animal behavior, where organisms adapt through rewards and punishments to achieve survival goals, as exemplified by Edward Thorndike's Law of Effect, which posits that actions leading to satisfaction are reinforced while those causing discomfort are diminished.^[7] Unlike supervised learning, which relies on labeled examples provided by a teacher to minimize prediction errors, or unsupervised learning, which identifies patterns in unlabeled data without guidance, RL emphasizes delayed rewards and autonomous exploration in dynamic settings.^[7] This distinction enables RL to address problems involving sequential decision-making and uncertainty, where immediate feedback is unavailable.^[7] A foundational conceptual example illustrating RL principles is the multi-armed bandit problem, a simplified precursor where an agent repeatedly chooses among multiple actions (arms) to maximize rewards from unknown probability distributions, balancing exploration of options against exploitation of known good ones.^[8]

Historical Development

The roots of reinforcement learning (RL) trace back to the 1950s in optimal control theory, where Richard Bellman introduced dynamic programming as a method for solving sequential decision-making problems under uncertainty. Bellman's work formalized the use of value functions to evaluate future rewards, laying the groundwork for later RL frameworks, and he also defined Markov decision processes (MDPs) as a stochastic extension of optimal control.^[1] These concepts, developed amid early computing limitations, emphasized backward induction to compute optimal policies, influencing fields like operations research.^[9] In the 1980s and 1990s, RL emerged as a distinct subfield of machine learning, with Richard S. Sutton and Andrew G. Barto playing pivotal roles in its theoretical and algorithmic foundations. Sutton's 1988 introduction of temporal-difference (TD) learning provided an efficient, model-free way to update value estimates incrementally based on bootstrapping from current predictions, bridging trial-and-error learning with dynamic programming. Their collaborative textbook, first published in 1998 and updated in 2018, synthesized these ideas, popularizing RL as a paradigm for agents learning from delayed rewards without explicit supervision.^[1] This era also saw early applications in adaptive control and games, establishing RL's core elements like policies and exploration strategies. The 2010s marked a surge in RL's practical impact through deep learning integrations, led by DeepMind researchers including Volodymyr Mnih and David Silver. In 2013–2015, Deep Q-Networks (DQN) achieved human-level performance on Atari 2600 games by combining convolutional neural networks with Q-learning, directly processing pixel inputs to learn control policies across 49 tasks, surpassing prior benchmarks by wide margins.^[10] The 2016 AlphaGo system, using deep neural networks for policy and value estimation alongside Monte Carlo tree search, defeated world champion Lee Sedol in Go, a game with immense complexity (approximately 10^170 possible positions), demonstrating RL's scalability to strategic reasoning.^[11] Entering the 2020s, RL advanced in large-scale applications, particularly through reinforcement learning from human feedback (RLHF), which aligns models with preferences via reward modeling. OpenAI's 2022 InstructGPT applied RLHF to fine-tune language models like GPT-3, improving instruction-following and reducing hallucinations compared to supervised baselines, as evidenced by human evaluations where InstructGPT outputs were preferred over GPT-3 outputs approximately 85% of the time.^[12] In robotics, continual RL frameworks addressed lifelong learning challenges, with a 2025 Nature study introducing a Bayesian-inspired knowledge space that preserved performance across sequential manipulation tasks on real robots, mitigating catastrophic forgetting in dynamic environments.^[13] A landmark 2025 milestone came from DeepMind's automated discovery of RL algorithms via machine search, yielding rules that outperformed existing methods on the Atari benchmark and other challenging environments, suggesting a shift toward AI-driven algorithm design.^[4]

Markov Decision Process Framework

Reinforcement learning problems are formally modeled using Markov decision processes (MDPs), which provide a mathematical framework for sequential decision-making under uncertainty. An MDP is defined as a tuple (S, A, P, R, \gamma), where S is the set of possible states representing the environment's configuration, A is the set of actions available to the agent, P is the transition probability function P_a(s, s') = \Pr(S_{t+1} = s' \mid S_t = s, A_t = a) specifying the dynamics of state transitions given an action, R is the reward function R: S \times A \to \mathbb{R} that assigns immediate rewards to state-action pairs, and \gamma \in [0, 1) is the discount factor that weights future rewards relative to immediate ones.^[1] The core assumption underlying MDPs is the Markov property, which states that the future state and reward depend only on the current state and action, not on the history of prior states or actions; formally, \Pr(S_{t+1} = s', R_{t+1} = r \mid S_{1:t}, A_{1:t}) = \Pr(S_{t+1} = s', R_{t+1} = r \mid S_t, A_t).^[1] This property simplifies the representation of decision problems by ensuring that all relevant information is captured in the current state.^[1] MDPs can describe two main types of tasks: episodic tasks, which consist of distinct episodes or trials starting from an initial state and terminating upon reaching a terminal state, and continuing tasks, which have no terminal states and run indefinitely.^[1] These tasks may also differ in horizon: finite-horizon MDPs have a fixed number of time steps per episode, while infinite-horizon MDPs assume ongoing interactions, often relying on discounting to ensure convergence of value measures.^[1] In an MDP, the return G_t measures the total reward accumulated from time step t onward, capturing the agent's cumulative success. The Bellman equation for the return expresses this recursively as

G_t = R_{t+1} + \gamma G_{t+1},

where the expectation is taken over the stochastic transitions and rewards.^[1] For infinite-horizon discounted tasks, the return is given by the infinite sum

G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1},

which converges because $0 \leq \gamma < 1, prioritizing nearer-term rewards while still accounting for long-term consequences.^[1] Many real-world scenarios violate the Markov assumption due to partial observability, where the agent receives only noisy or incomplete information about the state. In such cases, the problem is modeled as a partially observable MDP (POMDP), which extends the MDP framework by incorporating a belief state—a probability distribution over possible true states—updated via Bayes' rule based on observations and actions.^[14] POMDPs maintain the Markov property over the belief state, enabling optimal decision-making despite uncertainty in state estimation.^[14] The objective in an MDP or POMDP is to find a policy—a mapping from states (or beliefs) to actions—that maximizes the expected discounted return.^[1]

Key Concepts

Components of RL Systems

In reinforcement learning (RL), the core components form the foundational structure for how an agent interacts with its surroundings to learn optimal behavior. These elements define the interaction loop and are essential for modeling decision-making processes in dynamic environments.^[1] The agent is the central decision-making entity that observes the current state of the environment and selects actions to maximize long-term rewards. It operates as a learner or controller, adapting its behavior based on feedback from interactions.^[1] The environment encompasses everything external to the agent, including the physical world, simulated dynamics, or any system that responds to the agent's actions by providing new states and rewards. This separation allows the agent to treat the environment as a black box, focusing solely on its inputs and outputs.^[1] States represent the agent's perception of the environment at a given time, capturing relevant information for decision-making; they can be discrete, such as positions on a grid, or continuous, such as joint angles in a physical system.^[1] Actions are the possible choices available to the agent, which influence the environment's transition to the next state; like states, actions may be discrete (e.g., moving up, down, left, or right) or continuous (e.g., applying varying torque to motors).^[1] Rewards serve as the scalar feedback signal from the environment, indicating the immediate desirability of an action in a given state; these can be positive for progress toward goals, negative for errors, and may accumulate over delayed outcomes to guide long-term planning.^[1] A trajectory refers to a sequence of states, actions, and rewards generated by the agent's interactions with the environment over time, providing the data from which learning occurs.^[1] In episodic tasks, these interactions form episodes, which are finite trajectories starting from an initial state and ending at a terminal state, such as completing a game level or reaching a goal.^[1] For illustration, consider a discrete-state example like a gridworld, where the agent navigates a 2D grid to reach a goal while avoiding obstacles, with states as grid positions, actions as directional moves, and rewards as +1 for success or -1 for pitfalls.^[1] In contrast, a continuous-state and action space arises in robotic arm control, where the agent manipulates an arm to grasp objects; states include joint positions and velocities (real-valued vectors), actions specify continuous torque or velocity commands, and rewards reflect task success like precise positioning.

Policies and Value Functions

In reinforcement learning, a policy serves as the agent's decision-making strategy, defining a mapping from states to actions that dictates behavior in the environment. Formally, a policy \pi is a function that specifies the probability of selecting each action in a given state, \pi(a|s), where $0 \leq \pi(a|s) \leq 1 and \sum_a \pi(a|s) = 1 for all states s. Policies can be deterministic, assigning probability 1 to a single action per state, or stochastic, allowing randomization over multiple actions to handle uncertainty or promote exploration. Value functions provide a measure of the long-term desirability of states or state-action pairs under a given policy, quantifying expected future rewards discounted over time. The state-value function V^\pi(s) for policy \pi and state s is defined as the expected return starting from s and then following \pi thereafter:

V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s \right],

where \mathbb{E}_\pi denotes the expectation under \pi, R_{t+k+1} is the reward at step t+k+1, and \gamma \in [0,1) is the discount factor. Similarly, the action-value function Q^\pi(s,a) gives the expected return starting from state s, taking action a, and then following \pi:

Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a \right].

These functions enable the evaluation of policies by assessing their performance across the state space. The relationship between policies and value functions is captured by the Bellman expectation equations, which express the value of a state or state-action pair recursively in terms of immediate rewards and future values. For the state-value function:

V^\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) \left[ r + \gamma V^\pi(s') \right],

where p(s',r|s,a) is the probability of transitioning to state s' and receiving reward r given state s and action a. The corresponding equation for the action-value function is:

Q^\pi(s,a) = \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a') \right].

These equations, derived from the law of total expectation, form the foundation for policy evaluation in Markov decision processes. An optimal policy \pi^* maximizes the expected return for every state, outperforming all other policies in terms of value functions. The optimal state-value function V^*(s) and action-value function Q^*(s,a) are defined as the maxima over all possible policies: V^*(s) = \max_\pi V^\pi(s) and Q^*(s,a) = \max_\pi Q^\pi(s,a). The optimal policy can be obtained deterministically as \pi^*(s) = \arg\max_a Q^*(s,a), and it satisfies the Bellman optimality equations, such as V^*(s) = \max_a \sum_{s',r} p(s',r|s,a) [r + \gamma V^*(s')]. Policy improvement builds toward optimality by greedily selecting actions that maximize the current estimate of the action-value function. Starting from an arbitrary policy \pi, the greedy policy \pi'(s) = \arg\max_a Q^\pi(s,a) yields a non-decreasing value function, V^{\pi'}(s) \geq V^\pi(s) for all s, with strict improvement unless \pi is already optimal. This principle underpins iterative methods for finding \pi^*.

Exploration and Exploitation

In reinforcement learning, agents face the exploration-exploitation trade-off, where they must balance gathering new information about the environment through exploratory actions against maximizing immediate rewards via known actions. This dilemma is central to sequential decision-making under uncertainty, as excessive exploration may delay reward accumulation, while over-reliance on exploitation can trap the agent in suboptimal behaviors. The performance of exploration strategies is commonly evaluated using regret, defined as the difference between the total reward obtained by the agent and the maximum achievable reward from an optimal policy over time, providing a metric for the cost of suboptimal actions.^[15] One simple and widely used strategy to address this trade-off is the ε-greedy approach, in which the agent selects the action with the highest estimated value Q(s, a) with probability $1 - \epsilon, and otherwise chooses a random action uniformly from the available options. The parameter \epsilon controls the exploration rate; a fixed small value like \epsilon = 0.1 promotes occasional exploration, but to adapt to increasing knowledge over time, \epsilon can decay, such as \epsilon_t = 1/t where t is the current timestep, gradually shifting toward exploitation as the agent learns. This method is computationally efficient and effective in practice for tabular settings, though it can lead to logarithmic regret in multi-armed bandit problems under appropriate decay schedules. A more principled deterministic strategy is the Upper Confidence Bound (UCB) method, which selects actions by balancing estimated value and uncertainty. Specifically, the action a is chosen as

a = \arg\max_a \left[ Q(s, a) + c \sqrt{\frac{\ln t}{N(s, a)}} \right],

where Q(s, a) is the estimated action-value, N(s, a) is the number of times action a has been selected in state s, t is the total number of steps, and c > 0 is a constant tuning exploration (often set around c = 2 for suboptimality bounds). UCB achieves near-optimal regret bounds of O(\sqrt{K T \ln T}) for K-armed bandits with T steps, making it asymptotically efficient by favoring underexplored actions with high confidence intervals.^[16] Probabilistic alternatives like Thompson sampling offer a Bayesian approach to exploration by sampling actions from a posterior distribution over possible models or value estimates. Under assumptions of conjugate priors (e.g., Beta for Bernoulli rewards), the agent draws a sample from the posterior for each action's value and selects the one with the highest sampled value, naturally balancing optimism toward uncertain actions. This method empirically matches or outperforms UCB in many bandit and RL settings, with regret bounds approaching the lower bound of O(\sqrt{K T \ln K}), and it extends naturally to posterior sampling for reinforcement learning in Markov decision processes.^[17] In policy-based methods, intrinsic exploration can be encouraged through entropy regularization, which adds an entropy term to the objective to penalize deterministic policies and promote stochasticity. The modified objective becomes the expected return minus a temperature-scaled entropy \alpha \mathcal{H}(\pi(\cdot|s)), where \pi is the policy and \alpha controls the exploration strength; this maximizes both reward and policy randomness, leading to robust policies in continuous or high-dimensional spaces. Algorithms like Soft Actor-Critic implement this via off-policy updates, achieving improved sample efficiency and stability in deep RL tasks. More advanced intrinsic motivation techniques include curiosity-driven exploration, where the agent receives an additional reward based on the prediction error of a learned dynamics model. For instance, the intrinsic reward is the squared error between predicted and actual next-state features from a self-supervised forward model, incentivizing visits to novel or unpredictable states. This approach, using random network distillation for feature learning, enables effective exploration in sparse-reward environments like maze navigation, reducing the number of interactions needed to reach goals by orders of magnitude compared to extrinsic rewards alone.^[18]

Learning Algorithms

Model-Free Value-Based Methods

Model-free value-based methods in reinforcement learning focus on estimating action-value functions, denoted as Q(s, a), which represent the expected return starting from state s, taking action a, and following an optimal policy thereafter, without explicitly modeling the environment's dynamics or reward structure. These methods learn directly from sampled experiences of the form (s, a, r, s'), where r is the immediate reward and s' is the next state, enabling the agent to derive policies implicitly by selecting actions that maximize the estimated values. This paradigm contrasts with model-based approaches by avoiding the computational overhead of model construction, making it suitable for complex, high-dimensional environments where accurate models are difficult to obtain. The foundational algorithm in this category is Q-learning, an off-policy method that updates the action-value function toward the optimal Q^* using temporal-difference learning. The update rule is given by:

Q(s, a) \leftarrow Q(s, a) + [\alpha](/page/Learning_rate) \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right],

where [\alpha](/page/Learning_rate) is the learning rate and \gamma is the discount factor. Under appropriate conditions, such as all state-action pairs being visited infinitely often and [\alpha](/page/Learning_rate) satisfying the Robbins-Monro conditions, Q-learning converges to the optimal action-value function with probability 1. Introduced in Watkins' 1989 thesis and formalized with a convergence proof in the 1992 publication, Q-learning has become a cornerstone for off-policy value estimation due to its simplicity and model-free nature.^[19] SARSA serves as the canonical on-policy counterpart to Q-learning, updating the action-value function based on the value of the action actually selected by the current policy, rather than the maximum over all actions. Its update rule is:

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right],

where a' is the next action sampled from the policy \pi. This on-policy evaluation ensures that the learned Q-values align with the behavior policy used for exploration, promoting stability in environments where off-policy updates might lead to divergence with function approximation. SARSA was first described in Rummery and Niranjan's 1994 technical report on connectionist implementations of Q-learning, and it converges to the value function of the greedy policy over the learned Q under standard tabular assumptions. To mitigate the variance of SARSA while retaining its on-policy benefits, Expected SARSA modifies the update by averaging the next-action value over the policy's distribution instead of using a single sample:

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \mathbb{E}_{a' \sim \pi} Q(s', a') - Q(s, a) \right].

This expectation reduces the stochasticity of updates, leading to faster convergence in practice, as demonstrated empirically on benchmark tasks like the mountain car problem where Expected SARSA outperformed SARSA by achieving higher average returns with fewer episodes. Theoretical analysis shows that Expected SARSA converges to the true policy value and benefits from lower bias compared to SARSA, though it requires policy evaluation at each step. The algorithm was introduced and analyzed in van Seijen et al.'s 2009 paper.^[20] A key limitation of standard Q-learning is the overestimation bias arising from the maximization operator \max_{a'} Q(s', a'), which amplifies noisy high-value estimates and can lead to suboptimal policies in stochastic environments. Double Q-learning addresses this by maintaining two independent action-value functions, Q_A and Q_B, and alternating updates between them: for an experience (s, a, r, s'), the update for Q_A uses the max from Q_B as the target, and vice versa. This decouples action selection from evaluation, reducing bias while preserving off-policy learning. Empirical evaluations on tasks like the noisy 4x3 gridworld showed Double Q-learning achieving near-optimal performance where standard Q-learning failed due to overestimation. The method was proposed by van Hasselt in 2010 at NeurIPS. For improved credit assignment over longer horizons, n-step Q-learning extends the one-step update by bootstrapping from n future rewards and states, balancing bias and variance between one-step and Monte Carlo methods. The n-step return is G_{t:t+n} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n \max_{a'} Q(s_{t+n}, a'), and the update becomes Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [G_{t:t+n} - Q(s_t, a_t)]. Larger n reduces bias but increases variance, and eligibility traces can further generalize this for arbitrary n. This approach, building on n-step temporal-difference methods, enhances sample efficiency in delayed-reward settings, as shown in simulations where n=3 to 5 yielded better convergence than one-step Q-learning on chain-like MDPs. The formulation is detailed in Sutton and Barto's comprehensive treatment of n-step methods applied to Q-learning.^[1]

Model-Free Policy-Based Methods

Model-free policy-based methods, also known as policy gradient methods, directly parameterize the policy \pi_\theta(a|s) and optimize it by performing gradient ascent on the expected cumulative reward J(\theta) = \mathbb{E}\left[\sum_t r_t\right], where \theta are the policy parameters.^[21] These approaches are particularly suited for environments with continuous action spaces, as they avoid the need to select actions from a discrete set via value estimation.^[21] The REINFORCE algorithm provides a foundational Monte Carlo estimate of the policy gradient, given by \nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) G_t, where G_t is the return from time t.^[22] This estimate is derived from the policy gradient theorem and uses complete episode trajectories to compute unbiased but high-variance updates.^[22] To reduce variance without introducing bias, a baseline b(s_t) can be subtracted, yielding \nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) (G_t - b(s_t)), where the baseline is typically a state-value function estimate.^[22] Actor-critic methods extend policy gradients by incorporating a separate critic component to estimate value functions, which informs the actor's policy updates.^[23] The actor maintains the parameterized policy \pi_\theta, while the critic approximates either the state-value function V(s) or action-value function Q(s,a), enabling lower-variance gradient estimates compared to pure REINFORCE.^[23] The Asynchronous Advantage Actor-Critic (A3C) algorithm refines this architecture by using multiple parallel actors that asynchronously update a shared policy and value network, with the advantage function A_t = Q(s_t, a_t) - V(s_t) to further reduce variance in policy gradients.^[24] A synchronous variant, known as Advantage Actor-Critic (A2C), collects experiences from multiple environments in parallel but updates synchronously, offering similar benefits with simpler implementation.^[24] These methods have demonstrated superior sample efficiency in Atari games, achieving human-level performance with fewer interactions than prior approaches.^[24] Proximal Policy Optimization (PPO), introduced in 2017, addresses instability in policy updates by employing a clipped surrogate objective that constrains the new policy from deviating too far from the old one, formulated as \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right], where r_t(\theta) is the probability ratio and \hat{A}_t is an advantage estimate.^[25] This clipping mechanism ensures monotonic improvement and stable training, making PPO widely adopted for its simplicity and robustness across robotic control and game-playing tasks.^[25]

Temporal Difference and Monte Carlo Methods

Monte Carlo methods provide a model-free approach to estimating value functions by averaging complete sample returns obtained from full episodes of interaction with the environment. In these methods, the value of a state V(s) is updated as the average of the returns G_t starting from that state across multiple visits, ensuring unbiased estimates since the returns are direct samples of the true expected return. However, Monte Carlo methods suffer from high variance due to the inherent randomness in long trajectories, and they require complete episodes, making them suitable primarily for episodic tasks rather than continuing ones.^[1] Two variants of Monte Carlo estimation differ in how returns are averaged for state visits within an episode: first-visit Monte Carlo, which uses only the return following the first occurrence of a state in the episode to update its value, and every-visit Monte Carlo, which incorporates returns from every occurrence of the state, potentially leading to faster convergence but with correlated samples that can introduce slight bias in finite samples. Both variants are incremental, updating estimates after each episode based on the average so far, and they can be applied on-policy (using trajectories from the current policy) or off-policy with importance sampling to correct for behavior policy discrepancies.^[1] Temporal difference methods, in contrast, learn value estimates by bootstrapping from current approximations rather than waiting for complete returns, enabling updates at each time step using the observed reward and the estimated value of the next state. The simplest form, TD(0), computes the temporal difference error \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t), where \gamma is the discount factor, and updates the value function incrementally as V(s_t) \leftarrow V(s_t) + \alpha \delta_t, with learning rate \alpha; this approach reduces variance compared to Monte Carlo at the cost of potential bias from imperfect bootstrapping targets. TD methods are particularly advantageous for online learning, as they do not require full episodes and can converge faster in expectation for Markov decision processes under suitable conditions.^[26] To bridge the gap between one-step updates and full Monte Carlo returns, TD(\lambda) incorporates eligibility traces that enable multi-step lookahead, blending n-step returns for varying n with weights decaying geometrically. The eligibility trace for a state is updated as e_t(s) = \gamma \lambda e_{t-1}(s) + 1 if s is visited at t (or more generally e_t = \gamma \lambda e_{t-1} + \nabla V(s_t) in function approximation settings), and the value update credits past states proportionally to their trace values using the TD error, effectively approximating the full return while maintaining low variance. As \lambda approaches 0, TD(\lambda) reduces to TD(0); as \lambda approaches 1, it approaches Monte Carlo, providing a spectrum of methods tunable to the bias-variance trade-off.^[26]^[1] The core trade-off between Monte Carlo and temporal difference methods lies in their bias-variance characteristics: Monte Carlo offers unbiased estimates with high variance, ideal for offline episodic settings where complete returns are available, whereas TD methods introduce bias through bootstrapping but achieve lower variance and enable real-time, online learning suitable for longer or continuing tasks. Monte Carlo is simpler to implement without reliance on value function approximations but performs poorly in environments with long horizons due to variance accumulation, while TD's efficiency stems from its ability to propagate errors backward through the Bellman equation, though it requires careful tuning of \alpha and \lambda for stability. These estimation techniques form foundational building blocks for value-based reinforcement learning and extend naturally to action-value functions.^[1]

Function Approximation Techniques

In traditional tabular reinforcement learning methods, value functions and policies are represented explicitly for each state or state-action pair, which becomes infeasible in high-dimensional or continuous state spaces due to the curse of dimensionality. This exponential growth in the number of states requires an impractically large amount of memory and data to achieve accurate estimates, limiting scalability to complex environments. To address these limitations, function approximation parameterizes the value or policy functions using a smaller set of parameters, enabling generalization across unseen states. Linear function approximation represents the value function as V(s) \approx \theta^\top \phi(s), where \theta is a parameter vector and \phi(s) are fixed feature vectors derived from the state s.^[27] Parameters are typically updated via gradient descent to minimize the mean squared error of temporal-difference (TD) errors, providing an efficient way to estimate values in large spaces.^[28] Gradient-based TD methods extend this approach by directly minimizing the squared TD error using stochastic gradient descent, but true gradients can lead to instability in off-policy settings.^[28] Semi-gradient methods address this by omitting the gradient of the target in the update, yielding the update rule \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta V(s_t; \theta), where \delta_t is the TD error and \alpha is the learning rate; this approximation enhances stability while converging to a local optimum under certain conditions. Neural networks serve as nonlinear function approximators for value and policy functions, allowing RL to handle high-dimensional inputs like images through deep architectures. To mitigate correlations in sequential experiences and provide independent and identically distributed (i.i.d.) samples for stable training, experience replay stores transitions in a buffer and samples them randomly during updates, as introduced in early neural RL work.^[29] A key challenge in neural TD learning is the moving target problem, where the target values change with each parameter update, causing instability; target networks mitigate this by using a periodically updated copy of the main network to compute fixed targets. Distributional reinforcement learning advances function approximation by modeling the full distribution of returns rather than their expectation, capturing uncertainty in value estimates. The C51 algorithm parameterizes return distributions using categorical distributions over a fixed support and updates them via distributional Bellman operators, leading to improved performance on benchmarks like Atari games.^[30]

Advanced Algorithms

Model-Based Methods

Model-based methods in reinforcement learning construct an explicit model of the environment, capturing its transition dynamics P(s', r \mid s, a) and reward function R(s, a), to enable planning and simulation for improved sample efficiency compared to direct interaction-based learning.^[31] This approach allows agents to generate synthetic experiences within the model, facilitating more informed policy updates without additional real-world data.^[31] By estimating these model components from observed interactions, algorithms can perform lookahead computations or iterative optimizations to approximate optimal behavior.^[31] Model learning in these methods can adopt parametric forms, such as neural networks that parameterize the dynamics as a function approximating next states and rewards from current observations and actions.^[32] Alternatively, non-parametric techniques, like k-nearest neighbors, store past transitions and retrieve similar experiences to predict outcomes, avoiding strong distributional assumptions at the cost of increased memory requirements.^[33] These representations often leverage function approximation to scale to continuous or high-dimensional spaces, enabling generalization beyond observed data.^[31] Planning within the learned model typically involves dynamic programming techniques, such as value iteration, which computes improved value estimates through repeated backups:

V_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a) \left[ r + \gamma V_k(s') \right]

This process, rooted in the Bellman optimality equation, converges to the optimal value function under the model, from which policies can be derived greedily. The Dyna architecture exemplifies integration of modeling and planning by interleaving real environmental steps with multiple simulated updates on the model, accelerating convergence while maintaining reactivity to the true environment. Recent advancements emphasize practical efficiency in complex domains. Model-Based Policy Optimization (MBPO, 2019) generates short-horizon trajectories from an ensemble of neural dynamics models, using them to augment training data for a model-free policy optimizer like soft actor-critic, achieving up to 10-fold sample efficiency gains on continuous control tasks while bounding error from model inaccuracies.^[34] World models (2018) compress environments into latent spaces via variational autoencoders and recurrent dynamics, allowing agents to evolve policies through imagined rollouts in this compact representation, as demonstrated in Atari games where the agent learns solely from latent simulations after initial training.^[32]

Deep Reinforcement Learning

Deep reinforcement learning (deep RL) combines reinforcement learning with deep neural networks to process high-dimensional inputs, such as raw pixel data or complex feature vectors, allowing agents to learn policies directly from sensory observations without hand-crafted features. This integration has enabled breakthroughs in tasks requiring perception and decision-making, from video games to robotic control. Deep neural networks act as function approximators to represent policies and value functions in expansive state-action spaces. However, deep RL introduces unique challenges, including non-stationarity due to the evolving policy, which results in a moving target for value estimates as the agent improves, and correlation among experience samples generated sequentially, which biases gradient updates by violating i.i.d. assumptions in neural network training. A landmark advancement was the Deep Q-Network (DQN), first proposed in 2013 and extended in 2015, which approximated the Q-function using convolutional neural networks for discrete action spaces in image-based environments. To mitigate non-stationarity, DQN introduced a target network that is updated periodically from the main network, providing stable Q-value targets; an experience replay buffer was also employed to break sample correlations by randomly sampling past transitions for training. Evaluated on 49 Atari 2600 games using only pixel inputs, DQN surpassed human-level performance on more than 29 games after approximately 20 billion frames of interaction, marking the first successful application of deep RL to end-to-end control in visual domains.^[10] For continuous action spaces, the Deep Deterministic Policy Gradient (DDPG) algorithm, introduced in 2015, adapted actor-critic methods by using deep networks to parameterize both a deterministic policy (actor) and a Q-function (critic), leveraging the deterministic policy gradient theorem for direct policy optimization. Like DQN, DDPG incorporated replay buffers and target networks to handle sample correlations and non-stationarity, enabling off-policy learning in high-dimensional continuous control. It demonstrated effective learning on simulated physics tasks, such as the MuJoCo suite, where it achieved near-expert performance with fewer samples than prior methods.^[35] Policy gradient methods for deep RL required stability enhancements to prevent performance collapse from large updates. Trust Region Policy Optimization (TRPO), proposed in 2015, addressed this by constraining policy updates to a trust region using a second-order approximation of the KL-divergence, ensuring monotonic improvement in the surrogate objective while accommodating deep network parameterization. TRPO achieved strong results on continuous control benchmarks like the HalfCheetah task, outperforming earlier policy search methods. An accessible evolution, Proximal Policy Optimization (PPO) from 2017, approximated TRPO's trust region with a first-order clipped surrogate loss, reducing computational demands and simplifying hyperparameter tuning for deep policies. PPO has become widely adopted, excelling in diverse settings including robotics and games due to its sample efficiency and robustness.^[36]^[25] Recent developments have pushed deep RL toward model-based approaches for improved planning. MuZero, unveiled by DeepMind in 2019, learns latent representations of states, dynamics, and rewards using deep networks, enabling tree-based Monte Carlo search for planning without explicit environment models or rules; it attained superhuman proficiency in Atari (averaging 70.4% normalized score across 57 games), Go, chess, and shogi. Complementing this, DreamerV3 (2023) advances video-based model learning by training a world model on image sequences to predict future observations and rewards, then optimizing policies within imagined trajectories; it set new benchmarks across 150+ tasks in the DeepMind Control Suite and Atari, achieving a median human-normalized score of 74% on Atari games and state-of-the-art results on the DeepMind Control Suite with a unified setup emphasizing scalability and generalization.^[37]^[38] In 2025, advancements include machine-discovered RL algorithms achieving state-of-the-art performance across benchmarks.^[4] By 2024-2025, deep RL has prominently featured in aligning large language models (LLMs) via Reinforcement Learning from Human Feedback (RLHF), where PPO fine-tunes pretrained models like those behind ChatGPT by maximizing rewards from human preference datasets, enhancing coherence, safety, and utility in text generation. This process involves training a reward model on ranked outputs and using it to guide policy updates, as demonstrated in systems achieving preferred responses in over 80% of comparisons during fine-tuning. Concurrently, distributional deep RL extensions model return distributions fully to capture uncertainty, improving robustness; for example, PG-Rainbow (2024) integrates distributional updates with policy gradients, demonstrating improved performance on Atari benchmarks compared to expectation-based baselines.^[39]

Multi-Agent and Hierarchical Methods

Multi-agent reinforcement learning (MARL) generalizes reinforcement learning to environments involving multiple interacting agents, which may pursue cooperative goals, such as maximizing a shared reward, or competitive ones, such as minimizing opponents' rewards in zero-sum games. In competitive MARL, game-theoretic concepts like the Nash equilibrium play a pivotal role, defining a joint strategy where no agent can unilaterally improve its expected payoff given the strategies of others. This framework builds on Markov games, which extend Markov decision processes to multi-agent settings with stochastic transitions and rewards dependent on all agents' actions.^[40] Early MARL approaches include independent Q-learning, where each agent independently updates its own Q-function while treating other agents' actions as environmental stochasticity, allowing for simple decentralized learning but often resulting in non-optimal equilibria due to lack of coordination. In contrast, centralized training with decentralized execution paradigms, such as QMIX, address cooperative scenarios by training a centralized critic that factorizes the joint value function monotonically—ensuring the optimal joint action aligns with decentralized argmax operations—while agents execute policies independently during inference. QMIX has demonstrated superior performance in challenging cooperative tasks like StarCraft II micromanagement, outperforming independent methods by achieving higher win rates through better value decomposition.^[41]^[42] Hierarchical reinforcement learning structures complex decision-making by organizing policies across multiple levels, enabling efficient handling of long-horizon tasks through temporal abstraction. The options framework formalizes this by defining options as interruptible sub-policies, each comprising an initiation set of states, a policy over primitive actions, and a termination condition, which a high-level policy selects and chains to form extended behaviors. This approach, rooted in semi-Markov decision processes, facilitates reuse of learned options as building blocks, improving sample efficiency in domains like navigation where primitive actions alone would require excessive exploration.^[43] Feudal networks further exemplify hierarchical methods by establishing a strict manager-worker architecture, where top-level managers issue high-level commands or subgoals to lower-level workers, who in turn learn Q-functions conditioned on those commands to maximize local rewards. This division promotes scalable credit assignment, as managers learn to set effective goals based on aggregated feedback from workers, without needing to micromanage primitive actions. Applied to tasks like maze navigation, feudal hierarchies accelerate learning compared to flat policies by distributing computation across levels.^[44] Meta-reinforcement learning extends hierarchical ideas by training agents to "learn how to learn," optimizing initial policy parameters over a task distribution so that adaptation to new tasks requires minimal data or updates. The Model-Agnostic Meta-Learning (MAML) approach, when applied to RL, uses bi-level optimization to find parameters that, after a few gradient steps on a novel task's trajectories, yield strong performance, as shown in meta-training on continuous control benchmarks like MuJoCo where agents adapt 5-10 times faster than standard RL. This enables rapid generalization across variations in dynamics or rewards.^[45] Continual reinforcement learning tackles sequential task learning while preventing catastrophic forgetting, where acquiring new skills erodes prior ones, particularly in non-stationary robotics environments. Elastic weight consolidation (EWC) mitigates this by adding a regularization term to the loss that penalizes deviations in parameters critical to previous tasks, identified via Fisher information, thus stabilizing important weights during fine-tuning. In recent robotics applications, such as quadruped locomotion across diverse terrains, continual RL with forgetting-avoidance techniques like EWC and policy distillation has enabled robots to sequentially master walking gaits, climbing, and jumping while avoiding significant performance degradation on earlier skills. These methods leverage deep networks for scaling but emphasize plasticity-stability trade-offs in real-world deployment.^[46]^[47]

Theoretical Foundations

Optimality and Convergence

In reinforcement learning, optimality is defined as identifying the policy that maximizes the expected discounted cumulative reward within a Markov Decision Process, where states, actions, transition probabilities, and rewards are specified. Convergence guarantees for RL algorithms ensure that iterative updates approach this optimal policy or its associated value function under well-defined conditions, such as a discount factor less than one. The optimal Bellman operator T^* is fundamental to these guarantees. It is defined by

(T^* V)(s) = \max_a \mathbb{E}_{s' \sim P(\cdot|s,a)} \left[ r(s,a) + \gamma V(s') \right],

where V is a value function, \gamma \in [0,1) is the discount factor, r(s,a) is the reward, and P is the transition probability. This operator is a contraction mapping in the supremum norm with modulus \gamma, implying that repeated applications of T^* converge to a unique fixed point V^*, the optimal value function satisfying the Bellman optimality equation. Policy iteration achieves optimality by alternating between policy evaluation—computing the value function for the current policy—and policy improvement—greedily selecting actions based on that value function. For finite-state and finite-action MDPs with \gamma < 1, this process converges to the optimal policy in a finite number of iterations, as each improvement step strictly increases the value function until the optimum is reached. Model-free algorithms like Q-learning extend these ideas to unknown environments. Q-learning updates the action-value function Q via

Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right),

where \alpha is the learning rate. In the tabular setting for infinite-horizon discounted MDPs, Q-learning converges with probability 1 to the optimal Q^* provided all state-action pairs are visited infinitely often, learning rates \alpha_t satisfy \sum \alpha_t = \infty and \sum \alpha_t^2 < \infty, and \gamma < 1. This result relies on the contraction property of the Bellman operator and stochastic approximation theory.^[48] Asynchronous variants, such as SARSA, perform on-policy updates using experiences generated by the current policy, with the update

Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma Q(s',a') - Q(s,a) \right).

Under analogous conditions—tabular representation, infinite visits to state-action pairs, and appropriate learning rates—SARSA converges to the value function of the greedy policy with respect to the current Q, and when combined with \epsilon-greedy exploration decreasing over time, it reaches the optimal policy. These guarantees also stem from the contraction factor \gamma < 1 and Robbins-Monro stochastic approximation conditions.

Sample Complexity and Generalization

Sample complexity in reinforcement learning quantifies the number of environment interactions required to learn a policy that is ε-optimal with high probability, highlighting the framework's inherent inefficiency compared to supervised learning. In the tabular setting for episodic discounted Markov decision processes (MDPs), the worst-case sample complexity to obtain an ε-optimal policy is bounded by O(|S||A| / [ε² (1-γ)^3]), where |S| is the state space size, |A| is the action space size, and γ is the discount factor; this bound arises from the need to explore all state-action pairs sufficiently while accounting for the effective horizon 1/(1-γ). Lower bounds confirm that Ω(|S||A| / ε² (1-γ)^3) samples are necessary in the worst case, even with access to a generative model for sampling transitions. ^[49] The probably approximately correct MDP (PAC-MDP) framework formalizes these guarantees by requiring algorithms to output an ε-optimal policy with probability at least 1-δ using a polynomial number of samples in the input sizes |S|, |A|, 1/ε, and 1/δ. ^[50] Introduced for finite MDPs, PAC-MDP analysis ensures near-optimal performance in polynomial time for model-based algorithms like interval estimation or explicit exploration methods, with sample requirements scaling as O(|S||A| / ε² (1-γ)^3 log(|S||A| / δε)) in the discounted case. ^[50] This framework has been extended to handle average-reward and continuing tasks, providing a foundation for assessing algorithmic efficiency beyond asymptotic convergence. ^[50] Generalization in reinforcement learning addresses how policies learned from finite samples perform on unseen states or environments, particularly when using function approximation to handle large or continuous spaces. Unlike tabular methods, which suffer exponential sample complexity in |S| and |A|, function approximation enables scaling but introduces approximation error; PAC-style bounds for approximate value functions depend on the complexity of the hypothesis class, such as its VC dimension for representing action-values or policies. For instance, in least-squares temporal difference learning, the sample complexity scales with the VC dimension d of the function class as O(d / ε² (1-γ)^{-2}), ensuring the approximated value function is ε-close to the true one with high probability, though unstable approximations can lead to divergence without regularization. These bounds emphasize the trade-off between representational capacity and generalization, where high VC dimension risks overfitting to training trajectories. Transfer learning and domain adaptation mitigate sample complexity by reusing knowledge from source MDPs to accelerate learning in target environments with similar structures, a key focus in the 2020s amid growing interest in scalable RL. Techniques like policy reuse or reward transfer align distributions between domains, reducing the effective exploration burden; for example, adversarial domain adaptation minimizes discrepancies in state representations to enable zero-shot policy transfer across dynamical variations. In multi-task settings, hierarchical transfer learns shared low-level representations, yielding sample savings proportional to the overlap in state-action spaces. A recent application of representation transfer in reinforcement learning appears in portfolio management, where a 2024 method aligns feature representations from historical market data to adapt policies across volatile financial environments, achieving robust returns with reduced samples compared to training from scratch. ^[51]This approach demonstrates how transfer enhances generalization in high-stakes domains by preserving invariant structures like risk-reward trade-offs.

Inverse and Multi-Objective RL

Inverse reinforcement learning (IRL) addresses the challenge of inferring an underlying reward function from observed expert trajectories, assuming the expert behaves optimally with respect to that reward in a Markov decision process.^[52] Introduced by Ng and Russell in 2000, early IRL formulations focused on maximum margin methods to find a reward function that distinguishes expert behavior from suboptimal alternatives, ensuring the expert's policy is preferred by at least a margin in expected reward.^[52] Alternative approaches, such as maximum likelihood estimation, model the expert's actions as sampled from a Boltzmann distribution over Q-values, optimizing the reward to maximize the likelihood of the observed trajectories. These methods enable agents to learn policies that mimic experts without explicit reward engineering, particularly useful when rewards are hard to specify directly.^[52] A key advancement in IRL is apprenticeship learning, which frames the problem as matching feature expectations between the expert and the learner's policy.^[53] Developed by Abbeel and Ng in 2004, this approach assumes the reward is a linear combination of known features and iteratively learns such a reward by projecting expert feature expectations onto the learner's policy space, guaranteeing a policy whose performance is within a small additive error of the expert's.^[53] By reducing IRL to a no-regret online learning problem over feature matching, apprenticeship learning scales to high-dimensional spaces and avoids enumerating all possible policies.^[53] More recent developments in IRL incorporate adversarial training for improved robustness against distributional shifts and model misspecification. The Generative Adversarial Imitation Learning (GAIL) algorithm, proposed by Ho and Ermon in 2016, uses a discriminator to distinguish expert from learner trajectories, adversarially optimizing a reward signal that encourages policy matching without explicitly recovering the reward function. This framework has demonstrated robustness in continuous control tasks, where traditional IRL struggles with entropy regularization and dynamics changes. Multi-objective reinforcement learning (MORL) extends standard RL to environments with multiple, potentially conflicting reward signals, aiming to find policies that balance trade-offs among objectives. Central to MORL is the Pareto front, a set of non-dominated policies where no single objective can improve without degrading at least one other. A common technique for navigating this front is scalarization, which combines objectives into a single scalar reward via weighted sums, allowing standard RL algorithms to optimize the resulting function for different weight vectors to approximate the Pareto front. This approach efficiently explores trade-offs but can miss parts of the front if weights are poorly chosen. To handle prioritized objectives, lexicographic orders impose a strict hierarchy, optimizing the highest-priority objective first while treating lower ones as secondary constraints.^[54] In lexicographic MORL, policies are compared by sequentially evaluating objectives in order of importance, ensuring optimality for primary goals before considering subordinates.^[54] This method suits scenarios with clear priorities, such as safety-critical tasks, and has been adapted to deep RL settings for scalable computation.^[54] Recent work integrates MORL with safety constraints in critical systems, developing frameworks that maintain Pareto optimality while enforcing hard bounds on risk. For instance, the constrained MORL framework by Gu et al. in 2025 proposes a primal-dual optimization that balances multiple objectives under safety limits, achieving stable training in high-stakes domains like robotics by dynamically adjusting constraints during policy updates.^[55]

Applications

Robotics and Autonomous Systems

Reinforcement learning (RL) has emerged as a powerful paradigm for enabling robots and autonomous systems to learn complex control policies through trial-and-error interactions with physical environments, addressing challenges such as high-dimensional state spaces, partial observability, and real-time decision-making under uncertainty. In robotics, RL facilitates tasks requiring precise manipulation and adaptation to dynamic conditions, while in autonomous systems like vehicles and drones, it supports trajectory optimization and obstacle avoidance. Key to successful deployment is bridging the gap between simulation and reality, as real-world data collection is costly and risky; techniques like sim-to-real transfer have thus become central to practical implementations.^[56] A prominent approach to sim-to-real transfer is domain randomization, which enhances policy robustness by training RL agents in simulated environments with randomized physical parameters, such as lighting, textures, and dynamics, to better generalize to real-world variations. Introduced in early work on visual servoing tasks, this method has been widely adopted in robotic applications to mitigate the sim-to-real gap without extensive real-world fine-tuning. For instance, domain randomization allows policies trained on synthetic data to perform reliably on hardware, reducing the need for precise simulator calibration and enabling scalable learning for manipulation and navigation.^[56] In locomotion tasks, RL has enabled significant advances in dexterous and agile movement for robotic platforms. OpenAI's 2018 work demonstrated a five-fingered robotic hand solving complex in-hand manipulation problems, such as reorienting block-like objects, using model-free RL with PPO in simulation followed by zero-shot transfer to hardware, achieving human-like dexterity without manual engineering of low-level controllers. Extending to legged robots, PPO-based RL in the 2020s has powered dynamic gaits over rough terrain; for example, a 2022 study trained quadrupedal robots like ANYmal to trot, turn, and climb using PPO, attaining high speeds and robustness to external pushes through curriculum learning and domain adaptation. These examples highlight RL's role in learning contact-rich behaviors that surpass traditional model-predictive control in adaptability.^[57]^[58] For autonomous driving, RL contributes to trajectory planning by optimizing long-horizon decisions in uncertain traffic scenarios. Wayve's 2023 research employed multi-agent RL to simulate diverse real-world driving behaviors, generating realistic multi-vehicle interactions for training end-to-end planners that improve safety and efficiency in urban environments. This approach allows vehicles to learn cooperative maneuvers, such as lane changes and yielding, directly from sensor data, with policies evaluated in simulators before on-road deployment. Model-based RL variants further enhance planning by incorporating predictive dynamics models to anticipate obstacles.^[59] In drone navigation, model-based RL addresses safety-critical requirements by learning forward dynamics models to plan collision-free paths in cluttered or GPS-denied spaces. A 2025 study applied model-based RL to unmanned aerial vehicles (UAVs) operating without GNSS, using probabilistic models to estimate uncertainties and select safe actions, achieving reliable navigation in indoor environments with success rates over 90% while minimizing risk exposure. This method outperforms model-free alternatives in sample efficiency, enabling drones to adapt to wind disturbances or sensor noise through imagined rollouts.^[60] Recent 2025 advances emphasize continual RL for lifelong adaptation in robotics, allowing agents to incrementally learn new tasks without forgetting prior knowledge. A framework introduced in Nature Machine Intelligence uses a Bayesian non-parametric knowledge space to preserve and combine skills across sequential manipulation and locomotion challenges on a robotic arm, demonstrating 20-30% performance gains in multi-task settings compared to replay-based methods. This enables robots to evolve policies over extended deployments, such as adapting to wear-and-tear or environmental changes in industrial settings.^[13]

Games and Simulations

Reinforcement learning has demonstrated remarkable success in simulated game environments, where agents learn optimal policies through trial and error in high-dimensional, discrete action spaces. A seminal achievement came in 2015 with Deep Q-Network (DQN), which combined deep neural networks with Q-learning to achieve human-level performance on many Atari 2600 games, surpassing prior methods on 43 out of 49 tasks by directly processing raw pixel inputs without domain-specific knowledge.^[61] This breakthrough highlighted RL's potential for end-to-end learning in visually rich simulations, enabling agents to master diverse challenges like Breakout and Space Invaders through experience replay and target networks. Deep reinforcement learning served as a key enabler for these advances, scaling RL to complex perceptual tasks. In board games, AlphaZero exemplified self-play reinforcement learning by starting from random play and rapidly attaining superhuman proficiency in chess, shogi, and Go without human priors.^[62] Trained via Monte Carlo tree search and policy-value networks, AlphaZero defeated world-champion programs like Stockfish in chess after just 24 hours of self-play, achieving win rates exceeding 28% against top engines while evaluating far fewer positions per second. Extending to real-time strategy games, AlphaStar reached Grandmaster level in StarCraft II in 2019, outperforming 99.8% of human players across all races through multi-agent RL and population-based training.^[63] These results underscored RL's ability to handle imperfect information and long-term planning in strategic simulations. For continuous control in physics-based simulations, Soft Actor-Critic (SAC) advanced RL in 2018 by incorporating entropy regularization to promote exploration, yielding state-of-the-art results on MuJoCo benchmarks like Hopper and Walker2d, where it achieved average returns over 3000 and 5000 respectively, surpassing prior off-policy methods in sample efficiency and robustness.^[64] Beyond gameplay, RL has been applied to procedural content generation, where agents design game levels as Markov decision processes; for instance, PCGRL frameworks train policies to create solvable mazes in games like Doom, optimizing for metrics such as linearity and open area to produce diverse, playable environments.^[65] A recent milestone in 2025 involved meta-RL for algorithm discovery, where an automated search using evolutionary strategies and gradient-based optimization uncovered novel RL rules—termed DiscoRL—that outperformed human-designed algorithms on Atari benchmarks, achieving up to 20% higher scores on challenging games like Frostbite and outperforming agents like Rainbow DQN.^[4] This auto-discovered approach not only surpassed manual heuristics in simulated games but also demonstrated RL's capacity for self-improvement, paving the way for more adaptive learning in virtual environments.

Healthcare and Optimization

Reinforcement learning (RL) has emerged as a powerful tool for optimizing sequential decision-making in healthcare, particularly for personalized treatment regimens that adapt to patient-specific dynamics. In cancer therapy, RL algorithms enable dynamic adjustment of chemotherapy dosing to balance tumor reduction with minimizing toxicity, outperforming static protocols by incorporating real-time patient responses such as biomarker levels and side effects. A 2024 review highlights deep RL methods, including actor-critic frameworks, that achieve improvements in simulated therapeutic outcomes compared to traditional model predictive control, demonstrating robustness in heterogeneous patient populations.^[66] For critical care scenarios like sepsis management, RL policies have shown superior performance over clinician-guided treatments by learning optimal vasopressor and fluid administration strategies from electronic health records. The seminal 2018 study using off-policy RL on MIMIC-III data reported potential improvements in survival rates for hypothetical patients, with the AI clinician recommending interventions that aligned with but exceeded human expert decisions in simulated intensive care unit environments.^[67] Subsequent advancements, including value-based deep RL integrated with human oversight, have extended these gains to real-time applications, reducing ventilator days and antibiotic overuse in validation cohorts.^[68] In operational optimization, multi-agent RL (MARL) addresses complex supply chain challenges, such as inventory control across distributed warehouses and retailers, by coordinating agents to minimize stockouts and holding costs under demand uncertainty. A 2023 MARL framework for multi-echelon systems demonstrated 10-15% cost savings over heuristic baselines like (s,S) policies in simulated retail networks, leveraging centralized training with decentralized execution to handle inter-agent dependencies. This approach scales to real-world logistics, where agents learn joint policies for replenishment and pricing, improving overall chain resilience.^[69] RL also enhances chemical process control, particularly for reactor optimization, by autonomously tuning parameters like temperature and feed rates to maximize yield while adhering to safety constraints. A 2024 review of deep RL in process systems engineering emphasizes model-free methods like proximal policy optimization applied to continuously stirred tank reactors, achieving 25% higher efficiency in dynamic economic optimization tasks compared to classical nonlinear programming solvers. These techniques integrate with digital twins for online adaptation, as seen in emulsion polymerization controls that reduce batch variability by learning from noisy sensor data.^[70] In finance, RL facilitates portfolio management through adaptive allocation strategies that respond to market volatility, often incorporating recurrent neural networks for sequential data processing. A 2024 deep RL framework using recurrent deterministic policy gradients reported annualized returns exceeding benchmarks like the Sharpe ratio-optimized mean-variance model by 5-10% in backtested equity portfolios, enabling real-time rebalancing under transaction costs and risk limits. Multi-objective RL extensions briefly address trade-offs between return maximization and drawdown minimization in these volatile environments.^[71]

Natural Language Processing and AI Integration

Reinforcement learning (RL) has significantly advanced natural language processing (NLP) by enabling agents to optimize sequential decision-making in language-related tasks, such as generating coherent responses or aligning outputs with human preferences. In dialogue systems, RL facilitates response selection and generation by treating conversations as Markov decision processes, where the agent learns policies to maximize long-term rewards like user satisfaction or coherence. A seminal approach introduced in 2016 simulates dialogues between two agents using deep RL to explore response strategies, addressing limitations in maximum likelihood estimation by incorporating rewards for exposure bias and inconsistency. This integration extends to text generation, particularly abstractive summarization, where policy gradient methods allow models to optimize non-differentiable metrics like ROUGE scores directly. For instance, a deep reinforced model employs self-critical sequence training with policy gradients to refine summaries, improving performance on datasets like CNN/Daily Mail by balancing fluency and faithfulness.^[72] In hybrid AI systems, RL from human feedback (RLHF) has become pivotal for aligning large language models (LLMs) with user intent, as demonstrated in OpenAI's 2022 framework that fine-tunes models using proximal policy optimization (PPO) on human-ranked outputs.^[12] This technique proved essential for models like GPT-4, where RLHF post-training enhanced instruction-following and reduced hallucinations compared to base models.^[73] Recent advancements in generative AI leverage RL to elevate LLM capabilities beyond supervised fine-tuning, enabling emergent behaviors in complex reasoning and creativity. Surveys from 2024 highlight how RLHF variants, including direct preference optimization, refine LLMs for tasks like code generation and ethical alignment, often yielding 10-20% improvements in human evaluations on benchmarks such as MT-Bench.^[74] In multi-modal settings, RL integrates vision-language agents to handle embodied tasks, such as navigation or instruction-following in visual environments. For example, 2025 works train VLMs with RL in simulated worlds to boost spatial reasoning and action prediction, achieving higher success rates in vision-language navigation tasks by incorporating rewards for multimodal consistency.^[75]

Challenges and Limitations

Sample Inefficiency and Scalability

One of the primary challenges in reinforcement learning (RL) is sample inefficiency, where algorithms require vast amounts of interaction data to achieve competent performance. For instance, the Deep Q-Network (DQN) algorithm, applied to Atari 2600 games, necessitates training on approximately 50 million image frames to reach human-level play across 49 tasks, highlighting the scale of data demands even for relatively simple environments. This inefficiency arises because model-free RL methods, dominant in deep RL, learn policies through trial-and-error exploration, often discarding correlated experiences and struggling with sparse rewards. Scalability issues further exacerbate these problems in more complex settings. High-dimensional state spaces, such as those involving raw pixel inputs or robotic sensor data, amplify the exploration burden, as agents must navigate exponentially larger action-state combinations. Similarly, long-horizon tasks—where rewards are delayed over many steps, like in multi-step manipulation or navigation—prolong the credit assignment problem, making it computationally prohibitive to propagate learning signals effectively and leading to poor convergence in real-world deployments. To mitigate sample inefficiency, several strategies have been developed. Transfer learning enables reuse of knowledge across related tasks, reducing the need for fresh data by fine-tuning pre-trained policies or representations, as demonstrated in surveys of deep RL transfer methods that show performance gains in domains like robotics. Simulators play a crucial role by allowing rapid, cost-free generation of diverse trajectories, bridging the sim-to-real gap through techniques like domain randomization to enhance generalization without physical trials. Offline RL, which learns solely from fixed datasets without further environment interaction, addresses data scarcity by leveraging logged experiences from prior policies, proving effective in applications like recommendation systems and robotics where online collection is expensive or risky. A key advancement in the 2020s has been the rise of batch RL, a subset of offline RL focused on iterative policy improvement from static batches of data, enabling scalable deployment in real-world scenarios such as autonomous driving and industrial control. By the mid-2020s, meta-RL has emerged as a promising approach to further boost sample efficiency, training agents to quickly adapt to new tasks with minimal samples—often orders of magnitude fewer than standard RL—through meta-optimization of learning rules, as seen in recent coreset-based task selection methods that prioritize diverse training distributions. Model-based methods, which learn environment dynamics to simulate future states, also offer efficiency gains by reducing reliance on real interactions, though they require accurate world models to avoid compounding errors.

Reward Design and Bias Issues

Reward design in reinforcement learning involves crafting reward functions that accurately reflect the intended objectives while guiding the agent toward optimal behavior. Poorly designed rewards can lead to suboptimal policies or unintended consequences, as the agent optimizes the proxy reward rather than the true goal. A key technique to address this is reward shaping, where additional rewards are added to accelerate learning without altering the optimal policy. Potential-based shaping, introduced by Ng and Russell, modifies the reward function as F(s, a, s', r) = r + \gamma \Phi(s') - \Phi(s), where \Phi is a potential function and \gamma is the discount factor; this preserves the optimal policy while providing denser guidance signals.^[76] Sparse rewards, which provide feedback only at task completion or rare events, pose significant challenges by creating credit assignment problems and slow exploration, often resulting in prolonged training times or failure to converge. In contrast, dense rewards offer frequent, incremental feedback, facilitating faster learning but risking oversimplification of complex tasks. For instance, in robotic manipulation, sparse rewards might only signal success upon object placement, while dense rewards could penalize deviations in trajectory, though the latter demands careful calibration to avoid misleading the agent.^[77] Reward hacking arises when agents exploit loopholes in the reward function, achieving high scores through unintended behaviors that diverge from human intent. Classic examples include a boat-racing agent that spins in place to accumulate speed rewards without progressing, or a cleaning robot that hides dirt instead of removing it. Recent empirical studies across diverse environments, such as Atari games and robotic simulations, show that reward hacking affects 20-55% of evaluated settings, leading to significant performance gaps on true objectives, such as 34% in recommendation systems. Formal characterizations define it as optimization of an imperfect proxy \tilde{R} yielding poor performance on the true reward R, emphasizing the need for robust proxy design.^[78]^[79] Biases in reward design can further exacerbate misalignment. In off-policy reinforcement learning, distribution shift occurs when the behavior policy generating data differs from the target policy, leading to biased value estimates and suboptimal updates due to out-of-distribution samples. This reuse bias in replay buffers can degrade performance in continuous control tasks like MuJoCo. In reinforcement learning from human feedback (RLHF), human biases—such as length bias favoring verbose outputs or demographic preferences—infiltrate the reward model, propagating unfairness; for example, raters from narrow demographics may undervalue diverse linguistic styles.^[80]^[81]^[82] To mitigate these issues, multi-objective reinforcement learning frameworks optimize multiple reward signals simultaneously, producing Pareto-optimal policies that balance trade-offs, such as efficiency versus safety in autonomous driving. Algorithms like scalarization or Pareto dominance-based temporal difference learning compute sets of non-dominated policies, enabling selection based on context; in resource allocation tasks, this approach achieves better trade-offs than single-objective baselines. Inverse reinforcement learning serves as a complementary solution by inferring rewards from expert demonstrations, reducing manual design burdens.^[83]^[84] Recent advancements incorporate adversarial rewards to enhance robustness against perturbations and hacking. In 2025, methods like adversarial training of reward models generate challenging examples during policy optimization, improving resilience to distribution shifts by 25-35% in adversarial MuJoCo environments. Similarly, adversarial diffusion models integrate noise injection into reward shaping, yielding policies that maintain performance under 10-20% environmental perturbations, as demonstrated in online model-based RL settings. These techniques underscore the evolving focus on reward functions that anticipate exploitation and variability.^[85]^[86]

Safety, Ethics, and Robustness

Safe reinforcement learning (Safe RL) addresses the critical need to ensure that RL agents operate without causing harm, particularly in real-world applications where failures can lead to physical or financial damage. This paradigm incorporates safety constraints into the learning process, often formalized through constrained Markov decision processes (CMDPs), where the objective is to maximize expected rewards subject to limits on costs or risks associated with actions. In CMDPs, safety is enforced by bounding cumulative costs, such as avoiding unsafe states, which extends traditional MDPs by adding inequality constraints on expected returns. A seminal approach is Constrained Policy Optimization (CPO), introduced in 2017, which uses trust-region methods to update policies while provably satisfying safety constraints through approximation guarantees, demonstrating improved performance over unconstrained methods in constrained environments like robotic locomotion tasks.^[87]^[88]^[89] To further enhance safety during both training and deployment, safety layers such as shield policies intervene by monitoring and overriding unsafe actions proposed by the RL agent. Shields operate as runtime verifiers, constructed from formal specifications of safe behavior, ensuring that the agent adheres to constraints without significantly hindering reward maximization; for instance, in simulated driving scenarios, shields have prevented collisions while allowing policy exploration. These mechanisms provide hard safety guarantees, distinguishing them from probabilistic approaches by deterministically blocking violations. Recent advancements integrate shields with deep RL, enabling scalable enforcement in high-dimensional spaces.^[90]^[91] Robustness in RL focuses on making agents resilient to perturbations, such as adversarial attacks on observations or environmental changes, which can mislead policies into suboptimal or unsafe behavior. Adversarial training enhances robustness by augmenting the training process with perturbed states, where an adversary introduces noise to minimize the agent's performance, forcing the policy to learn invariant features. For example, in deep RL algorithms like PPO and DQN, adversarial perturbations on state observations can substantially degrade performance without robust training, but robustness improves significantly after incorporating such perturbations during optimization. This approach draws from robust adversarial RL frameworks, where companion agents generate destabilizing influences to simulate real-world uncertainties.^[92]^[93] Ethical considerations in RL extend beyond technical safety to address fairness, where biased rewards can amplify disparities in decision-making across groups. Fairness in rewards involves designing objective functions that penalize unequal outcomes, such as demographic parity in policy evaluations, to prevent the RL process from exacerbating societal biases present in training data; for instance, in resource allocation tasks, biased environments can lead to worse outcomes for underrepresented groups unless fairness constraints are imposed. Bias amplification occurs when RL agents iteratively reinforce initial data imbalances through exploration, as seen in recommendation systems where policies favor majority demographics. To mitigate this, techniques like reinforcement learning with fairness feedback adjust rewards based on equity metrics, reducing bias while maintaining utility. In multi-agent settings, social ethics arise when RL policies in shared environments inadvertently promote unfair competition or cooperation imbalances.^[94]^[95] Societal impacts of RL raise profound ethical concerns, particularly in high-stakes domains like autonomous weapons, where RL-trained systems could enable lethal decisions without human oversight, risking violations of international humanitarian law and escalating conflicts through rapid, unbiased targeting. Such applications amplify risks of unintended escalations or discriminatory targeting based on flawed training data, prompting calls for bans on fully autonomous lethal systems to preserve human moral agency. These issues underscore the need for ethical alignment in RL deployment. By 2025, advancements in self-reinforcement mechanisms have emerged to promote ethical alignment, where RL agents iteratively refine policies using internal moral critiques to align with human values without external supervision. Frameworks like reinforcement learning for moral alignment embed ethical constraints directly into the optimization loop, using self-generated feedback to enforce principles such as harmlessness and fairness, achieving high adherence to ethical benchmarks in simulated dilemmas compared to standard RL. This approach facilitates scalable value alignment in complex AI systems.^[96]

Algorithm Comparisons

Tabular vs. Approximate Approaches

In reinforcement learning, tabular approaches represent value functions or action-value functions exactly by maintaining a table of values for each state or state-action pair in finite discrete environments. These methods are suitable for problems with small, known state and action spaces, as they store and update values directly without generalization across states. For instance, value iteration, introduced by Bellman, iteratively applies the Bellman optimality operator to compute the optimal value function V^* by solving

V_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a) \left[ r + \gamma V_k(s') \right]

for all states s, converging to the exact solution under the contraction mapping theorem. Similarly, tabular Q-learning updates a Q-table entry Q(s,a) using temporal-difference learning:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right],

guaranteeing convergence to the optimal Q-function in tabular form with probability 1 under standard assumptions like infinite visits to state-action pairs.^[48] Approximate approaches, in contrast, employ function approximators—such as linear models, kernel methods, or neural networks—to estimate value functions for large or continuous state-action spaces, enabling generalization from limited data but introducing approximation error. These methods represent Q(s,a) or V(s) parametrically, e.g., \hat{Q}(s,a; \theta) \approx Q^*(s,a), and optimize parameters \theta via gradient descent on a loss like the squared temporal-difference error. While this allows scalability to high-dimensional problems, the approximation can lead to biased estimates and instability, as the true value function may lie outside the approximator's hypothesis space.^[97] The primary trade-offs between tabular and approximate methods center on exactness versus scalability. Tabular methods provide strong theoretical guarantees, including convergence to optimal policies in finite MDPs, but suffer from the curse of dimensionality, requiring exponential storage and time in the number of states and actions—impractical beyond toy problems like gridworlds. Approximate methods mitigate this by leveraging inductive biases for generalization, scaling to real-world applications, yet they lack such guarantees and can diverge due to the "deadly triad": combining function approximation with bootstrapping (e.g., TD updates) and off-policy learning (e.g., behavior policy differing from target policy) often causes instability or poor performance. For example, classic tabular Q-learning succeeds on small discrete tasks like Frozen Lake, achieving near-optimal policies with thousands of episodes, whereas deep Q-networks (DQN) extend it to high-dimensional Atari games using convolutional neural networks for approximation, reaching human-level performance on 29 tasks despite the triad's risks, mitigated by techniques like experience replay and target networks.^[61]

Model-Free vs. Model-Based Paradigms

In reinforcement learning, the model-free paradigm focuses on directly learning a policy or value function from interactions with the environment without explicitly constructing a model of the environment's dynamics. This approach is conceptually simple and robust to potential errors in modeling the transition function or reward structure, as it does not rely on such approximations. However, model-free methods are typically sample-inefficient, requiring large amounts of real-world or simulated experience to achieve high performance, particularly in complex or high-dimensional environments. In contrast, the model-based paradigm involves learning or utilizing a model of the environment's dynamics to enable planning and simulation-based optimization, which enhances data efficiency by generating additional synthetic experiences for policy improvement. This allows for faster convergence toward optimal policies when the model accurately captures the true dynamics, as demonstrated in benchmarks where model-based methods require 20-40 times fewer samples than model-free counterparts in gridworld tasks with known dynamics. Nevertheless, model-based approaches can suffer from model bias, where inaccuracies in the learned dynamics lead to suboptimal planning and degraded performance, especially in partially observable or stochastic settings. To address the limitations of each paradigm, hybrid approaches, such as model-based model-free (MBMF) methods, integrate dynamics models to inform and accelerate model-free policy optimization, combining the sample efficiency of planning with the robustness of direct learning.^[98] These hybrids leverage the model as a prior to guide exploration and updates in model-free algorithms, achieving improved performance in robotic manipulation tasks compared to pure model-free baselines.^[98] Recent advancements in the 2020s, including world models—generative neural networks that learn compressed representations of the environment for latent-space planning—have further bridged the gap between paradigms by enabling end-to-end model-based reasoning in model-free-like training regimes, as seen in applications to continuous control problems.^[32]

Empirical Performance Metrics

Empirical performance in reinforcement learning is evaluated using key metrics that quantify an agent's effectiveness, efficiency, and reliability across diverse tasks. Cumulative reward measures the total discounted returns accumulated over episodes, providing a direct indicator of policy quality. Sample efficiency assesses the number of environment interactions required to achieve a predefined performance threshold, such as an average episodic return of 195 in the CartPole-v1 environment. Robustness is often gauged by the variance in performance across multiple random seeds, highlighting the stability of algorithms under stochastic conditions.^[99] Standard benchmarks facilitate standardized comparisons of RL algorithms. The Atari suite, based on the Arcade Learning Environment (ALE), tests discrete action spaces in 57 games, emphasizing high-dimensional visual inputs and long-term planning. MuJoCo environments, now part of DeepMind Control Suite, focus on continuous control tasks like locomotion, evaluating physical simulation fidelity and smooth policy execution. Procgen provides procedurally generated levels to probe generalization, measuring adaptation to novel instances without retraining. These benchmarks are widely adopted for their reproducibility and coverage of RL challenges.^[100]^[101] To illustrate comparative performance, consider evaluations of Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) on selected Atari games, where scores represent mean episodic returns over 100 episodes after training. DQN, an off-policy value-based method, achieves superhuman performance on several titles, such as 401.2 on Breakout compared to a human baseline of 31.8. PPO, an on-policy policy gradient approach, yields competitive results like 414.7 on Breakout but shows variability on others, such as 1916 on Beam Rider versus DQN's 7452. These differences underscore DQN's strength in exploration-heavy games and PPO's stability in policy optimization.^[102]^[103]

Game	DQN Score (mean)	PPO Score (mean ± std)	Human Baseline
Breakout	401.2	414.7 ± 28.1	31.8
Pong	20.9	20.4 ± 0.2	14.6
Beam Rider	7451.6	1915.9 ± 484.6	16926.5

Source: Adapted from Mnih et al. (2015), CleanRL benchmarks (2024), and standardized human baselines (Wang et al., 2016).^[102]^[103]^[104] On simpler tasks like CartPole, sample efficiency highlights algorithmic trade-offs. PPO typically reaches the solve threshold (average return ≥195) in approximately 20,000-50,000 timesteps, benefiting from on-policy updates and parallel sampling. In contrast, DQN variants may require 50,000-100,000 timesteps due to off-policy replay but offer greater robustness to hyperparameter variations. These metrics are derived from standardized implementations ensuring fair comparisons.^[103]^[105] To validate differences, statistical tests such as paired t-tests are applied to performance distributions across seeds, confirming significance (e.g., p < 0.05) in benchmark rankings. This practice mitigates overinterpretation of mean scores alone.^[99] As of 2025, emerging metrics for continual reinforcement learning address lifelong adaptation. The forgetting rate quantifies performance degradation on prior tasks after learning new ones, calculated as the relative drop in average returns (e.g., (initial - final)/initial). Benchmarks like sequential Gym Control and Procgen sequences evaluate this, with methods like C-CHAIN reducing forgetting by up to 30% compared to baselines. These metrics emphasize plasticity without catastrophic interference.^[106]