Fact-checked by Grok 2 weeks ago

Deep reinforcement learning

Deep reinforcement learning (DRL) is a paradigm that integrates principles with deep neural networks to enable autonomous agents to learn optimal decision-making policies through trial-and-error interactions with complex, high-dimensional environments, maximizing cumulative rewards without explicit supervision. In this framework, deep networks approximate value functions or policies, allowing DRL to handle raw sensory inputs like images or audio, which traditional methods struggle with due to the curse of dimensionality. The field gained prominence with early breakthroughs in 2013, when the Deep Q-Network (DQN) algorithm demonstrated the first successful application of to by training a to play games directly from pixel inputs, achieving superhuman performance in several tasks. This was followed in 2015 by an advanced DQN variant that reached human-level control across 49 , addressing challenges like unstable training through techniques such as experience replay and target networks. A landmark milestone came in 2016 with , which combined deep reinforcement learning with to defeat world champion in the game of Go, showcasing DRL's ability to master strategic games with vast state spaces previously deemed intractable for AI. Subsequent advances include the use of DRL in training large language models through (RLHF), enabling improved reasoning and alignment in AI systems as of the 2020s. DRL has since expanded to diverse applications, including for manipulation tasks, autonomous vehicle navigation, resource optimization in energy systems, in large models, and , where agents learn adaptive behaviors from simulated or real-world interactions. Notable successes include robotic arms for dexterous object handling and optimizing trading strategies in by simulating market dynamics. However, the approach faces significant challenges, such as sample inefficiency—requiring millions of interactions for —exploration-exploitation trade-offs in sparse-reward environments, and issues with generalization and safety in real-world deployments. Ongoing research focuses on improving scalability, interpretability, and robustness to bridge these gaps.

Fundamentals

Deep Learning

Deep learning is a subset of that employs multi-layered artificial neural networks to automatically learn hierarchical representations of data, enabling the modeling of complex patterns without explicit . These networks excel in tasks by transforming raw inputs into high-level abstractions through successive layers of nonlinear processing. The core components of deep neural networks include artificial neurons, which mimic biological counterparts by computing a weighted sum of inputs plus a , followed by an to introduce nonlinearity. Networks are structured in layers: an input layer that receives data features, one or more hidden layers that perform intermediate computations, and an output layer that produces predictions or classifications. Common functions include the , defined as \sigma(x) = \frac{1}{1 + e^{-x}}, which outputs values between 0 and 1, and the rectified linear (ReLU), f(x) = \max(0, x), which promotes sparsity and faster in training. Training occurs via the algorithm, which efficiently computes gradients of the loss with respect to weights by applying the chain rule in reverse through the network layers. The training process optimizes network parameters by minimizing a that quantifies prediction errors, such as (MSE), \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, for problems. iteratively updates weights in the direction opposite to the gradient of the loss, with variants like using mini-batches for efficiency. To mitigate , where models memorize training data at the expense of , techniques like dropout randomly deactivate neurons during training to encourage robustness, and regularization adds a penalty term, such as L2 norm \lambda \sum w^2, to the loss to constrain weight magnitudes. Historically, deep learning traces its roots to the perceptron, a single-layer neural model proposed by in 1958 for tasks, which laid the groundwork for connectionist approaches despite limitations exposed by the XOR problem. The field experienced a revival in 2006 with the introduction of deep belief networks by and colleagues, which used unsupervised pre-training to initialize deep architectures, overcoming vanishing gradient issues and enabling effective learning in multi-layer networks. A prominent example of deep learning architectures is the (CNN), specialized for processing grid-like data such as images. Pioneered by in the late 1980s and refined in his 1998 work on document recognition, CNNs incorporate convolutional layers that apply learnable filters to detect local features like edges or textures, followed by pooling layers that downsample activations to capture spatial hierarchies while reducing computational load and translation invariance. This design allows CNNs to approximate image-to-label functions efficiently, as demonstrated in tasks like handwritten digit recognition where error rates dropped significantly compared to prior methods.

Reinforcement Learning

Reinforcement learning (RL) is a paradigm in where an learns to make sequential decisions by interacting with an environment through , aiming to maximize the cumulative reward over time. The process involves the observing the current state of the environment, selecting an action, receiving a reward or penalty, and transitioning to a new state, with learning occurring based on the feedback from these interactions. Unlike , which relies on , RL focuses on delayed and sparse rewards, enabling the to discover optimal behaviors in complex, dynamic settings without explicit instructions. The foundational framework for RL is the (MDP), a that formalizes the problem under . An MDP is defined by a (S, A, P, R, \gamma), where S is the set of states representing the environment's configuration, A is the set of possible actions the agent can take, P(s'|s,a) denotes the transition probabilities to next states s' given state s and action a, R(s,a) is the reward function providing immediate feedback, and \gamma \in [0,1) is the discount factor that prioritizes immediate over future rewards. Central to solving MDPs is the Bellman equation, which expresses the optimal value function V^*(s) for a state s as the maximum expected return achievable from that state onward: V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right] This recursive equation, derived from dynamic programming principles, allows computation of optimal policies by breaking down long-term value into one-step lookahead decisions. Key elements in RL include the agent, which perceives and acts; the policy \pi(a|s), a mapping from states to actions that can be deterministic (\pi(s) = a) or stochastic (probabilistic distribution over actions); and value functions that estimate expected returns, such as the state-value function V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s \right] under policy \pi, or the action-value function Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a \right]. A core challenge is the exploration-exploitation dilemma, where the agent must balance trying new actions to discover better rewards (exploration) against leveraging known high-reward actions (exploitation) to avoid suboptimal performance. Classic algorithms for solving MDPs in tabular form, where values are stored in lookup tables, include Q-learning and SARSA. Q-learning is an off-policy temporal-difference method that learns the optimal action-value function Q^*(s,a) via the update rule: Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] where \alpha is the learning rate and the maximum over next actions promotes optimality regardless of the current policy. In contrast, SARSA is an on-policy algorithm that updates Q^\pi(s,a) based on the action actually selected by the policy: Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q(s',a') - Q(s,a) \right] ensuring updates align with the behavior policy for safer learning in stochastic environments. These methods converge to optimal solutions in finite MDPs under appropriate conditions, such as sufficient and decreasing learning rates. Tabular RL methods, however, suffer from of dimensionality, becoming computationally infeasible in environments with large or continuous state-action spaces, as the table size grows exponentially with dimensionality. This limitation motivates the use of techniques, such as deep neural networks, to generalize across states in high-dimensional settings.

Integration of Deep Learning and Reinforcement Learning

Deep reinforcement learning (DRL) is a subfield of that integrates with algorithms, employing them as function approximators for key components such as policies, value functions, and models within Markov decision processes. This fusion enables agents to learn optimal behaviors in complex environments by approximating high-dimensional functions that map states to actions or values, overcoming the limitations of traditional tabular methods which scale poorly beyond low-dimensional discrete spaces. The primary motivation for integrating into stems from the instability encountered when using linear function approximators in RL settings, particularly when combined with and off-policy learning—a phenomenon known as the "deadly triad." Linear approximators often fail to capture the non-linearities inherent in continuous or combinatorial state-action spaces, leading to divergence and poor convergence in value estimation. DRL addresses these issues by leveraging deep neural networks, such as convolutional neural networks for processing raw inputs, to provide non-linear approximations that handle high-dimensional sensory data directly through end-to-end learning, eliminating the need for manual . A seminal framework illustrating this integration is the Deep Q-Network (DQN), which uses a deep to approximate the and learn policies from unstructured visual inputs in . To mitigate the deadly triad's instabilities, DQN incorporates experience replay, which stores and randomly samples past transitions to decorrelate data and stabilize training, and target networks, which maintain a fixed copy of the Q-network for updates to reduce feedback loops. These mechanisms allow DRL to scale to environments with millions of states, such as video games or robotic control tasks. The advantages of this integration include enhanced scalability to complex, high-dimensional environments and the ability to perform representation learning directly from raw, like images or sensor readings, enabling agents to discover hierarchical features autonomously. For instance, DQN achieved human-level performance on 49 using only pixel inputs and game scores, demonstrating how deep networks facilitate generalization across diverse tasks without domain-specific priors.

History

Early Developments

The early developments in deep reinforcement learning trace back to the integration of neural networks with reinforcement learning principles during the 1980s and 1990s, laying theoretical and practical foundations for handling complex sequential decision-making problems. Pioneering work emphasized approximate dynamic programming methods augmented by neural architectures to address the curse of dimensionality in large state spaces. A seminal contribution was the 1998 textbook Reinforcement Learning: An Introduction by and Andrew G. Barto, which formalized core RL concepts and inspired extensions using neural function approximators for value estimation and policy learning. One of the first notable applications of neural networks in emerged in 1992 with , developed by Gerald Tesauro, which employed temporal-difference learning to train a multi-layer for evaluating positions. By self-playing millions of games, achieved expert-level performance, demonstrating the potential of neural for board games while highlighting the efficacy of TD methods in bootstrapping learning from incomplete information. Concurrently, the 1996 book Neuro-Dynamic Programming by Dimitri P. Bertsekas and John N. Tsitsiklis provided a rigorous framework for combining neural networks with dynamic programming, introducing techniques like approximate value iteration to mitigate computational intractability in high-dimensional environments. In the , advancements focused on batch-mode algorithms to improve efficiency, with fitted Q-iteration proposed by Ernst and colleagues in 2005 as a method to iteratively approximate the using supervised on collected trajectories. Building on this, Riedmiller's 2005 neural fitted Q-iteration extended the approach by parameterizing the with multi-layer perceptrons, enabling effective learning in continuous state spaces through repeated fitting on experience replay buffers. These methods marked progress in scaling to more realistic tasks but encountered significant challenges, including training instability due to non-stationary distributions and the deadly triad of , , and off-policy learning, which often led to divergent policies. Early experiments in the applied shallow neural networks to RL benchmarks resembling , such as simple arcade environments, where linear or single-hidden-layer approximators achieved modest performance on tasks like cart-pole balancing or basic pursuit-evasion but struggled with high-dimensional visual inputs. Pre-deep RL milestones also included the rise of (MCTS) around 2006 for games like Go, which, though initially without neural enhancements, served as a search-based precursor to later hybrid RL systems by efficiently exploring action trees in combinatorial domains. Reflections on these eras, as articulated in Richard Sutton's 2019 essay "The Bitter Lesson," underscore how early reliance on domain-specific knowledge often hindered scalability, paving the way for computation-driven neural methods.

Breakthroughs and Milestones

The field of deep reinforcement learning (DRL) saw its first major breakthrough in 2013–2015 with the development of Deep Q-Networks (DQN) by researchers at DeepMind. Introduced in a seminal 2015 paper, DQN combined deep neural networks with to enable end-to-end learning directly from high-dimensional pixel inputs, achieving human-level performance on a suite of 49 games without prior knowledge of game rules. This work introduced key innovations such as experience replay, which stabilizes training by sampling past experiences uniformly from a replay buffer, and target networks to mitigate moving-target problems in value estimation. Building on DQN, Double DQN addressed overestimation biases in Q-value approximations, leading to more robust performance across Atari tasks and demonstrating up to 50% improvements in scores on challenging games like Seaquest. In 2016, DeepMind's marked a pivotal milestone by defeating the world champion Go player in a historic five-game match, showcasing DRL's potential in complex strategic domains. integrated deep convolutional neural networks for policy and value estimation with (MCTS), trained through from human games and self-play , achieving superhuman performance with a 99.8% win rate against top Go programs. This success highlighted the power of combining model-free policy gradients with model-based search, influencing subsequent DRL applications beyond games. The 2017 release of extended these advances to multiple board games, including chess and , using a single algorithm that learned through self-play without human knowledge. surpassed world-champion programs like in chess after just four hours of training on a single machine, attaining superhuman levels in all three domains (Go, chess, ) via unified neural networks for move prediction and value estimation coupled with MCTS. Its impact lay in demonstrating scalable, general-purpose DRL that could master diverse rulesets, inspiring broader adoption in and tasks. From 2018 to 2019, OpenAI's Five system applied DRL at unprecedented scale to the game , defeating professional human teams including world champions in a best-of-three series. Trained via with on 256 GPUs and 128,000 CPU cores over ten months, OpenAI Five handled the game's vast action space (over 20,000 per turn) and partial observability, achieving superhuman coordination in five-versus-five matches. This milestone underscored DRL's viability in multi-agent, continuous-action environments, bridging simulated games to potential real-world and control applications. In 2019, DeepMind's advanced model-based DRL by learning latent models of environments without access to rules or state transitions, excelling in , Go, chess, and . combined a representation network, dynamics predictor, and prediction function with MCTS for , outperforming prior model-free methods on , achieving state-of-the-art median human-normalized scores while matching superhuman performance in board games. Published in 2020, it represented a shift toward more sample-efficient, generalizable in unknown domains. Entering the 2020s, scalable methods like DreamerV2 emerged, achieving state-of-the-art results on through discrete world models that imagined latent trajectories for policy learning. DreamerV2 reached human-level performance across 55 using a single GPU, surpassing prior model-based agents by leveraging RSSM (Recurrent State-Space Model) for efficient imagination-based training. Concurrently, the 2021 Decision Transformer reframed offline DRL as sequence modeling with transformers, conditioning actions on desired returns to generate expert-level trajectories in and Key-to-Door tasks, outperforming state-of-the-art offline baselines like CQL by generating high-reward policies from datasets without online interaction. In 2022, (RLHF) gained prominence, powering the alignment of large language models like through reward modeling and (PPO), enabling scalable incorporation of human preferences in generative AI. Further progress included Voyager in 2023, which integrated LLMs for automatic curriculum and skill libraries in , advancing open-ended exploration and in expansive environments. These developments facilitated transitions from purely simulated benchmarks to real-world deployments, such as and autonomous systems, emphasizing efficiency and data utilization.

Core Algorithms

Value-Based Methods

Value-based methods approximate the optimal action-value function Q^*(s, a), which estimates the expected discounted return starting from state s, taking action a, and following the optimal policy thereafter, using a neural network parameterized by \theta, denoted Q_\theta(s, a). The corresponding policy is derived greedily via \pi(s) = \arg\max_a Q_\theta(s, a), enabling off-policy control without directly parameterizing the policy. This paradigm builds on tabular by employing networks to handle high-dimensional inputs, such as raw pixels from environments like , where is essential for scalability. The foundational algorithm, Deep Q-Network (DQN), trains the network by minimizing the squared Bellman error on transitions (s, a, r, s') sampled from an experience replay buffer, which decorrelates data and allows reuse of past experiences for stable off-policy learning: L(\theta) = \mathbb{E} \left[ \left( r + \gamma \max_{a'} Q_{\theta'}(s', a') - Q_\theta(s, a) \right)^2 \right], where \theta' denotes a target network's parameters, periodically copied from \theta to mitigate instability from moving targets, \gamma is the discount factor, and the expectation is over the replay buffer distribution. Exploration is facilitated by an \epsilon-greedy policy, selecting random actions with probability \epsilon (typically annealed from 1.0 to 0.1), balancing exploitation of learned values with discovery of new states. These mechanisms enabled DQN to achieve human-level performance on many Atari games, surpassing prior hand-crafted features. Subsequent variants addressed key limitations in DQN. Double DQN reduces overestimation bias inherent in standard by decoupling action selection (using the online network) from evaluation (using the target network) in the target computation, yielding more accurate value estimates and improved stability across benchmarks. Dueling DQN enhances representation efficiency by factoring the Q-network into separate streams for the state value function V(s) and advantages A(s, a), combined as Q_\theta(s, a) = V(s) + \left( A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \right), which better captures states where action differences are minor relative to the base value. Rainbow integrates multiple enhancements into a single framework, including Double DQN, dueling , prioritized experience replay (sampling transitions proportional to their temporal-difference error magnitude for efficient learning of ), multi-step returns ( over n > 1 steps to incorporate immediate rewards and reduce single-step bias), distributional reinforcement learning via the algorithm (modeling the full return distribution Z_\theta(s, a) as a over fixed support atoms, updated via distributional Bellman projections), and noisy parametric networks for integrated exploration. This combination substantially outperforms individual variants, achieving superior sample efficiency and scores on , such as exceeding 90% of across 39 games. Despite these advances, value-based methods are inherently suited to action spaces, where the argmax yields a unique maximizer efficiently. In continuous spaces, approximating the action maximizer requires non-trivial optimizations, often resulting in training instability or suboptimal without further modifications.

Policy-Based Methods

Policy-based methods in deep reinforcement learning directly parameterize the \pi_\theta(a|s), which maps states s to actions a (often stochastically), using a with parameters \theta. These methods optimize the by ascending the of the J(\theta), avoiding explicit value function estimation and instead focusing on direct policy search, which is particularly effective for environments with or high-dimensional action spaces. The foundation of these approaches is the gradient theorem, which states that the of the performance objective is \nabla_\theta J(\theta) \approx \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A(s_t, a_t) \right], where \tau is a , T is the episode length, and A(s_t, a_t) is the advantage function estimating the relative value of a_t in state s_t. This formulation derives from the under the parameterized and enables gradient-based optimization of complex, nonlinear policies represented by deep networks. A seminal algorithm in this class is REINFORCE, which computes policy gradients using sampling of complete trajectories to estimate returns, yielding an unbiased but high-variance gradient update. To mitigate variance, REINFORCE incorporates baselines, such as a state-value V(s), subtracted from returns to form advantages without introducing bias: the update becomes \nabla_\theta J(\theta) \approx \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot (G - V(s)) \right], where G is the realized return; this technique stabilizes training in practice. Building on these ideas, actor-only variants enhance scalability and stability through parallelization and constrained updates. Asynchronous Advantage Actor-Critic (A3C) employs multiple parallel environments to generate diverse trajectories asynchronously, updating a shared via on-policy gradients to accelerate learning and reduce correlation in samples. Trust Region Policy Optimization (TRPO) addresses destructive large-step updates by constraining policy changes within a trust region, enforcing a bound on the divergence between old and new policies during optimization: \max_\theta \mathbb{E} [L(\theta) ] subject to \mathbb{E} [D_{KL}(\pi_{\theta_{old}} || \pi_\theta)] \leq \delta, where L(\theta) is a surrogate objective; this ensures monotonic improvement in performance. Proximal Policy Optimization (PPO) simplifies TRPO's constraints with a clipped surrogate objective function, L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right], where r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio and \hat{A}_t is an estimate; this prevents excessive shifts while allowing multiple epochs of minibatch updates on the same data, making it computationally efficient and widely adopted. These methods excel in handling high-dimensional continuous action spaces, such as robotic control tasks, where discrete value-based approaches struggle with action selection, and their stochastic nature naturally accommodates noisy or multimodal policies.

Actor-Critic Methods

Actor-critic methods in deep reinforcement learning integrate elements of both policy-based and value-based approaches by employing two distinct neural networks: an that learns a parameterized to select s, and a that evaluates the expected returns of those s to guide policy improvement. This structure enables more stable compared to standalone methods, as the provides a to reduce variance during policy updates. These methods are particularly effective for both and continuous spaces, with the typically outputting a over s or a deterministic , while the approximates either the state-value function V(s) or the action-value function Q(s, a). The core architecture of actor-critic methods features the network, parameterized by \theta, which outputs an action distribution \pi_\theta(\cdot | s) or a deterministic action \mu_\theta(s) given state s, and the network, parameterized by \phi, which estimates values such as Q_\phi(s, a) or V_\phi(s). The is updated by ascending an estimated policy gradient derived from the critic's value estimates, often using the advantage function A(s, a) = Q(s, a) - V(s) to further reduce variance. This setup allows for on-policy or off-policy learning, where the and can be trained using trajectories generated by the current or a separate . A foundational advancement is the Advantage Actor-Critic (A2C) algorithm, a synchronous variant of the Asynchronous Advantage Actor-Critic (A3C) method, which trains multiple s in parallel but updates a shared model synchronously after collecting experiences. In A2C, the objective combines policy gradient ascent for the , value function regression for the critic, and an bonus to encourage exploration, formulated as a joint : based on \log \pi_\theta(a|s) \cdot A(s,a), minus value \left( V_\phi(s) - R \right)^2, plus term H(\pi_\theta(\cdot|s)). This approach achieves strong performance on Atari games with fewer computational resources than fully asynchronous methods. The Soft Actor-Critic (SAC) algorithm, introduced in 2018, extends actor-critic methods to off-policy settings by incorporating maximum entropy regularization, maximizing the expected return plus an entropy term: J(\pi) = \mathbb{E} \left[ \sum_t \left( r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right) \right], where \alpha is a temperature parameter automatically tuned to balance reward and exploration. SAC uses separate critic networks for target value estimation and a stochastic actor, enabling robust learning in continuous control tasks like robotic locomotion, often outperforming prior methods in sample efficiency and asymptotic performance on MuJoCo benchmarks. For continuous action spaces, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, proposed in 2018, builds on deterministic actor-critic frameworks to mitigate overestimation bias in Q-value critics through three key innovations: twin critics with the minimum value selected for updates, delayed policy updates every few critic steps, and target action smoothing via Gaussian noise. TD3 updates the actor \mu_\theta(s) and critics Q_{\phi_1}(s,a), Q_{\phi_2}(s,a) using off-policy data, achieving superior results on continuous control benchmarks such as HalfCheetah and Hopper compared to its predecessor DDPG. The Deterministic Policy Gradient (DPG) theorem underpins many continuous-action -critic methods, stating that the gradient of the performance objective with respect to actor parameters is \nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu} \left[ \nabla_a Q(s,a) \big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \right], where \rho^\mu is the under \mu, allowing optimization of deterministic policies using critic gradients. This theorem enables off-policy learning with replay buffers, as demonstrated in early applications to high-dimensional control. Actor-critic methods offer reduced variance in policy gradients relative to pure policy-based approaches, thanks to the critic's baseline, while supporting off-policy efficiency for better sample utilization compared to on-policy-only methods. These benefits have made them foundational for scalable deep RL in complex environments.

Key Challenges

Exploration and Exploitation

In reinforcement learning, the exploration-exploitation dilemma refers to the challenge of balancing the need to acquire new knowledge about the environment through diverse actions (exploration) and leveraging known information to maximize immediate rewards (exploitation). This trade-off is particularly acute in deep reinforcement learning (deep RL), where high-dimensional state and action spaces can lead to inefficient learning if exploration is insufficient. In sparse reward settings, where useful feedback is infrequent, poor exploration strategies can result in agents getting stuck in suboptimal behaviors, delaying or preventing convergence to effective policies. Classic approaches to address this dilemma include the ε-greedy method, which selects a random action with probability ε (typically decaying from 1.0 to 0.01 over training) and otherwise follows the current policy, as employed in the Deep Q-Network (DQN) algorithm for . Another foundational technique is the Upper Confidence Bound (UCB), which promotes optimism by favoring actions with high estimated rewards plus an uncertainty bonus, originally developed for multi-armed bandits and adapted to RL settings. These methods, while effective in tabular RL, often underperform in deep RL due to the curse of dimensionality, necessitating specialized adaptations. In deep RL, adaptations like noisy networks introduce parameterized noise into the weights of neural networks to generate stochastic actions, enabling continuous exploration without explicit randomness, and achieving state-of-the-art results on 57 with a score of 633. Entropy maximization, as in the Soft Actor-Critic () algorithm, adds an term to the objective to encourage stochasticity, promoting diverse behaviors while optimizing expected returns; this approach yielded scores like 3155 on the MuJoCo task. These techniques integrate seamlessly with off-policy methods to reuse exploratory efficiently. Intrinsic motivation methods further enhance exploration by generating internal rewards based on novelty or surprise. Curiosity-driven exploration, exemplified by Random Network Distillation (RND), rewards agents for prediction errors from a fixed random network, fostering visits to unpredictable states and attaining 7500 points on the sparse-reward Montezuma's Revenge game. Count-based exploration assigns higher intrinsic rewards to less-visited states using neural density models to estimate visit frequencies, as in methods that unify counting with entropy regularization, improving performance on exploration-heavy environments like . Information-theoretic approaches maximize between actions and future states to direct toward informative trajectories; for instance, predictive information maximization prioritizes policies that reduce about the environment's . Bayesian methods, such as approximations to , sample actions from posterior distributions over value functions to balance and reward. Evaluating effectiveness in deep RL often involves metrics like effective sample size, which quantifies the of state-action pairs encountered relative to total interactions, highlighting inefficiencies in high-dimensional spaces. Challenges persist in sparse reward scenarios, where extrinsic signals are rare, leading to the "noisy-TV problem" of endless without progress; solutions like intrinsic rewards mitigate this but require careful tuning to avoid over-. Recent research as of 2025 continues to explore hybrid methods combining intrinsic motivation with model-based planning to improve robustness in diverse environments.

Sample Efficiency and Off-Policy Learning

In deep reinforcement learning, algorithms are categorized as on-policy or off-policy based on how they utilize experience data for policy updates. On-policy methods, such as , generate data using the current policy \pi and discard it after a single update, requiring fresh samples for each iteration to ensure the data distribution matches the policy being optimized. In contrast, off-policy methods, like Deep Q-Networks (DQN), allow the behavior policy \mu that collects data to differ from the target policy \pi, enabling reuse of past experiences stored in a replay buffer for multiple updates, which enhances data efficiency but introduces challenges in correcting distribution shifts. A core mechanism for off-policy learning is , which reweights experiences from \mu to estimate expectations under \pi using the ratio \rho = \frac{\pi(a|s)}{\mu(a|s)}, incorporated into value or gradient updates to account for behavioral differences. However, this ratio can lead to high variance, particularly in long-horizon tasks where products of ratios accumulate exponentially, causing unstable and in deep RL settings. To mitigate this, techniques like V-trace, introduced in the framework, apply truncated and weighted to balance bias and variance, stabilizing off-policy actor-critic updates while enabling scalable distributed learning. Experience replay further improves sample efficiency by storing and resampling transitions, with advancements prioritizing samples based on learning potential. Prioritized experience replay assigns higher sampling probabilities to transitions with larger temporal-difference (TD) errors, focusing updates on informative experiences and accelerating convergence in value-based methods like DQN. For sparse-reward environments, hindsight experience replay (HER) enhances replay by relabeling failed trajectories with achieved goals as "successful" outcomes, allowing the agent to learn from any reached state and significantly boosting sample efficiency in goal-conditioned tasks without altering the environment. In batch and offline reinforcement learning, where interaction with the environment is limited or impossible, off-policy techniques adapt to fixed datasets by incorporating regularization to avoid overestimation of unseen actions. Conservative Q-learning (CQL) learns a conservative by adding a penalty term that downweights out-of-distribution actions during training, ensuring pessimistic estimates that prevent extrapolation errors and improve performance on diverse offline benchmarks. Complementary approaches, such as behavior cloning regularization, constrain the learned policy to stay close to the dataset's behavior policy via additional losses, stabilizing and reducing the risk of deploying unsafe policies in real-world applications. Despite these advances, deep RL remains sample-inefficient compared to human learning, often requiring millions of interactions to achieve proficiency in tasks like , whereas humans master similar visuomotor skills with orders of magnitude fewer trials through prior knowledge and generalization. For instance, the original DQN algorithm demands around 200 million frames to reach human-level performance, highlighting the gap in that off-policy methods aim to narrow but have not fully closed. As of 2025, ongoing efforts in scalable world models and seek to further reduce this gap in practical deployments.

Advanced Research Areas

Generalization and Transfer Learning

Deep reinforcement learning (DRL) agents often struggle with generalization due to domain shift, where changes in state distributions or action spaces between training and deployment environments lead to degraded performance. This issue arises because DRL models, reliant on deep neural networks, tend to overfit to specific training dynamics, limiting their robustness to unseen variations such as altered physics or visual appearances. To address this, meta-reinforcement learning (meta-RL) enables fast adaptation to new tasks by learning initial parameters that allow quick fine-tuning with minimal data. A prominent example is the application of Model-Agnostic Meta-Learning (MAML) to RL, which optimizes policies for rapid adaptation across related Markov decision processes (MDPs), as demonstrated in continuous control tasks where agents adapt in fewer than 10 episodes. Transfer learning techniques in DRL mitigate generalization gaps by reusing knowledge from source tasks. Parameter sharing across tasks involves training shared neural network layers to extract common representations, while task-specific heads handle unique rewards or dynamics, improving efficiency in multi-task settings like robotic manipulation. Successor features (SFs) provide a modular approach by representing expected future feature vectors under a policy, decoupling environment dynamics from rewards to enable zero-shot transfer to new reward functions without retraining the policy. For instance, SFs combined with generalized policy improvement (GPI) allow agents to select the best policy from a library for novel tasks, achieving near-optimal performance in gridworlds and Atari games with shared dynamics. Context-aware networks further enhance transfer by incorporating task-specific context into the policy network, such as through adapters that modulate shared features based on environmental cues, promoting adaptation in procedurally generated domains. Hierarchical reinforcement learning (HRL) supports generalization by introducing temporal abstractions that facilitate transfer across tasks with structural similarities. The options framework, introduced by Sutton et al., formalizes hierarchies as temporally extended actions (options) with initiation sets, policies, and termination functions, allowing reusable sub-policies for high-level planning. Deep extensions integrate this with neural networks, enabling end-to-end learning of options for abstraction in complex environments. Feudal Networks (FuNs), proposed in 2017, extend the feudal RL paradigm with a manager-worker hierarchy where the manager issues abstract goals to workers, promoting transfer by decomposing tasks into modular, reusable components, as shown in Atari games where hierarchical policies significantly improve sample efficiency over flat ones. Evaluation of generalization in DRL emphasizes benchmarks that test robustness to unseen environments. The Procgen benchmark, consisting of 16 procedurally generated 2D games, assesses zero-shot by training on one set of levels and evaluating on held-out levels with varied visuals and layouts, revealing that standard DRL agents like achieve only 20-50% of easy-mode performance in hard-mode zero-shot settings. Few-shot , involving with limited interactions on target tasks, contrasts with zero-shot by allowing updates, often yielding 10-30% gains in benchmarks like Meta-World for robotic tasks. Despite advances, challenges persist in achieving reliable . Negative occurs when knowledge from tasks hinders performance on targets, particularly in continual learning where sequential task exposure leads to catastrophic forgetting or suboptimal policies, resulting in substantial performance degradation in multi-task Atari sequences without regularization. Compositional , the ability to recombine learned primitives for novel scenarios, remains elusive in deep representations, as DRL agents often fail to extrapolate to combinations of objects or goals, as shown in synthetic tasks requiring color-shape recombinations. These issues underscore the need for representations that capture structures across domains. Recent developments as of 2025 include the integration of large language models (LLMs) with meta-RL for improved compositional and transfer, as well as new benchmarks like GenPlan for evaluating planning .

Multi-Agent Reinforcement Learning

(MARL) extends deep reinforcement learning to environments where multiple agents interact, learn, and influence each other's outcomes, contrasting with single-agent settings by introducing dynamics of cooperation, competition, or both. These interactions are formalized within the Markov games framework, a of Markov decision processes (MDPs) that incorporates multiple agents with joint state spaces, action spaces, and reward functions, allowing for stochastic transitions based on collective actions. Paradigms in MARL include fully cooperative scenarios with shared rewards, fully competitive zero-sum games where one agent's gain is another's loss, and mixed settings combining elements of both, enabling the study of complex social dynamics. A key distinction in MARL approaches is between centralized and decentralized training paradigms. In decentralized methods, agents learn independently using local observations and actions, which promotes scalability but struggles with coordination. Centralized training with decentralized execution (CTDE) addresses this by allowing a centralized critic to access global information during training while enforcing decentralized actors for execution, mitigating issues like non-stationarity from co-adapting policies. For cooperative tasks, value decomposition methods like QMIX decompose the joint value function into per-agent values using a monotonic mixing network, ensuring that the contribution of individual actions to the global value remains consistent and enabling effective credit assignment in teams. In mixed settings, algorithms such as MADDPG extend single-agent deep deterministic policy gradients to multi-agent contexts by training centralized critics that condition on all agents' actions and observations, while actors remain decentralized, allowing adaptation to both cooperative and competitive environments. Communication protocols further enhance coordination in by enabling agents to exchange information through learned channels integrated into deep networks. Differentiable messaging schemes, for instance, allow end-to-end training of continuous or discrete messages as part of the policy network, fostering emergent communication that improves joint performance in partially observable settings. Emergent behaviors often arise in these systems, such as sophisticated in cooperative games like the micromanagement challenge (SMAC), where agents develop tactics like flanking maneuvers without explicit programming, outperforming independent learners. In social dilemmas modeled after the , self-interested agents trained via deep Q-networks exhibit emergent cooperation or defection patterns, revealing how repeated interactions lead to stable equilibria beyond Nash predictions. Scalability in is hindered by challenges like non-stationarity, where each agent's updates alter the perceived by others, causing distributional shifts that destabilize learning. Credit assignment exacerbates this in settings, as attributing team rewards to individual contributions becomes ambiguous amid interdependent actions, often requiring techniques to propagate gradients effectively across agents. These issues underscore the need for robust methods that handle evolving multi-agent dynamics without assuming fixed opponent behaviors. As of 2025, advances include foundation models like MARL-GPT for scalable multi-agent coordination and applications in real-world systems such as cybersecurity and resource allocation.

Inverse and Goal-Conditioned Reinforcement Learning

Inverse reinforcement learning (IRL) addresses the challenge of inferring an underlying reward function from expert demonstrations, enabling agents to learn policies that generalize beyond observed trajectories. Unlike direct imitation methods, IRL recovers a reward model that rationalizes the expert's behavior as optimal under a Markov decision process, allowing the learner to optimize for that reward using standard reinforcement learning techniques. This approach was pioneered in the context of apprenticeship learning, where the goal is to match feature expectations of the expert's policy through linear reward functions over predefined features. A key limitation of simpler imitation techniques like behavioral , which directly maps states to actions via , is covariate shift: small errors compound, leading the learner into states outside the expert's demonstration distribution, where predictions become unreliable. mitigates this by learning a reward that guides and recovery from errors, promoting robust optimization. To handle ambiguity in reward inference—where multiple rewards can explain the same behavior—maximum formulates the problem probabilistically, maximizing the entropy of the induced while matching expert demonstrations. The objective is to find a reward r such that the expert demonstrations are most likely under the maximum entropy optimal for r, often solved by matching feature expectations \hat{\mu}(E) = \sum_{\tau \in E} \mu(\tau) with \mathbb{E}_{\pi_r} [f(s,a)] = \mathbb{E}_{\pi_E} [f(s,a)], where f are state-action features, while maximizing the causal entropy of trajectories. In deep reinforcement learning settings, IRL objectives are approximated using neural networks for high-dimensional spaces. A prominent example is generative adversarial imitation learning (GAIL), which frames IRL as an adversarial game between a policy generator and a discriminator that distinguishes expert from learner trajectories, effectively minimizing a Jensen-Shannon divergence proxy for the maximum entropy objective without explicit reward modeling. This method has demonstrated sample-efficient imitation in continuous control tasks, outperforming behavioral cloning by leveraging off-policy data reuse. Goal-conditioned reinforcement learning extends standard to handle variable objectives by parameterizing policies and value functions with goals g, yielding forms like \pi(a \mid s, g) and universal value function approximators (UVFAs) V(s, g; \theta) that estimate returns for any state-goal pair using a shared . UVFAs enable transfer across goals by learning a joint representation of states and goals, facilitating multi-task policies in sparse-reward environments. To address sample inefficiency in goal pursuit, hindsight experience replay (HER) relabels failed trajectories with achieved goals as "successes," allowing off-policy algorithms like DDPG to learn from any outcome, significantly improving success rates in robotic manipulation tasks—e.g., achieving approximately 100% success in tasks like FetchPickAndPlace with varied goals compared to 0% without relabeling. Combining with goal conditioning enables multi-task imitation learning, where demonstrations for specific goals inform a shared reward model that generalizes to unseen objectives, often via language or visual specifications. For instance, goal-induced uses goals to condition adversarial discriminators, learning interpretable rewards for robotic tasks like block stacking with varied targets. In , these methods support diverse objectives, such as adaptive grasping or , by inferring task-specific rewards from few demonstrations, reducing the need for manual reward engineering. Despite advances, IRL faces challenges like reward ambiguity, where infinitely many rewards match expert behavior, requiring regularization (e.g., ) to select parsimonious solutions, and scalability to high-dimensional goal spaces, where deep approximations struggle with curse-of-dimensionality in trajectory matching. These issues limit deployment in complex, real-world , though ongoing work in adversarial and Bayesian formulations aims to enhance robustness. As of 2025, recent advances include LLM-based methods for , such as post-training alignment via for reward inference from language specifications, enhancing applications in and .

Applications

Gaming and Robotics

Deep has achieved remarkable success in environments, particularly through benchmark tasks that demonstrate its ability to handle high-dimensional inputs and complex . In the suite, Deep Q-Networks (DQN) enabled agents to achieve human-level control across 49 games by learning directly from pixel inputs, marking a pivotal advancement in applying deep to . For board games, utilized and deep to master chess, , and Go, achieving superhuman performance without prior human knowledge by combining with policy and value networks. In games, demonstrated coordinated 5v5 gameplay in , where five agents trained via defeated professional teams, highlighting deep RL's capacity for multi-agent cooperation in partially observable, continuous-action spaces. In robotics, deep reinforcement learning excels in simulated continuous control tasks, often using environments like Gym and NVIDIA Isaac Gym for scalable training. (PPO) has been widely applied in MuJoCo simulations for locomotion tasks, such as training humanoid or ant agents to walk and balance by optimizing policies over high-dimensional state spaces. For manipulation, goal-conditioned policies allow agents to achieve diverse objectives, such as grasping or stacking objects, by incorporating hindsight experience replay to relabel failed trajectories and improve sample efficiency in sparse-reward settings. Transferring policies from to real robots—known as sim-to-real—relies on techniques like domain randomization, which varies simulation parameters (e.g., , ) to robustify policies against real-world discrepancies. Notable achievements include OpenAI's 2019 work on dexterous manipulation, where a five-fingered robotic hand solved a in simulation and transferred to hardware via randomized physics, demonstrating fine-grained control without real-world demonstrations. Similarly, deep RL has enabled agile quadruped locomotion in the real world, as seen in policies trained in simulation for robots like ANYmal, which navigate rough terrain after to bridge dynamics gaps. In 2024, DRL enabled the ANYmal robot to perform agile , including jumping and climbing obstacles, demonstrating advanced locomotion skills. Despite these advances, sim-to-real deployment faces challenges like partial , where real sensors provide noisy or incomplete data unlike perfect simulations, necessitating robust observation models. remains a critical limitation, as unconstrained exploration in physical systems risks hardware damage, prompting frameworks like Safety Gym to evaluate during training. Sim gaps, including unmodeled and actuator delays, often degrade performance, underscoring the need for ongoing refinements in simulation fidelity and adaptation methods.

Real-World Domains

Deep reinforcement learning has been applied to various real-world domains where under is critical, extending beyond controlled simulations to influence physical and economic systems directly. These applications leverage deep RL's ability to learn policies from complex, high-dimensional data, often integrating off-policy methods to utilize historical datasets while addressing safety and efficiency concerns. In autonomous driving, deep RL enables end-to-end learning for vehicle control and tasks. For instance, Wayve's approach uses deep RL to train driving policies directly from camera inputs, achieving lane-following in under 20 minutes without predefined maps or rules. Complementing this, Wayve's FIERY model employs for probabilistic in bird's-eye view, forecasting future paths of road agents to inform RL-based decisions, improving anticipation of dynamic environments. In , (MARL) optimizes signal control across intersections, reducing average vehicle delay by up to 20% compared to fixed-time strategies in urban simulations validated on real traffic data. In healthcare, deep RL develops personalized treatment policies, particularly for critical conditions like . The AI Clinician, an off-policy RL model trained on over 46,000 patient records from the MIMIC-III database, recommends vasopressor and IV fluid dosages, outperforming clinicians in expected outcomes by improving survival rates in retrospective evaluations. For , deep RL combined with generative models accelerates molecule design by optimizing chemical properties like binding affinity. A policy gradient-based generates novel molecules with desired drug-like features, achieving higher validity and uniqueness scores than traditional generative adversarial networks in benchmark datasets. Resource management benefits from deep RL in optimizing energy grids and recommendation systems. In smart grids, deep RL algorithms manage microgrid operations, such as balancing sources and loads, reducing operational costs by 15-25% in scenarios through actor-critic methods like deep deterministic policy gradients. For recommendation systems, extensions from multi-armed bandits to deep RL enhance user engagement; applies RL to YouTube's video suggestions, optimizing long-term satisfaction via slate-based policies that consider sequential user interactions. In finance, deep RL supports and amid market volatility. Policy gradient methods, such as , learn trading strategies for equities, minimizing transaction costs while maximizing returns; one implementation on US stocks outperformed baseline buy-and-hold approaches in backtests. For under uncertainty, deep RL frameworks incorporate risk-sensitive rewards, dynamically allocating assets to achieve higher risk-adjusted returns, with cumulative regrets compared to mean-variance models in volatile periods. Deploying deep RL in real-world settings introduces challenges like real-time inference constraints and ethical considerations. Systems must operate within milliseconds for applications like autonomous driving, often requiring model or to meet requirements without sacrificing policy quality. Ethical issues arise from reward biases, which can perpetuate inequalities; for example, biased training data in healthcare RL may disadvantage underrepresented groups, necessitating fairness-aware reward shaping. Case studies highlight minimization as a key metric: in financial trading deployments, deep RL policies achieve sublinear bounds, bounding cumulative losses relative to optimal strategies over time horizons of thousands of trades. Generalization techniques, such as domain , aid adaptation from simulated training to real environments by enhancing policy robustness.

References

  1. [1]
    [1811.12560] An Introduction to Deep Reinforcement Learning - arXiv
    Nov 30, 2018 · Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve ...Missing: definition | Show results with:definition
  2. [2]
    Human-level control through deep reinforcement learning - Nature
    Feb 25, 2015 · To achieve this, we developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial ...
  3. [3]
    [1312.5602] Playing Atari with Deep Reinforcement Learning - arXiv
    Dec 19, 2013 · We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
  4. [4]
    Mastering the game of Go with deep neural networks and tree search
    Jan 27, 2016 · Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go ...
  5. [5]
    Deep Reinforcement Learning - an overview | ScienceDirect Topics
    Definition of topic​​ AI. Deep reinforcement learning (DRL) is defined as a combination of deep learning (DL) and reinforcement learning (RL) principles, aimed ...Core Concepts and Algorithms... · Computational Frameworks...
  6. [6]
    [1701.07274] Deep Reinforcement Learning: An Overview - arXiv
    Jan 25, 2017 · We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve ...
  7. [7]
    Deep Reinforcement Learning: A Chronological Overview ... - MDPI
    We then trace the historical development of deep RL, highlighting key milestones such as the advent of deep Q-networks (DQN).
  8. [8]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
  9. [9]
    Reinforcement Learning - MIT Press
    In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has ...
  10. [10]
    Q-learning | Machine Learning
    This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to the ...
  11. [11]
    On-Line Q-Learning Using Connectionist Systems - ResearchGate
    Updates for model-free learning were described using the SARSA TD algorithm (Rummery and Niranjan 1994) . The reward prediction error (δ) was computed as the ...
  12. [12]
    [1812.02648] Deep Reinforcement Learning and the Deadly Triad
    Dec 6, 2018 · Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three ...
  13. [13]
    [PDF] Neuro-Dynamic Programming - MIT
    cG 1996 Dimitri P. Bertsekas and John N. Tsitsiklis. All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical ...
  14. [14]
  15. [15]
    [PDF] TD-Gammon, A Self-teaching Backgammon Program, Achieves ...
    TD-Gammon is a neural network that self-teaches backgammon by playing against itself, learning from the results, and achieving master-level play.
  16. [16]
    11.1 TD-Gammon
    One of the most impressive applications of reinforcement learning to date is that by Gerry Tesauro to the game of backgammon (Tesauro, 1992, 1994, 1995).
  17. [17]
    [PDF] Tree-Based Batch Mode Reinforcement Learning
    The fitted Q iteration algorithm is a batch mode reinforcement learning algorithm which yields an approximation of the Q-function corresponding to an infinite ...
  18. [18]
    Neural Fitted Q Iteration – First Experiences with a Data Efficient ...
    This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron.
  19. [19]
    Neural Fitted Q Iteration – First Experiences with a Data Efficient ...
    Aug 7, 2025 · Early neural value estimation methods Riedmiller (2005) incorporated action conditioning by incorporating both state and action as model inputs.
  20. [20]
    [PDF] Monte Carlo Tree Search in Go - Department of Computing Science
    Abstract Monte Carlo Tree Search (MCTS) was born in Computer Go, i.e. in the application of artificial intelligence to the game of Go.
  21. [21]
    The Bitter Lesson - Rich Sutton
    The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most ...Missing: implications | Show results with:implications
  22. [22]
    Deep Reinforcement Learning with Double Q-learning - arXiv
    Sep 22, 2015 · We first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games.
  23. [23]
    A general reinforcement learning algorithm that masters chess ...
    Dec 7, 2018 · In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games.
  24. [24]
    [1912.06680] Dota 2 with Large Scale Deep Reinforcement Learning
    Dec 13, 2019 · OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.Missing: 2018 | Show results with:2018
  25. [25]
    Mastering Atari, Go, chess and shogi by planning with a learned model
    Dec 23, 2020 · Mastering Atari, Go, chess and shogi by planning with a learned model ... To better understand the nature of MuZero's learning algorithm ...
  26. [26]
    [2010.02193] Mastering Atari with Discrete World Models - arXiv
    Oct 5, 2020 · We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model.Missing: deep | Show results with:deep
  27. [27]
    Decision Transformer: Reinforcement Learning via Sequence ... - arXiv
    Jun 2, 2021 · Abstract page for arXiv paper 2106.01345: Decision Transformer: Reinforcement Learning via Sequence Modeling.
  28. [28]
    Dueling Network Architectures for Deep Reinforcement Learning
    Nov 20, 2015 · In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators.
  29. [29]
    Rainbow: Combining Improvements in Deep Reinforcement Learning
    Oct 6, 2017 · This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of- ...
  30. [30]
    [1511.05952] Prioritized Experience Replay - arXiv
    Nov 18, 2015 · In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently.
  31. [31]
    [1707.06887] A Distributional Perspective on Reinforcement Learning
    Jul 21, 2017 · In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning ...
  32. [32]
    Asynchronous Methods for Deep Reinforcement Learning - arXiv
    Feb 4, 2016 · We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep ...Missing: DQN 2015
  33. [33]
    [1502.05477] Trust Region Policy Optimization - arXiv
    Feb 19, 2015 · This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks.
  34. [34]
    [1707.06347] Proximal Policy Optimization Algorithms - arXiv
    Proximal Policy Optimization (PPO) is a policy gradient method for reinforcement learning that uses multiple epochs of minibatch updates, and is simpler than ...
  35. [35]
    [1812.05905] Soft Actor-Critic Algorithms and Applications - arXiv
    Dec 13, 2018 · In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework.
  36. [36]
    Addressing Function Approximation Error in Actor-Critic Methods
    We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic.
  37. [37]
    [PDF] Deterministic Policy Gradient Algorithms
    Abstract. In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol-.
  38. [38]
  39. [39]
  40. [40]
  41. [41]
    Relative Importance Sampling for off-Policy Actor-Critic in Deep ...
    Oct 30, 2018 · However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling ...
  42. [42]
    IMPALA: Scalable Distributed Deep-RL with Importance Weighted ...
    Feb 5, 2018 · We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V- ...Missing: critic | Show results with:critic
  43. [43]
    [1707.01495] Hindsight Experience Replay - arXiv
    We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary.
  44. [44]
    Conservative Q-Learning for Offline Reinforcement Learning - arXiv
    Jun 8, 2020 · In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function.
  45. [45]
    Adaptive Behavior Cloning Regularization for Stable Offline-to ...
    Oct 25, 2022 · We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
  46. [46]
    [PDF] Playing Atari with Deep Reinforcement Learning - cs.Toronto
    We present the first deep learning model to successfully learn control policies di- rectly from high-dimensional sensory input using reinforcement learning.
  47. [47]
    [PDF] Sample-Efficient Deep Reinforcement Learning via Episodic ...
    We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments.
  48. [48]
    Transfer Learning in Deep Reinforcement Learning: A Survey - PMC
    In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning.
  49. [49]
    A Survey of Zero-shot Generalisation in Deep Reinforcement Learning
    Nov 18, 2021 · The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to ...Missing: evaluation Procgen few-
  50. [50]
    Meta Reinforcement Learning - Lil'Log
    Jun 23, 2019 · A meta-RL model is trained over a distribution of MDPs, and at test time, it is able to learn to solve a new task quickly.
  51. [51]
    [PDF] Transfer in Deep Reinforcement Learning Using Successor ...
    The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Re- cently, ...
  52. [52]
    Successor Features for Transfer in Reinforcement Learning - arXiv
    Jun 16, 2016 · We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same.
  53. [53]
    [PDF] Dynamics Generalisation in Reinforcement Learning via Adaptive ...
    This allows the agent to modify its behaviour for each context by having a shared feature-extractor network which is modulated by the context-aware adapter.
  54. [54]
    FeUdal Networks for Hierarchical Reinforcement Learning - arXiv
    Mar 3, 2017 · We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement ...Missing: options Sutton 1999 extensions
  55. [55]
    [PDF] FeUdal Networks for Hierarchical Reinforcement Learning
    The options framework (Sut- ton et al., 1999; Precup, 2000) is a popular formulation for considering the problem with a two level hierarchy. The bottom level – ...Missing: extensions | Show results with:extensions
  56. [56]
    [PDF] Leveraging Procedural Generation to Benchmark Reinforcement ...
    We have created Procgen Benchmark to fulfill this need. This benchmark is ideal for evaluating generalization, as distinct training and test sets can be ...Missing: transfer | Show results with:transfer
  57. [57]
    Compositional Learning of Visually-Grounded Concepts Using ...
    Sep 8, 2023 · We investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthetic 3D environments.
  58. [58]
    [PDF] Robust Subtask Learning for Compositional Generalization
    Compositional reinforcement learning is a promising approach for training policies to per- form complex long-horizon tasks. Typically, a.
  59. [59]
    Multi-Agent Actor-Critic for Mixed Cooperative-Competitive ... - arXiv
    Jun 7, 2017 · This paper explores deep reinforcement learning for multi-agent domains, adapting actor-critic methods to consider other agents' policies and ...
  60. [60]
    [PDF] Markov games as a framework for multi-agent reinforcement learning
    Markov games (see e.g., [Van Der Wal, 1981]) is an extension of game theory to MDP-like environments. This paper considers the consequences of using the Markov.<|separator|>
  61. [61]
    Multi-agent Reinforcement Learning in Sequential Social Dilemmas
    Feb 10, 2017 · We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games.
  62. [62]
    Multi-Agent Reinforcement Learning: A Review of Challenges and ...
    In this review, we present an analysis of the most used multi-agent reinforcement learning algorithms. Starting with the single-agent reinforcement learning ...
  63. [63]
    [PDF] Apprenticeship Learning via Inverse Reinforcement Learning
    The problem of deriving a reward function from ob- served behavior is referred to as inverse reinforcement learning (Ng & Russell, 2000). In this paper, we.
  64. [64]
    [PDF] Maximum Entropy Inverse Reinforcement Learning
    The maximum entropy ap- proach provides a principled method of dealing with this uncertainty. We discuss several additional advantages in modeling behavior that ...
  65. [65]
    A Reduction of Imitation Learning and Structured Prediction to No ...
    Nov 2, 2010 · In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online ...
  66. [66]
    [1606.03476] Generative Adversarial Imitation Learning - arXiv
    Jun 10, 2016 · Authors:Jonathan Ho, Stefano Ermon. View a PDF of the paper titled Generative Adversarial Imitation Learning, by Jonathan Ho and 1 other authors.
  67. [67]
    Universal Value Function Approximators
    In this paper we introduce universal value function approximators (UVFAs) V(s,g;theta) that generalise not just over states s but also over goals g.
  68. [68]
    [PDF] Goal-Induced Inverse Reinforcement Learning - UC Berkeley EECS
    May 17, 2019 · This work explores using natural language to communicate goal-conditioned rewards, which can be learned via solving the inverse reinforcement ...
  69. [69]
    Advances and applications in inverse reinforcement learning
    Mar 26, 2025 · This comprehensive review focuses on three key aspects: the diverse methodologies employed in IRL, its wide-ranging applications across fields such as robotics ...
  70. [70]
    A survey of inverse reinforcement learning: Challenges, methods ...
    Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior.
  71. [71]
    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
    Apr 27, 2018 · In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques.
  72. [72]
    Safety Gym - OpenAI
    Nov 21, 2019 · We're releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints ...Exploration Is Risky · Safety Gym · Benchmark
  73. [73]
    FIERY: Future Instance Prediction in Bird's-Eye View from Surround ...
    Apr 21, 2021 · We present FIERY: a probabilistic future prediction model in bird's-eye view from monocular cameras. Our model predicts future instance segmentation and motion.
  74. [74]
  75. [75]
    The Artificial Intelligence Clinician learns optimal treatment ... - Nature
    Oct 22, 2018 · We developed the AI Clinician, a computational model using reinforcement learning, which is able to dynamically suggest optimal treatments for ...
  76. [76]
  77. [77]
    Deep Reinforcement Learning: Policy Gradients for US Equities ...
    Dec 4, 2023 · This paper presents a novel approach to applying Deep Reinforcement Learning (DRL) within the financial trading domain.
  78. [78]
    Deep reinforcement learning for portfolio selection - ScienceDirect
    This study proposes an advanced model-free deep reinforcement learning (DRL) framework to construct optimal portfolio strategies in dynamic, complex, and large ...