Fact-checked by Grok 2 weeks ago

Q-learning

Q-learning is a foundational model-free reinforcement learning algorithm that enables an intelligent agent to learn an optimal policy for selecting actions in a given environment modeled as a finite Markov decision process (MDP) by iteratively estimating the value of state-action pairs, known as the Q-function. Introduced by Christopher J. C. H. Watkins in his 1989 PhD thesis, Learning from Delayed Rewards, the algorithm operates without requiring a model of the environment's dynamics, relying instead on interactions with the environment to update Q-values based on received rewards and observed state transitions.^[1] The core update rule of Q-learning uses a temporal difference (TD) learning approach, where the Q-value for a state-action pair is adjusted towards a target that combines the immediate reward with the discounted maximum Q-value of the next state, formalized as Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)], with \alpha as the learning rate and \gamma as the discount factor.^[2] As an off-policy method, Q-learning learns the optimal action-value function while allowing the agent to follow an exploratory policy, such as \epsilon-greedy, which balances exploitation of known good actions and exploration of new ones.^[3] A key theoretical contribution came in 1992, when Watkins and Peter Dayan proved that Q-learning converges to the optimal Q-function with probability 1 under appropriate conditions on the learning rates and sufficient exploration of state-action pairs.^[2] This convergence guarantee, building on Watkins' earlier outline, established Q-learning as a robust algorithm for solving MDPs, applicable to discrete action spaces and finite state environments. The algorithm's simplicity and effectiveness have made it a cornerstone of reinforcement learning, influencing subsequent developments like deep Q-networks (DQN) that extend it to high-dimensional state spaces using neural networks. Q-learning's practical implementation involves initializing a Q-table or function approximator, then iteratively selecting actions, observing outcomes, and updating estimates until convergence or a performance criterion is met. It excels in scenarios where delayed rewards provide sparse feedback, such as game playing or robotic control, by propagating value estimates backward through time via bootstrapping. Despite its tabular form assuming finite states and actions, extensions like function approximation address scalability issues in real-world applications.^[2]

Background and Fundamentals

Reinforcement Learning Overview

Reinforcement learning (RL) is a paradigm in machine learning where an intelligent agent learns to make sequential decisions by interacting with an environment, aiming to maximize the total cumulative reward over time.^[4] The agent receives feedback in the form of scalar rewards or penalties after each action, which guides its learning through trial and error rather than direct instruction.^[5] This process enables the agent to discover optimal behaviors in complex, dynamic settings without prior knowledge of the environment's rules. Central to RL are several key components: the agent, which is the decision-maker; the environment, which responds to the agent's actions; the state S, representing the current situation; the action A, chosen by the agent; and the reward R, a numerical signal indicating the immediate desirability of the action taken.^[4] The agent's behavior is governed by a policy \pi, a strategy that maps states to actions, either deterministically or probabilistically.^[4] Additionally, the value function V estimates the expected long-term reward starting from a given state under the policy, while the action-value function Q extends this to evaluate state-action pairs, aiding in action selection.^[4] RL differs fundamentally from supervised learning, which uses labeled input-output pairs to train models for prediction or classification, and unsupervised learning, which identifies hidden structures in unlabeled data without explicit feedback.^[5] In contrast, RL operates without labeled examples of correct actions, relying instead on sparse, delayed rewards that may arrive long after an action is taken, requiring the agent to balance exploration of new strategies with exploitation of known rewarding ones.^[5] RL tasks are classified as episodic or continuing based on their structure.^[4] Episodic tasks naturally divide into finite episodes, each starting from an initial state and ending at a terminal state, such as a single playthrough of a board game where the agent resets after each game to learn from repeated episodes.^[4] Continuing tasks, however, have no terminal states and extend indefinitely, like perpetual inventory management, where the agent must sustain performance over an ongoing horizon.^[4]

Markov Decision Processes

A Markov decision process (MDP) provides the mathematical foundation for modeling sequential decision-making problems in reinforcement learning, where an agent interacts with an environment to maximize cumulative rewards. Formally, an MDP is defined as a tuple (S, A, P, R, \gamma), consisting of a state space S representing all possible states of the environment, an action space A denoting the set of actions available to the agent in each state, a transition probability function P: S \times A \times S \to [0,1] that gives the probability P(s' \mid s, a) of transitioning to state s' given current state s and action a, a reward function R: S \times A \to \mathbb{R} specifying the immediate reward r(s, a) received after taking action a in state s, and a discount factor \gamma \in [0,1) that weights the importance of future rewards relative to immediate ones. Central to the MDP framework is the Markov property, which assumes that the future state and reward depend only on the current state and action, rendering the history of prior states and actions irrelevant for prediction. This property simplifies the modeling of decision processes by focusing solely on the present context, enabling efficient computation and analysis in environments like games or robotic control tasks. The objective in an MDP is to determine an optimal policy \pi^*: S \to A, a mapping from states to actions (or distributions over actions), that maximizes the expected discounted return starting from any state s at time t, defined as G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}. For a given policy \pi, the state-value function V^\pi(s) represents the expected return when starting in state s and following \pi thereafter: $$ V^\pi(s) = \mathbb{E}\pi \left[ \sum{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s \right],

while the action-value function $Q^\pi(s, a)$ gives the expected return starting in $s$, taking action $a$, and then following $\pi$:

Q^\pi(s, a) = \mathbb{E}\pi \left[ \sum{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s, A_t = a \right].

These value functions satisfy the Bellman equations, which express them recursively in terms of immediate rewards and future values. The optimal value functions $V^*(s)$ and $Q^*(s, a)$, achieved by the optimal policy $\pi^*$, obey the optimal Bellman equations:

V^(s) = \max_a \left[ R(s,a) + \gamma \sum_{s' \in S} P(s' \mid s, a) V^(s') \right]

and

Q^(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a) \max_{a' \in A} Q^(s', a').

## Core Algorithm ### Algorithm Description Q-learning is a temporal-difference (TD) learning method that approximates the optimal action-value function $Q^*(s,a)$, which represents the maximum expected discounted return starting from state $s$, taking action $a$, and following the optimal policy thereafter, without constructing an explicit model of the environment. This approach enables the agent to learn directly from interactions with the environment in the form of state-action-reward-next state tuples.[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) As an off-policy algorithm, Q-learning derives the optimal policy from the learned $Q(s,a)$ values independently of the policy used to generate the experience, allowing the behavior policy—such as an $\epsilon$-greedy strategy that balances exploration and exploitation—to differ from the target optimal policy.[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) This separation facilitates learning the optimal action values even when exploratory actions are taken, making it suitable for environments requiring thorough exploration. The core of the algorithm involves iteratively updating the $Q(s,a)$ estimates through a TD update rule. The following pseudocode outlines the standard tabular Q-learning procedure: ``` Initialize Q(s, a) arbitrarily for all s, a (e.g., Q(s, a) = 0) For each episode: Initialize a random starting state s While s is not a terminal state: Choose action a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe reward r and next state s' Update Q(s, a) ← Q(s, a) + α [r + γ max_{a'} Q(s', a') − Q(s, a)] Set s ← s' ``` This procedure assumes Q-learning solves infinite-horizon discounted Markov decision processes.[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) The hyperparameters $\alpha$ (learning rate) and $\gamma$ (discount factor) control the update magnitude and future reward weighting, respectively.[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) The key intuition behind the update lies in bootstrapping: the term $r + \gamma \max_{a'} Q(s', a')$ serves as a biased but unbiased-in-expectation target for $Q(s,a)$, allowing the estimate to improve incrementally toward $Q^*(s,a)$ based on current knowledge of future values.[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) By relying solely on sampled transitions rather than modeling transition probabilities $P(s'|s,a)$ or rewards $R(s,a)$, Q-learning maintains its model-free nature, adapting to unknown dynamics through repeated interaction. ### Mathematical Formulation Q-learning is grounded in the theory of Markov decision processes (MDPs), where the goal is to learn an optimal action-value function $Q^*(s, a)$ that satisfies the Bellman optimality equation:

Q^(s, a) = \mathbb{E}\left[r + \gamma \max_{a'} Q^(s', a') \mid s, a\right],

with $r$ denoting the immediate reward, $\gamma \in [0, 1)$ the discount factor, and the expectation taken over the transition dynamics from state $s$ to $s'$ under action $a$.[](https://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf) The Q-learning update rule provides a temporal-difference method to iteratively approximate $Q^*$. After observing the tuple $(s_t, a_t, r_{t+1}, s_{t+1})$, the estimate is updated as

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right],

where $\alpha \in (0, 1]$ is the learning rate. This rule originates from the temporal-difference error $\delta_t = r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)$, where the target $r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$ serves as an unbiased-in-expectation estimate of $Q^*(s_t, a_t)$ when the current $Q$ approximates the optimal values under the exploration policy.[](https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf)[](https://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf) The target $r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')$ derives directly from the Bellman optimality equation, as it bootstraps the value of the next state using the maximum over possible actions, assuming the greedy policy with respect to the current $Q$. This makes Q-learning an off-policy algorithm, learning the optimal policy independently of the behavior policy used for exploration.[](https://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf) Under the assumption of infinite visits to every state-action pair, a learning rate $\alpha_t$ satisfying $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$, and exploitation via the greedy policy in the limit, Q-learning converges to $Q^*$ with probability 1. This result, known as Watkins' theorem, establishes the algorithm's theoretical soundness in finite MDPs.[](https://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf) In contrast to SARSA, which updates using the next action $a'$ sampled from the behavior policy—yielding $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma Q(s_{t+1}, a') - Q(s_t, a_t)]$, making it on-policy—Q-learning's use of $\max_{a'} Q$ enables off-policy learning toward optimality. SARSA was introduced as a connectionist variant but shares the temporal-difference framework.[](https://www.researchgate.net/publication/220344150_Technical_Note_Q-Learning)[](https://www.researchgate.net/publication/2500611_On-Line_Q-Learning_Using_Connectionist_Systems) From a stochastic approximation perspective, the Q-learning update minimizes the mean-squared projected Bellman error in an asynchronous manner, aligning with Robbins-Monro conditions for convergence to the fixed point of the Bellman operator.[](https://www.mit.edu/~jnt/Papers/J052-94-jnt-q.pdf) ## Hyperparameters ### Learning Rate The learning rate, denoted as $\alpha$, is a key hyperparameter in Q-learning that governs the magnitude of updates to the action-value function $Q(s, a)$. It is defined within the interval $[0, 1]$, where $\alpha = 0$ implies no updates occur and the Q-values remain fixed, while $\alpha = 1$ results in complete replacement of the prior Q-value with the new estimate derived from the temporal difference (TD) error. In the update rule, $\alpha$ scales the contribution of the new information relative to the existing estimate, as seen in the standard Q-learning update: $Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha (r + \gamma \max_{a'} Q(s', a'))$.[](https://link.springer.com/article/10.1007/BF00992698) The choice of $\alpha$ significantly influences the learning dynamics. A high $\alpha$ facilitates rapid incorporation of new experiences, enabling quick adaptation, but it can introduce instability, such as oscillations in Q-value estimates or divergence from optimal policies, particularly in environments with noisy rewards or sparse feedback. In contrast, a low $\alpha$ ensures more gradual updates, promoting stable convergence toward optimal values at the cost of slower overall learning progress. This trade-off is especially pronounced when interacting with the TD error, where larger errors amplify the update size via $\alpha$, allowing faster corrections for substantial discrepancies but risking overshooting with elevated $\alpha$.[](https://pdfs.semanticscholar.org/292c/d7e17c0aa08b8f07bd77ea1ebff08d51540d.pdf) Optimal schedules for $\alpha$ depend on the environment's characteristics. In stationary Markov decision processes (MDPs), a decreasing schedule is required for almost-sure convergence to the optimal Q-values, satisfying the conditions $0 < \alpha_n < 1$, $\sum_n \alpha_n = \infty$, and $\sum_n \alpha_n^2 < \infty$ for all state-action pairs visited infinitely often; a harmonic schedule like $\alpha_n = 1/n$ meets these criteria. For non-stationary environments, where the transition dynamics or rewards may change over time, a constant $\alpha$ (e.g., 0.1) is typically employed to maintain ongoing adaptation without the Q-values converging prematurely to outdated optima.[](https://link.springer.com/article/10.1007/BF00992698)[](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) Tuning $\alpha$ is often performed through empirical methods tailored to the setting. In tabular Q-learning, grid search over discrete values (e.g., \{0.01, 0.1, 0.5\}) evaluates performance on validation episodes to balance speed and stability. In function approximation variants like deep Q-networks (DQN), adaptive optimizers such as RMSProp or Adam incorporate $\alpha$ implicitly, with initial values around $10^{-4}$ to $10^{-3}$ selected via hyperparameter optimization to handle high-dimensional state spaces. ### Discount Factor The discount factor, denoted $\gamma$, is a hyperparameter in Q-learning with $0 \leq \gamma \leq 1$ that scales the contribution of future rewards to the agent's decision-making process.[](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf) It discounts rewards received in the future relative to immediate ones, with rewards $s$ steps ahead valued at $\gamma^s$ times their nominal amount.[](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf) When $\gamma = 0$, the agent becomes fully myopic, prioritizing only immediate rewards and ignoring any future consequences, resulting in a purely greedy policy. Conversely, when $\gamma = 1$, the formulation is undiscounted, treating all rewards equally regardless of when they occur, which is suitable for tasks without a natural horizon but requires modifications for convergence.[](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf) The discount factor fundamentally shapes the definition of the return in Q-learning, defined as the expected discounted sum of future rewards:

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Q^A(s, a) \leftarrow Q^A(s, a) + \alpha \left[ r + \gamma Q^B\left(s', \arg\max_{a'} Q^A(s', a')\right) - Q^A(s, a) \right]

(and symmetrically for $ Q^B $). This variant significantly lowers the overestimation bias, resulting in more accurate Q-value estimates and improved empirical performance, particularly in environments with stochastic transitions or rewards, while maintaining off-policy learning properties. Under the standard tabular assumptions (e.g., infinite visits to state-action pairs and appropriate exploration), Double Q-learning converges to the optimal Q-function $ Q^* $ with probability 1, similar to standard Q-learning but without the asymptotic bias. In implementation, two separate Q-tables (or function approximators in non-tabular settings) are maintained for $ Q^A $ and $ Q^B $; for action selection during episodes, the policy can derive from either function or their average to further mitigate variance. ### Deep Q-Networks Deep Q-Networks (DQN) integrate Q-learning with deep neural networks to scale reinforcement learning to high-dimensional state spaces, particularly raw pixel inputs from environments like video games. Developed by Mnih et al. in 2015, DQN uses a convolutional neural network (CNN) to approximate the action-value function $ Q(s, a; \theta) $, where $ s $ denotes the state (e.g., stacked image frames), $ a $ is the discrete action, and $ \theta $ represents the network parameters. This end-to-end learning approach eliminates the need for hand-crafted features, enabling agents to process visual observations directly and achieve control policies in complex domains.[](https://www.nature.com/articles/nature14236) A core challenge in applying deep networks to Q-learning is training instability due to non-stationary targets and correlated sequential data. DQN addresses this through two key innovations: experience replay and a target network. Experience replay stores agent experiences as transitions $ (s, a, r, s') $ in a large buffer $ \mathcal{D} $, from which random minibatches are sampled to perform independent and identically distributed (i.i.d.) updates, decorrelating samples and improving data efficiency. The target network, a periodic copy of the main Q-network denoted $ Q(s', a'; \theta^-) $, provides stable target values for the Bellman update, with $ \theta^- $ updated infrequently (e.g., every few thousand steps) to mitigate divergence during optimization. These techniques allow gradient-based training via backpropagation on the entire network.[](https://www.nature.com/articles/nature14236) The training objective minimizes the mean squared error (MSE) between the predicted Q-value and the TD target:

L(\theta) = \mathbb{E}{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right],

References

[1]
PhD Thesis: Learning from Delayed Rewards
The thesis introduces the notion of reinforcement learning as learning to control a Markov Decision Process by incremental dynamic programming.
[2]
Q-learning | Machine Learning
This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to ...
[3]
Q-Learning Agent - MATLAB & Simulink - MathWorks
The Q-learning algorithm is an off-policy reinforcement learning method for environments with a discrete action space. A Q-learning agent trains a Q-value ...
[4]
[PDF] Reinforcement Learning: An Introduction - Stanford University
We focus on the simplest aspects of reinforcement learning and on its main distinguishing features. ... on examples of correct behavior, reinforcement learning is ...
[5]
[PDF] Chapter 1 Introduction - Rich Sutton
These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning. Reinforcement ...
[6]
[PDF] Technical Note Q,-Learning
This paper presents and proves in detail a convergence theorem for Q,-learning based on that outlined in Watkins. (1989). We show that Q-learning converges to ...
[7]
[PDF] Learning from Delayed Rewards - Computer Science
May 1, 1989 · Learning from Delayed Rewards. Christopher John Cornish Hellaby Watkins. King's College. Thesis Submitted for Ph.D. May, 1989. Page 2. A.
[8]
(PDF) Technical Note: Q-Learning - ResearchGate
Oct 24, 2025 · This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning ...
[9]
On-Line Q-Learning Using Connectionist Systems - ResearchGate
Updates for model-free learning were described using the SARSA TD algorithm (Rummery and Niranjan 1994) . The reward prediction error (δ) was computed as the ...
[10]
[PDF] Asynchronous Stochastic Approximation and Q-Learning - MIT
The Q-learning algorithm is a method for computing V* based on a reformulation of the Bellman equation V* = T(V*). We provide a brief description of the ...
[11]
[PDF] An Investigation Into the Effect of the Learning Rate on ...
In the beginning of training, a reasonably high learning rate is important to learn fast, but once a good approximation has been learned, using a low learning ...
[12]
Q-Learning
This paper presents and proves in detail a convergence theorem for ~-learning based on that outlined in Watkins. (1989). We show that 0~-learning converges to ...
[13]
Average reward reinforcement learning: Foundations, algorithms ...
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks ...
[14]
[PDF] Potential-Based Shaping and Q-Value Initialization are Equivalent
With Q-values initialized below their optimal value, an agent may require learning time exponential in the state and action space in order to find a goal state.
[15]
[PDF] Q-Learning - Henrique Maia
This paper has presented the proof outlined by Watkins (1989) that Q-learning converges with probability one under reasonable conditions on the learning rates ...<|control11|><|separator|>
[16]
Solving Frozenlake with Tabular Q-Learning
This tutorial trains an agent for FrozenLake using tabular Q-learning. In this post we'll compare a bunch of different map sizes on the FrozenLake environment.
[17]
Function Approximation in Reinforcement Learning - GeeksforGeeks
Jul 23, 2025 · Function approximation is a critical concept in reinforcement learning (RL), enabling algorithms to generalize from limited experience to a broader set of ...
[18]
[PDF] Successful Examples Using Sparse Coarse Coding - Rich Sutton
Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke & ...
[19]
[PDF] Q-FUNCTION APPROXIMATION WITH RADIAL BASIS NETWORK ...
Following that, we found an RBF approximation of this off-policy method was best found with J = 20 basis functions.
[20]
[PDF] Convergence of Q-learning with linear function approximation
In this paper, we describe Q-learning with linear function approximation. This algorithm can be seen as an exten- sion to control problems of temporal- ...
[21]
Playing Atari with Deep Reinforcement Learning
### Summary: Deep Neural Networks for Q-Function Approximation in Atari Games
[22]
[1812.02648] Deep Reinforcement Learning and the Deadly Triad
Dec 6, 2018 · Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three ...
[23]
Breaking the Deadly Triad with a Target Network
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, ...
[24]
[PDF] K-Means Clustering based Reinforcement Learning Algorithm for ...
While partitioning the goal of reinforcement learning, we apply a modified K-means clustering algorithm to discrete continuous state and action spaces.
[25]
Convergence Analysis of Discretization Procedure in Q-Learning
Discretization of the state and decision spaces is required when Q-Learning is used to solve stochastic optimal control problems with the state and decision ...Missing: techniques | Show results with:techniques
[26]
Balancing a CartPole System with Reinforcement Learning - arXiv
Jun 8, 2020 · In this paper, we provide the details of implementing various reinforcement learning (RL) algorithms for controlling a Cart-Pole system.Missing: discretization bins<|control11|><|separator|>
[27]
Learning to predict by the methods of temporal differences
Feb 4, 1988 · This article introduces a class of incremental learning procedures specialized for prediction-that is, for using past experience with an incompletely known ...
[28]
On-line Q-learning using connectionist systems - Semantic Scholar
On-line Q-learning using connectionist systems · Gavin Adrian Rummery, M. Niranjan · Published 1994 · Computer Science.Missing: key milestones
[29]
Q-Learning for Robot Control - ResearchGate
Q-Learning is a method for solving reinforcement learning problems. Reinforcement learning problems require improvement of behaviour based on received ...
[30]
[PDF] Nash Q-Learning for General-Sum Stochastic Games
In extending Q-learning to multiagent environments, we adopt the framework of general-sum stochastic games. In a stochastic game, each agent's reward depends ...<|control11|><|separator|>
[31]
[PDF] An Analysis Of Temporal-difference Learning With Function ... - MIT
TSITSIKLIS AND VAN ROY: ANALYSIS OF TEMPORAL-DIFFERENCE LEARNING. 677 ... [4] J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q- learning ...Missing: Q- | Show results with:Q-
[32]
Human-level control through deep reinforcement learning - Nature
Feb 25, 2015 · Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn ...Main · Methods · Training Algorithm For Deep...
[33]
Conservative Q-Learning for Offline Reinforcement Learning - arXiv
Jun 8, 2020 · In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function.
[34]
(PDF) QT-TDM: Planning With Transformer Dynamics Model and ...
Dec 12, 2024 · Our proposed method, QT-TDM, integrates the robust predictive capabilities of Transformers as dynamics models with the efficacy of a model-free ...
[35]
DQN — Stable Baselines3 2.7.1a3 documentation - Read the Docs
Deep Q Network (DQN) builds on Fitted Q-Iteration (FQI) and make use of different tricks to stabilize the learning with neural networks.
[36]
Rainbow: Combining Improvements in Deep Reinforcement Learning
Oct 6, 2017 · This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of- ...
[37]
[1511.05952] Prioritized Experience Replay - arXiv
Nov 18, 2015 · We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across ...
[38]
Dueling Network Architectures for Deep Reinforcement Learning
Nov 20, 2015 · Access Paper: View a PDF of the paper titled Dueling Network Architectures for Deep Reinforcement Learning, by Ziyu Wang and 5 other authors.
[39]
[1706.10295] Noisy Networks for Exploration - arXiv
Jun 30, 2017 · Access Paper: View a PDF of the paper titled Noisy Networks for Exploration, by Meire Fortunato and 11 other authors. View PDF · TeX Source.
[40]
Model-based Offline Reinforcement Learning with Lower Expectile ...
Jun 30, 2024 · Abstract page for arXiv paper 2407.00699: Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning.
[41]
Multi-agent Reinforcement Learning: A Comprehensive Survey - arXiv
This survey examines these challenges, placing an emphasis on studying seminal concepts from game theory (GT) and machine learning (ML)Missing: 2025 | Show results with:2025
[42]
Multi-agent reinforcement learning - ACM Digital Library
Multi-agent reinforcement learning: independent versus cooperative agents. Author: Ming Tan. Ming Tan. View Profile. Authors Info & Claims. ICML'93: Proceedings ...Missing: Q- | Show results with:Q-
[43]
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent ...
Mar 30, 2018 · Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion.
[44]
Opponent Modeling in Deep Reinforcement Learning
Opponent modeling is needed in multi-agent settings. This paper uses neural models to learn opponent behavior, encoding observations into a deep Q-Network.
[45]
[PDF] The Effect of Hyperparameters on the Model Convergence Rate of ...
This paper studies how hyperparameters like learning rate (alpha) and discount factor (gamma) affect the convergence speed of Q-Learning algorithm.
[46]
[PDF] Addressing Environment Non-Stationarity by Repeating Q-learning ...
Abstract. Q-learning (QL) is a popular reinforcement learning algorithm that is guaranteed to converge to op- timal policies in Markov decision processes.
[47]
[PDF] Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
... sample complexity of. Q-learning to be on the order of |S||A|. (1−γ)4ε2 (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when ...
[48]
[PDF] arXiv:2307.10649v1 [q-fin.CP] 20 Jul 2023
Jul 20, 2023 · It is important to note that the curse of dimensionality in Q-learning makes it chal- lenging to handle high-dimensional data. While Hen ...
[49]
[PDF] Defining and Characterizing Reward Hacking - arXiv
Mar 5, 2025 · Reward hacking occurs when optimizing a proxy reward function leads to poor performance according to the true reward function, in reinforcement ...