Fact-checked by Grok 2 weeks ago

AlphaZero

AlphaZero is a general-purpose developed by DeepMind, a subsidiary of Alphabet Inc., that achieves superhuman performance in complex board games including chess, , and Go by learning solely through self-play starting from random moves, with no incorporation of human game knowledge or domain-specific heuristics beyond the basic rules. Introduced in late 2017, AlphaZero employs a single deep to approximate both the policy function (for selecting moves) and the value function (for evaluating positions), integrated with during gameplay to guide decision-making and update the network via temporal-difference learning from simulated games. Trained on specialized hardware consisting of 4,000 to 5,000 TPUs (tensor processing units), it required approximately 9 hours for chess, 12 hours for , and 13 days for Go to reach its evaluated performance levels through around 700,000 iterations of self-play. In evaluations, AlphaZero demonstrated overwhelming superiority over established champion programs: in chess, it defeated (the 2016 TCEC season 9 winner) in a 1,000-game match with 155 wins, 6 losses, and 839 draws; in shogi, it beat the 2017 Computer Shogi Association world champion with a 91.2% win rate across 1,000 games; and in Go, it surpassed (a prior DeepMind system) by winning 61% of 100 games. These results, achieved after just 4 hours of training to initially outperform in chess, 2 hours for in shogi, and 30 hours for in Go, highlighted AlphaZero's ability to discover novel strategies, such as aggressive piece sacrifices in chess and unconventional pawn structures, often described by experts as creative and intuitive. Beyond gaming, the algorithm's generalized approach has influenced subsequent DeepMind research, including extensions to real-world applications like optimization problems in computing, underscoring its broader impact on .

Background and Development

Relation to AlphaGo Zero

AlphaGo Zero represented a significant advancement in by mastering the game of Go through pure , starting from random play without any reliance on human-generated data, expert knowledge, or . This approach utilized a single to represent both the policy (move selection) and value (position evaluation) functions, trained via and (MCTS), enabling it to surpass previous versions of AlphaGo that incorporated human game records. AlphaZero directly extended this methodology to create a general-purpose capable of achieving superhuman performance across multiple board games, including chess and , in addition to Go. Unlike , which was tailored specifically to the rules and board representation of Go, AlphaZero employed a unified framework where adaptations for different games required only changes to the input representation and action space definitions, without altering the core learning or network architecture. This generalization allowed a single AlphaZero system to learn —starting from scratch—for each game, demonstrating the versatility of beyond a single domain. A key improvement in AlphaZero was the end-to-end training of a shared policy-value that could handle diverse game complexities, such as varying board sizes and branching factors, while maintaining the efficiency of Zero's MCTS integration for search-guided decision-making. This unified design facilitated rapid adaptation to new games, as the same loop—iterating through , evaluation, and improvement—produced expert-level play across chess, , and Go within comparable training timelines. The development timeline underscores AlphaZero's position as an immediate successor: was publicly detailed in October 2017, and AlphaZero was announced just two months later in December 2017 as a broader extension of that work.

Comparison to Traditional Game Engines

Traditional game engines, such as for chess and for shogi, rely on decades of human expertise encoded through hand-crafted functions, opening books, tablebases, and sophisticated search algorithms like alpha-beta . These engines evaluate positions using carefully tuned weights for features like material balance, king safety, and , developed by expert programmers and players, allowing them to search tens of millions of positions per second on standard hardware. For instance, employs alpha-beta to efficiently explore deep tactical lines, prioritizing moves that lead to captures or checks, which enhances its strength in sharp, calculative scenarios. In contrast, AlphaZero employs a approach, starting from random play with no prior human knowledge, domain-specific heuristics, or curated databases, and learns entirely through reinforcement learning guided by a single deep . This outputs both move probabilities and position values, replacing traditional hand-crafted evaluations and enabling the discovery of intuitive strategies that emerge organically from millions of simulated games. Unlike conventional engines, AlphaZero integrates with neural network guidance, searching far fewer positions—around 80,000 per second—yet achieving deeper strategic insights by focusing on high-value branches rather than exhaustive enumeration. In chess, this manifests as AlphaZero favoring positional understanding and novel openings over Stockfish's tactical prowess; for example, AlphaZero intuitively develops for long-term control and sacrifices material for initiative in ways that bypass traditional engines' materialistic biases. Similarly, in , Elmo's optimizations incorporate game-specific heuristics for promotions and drops, honed by human analysis, whereas AlphaZero's generalized self-play uncovers unconventional formations without such tailoring, emphasizing fluid piece activity across the larger board. These differences highlight AlphaZero's ability to transcend human-derived rules, producing playstyles that blend with precision unattainable by rule-bound traditional systems.

Technical Architecture

Neural Network Design

AlphaZero employs a deep architecture based on residual convolutional layers to process game states and output both move policy probabilities and position value estimates. The network takes as input a multi-plane representation of the board state, capturing the current position along with recent move history to encode dynamic game information. For chess, this consists of 119 planes arranged in an 8×8 stack. These encode piece positions for both players over the current position and the previous 7 moves (96 planes: 6 piece types per player × 8 positions), along with additional planes for castling rights, side to move, no-progress and total move counts, and repetition factors (23 planes). For shogi, the input uses a 362-plane format on a 9×9 board, adapting the encodings for shogi's piece types, promotions, prisoner counts, and rules such as drop moves. The core of the network is a residual tower comprising 20 to 40 blocks, each consisting of two convolutional layers with 256 filters and kernels, followed by and rectified linear , enabling the model to learn complex spatial patterns in the board state without vanishing gradients. This body processes the input through an initial convolutional layer with 256 filters and a kernel, after which the blocks refine the features. The culminates in two output heads attached to the residual tower: the policy head, which applies a convolutional layer followed by a fully connected layer and softmax to produce a over all possible legal moves (approximately 4,672 for chess queen moves and promotions); and the head, which uses additional convolutional and fully connected layers with tanh to output a scalar estimate (-1 to +1) of the expected outcome for the current player, where a win is valued at +1, a at -1, and a draw at 0. Training optimizes the network parameters θ using a combined loss function that jointly supervises the policy and value predictions against targets derived from self-play games enhanced by Monte Carlo Tree Search (MCTS). The total loss is the sum of the value loss (mean squared error between predicted value v_θ(s_t) and actual game outcome z_t), the policy loss (cross-entropy between predicted policy p_θ(s_t) and MCTS-improved target policy π_t), and an L2 regularization term. This is formalized as: L = (z_t - v_\theta(s_t))^2 - \pi_t^\top \log p_\theta(s_t) + c \|\theta\|^2 where s_t denotes the state at time t, and c is the regularization coefficient. The policy loss encourages the network to approximate the improved move distribution from search, while the value loss aligns predictions with empirical outcomes, with the negative sign in the policy term reflecting the maximization of log-likelihood.

Reinforcement Learning Mechanism

AlphaZero employs a approach, beginning with a randomly initialized that lacks any domain-specific knowledge beyond the game's rules. This initial generates self-play games, where two instances of the same play against each other, producing trajectories of positions, moves, and outcomes. The resulting data is then used to update the parameters through , minimizing prediction errors for both move probabilities and position values. The core of this process is , where the serves as its own teacher by evaluating positions during , enabling iterative improvement without any external supervision or human-generated data. In each training iteration, the current network guides move selection via (MCTS), which refines the raw policy outputs into more effective action probabilities; these, along with the game's final outcomes, form the training targets. There is no initial phase—learning proceeds purely through from , with approximately 44 million games generated for chess to achieve superhuman performance. Value estimation during training relies on the actual game results as targets, providing unbiased returns, while the search phase incorporates to propagate value estimates through the tree, updating node values as the average of child evaluations. This is complemented by policy iteration, where the MCTS-enhanced policies from iteratively refine the network's move selection, leading to increasingly sophisticated strategies over successive generations of games. The entire mechanism thus forms a closed-loop system of self-improvement, converging on optimal play through repeated cycles of generation, evaluation, and optimization.

Monte Carlo Tree Search Integration

AlphaZero employs a variant of (MCTS) that leverages the policy-value to guide the search process during , replacing traditional random rollouts with network-derived move probabilities P and position evaluations V. This integration allows for effective decision-making with shallower search trees, as the neural network provides informed priors and value estimates rather than relying on extensive simulations to the game's end. The MCTS proceeds in four key steps. In selection, the traverses from the root node to a by repeatedly choosing actions that of known high-value moves and of promising untried options, using the Predictor + Upper Confidence Bound for Trees (PUCT) formula. adds child corresponding to legal actions at the selected . Evaluation applies the to the state, yielding the value V for and prior probabilities P to initialize child nodes. then updates the visit counts and average values Q along the path from root to using the evaluated V. The PUCT formula for action selection is given by \text{PUCT}(s, a) = Q(s, a) + c \cdot P(s, a) \cdot \left( \frac{\log N}{1 + n} \right)^{1/2}, where Q(s, a) is the mean value of action a in state s, c is an exploration constant, P(s, a) is the neural network's prior probability, N is the visit count of the parent node, and n is the visit count of the child node corresponding to action a. During competitive matches, AlphaZero performs 800 simulations per move, which suffices to achieve superhuman performance levels while requiring fewer computational resources than conventional search methods like alpha-beta pruning in traditional engines. This efficiency stems from the neural network's ability to prune unpromising branches early, concentrating simulations on high-potential paths.

Training Process

Self-Play and Data Generation

AlphaZero employs an iterative self-play mechanism to generate its training data, where two instances of the current neural network act as opponents in simulated games. Each move is selected through Monte Carlo Tree Search (MCTS), which uses the neural network to guide the search by providing prior probabilities and value estimates for positions. Upon completion, each game is recorded as a sequence of states, actions taken, and the final outcome (win, loss, or draw from the perspective of the starting player). To ensure data efficiency and focus on the network's evolving capabilities, AlphaZero maintains a replay buffer containing positions from the most recent games, discarding older data via a (first-in, first-out) approach. This prevents the model from being hindered by outdated strategies and emphasizes learning from current performance. Positions are sampled uniformly across games for training, with an emphasis on early-game states to better develop opening strategies, as these are critical for strategic diversity in board games. Game diversity is achieved by having the network play against versions of itself at varying effective strengths, primarily through modifications during the opening phase. A parameter is applied to the move probabilities from MCTS for the first 30 moves (or fewer in Go), encouraging a broader of openings rather than optimal but repetitive play. Additionally, Dirichlet noise is added to the prior probabilities at the root node of the MCTS to introduce and promote branching into less-visited actions, preventing the self-play from converging too quickly to narrow paths. Through this process, AlphaZero reaches levels after approximately 9 hours of in chess, 12 hours in , and 13 days in Go, generating tens of millions of positions in each case—for instance, 44 million games in chess—while first surpassing prior benchmarks after 4 hours in chess, 2 hours in , and 30 hours in Go, yielding similarly vast datasets.

Hardware and Timeline

AlphaZero's training relied heavily on Google's Tensor Processing Units (TPUs) for efficient parallel computation. Self-play games were generated using approximately 5,000 first-generation TPUs to enable massive parallelism in simulating games, while the neural networks were trained on 64 second-generation TPUs. During evaluation matches, inference was performed on 44 first-generation TPUs, allowing AlphaZero to evaluate around 80,000 positions per second under time controls. The development timeline began with a preliminary version detailed in a December 2017 preprint, where AlphaZero for chess achieved superhuman performance after just 9 hours of , specifically surpassing following 300,000 training steps in about 4 hours. This initial encompassed 700,000 total steps across mini-batches of 4,096 positions each, starting from random initialization and requiring no human data beyond the game's encoded rules. The full version appeared in a December 2018 paper, incorporating extended evaluation but maintaining the core 700,000-step regimen for chess. In terms of scaling, the network architecture featured 20 residual blocks. Key differences between the versions included longer overall training durations in the full release for stability, alongside refined hyperparameters such as learning rates and exploration constants to mitigate variance in self-play outcomes.

Game Performance

Results in Chess

AlphaZero's performance in chess was first demonstrated in a preliminary evaluation in December 2017, where it competed against 8 after just four hours of training from scratch. In a 100-game match under time controls of approximately one minute per move, AlphaZero achieved 28 wins, 72 draws, and 0 losses, corresponding to an rating approximately 100 points higher than Stockfish 8. In the full evaluation published in December 2018, AlphaZero faced 8 (the 2016 TCEC season 9 winner) in a 1000-game match under time controls of three hours per game plus 15 seconds per move, resulting in 155 wins, 6 losses, and 839 draws for AlphaZero. Under faster time controls of three minutes per move, AlphaZero secured 29 wins, 0 losses, and 71 draws in a separate 100-game match against 8. These outcomes highlighted AlphaZero's superiority across different time constraints. AlphaZero's gameplay introduced novel strategies that diverged from traditional engines, emphasizing sacrificial play for initiative and dynamic pawn structures over preservation. For instance, it frequently sacrificed s or pieces to gain long-term positional advantages, such as open lines for attacks or control of key squares, leading to aggressive and creative openings like the or Sicilian Defense variations. This approach contrasted with Stockfish's more conservative, materialistic evaluation. During matches, AlphaZero evaluated about 80,000 positions per second using , far fewer than Stockfish's 70 million, yet achieved superior results by prioritizing high-quality evaluations over . These matches were conducted under hardware constraints that favored Stockfish's design: AlphaZero ran on 4 tensor processing units (TPUs) and 44 CPU cores for inference during play, while utilized a 64-core CPU optimized for parallel search. Despite this, AlphaZero's neural network-guided search proved more efficient, demonstrating the effectiveness of its paradigm in a domain dominated by handcrafted heuristics.

Results in Shogi

AlphaZero achieved superhuman performance in after approximately 110,000 games, equivalent to 1.5 days of training on 5,120 tensor processing units (TPUs). This rapid learning curve demonstrated the algorithm's efficiency in mastering the game's rules and strategies solely through reinforcement learning, without any prior human or opening books. In a preliminary evaluation, AlphaZero competed against , the 2017 world champion, in a 100-game match under tournament time controls (3 minutes per move plus 2 minutes for the first 60 moves). AlphaZero secured 90 victories, 8 defeats, and 2 draws, showcasing its dominance even early in training. The full confirmed this superiority in a 1000-game match under standard time controls of three hours per game plus 15 seconds per move, where AlphaZero achieved a 91.2% win rate against . Notably, AlphaZero's play style diverged from traditional human and engine approaches, frequently employing aggressive pawn pushes to open lines and seize initiative, strategies rarely seen in conventional shogi theory. This innovation underscored the algorithm's ability to explore novel tactics. Shogi's inclusion of piece drops significantly expands its complexity, with a state-space size of about 10^{226} positions—far larger than chess's 10^{46}—yet AlphaZero seamlessly incorporated these rules into its neural network evaluations and Monte Carlo Tree Search.

Results in Go

AlphaZero applied the same general to Go as it did to chess and , using a 19×19 board without any game-specific adjustments beyond the rules. This approach built directly on 's framework but eliminated all reliance on human-generated game data, starting instead from completely random play. Compared to , which required 3 days of training on 4 TPUs to achieve mastery, AlphaZero for Go was trained for 13 hours on thousands of TPUs for generation. In the evaluation of 100 games under standard time controls (3 hours per game plus 15 seconds per move), the fully trained AlphaZero defeated AlphaGo Zero with 60 wins to 40 losses and 0 draws, achieving a 60% win rate. AlphaZero demonstrated greater computational during play, evaluating approximately 1,600 positions per move via compared to AlphaGo Zero's 10,000, yet maintaining its dominant win rate through a more precise policy. This contributed to a playing style that professional Go players described as more human-like and intuitive, favoring creative, balanced strategies over brute-force computation.

Analysis and Impact

Strengths and Limitations

AlphaZero demonstrates remarkable capabilities, as the same , without any modifications to its core architecture or hyperparameters, achieves superhuman performance across diverse board games including chess, , and Go, relying solely on the rules of each game as input. This approach enables the discovery of novel, superhuman strategies that diverge from centuries of human expertise, such as aggressive piece sacrifices in chess for long-term positional dominance or unconventional territorial plays in Go. Furthermore, AlphaZero's efficiency in learning is profound; it reaches world-class proficiency in chess after just four hours of training and full superhuman levels in approximately nine hours, contrasting sharply with the years or decades required for human grandmasters to accumulate comparable knowledge through guided study and practice. Despite these advances, AlphaZero's methodology imposes significant limitations, particularly its exorbitant computational demands, requiring 5,000 first-generation TPUs for game generation and 64 second-generation TPUs for training over several days to attain peak performance. The system's reliance on deep s renders its processes largely uninterpretable, functioning as a where the rationale behind specific moves—such as why a particular is preferred—remains opaque to human , hindering insights into its strategic reasoning. Additionally, AlphaZero exhibits potential to perturbations in the environment, as its policies are optimized strictly for the fixed rules provided during training, necessitating full retraining for even minor rule modifications, which limits its adaptability in dynamic or evolving domains. Empirical results highlight AlphaZero's strengths in long-term , where it excels at evaluating complex positional trade-offs over dozens of moves, often outperforming traditional engines that prioritize immediate tactical gains. However, it occasionally falters in short-term tactical puzzles. In head-to-head matches, AlphaZero establishes dominance with win rates typically in the 80-90% range or higher when accounting for draws; for instance, it achieved 91.2% wins against the top engine and winning 60 out of 100 games against in Go, while scoring 689.5 points (155 wins, 839 draws, 6 losses) against in a 1,000-game chess . The original evaluations, however, provide incomplete coverage of long-term stability, focusing primarily on fixed-match outcomes without extended assessments of performance degradation or robustness over prolonged, repeated play under varying conditions.

Reactions from Experts

Demis Hassabis, CEO of DeepMind, described AlphaZero as representing a historical turning point for , emphasizing its ability to exhibit human-like and in without relying on human knowledge. , former world chess champion, praised AlphaZero as revolutionary, noting in a to a book on its impact that it marked a profound shift in by prioritizing piece activity and aggressive positions over material conservation, surpassing traditional engines in a way that revitalized . In a editorial accompanying the 2018 publication, Kasparov further highlighted AlphaZero's significance beyond chess, calling it a model for duplicating virtual knowledge accumulation in other fields. Criticisms emerged regarding the compute resources disparity in AlphaZero's match against Stockfish, where AlphaZero utilized four Tensor Processing Units (TPUs) while Stockfish ran on 44 CPU cores with 10 GB of memory, leading developers to argue that the hardware imbalance favored the approach. Additionally, the initial faced reproducibility challenges due to DeepMind withholding full and architecture details, making independent validation difficult and costly without crowdsourced reverse-engineering efforts. In the shogi community, professionals expressed amazement at AlphaZero's novel tactics, such as advancing the king toward the board's center—contrary to established theory and appearing risky from a viewpoint—which demonstrated unconventional strategies beyond traditional play. Go experts observed a stylistic evolution in AlphaZero compared to , noting its more intuitive and fluid approach that balanced attack and defense with flair, differing from prior systems while resembling top play in some aspects; AlphaZero defeated in 60 out of 100 games, underscoring this shift. The publication of AlphaZero's methods was widely lauded for generalizing across games, though some researchers questioned potential biases in self-play training, where Elo ratings could become unrealistically inflated without external validation, as noted in related DeepMind works. No major updates to the original AlphaZero system have occurred since .

Legacy and Subsequent Works

AlphaZero's algorithmic framework has directly inspired several successor systems at DeepMind, notably , introduced in a 2019 and later detailed in a 2020 publication. extends AlphaZero's model-based approach by learning an explicit model of the environment's dynamics, rewards, and representations without requiring prior knowledge of the rules, enabling it to achieve superhuman performance across board games like Go, chess, and , as well as video games. This advancement addressed a key limitation of AlphaZero, which assumed access to game rules for simulation, by allowing to infer necessary environmental models solely from interactions, thus broadening applicability to domains with partially observable or unknown dynamics. Building further on this lineage, AlphaDev, unveiled by DeepMind in 2023, adapts AlphaZero-inspired to optimize low-level software algorithms, specifically targeting routines in the . By framing algorithm discovery as a single-player game where the agent refines assembly code sequences to minimize execution time on random inputs, AlphaDev uncovered novel strategies for small arrays (up to length 5) that outperform human-designed benchmarks, resulting in up to 70% speedups for short sequences when integrated into the LLVM libc++ library. These improvements have been upstreamed into widely used open-source implementations, demonstrating AlphaZero's principles in accelerating foundational tasks beyond games. Beyond direct successors, AlphaZero has profoundly influenced reinforcement learning applications in diverse fields, including , , and . In , AlphaTensor (2022) employs an AlphaZero-like search to discover faster algorithms for , surpassing longstanding human records for dimensions up to 4x4 and achieving up to 20% efficiency gains for larger matrices by optimizing tensor decompositions through . For quantum control, researchers have adapted AlphaZero's to optimize , such as gate sequences in quantum circuits, enabling model-free solutions that outperform traditional methods in tasks like state preparation and error mitigation. In and systems, AlphaZero's policy iteration and lookahead techniques have informed frameworks, enhancing stability and performance in adaptive discrete-event systems, as explored in analyses of its principles for suboptimal . Academically, AlphaZero has driven a surge in research on model-free and self-supervised , with its seminal 2018 Science paper garnering over 3,700 citations as of 2025 in Crossref and influencing a shift toward scalable, general-purpose agents in both games and real-world optimization. This body of work, exceeding 5,000 citations across related DeepMind publications since 2018, underscores AlphaZero's role in catalyzing advancements that extend far beyond its original game-playing domain.

References

  1. [1]
    A general reinforcement learning algorithm that masters chess ...
    Dec 7, 2018 · In chess, AlphaZero first outperformed Stockfish after just 4 hours (300,000 steps); in shogi, AlphaZero first outperformed Elmo after 2 hours ( ...
  2. [2]
    [1712.01815] Mastering Chess and Shogi by Self-Play with a ... - arXiv
    Dec 5, 2017 · In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains.Missing: primary | Show results with:primary
  3. [3]
    AlphaZero: Shedding new light on chess, shogi, and Go
    Dec 6, 2018 · AlphaZero, a single system that taught itself from scratch how to master the games of chess, shogi(Japanese chess), and Go, beating a world-champion program in ...Missing: primary | Show results with:primary
  4. [4]
    AlphaZero and MuZero - Google DeepMind
    and are now helping us solve real-world problems ...
  5. [5]
    Mastering the game of Go without human knowledge - Nature
    Oct 19, 2017 · (4) AlphaGo Zero is the program described in this paper. It learns from self-play reinforcement learning, starting from random initial weights, ...
  6. [6]
    DeepMind's AlphaZero now showing human-like intuition in ...
    Dec 6, 2018 · DeepMind's AlphaZero now showing human-like intuition in historical 'turning point' for AI · Demis Hassabis Q&A.
  7. [7]
    Rise of the machines: new book shows how revolutionary AlphaZero is
    Feb 9, 2019 · Rise of the machines: new book shows how revolutionary AlphaZero is ... Garry Kasparov has written a foreword for a newly published book in ...
  8. [8]
    Chess, a Drosophila of reasoning - Science
    Garry Kasparov wrote an article entitled "Chess, a Drosophila of reasoning" (1). Machine learning has been outperforming human champions in ...
  9. [9]
    Alpha Zero: Comparing "Orangutans and Apples" - ChessBase
    Dec 13, 2017 · Alpha Zero is a chess program and won a 100 game match against Stockfish by a large margin. But some questions remain. Reactions from chess professionals and ...
  10. [10]
    Mastering Atari, Go, Chess and Shogi by Planning with a Learned ...
    Nov 19, 2019 · In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of ...
  11. [11]
    Mastering Atari, Go, chess and shogi by planning with a learned model
    Dec 23, 2020 · Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging ...Missing: 2019 | Show results with:2019
  12. [12]
    Faster sorting algorithms discovered using deep reinforcement ...
    Jun 7, 2023 · AlphaDev discovered small sorting algorithms from scratch that outperformed previously known human benchmarks.
  13. [13]
    AlphaDev discovers faster sorting algorithms - Google DeepMind
    Jun 7, 2023 · AlphaDev uncovered new sorting algorithms that led to improvements in the LLVM libc++ sorting library that were up to 70% faster for shorter ...Alphadev Discovers Faster... · Searching For New Algorithms · Finding The Best Algorithms...
  14. [14]
    Discovering faster matrix multiplication algorithms with ... - Nature
    Oct 5, 2022 · Here we report a deep reinforcement learning approach based on AlphaZero 1 for discovering efficient and provably correct algorithms for the multiplication of ...
  15. [15]
    Global optimization of quantum dynamics with AlphaZero deep ...
    Jan 14, 2020 · However, AlphaZero still performs well despite its limitation of only having amplitude-discretized controls. To improve the AlphaZero algorithm ...
  16. [16]
    [PDF] Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive ...
    “Reinforcement Learning for Control: Performance, Stability, and Deep Approx- imators,” Annual Reviews in Control, Vol. 46, pp. 8-28. [BKB20] Bhattacharya ...