Efficiently updatable neural network
An efficiently updatable neural network (NNUE) is a specialized neural network architecture optimized for evaluating positions in board games like shogi and chess, featuring an input layer that encodes board states in a way that enables incremental updates with minimal computational overhead when pieces move.[1] Developed initially for shogi engines, NNUE replaces traditional handcrafted evaluation functions with machine-learned models trained on vast datasets of game positions, achieving high accuracy while running efficiently on standard CPUs during alpha-beta search.[2] Originating in 2018 from the Japanese shogi programming community, the technique was pioneered by Yu Nasu as an extension of piece-square table methods from earlier engines like Bonanza, using a multi-layer perceptron with half-knowledge point (HalfKP) features that consider king-relative piece placements for both players.[3] In 2020, programmer Hisayori "Nodchip" Noda adapted NNUE for the chess engine Stockfish, integrating it into version 12 and yielding an Elo rating improvement of approximately 90 points over the prior hand-tuned evaluation, marking a pivotal shift toward neural network dominance in competitive chess engines.[2] Since then, NNUE has been widely adopted in engines such as Komodo Dragon, Igel, and Ethereal, revolutionizing game AI by combining the precision of deep learning with the speed required for real-time play, and influencing training methodologies that leverage self-play and supervised learning from high-quality game databases.[1] The architecture typically includes an overparameterized input layer (e.g., 768 features for chess), one or more hidden layers with clipped ReLU activations (often 1024–3072 neurons), and a single output neuron producing a centipawn evaluation score, all quantized for further efficiency.[1]Overview
Definition and purpose
An efficiently updatable neural network (NNUE), stylized as ƎUИИ, is a type of feedforward neural network specifically engineered to support rapid incremental updates when inputs undergo minor modifications, making it ideal for real-time evaluation functions in board games such as shogi and chess.[4] Unlike conventional deep neural networks that require full recomputation for each input change, NNUE leverages sparse, differential updates to maintain computational efficiency on standard CPUs, exploiting the fact that game states typically evolve through small alterations like piece movements.[5] This architecture was originally developed for computer shogi but has since been adapted for other strategic games.[6] The primary purpose of NNUE is to generate a dynamic numerical score representing the evaluative strength of a given game position—such as the estimated advantage for one player over the other—thereby approximating the intuitive assessments made by human experts more faithfully than hand-crafted heuristic functions traditionally used in alpha-beta search algorithms.[4] By integrating neural network capabilities into game engines, NNUE enhances the accuracy of position assessments without compromising the low-latency requirements of search processes, which often demand millions of evaluations per second.[5] This allows engines to achieve superhuman performance levels while remaining computationally feasible for consumer hardware.[7] The name NNUE originates from a Japanese wordplay on nue, a mythical chimera-like creature from folklore, and was coined by its inventor, Yu Nasu, in reference to the network's hybrid and adaptive nature.[3] The core evaluation function of NNUE can be expressed mathematically as s = f\left(W_4 \cdot \sigma\left(W_3 \cdot \sigma\left(W_2 \cdot (W_1 \cdot x + b_1) + b_2\right)\right) + b_3\right), where x represents the vectorized board state, W_i denote the weight matrices for each layer, b_i are bias terms, \sigma is the ReLU activation function, and f is a linear scaling function to produce the final score s.[4] This formulation enables the efficient propagation of changes through the network, particularly in the initial feature transformation layer, to support seamless integration with game tree search.[5]Key advantages
Efficiently updatable neural networks (NNUE) offer significant update efficiency, achieving constant-time O(1) complexity for evaluating single-piece moves by incrementally updating only the affected neurons in the input layer, in contrast to the O(n cost of full forward passes required by standard neural networks for each position change.[1] This reuse of prior computations from accumulators enables rapid adaptation to board states with minimal changes, such as quiet moves affecting just two neurons or captures impacting three.[1] NNUE facilitates hybrid integration with classical search algorithms like alpha-beta pruning, allowing neural evaluations to enhance traditional tree searches without incurring proportional computational slowdowns, thereby supporting deeper exploration of game trees on standard hardware.[2] In practice, this combination has enabled CPU-based engines to achieve performance levels competitive with GPU-accelerated deep learning methods.[8] In terms of accuracy, NNUE models trained on self-play or human game data substantially outperform hand-crafted evaluation functions by approximately 100 Elo points in engines like Stockfish.[9][8] NNUE maintains a favorable resource profile, operating efficiently on CPUs with network weights typically under 50 MB, around 10-50 million parameters, and inference times below 1 microsecond per position, ensuring low memory and latency suitable for real-time game play.[10][1]History
Development in shogi
The efficiently updatable neural network (NNUE) was invented by Yu Nasu in 2018 as an advancement over traditional static neural evaluators in computer shogi programs, such as Bonanza and its derivatives like YaneuraOu.[4][11] Nasu, a member of the Ziosoft Computer Shogi Club and the Tanu-King team, developed NNUE to address limitations in prior evaluation functions that struggled with the computational demands of shogi's search algorithms.[4][12] The primary motivation stemmed from shogi's unique characteristics, including its 9x9 board and piece drop rules, which generate highly dynamic positions requiring evaluation functions resilient to frequent incremental changes during alpha-beta search.[4] Early prototypes employed half-KP (King-Piece) features, which encode relative positions between the king and other pieces in a sparse, symmetric manner to capture essential tactical and positional motifs while minimizing input dimensionality.[4] This design allowed for efficient updates through delta computations, where only affected features are recalculated upon piece movements, enabling rapid evaluation without full network recomputation.[4][13] Nasu detailed this "efficiently updatable" paradigm in his seminal 2018 paper, "Efficiently Updatable Neural-Network-based Evaluation Functions for Computer Shogi," presented as an appeal document for the 28th World Computer Shogi Championship.[4] Accompanying the publication, he released the initial codebase on GitHub, implementing NNUE as a USI-compliant evaluation module integrable into existing shogi engines.[13] The architecture featured a shallow feedforward network with clipped ReLU activations, optimized for CPU execution to maintain search speeds comparable to handcrafted evaluators.[4] Early adoption occurred swiftly within the Japanese shogi programming community, with NNUE integrated into the open-source engine YaneuraOu by developer Motohiro Isozaki as early as May 2018.[12][14] This integration enabled the use of wider, more expressive networks without incurring search slowdowns, as the update mechanism preserved low-latency performance during gameplay.[4] By 2019, YaneuraOu employing NNUE—under the banner "YaneuraOu with Otafuku Lab"—won the 29th World Computer Shogi Championship (WCSC29), demonstrating substantial performance gains over prior handcrafted and static neural approaches.[12][15] NNUE continued to power top-performing shogi engines in subsequent championships through the 2020s.Integration into chess engines
The adaptation of efficiently updatable neural networks (NNUE) from shogi to Western chess began in 2020, driven by Japanese developer Hisayori "Nodchip" Noda, who ported the technology into a development version of the Stockfish engine.[3][8] This port modified the input representation to suit chess, incorporating piece-square table (PST)-style features that encode piece positions relative to the king, enabling efficient updates during search.[1] Community efforts facilitated the integration, with the NNUE code merged into Stockfish's main repository through collaborative pull requests, marking a pivotal shift toward hybrid neural-classical evaluation in open-source chess engines.[2] Stockfish version 12, released on September 2, 2020, officially introduced NNUE as the default evaluation function, representing a milestone in chess engine development.[16] The network was trained on evaluations from millions of positions generated at moderate search depths using prior Stockfish versions, resulting in a substantial strength increase of approximately 90 Elo points compared to Stockfish 11.[2][9] This upgrade preserved Stockfish's alpha-beta search efficiency while enhancing positional understanding, quickly establishing NNUE as a standard for CPU-based engines. By 2021, the chess AI community expanded NNUE's reach, with Leela Chess Zero (LC0) incorporating hybrid NNUE variants to combine its deep neural network style with faster CPU inference.[17] Ongoing refinements in Stockfish included the HalfKAv2 architecture in version 14 (September 2021), which reduced input redundancy by focusing on king-relative features, and larger networks such as the SFNNv6 net in version 16 (June 2023), further boosting performance through refined training and implementation.[8][18] By 2025, NNUE had been integrated into numerous engines beyond Stockfish, including Komodo Dragon, which adopted the technology in its November 2020 release to blend traditional search with neural evaluation for deeper positional insight.[19] Similarly, Fairy-Stockfish supports variant-specific NNUE networks for fairy chess variants, enabling strong play across non-standard rulesets like those with custom pieces or board geometries.[20] Recent advancements, such as Stockfish 17 released in September 2024, continued to optimize NNUE for even greater efficiency and strength.[21] These developments underscore NNUE's versatility beyond standard chess, fostering broader adoption in specialized and experimental engines.Technical architecture
Layer composition
The Efficiently Updatable Neural Network (NNUE) utilizes a shallow multi-layer feedforward architecture optimized for rapid incremental updates in game tree search. The network typically comprises four layers: a sparse input layer derived from board features, two hidden layers employing clipped ReLU activations, and a single-neuron output layer that yields a scalar evaluation score in centipawns. This topology balances representational capacity with low latency, enabling evaluation speeds exceeding 10 million positions per second on consumer hardware.[4][5] The hidden layers feature neuron counts such as 1024–3072 in the first and 32 in the second, facilitating hierarchical feature processing from raw board states to refined evaluations. Modern implementations, such as in Stockfish as of 2025, often use larger first hidden layers (1024–3072 neurons) and may employ variants like squared clipped ReLU (SCReLU) for improved performance. The clipped ReLU activation, applied after each hidden layer's linear transformation, is defined as \sigma(x) = \min(\max(0, x), 1), which ensures numerical stability by clipping outputs to the [0, 1] interval, mitigating gradient issues during training and overflow in integer computations. To further enhance inference speed, weights and biases in these layers are quantized to 8-bit signed integers, reducing memory footprint while preserving accuracy.[4][5] In chess implementations, the network typically encompasses around 10 million parameters, predominantly in the first hidden layer due to the expansive input dimensionality. Shogi variants, adapted to a larger 9x9 board and diverse piece promotions, scale to significantly more parameters, often tens of millions including biases, to capture increased positional complexity.[5][4] The output layer performs a linear projection without activation, producing a value normalized to the [-1, 1] range; this is scaled during evaluation such that +1.0 corresponds to approximately +400 centipawns advantage for the player to move, aligning with traditional engine scoring conventions.[5]Input representation
The input representation in efficiently updatable neural networks (NNUE) employs a sparse binary vector to encode board states, prioritizing king-centric perspectives to reflect strategic priorities in games like chess and shogi, such as piece mobility and safety around the kings.[1] A core feature type is the Half-KP encoding, which constructs sub-vectors for each possible king position—up to 64 squares in chess—capturing interactions between the king and non-king pieces on the board. In chess implementations, the encoding uses 6 piece types × 64 squares = 384 features per side in the effective representation, extended over 64 king positions in the overparameterized HalfKP structure.[8][22] The overall input vector spans a dimension of roughly 80,000, yet leverages inherent board sparsity, activating only the features corresponding to the current pieces on the board, approximately 32 active features per position, via indexing of piece types and their square positions relative to the king. This design ensures that only occupied squares contribute non-zero values, minimizing computational overhead while preserving positional detail.[8][5] Encoding is inherently side-specific, maintaining distinct representations for white and black kings to incorporate asymmetric viewpoints; features cover all standard piece types (pawn through queen) with square positions defined relative to each king, thereby emphasizing proximity, attacks, and defensive configurations.[1][22] Adaptations for game variants extend this framework: shogi NNUE includes dedicated flags for pieces held in hand, accounting for drop mechanics and promotions without inflating the core vector. For fairy chess variants, additional custom vectors encode bespoke piece movements and interactions, ensuring compatibility with non-standard rulesets.[1][4]Update mechanism
Incremental computation
The core of incremental computation in NNUE lies in the delta update principle, which maintains persistent states for the hidden layer activations across board position changes. Rather than recomputing the entire network input from scratch for each evaluation, the system tracks an accumulator representing the pre-activation values of the first hidden layer. When a minimal change occurs, such as a single move, the accumulator is updated by subtracting the contributions from the removed or altered features (e.g., a piece leaving its square) and adding the contributions from the new features (e.g., the piece arriving at its destination or a captured piece being removed). This is formalized as h' = h - W \cdot \delta_{\text{out}} + W \cdot \delta_{\text{in}}, where h is the current accumulator, W is the weight matrix for the input-to-hidden layer, \delta_{\text{out}} encodes the outgoing feature vector (typically a one-hot vector for the old position), and \delta_{\text{in}} encodes the incoming one.[4][5] This approach optimizes the forward pass by limiting recomputation to only the affected features, which are few in number per move—typically 2 for a quiet move (old and new position of the piece), 3 for a capture (plus the captured piece), or up to 4 for special moves like castling. The full hidden layer update, if needed for evaluation, follows h_{\text{new}} = \max(0, W \cdot x_{\text{new}} + b), where x_{\text{new}} is the updated sparse input feature vector and b is the bias; however, the incremental variant computes the change directly as \Delta h = W \cdot (x_{\text{new}} - x_{\text{old}}), applied additively to the existing accumulator before applying the clipped ReLU activation only during position evaluation. This ensures that the bulk of the network's computation remains deferred until necessary, with the accumulator serving as a lightweight, updatable intermediate representation.[4][5] The time complexity of these updates is effectively O(1) per move, as it scales linearly with the small number of changed features (bounded by a constant like 4) multiplied by the hidden layer size, contrasting sharply with the O(d) complexity of a naive full-input recomputation, where d is the total number of input features (often thousands for board representations). In practice, this enables rapid state transitions during search, with updates performed via simple integer additions and subtractions on the accumulator values.[4][5] NNUE's incremental updates involve no approximations, relying on exact arithmetic to preserve precision; quantized integer representations (e.g., 16-bit for accumulators) are used to avoid floating-point accumulation errors over multiple moves, with careful scaling to prevent overflow given the maximum number of active features. This exactness ensures that evaluations remain deterministic and faithful to the trained network parameters, without degradation from repeated delta applications.[4][5]Efficiency optimizations
To achieve low-latency evaluations during alpha-beta searches in chess engines, NNUE implementations employ several hardware-aware optimizations that reduce memory bandwidth and computational overhead without compromising positional accuracy. Quantization is a primary technique, converting floating-point weights and activations to low-precision integers such as int8 for weights and int16 for accumulators, with dequantization performed on-the-fly using scaling factors like powers of 2 for efficient bit shifts.[5] This reduces the model size by approximately 4x—from 32-bit floats to 8-bit integers—while limiting accuracy degradation to negligible levels in shallow networks like those used in Stockfish, where clipped ReLU outputs are bounded to 0-127.[1] In practice, Stockfish's feature transformer layer applies int8 multiplications followed by int32 accumulation, enabling faster integer arithmetic on modern CPUs and supporting larger hidden layer sizes without proportional increases in memory usage.[5] SIMD instructions further accelerate core operations, with AVX2 vectorizing dot products to process 16 int16 values per 256-bit register and AVX-512 extending this to 32 values, providing additional speedup over AVX2 in the innermost evaluation loops.[1] These extensions leverage instructions like VNNI for fused int8 multiply-accumulate, minimizing data movement in the affine transformations of hidden layers.[5] For cache efficiency, implementations precompute sparse feature vectors relative to king positions—using halfKP encoding where each piece pairs with the opposing king—and store incremental accumulators that update only affected elements per move, avoiding full recomputation except on king relocations.[8] Additionally, transposing weight matrices aligns memory access patterns with CPU cache lines, improving locality during batched evaluations and reducing load/store latencies by up to 20% in optimized builds.[23] Experimental ports extend NNUE to non-x86 hardware, including SIMD-free variants for mobile devices via NNAPI and GPU backends like CUDA, though these remain secondary to CPU optimizations due to the network's small size and the latency costs of data transfer in inference scenarios. Stockfish has supported NNUE on ARM-based systems since version 12 (2020), with ongoing optimizations for portability and minimal performance loss as of Stockfish 17 (September 2024).[2][24] In Stockfish 16 (June 2023), the classical hand-crafted evaluation was fully removed, marking complete reliance on NNUE.[8]Training methods
Dataset generation
Training datasets for efficiently updatable neural networks (NNUE) are constructed as pairs of board positions and corresponding evaluation labels, primarily derived from simulations of games played by strong engines. Self-play games generated by engines like Stockfish, searched to depths of 20 or greater, form the core of these datasets, providing a vast array of positions encountered during play. To enhance diversity and capture human-like strategic nuances, positions from human master games are incorporated alongside the self-play data. These datasets have scaled dramatically over time, starting with approximately 10 million positions for initial shogi implementations in 2018 and expanding to 800 million for early chess NNUE in 2020, reaching up to 20 billion positions by 2024; by 2022, some datasets exceeded 4 TB in size. Labels for these positions are assigned based on game outcomes—win (1), draw (0.5), or loss (0)—often interpolated linearly to reflect expected scores, or derived from value estimates produced by deeper engine searches or Monte Carlo Tree Search (MCTS) rollouts in compatible frameworks. Supervision is applied sparsely, prioritizing quiet positions where no immediate tactics disrupt the board state, as these allow the network to learn stable positional evaluations without interference from volatile moves. Recent studies emphasize advanced filtering for quiet positions, using thresholds like a quiescence search difference greater than 60 centipawns or negamax difference greater than 70 centipawns to exclude noisy data.[25] Noisy or tactically unstable positions are filtered out to improve data quality, for instance by excluding those where the absolute difference between the initial evaluation and a quiescence search result exceeds 60 centipawns (equivalent to roughly 0.6 pawns). Data augmentation techniques further expand the effective dataset size while promoting symmetry invariance. Common methods include mirroring the board across the vertical axis, generating equivalent positions from a single source without altering the underlying evaluation. Positions are also rigorously filtered for reliability, such as discarding those where the discrepancy between the game outcome and the engine evaluation is greater than 1 pawn (100 centipawns), ensuring the training data aligns closely with reliable assessments. This preprocessing yields high-quality inputs tailored for NNUE's supervised learning paradigm.Optimization techniques
The training of efficiently updatable neural networks (NNUE) follows a supervised regression framework, where the primary objective is to minimize the discrepancy between the network's predicted position evaluations and target values derived from game outcomes. The standard loss function is the mean squared error (MSE), formulated asL = \frac{1}{N} \sum_{i=1}^{N} (s_{\text{pred},i} - s_{\text{target},i})^2,
where N is the number of samples in a batch, s_{\text{pred}} denotes the network's output score, and s_{\text{target}} represents discounted game results—typically scaled as 1 for a win, 0.5 for a draw, and 0 for a loss, with adjustments based on the game phase to emphasize midgame or endgame relevance.[4][1] This setup ensures the network learns a smooth evaluation function aligned with actual play outcomes, prioritizing positions near the end of self-play games for higher reliability.[5] Optimization relies on gradient-based methods such as Adam or stochastic gradient descent (SGD) with momentum to update network weights efficiently. A common learning rate schedule begins at $10^{-3} and decays progressively to $10^{-5} over roughly 100 epochs, allowing initial rapid convergence followed by fine adjustments to avoid overshooting minima.[26] Batch sizes typically range from 10,000 to 100,000 positions, enabling stable gradient estimates while leveraging GPU parallelism for large-scale training; smaller batches may introduce noise beneficial for generalization, but larger ones accelerate throughput on modern hardware.[5][8] To mitigate overfitting, L2 regularization via weight decay is applied, penalizing large weights to promote sparse and generalizable representations suitable for the NNUE's integer-based deployment. Early stopping monitors validation performance, halting training when metrics like Elo rating on a held-out set plateau, typically after observing no improvement over several epochs.[26][6] Post-training refinement often involves knowledge distillation, where a smaller NNUE student network is trained to mimic outputs from a larger teacher model, compressing knowledge while preserving accuracy. Additionally, fine-tuning on curated datasets of human expert games can infuse stylistic elements, such as positional preferences or aggressive tendencies, enhancing the network's alignment with intuitive play beyond pure win-rate optimization.[1][26]