Fact-checked by Grok 2 weeks ago

Batch normalization

Batch normalization is a technique in that normalizes the inputs to each layer of a by subtracting the batch mean and dividing by the batch standard deviation, then applying learnable scale and shift parameters to restore representational power. This process addresses internal covariate shift—the change in the distribution of layer inputs during training—enabling faster , higher learning rates, and more without requiring meticulous weight initialization. Proposed by Sergey Ioffe and Christian Szegedy in 2015, and presented at the 2015 (ICML), batch normalization was introduced as a to accelerate the training of deep neural networks, particularly those with many layers where traditional training is hindered by shifting distributions that necessitate low learning rates and careful parameter tuning. The paper received the ICML 2025 Test of Time Award. The approach integrates directly into the network architecture, typically applied immediately before the nonlinearity in fully connected or convolutional layers, transforming inputs x = Wu (where W are weights and u the previous layer's outputs) into normalized versions. The core mechanism operates on mini-batches during training: for a mini-batch B = \{x_1, x_2, \dots, x_m\}, it computes the empirical mean \mu_B = \frac{1}{m} \sum_{i=1}^m x_i and variance \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2; each input is then normalized as \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} (with \epsilon a small constant for numerical stability), followed by scaling and shifting y_i = \gamma \hat{x}_i + \beta, where \gamma and \beta are learned parameters initialized to 1 and 0, respectively. During inference, population statistics (running averages of means and variances from training) replace batch statistics to ensure deterministic outputs. Among its key benefits, batch normalization not only speeds up training—achieving comparable accuracy to prior methods with up to 14 times fewer steps on —but also serves as a form of regularization, sometimes obviating the need for dropout while improving generalization. It has become a standard component in convolutional neural networks (CNNs) and other deep architectures, significantly contributing to advances in and beyond, though it performs best with sufficiently large mini-batches and can introduce challenges in scenarios like recurrent networks or small-batch training.

History and Introduction

Development and Original Proposal

Batch normalization was introduced by Sergey Ioffe and Christian Szegedy, researchers at , in their seminal 2015 paper titled "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," published as an preprint and later presented at the (ICML). The technique emerged as a response to the growing difficulties in training increasingly deep neural networks following breakthroughs like in 2012, where extending network depth beyond a few layers often led to training instability. At the time, deep network was hindered by issues such as vanishing gradients and activation saturation, which were exacerbated in deeper architectures and necessitated careful parameter initialization, small learning rates, and activation functions like ReLU to mitigate slow convergence. Ioffe and Szegedy proposed batch normalization to address these challenges by standardizing the inputs to each layer during , using the and variance computed from mini-batch statistics, thereby stabilizing the learning process across layers. This approach was motivated by the problem of internal covariate shift, where the distribution of layer inputs shifts as parameters are updated, complicating optimization. In initial experiments, the authors applied batch normalization to an image classification model on the dataset, demonstrating that it achieved the same accuracy as a baseline state-of-the-art model in 14 times fewer training steps while surpassing it with a top-5 validation error of 4.9%. These results highlighted batch normalization's ability to enable much higher learning rates and reduce the sensitivity to initialization, marking a significant advancement in accelerating deep network training.

Impact and Recognition

Following its introduction in 2015, batch normalization was rapidly integrated into major frameworks, with incorporating it as a core layer by late 2015 through its contrib module and later as a standard layer. , released in early 2017, included batch normalization as a built-in module from its initial versions, facilitating seamless adoption in convolutional neural networks (CNNs) and recurrent architectures. This quick integration transformed batch normalization into a default component in model design, extending its use beyond vision tasks to and generative models. The technique's key impacts include enabling the stable training of much deeper networks, such as the Residual Networks (ResNets) introduced in 2015, which achieved unprecedented depths of over 100 layers on without gradient vanishing issues. It also diminished the need for meticulous weight initialization schemes like or He, allowing practitioners to employ higher learning rates and simpler setups, thereby streamlining experimentation. These advancements played a pivotal role in democratizing by lowering barriers to effective model training for researchers and engineers without specialized tuning expertise. By 2020, batch normalization had become ubiquitous, appearing in over 90% of top-performing classification models on leaderboards, and it maintained strong prevalence in benchmarks through the mid-2020s. A notable milestone was its inclusion in major architectures like Inception-v2 in 2015, where it improved accuracy and training speed on large-scale datasets. In recognition of its lasting influence, the original 2015 paper by Sergey Ioffe and Christian Szegedy received the ICML Test of Time Award in 2025, underscoring its enduring relevance a decade after publication as a foundational enabler of modern pipelines.

Motivation

Internal Covariate Shift

Internal covariate shift refers to the change in the distributions of internal within a deep during , primarily caused by updates to the of preceding layers. This phenomenon alters the inputs to subsequent layers, leading to shifting activation distributions that must be continually adapted to by the network. The consequences of internal covariate shift include the need for lower learning rates to maintain , more careful parameter initialization to avoid initial instability, and overall slower convergence, particularly in deeper networks. These issues are exacerbated by the amplification of small changes through multiple layers, often resulting in the saturation of nonlinear functions such as the , which in turn causes vanishing gradients and hinders effective . Evidence for internal covariate shift is illustrated in the original proposal through plots of activation distributions in a sigmoid-activated network trained on without normalization; these show significant shifts in the mean and variance of activations over training epochs, indicating ongoing distributional changes. In contrast, networks employing batch normalization maintain stable distributions throughout training. This concept extends the broader notion of covariate shift—where the input distribution changes between training and test phases—to the internal representations of hidden layers, rather than solely the input features, making it a network-specific challenge akin to repeated domain shifts within the model itself. Batch normalization mitigates internal covariate shift by normalizing the inputs to each layer using statistics computed from each mini-batch, thereby fixing the mean and variance of the activations and reducing the distributional changes caused by parameter updates.

Modern Interpretations of Benefits

Subsequent research has challenged the original hypothesis that batch normalization primarily mitigates internal covariate shift, proposing instead that its benefits stem from smoothing the optimization landscape of the loss function. However, the role of internal covariate shift remains debated, with some studies, such as Rauf et al. (2020), arguing that reducing ICS is essential and sufficient for batch normalization's performance gains, countering earlier challenges. In a seminal study, Santurkar et al. demonstrated through controlled experiments that internal covariate shift persists even with batch normalization applied, yet training converges faster and more reliably compared to networks without it. Specifically, they showed that networks without batch normalization, but with artificially stabilized layer distributions to minimize covariate shift, do not exhibit the same acceleration in training or stability gains. This empirical evidence from ablation studies underscores that the core advantage lies elsewhere, shifting focus to how batch normalization alters the geometry of the loss surface for more effective optimization. One key mechanism identified is the reduction in the Lipschitz constant of the loss function, which bounds the norms of gradients across the parameter space and prevents explosive behavior during optimization. By enforcing this smoothness, batch normalization enables the use of substantially larger learning rates without risking instability, as the constrained gradient magnitudes maintain consistent update steps. Complementing this, Bjorck et al. argued that batch normalization acts as a form of preconditioning, transforming the of the loss to make it more isotropic in the parameter space, thereby stabilizing and promoting akin to optimizing a more convex-like function. More recent work, as of 2025, suggests that batch normalization also improves the clustering characteristics of hidden representations, enhancing generalization without relying on sparsity. These interpretations collectively explain the empirical observation that batch-normalized networks train more efficiently across diverse architectures and tasks. An additional perspective highlights the regularizing effect arising from the stochasticity in mini-batch statistics, which injects beneficial during to enhance . This , inherent to estimating means and variances from finite mini-batches rather than the full , mirrors techniques like dropout by introducing variability that discourages , though it requires no extra inference-time computation. While not the primary driver of optimization speed, this aspect contributes to the overall robustness of batch normalization in improving model performance on unseen data.

Procedures

Forward Pass Normalization

Batch normalization is applied during the immediately after the linear transformation in a layer, such as a fully connected or convolutional operation (e.g., x = Wu + b), but before the nonlinearity (e.g., ReLU). For a mini-batch of m activations B = \{x_1, x_2, \dots, x_m\}, the batch mean \mu_B and variance \sigma_B^2 are first computed as follows: \mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2. These statistics normalize the input to have zero and unit variance. The normalization step then transforms each input x_i to \hat{x}_i: \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, where \epsilon is a small positive constant added to the denominator for numerical stability, preventing when the batch variance is small. To retain representational power, the normalized values are subsequently scaled and shifted using learnable parameters \gamma and \beta, one pair per feature: y_i = \gamma \hat{x}_i + \beta. This allows the layer to recover the original distribution if needed, while benefiting from the stabilized inputs. In convolutional layers, normalization occurs independently for each channel (feature map), with statistics aggregated across the spatial dimensions (height p and width q) and the batch, resulting in an effective mini-batch size of m' = m \cdot p \cdot q; a single \gamma and \beta are applied per channel.

Backpropagation Through Normalization

During training, through the batch normalization (BN) layer computes gradients with respect to the layer's inputs and learnable parameters using the chain rule, accounting for the dependencies introduced by the mini-batch mean \mu_B and variance \sigma_B^2. The gradient with respect to the normalized input \hat{x}_i is first obtained as \frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \cdot \gamma, where L is , y_i is the scaled and shifted output, and \gamma is the ; this propagates the upstream gradient \frac{\partial L}{\partial y_i} through the . To compute the gradient with respect to the original input x_i, the dependencies on \mu_B and \sigma_B^2 must be resolved via the chain rule: \frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{2(x_i - \mu_B)}{m} + \frac{\partial L}{\partial \mu_B} \cdot \frac{1}{m}, where m is the mini-batch size, and the auxiliary gradients are \frac{\partial L}{\partial \sigma_B^2} = \sum_{j=1}^m \frac{\partial L}{\partial \hat{x}_j} \cdot \frac{(x_j - \mu_B) \cdot (-1/2)}{(\sigma_B^2 + \epsilon)^{3/2}}, \quad \frac{\partial L}{\partial \mu_B} = \sum_{j=1}^m \frac{\partial L}{\partial \hat{x}_j} \cdot \left( -\frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \frac{\partial L}{\partial \sigma_B^2} \cdot \sum_{j=1}^m \frac{-2 (x_j - \mu_B)}{m}. These terms correct for the batch-wide statistics, ensuring the input gradient reflects the normalization's effect on the entire mini-batch; the small \epsilon > 0 (typically $10^{-5}) is included in the denominator to prevent and maintain during . Note that the second term in \frac{\partial L}{\partial \mu_B} evaluates to zero since \sum (x_j - \mu_B) = 0. The gradients for the learnable parameters \gamma and \beta are simpler, as they depend only on the normalized inputs and upstream gradients without batch statistic corrections: \frac{\partial L}{\partial \gamma} = \sum_{i=1}^m \frac{\partial L}{\partial y_i} \cdot \hat{x}_i, \quad \frac{\partial L}{\partial \beta} = \sum_{i=1}^m \frac{\partial L}{\partial y_i}. These are essentially unnormalized sums (or averages when divided by m) that enable efficient updates to the affine parameters via stochastic gradient descent. Computationally, propagating gradients through BN incurs an additional cost of O(m \cdot d) per layer, where d is the feature dimension, due to the summations over the mini-batch; however, this is offset by the layer's role in enabling larger learning rates and more stable parameter updates during training.

Inference Phase

During the inference phase of models employing batch , the process adapts to operate without mini-batches, relying instead on fixed statistics for the and variance to ensure deterministic outputs. Specifically, the mini-batch \mu_B and variance \sigma_B^2 used in are replaced by estimates of the overall \mu and variance \sigma^2, approximated as \mu \approx \frac{1}{T} \sum_{t=1}^T \mu_B^t and \sigma^2 \approx \frac{m}{m-1} \frac{1}{T} \sum_{t=1}^T \sigma_B^{2,t}, where T is the number of mini-batches and m is the mini-batch size, providing an unbiased estimate of the variance (approximating by neglecting the typically small variance of the batch ). These statistics are estimated during by maintaining running averages over the mini-batches, often implemented as moving averages to prioritize recent batches and reduce computational overhead. The update rule for these running statistics typically follows an exponential scheme: the running is updated as \text{running_mean} \leftarrow \text{momentum} \times \text{running_mean} + (1 - \text{momentum}) \times \mu_B, with a similar form for the running variance, where the parameter (commonly set to 0.9 or 0.99) controls the decay rate and balances responsiveness to new with stability. In , the step then applies the standard batch normalization formula but with these fixed running statistics: \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} followed by scaling and shifting y = \gamma \hat{x} + \beta, where \epsilon is a small constant for numerical stability, and \gamma and \beta are the learned parameters. This substitution yields consistent, batch-independent predictions, as the output depends solely on the input and the fixed parameters rather than stochastic mini-batch variations. A key challenge in the inference phase arises when processing small or single-sample inputs, such as in online deployment or real-time applications, where computing fresh mini-batch statistics is infeasible or unreliable due to high variance in small samples. Solutions include relying on the precomputed running averages, which approximate the full dataset distribution without requiring batch aggregation, though this can introduce minor mismatches if training batches were atypically sized. Alternatively, one can compute exact population statistics by passing the entire training dataset through the model post-training, albeit at significant computational cost, or adopt batch-independent normalization techniques like layer normalization, which normalizes across features within each individual sample rather than across a batch. Unlike the training phase, where mini-batch statistics introduce noise that acts as a form of regularization to prevent , inference with fixed population statistics produces smoother, more deterministic predictions without this variability. This eliminates the "" from batch sampling, potentially leading to slightly less robust generalizations in scenarios sensitive to distributional shifts, though it enhances and efficiency in deployment.

Theoretical Foundations

Loss Landscape Smoothness

Batch normalization enhances the of the loss landscape in optimization by reducing the constant of the , ensuring that \|\nabla \mathcal{L}(\theta) - \nabla \mathcal{L}(\theta')\| \leq \beta \|\theta - \theta'\| for some \beta > 0, where \mathcal{L} is and \theta are the s. This , known as \beta-, bounds how rapidly the changes with respect to parameter perturbations, leading to more predictable and stable optimization dynamics. The mechanism underlying this arises from batch normalization's reparametrization of , which constrains the magnitudes of activations and weights, thereby limiting the of to small changes in parameters and reducing overall variability during . Mathematically, consider a layer with output y = f(Wx), where W is the weight matrix and x is the input. Batch normalization, applied post-activation, normalizes the pre-activation values and introduces learnable scale and shift parameters \gamma and \beta, which effectively rescale the Hessian matrix associated with the layer's loss contribution. This rescaling reduces the condition number of the Hessian, as shown by bounding the second-order terms in the gradient computation: specifically, the normalized gradient satisfies \|\nabla_{y_j} \hat{\mathcal{L}}\|_2^2 \leq \frac{\gamma^2}{\sigma_j^2} \left( \|\nabla_{y_j} \mathcal{L}\|_2^2 - \frac{1}{m} \langle \mathbf{1}, \nabla_{y_j} \mathcal{L} \rangle^2 - \frac{1}{m} \langle \nabla_{y_j} \mathcal{L}, \hat{y}_j \rangle^2 \right), where \hat{\mathcal{L}} is the loss after normalization, \sigma_j^2 is the variance, and m is the batch size; a similar bound applies to the Hessian quadratic form, promoting a more well-conditioned optimization problem that facilitates efficient convergence. Empirically, Santurkar et al. (2018) demonstrate this effect through visualizations of loss contours on deep linear networks, where batch normalization transforms elongated, ill-conditioned surfaces into rounder, more isotropic ones, reducing loss variation by up to two orders of magnitude early in training. This smoothing enables the use of substantially higher learning rates—up to 10 times larger—without divergence, accelerating convergence while maintaining stability across various architectures like VGG networks.

Covariate Shift Quantification

To quantify the extent of internal covariate shift, researchers have employed metrics that measure changes in the distributions of layer activations over the course of . A common approach is the divergence between the activation distributions at the initial (p_1) and a later epoch k (p_k), given by D_\text{KL}(p_1 \Vert p_k) = \int p_1(x) \log \frac{p_1(x)}{p_k(x)} \, dx, which quantifies how much the distribution has shifted. This metric highlights the degree to which parameter updates alter input distributions to subsequent layers, a core aspect of internal covariate shift. In the seminal batch normalization paper, experiments on a multi-layer trained on MNIST demonstrated that without normalization, distributions undergo substantial shifts during , as visualized in histograms of layer inputs that deviate significantly from their initial Gaussian-like form. Batch normalization stabilizes these distributions by enforcing zero mean and unit variance per mini-batch, effectively reducing the observed shift and allowing for rates without . Results indicated that this stabilization is particularly pronounced in early layers, where shifts are most disruptive to gradient flow. Subsequent critiques, notably a 2018 NeurIPS study, challenged the centrality of shift reduction by showing that internal covariate shift persists even with batch normalization. Using a measure based on the difference in mean and standard deviation of layer inputs across training iterations, the study found that batch-normalized networks exhibit comparable or sometimes greater shift magnitudes than non-normalized ones, yet train substantially faster—reaching 83% accuracy on with a VGG-like ReLU network versus 80% without. This lack of correlation between measured shift and training speed suggests that internal covariate shift mitigation is not the primary mechanism behind batch normalization's effectiveness. While batch normalization mitigates internal covariate shift to some degree, empirical evidence indicates it is not the sole reason for its advantages, with complementary interpretations like loss landscape smoothness providing additional explanatory power.

Gradient Flow Stabilization

In deep neural networks, gradients propagate backward through the chain rule, where the gradient at each layer is the product of the incoming gradient and the layer's Jacobian. This multiplicative process often results in vanishing gradients, scaling exponentially as \exp(-c \cdot \text{depth}) for some constant c > 0, or exploding gradients when the product grows uncontrollably, hindering effective training of deep architectures. Batch normalization addresses this instability by normalizing the inputs to each layer, constraining activations to have zero mean and unit variance during the forward pass. This keeps activation magnitudes in a stable range, reducing the risk of scale-induced amplification or attenuation in subsequent computations. Additionally, the learnable scaling parameter \gamma decouples the learning of activation scales from the transformation itself, allowing to adaptively control magnitudes without relying on initialization alone. Mathematically, the effective of a batch-normalized layer tends to have singular values clustered around 1, promoting balanced flow across depths, in contrast to unnormalized layers where singular values can drift far from and cause . The backward through introduces a factor, approximating the layer-wise as \frac{\partial L}{\partial x} \approx \left( \frac{\partial L}{\partial y} \right) \cdot \left( \frac{\partial y}{\partial x} \right) / \sigma, where \sigma is the standard deviation of the layer inputs; this division by \sigma normalizes the gradient magnitude, counteracting the cumulative effects of prior layers. Empirical evidence from deep networks demonstrates this stabilization: in 90-layer models trained on permutation-invariant MNIST, batch normalization helps prevent exponential decay of gradient norms compared to no normalization, though norms may increase with depth for certain activations; combining it with techniques like backward gradient normalization achieves flat profiles across layers. The synergy between batch normalization and skip connections in residual networks (ResNets) enhances this effect further. Skip connections enable identity mappings, allowing gradients to flow directly through added shortcut paths, while batch normalization ensures these propagated signals remain well-conditioned; together, they enable stable training of networks exceeding 100 layers deep without gradient pathologies.

Parameter Decoupling

In standard neural networks without , the weights in each layer entangle the (, denoted as \|W\|) and (, denoted as W / \|W\|) of the linear , which can lead to ill-conditioned optimization problems where small changes in require large adjustments in to maintain output . addresses this by applying after the linear , allowing to primarily learn the through the weights W, while the learnable \gamma absorbs and controls the overall ; as a result, the effective output becomes \gamma \cdot \|W\|, these aspects. For a linear layer, this can be expressed mathematically as \mathrm{BN}(Wx) \approx \gamma \left( \frac{Wx}{\|Wx\|} \right) + \beta, where \beta is the learnable shift , effectively transferring the influence of \|W\| to \gamma and isolating directional learning in W. This decoupling reduces the network's sensitivity to initialization, as initial variations are normalized away, enabling more robust starting points without fine-tuned . It also permits aggressive updates and larger learning rates during training, as directional adjustments no longer risk explosive scale changes that could destabilize gradients. Theoretically, batch normalization improves the conditioning of the by reducing its maximum eigenvalue, which flattens the loss landscape and facilitates smoother optimization.

Convergence Guarantees

Least-Squares Optimization

In linear models, batch normalization is analyzed in the context of solving the ordinary least-squares problem, which seeks to minimize the objective \min_W \|XW - Y\|_F^2, where X \in \mathbb{R}^{n \times d} is the input , Y \in \mathbb{R}^{n \times m} is the , and W \in \mathbb{R}^{d \times m} are the weights, with the Frobenius norm measuring the squared error across all outputs. This setup captures the essence of , where iteratively updates W to reduce the loss. Without batch normalization, convergence of can be slow when features in X exhibit varying scales or heterogeneous variances, leading to an ill-conditioned X^\top X with a large \kappa = \lambda_{\max}/\lambda_{\min}, where \lambda_{\max} and \lambda_{\min} are the largest and smallest eigenvalues. This ill-conditioning results in a suboptimal convergence rate for , often requiring careful tuning of the and many more iterations to reach a small . Batch normalization, applied before the linear transformation, standardizes each across the batch to zero mean and unit variance, effectively preprocessing X to mitigate scale differences and reduce the \kappa of the effective . Consequently, gradient descent with batch normalization achieves linear convergence at a rate \mathcal{O}\left(1 - \frac{1}{\kappa}\right), where the reduced \kappa accelerates the process compared to the unnormalized case. The proof relies on analyzing the dynamics of gradient descent through the normalized Gram matrix H^*, which batch normalization constructs via over-parameterization and standardization, making its eigenvalues more balanced and closer to unity under feature normalization. A key theorem establishes that this yields a speedup factor proportional to the variance reduction across features, with the contraction rate improving linearly as \kappa decreases (Theorem 3.4 in Cai et al., 2019). Experiments on synthetic datasets, where features are generated with heterogeneous variances (e.g., exponentially increasing scales), demonstrate that batch normalization reduces the number of epochs to by 2-5 times relative to standard , while maintaining robustness to larger learning rates. This linear case provides foundational insight, with extensions to related problems like halfspace learning showing similar benefits in settings.

Halfspace Learning

In the context of halfspace learning, batch normalization () is analyzed for linear classifiers such as or support vector machines (SVMs) trained on separable data, where the objective is to minimize losses like the logistic loss or to separate the data into halfspaces. This setup typically involves Gaussian-distributed inputs, and the optimization problem is formulated as minimizing the expected loss f(\tilde{w}) = \mathbb{E}_{y,x} [\phi(-y x^T \tilde{w})], where \tilde{w} represents the normalized weight vector and \phi is the loss function. BN normalizes the inputs to the linear layer, which decouples the optimization into length and direction components of the weights, thereby improving margin maximization during training. This normalization effect leads to faster convergence, achieving an iteration complexity of O(\log(1/\epsilon)) for strongly convex losses to reach an \epsilon-accurate solution in stochastic gradient descent (SGD). A key theoretical result from Kohler et al. (2018) demonstrates that BN provides a linear speedup in SGD by effectively bounding the step sizes, exploiting the decoupled structure to stabilize and accelerate the optimization process. Post-BN, the effective loss landscape becomes \mu-strongly convex with an L-Lipschitz continuous gradient, enabling linear convergence at the rate $1 - \mu/L per iteration.
Specifically, under assumptions of Gaussian inputs and separable data, the gradient norm satisfies
\|\nabla_{\tilde{w}} f(\tilde{w}_{T_d})\| \leq (1 - \mu/L)^{2T_d} \Phi^2 (\rho(w_0) - \rho^*) + \text{error term},
where T_d is the number of iterations for the direction update, \Phi bounds the initial margin, and \rho measures the margin. This result holds for the direction optimization phase after length decoupling via BN.
Without , high-variance features in the input distribution can disproportionately slow down margin growth, as the optimization struggles with ill-conditioned landscapes. In contrast, equalizes feature variances through normalization, promoting balanced updates and mitigating these slowdowns to achieve the accelerated rates. This analysis uses least-squares objectives as a for quadratic approximations of the losses in some derivations.

Deep Neural Network Training

In overparameterized deep neural networks, batch normalization (BN) plays a role in the (NTK) regime, where wide networks behave like kernel methods during training with (SGD). This regime assumes infinite width, transforming the nonlinear dynamics into a governed by the NTK, whose eigenvalues determine the optimization landscape's conditioning. BN can affect these eigenvalues by altering the network's dynamical regimes, such as promoting chaotic behavior that influences trainability, as analyzed in studies of effects on the NTK spectrum. Theoretical analyses in the NTK regime suggest that BN contributes to improved optimization landscapes in certain settings, with extensions from linear models indicating potential benefits in stabilizing flow and reducing in specific architectures. These effects are primarily derived under the infinite-width assumption, with finite-width networks approximating the behavior but potentially deviating due to beyond lazy training. Empirical evidence on benchmarks like supports faster training with BN, where networks reach 90% test accuracy in under 100 epochs, compared to unnormalized models requiring more iterations for similar performance.

References

  1. [1]
    Batch Normalization: Accelerating Deep Network Training by ... - arXiv
    Feb 11, 2015 · Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases ...
  2. [2]
    [1805.11604] How Does Batch Normalization Help Optimization?
    May 29, 2018 · Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs).
  3. [3]
    How could I use batch normalization in TensorFlow? - Stack Overflow
    Nov 27, 2015 · Update July 2016 The easiest way to use batch normalization in TensorFlow is through the higher-level interfaces provided in either ...BatchNormalization Implementation in Keras (TF backend) - Before ...How to use BatchNormalization with tensorflow? - Stack OverflowMore results from stackoverflow.com
  4. [4]
  5. [5]
    ICML Test Of Time Batch Normalization: Accelerating Deep Network ...
    Batch Normalization accelerates deep network training by reducing internal covariate shift.
  6. [6]
    [1806.02375] Understanding Batch Normalization - arXiv
    Jun 1, 2018 · Abstract:Batch normalization (BN) is a technique to normalize activations in intermediate layers of deep neural networks.Missing: 2023 | Show results with:2023
  7. [7]
    [PDF] Batch Normalization: Accelerating Deep Network Training by ... - arXiv
    Mar 2, 2015 · We propose a new mechanism, which we call Batch Normalization, that takes a step towards re- ducing internal covariate shift, and in doing so ...
  8. [8]
    [PDF] Backward Gradient Normalization in Deep Neural Networks - arXiv
    Jun 17, 2021 · Batch normalization affects δ(K) by increasing the gradient norm with the layer depth. This effect is more pronounced for ReLU and tanh ...Missing: stabilization | Show results with:stabilization
  9. [9]
    None
    ### Summary on Batch Normalization and Skip Connections Stabilizing Gradient Flow in Deep ResNets
  10. [10]
    [PDF] Analysis on Gradient Propagation in Batch Normalized Residual ...
    Dec 2, 2018 · We conduct mathematical analysis on the effect of batch normalization (BN) on gradient backpropogation in residual network training, which is ...
  11. [11]
    [1805.10694] Exponential convergence rates for Batch Normalization
    May 27, 2018 · Batch Normalization accelerates optimization by splitting the task into optimizing length and direction of parameters separately.
  12. [12]
    [1903.02606] Mean-field Analysis of Batch Normalization - arXiv
    Mar 6, 2019 · We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These ...
  13. [13]
    [PDF] A Quantitative Analysis of the Effect of Batch Normalization on ...
    In this paper, we provide such an analysis on the simple problem of ordinary least squares. (OLS), where the precise dynamical properties of gradient descent ( ...
  14. [14]
    [PDF] Batch Normalization Alleviates the Spectral Bias in Coordinate ...
    Batch normalization (BN) reduces the maximum and variance of NTK's eigenvalues, shifting their distribution, thus alleviating spectral bias in coordinate ...
  15. [15]
    Order and Chaos: NTK views on DNN Normalization, Checkerboard ...
    Jul 11, 2019 · We observe a similar effect for Batch Normalization (BN) applied after the last nonlinearity. We uncover the same order and chaos modes in ...
  16. [16]
    A Quantitative Analysis of the Effect of Batch Normalization on ...
    Sep 29, 2018 · In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS).Missing: optimization | Show results with:optimization