Stochastic gradient descent

Stochastic gradient descent (SGD) is an optimization algorithm used to minimize an objective function, particularly in machine learning, by iteratively updating model parameters in the direction of a negative estimate of the gradient, where the estimate is computed using a randomly selected subset of the training data rather than the full dataset, enabling efficient processing of large-scale problems.^[1] This approach contrasts with batch gradient descent, which uses the entire dataset for each update, by introducing stochasticity that reduces computational cost while introducing noise to help escape local minima.^[1] The foundational principles of SGD trace back to stochastic approximation methods introduced by Herbert Robbins and Sutton Monro in 1951, who proposed an iterative procedure to find roots of equations under noisy observations, laying the groundwork for handling stochastic gradients in optimization.^[2] While the method was initially developed for general statistical estimation, it gained prominence in machine learning during the resurgence of neural networks in the late 20th century, where full-batch computation became impractical for deep architectures.^[3] In modern applications, SGD and its variants, such as mini-batch SGD—which balances noise and efficiency by using small batches of data—are the workhorses for training deep learning models, offering faster convergence on massive datasets compared to deterministic methods, though they require careful tuning of learning rates to manage variance in gradient estimates.^[1] Key advantages include scalability to big data and implicit regularization effects that improve generalization, making SGD indispensable in fields like computer vision and natural language processing.^[1]

Fundamentals

Definition and Motivation

Stochastic gradient descent (SGD) is an iterative optimization algorithm that approximates the true gradient of an objective function by using gradients computed from individual training samples or small subsets called mini-batches, enabling efficient parameter updates in the opposite direction of the estimated gradient.^[2] This stochastic approximation technique, foundational to SGD, was first formalized by Robbins and Monro in their 1951 work on solving equations where exact evaluations are infeasible.^[2] In machine learning contexts, SGD targets empirical risk minimization, where the objective is the average loss over a finite dataset that approximates the expected loss under the true data distribution.^[4] The motivation for SGD arises primarily from its scalability to large datasets and high-dimensional parameter spaces, where full-batch gradient computation becomes computationally intractable due to the sheer volume of data.^[4] By processing only a fraction of the data per iteration, SGD drastically reduces per-step computational cost while still progressing toward the optimum.^[4] Furthermore, the inherent noise in these stochastic gradient estimates introduces variability that acts as an implicit form of regularization, promoting exploration of the loss landscape and aiding escape from suboptimal local minima.^[5] Fundamentally, SGD embodies a trade-off in gradient estimation quality: single-sample updates yield unbiased but high-variance approximations that accelerate training but introduce significant fluctuations, whereas mini-batches mitigate this variance—yielding more stable progress—at the expense of increased computation per step.^[4] This balance allows SGD to outperform exact methods like batch gradient descent in practice for massive problems, where the latter's precision comes at an unaffordable scaling cost.^[4]

Comparison to Batch Gradient Descent

Batch gradient descent (BGD) computes the gradient of the objective function using the entire training dataset at each iteration, resulting in smooth and stable updates but at a high computational cost, especially for large datasets.^[6] The update rule for BGD is given by

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla Q(\mathbf{w}_t),

where Q(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n Q_i(\mathbf{w}) is the average loss over n samples, \eta is the learning rate, and \nabla Q(\mathbf{w}_t) is the full gradient.^[6] This approach ensures precise gradient estimates, leading to reliable convergence toward a minimum in convex problems or local minima in non-convex ones.^[6] In contrast, stochastic gradient descent (SGD) approximates the gradient using a single sample or a small mini-batch, yielding the update \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla Q_i(\mathbf{w}_t) for a randomly selected i, which introduces noise and variability in the optimization path.^[4] While each iteration in SGD is computationally inexpensive—requiring only constant time per update rather than time proportional to the dataset size—the stochasticity results in a noisier trajectory, often necessitating more iterations to reach convergence compared to BGD's smoother descent.^[4] This makes SGD particularly suitable for large-scale problems where processing the full dataset repeatedly is infeasible, as it enables online learning and scalability.^[4]

Aspect	Stochastic Gradient Descent (SGD)	Batch Gradient Descent (BGD)
Computational Cost	Low per iteration (O(1) or O(b) for mini-batch size b); scales well to massive datasets.^[4]	High per iteration (O(n) for dataset size n); inefficient for large n.^[6]
Convergence Behavior	Noisy path with high variance; slower asymptotic rate but can escape local minima.^[4]	Smooth, stable convergence; faster in small datasets but prone to getting stuck in local minima.^[6]
Scalability	Excellent for large datasets; supports streaming data.^[4]	Poor for large datasets due to repeated full passes.^[4]
Generalization	Often better generalization due to noise acting as implicit regularization.^[7]	More prone to overfitting in small datasets; stable but less exploratory.^[7]
Suitability	Ideal for big data and non-convex problems like deep learning.^[4]	Best for small datasets or when precise gradients are critical.^[6]

Empirically, SGD typically demands more iterations than BGD to achieve comparable accuracy but incurs lower overall computational expense for large datasets, as demonstrated in benchmarks on tasks like text classification where SGD completed training in seconds versus hours for BGD equivalents.^[4] For instance, on the RCV1 dataset, SGD variants achieved near-optimal performance after a single pass, highlighting their efficiency in high-dimensional, large-scale settings.^[4]

Mathematical Formulation

Objective Function and Parameter Updates

In stochastic optimization, the objective of stochastic gradient descent (SGD) is to minimize a smooth objective function Q(\mathbf{w}), where \mathbf{w} represents the parameters of a model.^[4] This function is typically defined as the expected loss over a distribution of data points: Q(\mathbf{w}) = \mathbb{E}_{\mathbf{z} \sim \mathcal{D}} [Q(\mathbf{z}, \mathbf{w})], where Q(\mathbf{z}, \mathbf{w}) is the loss incurred by the model on a single data point \mathbf{z} = (\mathbf{x}, y) drawn from the data distribution \mathcal{D}.^[4] In machine learning applications with a finite dataset of n samples, this expectation is approximated by the empirical risk: Q(\mathbf{w}) \approx \frac{1}{n} \sum_{i=1}^n Q(\mathbf{z}_i, \mathbf{w}).^[4] The core update rule of SGD iteratively adjusts the parameters using a stochastic estimate of the gradient. At each iteration t, the parameters are updated as \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \tilde{\mathbf{g}}_t, where \eta_t > 0 is the learning rate (or step size) at step t, and \tilde{\mathbf{g}}_t is an unbiased stochastic approximation of the true gradient \nabla Q(\mathbf{w}_t).^[2]^[4] This estimate \tilde{\mathbf{g}}_t is obtained by randomly selecting a single data point \mathbf{z}_t from the dataset and computing \tilde{\mathbf{g}}_t = \nabla_{\mathbf{w}} Q(\mathbf{z}_t, \mathbf{w}_t), ensuring that \mathbb{E}[\tilde{\mathbf{g}}_t \mid \mathbf{w}_t] = \nabla Q(\mathbf{w}_t).^[4] The method originates from stochastic approximation techniques, which generalize gradient-based updates to noisy observations.^[2] To reduce the variance of the gradient estimate and improve stability, SGD is often extended to use mini-batches. In this variant, a random subset of b data points \{\mathbf{z}_{t,j}\}_{j=1}^b is selected at iteration t, and the stochastic gradient is the average \tilde{\mathbf{g}}_t = \frac{1}{b} \sum_{j=1}^b \nabla_{\mathbf{w}} Q(\mathbf{z}_{t,j}, \mathbf{w}_t), which remains an unbiased estimate of \nabla Q(\mathbf{w}_t) while providing lower variance than single-sample updates.^[4] The choice of b trades off computational efficiency against noise reduction, with b = 1 recovering the basic SGD and b = n approaching batch gradient descent.^[4] SGD assumes that the objective function Q(\mathbf{w}) is differentiable almost everywhere, allowing gradient computations, and that the stochastic gradients are unbiased with respect to the current parameters.^[2]^[4] These assumptions hold for both convex and non-convex objectives, though the behavior of the algorithm differs accordingly; for instance, convex problems permit stronger theoretical guarantees on reaching the global minimum, while non-convex cases, common in deep learning, focus on finding good local minima.^[4]

Convergence Guarantees

The foundational conditions for the convergence of stochastic gradient descent (SGD), as a stochastic approximation method, were established by Robbins and Monro, requiring that the learning rates \eta_t satisfy \sum_{t=1}^\infty \eta_t = \infty and \sum_{t=1}^\infty \eta_t^2 < \infty, along with unbiased stochastic gradient estimates having bounded variance.

\] These conditions ensure that the step sizes diminish sufficiently slowly to approach the optimum while preventing excessive accumulation of variance from the noise in gradient estimates.\[

A key result supporting almost sure convergence in SGD is the Robbins-Siegmund theorem, which applies to non-negative almost supermartingales and guarantees that, under suitable assumptions on the objective function and noise, the parameter sequence converges almost surely to a point where a Lyapunov function—such as the distance to the optimum—is minimized.$$] This theorem provides a general framework for proving convergence in stochastic approximation settings, including SGD, by leveraging supermartingale properties to bound the behavior of the iterates. For convex objective functions, SGD achieves a convergence rate of O(1/\sqrt{T}) in expectation for the function value error after T iterations, assuming bounded gradients and variance, with the rate holding for the average iterate or a Polyak-Juditsky averaged version.[

In the strongly convex case, the rate improves to $O(1/T)$ for the expected excess loss, provided the learning rate decays appropriately and the strong convexity parameter is positive.

] Using mini-batches of size b in SGD reduces the gradient variance by a factor of approximately $1/b, leading to improved constants in these rates or effective acceleration in practice without altering the asymptotic order.[$$ In non-convex settings, where the objective may have multiple local minima, SGD converges in expectation to a stationary point, with the expected squared gradient norm bounded by O(1/\sqrt{T}) under assumptions of Lipschitz smoothness and bounded variance of the stochastic gradients.$$] This result highlights SGD's robustness for deep learning applications, though it does not guarantee global optimality.

Algorithm Implementation

Step-by-Step Process

The stochastic gradient descent (SGD) algorithm proceeds iteratively by approximating the true gradient through subsampling, enabling efficient optimization over large datasets.^[8] This process begins with the initialization of model parameters and continues through repeated cycles of gradient estimation and parameter updates, where each update uses a noisy but unbiased estimate of the gradient to move toward a minimizer of the objective function.^[8] The core algorithm can be expressed in pseudocode as follows:

Initialize parameters w_0
For t = 1 to T:
    Sample a mini-batch B_t from the training data
    Compute the stochastic [gradient](/page/Gradient) g_t = (1 / |B_t|) ∑_{i ∈ B_t} ∇Q_i(w_{t-1})
    Update w_t = w_{t-1} - η_t g_t
Initialize parameters w_0
For t = 1 to T:
    Sample a mini-batch B_t from the training data
    Compute the stochastic [gradient](/page/Gradient) g_t = (1 / |B_t|) ∑_{i ∈ B_t} ∇Q_i(w_{t-1})
    Update w_t = w_{t-1} - η_t g_t

Here, w_0 is the initial parameter vector, often set randomly or to zero depending on the problem; B_t is a randomly selected subset of training examples (with batch size typically between 1 and the full dataset size); Q_i(w) denotes the loss for the i-th example; and \eta_t > 0 is the step size at iteration t.^[8] The stochastic gradient g_t approximates the full gradient of the objective, providing an efficient but noisy direction for descent.^[8] In practice, the iteration cycle is structured around epochs, where one epoch consists of processing the entire training dataset once through multiple mini-batches. To enhance randomness and avoid correlated updates, the training data is shuffled at the beginning of each epoch before sampling mini-batches sequentially or randomly within the epoch. The total number of iterations T is often determined by a fixed number of epochs, such as 10 to 100, depending on convergence behavior.^[8] Early stopping serves as a criterion to halt the process when further iterations yield diminishing improvements, typically assessed after each epoch by evaluating performance on a validation set.^[8] This prevents overfitting and ensures the model generalizes well without exhaustive computation.^[8] To address potential non-convergence, such as when the algorithm stalls in a region of slow progress, the training loss is monitored across iterations; if the loss exhibits prolonged plateaus (e.g., no significant decrease over several epochs), the process may be interrupted to avoid unnecessary computation.^[8]

Hyperparameter Selection

Hyperparameter selection in stochastic gradient descent (SGD) is crucial for achieving efficient convergence and optimal model performance, as inappropriate choices can lead to slow training, instability, or suboptimal solutions. The primary hyperparameters include the learning rate, batch size, number of epochs, and weight initialization scheme, each influencing the algorithm's behavior in distinct ways. The learning rate \eta determines the magnitude of parameter updates and must be carefully chosen to balance rapid progress toward the minimum and avoidance of overshooting. A constant learning rate may suffice for simple problems but often leads to oscillations or stagnation in complex landscapes; instead, decaying schedules are commonly employed to ensure diminishing step sizes over time, promoting convergence. For instance, a popular schedule is \eta_t = \frac{\eta_0}{\sqrt{t}}, where \eta_0 is the initial rate and t is the iteration number, which adapts to increasing gradient accuracy as training progresses. Tuning typically involves grid search over a range of values (e.g., 0.001 to 0.1) or validation-based methods, where performance on a held-out set guides selection.^[9] Batch size b governs the number of samples used per gradient estimate, trading off between stochastic noise—which can enhance generalization by acting as implicit regularization—and computational efficiency, as larger batches reduce variance but increase per-update costs. Small batches (e.g., b=1) introduce high noise in gradients, potentially escaping sharp minima but risking erratic paths; larger ones (e.g., b > 1000) yield more stable but less exploratory updates, often requiring adjusted learning rates to maintain progress. Typical ranges in practice span 1 to 256, with the optimal size depending on the gradient noise scale, which predicts the critical batch size beyond which efficiency plateaus.^[10] Additional hyperparameters include the number of epochs, defined as complete passes through the training dataset, which controls total exposure to data; insufficient epochs may underfit, while excess can overfit, typically tuned to 10–100 based on validation monitoring. Weight initialization, such as Xavier (or Glorot) initialization, sets initial parameters from a uniform distribution U\left[-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right] to preserve gradient variance across layers, preventing vanishing or exploding gradients and facilitating smoother SGD convergence without extensive rate adjustments.^[11]^[12] Diagnostics like learning curves—plotting training loss against iterations—and validation loss are essential for tuning, revealing issues such as divergence (high rate) or saturation (low rate or small batches). By comparing these curves, practitioners iteratively refine hyperparameters to minimize validation error while avoiding overfitting.^[12]

Illustrative Example

Linear Regression Setup

Linear regression serves as a foundational example for illustrating optimization techniques like stochastic gradient descent, where the goal is to find parameters that best fit a linear model to observed data. In the simplest univariate case, the model posits that the response variable y is approximately a linear function of the predictor x, expressed as y \approx w_0 + w_1 x, where w = [w_0, w_1]^T represents the parameter vector consisting of the intercept w_0 and slope w_1.^[13]^[14] Given a dataset of n independent and identically distributed (IID) samples \{(x_i, y_i)\}_{i=1}^n, the problem is formulated as minimizing the mean squared error objective function
[ Q(w) = \frac{1}{n} \sum_{i=1}^n (y_i - w^T x_i)^2,

where each input is augmented as $ x_i = [1, x_i]^T $ to include the intercept term.[](https://www.cns.nyu.edu/~eero/math-tools01/leastSquares.pdf)[](http://www.cs.toronto.edu/~mbrubake/teaching/C11/Handouts/LinearRegression.pdf)[](https://pages.cs.wisc.edu/~jerryzhu/cs731/regression.pdf) The IID assumption on the samples ensures that the errors $ \epsilon_i = y_i - w^T x_i $ are independent across observations, which is crucial for the validity of [statistical inference](/page/Statistical_inference) and optimization in this setting.[](https://www.mcw.edu/-/media/MCW/Departments/Biostatistics/commonerrorsinlinearregression11912.pdf?la=en&hash=432B8FCCEF25285B9A07647615E55D50) This setup can be underparameterized when $ n > 2 $ (more data points than parameters, leading to a unique minimum) or overparameterized when $ n < 2 $ (fewer data points, resulting in multiple solutions), though practical applications typically assume sufficient data for identifiability.[](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html)[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf) For comparison, the ordinary least squares (OLS) method provides a closed-form solution to this quadratic optimization problem, given by

w = (X^T X)^{-1} X^T y,

where $ X $ is the $ n \times 2 $ design matrix with rows $ x_i^T $ and $ y $ is the vector of responses, assuming $ X^T X $ is invertible.[](https://pages.cs.wisc.edu/~jerryzhu/cs731/regression.pdf)[](https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf)[](http://users.ece.cmu.edu/~asaluja/lms.pdf) This exact solution highlights the convexity of $ Q(w) $, as the Hessian is positive semi-definite, enabling efficient alternatives like [stochastic gradient descent](/page/Stochastic_gradient_descent) for large-scale problems.[](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote08.html) The gradient of the per-sample loss term $ Q_i(w) = (y_i - w^T x_i)^2 $ is

\nabla Q_i(w) = -2 (y_i - w^T x_i) x_i,

which forms the basis for iterative updates in gradient-based methods.[](https://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/readings/notes_on_linear_regression.pdf)[](https://web.stanford.edu/class/archive/cs/cs109/cs109.1244/lectures/24_gradient_ascent_annotated.pdf)[](https://www.cs.cmu.edu/~mgormley/courses/10601-f21/slides/lecture8-opt.pdf) ### Applying SGD to Linear Regression In linear regression, stochastic gradient descent (SGD) applies the general iterative update process by selecting a single training example at each step to compute the gradient approximation.[](https://tongzhang-ml.org/papers/icml04-stograd.pdf) The update rule for the weight vector $ \mathbf{w} $ using the squared error loss without the $ \frac{1}{2} $ factor is given by

\mathbf{w}_{t+1} = \mathbf{w}_t + 2 \eta (y_i - \mathbf{w}_t^T \mathbf{x}_i) \mathbf{x}_i,

v_t = \beta v_{t-1} + g_t, \quad w_{t+1} = w_t - \eta v_t,

where $v_t$ is the [velocity](/page/Velocity), $\beta \in (0,1)$ is the momentum coefficient (typically set to 0.9 in practice for [machine learning](/page/Machine_learning) applications), $g_t$ is the stochastic gradient, $\eta$ is the [learning rate](/page/Learning_rate), and $w_t$ denotes the parameters at [iteration](/page/Iteration) $t$.[](https://papers.baulab.info/papers/also/Polyak-1964.pdf)[](https://proceedings.mlr.press/v28/sutskever13.pdf) This approach simulates physical [momentum](/page/Momentum), allowing the optimizer to continue moving in consistent directions despite transient fluctuations in gradients, thereby reducing oscillations along ravine-like regions in the loss [landscape](/page/Landscape).[](https://papers.baulab.info/papers/also/Polyak-1964.pdf) When adapted to SGD, [momentum](/page/Momentum) proves particularly effective in [deep learning](/page/Deep_learning), where it speeds up training by smoothing the optimization trajectory and escaping shallow local minima more readily.[](https://proceedings.mlr.press/v28/sutskever13.pdf) A refined variant, Nesterov's accelerated [gradient descent](/page/Gradient_descent), introduces a lookahead [mechanism](/page/Mechanism) to further improve performance, especially in [convex](/page/Convex) settings. The update computes the gradient at an extrapolated point $y_t = w_t + \beta (w_t - w_{t-1})$ before applying the correction, yielding

w_{t+1} = y_t - \eta \nabla f(y_t),

where $\beta$ is often chosen as $\frac{t-1}{t+2}$ to achieve optimal rates.[](https://hengshuaiyao.github.io/papers/nesterov83.pdf) This lookahead adjustment anticipates the parameter shift, enabling tighter bounds on convergence for [smooth](/page/Smooth) [convex](/page/Convex) functions, with a rate of $O(1/t^2)$ compared to $O(1/t)$ for standard [gradient descent](/page/Gradient_descent).[](https://hengshuaiyao.github.io/papers/nesterov83.pdf) In [stochastic](/page/Stochastic) contexts, Nesterov acceleration maintains these advantages by better handling the inherent variance, leading to more stable progress in ill-conditioned problems where standard SGD might stall.[](https://hengshuaiyao.github.io/papers/nesterov83.pdf) Averaging techniques complement momentum by reducing the variance of SGD iterates, particularly in [convex optimization](/page/Convex_optimization). The Polyak-Ruppert averaging scheme computes the final estimate as the uniform average of all iterates,

\bar{w}T = \frac{1}{T} \sum{t=1}^T w_t,

dw = -\nabla Q(w) , dt,

where $Q(w)$ is the objective function, $w$ represents the parameters, and $\nabla Q(w)$ is its gradient.[](https://www.stephanmandt.com/papers/Mandtetal_NIPS_OPT2015.pdf) This flow captures the mean trajectory of SGD, illustrating how parameters descend along the steepest direction toward minima in the absence of noise.[](http://proceedings.mlr.press/v70/li17f/li17f.pdf) To incorporate the stochasticity inherent in SGD—arising from mini-batch gradient estimates—the process is approximated by an SDE that adds a diffusion term driven by Brownian motion. For a time-varying learning rate $\eta_t$, the SGD dynamics are modeled as

dw_t = -\eta_t \nabla Q(w_t) , dt + \sqrt{\eta_t} \sigma(w_t) , dB_t,

where $B_t$ is standard Brownian motion and $\sigma(w_t)$ captures the local covariance of the gradient noise.[](http://proceedings.mlr.press/v70/li17f/li17f.pdf) This formulation, known as a stochastic modified equation, provides a weak approximation to the discrete SGD updates, with the diffusion coefficient scaling as the square root of the learning rate to reflect the accumulation of noise over infinitesimal steps.[](https://jmlr.org/papers/volume25/23-0237/23-0237.pdf) Near stationary points, the SDE simplifies to a multivariate Ornstein-Uhlenbeck process, enabling precise analysis of local behavior.[](https://www.stephanmandt.com/papers/Mandtetal_NIPS_OPT2015.pdf) This [SDE](/page/SDE) bears a close resemblance to **[Langevin dynamics](/page/Langevin_dynamics)**, the continuous-time limit of [Markov chain Monte Carlo](/page/Markov_chain_Monte_Carlo) methods for sampling from posterior distributions. In this interpretation, the gradient term acts as a drift pulling toward the mode of the target distribution, while the noise term—proportional to a "[temperature](/page/Temperature)" parameter related to the [learning rate](/page/Learning_rate)—facilitates exploration of the parameter space.[](https://www.stephanmandt.com/papers/Mandtetal_NIPS_OPT2015.pdf) Specifically, running SGD with a constant learning rate approximates sampling from a distribution $q(w) \propto \exp\left(-\frac{2}{\eta} Q(w)\right)$, where the inverse temperature is inversely proportional to $\eta$, allowing SGD to be viewed as approximate [Bayesian inference](/page/Bayesian_inference) in non-convex objectives.[](https://www.jmlr.org/papers/volume18/17-214/17-214.pdf) Analysis of these SDEs reveals key properties of SGD's long-term behavior. The **invariant measure** of the SDE, which describes the [stationary distribution](/page/Stationary_distribution) of parameters under constant learning rates, is Gaussian near quadratic minima, with [covariance](/page/Covariance) $\Sigma$ satisfying the [Lyapunov equation](/page/Lyapunov_equation) $\nabla^2 Q(w^*) \Sigma + \Sigma (\nabla^2 Q(w^*))^T = \eta \sigma \sigma^T / S$, where $S$ is the mini-batch size and $w^*$ is the minimum.[](https://www.stephanmandt.com/papers/Mandtetal_NIPS_OPT2015.pdf) This measure quantifies [uncertainty](/page/Uncertainty) in parameter estimates and guides optimal [learning rate](/page/Learning_rate) selection to minimize divergence from the true posterior. For global dynamics, the SDE framework informs **escape times** from local minima, modeled as first-exit times from a neighborhood around a spurious minimum. Under [Gaussian noise](/page/Gaussian_noise) (Brownian case), the expected escape time scales inversely with the noise level $\sqrt{\eta}$, enabling SGD to evade sharp minima faster than deterministic gradient flow, with explicit bounds derived via [Itô calculus](/page/Itô_calculus) for quadratic potentials.[](https://arxiv.org/pdf/1906.09069) These analyses highlight how noise injection promotes convergence to flatter, more generalizable minima in overparameterized models.[](https://www.stephanmandt.com/papers/Mandtetal_NIPS_OPT2015.pdf) ### Recent Advances in Analysis Recent theoretical analyses have advanced the understanding of stochastic gradient descent (SGD) by modeling its dynamics through partial differential equations (PDEs). In a 2025 study, SGD is interpreted as the Euler-Maruyama discretization of a stochastic differential equation (SDE), which is further approximated by a Fokker-Planck PDE describing the evolution of the probability density of the parameters:

\partial_t \rho = \nabla \cdot \left( \varepsilon^2 \nabla \cdot (Q(x) \rho) + \rho \nabla L(x) \right),

where $\rho(t,x)$ is the transition probability density, $\varepsilon^2 = \eta / (2b)$ represents an effective learning rate scaled by the batch size $b$, $Q(x)$ captures the noise covariance, and $L(x)$ is the loss function.[](https://arxiv.org/abs/2501.08425) This PDE perspective reveals two distinct regimes in SGD's behavior: an initial drift regime where parameter mass concentrates around local minima, quantified by lower bounds on mass preservation, and a subsequent diffusion regime where stochastic fluctuations enable escape from suboptimal minima, with novel bounds on the mean exit time (MET) providing both lower and upper estimates under non-degeneracy assumptions.[](https://arxiv.org/abs/2501.08425) These results yield new effectiveness bounds for SGD's exploration-exploitation trade-off in non-convex landscapes, demonstrating exponential convergence to stationary measures via duality and entropy methods.[](https://arxiv.org/abs/2501.08425) Under relaxed smoothness assumptions, such as locally [Lipschitz](/page/Lipschitz) gradients rather than global [Lipschitz continuity](/page/Lipschitz_continuity), accelerated variants of SGD incorporating [Nesterov](/page/Nesterov) and heavy-ball [momentum](/page/Momentum) have been shown to maintain strong [convergence](/page/Convergence) guarantees. A 2024 analysis introduces normalized gradient descent (NGD) stepsizes integrated with these acceleration techniques, using local Lipschitz constants to adapt step sizes dynamically and avoid reliance on global bounds.[](https://optimization-online.org/wp-content/uploads/2024/11/CHC25.pdf) For [convex](/page/Convex) objectives, the heavy-ball accelerated NGD (NGDh) and [Nesterov](/page/Nesterov) accelerated NGD (NGDn) achieve ergodic [convergence](/page/Convergence) rates of $O(1/K)$ after $K$ iterations, with conditions on [momentum](/page/Momentum) parameters $\gamma$ and error tolerances ensuring stability, such as $e^{\sum \varepsilon_j} \leq (1-\gamma)/L$ for NGDh.[](https://optimization-online.org/wp-content/uploads/2024/11/CHC25.pdf) Stochastic extensions of these methods extend the [analysis](/page/Analysis) to non-[convex](/page/Convex) settings, providing [convergence](/page/Convergence) under bounded gradient variance, which enhances applicability to [deep learning](/page/Deep_learning) tasks where global smoothness fails.[](https://optimization-online.org/wp-content/uploads/2024/11/CHC25.pdf) For datasets exhibiting erratic or non-independent and identically distributed (non-IID) structures, adaptive mechanisms in SGD have been developed to improve [convergence](/page/Convergence) robustness. A 2024 framework proposes an adaptive SGD variant that dynamically adjusts learning rates based on historical [gradient](/page/Gradient) information and incorporates [momentum](/page/Momentum) to selectively prioritize consistent [data](/page/Data) points, mitigating the impact of noisy or heterogeneous samples.[](https://doi.org/10.1016/j.future.2024.107682) This approach demonstrates faster [convergence](/page/Convergence) and higher accuracy in empirical evaluations on [logistic regression](/page/Logistic_regression) tasks, with adaptive rates enabling stable updates even under [data](/page/Data) variability that degrades standard SGD performance.[](https://doi.org/10.1016/j.future.2024.107682) While theoretical [convergence](/page/Convergence) is supported by empirical [stability](/page/Stability), the method highlights adaptive strategies' potential to handle real-world non-IID challenges without explicit distribution assumptions. Despite these advances, several open questions persist in SGD analysis, particularly regarding tight convergence rates in highly non-convex settings and sharp generalization bounds linking optimization trajectories to test performance. Recent works underscore the need for refined high-probability bounds that account for heavy-tailed noise and weak smoothness, as current analyses often rely on idealized assumptions that do not fully capture practical generalization gaps.[](https://openreview.net/pdf?id=vEFAR8KH1l) Continuous SDE models provide a useful bridge to these issues by approximating discrete SGD steps, but deriving non-asymptotic generalization from them remains an active area.[](https://arxiv.org/abs/2501.08425)