Fact-checked by Grok 2 weeks ago

Stochastic gradient Langevin dynamics

Stochastic gradient Langevin dynamics (SGLD) is an algorithm for performing scalable Bayesian inference in machine learning models with large datasets, combining stochastic gradient descent with Langevin dynamics by injecting Gaussian noise into parameter updates to sample from the posterior distribution.^[1] Introduced by Max Welling and Yee Whye Teh in 2011, SGLD enables efficient approximation of posterior samples without requiring full-dataset computations at each step, instead using mini-batches to estimate gradients and annealing the step size over iterations to ensure convergence to the target distribution.^[1] The method addresses the computational challenges of traditional Markov chain Monte Carlo (MCMC) techniques, such as Langevin Monte Carlo, by leveraging the scalability of stochastic optimization while providing uncertainty quantification and protection against overfitting inherent in Bayesian approaches.^[2] In practice, the update rule for parameters \theta_t at iteration t is given by \theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \hat{\nabla} \log p(\theta_t | D) + \eta_t, where \epsilon_t is a decreasing step size, \hat{\nabla} \log p(\theta_t | D) is a noisy estimate of the gradient of the log-posterior using a mini-batch of data D, and \eta_t \sim \mathcal{N}(0, \epsilon_t I) is isotropic Gaussian noise.^[1] This formulation draws from the overdamped Langevin diffusion process, ensuring that under appropriate step-size schedules (e.g., \epsilon_t \propto (b + t)^{-\gamma} for \gamma \in (0.5, 1]), the iterates converge in distribution to the posterior p(\theta | D).^[2] SGLD has been applied to various models, including Gaussian mixtures, logistic regression, and independent component analysis, demonstrating its utility in high-dimensional settings.^[1] Theoretical analyses have established its consistency and fluctuation properties, showing that it achieves optimal scaling for posterior mean estimation in convex and non-convex settings, though practical implementations often require techniques to mitigate biases from constant step sizes.^[2] Extensions, such as preconditioned or variance-reduced variants, have further improved its performance for deep learning and distributed computing environments.

Introduction

Overview and Motivation

Stochastic gradient Langevin dynamics (SGLD) is a Markov chain Monte Carlo (MCMC) method designed to approximate the posterior distribution in Bayesian inference for large-scale datasets. It achieves this by extending Langevin dynamics—a stochastic differential equation that samples from a target distribution through gradient-based updates augmented with Gaussian noise—with stochastic gradients computed from mini-batches of data, enabling efficient exploration in high-dimensional parameter spaces such as those encountered in deep learning models.^[1] The primary motivation for SGLD arises from the computational intractability of traditional full-batch MCMC methods, which require evaluating the full likelihood and prior at each iteration, rendering them impractical for massive datasets. By incorporating the scalability of stochastic gradient descent (SGD), SGLD injects isotropic Gaussian noise into the parameter updates to sample from the posterior p(\theta \mid D) \propto p(D \mid \theta) p(\theta), where \theta denotes model parameters and D the observed data, thus bridging optimization and probabilistic inference without needing complete data passes.^[1] At its core, the intuition behind SGLD lies in the role of the noise term, which counters the tendency of pure SGD to converge to a single mode (mode collapse) by facilitating escapes from local optima and enabling ergodic sampling from complex, non-convex posteriors. Key benefits include variance reduction in gradient estimates through mini-batching, which lowers computational overhead compared to full-batch approaches, and the ability to handle large datasets by avoiding exhaustive likelihood evaluations per step.^[1] For instance, in sampling weights for a neural network posterior, SGLD proceeds by initializing parameters and iteratively applying updates based on noisy stochastic gradients derived from random mini-batches of training data, gradually annealing the step size to refine the chain toward stationary samples that capture parameter uncertainty.^[1]

Historical Development

The foundations of stochastic gradient Langevin dynamics (SGLD) trace back to the early 20th century in physics, where Paul Langevin introduced the Langevin equation in 1908 to describe the Brownian motion of particles under random forces and friction.^[3] This stochastic differential equation provided a framework for simulating equilibrium distributions in physical systems, laying the groundwork for later adaptations in statistical computing. By the 1990s, Langevin dynamics had been incorporated into Markov chain Monte Carlo (MCMC) methods in statistics, notably through the Metropolis-adjusted Langevin algorithm (MALA), which uses Langevin proposals within a Metropolis-Hastings framework to sample from complex posterior distributions while ensuring ergodicity.^[4] SGLD emerged as a scalable extension of these ideas in machine learning, introduced by Max Welling and Yee Whye Teh in their 2011 paper "Bayesian Learning via Stochastic Gradient Langevin Dynamics" at the International Conference on Machine Learning (ICML). Motivated by the need for efficient Bayesian inference on large datasets, where traditional MCMC methods like MALA become computationally prohibitive due to full-data gradient requirements, the authors proposed injecting Gaussian noise into stochastic gradient descent updates to approximate posterior sampling via mini-batches. This approach addressed overfitting in high-dimensional models by transitioning from point estimates to full posterior exploration, enabling practical Bayesian learning in big data regimes. The paper has garnered over 3,600 citations as of 2025, underscoring its foundational impact.^[5] Early extensions in the mid-2010s focused on mitigating discretization bias and enabling constant learning rates in SGLD-like methods. A key development was stochastic gradient Hamiltonian Monte Carlo (SGHMC) by Tianqi Chen, Emily Fox, and Carlos Guestrin in 2014, which incorporated momentum and friction terms inspired by Hamiltonian dynamics to reduce the bias inherent in SGLD's Euler-Maruyama discretization while maintaining scalability. Between 2014 and 2016, further refinements addressed variance reduction and adaptive preconditioning to improve mixing and convergence on non-convex objectives. By 2015, SGLD began integrating with deep learning for tasks like uncertainty quantification, as demonstrated by Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin in their work on preconditioned SGLD for Bayesian neural networks, which enhanced predictive distributions by adapting to the ill-conditioned Hessians in deep architectures.^[6] This milestone facilitated broader adoption in scalable Bayesian methods, with SGLD recognized in surveys on probabilistic deep learning up to 2020 for sparking the stochastic gradient MCMC (SG-MCMC) family of algorithms.

Mathematical Background

Langevin Dynamics

Langevin dynamics, in its overdamped form, provides a foundational stochastic process for sampling from a target probability distribution p(\theta). The dynamics are governed by the stochastic differential equation (SDE)

d\theta_t = \frac{1}{2} \nabla \log p(\theta_t) \, dt + dW_t,

where \theta_t represents the state at time t, \frac{1}{2} \nabla \log p(\theta_t) is half the score function serving as the drift term, and W_t is a standard Wiener process in \mathbb{R}^d that introduces isotropic Gaussian noise. This equation models the evolution of a particle undergoing Brownian motion within a potential energy landscape defined by U(\theta) = -\log p(\theta), where the drift corresponds to half the negative gradient of the potential, -\frac{1}{2} \nabla U(\theta) = \frac{1}{2} \nabla \log p(\theta), following the scaling convention common in machine learning applications. Physically, the overdamped Langevin equation simulates the diffusive behavior of a particle in a force field derived from the target density p(\theta), approximating the high-friction limit where inertial effects are negligible. The noise term ensures exploration of the state space, while the drift guides the particle toward regions of high probability density. Under suitable regularity conditions on p(\theta), such as strong log-concavity and smoothness, the process is ergodic, meaning that time averages converge to the expectation under the stationary distribution p(\theta), and exhibits exponential convergence to this invariant measure. This property makes Langevin dynamics a cornerstone for constructing Markov chains that sample from complex distributions in statistical inference. To implement Langevin dynamics computationally, the continuous-time SDE is discretized using the Euler-Maruyama scheme, yielding the iterative update

\theta_{t+1} = \theta_t + \frac{\epsilon}{2} \nabla \log p(\theta_t) + \sqrt{\epsilon} \, \eta_t,

where \epsilon > 0 is a small time step and \eta_t \sim \mathcal{N}(0, I_d) is standard Gaussian noise in d dimensions. This discretization forms the basis for Langevin-based Markov chain Monte Carlo (MCMC) methods, generating an approximate Markov chain whose invariant measure is p(\theta). To ensure exact invariance and reversibility, the chain can be adjusted via the Metropolis-Hastings acceptance step, rendering it a reversible Markov chain that converges geometrically to the target distribution under appropriate conditions on \epsilon and the target density. As a prerequisite for stochastic gradient Langevin dynamics (SGLD), the Langevin diffusion establishes the core mechanism of injecting controlled noise to facilitate sampling, which SGLD extends by approximating the gradient in high-dimensional settings.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) is an iterative optimization algorithm designed to minimize an objective function by using noisy approximations of the gradient, computed from random subsets of data called minibatches. The core update rule is \theta_{t+1} = \theta_t - \epsilon_t g_t, where \theta_t represents the model parameters at step t, \epsilon_t > 0 is the learning rate (or step size), and g_t is an unbiased estimator of the true gradient \nabla L(\theta_t), with L(\theta) = -\log p(D \mid \theta) denoting the negative log-likelihood over the full dataset D = \{x_i\}_{i=1}^N. This estimator g_t is typically obtained by averaging the gradients over a minibatch of size n \ll N, ensuring that \mathbb{E}[g_t \mid \theta_t] = \nabla L(\theta_t).^[7] A primary advantage of minibatching in SGD is the reduction in computational cost per iteration, dropping from O(N) for full-batch gradient descent to O(n), which enables efficient training on large datasets. The stochasticity introduced by minibatches also imparts a form of implicit regularization, though it increases gradient variance compared to exact gradients. Specifically, the variance of the minibatch gradient estimate is approximately \sigma_g^2 \approx \frac{N - n}{n (N - 1)} \mathrm{Var}(\nabla \log p(x_i \mid \theta)), where the variance term reflects the heterogeneity across individual data points; this scales roughly as O(1/n) for large N, motivating larger minibatches to stabilize updates at the expense of increased computation.^[8] Convergence of SGD to a minimizer of L(\theta) requires appropriate learning rate schedules, such as diminishing steps satisfying \sum_t \epsilon_t = \infty and \sum_t \epsilon_t^2 < \infty, which ensure the algorithm explores sufficiently while the noise diminishes over time. These conditions trace back to the foundational stochastic approximation framework, where they guarantee almost sure convergence under mild assumptions like bounded variance and Lipschitz gradients. Alternatively, constant learning rates can achieve optimal convergence rates when paired with averaging of the iterates, as in Polyak-Ruppert averaging, which mitigates the bias from persistent noise.^[9]^[10]^[11] The roots of SGD lie in the stochastic approximation method introduced by Robbins and Monro in 1951, which provided the theoretical basis for iteratively refining estimates using noisy observations. Its adoption and refinement in machine learning, particularly for large-scale problems, were advanced by Bottou in 2010, emphasizing practical implementations like momentum and adaptive rates to handle non-convex losses prevalent in deep learning. In optimization contexts like maximum likelihood estimation, pure SGD converges to a local minimum of L(\theta), corresponding to a maximum a posteriori estimate under uniform priors; however, its gradient noise alone often fails to enable broad exploration of the parameter space for sampling purposes, requiring augmentation with explicit noise sources.^[9]^[7]

Formal Definition

Discrete-Time Update Rule

The discrete-time update rule of stochastic gradient Langevin dynamics (SGLD) forms the core iterative algorithm for approximating samples from the posterior distribution p(\theta \mid D) in Bayesian inference, where \theta denotes model parameters and D represents the dataset. This rule discretizes the continuous-time Langevin diffusion process using a stochastic gradient approximation and additive Gaussian noise to mimic the posterior's invariant distribution. The update at each iteration t is given by

\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2} \tilde{\nabla} \log p(\theta_t \mid D) + \sqrt{\epsilon_t} \eta_t,

where \tilde{\nabla} \log p(\theta_t \mid D) is a noisy estimate of the gradient of the log-joint density using a mini-batch, \epsilon_t > 0 is the step size, and \eta_t \sim \mathcal{N}(0, I) is standard Gaussian noise in the parameter space.^[1] The stochastic gradient \tilde{\nabla} \log p(\theta_t \mid D) approximates the full-batch gradient \nabla \log p(\theta_t \mid D) = \nabla \log p(\theta_t) + \sum_{i=1}^N \nabla \log p(x_i \mid \theta_t), where N is the dataset size and p(\theta_t) is the prior. Specifically, it is computed as \tilde{\nabla} \log p(\theta_t \mid D) \approx \nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n \nabla \log p(x_{t,i} \mid \theta_t), with \{x_{t,1}, \dots, x_{t,n}\} drawn as a mini-batch of size n \ll N at iteration t. This mini-batch scaling by N/n ensures unbiasedness of the gradient estimate relative to the full data.^[1] Key hyperparameters include the step size schedule \epsilon_t, which decreases over time to promote convergence, typically as \epsilon_t = \epsilon_0 / (t + t_0)^\rho with \rho \in (0.5, 1] and offset t_0 \geq 1 to avoid division by zero; the initial step size \epsilon_0 is tuned empirically, often via grid search on validation metrics like log-likelihood. The mini-batch size n balances computational efficiency and gradient variance, with smaller n increasing noise that aids exploration but may slow mixing. These choices satisfy conditions \sum_t \epsilon_t = \infty and \sum_t \epsilon_t^2 < \infty for asymptotic consistency.^[1]^[2] The algorithm proceeds as follows: Initialize \theta_0 \sim p(\theta); for t = 1 to T, sample a mini-batch, compute the stochastic gradient \tilde{\nabla} \log p(\theta_t \mid D), and apply the update rule with current \epsilon_t and fresh \eta_t. Early iterations emphasize optimization via the gradient term, transitioning to sampling as noise dominates with decreasing \epsilon_t. A long burn-in period is required, often monitored by the relative magnitude of gradient and noise contributions (e.g., when noise variance exceeds gradient step by a factor \alpha \ll 1). The unadjusted Markov chain exhibits O(\epsilon_t) discretization bias from the Euler-Maruyama scheme, mitigated by the diminishing step sizes but necessitating extended runs for accurate posterior approximation.^[1]^[2]

Continuous-Time Approximation

The discrete-time updates of stochastic gradient Langevin dynamics (SGLD) can be viewed as an Euler-Maruyama discretization of a continuous-time stochastic differential equation (SDE), where the step size \epsilon_k \to 0 and the continuous time t = \sum_{i=1}^k \epsilon_i evolves accordingly. Under suitable assumptions on decreasing step sizes satisfying \sum \epsilon_k = \infty and \sum \epsilon_k^2 < \infty, along with smoothness of the log-posterior, the SGLD process converges weakly to this limiting SDE.^[2] In the continuous-time limit, the dynamics follow the SDE

d\theta_t = \frac{1}{2} \tilde{\nabla} \log p(\theta_t \mid D) \, dt + dW_t,

where p(\theta \mid D) denotes the posterior distribution given data D, W_t is standard Brownian motion, and \tilde{\nabla} \log p(\theta_t \mid D) is an effective gradient that incorporates bias from the stochastic approximation alongside the true full-batch gradient \nabla \log p(\theta_t \mid D). The minibatch-induced noise in the discrete updates contributes an additional state-dependent diffusion term, effectively preconditioning the diffusion coefficient beyond the identity scaling from dW_t. This enhanced diffusion arises from the covariance of the stochastic gradient noise, which scales inversely with minibatch size and smooths the exploration of the parameter space.^[12]^[2] The probability density \pi(\theta, t) of the process satisfies the Fokker-Planck equation

\frac{\partial}{\partial t} \pi(\theta, t) = -\nabla \cdot \left[ \frac{1}{2} \tilde{\nabla} \log p(\theta \mid D) \, \pi(\theta, t) \right] + \frac{1}{2} \Delta \pi(\theta, t),

where \Delta is the Laplacian. In the stationary regime (t \to \infty), \pi(\theta, \infty) \approx p(\theta \mid D), though a persistent bias remains due to the stochastic gradient approximation and discretization effects, with the bias scaling as \Theta(1) independently of dataset size.^[2]^[13] This continuous-time perspective justifies the ergodicity of SGLD in the limit of vanishing step sizes, as the process mixes towards the (approximate) invariant posterior measure under dissipativity conditions on the log-posterior drift, ensuring exponential convergence in Wasserstein distance. The minibatch noise's role as preconditioning on the diffusion coefficient implies improved sampling efficiency in high dimensions by adapting the noise scale to the local geometry of the posterior.^[2]^[12] For numerical stability, the discrete SGLD steps approximate the SDE well when the step size \epsilon is sufficiently small and the potential (negative log-posterior) is smooth and strongly convex, minimizing discretization bias and ensuring bounded moments via Lyapunov functions; otherwise, larger steps can lead to instability or amplified bias in non-convex settings.^[2]

Applications

Bayesian Neural Networks

In Bayesian neural networks, the network weights \theta are modeled as random variables drawn from a prior distribution p(\theta), often an isotropic Gaussian, while the likelihood p(D|\theta) is defined through the forward pass of the neural network on the observed dataset D. The resulting posterior distribution p(\theta|D) \propto p(\theta) p(D|\theta) is analytically intractable owing to the high dimensionality of \theta, which can encompass millions of parameters in modern architectures.^[1] Stochastic gradient Langevin dynamics (SGLD) addresses this by generating a sequence of iterates \theta_t that approximate samples from the posterior p(\theta|D), leveraging mini-batch gradients to scale to large datasets. Predictive quantities, such as the expected output f(\theta) for a new input, can then be estimated via Monte Carlo integration: \int f(\theta) p(\theta|D) \, d\theta \approx \frac{1}{T} \sum_{t=1}^T f(\theta_t), where T is the number of samples collected after a burn-in period. This approach enables full Bayesian inference without restrictive assumptions on the model structure.^[1] SGLD offers key advantages for Bayesian neural networks, including scalability to deep architectures such as convolutional neural networks without relying on approximations like Laplace methods, and the flexibility to accommodate non-conjugate priors that capture complex weight dependencies. In the seminal 2011 work introducing SGLD, the method was demonstrated on logistic regression for classification tasks, achieving rapid convergence to posterior samples while mitigating overfitting through inherent regularization from the sampling process. Extensions in 2015 applied preconditioned variants of SGLD to convolutional networks for image classification on datasets like MNIST, yielding test errors as low as 0.45% and demonstrating superior predictive performance compared to standard stochastic gradient descent.^[1]^[14] Despite these benefits, applying SGLD to deep neural networks faces challenges from high variance in gradient estimates, exacerbated by the non-convex loss landscapes and pathological curvature in high-dimensional spaces. Effective mixing and convergence often necessitate long sampling chains of $10^4 to $10^6 iterations, increasing computational demands, though preconditioning techniques can mitigate this by adapting to local geometry.^[14]

Uncertainty Quantification in Deep Learning

Stochastic gradient Langevin dynamics (SGLD) facilitates uncertainty quantification in deep learning by enabling sampling from the posterior distribution over model parameters, which captures epistemic uncertainty arising from limited data. This is achieved through the posterior predictive variance, given by

\Var_{\theta \sim p(\theta \mid \mathcal{D})} [f(\theta, x^*)] = \E_{\theta \sim p(\theta \mid \mathcal{D})} [f(\theta, x^*)^2] - \left( \E_{\theta \sim p(\theta \mid \mathcal{D})} [f(\theta, x^*)] \right)^2,

where \theta denotes the parameters, \mathcal{D} the training data, f(\theta, x^*) the model's output for a new input x^*, and expectations are approximated via Monte Carlo samples from SGLD iterates. The total predictive uncertainty combines this epistemic component with aleatoric uncertainty, which models irreducible noise in the data, often estimated by adding a heteroscedastic noise variance term to the predictive distribution. Key techniques for implementing uncertainty estimation with SGLD include approximations like Monte Carlo dropout, which serves as a scalable proxy for Bayesian inference and can be viewed as approximating the stochasticity in SGLD updates.^[15] Direct SGLD sampling, on the other hand, provides asymptotically exact posterior draws for regression and classification tasks, allowing for reliable variance computation without variational assumptions. In practical applications, SGLD-based uncertainty has proven valuable in medical imaging, where it quantifies uncertainty in diffeomorphic registration of 3D brain MRI scans, enabling identification of regions with high predictive variability due to imaging artifacts or anatomical borders.^[16] Similarly, in reinforcement learning, SGLD enhances policy robustness by incorporating parameter uncertainty into adversarial training, leading to safer decision-making in uncertain environments like robotic navigation.^[17] These approaches yield benefits such as predictive confidence intervals that guide risk-aware predictions in safety-critical systems and support active learning by selecting data points with elevated epistemic uncertainty for annotation, thereby improving model efficiency. Evaluation of SGLD uncertainty typically employs metrics like expected calibration error (ECE), which assesses how well predicted probabilities align with true accuracies across confidence bins, revealing miscalibration in overconfident models. Comparisons with deep ensembles indicate that SGLD can provide comparable calibration on benchmarks.

Variants and Extensions

Preconditioned SGLD

Preconditioned stochastic gradient Langevin dynamics (pSGLD) adapts the standard SGLD update by incorporating a positive-definite preconditioning matrix G_t to account for the local geometry of the posterior distribution. The modified discrete-time update rule is given by

\theta_{t+1} = \theta_t + \frac{\varepsilon_t}{2} G_t^{-1} \tilde{\nabla} \log p(\theta_t \mid \mathcal{D}_t) + \sqrt{\varepsilon_t} G_t^{-1/2} \eta_t,

where \tilde{\nabla} \log p(\theta_t \mid \mathcal{D}_t) is a noisy estimate of the gradient based on a mini-batch \mathcal{D}_t, \eta_t \sim \mathcal{N}(0, I), and G_t approximates the inverse of the Fisher information matrix or the diagonal of the Hessian at step t.^[6]^[18] This formulation recovers the base SGLD as a special case when G_t = I. The primary motivation for preconditioning arises from the anisotropic noise structure in high-dimensional posteriors, where parameter correlations—such as those between weights in convolutional neural network layers—lead to poor conditioning and slow mixing. By rescaling the gradient and noise terms via G_t^{-1} and G_t^{-1/2}, pSGLD accelerates exploration along directions of high curvature while reducing variance in flat directions, thereby improving sampling efficiency in ill-conditioned landscapes common in deep learning models.^[6]^[18] Key algorithms include the Riemannian variant proposed by Patterson and Teh (2013), which uses a diagonal G_t tailored to manifold geometries like the probability simplex (e.g., G(\theta) = \operatorname{diag}(\theta)^{-1} for multinomial parameters, approximating the Fisher information). More general adaptive schemes draw from optimization methods: RMSprop-preconditioned SGLD estimates G_t via an exponentially moving average of squared gradients, V_{t+1} = \alpha V_t + (1 - \alpha) g_t \odot g_t, with G_{t+1} = \operatorname{diag}(1 / (\lambda + \sqrt{V_{t+1}})) for regularization parameter \lambda > 0 and decay \alpha \approx 0.99. Adam-like extensions incorporate momentum in the preconditioner, adapting both first- and second-moment estimates to further stabilize updates in stochastic settings.^[18]^[6] Empirically, pSGLD demonstrates speedup in logistic regression tasks, achieving lower test errors (e.g., 14.85% on the a9a dataset) and higher effective sample sizes compared to vanilla SGLD, while reducing the effective dimensionality of the sampling space in neural network training. In probabilistic topic models on large corpora like NIPS abstracts, it outperforms online variational methods with perplexity scores around 1400–2200, approaching full Gibbs sampling quality. These gains extend to deep neural networks, where preconditioning mitigates overfitting and enhances posterior approximation.^[6]^[18] A notable limitation is the added computational overhead from estimating G_t, which incurs an O(d) cost per iteration for diagonal approximations in d-dimensional parameter spaces, though this is often offset by faster convergence.^[6]

Kinetic and Hamiltonian Variants

Kinetic stochastic gradient Langevin dynamics, often referred to as underdamped SGLD (uSGLD), extends the standard overdamped formulation by introducing a velocity or momentum variable, resulting in second-order dynamics that better capture inertial effects for improved sampling efficiency. The continuous-time limit is described by the coupled stochastic differential equations

d\theta = v \, dt,

dv = -\nabla U(\theta) \, dt - \gamma v \, dt + \sqrt{2\gamma} \, dW,

where \theta denotes the parameters (position), v is the auxiliary momentum variable (velocity), \gamma > 0 is the friction coefficient, U(\theta) is the negative log-posterior potential, and W is a standard Wiener process. In discrete-time implementations, the true gradient \nabla U(\theta) is replaced by a stochastic estimate computed from a mini-batch of data, enabling scalable inference on large datasets.^[19] A prominent Hamiltonian variant is stochastic gradient Hamiltonian Monte Carlo (SGHMC), introduced by Chen et al. in 2014, which discretizes the underdamped dynamics while incorporating friction to dampen momentum and Gaussian noise to preserve the target posterior as the invariant distribution. The update rules for SGHMC are

\theta_{t+1} = \theta_t + \epsilon M^{-1} v_t,

v_{t+1} = v_t - \epsilon \tilde{\nabla} U(\theta_t) - \epsilon \gamma M^{-1} v_t + \mathcal{N}(0, 2\epsilon \gamma I),

where \epsilon > 0 is the step size, M is a preconditioning mass matrix (often the identity), \tilde{\nabla} U(\theta_t) is the stochastic gradient estimate, and the noise term approximates the diffusion in the continuous limit. This formulation avoids the computationally expensive leapfrog integrator of classical HMC by using a simpler Euler-Maruyama scheme adapted for noisy gradients.^[20] These kinetic and Hamiltonian variants offer significant advantages over overdamped SGLD, particularly in multimodal posteriors where the added momentum enables longer excursions and reduces the inefficient random-walk behavior, leading to faster mixing and more effective exploration of complex parameter spaces.^[20] In applications, SGHMC has demonstrated faster convergence for sampling in deep Gaussian processes compared to variational inference alternatives, achieving the target effective sample size in approximately 1.6 times fewer iterations. Empirical evaluations also show superior performance in topic modeling tasks, such as Latent Dirichlet Allocation on large corpora like Wikipedia articles, where the momentum aids in navigating high-dimensional sparse posteriors.^[19] Discretization of the friction term in these methods can introduce bias, which is controlled through partial momentum refresh strategies that periodically resample a fraction of the momentum components from their Gaussian stationary distribution, balancing accuracy and computational cost without fully resetting the velocity at every step.^[20]

Recent Developments

In recent years, advancements in Stochastic Gradient Langevin Dynamics (SGLD) have focused on enhancing its ability to navigate complex loss landscapes and improve sampling efficiency in high-dimensional settings. A notable development is Flatness-Aware SGLD (fSGLD), introduced in October 2025, which incorporates random weight perturbations with isotropic Gaussian noise to target flat minima for superior generalization. By optimizing a randomized-smoothing objective that implicitly regularizes via the Hessian trace, fSGLD couples the noise scale \sigma and inverse temperature \beta (as \sigma = \beta^{-(1+\eta)/4} for \eta > 0) to concentrate on global flat minimizers in non-convex landscapes, avoiding direct Hessian computation. Experimental evaluations on noisy-label datasets like CIFAR-10N demonstrate accuracies up to 91.72%, outperforming baselines such as SGD and SAM while maintaining computational costs comparable to SGD.^[21] Another innovation is Gradient-Adjusted Underdamped Langevin (GAUL) dynamics, proposed in October 2024, which extends underdamped Langevin processes by adding stochastic gradient corrections through primal-dual and Hessian-driven damping terms. Defined via the SDE dX_t = [-Q \nabla H(X_t) dt + \sqrt{2} \sym(Q) dB_t], where Q incorporates a preconditioner C and parameters a, \gamma, GAUL accelerates mixing in non-convex distributions by achieving convergence rates dependent on the square root of the condition number \kappa, such as O(\sqrt{\kappa} \log(d/\delta)) iterations for Gaussian targets using Euler-Maruyama discretization. Numerical results on Bayesian regression and neural networks show GAUL surpassing standard overdamped and underdamped variants in sampling speed for non-Gaussian posteriors.^[22] For Bayesian neural networks (BNNs), Parameter-Expanded Stochastic Gradient Markov Chain Monte Carlo (PX-SGMCMC), accepted at ICLR 2025 (March 2025 submission), addresses sampling inefficiencies by reparameterizing weight matrices W = PVQ to expand the latent space with additional matrices P and Q of depths c and d. This augmentation increases parameter dimensionality without raising inference costs, as matrices are reassembled post-sampling, and preconditions gradients to bound sample distances by \epsilon L^2 (c + d + 1) M^{(c + d)} (h + s) + \epsilon LC. On CIFAR-10, PX-SGMCMC yields lower expected risk (0.121) and negative log-likelihood (0.388) than SGHMC, enhancing exploration in BNN posteriors.^[23] Emerging trends include SGLD's integration with diffusion models for faster generative sampling, as seen in cyclical diffusion sampling methods from 2024 that embed SGLD trajectories to elucidate score-based model paths, reducing sampling steps while preserving distribution fidelity. Additionally, 2024 analyses in ESAIM provide contraction rate estimates for stochastic gradient kinetic Langevin integrators, achieving O(m/M) convergence under strong convexity assumptions for the potential U, where m and M denote lower and upper smoothness bounds. A 2025 study further establishes exponential convergence of SGLD in the lazy training regime, with rates \exp(-2 m \lambda^2 t) to the empirical risk minimizer, where \lambda^2 is the minimum NTK eigenvalue, alongside high-probability bounds on parameter stability near initialization.^[24]^[25]^[26] Despite these progresses, open challenges persist in GPU implementations for real-time SGLD sampling, particularly due to the computational overhead of stochastic perturbations and discretization in large-scale models, which can degrade performance by factors of 7x compared to optimized solvers in benchmarks. These issues highlight the need for hardware-accelerated preconditioning and parallelizable integrators to enable practical deployment in dynamic environments.

Theoretical Properties

Convergence Guarantees

Stochastic gradient Langevin dynamics (SGLD) exhibits weak convergence to the target posterior distribution under assumptions of Lipschitz smoothness of the potential and a log-Sobolev inequality (LSI) on the target measure. These guarantees rely on bounded variance of the stochastic gradients and decreasing step sizes \varepsilon_t to ensure asymptotic consistency and control discretization errors. Bias in SGLD arises primarily from discretization of the continuous-time Langevin diffusion, which introduces an error of order O(\varepsilon). Mini-batching introduces variance in the gradient estimates of order O(1/\sqrt{n}) where n is the batch size, affecting the overall sampling error. These effects can be mitigated using control variates to adjust the gradient estimates and reduce variance, though the computational cost scales as O(N \log N) with the dataset size N.^[27] Note that SGLD with constant step sizes exhibits persistent bias and does not converge to the true posterior; decreasing step sizes are required for asymptotic consistency.^[2] Under strong convexity assumptions, SGLD achieves exponential ergodicity in the Wasserstein-1 distance to its invariant measure.^[28] This ergodicity holds outside compact sets with log-concave targets, ensuring geometric convergence to stationarity. Recent analyses in the lazy training regime for overparameterized models provide sharper bounds, demonstrating exponential convergence under conditions where parameters remain near initialization due to large scaling factors.^[26] Mixing rates serve as a complementary metric to these convergence guarantees, quantifying autocorrelation decay separately from overall error to stationarity.

Mixing Rates and Bias Analysis

Mixing time in stochastic gradient Langevin dynamics (SGLD) refers to the number of iterations required for the generated Markov chain to reach ε-close to the target posterior distribution in total variation distance. For smooth and strongly log-concave posteriors, non-asymptotic bounds have been established using Wasserstein distances.^[29] In non-convex settings relevant to deep learning, the mixing time can be bounded in terms of the smoothness constant L and initialization radius R, with exponential dependence e^{O(L R^2)} arising from the potential's growth, though variance reduction techniques can mitigate this.^[30] These bounds highlight the curse of dimensionality in SGLD, where high d exacerbates slow mixing. Autocorrelation in SGLD chains measures the dependence between successive samples, with the integrated autocorrelation time τ_int quantifying the number of steps needed for effective independence. In strongly convex posteriors, autocorrelation times are larger in ill-conditioned problems common in deep learning.^[31] In non-convex deep learning landscapes, τ_int is typically higher due to local minima and saddle points, resulting in correlated samples that reduce the effective sample size. Bias in SGLD arises from noisy minibatch gradients, which introduce discretization and stochastic approximation errors, preventing exact sampling from the posterior. Cyclic minibatching cycles through the dataset to reduce variance while preserving ergodicity. The Metropolis-adjusted Langevin algorithm (MALA), an extension of SGLD with acceptance steps, reduces bias compared to unadjusted SGLD under smoothness assumptions. Empirically, in Bayesian neural networks (BNNs), the effective sample size (ESS) is estimated as ESS ≈ T / (2 τ_int), where T is the total number of iterations, providing a measure of independent samples drawn. Monitoring mixing via trace plots of parameters or log-likelihoods reveals autocorrelation decay, with poor mixing indicated by slow convergence to stationarity in high-dimensional BNN parameter spaces.^[32] Recent advances include contraction rate analyses for variants of Langevin dynamics, enhancing efficiency for non-convex optimization.^[33]

References

[1]
[PDF] Bayesian Learning via Stochastic Gradient Langevin Dynamics
The method combines stochastic optimization with Langevin dynamics, adding noise to parameter updates, to converge to the posterior distribution. It ...
[2]
[PDF] Consistency and Fluctuations For Stochastic Gradient Langevin ...
This paper is concerned with stochastic gradient Langevin dynamics (SGLD), an alter- native approach proposed by Welling and Teh (2011). This follows the ...
[3]
Non-convex learning via Stochastic Gradient Langevin Dynamics
The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate ...
[4]
Paul Langevin's 1908 paper “On the Theory of Brownian Motion ...
Paul Langevin's 1908 paper “On the Theory of Brownian Motion” [“Sur la théorie du mouvement brownien,” C. R. Acad. Sci. (Paris) 146, 530–533 (1908)] Available.
[5]
Fast Langevin based algorithm for MCMC in high dimensions
known as the Metropolis Adjusted Langevin Algorithm (MALA), and it is well established that it has better convergence properties than the RWM algorithm in.
[6]
https://arxiv.org/abs/1512.07666
[7]
Preconditioned Stochastic Gradient Langevin Dynamics for Deep ...
Dec 23, 2015 · We propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk.Missing: et distributions
[8]
[PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
Léon Bottou. Table 1. Stochastic gradient algorithms for various learning ... (2010): Towards Optimal One Pass Large Scale Learning with Averaged. Stochastic ...
[9]
[PDF] A Study of Gradient Variance in Deep Learning - arXiv
Jul 9, 2020 · Mini-batch SGD is said to converge faster because the variance of the gradient estimates is reduced by a rate linear in the mini-batch size. In ...
[10]
A Stochastic Approximation Method - Project Euclid
September, 1951 A Stochastic Approximation Method. Herbert Robbins, Sutton Monro · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist. 22(3): 400-407 ...
[11]
New Convergence Aspects of Stochastic Gradient Algorithms
It is well-known that SGD converges if a sequence of learning rates {ηt} satisfies ∑∞t=0ηt→∞ and ∑∞t=0η2t<∞. We show the convergence of SGD for strongly convex ...Missing: conditions ε_t = ε_t² <<|control11|><|separator|>
[12]
Acceleration of Stochastic Approximation by Averaging
A new recursive stochastic approximation algorithm with averaging of trajectories is investigated, achieving the highest possible rate of convergence.<|control11|><|separator|>
[13]
[PDF] Stochastic Gradient and Langevin Processes
In this paper, we take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution. There ...
[14]
[PDF] The promises and pitfalls of Stochastic Gradient Langevin Dynamics
SGLD is a key MCMC algorithm for large datasets, but with constant step sizes, it can behave like SGD. SGLDFP can reduce variance and is comparable to LMC.
[15]
None
### Summary of SGLD Application to Bayesian Neural Networks
[16]
Representing Model Uncertainty in Deep Learning - arXiv
Jun 6, 2015 · Authors:Yarin Gal, Zoubin Ghahramani. View a PDF of the paper titled Dropout as a Bayesian Approximation: Representing Model Uncertainty in ...
[17]
Robust Reinforcement Learning via Adversarial training with ... - arXiv
Feb 14, 2020 · Leveraging the powerful Stochastic Gradient Langevin Dynamics, we present a novel, scalable two-player RL algorithm, which is a sampling variant ...
[18]
[2007.08792] Uncertainty Quantification and Deep Ensembles - arXiv
Jul 17, 2020 · We show that deep-ensembles do not necessarily lead to improved calibration properties. In fact, we show that standard ensembling methods, when ...
[19]
How deep learning methods compare to human experts
The results show that TTA (s = 1) and ensembles has a positive effect on the calibration error. Ensembles also reduced the mean confidence, resulting in more ...
[20]
Stochastic Gradient Riemannian Langevin Dynamics on the ...
In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian ...Missing: Preconditioned | Show results with:Preconditioned
[21]
A Complete Recipe for Stochastic Gradient MCMC - arXiv
Jun 15, 2015 · In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes ...
[22]
[1402.4102] Stochastic Gradient Hamiltonian Monte Carlo - arXiv
Feb 17, 2014 · Fox, Carlos Guestrin. View a PDF of the paper titled Stochastic Gradient Hamiltonian Monte Carlo, by Tianqi Chen and 2 other authors. View PDF.
[23]
[2510.02174] Flatness-Aware Stochastic Gradient Langevin Dynamics
Oct 2, 2025 · This work introduces Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), designed to efficiently and provably seek flat minima in high ...
[24]
Gradient-adjusted underdamped Langevin dynamics for sampling
Oct 11, 2024 · We introduce a class of stochastic differential equations (SDEs) called gradient-adjusted underdamped Langevin dynamics (GAUL), which add stochastic ...
[25]
https://www.esaim-m2an.org/articles/m2an/abs/2024/06/m2an230147/m2an230147.html
[26]
Image Restoration by Denoising Diffusion Models with Iteratively ...
Dec 27, 2023 · This paper proposes a novel guidance technique using preconditioning for image restoration, moving from back-projection to least squares ...Missing: SGLD | Show results with:SGLD
[27]
Contraction rate estimates of stochastic gradient kinetic Langevin ...
ESAIM: M2AN 58 (2024) 2255–2286. Contraction rate estimates of stochastic gradient kinetic Langevin integrators ... We provide convergence rates of O(m/M) ...
[28]
Convergence of Stochastic Gradient Langevin Dynamics in the Lazy ...
Oct 24, 2025 · In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic ...Missing: seminal | Show results with:seminal
[29]
The True Cost of Stochastic Gradient Langevin Dynamics - arXiv
[Submitted on 8 Jun 2017]. Title:The True Cost of Stochastic Gradient Langevin Dynamics. Authors:Tigran Nagapetyan, Andrew B. Duncan, Leonard Hasenclever, ...
[30]
https://proceedings.mlr.press/v65/raginsky17a.html
[31]
[1710.00095] User-friendly guarantees for the Langevin Monte Carlo ...
Sep 29, 2017 · In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave.
[32]
[PDF] Stochastic Gradient Descent as Approximate Bayesian Inference
This shows that the posterior covariance is generally overestimated by Langevin dynamics, which can be attributed to non-vanishing learning rates at long times.
[33]
[PDF] Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm ...
Metropolis-adjusted Langevin algorithm (MALA) was original proposed by Besag (1994) and its properties were examined in detail by Roberts and Tweedie (1996a).
[34]
[PDF] Structured Stochastic Gradient MCMC: a hybrid VI and MCMC ...
We evaluate the precision of the methods on accuracy over time and the overall mixing time of them on IAC and ESS with 2 base algorithms: pSGLD and SGHMC.
[35]
Contraction and Convergence Rates for Discretized Kinetic ... - arXiv
Feb 21, 2023 · We provide a framework to analyze the convergence of discretized kinetic Langevin dynamics for M-\nablaLipschitz, m-convex potentials. Our ...
[36]
Contraction rate estimates of stochastic gradient kinetic Langevin ...
A popular approach is to employ a Markov chain constructed by discretizing a stochastic differential equation (SDE) and to approximate observable averages using ...