Fact-checked by Grok 2 weeks ago

Stochastic optimization

Stochastic optimization encompasses a broad class of mathematical and computational methods designed to minimize or maximize an objective function in the presence of randomness, such as noisy function evaluations, uncertain parameters, or stochastic constraints. Unlike deterministic optimization, which assumes , stochastic optimization addresses real-world scenarios where arises from factors like measurement errors, environmental variability, or incomplete data, making it essential for under incomplete . These methods often rely on iterative procedures that use probabilistic models or random sampling to approximate solutions, balancing computational efficiency with convergence to optimal or near-optimal policies. The field traces its roots to early 20th-century statistical techniques, such as introduced by in 1922, and gained momentum in the mid-20th century with foundational work on by Herbert Robbins and Monro in 1951. By the and , concepts from inspired algorithms like and early genetic algorithms, while operations research advanced multistage for sequential decisions. The 1980s and 1990s saw explosive growth with applications in and , culminating in influential frameworks like simultaneous perturbation stochastic approximation (SPSA) for high-dimensional problems. Recent decades have unified the area under broader models, such as Warren Powell's 2019 framework, which categorizes problems using state variables, decisions, and exogenous information to handle diverse uncertainties across disciplines. Key approaches in stochastic optimization include stochastic approximation methods, which iteratively refine estimates using noisy gradients, as in (SGD) prevalent in ; sample average approximation (SAA), which replaces the expected objective with averages from random samples; and evolutionary algorithms like genetic algorithms that mimic for global search in non-convex landscapes. For multistage problems, techniques such as Markov decision processes and model sequential policies under uncertainty, often incorporating value function approximations to manage . Critical challenges involve handling noise to avoid false optima, scaling to high dimensions, and ensuring , with algorithms like SPSA offering by requiring only two measurements per regardless of dimensionality. Stochastic optimization finds applications across numerous fields, including finance for under market volatility, engineering for robust design in uncertain environments, supply chain management for inventory planning with demand fluctuations, and machine learning for training models on large, noisy datasets. In healthcare, it supports amid patient variability, while in energy systems, chance-constrained programming ensures reliability in grids subject to renewable intermittency. Recent advances integrate these methods with , enabling real-time decisions in and autonomous systems, and hybrid approaches combining stochastic and to address distributional shifts in dynamic settings.

Fundamentals

Definition and Scope

Stochastic optimization is a framework within for addressing problems where or affects the objective function, constraints, or parameters, requiring decisions that perform well in or with high probability. Unlike deterministic optimization, where all elements are known precisely, stochastic optimization relies on probabilistic models to handle inherent variability, such as noise in measurements or unpredictable future states. The foundational notation in optimization involves finding the minimizer x^* = \arg\min_{x \in \mathcal{X}} [f(x)](/page/F/X), where \mathcal{X} denotes the feasible set and f: \mathcal{X} \to \mathbb{R} is . In the setting, becomes [f(x)](/page/F/X) = \mathbb{E}_{\xi} [F(x, \xi)], where \xi is a drawn from a known , and F(x, \xi) captures the contribution; the goal is to compute \min_{x \in \mathcal{X}} [f(x)](/page/F/X) without direct access to [f(x)](/page/F/X), often using samples of F(x, \xi). This formulation underpins methods that iteratively approximate the through noisy evaluations. Common types of stochastic problems include those with noisy objectives, where function evaluations yield perturbed observations F(x, \xi) = f(x) + \epsilon(\xi) and \epsilon(\xi) represents additive , necessitating algorithms robust to such variability. Stochastic constraints introduce randomness into feasibility, as in chance-constrained problems where requirements hold with a specified probability, such as \mathbb{P}(g(x, \xi) \leq 0) \geq \alpha for confidence level \alpha \in (0,1). Additionally, two-stage models sequential decisions under : first-stage choices x_1 are made before \xi is realized, followed by second-stage recourse actions x_2(\xi) that adapt to the outcome, minimizing the expected total cost \min_{x_1} \mathbb{E}_{\xi} [c_1^T x_1 + Q(x_1, \xi)], where Q(x_1, \xi) = \min_{x_2} c_2^T x_2 subject to recourse constraints. A illustrative example is estimating the of a normal \xi \sim \mathcal{N}(0,1) by solving \min_x \mathbb{E}[(x - \xi)^2], which yields the optimal x^* = 0 as the , but in practice requires stochastic samples (x_k - \xi_k)^2 to approximate the gradient or update iteratively.

Comparison to Deterministic Optimization

Stochastic optimization differs fundamentally from deterministic optimization in its treatment of function evaluations and objective landscapes. In deterministic optimization, the function and its derivatives, if available, are evaluated exactly at each point, allowing for precise computation of gradients or subgradients without noise. In contrast, stochastic optimization operates on problems where exact evaluations are infeasible or costly, relying instead on noisy, unbiased estimates of the or gradients derived from random samples of the underlying . This introduces into the optimization process, modeling real-world uncertainties such as variable inputs or , as opposed to the fixed parameters assumed in deterministic settings. A primary advantage of stochastic approaches lies in their ability to handle uncertainty inherent in large-scale, high-dimensional problems, such as those in or process , where deterministic methods may fail due to oversimplification of variability. By using approximate gradients from mini-batches or single samples, stochastic methods enable faster iterations with lower per-iteration computational cost—often O(1) compared to O(n) for full-batch deterministic evaluations—making them scalable for massive datasets. For instance, in , stochastic gradient methods can achieve ε-optimality in O(1/ε²) iterations, providing quicker progress in practice despite theoretical sublinear rates like O(1/√k). Additionally, stochastic optimization supports adaptive policies that recourse to new information, yielding more robust solutions under uncertainty, as quantified by metrics like the value of the stochastic solution (VSS), which measures improvement over deterministic approximations (e.g., VSS ≈ 3% in some process networks). However, the introduction of in stochastic optimization brings significant challenges, including high variance in estimates that can lead to erratic progress and slower asymptotic compared to the linear rates possible in deterministic methods for , problems. This variance necessitates techniques like averaging or to stabilize trajectories, and often requires multiple samples to approximate expected performance, increasing overall —typically O(1/ε²) for non-accelerated stochastic methods versus O(log(1/ε)) iterations in deterministic cases. Moreover, stochastic approaches demand accurate modeling of probability distributions, which may not always be available, and can result in realized performance deviating from expected values due to the non-differentiable or discontinuous nature of sampled functions. In summary, while deterministic optimization excels in controlled, low-uncertainty environments with exact solutions, stochastic methods trade precision for practicality in uncertain, data-rich domains.

Historical Development

Origins in the Mid-20th Century

The emergence of stochastic optimization in the mid-20th century was driven by post-World War II advancements in and , fueled by military and engineering demands during the early period. These fields required methods to handle uncertainty in dynamic systems, such as signal detection and mechanisms, where deterministic approaches proved inadequate for noisy or probabilistic environments. Researchers sought iterative techniques to approximate optimal solutions without full knowledge of underlying probability distributions, laying the groundwork for handling real-world variability in system design and operations. A key foundational contribution came from Abraham Wald's work on statistical in the 1940s, which formalized under through minimax criteria and sequential analysis. Wald's 1947 paper on sequential probability ratio tests and his 1950 book Statistical Decision Functions provided a framework for evaluating risks in stochastic settings, influencing early optimization problems by emphasizing robust choices amid incomplete information. This theory bridged probability and optimization, enabling the treatment of uncertain parameters as integral to decision processes rather than mere perturbations. The field advanced significantly with the 1951 paper by Herbert Robbins and Sutton Monro, which introduced the method for solving root-finding problems in noisy environments. Titled "A Stochastic Approximation Method," their work proposed an iterative algorithm that to an optimal solution using unbiased estimates of gradients, assuming decreasing step sizes and bounded variances—conditions that ensured almost sure under mild assumptions. This seminal contribution established stochastic approximation as a cornerstone for optimization under uncertainty, directly applicable to parameter estimation in probabilistic models. During the 1950s, found early applications in adaptive systems, particularly for noise cancellation and , paving the way for least-mean-squares (LMS) filtering techniques. These developments addressed real-time adjustment in control systems, where algorithms iteratively minimized error based on observations, enhancing performance in tasks. Concurrently, saw the first formal formulations of problems around 1955, with and E.M.L. Beale independently proposing two-stage stochastic linear programs to model recourse actions under uncertain parameters, marking the integration of probability into frameworks.

Key Milestones Post-1950s

In the and , expansions of the Kiefer-Wolfowitz algorithm, originally introduced in 1952, addressed limitations in rates and applicability to higher-dimensional problems, with theoretical advancements demonstrating strong under relaxed assumptions on and step sizes. These developments facilitated broader use in noisy environments, such as and control systems. In the and , derivative-free methods advanced with the introduction of simultaneous perturbation (SPSA) by James Spall, requiring only two function measurements per iteration regardless of dimension. The saw stochastic gradient methods gain prominence through their integration with neural networks, particularly via as detailed by Rumelhart, Hinton, and Williams in 1986, which enabled efficient training of multilayer perceptrons using mini-batch stochastic updates to handle large datasets. Refinements in the , including momentum enhancements and early adaptive learning rates, addressed vanishing gradients and slow convergence in deep architectures, laying groundwork for applications despite computational constraints of the era. From the 2000s to the 2020s, stochastic optimization experienced explosive growth driven by demands, with the optimizer introduced in 2014 by Kingma and Ba combining adaptive moment estimates to achieve faster convergence and robustness in non-stationary objectives, rapidly becoming a standard for training deep networks. techniques, exemplified by the Stochastic Variance Reduced Gradient (SVRG) method proposed by Johnson and Zhang in 2013, mitigated the high variance in stochastic gradients for finite-sum problems, enabling linear convergence rates in convex settings and influencing subsequent algorithms. The advent of and GPU acceleration in the 2010s scaled these methods to massive datasets, allowing training of models with billions of parameters, while the 2020s emphasized paradigms to handle distributed, privacy-preserving optimization across devices.

Stochastic Approximation Methods

Robbins-Monro Algorithm

The Robbins–Monro algorithm, introduced in , is a pioneering method designed to solve -finding problems where direct evaluation of the underlying function is impossible or costly due to noise in observations. It operates in a one-dimensional setting by iteratively refining an estimate x_n toward a root x^* satisfying E[Y_n \mid x_n = x^*] = x^*, using unbiased noisy measurements Y_n. The core update rule is given by x_{n+1} = x_n + a_n (Y_n - x_n), where a_n > 0 is a sequence of step sizes that decrease over iterations, and Y_n provides a estimate of the function value at x_n. This form targets fixed points of the expected observation, with the term (Y_n - x_n) serving as an unbiased estimate of M(x_n) - x_n, the deviation from the root, assuming the \epsilon_n = Y_n - E[Y_n \mid x_n] has mean zero. Convergence of x_n to x^* in probability is guaranteed by the Robbins–Monro theorem provided the step sizes satisfy \sum_{n=1}^\infty a_n = \infty and \sum_{n=1}^\infty a_n^2 < \infty, which balances exploration (infinite total step) with stability (finite variance accumulation). Common choices include a_n = c/n for constant c > 0, ensuring these conditions hold while promoting asymptotic efficiency. Key assumptions include unbiased noise, i.e., E[Y_n \mid x_n] = M(x_n) where M(x^*) = x^*, and bounded variance of the noise, often formalized as \Pr[|Y_n| \leq C \mid x_n] = 1 for some constant C, alongside conditions on M such as monotonicity (M(x) > x for x < x^* and M(x) < x for x > x^*) to ensure uniqueness and approachability of the root. A representative application is the estimation of the autoregressive coefficient \phi in an AR(1) process X_t = \phi X_{t-1} + \epsilon_t, where \epsilon_t is white noise; here, the algorithm recursively solves the normal equation E[X_t X_{t-1}] = \phi E[X_{t-1}^2] by treating lagged observations as noisy inputs and updating \phi_n via the stochastic approximation rule to achieve consistent on-line estimation without storing full data history.

Kiefer-Wolfowitz Procedure

The Kiefer-Wolfowitz procedure is a stochastic approximation method for locating the maximum (or minimum, by sign adjustment) of a regression function using finite-difference estimates of the gradient, applicable to unconstrained optimization problems where only noisy function evaluations are available. Introduced in 1952, it extends the Robbins-Monro algorithm from root-finding to general optimization by approximating the gradient without direct access to derivatives. This approach is particularly useful when the objective function is observed with additive noise, such as in simulation-based or experimental settings. The algorithm operates iteratively as follows. Start with an initial point x_1 \in \mathbb{R}^d. At step n, for each dimension i = 1, \dots, d, evaluate the noisy function at perturbed points: Y_{n,i}^+ = M(x_n + c_n e_i) and Y_{n,i}^- = M(x_n - c_n e_i), where M denotes the stochastic oracle providing noisy evaluations of the objective f, e_i is the i-th standard basis vector, and c_n > 0 is the perturbation size. The i-th component of the gradient estimate is then \hat{g}_{n,i} = \frac{Y_{n,i}^+ - Y_{n,i}^-}{2 c_n}. The full gradient estimate is \hat{g}_n = (\hat{g}_{n,1}, \dots, \hat{g}_{n,d})^\top, and the update rule is x_{n+1} = x_n + a_n \hat{g}_n, where a_n > 0 is the step size. This requires $2d function evaluations per iteration for the two-sided difference, though one-sided variants exist that reduce this to d+1. Under assumptions such as the objective f being differentiable with a unique maximum, bounded second derivatives, and the noise having finite variance independent of the parameter, the procedure converges almost surely to the optimum. For mean squared error convergence, suitable sequences are a_n = a / n with a > 0 and c_n = c / n^{1/6} with c > 0 small enough, yielding an optimal rate of \mathbb{E}[\|x_n - x^*\|^2] = O(n^{-2/3}), or \|x_n - x^*\| = O_p(n^{-1/3}). These rates hold provided \sum a_n = \infty, \sum a_n^2 < \infty, c_n \to 0, and c_n / a_n \to 0, ensuring the bias from finite differences vanishes while controlling variance. Key limitations arise from the finite-difference approximation: the gradient estimates exhibit high variance, especially for small c_n, which can lead to erratic updates and slower practical convergence compared to methods with direct gradient access. Additionally, the computational cost scales as O(d) function evaluations per iteration in d-dimensional problems, making it inefficient for high-dimensional optimization without modifications like simultaneous perturbations. The need to tune both a_n and c_n sequences adds further implementation complexity. A representative example is the maximization of the quadratic regression function f(x) = -x^2 (maximum at x^* = 0) with additive independent standard normal noise in each evaluation. Starting from x_1 = 1, the procedure generates noisy evaluations at x_n \pm c_n, estimates the gradient (approximately -2x_n in expectation), and updates towards zero; simulations show the mean squared error decreasing at the O(n^{-2/3}) rate after sufficient iterations.

Gradient-Based Stochastic Methods

Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a fundamental iterative algorithm for minimizing objective functions in large-scale optimization problems, particularly those expressible as expectations over data distributions. It approximates the true gradient by using noisy estimates derived from individual data points or small subsets, enabling efficient updates in scenarios where computing the full gradient is computationally prohibitive. This approach stems from the broader framework of , where the update rule leverages unbiased estimates of the gradient to drive convergence toward a minimizer. The core formulation of SGD involves iteratively updating the parameter vector \mathbf{x} according to the rule \mathbf{x}_{k+1} = \mathbf{x}_k - \eta_k \mathbf{g}_k, where \eta_k > 0 is the step size at iteration k, and \mathbf{g}_k is an unbiased stochastic estimate of the true gradient \nabla f(\mathbf{x}_k) of the objective function f. For empirical risk minimization problems common in machine learning, where f(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^n \ell(\mathbf{x}; \mathbf{z}_i) with data points \{\mathbf{z}_i\}_{i=1}^n, the estimate \mathbf{g}_k is typically the gradient of the loss \ell(\mathbf{x}_k; \mathbf{z}_{i_k}) for a randomly sampled \mathbf{z}_{i_k}, ensuring \mathbb{E}[\mathbf{g}_k] = \nabla f(\mathbf{x}_k). This stochasticity introduces noise but allows for scalable computation, as each update requires only a constant-time gradient evaluation relative to the dataset size. A key aspect of SGD lies in the bias-variance inherent to the choice of gradient . When using a single point (batch size 1), the exhibits high variance due to the in sampling, leading to erratic updates that can escape shallow local minima but may oscillate around the optimum. In contrast, mini-batch gradients, computed over a small of b points, reduce variance proportionally to $1/b while remaining unbiased, providing smoother paths at the cost of increased per-iteration . This balances computational with , with mini-batch sizes often chosen empirically between 1 and a few hundred to optimize wall-clock training time in practice. Larger batches approximate batch gradient descent more closely but diminish the exploratory benefits of stochasticity. Under suitable assumptions, such as convexity and Lipschitz continuity of the objective, SGD exhibits sublinear convergence guarantees. For convex functions, the expected optimality gap satisfies \mathbb{E}[f(\bar{\mathbf{x}}_T) - f(\mathbf{x}^*)] = O(1/\sqrt{T}) after T iterations, where \bar{\mathbf{x}}_T is an appropriately averaged iterate, matching the information-theoretic lower bound for stochastic first-order methods. This rate holds when step sizes diminish appropriately, such as \eta_k = O(1/\sqrt{k}), and assumes bounded variance in the stochastic gradients. For non-convex but smooth objectives, weaker guarantees ensure convergence to stationary points, though the convex case underscores SGD's minimax optimality in stochastic settings. Practical implementation of SGD requires careful of the step size \{\eta_k\}. Diminishing step sizes, satisfying conditions like \sum_k \eta_k = \infty and \sum_k \eta_k^2 < \infty, promote asymptotic convergence by allowing initial large steps for exploration followed by finer adjustments. Constant step sizes \eta_k = \eta simplify tuning and enable perpetual learning in online settings but may lead to persistent oscillation around the optimum, with the achievable error scaling as O(\eta) balanced against variance terms. In large-scale applications, adaptive schedules like linear decay or cyclic annealing are often used to navigate this choice empirically.

Momentum and Adaptive Variants

To address the limitations of vanilla stochastic gradient descent (SGD), which can exhibit slow convergence in the presence of noisy gradients or ill-conditioned landscapes, momentum methods introduce a velocity term that accumulates past gradients, effectively damping oscillations and accelerating progress along consistent directions. In momentum SGD, the update rule incorporates a momentum parameter β (typically 0.9) to form a velocity vector: \mathbf{v}_{k+1} = \beta \mathbf{v}_k + (1 - \beta) \mathbf{g}_k \mathbf{x}_{k+1} = \mathbf{x}_k - \eta \mathbf{v}_{k+1} where \mathbf{g}_k is the stochastic gradient at iteration k, η is the learning rate, and \mathbf{v}_k acts as an exponentially weighted moving average of past gradients. This formulation helps traverse ravines more efficiently by building inertia, reducing the number of iterations needed for convergence in non-convex settings. Building on momentum, adaptive variants like Adam (Adaptive Moment Estimation) further enhance performance by incorporating per-parameter learning rates that adapt based on the historical magnitudes of gradients, making them particularly effective for sparse or high-dimensional data common in deep learning. Introduced by Kingma and Ba in 2014, Adam maintains two moving averages: the first moment estimate \mathbf{m}_k (similar to momentum) and the second moment estimate \mathbf{v}_k (uncentered variance, akin to RMSProp). These are computed as: \mathbf{m}_k = \beta_1 \mathbf{m}_{k-1} + (1 - \beta_1) \mathbf{g}_k \mathbf{v}_k = \beta_2 \mathbf{v}_{k-1} + (1 - \beta_2) \mathbf{g}_k^2 with bias corrections \hat{\mathbf{m}}_k = \mathbf{m}_k / (1 - \beta_1^k) and \hat{\mathbf{v}}_k = \mathbf{v}_k / (1 - \beta_2^k) to account for initialization bias, where β₁ ≈ 0.9 and β₂ ≈ 0.999. The parameter update then becomes \mathbf{x}_{k+1} = \mathbf{x}_k - \eta \hat{\mathbf{m}}_k / (\sqrt{\hat{\mathbf{v}}_k} + \epsilon), with ε as a small constant for numerical stability. The per-parameter adaptation in Adam scales the effective learning rate element-wise as η / √v_{k,i} for the i-th component, allowing larger updates for frequently occurring features and smaller ones for infrequent or noisy ones, which stabilizes training in ill-conditioned problems. This element-wise scaling draws from 's root-mean-square normalization but integrates it with momentum for smoother trajectories. In deep learning benchmarks, such as training convolutional neural networks on , Adam has shown faster initial convergence compared to .

Derivative-Free and Randomized Methods

Finite-Difference Stochastic Approximations

Finite-difference stochastic approximations represent a class of derivative-free methods in stochastic optimization that estimate gradients through function evaluations at points, enabling optimization in black-box settings where analytical derivatives are unavailable or impractical. These techniques build upon early ideas but incorporate randomization and efficiency improvements to handle high-dimensional problems and noisy evaluations. Unlike coordinate-wise approaches, modern variants use perturbations along random directions to reduce the number of required measurements while maintaining gradient approximation quality. A foundational approach is the two-point finite-difference estimator, which approximates the gradient in a random direction \mathbf{u} (typically drawn from a symmetric distribution on the unit sphere) as \hat{\mathbf{g}} \approx \frac{f(\mathbf{x} + c \mathbf{u}) - f(\mathbf{x} - c \mathbf{u})}{2c} \mathbf{u}, where c > 0 is a small perturbation size and f is the noisy function. This requires only two function evaluations per , of dimension, and provides an unbiased estimate under suitable noise assumptions, making it suitable for environments like simulations with inherent variability. It generalizes the classic Kiefer-Wolfowitz procedure by randomizing the perturbation direction to avoid curse-of-dimensionality issues in high dimensions. The simultaneous perturbation stochastic approximation (SPSA) further enhances efficiency by estimating the full d-dimensional using just two function measurements, regardless of d, through simultaneous s across all coordinates. In SPSA, a random vector \Delta_k with independent components (often \pm 1) is generated, and the is approximated as \hat{\mathbf{g}}_k(\mathbf{x}_k) = \frac{f(\mathbf{x}_k + c_k \Delta_k) - f(\mathbf{x}_k - c_k \Delta_k)}{2 c_k \Delta_{k,i}} \mathbf{e}_i for each component i, where \mathbf{e}_i is the vector; the update then proceeds via as \mathbf{x}_{k+1} = \mathbf{x}_k - a_k \hat{\mathbf{g}}_k(\mathbf{x}_k), with step sizes a_k and c_k decreasing appropriately. This O(1) evaluation complexity per iteration significantly outperforms traditional finite differences, which scale with dimension, and has been shown to achieve mean-squared error rates comparable to methods under weak conditions on the objective. The method was introduced by Spall in 1992 and has become widely adopted for its robustness to noise. Efficiency in these approximations hinges on balancing and variance through the choice of sizes c_n, which decrease over iterations as c_n \propto n^{-\gamma} with $0 < \gamma < 1. The arises from the finite-difference , scaling as O(c_n^2) for objectives, while variance stems from both the and inherent in f, often bounded as O(1/c_n^2 + \sigma^2) where \sigma^2 is the noise variance. Optimal c_n minimizes the mean-squared error of the , leading to rates of O(n^{-1/3}) for two-point methods and similar for SPSA under standard assumptions, with optimality established for central finite differences in certain function classes. These bounds ensure reliable performance even in high- regimes, though careful tuning is required to avoid excessive variance from small c_n. An illustrative application is in , where SPSA optimizes controller parameters amid from uncertain dynamics or errors. For instance, in stochastic source-seeking tasks for robots navigating fields, SPSA estimates gradients of a noisy signal strength using perturbed trajectories, converging to the source location with fewer evaluations than dimension-dependent methods despite environmental stochasticity. This demonstrates SPSA's practical value in real-time, black-box settings like .

Evolutionary Algorithms

Evolutionary algorithms represent a class of population-based, techniques inspired by biological , particularly suited for optimization problems where the objective function is noisy or subject to random variations. These methods maintain a set of candidate solutions, or individuals, and iteratively evolve them through processes mimicking , , and to explore the search space globally. Unlike gradient-based approaches, they do not require differentiability or smoothness, making them effective for complex, landscapes common in stochastic settings. Genetic algorithms, a foundational type of , operate on a of encoded solutions represented as strings, such as binary or real-valued vectors. The process begins with an initial random , followed by evaluation of each individual's fitness using the stochastic objective function. Selection favors higher-fitness individuals probabilistically, often via methods like roulette wheel or selection, to form a mating pool. Crossover recombines pairs of selected individuals to produce offspring, while introduces small random changes to maintain diversity. This cycle repeats over generations, with the updated based on fitness rankings. John H. Holland introduced these mechanisms in his seminal work, emphasizing adaptation through genetic operators to solve optimization problems efficiently. Evolution strategies, another key variant, focus on real-parameter optimization and incorporate self-adaptation to dynamically adjust search . In the (μ + λ)-evolution strategy, a parent population of μ individuals generates λ through , typically by adding normally distributed random vectors scaled by a strategy parameter σ. The μ best (plus parents in the "+" variant) are selected for the next generation. Self-adaptation evolves the mutation strengths σ alongside the object variables by treating them as additional subject to and selection, enabling the algorithm to respond to the problem's scale and difficulty. This approach originated in the work of Ingo Rechenberg and Hans-Paul Schwefel in the early 1970s, with self-adaptation formalized to enhance performance on continuous functions. A prominent advancement in evolution strategies is the covariance matrix adaptation evolution strategy (CMA-ES), which adapts not only mutation strengths but also the full of the used for sampling . This allows the algorithm to capture correlations in the search space, rotating and scaling the mutation ellipsoid to align with the objective function's contours. The update rule ranks individuals by fitness and adjusts the using a combination of rank-μ and rank-one updates, promoting efficient exploration in correlated, stochastic environments. Nikolaus Hansen developed CMA-ES, with a comprehensive review in 2006 highlighting its empirical superiority on benchmark functions with noise. In stochastic optimization, evolutionary algorithms address in fitness evaluations by performing multiple evaluations per individual to estimate a more reliable average , often using techniques like resampling where the same solution is re-evaluated across independent realizations. This reduces variance in selection decisions but increases computational cost, so strategies like dynamic resampling allocate more evaluations to promising individuals based on observed variance. Such methods ensure robustness when the objective involves inherent , such as simulation-based models. The strengths of evolutionary algorithms in stochastic optimization lie in their population-level diversity, which enables global search and escape from local optima in landscapes, and their derivative-free nature, which handles non-smooth or discontinuous functions without sensitivity to small perturbations from noise. These properties make them particularly robust to and non-smoothness, as the parallel evaluation of multiple solutions mitigates the impact of stochastic fluctuations and promotes to high-quality global solutions over time.

Applications

Machine Learning and AI

Stochastic optimization plays a central role in and by enabling the training of models through (ERM), where the goal is to minimize the average loss over a of samples. In this framework, (SGD) approximates the true of the expected risk by using gradients computed on random subsets of data, allowing efficient optimization of non-convex objectives in high-dimensional spaces. This approach is particularly suited to large-scale datasets, as it provides unbiased estimates of the gradient while reducing computational costs compared to full-batch methods. In deep neural networks, stochastic optimization is integral to , where mini-batch SGD updates model parameters by propagating errors backward through using gradients from small data batches. This technique has been pivotal in training convolutional neural networks for image classification, as demonstrated in the model, which employed mini-batch SGD with a batch size of 128 to achieve a top-5 error rate of 15.3% on the dataset. Similarly, in , stochastic optimization underpins policy gradient methods, such as the REINFORCE algorithm, which estimates gradients of the expected reward with respect to policy parameters using sampling of trajectories, enabling learning in environments with stochastic dynamics. A notable case study is the training of (ViT) models on using the optimizer, an adaptive variant of SGD that adjusts learning rates per parameter based on gradient moments. In the DeiT framework, AdamW—a decoupled weight decay version of —was used to train a base ViT model, attaining 81.8% top-1 accuracy on ImageNet-1K with 86M parameters, marking a state-of-the-art result for data-efficient transformer-based architectures at the time. This success highlights how adaptive stochastic methods accelerate and improve in vision tasks by handling noisy gradients effectively. For scalability in pipelines, distributed variants of SGD, such as Hogwild!, enable across multiple processors without locks, allowing asynchronous updates to shared parameters in sparse models like matrix factorization. Hogwild! has been shown to outperform lock-based alternatives by up to an in speed while maintaining guarantees under certain conditions, facilitating the training of massive models on multi-core systems.

Finance and Risk Management

In finance, stochastic optimization plays a pivotal role in selection by addressing the inherent in asset returns, extending the classic Markowitz mean-variance to incorporate stochastic elements such as regime switching or scenario-based projections. The mean-variance model, originally deterministic, is adapted to stochastic returns by modeling expected returns and covariances as random processes, allowing for dynamic adjustments that minimize while maximizing expected under . Scenario-based methods further enhance this by generating multiple plausible future market paths from historical or simulations, enabling that hedges against extreme outcomes without assuming perfect distributional knowledge. Stochastic programming formulations are widely applied in financial decision-making, particularly through two-stage models for that account for initial decisions followed by recourse actions under uncertainty. In these models, the objective is to minimize the first-stage cost c^\top x + \mathbb{E}[Q(x, \xi)], where x represents the initial allocation vector, c the associated costs, \xi the random scenario, and Q(x, \xi) the optimal recourse value function for adapting the post-uncertainty revelation. Such approaches are instrumental in balancing immediate investment choices with flexible adjustments to liabilities or returns, as demonstrated in bond optimizations where scenarios simulate fluctuations. Risk assessment in finance leverages stochastic optimization to handle tail risks via measures like Conditional Value-at-Risk (CVaR), which quantifies expected losses beyond a certain and is optimized using stochastic approximations for tractability with high-dimensional . CVaR minimization formulations transform the problem into a linear program solvable via sample average approximations or gradient-based methods, providing coherent metrics superior to Value-at-Risk for hedging. These techniques are particularly effective in volatile markets, where stochastic approximations iteratively estimate CVaR from simulated loss distributions to guide allocation toward lower-tail stability. Option under market incompleteness also benefits from stochastic optimization, framing the problem as minimizing a superhedging cost or over admissible strategies in the presence of unhedgeable risks. By solving stochastic control problems via approximations like least-squares , these methods yield bounds or approximations for prices in models with jumps or , outperforming traditional PDE solutions in multi-asset settings. In real-time trading systems, (SGD) facilitates high-frequency decision-making by updating trading parameters on streaming market data, enabling adaptive strategies that respond to microstructure noise and dynamics. For instance, SGD-based optimizes execution algorithms to minimize slippage in limit order books, achieving sub-millisecond adjustments that improve profitability in liquid markets like equities.

Challenges and Extensions

Convergence and Stability Analysis

Convergence analysis in stochastic optimization establishes conditions under which algorithms, such as stochastic gradient descent (SGD), approach optimal solutions despite noisy gradient estimates. A foundational result for almost-sure convergence relies on martingale arguments, particularly through the Robbins-Siegmund lemma, which applies to non-negative almost supermartingales arising in stochastic approximation procedures. This lemma ensures that, under suitable step-size conditions and bounded variance, the sequence of function values or parameter norms converges almost surely to a limit. For mean-squared error (MSE) bounds, non-asymptotic analyses provide explicit rates; for instance, in strongly convex settings with Lipschitz gradients, SGD achieves an MSE of O(1/T) after T iterations, assuming unbiased gradients with bounded variance. Specific to SGD, the Polyak-Juditsky theorem demonstrates ergodic for averaged iterates in convex problems. Under assumptions of convexity, of gradients, and decreasing step sizes satisfying \sum \gamma_t = \infty and \sum \gamma_t^2 < \infty, the averaged parameters \bar{\theta}_T satisfy \mathbb{E}[f(\bar{\theta}_T)] - f^* = O(1/\sqrt{T}), where f^* is the optimal value. This result extends the classical Robbins-Monro framework, which guarantees almost-sure to points under similar conditions. Strong convexity further improves rates to O(1/T) for the averaged iterates. Stability analysis addresses robustness to perturbations, including non-stationary noise, using for stochastic systems. For perturbed gradient systems, a V(\theta) satisfying \mathbb{E}[V(\theta_{t+1}) | \theta_t] \leq V(\theta_t) - \gamma_t \|\nabla f(\theta_t)\|^2 + B_t, where B_t bounds noise variance, ensures bounded iterates and under and strong convexity assumptions. In non-stationary settings, such as adaptive data sampling, like \mathbb{E}[V(Z_{t+1}) | Z_t] \leq \rho V(Z_t) + K (with \rho < 1) quantify mixing times and yield , leading to rates like O(\tau (\log T)^2 / T) for strongly convex objectives, where \tau is the mixing time. These assumptions— gradients (\|\nabla f(\theta) - \nabla f(\theta')\| \leq L \|\theta - \theta'\| ) and strong convexity (f(\theta') \geq f(\theta) + \nabla f(\theta)^T (\theta' - \theta) + \mu/2 \|\theta' - \theta\|^2 )—underpin most theoretical guarantees.

Recent Advances in Variance Reduction

Variance reduction techniques have emerged as a for improving the of optimization methods, particularly in addressing the high variance inherent in gradient estimates from finite-sum problems. These methods aim to accelerate by constructing unbiased or low-bias gradient estimates with reduced , often achieving linear rates under strong convexity assumptions. One seminal approach is the Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson and Zhang in 2013, which periodically computes the full to correct the bias in stochastic updates and constructs a control variate that subtracts the difference between current and previously stored per-component gradients. This periodic full-gradient computation, typically every m inner iterations where m is on the order of the number of data points, ensures that the variance of the decreases over epochs, leading to faster than standard SGD without additional storage proportional to the size. Building on SVRG, the method, proposed by Defazio et al. in , maintains a table of past gradients for each component and updates the stochastic gradient estimate by subtracting the average of these stored gradients, achieving through incremental updates without requiring full passes. SAGA exhibits linear for strongly objectives in finite-sum settings, with a rate of $1 - \mu / (3nL) where \mu is the strong convexity parameter, n the number of components, and L the smoothness constant, making it particularly effective for large-scale tasks like . Similarly, , developed by Allen-Zhu in , integrates Nesterov acceleration with SVRG-style using a novel "negative momentum" to counteract variance amplification, yielding optimal accelerated rates of O((n + \kappa) \log(1/\epsilon)) iterations for strongly convex problems, where \kappa = L/\mu. More general variance reduction strategies include and , which can be applied beyond finite-sum structures to lower the variance of stochastic estimates in broader stochastic optimization contexts. involve subtracting a correlated with known from the gradient estimator; for instance, Wang et al. (2013) demonstrated their use in stochastic gradient optimization by leveraging low-order moment statistics as controls, reducing variance in applications like while preserving unbiasedness. , on the other hand, adjusts sampling probabilities to focus on high-impact components, as explored by Csiba and Richtárik (2018) for minibatch SGD, where optimal non-uniform sampling based on constants of component can reduce the expected variance by up to a factor of n, enhancing convergence in . In the 2020s, has been increasingly integrated with to handle privacy-preserving optimization across distributed devices, mitigating "client drift" caused by heterogeneous data. A notable example is , introduced by Karimireddy et al. in 2020, which employs server and client-side to correct for update biases, achieving linear over SGD methods and rates of O(1/T + \sqrt{\sigma^2 / (KT)}) for T communication rounds with K clients and noise variance \sigma^2, as validated on datasets like in federated settings. Subsequent developments have extended variance reduction to more challenging settings. For instance, adaptive variance reduction methods based on the Stochastic Recursive gradient Moment estimator (STORM), proposed by Jian et al. in 2024, relax traditional assumptions like bounded variance by dynamically estimating higher-order moments, achieving near-optimal convergence rates for non-smooth objectives in large-scale problems. In zeroth-order optimization, population-based variance-reduced evolution (PVRE), introduced in 2025, combines evolutionary strategies with variance reduction for black-box stochastic problems, demonstrating improved sample efficiency in high-dimensional landscapes without gradient access. Additionally, variance-reduced estimators for stochastic variational inference on manifolds, developed by Shi et al. in 2025, address geometric constraints in Bayesian models, reducing variance in posterior sampling for applications in probabilistic machine learning.

References

  1. [1]
    [PDF] Stochastic Optimization - Columbia University
    Apr 4, 2014 · Stochastic optimization refers to a collection of methods for minimizing or maximizing an objective function when randomness is present.
  2. [2]
    [PDF] STOCHASTIC OPTIMIZATION
    This chapter provides a synopsis of some of the critical issues associated with stochastic optimization and a gives a summary of several popular algorithms.
  3. [3]
    [PDF] A unified framework for stochastic optimization | CASTLE
    Jul 26, 2018 · Stochastic optimization is an umbrella term that includes over a dozen fragmented communities, using a patchwork of sometimes overlapping ...Missing: scholarly | Show results with:scholarly
  4. [4]
    [PDF] Stochastic Optimization Methods for Uncertainty Modeling
    Feb 27, 2025 · Stochastic optimization methods use randomness and probabilistic models to tackle uncertainty, enabling better decision-making in various ...
  5. [5]
    Stochastic Optimization - an overview | ScienceDirect Topics
    Stochastic optimization refers to procedures used to maximize or minimize objective functions in the presence of uncertainty. It is a vital tool in various ...
  6. [6]
    Chance-Constrained Programming - an overview - ScienceDirect.com
    Chance-constrained programming refers to a type of stochastic model in which constraints may not be satisfied deterministically but must hold with a ...
  7. [7]
    Introduction to Stochastic Programming - SpringerLink
    In stockThe aim of stochastic programming is to find optimal decisions in problems which involve uncertain data. This field is currently developing rapidly with ...Missing: definition | Show results with:definition
  8. [8]
  9. [9]
    [PDF] A Unified Framework for Stochastic Optimization | CASTLE
    Jan 30, 2018 · Different communities in stochastic optimization differ in both how they approach modeling, and most ... Deterministic optimization can be ...
  10. [10]
    [PDF] The Cold War Hot House for Modeling Strategies at the Carnegie ...
    ABSTRACT. US Military needs during the Cold War induced a mathematical modeling of rational allocation and control processes while simultaneously binding ...
  11. [11]
    [PDF] Statistical Decision Functions - Gwern
    Wald, A., “Contributions to the Theory of Statistical Estimation and Testing. Hypotheses,” Ann Math. Stat., 10 (1939). 57. Wald, A., “On Cumulative Sums of ...Missing: Arthur optimization
  12. [12]
    Knowing When to Stop | American Scientist
    During World War II, Abraham Wald and other mathematicians developed the field of statistical sequential analysis to aid military and industrial decision makers ...
  13. [13]
    A Stochastic Approximation Method - Project Euclid
    September, 1951 A Stochastic Approximation Method. Herbert Robbins, Sutton Monro · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist. 22(3): 400-407 ...
  14. [14]
    [PDF] Stochastic Programming - Stanford University
    George Bernard Dantzig (November 8, 1914–May 13, 2005) is considered by many as one of the great mathematicians of the twentieth century and was an icon in the.<|separator|>
  15. [15]
    [PDF] EE210A: Adaptation and Learning Professor Ali H. Sayed
    ✓Plackett (1950): “modern” RLS or recursive least-squares. ✓Robbins and Monroe (1951): Stochastic approximation. ✓Widrow and Hoff (1960): LMS filter ...
  16. [16]
    [PDF] Stochastic Gradient Learning in Neural Networks - Leon Bottou
    This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent.
  17. [17]
    [1412.6980] Adam: A Method for Stochastic Optimization - arXiv
    Dec 22, 2014 · We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order ...
  18. [18]
    Accelerating Stochastic Gradient Descent using Predictive Variance ...
    We introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG).
  19. [19]
    [PDF] A Stochastic Approximation Method - Columbia University
    Author(s): Herbert Robbins and Sutton Monro. Source: The Annals of Mathematical Statistics , Sep., 1951, Vol. 22, No. 3 (Sep., 1951), pp. 400-407. Published ...
  20. [20]
    On Optimal Estimation Methods Using Stochastic Approximation ...
    The problem of estimating the zero of a regression function by means of Robbins Monro type of stochastic approximation procedures is discussed.Missing: AR( | Show results with:AR(
  21. [21]
    Efficient on-line estimation of autoregressive parameters
    Jul 1, 2010 · Efficient on-line estimation of autoregressive parameters. Published: 01 July 2010. Volume 19, pages 163–186, (2010); Cite this article.
  22. [22]
    Stochastic Estimation of the Maximum of a Regression Function
    This paper gives a scheme whereby, starting from an arbitrary point x1 x 1 , one obtains successively x2,x3,⋯ x 2 , x 3 , ⋯ such that xn x n converges to θ θ in ...
  23. [23]
    [PDF] A companion for the Kiefer–Wolfowitz–Blum stochastic ... - arXiv
    The aim of this paper is to provide a companion algorithm to the Kiefer–Wolfowitz–Blum al- gorithm, which allows one to simultaneously recursively approximate.
  24. [24]
    A Review of Simulation Optimization with Connection to Artificial ...
    Jun 4, 2025 · The Kiefer–Wolfowitz algorithm requires tuning more parameters than the Robbins–Monro algorithm. These drawbacks make the Kiefer–Wolfowitz ...
  25. [25]
    Kiefer-Wolfowitz Algorithm - SpringerLink
    We present the original K-W scheme, first for the case of a scalar parameter, and subsequently for a vector parameter of arbitrary dimension. Variants including ...Missing: procedure paper
  26. [26]
    [PDF] General Bounds and Finite-Time Improvement for the Kiefer ...
    Jul 27, 2010 · We consider the Kiefer-Wolfowitz (KW) stochastic approximation algorithm and derive general upper bounds on its mean-squared error. The bounds ...Missing: 1970s | Show results with:1970s
  27. [27]
    Optimization Methods for Large-Scale Machine Learning
    This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of ...
  28. [28]
    [PDF] Problem complexity and method efficiency in optimization
    Mar 7, 1986 · Yudin holds the chair of mathematical methods in the faculty of economics in Moscow university, and. Dr. A. S. Nemirovsky, a senior scientific ...
  29. [29]
    [PDF] TECHNICAL RESEARCH REPORT - Stochastic Gradient Estimation
    ... convergence rate, a Kiefer-Wolfowitz SA algorithm involves the additional selection of an appropriate difference sequence. In certain special cases involving.
  30. [30]
    [PDF] Stochastic Gradient Estimation With Finite Differences
    SF estimators are central to policy-gradient methods in RL [11] and have also been applied to learning probabilistic models with discrete latent variables. [5].<|separator|>
  31. [31]
    [PDF] Minimax Efficient Finite-Difference Stochastic Gradient Estimators ...
    Jul 8, 2020 · This paper argues so by showing that central finite-difference is a nearly minimax optimal zeroth-order gradient estimator, among both the class ...
  32. [32]
    [PDF] Multivariate stochastic approximation using a simultaneous ...
    This paper presents a stochastic approximation algorithm using a "simultaneous perturbation" gradient, requiring fewer measurements than standard methods, and ...
  33. [33]
  34. [34]
    [PDF] Achieving optimal bias-variance tradeoff in online derivative estimation
    ABSTRACT. The finite-difference method has been commonly used in stochastic derivative estimation when an unbiased derivative estimator is unavailable or ...<|separator|>
  35. [35]
    [PDF] Stochastic Source Seeking for Mobile Robots in Obstacle ...
    Abstract—This paper considers a class of stochastic source- seeking problems to drive a mobile robot to the minimizer of a source signal.
  36. [36]
    Evolutionary Algorithms for Parameter Optimization—Thirty Years ...
    Jun 1, 2023 · We address some major developments in the field of evolutionary algorithms, with applications in parameter optimization, over these 30 years.
  37. [37]
    Adaptation in Natural and Artificial Systems: An Introductory Analysis ...
    Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. By: John H. Holland.
  38. [38]
    The CMA Evolution Strategy: A Comparing Review - SpringerLink
    Hansen, S.D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES).Missing: original | Show results with:original
  39. [39]
    [PDF] Hybrid Dynamic Resampling Algorithms for Evolutionary Multi ...
    The performance degradation evolutionary al- gorithms experience caused by the stochastic evaluation functions can be com- pensated partly through resampling.
  40. [40]
    Stochastic Scenario Evaluation in Evolutionary Algorithms Used for ...
    Mar 24, 2018 · This paper focuses on evaluating a scenario-based multiobjective evolutionary algorithm for real-world design problems in which the environment where a system ...
  41. [41]
    Evolutionary Algorithms Are Significantly More Robust to Noise ...
    Aug 31, 2024 · We prove that the (1+1) evolutionary algorithm without re-evaluations can optimize the classic LeadingOnes benchmark with up to constant noise rates.
  42. [42]
    [PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
    Léon Bottou. Table 1. Stochastic gradient algorithms for various learning systems. Loss. Stochastic gradient algorithm. Adaline (Widrow and Hoff, 1960).
  43. [43]
    [PDF] ImageNet Classification with Deep Convolutional Neural Networks
    We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 ...
  44. [44]
    [PDF] Simple Statistical Gradient-Following Algorithms for
    Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units.
  45. [45]
    [PDF] Training data-efficient image transformers & distillation through ...
    Jan 15, 2021 · In this paper, we train a vision transformer on a single 8-GPU node in two to three days (53 hours of pre-training, and optionally 20 hours of ...
  46. [46]
    [PDF] Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient ...
    We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwrit- ing each other's work. We show ...
  47. [47]
    [PDF] Mean-Variance and Scenario-Based Approaches to Portfolio Selection
    We have shown how the scenario-based approach to the portfolio optimization problem can be adapted to: Produce unconditional expected returns consistent.
  48. [48]
    [PDF] Multi-stage stochastic linear programs for portfolio optimization
    Multi-stage stochastic linear programs solve multi-period portfolio optimization, where parameters vary stochastically across periods, using Benders ...
  49. [49]
    A Monte Carlo-Based Framework for Two-Stage Stochastic ... - MDPI
    This paper presents a Monte Carlo simulation-based approach for solving stochastic two-stage bond portfolio optimization problems.
  50. [50]
    [PDF] Optimization of Conditional Value-at-Risk - UW Math Department
    Sep 5, 1999 · A new approach to optimizing or hedging a portfolio of financial instruments to reduce risk is presented and tested on applications.
  51. [51]
    [PDF] Computing VaR and CVaR using Stochastic Approximation ... - arXiv
    Dec 3, 2010 · Abstract. Value-at-Risk (VaR) and Conditional-Value-at-Risk (CVaR) are two risk measures which are widely used in the practice of risk ...
  52. [52]
    [PDF] Option Pricing for Incomplete Markets via Stochastic Optimization
    The problem of determining the European-style option price in incomplete markets is examined within the framework of stochastic optimization.
  53. [53]
    [PDF] Non-Asymptotic Analysis of Stochastic Approximation Algorithms for ...
    This paper analyzes non-asymptotic convergence of stochastic gradient descent and Polyak-Ruppert averaging, finding slower learning rates with averaging are ...
  54. [54]
    Acceleration of Stochastic Approximation by Averaging
    Acceleration of Stochastic Approximation by Averaging. Authors: B. T. Polyak and A. B. JuditskyAuthors Info & Affiliations ... PDF. View PDF. Figures. Tables ...
  55. [55]
    [PDF] Convergence of Stochastic Approximation via Martingale and ... - arXiv
    Jan 9, 2023 · Abstract. In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm.
  56. [56]
    [PDF] Stochastic Gradient Descent with Adaptive Data - Columbia University
    Our Lyapunov-function analysis allows one to translate existing stability analysis of stochastic systems studied in operations research into convergence rates ...
  57. [57]
  58. [58]
    SAGA: A Fast Incremental Gradient Method With Support for Non ...
    Jul 1, 2014 · SAGA is a new optimization method, related to SAG, SDCA, MISO and SVRG, with fast linear convergence rates and support for composite objectives.Missing: original | Show results with:original
  59. [59]
    Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
    Mar 18, 2016 · Access Paper: View a PDF of the paper titled Katyusha: The First Direct Acceleration of Stochastic Gradient Methods, by Zeyuan Allen-Zhu.
  60. [60]
    Variance Reduction for Stochastic Gradient Optimization
    In this paper, we develop a general approach of using control variate for variance reduction in stochastic gradient. Data statistics such as low-order moments ...
  61. [61]
    [PDF] Importance Sampling for Minibatches
    One of the most popular algorithms for overcoming the deluge-of-data issue is stochastic gradient descent (SGD), which can be traced back to a seminal work of ...
  62. [62]
    SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
    Oct 14, 2019 · We propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client-drift' in its local updates.