Fact-checked by Grok 2 weeks ago

Bayesian optimization

Bayesian optimization is a sequential for of black-box functions that are expensive or time-consuming to evaluate, typically using a probabilistic such as a to approximate the objective function and an acquisition function to select the next evaluation point by balancing exploration of uncertain regions and exploitation of promising areas. This approach emerged in the statistical literature during the , building on earlier concepts like multi-armed bandits from , and gained prominence in the due to its efficiency in applications such as hyperparameter tuning. Key components include the surrogate model, which provides a posterior over possible based on observed , and the acquisition function, with common examples like expected improvement () that quantifies the expected gain from evaluating a point and upper confidence bound (UCB) that incorporates uncertainty bounds. It is particularly suited for low-dimensional problems (often up to 20 dimensions) where each evaluation might involve complex simulations or real-world experiments, such as in , , and . Recent advances have extended Bayesian optimization to challenging settings, including high-dimensional spaces through variable selection and embeddings, via latent representations like variational autoencoders, and constrained or noisy environments with tailored acquisition functions such as constrained expected improvement (). Developments post-2016 also address parallel and batch evaluations for faster iterations, distributed systems for large-scale applications, and emerging concerns like in and fairness in algorithmic decisions. These enhancements have broadened its use in fields like , autonomous systems, and , while open challenges remain in scalability and handling heterogeneous evaluation costs.

Introduction

Definition and Motivation

Bayesian optimization is a sequential design strategy for the of black-box functions that are expensive or infeasible to evaluate repeatedly. It models the unknown function using a distribution, which is updated to a posterior distribution via as new function evaluations are observed. The selection of subsequent evaluation points is guided by an acquisition designed to exploration of regions with high and exploitation of areas likely to yield improved solutions. This approach is particularly motivated by scenarios where function evaluations incur significant costs, such as time-intensive simulations in engineering design, resource-heavy experiments in , or computationally demanding hyperparameter tuning in . Unlike grid search or methods, which require numerous evaluations to cover the search space adequately, Bayesian optimization achieves sample efficiency by leveraging the probabilistic model to focus evaluations intelligently, often converging to near-optimal solutions with far fewer queries—typically tens to hundreds in low- to moderate-dimensional spaces (e.g., up to 20 dimensions). The basic initiates with a belief about the , often seeded by a small set of initial observations to inform the model. As evaluations proceed, the posterior is refined to reflect accumulated data, and the acquisition identifies the next query point that maximizes expected toward the optimum. This iterative enables effective optimization of functions without access to derivatives or explicit forms. Bayesian optimization emerged in the mid-20th century as a for locating maxima of noisy, derivative-free functions, with foundational contributions tracing back to Kushner's 1964 work on probabilistic improvement strategies using stochastic . Subsequent developments in the 1970s and 1990s, including Mockus's Bayesian search and Jones et al.'s efficient framework, solidified its role in practical applications.

Comparison to Other Optimization Methods

Bayesian optimization belongs to the broader class of black-box optimization techniques, which address problems where the objective provides no gradient information and is treated as an opaque mapping from inputs to outputs. Derivative-based methods, such as , excel at optimizing smooth, differentiable functions by following local descent directions but fail when derivatives are unavailable or the function is noisy or discontinuous. In contrast, derivative-free methods encompass a diverse set, including local search algorithms like Nelder-Mead, which perform simplex-based explorations suitable for low-dimensional, unimodal problems, and evolutionary algorithms like genetic algorithms, which maintain populations of candidate solutions to navigate multimodal landscapes through selection and mutation. The distinctive strength of Bayesian optimization lies in its probabilistic framework, which employs surrogate models—typically Gaussian processes—to approximate the objective function and quantify uncertainty, enabling informed decisions even with noisy or multimodal black-box evaluations. This approach yields high sample efficiency, often requiring only 10 to 100 evaluations to achieve strong performance on expensive functions, in contrast to the thousands typically needed by evolutionary methods or exhaustive grid search. For instance, while grid search systematically enumerates a predefined lattice of points, it becomes prohibitively inefficient in high-dimensional spaces due to exponential growth in the number of evaluations; random search, though a simple baseline that samples uniformly, lacks adaptivity and performs no better than grid search in many cases but can be preferable when the effective dimensionality is low. However, Bayesian optimization incurs significant computational overhead from updating the surrogate posterior and optimizing acquisition functions, with costs scaling cubically in the number of observations for Gaussian processes, exacerbating the curse of dimensionality beyond 20 dimensions. It is thus ill-suited for scenarios with cheap evaluations, where simpler methods like Nelder-Mead or suffice without the modeling burden. Bayesian optimization shines in settings like hyperparameter tuning or experimental design, where function evaluations are costly (e.g., runs or experiments) and the evaluation budget represents a small fraction of the searchable , allowing its uncertainty-aware exploration—via models—to outperform non-adaptive alternatives.

Surrogate Models

Gaussian Processes

Gaussian processes (GPs) are non-parametric probabilistic models that define a distribution over functions, treating the unknown objective function as a sample from this distribution. A GP is specified by a mean function, often taken as zero for simplicity in Bayesian optimization, and a function () that encodes assumptions about the function's properties, such as smoothness. A widely used is the squared exponential (or RBF) , k(x, x') = \sigma_f^2 \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right), where \sigma_f^2 controls the vertical scale of the function variations and \ell determines the length scale of correlations. Given noisy observations D = \{(x_i, y_i = f(x_i) + \epsilon_i)_{i=1}^n\} with \epsilon_i \sim \mathcal{N}(0, \sigma_n^2), the posterior distribution over functions is also a GP, updated via Bayes' rule. The posterior mean at a new point x_* is \mu(x_*) = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}, and the posterior variance is \sigma^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*, where \mathbf{K} is the n \times n kernel matrix with entries K_{ij} = k(x_i, x_j), \mathbf{k}_* = [k(x_*, x_1), \dots, k(x_*, x_n)]^T, and \mathbf{y} = [y_1, \dots, y_n]^T. These expressions enable predictive distributions that quantify both the expected function value and epistemic , shrinking as more data is observed. The GP hyperparameters—such as \ell, \sigma_f^2, and \sigma_n^2—are estimated by maximizing the marginal log likelihood \log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi, typically via gradient-based optimization like L-BFGS. This evidence-based approach ensures the model fits the data while regularizing against overfitting. GPs rely on assumptions encoded by the kernel, such as stationarity and smoothness for the squared exponential , which may not hold for highly non-stationary or discontinuous functions. Exact GP inference incurs a cubic computational cost O(n^3) from inverting the kernel and O(n^2) storage, limiting applicability to datasets with hundreds of points; scalable approximations, such as variational sparse GPs with inducing points, mitigate this by projecting the GP onto a low-rank of m \ll n pseudo-inputs, achieving O(m^3 + nm^2) complexity while preserving predictive quality. In Bayesian optimization, GPs serve as the primary because their posterior provides calibrated alongside point predictions, enabling acquisition functions to balance and effectively.

Alternative Surrogate Models

While Gaussian processes remain the standard in Bayesian optimization due to their probabilistic and smooth , several alternatives address key limitations such as computational scalability beyond small datasets (typically n > 100 observations), challenges in high-dimensional spaces (d > 10), and handling of non-continuous or conditional inputs. These models provide flexible approximations of the objective function while maintaining the ability to estimate for acquisition function evaluation. Tree-structured Parzen estimators (TPE) offer a non-parametric approach that models promising regions of the search space using ratios of densities rather than direct . Introduced for , TPE partitions observations into "good" and "bad" sets based on a performance threshold and fits separate Parzen estimators (kernel density estimates structured as trees) to each, enabling the acquisition function to favor configurations likely to outperform the current best. This method excels in tree-structured configuration spaces with conditional dependencies, such as architectures, and is implemented in the Hyperopt library for practical hyperparameter tuning. Compared to Gaussian processes, TPE avoids costly matrix inversions, achieving better scalability for high-dimensional problems while performing competitively on benchmarks like SVM and tuning. Random forests serve as ensemble-based surrogate models by aggregating predictions from multiple decision trees, providing mean estimates and uncertainty via forests that capture variance from tree diversity. In Bayesian optimization frameworks like SMAC, random forests model the objective as a task, naturally accommodating mixed continuous and categorical variables without requiring kernel , which is advantageous for algorithm configuration tasks with discrete parameters. Their tree structure enables efficient handling of high-dimensional inputs (up to d ≈ 50) and non-stationary functions, outperforming Gaussian processes in scenarios with sparse or noisy data, as demonstrated on and vehicle benchmarks. Scalability arises from parallelizable tree construction, making them suitable for larger observation sets where Gaussian processes become prohibitive. Neural network-based surrogates, such as Bayesian neural networks (BNNs) and deep Gaussian processes, leverage layered architectures to capture complex, non-smooth dependencies in , extending beyond the smoothness assumptions of Gaussian processes. BNNs incorporate uncertainty through priors on weights, enabling variational or for posterior predictions, which proves effective for non-stationary and multi-modal in high dimensions (d > 20). For instance, infinite-width BNNs have demonstrated competitive performance on low-dimensional benchmarks like Ackley and Branin, while outperforming Gaussian processes in high-dimensional and non-stationary settings, particularly with a large number of observations (e.g., thousands of queries), due to their adaptive representations. These models are particularly useful in hyperparameter tuning where exhibits sharp transitions. Student-t processes generalize Gaussian processes by using a Student-t distribution for marginals, introducing a degrees-of-freedom that imparts heavier tails and greater robustness to outliers in the observations. Derived from hierarchical Gaussian process models with an inverse Wishart prior on the , they retain closed-form predictions while allowing the predictive variance to adapt based on data values, improving reliability in noisy or heavy-tailed environments like tuning. On synthetic tasks, Student-t processes outperform Gaussian processes in outlier-contaminated settings without additional computational overhead.
Surrogate ModelScalability (Time Complexity)FlexibilityKey Strengths
Gaussian ProcessesO(n³) for High for continuous, low-d spacesSmooth uncertainty, exact posteriors
Tree-structured Parzen EstimatorsO(n log n) for density fittingExcellent for conditional/discrete spacesEfficient in high-d hyperparameter tuning
Random ForestsO(n log n) per treeStrong for mixed categorical/continuousHandles high-d, non-stationary data
Neural Network-based (e.g., BNNs)O(n²) or better with approximationsHigh for non-smooth, multi-modalAdapts to complex structures in d > 20
Student-t ProcessesO(n³), same as GPsSimilar to GPs, with robustnessHeavier tails for noisy data
Gaussian processes are preferred for low-dimensional (d < 10) continuous problems where precise uncertainty is critical, but alternatives like random forests or TPE should be chosen for high-dimensional (d > 10) or non-continuous inputs to mitigate the curse of dimensionality and improve . For instance, random forests suit mixed-variable algorithm configuration, while TPE excels in conditional hyperparameter spaces. Neural and Student-t variants are ideal when the objective is non-smooth or outlier-prone, respectively.

Acquisition Functions

Role and Design Principles

In Bayesian optimization, the acquisition function serves as a function defined over the posterior distribution provided by the , guiding the selection of the next evaluation point x_{t+1} = \arg\max_x \alpha(x | D_t), where D_t represents the observed up to t and \alpha quantifies the of evaluating the objective at x. This mechanism balances the inherent trade-off between —sampling regions of high predictive , often indicated by high posterior variance—and —focusing on areas with high predicted values from the —to efficiently search the parameter space of expensive black-box functions. Within a minimization , the acquisition function aims to minimize cumulative , defined as the difference between the optimal function value and the value at selected points, ensuring no-regret guarantees under certain conditions like sub-Gaussian noise. The general form of an acquisition is \alpha(x) = \mathbb{E}[u(f(x)) | D_t], where the is taken with respect to the posterior predictive distribution p(f(x) | D_t), and u encodes a such as the expected improvement over the current best observed value. This formulation allows the to adapt to the evolving posterior, prioritizing points that maximize anticipated while accounting for . Design principles for acquisition functions emphasize desirable properties to ensure effective optimization. Monotonicity requires that if a new improves the best, the acquisition values non-decrease, preserving the incentive to search further. demands efficient computation and optimization of \alpha, often leveraging closed-form expressions or approximations to handle high-dimensional spaces without excessive cost. Handling noisy s is critical, with functions designed to incorporate models in the posterior to avoid overconfidence in uncertain estimates. From an information-theoretic perspective, approaches like the knowledge gradient treat the acquisition as the expected increase in from sampling, maximizing the reduction in posterior or expected future utility. Challenges in acquisition function design include extending to multi-objective settings, where functions like entropy search optimize by minimizing the entropy of the posterior over the global optimum, balancing multiple criteria such as cost or constraints. Additionally, computationally optimizing \alpha(x) itself—often a non-convex problem—requires techniques like multi-start optimization or gradient-based methods to find reliable maximizers, particularly as dimensionality increases.

Specific Acquisition Functions

Expected improvement (EI) is one of the most widely used acquisition functions in Bayesian optimization, quantifying the expected gain in function value over the current best observation f(x^+) when evaluating at a new point x. Formally, under a surrogate, it is given by \alpha_{\text{EI}}(x) = \mathbb{E}[\max(0, f(x) - f(x^+)) \mid \mathcal{D}] = (\mu(x) - f(x^+)) \Phi(z) + \sigma(x) \phi(z), where z = \frac{\mu(x) - f(x^+) - \xi}{\sigma(x)}, \Phi and \phi are the cumulative distribution and probability density functions of the standard , respectively, \mu(x) and \sigma(x) are the posterior mean and standard deviation at x, and \xi \geq 0 is a non-negative that controls the exploration-exploitation by penalizing improvements below a margin. This function accounts for both the predicted improvement and its uncertainty, making it suitable for noisy observations when the posterior incorporates measurement noise. EI is invariant to monotonic transformations of the objective function, preserving its relative ordering of points under such changes. Probability of improvement (PI) offers a simpler alternative, measuring the likelihood that evaluating at x yields a better outcome than the current best f(x^+). It is defined as \alpha_{\text{PI}}(x) = \Phi\left( \frac{\mu(x) - f(x^+) - \xi}{\sigma(x)} \right), again using the Gaussian posterior and standard deviation, with \xi serving a similar role in adjusting for . Unlike EI, PI does not account for the magnitude of potential improvements, focusing solely on the probability of surpassing the , which can lead to overly conservative selections in high-uncertainty regions. The upper confidence bound (UCB) acquisition function balances prediction and uncertainty through a direct additive form, selecting points that maximize \alpha_{\text{UCB}}(x) = \mu(x) + \kappa \sigma(x), where \kappa > 0 is a tuning the strength. Derived in the upper confidence bound (GP-UCB) algorithm, this approach provides theoretical guarantees, including sublinear regret bounds of O(\sqrt{T \gamma_T}) after T iterations under a with maximum information gain \gamma_T, demonstrating no-regret optimization for functions drawn from the . Other acquisition functions emphasize information-theoretic principles for more global search strategies. The knowledge gradient (KG) measures the expected increase in the value of the optimal decision following a single evaluation, formulated as the difference in expected maximum posterior values before and after observing f(x), often computed via sampling over the entropic . Entropy search (ES) directly maximizes the expected reduction in uncertainty about the global optimum's location, approximating the between the observation at x and the maximizer x^* through over a of potential optima. Hybrids, such as portfolios combining with entropy-based methods like , improve robustness by randomly selecting from multiple acquisition functions at each step, as in the entropy search portfolio approach, which has shown empirical benefits in diverse optimization landscapes.

Optimization Process

Sequential Bayesian Optimization

Sequential Bayesian optimization forms the core of the Bayesian optimization process, where evaluations of the objective function are performed one at a time in an iterative manner to efficiently search for the global optimum of expensive black-box functions. This approach leverages a probabilistic , typically a , to approximate the objective and guide the selection of the next point by maximizing an acquisition function that balances exploration and exploitation. The process is particularly suited for settings where each function evaluation is costly, such as hyperparameter tuning or simulation-based design, allowing for informed decisions with a limited evaluation budget. The algorithm begins by initializing a dataset D_0 = \{(x_i, y_i)\}_{i=1}^{n_0} using a space-filling design, such as , to provide an initial set of observations for the . For each t = 1 to T, the is fitted to the current D_{t-1}, yielding a posterior over the f. An acquisition \alpha(x | D_{t-1}) is then optimized to select the next point x_t = \arg\max_x \alpha(x | D_{t-1}). The objective is evaluated at x_t to obtain y_t = f(x_t) + \epsilon, where \epsilon represents observation noise (which may be zero in noise-free settings), and the is updated as D_t = D_{t-1} \cup \{(x_t, y_t)\}. This loop continues until a stopping criterion is met, after which the optimum is inferred from the best observed point in D_T. Optimizing the acquisition function to find x_t is a non-convex global optimization problem itself, often addressed using gradient-based methods like L-BFGS-B, which exploits analytical derivatives of common acquisition functions for efficiency. To mitigate local optima, multi-start strategies initialize multiple L-BFGS-B runs from random or promising points, selecting the best result; alternatively, derivative-free evolutionary methods such as (a variant) can be employed for robust global search. These techniques ensure reliable maximization even in high-dimensional spaces. In noise-free settings, the surrogate assumes deterministic evaluations (\epsilon = 0), leading to a posterior with zero variance at observed points. For noisy observations, the model incorporates a known variance \sigma^2_\epsilon > 0 in the likelihood, adjusting the posterior to account for uncertainty in repeated evaluations and preventing overconfidence at sampled locations. This adaptation is crucial for real-world applications where measurement error is inevitable. Common stopping criteria include exhaustion of the evaluation budget T, or convergence indicators such as the maximum expected falling below a (e.g., 1% of the current best objective value). These ensure the process terminates practically while maximizing information gain. The following outlines the sequential Bayesian optimization algorithm (assuming minimization of function f):
Initialize dataset D_0 = {(x_i, y_i)}_{i=1}^{n_0} using space-filling design
For t = 1 to T:
    Fit [surrogate model](/page/Surrogate_model) (e.g., [Gaussian process](/page/Gaussian_process)) to D_{t-1}
    Optimize acquisition function α(x | D_{t-1}) to obtain x_t = arg max_x α(x | D_{t-1})
    Evaluate y_t = f(x_t) + ε  (ε ~ N(0, σ_ε²); σ_ε = 0 for noise-free)
    Update D_t = D_{t-1} ∪ {(x_t, y_t)}
    If stopping criterion met (e.g., max α < threshold or t = T), break
Return the point in D_T with the minimum observed y

Batch and Parallel Optimization

Traditional Bayesian optimization proceeds sequentially, selecting and evaluating one point at a time while updating the surrogate model after each observation. However, with the availability of parallel computing resources, evaluating multiple points simultaneously—known as batch or parallel optimization—can substantially reduce overall computation time for expensive . This adaptation addresses the challenge of non-myopic selection, where the acquisition function must account for joint utility across a batch of q points rather than a single point, often leading to more conservative and diverse selections to mitigate information overlap. A key approach is the q-Expected Improvement (qEI), which extends the single-point Expected Improvement to batches by quantifying the anticipated maximum improvement from any point in the proposed set \mathbf{x} = (x_1, \dots, x_q). Formally, \alpha(\mathbf{x}) = \mathbb{E}\left[\max\left(0, f(x^+) - \min_{i=1}^q f(x_i)\right)\right], where f(x^+) denotes the current best (minimum) observed value (assuming minimization), and the expectation is taken over the joint posterior predictive distribution of the surrogate model, such as a Gaussian process. For q > 1, exact computation is intractable due to the maximum over correlated predictions, so approximations rely on : samples are drawn from the multivariate Gaussian posterior to simulate batch outcomes and estimate the . This method promotes batches that jointly maximize improvement while encouraging diversity through correlation penalties implicit in the joint distribution. Alternative heuristics simplify batch selection for practicality. Local penalization employs a sequential greedy strategy, iteratively selecting points by maximizing a single-point acquisition (e.g., Expected Improvement) while applying a penalty to the around previously chosen batch points; the penalty is scaled by an empirical estimate of the objective's constant to discourage clustering and ensure exploration. techniques approximate the joint effect by simulating unobserved outcomes for prior batch selections—treating them as realized—to update the before choosing the next point, enabling standard sequential acquisition functions in a myopic, parallelizable loop. A related , the Believer, specifically hallucinates these simulated values as the surrogate's posterior mean at each pending point, providing a simple update rule that balances belief in the model with ongoing optimization. Scalability in batch optimization benefits from methods that generate diverse candidates efficiently, particularly in distributed environments where evaluations occur across multiple processors. For instance, batch selection can integrate clustering to identify representative points from large pools, facilitating parallel execution while minimizing redundant evaluations. Recent post-2015 developments, such as the BoTorch library, enable GPU-accelerated approximations for qEI, supporting high-dimensional batches and large sample sizes through integration, thus making parallel Bayesian optimization viable for industrial-scale applications.

History

Early Foundations (1960s-1990s)

The foundations of Bayesian optimization emerged in the through efforts to address optimization in noisy environments. In 1964, Harold J. Kushner introduced a method for locating the maximum of an arbitrary multipeak curve amid noise, marking an early precursor to sequential strategies for handling uncertainty in black-box functions. The 1970s and 1980s saw substantial theoretical progress, primarily driven by Jonas Mockus and collaborators in the Soviet literature, who formalized Bayesian approaches to of functions. Mockus's series of publications established Bayesian optimization as a framework minimizing expected deviation from the global extremum using probabilistic surrogates, with initial developments focusing on one-dimensional and low-dimensional cases. During this era, Mockus introduced the Probability of Improvement (PI) criterion in the early as a for selecting promising evaluation points based on the likelihood of surpassing the current best observation. In 1978, Mockus, V. Tiešis, and A. Žilinskas proposed the Expected Improvement (EI) acquisition function, quantifying the anticipated gain over the incumbent solution to guide efficient exploration-exploitation trade-offs. Antanas Žilinskas played a key role alongside Mockus, contributing algorithms for multidimensional and statistical criteria for in settings during the 1970s and 1980s. These works emphasized surrogate models like functions to approximate expensive objectives, though computational constraints—such as the of model updates with observations—limited applications to low-dimensional problems with fewer than 20 variables. In 1989, Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn advanced the methodology by applying (a form of ) to design and analyze computer experiments, providing a rigorous statistical basis for predicting responses and their uncertainties in deterministic simulations. Mockus's 1989 book, Bayesian Approach to Global Optimization: Theory and Applications, synthesized these contributions, detailing theoretical guarantees like rates and practical examples in engineering design, while highlighting the method's efficiency for functions evaluable only a few dozen times. By the late 1990s, the field transitioned toward practicality, exemplified by Donald R. Jones, Matthias Schonlau, and William J. Welch's 1998 paper on efficient of noisy black-box functions. This work integrated Gaussian processes as with the criterion, demonstrating superior performance over alternatives like response surface methods on test functions, and facilitating the development of initial software tools for broader adoption.

Modern Developments (2000s-Present)

The 2000s marked a pivotal for Bayesian optimization, shifting from foundational theory to practical implementations with rigorous guarantees. A landmark advancement was the introduction of the Upper Confidence Bound (GP-UCB) algorithm by Srinivas et al. in 2010, which provided the first no-regret bounds for GP-based optimization in the bandit setting by bounding cumulative regret in terms of information gain. This theoretical progress spurred applications in for , where expensive black-box evaluations are common. Tools like SMAC, developed by Hutter et al. in 2011, leveraged random forests as surrogate models for sequential model-based configuration, demonstrating superior performance on algorithm tuning benchmarks compared to grid search or . Similarly, by Snoek et al. in 2012 popularized GP-based optimization for tuning deep neural networks, achieving significant improvements in generalization performance on datasets like with fewer evaluations than alternatives. The 2010s emphasized scalability, parallelization, and handling complex search spaces. Batch optimization methods gained prominence to exploit resources; for instance, the q-Expected Improvement (qEI) acquisition function, formalized in works around 2012, extended single-point EI to select batches of q points that maximize joint expected improvement, reducing wall-clock time in parallel settings without substantial regret increase. Open-source libraries accelerated adoption: GPyOpt, released in 2014 and detailed in subsequent documentation, integrated GPs with various acquisition functions for constrained and multi-fidelity problems. BoTorch, announced in 2019 and published in 2020 by Balandat et al., introduced a modular PyTorch-based framework supporting acquisition functions and sample average approximation, enabling efficient optimization for up to thousands of evaluations on high-performance hardware. To tackle high-dimensional spaces (d > 20), techniques like random embeddings emerged; the REMBO by Wang et al. in projected high-dimensional inputs (up to billions) onto lower-dimensional subspaces via random linear maps, preserving optimization guarantees while outperforming direct GP methods on synthetic high-d benchmarks. Entering the 2020s, Bayesian optimization integrated deeply with AutoML and emerging paradigms, enhancing scalability for massive datasets. NAS-BOT, proposed by Kandasamy et al. in 2018 but widely applied in the , used GPs with optimal transport metrics for , discovering competitive architectures on with 10x fewer evaluations than baselines. Scalable surrogate models advanced through Stochastic Variational Gaussian Processes (SVGPs), introduced by Hensman et al. in 2013 and refined in the decade, which approximate full GPs via variational inference on inducing points, scaling to millions of data points with minimal accuracy loss in BO loops. Applications proliferated in scientific domains; in , BO optimized state preparation protocols in ultracold atoms by 2024, achieving fractional quantum Hall states with 50% fewer experiments than grid search via GP modeling of noisy observables. Post-2020 developments addressed distributed and safety-critical settings. Federated Bayesian optimization emerged to preserve data privacy, aggregating GP posteriors across decentralized nodes without sharing raw data. In , methods tuned large language models for from 2023 onward, reducing harmful outputs while balancing objectives in RLHF pipelines. Theoretical advances included no-regret guarantees for objectives. Hybrids with integrated BO for meta-optimization, where policy search in MDPs used BO to tune hyperparameters online, outperforming direct on continuous tasks like MuJoCo. Despite progress, open challenges persist in ultra-high dimensions (100D+) and multi-fidelity settings. The curse of dimensionality limits GP expressiveness beyond d=50 without embeddings, leading to poor regret scaling, as noted in surveys like Shahriari et al. (2016).

Applications

Machine Learning and Hyperparameter Tuning

Bayesian optimization plays a pivotal role in machine learning workflows by efficiently tuning hyperparameters, which are configuration settings such as learning rates, tree depths in random forests, or kernel parameters in support vector machines (SVMs), that control model behavior without being learned from data. Unlike exhaustive grid search, which scales poorly with dimensionality, or , which samples uniformly without exploiting prior evaluations, Bayesian optimization minimizes the number of required evaluations—often by orders of magnitude—for computationally expensive training runs, particularly for deep neural networks where each trial involves hours or days of computation. The sequential model-based optimization (SMBO) framework formalizes this process by treating hyperparameter tuning as a black-box , where the objective is typically the cross-validation error or a similar performance metric on held-out data, and evaluations are sequential to build an increasingly accurate of the response surface. In practice, SMBO iteratively fits a probabilistic (e.g., ) to observed validation scores and selects the next hyperparameters via an acquisition function that balances and , enabling effective optimization even in high-dimensional spaces. Prominent case studies illustrate its impact: the Auto-sklearn system (2015) integrates Bayesian optimization within an pipeline to jointly optimize algorithm selection and hyperparameters, outperforming prior AutoML baselines on diverse datasets by leveraging for warm-starting the search. Similarly, the Optuna framework (2019) employs a variant of Bayesian optimization using tree-structured Parzen estimators to handle complex search spaces, facilitating scalable tuning in distributed environments. In competitive settings like competitions, Bayesian optimization has been applied to tune models, where it refines parameters such as subsample ratios and maximum depths to achieve leaderboard improvements over default configurations, as demonstrated in risk classification tasks. A key advantage in is Bayesian optimization's native support for mixed continuous and discrete parameter spaces, including conditional dependencies—such as enabling dropout rates only if the number of layers exceeds one in neural architectures—which grid or handle inefficiently without custom adaptations. Post-2020 applications extend to large-scale models, where it tunes hyperparameters for transformers like , optimizing learning schedules and regularization to minimize validation while reducing search costs compared to random sampling. Recent advances as of 2025 include its use in optimizing prompts for large language models (LLMs), enhancing performance in black-box tasks with limited evaluations. The objective is often framed as minimizing simple , defined as the difference between the best observed performance and the true optimum, or directly as cross-validation error to ensure generalizable models.

Engineering and Scientific Domains

Bayesian optimization has been extensively applied in engineering design, particularly in , where simulations of are computationally expensive. A seminal involves the optimization of shapes for , demonstrating the efficiency of efficient (EGO), an early form of Bayesian optimization using Gaussian processes, which reduced the number of evaluations needed to achieve near-optimal lift-to-drag ratios compared to traditional methods. In modern contexts, Bayesian optimization integrates with (CFD) simulations to refine parameters while minimizing high-fidelity evaluations. Similarly, in , it facilitates controller tuning by optimizing impedance parameters to enhance and in dynamic environments, such as robotic arms performing precise manipulations, where noisy is common. For placement in networks, Bayesian optimization selects optimal locations to maximize information gain in monitoring systems, like structural health networks, reducing uncertainty in predictions with fewer sensors. In scientific domains, Bayesian optimization addresses costly experimental evaluations, such as in , where it optimizes molecular properties represented as SMILES strings to maximize binding affinity while navigating vast chemical spaces. This approach has accelerated identification by iteratively suggesting promising candidates, outperforming random screening in terms of convergence speed. In , it guides alloy composition optimization for desired properties like strength and corrosion resistance, with benchmarking across datasets demonstrating efficiency gains over traditional search methods. In , Bayesian optimization maximizes reaction yields by tuning conditions like temperature, catalysts, and reagents, as demonstrated in optimizing where yields improved in fewer iterations than manual methods. Notable case studies underscore its impact. The 1990s airfoil optimization highlighted Bayesian methods' sample efficiency in simulation-driven design, a foundation for contemporary integrations with CFD tools. Post-2015 applications in employ Bayesian optimization to construct surfaces for molecular systems, enabling accurate prediction of reaction pathways with reduced quantum mechanical calculations. In environmental modeling during the 2020s, it calibrates parameters in complex hydrological or climate models, improving simulation accuracy for flood prediction or dynamics by automating the fitting of uncertain inputs like permeability. Multi-fidelity Bayesian optimization extends these applications by combining low-fidelity (cheap, approximate) models with high-fidelity (accurate but expensive) ones, such as using low-resolution Gaussian processes to inform engineering designs like shapes, achieving cost reductions without sacrificing optimality. As of 2025, it has gained traction in for optimizing experimental conditions in . However, challenges persist in these domains, including handling noisy real-world data from experiments, which Bayesian methods mitigate through robust modeling. Safety constraints are addressed via constrained Bayesian optimization, ensuring designs in or chemical processes avoid unsafe regions, such as exceeding material stress limits, while maintaining exploration efficiency.