Bayesian optimization

Bayesian optimization is a sequential design strategy for global optimization of black-box functions that are expensive or time-consuming to evaluate, typically using a probabilistic surrogate model such as a Gaussian process to approximate the objective function and an acquisition function to select the next evaluation point by balancing exploration of uncertain regions and exploitation of promising areas.^[1] This approach emerged in the statistical literature during the 1970s, building on earlier concepts like multi-armed bandits from the 1930s, and gained prominence in the 2010s due to its efficiency in machine learning applications such as hyperparameter tuning.^[1] Key components include the surrogate model, which provides a posterior distribution over possible functions based on observed data, and the acquisition function, with common examples like expected improvement (EI) that quantifies the expected gain from evaluating a point and upper confidence bound (UCB) that incorporates uncertainty bounds.^[1] It is particularly suited for low-dimensional problems (often up to 20 dimensions) where each evaluation might involve complex simulations or real-world experiments, such as in robotics, environmental monitoring, and drug design.^[1] Recent advances have extended Bayesian optimization to challenging settings, including high-dimensional spaces through variable selection and embeddings, combinatorial optimization via latent representations like variational autoencoders, and constrained or noisy environments with tailored acquisition functions such as constrained expected improvement (cEI).^[2] Developments post-2016 also address parallel and batch evaluations for faster iterations, distributed systems for large-scale applications, and emerging concerns like privacy in federated learning and fairness in algorithmic decisions.^[2] These enhancements have broadened its use in fields like materials science, autonomous systems, and process engineering, while open challenges remain in scalability and handling heterogeneous evaluation costs.^[2]

Introduction

Definition and Motivation

Bayesian optimization is a sequential design strategy for the global optimization of black-box objective functions that are expensive or infeasible to evaluate repeatedly. It models the unknown objective function using a prior probability distribution, which is updated to a posterior distribution via Bayesian inference as new function evaluations are observed. The selection of subsequent evaluation points is guided by an acquisition function designed to balance exploration of regions with high uncertainty and exploitation of areas likely to yield improved solutions. This approach is particularly motivated by scenarios where function evaluations incur significant costs, such as time-intensive simulations in engineering design, resource-heavy experiments in materials science, or computationally demanding hyperparameter tuning in machine learning. Unlike grid search or random search methods, which require numerous evaluations to cover the search space adequately, Bayesian optimization achieves sample efficiency by leveraging the probabilistic model to focus evaluations intelligently, often converging to near-optimal solutions with far fewer queries—typically tens to hundreds in low- to moderate-dimensional spaces (e.g., up to 20 dimensions).^[3] The basic workflow initiates with a prior belief about the function, often seeded by a small set of initial observations to inform the model. As evaluations proceed, the posterior is refined to reflect accumulated data, and the acquisition function identifies the next query point that maximizes expected utility toward the global optimum. This iterative process enables effective optimization of functions without access to derivatives or explicit forms. Bayesian optimization emerged in the mid-20th century as a method for locating maxima of noisy, derivative-free functions, with foundational contributions tracing back to Kushner's 1964 work on probabilistic improvement strategies using stochastic processes. Subsequent developments in the 1970s and 1990s, including Mockus's Bayesian global search and Jones et al.'s efficient global optimization framework, solidified its role in practical applications.

Comparison to Other Optimization Methods

Bayesian optimization belongs to the broader class of black-box optimization techniques, which address problems where the objective function provides no gradient information and is treated as an opaque mapping from inputs to outputs. Derivative-based methods, such as gradient descent, excel at optimizing smooth, differentiable functions by following local descent directions but fail when derivatives are unavailable or the function is noisy or discontinuous.^[4] In contrast, derivative-free methods encompass a diverse set, including local search algorithms like Nelder-Mead, which perform simplex-based explorations suitable for low-dimensional, unimodal problems, and evolutionary algorithms like genetic algorithms, which maintain populations of candidate solutions to navigate multimodal landscapes through selection and mutation.^[5] The distinctive strength of Bayesian optimization lies in its probabilistic framework, which employs surrogate models—typically Gaussian processes—to approximate the objective function and quantify uncertainty, enabling informed decisions even with noisy or multimodal black-box evaluations.^[4] This approach yields high sample efficiency, often requiring only 10 to 100 evaluations to achieve strong performance on expensive functions, in contrast to the thousands typically needed by evolutionary methods or exhaustive grid search.^[3] For instance, while grid search systematically enumerates a predefined lattice of points, it becomes prohibitively inefficient in high-dimensional spaces due to exponential growth in the number of evaluations; random search, though a simple baseline that samples uniformly, lacks adaptivity and performs no better than grid search in many cases but can be preferable when the effective dimensionality is low.^[6] However, Bayesian optimization incurs significant computational overhead from updating the surrogate posterior and optimizing acquisition functions, with costs scaling cubically in the number of observations for Gaussian processes, exacerbating the curse of dimensionality beyond 20 dimensions.^[4] It is thus ill-suited for scenarios with cheap evaluations, where simpler methods like Nelder-Mead or random search suffice without the modeling burden. Bayesian optimization shines in settings like hyperparameter tuning or experimental design, where function evaluations are costly (e.g., simulation runs or lab experiments) and the evaluation budget represents a small fraction of the searchable space, allowing its uncertainty-aware exploration—via surrogate models—to outperform non-adaptive alternatives.

Surrogate Models

Gaussian Processes

Gaussian processes (GPs) are non-parametric probabilistic models that define a prior distribution over functions, treating the unknown objective function as a sample from this distribution.^[7] A GP is specified by a mean function, often taken as zero for simplicity in Bayesian optimization, and a covariance function (kernel) that encodes assumptions about the function's properties, such as smoothness.^[7] A widely used kernel is the squared exponential (or RBF) kernel,

k(x, x') = \sigma_f^2 \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right),

where \sigma_f^2 controls the vertical scale of the function variations and \ell determines the length scale of correlations.^[7] Given noisy observations D = \{(x_i, y_i = f(x_i) + \epsilon_i)_{i=1}^n\} with \epsilon_i \sim \mathcal{N}(0, \sigma_n^2), the posterior distribution over functions is also a GP, updated via Bayes' rule.^[7] The posterior mean at a new point x_* is

\mu(x_*) = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y},

and the posterior variance is

\sigma^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*,

where \mathbf{K} is the n \times n kernel matrix with entries K_{ij} = k(x_i, x_j), \mathbf{k}_* = [k(x_*, x_1), \dots, k(x_*, x_n)]^T, and \mathbf{y} = [y_1, \dots, y_n]^T.^[7] These expressions enable predictive distributions that quantify both the expected function value and epistemic uncertainty, shrinking as more data is observed.^[7] The GP hyperparameters—such as \ell, \sigma_f^2, and \sigma_n^2—are estimated by maximizing the marginal log likelihood \log p(\mathbf{y} | X) = -\frac{1}{2} \mathbf{y}^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi, typically via gradient-based optimization like L-BFGS.^[7] This evidence-based approach ensures the model fits the data while regularizing against overfitting.^[7] GPs rely on assumptions encoded by the kernel, such as stationarity and smoothness for the squared exponential kernel, which may not hold for highly non-stationary or discontinuous functions.^[7] Exact GP inference incurs a cubic computational cost O(n^3) from inverting the kernel matrix and O(n^2) storage, limiting applicability to datasets with hundreds of points; scalable approximations, such as variational sparse GPs with inducing points, mitigate this by projecting the GP onto a low-rank subspace of m \ll n pseudo-inputs, achieving O(m^3 + nm^2) complexity while preserving predictive quality.^[8] In Bayesian optimization, GPs serve as the primary surrogate model because their posterior provides calibrated uncertainty quantification alongside point predictions, enabling acquisition functions to balance exploration and exploitation effectively.

Alternative Surrogate Models

While Gaussian processes remain the standard surrogate model in Bayesian optimization due to their probabilistic nature and smooth uncertainty quantification, several alternatives address key limitations such as computational scalability beyond small datasets (typically n > 100 observations), challenges in high-dimensional spaces (d > 10), and handling of non-continuous or conditional inputs. These models provide flexible approximations of the objective function while maintaining the ability to estimate uncertainty for acquisition function evaluation.^[9] Tree-structured Parzen estimators (TPE) offer a non-parametric density estimation approach that models promising regions of the search space using ratios of densities rather than direct regression. Introduced for hyperparameter optimization, TPE partitions observations into "good" and "bad" sets based on a performance threshold and fits separate Parzen estimators (kernel density estimates structured as trees) to each, enabling the acquisition function to favor configurations likely to outperform the current best. This method excels in tree-structured configuration spaces with conditional dependencies, such as neural network architectures, and is implemented in the Hyperopt library for practical hyperparameter tuning. Compared to Gaussian processes, TPE avoids costly matrix inversions, achieving better scalability for high-dimensional problems while performing competitively on benchmarks like SVM and deep belief network tuning.^[10] Random forests serve as ensemble-based surrogate models by aggregating predictions from multiple decision trees, providing mean estimates and uncertainty via quantile regression forests that capture variance from tree diversity. In Bayesian optimization frameworks like SMAC, random forests model the objective as a regression task, naturally accommodating mixed continuous and categorical variables without requiring kernel engineering, which is advantageous for algorithm configuration tasks with discrete parameters. Their tree structure enables efficient handling of high-dimensional inputs (up to d ≈ 50) and non-stationary functions, outperforming Gaussian processes in scenarios with sparse or noisy data, as demonstrated on SAT solver and vehicle routing benchmarks. Scalability arises from parallelizable tree construction, making them suitable for larger observation sets where Gaussian processes become prohibitive.^[11]^[12] Neural network-based surrogates, such as Bayesian neural networks (BNNs) and deep Gaussian processes, leverage layered architectures to capture complex, non-smooth dependencies in the objective function, extending beyond the smoothness assumptions of Gaussian processes. BNNs incorporate uncertainty through priors on weights, enabling variational or Monte Carlo inference for posterior predictions, which proves effective for non-stationary and multi-modal functions in high dimensions (d > 20). For instance, infinite-width BNNs have demonstrated competitive performance on low-dimensional benchmarks like Ackley and Branin, while outperforming Gaussian processes in high-dimensional and non-stationary settings, particularly with a large number of observations (e.g., thousands of function queries), due to their adaptive feature representations. These models are particularly useful in deep learning hyperparameter tuning where the objective exhibits sharp transitions.^[13] Student-t processes generalize Gaussian processes by using a Student-t distribution for marginals, introducing a degrees-of-freedom parameter that imparts heavier tails and greater robustness to outliers in the observations. Derived from hierarchical Gaussian process models with an inverse Wishart prior on the covariance, they retain closed-form predictions while allowing the predictive variance to adapt based on data values, improving reliability in noisy or heavy-tailed environments like reinforcement learning tuning. On synthetic regression tasks, Student-t processes outperform Gaussian processes in outlier-contaminated settings without additional computational overhead.^[14]

Surrogate Model	Scalability (Time Complexity)	Flexibility	Key Strengths
Gaussian Processes	O(n³) for inference	High for continuous, low-d spaces	Smooth uncertainty, exact posteriors^[9]
Tree-structured Parzen Estimators	O(n log n) for density fitting	Excellent for conditional/discrete spaces	Efficient in high-d hyperparameter tuning^[10]
Random Forests	O(n log n) per tree	Strong for mixed categorical/continuous	Handles high-d, non-stationary data^[11]^[12]
Neural Network-based (e.g., BNNs)	O(n²) or better with approximations	High for non-smooth, multi-modal	Adapts to complex structures in d > 20^[13]
Student-t Processes	O(n³), same as GPs	Similar to GPs, with outlier robustness	Heavier tails for noisy data^[14]

Gaussian processes are preferred for low-dimensional (d < 10) continuous problems where precise uncertainty is critical, but alternatives like random forests or TPE should be chosen for high-dimensional (d > 10) or non-continuous inputs to mitigate the curse of dimensionality and improve efficiency. For instance, random forests suit mixed-variable algorithm configuration, while TPE excels in conditional hyperparameter spaces. Neural and Student-t variants are ideal when the objective is non-smooth or outlier-prone, respectively.^[12]^[15]

Acquisition Functions

Role and Design Principles

In Bayesian optimization, the acquisition function serves as a utility function defined over the posterior distribution provided by the surrogate model, guiding the selection of the next evaluation point x_{t+1} = \arg\max_x \alpha(x | D_t), where D_t represents the data observed up to iteration t and \alpha quantifies the benefit of evaluating the objective at x.^[1] This mechanism balances the inherent trade-off between exploration—sampling regions of high predictive uncertainty, often indicated by high posterior variance—and exploitation—focusing on areas with high predicted mean values from the surrogate—to efficiently search the parameter space of expensive black-box functions.^[1] Within a regret minimization framework, the acquisition function aims to minimize cumulative regret, defined as the difference between the optimal function value and the value at selected points, ensuring no-regret guarantees under certain conditions like sub-Gaussian noise. The general form of an acquisition function is \alpha(x) = \mathbb{E}[u(f(x)) | D_t], where the expectation is taken with respect to the posterior predictive distribution p(f(x) | D_t), and u encodes a utility such as the expected improvement over the current best observed value.^[16] This formulation allows the function to adapt to the evolving posterior, prioritizing points that maximize anticipated utility while accounting for uncertainty.^[1] Design principles for acquisition functions emphasize desirable properties to ensure effective optimization. Monotonicity requires that if a new observation improves the current best, the acquisition values non-decrease, preserving the incentive to search further.^[1] Scalability demands efficient computation and optimization of \alpha, often leveraging closed-form expressions or approximations to handle high-dimensional spaces without excessive cost.^[1] Handling noisy observations is critical, with functions designed to incorporate noise models in the posterior to avoid overconfidence in uncertain estimates. From an information-theoretic perspective, approaches like the knowledge gradient treat the acquisition as the expected increase in value of information from sampling, maximizing the reduction in posterior entropy or expected future utility.^[17] Challenges in acquisition function design include extending to multi-objective settings, where functions like entropy search optimize by minimizing the entropy of the posterior over the global optimum, balancing multiple criteria such as cost or constraints. Additionally, computationally optimizing \alpha(x) itself—often a non-convex problem—requires techniques like multi-start optimization or gradient-based methods to find reliable maximizers, particularly as dimensionality increases.^[1]

Specific Acquisition Functions

Expected improvement (EI) is one of the most widely used acquisition functions in Bayesian optimization, quantifying the expected gain in function value over the current best observation f(x^+) when evaluating at a new point x.^[3] Formally, under a Gaussian process surrogate, it is given by

\alpha_{\text{EI}}(x) = \mathbb{E}[\max(0, f(x) - f(x^+)) \mid \mathcal{D}] = (\mu(x) - f(x^+)) \Phi(z) + \sigma(x) \phi(z),

where z = \frac{\mu(x) - f(x^+) - \xi}{\sigma(x)}, \Phi and \phi are the cumulative distribution and probability density functions of the standard normal distribution, respectively, \mu(x) and \sigma(x) are the posterior mean and standard deviation at x, and \xi \geq 0 is a non-negative parameter that controls the exploration-exploitation trade-off by penalizing improvements below a margin.^[3] This function accounts for both the predicted improvement and its uncertainty, making it suitable for noisy observations when the posterior incorporates measurement noise.^[3] EI is invariant to monotonic transformations of the objective function, preserving its relative ordering of points under such changes.^[1] Probability of improvement (PI) offers a simpler alternative, measuring the likelihood that evaluating at x yields a better outcome than the current best f(x^+).^[18] It is defined as

\alpha_{\text{PI}}(x) = \Phi\left( \frac{\mu(x) - f(x^+) - \xi}{\sigma(x)} \right),

again using the Gaussian posterior mean and standard deviation, with \xi serving a similar role in adjusting for conservatism.^[18] Unlike EI, PI does not account for the magnitude of potential improvements, focusing solely on the binary probability of surpassing the incumbent, which can lead to overly conservative selections in high-uncertainty regions.^[1] The upper confidence bound (UCB) acquisition function balances prediction and uncertainty through a direct additive form, selecting points that maximize \alpha_{\text{UCB}}(x) = \mu(x) + \kappa \sigma(x), where \kappa > 0 is a parameter tuning the exploration strength.^[19] Derived in the Gaussian process upper confidence bound (GP-UCB) algorithm, this approach provides theoretical guarantees, including sublinear regret bounds of O(\sqrt{T \gamma_T}) after T iterations under a Gaussian process prior with maximum information gain \gamma_T, demonstrating no-regret optimization for functions drawn from the prior.^[19] Other acquisition functions emphasize information-theoretic principles for more global search strategies. The knowledge gradient (KG) measures the expected increase in the value of the optimal decision following a single evaluation, formulated as the difference in expected maximum posterior values before and after observing f(x), often computed via Monte Carlo sampling over the entropic value of information. Entropy search (ES) directly maximizes the expected reduction in uncertainty about the global optimum's location, approximating the mutual information between the observation at x and the maximizer x^* through entropy over a distribution of potential optima.^[20] Hybrids, such as portfolios combining EI with entropy-based methods like ES, improve robustness by randomly selecting from multiple acquisition functions at each step, as in the entropy search portfolio approach, which has shown empirical benefits in diverse optimization landscapes.^[1]

Optimization Process

Sequential Bayesian Optimization

Sequential Bayesian optimization forms the core of the Bayesian optimization process, where evaluations of the objective function are performed one at a time in an iterative manner to efficiently search for the global optimum of expensive black-box functions. This approach leverages a probabilistic surrogate model, typically a Gaussian process, to approximate the objective and guide the selection of the next evaluation point by maximizing an acquisition function that balances exploration and exploitation.^[3] The process is particularly suited for settings where each function evaluation is costly, such as hyperparameter tuning or simulation-based design, allowing for informed decisions with a limited evaluation budget. The algorithm begins by initializing a dataset D_0 = \{(x_i, y_i)\}_{i=1}^{n_0} using a space-filling design, such as Latin hypercube sampling, to provide an initial set of observations for the surrogate model.^[3] For each iteration t = 1 to T, the surrogate model is fitted to the current dataset D_{t-1}, yielding a posterior distribution over the objective function f. An acquisition function \alpha(x | D_{t-1}) is then optimized to select the next point x_t = \arg\max_x \alpha(x | D_{t-1}). The objective is evaluated at x_t to obtain y_t = f(x_t) + \epsilon, where \epsilon represents observation noise (which may be zero in noise-free settings), and the dataset is updated as D_t = D_{t-1} \cup \{(x_t, y_t)\}. This loop continues until a stopping criterion is met, after which the optimum is inferred from the best observed point in D_T.^[3] Optimizing the acquisition function to find x_t is a non-convex global optimization problem itself, often addressed using gradient-based methods like L-BFGS-B, which exploits analytical derivatives of common acquisition functions for efficiency. To mitigate local optima, multi-start strategies initialize multiple L-BFGS-B runs from random or promising points, selecting the best result; alternatively, derivative-free evolutionary methods such as CMA-ES (a genetic algorithm variant) can be employed for robust global search. These techniques ensure reliable maximization even in high-dimensional spaces.^[21] In noise-free settings, the Gaussian process surrogate assumes deterministic evaluations (\epsilon = 0), leading to a posterior with zero variance at observed points.^[3] For noisy observations, the model incorporates a known noise variance \sigma^2_\epsilon > 0 in the likelihood, adjusting the posterior to account for uncertainty in repeated evaluations and preventing overconfidence at sampled locations. This adaptation is crucial for real-world applications where measurement error is inevitable.^[21] Common stopping criteria include exhaustion of the evaluation budget T, or convergence indicators such as the maximum expected improvement falling below a threshold (e.g., 1% of the current best objective value).^[3] These ensure the process terminates practically while maximizing information gain. The following pseudocode outlines the sequential Bayesian optimization algorithm (assuming minimization of the objective function f):

Initialize dataset D_0 = {(x_i, y_i)}_{i=1}^{n_0} using space-filling design
For t = 1 to T:
    Fit [surrogate model](/page/Surrogate_model) (e.g., [Gaussian process](/page/Gaussian_process)) to D_{t-1}
    Optimize acquisition function α(x | D_{t-1}) to obtain x_t = arg max_x α(x | D_{t-1})
    Evaluate y_t = f(x_t) + ε  (ε ~ N(0, σ_ε²); σ_ε = 0 for noise-free)
    Update D_t = D_{t-1} ∪ {(x_t, y_t)}
    If stopping criterion met (e.g., max α < threshold or t = T), break
Return the point in D_T with the minimum observed y
Initialize dataset D_0 = {(x_i, y_i)}_{i=1}^{n_0} using space-filling design
For t = 1 to T:
    Fit [surrogate model](/page/Surrogate_model) (e.g., [Gaussian process](/page/Gaussian_process)) to D_{t-1}
    Optimize acquisition function α(x | D_{t-1}) to obtain x_t = arg max_x α(x | D_{t-1})
    Evaluate y_t = f(x_t) + ε  (ε ~ N(0, σ_ε²); σ_ε = 0 for noise-free)
    Update D_t = D_{t-1} ∪ {(x_t, y_t)}
    If stopping criterion met (e.g., max α < threshold or t = T), break
Return the point in D_T with the minimum observed y

^[3]

Batch and Parallel Optimization

Traditional Bayesian optimization proceeds sequentially, selecting and evaluating one point at a time while updating the surrogate model after each observation. However, with the availability of parallel computing resources, evaluating multiple points simultaneously—known as batch or parallel optimization—can substantially reduce overall computation time for expensive black-box functions. This adaptation addresses the challenge of non-myopic selection, where the acquisition function must account for joint utility across a batch of q points rather than a single point, often leading to more conservative and diverse selections to mitigate information overlap. A key approach is the q-Expected Improvement (qEI), which extends the single-point Expected Improvement to batches by quantifying the anticipated maximum improvement from any point in the proposed set \mathbf{x} = (x_1, \dots, x_q). Formally,

\alpha(\mathbf{x}) = \mathbb{E}\left[\max\left(0, f(x^+) - \min_{i=1}^q f(x_i)\right)\right],

where f(x^+) denotes the current best (minimum) observed value (assuming minimization), and the expectation is taken over the joint posterior predictive distribution of the surrogate model, such as a Gaussian process. For q > 1, exact computation is intractable due to the maximum operator over correlated predictions, so approximations rely on Monte Carlo integration: samples are drawn from the multivariate Gaussian posterior to simulate batch outcomes and estimate the expectation. This method promotes batches that jointly maximize improvement while encouraging diversity through correlation penalties implicit in the joint distribution. Alternative heuristics simplify batch selection for practicality. Local penalization employs a sequential greedy strategy, iteratively selecting points by maximizing a single-point acquisition (e.g., Expected Improvement) while applying a penalty to the surrogate around previously chosen batch points; the penalty is scaled by an empirical estimate of the objective's Lipschitz constant to discourage clustering and ensure exploration. Hallucination techniques approximate the joint effect by simulating unobserved outcomes for prior batch selections—treating them as realized—to update the surrogate before choosing the next point, enabling standard sequential acquisition functions in a myopic, parallelizable loop. A related heuristic, the Kriging Believer, specifically hallucinates these simulated values as the surrogate's posterior mean at each pending point, providing a simple update rule that balances belief in the model with ongoing optimization. Scalability in batch optimization benefits from methods that generate diverse candidates efficiently, particularly in distributed environments where evaluations occur across multiple processors. For instance, batch selection can integrate clustering to identify representative points from large pools, facilitating parallel execution while minimizing redundant evaluations. Recent post-2015 developments, such as the BoTorch library, enable GPU-accelerated Monte Carlo approximations for qEI, supporting high-dimensional batches and large sample sizes through PyTorch integration, thus making parallel Bayesian optimization viable for industrial-scale applications.

History

Early Foundations (1960s-1990s)

The foundations of Bayesian optimization emerged in the 1960s through efforts to address optimization in noisy environments. In 1964, Harold J. Kushner introduced a stochastic approximation method for locating the maximum of an arbitrary multipeak curve amid noise, marking an early precursor to sequential strategies for handling uncertainty in black-box functions.^[22] The 1970s and 1980s saw substantial theoretical progress, primarily driven by Jonas Mockus and collaborators in the Soviet literature, who formalized Bayesian approaches to global optimization of multimodal functions. Mockus's series of publications established Bayesian optimization as a framework minimizing expected deviation from the global extremum using probabilistic surrogates, with initial developments focusing on one-dimensional and low-dimensional cases.^[23] During this era, Mockus introduced the Probability of Improvement (PI) criterion in the early 1970s as a heuristic for selecting promising evaluation points based on the likelihood of surpassing the current best observation.^[24] In 1978, Mockus, V. Tiešis, and A. Žilinskas proposed the Expected Improvement (EI) acquisition function, quantifying the anticipated gain over the incumbent solution to guide efficient exploration-exploitation trade-offs.^[25] Antanas Žilinskas played a key role alongside Mockus, contributing algorithms for multidimensional Bayesian global optimization and statistical criteria for model selection in stochastic settings during the 1970s and 1980s.^[25] These works emphasized surrogate models like regression functions to approximate expensive objectives, though computational constraints—such as the quadratic scaling of model updates with observations—limited applications to low-dimensional problems with fewer than 20 variables.^[24] In 1989, Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn advanced the methodology by applying kriging (a form of Gaussian process regression) to design and analyze computer experiments, providing a rigorous statistical basis for predicting responses and their uncertainties in deterministic simulations.^[26] Mockus's 1989 book, Bayesian Approach to Global Optimization: Theory and Applications, synthesized these contributions, detailing theoretical guarantees like convergence rates and practical examples in engineering design, while highlighting the method's efficiency for functions evaluable only a few dozen times.^[24] By the late 1990s, the field transitioned toward practicality, exemplified by Donald R. Jones, Matthias Schonlau, and William J. Welch's 1998 paper on efficient global optimization of noisy black-box functions. This work integrated Gaussian processes as surrogates with the EI criterion, demonstrating superior performance over alternatives like response surface methods on test functions, and facilitating the development of initial software tools for broader adoption.^[3]

Modern Developments (2000s-Present)

The 2000s marked a pivotal era for Bayesian optimization, shifting from foundational theory to practical implementations with rigorous guarantees. A landmark advancement was the introduction of the Gaussian Process Upper Confidence Bound (GP-UCB) algorithm by Srinivas et al. in 2010, which provided the first no-regret bounds for GP-based optimization in the bandit setting by bounding cumulative regret in terms of information gain.^[19] This theoretical progress spurred applications in hyperparameter tuning for machine learning, where expensive black-box evaluations are common. Tools like SMAC, developed by Hutter et al. in 2011, leveraged random forests as surrogate models for sequential model-based configuration, demonstrating superior performance on algorithm tuning benchmarks compared to grid search or random search.^[27] Similarly, Spearmint by Snoek et al. in 2012 popularized GP-based optimization for tuning deep neural networks, achieving significant improvements in generalization performance on datasets like CIFAR-10 with fewer evaluations than alternatives.^[21] The 2010s emphasized scalability, parallelization, and handling complex search spaces. Batch optimization methods gained prominence to exploit parallel computing resources; for instance, the q-Expected Improvement (qEI) acquisition function, formalized in works around 2012, extended single-point EI to select batches of q points that maximize joint expected improvement, reducing wall-clock time in parallel settings without substantial regret increase. Open-source libraries accelerated adoption: GPyOpt, released in 2014 and detailed in subsequent documentation, integrated GPs with various acquisition functions for constrained and multi-fidelity problems. BoTorch, announced in 2019 and published in 2020 by Balandat et al., introduced a modular PyTorch-based framework supporting Monte Carlo acquisition functions and sample average approximation, enabling efficient optimization for up to thousands of evaluations on high-performance hardware.^[28] To tackle high-dimensional spaces (d > 20), techniques like random embeddings emerged; the REMBO algorithm by Wang et al. in 2013 projected high-dimensional inputs (up to billions) onto lower-dimensional subspaces via random linear maps, preserving optimization guarantees while outperforming direct GP methods on synthetic high-d benchmarks.^[29] Entering the 2020s, Bayesian optimization integrated deeply with AutoML and emerging AI paradigms, enhancing scalability for massive datasets. NAS-BOT, proposed by Kandasamy et al. in 2018 but widely applied in the 2020s, used GPs with optimal transport metrics for neural architecture search, discovering competitive architectures on ImageNet with 10x fewer evaluations than reinforcement learning baselines.^[30] Scalable surrogate models advanced through Stochastic Variational Gaussian Processes (SVGPs), introduced by Hensman et al. in 2013 and refined in the decade, which approximate full GPs via variational inference on inducing points, scaling to millions of data points with minimal accuracy loss in BO loops. Applications proliferated in scientific domains; in quantum chemistry, BO optimized state preparation protocols in ultracold atoms by 2024, achieving fractional quantum Hall states with 50% fewer experiments than grid search via GP modeling of noisy observables.^[31] Post-2020 developments addressed distributed and safety-critical settings. Federated Bayesian optimization emerged to preserve data privacy, aggregating GP posteriors across decentralized nodes without sharing raw data. In AI safety, methods tuned large language models for alignment from 2023 onward, reducing harmful outputs while balancing objectives in RLHF pipelines. Theoretical advances included no-regret guarantees for objectives. Hybrids with reinforcement learning integrated BO for meta-optimization, where policy search in MDPs used BO to tune hyperparameters online, outperforming direct RL on continuous control tasks like MuJoCo. Despite progress, open challenges persist in ultra-high dimensions (100D+) and multi-fidelity settings. The curse of dimensionality limits GP expressiveness beyond d=50 without embeddings, leading to poor regret scaling, as noted in surveys like Shahriari et al. (2016).

Applications

Machine Learning and Hyperparameter Tuning

Bayesian optimization plays a pivotal role in machine learning workflows by efficiently tuning hyperparameters, which are configuration settings such as learning rates, tree depths in random forests, or kernel parameters in support vector machines (SVMs), that control model behavior without being learned from data.^[21] Unlike exhaustive grid search, which scales poorly with dimensionality, or random search, which samples uniformly without exploiting prior evaluations, Bayesian optimization minimizes the number of required evaluations—often by orders of magnitude—for computationally expensive training runs, particularly for deep neural networks where each trial involves hours or days of computation.^[21]^[6] The sequential model-based optimization (SMBO) framework formalizes this process by treating hyperparameter tuning as a black-box optimization problem, where the objective is typically the cross-validation error or a similar performance metric on held-out data, and evaluations are sequential to build an increasingly accurate surrogate model of the response surface.^[10] In practice, SMBO iteratively fits a probabilistic surrogate (e.g., Gaussian process) to observed validation scores and selects the next hyperparameters via an acquisition function that balances exploration and exploitation, enabling effective optimization even in high-dimensional spaces.^[21] Prominent case studies illustrate its impact: the Auto-sklearn system (2015) integrates Bayesian optimization within an automated machine learning pipeline to jointly optimize algorithm selection and hyperparameters, outperforming prior AutoML baselines on diverse datasets by leveraging meta-learning for warm-starting the search.^[32] Similarly, the Optuna framework (2019) employs a variant of Bayesian optimization using tree-structured Parzen estimators to handle complex search spaces, facilitating scalable tuning in distributed environments.^[33] In competitive settings like Kaggle competitions, Bayesian optimization has been applied to tune XGBoost models, where it refines parameters such as subsample ratios and maximum depths to achieve leaderboard improvements over default configurations, as demonstrated in risk classification tasks.^[34] A key advantage in machine learning is Bayesian optimization's native support for mixed continuous and discrete parameter spaces, including conditional dependencies—such as enabling dropout rates only if the number of layers exceeds one in neural architectures—which grid or random search handle inefficiently without custom adaptations.^[35] Post-2020 applications extend to large-scale models, where it tunes fine-tuning hyperparameters for transformers like BERT, optimizing learning schedules and regularization to minimize validation perplexity while reducing search costs compared to random sampling.^[36] Recent advances as of 2025 include its use in optimizing prompts for large language models (LLMs), enhancing performance in black-box tasks with limited evaluations.^[37] The objective is often framed as minimizing simple regret, defined as the difference between the best observed performance and the true optimum, or directly as cross-validation error to ensure generalizable models.^[21]

Engineering and Scientific Domains

Bayesian optimization has been extensively applied in engineering design, particularly in aerodynamics, where simulations of fluid dynamics are computationally expensive. A seminal case study involves the optimization of airfoil shapes for NASA, demonstrating the efficiency of efficient global optimization (EGO), an early form of Bayesian optimization using Gaussian processes, which reduced the number of evaluations needed to achieve near-optimal lift-to-drag ratios compared to traditional methods. In modern contexts, Bayesian optimization integrates with computational fluid dynamics (CFD) simulations to refine airfoil parameters while minimizing high-fidelity evaluations.^[38] Similarly, in robotics, it facilitates controller tuning by optimizing impedance parameters to enhance stability and energy efficiency in dynamic environments, such as robotic arms performing precise manipulations, where noisy sensor feedback is common.^[39] For sensor placement in networks, Bayesian optimization selects optimal locations to maximize information gain in monitoring systems, like structural health networks, reducing uncertainty in predictions with fewer sensors.^[40] In scientific domains, Bayesian optimization addresses costly experimental evaluations, such as in drug discovery, where it optimizes molecular properties represented as SMILES strings to maximize binding affinity while navigating vast chemical spaces.^[41] This approach has accelerated lead compound identification by iteratively suggesting promising candidates, outperforming random screening in terms of convergence speed. In materials science, it guides alloy composition optimization for desired properties like strength and corrosion resistance, with benchmarking across datasets demonstrating efficiency gains over traditional search methods.^[9] In chemistry, Bayesian optimization maximizes reaction yields by tuning conditions like temperature, catalysts, and reagents, as demonstrated in optimizing organic syntheses where yields improved in fewer iterations than manual methods.^[42] Notable case studies underscore its impact. The 1990s NASA airfoil optimization highlighted Bayesian methods' sample efficiency in simulation-driven design, a foundation for contemporary integrations with CFD tools. Post-2015 applications in quantum chemistry employ Bayesian optimization to construct potential energy surfaces for molecular systems, enabling accurate prediction of reaction pathways with reduced quantum mechanical calculations.^[43] In environmental modeling during the 2020s, it calibrates parameters in complex hydrological or climate models, improving simulation accuracy for flood prediction or carbon cycle dynamics by automating the fitting of uncertain inputs like soil permeability.^[44] Multi-fidelity Bayesian optimization extends these applications by combining low-fidelity (cheap, approximate) models with high-fidelity (accurate but expensive) ones, such as using low-resolution Gaussian processes to inform engineering designs like airfoil shapes, achieving cost reductions without sacrificing optimality.^[45] As of 2025, it has gained traction in bioprocess engineering for optimizing experimental conditions in biotechnology.^[46] However, challenges persist in these domains, including handling noisy real-world data from experiments, which Bayesian methods mitigate through robust surrogate modeling. Safety constraints are addressed via constrained Bayesian optimization, ensuring designs in robotics or chemical processes avoid unsafe regions, such as exceeding material stress limits, while maintaining exploration efficiency.^[47]