Rejection sampling
Rejection sampling, also known as the acceptance-rejection method, is a fundamental Monte Carlo algorithm used to generate independent and identically distributed samples from a target probability density function f(x) that is difficult to sample directly, by leveraging a simpler proposal density g(x) from which samples can be easily drawn.[1][2][3] The method works by repeatedly proposing candidates from g(x) and accepting them with a probability that ensures the accepted samples follow the target distribution f(x), provided that f(x) \leq c g(x) for some finite constant c \geq 1.[1][2]
The core algorithm for the continuous case proceeds as follows: generate a candidate Y from the proposal distribution with density g(x); independently generate a uniform random variable U \sim \text{Uniform}(0,1); accept Y as a sample X if U \leq f(Y)/(c g(Y)), otherwise reject and repeat the process.[1][2] This acceptance criterion guarantees that the conditional distribution of accepted samples is exactly f(x), as the probability of acceptance adjusts for the discrepancy between f and g.[3] Key requirements include the existence of a bounded constant c = \sup_x \{f(x)/g(x)\}, which must be finite, and the ability to evaluate both f(x) and g(x) up to a normalizing constant if needed.[1][2] The choice of proposal g(x) is crucial; an ideal g closely approximates the shape of f(x) to minimize c, thereby improving efficiency.[1]
Efficiency is determined by the expected number of iterations, which follows a geometric distribution with success probability $1/c, yielding an expected value of c proposals per accepted sample.[2][1] While the method provides exact samples without bias, it can become inefficient in high dimensions or when c is large, leading to high rejection rates, and is often contrasted with alternatives like importance sampling that reuse rejected samples via weights.[3] Rejection sampling applies to both continuous and discrete distributions, with adaptations for unnormalized densities common in Bayesian inference and statistical physics.[2][3]
Introduction
History
Rejection sampling emerged as a foundational technique within the broader development of Monte Carlo methods during the 1940s, primarily driven by efforts to simulate complex physical processes on early computers. John von Neumann proposed the core accept-reject mechanism in unpublished notes and a 1947 letter to Stanislaw Ulam, while working at Los Alamos National Laboratory on neutron diffusion simulations for nuclear weapons research. This approach addressed the challenge of generating nonuniform random variates from known distributions using uniform random numbers, marking an early innovation in statistical sampling for computational physics.[4][5]
The method gained formal recognition in the early 1950s through von Neumann's published work, including a 1951 report detailing rejection algorithms for sampling trigonometric functions, which improved efficiency in three-dimensional simulations. These developments evolved rejection sampling from rudimentary Monte Carlo implementations into a versatile tool for approximating integrals and distributions in reliability and physics contexts.[5][6]
Rejection sampling was further popularized in statistics and applied mathematics through John M. Hammersley and David C. Handscomb's influential 1964 monograph Monte Carlo Methods, which systematically described the technique alongside variance reduction strategies and provided theoretical foundations for its use in multidimensional problems. This book solidified its role in transitioning from naive uniform sampling to efficient methods for complex, non-uniform distributions. A significant advancement came in 1992 with W. R. Gilks and P. Wild's introduction of adaptive rejection sampling, tailored for log-concave densities in Gibbs sampling applications, enhancing adaptability and reducing rejection rates in Bayesian inference.[7][8][9]
Definition and Motivation
Rejection sampling is a fundamental Monte Carlo method for generating independent and identically distributed (i.i.d.) samples from a target probability density function f(x) when direct sampling is intractable. It relies on an auxiliary proposal distribution g(x), from which samples can be drawn efficiently, and selectively accepts or rejects these proposals based on the ratio of the target to the proposal density. This approach ensures that the accepted samples are exactly distributed according to f(x), making it a cornerstone technique in statistical simulation and inference.
The primary motivation for rejection sampling arises in scenarios where the target density f(x) is known only up to an unknown normalizing constant, such as in Bayesian posterior distributions or complex probabilistic models, rendering methods like inverse transform sampling impractical because they require evaluating and inverting the cumulative distribution function. In contrast, rejection sampling only necessitates pointwise evaluations of the unnormalized f(x) and g(x), allowing it to handle distributions that are difficult or impossible to sample directly. This is particularly valuable in Monte Carlo applications, where i.i.d. samples are essential for unbiased estimation of integrals, expectations, or probabilities via the law of large numbers.
At a high level, the technique constructs a bounding "envelope" around the target density using scaled versions of the proposal distribution, ensuring that proposals falling under the target curve are retained while others are discarded. Developed by John von Neumann in the context of early Monte Carlo simulations for physics problems, it provides an exact sampling mechanism without approximation, though efficiency depends on the choice of proposal.[10]
Mathematical Foundations
Target and Proposal Distributions
In rejection sampling, the target distribution refers to the desired probability distribution from which independent samples are to be generated. It is typically defined by an unnormalized density function f(\mathbf{x}), which is proportional to the true probability density function p(\mathbf{x}) = f(\mathbf{x}) / Z, where Z = \int f(\mathbf{x}) \, d\mathbf{x} is an unknown normalizing constant that often cannot be computed analytically. This setup is particularly useful in contexts such as Bayesian inference, where the target arises as a posterior distribution with an intractable normalization factor.[11]
The proposal distribution, denoted by density g(\mathbf{x}), serves as an auxiliary distribution that is easy to sample from directly, such as standard uniforms, normals, or exponentials. A key requirement is that it forms an envelope for the target, satisfying the inequality
f(\mathbf{x}) \leq M g(\mathbf{x})
for all \mathbf{x} in the domain, where M \geq 1 is a finite constant bounding the supremum of f(\mathbf{x}) / g(\mathbf{x}). This domination condition ensures that the proposal covers the target without underestimating its density in any region.[12]
The choice of proposal distribution is critical for practical efficiency, as it directly influences the value of M; a smaller M reduces wasted computations. Ideally, g(\mathbf{x}) should approximate the shape of p(\mathbf{x}) closely, such as using a uniform proposal over a bounded interval for targets with compact support, or a Gaussian proposal matched to the target's mean and variance for approximately normal targets. In higher dimensions, proposals like multivariate normals or mixtures may be employed when they align well with the target's modal structure.[11][12]
Both distributions must share compatible supports, meaning the support of g(\mathbf{x}) must include the entire support of p(\mathbf{x}) (i.e., g(\mathbf{x}) > 0 wherever f(\mathbf{x}) > 0) to ensure complete coverage and avoid biasing the samples toward subsets of the target space. Failure to match supports can lead to incomplete exploration of the target, rendering the method invalid.[12]
Acceptance Probability
In rejection sampling, samples are proposed from the distribution with density g(y), and an auxiliary uniform random variable u \sim \text{Uniform}(0,1) is generated independently. The proposal y is accepted if u < \frac{f(y)}{M g(y)}, where f is the unnormalized target density and M \geq \sup_y \frac{f(y)}{g(y)}; otherwise, it is rejected and the process repeats.[1][13]
To verify correctness, consider the joint density of (y, u), which is g(y) \cdot 1 over the support of g and u \in [0,1]. The acceptance region is defined by $0 \leq u < \frac{f(y)}{M g(y)}. The marginal density of accepted y is obtained by integrating the joint density over this region:
\int_0^{f(y)/(M g(y))} g(y) \, du = g(y) \cdot \frac{f(y)}{M g(y)} = \frac{f(y)}{M}.
The total probability of acceptance is \int \frac{f(y)}{M} \, dy = \frac{1}{M} \int f(y) \, dy = \frac{Z}{M}, where Z = \int f(y) \, dy. Thus, the conditional density of accepted y is \frac{f(y)/M}{Z/M} = \frac{f(y)}{Z}, matching the target distribution p(y).[1][13]
The expected acceptance probability is \frac{Z}{M}, which simplifies to \frac{1}{M} if Z = 1 (normalized case); this implies the average number of proposals required per accepted sample is M.[1]
The generated accepted samples are independent and identically distributed (i.i.d.) from p(y), as each proposal-rejection cycle is independent of previous ones, regardless of the value of M.[1][13]
Algorithm
Standard Procedure
Rejection sampling begins with identifying a bounding constant M \geq 1 such that the target density f(x) satisfies f(x) \leq M g(x) for all x in the support, where g(x) is the proposal density from which samples can be easily generated.[1][14] This constant M is typically determined as the supremum \sup_x \frac{f(x)}{g(x)}, which can be found analytically by optimization techniques such as calculus to locate the maximum of the ratio \frac{f(x)}{g(x)} or through bounding approximations when an exact computation is infeasible.[2][1]
Once M is established, the procedure proceeds iteratively and independently in each trial to generate samples from f(x).[14] Sample a candidate y from the proposal distribution with density g(y), then independently sample u from the uniform distribution on [0, 1].[1][15] Accept y as a sample from f(x) if u \leq \frac{f(y)}{M g(y)}; otherwise, reject it and repeat the process.[14][1] This acceptance step must be performed by evaluating both f and g at the candidate y, which requires that these densities (or their unnormalized forms) can be computed efficiently.[2]
The iterations continue until the desired number of accepted samples is obtained, with each trial producing an independent draw from the target distribution upon acceptance.[14] In practice, when f(x) is available only up to a normalizing constant (i.e., as an unnormalized \tilde{f}(x) where f(x) = \tilde{f}(x)/Z and Z is unknown), the procedure adapts by using the ratio \frac{\tilde{f}(y)}{M g(y)} in the acceptance condition, as the constant Z cancels out and does not affect the validity.[15][1] Similarly, if g(y) is unnormalized, its normalizing constant also cancels in the ratio.[14]
Pseudocode and Simple Example
The rejection sampling algorithm can be implemented as a function that generates a specified number of samples from the target distribution f using a proposal distribution g and bounding constant M, where M \geq \sup_x f(x)/g(x).[1]
Here is pseudocode for the continuous case:
function rejection_sample(n_samples, f, g, M):
samples = empty list
while length(samples) < n_samples:
y = sample from g
u = uniform(0, 1)
if u <= f(y) / (M * g(y)):
append y to samples
return samples
function rejection_sample(n_samples, f, g, M):
samples = empty list
while length(samples) < n_samples:
y = sample from g
u = uniform(0, 1)
if u <= f(y) / (M * g(y)):
append y to samples
return samples
This procedure repeatedly proposes candidates from g until the desired number of acceptances is reached, ensuring the accepted samples follow the target f.[1][16]
A simple example illustrates the method for sampling from a truncated standard normal distribution restricted to [0, \infty), which has target density f(x) = \sqrt{2/\pi} \exp(-x^2/2) for x \geq 0 (the half-normal distribution). An exponential proposal distribution with rate 1, g(x) = \exp(-x) for x \geq 0, is suitable due to its similar tail behavior. The bounding constant is M = \sup_{x \geq 0} f(x)/g(x) = \sqrt{2/\pi} \exp(1/2) \approx 1.31. To generate samples, draw Y from the exponential, draw U \sim \text{Unif}(0,1), and accept Y if U \leq f(Y)/(M g(Y)); repeat until 1000 samples are obtained. A histogram of these samples should closely match the target half-normal density, confirming the method's correctness.[1]
For implementation, the algorithm applies to both continuous and discrete cases, with the primary difference being whether densities f, g or probability mass functions are used; in the discrete case, M \geq \sup_k p(k)/q(k) where p and q are the target and proposal pmfs, and proposals are drawn accordingly. In software like Python, a basic implementation for the half-normal example might use NumPy for sampling:
python
import numpy as np
def rejection_sample_half_normal(n_samples):
M = np.sqrt(2 / np.pi) * np.exp(0.5)
samples = []
while len(samples) < n_samples:
y = np.random.exponential(1)
u = np.random.uniform(0, 1)
f_y = np.sqrt(2 / np.pi) * np.exp(-y**2 / 2)
g_y = np.exp(-y)
if u <= f_y / (M * g_y):
samples.append(y)
return np.array(samples)
# Example: 1000 samples
samples = rejection_sample_half_normal(1000)
import numpy as np
def rejection_sample_half_normal(n_samples):
M = np.sqrt(2 / np.pi) * np.exp(0.5)
samples = []
while len(samples) < n_samples:
y = np.random.exponential(1)
u = np.random.uniform(0, 1)
f_y = np.sqrt(2 / np.pi) * np.exp(-y**2 / 2)
g_y = np.exp(-y)
if u <= f_y / (M * g_y):
samples.append(y)
return np.array(samples)
# Example: 1000 samples
samples = rejection_sample_half_normal(1000)
This code handles the continuous case efficiently for moderate n; for discrete distributions, replace exponential sampling with appropriate discrete generators like np.random.choice. Similar implementations exist in R using runif and rexp.[1][17]
Advantages
Rejection sampling offers significant simplicity in implementation, as it requires only the ability to evaluate the target density f and the proposal density g, along with a constant M such that Mg(x) \geq f(x) for all x, without needing to invert cumulative distribution functions or employ the intricate transition kernels typical of Markov chain Monte Carlo (MCMC) methods.[17][18]
A key advantage is its exactness: the method generates independent and identically distributed (i.i.d.) samples directly from the target distribution f, bypassing the approximations and correlations inherent in MCMC chains, which require burn-in periods and exhibit reduced effective sample sizes due to autocorrelation.[18][17]
The algorithm is particularly well-suited for handling unnormalized densities, making it ideal for sampling Bayesian posteriors where the normalizing constant (evidence) is intractable, as it operates solely on density ratios f(x)/g(x) without requiring the constant's computation.[17][19]
Compared to naive approaches like exhaustive enumeration or grid-based sampling, rejection sampling avoids the need to evaluate the entire support, which is especially beneficial for rare event estimation where low-probability regions would otherwise demand prohibitive computational resources; for instance, it bounds wasted proposals via the acceptance rate $1/M rather than scanning all grid points.[17]
When the envelope constant M is small (e.g., M < 2), rejection sampling achieves high computational efficiency, with acceptance probabilities exceeding 50% and minimal rejections, outperforming more complex alternatives in such favorable cases.[17]
Drawbacks
One major drawback of rejection sampling is its potential inefficiency when the constant M is large, as the acceptance probability drops to $1/M, resulting in a high number of rejections and requiring many proposal samples to obtain a single accepted one.[20] This issue is particularly pronounced when the proposal distribution g poorly envelopes the target density f, such as in cases involving multimodal target distributions where a simple unimodal proposal struggles to cover all modes without inflating M.[21] Consequently, the expected number of iterations needed to generate one sample equals M, which can render the method computationally prohibitive if M \gg 1.[22]
Rejection sampling also suffers from the curse of dimensionality, where performance degrades exponentially as the dimension d increases because the constant M typically grows with d, leading to vanishingly small acceptance rates.[20] For instance, when sampling from a multivariate normal target using a scaled normal proposal, the acceptance rate can decay as $1/\sigma^d for \sigma > 1, making the method impractical beyond low dimensions.[22] In high-dimensional spaces, the volume concentration phenomenon further exacerbates this, as most proposal samples fall outside the relevant support region of the target, necessitating an exponential number of trials.[23]
A key limitation is the stringent requirement for an envelope distribution: the method demands a proposal g and constant M such that M g(x) \geq f(x) for all x in the support, which involves computing or bounding the supremum of the ratio f(x)/g(x).[24] Determining this bound can be challenging, especially for complex or unbounded targets, and may require extensive prior analysis or numerical optimization to ensure the envelope holds everywhere.[17]
Unlike adaptive sampling techniques, rejection sampling employs a fixed proposal distribution g and constant M throughout the process, with no mechanism to refine them based on observed rejections or samples.[25] This rigidity limits its flexibility in non-stationary or evolving sampling scenarios.
Finally, the repeated evaluation of the target density f during the acceptance step imposes a significant computational burden if f is expensive to compute, as each rejected proposal still incurs the full cost of density evaluation without contributing to the output sample set.[26] In applications where density evaluations dominate runtime, this can lead to substantial overhead, particularly when combined with low acceptance rates.[17]
Variants
Exponential Tilting
Exponential tilting is a variant of rejection sampling that optimizes the proposal distribution by incorporating adjustments from an exponential family, thereby improving efficiency for targets that share structural similarities with the base proposal. The tilted proposal distribution is given by
g_t(x) = \frac{\exp(t \, h(x)) \, g(x)}{Z_t},
where g(x) is the original proposal density, h(x) is a suitable tilting function (often a sufficient statistic), t is the tilting parameter, and Z_t = \mathbb{E}_g[\exp(t \, h(X))] is the normalizing constant ensuring g_t integrates to 1. This modification allows the proposal to be "tilted" toward regions where the target density f(x) is more prominent, reducing the bounding constant M = \sup_x f(x)/g_t(x) required for the rejection mechanism and increasing the overall acceptance rate. The approach is particularly effective when g belongs to an exponential family, as the tilting preserves the family structure while adapting to the target.[27]
To select the optimal t, one typically minimizes the Kullback-Leibler (KL) divergence between f and g_t, or equivalently, maximizes the expected log-likelihood ratio \mathbb{E}_g[\log(f/g)], which can be achieved by solving the moment-matching equation \mathbb{E}_{g_t}[h(X)] = \mathbb{E}_f[h(X)]. This optimization often involves numerical methods, such as solving the saddlepoint equation derived from the cumulant generating function of g, and results in M approaching 1 more closely than with the untitled proposal, especially for targets in or near the exponential family of g. In practice, the tilted proposal can be updated iteratively within the rejection sampling loop or precomputed if Z_t is tractable, making it suitable for hybrid methods that combine rejection sampling with importance sampling to estimate expectations or normalizing constants. This technique was developed in the context of exponential families within statistical Monte Carlo methods during the late 20th century.[28]
Adaptive Rejection Sampling
Adaptive Rejection Sampling (ARS) is a variant of rejection sampling designed specifically for univariate log-concave target densities, where the logarithm of the unnormalized density h(x) = \log f(x) is concave.[9] This property allows the construction of piecewise linear upper and lower envelopes that tightly bound the target density, using tangent lines to h(x) at evaluated points and chords connecting those points.[9] The upper envelope forms a rejection bound that majorizes the target, while the lower envelope serves as a squeezing function to accelerate acceptances without density evaluations.[9] These envelopes are iteratively refined after each acceptance or rejection, incorporating new evaluation points to progressively tighten the bounds and improve sampling efficiency.[9]
The algorithm begins by initializing a set of abscissae that bracket the support of the target density, typically including points near the mode and boundaries, to define initial piecewise linear hulls for the upper envelope u_k(x) and lower envelope l_k(x).[9] A piecewise exponential proposal distribution s_k(x) is then constructed by normalizing the upper envelope, s_k(x) = c_k^{-1} \exp(u_k(x)), where c_k is the normalizing constant computed via numerical integration over the pieces.[9] Sampling proceeds by drawing a candidate x^* from s_k(x) and a uniform variate w \sim U(0,1); if w < \exp(l_k(x^*) - u_k(x^*)), accept immediately using the squeezing bound; otherwise, evaluate h(x^*) and accept if w < \exp(h(x^*) - u_k(x^*)).[9] Upon acceptance or rejection requiring evaluation, the new point x^* (along with its derivative if needed) is added to the set of abscissae, and the envelopes are updated to form tighter piecewise linear approximations.[9]
This adaptive process ensures high efficiency, as the acceptance probability approaches 1 over iterations due to the envelopes converging to the target density, and it avoids the need to compute a global supremum constant M by relying on local tangent constructions.[9] ARS was introduced by Gilks and Wild in 1992 for application in Gibbs sampling.[9] Extensions to non-log-concave densities incorporate squeezing bounds within an adaptive rejection Metropolis framework to handle deviations from concavity.[29]
A representative example is sampling from a log-concave posterior distribution in Bayesian inference, such as the full conditional for parameters in a monoclonal antibody reactivity model analyzed via Gibbs sampling.[9]
Applications
Statistical Sampling
Rejection sampling plays a key role in Bayesian computation by enabling the generation of samples from the posterior distribution p(\theta \mid \data) \propto p(\data \mid \theta) p(\theta), particularly when conjugate priors are unavailable and the normalizing constant is intractable.[30] This approach is especially valuable for complex models where direct sampling from the posterior is infeasible, allowing inference through independent and identically distributed (i.i.d.) samples that approximate the target distribution.[30] In modern applications, such as approximate Bayesian computation (ABC), rejection sampling facilitates posterior inference for models with intractable likelihoods by simulating data from the model and accepting parameters that produce auxiliary statistics close to observed data summaries.[31] The original motivation for rejection sampling traces back to John von Neumann's work on neutron transport simulations during the Manhattan Project, where it was used to model particle interactions by sampling from non-uniform distributions in a computationally efficient manner.[4]
In Monte Carlo integration, rejection sampling supports the estimation of expectations \mathbb{E}_p [h(X)] = \int h(x) p(x) \, dx by generating i.i.d. samples from the target density p(x), which can then be averaged to approximate the integral with reduced bias under suitable proposal distributions.[32] This method ensures unbiased estimates while leveraging the ability to evaluate unnormalized densities, making it suitable for high-dimensional integrals in statistical inference.[32]
For rare event simulation, rejection sampling aids in bounding tail probabilities in reliability analysis and queueing theory by proposing samples from an enveloping distribution and accepting those that fall within the rare event region, thereby estimating low-probability outcomes like system failures without excessive computational waste.[33] To enhance efficiency, rejection sampling is often integrated with importance sampling for variance reduction; in this hybrid approach, proposals from an importance distribution are weighted and selectively accepted, improving the estimation of expectations in ABC and other Monte Carlo settings by mitigating the high rejection rates of pure rejection methods.[34]
Machine Learning Contexts
In generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), rejection sampling is employed to draw samples from implicit densities by rejecting proposals that deviate from the learned data manifold. In GANs, Discriminator Rejection Sampling (DRS) leverages the discriminator's output as an importance weight to filter generated samples, improving quality by discarding those with low discriminator scores to approximate the true data distribution.[35] This approach addresses the challenge of implicit sampling in GANs, where direct likelihood evaluation is unavailable, and has been shown to enhance sample fidelity without retraining the model.[35]
As an alternative to Markov Chain Monte Carlo (MCMC) methods, rejection sampling is particularly useful in scenarios where MCMC chains exhibit poor mixing, such as in particle filters for sequential data processing in machine learning tasks like time-series forecasting or tracking. In particle filters, rejection-based variants, such as Variational Rejection Particle Filtering (VRPF), integrate rejection steps to resample particles more efficiently, reducing degeneracy in high-variance environments while maintaining unbiased estimates of the filtering distribution.[36] This makes it suitable for dynamic models in reinforcement learning or robotics, where rapid adaptation to new observations is critical.
Compared to Metropolis-Hastings (MH), a core MCMC algorithm, rejection sampling provides exact samples from the target distribution but becomes inefficient in high dimensions due to the curse of dimensionality, often requiring an impractically large number of proposals. In contrast, MH offers scalability through approximate sampling via Markov chains, making it preferable for deep learning applications with complex, high-dimensional posteriors, though it may suffer from autocorrelation in samples. Rejection sampling also relates to slice sampling, another MCMC technique, but differs by using a uniform envelope over the entire support rather than auxiliary variables to define slice regions, leading to simpler implementation at the cost of higher rejection rates in multimodal distributions.
Post-2010 developments have integrated rejection sampling into normalizing flows for exact likelihood-based sampling, where learned rejection mechanisms serve as flexible base distributions to handle multimodal targets more effectively than standard Gaussian priors.[37] For instance, resampling base distributions via rejection in flows enables modeling of complicated densities in tasks like density estimation for anomaly detection.[37] However, its limitations in high-dimensional deep learning settings, such as those exceeding 100 dimensions, often render it less practical than MCMC alternatives due to exponential growth in rejection rates. A notable recent application appears in diffusion models, where Diffusion Rejection Sampling (DiffRS) rejects noisy intermediates during the reverse process to better match the target data distribution, improving generation quality in image synthesis tasks by aligning transition kernels with the true posterior.[38]