Fact-checked by Grok 2 weeks ago

Rejection sampling

Rejection sampling, also known as the acceptance-rejection method, is a fundamental used to generate independent and identically distributed samples from a target f(x) that is difficult to sample directly, by leveraging a simpler proposal g(x) from which samples can be easily drawn. The method works by repeatedly proposing candidates from g(x) and accepting them with a probability that ensures the accepted samples follow the target distribution f(x), provided that f(x) \leq c g(x) for some finite constant c \geq 1. The core algorithm for the continuous case proceeds as follows: generate a candidate Y from the proposal with density g(x); independently generate a random variable U \sim \text{Uniform}(0,1); accept Y as a sample X if U \leq f(Y)/(c g(Y)), otherwise reject and repeat the process. This acceptance criterion guarantees that the conditional of samples is exactly f(x), as the probability of acceptance adjusts for the discrepancy between f and g. Key requirements include the existence of a bounded constant c = \sup_x \{f(x)/g(x)\}, which must be finite, and the ability to evaluate both f(x) and g(x) a if needed. The choice of proposal g(x) is crucial; an ideal g closely approximates the shape of f(x) to minimize c, thereby improving . Efficiency is determined by the expected number of iterations, which follows a with success probability $1/c, yielding an of c proposals per accepted sample. While the method provides exact samples without bias, it can become inefficient in high dimensions or when c is large, leading to high rejection rates, and is often contrasted with alternatives like that reuse rejected samples via weights. Rejection sampling applies to both continuous and distributions, with adaptations for unnormalized densities common in and statistical physics.

Introduction

History

Rejection sampling emerged as a foundational technique within the broader development of methods during the , primarily driven by efforts to simulate complex physical processes on early computers. proposed the core accept-reject mechanism in unpublished notes and a 1947 letter to Stanislaw Ulam, while working at on neutron diffusion simulations for nuclear weapons research. This approach addressed the challenge of generating nonuniform random variates from known distributions using uniform random numbers, marking an early innovation in statistical sampling for . The method gained formal recognition in the early 1950s through von Neumann's published work, including a 1951 report detailing rejection algorithms for sampling , which improved efficiency in three-dimensional simulations. These developments evolved rejection sampling from rudimentary implementations into a versatile tool for approximating integrals and distributions in reliability and physics contexts. Rejection sampling was further popularized in statistics and through John M. Hammersley and David C. Handscomb's influential 1964 monograph Monte Carlo Methods, which systematically described the technique alongside variance reduction strategies and provided theoretical foundations for its use in multidimensional problems. This book solidified its role in transitioning from naive uniform sampling to efficient methods for complex, non-uniform distributions. A significant advancement came in 1992 with W. R. Gilks and P. Wild's introduction of adaptive rejection sampling, tailored for log-concave densities in applications, enhancing adaptability and reducing rejection rates in .

Definition and Motivation

Rejection sampling is a fundamental for generating independent and identically distributed (i.i.d.) samples from a target f(x) when direct sampling is intractable. It relies on an auxiliary proposal distribution g(x), from which samples can be drawn efficiently, and selectively accepts or rejects these proposals based on the ratio of the target to the proposal density. This approach ensures that the accepted samples are exactly distributed according to f(x), making it a cornerstone technique in statistical simulation and inference. The primary motivation for rejection sampling arises in scenarios where the target density f(x) is known only up to an unknown , such as in Bayesian posterior distributions or complex probabilistic models, rendering methods like impractical because they require evaluating and inverting the . In contrast, rejection sampling only necessitates pointwise evaluations of the unnormalized f(x) and g(x), allowing it to handle distributions that are difficult or impossible to sample directly. This is particularly valuable in applications, where i.i.d. samples are essential for unbiased estimation of integrals, expectations, or probabilities via the . At a high level, the technique constructs a bounding "" around the target density using scaled versions of the proposal distribution, ensuring that proposals falling under the target curve are retained while others are discarded. Developed by in the context of early simulations for physics problems, it provides an exact sampling mechanism without approximation, though efficiency depends on the choice of proposal.

Mathematical Foundations

Target and Proposal Distributions

In rejection sampling, the target distribution refers to the desired probability distribution from which independent samples are to be generated. It is typically defined by an unnormalized density function f(\mathbf{x}), which is proportional to the true p(\mathbf{x}) = f(\mathbf{x}) / Z, where Z = \int f(\mathbf{x}) \, d\mathbf{x} is an unknown that often cannot be computed analytically. This setup is particularly useful in contexts such as , where the target arises as a posterior with an intractable normalization factor. The proposal distribution, denoted by density g(\mathbf{x}), serves as an auxiliary distribution that is easy to sample from directly, such as standard uniforms, normals, or exponentials. A key requirement is that it forms an for the , satisfying the f(\mathbf{x}) \leq M g(\mathbf{x}) for all \mathbf{x} in the , where M \geq 1 is a finite constant bounding the supremum of f(\mathbf{x}) / g(\mathbf{x}). This domination condition ensures that the proposal covers the target without underestimating its in any region. The choice of proposal distribution is critical for practical efficiency, as it directly influences the value of M; a smaller M reduces wasted computations. Ideally, g(\mathbf{x}) should approximate the shape of p(\mathbf{x}) closely, such as using a proposal over a bounded for targets with compact , or a Gaussian proposal matched to the target's and variance for approximately targets. In higher dimensions, proposals like multivariate normals or mixtures may be employed when they align well with the target's modal structure. Both distributions must share compatible supports, meaning the support of g(\mathbf{x}) must include the entire support of p(\mathbf{x}) (i.e., g(\mathbf{x}) > 0 wherever f(\mathbf{x}) > 0) to ensure complete coverage and avoid biasing the samples toward subsets of the target space. Failure to match supports can lead to incomplete exploration of the target, rendering the method invalid.

Acceptance Probability

In rejection sampling, samples are proposed from the distribution with density g(y), and an auxiliary uniform random variable u \sim \text{Uniform}(0,1) is generated independently. The proposal y is accepted if u < \frac{f(y)}{M g(y)}, where f is the unnormalized target density and M \geq \sup_y \frac{f(y)}{g(y)}; otherwise, it is rejected and the process repeats. To verify correctness, consider the joint density of (y, u), which is g(y) \cdot 1 over the support of g and u \in [0,1]. The acceptance region is defined by $0 \leq u < \frac{f(y)}{M g(y)}. The marginal density of accepted y is obtained by integrating the joint density over this region: \int_0^{f(y)/(M g(y))} g(y) \, du = g(y) \cdot \frac{f(y)}{M g(y)} = \frac{f(y)}{M}. The total probability of acceptance is \int \frac{f(y)}{M} \, dy = \frac{1}{M} \int f(y) \, dy = \frac{Z}{M}, where Z = \int f(y) \, dy. Thus, the conditional density of accepted y is \frac{f(y)/M}{Z/M} = \frac{f(y)}{Z}, matching the target distribution p(y). The expected acceptance probability is \frac{Z}{M}, which simplifies to \frac{1}{M} if Z = 1 (normalized case); this implies the average number of proposals required per accepted sample is M. The generated accepted samples are independent and identically distributed (i.i.d.) from p(y), as each proposal-rejection cycle is independent of previous ones, regardless of the value of M.

Algorithm

Standard Procedure

Rejection sampling begins with identifying a bounding constant M \geq 1 such that the target density f(x) satisfies f(x) \leq M g(x) for all x in the support, where g(x) is the proposal density from which samples can be easily generated. This constant M is typically determined as the supremum \sup_x \frac{f(x)}{g(x)}, which can be found analytically by optimization techniques such as calculus to locate the maximum of the ratio \frac{f(x)}{g(x)} or through bounding approximations when an exact computation is infeasible. Once M is established, the procedure proceeds iteratively and independently in each trial to generate samples from f(x). Sample a candidate y from the proposal distribution with density g(y), then independently sample u from the uniform distribution on [0, 1]. Accept y as a sample from f(x) if u \leq \frac{f(y)}{M g(y)}; otherwise, reject it and repeat the process. This acceptance step must be performed by evaluating both f and g at the candidate y, which requires that these densities (or their unnormalized forms) can be computed efficiently. The iterations continue until the desired number of accepted samples is obtained, with each trial producing an independent draw from the target distribution upon acceptance. In practice, when f(x) is available only up to a normalizing constant (i.e., as an unnormalized \tilde{f}(x) where f(x) = \tilde{f}(x)/Z and Z is unknown), the procedure adapts by using the ratio \frac{\tilde{f}(y)}{M g(y)} in the acceptance condition, as the constant Z cancels out and does not affect the validity. Similarly, if g(y) is unnormalized, its normalizing constant also cancels in the ratio.

Pseudocode and Simple Example

The rejection sampling algorithm can be implemented as a function that generates a specified number of samples from the target distribution f using a proposal distribution g and bounding constant M, where M \geq \sup_x f(x)/g(x). Here is pseudocode for the continuous case:
function rejection_sample(n_samples, f, g, M):
    samples = empty list
    while length(samples) < n_samples:
        y = sample from g
        u = uniform(0, 1)
        if u <= f(y) / (M * g(y)):
            append y to samples
    return samples
This procedure repeatedly proposes candidates from g until the desired number of acceptances is reached, ensuring the accepted samples follow the target f. A simple example illustrates the method for sampling from a truncated standard normal distribution restricted to [0, \infty), which has target density f(x) = \sqrt{2/\pi} \exp(-x^2/2) for x \geq 0 (the half-normal distribution). An exponential proposal distribution with rate 1, g(x) = \exp(-x) for x \geq 0, is suitable due to its similar tail behavior. The bounding constant is M = \sup_{x \geq 0} f(x)/g(x) = \sqrt{2/\pi} \exp(1/2) \approx 1.31. To generate samples, draw Y from the exponential, draw U \sim \text{Unif}(0,1), and accept Y if U \leq f(Y)/(M g(Y)); repeat until 1000 samples are obtained. A histogram of these samples should closely match the target half-normal density, confirming the method's correctness. For implementation, the algorithm applies to both continuous and discrete cases, with the primary difference being whether densities f, g or probability mass functions are used; in the discrete case, M \geq \sup_k p(k)/q(k) where p and q are the target and proposal pmfs, and proposals are drawn accordingly. In software like , a basic implementation for the half-normal example might use for sampling:
python
import numpy as np

def rejection_sample_half_normal(n_samples):
    M = np.sqrt(2 / np.pi) * np.exp(0.5)
    samples = []
    while len(samples) < n_samples:
        y = np.random.exponential(1)
        u = np.random.uniform(0, 1)
        f_y = np.sqrt(2 / np.pi) * np.exp(-y**2 / 2)
        g_y = np.exp(-y)
        if u <= f_y / (M * g_y):
            samples.append(y)
    return np.array(samples)

# Example: 1000 samples
samples = rejection_sample_half_normal(1000)
This code handles the continuous case efficiently for moderate n; for discrete distributions, replace exponential sampling with appropriate discrete generators like np.random.choice. Similar implementations exist in R using runif and rexp.

Performance Analysis

Advantages

Rejection sampling offers significant simplicity in implementation, as it requires only the ability to evaluate the target density f and the proposal density g, along with a constant M such that Mg(x) \geq f(x) for all x, without needing to invert cumulative distribution functions or employ the intricate transition kernels typical of methods. A key advantage is its exactness: the method generates independent and identically distributed (i.i.d.) samples directly from the target distribution f, bypassing the approximations and correlations inherent in MCMC chains, which require burn-in periods and exhibit reduced effective sample sizes due to autocorrelation. The algorithm is particularly well-suited for handling unnormalized densities, making it ideal for sampling Bayesian posteriors where the normalizing constant (evidence) is intractable, as it operates solely on density ratios f(x)/g(x) without requiring the constant's computation. Compared to naive approaches like exhaustive enumeration or grid-based sampling, rejection sampling avoids the need to evaluate the entire support, which is especially beneficial for rare event estimation where low-probability regions would otherwise demand prohibitive computational resources; for instance, it bounds wasted proposals via the acceptance rate $1/M rather than scanning all grid points. When the envelope constant M is small (e.g., M < 2), rejection sampling achieves high computational efficiency, with acceptance probabilities exceeding 50% and minimal rejections, outperforming more complex alternatives in such favorable cases.

Drawbacks

One major drawback of rejection sampling is its potential inefficiency when the constant M is large, as the acceptance probability drops to $1/M, resulting in a high number of rejections and requiring many proposal samples to obtain a single accepted one. This issue is particularly pronounced when the proposal distribution g poorly envelopes the target density f, such as in cases involving multimodal target distributions where a simple unimodal proposal struggles to cover all modes without inflating M. Consequently, the expected number of iterations needed to generate one sample equals M, which can render the method computationally prohibitive if M \gg 1. Rejection sampling also suffers from the curse of dimensionality, where performance degrades exponentially as the dimension d increases because the constant M typically grows with d, leading to vanishingly small acceptance rates. For instance, when sampling from a multivariate normal target using a scaled normal proposal, the acceptance rate can decay as $1/\sigma^d for \sigma > 1, making the method impractical beyond low dimensions. In high-dimensional spaces, the volume concentration phenomenon further exacerbates this, as most proposal samples fall outside the relevant support region of the target, necessitating an exponential number of trials. A key limitation is the stringent requirement for an envelope distribution: the method demands a proposal g and constant M such that M g(x) \geq f(x) for all x in the support, which involves computing or bounding the supremum of the ratio f(x)/g(x). Determining this bound can be challenging, especially for complex or unbounded targets, and may require extensive prior analysis or numerical optimization to ensure the holds everywhere. Unlike adaptive sampling techniques, rejection sampling employs a fixed proposal distribution g and constant M throughout the process, with no mechanism to refine them based on observed rejections or samples. This rigidity limits its flexibility in non-stationary or evolving sampling scenarios. Finally, the repeated evaluation of the target f during the acceptance step imposes a significant computational burden if f is expensive to compute, as each rejected proposal still incurs the full cost of density evaluation without contributing to the output sample set. In applications where density evaluations dominate , this can lead to substantial overhead, particularly when combined with low acceptance rates.

Variants

Exponential Tilting

Exponential tilting is a variant of rejection sampling that optimizes the proposal distribution by incorporating adjustments from an , thereby improving efficiency for targets that share structural similarities with the base proposal. The tilted proposal distribution is given by g_t(x) = \frac{\exp(t \, h(x)) \, g(x)}{Z_t}, where g(x) is the original proposal density, h(x) is a suitable tilting function (often a ), t is the tilting parameter, and Z_t = \mathbb{E}_g[\exp(t \, h(X))] is the ensuring g_t integrates to 1. This modification allows the proposal to be "tilted" toward regions where the target density f(x) is more prominent, reducing the bounding constant M = \sup_x f(x)/g_t(x) required for the rejection mechanism and increasing the overall acceptance rate. The approach is particularly effective when g belongs to an , as the tilting preserves the family structure while adapting to the target. To select the optimal t, one typically minimizes the Kullback-Leibler (KL) divergence between f and g_t, or equivalently, maximizes the expected log-likelihood ratio \mathbb{E}_g[\log(f/g)], which can be achieved by solving the moment-matching equation \mathbb{E}_{g_t}[h(X)] = \mathbb{E}_f[h(X)]. This optimization often involves numerical methods, such as solving the saddlepoint equation derived from the cumulant generating function of g, and results in M approaching 1 more closely than with the untitled proposal, especially for targets in or near the exponential family of g. In practice, the tilted proposal can be updated iteratively within the rejection sampling loop or precomputed if Z_t is tractable, making it suitable for hybrid methods that combine rejection sampling with importance sampling to estimate expectations or normalizing constants. This technique was developed in the context of exponential families within statistical Monte Carlo methods during the late 20th century.

Adaptive Rejection Sampling

Adaptive Rejection Sampling (ARS) is a variant of rejection sampling designed specifically for univariate log-concave target , where the logarithm of the unnormalized h(x) = \log f(x) is concave. This property allows the construction of piecewise linear upper and lower envelopes that tightly bound the target , using lines to h(x) at evaluated points and chords connecting those points. The upper envelope forms a rejection bound that majorizes the target, while the lower envelope serves as a squeezing to accelerate acceptances without evaluations. These envelopes are iteratively refined after each acceptance or rejection, incorporating new evaluation points to progressively tighten the bounds and improve sampling efficiency. The algorithm begins by initializing a set of abscissae that bracket the support of the target density, typically including points near the mode and boundaries, to define initial piecewise linear hulls for the upper envelope u_k(x) and lower envelope l_k(x). A piecewise exponential proposal distribution s_k(x) is then constructed by normalizing the upper envelope, s_k(x) = c_k^{-1} \exp(u_k(x)), where c_k is the normalizing constant computed via numerical integration over the pieces. Sampling proceeds by drawing a candidate x^* from s_k(x) and a uniform variate w \sim U(0,1); if w < \exp(l_k(x^*) - u_k(x^*)), accept immediately using the squeezing bound; otherwise, evaluate h(x^*) and accept if w < \exp(h(x^*) - u_k(x^*)). Upon acceptance or rejection requiring evaluation, the new point x^* (along with its derivative if needed) is added to the set of abscissae, and the envelopes are updated to form tighter piecewise linear approximations. This adaptive process ensures high efficiency, as the acceptance probability approaches 1 over iterations due to the envelopes converging to the target density, and it avoids the need to compute a global supremum constant M by relying on local constructions. was introduced by Gilks and Wild in 1992 for application in . Extensions to non-log-concave densities incorporate squeezing bounds within an adaptive rejection framework to handle deviations from concavity. A representative example is sampling from a log-concave posterior distribution in Bayesian inference, such as the full conditional for parameters in a monoclonal antibody reactivity model analyzed via Gibbs sampling.

Applications

Statistical Sampling

Rejection sampling plays a key role in Bayesian computation by enabling the generation of samples from the posterior distribution p(\theta \mid \data) \propto p(\data \mid \theta) p(\theta), particularly when conjugate priors are unavailable and the normalizing constant is intractable. This approach is especially valuable for complex models where direct sampling from the posterior is infeasible, allowing inference through independent and identically distributed (i.i.d.) samples that approximate the target distribution. In modern applications, such as approximate Bayesian computation (ABC), rejection sampling facilitates posterior inference for models with intractable likelihoods by simulating data from the model and accepting parameters that produce auxiliary statistics close to observed data summaries. The original motivation for rejection sampling traces back to John von Neumann's work on neutron transport simulations during the Manhattan Project, where it was used to model particle interactions by sampling from non-uniform distributions in a computationally efficient manner. In , rejection sampling supports the estimation of expectations \mathbb{E}_p [h(X)] = \int h(x) p(x) \, dx by generating i.i.d. samples from the target p(x), which can then be averaged to approximate the integral with reduced bias under suitable proposal distributions. This method ensures unbiased estimates while leveraging the ability to evaluate unnormalized densities, making it suitable for high-dimensional integrals in . For rare event simulation, rejection sampling aids in bounding tail probabilities in reliability analysis and by proposing samples from an enveloping and accepting those that fall within the rare event region, thereby estimating low-probability outcomes like system failures without excessive computational waste. To enhance efficiency, rejection sampling is often integrated with for ; in this hybrid approach, proposals from an importance are weighted and selectively accepted, improving the estimation of expectations in and other settings by mitigating the high rejection rates of pure rejection methods.

Machine Learning Contexts

In generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), rejection sampling is employed to draw samples from implicit densities by rejecting proposals that deviate from the learned manifold. In GANs, Discriminator Rejection Sampling (DRS) leverages the discriminator's output as an to generated samples, improving by discarding those with low discriminator scores to approximate the true . This approach addresses the challenge of implicit sampling in GANs, where direct likelihood evaluation is unavailable, and has been shown to enhance sample fidelity without retraining the model. As an alternative to (MCMC) methods, rejection sampling is particularly useful in scenarios where MCMC chains exhibit poor mixing, such as in particle filters for sequential data processing in tasks like time-series or tracking. In particle filters, rejection-based variants, such as Variational Rejection Particle Filtering (VRPF), integrate rejection steps to resample particles more efficiently, reducing degeneracy in high-variance environments while maintaining unbiased estimates of the filtering distribution. This makes it suitable for dynamic models in or , where rapid adaptation to new observations is critical. Compared to Metropolis-Hastings (MH), a core MCMC algorithm, rejection sampling provides exact samples from the target distribution but becomes inefficient in high dimensions due to the curse of dimensionality, often requiring an impractically large number of proposals. In contrast, offers scalability through approximate sampling via Markov chains, making it preferable for applications with complex, high-dimensional posteriors, though it may suffer from in samples. Rejection sampling also relates to slice sampling, another MCMC technique, but differs by using a envelope over the entire support rather than auxiliary variables to define slice regions, leading to simpler implementation at the cost of higher rejection rates in distributions. Post-2010 developments have integrated rejection sampling into normalizing flows for likelihood-based sampling, where learned rejection mechanisms serve as flexible base distributions to handle targets more effectively than standard Gaussian priors. For instance, resampling base distributions via rejection in flows enables modeling of complicated densities in tasks like for . However, its limitations in high-dimensional settings, such as those exceeding 100 dimensions, often render it less practical than MCMC alternatives due to exponential growth in rejection rates. A notable recent application appears in models, where Diffusion Rejection Sampling (DiffRS) rejects noisy intermediates during the reverse process to better match the target , improving generation quality in tasks by aligning transition kernels with the true posterior.