Binomial distribution
The binomial distribution is a discrete probability distribution that describes the probability of achieving a specific number of successes in a fixed number of independent trials, where each trial has exactly two possible outcomes—typically labeled as "success" or "failure"—and the probability of success remains constant across all trials.[1][2] It arises in scenarios modeled by a sequence of n Bernoulli trials, each with success probability p, and is fundamental to probability theory for counting discrete events.[3] The probability mass function (PMF) of the binomial distribution, denoted as B(n, p), gives the probability P(X = k) of exactly k successes in n trials asP(X = k) = \binom{n}{k} p^k (1-p)^{n-k},
where \binom{n}{k} = \frac{n!}{k!(n-k)!} is the binomial coefficient representing the number of ways to choose k successes out of n trials, for k = 0, 1, \dots, n.[1][2] The parameters are n, a positive integer specifying the number of trials, and p, a real number between 0 and 1 indicating the success probability per trial.[3] This formulation assumes independence between trials and a fixed success probability, distinguishing it from other distributions like the hypergeometric, which involves sampling without replacement.[1] Key statistical properties include the expected value (mean) \mu = np, which represents the average number of successes, and the variance \sigma^2 = np(1-p), measuring the spread around the mean; the standard deviation is \sigma = \sqrt{np(1-p)}.[1][2] The distribution is symmetric when p = 0.5, such as in fair coin flips, but skewed toward higher values of k when p > 0.5 or lower when p < 0.5.[2] The mode, or most likely value of k, lies between \lfloor p(n+1) \rfloor and \lceil p(n+1) - 1 \rceil.[1] In practice, the binomial distribution applies to fields like quality control (e.g., counting defective items in a batch), clinical trials (e.g., success rates of treatments), and surveys (e.g., proportion of favorable responses).[3] For large n, it can be approximated by the normal distribution when np and n(1-p) are both greater than 5 or 10,[4] or by the Poisson distribution when n is large and p is small such that \lambda = np is moderate.[5] Historically, the binomial distribution was formalized by Jacob Bernoulli in his 1713 posthumous work Ars Conjectandi, building on earlier combinatorial ideas from Blaise Pascal in the 17th century related to the "problem of points."[6] Abraham de Moivre later developed its normal approximation in 1733, advancing its role in statistical inference.[6]
Definitions
Probability Mass Function
The binomial random variable X represents the number of successes in n independent Bernoulli trials, each with success probability p, where $0 \leq p \leq 1 and n is a positive integer.[1] This distribution, originally developed by Jacob Bernoulli in his seminal 1713 work Ars Conjectandi, models scenarios such as coin flips or quality control inspections with binary outcomes.[7] The probability mass function (PMF) of X \sim \text{Bin}(n, p) is P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, where k = 0, 1, \dots, n and \binom{n}{k} = \frac{n!}{k!(n-k)!} denotes the binomial coefficient, counting the number of ways to select k successes from n trials.[1] The PMF equals zero for any non-integer k outside this range.[1] This formula derives from the independence of the trials: the probability of exactly k successes and n-k failures in a specific sequence is p^k (1-p)^{n-k}, and multiplying by the \binom{n}{k} possible sequences yields the total probability.[8] Common notations include \text{Bin}(n, p) or B(n, p).[1]Parameters and Examples
The binomial distribution is characterized by two parameters: n, the number of independent Bernoulli trials, which must be a positive integer, and p, the probability of success on each trial, where $0 \leq p \leq 1.[1] These parameters define the distribution's support over the integers k = 0, 1, \dots, n, representing the number of successes observed.[9] This distribution models real-world scenarios involving a fixed number of independent trials, each with two possible outcomes: success or failure. For instance, it applies to coin flips, where a fair coin has p = 0.5; quality control processes, such as detecting defective items in manufacturing; or polling, where responses are yes/no on a binary question.[10] Consider a fair coin tossed 10 times, so n = 10 and p = 0.5. The probability of exactly 5 heads, P(X = 5), is given by the probability mass function as \binom{10}{5} (0.5)^{10} = 252 / 1024 \approx 0.246, which is the mode and reflects the symmetric, bell-shaped distribution centered at the mean.[11] In contrast, for a manufacturing process with a 20% defective rate (n = 100, p = 0.2), the probability mass function yields a right-skewed histogram, with higher probabilities clustered near fewer defectives (e.g., around 20) and a long tail toward more defectives, illustrating how small p shifts mass to the left.[12] Boundary cases highlight the distribution's flexibility. When n = 1, it reduces to the Bernoulli distribution with parameter p, focusing on a single trial's outcome. If p = 0, all probability mass concentrates at 0 (no successes); if p = 1, it concentrates at n (all successes), resulting in degenerate distributions.[13]Properties
Moments and Central Tendency
The binomial random variable X \sim \text{Bin}(n, p) represents the number of successes in n independent Bernoulli trials, each with success probability p. The expected value, or mean, of X is given by E[X] = np. This follows from expressing X as the sum of n indicator random variables Y_i for the i-th trial succeeding, where each Y_i \sim \text{Bernoulli}(p) has E[Y_i] = p, and applying the linearity of expectation: E[X] = E\left[\sum_{i=1}^n Y_i\right] = \sum_{i=1}^n E[Y_i] = np. The variance of X is \operatorname{Var}(X) = np(1-p), and the standard deviation is \sqrt{np(1-p)}. This derivation also uses the indicator representation, where \operatorname{Var}(Y_i) = p(1-p) for each trial, and independence implies \operatorname{Var}(X) = \sum_{i=1}^n \operatorname{Var}(Y_i) = np(1-p). Higher moments provide further characterization of the distribution's shape. The skewness, measuring asymmetry, is \frac{1-2p}{\sqrt{np(1-p)}}; it is zero when p = 0.5 (symmetric case), positive for p < 0.5, and negative for p > 0.5. The kurtosis, indicating tail heaviness relative to the normal distribution, is $3 + \frac{1-6p(1-p)}{np(1-p)}; the excess kurtosis term \frac{1-6p(1-p)}{np(1-p)} is positive for small np(1-p) and approaches zero as n grows large. The mode, the value of X with the highest probability, is \lfloor (n+1)p \rfloor or \lceil (n+1)p - 1 \rceil, with the distribution being unimodal except near boundary values of p (e.g., 0 or 1) where it may be bimodal. The median m, the value dividing the distribution such that half the probability lies on each side, approximates np and lies within the interval [\lfloor np \rfloor, \lceil np \rceil] for integer outcomes. When p = 0.5, the distribution is symmetric, so the mean, median, and mode coincide at n/2. For p \neq 0.5, the distribution skews: if p < 0.5, the mean exceeds the median and mode (right-skewed); if p > 0.5, the mean is less than the median and mode (left-skewed).Tail Bounds and Inequalities
Tail bounds provide upper limits on the probability that a binomial random variable deviates significantly from its expected value, which is crucial for concentration inequalities and risk analysis. For a binomial random variable X \sim \operatorname{Bin}(n, p) with mean \mu = np, these bounds quantify the unlikelihood of extreme outcomes in the upper or lower tails. Markov's inequality offers a simple first-order bound for the upper tail. For a > np, it states that P(X \geq a) \leq \frac{np}{a}.[14] This follows directly from applying the general Markov inequality to the non-negative random variable X, using its expectation. While loose, it provides a baseline without requiring higher moments or independence details beyond the mean. Hoeffding's inequality delivers a sharper, parameter-independent bound on deviations from the mean. Specifically, P(|X - np| \geq t) \leq 2 \exp\left(-\frac{2t^2}{n}\right) for t > 0.[15] This result holds for the sum of bounded independent random variables, such as the Bernoulli trials underlying the binomial distribution, and relies on the range of each variable (here, 0 to 1). It excels in additive deviation scenarios, particularly when p is unknown or varies. Chernoff bounds provide tighter exponential control, especially for multiplicative deviations. For the upper tail, P(X \geq (1 + \delta)np) \leq \exp\left(-\frac{np \delta^2}{3}\right) holds for $0 < \delta < 1. A symmetric form applies to the lower tail: P(X \leq (1 - \delta)np) \leq \exp\left(-\frac{np \delta^2}{2}\right) for $0 < \delta < 1. These derive from optimizing the moment generating function via Markov's inequality on \mathbb{E}[e^{\lambda X}], yielding relative error bounds that scale with the variance np(1-p). Sanov's theorem extends to large deviations in the empirical distribution of binomial samples, stating that the probability of the sample proportion deviating from p by a fixed amount decays exponentially with rate given by the relative entropy D(\hat{p} \| p) = \hat{p} \log(\hat{p}/p) + (1 - \hat{p}) \log((1 - \hat{p})/(1 - p)). For binomial settings, this characterizes the precise asymptotics of tail events as n \to \infty, focusing on type deviations rather than fixed thresholds. These bounds find applications in algorithm analysis, such as bounding buffer overflow probabilities in randomized load balancing or controlling false positive rates in hypothesis testing for binomial outcomes.[16] For instance, Chernoff bounds ensure that hashing algorithms achieve uniform distribution with high probability, limiting collision risks. Comparisons reveal trade-offs: Hoeffding's bound is simpler and p-independent, making it preferable for worst-case additive errors, but Chernoff's multiplicative form is tighter when deviations are proportional to the mean, especially for p near 0 or 1.[17] Markov remains the coarsest, useful only for crude estimates.Characterizing the Distribution
Cumulative Distribution Function
The cumulative distribution function (CDF) of a binomial random variable X with parameters n (number of trials) and p (success probability) is given by F(k; n, p) = P(X \leq k) = \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} p^i (1-p)^{n-i}, where \binom{n}{i} denotes the binomial coefficient, and the sum is taken over integer values of i from 0 to the floor of k.[1] This function is non-decreasing and right-continuous as a function of k, with boundary values F(-1; n, p) = 0 and F(n; n, p) = 1. It exhibits discontinuities (jumps) precisely at the integer points k = 0, 1, \dots, n, corresponding to the support of the distribution, where the size of each jump equals the probability mass at that point.[18] An alternative expression for the CDF relates it to the regularized incomplete beta function I_x(a, b), defined as the ratio of the incomplete beta function to the complete beta function: F(k; n, p) = I_{1-p}(n - k, k + 1). This identity arises from integrating the beta density and equating it to the binomial sum via repeated integration by parts.[19] For computational purposes, the direct summation is straightforward and accurate when n is small (e.g., n < 20), as it simply accumulates the terms of the probability mass function. However, for larger n, the summation can encounter numerical challenges, such as overflow from large intermediate binomial coefficients or loss of precision in floating-point arithmetic, particularly when p is close to 0 or 1. In modern software libraries, the beta function relation is often employed instead, leveraging stable algorithms for the incomplete beta (e.g., continued fraction expansions or series approximations) to ensure reliable evaluation across a wide range of parameters.[1] In statistical applications, the binomial CDF plays a key role in hypothesis testing for proportions. For instance, under a null hypothesis specifying a particular p_0, the p-value for a one-sided lower-tail test based on an observed number of successes y is P(X \leq y \mid n, p_0) = F(y; n, p_0), providing the probability of observing a result at least as extreme as the data.[20]Quantile Function
The quantile function of the binomial distribution, denoted Q(α; n, p), is defined as the smallest integer k in {0, 1, ..., n} such that the cumulative distribution function F(k; n, p) ≥ α, where 0 < α < 1 and F(k; n, p) = ∑_{i=0}^k \binom{n}{i} p^i (1-p)^{n-i}.[1] This function is non-decreasing in α and always returns an integer value within the support of the distribution. For α = 0.5, Q(0.5; n, p) gives the median, which lies near np.[1] Computation of Q(α; n, p) typically involves a binary search over k from 0 to n, evaluating F(k; n, p) at midpoints until the smallest k satisfying the inequality is found; this approach is efficient given the monotonicity of the CDF and requires O(log n) evaluations.[21] The CDF itself can be evaluated using its exact relation to the regularized incomplete beta function: F(k; n, p) = I_{1-p}(n - k, k + 1), where I_x(a, b) is the regularized incomplete beta function defined as I_x(a, b) = B_x(a, b) / B(a, b) with B_x(a, b) the incomplete beta function and B(a, b) the beta function. Equivalently, F(k; n, p) = 1 - I_p(k + 1, n - k), allowing the search condition to be rephrased as finding the minimal k where I_p(k + 1, n - k) ≤ 1 - α. Numerical implementations often use the inverse regularized incomplete beta function during the binary search to evaluate the CDF efficiently. Binary search remains the core method for precision across all parameter ranges.[21] The quantile function finds application in determining critical values for binomial hypothesis tests, where Q(1 - α; n, p_0) under the null hypothesis p = p_0 defines the rejection region threshold. It also aids in sample size calculations to achieve specified power, by solving for n such that the quantile under the alternative hypothesis aligns with the desired type II error rate. For large n, an asymptotic approximation provides Q(α; n, p) ≈ np + z_α √[np(1 - p)], where z_α is the α-quantile of the standard normal distribution, though exact methods via the above procedures are preferred for small to moderate n.[1]Statistical Inference
Parameter Estimation
In the binomial distribution, parameter estimation typically arises in two scenarios: when the number of trials n is known and the success probability p must be estimated, or when both n and p are unknown.[22][23] When n is fixed and known, the primary focus is on estimating p from an observed number of successes X in n independent Bernoulli trials.[24] The maximum likelihood estimator (MLE) for p is given by \hat{p} = X / n, which represents the sample proportion of successes.[24][25] This estimator is derived by maximizing the log-likelihood function \ell(p) = X \log p + (n - X) \log (1 - p), where the derivative with respect to p yields \frac{d\ell}{dp} = \frac{X}{p} - \frac{n - X}{1 - p} = 0, solving to \hat{p} = X / n.[26][24] The MLE is unbiased, meaning E[\hat{p}] = p, and has variance \mathrm{Var}(\hat{p}) = p(1 - p)/n, which decreases as n increases.[22][27] For large n, it achieves the minimum possible variance among unbiased estimators, as per the Cramér-Rao lower bound.[24][28] The method of moments estimator for p coincides with the MLE when n is known, as it equates the sample mean \bar{X} = X/n to the population mean np, yielding the same \hat{p} = \bar{X}/n = X/n.[22][28] When both n and p are unknown, estimation becomes more complex; the method of moments requires solving simultaneous equations from the first two population moments, \mu_1 = np and \mu_2 = np(1 - p) + (np)^2, using sample mean and variance, but this often leads to underestimation of n and numerical challenges due to the discrete nature of the parameters.[23] In such cases, observed frequencies can provide initial estimates, though they are prone to bias, and alternatives like modeling with the negative binomial distribution—where n is interpreted as the number of failures until a fixed number of successes—may be more suitable for certain data structures.[23][29]Confidence Intervals for p
Confidence intervals for the binomial parameter p, with n known, provide a range [L, U] such that the probability P(L \leq p \leq U \mid \text{data}) \approx 1 - \alpha. These intervals quantify uncertainty around the point estimate \hat{p} = X/n, where X is the observed number of successes. Various methods exist to construct such intervals, balancing coverage probability (the actual probability that the interval contains the true p) and interval width, with performance varying by sample size and true p value.[30] The Wald interval, also known as the normal approximation interval, is given by \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, where z_{\alpha/2} is the (1 - \alpha/2)-quantile of the standard normal distribution. This method relies on the asymptotic normality of \hat{p} and performs well for large n and p away from 0 or 1, but it exhibits poor coverage for small n or when p is near the boundaries, often undercovering the true p by up to 10-20% in simulations.[30][31] The Wilson score interval improves upon the Wald by centering on an adjusted estimate and is formulated as \hat{p} + \frac{z_{\alpha/2}^2}{2n} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1 - \hat{p})/n + z_{\alpha/2}^2/(4n^2)}{1 + z_{\alpha/2}^2/n} }, where the center shifts toward 0.5 for better symmetry. Derived from inverting the score test, it achieves near-nominal coverage across a wider range of n and p, with simulation studies showing coverage probabilities closer to $1 - \alpha than the Wald, especially for n < 50. This method was originally proposed for practical inference in binomial settings.[30] The Agresti-Coull interval simplifies computation by adding pseudocounts: treat the data as if there were X + z_{\alpha/2}^2/2 successes in n + z_{\alpha/2}^2 trials (often z_{\alpha/2}^2 = 4 for 95% confidence), then apply the Wald formula to these adjusted values. For 95% intervals, this equates to adding 2 successes and 4 trials. It offers good coverage for small samples, outperforming the exact Clopper-Pearson in terms of shorter average length while maintaining coverage above 94%, and is recommended for its ease and reliability in applied settings.[31][30] The arcsine transformation interval applies the variance-stabilizing transformation \sin^2(\arcsin(\sqrt{\hat{p}}) \pm z_{\alpha/2} / (2 \sqrt{n})), which stretches the distribution ends to improve normality approximation. This method stabilizes the variance of \hat{p} and yields reasonable coverage for moderate n, though it can produce asymmetric intervals and performs less favorably near boundaries compared to Wilson or Agresti-Coull in simulation-based evaluations.[30] Among these, the Wilson score interval is often preferred for its balance of coverage accuracy and computational simplicity, with coverage rates typically within 1-2% of nominal across diverse scenarios, as evidenced by extensive simulations. The exact Clopper-Pearson interval, obtained by inverting the binomial test using beta quantiles, guarantees at least $1 - \alpha coverage but tends to be overly conservative (coverage up to 5-10% above nominal) and computationally intensive for large n, making it suitable mainly for small samples. Comparative studies recommend Wilson or Agresti-Coull over Wald for most practical uses, with coverage assessed via Monte Carlo simulations showing Wilson superior for p near 0 or 1.[30][31]Related Distributions and Approximations
Exact Relations
The binomial distribution with a single trial, denoted \text{Bin}(1, p), is identical to the Bernoulli distribution with success probability p, where the probability mass function (PMF) is \Pr(X = 1) = p and \Pr(X = 0) = 1 - p. More generally, a binomial random variable X \sim \text{Bin}(n, p) arises as the sum of n independent and identically distributed (i.i.d.) Bernoulli random variables, each with success probability p; the PMF of the binomial is then \Pr(X = k) = \binom{n}{k} p^k (1-p)^{n-k} for k = 0, 1, \dots, n.[32] The binomial distribution is closed under summation for independent random variables sharing the same success probability. Specifically, if X_i \sim \text{Bin}(n_i, p) for i = 1, \dots, m are independent, then their sum S = \sum_{i=1}^m X_i follows \text{Bin}\left( \sum_{i=1}^m n_i, p \right), with PMF \Pr(S = k) = \binom{\sum n_i}{k} p^k (1-p)^{\sum n_i - k} for k = 0, 1, \dots, \sum n_i. This property follows directly from the convolution of the individual PMFs, leveraging the identical p to preserve the binomial form.[32][33] A generalization occurs when the success probabilities differ across trials: the sum of independent Bernoulli random variables X_i \sim \text{Bern}(p_i) for i = 1, \dots, n follows the Poisson binomial distribution, with PMF given by the convolution \Pr(S = k) = \sum_{A \subseteq : |A|=k} \prod_{i \in A} p_i \prod_{j \notin A} (1 - p_j). When all p_i = p, the Poisson binomial reduces exactly to the binomial distribution \text{Bin}(n, p). This relation highlights the binomial as a special homogeneous case of the more general Poisson binomial.[34] The beta-binomial distribution emerges when the binomial success probability is uncertain and follows a beta prior. If p \sim \text{Beta}(\alpha, \beta) and, conditional on p, X \mid p \sim \text{Bin}(n, p), then the marginal distribution of X is beta-binomial with parameters n, \alpha, \beta, having PMF \Pr(X = k) = \binom{n}{k} \frac{B(k + \alpha, n - k + \beta)}{B(\alpha, \beta)}, where B(\cdot, \cdot) is the beta function. This compound distribution accounts for heterogeneity in p, leading to overdispersion relative to the standard binomial (variance np(1-p) \frac{\alpha + \beta + n}{\alpha + \beta + 1} > np(1-p)).[35] The binomial distribution relates to the hypergeometric distribution through a limiting process that preserves exact structure in the infinite-population case. The hypergeometric distribution \text{Hyper}(N, K, n) models the number of successes in n draws without replacement from a finite population of size N with K successes, with PMF \Pr(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}. As N \to \infty with K/N \to p fixed, the hypergeometric converges exactly to the binomial \text{Bin}(n, p), reflecting independent trials in an effectively infinite population.[36][33] The negative binomial distribution inverts the fixed-trials perspective of the binomial. While the binomial \text{Bin}(n, p) counts successes in n fixed trials, the negative binomial \text{NB}(r, p) (number of failures before the r-th success) counts trials until r fixed successes, with PMF \Pr(X = k) = \binom{k + r - 1}{k} p^r (1-p)^k for k = 0, 1, \dots. The two are inversely related: for fixed n and r, the event \{ \text{NB}(r, p) \geq n \} corresponds to at most r-1 successes in the first n + r - 1 trials under the binomial distribution Bin(n + r - 1, p), i.e., linking their cumulative distribution function and survival function via P(\text{NB}(r, p) \geq n) = P(\text{Bin}(n + r - 1, p) \leq r - 1).[33]Limiting Approximations
The binomial distribution arises as the sum of independent Bernoulli trials, and under certain limiting regimes, it converges in distribution to other well-known distributions, providing useful approximations for computation and analysis. One primary limiting case occurs when the number of trials n tends to infinity while the success probability p tends to zero such that the product \lambda = np remains fixed and positive. In this regime, known as the law of rare events, the binomial distribution \mathrm{Bin}(n, p) converges to the Poisson distribution with parameter \lambda. This Poisson approximation is particularly effective for modeling rare events, where successes are infrequent relative to the total trials. The probability mass function (PMF) of the binomial, given by P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, approximates the Poisson PMF P(Y = k) = e^{-\lambda} \lambda^k / k! through the expansion \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k} \approx \frac{e^{-\lambda} \lambda^k}{k!}, as n \to \infty, with the approximation improving when n is large and p is small (typically n \geq 100 and np \leq 10).[37] The error in this approximation can be bounded using techniques like the Stein-Chen method, yielding total variation distances on the order of \min(1, 1/\lambda) p, which decreases as p \to 0.[38] This limit is foundational for applications in rare event simulation and queueing theory, where exact binomial computations become infeasible. A second key limiting approximation arises when n \to \infty with p fixed (or more generally, $0 < p < 1), leading to convergence to the normal distribution via the central limit theorem (CLT). Specifically, if X \sim \mathrm{Bin}(n, p), then the standardized variable (X - np)/\sqrt{np(1-p)} converges in distribution to the standard normal Z \sim N(0,1), so X \approx N(np, np(1-p)) for large n. This de Moivre–Laplace theorem, originally established for the binomial case, requires the conditions np \to \infty and n(1-p) \to \infty to ensure the approximation's validity, typically holding well when n \geq 30 and p is not too close to 0 or 1.[39] The normal approximation is advantageous for moderate p and large n, facilitating analytical tractability in hypothesis testing and confidence intervals.[40] For cumulative probabilities, the continuity correction enhances accuracy by treating the discrete binomial as continuous: P(a \leq X \leq b) \approx \Phi\left(\frac{b + 0.5 - np}{\sqrt{np(1-p)}}\right) - \Phi\left(\frac{a - 0.5 - np}{\sqrt{np(1-p)}}\right), where \Phi is the standard normal CDF. The error in the normal approximation is quantified by the Berry–Esseen theorem, which provides a uniform bound on the Kolmogorov distance of O(1/\sqrt{n}), specifically C \rho / (\sigma^3 \sqrt{n}) with C \approx 0.4748 for the optimal constant, where \rho = E[|X_1 - p|^3] and \sigma^2 = p(1-p) for the underlying Bernoulli variables.[41] This rate confirms the approximation's asymptotic sharpness, though practical errors may be smaller for symmetric cases near p=0.5.[42] The local central limit theorem extends this to point probabilities, stating that for lattice distributions like the binomial, the local densities converge uniformly to the normal density: P(X = k) \approx \frac{1}{\sqrt{2\pi np(1-p)}} \exp\left( -\frac{(k - np)^2}{2np(1-p)} \right) as n \to \infty, under the same conditions, with error terms also O(1/\sqrt{n}).[43] Overall, the Poisson limit suits sparse success scenarios (e.g., defect counts in manufacturing), while the normal excels in balanced, high-volume settings (e.g., election polling), guiding the choice based on np and n(1-p).[44]Computational Aspects
Generating Random Variates
Generating random variates from the binomial distribution Bin(n, p) is essential for Monte Carlo simulations and statistical modeling. Common methods include exact techniques like the inverse transform sampling and the summation of Bernoulli trials, as well as rejection-based approaches for efficiency. Special-purpose algorithms optimize performance for specific parameter ranges, while software libraries provide robust implementations.[45] The inverse transform method generates a binomial variate by drawing a uniform random variable U ~ Uniform(0,1) and finding the smallest integer k such that the cumulative distribution function F(k) ≥ U, where F(k) = ∑_{i=0}^k \binom{n}{i} p^i (1-p)^{n-i}. This can be computed efficiently for small n using a recursive approach that accumulates probabilities sequentially until the sum exceeds U, avoiding full computation of the binomial coefficients each time. The expected time complexity is O(np) in the worst case but often better due to early termination on average.[45] A simple exact method, equivalent to inverse transform in some implementations, involves generating n independent Bernoulli(p) random variables (each via a single uniform comparison to p) and summing them to obtain the binomial variate; this requires O(n) time and is straightforward for moderate n.[45] Rejection sampling enhances efficiency, particularly when approximations are viable. For cases where np is small (e.g., p near 0 and n moderate), a Poisson(λ = np) proposal distribution can be used, generating a Poisson variate and accepting it if it is at most n; rejections are rare under the approximation. More generally, rejection methods employ envelopes like triangular or exponential functions to bound the probability mass function, accepting or rejecting based on a uniform draw.[46] The Binomial Test Point Envelope (BTPE) algorithm is a specialized rejection sampler for moderate n where min(np, n(1-p)) ≥ 10. It decomposes the support into regions (triangular central, parallelogram sides, exponential tails) with a piecewise majorizing function, using 2-4 uniform variates per accepted sample on average and fixed memory independent of n. BTPE outperforms earlier constant-memory methods by a factor of over 2 in execution time for its parameter range.[47] The alias method provides fast generation for discrete distributions with finite support like the binomial. It preprocesses the probabilities into probability and alias tables in O(n) time, allowing O(1) time per variate thereafter by drawing a uniform index and a sub-uniform to decide between the index or its alias. This is particularly efficient for repeated sampling with fixed n and p.[47] For large n, approximate methods improve efficiency; for example, rejection sampling from a normal approximation N(np, np(1-p))—generating a normal variate, rounding to the nearest integer k (0 ≤ k ≤ n), and accepting with probability proportional to the ratio of the exact PMF to the normal density—yields exact variates with expected O(1) uniforms per sample after occasional rejections. The inverse CDF can reference the cumulative distribution function for quantile-based generation in small-n cases.[48] In software, R'srbinom(n, size, prob) function generates n binomial variates with size trials and success probability prob, using a combination of methods including BTPE for suitable parameters to ensure efficiency. Similarly, NumPy's numpy.random.Generator.binomial(n, p, size=None) (recommended over the legacy numpy.random.binomial) draws samples (scalar or array of shape size) from Bin(n, p), suitable for use when np ≤ 5 for estimating the standard error of a proportion, and internally employs optimized algorithms like those in the PCG family for uniforms.[49][50]
Direct methods like Bernoulli summation run in O(n) time, while rejection-based algorithms like BTPE achieve near-constant time per variate (average 2-4 uniforms), making them preferable for large-scale simulations. For validation, simulated samples should have empirical mean ≈ np and variance ≈ np(1-p), matching theoretical moments; deviations indicate implementation issues.[47][48]
Numerical Evaluation
The probability mass function (PMF) of the binomial distribution can be computed via direct summation using the formula P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, where the binomial coefficient \binom{n}{k} is calculated recursively to minimize intermediate overflow: \binom{n}{k} = \binom{n}{k-1} \cdot \frac{n-k+1}{k}, starting from \binom{n}{0} = 1.[51] This multiplicative approach avoids computing large factorials directly and is suitable for moderate n (up to around 1000 in double-precision floating-point arithmetic). For larger n, direct computation risks overflow and underflow in floating-point representations, leading to numerical instability; relative errors can exceed machine epsilon (about $10^{-16} for double precision) when intermediate terms grow beyond $2^{53}, the mantissa limit.[51] To address this, the logarithm of the PMF is often evaluated instead: \log P(X = k) = \log \binom{n}{k} + k \log p + (n-k) \log(1-p), where \log \binom{n}{k} uses the log-gamma function via \log \Gamma(n+1) - \log \Gamma(k+1) - \log \Gamma(n-k+1), implemented accurately in numerical libraries without explicit Stirling's approximation for the core computation.[51] Stirling's approximation, \log n! \approx n \log n - n + \frac{1}{2} \log (2 \pi n), provides a fallback for extremely large n beyond library support, with relative errors decreasing as O(1/n).[52] The cumulative distribution function (CDF) F(k; n, p) = \sum_{i=0}^{k} P(X = i) can be computed by summing the recursive or log PMF terms, but for efficiency and precision with large n, it leverages the relation to the regularized incomplete beta function: F(k; n, p) = 1 - I_p(k+1, n-k), where I_x(a,b) is evaluated using continued fraction expansions or numerical quadrature to avoid summation loops and overflow. These methods ensure relative errors below $10^{-15} for n > 1000 in standard double precision, switching to approximations only when exact computation exceeds feasible precision. Quantiles, or the inverse CDF, are computed numerically by solving F(q; n, p) = \alpha for integer q, typically using the Newton-Raphson method on the CDF derivative (the PMF) for rapid convergence or bisection search for guaranteed monotonicity and robustness, converging in O(\log n) iterations with errors bounded by 1 in the discrete domain.[51] Standard libraries implement these techniques for practical evaluation. In Python's SciPy,binom.pmf(k, n, p) and binom.cdf(k, n, p) use log-gamma for PMF and incomplete beta for CDF, while binom.ppf(alpha, n, p) employs bisection; for arbitrary precision with large n, the mpmath library supports exact computation via multi-precision gamma functions.[53] Similarly, MATLAB's binopdf(k, n, p) and binocdf(k, n, p) rely on log-gamma and beta integrals, handling n up to machine limits with relative errors under $10^{-14}. Error analysis in these implementations confirms that floating-point relative errors remain below $10^{-13} for n \leq 10^6 and p away from 0 or 1, prompting a shift to normal or Poisson approximations beyond that threshold for stability.[51]