Bernoulli distribution
The Bernoulli distribution is a discrete probability distribution that models the outcome of a single random experiment or trial with exactly two possible results: success, conventionally denoted as 1 and occurring with fixed probability p where $0 \leq p \leq 1, or failure, denoted as 0 and occurring with probability $1 - p.[1][2][3] It represents the simplest form of a binary random variable and serves as the foundational building block for more complex distributions, such as the binomial distribution, which arises from the sum of independent Bernoulli trials.[1][4] Named after the Swiss mathematician Jacob Bernoulli (1654–1705), who explored related concepts in probability in his posthumously published work Ars Conjectandi (1713), the distribution formalizes scenarios like coin flips, yes/no surveys, or any event with dichotomous outcomes under constant success probability.[3][5] The probability mass function (PMF) for a Bernoulli random variable X is defined as P(X = x) = p^x (1-p)^{1-x} for x \in \{0, 1\}, which simplifies to P(X = 1) = p and P(X = 0) = 1 - p.[1][6] Its mean (expected value) is E[X] = p, reflecting the long-run average success rate, while the variance is \Var(X) = p(1-p), measuring the spread around this mean and achieving maximum value at p = 0.5.[1][3] In statistical modeling and applications, the Bernoulli distribution underpins binary logistic regression for predicting probabilities of binary events, hypothesis testing for proportions, and simulations in fields like machine learning, genetics, and quality control, where outcomes are inherently categorical.[1][7] It also connects to the normal distribution in the limit of many trials via the central limit theorem, enabling approximations for larger sample analyses.[8]Definition
Probability mass function
The Bernoulli distribution is a discrete probability distribution that models a random variable X taking only two possible values: 1 with probability p and 0 with probability q = 1 - p, where p \in [0, 1].[9][10] This setup represents the simplest case of a binary random experiment, such as a single trial in a sequence of independent events.[11] The probability mass function (PMF) of the Bernoulli distribution is given by P(X = k) = \begin{cases} p^k (1-p)^{1-k} & k = 0, 1 \\ 0 & \text{otherwise}. \end{cases} [10][11] This PMF can be interpreted as the probability of a success (value 1) or failure (value 0) in a binary trial, where the random variable X serves as an indicator for the occurrence of the success event.[9] For instance, in a fair coin toss, X = 1 for heads with p = 0.5 and X = 0 for tails, modeling equal likelihood outcomes.[10] The Bernoulli distribution corresponds to the binomial distribution in the special case of one trial (n=1).[11]Cumulative distribution function
The cumulative distribution function (CDF) of a Bernoulli random variable X \sim \text{Bernoulli}(p), where $0 < p < 1 is the success probability, is defined as F(x) = P(X \leq x).[6] It takes the form F(x) = \begin{cases} 0 & \text{if } x < 0, \\ 1 - p & \text{if } 0 \leq x < 1, \\ 1 & \text{if } x \geq 1. \end{cases} [6] This piecewise definition reflects the discrete support of X on the values {0, 1}, accumulating the probability mass from the probability mass function up to x.[6] Due to the discrete nature of the Bernoulli distribution, the CDF is a step function, remaining constant between the support points and exhibiting jumps of size $1-p at x=0 and p at x=1.[12] This step-like structure distinguishes it from continuous distributions, where the CDF would be smooth and increasing.[12] Visually, the CDF appears as a step plot originating at (-\infty, 0), rising to (0, 1-p), holding flat until (1, 1-p), then jumping to (1, 1) and remaining at 1 thereafter.[13] Such plots, often rendered as staircases for discrete distributions, aid in understanding the probability accumulation at the binary outcomes.[13] The CDF is essential for computing interval probabilities P(a \leq X \leq b) for any real numbers a \leq b, given by F(b) - F(a^-), where F(a^-) denotes the left limit at a to account for the jump at a if applicable.[14] This property enables efficient evaluation of cumulative probabilities without summing individual point masses, particularly useful in applications involving Bernoulli trials.[14]Moments
Mean
The expected value, or mean, of a Bernoulli random variable X \sim \text{Bernoulli}(p) is E[X] = p.[15] This parameter p serves as the measure of central tendency for the distribution, where X takes the value 1 with probability p (success) and 0 with probability $1-p (failure).[16] The derivation follows from the definition of the expected value for a discrete random variable: E[X] = \sum_{k=0}^{1} k \, P(X = k) = 0 \cdot (1 - p) + 1 \cdot p = p. This summation directly uses the probability mass function of the Bernoulli distribution.[16] The mean p interprets as the long-run proportion of successes observed in a sequence of repeated, independent Bernoulli trials.[17] For example, if p = 0.5, as in the case of a fair coin flip, then E[X] = 0.5, which aligns with the symmetric probability of heads or tails in repeated tosses.[16]Variance
The variance of a Bernoulli random variable X \sim \text{Bernoulli}(p) quantifies the dispersion around its mean \mu = p, measuring the expected squared deviation from the mean. By definition, \operatorname{Var}(X) = E[(X - \mu)^2].[18] Since X takes values 0 or 1, this expands to \operatorname{Var}(X) = (0 - p)^2 \cdot (1 - p) + (1 - p)^2 \cdot p = p^2(1 - p) + p(1 - p)^2 = p(1 - p).[18] An equivalent derivation uses the second-moment formula \operatorname{Var}(X) = E[X^2] - (E[X])^2. For the Bernoulli distribution, E[X^2] = 0^2 \cdot (1 - p) + 1^2 \cdot p = p, so \operatorname{Var}(X) = p - p^2 = p(1 - p).[19] This can also be expressed as pq where q = 1 - p.[19] The variance p(1 - p) reaches its maximum value of $0.25 when p = 0.5, indicating the greatest uncertainty in the binary outcome, and equals zero when p = 0 or p = 1, corresponding to deterministic cases with no dispersion.[18] This property highlights how the Bernoulli variance captures the inherent variability in probabilistic binary events, such as coin flips or success indicators.[20]Skewness
The skewness of a Bernoulli random variable X with success probability p (where $0 < p < 1) is the third standardized central moment, defined as \gamma_1 = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}, with mean \mu = p and variance \sigma^2 = p(1-p).[21] This quantity quantifies the asymmetry in the distribution's probability mass, which arises from the binary nature of outcomes (0 or 1), where deviations from p = 0.5 introduce imbalance between the success and failure probabilities.[10] The third central moment is \mathbb{E}[(X - p)^3] = p(1-p)(1-2p), obtained by evaluating the expectation over the two possible values: \mathbb{E}[(X - p)^3] = p(1-p)^3 + (1-p)(-p)^3 = p(1-p)^3 - (1-p)p^3 = p(1-p)(1-2p). Dividing by \sigma^3 = [p(1-p)]^{3/2} yields the skewness formula \gamma_1 = \frac{1-2p}{\sqrt{p(1-p)}}. Both the third moment and skewness formula appear in standard references on statistical distributions.[21] The sign of \gamma_1 indicates the direction of asymmetry: positive for p < 0.5 (right-skewed, with longer tail toward higher values), negative for p > 0.5 (left-skewed, with longer tail toward lower values), and zero for p = 0.5 (symmetric).[10] For example, when p = 0.3, \gamma_1 \approx 0.873, reflecting moderate right skew due to the higher probability of the lower outcome (0). This asymmetry measure is particularly relevant in modeling binary events where p deviates from equality, such as in reliability testing or diagnostic outcomes.[21]Kurtosis
The kurtosis of the Bernoulli distribution, which quantifies the peakedness and tail heaviness relative to the normal distribution, is given by the fourth standardized central moment \beta_2 = \frac{\mu_4}{\sigma^4}, where \mu_4 = E[(X - \mu)^4] is the fourth central moment and \sigma^2 = p(1-p) is the variance.[21] For a Bernoulli random variable X with success probability p (where $0 < p < 1), the fourth central moment is \mu_4 = p(1-p)[1 - 3p(1-p)].[6] Substituting these into the kurtosis formula yields \beta_2 = \frac{1 - 3p(1-p)}{p(1-p)}.[6] The excess kurtosis, defined as \kappa = \beta_2 - 3, simplifies to \kappa = \frac{1 - 6p(1-p)}{p(1-p)}.[21] This expression highlights the distribution's platykurtic nature compared to the normal distribution, which has excess kurtosis of zero; the Bernoulli distribution exhibits lighter tails and a more uniform peakedness due to its concentration at only two support points (0 and 1).[6] The excess kurtosis reaches its minimum value of -2 at p = 0.5, where the distribution is symmetric and most spread out relative to its variance, and approaches +\infty as p tends to 0 or 1, reflecting the increasing degeneracy at the boundaries.[21] To derive the fourth central moment, note that \mu = p, so (X - \mu)^4 = (1 - p)^4 with probability p and (-p)^4 = p^4 with probability $1-p. Thus, \mu_4 = p(1-p)^4 + (1-p)p^4 = p(1-p)[(1-p)^3 + p^3]. Expanding (1-p)^3 + p^3 = 1 - 3p + 3p^2 - p^3 + p^3 = 1 - 3p(1-p), which confirms \mu_4 = p(1-p)[1 - 3p(1-p)].[21] This derivation underscores the distribution's limited variability, contributing to its consistently negative or low excess kurtosis for interior values of p.[6]Advanced Properties
Higher moments and cumulants
The raw moments of a Bernoulli random variable X with success probability p are \mathbb{E}[X^k] = p for all integers k \geq 1, while \mathbb{E}[X^0] = 1. This follows from X^k = X almost surely for k \geq 1, since X takes values in \{0, 1\}.[21] The central moments are \mu_k = \mathbb{E}[(X - p)^k] = p (1 - p)^k + (1 - p) (-p)^k for k \geq 1. For even k, this simplifies to p (1 - p)^k + (1 - p) p^k, which can be derived using the binomial theorem on expansions involving powers of (1 - p) and p. For odd k > 1, the expression yields antisymmetric patterns, such as \mu_3 = p(1 - p)(1 - 2p). These moments provide a complete characterization beyond the mean and variance, emphasizing the distribution's binary nature.[22] Cumulants of the Bernoulli distribution are obtained from the cumulant generating function K(t) = \log(1 - p + p e^t), whose Taylor series coefficients satisfy K(t) = \sum_{n=1}^\infty \kappa_n \frac{t^n}{n!}. The first cumulant is \kappa_1 = p, the second is \kappa_2 = p(1 - p), the third is \kappa_3 = p(1 - p)(1 - 2p), the fourth is \kappa_4 = p(1 - p)[1 - 6p(1 - p)], the fifth is \kappa_5 = p(1 - p)(1 - 2p)[1 - 12p(1 - p)], and the sixth is \kappa_6 = p(1 - p)[1 - 30p(1 - p) + 120 p^2 (1 - p)^2]. Higher cumulants follow from differentiating K(t) or using relations like Faà di Bruno's formula to convert from moments, resulting in \kappa_n = p(1 - p) times a polynomial in p of degree n-2.[23] A key advantage of cumulants is their additivity under independent summation: if X_1, \dots, X_n are i.i.d. Bernoulli(p), then the cumulants of their sum (a binomial random variable) are exactly n times the corresponding Bernoulli cumulants. This property facilitates approximations and analyses of sums in probability theory.[23]Generating functions
The probability generating function (PGF) of a Bernoulli random variable X with success probability p (where q = 1 - p) is defined as G_X(s) = \mathbb{E}[s^X] = q + p s, for |s| \leq 1.[24] This function encapsulates the probability mass function and facilitates the analysis of sums of independent random variables. The moment generating function (MGF) for the same distribution is M_X(t) = \mathbb{E}[e^{tX}] = q + p e^t, defined for t \in \mathbb{R}.[25] Similarly, the characteristic function, which is the Fourier transform of the distribution, is \phi_X(t) = \mathbb{E}[e^{i t X}] = q + p e^{i t}, for t \in \mathbb{R}.[6] These generating functions provide powerful tools for deriving properties of the Bernoulli distribution. Specifically, successive derivatives of the PGF or MGF evaluated at appropriate points (such as s = 1 or t = 0) yield the moments of X.[26] Additionally, for a sum of independent Bernoulli random variables, the PGF of the sum is the product of the individual PGFs, leading directly to the PGF of a binomial distribution.[27]Exponential family representation
The Bernoulli distribution belongs to the exponential family of distributions, a class that encompasses many common parametric families and facilitates unified statistical inference procedures. In its general form, a distribution in the exponential family can be expressed asp(x \mid \theta) = h(x) \exp\left[ \eta(\theta) T(x) - A(\theta) \right],
where h(x) is the base measure, \eta(\theta) is the natural parameter, T(x) is the sufficient statistic, and A(\theta) is the log-partition function that normalizes the distribution.[28][29] For the Bernoulli distribution with success probability p, the probability mass function p(x \mid p) = p^x (1-p)^{1-x} for x \in \{0, 1\} can be rewritten in canonical exponential family form by taking the logarithm:
\log p(x \mid p) = x \log \frac{p}{1-p} + \log(1-p).
This yields h(x) = 1, the natural parameter \eta = \log \frac{p}{1-p} (also denoted as the logit of p), the sufficient statistic T(x) = x, and the log-partition function A(\eta) = \log(1 + e^\eta), since p = \frac{e^\eta}{1 + e^\eta} and $1-p = \frac{1}{1 + e^\eta}.[28][29] The natural parameter \eta thus serves as a reparameterization of p, mapping the interval (0,1) to (-\infty, \infty), which proves useful in optimization and modeling contexts. Membership in the exponential family provides several inferential advantages for the Bernoulli distribution. The log-partition function A(\eta) acts as a cumulant generating function, allowing moments to be obtained via differentiation: the mean \mu = E[X] = \frac{\partial A}{\partial \eta} = \frac{e^\eta}{1 + e^\eta} = p, and the variance \mathrm{Var}(X) = \frac{\partial^2 A}{\partial \eta^2} = \mu(1 - \mu), expressing variability directly as a function of the mean without additional parameters.[28][29] This structure unifies the Bernoulli with other exponential family distributions, enabling shared techniques for maximum likelihood estimation and Bayesian inference across models. The exponential family representation also underpins the Bernoulli distribution's role in generalized linear models (GLMs), where it serves as the response distribution for binary outcomes with a logit link function connecting the linear predictor to the natural parameter \eta.[28][29] This connection facilitates extensions to logistic regression and broader GLM frameworks for predictive modeling.