Fact-checked by Grok 2 weeks ago

Bernoulli distribution

The Bernoulli distribution is a discrete probability distribution that models the outcome of a single random experiment or trial with exactly two possible results: success, conventionally denoted as 1 and occurring with fixed probability p where $0 \leq p \leq 1, or failure, denoted as 0 and occurring with probability $1 - p.^[1]^[2]^[3] It represents the simplest form of a binary random variable and serves as the foundational building block for more complex distributions, such as the binomial distribution, which arises from the sum of independent Bernoulli trials.^[1]^[4] Named after the Swiss mathematician Jacob Bernoulli (1654–1705), who explored related concepts in probability in his posthumously published work Ars Conjectandi (1713), the distribution formalizes scenarios like coin flips, yes/no surveys, or any event with dichotomous outcomes under constant success probability.^[3]^[5] The probability mass function (PMF) for a Bernoulli random variable X is defined as P(X = x) = p^x (1-p)^{1-x} for x \in \{0, 1\}, which simplifies to P(X = 1) = p and P(X = 0) = 1 - p.^[1]^[6] Its mean (expected value) is E[X] = p, reflecting the long-run average success rate, while the variance is \Var(X) = p(1-p), measuring the spread around this mean and achieving maximum value at p = 0.5.^[1]^[3] In statistical modeling and applications, the Bernoulli distribution underpins binary logistic regression for predicting probabilities of binary events, hypothesis testing for proportions, and simulations in fields like machine learning, genetics, and quality control, where outcomes are inherently categorical.^[1]^[7] It also connects to the normal distribution in the limit of many trials via the central limit theorem, enabling approximations for larger sample analyses.^[8]

Definition

Probability mass function

The Bernoulli distribution is a discrete probability distribution that models a random variable X taking only two possible values: 1 with probability p and 0 with probability q = 1 - p, where p \in [0, 1].^[9]^[10] This setup represents the simplest case of a binary random experiment, such as a single trial in a sequence of independent events.^[11] The probability mass function (PMF) of the Bernoulli distribution is given by

P(X = k) = \begin{cases} p^k (1-p)^{1-k} & k = 0, 1 \\ 0 & \text{otherwise}. \end{cases}

^[10]^[11] This PMF can be interpreted as the probability of a success (value 1) or failure (value 0) in a binary trial, where the random variable X serves as an indicator for the occurrence of the success event.^[9] For instance, in a fair coin toss, X = 1 for heads with p = 0.5 and X = 0 for tails, modeling equal likelihood outcomes.^[10] The Bernoulli distribution corresponds to the binomial distribution in the special case of one trial (n=1).^[11]

Cumulative distribution function

The cumulative distribution function (CDF) of a Bernoulli random variable X \sim \text{Bernoulli}(p), where $0 < p < 1 is the success probability, is defined as F(x) = P(X \leq x).^[6] It takes the form

F(x) = \begin{cases} 0 & \text{if } x < 0, \\ 1 - p & \text{if } 0 \leq x < 1, \\ 1 & \text{if } x \geq 1. \end{cases}

^[6] This piecewise definition reflects the discrete support of X on the values {0, 1}, accumulating the probability mass from the probability mass function up to x.^[6] Due to the discrete nature of the Bernoulli distribution, the CDF is a step function, remaining constant between the support points and exhibiting jumps of size $1-p at x=0 and p at x=1.^[12] This step-like structure distinguishes it from continuous distributions, where the CDF would be smooth and increasing.^[12] Visually, the CDF appears as a step plot originating at (-\infty, 0), rising to (0, 1-p), holding flat until (1, 1-p), then jumping to (1, 1) and remaining at 1 thereafter.^[13] Such plots, often rendered as staircases for discrete distributions, aid in understanding the probability accumulation at the binary outcomes.^[13] The CDF is essential for computing interval probabilities P(a \leq X \leq b) for any real numbers a \leq b, given by F(b) - F(a^-), where F(a^-) denotes the left limit at a to account for the jump at a if applicable.^[14] This property enables efficient evaluation of cumulative probabilities without summing individual point masses, particularly useful in applications involving Bernoulli trials.^[14]

Moments

Mean

The expected value, or mean, of a Bernoulli random variable X \sim \text{Bernoulli}(p) is E[X] = p.^[15] This parameter p serves as the measure of central tendency for the distribution, where X takes the value 1 with probability p (success) and 0 with probability $1-p (failure).^[16] The derivation follows from the definition of the expected value for a discrete random variable:

E[X] = \sum_{k=0}^{1} k \, P(X = k) = 0 \cdot (1 - p) + 1 \cdot p = p.

This summation directly uses the probability mass function of the Bernoulli distribution.^[16] The mean p interprets as the long-run proportion of successes observed in a sequence of repeated, independent Bernoulli trials.^[17] For example, if p = 0.5, as in the case of a fair coin flip, then E[X] = 0.5, which aligns with the symmetric probability of heads or tails in repeated tosses.^[16]

Variance

The variance of a Bernoulli random variable X \sim \text{Bernoulli}(p) quantifies the dispersion around its mean \mu = p, measuring the expected squared deviation from the mean. By definition, \operatorname{Var}(X) = E[(X - \mu)^2].^[18] Since X takes values 0 or 1, this expands to \operatorname{Var}(X) = (0 - p)^2 \cdot (1 - p) + (1 - p)^2 \cdot p = p^2(1 - p) + p(1 - p)^2 = p(1 - p).^[18] An equivalent derivation uses the second-moment formula \operatorname{Var}(X) = E[X^2] - (E[X])^2. For the Bernoulli distribution, E[X^2] = 0^2 \cdot (1 - p) + 1^2 \cdot p = p, so \operatorname{Var}(X) = p - p^2 = p(1 - p).^[19] This can also be expressed as pq where q = 1 - p.^[19] The variance p(1 - p) reaches its maximum value of $0.25 when p = 0.5, indicating the greatest uncertainty in the binary outcome, and equals zero when p = 0 or p = 1, corresponding to deterministic cases with no dispersion.^[18] This property highlights how the Bernoulli variance captures the inherent variability in probabilistic binary events, such as coin flips or success indicators.^[20]

Skewness

The skewness of a Bernoulli random variable X with success probability p (where $0 < p < 1) is the third standardized central moment, defined as \gamma_1 = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}, with mean \mu = p and variance \sigma^2 = p(1-p).^[21] This quantity quantifies the asymmetry in the distribution's probability mass, which arises from the binary nature of outcomes (0 or 1), where deviations from p = 0.5 introduce imbalance between the success and failure probabilities.^[10] The third central moment is \mathbb{E}[(X - p)^3] = p(1-p)(1-2p), obtained by evaluating the expectation over the two possible values:

\mathbb{E}[(X - p)^3] = p(1-p)^3 + (1-p)(-p)^3 = p(1-p)^3 - (1-p)p^3 = p(1-p)(1-2p).

Dividing by \sigma^3 = [p(1-p)]^{3/2} yields the skewness formula

\gamma_1 = \frac{1-2p}{\sqrt{p(1-p)}}.

Both the third moment and skewness formula appear in standard references on statistical distributions.^[21] The sign of \gamma_1 indicates the direction of asymmetry: positive for p < 0.5 (right-skewed, with longer tail toward higher values), negative for p > 0.5 (left-skewed, with longer tail toward lower values), and zero for p = 0.5 (symmetric).^[10] For example, when p = 0.3, \gamma_1 \approx 0.873, reflecting moderate right skew due to the higher probability of the lower outcome (0). This asymmetry measure is particularly relevant in modeling binary events where p deviates from equality, such as in reliability testing or diagnostic outcomes.^[21]

Kurtosis

The kurtosis of the Bernoulli distribution, which quantifies the peakedness and tail heaviness relative to the normal distribution, is given by the fourth standardized central moment \beta_2 = \frac{\mu_4}{\sigma^4}, where \mu_4 = E[(X - \mu)^4] is the fourth central moment and \sigma^2 = p(1-p) is the variance.^[21] For a Bernoulli random variable X with success probability p (where $0 < p < 1), the fourth central moment is \mu_4 = p(1-p)[1 - 3p(1-p)].^[6] Substituting these into the kurtosis formula yields \beta_2 = \frac{1 - 3p(1-p)}{p(1-p)}.^[6] The excess kurtosis, defined as \kappa = \beta_2 - 3, simplifies to \kappa = \frac{1 - 6p(1-p)}{p(1-p)}.^[21] This expression highlights the distribution's platykurtic nature compared to the normal distribution, which has excess kurtosis of zero; the Bernoulli distribution exhibits lighter tails and a more uniform peakedness due to its concentration at only two support points (0 and 1).^[6] The excess kurtosis reaches its minimum value of -2 at p = 0.5, where the distribution is symmetric and most spread out relative to its variance, and approaches +\infty as p tends to 0 or 1, reflecting the increasing degeneracy at the boundaries.^[21] To derive the fourth central moment, note that \mu = p, so (X - \mu)^4 = (1 - p)^4 with probability p and (-p)^4 = p^4 with probability $1-p. Thus, \mu_4 = p(1-p)^4 + (1-p)p^4 = p(1-p)[(1-p)^3 + p^3]. Expanding (1-p)^3 + p^3 = 1 - 3p + 3p^2 - p^3 + p^3 = 1 - 3p(1-p), which confirms \mu_4 = p(1-p)[1 - 3p(1-p)].^[21] This derivation underscores the distribution's limited variability, contributing to its consistently negative or low excess kurtosis for interior values of p.^[6]

Advanced Properties

Higher moments and cumulants

The raw moments of a Bernoulli random variable X with success probability p are \mathbb{E}[X^k] = p for all integers k \geq 1, while \mathbb{E}[X^0] = 1. This follows from X^k = X almost surely for k \geq 1, since X takes values in \{0, 1\}.^[21] The central moments are \mu_k = \mathbb{E}[(X - p)^k] = p (1 - p)^k + (1 - p) (-p)^k for k \geq 1. For even k, this simplifies to p (1 - p)^k + (1 - p) p^k, which can be derived using the binomial theorem on expansions involving powers of (1 - p) and p. For odd k > 1, the expression yields antisymmetric patterns, such as \mu_3 = p(1 - p)(1 - 2p). These moments provide a complete characterization beyond the mean and variance, emphasizing the distribution's binary nature.^[22] Cumulants of the Bernoulli distribution are obtained from the cumulant generating function K(t) = \log(1 - p + p e^t), whose Taylor series coefficients satisfy K(t) = \sum_{n=1}^\infty \kappa_n \frac{t^n}{n!}. The first cumulant is \kappa_1 = p, the second is \kappa_2 = p(1 - p), the third is \kappa_3 = p(1 - p)(1 - 2p), the fourth is \kappa_4 = p(1 - p)[1 - 6p(1 - p)], the fifth is \kappa_5 = p(1 - p)(1 - 2p)[1 - 12p(1 - p)], and the sixth is \kappa_6 = p(1 - p)[1 - 30p(1 - p) + 120 p^2 (1 - p)^2]. Higher cumulants follow from differentiating K(t) or using relations like Faà di Bruno's formula to convert from moments, resulting in \kappa_n = p(1 - p) times a polynomial in p of degree n-2.^[23] A key advantage of cumulants is their additivity under independent summation: if X_1, \dots, X_n are i.i.d. Bernoulli(p), then the cumulants of their sum (a binomial random variable) are exactly n times the corresponding Bernoulli cumulants. This property facilitates approximations and analyses of sums in probability theory.^[23]

Generating functions

The probability generating function (PGF) of a Bernoulli random variable X with success probability p (where q = 1 - p) is defined as G_X(s) = \mathbb{E}[s^X] = q + p s, for |s| \leq 1.^[24] This function encapsulates the probability mass function and facilitates the analysis of sums of independent random variables. The moment generating function (MGF) for the same distribution is M_X(t) = \mathbb{E}[e^{tX}] = q + p e^t, defined for t \in \mathbb{R}.^[25] Similarly, the characteristic function, which is the Fourier transform of the distribution, is \phi_X(t) = \mathbb{E}[e^{i t X}] = q + p e^{i t}, for t \in \mathbb{R}.^[6] These generating functions provide powerful tools for deriving properties of the Bernoulli distribution. Specifically, successive derivatives of the PGF or MGF evaluated at appropriate points (such as s = 1 or t = 0) yield the moments of X.^[26] Additionally, for a sum of independent Bernoulli random variables, the PGF of the sum is the product of the individual PGFs, leading directly to the PGF of a binomial distribution.^[27]

Exponential family representation

The Bernoulli distribution belongs to the exponential family of distributions, a class that encompasses many common parametric families and facilitates unified statistical inference procedures. In its general form, a distribution in the exponential family can be expressed as
p(x \mid \theta) = h(x) \exp\left[ \eta(\theta) T(x) - A(\theta) \right],
where h(x) is the base measure, \eta(\theta) is the natural parameter, T(x) is the sufficient statistic, and A(\theta) is the log-partition function that normalizes the distribution.^[28]^[29] For the Bernoulli distribution with success probability p, the probability mass function p(x \mid p) = p^x (1-p)^{1-x} for x \in \{0, 1\} can be rewritten in canonical exponential family form by taking the logarithm:
\log p(x \mid p) = x \log \frac{p}{1-p} + \log(1-p).
This yields h(x) = 1, the natural parameter \eta = \log \frac{p}{1-p} (also denoted as the logit of p), the sufficient statistic T(x) = x, and the log-partition function A(\eta) = \log(1 + e^\eta), since p = \frac{e^\eta}{1 + e^\eta} and $1-p = \frac{1}{1 + e^\eta}.^[28]^[29] The natural parameter \eta thus serves as a reparameterization of p, mapping the interval (0,1) to (-\infty, \infty), which proves useful in optimization and modeling contexts. Membership in the exponential family provides several inferential advantages for the Bernoulli distribution. The log-partition function A(\eta) acts as a cumulant generating function, allowing moments to be obtained via differentiation: the mean \mu = E[X] = \frac{\partial A}{\partial \eta} = \frac{e^\eta}{1 + e^\eta} = p, and the variance \mathrm{Var}(X) = \frac{\partial^2 A}{\partial \eta^2} = \mu(1 - \mu), expressing variability directly as a function of the mean without additional parameters.^[28]^[29] This structure unifies the Bernoulli with other exponential family distributions, enabling shared techniques for maximum likelihood estimation and Bayesian inference across models. The exponential family representation also underpins the Bernoulli distribution's role in generalized linear models (GLMs), where it serves as the response distribution for binary outcomes with a logit link function connecting the linear predictor to the natural parameter \eta.^[28]^[29] This connection facilitates extensions to logistic regression and broader GLM frameworks for predictive modeling.

Information Measures

Entropy

The entropy of a Bernoulli random variable X with success probability p, denoted H(X), measures the average uncertainty in the outcome of X. Since the Bernoulli distribution is discrete, differential entropy does not apply; instead, the Shannon entropy is used, given by

H(X) = -p \log_2 p - (1-p) \log_2 (1-p)

in bits, or equivalently

H(X) = -p \ln p - (1-p) \ln (1-p)

in nats when using the natural logarithm.^[30] This formula arises from the general definition of entropy for a discrete random variable as the expected value of the negative logarithm of the probability mass function.^[31] The function H(p) is known as the binary entropy function, which quantifies the information content inherent in a binary source with bias p. It achieves its maximum value of 1 bit (or \ln 2 nats) when p = 0.5, corresponding to the case of maximum uncertainty where the outcomes are equally likely.^[30] At the boundaries, H(0) = H(1) = 0, reflecting complete certainty in the outcome.^[30] This entropy represents the average number of bits required to encode the outcome of X in an optimal code, providing a fundamental limit on lossless compression for sequences of independent Bernoulli trials.^[31] The binary entropy function is symmetric about p = 0.5, satisfying H(p) = H(1-p), and is strictly concave on [0,1], as its second derivative is negative for $0 < p < 1.^[30]

Fisher's information

The Fisher information for a single observation from a Bernoulli distribution with success probability p is defined as I(p) = \mathbb{E}\left[ \left( \frac{\partial}{\partial p} \log f(X \mid p) \right)^2 \right], where f(X \mid p) = p^X (1-p)^{1-X} for X \in \{0, 1\}.^[32]^[33] To compute this, first find the score function: the log-likelihood is \log f(X \mid p) = X \log p + (1 - X) \log (1 - p), so the derivative with respect to p is \frac{\partial}{\partial p} \log f(X \mid p) = \frac{X}{p} - \frac{1 - X}{1 - p}.^[32]^[34] The Fisher information is then the expected value of the square of this score, which evaluates to I(p) = \frac{1}{p(1-p)}.^[32]^[33]^[34] This quantity measures the curvature of the log-likelihood function and quantifies the amount of information that a single observation carries about the parameter p; notably, its inverse provides the Cramér–Rao lower bound on the variance of any unbiased estimator of p.^[33]^[32] The value of I(p) is maximized at p = 0.5, where it equals 4, indicating the highest precision in estimating p near this point.^[33] For n independent and identically distributed observations, the Fisher information scales additively to I_n(p) = \frac{n}{p(1-p)}.^[32]^[33]

Parameter Estimation

Maximum likelihood estimation

Consider a sample of n independent and identically distributed (i.i.d.) Bernoulli random variables X_1, X_2, \dots, X_n, each with success probability p. The likelihood function is given by

L(p) = p^s (1-p)^{n-s},

where s = \sum_{i=1}^n X_i is the number of successes observed in the sample.^[35] The maximum likelihood estimator (MLE) of p is the value \hat{p} that maximizes L(p), which is the sample proportion \hat{p} = s/n.^[36] To derive this, consider the log-likelihood \ell(p) = \log L(p) = s \log p + (n-s) \log (1-p). Differentiating with respect to p yields

\frac{\partial}{\partial p} \ell(p) = \frac{s}{p} - \frac{n-s}{1-p}.

Setting the derivative equal to zero and solving gives \hat{p} = s/n.^[36] The MLE \hat{p} is unbiased, meaning E[\hat{p}] = p. It achieves the minimum variance among unbiased estimators, attaining the Cramér-Rao lower bound. Additionally, \hat{p} is asymptotically normal, with

\sqrt{n} (\hat{p} - p) \xrightarrow{d} \mathcal{N}\left(0, p(1-p)\right)

as n \to \infty. This asymptotic variance equals the reciprocal of the Fisher information for a single observation.^[37]^[38]

Bayesian estimation

In Bayesian estimation of the Bernoulli distribution's success probability p, the Beta distribution serves as the conjugate prior, parameterized by shape parameters \alpha > 0 and \beta > 0, which encodes prior beliefs about p.^[39]^[40] Given n independent Bernoulli trials with s successes, the likelihood is binomial, and the posterior distribution remains Beta, updated to \text{Beta}(\alpha + s, \beta + n - s).^[39]^[41] This conjugacy simplifies inference by preserving the family form, avoiding numerical integration.^[42] The posterior mean, a common point estimate, is given by

\frac{\alpha + s}{\alpha + \beta + n},

which shrinks the maximum likelihood estimate s/n toward the prior mean \alpha/(\alpha + \beta), weighted by the prior strength \alpha + \beta.^[39]^[43] For \alpha > 1 and \beta > 1, the posterior mode is

\frac{\alpha + s - 1}{\alpha + \beta + n - 2},

providing the most probable value of [p](/page/P′′) under the posterior.^[39] Credible intervals for [p](/page/P′′) can be constructed from the quantiles of this posterior Beta distribution, offering probabilistic bounds that incorporate prior uncertainty.^[41]^[42] The parameters \alpha and \beta admit an interpretation in terms of pseudocounts: the prior reflects \alpha - 1 prior successes and \beta - 1 prior failures, regularizing estimates especially with limited data.^[40]^[43] A notable special case is the uniform prior \text{Beta}(1, 1), which yields a posterior mean of (s + 1)/(n + 2); this is known as Laplace's rule of succession, originally applied to predict the probability of future successes after observing s out of n trials.^[44]^[45] Such Bayesian approaches complement maximum likelihood estimation particularly in small-sample scenarios by providing uncertainty quantification through the full posterior.^[39]

Binomial distribution

The binomial distribution arises as the distribution of the sum of a fixed number n of independent and identically distributed (i.i.d.) Bernoulli random variables, each with success probability p. Specifically, if X_1, X_2, \dots, X_n are i.i.d. Bernoulli(p) random variables, then their sum S_n = \sum_{i=1}^n X_i follows a binomial distribution with parameters n and p, denoted Binomial(n, p).^[10]^[11] The probability mass function (PMF) of the binomial distribution can be derived from the convolution of the individual Bernoulli PMFs. For S_n = k successes in n trials, the probability is the number of ways to choose k successes out of n trials, multiplied by the probability of success on those k trials and failure on the remaining n-k trials:

P(S_n = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n.

This formula reflects the combinatorial nature of sequencing independent Bernoulli trials.^[46]^[11] The mean and variance of the binomial distribution follow directly from the properties of the sum of i.i.d. random variables. The expected value is E[S_n] = np, obtained by linearity of expectation as the sum of individual means E[X_i] = p. Similarly, the variance is \operatorname{[Var](/page/Var)}(S_n) = np(1-p), since the variables are independent, so \operatorname{[Var](/page/Var)}(S_n) = \sum_{i=1}^n \operatorname{[Var](/page/Var)}(X_i) = n \cdot p(1-p).^[47]^[48] For large n, the central limit theorem provides a normal approximation to the binomial distribution, stating that S_n is approximately normally distributed with mean np and variance np(1-p), i.e., S_n \approx \mathcal{N}(np, np(1-p)), provided np and n(1-p) are sufficiently large (typically both greater than 5 or 10). This approximation is useful for computing probabilities when exact binomial calculations are cumbersome.^[49]^[50]

Other connections

The Bernoulli process is defined as a sequence of independent and identically distributed Bernoulli random variables, each representing a binary trial with success probability p, which collectively model discrete-time binary stochastic processes such as repeated coin flips or success/failure sequences over time.^[51] This process captures the temporal evolution of binary outcomes in applications like queueing theory and reliability analysis, where the independence ensures that each trial's result does not influence subsequent ones.^[52] The geometric distribution arises directly from the Bernoulli distribution as the distribution of the number of failures preceding the first success in a sequence of independent Bernoulli trials, each with success probability p.^[53] Specifically, if trials continue until the initial success, the waiting time follows a geometric law with parameter p, providing a foundational link between single-trial Bernoulli outcomes and stopping-time problems in probability.^[53] The Bernoulli distribution generalizes the two-point distribution, encompassing any discrete setup with outcomes at two points and unequal probabilities, while the Rademacher distribution emerges as its symmetric special case when p = 0.5, yielding outcomes +1 and -1 each with probability $1/2, equivalent to a linearly transformed Bernoulli variate via $2X - 1 where X \sim \text{Bernoulli}(0.5).^[54] This connection highlights the Bernoulli's role in symmetric random walks and sign functions in statistical mechanics. In the Poisson binomial distribution, the sum of independent but non-identically distributed Bernoulli random variables, each with its own success probability p_i, produces a more flexible generalization beyond fixed-p scenarios, applicable to heterogeneous binary risks like fault probabilities in engineering systems.^[55] This structure allows modeling of scenarios where trial success rates vary, distinguishing it from uniform-parameter cases while retaining the core binary nature of the Bernoulli components.^[56]

History

Origins in probability theory

The systematic study of probability in games of chance emerged in the 16th century through Gerolamo Cardano's Liber de Ludo Aleae, a treatise on games of chance that analyzed dice throws and calculated probabilities for sums and sequences, marking an early systematic approach to quantifying random events in gambling contexts.^[57] Written around 1564 and published posthumously in 1663, Cardano's work assumed equal chances for dice faces and applied multiplicative rules to compound probabilities, but it lacked a rigorous model for repeated independent binary trials.^[57] Foundational developments in probability for games of chance advanced in the mid-17th century via the correspondence between Blaise Pascal and Pierre de Fermat in 1654, prompted by queries from the gambler Chevalier de Méré on fair stake division in interrupted games.^[58] Their exchanges addressed the "problem of points," involving binary-like success-failure scenarios in dice and card games, where they enumerated outcomes to determine equitable shares based on remaining probabilities of winning.^[58] This collaboration established core principles of probabilistic reasoning for discrete trials, influencing subsequent work without yet formalizing convergence properties.^[59] Jacob Bernoulli provided the first rigorous treatment of binary trials in his posthumously published Ars Conjectandi in 1713, where he developed the law of large numbers specifically for repeated independent experiments with fixed success probability p.^[60] In Part IV, Bernoulli proved that the sample mean of successes converges to p as the number of trials n \to \infty, using binomial expansions to bound the probability of deviation and quantify the required n for high confidence.^[60] This key insight established statistical regularity in binary outcomes, building on earlier ideas from Pascal and Fermat to formalize the Bernoulli model as a cornerstone of probability theory.^[59]

Naming and legacy

The Bernoulli distribution is named after Swiss mathematician Jacob Bernoulli (1654–1705), whose seminal work Ars Conjectandi, published posthumously in 1713, established key principles of probability theory, including the law of large numbers for sequences of binary outcomes. Although Bernoulli himself described the general model for multiple trials using the term "binomial," the specific designation "Bernoulli distribution" for the single-trial case, honoring his foundational contributions, arose in the 20th century as probability theory formalized discrete distributions.^[61]^[62] Bernoulli's ideas formed the bedrock for limit theorems in probability, with his law of large numbers serving as the inaugural result showing convergence of empirical proportions to theoretical probabilities in repeated binary experiments. This theorem profoundly influenced 19th-century probabilists, including Pierre-Simon Laplace, who generalized it into the central limit theorem to approximate sums of independent random variables, and Siméon Denis Poisson, who extended the results to scenarios with non-constant success probabilities.^[63]^[64]^[65] In modern probability education, the Bernoulli distribution has been a staple since at least the 1930s, appearing as a core concept in influential texts like J.V. Uspensky's Introduction to Mathematical Probability (1937), underscoring its role as the simplest discrete distribution for binary events. The 2005 Jakob Bernoulli Year, marking the 350th anniversary of his birth and 300th of his death, celebrated his lasting impact through academic events and publications.^[66]^[67] The 2013 tricentennial of Ars Conjectandi further highlighted its enduring significance via international conferences dedicated to Bernoulli's probabilistic innovations.^[68] A key aspect of its legacy involves distinguishing the Bernoulli distribution—which models the outcome of one binary random variable with success probability p—from Bernoulli trials, which denote a series of independent such experiments underlying the binomial distribution for n > 1 trials. This clarification, emphasized in standard statistical literature, preserves Bernoulli's original emphasis on sequential processes while adapting his framework to modern single-variable analysis.^[69]^[70]

Applications

Modeling binary outcomes

The Bernoulli distribution serves as a foundational model for binary outcomes in probability experiments, where each trial results in either success or failure, with success probability denoted by p and failure probability $1 - p. A classic illustration is the coin toss, where a fair coin has p = 0.5 for heads (success), while a biased coin deviates from this value, allowing the distribution to capture asymmetry in real-world randomness.^[7]^[71] In quality control, the Bernoulli distribution models the occurrence of defects in individual items, treating each inspection as a trial where success might represent a non-defective product with p as the reliability rate. For instance, if historical data shows a 4% defect rate, the probability of a single bulb being defective follows Bernoulli with p = 0.04, aiding in pass-fail assessments during manufacturing.^[72] For hypothesis testing, the Bernoulli distribution underpins tests of fairness in binary events, such as setting the null hypothesis at p = 0.5 for a coin and comparing observed outcomes against an alternative p \neq 0.5 to detect bias. This approach evaluates whether deviations from expected success rates are statistically significant in simple yes/no scenarios.^[73] In simulation, Bernoulli random variables generate binary data for Monte Carlo methods, where repeated sampling from the distribution approximates complex probabilistic behaviors in computational experiments. For example, drawing from Bernoulli(p) produces sequences of 0s and 1s to model uncertain events in algorithmic testing.^[74] In risk analysis, the Bernoulli distribution quantifies the probability of event occurrence in a single trial, such as the likelihood of success or failure in an isolated hazard assessment, providing a baseline for evaluating potential impacts in fields like engineering reliability. This single-trial focus extends naturally to the binomial distribution for multiple independent repetitions.^[75]

Use in statistics and machine learning

In logistic regression, the Bernoulli distribution serves as the likelihood model for binary response variables, where each observation y_i \in \{0, 1\} is drawn from a Bernoulli distribution with success probability p_i, and the model parameters are estimated by maximizing the log-likelihood \sum_i [y_i \log p_i + (1 - y_i) \log (1 - p_i)]. The logit link function, defined as \log\left(\frac{p_i}{1 - p_i}\right) = \mathbf{x}_i^\top \boldsymbol{\beta}, linearly relates the predictors to the log-odds, enabling the modeling of how covariates influence binary outcomes such as success or failure. This framework, originally proposed for analyzing binary sequences, remains foundational for generalized linear models in inferential statistics.^[76] In A/B testing, the Bernoulli distribution models conversion rates as the parameter p, representing the probability of a positive binary event like a user click or purchase under different variants. Maximum likelihood estimation provides point estimates of p for each variant by solving for the proportion of successes, facilitating hypothesis tests on differences in conversion probabilities. Bayesian approaches update prior distributions on p (often Beta priors conjugate to Bernoulli likelihoods) with observed data to yield posterior distributions, enabling probabilistic statements about variant superiority and sequential testing decisions. In machine learning, the Bernoulli distribution underpins algorithms like Bernoulli naive Bayes for binary feature spaces, such as text classification where document-term presence (binary indicators) is assumed independent given the class label, with class-conditional probabilities estimated via MLE. This model excels in high-dimensional sparse data, outperforming multinomial variants when term frequencies are irrelevant. Hidden Markov models incorporate Bernoulli emissions for binary observation sequences, where the emission probability from each hidden state follows a Bernoulli distribution parameterized by state-specific success probabilities, supporting applications in sequential data inference like regime detection.^[77] For large-scale datasets, stochastic gradient descent optimizes the Bernoulli log-likelihood efficiently, approximating full-batch gradients with single observations or minibatches to scale logistic regression and related models to millions of samples, as demonstrated in online learning settings where the negative log-likelihood loss (binary cross-entropy) guides parameter updates. This approach trades off variance for computational speed, converging to near-optimal solutions in high-dimensional big data regimes.^[78]

References

[1]
[PDF] Bernoulli and Binomial Random Variables
Jul 10, 2017 · A Bernoulli random variable is the simplest kind of random variable. It can take on two values,. 1 and 0. It takes on a 1 if an experiment with ...
[2]
Bernoulli & Binomial Random Variables - Data Science Discovery
Bernoulli Distribution for Discrete Random Variables. Any event that has exactly two outcomes with a fixed probability is called a Bernoulli random variable.Missing: definition | Show results with:definition
[3]
[PDF] Discrete distributions - UNM Math
The Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the ...
[4]
[PDF] The Bernoulli Probability Distribution - Faculty Web Pages
A random variable with only two outcomes (1-0, true-false, right-wrong, on- off, etc.) is called a Bernoulli random variable, denoted as X~Bernoulli(p):.
[5]
Random Variables
The term Bernoulli trial is named for Jacob Bernoulli. Poisson. The Poisson distribution is named for Siméon-Denis Poisson (1781-1840) who wrote about the ...
[6]
[PDF] Bernoulli distribution
probability p has probability mass function f(x) = p x(1− p)1−x x = 0,1 ... The cumulative distribution function of X ∼ Bernoulli(p) is. F(x) = P(X ...<|control11|><|separator|>
[7]
[PDF] Bernoulli trials, binomial and hypergeometric distributions
The probability distribution of X is called the Bernoulli distribution with probability of success equal to p. If X is a random variable having the Bernoulli ...
[8]
https://math.clarku.edu/~djoyce/ma217/bernoulli.pdf
[9]
Bernoulli Random Variable - Glossary | CSRC
Definitions: A random variable that takes on the value of one with probability p and the value of zero with probability 1-p. Sources: NIST SP 800 ...Missing: distribution | Show results with:distribution
[10]
Bernoulli distribution | Properties, proofs, exercises - StatLect
The Bernoulli distribution is a univariate discrete distribution used to model random experiments that have binary outcomes.Missing: primary sources
[11]
Special Distributions | Bernoulli Distribution | Binomial Distribution
A Bernoulli random variable is a random variable that can only take two possible values, usually 0 and 1.Missing: primary sources
[12]
[PDF] 2. Random Variables
Sep 15, 2020 · distribution function (cdf) is defined for all x by. F(x) ≡ P(X ≤ x) ... Then the cdf is the following step function: F(x) = P(X ≤ x).<|control11|><|separator|>
[13]
Tutorial 3c: Probability distributions and their stories - Justin Bois
I plot the CDFs for discrete distributions as "staircases." As an example, here is a plot of the CDF of the Binomial distribution with parameters N=10 and θ=0.5 ...
[14]
[PDF] Probability Cheatsheet - andrew.cmu.ed
Mar 20, 2015 · To find the probability that a CRV takes on a value in the interval [a, b], subtract the respective CDFs. P (a ≤ X ≤ b) ... The Bernoulli ...<|control11|><|separator|>
[15]
Expected value and variance of a random variable - Stat 20
Then: E ( X ) = 1 × p + 0 × ( 1 − p ) = p The expected value of a Bernoulli( p ) random variable is therefore just p . In particular, if we toss a coin and ...
[16]
[PDF] 18.05 S22 Reading 4b: Discrete Random Variables: Expected Value
1. Know how to compute expected value (mean) of a discrete random variable. 2. Know the expected value of Bernoulli, binomial and geometric random ...
[17]
Probability distributions - Foundations in Data Science
Bernoulli trials. At the heart of discrete probability distributions lies the concept of Bernoulli trials, named after the Swiss mathematician Jacob Bernoulli.
[18]
[PDF] 18.05 S22 Reading 5a: Variance of Discrete Random Variables
Bernoulli random variables are fundamental, so we should know their variance. If 𝑋 ∼ Bernoulli(𝑝) then Var(𝑋) = 𝑝(1 − 𝑝). Proof: We know that 𝐸[𝑋] = 𝑝. We ...<|control11|><|separator|>
[19]
[PDF] Variance and standard deviation Math 217 Probability and Statistics
Therefore, the variance of one Bernoulli trial is Var(X) = p − p2 = pq. From that observation, we conclude the variance of the binomial distribution is. Var(S) ...
[20]
None
### Summary of Bernoulli Variance Content
[21]
Bernoulli Distribution -- from Wolfram MathWorld
The Bernoulli distribution is a discrete distribution having two possible outcomes labelled by n=0 and n=1 in which n=1 ("success") occurs with probability p.
[22]
Central moments of a Bernoulli distribution - Math Stack Exchange
Jul 6, 2016 · No. The moments only depend on the unknown parameters through p. Thus they don't contain any more information than does p.statistics - Third Central Moment of the Bernoulli Distribution without ...How to calculate the $4$th central moment of binomial distribution?More results from math.stackexchange.com
[23]
[PDF] arXiv:1602.01234v1 [nucl-th] 3 Feb 2016
Feb 3, 2016 · We derive formulas which connect cumulants of particle numbers observed with efficiency losses with the original ones based on the binomial ...
[24]
https://books.google.com/books?id=A6J4AGUwrPwC&pg=PA127
[25]
[PDF] Lecture 6 Moment-generating functions
Sep 25, 2019 · 3. • Bernoulli distribution. For Y ∼ B(p), we have. mY(t) = et×0pY(0) + et×1pY(1) = q + pet, where q = 1 − p. • Geometric distribution. If ...
[26]
[PDF] Probability Generating Functions - Texas A&M University
Bernoulli Variables. Example. Let X be a random variable that has Bernoulli distribution with parameter p. The probability generating function is given by. GX ...
[27]
[PDF] Generating function for a sum - Purdue Math
Generating function for Bernoulli: If Y ⇠ b(p) then. GY (s) = (1 p) + p s. Decomposition of Binomial: If X ⇠ Bin(n, p) one can write ... The Xi 's are independent.
[28]
[PDF] Chapter 8 The exponential family: Basics - People @EECS
which is the variance of a Bernoulli variable. ... Moreover, we can calculate third cumulants by computing the mixed partial, and fourth cumulants by taking the ...
[29]
[PDF] lecture 11: exponential family and generalized linear models
Bernoulli distribution is defined on a binary (0 or 1) ran- dom variable using parameter π where π = Pr(x = 1). The Bernoulli distribution can be written as: (2).
[30]
[PDF] Entropy and Information Theory - Stanford Electrical Engineering
This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels ...
[31]
[PDF] A Mathematical Theory of Communication
A mixed system is one in which both discrete and continuous variables appear, e.g., PCM transmission of speech. We first consider the discrete case. This case ...
[32]
[PDF] Fisher Information & Efficiency - Duke Statistical Science
May 11, 2021 · Bernoulli: For the Bernoulli distribution Bi(1,θ), λ(x | θ) = xlog θ + (1 − x) log(1 − θ) λ/(x | θ) = x θ −. 1 − x. 1 − θ λ//(x | θ) ...
[33]
[PDF] A Tutorial on Fisher Information - arXiv
For example, when the coin's true propensity is θ∗ = 0.3, replacing θ by θ∗ in the Bernoulli distribution yields the pmf p0.3(xi) = 0.3xi 0.71−xi , a function.
[34]
[PDF] A Few Notes on Fisher Information - David Meyer
Apr 10, 2024 · # Equation (3): p(1 - p) = 1. Ip. 3. Page 4. So the variance of the Bernoulli distribution is the inverse of the Fisher Information: Var(X) = 1.
[35]
1.5 - Maximum Likelihood Estimation | STAT 504
Bernoulli and Binomial Likelihoods The only difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial ...
[36]
[PDF] Maximum Likelihood Estimation
Step one of MLE is to write the likelihood of a Bernoulli as a function that we can maximize. Since a Bernoulli is a discrete distribution, the likelihood is ...
[37]
Chapter 11 Additional Properties of Estimators - Bookdown
The variance of the asymptotic normal distribution coincides with the Cramer-Rao lower bound providing further support for using maximum likelihood estimation.
[38]
[PDF] Notes: Estimation, Bias and Variance
Nov 13, 2014 · ... formula Var(X) = E[X2] − E[X]2. Using this, we can show ... Example: Estimating the proportion parameter p for a Bernoulli distribution.Missing: derivation | Show results with:derivation
[39]
[PDF] Conjugate priors: Beta and normal Class 15, 18.05
In this section, we will show that the beta distribution is a conjugate prior for binomial,. Bernoulli, and geometric likelihoods. 3.1 Binomial likelihood. We ...
[40]
[PDF] CPSC 540: Machine Learning - Conjugate Priors
Posterior involves updating parameters of prior. For Bernoulli-beta, if we observe h heads and t tails then posterior is B(α + h, β + t). Hyper-parameters α ...
[41]
Bayesian Parameter Estimation
The Beta distribution is conjugate to the binomial distribution which gives the likelihood of iid Bernoulli trials. As we will see, a conjugate prior perfectly ...
[42]
11.7: The Beta-Bernoulli Process - Statistics LibreTexts
Apr 23, 2022 · ... Bayesian statistics, the original distribution of P is the prior distribution , and the conditional distribution of P given the data ...
[43]
[PDF] The beta density, Bayes, Laplace, and Pólya
This is known as Laplace's rule of succession. Laplace applied this result to the sunrise problem: What is the probability that the sun will rise tomorrow?
[44]
Laplace's Rule of Succession
Laplace's Rule of Succession: if n trials end up with s successful outcomes and n is small, the probability of success is better estimated by (s+1)/(n+2) ...
[45]
J. B. S. Haldane's Rule of Succession - Project Euclid
May 5, 2024 · After Bayes, the oldest Bayesian account of enumerative induction is given by Laplace's so-called rule of succession: if all n observed ...<|control11|><|separator|>
[46]
3.3: Bernoulli and Binomial Distributions - Statistics LibreTexts
Jan 8, 2020 · In the typical application of the Bernoulli distribution, a value of 1 indicates a "success" and a value of 0 indicates a "failure", where " ...Missing: primary sources
[47]
Variance of the binomial distribution | The Book of Statistical Proofs
Jan 20, 2022 · Proof: Variance of the binomial distribution. Index: The Book of ... With the variance of the Bernoulli distribution, we have: Var(X)=n∑i ...
[48]
11.2: The Binomial Distribution - Statistics LibreTexts
Apr 23, 2022 · Again let Y n = ∑ i = 1 n X i where X = ( X 1 , X 2 , … ) is a sequence of Bernoulli trials with success parameter p . The mean and variance ...
[49]
28.1 - Normal Approximation to Binomial | STAT 414
We will now focus on using the normal distribution to approximate binomial probabilities. The Central Limit Theorem is the tool that allows us to do so.Missing: source | Show results with:source
[50]
9.1: Central Limit Theorem for Bernoulli Trials - Statistics LibreTexts
Sep 12, 2025 · Bernoulli Trials. Consider a Bernoulli trials process with probability p for success on each trial. Let X i = 1 or 0 according as the i th ...
[51]
[PDF] The Bernoulli Process - MIT OpenCourseWare
• Definition of Bernoulli process. • Stochastic processes. • Basic properties ... Resource: Introduction to Probability. John Tsitsiklis and Patrick ...
[52]
[PDF] Chapter 3. Discrete Random Variables 3.4 - Washington
Definition 3.4.2: Bernoulli Process. A Bernoulli process with parameter p is a sequence of independent coin flips X1,X2,X3, ... where. P (head) = p. If flip i ...
[53]
Geometric distribution | Properties, proofs, exercises - StatLect
The geometric distribution is the probability distribution of the number of failures we get by repeating a Bernoulli experiment until we obtain the first ...Intuition · Definition · Relation to the Bernoulli... · Moment generating function
[54]
[PDF] Chapter 1 Special Distributions
If P(X = 1) = p = 1 − P(X = 0), then X is said to be a Bernoulli(p) random variable. We refer to the event [X = 1] as success, and to [X = 0] as failure.
[55]
[PDF] Learning Poisson Binomial Distributions - Columbia CS
A Poisson Binomial Distribution (PBD) over {0, 1,...,n} is the distri- bution of a sum of n independent Bernoulli random variables which may have arbitrary, ...
[56]
[PDF] Lecture 17 1 Outline 2 Poisson Binomial Distributions (PBDs)
Apr 10, 2019 · 2 Poisson Binomial Distributions (PBDs) The Poisson binomial distribution is the distribution of a sum of independent Bernoulli variables that ...<|control11|><|separator|>
[57]
Decoding Cardano's Liber de Ludo Aleae - ScienceDirect.com
Written in the 16th century, Cardano's Liber de Ludo Aleae was, in its time, an advanced treatment of the probability calculus.
[58]
[PDF] FERMAT AND PASCAL ON PROBABILITY - University of York
The problem was proposed to Pascal and Fermat, probably in 1654, by the Chevalier de. Méré, a gambler who is said to have had unusual ability “even for the ...
[59]
A Tricentenary history of the Law of Large Numbers - Project Euclid
The Weak Law of Large Numbers is traced chronologically from its inception as Jacob Bernoulli's Theorem in 1713, through De Moivre's Theorem, ...
[60]
[PDF] Jakob Bernoulli On the Law of Large Numbers Translated into ...
The Art of Conjecturing and Its Contents. Jakob Bernoulli (1654 – 1705) was a most eminent mathematician, mechanician and physicist. His Ars Conjectandi (1713) ...
[61]
Earliest Known Uses of Some of the Words of Mathematics (B)
In the past the Bernoulli distribution often referred to what is now generally called the BINOMIAL DISTRIBUTION. Thus H. Cramér Random Variables and Probability ...
[62]
The Bernoulli Distribution: Intuitive Understanding - Probabilistic World
May 5, 2020 · The Bernoulli distribution deals with random variables that have exactly 2 possible outcomes. And it simply assigns a probability to each of those outcomes.
[63]
[PDF] A Tricentenary history of the Law of Large Numbers - arXiv
The Weak Law of Large Numbers is traced chronologically from its inception as Jacob Bernoulli's. Theorem in 1713, through De Moivre's Theorem, ...<|separator|>
[64]
[PDF] ACOB BERNOULLI AND HIS WORK ON PROBABILITY - IJCRT.org
He considered the form of his Central limit theorem as a generalization of Bernoulli's main theorem given in the book Ars Conjectandi, written by Bernoulli.
[65]
Limit theorems - Encyclopedia of Mathematics
Jun 5, 2020 · ... Bernoulli theorem; Laplace theorem). S. Poisson (1837) generalized these theorems to the case when the probability pk of appearance of E in ...
[66]
Bernoulli distribution – Knowledge and References - Taylor & Francis
It is obvious that if and if . The distribution is called a Bernoulli distribution (Uspensky, 1937), often used to represent a discrete probability distribution ...
[67]
(PDF) 2005 - The Jakob Bernoulli Year 350th Anniversary of Jakob's ...
Aug 7, 2025 · 350 years ago the theologian, natural scientist and mathematician Jakob Bernoulli was born in Basel, and 300 years ago, Jakob passed away, again in his ...
[68]
“International Conference Ars Conjectandi 1713-2013” to celebrate ...
“International Conference Ars Conjectandi 1713-2013” to celebrate the 300th anniversary of the publication of Jacob I. Bernoulli's famous book “Ars conjectandi”.Missing: UNESCO | Show results with:UNESCO
[69]
Bernoulli Distribution - an overview | ScienceDirect Topics
The Bernoulli distribution is the simplest random distribution and can be used to build other distributions. The random experiment that creates the Bernoulli ...Missing: history | Show results with:history<|separator|>
[70]
What is the difference and relationship between the binomial and ...
Jun 18, 2014 · A Bernoulli random variable has two possible outcomes: 0 or 1. A binomial distribution is the sum of independent and identically distributed Bernoulli random ...
[71]
[PDF] ECE 302: Lecture 3.6 Bernoulli Random Variables
We write. X ∼ Bernoulli(p) to say that X is drawn from a Bernoulli distribution with a parameter p. Example. Coin flip. 4 / 13. Page 5. ©Stanley Chan 2022 ...Missing: toss | Show results with:toss
[72]
MATH 105 - Lesson Five - Satya Mandal
The experiment of tossing a coin is synonymous to any real life Bernoulli TRIAL. Other examples include testing an item for defectiveness, asking a voter ...
[73]
Hypothesis testing a Bernoulli variable
There are many examples of Bernoulli RVs. We are familiar with the coin flip, which may be either heads or tails. But many common problems can be modelled by ...
[74]
[PDF] Monte Carlo Simulation of Random Variables
Simulating from Bernoulli (p) distribution: Recall: If X ~ Bernoulli (p) then P(X = 1) = p and P(X = 0) = 1 – p. Algorithm: 1. Generate U, a rv from Uniform ...
[75]
[PDF] Simulation for Applied Risk Management
... Risk Analysis in Project Analysis. Baltimore: The Johns Hopkins Press, 1970 ... Bernoulli distribution is used to simulate variables with two values ...
[76]
The Regression Analysis of Binary Sequences - Cox - 1958
Dec 5, 2018 · The Regression Analysis of Binary Sequences. Correction(s) for this ... First Published online: December 5, 2018. D. R. Cox,. D. R. Cox.
[77]
[PDF] Extensions of the basic hidden Markov model
We now discuss two ways in which covariates can be introduced into. HMMs: via the state-dependent probabilities, and via the transition probabilities of the ...
[78]
[PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
(2010): Towards Optimal One Pass Large Scale Learning with Averaged. Stochastic Gradient Descent. Journal of Machine Learning Research (to ap- pear).