Fact-checked by Grok 2 weeks ago

Bernoulli distribution

The Bernoulli distribution is a that models the outcome of a single random experiment or trial with exactly two possible results: success, conventionally denoted as 1 and occurring with fixed probability p where $0 \leq p \leq 1, or failure, denoted as 0 and occurring with probability $1 - p. It represents the simplest form of a random variable and serves as the foundational building block for more complex distributions, such as the , which arises from the sum of independent trials. Named after the mathematician (1654–1705), who explored related concepts in probability in his posthumously published work (1713), the distribution formalizes scenarios like coin flips, yes/no surveys, or any event with dichotomous outcomes under constant success probability. The (PMF) for a Bernoulli random variable X is defined as P(X = x) = p^x (1-p)^{1-x} for x \in \{0, 1\}, which simplifies to P(X = 1) = p and P(X = 0) = 1 - p. Its mean () is E[X] = p, reflecting the long-run average success rate, while the variance is \Var(X) = p(1-p), measuring the spread around this mean and achieving maximum value at p = 0.5. In statistical modeling and applications, the Bernoulli distribution underpins binary for predicting probabilities of binary events, hypothesis testing for proportions, and simulations in fields like , , and , where outcomes are inherently categorical. It also connects to the normal distribution in the limit of many trials via the , enabling approximations for larger sample analyses.

Definition

Probability mass function

The Bernoulli distribution is a discrete that models a X taking only two possible values: 1 with probability p and 0 with probability q = 1 - p, where p \in [0, 1]. This setup represents the simplest case of a , such as a single trial in a sequence of independent events. The probability mass function (PMF) of the Bernoulli distribution is given by P(X = k) = \begin{cases} p^k (1-p)^{1-k} & k = 0, 1 \\ 0 & \text{otherwise}. \end{cases} This PMF can be interpreted as the probability of a success (value 1) or failure (value 0) in a binary trial, where the random variable X serves as an indicator for the occurrence of the success event. For instance, in a fair coin toss, X = 1 for heads with p = 0.5 and X = 0 for tails, modeling equal likelihood outcomes. The Bernoulli distribution corresponds to the binomial distribution in the special case of one trial (n=1).

Cumulative distribution function

The (CDF) of a X \sim \text{Bernoulli}(p), where $0 < p < 1 is the success probability, is defined as F(x) = P(X \leq x). It takes the form F(x) = \begin{cases} 0 & \text{if } x < 0, \\ 1 - p & \text{if } 0 \leq x < 1, \\ 1 & \text{if } x \geq 1. \end{cases} This piecewise definition reflects the discrete support of X on the values {0, 1}, accumulating the probability mass from the probability mass function up to x. Due to the discrete nature of the Bernoulli distribution, the CDF is a step function, remaining constant between the support points and exhibiting jumps of size $1-p at x=0 and p at x=1. This step-like structure distinguishes it from continuous distributions, where the CDF would be smooth and increasing. Visually, the CDF appears as a step plot originating at (-\infty, 0), rising to (0, 1-p), holding flat until (1, 1-p), then jumping to (1, 1) and remaining at 1 thereafter. Such plots, often rendered as staircases for discrete distributions, aid in understanding the probability accumulation at the binary outcomes. The CDF is essential for computing interval probabilities P(a \leq X \leq b) for any real numbers a \leq b, given by F(b) - F(a^-), where F(a^-) denotes the left limit at a to account for the jump at a if applicable. This property enables efficient evaluation of cumulative probabilities without summing individual point masses, particularly useful in applications involving .

Moments

Mean

The expected value, or mean, of a Bernoulli random variable X \sim \text{Bernoulli}(p) is E[X] = p. This parameter p serves as the measure of central tendency for the distribution, where X takes the value 1 with probability p (success) and 0 with probability $1-p (failure). The derivation follows from the definition of the expected value for a discrete random variable: E[X] = \sum_{k=0}^{1} k \, P(X = k) = 0 \cdot (1 - p) + 1 \cdot p = p. This summation directly uses the probability mass function of the Bernoulli distribution. The mean p interprets as the long-run proportion of successes observed in a sequence of repeated, independent . For example, if p = 0.5, as in the case of a fair coin flip, then E[X] = 0.5, which aligns with the symmetric probability of heads or tails in repeated tosses.

Variance

The variance of a Bernoulli random variable X \sim \text{Bernoulli}(p) quantifies the dispersion around its mean \mu = p, measuring the expected squared deviation from the mean. By definition, \operatorname{Var}(X) = E[(X - \mu)^2]. Since X takes values 0 or 1, this expands to \operatorname{Var}(X) = (0 - p)^2 \cdot (1 - p) + (1 - p)^2 \cdot p = p^2(1 - p) + p(1 - p)^2 = p(1 - p). An equivalent derivation uses the second-moment formula \operatorname{Var}(X) = E[X^2] - (E[X])^2. For the Bernoulli distribution, E[X^2] = 0^2 \cdot (1 - p) + 1^2 \cdot p = p, so \operatorname{Var}(X) = p - p^2 = p(1 - p). This can also be expressed as pq where q = 1 - p. The variance p(1 - p) reaches its maximum value of $0.25 when p = 0.5, indicating the greatest uncertainty in the binary outcome, and equals zero when p = 0 or p = 1, corresponding to deterministic cases with no dispersion. This property highlights how the Bernoulli variance captures the inherent variability in probabilistic binary events, such as coin flips or success indicators.

Skewness

The skewness of a Bernoulli random variable X with success probability p (where $0 < p < 1) is the third standardized central moment, defined as \gamma_1 = \frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}, with mean \mu = p and variance \sigma^2 = p(1-p). This quantity quantifies the asymmetry in the distribution's probability mass, which arises from the binary nature of outcomes (0 or 1), where deviations from p = 0.5 introduce imbalance between the success and failure probabilities. The third central moment is \mathbb{E}[(X - p)^3] = p(1-p)(1-2p), obtained by evaluating the expectation over the two possible values: \mathbb{E}[(X - p)^3] = p(1-p)^3 + (1-p)(-p)^3 = p(1-p)^3 - (1-p)p^3 = p(1-p)(1-2p). Dividing by \sigma^3 = [p(1-p)]^{3/2} yields the skewness formula \gamma_1 = \frac{1-2p}{\sqrt{p(1-p)}}. Both the third moment and skewness formula appear in standard references on statistical distributions. The sign of \gamma_1 indicates the direction of asymmetry: positive for p < 0.5 (right-skewed, with longer tail toward higher values), negative for p > 0.5 (left-skewed, with longer tail toward lower values), and zero for p = 0.5 (symmetric). For example, when p = 0.3, \gamma_1 \approx 0.873, reflecting moderate right due to the higher probability of the lower outcome (0). This measure is particularly relevant in modeling events where p deviates from , such as in reliability testing or diagnostic outcomes.

Kurtosis

The kurtosis of the Bernoulli distribution, which quantifies the peakedness and tail heaviness relative to the normal distribution, is given by the fourth standardized \beta_2 = \frac{\mu_4}{\sigma^4}, where \mu_4 = E[(X - \mu)^4] is the fourth and \sigma^2 = p(1-p) is the variance. For a Bernoulli random variable X with success probability p (where $0 < p < 1), the fourth is \mu_4 = p(1-p)[1 - 3p(1-p)]. Substituting these into the kurtosis formula yields \beta_2 = \frac{1 - 3p(1-p)}{p(1-p)}. The excess kurtosis, defined as \kappa = \beta_2 - 3, simplifies to \kappa = \frac{1 - 6p(1-p)}{p(1-p)}. This expression highlights the distribution's platykurtic nature compared to the normal distribution, which has excess kurtosis of zero; the Bernoulli distribution exhibits lighter tails and a more uniform peakedness due to its concentration at only two support points (0 and 1). The excess kurtosis reaches its minimum value of -2 at p = 0.5, where the distribution is symmetric and most spread out relative to its variance, and approaches +\infty as p tends to 0 or 1, reflecting the increasing degeneracy at the boundaries. To derive the fourth central moment, note that \mu = p, so (X - \mu)^4 = (1 - p)^4 with probability p and (-p)^4 = p^4 with probability $1-p. Thus, \mu_4 = p(1-p)^4 + (1-p)p^4 = p(1-p)[(1-p)^3 + p^3]. Expanding (1-p)^3 + p^3 = 1 - 3p + 3p^2 - p^3 + p^3 = 1 - 3p(1-p), which confirms \mu_4 = p(1-p)[1 - 3p(1-p)]. This derivation underscores the distribution's limited variability, contributing to its consistently negative or low excess kurtosis for interior values of p.

Advanced Properties

Higher moments and cumulants

The raw moments of a Bernoulli random variable X with success probability p are \mathbb{E}[X^k] = p for all integers k \geq 1, while \mathbb{E}[X^0] = 1. This follows from X^k = X almost surely for k \geq 1, since X takes values in \{0, 1\}. The central moments are \mu_k = \mathbb{E}[(X - p)^k] = p (1 - p)^k + (1 - p) (-p)^k for k \geq 1. For even k, this simplifies to p (1 - p)^k + (1 - p) p^k, which can be derived using the binomial theorem on expansions involving powers of (1 - p) and p. For odd k > 1, the expression yields antisymmetric patterns, such as \mu_3 = p(1 - p)(1 - 2p). These moments provide a complete beyond the and variance, emphasizing the distribution's nature. Cumulants of the Bernoulli distribution are obtained from the cumulant generating function K(t) = \log(1 - p + p e^t), whose coefficients satisfy K(t) = \sum_{n=1}^\infty \kappa_n \frac{t^n}{n!}. The first cumulant is \kappa_1 = p, the second is \kappa_2 = p(1 - p), the third is \kappa_3 = p(1 - p)(1 - 2p), the fourth is \kappa_4 = p(1 - p)[1 - 6p(1 - p)], the fifth is \kappa_5 = p(1 - p)(1 - 2p)[1 - 12p(1 - p)], and the sixth is \kappa_6 = p(1 - p)[1 - 30p(1 - p) + 120 p^2 (1 - p)^2]. Higher cumulants follow from differentiating K(t) or using relations like to convert from moments, resulting in \kappa_n = p(1 - p) times a in p of n-2. A key advantage of cumulants is their additivity under independent summation: if X_1, \dots, X_n are i.i.d. Bernoulli(p), then the cumulants of their sum (a binomial random variable) are exactly n times the corresponding Bernoulli cumulants. This property facilitates approximations and analyses of sums in probability theory.

Generating functions

The probability generating function (PGF) of a Bernoulli random variable X with success probability p (where q = 1 - p) is defined as G_X(s) = \mathbb{E}[s^X] = q + p s, for |s| \leq 1. This function encapsulates the probability mass function and facilitates the analysis of sums of independent random variables. The moment generating function (MGF) for the same distribution is M_X(t) = \mathbb{E}[e^{tX}] = q + p e^t, defined for t \in \mathbb{R}. Similarly, the , which is the of the distribution, is \phi_X(t) = \mathbb{E}[e^{i t X}] = q + p e^{i t}, for t \in \mathbb{R}. These generating functions provide powerful tools for deriving properties of the Bernoulli distribution. Specifically, successive derivatives of the PGF or MGF evaluated at appropriate points (such as s = 1 or t = 0) yield the moments of X. Additionally, for a sum of independent Bernoulli random variables, the PGF of the sum is the product of the individual PGFs, leading directly to the PGF of a .

Exponential family representation

The Bernoulli distribution belongs to the of distributions, a class that encompasses many common parametric families and facilitates unified procedures. In its general form, a distribution in the can be expressed as
p(x \mid \theta) = h(x) \exp\left[ \eta(\theta) T(x) - A(\theta) \right],
where h(x) is the base measure, \eta(\theta) is the natural parameter, T(x) is the , and A(\theta) is the log-partition function that normalizes the distribution.
For the Bernoulli distribution with success probability p, the probability mass function p(x \mid p) = p^x (1-p)^{1-x} for x \in \{0, 1\} can be rewritten in canonical exponential family form by taking the logarithm:
\log p(x \mid p) = x \log \frac{p}{1-p} + \log(1-p).
This yields h(x) = 1, the natural parameter \eta = \log \frac{p}{1-p} (also denoted as the logit of p), the sufficient statistic T(x) = x, and the log-partition function A(\eta) = \log(1 + e^\eta), since p = \frac{e^\eta}{1 + e^\eta} and $1-p = \frac{1}{1 + e^\eta}. The natural parameter \eta thus serves as a reparameterization of p, mapping the interval (0,1) to (-\infty, \infty), which proves useful in optimization and modeling contexts.
Membership in the provides several inferential advantages for the Bernoulli distribution. The log-partition function A(\eta) acts as a cumulant generating function, allowing moments to be obtained via : the \mu = E[X] = \frac{\partial A}{\partial \eta} = \frac{e^\eta}{1 + e^\eta} = p, and the variance \mathrm{Var}(X) = \frac{\partial^2 A}{\partial \eta^2} = \mu(1 - \mu), expressing variability directly as a of the without additional parameters. This structure unifies the Bernoulli with other distributions, enabling shared techniques for and across models. The exponential family representation also underpins the Bernoulli distribution's role in generalized linear models (GLMs), where it serves as the response distribution for binary outcomes with a link function connecting the linear predictor to the natural parameter \eta. This connection facilitates extensions to and broader GLM frameworks for predictive modeling.

Information Measures

Entropy

The entropy of a Bernoulli random variable X with success probability p, denoted H(X), measures the average uncertainty in the outcome of X. Since the Bernoulli distribution is discrete, differential entropy does not apply; instead, the is used, given by H(X) = -p \log_2 p - (1-p) \log_2 (1-p) in bits, or equivalently H(X) = -p \ln p - (1-p) \ln (1-p) in nats when using the natural logarithm. This formula arises from the general definition of for a discrete as the of the negative logarithm of the . The function H(p) is known as the , which quantifies the inherent in a source with p. It achieves its maximum value of 1 bit (or \ln 2 nats) when p = 0.5, corresponding to the case of maximum where the outcomes are equally likely. At the boundaries, H(0) = H(1) = 0, reflecting complete certainty in the outcome. This represents the average number of bits required to encode the outcome of X in an optimal , providing a fundamental limit on for sequences of independent Bernoulli trials. The is symmetric about p = 0.5, satisfying H(p) = H(1-p), and is strictly concave on [0,1], as its is negative for $0 < p < 1.

Fisher's information

The Fisher information for a single observation from a Bernoulli distribution with success probability p is defined as I(p) = \mathbb{E}\left[ \left( \frac{\partial}{\partial p} \log f(X \mid p) \right)^2 \right], where f(X \mid p) = p^X (1-p)^{1-X} for X \in \{0, 1\}. To compute this, first find the score function: the log-likelihood is \log f(X \mid p) = X \log p + (1 - X) \log (1 - p), so the derivative with respect to p is \frac{\partial}{\partial p} \log f(X \mid p) = \frac{X}{p} - \frac{1 - X}{1 - p}. The Fisher information is then the expected value of the square of this score, which evaluates to I(p) = \frac{1}{p(1-p)}. This quantity measures the curvature of the log-likelihood function and quantifies the amount of information that a single observation carries about the parameter p; notably, its inverse provides the Cramér–Rao lower bound on the variance of any unbiased estimator of p. The value of I(p) is maximized at p = 0.5, where it equals 4, indicating the highest precision in estimating p near this point. For n independent and identically distributed observations, the Fisher information scales additively to I_n(p) = \frac{n}{p(1-p)}.

Parameter Estimation

Maximum likelihood estimation

Consider a sample of n independent and identically distributed (i.i.d.) Bernoulli random variables X_1, X_2, \dots, X_n, each with success probability p. The likelihood function is given by L(p) = p^s (1-p)^{n-s}, where s = \sum_{i=1}^n X_i is the number of successes observed in the sample. The maximum likelihood estimator (MLE) of p is the value \hat{p} that maximizes L(p), which is the sample proportion \hat{p} = s/n. To derive this, consider the log-likelihood \ell(p) = \log L(p) = s \log p + (n-s) \log (1-p). Differentiating with respect to p yields \frac{\partial}{\partial p} \ell(p) = \frac{s}{p} - \frac{n-s}{1-p}. Setting the derivative equal to zero and solving gives \hat{p} = s/n. The MLE \hat{p} is unbiased, meaning E[\hat{p}] = p. It achieves the minimum variance among unbiased estimators, attaining the Cramér-Rao lower bound. Additionally, \hat{p} is asymptotically normal, with \sqrt{n} (\hat{p} - p) \xrightarrow{d} \mathcal{N}\left(0, p(1-p)\right) as n \to \infty. This asymptotic variance equals the reciprocal of the Fisher information for a single observation.

Bayesian estimation

In Bayesian estimation of the Bernoulli distribution's success probability p, the distribution serves as the conjugate , parameterized by shape parameters \alpha > 0 and \beta > 0, which encodes beliefs about p. Given n independent Bernoulli trials with s successes, the likelihood is , and the posterior distribution remains , updated to \text{Beta}(\alpha + s, \beta + n - s). This conjugacy simplifies by preserving the family form, avoiding . The posterior mean, a common point estimate, is given by \frac{\alpha + s}{\alpha + \beta + n}, which shrinks the maximum likelihood estimate s/n toward the prior mean \alpha/(\alpha + \beta), weighted by the prior strength \alpha + \beta. For \alpha > 1 and \beta > 1, the posterior is \frac{\alpha + s - 1}{\alpha + \beta + n - 2}, providing the most probable value of [p](/page/P′′) under the posterior. Credible intervals for [p](/page/P′′) can be constructed from the quantiles of this posterior , offering probabilistic bounds that incorporate prior uncertainty. The parameters \alpha and \beta admit an interpretation in terms of pseudocounts: the reflects \alpha - 1 successes and \beta - 1 failures, regularizing estimates especially with limited data. A notable special case is the uniform \text{Beta}(1, 1), which yields a posterior of (s + 1)/(n + 2); this is known as Laplace's , originally applied to predict the probability of future successes after observing s out of n trials. Such Bayesian approaches complement particularly in small-sample scenarios by providing through the full posterior.

Binomial distribution

The binomial distribution arises as the distribution of the sum of a fixed number n of independent and identically distributed (i.i.d.) random variables, each with success probability p. Specifically, if X_1, X_2, \dots, X_n are i.i.d. (p) random variables, then their sum S_n = \sum_{i=1}^n X_i follows a with parameters n and p, denoted (n, p). The (PMF) of the can be derived from the of the individual Bernoulli PMFs. For S_n = k successes in n trials, the probability is the number of ways to choose k successes out of n trials, multiplied by the probability of success on those k trials and failure on the remaining n-k trials: P(S_n = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \dots, n. This formula reflects the combinatorial nature of sequencing independent Bernoulli trials. The mean and variance of the follow directly from the properties of the of i.i.d. random variables. The is E[S_n] = np, obtained by of as the of individual means E[X_i] = p. Similarly, the variance is \operatorname{[Var](/page/Var)}(S_n) = np(1-p), since the variables are , so \operatorname{[Var](/page/Var)}(S_n) = \sum_{i=1}^n \operatorname{[Var](/page/Var)}(X_i) = n \cdot p(1-p). For large n, the provides a approximation to the , stating that S_n is approximately distributed with np and variance np(1-p), i.e., S_n \approx \mathcal{N}(np, np(1-p)), provided np and n(1-p) are sufficiently large (typically both greater than 5 or 10). This approximation is useful for computing probabilities when exact binomial calculations are cumbersome.

Other connections

The Bernoulli process is defined as a of independent and identically distributed Bernoulli random variables, each representing a trial with success probability p, which collectively model discrete-time stochastic processes such as repeated flips or success/failure s over time. This process captures the temporal evolution of outcomes in applications like and reliability analysis, where the ensures that each trial's result does not influence subsequent ones. The arises directly from the as the distribution of the number of failures preceding the first success in a of trials, each with success probability p. Specifically, if trials continue until the initial success, the waiting time follows a with p, providing a foundational link between single-trial outcomes and stopping-time problems in probability. The Bernoulli distribution generalizes the two-point distribution, encompassing any discrete setup with outcomes at two points and unequal probabilities, while the emerges as its symmetric special case when p = 0.5, yielding outcomes +1 and -1 each with probability $1/2, equivalent to a linearly transformed Bernoulli variate via $2X - 1 where X \sim \text{Bernoulli}(0.5). This connection highlights the Bernoulli's role in symmetric random walks and sign functions in . In the Poisson binomial distribution, the sum of independent but non-identically distributed random variables, each with its own success probability p_i, produces a more flexible generalization beyond fixed-p scenarios, applicable to heterogeneous risks like fault probabilities in systems. This structure allows modeling of scenarios where trial success rates vary, distinguishing it from uniform-parameter cases while retaining the core nature of the components.

History

Origins in probability theory

The systematic study of probability in games of chance emerged in the 16th century through Gerolamo Cardano's Liber de Ludo Aleae, a treatise on games of chance that analyzed throws and calculated probabilities for sums and sequences, marking an early systematic approach to quantifying random events in gambling contexts. Written around 1564 and published posthumously in 1663, Cardano's work assumed equal chances for dice faces and applied multiplicative rules to compound probabilities, but it lacked a rigorous model for repeated independent binary trials. Foundational developments in probability for games of chance advanced in the mid-17th century via the correspondence between and in 1654, prompted by queries from the gambler Chevalier de Méré on fair stake division in interrupted games. Their exchanges addressed the "," involving binary-like success-failure scenarios in dice and card games, where they enumerated outcomes to determine equitable shares based on remaining probabilities of winning. This collaboration established core principles of probabilistic reasoning for discrete trials, influencing subsequent work without yet formalizing convergence properties. Jacob Bernoulli provided the first rigorous treatment of binary trials in his posthumously published Ars Conjectandi in 1713, where he developed the specifically for repeated independent experiments with fixed success probability p. In Part IV, Bernoulli proved that the sample mean of successes converges to p as the number of trials n \to \infty, using binomial expansions to bound the probability of deviation and quantify the required n for high . This key insight established statistical regularity in binary outcomes, building on earlier ideas from Pascal and Fermat to formalize the model as a cornerstone of .

Naming and legacy

The Bernoulli distribution is named after Swiss mathematician (1654–1705), whose seminal work , published posthumously in 1713, established key principles of , including the law of large numbers for sequences of binary outcomes. Although Bernoulli himself described the general model for multiple trials using the term "binomial," the specific designation "Bernoulli distribution" for the single-trial case, honoring his foundational contributions, arose in the 20th century as formalized discrete distributions. Bernoulli's ideas formed the bedrock for limit theorems in probability, with his serving as the inaugural result showing convergence of empirical proportions to theoretical probabilities in repeated binary experiments. This theorem profoundly influenced 19th-century probabilists, including , who generalized it into the to approximate sums of independent random variables, and , who extended the results to scenarios with non-constant success probabilities. In modern probability education, the Bernoulli distribution has been a staple since at least the 1930s, appearing as a core concept in influential texts like J.V. Uspensky's Introduction to Mathematical Probability (1937), underscoring its role as the simplest discrete distribution for binary events. The 2005 Jakob Bernoulli Year, marking the 350th anniversary of his birth and 300th of his death, celebrated his lasting impact through academic events and publications. The 2013 tricentennial of Ars Conjectandi further highlighted its enduring significance via international conferences dedicated to Bernoulli's probabilistic innovations. A key aspect of its legacy involves distinguishing the Bernoulli distribution—which models the outcome of one random variable with success probability p—from Bernoulli trials, which denote a series of independent such experiments underlying the for n > 1 trials. This clarification, emphasized in standard statistical literature, preserves Bernoulli's original emphasis on sequential processes while adapting his framework to modern single-variable analysis.

Applications

Modeling binary outcomes

The Bernoulli distribution serves as a foundational model for outcomes in probability experiments, where each trial results in either or , with success probability denoted by p and failure probability $1 - p. A classic illustration is the toss, where a has p = 0.5 for heads (), while a biased coin deviates from this value, allowing the distribution to capture asymmetry in real-world randomness. In , the Bernoulli distribution models the occurrence of defects in individual items, treating each inspection as a where success might represent a non-defective product with p as the reliability rate. For instance, if historical data shows a 4% defect rate, the probability of a single being defective follows Bernoulli with p = 0.04, aiding in pass-fail assessments during . For hypothesis testing, the Bernoulli distribution underpins tests of fairness in binary events, such as setting the null hypothesis at p = 0.5 for a and comparing observed outcomes against an alternative p \neq 0.5 to detect . This approach evaluates whether deviations from expected success rates are statistically significant in simple yes/no scenarios. In simulation, random variables generate binary data for methods, where repeated sampling from the distribution approximates complex probabilistic behaviors in computational experiments. For example, drawing from Bernoulli(p) produces sequences of 0s and 1s to model uncertain events in algorithmic testing. In risk analysis, the Bernoulli distribution quantifies the probability of event occurrence in a single trial, such as the likelihood of or in an isolated assessment, providing a for evaluating potential impacts in fields like reliability. This single-trial focus extends naturally to the for multiple independent repetitions.

Use in statistics and machine learning

In , the Bernoulli distribution serves as the likelihood model for response variables, where each y_i \in \{0, 1\} is drawn from a Bernoulli distribution with success probability p_i, and the model parameters are estimated by maximizing the log-likelihood \sum_i [y_i \log p_i + (1 - y_i) \log (1 - p_i)]. The logit link function, defined as \log\left(\frac{p_i}{1 - p_i}\right) = \mathbf{x}_i^\top \boldsymbol{\beta}, linearly relates the predictors to the log-odds, enabling the modeling of how covariates influence outcomes such as success or failure. This framework, originally proposed for analyzing sequences, remains foundational for generalized linear models in inferential statistics. In , the Bernoulli distribution models rates as the parameter p, representing the probability of a positive event like a user click or purchase under different variants. provides point estimates of p for each variant by solving for the proportion of successes, facilitating tests on differences in probabilities. Bayesian approaches update prior distributions on p (often priors conjugate to Bernoulli likelihoods) with observed data to yield posterior distributions, enabling probabilistic statements about variant superiority and sequential testing decisions. In , the Bernoulli distribution underpins algorithms like Bernoulli naive Bayes for binary feature spaces, such as text classification where document-term presence (binary indicators) is assumed independent given the class label, with class-conditional probabilities estimated via MLE. This model excels in high-dimensional sparse data, outperforming multinomial variants when term frequencies are irrelevant. Hidden Markov models incorporate Bernoulli emissions for binary observation sequences, where the emission probability from each hidden state follows a Bernoulli distribution parameterized by state-specific success probabilities, supporting applications in sequential data inference like regime detection. For large-scale datasets, optimizes the Bernoulli log-likelihood efficiently, approximating full-batch gradients with single observations or minibatches to scale and related models to millions of samples, as demonstrated in settings where the negative log-likelihood loss (binary cross-entropy) guides parameter updates. This approach trades off variance for computational speed, converging to near-optimal solutions in high-dimensional regimes.

References

  1. [1]
    [PDF] Bernoulli and Binomial Random Variables
    Jul 10, 2017 · A Bernoulli random variable is the simplest kind of random variable. It can take on two values,. 1 and 0. It takes on a 1 if an experiment with ...
  2. [2]
    Bernoulli & Binomial Random Variables - Data Science Discovery
    Bernoulli Distribution for Discrete Random Variables. Any event that has exactly two outcomes with a fixed probability is called a Bernoulli random variable.Missing: definition | Show results with:definition
  3. [3]
    [PDF] Discrete distributions - UNM Math
    The Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the ...
  4. [4]
    [PDF] The Bernoulli Probability Distribution - Faculty Web Pages
    A random variable with only two outcomes (1-0, true-false, right-wrong, on- off, etc.) is called a Bernoulli random variable, denoted as X~Bernoulli(p):.
  5. [5]
    Random Variables
    The term Bernoulli trial is named for Jacob Bernoulli. Poisson. The Poisson distribution is named for Siméon-Denis Poisson (1781-1840) who wrote about the ...
  6. [6]
    [PDF] Bernoulli distribution
    probability p has probability mass function f(x) = p x(1− p)1−x x = 0,1 ... The cumulative distribution function of X ∼ Bernoulli(p) is. F(x) = P(X ...<|control11|><|separator|>
  7. [7]
    [PDF] Bernoulli trials, binomial and hypergeometric distributions
    The probability distribution of X is called the Bernoulli distribution with probability of success equal to p. If X is a random variable having the Bernoulli ...
  8. [8]
  9. [9]
    Bernoulli Random Variable - Glossary | CSRC
    Definitions: A random variable that takes on the value of one with probability p and the value of zero with probability 1-p. Sources: NIST SP 800 ...Missing: distribution | Show results with:distribution
  10. [10]
    Bernoulli distribution | Properties, proofs, exercises - StatLect
    The Bernoulli distribution is a univariate discrete distribution used to model random experiments that have binary outcomes.Missing: primary sources
  11. [11]
    Special Distributions | Bernoulli Distribution | Binomial Distribution
    A Bernoulli random variable is a random variable that can only take two possible values, usually 0 and 1.Missing: primary sources
  12. [12]
    [PDF] 2. Random Variables
    Sep 15, 2020 · distribution function (cdf) is defined for all x by. F(x) ≡ P(X ≤ x) ... Then the cdf is the following step function: F(x) = P(X ≤ x).<|control11|><|separator|>
  13. [13]
    Tutorial 3c: Probability distributions and their stories - Justin Bois
    I plot the CDFs for discrete distributions as "staircases." As an example, here is a plot of the CDF of the Binomial distribution with parameters N=10 and θ=0.5 ...
  14. [14]
    [PDF] Probability Cheatsheet - andrew.cmu.ed
    Mar 20, 2015 · To find the probability that a CRV takes on a value in the interval [a, b], subtract the respective CDFs. P (a ≤ X ≤ b) ... The Bernoulli ...<|control11|><|separator|>
  15. [15]
    Expected value and variance of a random variable - Stat 20
    Then: E ( X ) = 1 × p + 0 × ( 1 − p ) = p The expected value of a Bernoulli( p ) random variable is therefore just p . In particular, if we toss a coin and ...
  16. [16]
    [PDF] 18.05 S22 Reading 4b: Discrete Random Variables: Expected Value
    1. Know how to compute expected value (mean) of a discrete random variable. 2. Know the expected value of Bernoulli, binomial and geometric random ...
  17. [17]
    Probability distributions - Foundations in Data Science
    Bernoulli trials. At the heart of discrete probability distributions lies the concept of Bernoulli trials, named after the Swiss mathematician Jacob Bernoulli.
  18. [18]
    [PDF] 18.05 S22 Reading 5a: Variance of Discrete Random Variables
    Bernoulli random variables are fundamental, so we should know their variance. If 𝑋 ∼ Bernoulli(𝑝) then Var(𝑋) = 𝑝(1 − 𝑝). Proof: We know that 𝐸[𝑋] = 𝑝. We ...<|control11|><|separator|>
  19. [19]
    [PDF] Variance and standard deviation Math 217 Probability and Statistics
    Therefore, the variance of one Bernoulli trial is Var(X) = p − p2 = pq. From that observation, we conclude the variance of the binomial distribution is. Var(S) ...
  20. [20]
    None
    ### Summary of Bernoulli Variance Content
  21. [21]
    Bernoulli Distribution -- from Wolfram MathWorld
    The Bernoulli distribution is a discrete distribution having two possible outcomes labelled by n=0 and n=1 in which n=1 ("success") occurs with probability p.
  22. [22]
    Central moments of a Bernoulli distribution - Math Stack Exchange
    Jul 6, 2016 · No. The moments only depend on the unknown parameters through p. Thus they don't contain any more information than does p.statistics - Third Central Moment of the Bernoulli Distribution without ...How to calculate the $4$th central moment of binomial distribution?More results from math.stackexchange.com
  23. [23]
    [PDF] arXiv:1602.01234v1 [nucl-th] 3 Feb 2016
    Feb 3, 2016 · We derive formulas which connect cumulants of particle numbers observed with efficiency losses with the original ones based on the binomial ...
  24. [24]
  25. [25]
    [PDF] Lecture 6 Moment-generating functions
    Sep 25, 2019 · 3. • Bernoulli distribution. For Y ∼ B(p), we have. mY(t) = et×0pY(0) + et×1pY(1) = q + pet, where q = 1 − p. • Geometric distribution. If ...
  26. [26]
    [PDF] Probability Generating Functions - Texas A&M University
    Bernoulli Variables. Example. Let X be a random variable that has Bernoulli distribution with parameter p. The probability generating function is given by. GX ...
  27. [27]
    [PDF] Generating function for a sum - Purdue Math
    Generating function for Bernoulli: If Y ⇠ b(p) then. GY (s) = (1 p) + p s. Decomposition of Binomial: If X ⇠ Bin(n, p) one can write ... The Xi 's are independent.
  28. [28]
    [PDF] Chapter 8 The exponential family: Basics - People @EECS
    which is the variance of a Bernoulli variable. ... Moreover, we can calculate third cumulants by computing the mixed partial, and fourth cumulants by taking the ...
  29. [29]
    [PDF] lecture 11: exponential family and generalized linear models
    Bernoulli distribution is defined on a binary (0 or 1) ran- dom variable using parameter π where π = Pr(x = 1). The Bernoulli distribution can be written as: (2).
  30. [30]
    [PDF] Entropy and Information Theory - Stanford Electrical Engineering
    This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels ...
  31. [31]
    [PDF] A Mathematical Theory of Communication
    A mixed system is one in which both discrete and continuous variables appear, e.g., PCM transmission of speech. We first consider the discrete case. This case ...
  32. [32]
    [PDF] Fisher Information & Efficiency - Duke Statistical Science
    May 11, 2021 · Bernoulli: For the Bernoulli distribution Bi(1,θ), λ(x | θ) = xlog θ + (1 − x) log(1 − θ) λ/(x | θ) = x θ −. 1 − x. 1 − θ λ//(x | θ) ...
  33. [33]
    [PDF] A Tutorial on Fisher Information - arXiv
    For example, when the coin's true propensity is θ∗ = 0.3, replacing θ by θ∗ in the Bernoulli distribution yields the pmf p0.3(xi) = 0.3xi 0.71−xi , a function.
  34. [34]
    [PDF] A Few Notes on Fisher Information - David Meyer
    Apr 10, 2024 · # Equation (3): p(1 - p) = 1. Ip. 3. Page 4. So the variance of the Bernoulli distribution is the inverse of the Fisher Information: Var(X) = 1.
  35. [35]
    1.5 - Maximum Likelihood Estimation | STAT 504
    Bernoulli and Binomial Likelihoods​​ The only difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial ...
  36. [36]
    [PDF] Maximum Likelihood Estimation
    Step one of MLE is to write the likelihood of a Bernoulli as a function that we can maximize. Since a Bernoulli is a discrete distribution, the likelihood is ...
  37. [37]
    Chapter 11 Additional Properties of Estimators - Bookdown
    The variance of the asymptotic normal distribution coincides with the Cramer-Rao lower bound providing further support for using maximum likelihood estimation.
  38. [38]
    [PDF] Notes: Estimation, Bias and Variance
    Nov 13, 2014 · ... formula Var(X) = E[X2] − E[X]2. Using this, we can show ... Example: Estimating the proportion parameter p for a Bernoulli distribution.Missing: derivation | Show results with:derivation
  39. [39]
    [PDF] Conjugate priors: Beta and normal Class 15, 18.05
    In this section, we will show that the beta distribution is a conjugate prior for binomial,. Bernoulli, and geometric likelihoods. 3.1 Binomial likelihood. We ...
  40. [40]
    [PDF] CPSC 540: Machine Learning - Conjugate Priors
    Posterior involves updating parameters of prior. For Bernoulli-beta, if we observe h heads and t tails then posterior is B(α + h, β + t). Hyper-parameters α ...
  41. [41]
    Bayesian Parameter Estimation
    The Beta distribution is conjugate to the binomial distribution which gives the likelihood of iid Bernoulli trials. As we will see, a conjugate prior perfectly ...
  42. [42]
    11.7: The Beta-Bernoulli Process - Statistics LibreTexts
    Apr 23, 2022 · ... Bayesian statistics, the original distribution of P is the prior distribution , and the conditional distribution of P given the data ...
  43. [43]
    [PDF] The beta density, Bayes, Laplace, and Pólya
    This is known as Laplace's rule of succession. Laplace applied this result to the sunrise problem: What is the probability that the sun will rise tomorrow?
  44. [44]
    Laplace's Rule of Succession
    Laplace's Rule of Succession: if n trials end up with s successful outcomes and n is small, the probability of success is better estimated by (s+1)/(n+2) ...
  45. [45]
    J. B. S. Haldane's Rule of Succession - Project Euclid
    May 5, 2024 · After Bayes, the oldest Bayesian account of enumerative induction is given by Laplace's so-called rule of succession: if all n observed ...<|control11|><|separator|>
  46. [46]
    3.3: Bernoulli and Binomial Distributions - Statistics LibreTexts
    Jan 8, 2020 · In the typical application of the Bernoulli distribution, a value of 1 indicates a "success" and a value of 0 indicates a "failure", where " ...Missing: primary sources
  47. [47]
    Variance of the binomial distribution | The Book of Statistical Proofs
    Jan 20, 2022 · Proof: Variance of the binomial distribution. Index: The Book of ... With the variance of the Bernoulli distribution, we have: Var(X)=n∑i ...
  48. [48]
    11.2: The Binomial Distribution - Statistics LibreTexts
    Apr 23, 2022 · Again let Y n = ∑ i = 1 n X i where X = ( X 1 , X 2 , … ) is a sequence of Bernoulli trials with success parameter p . The mean and variance ...
  49. [49]
    28.1 - Normal Approximation to Binomial | STAT 414
    We will now focus on using the normal distribution to approximate binomial probabilities. The Central Limit Theorem is the tool that allows us to do so.Missing: source | Show results with:source
  50. [50]
    9.1: Central Limit Theorem for Bernoulli Trials - Statistics LibreTexts
    Sep 12, 2025 · Bernoulli Trials. Consider a Bernoulli trials process with probability p for success on each trial. Let X i = 1 or 0 according as the i th ...
  51. [51]
    [PDF] The Bernoulli Process - MIT OpenCourseWare
    • Definition of Bernoulli process. • Stochastic processes. • Basic properties ... Resource: Introduction to Probability. John Tsitsiklis and Patrick ...
  52. [52]
    [PDF] Chapter 3. Discrete Random Variables 3.4 - Washington
    Definition 3.4.2: Bernoulli Process. A Bernoulli process with parameter p is a sequence of independent coin flips X1,X2,X3, ... where. P (head) = p. If flip i ...
  53. [53]
    Geometric distribution | Properties, proofs, exercises - StatLect
    The geometric distribution is the probability distribution of the number of failures we get by repeating a Bernoulli experiment until we obtain the first ...Intuition · Definition · Relation to the Bernoulli... · Moment generating function
  54. [54]
    [PDF] Chapter 1 Special Distributions
    If P(X = 1) = p = 1 − P(X = 0), then X is said to be a Bernoulli(p) random variable. We refer to the event [X = 1] as success, and to [X = 0] as failure.
  55. [55]
    [PDF] Learning Poisson Binomial Distributions - Columbia CS
    A Poisson Binomial Distribution (PBD) over {0, 1,...,n} is the distri- bution of a sum of n independent Bernoulli random variables which may have arbitrary, ...
  56. [56]
    [PDF] Lecture 17 1 Outline 2 Poisson Binomial Distributions (PBDs)
    Apr 10, 2019 · 2 Poisson Binomial Distributions (PBDs) The Poisson binomial distribution is the distribution of a sum of independent Bernoulli variables that ...<|control11|><|separator|>
  57. [57]
    Decoding Cardano's Liber de Ludo Aleae - ScienceDirect.com
    Written in the 16th century, Cardano's Liber de Ludo Aleae was, in its time, an advanced treatment of the probability calculus.
  58. [58]
    [PDF] FERMAT AND PASCAL ON PROBABILITY - University of York
    The problem was proposed to Pascal and Fermat, probably in 1654, by the Chevalier de. Méré, a gambler who is said to have had unusual ability “even for the ...
  59. [59]
    A Tricentenary history of the Law of Large Numbers - Project Euclid
    The Weak Law of Large Numbers is traced chronologically from its inception as Jacob Bernoulli's Theorem in 1713, through De Moivre's Theorem, ...
  60. [60]
    [PDF] Jakob Bernoulli On the Law of Large Numbers Translated into ...
    The Art of Conjecturing and Its Contents. Jakob Bernoulli (1654 – 1705) was a most eminent mathematician, mechanician and physicist. His Ars Conjectandi (1713) ...
  61. [61]
    Earliest Known Uses of Some of the Words of Mathematics (B)
    In the past the Bernoulli distribution often referred to what is now generally called the BINOMIAL DISTRIBUTION. Thus H. Cramér Random Variables and Probability ...
  62. [62]
    The Bernoulli Distribution: Intuitive Understanding - Probabilistic World
    May 5, 2020 · The Bernoulli distribution deals with random variables that have exactly 2 possible outcomes. And it simply assigns a probability to each of those outcomes.
  63. [63]
    [PDF] A Tricentenary history of the Law of Large Numbers - arXiv
    The Weak Law of Large Numbers is traced chronologically from its inception as Jacob Bernoulli's. Theorem in 1713, through De Moivre's Theorem, ...<|separator|>
  64. [64]
    [PDF] ACOB BERNOULLI AND HIS WORK ON PROBABILITY - IJCRT.org
    He considered the form of his Central limit theorem as a generalization of Bernoulli's main theorem given in the book Ars Conjectandi, written by Bernoulli.
  65. [65]
    Limit theorems - Encyclopedia of Mathematics
    Jun 5, 2020 · ... Bernoulli theorem; Laplace theorem). S. Poisson (1837) generalized these theorems to the case when the probability pk of appearance of E in ...
  66. [66]
    Bernoulli distribution – Knowledge and References - Taylor & Francis
    It is obvious that if and if . The distribution is called a Bernoulli distribution (Uspensky, 1937), often used to represent a discrete probability distribution ...
  67. [67]
    (PDF) 2005 - The Jakob Bernoulli Year 350th Anniversary of Jakob's ...
    Aug 7, 2025 · 350 years ago the theologian, natural scientist and mathematician Jakob Bernoulli was born in Basel, and 300 years ago, Jakob passed away, again in his ...
  68. [68]
    “International Conference Ars Conjectandi 1713-2013” to celebrate ...
    “International Conference Ars Conjectandi 1713-2013” to celebrate the 300th anniversary of the publication of Jacob I. Bernoulli's famous book “Ars conjectandi”.Missing: UNESCO | Show results with:UNESCO
  69. [69]
    Bernoulli Distribution - an overview | ScienceDirect Topics
    The Bernoulli distribution is the simplest random distribution and can be used to build other distributions. The random experiment that creates the Bernoulli ...Missing: history | Show results with:history<|separator|>
  70. [70]
    What is the difference and relationship between the binomial and ...
    Jun 18, 2014 · A Bernoulli random variable has two possible outcomes: 0 or 1. A binomial distribution is the sum of independent and identically distributed Bernoulli random ...
  71. [71]
    [PDF] ECE 302: Lecture 3.6 Bernoulli Random Variables
    We write. X ∼ Bernoulli(p) to say that X is drawn from a Bernoulli distribution with a parameter p. Example. Coin flip. 4 / 13. Page 5. ©Stanley Chan 2022 ...Missing: toss | Show results with:toss
  72. [72]
    MATH 105 - Lesson Five - Satya Mandal
    The experiment of tossing a coin is synonymous to any real life Bernoulli TRIAL. Other examples include testing an item for defectiveness, asking a voter ...
  73. [73]
    Hypothesis testing a Bernoulli variable
    There are many examples of Bernoulli RVs. We are familiar with the coin flip, which may be either heads or tails. But many common problems can be modelled by ...
  74. [74]
    [PDF] Monte Carlo Simulation of Random Variables
    Simulating from Bernoulli (p) distribution: Recall: If X ~ Bernoulli (p) then P(X = 1) = p and P(X = 0) = 1 – p. Algorithm: 1. Generate U, a rv from Uniform ...
  75. [75]
    [PDF] Simulation for Applied Risk Management
    ... Risk Analysis in Project Analysis. Baltimore: The Johns Hopkins Press, 1970 ... Bernoulli distribution is used to simulate variables with two values ...
  76. [76]
    The Regression Analysis of Binary Sequences - Cox - 1958
    Dec 5, 2018 · The Regression Analysis of Binary Sequences. Correction(s) for this ... First Published online: December 5, 2018. D. R. Cox,. D. R. Cox.
  77. [77]
    [PDF] Extensions of the basic hidden Markov model
    We now discuss two ways in which covariates can be introduced into. HMMs: via the state-dependent probabilities, and via the transition probabilities of the ...
  78. [78]
    [PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
    (2010): Towards Optimal One Pass Large Scale Learning with Averaged. Stochastic Gradient Descent. Journal of Machine Learning Research (to ap- pear).