Geometric distribution
The geometric distribution is a discrete probability distribution that models the number of independent Bernoulli trials required to achieve the first success, where each trial has a constant success probability p (with $0 < p \leq 1).[1] It arises in scenarios involving repeated independent experiments with binary outcomes, such as success or failure, and is fundamental in probability theory for analyzing waiting times until an event occurs.[2] There are two common parameterizations of the geometric distribution, differing in whether the random variable counts the total number of trials until the first success (with support X = 1, 2, 3, \dots) or the number of failures preceding the first success (with support Y = 0, 1, 2, \dots).[3] For the trials-until-success version, the probability mass function is given byP(X = k) = (1 - p)^{k-1} p, \quad k = 1, 2, 3, \dots
while for the failures-before-success version, it is
P(Y = k) = (1 - p)^k p, \quad k = 0, 1, 2, \dots. [2][3] The expected value is E[X] = \frac{1}{p} for the former and E[Y] = \frac{1-p}{p} for the latter, with both sharing the variance \frac{1-p}{p^2}.[2][3] A defining feature of the geometric distribution is its memoryless property, which states that the probability of requiring additional trials beyond a certain point is independent of the trials already conducted: P(X > s + t \mid X > s) = P(X > t) for non-negative integers s and t.[3] This property uniquely characterizes the geometric distribution among discrete distributions on the non-negative integers and parallels the exponential distribution in continuous settings.[3] The distribution also serves as the special case r = 1 of the negative binomial distribution, which generalizes it to the number of trials until the r-th success.[4] Applications of the geometric distribution are widespread in fields requiring modeling of waiting times or trial counts until an event, such as reliability engineering (e.g., time until an engine failure in independent tests), quality control (e.g., inspections until a defective item is found), and telecommunications (e.g., packet retransmissions until successful delivery).[5][6][7] It is also used in ecology for modeling runs of species occurrences and in computer science for analyzing algorithm performance in randomized settings.[7][6]
Definition
Parameterizations
The geometric distribution arises in the context of independent Bernoulli trials, each with success probability p where $0 < p \leq 1.[8] One standard parameterization defines the random variable X as the number of failures preceding the first success, with possible values in the set \{0, 1, 2, \dots \}.[8] This formulation, often considered primary in mathematical probability, directly corresponds to the terms of a geometric series indexed from 0.[8] An alternative parameterization specifies the random variable Y as the total number of trials required to achieve the first success, taking values in \{1, 2, 3, \dots \}.[9] Here, Y = X + 1, linking the two definitions.[10] Both parameterizations remain in common use due to contextual preferences: the failures version suits derivations involving infinite series and theoretical probability, while the trials version is favored in applied statistics for representing waiting times or experiment counts.[8][9] The probability of failure on each trial is conventionally denoted q = 1 - p.[8]Probability Mass Function
The geometric distribution can be parameterized in terms of the number of failures X before the first success in a sequence of independent Bernoulli trials, each with success probability p where $0 < p \leq 1. The probability mass function (PMF) for this parameterization is given by P(X = k) = (1 - p)^k p, \quad k = 0, 1, 2, \dots [8] This PMF is verified to be a valid probability distribution, as the infinite sum over the support equals 1: \sum_{k=0}^{\infty} P(X = k) = p \sum_{k=0}^{\infty} (1 - p)^k = p \cdot \frac{1}{1 - (1 - p)} = 1, using the formula for the sum of an infinite geometric series with common ratio |1 - p| < 1.[8] An alternative parameterization models the number of trials Y until the first success, also based on independent Bernoulli trials with success probability p. The corresponding PMF is P(Y = k) = (1 - p)^{k-1} p, \quad k = 1, 2, 3, \dots [11] This PMF similarly sums to 1 over its support: \sum_{k=1}^{\infty} P(Y = k) = p \sum_{k=1}^{\infty} (1 - p)^{k-1} = p \sum_{j=0}^{\infty} (1 - p)^j = p \cdot \frac{1}{1 - (1 - p)} = 1, again applying the infinite geometric series sum.[11] The random variables X and Y are related by Y = X + 1, which induces a probability shift such that P(Y = k) = P(X = k - 1) for k = 1, 2, \dots, accounting for the difference in their supports and the exponent adjustment in the PMF.[11]Cumulative Distribution Function
The cumulative distribution function (CDF) of a geometric random variable provides the probability that the number of failures or trials until the first success is at most a given value. There are two common parameterizations of the geometric distribution, which differ in whether the random variable counts the number of failures before the first success (starting from 0) or the number of trials until the first success (starting from 1).[3] Consider first the parameterization where X denotes the number of failures before the first success in a sequence of independent Bernoulli trials, each with success probability p where $0 < p < 1. The CDF is given by F_X(k) = P(X \leq k) = 1 - (1 - p)^{k+1}, \quad k = 0, 1, 2, \dots This closed-form expression is obtained by summing the probability mass function (PMF) from 0 to k, which forms a finite geometric series.[3][12] For the alternative parameterization where Y represents the trial number on which the first success occurs, the support begins at 1, and the CDF is F_Y(k) = P(Y \leq k) = 1 - (1 - p)^k, \quad k = 1, 2, 3, \dots Similarly, this follows from summing the corresponding PMF from 1 to k.[12][13] In both cases, the CDF approaches 1 as k \to \infty, since |1 - p| < 1, ensuring that success eventually occurs with probability 1. The survival function, S(k) = 1 - F(k), is then S_X(k) = (1 - p)^{k+1} for the failures parameterization and S_Y(k) = (1 - p)^k for the trials parameterization, representing the probability that no success has occurred by trial k.[12][3] These CDFs interpret the cumulative probability of the first success happening by the k-th trial (or after at most k failures), which is fundamental for modeling waiting times in discrete processes.[13]Properties
Memorylessness
The memoryless property of the geometric distribution states that, for a random variable X representing the number of trials until the first success with success probability p (where $0 < p < 1), the conditional probability P(X > s + t \mid X > s) = P(X > t) holds for all nonnegative integers s and t.[14] This means that the probability of requiring more than t additional trials after already observing s failures remains unchanged from the original probability of exceeding t trials from the start. To prove this, consider the survival function derived from the cumulative distribution function (CDF). For this parameterization, P(X > k) = (1 - p)^k for k = 0, 1, 2, \dots. Thus, P(X > s + t \mid X > s) = \frac{P(X > s + t)}{P(X > s)} = \frac{(1 - p)^{s + t}}{(1 - p)^s} = (1 - p)^t = P(X > t). [15] This equality demonstrates that past outcomes do not influence future probabilities. The property implies that the distribution of the remaining number of trials (or failures) is identical to the original distribution, regardless of the number of prior failures observed, reflecting a lack of "aging" or dependence on history in the process. As the sole discrete distribution exhibiting memorylessness, the geometric distribution serves as the discrete analogue to the exponential distribution's continuous memoryless property.[8]Moments and Cumulants
The expected value of the geometric random variable X, representing the number of failures before the first success in a sequence of independent Bernoulli trials with success probability p, is given by E[X] = \frac{1-p}{p}.[8] This can be derived directly from the probability mass function P(X = k) = p (1-p)^k for k = 0, 1, 2, \dots, yielding E[X] = \sum_{k=0}^{\infty} k \, p (1-p)^k = p (1-p) \sum_{k=1}^{\infty} k (1-p)^{k-1} = \frac{1-p}{p}, where the sum is evaluated using the formula for the expected value of a geometric series.[8] Alternatively, leveraging the memoryless property of the geometric distribution, the tail sum formula provides E[X] = \sum_{k=0}^{\infty} P(X > k), where P(X > k) = (1-p)^{k+1}, so E[X] = \sum_{k=0}^{\infty} (1-p)^{k+1} = \frac{1-p}{p}.[16] The variance is \operatorname{Var}(X) = \frac{1-p}{p^2}.[8] To derive this, first compute the second factorial moment E[X(X-1)] = \sum_{k=0}^{\infty} k(k-1) \, p (1-p)^k = 2p (1-p)^2 \sum_{k=2}^{\infty} (1-p)^{k-2} = \frac{2(1-p)^2}{p^2}.[8] Then, \operatorname{Var}(X) = E[X(X-1)] + E[X] - (E[X])^2 = \frac{2(1-p)^2}{p^2} + \frac{1-p}{p} - \left( \frac{1-p}{p} \right)^2 = \frac{1-p}{p^2}.[8] In the alternative parameterization where Y = X + 1 denotes the number of trials until the first success, the expected value is E[Y] = \frac{1}{p} and the variance is \operatorname{Var}(Y) = \frac{1-p}{p^2}.[12] The skewness of the geometric distribution (in the failures parameterization) is \frac{2 - p}{\sqrt{1 - p}}, measuring the asymmetry toward positive values, which increases as p decreases./11%3A_Bernoulli_Trials/11.03%3A_The_Geometric_Distribution) The (excess) kurtosis is $6 + \frac{p^2}{1 - p}, indicating heavier tails than the normal distribution, particularly for small p./11%3A_Bernoulli_Trials/11.03%3A_The_Geometric_Distribution) The cumulants of the geometric distribution satisfy \kappa_1 = \frac{1-p}{p}, \kappa_2 = \frac{1-p}{p^2}, and \kappa_n = (n-1)! \frac{1-p}{p^n} for n \geq 2.[8] These follow from the cumulant-generating function K(t) = \log \left( \frac{p}{1 - (1-p) e^t} \right), obtained as the logarithm of the moment-generating function.[8]Summary Statistics
The geometric distribution is characterized by its success probability parameter p \in (0,1), with two common parameterizations: one counting the number of failures before the first success (X = 0, 1, 2, \dots, PMF P(X = k) = p (1-p)^k) and the other counting the number of trials until the first success (Y = 1, 2, 3, \dots, PMF P(Y = k) = p (1-p)^{k-1}). Note that Y = X + 1, so the distributions share the same variance, skewness, and excess kurtosis, but differ in mean, mode, and median by a shift of 1.[17][18] The following table summarizes key statistics for both parameterizations, where q = 1 - p.| Statistic | Failures (X) | Trials (Y) |
|---|---|---|
| Mean | \frac{q}{p} | \frac{1}{p} |
| Variance | \frac{q}{p^2} | \frac{q}{p^2} |
| Standard Deviation | \frac{\sqrt{q}}{p} | \frac{\sqrt{q}}{p} |
| Skewness | \frac{2 - p}{\sqrt{q}} | \frac{2 - p}{\sqrt{q}} |
| Excess Kurtosis | $6 + \frac{p^2}{q} | $6 + \frac{p^2}{q} |
| Mode | 0 | 1 |
| Median | \left\lceil \frac{\ln 0.5}{\ln q} \right\rceil - 1 | \left\lceil \frac{\ln 0.5}{\ln q} \right\rceil |
Information Measures
Entropy
The Shannon entropy of a discrete random variable quantifies the average uncertainty or information content in its outcomes, defined as H(X) = -\sum_k P(X=k) \log_2 P(X=k), measured in bits. For the geometric distribution, which is discrete, the differential entropy approximation does not apply; instead, the focus is on this discrete Shannon entropy.[19] Consider the parameterization where X counts the number of failures before the first success, so X = 0, 1, 2, \dots with probability mass function P(X = k) = p (1-p)^k, where p is the success probability ($0 < p \leq 1) and q = 1 - p. The entropy is H(X) = -\log_2 p - \frac{q \log_2 q}{p}, or equivalently H(X) = \frac{h_2(p)}{p}, where h_2(p) = -p \log_2 p - q \log_2 q is the binary entropy function.[19] For the alternative parameterization where Y counts the number of trials until the first success, so Y = 1, 2, 3, \dots with P(Y = k) = p q^{k-1}, note that Y = X + 1. Since adding a constant does not change the entropy, H(Y) = H(X) = \frac{h_2(p)}{p}. The binary entropy h_2(p) arises from the memoryless property of the geometric distribution, reflecting the uncertainty in each Bernoulli trial scaled by the expected number of trials $1/p.[19] The entropy H(X) (or H(Y)) is minimized at p = 1, where H = 0 bits, corresponding to certain success on the first trial with no uncertainty. As p \to 0^+, the entropy diverges to infinity, reflecting unbounded uncertainty due to the potentially infinite sequence of failures. For example, at p = 0.5, H \approx 2 bits. If natural logarithms are used instead, the entropy is measured in nats by replacing \log_2 with \ln.[19]Fisher Information
The Fisher information quantifies the amount of information that an observable random variable carries about an unknown parameter in a statistical model. For the geometric distribution, where X denotes the number of failures before the first success with success probability p \in (0,1), and probability mass function f(k;p) = p (1-p)^k for k = 0,1,2,\dots, the Fisher information I(p) is a scalar measuring the sensitivity of the distribution to changes in p.[20] To derive I(p), consider the log-probability mass function: \log f(k;p) = \log p + k \log(1-p). The score function, or first derivative with respect to p, is \frac{\partial}{\partial p} \log f(k;p) = \frac{1}{p} - \frac{k}{1-p}. The second derivative is \frac{\partial^2}{\partial p^2} \log f(k;p) = -\frac{1}{p^2} - \frac{k}{(1-p)^2}. The Fisher information is then the negative expected value of this second derivative: I(p) = -E\left[ \frac{\partial^2}{\partial p^2} \log f(k;p) \right] = E\left[ \frac{1}{p^2} + \frac{k}{(1-p)^2} \right] = \frac{1}{p^2} + \frac{E}{(1-p)^2}. Since E = (1-p)/p for the geometric distribution, substitution yields I(p) = \frac{1}{p^2} + \frac{(1-p)/p}{(1-p)^2} = \frac{1}{p^2} + \frac{1}{p(1-p)} = \frac{1}{p^2(1-p)}. This expression holds equivalently when computed as the variance of the score function, confirming the result.[21][20] For a sample of n independent observations, the total Fisher information is n I(p). This directly informs the asymptotic efficiency of estimators, such as the maximum likelihood estimator \hat{p}. Specifically, the asymptotic variance of \sqrt{n} (\hat{p} - p) is $1/I(p) = p^2 (1-p), implying that \hat{p} is asymptotically normal with mean p and variance p^2 (1-p)/n. Thus, estimation precision is highest near the boundaries p \to 0 or p \to 1, where the variance approaches zero, and lowest around p = 2/3, where I(p) achieves its minimum value of $27/4.[20]Related Distributions
Bernoulli Distribution
The Bernoulli distribution serves as the foundational building block for the geometric distribution, representing the outcome of a single trial in a sequence of independent experiments. A Bernoulli trial is a random experiment with exactly two possible outcomes: success, occurring with probability p where $0 < p < 1, or failure, occurring with probability $1 - p.[1] These trials are assumed to be independent, meaning the outcome of any one trial does not influence the others, and identically distributed, with the same success probability p for each. This setup forms a Bernoulli process, which underpins many discrete probability models. The geometric distribution emerges directly from a sequence of such independent and identically distributed (i.i.d.) Bernoulli trials as the distribution of the waiting time until the first success. Specifically, if trials continue until the initial success is observed, the number of trials required—denoted X—follows a geometric distribution with parameter p. This waiting time interpretation highlights the geometric as a natural extension of repeated Bernoulli experiments, where each preceding failure simply delays the eventual success without altering future probabilities.[22] The independence of the Bernoulli trials ensures that the probability of success remains constant across attempts, making the geometric distribution memoryless in its progression.[1] In a limiting case, the geometric distribution reduces to the Bernoulli distribution when the process is restricted to a single trial. Under the convention where the geometric random variable X counts the number of failures before the first success (so X = 0, 1, 2, \dots), the probability P(X = 0) = p exactly matches the Bernoulli probability of success on that solitary trial, with all higher values becoming impossible.[23] This adjustment underscores the geometric's role as a generalization of the Bernoulli, where expanding to multiple potential trials accommodates waiting beyond the immediate outcome. All key properties of the geometric distribution, such as its memorylessness and moment-generating function, inherit directly from the independence and identical distribution of the underlying Bernoulli trials. Without this prerequisite structure, the geometric could not maintain its characteristic lack of dependence on prior failures, emphasizing the Bernoulli's essential preparatory role in deriving and understanding the geometric model.Binomial Distribution
The binomial distribution and the geometric distribution are both derived from sequences of independent Bernoulli trials, each with success probability p and failure probability q = 1 - p, but they model different aspects of the process. The binomial distribution describes the number of successes in a fixed number n of trials, with probability mass function P(X = k) = \binom{n}{k} p^k q^{n-k} for k = 0, 1, \dots, n. In contrast, the geometric distribution describes the number of trials until the first success (equivalently, one fixed success with a variable number of trials), emphasizing the waiting time rather than a predetermined trial count.[1] This distinction highlights how the binomial fixes the sample size while allowing variable outcomes, whereas the geometric fixes the outcome (first success) while allowing variable sample size.[24] A key probabilistic relation connects the two: the probability mass function of the geometric distribution, P(X = k) = q^{k-1} p for k = 1, 2, \dots, can be expressed as the product of the success probability and the binomial probability of zero successes in the preceding k-1 trials, i.e., P(X = k) = p \cdot P(Y = 0), where Y \sim \text{Binomial}(k-1, p).[23] Since P(Y = 0) = q^{k-1}, this directly yields the geometric form and illustrates how the waiting time to first success builds on the absence of successes, a core component of binomial probabilities.[25] This link underscores the geometric as a foundational waiting-time model within the broader framework of Bernoulli trial accumulations. Conditioning on the binomial outcome provides another perspective on their interplay. Given exactly one success in n trials (i.e., S_n = 1 where S_n \sim \text{Binomial}(n, p)), the position X of that single success follows a discrete uniform distribution on \{1, 2, \dots, n\}, with P(X = k \mid S_n = 1) = 1/n for each k.[23] This uniform conditioning contrasts with the unconditional geometric distribution, which decreases with k, and reflects an "inverse" waiting-time view: instead of waiting forward until success, the position is retrospectively uniform given the total constraint of one success.[23] Notably, this conditional uniformity holds independently of p, as the events of failures before and after the success balance out.[23] The probability generating functions further tie the distributions together, with the binomial's G(s) = (q + p s)^n reducing to the Bernoulli case (n=1) as G(s) = q + p s, the single-trial building block shared with the geometric.[24] The geometric extends this via its generating function G(s) = \frac{p s}{1 - q s} for |s| < 1/q, which arises from summing the infinite series of potential waiting times and can be seen as generalizing the fixed-n structure to unbounded trials until success.[24] This functional form facilitates derivations of moments and connections to other waiting-time models.[1]Negative Binomial Distribution
The negative binomial distribution arises as the distribution of the total number of failures occurring before the r-th success in a sequence of independent Bernoulli trials, each with success probability p. This can be viewed as the sum of r independent and identically distributed geometric random variables, where each geometric random variable counts the number of failures before a single success.[26][27] If X ~ Geometric(p) represents the number of failures before the first success, with probability mass function (PMF) P(X = k) = (1-p)^k p for k = 0, 1, 2, ..., then the negative binomial random variable Y = X_1 + X_2 + \dots + X_r, where each X_i ~ Geometric(p), follows a negative binomial distribution NB(r, p).[26][28] The PMF of the negative binomial distribution NB(r, p) for the number of failures y before the r-th success is given by P(Y = y) = \binom{y + r - 1}{r - 1} p^r (1-p)^y, \quad y = 0, 1, 2, \dots This form incorporates binomial coefficients to account for the number of ways to arrange y failures and r successes in a sequence ending with the r-th success.[29] The expected value (mean) of Y is E[Y] = r (1-p)/p, and the variance is Var(Y) = r (1-p)/p^2. These moments generalize those of the geometric distribution, which correspond to the special case r = 1, where E[Y] = (1-p)/p and Var(Y) = (1-p)/p^2.[29][27] The PMF of the negative binomial can be derived through convolution of the individual geometric PMFs, reflecting the additive nature of the failures across the r stages. Specifically, the probability P(Y = y) is the sum over all non-negative integers k_1, k_2, \dots, k_{r-1} such that k_1 + \dots + k_{r-1} + k_r = y of the product p^r (1-p)^{k_1 + \dots + k_r}, which simplifies to the binomial coefficient expression due to the identical distributions. Alternatively, this convolution result can be obtained using moment-generating functions: the MGF of a geometric random variable is p / (1 - (1-p) e^t) for t < -\ln(1-p), and raising it to the r-th power yields the MGF of the negative binomial, confirming the distribution.[30][28]Exponential Distribution
The exponential distribution arises as the continuous counterpart to the geometric distribution, modeling the waiting time until the first event in a continuous-time setting, much as the geometric distribution counts the number of discrete trials until the first success. In a Poisson process with rate parameter \lambda > 0, the interarrival time between events follows an exponential distribution with probability density function f(x) = \lambda e^{-\lambda x} for x \geq 0, paralleling the role of the geometric distribution in discrete-time Bernoulli processes.[31][32] This analogy is precise through parameter correspondence: for a geometric distribution with success probability p, the associated exponential rate is \lambda = -\log(1 - p), ensuring the survival functions align exactly, P(X > k) = (1 - p)^k = e^{-\lambda k} for integer k. When p is small, \lambda \approx p, reflecting the low-probability regime where discrete and continuous models converge.[31] Both distributions exhibit the memoryless property, where the conditional probability of waiting beyond an additional unit is independent of elapsed time or trials. In the continuous limit, the geometric probability mass function approximates the exponential density: consider time discretized into intervals of length \Delta t \to 0, with success probability p = \lambda \Delta t; then, for a waiting time scaled by \Delta t, the geometric PMF P(K = k) = (1 - p)^{k-1} p yields P(T \in [t, t + \Delta t)) \approx \lambda e^{-\lambda t} \Delta t, matching the exponential PDF.[31][17] This limit interprets the geometric as a discretized exponential, with the waiting time for the first Poisson event in continuous time corresponding to the trial count in discrete time. A time-rescaled geometric random variable, where the number of trials K is mapped to time T = K \Delta t as \Delta t \to 0 and p = \lambda \Delta t, converges in distribution to an exponential random variable with rate \lambda. This transformation highlights the geometric as the discrete embedding of the exponential, unifying their interpretations in renewal theory.[31][32]Statistical Inference
Method of Moments
The method of moments provides a straightforward approach to estimate the success probability p in the geometric distribution by equating the first population moment to the corresponding sample moment. There are two common parameterizations of the geometric distribution: one counting the number of failures X before the first success, with probability mass function P(X = k) = p (1-p)^k for k = 0, 1, 2, \dots and population mean E[X] = (1-p)/p; the other counting the number of trials Y until the first success, with P(Y = k) = p (1-p)^{k-1} for k = 1, 2, \dots and E[Y] = 1/p./07:_Point_Estimation/7.02:_The_Method_of_Moments) For a random sample of size n, let \bar{X} and \bar{Y} denote the sample means from the failures and trials parameterizations, respectively. The method of moments estimator for p in the failures case is \hat{p} = 1 / (\bar{X} + 1), obtained by solving \bar{X} = (1 - \hat{p})/\hat{p}. In the trials case, the estimator simplifies to \hat{p} = 1 / \bar{Y}, by solving \bar{Y} = 1 / \hat{p}. These estimators arise directly from matching the unbiased sample mean to the population mean in each parameterization./07:_Point_Estimation/7.02:_The_Method_of_Moments)[20] The trials version estimator is unbiased in the sense that the underlying sample mean \bar{Y} is unbiased for E[Y], though the nonlinear transformation to \hat{p} introduces finite-sample bias; regardless, both estimators are asymptotically consistent as n \to \infty, converging in probability to the true p. The asymptotic variance of \hat{p} is approximately p^2 (1-p)/n, derived from the Fisher information and applicable to large samples via the delta method; this can be estimated by substituting \hat{p} for p.[20] While higher-order moments could be matched for added robustness in cases of model misspecification, the first moment is typically sufficient for the single-parameter geometric distribution, yielding a simple and efficient estimator./07:_Point_Estimation/7.02:_The_Method_of_Moments)Maximum Likelihood Estimation
Consider a sample of n independent observations X_1, \dots, X_n from the geometric distribution, where X_i represents the number of failures before the first success, with probability mass function P(X_i = x_i) = p (1-p)^{x_i} for x_i = 0, 1, 2, \dots and parameter $0 < p < 1.[33] The likelihood function is then L(p) = \prod_{i=1}^n p (1-p)^{x_i} = p^n (1-p)^{\sum_{i=1}^n x_i}. [20] To find the maximum likelihood estimator (MLE), take the natural logarithm: \log L(p) = n \log p + \left( \sum_{i=1}^n x_i \right) \log(1-p). Differentiating with respect to p and setting the result to zero yields \frac{\partial \log L(p)}{\partial p} = \frac{n}{p} - \frac{\sum_{i=1}^n x_i}{1-p} = 0, which simplifies to \hat{p}_{\text{MLE}} = \frac{n}{n + \sum_{i=1}^n x_i} = \frac{1}{\bar{x} + 1}, where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i.[20] The second derivative \frac{\partial^2 \log L(p)}{\partial p^2} = -\frac{n}{p^2} - \frac{\sum_{i=1}^n x_i}{(1-p)^2} < 0 confirms this is a maximum.[33] Notably, this MLE coincides with the method of moments estimator for the geometric distribution.[7] Under standard regularity conditions, the MLE \hat{p}_{\text{MLE}} is asymptotically normal as n \to \infty: \sqrt{n} (\hat{p}_{\text{MLE}} - p) \xrightarrow{d} N\left(0, \frac{p^2 (1-p)}{1}\right), with asymptotic variance \frac{p^2 (1-p)}{n} derived from the inverse of the Fisher information I(p) = \frac{1}{p^2 (1-p)}.[20] The Fisher information for a single observation is I(p) = -\mathbb{E}\left[ \frac{\partial^2 \log f(X;p)}{\partial p^2} \right] = \frac{1}{p^2 (1-p)}, where f(x;p) is the probability mass function.[34] For the full sample, it scales to n I(p).[20] The form of the MLE is invariant under the alternative parameterization of the geometric distribution as the number of trials until the first success, Y_i = X_i + 1, with P(Y_i = y_i) = p (1-p)^{y_i - 1} for y_i = 1, 2, \dots. In this case, the MLE is \hat{p}_{\text{MLE}} = \frac{n}{\sum_{i=1}^n y_i}, which equals \frac{1}{\bar{y}} and matches the failures-based estimator since \sum y_i = \sum x_i + n.[33] The asymptotic properties remain the same under this reparameterization.[20]Bayesian Inference
In Bayesian inference for the geometric distribution, the success probability p is assigned a Beta prior distribution with shape parameters \alpha > 0 and \beta > 0, which is the conjugate prior due to its compatibility with the likelihood form. For n independent observations y_1, \dots, y_n from Geometric(p), where P(Y = y) = p (1-p)^{y-1} for y = 1, 2, \dots, the likelihood is proportional to p^n (1-p)^{\sum (y_i - 1)}. The resulting posterior distribution is Beta(\alpha + n, \beta + \sum (y_i - 1)), providing a closed-form update that incorporates both prior beliefs and observed data. The posterior mean serves as a Bayesian point estimate for p: \hat{p} = \frac{\alpha + n}{\alpha + \beta + \sum y_i}, which shrinks the maximum likelihood estimate toward the prior mean \alpha / (\alpha + \beta) and converges to it as n increases. Credible intervals for p are obtained from the quantiles of this Beta posterior, offering probabilistic bounds that account for parameter uncertainty. The maximum a posteriori (MAP) estimate, corresponding to the posterior mode, is \hat{p}_{\text{MAP}} = \frac{\alpha + n - 1}{\alpha + \beta + \sum y_i - 2} when \alpha + n > 1 and \beta + \sum (y_i - 1) > 1, providing a peaked summary of the posterior under squared-error loss alternatives. For non-informative priors, the Jeffreys prior \pi(p) \propto 1 / [p \sqrt{1-p}], derived from the square root of the Fisher information I(p) = 1 / [p^2 (1-p)], is improper but yields a proper posterior for observed data.[35] This prior leads to a posterior that approximates the maximum likelihood estimate in large samples, while incorporating parameter invariance properties. Bayesian approaches with conjugate priors like Beta handle small samples more effectively than maximum likelihood estimation by leveraging prior information to stabilize estimates and quantify uncertainty.[36]Computation
Random Variate Generation
Generating random variates from the geometric distribution can be accomplished using several methods, with the choice depending on computational efficiency and the specific parameterization (number of trials until first success or number of failures before first success). The direct simulation method involves repeatedly generating independent Bernoulli random variables with success probability p until the first success occurs. For the number of trials Y (support \{1, 2, \dots\}), this counts the trials including the success; for the number of failures X (support \{0, 1, 2, \dots\}), it counts only the preceding failures. This approach is straightforward but inefficient for small p, as it requires an expected $1/p uniform random variates per sample, leading to high computational cost when the expected value is large.[37] A more efficient alternative is the inverse transform sampling method, which leverages the closed-form inverse of the cumulative distribution function (CDF). Generate U \sim \text{Uniform}(0,1). For the number of trials Y, compute Y = \left\lceil \frac{\log U}{\log (1-p)} \right\rceil; for the number of failures X, X = \left\lfloor \frac{\log U}{\log (1-p)} \right\rfloor. This relies on the CDF inversion F^{-1}(u) = \frac{\log(1-u)}{\log(1-p)}, adjusted for discreteness with ceiling or floor functions; since $1-U \sim \text{Uniform}(0,1), the form using U is equivalent. The method requires only one uniform variate and logarithmic operations per sample, making it suitable for all p > 0.[3][38] The inverse transform method is preferred in computational practice due to its efficiency, particularly for small p, and is implemented in many simulation libraries for generating geometric variates.[39]Numerical Methods
The probability mass function (PMF) and cumulative distribution function (CDF) of the geometric distribution, modeling the number of failures before the first success with success probability p ($0 < p \leq 1), are given by P(X = k) = (1 - p)^k p, \quad k = 0, 1, 2, \dots and F(k) = P(X \leq k) = 1 - (1 - p)^{k+1}, \quad k = 0, 1, 2, \dots, respectively.[8] For large k, direct evaluation of (1 - p)^k risks numerical underflow in floating-point systems, as the value approaches zero exponentially. To mitigate this, computations are performed in log-space: \log P(X = k) = k \log(1 - p) + \log p, with the result exponentiated only if the unlogged probability is required; otherwise, log-probabilities are retained for stability in subsequent operations like summation or ratio calculations.[40] The CDF can similarly be evaluated as \log F(k) = \log\left(1 - \exp((k+1) \log(1 - p))\right), using specialized functions like log1p-exp for accuracy when (1 - p)^{k+1} is close to 1.[41] The quantile function, which inverts the CDF to find the smallest k such that F(k) \geq u for u \in (0, 1], is derived from solving $1 - (1 - p)^{k+1} = u, yielding the real-valued form q(u) = \frac{\log(1 - u)}{\log(1 - p)} - 1. For the discrete case, this is rounded to the appropriate integer via ceiling to ensure the CDF condition holds: q(u) = \left\lceil \frac{\log(1 - u)}{\log(1 - p)} - 1 \right\rceil. This logarithmic formulation avoids overflow and underflow issues inherent in iterative summation methods, providing efficient evaluation even for extreme quantiles.[41] Tail probabilities, such as P(X > k) = (1 - p)^{k+1}, follow directly from the CDF as the survival function and are computed analogously in log-space as (k+1) \log(1 - p) to prevent underflow for large k. Recursive evaluation, where P(X > k) = (1 - p) \cdot P(X > k-1), can enhance stability by iteratively multiplying probabilities in log-space, avoiding repeated exponentiations. For very large means \mu = (1 - p)/p, where the distribution becomes heavy-tailed, a normal approximation may be applied to estimate central tail probabilities using mean \mu and variance \mu(1 + \mu), though this is less accurate for extreme tails due to skewness.[41] In software implementations, such as Python's SciPy library, the geometric distribution is handled viascipy.stats.geom, which computes PMF, CDF, and quantile functions with built-in safeguards against overflow; for instance, when p < 10^{-17}, results may clip to the integer maximum due to dtype limits, and log-scale methods are implicitly used for stability. For scenarios requiring arbitrary precision to fully avoid overflow, libraries like SymPy or mpmath enable exact or high-precision evaluations of the formulas above.[42]