Negative binomial distribution
The negative binomial distribution is a discrete probability distribution that models the number of independent Bernoulli trials required to achieve a fixed number r of successes, or equivalently, the number of failures preceding r successes, where each trial has a constant success probability p (0 < p ≤ 1) and r is a positive integer parameter.[1][2][3] In one common parameterization, a negative binomial random variable X represents the number of failures before the r-th success, taking non-negative integer values (X = 0, 1, 2, ...).[1] The probability mass function is given byP(X = k) = \binom{k + r - 1}{k} p^r (1 - p)^k
for k = 0, 1, 2, ..., where \binom{k + r - 1}{k} is the binomial coefficient.[1][3] An alternative parameterization defines Y = X + r as the total number of trials until the r-th success (Y = r, r+1, ...), with probability mass function
P(Y = n) = \binom{n - 1}{r - 1} p^r (1 - p)^{n - r}.[2][3] Both forms share the same variance but differ in their means: E[X] = r(1 - p)/p and E[Y] = r/p.[1][2] The variance for either is \operatorname{Var}(X) = \operatorname{Var}(Y) = r(1 - p)/p^2, which exceeds the mean, indicating overdispersion compared to the Poisson distribution.[1][2][3] Key properties include the moment-generating function M(t) = \left[ p / (1 - (1 - p)e^t) \right]^r for the failures parameterization (valid for t < -\ln(1 - p)), from which the mean and variance can be derived via differentiation.[4] The distribution is unimodal, with the mode at \lfloor (r - 1)(1 - p)/p \rfloor for X.[3] A negative binomial random variable with parameters r and p is the sum of r independent geometric random variables, each representing the number of failures before a single success (with the same p).[5][3] When r = 1, it reduces exactly to the geometric distribution.[2][1] As r → ∞ and p → 1 with the mean held constant, the negative binomial converges to a Poisson distribution, making it useful for modeling processes with extra variability.[6] Sums of independent negative binomials with the same p but different r values also follow a negative binomial distribution.[3] The negative binomial distribution finds broad applications in statistics for analyzing count data that display greater variability than expected under a Poisson model, such as species abundances in ecology, accident frequencies in insurance, or word counts in linguistics.[7][8] In regression contexts, negative binomial models extend generalized linear models to handle overdispersed outcomes, common in fields like epidemiology and quality control.[9] It also appears in classic probability problems, such as the Banach matchbox problem or series of games like the "problem of points."[3] For large r, normal approximations facilitate further analysis and inference.[3]
Definitions
Probability Mass Function
The negative binomial distribution describes the probability of observing exactly k failures before the r-th success in a sequence of independent Bernoulli trials, each with success probability p.[2][10] The probability mass function is given by P(X = k) = \binom{k + r - 1}{k} p^r (1-p)^k for k = 0, 1, 2, \dots, where r > 0 is the number of successes (typically a positive integer) and $0 < p \leq 1 is the success probability.[2][10] The support of the distribution consists of the non-negative integers k.[2][10] This formula arises from the product of the binomial coefficient, which counts the number of ways to arrange k failures and r-1 successes in the first k + r - 1 trials, and the Bernoulli probabilities p^r (1-p)^k for the r successes and k failures overall, with the final trial being a success.[2][10] The probabilities sum to 1 over all k \geq [0](/page/0) because the expression corresponds to the negative binomial series expansion of [p / (1 - (1-p))]^r = 1^r = 1.[10] When r = [1](/page/1), this reduces to the geometric distribution.[2]Cumulative Distribution Function
The cumulative distribution function of the negative binomial random variable X, which counts the number of failures before r successes in independent Bernoulli trials each with success probability p, is defined asF(x; r, p) = P(X \leq x) = \sum_{k=0}^{\lfloor x \rfloor} \binom{k + r - 1}{k} p^r (1-p)^k
for x \geq 0, r > 0, and $0 < p < 1.[11] This infinite series in its direct form admits a closed-form expression in terms of the regularized incomplete beta function:
F(x; r, p) = I_p(r, \lfloor x \rfloor + 1),
where
I_z(a, b) = \frac{1}{B(a, b)} \int_0^z t^{a-1} (1-t)^{b-1} \, dt
is the regularized incomplete beta function and B(a, b) is the beta function.[11][12] Equivalently, the survival function or tail probability is
P(X > x) = 1 - F(x; r, p) = I_{1-p}(\lfloor x \rfloor + 1, r). [11][12] Direct evaluation of the summation defining F(x; r, p) poses numerical challenges for large \lfloor x \rfloor or r, as it involves many terms with potentially large binomial coefficients that can cause overflow or require high-precision arithmetic. The beta function representation mitigates these issues by leveraging efficient algorithms for incomplete beta integrals, such as continued fractions or series expansions, which are implemented in numerical libraries for stable computation even in asymptotic regimes where x is large and F(x; r, p) approaches 1. For very large x, the tail P(X > x) becomes small, and its asymptotic decay follows the geometric tail behavior modulated by the binomial coefficients, often approximated via saddlepoint methods. For large r, normal approximations from the central limit theorem apply to the central region of the distribution.[12]
Parameterizations and Formulations
Standard r-p Parameterization
The standard r-p parameterization of the negative binomial distribution employs two key parameters: r > 0, which represents the number of successes, and p, the probability of success on each independent Bernoulli trial, satisfying $0 < p \leq 1.[7] In its classical form, r is a positive integer denoting the fixed number of successes required to terminate the sequence of trials.[13] This setup models the distribution of the number of failures preceding the r-th success in a series of independent trials.[13] The parameterization extends naturally to real-valued r > 0, facilitating applications in generalized linear models for overdispersed count data, where r acts as a dispersion parameter controlling the variance-to-mean relationship.[7] Under this formulation, the mean number of failures is given by \mu = \frac{r(1 - p)}{p}, and the variance by \sigma^2 = \frac{r(1 - p)}{p^2}. [13] These expressions highlight the overdispersion property, as the variance exceeds the mean when p < 1.[7] This r-p form serves as the conventional baseline in theoretical probability and is widely implemented in statistical software libraries. For instance, Python's SciPy module usesscipy.stats.nbinom(n=r, p=p), supporting real-valued n (equivalent to r).[14] Similarly, R's base package employs dnbinom(x, size=r, prob=p), where size can be non-integer to accommodate the generalized case.[15]
Alternative Parameterizations
The negative binomial distribution can be parameterized using the mean \mu and the success probability p, where the shape parameter is expressed as r = \frac{\mu p}{1 - p}.[16] This form maintains compatibility with the standard probability mass function while facilitating direct specification of the expected value in applications such as count regression models. Another common alternative is the overdispersion parameterization, which uses the mean \mu and variance \sigma^2 > \mu, with p = \frac{\mu}{\sigma^2} and r = \frac{\mu^2}{\sigma^2 - \mu}.[17] This approach is particularly useful in modeling count data where variance exceeds the mean, such as in ecological or econometric analyses, by explicitly accounting for the dispersion parameter \phi = \frac{1}{r} = \frac{\sigma^2 - \mu}{\mu^2} in the variance formula \sigma^2 = \mu + \phi \mu^2. In the limit as r \to 0, the negative binomial distribution relates to the logarithmic series distribution, which arises as a special case for modeling species abundance or rare events, with the probability mass function simplifying to P(X = k) = -\frac{\theta^k}{k \ln(1 - \theta)} for $0 < \theta < 1.[18] This logarithmic parameterization highlights the distribution's connection to infinite series expansions and is often applied in biodiversity studies. Transformations between parameter sets, such as from (r, p) to (\mu, \sigma^2), involve solving the relations \mu = \frac{r(1-p)}{p} and \sigma^2 = \frac{r(1-p)}{p^2}, with the inverse given above.Interpretations
Number of Failures Before Fixed Successes
The negative binomial distribution arises in the context of a sequence of independent Bernoulli trials, each with a fixed success probability p (where $0 < p < 1), where the random variable X represents the number of failures observed before achieving exactly r successes, with r being a positive integer parameter. This setup models scenarios involving repeated independent experiments until a predetermined number of positive outcomes occur, such as testing items for defects until a specified number of non-defective units are found.[19][20] When r = 1, this interpretation simplifies to the geometric distribution, which counts the number of failures preceding the first success in Bernoulli trials. For the general case of r > 1, X is the sum of r independent and identically distributed geometric random variables, each capturing the failures occurring between consecutive successes: the first geometric variable for failures before the initial success, the second for failures between the first and second success, and so on, up to the r-th success. This additive structure intuitively builds the distribution by accumulating waiting times for each successive success milestone.[21][22] The probability mass function emerges from a sequential probability argument in this trial process: for X = k (where k = 0, 1, 2, \dots), exactly r-1 successes must occur in the first k + r - 1 trials, interspersed with k failures, followed by a success on the (k + r)-th trial. This can be conceptualized via a probability tree, where branches denote success or failure at each step, and the total probability sums the paths leading to precisely k failures before the r-th success, accounting for the combinatorial arrangements of those outcomes.[23][20] This failures-before-successes interpretation has historical ties to early probability developments in the 17th and 18th centuries, stemming from analyses of gambling problems involving repeated plays until a fixed number of wins, as explored in foundational works on chance. It also saw early applications in reliability testing, such as modeling production defects until a quota of reliable components is met.[24][25]Number of Trials Until Fixed Successes
In the negative binomial distribution, an alternative interpretation arises by considering the total number of trials required to achieve a fixed number r of successes, building on the perspective of counting failures before those successes.[2] Let X denote the number of failures before the r-th success in a sequence of independent Bernoulli trials with success probability p; then define Y = X + r as the total number of trials until the r-th success occurs. The random variable Y follows a negative binomial distribution, with support y = r, r+1, r+2, \dots, and probability mass function P(Y = y) = \binom{y-1}{r-1} p^r (1-p)^{y-r}. [2][26] This formulation is commonly applied in quality control to model the total number of items inspected until r defects are identified, aiding in process monitoring and acceptance sampling decisions.[27] In clinical trials, it describes the total number of patients enrolled until r positive treatment responses are observed, supporting sequential design and early stopping rules.[28] The mean and variance of Y reflect the shift from the failures-only count: \mathbb{E}[Y] = r / p and \mathrm{Var}(Y) = r(1-p)/p^2.[26] Historically, this version of the negative binomial distribution—emphasizing trials until r successes—is synonymous with the Pascal distribution, named after Blaise Pascal for its roots in early probability problems involving repeated trials.[26]Properties
Moments
The negative binomial distribution in its standard parameterization models the number of failures X before observing r successes in a sequence of independent Bernoulli trials, each with success probability p (where $0 < p < 1 and r is a positive integer). The moments of this distribution can be derived by recognizing that X is the sum of r independent geometric random variables, each representing the number of failures before a single success, with mean \frac{1-p}{p} and variance \frac{1-p}{p^2}.[3] The expected value is obtained by linearity of expectation: \E[X] = r \cdot \frac{1-p}{p} = \frac{r(1-p)}{p}. This follows directly from the additivity of the means of the component geometric distributions.[3] Similarly, the variance is the sum of the individual variances: \Var(X) = r \cdot \frac{1-p}{p^2} = \frac{r(1-p)}{p^2}. Since \Var(X) = \E[X(X-1)] + \E[X] - (\E[X])^2, the second factorial moment \E[X(X-1)] = r(r+1) \frac{(1-p)^2}{p^2} can also be used to confirm this result via the probability generating function or direct summation.[3] For p < 1, the variance exceeds the mean (\Var(X) > \E[X]), a property known as overdispersion, which makes the distribution suitable for modeling count data with greater variability than a Poisson distribution.[3] Higher-order moments describe the shape of the distribution. The skewness, measuring asymmetry, is positive and decreases with r: \gamma_1 = \frac{2 - p}{\sqrt{r(1 - p)}}. This expression arises because the skewness of the sum of r i.i.d. geometric random variables (each with skewness \frac{2 - p}{\sqrt{1 - p}}) scales by a factor of $1/\sqrt{r}.[11] The kurtosis, indicating tail heaviness, is \gamma_2 = 3 + \frac{6}{r} + \frac{p^2}{r(1 - p)}, which exceeds 3 for finite r and p < 1, reflecting leptokurtic tails relative to the normal distribution; as r \to \infty, it approaches 3. This can be derived from the fourth central moment using the representation as a sum of geometrics or via the probability generating function \left( \frac{p}{1 - (1 - p)t} \right)^r.[11]Generating Functions
The generating functions of the negative binomial distribution provide powerful tools for deriving its moments and analyzing its properties, particularly in the context of sums of independent random variables. For a negative binomial random variable X with parameters r > 0 (number of successes) and $0 < p < 1 (success probability on each Bernoulli trial), where X counts the number of failures before the rth success, these functions are derived from the probability mass function P(X = k) = \binom{k + r - 1}{k} p^r (1 - p)^k for k = 0, 1, 2, \dots. The probability generating function (PGF) is defined as G(s) = \mathbb{E}[s^X] = \sum_{k=0}^\infty P(X = k) s^k. For the negative binomial distribution, it takes the form G(s) = \left( \frac{p}{1 - (1 - p)s} \right)^r, \quad |s| \leq \frac{1}{1 - p}. This closed-form expression arises from recognizing the PGF as the rth power of the geometric distribution's PGF (for r = 1) and summing the resulting negative binomial series.[29] The moment-generating function (MGF) is M(t) = \mathbb{E}[e^{tX}], which for the negative binomial is M(t) = \left( \frac{p e^t}{1 - (1 - p) e^t} \right)^r, \quad t < -\ln(1 - p). The domain ensures convergence, as (1 - p) e^t < 1. This MGF is obtained analogously to the PGF by substituting s = e^t into the series expansion.[10] The characteristic function, \phi(t) = \mathbb{E}[e^{itX}], extends the MGF to complex arguments and uniquely determines the distribution. For the negative binomial, it is \phi(t) = \left( \frac{p}{1 - (1 - p) e^{it}} \right)^r. This form follows directly from replacing t with it in the MGF.[30] Moments of X can be systematically derived from these generating functions via differentiation. For the PGF, the mean is \mathbb{E}[X] = G'(1) = r(1 - p)/p, and higher moments follow from further derivatives evaluated at s = 1. Similarly, for the MGF, \mathbb{E}[X] = M'(0) = r(1 - p)/p and \mathrm{Var}(X) = M''(0) - [M'(0)]^2 = r(1 - p)/p^2, confirming the known expressions without direct summation of the probability mass function.[10][29] These functions also facilitate proofs of distributional properties, such as the negative binomial arising as the sum of r independent geometric random variables (each counting failures before one success). The MGF of the sum is the product of the individual MGFs, yielding [p e^t / (1 - (1 - p) e^t)]^r, which matches the negative binomial MGF and thus establishes the result under the assumption of independence.[31]Recurrence Relations
The probability mass function of the negative binomial distribution admits a useful recurrence relation that allows for iterative calculation of probabilities without directly evaluating large binomial coefficients. For a random variable X \sim \mathrm{NB}(r, p), where r > 0 is the number of successes and $0 < p < 1 is the success probability on each Bernoulli trial (with X denoting the number of failures until the rth success), the probabilities satisfy (k+1) P(X = k+1) = (r + k)(1 - p) P(X = k) for k = 0, 1, 2, \dots , with initial condition P(X = 0) = p^r. This relation arises from the ratio of successive binomial coefficients in the PMF expression \binom{k + r - 1}{k}.[32] This recurrence is particularly valuable for efficient numerical evaluation of the PMF and CDF when r or k is large, as direct computation of \binom{k + r - 1}{k} can lead to numerical instability from overflow or underflow in factorials or gamma functions. Starting from P(X = 0) and applying the relation sequentially minimizes computational cost and maintains precision, making it suitable for software implementations in statistical computing.[33] Recurrence relations also extend to the moments of the distribution, often derived via differential equations from the probability generating function G(s) = \left[ \frac{p}{1 - (1-p)s} \right]^r. For instance, the cumulants \kappa_n obey the recurrence \kappa_{n+1} = PQ \frac{d \kappa_n}{d Q}, where P = (1-p)/p and Q = 1/p, with \kappa_1 = rQ as the starting point; this allows systematic computation of higher-order cumulants for theoretical analysis or approximation purposes. Similar recurrences apply to factorial moments, obtained by applying the operator s \frac{d}{ds} repeatedly to G(s), yielding relations like \mu_{(n+1)} = (1-p) (r + n) \mu_{(n)} for the falling factorial moments \mu_{(n)} = E[X(X-1)\cdots(X-n+1)].[11] In special cases, particularly for non-integer r, closed-form expressions for the CDF involve hypergeometric functions. The survival function P(X > k) = I_p(r, k+1), where I_x(a, b) is the regularized incomplete beta function, can be equivalently expressed using the Gauss hypergeometric function {}_2F_1 via the integral representation of the beta function, providing exact solutions without summation for certain parameter values.[11]Related Distributions
Connection to Geometric Distribution
The negative binomial distribution generalizes the geometric distribution, which models the number of failures before the first success in a sequence of independent Bernoulli trials with success probability p. When the number of successes r = 1, the negative binomial distribution reduces exactly to the geometric distribution, as both describe the waiting time until the initial success.[34][35] In its general form, a negative binomial random variable X \sim \text{NB}(r, p) can be expressed as the sum of r independent and identically distributed geometric random variables G_i \sim \text{Geometric}(p), where each G_i counts the number of failures before the i-th success: X = \sum_{i=1}^r G_i. This representation arises because the process of achieving r successes consists of r independent phases, each waiting for one success after the previous.[34][35][36] To verify this equivalence, consider the probability generating function (PGF) approach. The PGF of a single geometric random variable G \sim \text{Geometric}(p) (number of failures before first success) is G(s) = \frac{p}{1 - (1-p)s} for |s| < \frac{1}{1-p}. For the sum of r i.i.d. geometrics, the PGF is the product [G(s)]^r = \left(\frac{p}{1 - (1-p)s}\right)^r, which matches the PGF of the negative binomial distribution \text{NB}(r, p).[34][31] Alternatively, the connection can be established via convolution of the probability mass functions (PMFs). The PMF of a geometric G \sim \text{Geometric}(p) is P(G = k) = (1-p)^k p for k = 0, 1, 2, \dots. The PMF of the sum X = \sum_{i=1}^r G_i is obtained by the r-fold convolution, yielding P(X = x) = \sum_{k_1 + \cdots + k_r = x} \prod_{i=1}^r [(1-p)^{k_i} p] = p^r (1-p)^x \sum_{k_1 + \cdots + k_r = x} 1, where the sum counts the number of non-negative integer solutions, which is \binom{x + r - 1}{r - 1}. Thus, P(X = x) = \binom{x + r - 1}{r - 1} p^r (1-p)^x for x = 0, 1, 2, \dots, matching the negative binomial PMF (in the failures parameterization).[36][34] This sum representation has practical implications for simulation: to generate a negative binomial random variable \text{NB}(r, p), independently simulate r geometric random variables \text{Geometric}(p) and sum their values, leveraging efficient algorithms for the geometric case.[34]Gamma-Poisson Mixture
The negative binomial distribution arises as a marginal distribution in a hierarchical model where the rate parameter of a Poisson distribution is itself random and follows a gamma distribution. Specifically, let \Lambda follow a gamma distribution with shape parameter \alpha > 0 and scale parameter \beta > 0, so that the density of \Lambda is f(\lambda) = \frac{1}{\beta^\alpha \Gamma(\alpha)} \lambda^{\alpha-1} e^{-\lambda / \beta} for \lambda > 0.[37] Then, conditional on \Lambda = \lambda, let X follow a Poisson distribution with mean \lambda, having probability mass function P(X = x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} for x = 0, 1, 2, \dots.[37] The unconditional distribution of X is obtained by integrating out \lambda: P(X = x) = \int_0^\infty P(X = x \mid \lambda) f(\lambda) \, d\lambda = \int_0^\infty \frac{\lambda^x e^{-\lambda}}{x!} \cdot \frac{1}{\beta^\alpha \Gamma(\alpha)} \lambda^{\alpha-1} e^{-\lambda / \beta} \, d\lambda. This integral simplifies to the probability mass function of a negative binomial distribution with parameters r = \alpha and success probability p = \frac{1}{1 + \beta}: P(X = x) = \frac{\Gamma(\alpha + x)}{x! \Gamma(\alpha)} \left( \frac{1}{1 + \beta} \right)^\alpha \left( \frac{\beta}{1 + \beta} \right)^x, \quad x = 0, 1, 2, \dots. The derivation relies on the fact that the Poisson-gamma mixture yields a negative binomial form due to the conjugacy of the gamma distribution, which facilitates the integration.[38][37] Under this parameterization, the mean of X is E[X] = \alpha \beta, matching the mean of the gamma-distributed rate \Lambda.[37] The variance is Var(X) = \alpha \beta (1 + \beta), which exceeds the mean by the factor \alpha \beta^2, the variance of \Lambda.[37] This overdispersion—where variance surpasses the mean—contrasts with the Poisson distribution (where mean equals variance) and makes the negative binomial suitable for modeling count data with extra variability due to unobserved heterogeneity in rates.[39] In a hierarchical Bayesian framework, the gamma serves as the conjugate prior for the Poisson rate, enabling posterior inference: given observed x, the posterior for \Lambda is gamma with updated parameters \alpha + x and scale \beta / (1 + \beta).[37] This setup naturally incorporates prior uncertainty about the rate into the marginal predictive distribution for counts, facilitating robust modeling in Bayesian analyses of overdispersed data.[38][37]Compound Poisson Representation
The negative binomial distribution admits a representation as a compound Poisson distribution, in which the total outcome is the sum of a random number of independent jumps, each following a logarithmic series distribution. This formulation highlights the distribution's infinitely divisible nature and provides a framework for analyzing processes involving clustered events, such as apparent contagion in counting experiments or aggregated risks in stochastic modeling.[40] Specifically, a negative binomial random variable X \sim \mathrm{NB}(r, p) (in the "number of failures" parameterization, with mean r(1-p)/p) can be expressed as X = \sum_{i=1}^N Y_i, where N \sim \mathrm{Poisson}(\lambda) is the number of Poisson events, the Y_i are independent and identically distributed as \mathrm{Logarithmic}(\theta) (a discrete distribution supported on the positive integers with probability mass function P(Y_i = k) = -\theta^k / (k \ln(1-\theta)) for k = 1, 2, \dots), and N is independent of the Y_i. The parameters are related by \lambda = -r \ln(1-\theta) and \theta = 1-p, ensuring compatibility with the standard negative binomial form.[41] This equivalence can be derived using probability generating functions (PGFs). The PGF of the logarithmic series distribution is G_Y(s) = \frac{\ln(1 - \theta s)}{\ln(1 - \theta)}, \quad |s| < 1/\theta. The PGF of the compound sum X is then G_X(s) = \exp\bigl( \lambda \bigl( G_Y(s) - 1 \bigr) \bigr). Substituting the expressions for \lambda and G_Y(s) simplifies to G_X(s) = \exp\left( \frac{r \ln(1 - \theta s)}{\ln(1 - \theta)} \right) = \left( \frac{1 - \theta}{1 - \theta s} \right)^r = \left( \frac{p}{1 - (1-p)s} \right)^r, which matches the PGF of the negative binomial distribution \mathrm{NB}(r, p).[41] The parameter choices also ensure that the first two moments coincide with those of the standard negative binomial: the mean is \mathbb{E}[X] = \lambda \mathbb{E}[Y_1] = r \theta / (1 - \theta) = r (1-p)/p, and the variance is \mathrm{Var}(X) = \lambda \mathbb{E}[Y_1] + \lambda \mathrm{Var}(Y_1) = r \theta / (1 - \theta)^2 = r (1-p)/p^2. This moment matching underscores the representation's consistency for applications in fields like insurance, where it models total claim counts as Poisson-distributed clusters with logarithmic-sized jumps.[41]Parameter Estimation
Method of Moments
The method of moments estimation for the negative binomial distribution, which models the number of failures before the r-th success in independent Bernoulli trials with success probability p, relies on matching the first two population moments to their sample counterparts. The theoretical mean is given by \mathbb{E}[X] = \frac{r(1-p)}{p}, and the variance by \mathrm{Var}(X) = \frac{r(1-p)}{p^2}. To solve for the parameters, set the sample mean \bar{X} equal to \frac{r(1-p)}{p} and the sample variance s^2 equal to \frac{r(1-p)}{p^2}, yielding a system of two equations. Dividing the variance equation by the mean equation simplifies to \frac{s^2}{\bar{X}} = \frac{1}{p}, so \hat{p} = \frac{\bar{X}}{s^2}; substituting back gives \hat{r} = \frac{\bar{X}^2}{s^2 - \bar{X}}.[42][43] These estimators assume the data are independent and identically distributed according to the negative binomial distribution, with the sample size n sufficient to compute reliable moments. For large n, the estimators are consistent, converging in probability to the true parameters as n approaches infinity, and asymptotically normal. However, they exhibit bias, particularly when r is small, leading to overestimation or underestimation in finite samples; simulations show moment estimators perform reasonably but with higher mean squared error compared to alternatives in small samples.[44][45][42] A key limitation arises if the sample is underdispersed, where s^2 < \bar{X}, resulting in a negative \hat{r}, which is invalid since r must be positive. In such cases, the negative binomial assumption may not hold, and alternative distributions or estimation approaches should be considered to avoid nonsensical results.[46][42]Maximum Likelihood Estimation
The maximum likelihood estimators (MLEs) for the parameters r > 0 and $0 < p < 1 of the negative binomial distribution are obtained by maximizing the log-likelihood function based on a random sample k_1, \dots, k_n of independent observations. The log-likelihood is given by l(r, p) = \sum_{i=1}^n \ln \binom{k_i + r - 1}{k_i} + n r \ln p + \left( \sum_{i=1}^n k_i \right) \ln (1 - p), where \ln \binom{k_i + r - 1}{k_i} = \ln \Gamma(k_i + r) - \ln \Gamma(r) - \ln \Gamma(k_i + 1) and \Gamma denotes the gamma function.[47] Setting the partial derivative with respect to p to zero yields the conditional MLE \hat{p}(r) = \frac{n r}{n r + \sum_{i=1}^n k_i} = \frac{r}{r + \bar{k}}, where \bar{k} = n^{-1} \sum_{i=1}^n k_i is the sample mean. Substituting this into the partial derivative with respect to r and setting it to zero gives the equation for the MLE \hat{r}: n \ln \hat{p}(\hat{r}) - n \psi(\hat{r}) + \sum_{i=1}^n \psi(\hat{r} + k_i) = 0, where \psi(\cdot) is the digamma function, defined as the derivative of the log-gamma function \psi(z) = \frac{d}{dz} \ln \Gamma(z). This equation has no closed-form solution and must be solved numerically for \hat{r}, after which \hat{p} is obtained by substitution.[47] Numerical solutions for the joint MLE (\hat{r}, \hat{p}) typically employ iterative optimization methods such as Newton-Raphson, which leverages the second derivatives (involving the trigamma function) for convergence, or Brent's method for root-finding in the equation for \hat{r}. When r is restricted to integers (as in some applications modeling fixed successes), the likelihood can be evaluated directly over a grid of integer values near an initial moment-based estimate. Alternatively, the expectation-maximization (EM) algorithm provides a robust approach by interpreting the negative binomial as a Poisson-gamma mixture: in the E-step, latent gamma-distributed means are inferred given current parameters; in the M-step, the parameters are updated by solving a gamma MLE problem, iterating until convergence.[47][48] Under standard regularity conditions, the MLEs (\hat{r}, \hat{p}) are consistent (converging in probability to the true values as n \to \infty) and asymptotically normal, with asymptotic distribution \sqrt{n} \begin{pmatrix} \hat{r} - r \\ \hat{p} - p \end{pmatrix} \overset{d}{\to} \mathcal{N}\left( \mathbf{0}, \mathcal{I}(r, p)^{-1} \right), where \mathcal{I}(r, p) is the Fisher information matrix evaluated at the true parameters; the inverse provides the asymptotic covariance for inference, such as confidence intervals via the delta method.[49] For finite samples, the MLE \hat{r} (or equivalently the dispersion parameter \alpha = 1/r) exhibits positive bias, particularly when overdispersion is strong or n is small; this can lead to underestimated variance in applications like count regression. Bias-corrected versions, such as the first-order correction \tilde{\alpha} = \hat{\alpha} / (1 + b/n) where b is derived from expected Hessian terms, reduce this bias to second order and improve small-sample accuracy without sacrificing asymptotic properties.[50]Applications
Modeling Overdispersion in Count Data
Overdispersion in count data occurs when the variance exceeds the mean, violating the equidispersion assumption of the Poisson distribution.[51] The negative binomial distribution addresses this by incorporating an additional dispersion parameter r, where finite r allows the variance to surpass the mean, modeling extra variation from unobserved heterogeneity or clustering.[51] This arises naturally as a gamma-Poisson mixture, where the Poisson rate follows a gamma distribution, leading to unconditional overdispersion.[51] Negative binomial regression extends this to covariate-dependent counts, formulated as a generalized linear model (GLM) with a logarithmic link function to ensure positive predicted means.[9] The dispersion parameter, often denoted as \alpha = 1/r, scales the variance as \mu + \alpha \mu^2, where \mu is the mean, enabling flexible handling of overdispersion without altering the mean structure.[9] To assess suitability against the Poisson model, researchers apply a likelihood ratio test, which evaluates the null hypothesis of no overdispersion (\alpha = 0) by comparing log-likelihoods.[9] Information criteria such as Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC) further aid model selection, penalizing complexity while favoring better fit.[52] Implementation is straightforward in common statistical software. In R, theglm.nb function from the MASS package fits negative binomial GLMs, automatically estimating the dispersion parameter.[53] In Python, the statsmodels library offers the NegativeBinomial class within its discrete models module for similar estimation.[54]
Empirical applications include modeling road accident frequencies, where negative binomial regression captures overdispersion from factors like varying driver behavior or environmental conditions not fully observed in the data.[55] Similarly, in ecology, it analyzes species abundance counts across sites, accounting for extra variation due to habitat heterogeneity.[56]