Multinomial distribution

The multinomial distribution is a fundamental probability distribution in statistics that generalizes the binomial distribution to scenarios involving more than two possible outcomes per trial. It describes the probabilities associated with the counts of occurrences for k mutually exclusive categories in a fixed number n of independent trials, where each trial has predefined probabilities p_1, p_2, \dots, p_k for the respective categories, satisfying \sum_{i=1}^k p_i = 1.^[1]^[2] The probability mass function (PMF) of the multinomial distribution for a random vector \mathbf{X} = (X_1, \dots, X_k) with \sum_{i=1}^k X_i = n is given by

P(\mathbf{X} = \mathbf{x}) = \frac{n!}{x_1! x_2! \cdots x_k!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k},

where \mathbf{x} = (x_1, \dots, x_k) are non-negative integers summing to n. This formula extends the binomial PMF by incorporating a multinomial coefficient to account for the ways to partition n trials into k categories.^[1]^[3] Key properties include the marginal distribution of each X_i, which follows a binomial distribution with parameters n and p_i, and the conditional distribution of a subset given another, which is also multinomial. The expected value of X_i is E[X_i] = n p_i, the variance is \operatorname{Var}(X_i) = n p_i (1 - p_i), and the covariance between distinct components is \operatorname{Cov}(X_i, X_j) = -n p_i p_j for i \neq j. These moments highlight the negative dependence between categories due to the fixed total n.^[1]^[2]^[3] The multinomial distribution arises naturally in multinomial experiments, consisting of independent and identically distributed trials each resulting in one of k outcomes, and is widely applied in fields such as categorical data analysis, natural language processing for topic modeling, genetics for allele frequencies, and quality control for defect classifications. Special cases include the binomial distribution when k=2 and the categorical distribution when n=1.^[4]^[5]^[6]

Definitions

Probability Mass Function

The multinomial distribution generalizes the binomial distribution to scenarios involving k > 2 possible outcomes or categories in a fixed number of independent trials.^[7] The probability mass function (PMF) specifies the probability of obtaining exactly x_i occurrences of the i-th outcome for i = 1, ..., k, and is given by

P(X_1 = x_1, \dots, X_k = x_k) = \frac{n!}{x_1! \cdots x_k!} \, p_1^{x_1} \cdots p_k^{x_k},

where n is the number of independent trials, the p_i are the probabilities of each outcome with \sum_{i=1}^k p_i = 1, and the x_i are non-negative integers satisfying \sum_{i=1}^k x_i = n.^[8] This PMF represents the probability of observing the count vector (x_1, \dots, x_k) across n independent trials, where each trial results in one of k mutually exclusive and exhaustive outcomes with respective probabilities p_i.^[7] The support of the distribution consists of all vectors of non-negative integers (x_1, \dots, x_k) such that \sum_{i=1}^k x_i = n.^[8] To derive the PMF, consider n independent and identically distributed categorical random variables Y₁, ..., Y_n, each taking value i with probability p_i for i = 1, ..., k. Define X_i = \sum_{j=1}^n \mathbf{1}_{\{Y_j = i\}}, the number of trials resulting in outcome i. The probability of any specific sequence with exactly x_i occurrences of outcome i is \prod_{i=1}^k p_i^{x_i}, and the number of such distinct sequences is the multinomial coefficient \frac{n!}{x_1! \cdots x_k!}, yielding the PMF upon multiplication.^[9]

Parameters and Support

The multinomial distribution is parameterized by the number of independent trials n, which is a positive integer representing the total number of observations or experiments conducted.^[10] It is also defined by the number of categories k, a positive integer with k \geq 2, indicating the distinct outcomes possible in each trial.^[7] Additionally, the distribution requires a probability vector \mathbf{p} = (p_1, p_2, \dots, p_k), where each p_i > 0 specifies the probability of the i-th category occurring in a single trial, and the probabilities satisfy \sum_{i=1}^k p_i = 1.^[1] The support of the multinomial distribution comprises all k-tuples of non-negative integers (x_1, x_2, \dots, x_k) that sum to exactly n, i.e., x_1 + x_2 + \dots + x_k = n with each x_i \geq 0 and x_i integer-valued, where x_i denotes the count of occurrences in the i-th category.^[10] Degenerate cases arise when k = 1, reducing the distribution to a trivial scenario where x_1 = n with probability 1, as there is only one category.^[1] Similarly, setting any p_i = 0 results in x_i = 0 with probability 1, effectively reducing the model to fewer than k categories. The PMF formula still applies, though many definitions assume p_i > 0 for all i to ensure k active categories.^[7]

Illustrative Example

Consider rolling a fair six-sided die 10 times, where each face has an equal probability of p_i = \frac{1}{6} for i = 1 to $6, and the rolls are independent. The vector of counts (X_1, X_2, X_3, X_4, X_5, X_6), where X_i is the number of times face i appears, follows a multinomial distribution with parameters n = 10 and \mathbf{p} = \left( \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6} \right)./11:_Bernoulli_Trials/11.05:_The_Multinomial_Distribution) To illustrate, compute the probability of observing the specific counts (2, 1, 3, 0, 2, 2), meaning two 1's, one 2, three 3's, zero 4's, two 5's, and two 6's. Using the probability mass function,

P(X_1=2, X_2=1, X_3=3, X_4=0, X_5=2, X_6=2) = \frac{10!}{2! \, 1! \, 3! \, 0! \, 2! \, 2!} \left( \frac{1}{6} \right)^{10}.

First, evaluate the multinomial coefficient step by step: $10! = 3{,}628{,}800, and the denominator is $2! \cdot 1! \cdot 3! \cdot 0! \cdot 2! \cdot 2! = 2 \cdot 1 \cdot 6 \cdot 1 \cdot 2 \cdot 2 = 48. Thus, the coefficient is $3{,}628{,}800 / 48 = 75{,}600. Next, \left( \frac{1}{6} \right)^{10} = \frac{1}{6^{10}} = \frac{1}{60{,}466{,}176}. The probability is therefore $75{,}600 / 60{,}466{,}176 \approx 0.001250. This value can be obtained by direct computation or using statistical software, confirming the likelihood of this particular arrangement of outcomes.^[11] These counts represent the frequencies of each die face across the repeated trials, capturing how the multinomial distribution quantifies the joint probabilities of multiple categorical outcomes in such experiments./11:_Bernoulli_Trials/11.05:_The_Multinomial_Distribution) For intuition with a smaller number of trials, consider n=2 rolls of the same fair die. The table below summarizes probabilities for representative count vectors, grouped by type (specific permutations across faces yield the full distribution).

Count Type	Example Vector	Multinomial Coefficient	Probability for Specific Vector
Both rolls same face	(2,0,0,0,0,0)	\frac{2!}{2! \, 0!^5} = 1	$1 \times \left( \frac{1}{6} \right)^2 = \frac{1}{36}
Rolls show two different faces	(1,1,0,0,0,0)	\frac{2!}{1! \, 1! \, 0!^4} = 2	$2 \times \left( \frac{1}{6} \right)^2 = \frac{2}{36}

There are 6 ways for the first type and \binom{6}{2} = 15 ways for the second, summing to total probability 1./11:_Bernoulli_Trials/11.05:_The_Multinomial_Distribution)

Properties

Moments and Covariance

The moments of the multinomial distribution provide key summaries of its central tendency and spread. For a multinomial random vector \mathbf{X} = (X_1, \dots, X_k) with parameters n and \mathbf{p} = (p_1, \dots, p_k), the expected value of each component is given by

E[X_i] = n p_i

for i = 1, \dots, k.^[12] This follows from representing X_i as the sum of n independent indicator random variables I_{j,i} for the j-th trial yielding outcome i, where E[I_{j,i}] = p_i, and applying linearity of expectation: E[X_i] = \sum_{j=1}^n E[I_{j,i}] = n p_i.^[12] The variance of each X_i is

\operatorname{Var}(X_i) = n p_i (1 - p_i).

This result arises because the indicators I_{j,i} are independent Bernoulli random variables with parameter p_i, so \operatorname{Var}(X_i) = \sum_{j=1}^n \operatorname{Var}(I_{j,i}) = n p_i (1 - p_i), again by linearity of variance for independent summands.^[12] Marginally, each X_i follows a binomial distribution with parameters n and p_i, confirming this variance formula.^[1] For i \neq j, the covariance between components is

\operatorname{Cov}(X_i, X_j) = -n p_i p_j.

To derive this, consider the double sum \operatorname{Cov}(X_i, X_j) = \sum_{m=1}^n \sum_{l=1}^n \operatorname{Cov}(I_{m,i}, I_{l,j}). Terms where m \neq l vanish due to independence across trials, leaving n terms where m = l: \operatorname{Cov}(I_{m,i}, I_{m,j}) = E[I_{m,i} I_{m,j}] - p_i p_j = 0 - p_i p_j = -p_i p_j, since I_{m,i} I_{m,j} = 0 almost surely (a single trial cannot yield both outcomes i and j). Thus, \operatorname{Cov}(X_i, X_j) = n (-p_i p_j).^[12] The full covariance matrix \Sigma of \mathbf{X} has diagonal entries \Sigma_{ii} = n p_i (1 - p_i) and off-diagonal entries \Sigma_{ij} = -n p_i p_j for i \neq j.^[1] This structure reflects the inherent dependence among the X_i, as their sum is fixed at n: an increase in one X_i must be offset by decreases in others, leading to negative covariances.^[12]

Generating Functions

The probability generating function (PGF) of a multinomial random vector \mathbf{X} = (X_1, \dots, X_k) with parameters n and \mathbf{p} = (p_1, \dots, p_k) is defined as G(s_1, \dots, s_k) = \mathbb{E}[s_1^{X_1} \cdots s_k^{X_k}], where |s_i| \leq 1 for all i. This function takes the closed form

G(s_1, \dots, s_k) = \left( \sum_{i=1}^k p_i s_i \right)^n. \tag{1}\label{eq:pgf}

The PGF arises naturally from the structure of the multinomial distribution as the distribution of the counts in n independent and identically distributed categorical trials, each with success probabilities \mathbf{p}. The PGF of a single categorical trial is \sum_{i=1}^k p_i s_i, and since the trials are independent, the PGF of their sum (the multinomial counts) is the nth power of the single-trial PGF.\cite{https://www.statlect.com/probability-distributions/multinomial-distribution} The PGF is particularly useful for computing factorial moments of the distribution. The rth-order factorial moment can be obtained by taking the appropriate mixed partial derivative of G and evaluating at s_1 = \dots = s_k = 1. For example, the first-order factorial moments are \mathbb{E}[X_i (X_i - 1) \cdots (X_i - r + 1)] = \frac{\partial^r G}{\partial s_i^r} \big|_{s=1}, which simplifies to n(n-1) \cdots (n-r+1) p_i^r for the rth factorial moment of X_i. Higher-order mixed factorial moments follow similarly from cross-derivatives, providing a systematic way to derive joint moments without direct summation over the probability mass function.\cite{https://www.statlect.com/probability-distributions/multinomial-distribution} For the marginal distribution of a single component X_j, which follows a binomial distribution with parameters n and p_j, the univariate PGF is the special case obtained by setting all s_i = 1 for i \neq j in \eqref{eq:pgf}, yielding

G_j(s_j) = (1 - p_j + p_j s_j)^n.

This confirms the known PGF for the binomial marginal and allows computation of its moments independently.\cite{https://www.statlect.com/probability-distributions/multinomial-distribution} The moment-generating function (MGF) of \mathbf{X} is defined as M(t_1, \dots, t_k) = \mathbb{E}[e^{t_1 X_1 + \dots + t_k X_k}], and it admits the closed form

M(t_1, \dots, t_k) = \left( \sum_{i=1}^k p_i e^{t_i} \right)^n. \tag{2}\label{eq:mgf}

Like the PGF, this follows from the independence of the n categorical trials: the MGF of one trial is \sum_{i=1}^k p_i e^{t_i}, raised to the nth power for the sum.\cite{https://ocw.mit.edu/courses/18-655-mathematical-statistics-spring-2016/669dd844d5208e4a01999a71bb1155ab_MIT18_655S16_LecNote8.pdf} The MGF facilitates derivation of ordinary moments via differentiation; for instance, the mean vector is the gradient of \log M at \mathbf{t} = \mathbf{0}, yielding \mathbb{E}[\mathbf{X}] = n \mathbf{p}, and higher-order moments (including covariances) emerge from higher derivatives, offering an efficient alternative to direct expectation calculations for complex joint distributions.\cite{https://ocw.mit.edu/courses/18-655-mathematical-statistics-spring-2016/669dd844d5208e4a01999a71bb1155ab_MIT18_655S16_LecNote8.pdf}

Matrix Notation

The multinomial distribution is conveniently reformulated using vector and matrix notation to facilitate multivariate analysis and computations. Consider the random vector \mathbf{X} = (X_1, \dots, X_k)^T following a multinomial distribution with parameters n (number of trials) and \mathbf{p} = (p_1, \dots, p_k)^T (a k \times 1 probability vector satisfying \sum_{i=1}^k p_i = 1), denoted \mathbf{X} \sim \text{Multinomial}(n, \mathbf{p}).^[1] The probability mass function in vector notation is

P(\mathbf{X} = \mathbf{x}) = \frac{n!}{\prod_{i=1}^k x_i!} \prod_{i=1}^k p_i^{x_i} = \binom{n}{\mathbf{x}} \mathbf{p}^{\odot \mathbf{x}},

where \mathbf{x} = (x_1, \dots, x_k)^T consists of non-negative integers summing to n, \binom{n}{\mathbf{x}} is the multinomial coefficient, and \odot denotes element-wise exponentiation. This form highlights the distribution's dependence on the vectorized counts and probabilities.^[1] The mean vector is \mathbb{E}[\mathbf{X}] = n \mathbf{p}, representing the expected counts scaled by the trial size and probability vector.^[1] The covariance matrix captures the dependencies among components:

\text{Cov}(\mathbf{X}) = n \left( \text{Diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^T \right),

where \text{Diag}(\mathbf{p}) is the diagonal matrix with \mathbf{p} on the diagonal; the off-diagonal elements reflect negative covariances due to the fixed total sum of trials.^[1]^[13] In Bayesian inference, the Dirichlet distribution serves as the conjugate prior for the probability vector \mathbf{p}, enabling closed-form posterior updates when observing multinomial data.^[14]

Normalization

The normalization of the multinomial distribution ensures that the probability mass function (PMF) assigns a total probability of 1 to the entire sample space, confirming it as a valid probability distribution. Specifically, for parameters n \geq 0 and probabilities p = (p_1, \dots, p_k) with \sum_{i=1}^k p_i = 1 and p_i \geq 0, the sum of the PMF over all non-negative integers x_1, \dots, x_k satisfying \sum_{i=1}^k x_i = n is given by

\sum_{\substack{x_1, \dots, x_k \geq 0 \\ \sum x_i = n}} \frac{n!}{x_1! \cdots x_k!} p_1^{x_1} \cdots p_k^{x_k} = 1.

This equality follows directly from the multinomial theorem, which states that for non-negative integer n and variables p_1, \dots, p_k,

(p_1 + \cdots + p_k)^n = \sum_{\substack{x_1, \dots, x_k \geq 0 \\ \sum x_i = n}} \frac{n!}{x_1! \cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}.

Substituting \sum p_i = 1 yields $1^n = 1, thus verifying the normalization.^[15] The multinomial coefficient \frac{n!}{x_1! \cdots x_k!} plays a crucial role in this scaling by accounting for the number of distinct sequences of n independent trials that result in exactly x_i outcomes of category i, for each i. This combinatorial factor, combined with the product of the individual category probabilities, ensures that the probabilities are neither under- nor over-scaled across the support, as the theorem's expansion partitions the total probability mass exhaustively and exclusively.^[16]^[15] In the uniform case, where p_i = 1/k for all i = 1, \dots, k, the PMF simplifies to \frac{n!}{x_1! \cdots x_k!} (1/k)^n, and the normalization holds symmetrically due to the equal probabilities summing to 1.^[15] For the degenerate case, where one p_j = 1 and all other p_i = 0 for i \neq j, the distribution assigns probability 1 solely to the outcome with x_j = n and all other x_i = 0, while all other outcomes have probability 0, again summing to 1 as required.^[15]

Representations and Visualizations

Combinatorial Interpretation

The multinomial distribution provides a probabilistic model for the counts of outcomes in a sequence of n independent trials, each with k possible categories occurring with probabilities p_1, \dots, p_k. Combinatorially, the probability of observing exactly x_1 occurrences of the first category, x_2 of the second, and so on up to x_k of the k-th category (where \sum x_i = n) is given by the product of the multinomial coefficient and the probabilities raised to the respective powers: \frac{n!}{x_1! \cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}. The multinomial coefficient \binom{n}{x_1, \dots, x_k} = \frac{n!}{x_1! \cdots x_k!} counts the number of distinct sequences of length n that result in these exact counts, treating the trials as distinguishable by their order and the outcomes within each category as indistinguishable. This arises from partitioning the n distinct trial positions into k labeled groups of sizes x_i, equivalent to the number of ways to assign each position to a category while respecting the counts.^[17] This combinatorial view links to broader counting problems, such as distributing n distinct objects into k distinct bins with x_i objects per bin, where the coefficient enumerates the possible assignments. While related techniques like stars and bars address indistinguishable objects into bins (yielding combinations with repetition), the multinomial focuses on distinguishable allocations in ordered trials.^[18] Historically, the multinomial coefficient emerged in combinatorial contexts through extensions of the binomial theorem, with early formulations appearing in works on permutations and expansions by mathematicians like Isaac Newton in the late 17th century, and later probabilistic interpretations in counting problems of object distribution by Jacob Bernoulli in the early 18th century.^[19]

Polynomial Coefficients

The multinomial theorem provides a direct algebraic link between the probability mass function (PMF) of the multinomial distribution and the coefficients in the expansion of the polynomial (p_1 + p_2 + \dots + p_k)^n, where p_i \geq 0 for i = 1, \dots, k and \sum_{i=1}^k p_i = 1. The theorem states that

(x_1 + x_2 + \dots + x_k)^n = \sum_{n_1 + \dots + n_k = n \atop n_i \geq 0} \frac{n!}{n_1! n_2! \dots n_k!} x_1^{n_1} x_2^{n_2} \dots x_k^{n_k},

where the sum is over all non-negative integers n_1, \dots, n_k satisfying the condition, and the coefficients \frac{n!}{n_1! \dots n_k!} are the multinomial coefficients.^[20] Substituting x_i = p_i into the expansion yields

(p_1 + p_2 + \dots + p_k)^n = \sum_{n_1 + \dots + n_k = n \atop n_i \geq 0} \frac{n!}{n_1! n_2! \dots n_k!} p_1^{n_1} p_2^{n_2} \dots p_k^{n_k}.

The left side simplifies to $1^n = 1 due to the normalization of the probabilities, so the right side sums to 1, confirming that each term \frac{n!}{n_1! \dots n_k!} p_1^{n_1} \dots p_k^{n_k} is precisely the PMF value for outcome (n_1, \dots, n_k). Thus, the PMF terms are exactly the contributions in this polynomial expansion, with the multinomial coefficients scaling the probability products.^[21] For a concrete example with n=2 and k=3, the expansion is

(p_1 + p_2 + p_3)^2 = p_1^2 + p_2^2 + p_3^2 + 2 p_1 p_2 + 2 p_1 p_3 + 2 p_2 p_3.

The terms correspond to the PMF as follows: p_1^2 for (2,0,0) (coefficient \frac{2!}{2!0!0!} = 1), $2 p_1 p_2 for (1,1,0) (coefficient \frac{2!}{1!1!0!} = 2), and similarly for the other permutations, with the full sum equaling 1.^[21] This polynomial perspective, grounded in the multinomial theorem, enables straightforward algebraic manipulations of the distribution, such as verifying normalization or expanding expressions for expected values under the probability constraints.^[3]

Slices of Generalized Pascal's Triangle

The generalized Pascal's triangle extends the classical binomial structure to higher dimensions through multinomial coefficients, forming what is known as Pascal's simplex. For a fixed total count n and k categories, the relevant slice consists of a (k-1)-dimensional array where each entry corresponds to a multinomial coefficient \binom{n}{x_1, x_2, \dots, x_k} = \frac{n!}{x_1! x_2! \cdots x_k!}, with x_1 + x_2 + \dots + x_k = n and each x_i \geq 0 an integer. This array captures all possible ways to distribute n indistinguishable items into k distinguishable bins, with the coefficients representing the number of distinct outcomes for each composition (x_1, \dots, x_k).^[22] This construction generalizes the binomial Pascal's triangle, which corresponds to the case k=2: there, each row n forms a one-dimensional array of binomial coefficients \binom{n}{x_1} with x_2 = n - x_1, arranged linearly to exhibit the familiar additive recurrence. For k > 2, the structure becomes higher-dimensional, with entries satisfying a multinomial version of Pascal's rule: \binom{n}{x_1, \dots, x_k} = \sum \binom{n-1}{y_1, \dots, y_k} over appropriate predecessors where one y_i = x_i - 1 and the rest match, enabling recursive construction layer by layer from n=0 (a single entry of 1) upward. The resulting simplex exhibits symmetries under permutation of the categories, reflecting the indistinguishability of bins in the coefficient formula, and shows exponential growth in the number of non-zero entries, which is \binom{n + k - 1}{k - 1}, the dimension of the slice.^[22]^[23] For illustration with k=3 (trinomial case, forming Pascal's pyramid slices), consider small values of n. The coefficients are symmetric under permutations of the indices, with larger values concentrated near balanced distributions. Below is a table for n=3, listing tuples in lexicographic order and their coefficients; note the three permutations of (3,0,0) each with 1, six of (2,1,0) with 3, and the central (1,1,1) with 6.

Tuple (x_1, x_2, x_3)	Coefficient \binom{3}{x_1, x_2, x_3}
(3,0,0)	1
(0,3,0)	1
(0,0,3)	1
(2,1,0)	3
(2,0,1)	3
(1,2,0)	3
(0,2,1)	3
(1,0,2)	3
(0,1,2)	3
(1,1,1)	6

This table demonstrates the symmetry (e.g., all (2,1,0)-type entries equal) and growth from n=2 (6 non-zero entries totaling 9) to n=3 (10 entries totaling 27). For n=4, the pattern expands further, with 15 entries including coefficients up to 12 for (2,1,1)-types.^[24]^[23] Visualizing these slices is challenging in higher dimensions, but for k=3, the triangular layer can be projected onto a 2D equilateral triangle representing the simplex \{ (x_1, x_2, x_3) \mid x_i \geq 0, \sum x_i = n \}, with coefficients placed at lattice points and colored by magnitude to highlight central peaks. For larger k, slicing along hyperplanes (e.g., fixing one x_i=0 to reduce to a binomial slice) or using barycentric coordinates aids projection onto 2D or 3D, revealing fractal-like self-similarities in the coefficient patterns across layers.^[23]

Advanced Theoretical Properties

Large Deviation Principles

The large deviation principle (LDP) for the multinomial distribution characterizes the exponential decay of probabilities for rare events where the empirical frequencies deviate significantly from the true parameter vector \mathbf{p} = (p_1, \dots, p_k), as the number of trials n grows large. Specifically, if \mathbf{X} = (X_1, \dots, X_k) \sim \mathrm{Multinomial}(n, \mathbf{p}), then the scaled vector \mathbf{X}/n satisfies an LDP with speed n and good rate function I(\mathbf{x}) = \sum_{i=1}^k x_i \log(x_i / p_i) for \mathbf{x} \in \Delta_{k-1} = \{\mathbf{x} \in [0,1]^k : \sum x_i = 1\} with \mathbf{x} > 0, and I(\mathbf{x}) = +\infty otherwise, where \Delta_{k-1} is the (k-1)-simplex.^[25] This implies that for any closed set F \subseteq \Delta_{k-1} not containing \mathbf{p},

\limsup_{n \to \infty} \frac{1}{n} \log P(\mathbf{X}/n \in F) \leq -\inf_{\mathbf{x} \in F} I(\mathbf{x}),

and for any open set G \subseteq \Delta_{k-1},

\liminf_{n \to \infty} \frac{1}{n} \log P(\mathbf{X}/n \in G) \geq -\inf_{\mathbf{x} \in G} I(\mathbf{x}).

In particular, for \mathbf{q} \neq \mathbf{p} in the interior of \Delta_{k-1}, the probability P(\mathbf{X}/n \approx \mathbf{q}) decays exponentially as \exp(-n I(\mathbf{q})), with the rate I(\mathbf{q}) representing the minimal "cost" of deviation, which is zero only at \mathbf{q} = \mathbf{p}.^[26] This LDP follows from Sanov's theorem, which applies to the empirical measure of n i.i.d. samples from the categorical distribution with parameter \mathbf{p}; the multinomial counts \mathbf{X} directly yield the empirical frequencies as masses on the finite support \{1, \dots, k\}, and the rate function is the Kullback-Leibler (KL) divergence D_{\mathrm{KL}}(\mathbf{q} \| \mathbf{p}) = \sum q_i \log(q_i / p_i).^[25] Equivalently, Cramér's theorem for vector-valued i.i.d. random variables establishes the same LDP for the sample mean \bar{\mathbf{Y}}_n = (1/n) \sum_{j=1}^n \mathbf{Y}_j, where each \mathbf{Y}_j is a standard basis vector selected with probabilities p_i, since \mathbf{X}/n = \bar{\mathbf{Y}}_n and the Legendre-Fenchel transform of the cumulant generating function yields the KL divergence as the rate.^[27] The origins of large deviation theory trace back to the 1930s with Harald Cramér's work on ruin probabilities in insurance mathematics, providing the first rigorous results for scalar sums of i.i.d. variables, later extended to vectors in the 1950s by researchers like A. Ya. Khinchin. Sanov's theorem, published in 1958, generalized these ideas to empirical measures on finite spaces, directly encompassing the multinomial case.^[28] In modern applications, particularly in information theory, this LDP underpins analyses of typical sequences, source coding efficiency, and hypothesis testing, where the KL rate quantifies the distinguishability of distributions via rare-event probabilities.^[29]

Asymptotic Behavior

As the number of trials n grows large, the multinomial random vector X concentrates around its mean np, with fluctuations of order \sqrt{n}.^[30] The central limit theorem establishes that the standardized vector converges in distribution to a multivariate normal:

\frac{X - np}{\sqrt{n}} \xrightarrow{d} \mathcal{N}\left(0, \operatorname{Diag}(p) - p p^T \right),

where \operatorname{Diag}(p) is the diagonal matrix with entries p_i and p is the probability vector.^[30] This result follows from the multivariate central limit theorem applied to the sum of n independent categorical random vectors, with the covariance matrix matching the scaled version of the multinomial's finite-sample covariance. For smooth functions g of the sample proportions \hat{p} = X/n, the delta method yields asymptotic normality of the transformed estimator. Specifically, if g: \mathbb{R}^k \to \mathbb{R}^m is continuously differentiable at p, then

\sqrt{n} \left( g(\hat{p}) - g(p) \right) \xrightarrow{d} \mathcal{N}\left(0, \nabla g(p)^T \left[ \operatorname{Diag}(p) - p p^T \right] \nabla g(p) \right),

where \nabla g(p) is the Jacobian matrix of g evaluated at p.^[31] This approximation facilitates inference on derived quantities, such as ratios or log-ratios of proportions, by leveraging the first-order Taylor expansion around the true parameter. Bounds on the error of the normal approximation are provided by the multivariate Berry–Esseen theorem, which quantifies the rate of convergence in the central limit theorem. For the multinomial, the supremum distance between the distribution function of the standardized vector and the multivariate normal satisfies

\sup_x \left| P\left( \frac{X - np}{\sqrt{n}} \leq x \right) - \Phi(x) \right| \leq C \frac{\beta_3}{\sqrt{n}},

where \Phi is the multivariate standard normal cumulative distribution function, C is a universal constant, and \beta_3 is a measure of the third moments of the underlying categorical distribution, bounded by terms involving \max_i p_i (1 - p_i).^[32] These bounds remain effective even when the number of categories grows moderately with n, aiding assessments of approximation quality in high-dimensional settings.

Concentration Inequalities

Concentration inequalities for the multinomial distribution provide non-asymptotic bounds on the probabilities that the random counts deviate significantly from their expected values, which are crucial for finite-sample analysis in statistics and machine learning. These bounds are exponential in nature and hold uniformly over the parameter space, offering guarantees without relying on asymptotic approximations. A fundamental result is Hoeffding's inequality applied to the marginal distributions of the multinomial counts. For the i-th component X_i, which follows a \mathrm{Binomial}(n, p_i) distribution, the normalized count \hat{p}_i = X_i / n satisfies

P\left( \left| \hat{p}_i - p_i \right| \geq \epsilon \right) \leq 2 \exp\left(-2 n \epsilon^2 \right)

for any \epsilon > 0. This bound arises because \hat{p}_i is the average of n independent bounded random variables (indicators for category i) taking values in [0, 1], and it is independent of the specific value of p_i. Similar bounds extend to the vector of normalized counts \hat{\mathbf{p}} via multivariate versions, controlling deviations in norms like \ell_1 or \ell_2.^[33] For more general functions of the multinomial vector \mathbf{X}, McDiarmid's bounded differences inequality applies when the underlying model is viewed as n independent categorical trials. If f(\mathbf{X}) is such that changing the outcome of any single trial alters f by at most c_j (for the j-th trial), then

P\left( |f(\mathbf{X}) - \mathbb{E}[f(\mathbf{X})]| \geq t \right) \leq 2 \exp\left( -\frac{2 t^2}{\sum_{j=1}^n c_j^2} \right)

for any t > 0.^[34] This is particularly useful for Lipschitz functions of the empirical distribution, such as those measuring discrepancy or risk in statistical estimators.^[34] Conditional concentration inequalities also hold for the multinomial distribution. Given that the sum of counts over a subset of categories S, \sum_{i \in S} X_i = m, the conditional distribution of (X_i)_{i \in S} is \mathrm{Multinomial}(m, \mathbf{q}), where q_i = p_i / \sum_{j \in S} p_j. Consequently, Hoeffding's inequality applies conditionally, yielding

P\left( \left| \frac{X_i}{m} - q_i \right| \geq \epsilon \;\middle|\; \sum_{j \in S} X_j = m \right) \leq 2 \exp\left(-2 m \epsilon^2 \right)

for i \in S and \epsilon > 0. This follows from the conditional distribution retaining the structure of an independent trials model with updated parameters and sample size m.^[35] These inequalities find applications in statistical inference, particularly for bounding the deviation of empirical risks based on multinomial samples, such as in goodness-of-fit testing or density estimation, where they ensure that the empirical distribution concentrates around the true probabilities with high probability.^[33]

Binomial and Categorical Connections

The multinomial distribution establishes direct connections to the binomial and categorical distributions through its marginal and conditional properties, as well as its interpretation as repeated independent trials. Specifically, for a multinomial random vector \mathbf{X} = (X_1, \dots, X_k) with parameters n and \mathbf{p} = (p_1, \dots, p_k), the marginal distribution of each component X_i follows a binomial distribution, X_i \sim \text{Binomial}(n, p_i), where the i-th category is treated as a "success" and all other categories collectively as "failure."^[36] This arises because the probability of observing exactly x_i outcomes in category i over n trials ignores the specific allocations among the remaining categories, reducing to the standard binomial setup.^[37] Furthermore, the conditional distribution given X_i = m for some i and $0 \leq m \leq n is itself a multinomial distribution over the remaining k-1 categories, with updated parameters (n - m) and probabilities \mathbf{p}' = (p_1 / (1 - p_i), \dots, p_{i-1}/(1 - p_i), p_{i+1}/(1 - p_i), \dots, p_k / (1 - p_i)).^[38] This property reflects the independence of trials and the renormalization of probabilities after fixing the count in one category, allowing the model to decompose into simpler structures while preserving the overall multinomial form for the residual counts.^[7] The categorical distribution serves as the foundational single-trial case of the multinomial, where setting n = 1 yields the probability mass function for a single outcome Z taking value j with probability p_j, P(Z = j) = p_j for j = 1, \dots, k.^[3] In this sense, the multinomial distribution models the joint outcomes of n independent and identically distributed (i.i.d.) categorical random variables, where the vector of counts \mathbf{X} captures the frequencies of each category across the trials.^[39] This i.i.d. structure underpins the multinomial's utility in aggregating multiple categorical observations into a cohesive probabilistic framework.

Dirichlet-Multinomial Compound

The Dirichlet-multinomial distribution arises as a compound distribution in Bayesian statistics, where the category probabilities \mathbf{p} = (p_1, \dots, p_k) of a multinomial model are treated as random variables drawn from a Dirichlet prior \mathbf{p} \sim \mathrm{Dir}(\boldsymbol{\alpha}) with positive concentration parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_k), and the observed counts \mathbf{X} = (X_1, \dots, X_k) conditional on \mathbf{p} follow a multinomial distribution \mathbf{X} \mid \mathbf{p} \sim \mathrm{Multinomial}(n, \mathbf{p}) for fixed trial count n = \sum X_i. The resulting marginal distribution of \mathbf{X} is the Dirichlet-multinomial, denoted \mathrm{DM}(n, \boldsymbol{\alpha}), which integrates out the uncertainty in \mathbf{p}. This construction was formalized in the context of compound multinomial distributions, highlighting its role in generalizing beta-binomial models to multiple categories.^[40] The probability mass function of the Dirichlet-multinomial distribution is

P(\mathbf{X} = \mathbf{x} \mid n, \boldsymbol{\alpha}) = \frac{n!}{x_1! \cdots x_k!} \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_0 + n)} \prod_{i=1}^k \frac{\Gamma(\alpha_i + x_i)}{\Gamma(\alpha_i)},

where \alpha_0 = \sum_{i=1}^k \alpha_i, \sum_{i=1}^k x_i = n, and \Gamma denotes the gamma function. This expression derives directly from the integral of the multinomial probability mass function times the Dirichlet probability density function over the simplex.^[41] In Bayesian inference, the Dirichlet serves as the conjugate prior for the multinomial likelihood, yielding a posterior \mathbf{p} \mid \mathbf{x} \sim \mathrm{Dir}(\boldsymbol{\alpha} + \mathbf{x}) that preserves the Dirichlet form with updated parameters. The predictive distribution for a future count vector \mathbf{y} with m trials then follows \mathbf{y} \mid \mathbf{x} \sim \mathrm{DM}(m, \boldsymbol{\alpha} + \mathbf{x}), enabling straightforward sequential updating and forecasting without numerical integration.^[42] The Dirichlet-multinomial distribution is widely applied to model overdispersion in multicategory count data, where observations exhibit greater variability than predicted by the standard multinomial under fixed probabilities, due to the hierarchical incorporation of uncertainty in \mathbf{p}. This makes it suitable for empirical Bayes methods and hierarchical modeling in fields requiring robust handling of extra variation.^[41]

Poisson and Other Limits

The multinomial distribution arises as the conditional distribution of independent Poisson random variables given that their sum equals a fixed integer n. Specifically, let Y_1, \dots, Y_k be independent Poisson random variables with means \lambda_i = n p_i for i = 1, \dots, k. Then the sum S = \sum_{i=1}^k Y_i follows a Poisson distribution with mean n, and the conditional distribution of (Y_1, \dots, Y_k) \mid S = n is multinomial with parameters n and probabilities (p_1, \dots, p_k).^[43]^[44] This relationship facilitates the Poisson limit theorem for the multinomial distribution. As n \to \infty with each p_i \to 0 such that n p_i \to \lambda_i > 0 fixed for i = 1, \dots, k, the joint distribution of the multinomial counts (X_1, \dots, X_k) converges in distribution to independent Poisson random variables (Z_1, \dots, Z_k) with means \lambda_1, \dots, \lambda_k. The dependence induced by the fixed sum n becomes negligible in this regime, as the probability that S = n concentrates around its mean. This limit extends the classical Poisson approximation for the binomial case and is useful for modeling rare events across multiple categories.^[45]^[46] In the infinite-dimensional limit as the number of categories k \to \infty, for the uniform Dirichlet-multinomial model with concentration parameters \alpha_i = 1/k (corresponding to total concentration \theta = 1), the limit partition structure of the counts follows the Ewens sampling formula with parameter \theta = 1, whose ranked relative frequencies yield the Poisson-Dirichlet distribution PD(\theta, 0). This convergence underlies models in population genetics (e.g., the infinite alleles model) and species sampling, where the number of potential categories grows without bound.^[47] The multinomial distribution also serves as the limiting case of the multivariate hypergeometric distribution in sampling without replacement from a finite population. Consider a population of size N with N_i items of type i for i = 1, \dots, k, so that the proportions are P_i = N_i / N. Drawing a sample of fixed size n without replacement yields counts following a multivariate hypergeometric distribution. As N \to \infty with n fixed and P_i fixed, the dependence between draws diminishes, and the distribution converges to the multinomial with parameters n and (P_1, \dots, P_k). This approximation is accurate when n \ll N, reflecting the transition from finite to infinite population models.^[48] De-Poissonization techniques leverage the Poisson-multinomial relationship to approximate quantities for the fixed-n case using easier-to-analyze unconditional Poisson models. By Poissonizing the number of trials (replacing fixed n with a Poisson(\mu) random variable where \mu = n), the category counts become independent Poissons, simplifying computations of moments, tail probabilities, or generating functions. To recover fixed-n results, methods such as local limit theorems, Stein-Chen bounds, or conditioning on the sum via saddlepoint approximations are applied; for instance, expectations under the multinomial can be bounded by differing from their Poissonized counterparts by O(1/\sqrt{n}). These techniques are particularly valuable in combinatorial probability and large-deviation analysis where direct multinomial calculations are intractable.^[44]^[49]

Statistical Inference

Parameter Estimation

The maximum likelihood estimator (MLE) for the probability parameters \mathbf{p} = (p_1, \dots, p_k) in the multinomial distribution, given a sample \mathbf{X} = (X_1, \dots, X_k) from Multinomial(n, \mathbf{p}), is derived from the likelihood function L(\mathbf{p} \mid \mathbf{X}) = \frac{n!}{X_1! \cdots X_k!} \prod_{i=1}^k p_i^{X_i}, subject to \sum_{i=1}^k p_i = 1 and p_i > 0. Maximizing the log-likelihood \ell(\mathbf{p} \mid \mathbf{X}) = \log L(\mathbf{p} \mid \mathbf{X}) + \const = \sum_{i=1}^k X_i \log p_i yields the closed-form solution \hat{p}_i = X_i / n for each i = 1, \dots, k.^[50] This MLE \hat{\mathbf{p}} is unbiased for \mathbf{p}, as E[\hat{p}_i] = E[X_i]/n = (n p_i)/n = p_i for each i, following directly from the linearity of expectation and the definition of the multinomial distribution.^[50] If the number of trials n is unknown, its MLE is the observed total \sum_{i=1}^k X_i, though in most applications n is a fixed, known quantity from the experimental design.^[50] The estimator \hat{\mathbf{p}} is consistent, meaning \hat{\mathbf{p}} \to_p \mathbf{p} as n \to \infty, which follows from the strong law of large numbers applied componentwise to the proportions X_i / n.^[51] Under standard regularity conditions (such as bounded support and positive probabilities), the MLE also satisfies asymptotic normality: \sqrt{n} (\hat{\mathbf{p}} - \mathbf{p}) \to_d \mathcal{N}(\mathbf{0}, \Sigma), where \Sigma = \diag(\mathbf{p}) - \mathbf{p} \mathbf{p}^\top is the covariance matrix of the multinomial distribution.^[51] This result arises from the general asymptotic theory for MLEs in exponential families, where the multinomial belongs, ensuring the score function's central limit theorem and the information matrix equality hold.^[51] For small sample sizes n, where asymptotic approximations may perform poorly due to skewness or boundary effects in the parameter space, bootstrapping offers a nonparametric alternative to assess the sampling variability of \hat{\mathbf{p}}. The procedure generates B bootstrap samples \mathbf{X}^{*b} (b=1,\dots,B) by resampling n independent observations from the empirical distribution defined by \hat{\mathbf{p}}, computes \hat{\mathbf{p}}^{*b} = \mathbf{X}^{*b} / n for each, and uses the empirical distribution of \{\hat{\mathbf{p}}^{*b}\} to approximate the distribution of \hat{\mathbf{p}}, such as for bias correction or variance estimation. This method is particularly useful in multinomial settings with sparse categories, improving reliability over direct reliance on the asymptotic normal form.

Hypothesis Testing

Hypothesis testing for the multinomial distribution primarily involves assessing whether observed categorical data conform to specified probability parameters or whether multiple samples arise from the same underlying distribution. Common approaches include goodness-of-fit tests to evaluate fit against hypothesized probabilities and tests for homogeneity or independence in multi-way contingency tables, where the joint distribution is multinomial. These tests rely on asymptotic approximations to the chi-squared distribution under the null hypothesis, assuming large sample sizes and no expected cell counts below 5.^[52] The Pearson's chi-squared goodness-of-fit test evaluates the null hypothesis that observed frequencies O_i from a multinomial sample of size n follow expected frequencies E_i = n p_i under specified probabilities p_i, for i = 1, \dots, k categories. The test statistic is given by

\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i},

which asymptotically follows a \chi^2_{k-1} distribution under the null, allowing computation of p-values for rejection. This test, introduced by Karl Pearson, is widely used for its simplicity and robustness in large samples.^[53] An alternative is the likelihood ratio test (LRT) for goodness-of-fit, which compares the maximized likelihood under the null hypothesis (specified p_i) to the unrestricted maximum likelihood estimate (MLE) of the probabilities from the data. The test statistic is

G^2 = 2 \sum_{i=1}^k O_i \log \left( \frac{O_i}{E_i} \right),

asymptotically distributed as \chi^2_{k-1} under the null. The LRT often provides better performance in moderate samples and is part of the broader power divergence family of tests, where Pearson's \chi^2 corresponds to a specific divergence parameter.^[52] For testing independence in r \times c contingency tables under multinomial sampling, the Pearson's chi-squared statistic extends naturally, with expected frequencies E_{ij} = \frac{(row_i \ total) \ (col_j \ total)}{n} and degrees of freedom (r-1)(c-1). Similarly, the LRT uses G^2 = 2 \sum_{i,j} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right). These tests assess whether the categorical variables are independent by fitting a product of marginal multinomials.^[53] Tests for homogeneity across multiple multinomial samples, such as comparing proportions in g groups with shared categories, employ the chi-squared statistic on a g \times k contingency table, with expected frequencies under the pooled MLE proportions and degrees of freedom (g-1)(k-1). This tests the null that all groups share the same probability vector, equivalent to independence between group and category indicators.^[54] Power for these chi-squared tests depends on the non-centrality parameter \lambda = n \sum (p_i - p_{0i})^2 / p_{0i} under alternatives, where the distribution shifts to a non-central \chi^2 with the same degrees of freedom; sample sizes are chosen to achieve desired power (e.g., 80%) at a specified effect size, often using simulation or approximations for multi-category cases.^[55] When expected frequencies are small (e.g., <5 in any cell), the chi-squared approximation may inflate Type I error; modern adjustments include Yates' continuity correction, which modifies the Pearson statistic to \chi^2 = \sum (|O_i - E_i| - 0.5)^2 / E_i for 2x2 tables or low df cases, improving accuracy by accounting for the discrete nature of counts. This correction, while conservative, is recommended for sparse data in contingency table tests.

Confidence Intervals and Equivalence Tests

Confidence intervals for the parameters of a multinomial distribution provide a range of plausible values for the true proportions \mathbf{p} = (p_1, \dots, p_k), leveraging the asymptotic normality of the maximum likelihood estimator \hat{\mathbf{p}} = \mathbf{X}/n, where \mathbf{X} denotes the count vector.^[56] For individual proportions p_i, the marginal distribution of X_i is binomial with parameters n and p_i, allowing direct application of binomial confidence interval methods adjusted for the multinomial structure. The Wilson score interval, which inverts a normal approximation test and centers on an adjusted proportion, extends effectively to these marginals and performs well even for small samples or extreme proportions.^[57] Similarly, the Agresti-Coull interval adds pseudocounts (typically 2 per category) to the observed counts before applying the normal approximation, improving coverage over the standard Wald interval for binomial marginals in the multinomial setting.^[58] For differences between proportions p_i - p_j (i \neq j), confidence intervals account for the dependence structure, with the asymptotic variance given by \frac{1}{n} [p_i(1 - p_i) + p_j(1 - p_j) + 2 p_i p_j]. A Wald-type interval uses the plug-in estimator \hat{p}_i - \hat{p}_j \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}_i (1 - \hat{p}_i)}{n} + \frac{\hat{p}_j (1 - \hat{p}_j)}{n} + 2 \frac{\hat{p}_i \hat{p}_j}{n} }, where z_{\alpha/2} is the standard normal quantile; this can be adjusted for simultaneity across multiple contrasts using methods like Goodman's procedure, which employs chi-square quantiles with 1 degree of freedom at level \alpha / [k(k-1)/2] to control the family-wise error rate for all pairwise differences.^[56] Simultaneous intervals for all pairwise differences or linear contrasts ensure joint coverage, with Goodman's approach providing conservative yet reliable bounds based on the Scheffé method adapted for multinomials.^[56] Equivalence tests for multinomial proportions assess whether the true \mathbf{p} lies within a predefined equivalence region, often using the two one-sided tests (TOST) procedure to reject hypotheses of meaningful deviation. For a single proportion p_i, TOST applies the binomial version by testing H_{01}: p_i \leq p_{iL} and H_{02}: p_i \geq p_{iU} (with equivalence bounds p_{iL} < p_i < p_{iU}), rejecting non-equivalence if both one-sided tests reject at level \alpha; this extends to multinomials by applying TOST marginally or to contrasts like p_i - p_j, bounding deviations from a nominal vector. For the full vector, joint equivalence can bound the maximum deviation \max_i |p_i - p_{i0}| < \delta using intersection-union tests, though coverage requires simulation calibration for small samples. In matched-pairs designs, where each pair yields a multinomial outcome across k categories (reducing the k \times k table to a k(k-1)-dimensional multinomial for off-diagonals), McNemar's test for binary outcomes extends to test marginal homogeneity via the Stuart-Maxwell statistic, \mathbf{D}^T \hat{\mathbf{V}}^{-1} \mathbf{D}, where \mathbf{D} is the vector of row-column differences and \hat{\mathbf{V}} its estimated covariance; this is asymptotically chi-squared with k-1 degrees of freedom under the null of equal marginals.^[59] Equivalence versions adapt TOST to these differences, testing bounds on \mathbf{D} for practical similarity in paired multinomial data.^[60] Recent advances incorporate simulation-based methods for more accurate multinomial confidence intervals, particularly for linear combinations \sum c_i p_i in sparse or small-sample settings. Fiducial inference provides exact intervals by inverting the multinomial cumulative distribution, with Monte Carlo simulation approximating tail probabilities for coverage rates close to nominal levels (e.g., 95%) across various n and \mathbf{p}; a 2021 study demonstrated superior performance over asymptotic methods in simulations with k \leq 5 and n < 100.^[61] Parametric bootstrap approaches, resampling from the fitted multinomial, further refine confidence intervals by estimating percentile intervals for deviations, enhancing reliability in multi-category analyses.

Applications

Statistical Modeling

The multinomial distribution serves as the foundational likelihood model in multinomial logistic regression, also known as softmax regression, where the category probabilities \mathbf{p} = (p_1, \dots, p_K) are modeled as a function of covariates \mathbf{x} through the softmax function.^[62] Specifically, for K categories, the probability for category k is given by

p_k = \frac{\exp(\mathbf{x}^T \boldsymbol{\beta}_k)}{\sum_{j=1}^K \exp(\mathbf{x}^T \boldsymbol{\beta}_j)},

where \boldsymbol{\beta}_k are the parameters for category k, ensuring \sum_k p_k = 1.^[63] This approach extends binary logistic regression to multiclass settings by normalizing the exponential linear predictors, allowing the estimation of category-specific effects via maximum likelihood.^[64] In log-linear models for contingency tables, the multinomial distribution arises under fixed marginal sampling, but the Poisson trick facilitates estimation by reparameterizing the problem as independent Poisson log-linear models, whose conditional distribution matches the multinomial.^[65] This equivalence, first noted in the context of maximum likelihood for multi-way tables, enables the use of generalized linear model software for fitting hierarchical log-linear structures to capture associations among categorical variables. For instance, under multinomial sampling with fixed totals, the log-expected frequencies \log \mu_{ij\dots} = \mathbf{z}_{ij\dots}^T \boldsymbol{\gamma} are modeled, where \mathbf{z} are design terms for main effects and interactions.^[66] Overdispersion in multinomial data, where observed variability exceeds that predicted by the standard multinomial variance \text{Var}(Y_k) = n p_k (1 - p_k), can be addressed using the Dirichlet-multinomial distribution, which introduces a Dirichlet prior on the probabilities to inflate the variance.^[67] This compound distribution has mean E[Y_k] = n p_k but variance \text{Var}(Y_k) = n p_k (1 - p_k) \frac{\alpha + n}{\alpha + 1}, where \alpha > 0 controls the overdispersion level, with smaller \alpha yielding greater inflation. Alternatively, the quasi-multinomial model adjusts the variance-covariance matrix by a dispersion parameter \phi > 1, scaling the standard multinomial variances while preserving the mean structure, suitable for robust inference without specifying a full parametric alternative.^[68] This quasi-likelihood approach estimates \phi from Pearson residuals and adjusts standard errors accordingly.^[69] Model selection for multinomial fits, such as in logistic or log-linear models, commonly employs information criteria like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to balance goodness-of-fit against model complexity.^[70] The AIC is defined as -2 \ell + 2 d, where \ell is the maximized log-likelihood and d is the number of parameters, while BIC imposes a stronger penalty: -2 \ell + d \log n for sample size n. These criteria facilitate comparison of nested or non-nested models, with BIC favoring sparser solutions in large samples, as applied in selecting covariate structures for multinomial logistic regression.^[71] In genetics, the multinomial distribution models the frequencies of alleles or genotypes in a population sample, particularly under the assumptions of the Hardy-Weinberg equilibrium, where observed genotype counts across categories (homozygous dominant, heterozygous, homozygous recessive) follow a multinomial sampling process given fixed allele frequencies.^[72] For multi-allelic loci, extensions of the Hardy-Weinberg model use multinomial likelihoods to estimate allele frequencies and test deviations due to factors like selection or population structure, enabling inference on evolutionary processes from genotype data.^[73] This approach is fundamental in population genetics for analyzing large-scale genomic datasets, such as those from single nucleotide polymorphism arrays, to detect deviations from equilibrium that signal non-random mating or migration.^[74] In ecology, the multinomial distribution underpins models of species abundance distributions (SADs), describing the counts of individuals across species categories in a sampled community, often assuming a fixed total sample size.^[75] Neutral biodiversity theory, for instance, employs the zero-sum multinomial distribution to predict relative species abundances in local communities, where immigration and stochastic birth-death processes generate the observed counts, providing a null model against which to test ecological assembly rules.^[76] This framework has been applied to tropical forest inventories and coral reef surveys to quantify community structure and dispersal limitations, revealing patterns like the prevalence of rare species.^[77] In the social sciences, the multinomial distribution is used to analyze categorical responses in surveys, such as voter preferences across multiple parties or candidate options, treating each response as a draw from a multinomial process with category-specific probabilities.^[78] For example, in election studies, it models vote shares as multinomial counts from a fixed electorate size, allowing estimation of preference probabilities via multinomial logistic regression to assess factors influencing choice, like socioeconomic variables.^[79] This application extends to public opinion polling, where multinomial models evaluate response distributions on multi-category items (e.g., approval ratings across scales), informing predictions of aggregate behaviors in diverse populations.^[80] In epidemiology, the multinomial distribution captures disease outcomes across categories (e.g., cured, improved, worsened, deceased) in cohort studies, modeling counts of patients falling into each outcome given exposure probabilities.^[81] It is particularly valuable for heterogeneous outcomes, where risk factors differentially affect categories, enabling multinomial regression to quantify associations while accounting for the joint distribution of outcomes in treated versus control groups.^[82] Applications include analyzing clinical trial data for multi-state disease progression, such as in cancer cohorts, to estimate transition probabilities and evaluate intervention effects on categorical endpoints.^[83] Historically, the roots of exact inference methods like Fisher's exact test trace back to multinomial sampling in contingency tables, where Ronald A. Fisher developed the approach in 1935 to test independence in small samples by conditioning on marginal totals, generalizing from binomial to multinomial frameworks for multi-category data.^[84] This innovation addressed limitations of chi-squared approximations in sparse multinomial settings, influencing modern exact tests for categorical data in experimental designs across sciences.^[85]

Machine Learning and Data Analysis

The multinomial distribution plays a foundational role in machine learning applications involving categorical data, particularly in probabilistic modeling of high-dimensional discrete observations such as text or multi-class counts. It serves as a core component in generative models that capture dependencies among categories, enabling tasks like classification, clustering, and data synthesis. In these contexts, the distribution models the joint probability of multiple outcomes, often under assumptions of independence or mixtures to handle complexity in real-world datasets.^[86] A prominent application is Latent Dirichlet Allocation (LDA), a generative probabilistic model for discovering latent topics in collections of discrete data, such as text corpora. In LDA, each document is represented as a mixture of topics, where the topic proportions follow a Dirichlet distribution, and each topic is a multinomial distribution over the vocabulary of words; conditionally, given the topic mixture, the words in a document are drawn from a multinomial distribution. This structure allows LDA to infer hidden topic structures by treating observed word counts as multinomial samples, facilitating unsupervised topic modeling in large-scale text analysis. The model was introduced as an extension of earlier Bayesian approaches to handle the challenges of inferring latent variables in discrete data settings.^[86]^[86] The multinomial variant of the Naive Bayes classifier is widely used for text classification tasks, where documents are modeled as multinomial distributions over word features in a bag-of-words representation. Under the naive independence assumption, the classifier estimates class-conditional multinomial parameters from training data and predicts the class maximizing the posterior probability, which decomposes into prior and likelihood terms. This approach excels in high-dimensional sparse settings typical of text, such as email spam detection or news categorization, due to its computational efficiency and robustness to irrelevant features. Empirical comparisons have shown the multinomial event model outperforming alternatives like multivariate Bernoulli in most text categorization benchmarks.^[87]^[87] For parameter estimation in multinomial mixture models, the Expectation-Maximization (EM) algorithm provides an iterative method to maximize the likelihood when latent variables, such as mixture components, are unobserved. In the E-step, posterior probabilities of component assignments are computed using current parameter estimates; in the M-step, these serve as weighted sufficient statistics to update the multinomial mixing proportions and component parameters. This procedure converges to a local maximum of the observed-data log-likelihood, making it suitable for fitting mixtures of multinomials in applications like document clustering or population segmentation. The algorithm's application to mixture densities addresses the incompleteness of data in probabilistic frameworks.^[88]^[88] High-dimensional challenges in these models often arise from sparsity in bag-of-words representations, where most word-category entries are zero due to limited vocabulary overlap across documents. This sparsity can lead to overfitting in parameter estimates and inflated variance in likelihood computations, particularly when the number of categories vastly exceeds the sample size. Mitigation strategies include regularization techniques, such as adding pseudocounts to multinomial parameters, or dimensionality reduction via feature selection, which preserve predictive performance in sparse regimes. These issues are exacerbated in text data, where vocabulary sizes can reach tens of thousands, demanding scalable approximations.^[89]^[89] Recent advancements post-2023 have integrated the multinomial distribution with transformer architectures for generating categorical data, particularly in tabular and sequential settings. For instance, diffusion transformers model categorical features via multinomial diffusion processes, where noise is added to discrete states and reversed through transformer-based denoising to synthesize realistic multi-category samples.^[90] This approach leverages the parallel computation of transformers to handle heterogeneous tabular data, improving upon autoregressive methods in capturing complex dependencies among categorical variables. Such integrations enable synthetic data generation for privacy-preserving machine learning and data augmentation in imbalanced categorical datasets.^[90]

Computational Methods

Random Variate Generation

One common approach to generating a multinomial random variate with parameters n and probability vector \mathbf{p} = (p_1, \dots, p_k) is the direct method, which simulates n independent and identically distributed (i.i.d.) categorical random variables, each following a categorical distribution with probabilities \mathbf{p}, and then counts the frequency of each category to obtain the counts (X_1, \dots, X_k).^[91] This method leverages the interpretive definition of the multinomial as arising from n independent trials, each resulting in one of k outcomes. To generate each categorical variate efficiently, the inverse cumulative distribution function (CDF) method can be used, which requires sorting the cumulative probabilities and selecting via uniform random numbers, or the alias method, which preprocesses a table for constant-time sampling per trial after an initial setup.^[91] An alternative is the sequential conditional binomial method, which exploits the fact that the marginal distribution of the first component X_1 is binomial with parameters n and p_1, the conditional distribution of X_2 given X_1 = x_1 is binomial with parameters n - x_1 and p_2 / (1 - p_1), and so on for subsequent components until the last count is determined by subtraction to ensure the total sums to n.^[92] This approach generates k-1 binomial variates sequentially, adjusting the remaining trials and renormalized probabilities at each step, and is particularly useful when k is small relative to n. Both the direct and conditional methods achieve O(n time complexity for generation, assuming constant-time operations for uniform and binomial sampling; the direct method incurs an initial O(k) setup for categorical sampling tables, while the conditional method has negligible setup but may require more overhead for binomial generations if n is large.^[92] To validate simulated multinomial variates, one standard practice is to compute empirical moments from multiple replications and verify they align with theoretical values, such as the expected count E[X_i] = n p_i and variance \mathrm{Var}(X_i) = n p_i (1 - p_i) for each category i, along with covariances \mathrm{Cov}(X_i, X_j) = -n p_i p_j for i \neq j.^[91] Deviations beyond expected sampling variability indicate implementation issues, and this moment-matching check ensures the simulations faithfully reproduce the distribution's properties for downstream computational statistics applications.^[92]

Sampling Algorithms

One efficient algorithm for generating samples from the multinomial distribution \operatorname{Mult}(n, \mathbf{p}), where \mathbf{p} = (p_1, \dots, p_k) and \sum_{i=1}^k p_i = 1, is the sequential conditional binomial method. This approach decomposes the joint distribution into a chain of conditional binomials, leveraging the property that the marginal distribution of X_1 is \operatorname{Bin}(n, p_1), the conditional distribution of X_2 given X_1 is \operatorname{Bin}(n - X_1, p_2 / (1 - p_1)), and so on for subsequent components.^[93] The algorithm proceeds iteratively, updating the remaining trials at each step. Here is pseudocode for the method:

function sample_multinomial(n, p):
    k = length(p)
    x = array of size k, initialized to 0
    remaining = n
    cum_p = 1.0  # cumulative sum starting from full probability
    for i in 1 to k-1:
        prob = p[i] / cum_p
        x[i] = binomial_sample(remaining, prob)  # Draw from Bin(remaining, prob)
        remaining -= x[i]
        cum_p -= p[i]
    x[k] = remaining
    return x
function sample_multinomial(n, p):
    k = length(p)
    x = array of size k, initialized to 0
    remaining = n
    cum_p = 1.0  # cumulative sum starting from full probability
    for i in 1 to k-1:
        prob = p[i] / cum_p
        x[i] = binomial_sample(remaining, prob)  # Draw from Bin(remaining, prob)
        remaining -= x[i]
        cum_p -= p[i]
    x[k] = remaining
    return x

This generates an exact sample without approximation or rejection steps.^[93] The sequential method offers several advantages: it is exact by construction, avoids the computational overhead of rejection sampling or simulating individual trials (which scales with n), and relies on efficient binomial generators, making it particularly suitable for large n where the number of iterations depends only on k. For binomial draws, standard libraries use optimized techniques like the inverse transform or alias method, ensuring constant-time performance per draw independent of n.^[93] An extension applies to the Dirichlet-multinomial distribution, a compound model where \mathbf{p} \sim \operatorname{Dir}(\boldsymbol{\alpha}) for concentration parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_k), followed by \mathbf{X} \sim \operatorname{Mult}(n, \mathbf{p}). To sample, first generate \mathbf{p} from the Dirichlet by drawing independent \Gamma(\alpha_i, 1) variates G_i (using standard gamma generators) and normalizing p_i = G_i / \sum_{j=1}^k G_j; then apply the multinomial sampling with this \mathbf{p}. This yields exact samples and preserves the overdispersion relative to the plain multinomial.^[94]

Software Implementations

The multinomial distribution is implemented in several major statistical and programming environments, providing functions for probability evaluation, random sampling, and parameter estimation. These implementations facilitate applications in simulation, modeling, and machine learning tasks. In R, the base stats package includes dmultinom for computing the probability mass function and rmultinom for generating random variates from the multinomial distribution, where the function takes parameters for the number of observations, trials per observation, and category probabilities.^[95] Parameter fitting for multinomial logistic regression can be performed using the glm function with family specifications, often extended via packages like nnet for full multinomial support. Python offers robust support through multiple libraries. The numpy.random.multinomial function generates samples from the distribution, accepting arguments for the number of trials, sample size, and probability vector.^[96] For more comprehensive statistical operations, scipy.stats.multinomial provides methods like pmf for probability mass, logpmf for log-probability, rvs for random variates, and entropy for information measures, with support for frozen distributions to fix parameters.^[97] In machine learning contexts, scikit-learn's MultinomialNB class implements the multinomial naive Bayes classifier, which assumes features follow a multinomial distribution and is optimized for discrete count data such as text frequencies.^[98] MATLAB's Statistics and Machine Learning Toolbox includes mnrnd for generating random multinomial variates, specified by the number of trials and probability vector, and mnrfit (now recommended as part of fitmnr in newer versions) for estimating parameters in multinomial logistic regression models from response counts and predictors.^[99]^[100] Julia's Distributions.jl package supports the multinomial distribution via the Multinomial type, enabling random sampling with rand, probability evaluation with pdf and logpdf, and moment calculations, integrated into a broader ecosystem for univariate and multivariate distributions.^[101] For high-performance computing, TensorFlow Probability provides the tfp.distributions.Multinomial class, which supports GPU-accelerated operations for sampling and log-probability computations through TensorFlow's backend, with enhancements in versions post-2023 improving scalability for large-scale probabilistic modeling on multi-GPU setups.^[102]