Fact-checked by Grok 2 weeks ago

Categorical distribution

The categorical distribution, also known as the multinomial distribution with a single trial, is a discrete probability distribution that models the probability of observing one out of K mutually exclusive and collectively exhaustive outcomes, where K ≥ 1 is a positive integer representing the number of categories.^[1] It is parameterized by a K-dimensional probability vector π = (π₁, π₂, ..., π_K), where each π_i ≥ 0 and ∑_{i=1}^K π_i = 1, such that the probability mass function is given by P(X = k) = π_k for k = 1, 2, ..., K.^[1] This distribution generalizes the Bernoulli distribution (which corresponds to K=2) and serves as a foundational model for discrete choice scenarios, such as a single roll of a K-sided die.^[2] In statistical modeling, the categorical distribution is characterized by its mean vector E[X] = π (where X is the one-hot encoded outcome vector) and variance-covariance matrix with diagonal elements π_i(1 - π_i) and off-diagonal elements -π_i π_j for i ≠ j, reflecting the dependence structure across categories.^[1] It is conjugate to the Dirichlet distribution, meaning that if priors over the parameters π follow a Dirichlet, posterior updates after observing categorical data remain Dirichlet-distributed, which facilitates Bayesian inference.^[1] Common parameterizations include the direct use of π or transformations like the softmax function, which maps a real-valued vector ψ to π via π_k = exp(ψ_k) / ∑_{j=1}^K exp(ψ_j), enabling gradient-based optimization in machine learning.^[3] The categorical distribution plays a central role in numerous applications, including multiclass classification (e.g., assigning labels in image recognition), natural language processing (e.g., next-word prediction in language models), and recommendation systems (e.g., selecting items from a finite set of options).^[3] For large K, such as vocabularies exceeding 10,000 words or high-resolution image classes, direct computation of the distribution can be inefficient due to linear scaling in K, prompting scalable inference methods like the Augment-and-Reduce approach to approximate likelihoods without enumerating all categories.^[3] Its simplicity and interpretability make it indispensable in probabilistic graphical models and exponential family frameworks, where it often represents observed or latent variables.^[4]

Definition and Formulation

Terminology and Parameters

The categorical distribution is a discrete probability distribution defined over a finite set of K mutually exclusive categories or outcomes, where K is a positive integer representing the number of possible distinct results.^[5] It models scenarios in which a random event results in exactly one of these categories, such as the outcome of rolling a die or selecting a class label in a classification task. In standard notation, the random variable X associated with the categorical distribution takes values in the finite support \{1, 2, \dots, [K](/page/K)\}, with no probability mass outside this set and no extension to a continuous domain.^[6] The distribution is fully parameterized by a probability vector \theta = (\theta_1, \theta_2, \dots, \theta_K), where each \theta_k \geq 0 denotes the probability assigned to category k, and these probabilities satisfy the normalization condition \sum_{k=1}^K \theta_k = 1. This vector \theta encapsulates all information needed to specify the distribution, directly representing the relative likelihoods of each category.^[6] The categorical distribution generalizes the Bernoulli distribution—which applies specifically to the case K=2, modeling binary outcomes—to an arbitrary number of categories.^[5] It is named for its role in describing categorical variables, which are qualitative attributes divided into discrete, non-ordered classes, particularly in statistical classification and data analysis contexts. As a single-trial special case, it corresponds to the marginal distribution of one outcome in a multinomial distribution.^[5]

Probability Mass Function

The probability mass function (PMF) of a categorical random variable X taking values in the finite set \{1, 2, \dots, K\} with parameter vector \boldsymbol{\theta} = (\theta_1, \theta_2, \dots, \theta_K) is defined as

P(X = k \mid \boldsymbol{\theta}) = \theta_k, \quad k = 1, 2, \dots, K,

where each \theta_k \geq 0 represents the probability of outcome k.^[7]^[2] To explicitly account for the support, the PMF can be written using the indicator function I(\cdot):

P(X = k \mid \boldsymbol{\theta}) = \theta_k \cdot I(k \in \{1, \dots, K\}),

which ensures the probability is zero for values outside the defined categories.^[8] The parameters must satisfy the normalization constraint

\sum_{k=1}^K \theta_k = 1,

guaranteeing that the probabilities sum to unity over all possible outcomes.^[7]^[4] In vector notation, particularly common in machine learning contexts, the outcome X can be represented as a one-hot encoded vector \mathbf{x} \in \{0,1\}^K with exactly one entry equal to 1, and the PMF is

P(X = \mathbf{x} \mid \boldsymbol{\theta}) = \prod_{k=1}^K \theta_k^{x_k} = \boldsymbol{\theta}^\top \mathbf{x},

where the dot product selects the corresponding probability.^[9]^[8] For example, consider K=3 and \boldsymbol{\theta} = (0.2, 0.5, 0.3); then P(X=2 \mid \boldsymbol{\theta}) = 0.5.^[7] This PMF arises as a specialization of the general form for discrete distributions, where the probability is assigned directly to each point in a finite support set without additional structure beyond the normalization condition.^[4]^[2]

Properties

Moments and Expectations

The categorical distribution can be viewed as a scalar random variable X taking values in \{1, 2, \dots, K\} with probabilities \theta_1, \theta_2, \dots, \theta_K, where \sum_{k=1}^K \theta_k = 1 and \theta_k \geq 0 for all k. The raw moments of X are given by the r-th raw moment m_r = \mathbb{E}[X^r] = \sum_{k=1}^K k^r \theta_k.^[10] In particular, the first raw moment is the expected value \mu = \mathbb{E}[X] = \sum_{k=1}^K k \theta_k.^[10] The variance follows from the second raw moment as \mathrm{Var}(X) = \mathbb{E}[X^2] - \mu^2 = \sum_{k=1}^K k^2 \theta_k - \mu^2, or equivalently \mathrm{Var}(X) = \sum_{k=1}^K (k - \mu)^2 \theta_k.^[10] The central moments provide further characterization, with the r-th central moment defined as \mu_r = \mathbb{E}[(X - \mu)^r] = \sum_{k=1}^K (k - \mu)^r \theta_k. The skewness \gamma, measuring asymmetry, is the standardized third central moment \gamma = \mu_3 / \sigma^3, where \sigma^2 = \mathrm{Var}(X).^[11] The kurtosis \kappa, measuring tailedness, is the standardized fourth central moment \kappa = \mu_4 / \sigma^4.^[11] Alternatively, the categorical distribution arises as a vector of indicator random variables \mathbf{I} = (I_1, I_2, \dots, I_K), where I_j = [1](/page/1) if category j is selected and I_j = 0 otherwise, with exactly one I_j = [1](/page/1). The expected value of each indicator is \mathbb{E}[I_j] = \theta_j.^[12] The variance of each indicator is \mathrm{Var}(I_j) = \theta_j (1 - \theta_j), and for j \neq k, the covariance is \mathrm{Cov}(I_j, I_k) = -\theta_j \theta_k.^[12] For example, consider K=2 with \theta = (0.5, 0.5) and labels $1, 2. Then \mu = \mathbb{E}[X] = 1 \cdot 0.5 + 2 \cdot 0.5 = 1.5 and \mathrm{Var}(X) = (1 - 1.5)^2 \cdot 0.5 + (2 - 1.5)^2 \cdot 0.5 = 0.25.^[10]

Mode and Entropy

The mode of the categorical distribution is the outcome k that maximizes the probability mass function, specifically k = \arg\max_j \theta_j.^[13] If multiple outcomes share the maximum probability, the distribution is multimodal, with each such outcome serving as a mode. In the case of a uniform distribution, where \theta_k = 1/K for all k = 1, \dots, K, every category qualifies as a mode.^[13] The entropy of the categorical distribution quantifies the expected uncertainty associated with a random outcome drawn from it. It is given by the Shannon entropy formula:

H(\theta) = -\sum_{k=1}^K \theta_k \log \theta_k,

where the base of the logarithm determines the units: natural logarithm for nats or base-2 for bits.^[14] This quantity equals the expected value \mathbb{E}[-\log P(X = k)], with the expectation computed over the distribution parameterized by \theta.^[15] The entropy achieves its maximum value of \log K (in the same units as the logarithm) when the distribution is uniform, \theta_k = 1/K for all k, as this configuration maximizes uncertainty subject to the constraint of K possible outcomes.^[15] For illustration, consider a binary categorical distribution with \theta = (0.9, 0.1). The mode is the first category, and the entropy is H(\theta) = -0.9 \log 0.9 - 0.1 \log 0.1 \approx 0.325 nats (using the natural logarithm). To arrive at this value, compute \log 0.9 \approx -0.1054 and \log 0.1 \approx -2.3026, yielding -0.9 \times (-0.1054) - 0.1 \times (-2.3026) = 0.0948 + 0.2303 = 0.3251. By comparison, the uniform binary distribution has entropy \log 2 \approx 0.693 nats, obtained directly from the maximum entropy formula.

Parameter Estimation

Maximum Likelihood Estimation

Given a sample of n independent and identically distributed observations x_1, \dots, x_n from a categorical distribution with probability vector \theta = (\theta_1, \dots, \theta_K)^\top satisfying \sum_{k=1}^K \theta_k = 1 and \theta_k > 0 for all k, the likelihood function is the product of the probability mass functions evaluated at the observations:

L(\theta) = \prod_{i=1}^n \theta_{x_i}.

This likelihood treats the categorical distribution as a special case of the multinomial distribution with one trial per observation.^[16] To facilitate maximization, consider the log-likelihood \ell(\theta) = \log L(\theta). Define n_k = \sum_{i=1}^n \mathbf{1}_{\{x_i = k\}} as the observed count for category k, so \sum_{k=1}^K n_k = n. Then,

\ell(\theta) = \sum_{k=1}^K n_k \log \theta_k.

Maximizing \ell(\theta) subject to the constraint \sum_{k=1}^K \theta_k = 1 requires incorporating a Lagrange multiplier \lambda, yielding the Lagrangian \mathcal{L}(\theta, \lambda) = \sum_{k=1}^K n_k \log \theta_k + \lambda \left(1 - \sum_{k=1}^K \theta_k\right). Differentiating with respect to \theta_k gives \partial \mathcal{L}/\partial \theta_k = n_k / \theta_k - \lambda = 0, so \theta_k = n_k / \lambda. Summing over k and applying the constraint implies \lambda = n, hence the maximum likelihood estimator (MLE) is the vector of empirical frequencies \hat{\theta}_k = n_k / n for each k = 1, \dots, K. This closed-form solution arises directly from the sufficient statistic consisting of the category counts.^[16] The MLE \hat{\theta} possesses desirable statistical properties under standard regularity conditions. It is unbiased, with E[\hat{\theta}_k] = \theta_k for each k, and consistent, converging in probability to the true \theta as n \to \infty.^[17] Furthermore, \hat{\theta} is asymptotically normal: \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma), where the covariance matrix \Sigma has diagonal elements \Sigma_{kk} = \theta_k (1 - \theta_k) and off-diagonal elements \Sigma_{kl} = -\theta_k \theta_l for k \neq l. The asymptotic variance of each component is thus \operatorname{Var}(\hat{\theta}_k) \approx \theta_k (1 - \theta_k)/n, reflecting the multinomial sampling variability.^[17]^[16] For illustration, consider n=3 observations \{1, 2, 1\} from a categorical distribution over K=3 categories. The counts are n_1 = 2, n_2 = 1, n_3 = 0, yielding \hat{\theta} = (2/3, 1/3, 0)^\top. Note that unobserved categories receive zero probability mass under this estimator.^[16]

Method of Moments

The method of moments provides an alternative approach to parameter estimation for the categorical distribution by equating population moments to their sample counterparts. For a categorical random variable X taking values in \{1, 2, \dots, K\} with probabilities \theta = (\theta_1, \dots, \theta_K), the first raw moment is \mathbb{E}[X] = \sum_{k=1}^K k \theta_k. The corresponding sample moment is the sample mean \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, where x_1, \dots, x_n is a random sample. Equating these yields \sum_{k=1}^K k \theta_k = \bar{x}, but this single equation underdetermines the K-1 free parameters (given \sum_{k=1}^K \theta_k = 1) when K > 2.^[18] To resolve this, the method employs indicator functions for each category: define I_k(X) = 1 if X = k and $0 otherwise, so \mathbb{E}[I_k(X)] = \theta_k. The sample analogue is the proportion \hat{\theta}_k = \frac{1}{n} \sum_{i=1}^n I_k(x_i) = \frac{n_k}{n}, where n_k is the count of observations equal to k. Equating moments gives the estimators \hat{\theta}_k = n_k / n for k = 1, \dots, K, which automatically satisfy the constraint \sum_{k=1}^K \hat{\theta}_k = 1.^[19] For the categorical distribution, this method of moments reduces to empirical frequency counting, which coincides exactly with the maximum likelihood estimator.^[20] Although the method of moments is straightforward here, it can be less statistically efficient than maximum likelihood estimation in cases requiring higher-order moments that are nonlinear functions of the parameters; however, its simplicity makes it useful for illustrating moment-matching principles, even if extensions to higher moments are rarely needed for the basic categorical case.^[18] As an example, suppose a sample of n = 100 observations yields counts n_1 = 30, n_2 = 40, n_3 = 30 for a 3-category distribution. The method of moments estimators are \hat{\theta}_1 = 0.30, \hat{\theta}_2 = 0.40, \hat{\theta}_3 = 0.30, identical to those from maximum likelihood.^[19]

Bayesian Inference

Conjugate Prior and Posterior

In Bayesian inference for the categorical distribution, the Dirichlet distribution serves as the conjugate prior for the probability parameter vector \boldsymbol{\theta} = (\theta_1, \dots, \theta_K). The prior is denoted \boldsymbol{\theta} \sim \text{Dir}(\boldsymbol{\alpha}), where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) with each \alpha_k > 0, and its density is proportional to \prod_{k=1}^K \theta_k^{\alpha_k - 1} over the (K-1)-simplex \sum_{k=1}^K \theta_k = 1, \theta_k \geq 0.^[21] The hyperparameters \alpha_k represent pseudo-counts, akin to prior observations for category k, with the total prior strength given by \sum_{k=1}^K \alpha_k. A uniform prior on the simplex arises when \alpha_k = 1 for all k, equivalent to \text{Dir}(1, \dots, 1).^[22] For a dataset consisting of N independent draws from the categorical distribution, let n_k denote the observed count for category k, so \sum_{k=1}^K n_k = N; the resulting likelihood takes a multinomial form. The posterior distribution is then \boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{n}), with updated parameters \alpha_k' = \alpha_k + n_k for each k.^[21] This conjugacy follows directly from the product of the prior density and likelihood, which is proportional to \prod_{k=1}^K \theta_k^{\alpha_k + n_k - 1} and thus shares the Dirichlet kernel form after normalization.^[23] The maximum a posteriori (MAP) estimate, obtained as the mode of the posterior (assuming \alpha_k + n_k > 1), is \theta_{\text{MAP},k} = \frac{\alpha_k + n_k - 1}{\sum_{j=1}^K (\alpha_j + n_j) - K}; this yields an add-one smoothing adjustment to the counts.^[22] As an illustration, for K=3 categories with uniform prior \boldsymbol{\alpha} = (1,1,1) and counts \mathbf{n} = (2,1,0), the posterior is \text{[Dir](/page/Dir)}(3,2,1).^[22] In the limit as \boldsymbol{\alpha} \to \mathbf{0}, the MAP estimate coincides with the maximum likelihood estimate.^[21]

Predictive and Marginal Distributions

In Bayesian inference for the categorical distribution with a Dirichlet prior, the marginal likelihood represents the probability of the observed data integrated over the prior distribution on the parameters θ. This is given by

p(\mathbf{x} \mid \boldsymbol{\alpha}) = \int p(\mathbf{x} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta} = \frac{ \prod_k n_k ! }{ n! } \frac{ B(\boldsymbol{\alpha} + \mathbf{n}) }{ B(\boldsymbol{\alpha}) },

where \mathbf{x} denotes a sample of n independent categorical observations with counts \mathbf{n} = (n_1, \dots, n_K) for each category k = 1, \dots, K, \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) are the prior parameters, and B(\cdot) is the multivariate beta function defined as B(\boldsymbol{\alpha}) = \prod_k \Gamma(\alpha_k) / \Gamma(\sum_k \alpha_k).^[22] The posterior predictive distribution provides the probability of a new observation x^* given the data and prior, obtained by integrating over the posterior distribution of θ, which is Dirichlet(\boldsymbol{\alpha} + \mathbf{n}). This yields a categorical distribution with updated parameters:

P(x^* = k \mid \mathbf{x}, \boldsymbol{\alpha}) = \frac{ \alpha_k + n_k }{ \sum_j \alpha_j + n }.

The derivation follows from the conjugacy: the posterior predictive is the expected value of θ_k under the posterior, reducing to the above normalized form.^[22] The posterior conditional distribution, which updates the posterior after observing a new category x^* = k, is proportional to the original posterior times θ_k, resulting in a Dirichlet distribution with parameters \boldsymbol{\alpha} + \mathbf{n} + \mathbf{e}_k, where \mathbf{e}_k is the unit vector with 1 in the k-th position. This sequential update property facilitates online Bayesian inference. These distributions find application in model comparison through the marginal likelihood, which serves as the evidence for prior hyperparameters, and in predictive tasks such as data imputation, where the posterior predictive fills in missing categories while accounting for parameter uncertainty.^[22] For example, suppose the prior is Dirichlet(\boldsymbol{\alpha} = (3, 2, 1)) with observed counts \mathbf{n} = (0, 0, 0) (prior predictive case); the probability of a new observation in category 3 is then P(x^* = 3 \mid \boldsymbol{\alpha}) = 1 / (3 + 2 + 1) = 1/6.

Sampling Methods

Direct Sampling

Direct sampling from a categorical distribution with probability vector \theta = (\theta_1, \theta_2, \dots, \theta_K) relies on the inverse transform sampling method, a standard technique for generating random variates from any probability distribution using a uniform random variable.^[24] The approach exploits the cumulative distribution function (CDF) F(k) = \sum_{j=1}^k \theta_j for k = 1, \dots, K, which is non-decreasing and reaches 1 at k = K. By generating a uniform random variable U \sim \text{Uniform}(0,1) and finding the smallest k such that U \leq F(k), the resulting k follows the desired categorical distribution, as this inverts the CDF to match the target probabilities.^[24] The algorithm proceeds as follows: draw U from the uniform distribution, then iteratively accumulate the probabilities until the cumulative sum exceeds U, selecting the category at that point. This ensures each category k is chosen with exact probability \theta_k, since the intervals defined by the partial sums partition the unit interval according to the PMF.^[24] Here is pseudocode for the direct sampling procedure:

function sample_categorical(θ):
    U ← random.uniform(0, 1)
    cumulative ← 0
    for k = 1 to K:
        cumulative ← cumulative + θ_k
        if U ≤ cumulative:
            return k
    return K  // Fallback, though unnecessary if ∑θ = 1
function sample_categorical(θ):
    U ← random.uniform(0, 1)
    cumulative ← 0
    for k = 1 to K:
        cumulative ← cumulative + θ_k
        if U ≤ cumulative:
            return k
    return K  // Fallback, though unnecessary if ∑θ = 1

This implementation runs in O(K) time complexity per sample, making it efficient for distributions with a small number of categories K, though less ideal for very large K where optimized alternatives may be preferred.^[24] For illustration, consider \theta = (0.3, 0.4, 0.3). If U = 0.5, the cumulative sums are F(1) = 0.3, F(2) = 0.7, and F(3) = 1.0. Since $0.3 < 0.5 \leq 0.7, the sample is category 2. Such examples demonstrate how the method preserves the probability structure through interval mapping.^[24] In practice, libraries like NumPy provide built-in functions for this task; for instance, np.random.choice with the p argument set to \theta generates samples from the categorical distribution, typically using an efficient variant suitable for the given parameters.^[25]

Gumbel-Max Trick

The Gumbel-max trick offers a perturbation-based method for sampling from a categorical distribution \operatorname{Cat}(\theta), where \theta = (\theta_1, \dots, \theta_K) with \sum_{k=1}^K \theta_k = 1 and \theta_k > 0, by adding independent Gumbel noise to the log-probabilities and selecting the maximizing index. This approach is particularly valuable in machine learning applications that require differentiable approximations to discrete sampling.^[26] The standard Gumbel distribution, denoted \operatorname{Gumbel}(0,1), arises in extreme value theory as the limiting distribution of maxima of i.i.d. random variables. Its cumulative distribution function is

F(g) = \exp\left(-\exp(-g)\right), \quad g \in \mathbb{R},

and its probability density function is

f(g) = \exp\left(-g - \exp(-g)\right).

A key property is that if G_1, \dots, G_K \sim \operatorname{Gumbel}(0,1) are i.i.d., then for constants c_1, \dots, c_K, the random variable \max_k (c_k + G_k) follows \operatorname{Gumbel}(\log \sum_k \exp(c_k), 1), and the index \arg\max_k (c_k + G_k) follows a categorical distribution with probabilities proportional to \exp(c_k).^[26] To sample an index i \sim \operatorname{Cat}(\theta), compute z_k = \log \theta_k + G_k for k = 1, \dots, K, where the G_k are i.i.d. \operatorname{Gumbel}(0,1), and set i = \arg\max_k z_k. This procedure generates an unbiased sample from the target distribution.^[26] A proof sketch relies on the independence of the G_k and the Gumbel CDF: the probability that z_j exceeds all other z_k for k \neq j is

P(z_j > z_k \ \forall k \neq j) = \int_{-\infty}^\infty f(g) \prod_{k \neq j} F(g + \log \theta_j - \log \theta_k) \, dg = \theta_j,

which follows from substituting the Gumbel forms and recognizing the integral as the softmax probability \theta_j = \frac{\exp(\log \theta_j)}{\sum_k \exp(\log \theta_k)}. This establishes that the argmax yields exactly the categorical distribution.^[26] The trick's main advantages include enabling reparameterization of discrete random variables, which allows gradients to flow through the sampling process during backpropagation. This is essential for training models like variational autoencoders (VAEs), where direct sampling would otherwise block differentiability, and it supports low-variance gradient estimates in reinforcement learning and structured prediction tasks.^[27] For illustration, suppose \theta = (0.3, 0.4, 0.3). Draw G = (-0.5, 0.2, -1.0). Then

z_1 \approx \log 0.3 - 0.5 \approx -1.70, \quad z_2 \approx \log 0.4 + 0.2 \approx -0.72, \quad z_3 \approx \log 0.3 - 1.0 \approx -2.20.

The maximum is z_2, so the sample is the second category (index 2). Different noise realizations will yield samples according to the probabilities in \theta.^[26] Historically, the underlying Gumbel distribution and its max-stability properties were introduced by Emil J. Gumbel in the 1950s as part of extreme value theory. The specific application to sampling from categorical distributions, via the argmax perturbation, gained prominence in modern machine learning through connections to choice models and efficient inference algorithms.

Multinomial and Bernoulli

The multinomial distribution arises as the joint distribution of counts from n independent trials, each following a categorical distribution with K categories and probability vector \theta = (\theta_1, \dots, \theta_K) where \sum_{k=1}^K \theta_k = 1.^[28] If X = (X_1, \dots, X_K) denotes the vector of counts with \sum_{k=1}^K X_k = n, then the probability mass function is given by

P(X = x) = \frac{n!}{x_1! \cdots x_K!} \prod_{k=1}^K \theta_k^{x_k},

for non-negative integers x_1, \dots, x_K summing to n.^[28] This generalizes the categorical distribution to multiple trials by modeling the frequency of outcomes across categories. The marginal distribution for a single trial is categorical, while the vector X represents the sum of indicator variables from n independent categoricals.^[28] The Bernoulli distribution is a special case of the categorical distribution when K=2, with probability vector \theta = (p, 1-p) for $0 \leq p \leq 1.^[29] For a Bernoulli random variable Y \in \{0,1\}, the probability mass function is P(Y=1) = p and P(Y=0) = 1-p, or equivalently P(Y=y) = p^y (1-p)^{1-y}.^[29] The variance of Y is p(1-p).^[30] For the multinomial distribution, the covariance between distinct categories is \operatorname{Cov}(X_i, X_j) = -n \theta_i \theta_j for i \neq j, reflecting the negative dependence due to the fixed total n.^[31] For example, if five independent categorical trials have \theta = (0.4, 0.3, 0.3), possible multinomial outcomes include counts like (2,1,2), with probability \frac{5!}{2!1!2!} (0.4)^2 (0.3)^1 (0.3)^2 = 0.1296.^[28]

Dirichlet and Softmax

The Dirichlet distribution is a family of continuous multivariate probability distributions supported on the interior of the (K-1)-simplex, consisting of vectors \theta = (\theta_1, \dots, \theta_K) where \theta_k > 0 for all k and \sum_{k=1}^K \theta_k = 1. It is parameterized by a positive vector \alpha = (\alpha_1, \dots, \alpha_K) with \alpha_k > 0, and its probability density function is

f(\theta \mid \alpha) = \frac{\Gamma\left( \sum_{k=1}^K \alpha_k \right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \theta_k^{\alpha_k - 1},

where \Gamma denotes the gamma function; this form ensures normalization over the simplex.^[32] The expected value of each component is E[\theta_k] = \frac{\alpha_k}{\sum_{j=1}^K \alpha_j}, providing a straightforward interpretation of the concentration parameters \alpha in terms of average probabilities.^[32] When all \alpha_k = 1, the distribution reduces to a uniform distribution over the simplex, while larger \alpha_k values concentrate the mass near the corresponding vertices. In Bayesian statistics, the Dirichlet distribution acts as the conjugate prior for the unknown probability vector \theta of a categorical distribution, meaning that if the prior is Dirichlet(\alpha), the posterior after observing data remains Dirichlet with updated parameters. This conjugacy facilitates closed-form updates and predictive distributions, making it a foundational tool for modeling uncertainty in categorical parameters (detailed further in the Bayesian Inference section). The softmax function offers a practical parameterization of the categorical distribution by mapping an unconstrained vector \beta \in \mathbb{R}^K to the probability simplex via

\theta_k = \frac{\exp(\beta_k)}{\sum_{j=1}^K \exp(\beta_j)}, \quad k = 1, \dots, K.

This ensures \theta_k > 0 and \sum_k \theta_k = 1, transforming raw scores (logits) into interpretable probabilities. It is widely employed in the output layers of neural networks for multi-class classification, where the network learns the \beta parameters directly, allowing optimization without probability constraints. The inverse operation, applying the natural logarithm to each \theta_k after subtracting the log-sum-exp for numerical stability, yields the logit representation of \beta, which facilitates gradient-based learning by avoiding the non-differentiable simplex boundary. For illustration, consider K=2 with \beta = (0, \log(0.5)): the softmax computes \theta_1 = \frac{\exp(0)}{\exp(0) + \exp(\log(0.5))} = \frac{1}{1 + 0.5} = \frac{2}{3} and \theta_2 = \frac{1}{3}, aligning with the logistic function for binary (Bernoulli) outcomes as a special case.

References

[1]
[PDF] dirichlet-conjugate-prior.pdf - Stephen Tu
The probability mass function is given as f(x1, ..., xK;p1, ..., pK,n) = Γ ... The Categorical distribution, which we denote as Cat(p1, ..., pK), is a ...
[2]
[PDF] CSC535: Probabilistic Graphical Models
is a probability mass function (PMF) of a discrete ... probability mass function (i.e. a function of the RV ... Categorical Distribution on integer-valued RV.
[3]
[PDF] Stochastic Inference for Large Categorical Distributions
Categorical distributions are fundamental to many areas of machine learning. Examples include classification (Gupta et al., 2014), language models (Bengio et al ...
[4]
[PDF] CS242: Probabilistic Graphical Models - Brown Computer Science
Sep 27, 2016 · Categorical Distribution: Single roll of a (possibly biased) die. / = ... is a probability mass function for discrete X, probability density ...
[5]
[PDF] Discrete Categorical Distribution
Nov 11, 2016 · The discrete or categorical distribution is the generalisation of the Bernoulli to m outcomes, and the special case of the multinomial with one ...
[6]
Categorical distribution
The distribution is parametrized by the probabilities assigned to each event. We define θ y to be the probability assigned to outcome y.
[7]
[PDF] Categorical distributions; Discriminative models - CPSC 440/550
Full categorical distribution: can model anything, very many parameters ... For binary classification, often use logistic likelihood p(y | x, W, v) = σ ...
[8]
[PDF] Probability Distributions: Discrete
An even more compact way to write this: P(X = x) = θx (1−θ)1−x. ◦ This is called a probability mass function. ... • Categorical distribution is multinomial ...
[9]
[PDF] 10-315 Notes Maximum Likelihood Estimation
The categorical distribution is a discrete distribution with K possible values (often corresponding to K discrete events). This is not a very standard ...
[10]
[PDF] CSC535: Probabilistic Graphical Models
Categorical Distribution on integer-valued RV with parameter and Kronecker delta: Can also represent X as one-hot binary vector, or where then. This ...<|control11|><|separator|>
[11]
[PDF] Lecture 8: Some common distributions
Expected value and variance are called moments of a distribution. One ... the categorical distribution with parameters 1pkl if f(k) = P(X = k) = pk. 8k ...
[12]
http://www2.stat.duke.edu/~cr173/Sta230_Sp14/Lec/Lec20.pdf
[13]
[PDF] Lecture 20: Covariance / Correlation & General Bivariate Normal
Apr 11, 2012 · Covariance and Correlation. Example - Covariance of Multinomial Distribution. Marginal distribution of Xi - consider category i a success and ...
[14]
Pattern Recognition and Machine Learning - Christopher M. Bishop
Aug 17, 2006 · Pattern Recognition and Machine Learning ... This is the first textbook on pattern recognition to present the Bayesian viewpoint. The book ...
[15]
[PDF] A Mathematical Theory of Communication
In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
[16]
[PDF] entropy, relative entropy, and mutual information
Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ... We now show that the uniform distribution over the range X is the maximum entropy ...
[17]
[PDF] Categorical Data Analysis
... Distributions and Inference for Categorical Data. 1. 1.1 Categorical Response Data, 1. 1.2 Distributions for Categorical Data, 5. 1.3 Statistical Inference for ...
[18]
[PDF] Statistical Inference
Casella, George. Statistical inference / George Casella, Roger L. Berger ... 12.2.3 Maximum Likelihood Estimation. 583. 12.2.4 Confidence Sets. 588. 12.4 ...Missing: multinomial | Show results with:multinomial
[19]
1.4 - Method of Moments | STAT 415 - STAT ONLINE
The method of moments involves equating sample moments with theoretical moments. So, let's start by making sure we recall the definitions of theoretical ...
[20]
[PDF] Lecture Notes for Math 448 Statistics - math.binghamton.edu
Dec 23, 2022 · We will consider two such methods, Method of Moments Estimation. (MME) and Maximum Likelihood Estimation (MLE). 4.1 Method of Moments Estimation.<|control11|><|separator|>
[21]
[PDF] Pattern Recognition and Machine Learning - Microsoft
A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
[22]
[PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
[23]
[PDF] Conjugacy for Categorical Distributions - Duke Computer Science
Categorical, Dirichlet, and Conjugacy. We now show that. • The Dirichlet distribution is the conjugate prior of the categorical distribu- tion. • Bayes ...
[24]
[PDF] 1 Inverse Transform Method
The first general method that we present is called the inverse transform method. Let F(x), x ∈ IR, denote any cumulative distribution function (cdf) (continuous ...
[25]
numpy.random.choice — NumPy v2.3 Manual
Generates a random sample from a given 1-D array. Note: New code should use the choice method of a Generator instance instead; please see the Quick start.
[26]
A Review of the Gumbel-max Trick and its Extensions for Discrete ...
Oct 4, 2021 · The Gumbel-max trick is a method to draw a sample from a categorical distribution, given by its unnormalized (log-)probabilities.
[27]
Categorical Reparameterization with Gumbel-Softmax - arXiv
Nov 3, 2016 · We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised ...
[28]
[PDF] Lecture 7: Multinomial distribution
The Dirichlet distribution is a distribution of continuous random variables relevant to the Multinomial distribution. Sampling from a Dirichlet distribution ...
[29]
[PDF] 1 Categorical MLE
Note that the Bernoulli distribution is a special case of the categorical distribution, where K = 2. Question 1: For a dataset D drawn from the categorical ...
[30]
[PDF] 5 Basic Probability Theory - Department of Computer Science
A Bernoulli distribution is a special case of a Categorical distribution when there are only two outcomes. The canonical example is the random coint oss ...<|control11|><|separator|>
[31]
2.3 - The Multinomial Distribution - STAT ONLINE
Because the elements of $X$ are constrained to sum to $n$, this covariance matrix is singular. If all the $\pi_j$s are positive, then the covariance ...
[32]
[PDF] Estimating a Dirichlet distribution Thomas P. Minka
This paper describes simple and efficient iterative schemes for obtaining parameter estimates in these models. In each case, a fixed-point iteration and a ...