Categorical distribution
The categorical distribution, also known as the multinomial distribution with a single trial, is a discrete probability distribution that models the probability of observing one out of K mutually exclusive and collectively exhaustive outcomes, where K ≥ 1 is a positive integer representing the number of categories.[1] It is parameterized by a K-dimensional probability vector π = (π₁, π₂, ..., π_K), where each π_i ≥ 0 and ∑_{i=1}^K π_i = 1, such that the probability mass function is given by P(X = k) = π_k for k = 1, 2, ..., K.[1] This distribution generalizes the Bernoulli distribution (which corresponds to K=2) and serves as a foundational model for discrete choice scenarios, such as a single roll of a K-sided die.[2]
In statistical modeling, the categorical distribution is characterized by its mean vector E[X] = π (where X is the one-hot encoded outcome vector) and variance-covariance matrix with diagonal elements π_i(1 - π_i) and off-diagonal elements -π_i π_j for i ≠ j, reflecting the dependence structure across categories.[1] It is conjugate to the Dirichlet distribution, meaning that if priors over the parameters π follow a Dirichlet, posterior updates after observing categorical data remain Dirichlet-distributed, which facilitates Bayesian inference.[1] Common parameterizations include the direct use of π or transformations like the softmax function, which maps a real-valued vector ψ to π via π_k = exp(ψ_k) / ∑_{j=1}^K exp(ψ_j), enabling gradient-based optimization in machine learning.[3]
The categorical distribution plays a central role in numerous applications, including multiclass classification (e.g., assigning labels in image recognition), natural language processing (e.g., next-word prediction in language models), and recommendation systems (e.g., selecting items from a finite set of options).[3] For large K, such as vocabularies exceeding 10,000 words or high-resolution image classes, direct computation of the distribution can be inefficient due to linear scaling in K, prompting scalable inference methods like the Augment-and-Reduce approach to approximate likelihoods without enumerating all categories.[3] Its simplicity and interpretability make it indispensable in probabilistic graphical models and exponential family frameworks, where it often represents observed or latent variables.[4]
Terminology and Parameters
The categorical distribution is a discrete probability distribution defined over a finite set of K mutually exclusive categories or outcomes, where K is a positive integer representing the number of possible distinct results.[5] It models scenarios in which a random event results in exactly one of these categories, such as the outcome of rolling a die or selecting a class label in a classification task.
In standard notation, the random variable X associated with the categorical distribution takes values in the finite support \{1, 2, \dots, [K](/page/K)\}, with no probability mass outside this set and no extension to a continuous domain.[6] The distribution is fully parameterized by a probability vector \theta = (\theta_1, \theta_2, \dots, \theta_K), where each \theta_k \geq 0 denotes the probability assigned to category k, and these probabilities satisfy the normalization condition \sum_{k=1}^K \theta_k = 1. This vector \theta encapsulates all information needed to specify the distribution, directly representing the relative likelihoods of each category.[6]
The categorical distribution generalizes the Bernoulli distribution—which applies specifically to the case K=2, modeling binary outcomes—to an arbitrary number of categories.[5] It is named for its role in describing categorical variables, which are qualitative attributes divided into discrete, non-ordered classes, particularly in statistical classification and data analysis contexts. As a single-trial special case, it corresponds to the marginal distribution of one outcome in a multinomial distribution.[5]
Probability Mass Function
The probability mass function (PMF) of a categorical random variable X taking values in the finite set \{1, 2, \dots, K\} with parameter vector \boldsymbol{\theta} = (\theta_1, \theta_2, \dots, \theta_K) is defined as
P(X = k \mid \boldsymbol{\theta}) = \theta_k, \quad k = 1, 2, \dots, K,
where each \theta_k \geq 0 represents the probability of outcome k.[7][2]
To explicitly account for the support, the PMF can be written using the indicator function I(\cdot):
P(X = k \mid \boldsymbol{\theta}) = \theta_k \cdot I(k \in \{1, \dots, K\}),
which ensures the probability is zero for values outside the defined categories.[8]
The parameters must satisfy the normalization constraint
\sum_{k=1}^K \theta_k = 1,
guaranteeing that the probabilities sum to unity over all possible outcomes.[7][4]
In vector notation, particularly common in machine learning contexts, the outcome X can be represented as a one-hot encoded vector \mathbf{x} \in \{0,1\}^K with exactly one entry equal to 1, and the PMF is
P(X = \mathbf{x} \mid \boldsymbol{\theta}) = \prod_{k=1}^K \theta_k^{x_k} = \boldsymbol{\theta}^\top \mathbf{x},
where the dot product selects the corresponding probability.[9][8]
For example, consider K=3 and \boldsymbol{\theta} = (0.2, 0.5, 0.3); then P(X=2 \mid \boldsymbol{\theta}) = 0.5.[7]
This PMF arises as a specialization of the general form for discrete distributions, where the probability is assigned directly to each point in a finite support set without additional structure beyond the normalization condition.[4][2]
Properties
Moments and Expectations
The categorical distribution can be viewed as a scalar random variable X taking values in \{1, 2, \dots, K\} with probabilities \theta_1, \theta_2, \dots, \theta_K, where \sum_{k=1}^K \theta_k = 1 and \theta_k \geq 0 for all k. The raw moments of X are given by the r-th raw moment m_r = \mathbb{E}[X^r] = \sum_{k=1}^K k^r \theta_k.[10] In particular, the first raw moment is the expected value \mu = \mathbb{E}[X] = \sum_{k=1}^K k \theta_k.[10] The variance follows from the second raw moment as \mathrm{Var}(X) = \mathbb{E}[X^2] - \mu^2 = \sum_{k=1}^K k^2 \theta_k - \mu^2, or equivalently \mathrm{Var}(X) = \sum_{k=1}^K (k - \mu)^2 \theta_k.[10]
The central moments provide further characterization, with the r-th central moment defined as \mu_r = \mathbb{E}[(X - \mu)^r] = \sum_{k=1}^K (k - \mu)^r \theta_k. The skewness \gamma, measuring asymmetry, is the standardized third central moment \gamma = \mu_3 / \sigma^3, where \sigma^2 = \mathrm{Var}(X).[11] The kurtosis \kappa, measuring tailedness, is the standardized fourth central moment \kappa = \mu_4 / \sigma^4.[11]
Alternatively, the categorical distribution arises as a vector of indicator random variables \mathbf{I} = (I_1, I_2, \dots, I_K), where I_j = [1](/page/1) if category j is selected and I_j = 0 otherwise, with exactly one I_j = [1](/page/1). The expected value of each indicator is \mathbb{E}[I_j] = \theta_j.[12] The variance of each indicator is \mathrm{Var}(I_j) = \theta_j (1 - \theta_j), and for j \neq k, the covariance is \mathrm{Cov}(I_j, I_k) = -\theta_j \theta_k.[12]
For example, consider K=2 with \theta = (0.5, 0.5) and labels $1, 2. Then \mu = \mathbb{E}[X] = 1 \cdot 0.5 + 2 \cdot 0.5 = 1.5 and \mathrm{Var}(X) = (1 - 1.5)^2 \cdot 0.5 + (2 - 1.5)^2 \cdot 0.5 = 0.25.[10]
Mode and Entropy
The mode of the categorical distribution is the outcome k that maximizes the probability mass function, specifically k = \arg\max_j \theta_j.[13] If multiple outcomes share the maximum probability, the distribution is multimodal, with each such outcome serving as a mode. In the case of a uniform distribution, where \theta_k = 1/K for all k = 1, \dots, K, every category qualifies as a mode.[13]
The entropy of the categorical distribution quantifies the expected uncertainty associated with a random outcome drawn from it. It is given by the Shannon entropy formula:
H(\theta) = -\sum_{k=1}^K \theta_k \log \theta_k,
where the base of the logarithm determines the units: natural logarithm for nats or base-2 for bits.[14] This quantity equals the expected value \mathbb{E}[-\log P(X = k)], with the expectation computed over the distribution parameterized by \theta.[15]
The entropy achieves its maximum value of \log K (in the same units as the logarithm) when the distribution is uniform, \theta_k = 1/K for all k, as this configuration maximizes uncertainty subject to the constraint of K possible outcomes.[15]
For illustration, consider a binary categorical distribution with \theta = (0.9, 0.1). The mode is the first category, and the entropy is H(\theta) = -0.9 \log 0.9 - 0.1 \log 0.1 \approx 0.325 nats (using the natural logarithm). To arrive at this value, compute \log 0.9 \approx -0.1054 and \log 0.1 \approx -2.3026, yielding -0.9 \times (-0.1054) - 0.1 \times (-2.3026) = 0.0948 + 0.2303 = 0.3251. By comparison, the uniform binary distribution has entropy \log 2 \approx 0.693 nats, obtained directly from the maximum entropy formula.
Parameter Estimation
Maximum Likelihood Estimation
Given a sample of n independent and identically distributed observations x_1, \dots, x_n from a categorical distribution with probability vector \theta = (\theta_1, \dots, \theta_K)^\top satisfying \sum_{k=1}^K \theta_k = 1 and \theta_k > 0 for all k, the likelihood function is the product of the probability mass functions evaluated at the observations:
L(\theta) = \prod_{i=1}^n \theta_{x_i}.
This likelihood treats the categorical distribution as a special case of the multinomial distribution with one trial per observation.[16]
To facilitate maximization, consider the log-likelihood \ell(\theta) = \log L(\theta). Define n_k = \sum_{i=1}^n \mathbf{1}_{\{x_i = k\}} as the observed count for category k, so \sum_{k=1}^K n_k = n. Then,
\ell(\theta) = \sum_{k=1}^K n_k \log \theta_k.
Maximizing \ell(\theta) subject to the constraint \sum_{k=1}^K \theta_k = 1 requires incorporating a Lagrange multiplier \lambda, yielding the Lagrangian \mathcal{L}(\theta, \lambda) = \sum_{k=1}^K n_k \log \theta_k + \lambda \left(1 - \sum_{k=1}^K \theta_k\right). Differentiating with respect to \theta_k gives \partial \mathcal{L}/\partial \theta_k = n_k / \theta_k - \lambda = 0, so \theta_k = n_k / \lambda. Summing over k and applying the constraint implies \lambda = n, hence the maximum likelihood estimator (MLE) is the vector of empirical frequencies \hat{\theta}_k = n_k / n for each k = 1, \dots, K. This closed-form solution arises directly from the sufficient statistic consisting of the category counts.[16]
The MLE \hat{\theta} possesses desirable statistical properties under standard regularity conditions. It is unbiased, with E[\hat{\theta}_k] = \theta_k for each k, and consistent, converging in probability to the true \theta as n \to \infty.[17] Furthermore, \hat{\theta} is asymptotically normal: \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma), where the covariance matrix \Sigma has diagonal elements \Sigma_{kk} = \theta_k (1 - \theta_k) and off-diagonal elements \Sigma_{kl} = -\theta_k \theta_l for k \neq l. The asymptotic variance of each component is thus \operatorname{Var}(\hat{\theta}_k) \approx \theta_k (1 - \theta_k)/n, reflecting the multinomial sampling variability.[17][16]
For illustration, consider n=3 observations \{1, 2, 1\} from a categorical distribution over K=3 categories. The counts are n_1 = 2, n_2 = 1, n_3 = 0, yielding \hat{\theta} = (2/3, 1/3, 0)^\top. Note that unobserved categories receive zero probability mass under this estimator.[16]
Method of Moments
The method of moments provides an alternative approach to parameter estimation for the categorical distribution by equating population moments to their sample counterparts. For a categorical random variable X taking values in \{1, 2, \dots, K\} with probabilities \theta = (\theta_1, \dots, \theta_K), the first raw moment is \mathbb{E}[X] = \sum_{k=1}^K k \theta_k. The corresponding sample moment is the sample mean \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, where x_1, \dots, x_n is a random sample. Equating these yields \sum_{k=1}^K k \theta_k = \bar{x}, but this single equation underdetermines the K-1 free parameters (given \sum_{k=1}^K \theta_k = 1) when K > 2.[18]
To resolve this, the method employs indicator functions for each category: define I_k(X) = 1 if X = k and $0 otherwise, so \mathbb{E}[I_k(X)] = \theta_k. The sample analogue is the proportion \hat{\theta}_k = \frac{1}{n} \sum_{i=1}^n I_k(x_i) = \frac{n_k}{n}, where n_k is the count of observations equal to k. Equating moments gives the estimators \hat{\theta}_k = n_k / n for k = 1, \dots, K, which automatically satisfy the constraint \sum_{k=1}^K \hat{\theta}_k = 1.[19]
For the categorical distribution, this method of moments reduces to empirical frequency counting, which coincides exactly with the maximum likelihood estimator.[20]
Although the method of moments is straightforward here, it can be less statistically efficient than maximum likelihood estimation in cases requiring higher-order moments that are nonlinear functions of the parameters; however, its simplicity makes it useful for illustrating moment-matching principles, even if extensions to higher moments are rarely needed for the basic categorical case.[18]
As an example, suppose a sample of n = 100 observations yields counts n_1 = 30, n_2 = 40, n_3 = 30 for a 3-category distribution. The method of moments estimators are \hat{\theta}_1 = 0.30, \hat{\theta}_2 = 0.40, \hat{\theta}_3 = 0.30, identical to those from maximum likelihood.[19]
Conjugate Prior and Posterior
In Bayesian inference for the categorical distribution, the Dirichlet distribution serves as the conjugate prior for the probability parameter vector \boldsymbol{\theta} = (\theta_1, \dots, \theta_K). The prior is denoted \boldsymbol{\theta} \sim \text{Dir}(\boldsymbol{\alpha}), where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) with each \alpha_k > 0, and its density is proportional to \prod_{k=1}^K \theta_k^{\alpha_k - 1} over the (K-1)-simplex \sum_{k=1}^K \theta_k = 1, \theta_k \geq 0.[21]
The hyperparameters \alpha_k represent pseudo-counts, akin to prior observations for category k, with the total prior strength given by \sum_{k=1}^K \alpha_k. A uniform prior on the simplex arises when \alpha_k = 1 for all k, equivalent to \text{Dir}(1, \dots, 1).[22]
For a dataset consisting of N independent draws from the categorical distribution, let n_k denote the observed count for category k, so \sum_{k=1}^K n_k = N; the resulting likelihood takes a multinomial form. The posterior distribution is then \boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{n}), with updated parameters \alpha_k' = \alpha_k + n_k for each k.[21]
This conjugacy follows directly from the product of the prior density and likelihood, which is proportional to \prod_{k=1}^K \theta_k^{\alpha_k + n_k - 1} and thus shares the Dirichlet kernel form after normalization.[23]
The maximum a posteriori (MAP) estimate, obtained as the mode of the posterior (assuming \alpha_k + n_k > 1), is \theta_{\text{MAP},k} = \frac{\alpha_k + n_k - 1}{\sum_{j=1}^K (\alpha_j + n_j) - K}; this yields an add-one smoothing adjustment to the counts.[22]
As an illustration, for K=3 categories with uniform prior \boldsymbol{\alpha} = (1,1,1) and counts \mathbf{n} = (2,1,0), the posterior is \text{[Dir](/page/Dir)}(3,2,1).[22]
In the limit as \boldsymbol{\alpha} \to \mathbf{0}, the MAP estimate coincides with the maximum likelihood estimate.[21]
Predictive and Marginal Distributions
In Bayesian inference for the categorical distribution with a Dirichlet prior, the marginal likelihood represents the probability of the observed data integrated over the prior distribution on the parameters θ. This is given by
p(\mathbf{x} \mid \boldsymbol{\alpha}) = \int p(\mathbf{x} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta} = \frac{ \prod_k n_k ! }{ n! } \frac{ B(\boldsymbol{\alpha} + \mathbf{n}) }{ B(\boldsymbol{\alpha}) },
where \mathbf{x} denotes a sample of n independent categorical observations with counts \mathbf{n} = (n_1, \dots, n_K) for each category k = 1, \dots, K, \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) are the prior parameters, and B(\cdot) is the multivariate beta function defined as B(\boldsymbol{\alpha}) = \prod_k \Gamma(\alpha_k) / \Gamma(\sum_k \alpha_k).[22]
The posterior predictive distribution provides the probability of a new observation x^* given the data and prior, obtained by integrating over the posterior distribution of θ, which is Dirichlet(\boldsymbol{\alpha} + \mathbf{n}). This yields a categorical distribution with updated parameters:
P(x^* = k \mid \mathbf{x}, \boldsymbol{\alpha}) = \frac{ \alpha_k + n_k }{ \sum_j \alpha_j + n }.
The derivation follows from the conjugacy: the posterior predictive is the expected value of θ_k under the posterior, reducing to the above normalized form.[22]
The posterior conditional distribution, which updates the posterior after observing a new category x^* = k, is proportional to the original posterior times θ_k, resulting in a Dirichlet distribution with parameters \boldsymbol{\alpha} + \mathbf{n} + \mathbf{e}_k, where \mathbf{e}_k is the unit vector with 1 in the k-th position. This sequential update property facilitates online Bayesian inference.
These distributions find application in model comparison through the marginal likelihood, which serves as the evidence for prior hyperparameters, and in predictive tasks such as data imputation, where the posterior predictive fills in missing categories while accounting for parameter uncertainty.[22]
For example, suppose the prior is Dirichlet(\boldsymbol{\alpha} = (3, 2, 1)) with observed counts \mathbf{n} = (0, 0, 0) (prior predictive case); the probability of a new observation in category 3 is then P(x^* = 3 \mid \boldsymbol{\alpha}) = 1 / (3 + 2 + 1) = 1/6.
Sampling Methods
Direct Sampling
Direct sampling from a categorical distribution with probability vector \theta = (\theta_1, \theta_2, \dots, \theta_K) relies on the inverse transform sampling method, a standard technique for generating random variates from any probability distribution using a uniform random variable.[24] The approach exploits the cumulative distribution function (CDF) F(k) = \sum_{j=1}^k \theta_j for k = 1, \dots, K, which is non-decreasing and reaches 1 at k = K. By generating a uniform random variable U \sim \text{Uniform}(0,1) and finding the smallest k such that U \leq F(k), the resulting k follows the desired categorical distribution, as this inverts the CDF to match the target probabilities.[24]
The algorithm proceeds as follows: draw U from the uniform distribution, then iteratively accumulate the probabilities until the cumulative sum exceeds U, selecting the category at that point. This ensures each category k is chosen with exact probability \theta_k, since the intervals defined by the partial sums partition the unit interval according to the PMF.[24]
Here is pseudocode for the direct sampling procedure:
function sample_categorical(θ):
U ← random.uniform(0, 1)
cumulative ← 0
for k = 1 to K:
cumulative ← cumulative + θ_k
if U ≤ cumulative:
return k
return K // Fallback, though unnecessary if ∑θ = 1
function sample_categorical(θ):
U ← random.uniform(0, 1)
cumulative ← 0
for k = 1 to K:
cumulative ← cumulative + θ_k
if U ≤ cumulative:
return k
return K // Fallback, though unnecessary if ∑θ = 1
This implementation runs in O(K) time complexity per sample, making it efficient for distributions with a small number of categories K, though less ideal for very large K where optimized alternatives may be preferred.[24]
For illustration, consider \theta = (0.3, 0.4, 0.3). If U = 0.5, the cumulative sums are F(1) = 0.3, F(2) = 0.7, and F(3) = 1.0. Since $0.3 < 0.5 \leq 0.7, the sample is category 2. Such examples demonstrate how the method preserves the probability structure through interval mapping.[24]
In practice, libraries like NumPy provide built-in functions for this task; for instance, np.random.choice with the p argument set to \theta generates samples from the categorical distribution, typically using an efficient variant suitable for the given parameters.[25]
Gumbel-Max Trick
The Gumbel-max trick offers a perturbation-based method for sampling from a categorical distribution \operatorname{Cat}(\theta), where \theta = (\theta_1, \dots, \theta_K) with \sum_{k=1}^K \theta_k = 1 and \theta_k > 0, by adding independent Gumbel noise to the log-probabilities and selecting the maximizing index. This approach is particularly valuable in machine learning applications that require differentiable approximations to discrete sampling.[26]
The standard Gumbel distribution, denoted \operatorname{Gumbel}(0,1), arises in extreme value theory as the limiting distribution of maxima of i.i.d. random variables. Its cumulative distribution function is
F(g) = \exp\left(-\exp(-g)\right), \quad g \in \mathbb{R},
and its probability density function is
f(g) = \exp\left(-g - \exp(-g)\right).
A key property is that if G_1, \dots, G_K \sim \operatorname{Gumbel}(0,1) are i.i.d., then for constants c_1, \dots, c_K, the random variable \max_k (c_k + G_k) follows \operatorname{Gumbel}(\log \sum_k \exp(c_k), 1), and the index \arg\max_k (c_k + G_k) follows a categorical distribution with probabilities proportional to \exp(c_k).[26]
To sample an index i \sim \operatorname{Cat}(\theta), compute z_k = \log \theta_k + G_k for k = 1, \dots, K, where the G_k are i.i.d. \operatorname{Gumbel}(0,1), and set i = \arg\max_k z_k. This procedure generates an unbiased sample from the target distribution.[26]
A proof sketch relies on the independence of the G_k and the Gumbel CDF: the probability that z_j exceeds all other z_k for k \neq j is
P(z_j > z_k \ \forall k \neq j) = \int_{-\infty}^\infty f(g) \prod_{k \neq j} F(g + \log \theta_j - \log \theta_k) \, dg = \theta_j,
which follows from substituting the Gumbel forms and recognizing the integral as the softmax probability \theta_j = \frac{\exp(\log \theta_j)}{\sum_k \exp(\log \theta_k)}. This establishes that the argmax yields exactly the categorical distribution.[26]
The trick's main advantages include enabling reparameterization of discrete random variables, which allows gradients to flow through the sampling process during backpropagation. This is essential for training models like variational autoencoders (VAEs), where direct sampling would otherwise block differentiability, and it supports low-variance gradient estimates in reinforcement learning and structured prediction tasks.[27]
For illustration, suppose \theta = (0.3, 0.4, 0.3). Draw G = (-0.5, 0.2, -1.0). Then
z_1 \approx \log 0.3 - 0.5 \approx -1.70, \quad z_2 \approx \log 0.4 + 0.2 \approx -0.72, \quad z_3 \approx \log 0.3 - 1.0 \approx -2.20.
The maximum is z_2, so the sample is the second category (index 2). Different noise realizations will yield samples according to the probabilities in \theta.[26]
Historically, the underlying Gumbel distribution and its max-stability properties were introduced by Emil J. Gumbel in the 1950s as part of extreme value theory. The specific application to sampling from categorical distributions, via the argmax perturbation, gained prominence in modern machine learning through connections to choice models and efficient inference algorithms.
Multinomial and Bernoulli
The multinomial distribution arises as the joint distribution of counts from n independent trials, each following a categorical distribution with K categories and probability vector \theta = (\theta_1, \dots, \theta_K) where \sum_{k=1}^K \theta_k = 1.[28] If X = (X_1, \dots, X_K) denotes the vector of counts with \sum_{k=1}^K X_k = n, then the probability mass function is given by
P(X = x) = \frac{n!}{x_1! \cdots x_K!} \prod_{k=1}^K \theta_k^{x_k},
for non-negative integers x_1, \dots, x_K summing to n.[28] This generalizes the categorical distribution to multiple trials by modeling the frequency of outcomes across categories. The marginal distribution for a single trial is categorical, while the vector X represents the sum of indicator variables from n independent categoricals.[28]
The Bernoulli distribution is a special case of the categorical distribution when K=2, with probability vector \theta = (p, 1-p) for $0 \leq p \leq 1.[29] For a Bernoulli random variable Y \in \{0,1\}, the probability mass function is P(Y=1) = p and P(Y=0) = 1-p, or equivalently P(Y=y) = p^y (1-p)^{1-y}.[29] The variance of Y is p(1-p).[30]
For the multinomial distribution, the covariance between distinct categories is \operatorname{Cov}(X_i, X_j) = -n \theta_i \theta_j for i \neq j, reflecting the negative dependence due to the fixed total n.[31] For example, if five independent categorical trials have \theta = (0.4, 0.3, 0.3), possible multinomial outcomes include counts like (2,1,2), with probability \frac{5!}{2!1!2!} (0.4)^2 (0.3)^1 (0.3)^2 = 0.1296.[28]
Dirichlet and Softmax
The Dirichlet distribution is a family of continuous multivariate probability distributions supported on the interior of the (K-1)-simplex, consisting of vectors \theta = (\theta_1, \dots, \theta_K) where \theta_k > 0 for all k and \sum_{k=1}^K \theta_k = 1. It is parameterized by a positive vector \alpha = (\alpha_1, \dots, \alpha_K) with \alpha_k > 0, and its probability density function is
f(\theta \mid \alpha) = \frac{\Gamma\left( \sum_{k=1}^K \alpha_k \right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \theta_k^{\alpha_k - 1},
where \Gamma denotes the gamma function; this form ensures normalization over the simplex.[32] The expected value of each component is E[\theta_k] = \frac{\alpha_k}{\sum_{j=1}^K \alpha_j}, providing a straightforward interpretation of the concentration parameters \alpha in terms of average probabilities.[32] When all \alpha_k = 1, the distribution reduces to a uniform distribution over the simplex, while larger \alpha_k values concentrate the mass near the corresponding vertices.
In Bayesian statistics, the Dirichlet distribution acts as the conjugate prior for the unknown probability vector \theta of a categorical distribution, meaning that if the prior is Dirichlet(\alpha), the posterior after observing data remains Dirichlet with updated parameters. This conjugacy facilitates closed-form updates and predictive distributions, making it a foundational tool for modeling uncertainty in categorical parameters (detailed further in the Bayesian Inference section).
The softmax function offers a practical parameterization of the categorical distribution by mapping an unconstrained vector \beta \in \mathbb{R}^K to the probability simplex via
\theta_k = \frac{\exp(\beta_k)}{\sum_{j=1}^K \exp(\beta_j)}, \quad k = 1, \dots, K.
This ensures \theta_k > 0 and \sum_k \theta_k = 1, transforming raw scores (logits) into interpretable probabilities. It is widely employed in the output layers of neural networks for multi-class classification, where the network learns the \beta parameters directly, allowing optimization without probability constraints. The inverse operation, applying the natural logarithm to each \theta_k after subtracting the log-sum-exp for numerical stability, yields the logit representation of \beta, which facilitates gradient-based learning by avoiding the non-differentiable simplex boundary.
For illustration, consider K=2 with \beta = (0, \log(0.5)): the softmax computes \theta_1 = \frac{\exp(0)}{\exp(0) + \exp(\log(0.5))} = \frac{1}{1 + 0.5} = \frac{2}{3} and \theta_2 = \frac{1}{3}, aligning with the logistic function for binary (Bernoulli) outcomes as a special case.