Fact-checked by Grok 2 weeks ago

Categorical distribution

The categorical distribution, also known as the multinomial distribution with a single trial, is a discrete probability distribution that models the probability of observing one out of K mutually exclusive and collectively exhaustive outcomes, where K ≥ 1 is a positive integer representing the number of categories. It is parameterized by a K-dimensional probability vector π = (π₁, π₂, ..., π_K), where each π_i ≥ 0 and ∑_{i=1}^K π_i = 1, such that the probability mass function is given by P(X = k) = π_k for k = 1, 2, ..., K. This distribution generalizes the Bernoulli distribution (which corresponds to K=2) and serves as a foundational model for discrete choice scenarios, such as a single roll of a K-sided die. In statistical modeling, the categorical distribution is characterized by its mean vector E[X] = π (where X is the one-hot encoded outcome vector) and variance-covariance matrix with diagonal elements π_i(1 - π_i) and off-diagonal elements -π_i π_j for i ≠ j, reflecting the dependence structure across categories. It is conjugate to the Dirichlet distribution, meaning that if priors over the parameters π follow a Dirichlet, posterior updates after observing categorical data remain Dirichlet-distributed, which facilitates Bayesian inference. Common parameterizations include the direct use of π or transformations like the softmax function, which maps a real-valued vector ψ to π via π_k = exp(ψ_k) / ∑_{j=1}^K exp(ψ_j), enabling gradient-based optimization in machine learning. The categorical distribution plays a central role in numerous applications, including (e.g., assigning labels in image recognition), (e.g., next-word prediction in language models), and recommendation systems (e.g., selecting items from a of options). For large K, such as vocabularies exceeding 10,000 words or high-resolution image classes, direct computation of the distribution can be inefficient due to linear scaling in K, prompting scalable methods like the Augment-and-Reduce approach to approximate likelihoods without enumerating all categories. Its simplicity and interpretability make it indispensable in probabilistic graphical models and frameworks, where it often represents observed or latent variables.

Definition and Formulation

Terminology and Parameters

The categorical distribution is a probability distribution defined over a of K mutually exclusive categories or outcomes, where K is a positive representing the number of possible distinct results. It models scenarios in which a random results in exactly one of these categories, such as the outcome of rolling a die or selecting a class label in a task. In standard notation, the X associated with the takes values in the finite support \{1, 2, \dots, [K](/page/K)\}, with no probability mass outside this set and no extension to a continuous . The is fully parameterized by a \theta = (\theta_1, \theta_2, \dots, \theta_K), where each \theta_k \geq 0 denotes the probability assigned to k, and these probabilities satisfy the condition \sum_{k=1}^K \theta_k = 1. This vector \theta encapsulates all information needed to specify the , directly representing the relative likelihoods of each . The categorical distribution generalizes the —which applies specifically to the case K=2, modeling binary outcomes—to an arbitrary number of categories. It is named for its role in describing categorical variables, which are qualitative attributes divided into discrete, non-ordered classes, particularly in and contexts. As a single-trial special case, it corresponds to the of one outcome in a .

Probability Mass Function

The (PMF) of a X taking values in the \{1, 2, \dots, K\} with parameter vector \boldsymbol{\theta} = (\theta_1, \theta_2, \dots, \theta_K) is defined as P(X = k \mid \boldsymbol{\theta}) = \theta_k, \quad k = 1, 2, \dots, K, where each \theta_k \geq 0 represents the probability of outcome k. To explicitly account for the support, the PMF can be written using the I(\cdot): P(X = k \mid \boldsymbol{\theta}) = \theta_k \cdot I(k \in \{1, \dots, K\}), which ensures the probability is zero for values outside the defined categories. The parameters must satisfy the normalization constraint \sum_{k=1}^K \theta_k = 1, guaranteeing that the probabilities sum to unity over all possible outcomes. In , particularly common in contexts, the outcome X can be represented as a encoded vector \mathbf{x} \in \{0,1\}^K with exactly one entry equal to 1, and the PMF is P(X = \mathbf{x} \mid \boldsymbol{\theta}) = \prod_{k=1}^K \theta_k^{x_k} = \boldsymbol{\theta}^\top \mathbf{x}, where the selects the corresponding probability. For example, consider K=3 and \boldsymbol{\theta} = (0.2, 0.5, 0.3); then P(X=2 \mid \boldsymbol{\theta}) = 0.5. This PMF arises as a specialization of the general form for discrete distributions, where the probability is assigned directly to each point in a finite support set without additional structure beyond the normalization condition.

Properties

Moments and Expectations

The categorical distribution can be viewed as a scalar X taking values in \{1, 2, \dots, K\} with probabilities \theta_1, \theta_2, \dots, \theta_K, where \sum_{k=1}^K \theta_k = 1 and \theta_k \geq 0 for all k. The raw moments of X are given by the r-th raw moment m_r = \mathbb{E}[X^r] = \sum_{k=1}^K k^r \theta_k. In particular, the first raw moment is the \mu = \mathbb{E}[X] = \sum_{k=1}^K k \theta_k. The variance follows from the second raw moment as \mathrm{Var}(X) = \mathbb{E}[X^2] - \mu^2 = \sum_{k=1}^K k^2 \theta_k - \mu^2, or equivalently \mathrm{Var}(X) = \sum_{k=1}^K (k - \mu)^2 \theta_k. The central moments provide further characterization, with the r-th central moment defined as \mu_r = \mathbb{E}[(X - \mu)^r] = \sum_{k=1}^K (k - \mu)^r \theta_k. The skewness \gamma, measuring asymmetry, is the standardized third central moment \gamma = \mu_3 / \sigma^3, where \sigma^2 = \mathrm{Var}(X). The kurtosis \kappa, measuring tailedness, is the standardized fourth central moment \kappa = \mu_4 / \sigma^4. Alternatively, the categorical distribution arises as a of indicator random variables \mathbf{I} = (I_1, I_2, \dots, I_K), where I_j = [1](/page/1) if category j is selected and I_j = 0 otherwise, with exactly one I_j = [1](/page/1). The expected value of each indicator is \mathbb{E}[I_j] = \theta_j. The variance of each indicator is \mathrm{Var}(I_j) = \theta_j (1 - \theta_j), and for j \neq k, the covariance is \mathrm{Cov}(I_j, I_k) = -\theta_j \theta_k. For example, consider K=2 with \theta = (0.5, 0.5) and labels $1, 2. Then \mu = \mathbb{E}[X] = 1 \cdot 0.5 + 2 \cdot 0.5 = 1.5 and \mathrm{Var}(X) = (1 - 1.5)^2 \cdot 0.5 + (2 - 1.5)^2 \cdot 0.5 = 0.25.

Mode and Entropy

The of the categorical distribution is the outcome k that maximizes the , specifically k = \arg\max_j \theta_j. If multiple outcomes share the maximum probability, the distribution is , with each such outcome serving as a . In the case of a , where \theta_k = 1/K for all k = 1, \dots, K, every category qualifies as a . The entropy of the categorical distribution quantifies the expected uncertainty associated with a random outcome drawn from it. It is given by the Shannon entropy formula: H(\theta) = -\sum_{k=1}^K \theta_k \log \theta_k, where the base of the logarithm determines the units: natural logarithm for nats or base-2 for bits. This quantity equals the expected value \mathbb{E}[-\log P(X = k)], with the expectation computed over the distribution parameterized by \theta. The achieves its maximum value of \log K (in the same units as the logarithm) when the is , \theta_k = 1/K for all k, as this configuration maximizes uncertainty subject to the constraint of K possible outcomes. For illustration, consider a categorical with \theta = (0.9, 0.1). The is the first category, and the is H(\theta) = -0.9 \log 0.9 - 0.1 \log 0.1 \approx 0.325 nats (using the natural logarithm). To arrive at this value, compute \log 0.9 \approx -0.1054 and \log 0.1 \approx -2.3026, yielding -0.9 \times (-0.1054) - 0.1 \times (-2.3026) = 0.0948 + 0.2303 = 0.3251. By comparison, the has \log 2 \approx 0.693 nats, obtained directly from the maximum formula.

Parameter Estimation

Maximum Likelihood Estimation

Given a sample of n independent and identically distributed observations x_1, \dots, x_n from a with \theta = (\theta_1, \dots, \theta_K)^\top satisfying \sum_{k=1}^K \theta_k = 1 and \theta_k > 0 for all k, the is the product of the probability mass functions evaluated at the observations: L(\theta) = \prod_{i=1}^n \theta_{x_i}. This likelihood treats the categorical distribution as a special case of the with one trial per observation. To facilitate maximization, consider the log-likelihood \ell(\theta) = \log L(\theta). Define n_k = \sum_{i=1}^n \mathbf{1}_{\{x_i = k\}} as the observed count for category k, so \sum_{k=1}^K n_k = n. Then, \ell(\theta) = \sum_{k=1}^K n_k \log \theta_k. Maximizing \ell(\theta) subject to the constraint \sum_{k=1}^K \theta_k = 1 requires incorporating a Lagrange multiplier \lambda, yielding the Lagrangian \mathcal{L}(\theta, \lambda) = \sum_{k=1}^K n_k \log \theta_k + \lambda \left(1 - \sum_{k=1}^K \theta_k\right). Differentiating with respect to \theta_k gives \partial \mathcal{L}/\partial \theta_k = n_k / \theta_k - \lambda = 0, so \theta_k = n_k / \lambda. Summing over k and applying the constraint implies \lambda = n, hence the maximum likelihood estimator (MLE) is the vector of empirical frequencies \hat{\theta}_k = n_k / n for each k = 1, \dots, K. This closed-form solution arises directly from the sufficient statistic consisting of the category counts. The MLE \hat{\theta} possesses desirable statistical properties under standard regularity conditions. It is unbiased, with E[\hat{\theta}_k] = \theta_k for each k, and consistent, converging in probability to the true \theta as n \to \infty. Furthermore, \hat{\theta} is asymptotically : \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \Sigma), where the \Sigma has diagonal elements \Sigma_{kk} = \theta_k (1 - \theta_k) and off-diagonal elements \Sigma_{kl} = -\theta_k \theta_l for k \neq l. The asymptotic variance of each component is thus \operatorname{Var}(\hat{\theta}_k) \approx \theta_k (1 - \theta_k)/n, reflecting the multinomial sampling variability. For illustration, consider n=3 observations \{1, 2, 1\} from a categorical distribution over K=3 categories. The counts are n_1 = 2, n_2 = 1, n_3 = 0, yielding \hat{\theta} = (2/3, 1/3, 0)^\top. Note that unobserved categories receive zero probability mass under this .

Method of Moments

The method of moments provides an alternative approach to parameter estimation for the categorical distribution by equating population moments to their sample counterparts. For a categorical X taking values in \{1, 2, \dots, K\} with probabilities \theta = (\theta_1, \dots, \theta_K), the first raw moment is \mathbb{E}[X] = \sum_{k=1}^K k \theta_k. The corresponding sample moment is the sample \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, where x_1, \dots, x_n is a random sample. Equating these yields \sum_{k=1}^K k \theta_k = \bar{x}, but this single equation underdetermines the K-1 free parameters (given \sum_{k=1}^K \theta_k = 1) when K > 2. To resolve this, the method employs indicator functions for each category: define I_k(X) = 1 if X = k and $0 otherwise, so \mathbb{E}[I_k(X)] = \theta_k. The sample analogue is the proportion \hat{\theta}_k = \frac{1}{n} \sum_{i=1}^n I_k(x_i) = \frac{n_k}{n}, where n_k is the count of observations equal to k. Equating moments gives the estimators \hat{\theta}_k = n_k / n for k = 1, \dots, K, which automatically satisfy the constraint \sum_{k=1}^K \hat{\theta}_k = 1. For the categorical distribution, this method of moments reduces to empirical frequency counting, which coincides exactly with the . Although the method of moments is straightforward here, it can be less statistically efficient than in cases requiring higher-order moments that are nonlinear functions of the parameters; however, its simplicity makes it useful for illustrating moment-matching principles, even if extensions to higher moments are rarely needed for the basic categorical case. As an example, suppose a sample of n = 100 observations yields counts n_1 = 30, n_2 = 40, n_3 = 30 for a 3-category . The of moments estimators are \hat{\theta}_1 = 0.30, \hat{\theta}_2 = 0.40, \hat{\theta}_3 = 0.30, identical to those from maximum likelihood.

Conjugate Prior and Posterior

In for the categorical distribution, the serves as the for the probability parameter vector \boldsymbol{\theta} = (\theta_1, \dots, \theta_K). The prior is denoted \boldsymbol{\theta} \sim \text{Dir}(\boldsymbol{\alpha}), where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) with each \alpha_k > 0, and its density is proportional to \prod_{k=1}^K \theta_k^{\alpha_k - 1} over the (K-1)- \sum_{k=1}^K \theta_k = 1, \theta_k \geq 0. The hyperparameters \alpha_k represent pseudo-counts, akin to prior observations for category k, with the total prior strength given by \sum_{k=1}^K \alpha_k. A uniform prior on the simplex arises when \alpha_k = 1 for all k, equivalent to \text{Dir}(1, \dots, 1). For a dataset consisting of N independent draws from the categorical distribution, let n_k denote the observed count for category k, so \sum_{k=1}^K n_k = N; the resulting likelihood takes a multinomial form. The posterior distribution is then \boldsymbol{\theta} \mid \mathbf{n} \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{n}), with updated parameters \alpha_k' = \alpha_k + n_k for each k. This conjugacy follows directly from the product of the prior density and likelihood, which is proportional to \prod_{k=1}^K \theta_k^{\alpha_k + n_k - 1} and thus shares the Dirichlet kernel form after normalization. The maximum (MAP) estimate, obtained as the of the posterior (assuming \alpha_k + n_k > 1), is \theta_{\text{MAP},k} = \frac{\alpha_k + n_k - 1}{\sum_{j=1}^K (\alpha_j + n_j) - K}; this yields an add-one smoothing adjustment to the counts. As an illustration, for K=3 categories with uniform prior \boldsymbol{\alpha} = (1,1,1) and counts \mathbf{n} = (2,1,0), the posterior is \text{[Dir](/page/Dir)}(3,2,1). In the limit as \boldsymbol{\alpha} \to \mathbf{0}, the MAP estimate coincides with the maximum likelihood estimate.

Predictive and Marginal Distributions

In for the categorical distribution with a , the represents the probability of the observed data integrated over the distribution on the parameters θ. This is given by p(\mathbf{x} \mid \boldsymbol{\alpha}) = \int p(\mathbf{x} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta} = \frac{ \prod_k n_k ! }{ n! } \frac{ B(\boldsymbol{\alpha} + \mathbf{n}) }{ B(\boldsymbol{\alpha}) }, where \mathbf{x} denotes a sample of n independent categorical observations with counts \mathbf{n} = (n_1, \dots, n_K) for each category k = 1, \dots, K, \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) are the parameters, and B(\cdot) is the multivariate defined as B(\boldsymbol{\alpha}) = \prod_k \Gamma(\alpha_k) / \Gamma(\sum_k \alpha_k). The provides the probability of a new observation x^* given the data and prior, obtained by integrating over the posterior distribution of θ, which is Dirichlet(\boldsymbol{\alpha} + \mathbf{n}). This yields a categorical distribution with updated parameters: P(x^* = k \mid \mathbf{x}, \boldsymbol{\alpha}) = \frac{ \alpha_k + n_k }{ \sum_j \alpha_j + n }. The follows from the conjugacy: the posterior predictive is the of θ_k under the posterior, reducing to the above normalized form. The posterior conditional distribution, which updates the posterior after observing a new category x^* = k, is proportional to the original posterior times θ_k, resulting in a Dirichlet distribution with parameters \boldsymbol{\alpha} + \mathbf{n} + \mathbf{e}_k, where \mathbf{e}_k is the unit vector with 1 in the k-th position. This sequential update property facilitates online Bayesian inference. These distributions find application in model comparison through the marginal likelihood, which serves as the evidence for prior hyperparameters, and in predictive tasks such as data imputation, where the posterior predictive fills in missing categories while accounting for parameter uncertainty. For example, suppose the prior is Dirichlet(\boldsymbol{\alpha} = (3, 2, 1)) with observed counts \mathbf{n} = (0, 0, 0) (prior predictive case); the probability of a new observation in category 3 is then P(x^* = 3 \mid \boldsymbol{\alpha}) = 1 / (3 + 2 + 1) = 1/6.

Sampling Methods

Direct Sampling

Direct sampling from a categorical distribution with probability vector \theta = (\theta_1, \theta_2, \dots, \theta_K) relies on the inverse transform sampling method, a standard technique for generating random variates from any probability distribution using a uniform random variable. The approach exploits the cumulative distribution function (CDF) F(k) = \sum_{j=1}^k \theta_j for k = 1, \dots, K, which is non-decreasing and reaches 1 at k = K. By generating a uniform random variable U \sim \text{Uniform}(0,1) and finding the smallest k such that U \leq F(k), the resulting k follows the desired categorical distribution, as this inverts the CDF to match the target probabilities. The algorithm proceeds as follows: draw U from the , then iteratively accumulate the probabilities until the cumulative sum exceeds U, selecting the at that point. This ensures each k is chosen with exact probability \theta_k, since the intervals defined by the partial sums the unit interval according to the PMF. Here is for the direct sampling :
function sample_categorical(θ):
    U ← random.uniform(0, 1)
    cumulative ← 0
    for k = 1 to K:
        cumulative ← cumulative + θ_k
        if U ≤ cumulative:
            return k
    return K  // Fallback, though unnecessary if ∑θ = 1
This implementation runs in O(K) time complexity per sample, making it efficient for distributions with a small number of categories K, though less ideal for very large K where optimized alternatives may be preferred. For illustration, consider \theta = (0.3, 0.4, 0.3). If U = 0.5, the cumulative sums are F(1) = 0.3, F(2) = 0.7, and F(3) = 1.0. Since $0.3 < 0.5 \leq 0.7, the sample is category 2. Such examples demonstrate how the method preserves the probability structure through interval mapping. In practice, libraries like NumPy provide built-in functions for this task; for instance, np.random.choice with the p argument set to \theta generates samples from the categorical distribution, typically using an efficient variant suitable for the given parameters.

Gumbel-Max Trick

The Gumbel-max trick offers a perturbation-based method for sampling from a categorical distribution \operatorname{Cat}(\theta), where \theta = (\theta_1, \dots, \theta_K) with \sum_{k=1}^K \theta_k = 1 and \theta_k > 0, by adding independent Gumbel noise to the log-probabilities and selecting the maximizing index. This approach is particularly valuable in applications that require differentiable approximations to discrete sampling. The standard Gumbel distribution, denoted \operatorname{Gumbel}(0,1), arises in as the limiting distribution of maxima of i.i.d. . Its is F(g) = \exp\left(-\exp(-g)\right), \quad g \in \mathbb{R}, and its is f(g) = \exp\left(-g - \exp(-g)\right). A key property is that if G_1, \dots, G_K \sim \operatorname{Gumbel}(0,1) are i.i.d., then for constants c_1, \dots, c_K, the random variable \max_k (c_k + G_k) follows \operatorname{Gumbel}(\log \sum_k \exp(c_k), 1), and the \arg\max_k (c_k + G_k) follows a categorical distribution with probabilities proportional to \exp(c_k). To sample an index i \sim \operatorname{Cat}(\theta), compute z_k = \log \theta_k + G_k for k = 1, \dots, K, where the G_k are i.i.d. \operatorname{Gumbel}(0,1), and set i = \arg\max_k z_k. This procedure generates an unbiased sample from the target distribution. A proof relies on the of the G_k and the Gumbel CDF: the probability that z_j exceeds all other z_k for k \neq j is P(z_j > z_k \ \forall k \neq j) = \int_{-\infty}^\infty f(g) \prod_{k \neq j} F(g + \log \theta_j - \log \theta_k) \, dg = \theta_j, which follows from substituting the Gumbel forms and recognizing the integral as the softmax probability \theta_j = \frac{\exp(\log \theta_j)}{\sum_k \exp(\log \theta_k)}. This establishes that the argmax yields exactly the categorical distribution. The trick's main advantages include enabling reparameterization of discrete random variables, which allows gradients to flow through the sampling process during backpropagation. This is essential for training models like variational autoencoders (VAEs), where direct sampling would otherwise block differentiability, and it supports low-variance gradient estimates in reinforcement learning and structured prediction tasks. For illustration, suppose \theta = (0.3, 0.4, 0.3). Draw G = (-0.5, 0.2, -1.0). Then z_1 \approx \log 0.3 - 0.5 \approx -1.70, \quad z_2 \approx \log 0.4 + 0.2 \approx -0.72, \quad z_3 \approx \log 0.3 - 1.0 \approx -2.20. The maximum is z_2, so the sample is the second category (index 2). Different noise realizations will yield samples according to the probabilities in \theta. Historically, the underlying and its max-stability properties were introduced by Emil J. Gumbel in the 1950s as part of . The specific application to sampling from categorical distributions, via the argmax perturbation, gained prominence in modern through connections to choice models and efficient inference algorithms.

Multinomial and Bernoulli

The arises as the joint distribution of counts from n independent trials, each following a categorical distribution with K categories and probability vector \theta = (\theta_1, \dots, \theta_K) where \sum_{k=1}^K \theta_k = 1. If X = (X_1, \dots, X_K) denotes the vector of counts with \sum_{k=1}^K X_k = n, then the is given by P(X = x) = \frac{n!}{x_1! \cdots x_K!} \prod_{k=1}^K \theta_k^{x_k}, for non-negative integers x_1, \dots, x_K summing to n. This generalizes the categorical distribution to multiple trials by modeling the frequency of outcomes across categories. The for a single trial is categorical, while the vector X represents the sum of indicator variables from n independent categoricals. The is a special case of the categorical distribution when K=2, with probability vector \theta = (p, 1-p) for $0 \leq p \leq 1. For a random variable Y \in \{0,1\}, the is P(Y=1) = p and P(Y=0) = 1-p, or equivalently P(Y=y) = p^y (1-p)^{1-y}. The variance of Y is p(1-p). For the multinomial distribution, the covariance between distinct categories is \operatorname{Cov}(X_i, X_j) = -n \theta_i \theta_j for i \neq j, reflecting the negative dependence due to the fixed total n. For example, if five independent categorical trials have \theta = (0.4, 0.3, 0.3), possible multinomial outcomes include counts like (2,1,2), with probability \frac{5!}{2!1!2!} (0.4)^2 (0.3)^1 (0.3)^2 = 0.1296.

Dirichlet and Softmax

The is a family of continuous multivariate probability distributions supported on the interior of the (K-1)-, consisting of \theta = (\theta_1, \dots, \theta_K) where \theta_k > 0 for all k and \sum_{k=1}^K \theta_k = 1. It is parameterized by a positive \alpha = (\alpha_1, \dots, \alpha_K) with \alpha_k > 0, and its is f(\theta \mid \alpha) = \frac{\Gamma\left( \sum_{k=1}^K \alpha_k \right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K \theta_k^{\alpha_k - 1}, where \Gamma denotes the ; this form ensures normalization over the . The of each component is E[\theta_k] = \frac{\alpha_k}{\sum_{j=1}^K \alpha_j}, providing a straightforward interpretation of the concentration parameters \alpha in terms of average probabilities. When all \alpha_k = 1, the distribution reduces to a over the , while larger \alpha_k values concentrate the mass near the corresponding vertices. In , the acts as the for the unknown probability vector \theta of a categorical distribution, meaning that if the prior is Dirichlet(\alpha), the posterior after observing data remains Dirichlet with updated parameters. This conjugacy facilitates closed-form updates and predictive distributions, making it a foundational tool for modeling in categorical parameters (detailed further in the section). The offers a practical parameterization of the categorical distribution by mapping an unconstrained \beta \in \mathbb{R}^K to the via \theta_k = \frac{\exp(\beta_k)}{\sum_{j=1}^K \exp(\beta_j)}, \quad k = 1, \dots, K. This ensures \theta_k > 0 and \sum_k \theta_k = 1, transforming raw scores (logits) into interpretable probabilities. It is widely employed in the output layers of neural networks for multi-class , where the network learns the \beta parameters directly, allowing optimization without probability constraints. The , applying the natural logarithm to each \theta_k after subtracting the log-sum-exp for , yields the representation of \beta, which facilitates gradient-based learning by avoiding the non-differentiable boundary. For illustration, consider K=2 with \beta = (0, \log(0.5)): the softmax computes \theta_1 = \frac{\exp(0)}{\exp(0) + \exp(\log(0.5))} = \frac{1}{1 + 0.5} = \frac{2}{3} and \theta_2 = \frac{1}{3}, aligning with the for binary () outcomes as a special case.

References

  1. [1]
    [PDF] dirichlet-conjugate-prior.pdf - Stephen Tu
    The probability mass function is given as f(x1, ..., xK;p1, ..., pK,n) = Γ ... The Categorical distribution, which we denote as Cat(p1, ..., pK), is a ...
  2. [2]
    [PDF] CSC535: Probabilistic Graphical Models
    is a probability mass function (PMF) of a discrete ... probability mass function (i.e. a function of the RV ... Categorical Distribution on integer-valued RV.
  3. [3]
    [PDF] Stochastic Inference for Large Categorical Distributions
    Categorical distributions are fundamental to many areas of machine learning. Examples include classification (Gupta et al., 2014), language models (Bengio et al ...
  4. [4]
    [PDF] CS242: Probabilistic Graphical Models - Brown Computer Science
    Sep 27, 2016 · Categorical Distribution: Single roll of a (possibly biased) die. / = ... is a probability mass function for discrete X, probability density ...
  5. [5]
    [PDF] Discrete Categorical Distribution
    Nov 11, 2016 · The discrete or categorical distribution is the generalisation of the Bernoulli to m outcomes, and the special case of the multinomial with one ...
  6. [6]
    Categorical distribution
    The distribution is parametrized by the probabilities assigned to each event. We define θ y to be the probability assigned to outcome y.
  7. [7]
    [PDF] Categorical distributions; Discriminative models - CPSC 440/550
    Full categorical distribution: can model anything, very many parameters ... For binary classification, often use logistic likelihood p(y | x, W, v) = σ ...
  8. [8]
    [PDF] Probability Distributions: Discrete
    An even more compact way to write this: P(X = x) = θx (1−θ)1−x. ◦ This is called a probability mass function. ... • Categorical distribution is multinomial ...
  9. [9]
    [PDF] 10-315 Notes Maximum Likelihood Estimation
    The categorical distribution is a discrete distribution with K possible values (often corresponding to K discrete events). This is not a very standard ...
  10. [10]
    [PDF] CSC535: Probabilistic Graphical Models
    Categorical Distribution on integer-valued RV with parameter and Kronecker delta: Can also represent X as one-hot binary vector, or where then. This ...<|control11|><|separator|>
  11. [11]
    [PDF] Lecture 8: Some common distributions
    Expected value and variance are called moments of a distribution. One ... the categorical distribution with parameters 1pkl if f(k) = P(X = k) = pk. 8k ...
  12. [12]
  13. [13]
    [PDF] Lecture 20: Covariance / Correlation & General Bivariate Normal
    Apr 11, 2012 · Covariance and Correlation. Example - Covariance of Multinomial Distribution. Marginal distribution of Xi - consider category i a success and ...
  14. [14]
    Pattern Recognition and Machine Learning - Christopher M. Bishop
    Aug 17, 2006 · Pattern Recognition and Machine Learning ... This is the first textbook on pattern recognition to present the Bayesian viewpoint. The book ...
  15. [15]
    [PDF] A Mathematical Theory of Communication
    In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
  16. [16]
    [PDF] entropy, relative entropy, and mutual information
    Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ... We now show that the uniform distribution over the range X is the maximum entropy ...
  17. [17]
    [PDF] Categorical Data Analysis
    ... Distributions and Inference for Categorical Data. 1. 1.1 Categorical Response Data, 1. 1.2 Distributions for Categorical Data, 5. 1.3 Statistical Inference for ...
  18. [18]
    [PDF] Statistical Inference
    Casella, George. Statistical inference / George Casella, Roger L. Berger ... 12.2.3 Maximum Likelihood Estimation. 583. 12.2.4 Confidence Sets. 588. 12.4 ...Missing: multinomial | Show results with:multinomial
  19. [19]
    1.4 - Method of Moments | STAT 415 - STAT ONLINE
    The method of moments involves equating sample moments with theoretical moments. So, let's start by making sure we recall the definitions of theoretical ...
  20. [20]
    [PDF] Lecture Notes for Math 448 Statistics - math.binghamton.edu
    Dec 23, 2022 · We will consider two such methods, Method of Moments Estimation. (MME) and Maximum Likelihood Estimation (MLE). 4.1 Method of Moments Estimation.<|control11|><|separator|>
  21. [21]
    [PDF] Pattern Recognition and Machine Learning - Microsoft
    A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
  22. [22]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
  23. [23]
    [PDF] Conjugacy for Categorical Distributions - Duke Computer Science
    Categorical, Dirichlet, and Conjugacy. We now show that. • The Dirichlet distribution is the conjugate prior of the categorical distribu- tion. • Bayes ...
  24. [24]
    [PDF] 1 Inverse Transform Method
    The first general method that we present is called the inverse transform method. Let F(x), x ∈ IR, denote any cumulative distribution function (cdf) (continuous ...
  25. [25]
    numpy.random.choice — NumPy v2.3 Manual
    Generates a random sample from a given 1-D array. Note: New code should use the choice method of a Generator instance instead; please see the Quick start.
  26. [26]
    A Review of the Gumbel-max Trick and its Extensions for Discrete ...
    Oct 4, 2021 · The Gumbel-max trick is a method to draw a sample from a categorical distribution, given by its unnormalized (log-)probabilities.
  27. [27]
    Categorical Reparameterization with Gumbel-Softmax - arXiv
    Nov 3, 2016 · We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised ...
  28. [28]
    [PDF] Lecture 7: Multinomial distribution
    The Dirichlet distribution is a distribution of continuous random variables relevant to the Multinomial distribution. Sampling from a Dirichlet distribution ...
  29. [29]
    [PDF] 1 Categorical MLE
    Note that the Bernoulli distribution is a special case of the categorical distribution, where K = 2. Question 1: For a dataset D drawn from the categorical ...
  30. [30]
    [PDF] 5 Basic Probability Theory - Department of Computer Science
    A Bernoulli distribution is a special case of a Categorical distribution when there are only two outcomes. The canonical example is the random coint oss ...<|control11|><|separator|>
  31. [31]
    2.3 - The Multinomial Distribution - STAT ONLINE
    Because the elements of \(X\) are constrained to sum to \(n\), this covariance matrix is singular. If all the \(\pi_j\)s are positive, then the covariance ...
  32. [32]
    [PDF] Estimating a Dirichlet distribution Thomas P. Minka
    This paper describes simple and efficient iterative schemes for obtaining parameter estimates in these models. In each case, a fixed-point iteration and a ...