Fact-checked by Grok 2 weeks ago

Dirichlet distribution

The Dirichlet distribution is a family of continuous multivariate probability distributions defined on the interior of the standard , where the components of a random sum to 1 and are each between 0 and 1. It generalizes the univariate to multiple dimensions and is parameterized by a positive real-valued \alpha = (\alpha_1, \dots, \alpha_K), with each \alpha_i > 0. This distribution models uncertainty over probability , such as proportions or categorical probabilities, and is fundamental in due to its conjugacy with the . Named after the 19th-century mathematician Johann Peter Gustav Lejeune Dirichlet, the distribution emerged from early work on multivariate integrals and has since become a cornerstone of modern . The of a K-dimensional Dirichlet random vector X = (X_1, \dots, X_K) is given by f(X = x \mid \alpha) = \frac{\Gamma\left(\sum_{k=1}^K \alpha_k\right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k - 1}, for x_k > 0, \sum_{k=1}^K x_k = 1, where \Gamma denotes the ; the normalizing constant is the reciprocal of the multivariate . The of each X_i is with parameters \alpha_i and \sum_{j \neq i} \alpha_j. The expected value is E[X_i] = \alpha_i / \sum_{k=1}^K \alpha_k, and the variance is \mathrm{Var}(X_i) = [\alpha_i (\sum \alpha_k - \alpha_i)] / [(\sum \alpha_k)^2 (\sum \alpha_k + 1)], with negative covariances between components reflecting their dependence through the sum-to-1 constraint. Key applications of the Dirichlet distribution include for categorical and multinomial data, where it serves as a flexible that updates to a posterior of the same form. It is prominently featured in (LDA), a for discovering latent topics in text corpora by treating topic proportions and word distributions as Dirichlet draws. Other uses span for modeling frequencies, compositional data analysis in , and nonparametric Bayesian methods via the , which extends it to infinite-dimensional settings for random probability measures.

Definitions

Probability density function

The Dirichlet distribution, denoted \operatorname{Dir}(\boldsymbol{\alpha}), is defined for a K-dimensional random \mathbf{x} = (x_1, \dots, x_K) parameterized by a positive \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) with each \alpha_i > 0. The of the Dirichlet distribution is given by f(\mathbf{x} \mid \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1}, where B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^K \Gamma(\alpha_i)}{\Gamma\left(\sum_{i=1}^K \alpha_i\right)} denotes the multivariate , which serves as the to ensure the density integrates to 1 over the appropriate domain. This form arises as a natural multivariate generalization of the , extending the two-parameter case (\alpha_1, \alpha_2) to K \geq 2 parameters while preserving conjugacy properties; specifically, the Dirichlet distribution is the for the parameter vector of a , allowing closed-form posterior updates in for categorical data. Named after the mathematician (1805–1859), due to his work on the beta integral. The full multivariate probabilistic formulation and its widespread use in statistics emerged in the mid-20th century through developments in Bayesian analysis and multivariate distributions.

Parameterization and support

The Dirichlet distribution is parameterized by a of \alpha = (\alpha_1, \dots, \alpha_K), where K \ge 2 is the for the multivariate case and each \alpha_i > 0. This parameterization ensures the distribution is well-defined over compositions or proportions. The support of the distribution is the (K-1)-dimensional probability simplex \Delta^{K-1} = \{x = (x_1, \dots, x_K) \in \mathbb{R}^K \mid x_i \ge 0 \ \forall i, \sum_{i=1}^K x_i = 1 \}. This set corresponds to all possible probability vectors over K categories, with the density defined on the interior and boundaries depending on the parameters. The requirement that each \alpha_i > 0 guarantees the positivity of the density function on the interior of the simplex and the finiteness of the normalizing constant, which is the multivariate beta function B(\alpha) = \prod_{i=1}^K \Gamma(\alpha_i) / \Gamma(\sum_{i=1}^K \alpha_i); the gamma function \Gamma(\cdot) is defined and positive only for positive arguments in this context. If any \alpha_i \le 0, the distribution is not defined, as the normalizing constant becomes infinite or undefined, rendering the density improper or non-integrable over the simplex. The concentration parameter is defined as \alpha_0 = \sum_{i=1}^K \alpha_i, which quantifies the overall concentration or of the around its . Larger values of \alpha_0 correspond to less variability and greater concentration near the expected proportions \mu_i = \alpha_i / \alpha_0, while smaller \alpha_0 (approaching 0) result in higher variance and a distribution approaching uniformity over the .

Special cases

The two-dimensional case of the Dirichlet distribution, where the parameter vector has only two components \boldsymbol{\alpha} = (\alpha_1, \alpha_2), reduces precisely to the with parameters \alpha_1 and \alpha_2; the support is the interval [0, 1], and the simplifies to the standard Beta form f(x) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} x^{\alpha_1 - 1} (1 - x)^{\alpha_2 - 1} for $0 < x < 1. A symmetric Dirichlet distribution arises when all parameters are equal, \alpha_i = \alpha for i = 1, \dots, K and some \alpha > 0, reducing the family to a single-parameter distribution that exhibits rotational symmetry over the simplex to identical of all components. The uniform Dirichlet distribution occurs as a special instance of the symmetric case when \alpha_i = 1 for all i, yielding a constant density of (K-1)! over the (K-1)-, equivalent to the uniform on that space. When all \alpha_i = 1/2, the Dirichlet distribution corresponds to the distribution of the normalized squared coordinates of a point chosen uniformly at random from the surface of the positive of the (K-1)-dimensional , relating to certain geometric interpretations and also appearing in connections to spacings derived from order statistics in lower dimensions. In limiting cases, as individual \alpha_i \to 0^+ while keeping others fixed, the probability mass concentrates on the boundaries of the , favoring configurations where one or more components approach 0 or 1; conversely, as the total concentration parameter \alpha_0 = \sum \alpha_i \to \infty with fixed ratios \alpha_i / \alpha_0, the distribution approaches a Dirac delta centered at the mean vector (\alpha_1 / \alpha_0, \dots, \alpha_K / \alpha_0).

Properties

Moments

The mean vector of a random vector \mathbf{X} \sim \mathrm{Dirichlet}(\boldsymbol{\alpha}), where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) with \alpha_i > 0 for all i and \alpha_0 = \sum_{j=1}^K \alpha_j, is given by \mathbb{E}[X_i] = \frac{\alpha_i}{\alpha_0}, \quad i = 1, \dots, K. This arises from integrating x_i against the Dirichlet , which yields \mathbb{E}[X_i] = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_i) \Gamma(\alpha_0 - \alpha_i)} \int_0^1 \int \cdots \int x_i \prod_{j=1}^K x_j^{\alpha_j - 1} \, dx_1 \cdots dx_K = \frac{B(\boldsymbol{\alpha}^{(i)})}{B(\boldsymbol{\alpha})}, where B(\cdot) is the multivariate , \boldsymbol{\alpha}^{(i)} is \boldsymbol{\alpha} with the i-th component incremented by 1, and the property \Gamma(z+1) = z \Gamma(z) simplifies the ratio to \alpha_i / \alpha_0. The second moments follow similarly. The variance of each component is \mathrm{Var}(X_i) = \frac{\alpha_i (\alpha_0 - \alpha_i)}{\alpha_0^2 (\alpha_0 + 1)}, obtained as \mathbb{E}[X_i^2] - (\mathbb{E}[X_i])^2, where \mathbb{E}[X_i^2] = \frac{\alpha_i (\alpha_i + 1)}{\alpha_0 (\alpha_0 + 1)} via the same beta integral with the power on x_i raised to 2. The between distinct components is \mathrm{Cov}(X_i, X_j) = -\frac{\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0 + 1)}, \quad i \neq j, derived from \mathbb{E}[X_i X_j] = \frac{\alpha_i \alpha_j}{\alpha_0 (\alpha_0 + 1)} and the relation \sum_{i=1}^K X_i = 1, which implies negative dependence. These expressions highlight how larger \alpha_0 reduces variability, with variances and covariances scaling inversely with \alpha_0^2. Higher-order raw moments admit a general closed form. For non-negative integers k_1, \dots, k_K, \mathbb{E}\left[ \prod_{i=1}^K X_i^{k_i} \right] = \frac{\prod_{i=1}^K \Gamma(\alpha_i + k_i) \, \Gamma(\alpha_0)}{\Gamma\left(\alpha_0 + \sum_{i=1}^K k_i\right) \prod_{i=1}^K \Gamma(\alpha_i)}, derived by augmenting the exponents in the Dirichlet density integral, which shifts the parameters accordingly; the multivariate normalizes the result using identities. Central moments can be computed from these via inclusion-exclusion or generating functions, though explicit forms beyond the second order are typically expressed recursively or numerically. moments, useful for discrete extensions, take a similar form involving ratios of , as \mathbb{E}[X_i (X_i - 1) \cdots (X_i - m + 1)] = \frac{\Gamma(\alpha_i + 1) \Gamma(\alpha_0 - m + 1)}{\Gamma(\alpha_i - m + 1) \Gamma(\alpha_0 + 1)} for the marginal case, generalizing to joint products via the formula. Alternatively, moments can be derived from the gamma representation: let Y_1, \dots, Y_K be \mathrm{Gamma}(\alpha_i, 1) random variables, so X_i = Y_i / \sum_{j=1}^K Y_j. The joint moments of the Y_j follow from gamma properties, and integrating over the sum yields the same closed forms via properties of gamma integrals. For log-moments, which relate to higher-order analyses like , the appears as \mathbb{E}[\log X_i] = \psi(\alpha_i) - \psi(\alpha_0), where \psi(z) = \Gamma'(z)/\Gamma(z), but raw moments suffice for standard and spread. Asymptotically, when \alpha_0 \to \infty with fixed ratios \alpha_i / \alpha_0 = \mu_i, the distribution concentrates around the mean \boldsymbol{\mu} = (\mu_1, \dots, \mu_K), as the second moments show variance shrinking like O(1/\alpha_0); higher moments similarly approach those of a degenerate distribution at \boldsymbol{\mu}, reflecting a central limit theorem-like behavior for normalized gamma sums.

Mode

The mode of the Dirichlet distribution represents the value of the random vector \mathbf{x} = (x_1, \dots, x_K) on the simplex that maximizes the probability density function, corresponding to the most likely composition under the distribution. To find the mode, consider the logarithm of the , which is \log f(\mathbf{x} | \boldsymbol{\alpha}) = \constant + \sum_{i=1}^K (\alpha_i - 1) \log x_i, subject to the constraints \sum_{i=1}^K x_i = 1 and x_i \geq 0. Using Lagrange multipliers to maximize this under the equality constraint yields the critical point x_i = \frac{\alpha_i - 1}{\sum_{j=1}^K (\alpha_j - 1)} = \frac{\alpha_i - 1}{\alpha_0 - K}, where \alpha_0 = \sum_{j=1}^K \alpha_j. This interior mode exists provided that \alpha_i > 1 for all i = 1, \dots, K, ensuring all x_i > 0. If one or more \alpha_i \leq 1, the lies on the of the , specifically on the face where the corresponding x_i = 0. In such cases, the distribution restricted to the remaining coordinates (with those \alpha_i \leq 1 set aside) follows a lower-dimensional Dirichlet distribution, and the is determined recursively on that sub-simplex. The location of the shifts as the parameters \alpha_i change; larger \alpha_i relative to others pull the toward higher values of x_i, reflecting a higher concentration of probability mass in that direction. In contrast to the , which is always \frac{\alpha_i}{\alpha_0}, the provides into the peak of the distribution rather than its . For the special case of the uniform Dirichlet distribution where \alpha_i = 1 for all i, the density is constant over the , resulting in no unique interior as every point is equally probable.

Marginal distributions

The marginal distribution of any single component X_i of a random vector \mathbf{X} = (X_1, \dots, X_K)^\top \sim \operatorname{Dir}(\alpha_1, \dots, \alpha_K) follows a , specifically X_i \sim \operatorname{Beta}(\alpha_i, \alpha_0 - \alpha_i), where \alpha_0 = \sum_{j=1}^K \alpha_j. This univariate marginal arises from integrating out the other components in the probability density function (PDF) of the Dirichlet distribution. The PDF is given by f(\mathbf{x}) = \frac{\Gamma(\alpha_0)}{\prod_{j=1}^K \Gamma(\alpha_j)} \prod_{j=1}^K x_j^{\alpha_j - 1}, \quad \mathbf{x} \in \Delta_K, where \Delta_K = \{\mathbf{x} \in [0,1]^K : \sum_{j=1}^K x_j = 1\} is the (K-1)-dimensional . To obtain the marginal PDF of X_i, fix x_i and integrate over the remaining x_j (for j \neq i) subject to \sum_{j \neq i} x_j = 1 - x_i and x_j \geq 0. The integral over this sub-simplex equals the of a Dirichlet distribution on K-1 components with parameters \{\alpha_j\}_{j \neq i}, which simplifies to \frac{\prod_{j \neq i} \Gamma(\alpha_j)}{\Gamma(\alpha_0 - \alpha_i)} (1 - x_i)^{\alpha_0 - \alpha_i - 1}. Substituting back into the PDF yields the PDF: f_{X_i}(x_i) = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_i) \Gamma(\alpha_0 - \alpha_i)} x_i^{\alpha_i - 1} (1 - x_i)^{\alpha_0 - \alpha_i - 1}, \quad 0 < x_i < 1. For the joint marginal distribution of a subset of components, consider partitioning the indices \{1, \dots, K\} into m \leq K disjoint groups G_1, \dots, G_m, and define Y_\ell = \sum_{i \in G_\ell} X_i for \ell = 1, \dots, m. Then, (Y_1, \dots, Y_m)^\top \sim \operatorname{Dir}(\beta_1, \dots, \beta_m), where \beta_\ell = \sum_{i \in G_\ell} \alpha_i. This aggregation property follows from the structure of the Dirichlet PDF, as integrating out components within each group (or across groups) leverages the gamma function identities in the normalizing constant, preserving the Dirichlet form with summed parameters. In particular, the joint marginal for the first m individual components (X_1, \dots, X_m)^\top is \operatorname{Dir}(\alpha_1, \dots, \alpha_m, \alpha_0 - \sum_{j=1}^m \alpha_j), treating the remaining components as a single aggregated variable. The components of a Dirichlet random vector are dependent, despite their marginals being Beta distributions, due to the constraint \sum_{j=1}^K X_j = 1. This sum-to-one condition induces negative dependence between pairs of components. Specifically, the covariance between distinct components is \operatorname{Cov}(X_i, X_j) = -\frac{\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0 + 1)} for i \neq j, and the correlation is \operatorname{Corr}(X_i, X_j) = -\sqrt{\frac{\alpha_i \alpha_j}{(\alpha_0 - \alpha_i)(\alpha_0 - \alpha_j)}}. These expressions derive from the second moments, using the fact that E[X_i] = \alpha_i / \alpha_0 and \operatorname{Var}(X_i) = \alpha_i (\alpha_0 - \alpha_i) / [\alpha_0^2 (\alpha_0 + 1)], with the negative covariances ensuring the linear dependence structure consistent with the simplex constraint.

Conjugacy with multinomial

In Bayesian statistics, the Dirichlet distribution serves as the conjugate prior for the parameter vector \mathbf{p} = (p_1, \dots, p_K) of a multinomial distribution, where \mathbf{p} lies on the K-1-simplex. Specifically, if the prior is \mathbf{p} \sim \mathrm{Dir}(\boldsymbol{\alpha}) with concentration parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K), and the likelihood arises from multinomial observations \mathbf{n} = (n_1, \dots, n_K) with total count N = \sum_{i=1}^K n_i, then the posterior distribution is \mathbf{p} \mid \mathbf{n} \sim \mathrm{Dir}(\boldsymbol{\alpha} + \mathbf{n}). This conjugacy preserves the Dirichlet family, facilitating analytical updates in inference. The posterior update rule is straightforward: each component becomes \alpha_i' = \alpha_i + n_i for i = 1, \dots, K, reflecting the addition of observed counts to the prior pseudocounts encoded by \boldsymbol{\alpha}. This closed-form posterior enables exact Bayesian inference without relying on approximation methods, a key advantage for models involving categorical data. The conjugacy also yields a posterior predictive distribution for future multinomial observations that follows the Dirichlet-multinomial compound distribution. Historically, this property builds on early Bayesian work by Thomas Bayes in the binomial case, with the Dirichlet providing a natural multivariate extension recognized in modern nonparametric Bayesian methods.

Aggregation and neutrality

The aggregation property of the states that if \mathbf{X} = (X_1, \dots, X_K) \sim \mathrm{Dir}(\boldsymbol{\alpha}) with \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K), and categories are grouped such that Y_j = \sum_{i \in G_j} X_i for disjoint groups G_1, \dots, G_M partitioning \{1, \dots, K\} where M < K, then \mathbf{Y} = (Y_1, \dots, Y_M) \sim \mathrm{Dir}(\boldsymbol{\beta}) with \beta_j = \sum_{i \in G_j} \alpha_i. This closure under summation preserves the distributional family when categories are combined, making it suitable for compositional data analysis. The proof follows from direct integration of the Dirichlet probability density function (PDF). The joint PDF of \mathbf{X} is f(\mathbf{x}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1} for \mathbf{x} in the simplex, where B(\boldsymbol{\alpha}) is the multivariate beta function. To obtain the density of \mathbf{Y}, integrate over the conditional distributions within each group: for fixed \mathbf{y}, the transformation yields f(\mathbf{y}) \propto \prod_{j=1}^M y_j^{\beta_j - 1} \prod_{j=1}^M \int \prod_{i \in G_j} u_i^{\alpha_i - 1} \, d\mathbf{u}_j, where each inner integral is a Dirichlet normalizing constant B(\boldsymbol{\alpha}_{G_j}) that simplifies to the form of a lower-dimensional Dirichlet density, confirming the result after renormalization. Alternatively, using the gamma representation—where X_i = Z_i / \sum Z_k and Z_i \sim \mathrm{Gamma}(\alpha_i, 1) independently—the sum Y_j = (\sum_{i \in G_j} Z_i) / \sum Z_k follows a Dirichlet with summed parameters, as the sum of independent gammas with equal rates is gamma-distributed. This property implies that the Dirichlet maintains compositionality under category aggregation, which is crucial in applications like topic models where topics can be hierarchically merged without altering the prior structure, as in latent Dirichlet allocation (LDA). In population genetics, it allows modeling allele frequency aggregations across loci while preserving the simplex constraint and prior beliefs on group proportions. The Dirichlet distribution exhibits neutrality, a characterizing independence property unique among distributions on the probability simplex. Specifically, complete neutrality means that if \mathbf{X} \sim \mathrm{Dir}(\boldsymbol{\alpha}), then for any k, X_k is independent of (X_1 / (1 - X_k), \dots, X_{k-1} / (1 - X_k)) \sim \mathrm{Dir}(\alpha_1, \dots, \alpha_{k-1}), allowing recursive decomposition without dependence. This extends to a regression version of neutrality, where conditional expectations align with the marginal structure, providing a moment-based characterization of the distribution. In hierarchical models, such as those with a global \boldsymbol{\theta} \sim \mathrm{Dir}(\boldsymbol{\alpha}) and local \mathbf{X} | \boldsymbol{\theta} \sim \mathrm{Multinomial}(n, \boldsymbol{\theta}), neutrality ensures the law of total variance decomposes as \mathrm{Var}(\mathbf{X}) = \mathbb{E}[\mathrm{Var}(\mathbf{X} | \boldsymbol{\theta})] + \mathrm{Var}(\mathbb{E}[\mathbf{X} | \boldsymbol{\theta}]) without introducing bias from prior dependencies, as the independent components facilitate unbiased partitioning of within- and between-level variances. This neutrality supports exchangeability in processes like the , enabling flexible modeling of clustered data in .

Entropy and Kullback-Leibler divergence

The differential entropy of a Dirichlet distribution with parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) quantifies the expected uncertainty in the distribution over the simplex, computed as H(\mathrm{Dir}(\boldsymbol{\alpha})) = -\mathbb{E}_{\boldsymbol{X} \sim \mathrm{Dir}(\boldsymbol{\alpha})} [\log f(\boldsymbol{X}; \boldsymbol{\alpha})], where f(\boldsymbol{X}; \boldsymbol{\alpha}) is the probability density function. This expectation integrates over the simplex \Delta^{K-1} = \{\boldsymbol{x} \in [0,1]^K : \sum_{i=1}^K x_i = 1\}, yielding the closed-form expression H(\mathrm{Dir}(\boldsymbol{\alpha})) = \log B(\boldsymbol{\alpha}) + (\alpha_0 - K) \psi(\alpha_0) - \sum_{i=1}^K (\alpha_i - 1) \psi(\alpha_i), where \alpha_0 = \sum_{i=1}^K \alpha_i, B(\boldsymbol{\alpha}) is the multivariate beta function, and \psi(\cdot) denotes the digamma function. The derivation follows from substituting the density f(\boldsymbol{x}; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1} into the entropy integral, evaluating \mathbb{E}[\log f(\boldsymbol{X}; \boldsymbol{\alpha})] = -\log B(\boldsymbol{\alpha}) + \sum_{i=1}^K (\alpha_i - 1) [\psi(\alpha_i) - \psi(\alpha_0)], and negating to obtain H. For fixed relative shape ratios \alpha_i / \alpha_0, the entropy H(\mathrm{Dir}(\boldsymbol{\alpha})) decreases monotonically as \alpha_0 increases, reflecting reduced uncertainty as the distribution concentrates toward its mean. The measures the information loss when approximating one by another, defined as D_{\mathrm{KL}}(\mathrm{Dir}(\boldsymbol{\alpha}) \parallel \mathrm{Dir}(\boldsymbol{\beta})) = \mathbb{E}_{\boldsymbol{X} \sim \mathrm{Dir}(\boldsymbol{\alpha})} [\log \frac{f(\boldsymbol{X}; \boldsymbol{\alpha})}{f(\boldsymbol{X}; \boldsymbol{\beta})}]. This asymmetric divergence admits a closed form involving : D_{\mathrm{KL}}(\mathrm{Dir}(\boldsymbol{\alpha}) \parallel \mathrm{Dir}(\boldsymbol{\beta})) = \log \Gamma(\beta_0) - \log \Gamma(\alpha_0) - \sum_{i=1}^K \log \Gamma(\beta_i) + \sum_{i=1}^K \log \Gamma(\alpha_i) + \sum_{i=1}^K (\alpha_i - \beta_i) [\psi(\alpha_i) - \psi(\alpha_0)], where \beta_0 = \sum_{i=1}^K \beta_i. The derivation parallels the entropy computation, substituting the ratio of densities and evaluating expectations under \mathrm{Dir}(\boldsymbol{\alpha}). In Bayesian model selection, the KL divergence between a Dirichlet prior and posterior (or variational approximation) contributes to the evidence lower bound (ELBO) in variational inference, facilitating scalable posterior approximation for multinomial models.

Dirichlet-multinomial distribution

The Dirichlet-multinomial distribution arises as the marginal distribution of multinomial counts when the category probabilities are drawn from a Dirichlet prior, providing a flexible model for overdispersed categorical count data. This compound structure accounts for uncertainty in the probabilities, making it suitable for scenarios where observations exhibit more variability than expected under a standard multinomial assumption. The for a vector of counts \mathbf{n} = (n_1, \dots, n_K) with \sum_{i=1}^K n_i = N and Dirichlet parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) where each \alpha_i > 0 is P(\mathbf{n} \mid \boldsymbol{\alpha}) = \frac{N!}{\prod_{i=1}^K n_i!} \frac{B(\boldsymbol{\alpha} + \mathbf{n})}{B(\boldsymbol{\alpha})}, with the multivariate beta function B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^K \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^K \alpha_i)}. This form is derived by integrating the multinomial over the Dirichlet : P(\mathbf{n} \mid \boldsymbol{\alpha}) = \int_{\boldsymbol{p} \in \Delta^{K-1}} \frac{N!}{\prod_{i=1}^K n_i!} \prod_{i=1}^K p_i^{n_i} \cdot \frac{\Gamma(\alpha_0)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K p_i^{\alpha_i - 1} \, d\boldsymbol{p}, where \alpha_0 = \sum_{i=1}^K \alpha_i and \Delta^{K-1} is the (K-1)-, yielding the ratio of beta functions after evaluating the . Compared to the , the Dirichlet-multinomial displays over-dispersion, with greater variance in the counts due to the variability introduced by the prior on \boldsymbol{p}. The marginal variance for the i-th component is \mathrm{Var}(n_i) = N \mu_i (1 - \mu_i) \left( 1 + (N-1) \rho \right), where \mu_i = \alpha_i / \alpha_0 is the proportion and \rho = 1 / (\alpha_0 + 1) quantifies the or extra variation; this exceeds the variance N \mu_i (1 - \mu_i) when \rho > 0. The is closely related to the , particularly in cases where the Dirichlet parameters align with fixed success probabilities scaled by a concentration factor, and it has been applied in Bayesian analyses of to model correlated proportions. In contemporary statistical modeling, the Dirichlet-multinomial plays a key role in topic modeling frameworks like Latent Dirichlet Allocation (LDA), where it describes the overdispersed distribution of word counts across topics within documents, with topic proportions drawn from a Dirichlet prior.

Beta distribution as marginal

The Dirichlet distribution serves as a multivariate generalization of the beta distribution, extending the modeling of proportions from two categories to an arbitrary number K \geq 2. Specifically, when K=2, the Dirichlet distribution with parameters \alpha_1 > 0 and \alpha_2 > 0 is equivalent to the beta distribution with the same parameters: \operatorname{Dir}(\alpha_1, \alpha_2) = \operatorname{Beta}(\alpha_1, \alpha_2). A key structural connection arises from the marginal distributions of the Dirichlet random vector (X_1, \dots, X_K), where each univariate marginal X_i follows a beta distribution: X_i \sim \operatorname{Beta}(\alpha_i, \alpha_0 - \alpha_i), with \alpha_0 = \sum_{j=1}^K \alpha_j. This property allows the Dirichlet to inherit the beta distribution's conjugacy with the binomial likelihood, thereby establishing the Dirichlet as the conjugate prior for the multinomial distribution in Bayesian inference for multicategory proportions. Several core properties of the beta distribution carry over directly to these marginals, including the mean \mathbb{E}[X_i] = \frac{\alpha_i}{\alpha_0}, variance \operatorname{Var}(X_i) = \frac{\alpha_i (\alpha_0 - \alpha_i)}{\alpha_0^2 (\alpha_0 + 1)}, and mode at \frac{\alpha_i - 1}{\alpha_0 - 2} for \alpha_i > 1 (with adjustments at boundaries otherwise). These inherited features underscore the Dirichlet's role in generalizing univariate proportion modeling to multivariate settings while preserving analytical tractability. Historically, the beta function integral, which normalizes the beta density, was introduced by Leonhard Euler in the early 18th century as part of his work on gamma functions. The beta distribution itself emerged later in , catalogued by in 1895 as Type I within his system of frequency curves for modeling skewed data on bounded intervals like [0,1]. Johann Peter Gustav Lejeune Dirichlet extended this framework to multiple dimensions in 1838, defining the multivariate beta integral that underpins the Dirichlet distribution and enabling its application to joint proportion modeling. The beta distribution's established use in proportion modeling, such as for success probabilities in binary trials, predates the Dirichlet and provided the conceptual foundation for handling interdependent proportions across categories.

Generalizations and extensions

The generalized Dirichlet distribution extends the standard Dirichlet by allowing greater flexibility in modeling dependencies among components through a sequence of independent distributions for successive marginal proportions, rather than the paired parameters of the standard form. This structure, introduced by Connor and Mosimann (1969), enables a wider range of matrices while preserving marginal distributions and certain aggregation properties. It has been applied in Bayesian analysis of proportions where asymmetric dependencies are expected. The Dirichlet process serves as a key nonparametric generalization of the Dirichlet distribution, defined as the limiting case where the dimensionality K tends to infinity with a fixed total concentration parameter α_0. Formally introduced by Ferguson (1973), it specifies a distribution over the space of probability measures, enabling without assuming a fixed number of categories or support points. The process is characterized by its stick-breaking construction or representation, which facilitates sampling and posterior computation. In machine learning applications since the 2000s, the Dirichlet process has underpinned infinite mixture models, particularly for topic modeling where the number of latent topics is unbounded. For instance, it allows automatic determination of the effective number of topics in text corpora through posterior inference, extending finite models like latent Dirichlet allocation. The hierarchical Dirichlet process builds on this by stacking Dirichlet processes in a multilevel framework, suitable for grouped data where shared structure across levels is desired. Proposed by Teh et al. (2006), it models multiple related distributions—such as per-document topic proportions in a corpus—drawn from a common top-level process, promoting sparsity and sharing while adapting to group-specific variations. This extension supports applications in multilevel clustering and density estimation. A scaled variant of the Dirichlet distribution, originating from Savage's work in the 1960s and elaborated by Dickey (1968), adjusts the standard form for Bayesian priors on proportions under constraints, such as ordered hypotheses or incomplete categories, by scaling the concentration parameters relative to a baseline component. This allows representation of partial allocations summing to less than 1, with applications in hypothesis testing via the Savage-Dickey density ratio. In analysis, the Dirichlet distribution is often generalized through log-ratio transformations, such as the additive log-ratio where y_i = \log(x_i / x_K) for i = 1, \dots, K-1, mapping the constrained to an unconstrained space. This approach, developed by Aitchison (1982), facilitates standard multivariate techniques while preserving the relative information in the proportions, with adjustments to account for the induced correlations from the original Dirichlet parameters.

Applications and interpretations

Bayesian inference

The Dirichlet distribution serves as a conjugate prior for the parameters of a multinomial distribution, making it particularly useful in Bayesian inference for modeling categorical data such as contingency tables in social sciences or proportions in topic models. In contingency table analysis, the Dirichlet prior encodes uncertainty over cell probabilities, allowing for shrinkage toward uniformity or informed pseudocounts based on expert knowledge, which facilitates posterior inference on associations between variables. Similarly, in latent Dirichlet allocation (LDA) for topic modeling, the Dirichlet prior is applied to document-topic and topic-word distributions, enabling the discovery of latent structures in large text corpora by integrating over these proportions during inference. Posterior sampling under a Dirichlet-multinomial model is straightforward in simple cases, where the posterior Dirichlet parameters are updated by adding observed counts to the prior alphas, yielding exact samples via standard methods like the logit-normal transformation. For hierarchical models, such as those with shared priors across groups in multi-level categorical data, (MCMC) methods like are employed to approximate the joint posterior, accommodating dependencies that exceed the conjugate structure. Model criticism in Dirichlet-multinomial settings often relies on posterior predictive checks (PPCs), where simulated data from the —generated by drawing parameters from the updated Dirichlet and then multinomial outcomes—are compared to observed data via test statistics like chi-squared discrepancies to detect systematic fit issues. This approach has been applied to validate topic models by assessing in held-out documents. Practical applications include for conversion proportions, where the Dirichlet prior aggregates prior experiments as pseudocounts, enabling probability statements on variant superiority without frequentist adjustments. In , Dirichlet-multinomial models infer abundance distributions from count data, accounting for in microbiome or community surveys to estimate metrics like Simpson's index. Since the , variational inference with mean-field approximations to the Dirichlet posterior has scaled for massive datasets, as in LDA implementations that optimize evidence lower bounds for efficient topic discovery on millions of documents.

Intuitive parameter meanings

The parameters \alpha = (\alpha_1, \dots, \alpha_K) of the Dirichlet distribution can be intuitively understood as pseudo-counts that encode beliefs about the relative frequencies of K . Each \alpha_i > 0 represents the strength of or the number of imaginary observations allocated to i, influencing how much weight is given to that category relative to others. The sum \alpha_0 = \sum_{i=1}^K \alpha_i serves as the total strength or concentration parameter, determining the overall confidence in these prior allocations; a small \alpha_0 implies a weak or vague with high variability in the generated proportions, often favoring sparse outcomes where probability mass concentrates near the boundaries of the (e.g., one category dominating), while a large \alpha_0 indicates a strong that tightly concentrates the around the expected proportions \alpha_i / \alpha_0. This pseudo-counts interpretation arises naturally from the connection to the , where samples from the Dirichlet can be generated by normalizing independent gamma random variables with shape parameters \alpha_i; for integer \alpha_i, each gamma corresponds to the waiting time for \alpha_i events in a process, directly analogous to counts of occurrences. However, the parameters need not be integers—they can be any —allowing for fractional strengths that generalize the counts analogy without requiring discreteness, though this flexibility means \alpha_i do not always correspond exactly to whole-number observations. A helpful non-mathematical analogy for the parameters is of cutting a or stick of unit length to allocate to each . The expected length of the for i is proportional to \alpha_i, reflecting the relative emphasis on that , while \alpha_0 scales the total "" or of the cuts: low \alpha_0 results in rough, highly variable cuts that produce uneven or extreme lengths (sparse allocations), and high \alpha_0 yields precise cuts closely matching the expected proportions (more balanced or concentrated near the mean). In the symmetric case where all \alpha_i = 1 (so \alpha_0 = K), this simplifies to breaking the stick at K-1 uniformly random points and using the sorted spacings as the proportions, which evenly distributes the expected lengths but still allows variability.

Pólya urn model

The , introduced by Eggenberger and Pólya in to describe contagious processes like , offers a generative of the Dirichlet distribution through a reinforcement mechanism. In the multivariate setup, the urn begins with \alpha_i > 0 balls of each color i = 1, \dots, K, where the \alpha_i serve as initial counts reflecting the concentration parameters. At each step, a ball is drawn uniformly at random from the , its color is observed, and it is replaced along with one additional ball of the same color. This process, with reinforcement parameter a = 1, continues indefinitely. The model's dynamics embody a "rich-get-richer" principle, where colors already prevalent in the urn become increasingly likely to be selected in future draws, mimicking phenomena such as in networks or in populations. As the number of draws n approaches infinity, the vector of proportions of balls for each color converges to a limiting random that follows the with parameters (\alpha_1, \dots, \alpha_K). Moreover, the sequence of observed colors over n draws follows a Dirichlet-multinomial distribution, linking the finite draws to the underlying Dirichlet . Extensions of the model replace the additional single ball with a > 0 balls of the drawn color, yielding generalized Pólya urns. For a \neq 1, the limiting proportions still converge to a Dirichlet distribution, but scaled by $1/a in the parameters, and the finite draw counts follow compound distributions distinct from the standard Dirichlet-multinomial, such as scaled beta-binomials in the bivariate case.

Sampling methods

Using gamma variates

One standard algorithm for sampling from the Dirichlet distribution with parameters \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K) where each \alpha_i > 0, involves generating independent gamma-distributed random variables and normalizing their sum. Specifically, draw Y_i \sim \mathrm{Gamma}(\alpha_i, 1) independently for i = 1, \dots, K, using the shape-rate parameterization where the density of \mathrm{Gamma}(\alpha, \beta) is f(y) = \frac{\beta^\alpha}{\Gamma(\alpha)} y^{\alpha-1} e^{-\beta y} for y > 0. Then, compute X_i = Y_i / \sum_{j=1}^K Y_j for each i. The resulting vector \mathbf{X} = (X_1, \dots, X_K) follows \mathrm{Dir}(\boldsymbol{\alpha}). The correctness of this procedure follows from the properties of the . The joint density of the independent Y_i is f(\mathbf{y}) = \prod_{i=1}^K \frac{1}{\Gamma(\alpha_i)} y_i^{\alpha_i - 1} e^{-y_i}, \quad y_i > 0. To obtain the of \mathbf{X}, apply the transformation \mathbf{X} = \mathbf{Y} / \sum Y_j with S = \sum Y_j, noting that the for this is s^{K-1}. The marginal of S is \mathrm{Gamma}(\sum \alpha_i, 1), and conditioning on S = s yields a joint for \mathbf{X} proportional to \prod x_i^{\alpha_i - 1} on the , which matches the Dirichlet after by the constant \frac{\Gamma(\sum \alpha_i)}{\prod \Gamma(\alpha_i)}. The choice of rate parameter 1 (or equivalently, scale 1 in the shape-scale parameterization) simplifies the exponential terms to e^{-y_i}, ensuring the normalization directly yields \mathrm{Dir}(\boldsymbol{\alpha}). If instead Y_i \sim \mathrm{Gamma}(\alpha_i, \theta) for rate \theta > 0, the procedure produces \mathbf{X} \sim \mathrm{Dir}(\boldsymbol{\alpha}/\theta), as the scaling affects the effective shape parameters after normalization. This method is direct and rejection-free, requiring only K independent gamma samples, making it computationally efficient provided an accurate gamma sampler is available—such as those based on acceptance-rejection or inverse transform methods for the . It has been a cornerstone of for generating Dirichlet variates since the 1970s, predating more specialized techniques and appearing in early works on multivariate .

Using beta variates

One alternative method for sampling from the Dirichlet distribution Dir(α_1, \dots, α_K), where α_0 = \sum_{i=1}^K α_i, involves successive draws from univariate distributions, leveraging the marginal and conditional properties of the Dirichlet. This approach generates the components sequentially on the (K-1)-, requiring only K-1 random variates. The algorithm proceeds as follows: Draw X_1 \sim \text{Beta}(α_1, α_0 - α_1). Then, set r_1 = 1 - X_1 and draw X_2 \sim \text{Beta}(α_2, α_0 - α_1 - α_2), scaling it by r_1 to obtain the second component as r_1 X_2. Continue this process: for i = 3 to K-1, set r_{i-1} = 1 - \sum_{j=1}^{i-1} X_j and draw X_i \sim \text{Beta}(α_i, α_0 - \sum_{j=1}^i α_j), yielding the i-th component as r_{i-1} X_i. Finally, set X_K = 1 - \sum_{i=1}^{K-1} X_i. More compactly, the first component is X_1 \sim \text{Beta}(α_1, \sum_{j=2}^K α_j), the conditional second is (1 - X_1) \cdot Y_2 where Y_2 \sim \text{Beta}(α_2, \sum_{j=3}^K α_j), and so on, with each subsequent conditional scaled by the remaining mass. This method derives from the known marginal and conditional structure of the Dirichlet distribution. Specifically, the marginal distribution of any single component X_i is \text{Beta}(α_i, α_0 - α_i). Furthermore, conditional on X_1 = x_1, the vector (X_2, \dots, X_K) follows a Dirichlet distribution with parameters (α_2, \dots, α_K) scaled by the factor (1 - x_1) to preserve the sum-to-one constraint on the simplex. Applying this recursively reduces the problem to univariate beta marginals at each step, as the conditional Dirichlet for the remaining components has the same form. This sequential conditioning ensures the joint distribution matches the target Dirichlet. The primary advantages of this beta-variate are its reliance solely on univariate beta samplers, which are straightforward to implement and often more efficient than multivariate alternatives when gamma variates are unavailable or costly to generate. It also constructs the incrementally, which can facilitate applications requiring partial samples or adaptive stopping. Computationally, it is efficient for moderate K, as only K-1 independent beta draws are needed, avoiding the step required in some other approaches.

Special parameter cases

When all parameters \alpha_i = 1 for i = 1, \dots, K, the Dirichlet distribution \mathrm{Dir}(1, \dots, 1) reduces to the over the (K-1)- \{ \mathbf{x} \in \mathbb{R}^K_+ : \sum_{i=1}^K x_i = 1 \}. This special case arises equivalently from the normalized spacings of K-1 independent exponential random variables with rate 1, or directly as the distribution of breakpoints induced by K-1 independent order statistics on [0,1]. An efficient sampling algorithm for this uniform case avoids general-purpose generators by exploiting the statistics property. Generate K-1 independent uniform random variables U_1, \dots, U_{K-1} on [0,1], sort them to obtain the statistics U_{(1)} < U_{(2)} < \dots < U_{(K-1)}, and define the components as X_1 = U_{(1)}, X_i = U_{(i)} - U_{(i-1)} for i=2, \dots, K-1, and X_K = [1](/page/1) - U_{(K-1)}. The vector \mathbf{X} = (X_1, \dots, X_K) then follows \mathrm{Dir}(1, \dots, [1](/page/1)). This method has O(K \log K) complexity due to and is particularly advantageous in high dimensions where it outperforms generic or gamma-based approaches. Another notable special case occurs when all \alpha_i = 1/2. Here, the Dirichlet distribution \mathrm{Dir}(1/2, \dots, 1/2) corresponds to the distribution of squared coordinates of a point on the positive of the unit (K-1)-hypersphere. To sample from it, draw K independent standard random variables Z_1, \dots, Z_K \sim \mathcal{N}(0,1), compute the squares, and normalize: X_i = Z_i^2 / \sum_{j=1}^K Z_j^2 for i=1, \dots, K. This representation stems from the fact that each marginal follows a \mathrm{Beta}(1/2, (K-1)/2) distribution, with the \mathrm{Beta}(1/2, 1/2) case (for K=2) being the , and the joint structure arising from the chi-squared properties of squared s, where \sum Z_j^2 \sim \chi^2_K. These parameter-specific geometric constructions—order statistics for the uniform case and squared normals for the half-parameter case—offer computational efficiency in simulations, as they leverage fast uniform and normal generators while avoiding the expense of general Dirichlet samplers for large K.

References

  1. [1]
    [PDF] On The Dirichlet Distribution
    The Dirichlet distribution is a multivariate generalization of the Beta distri- bution. It is an important multivariate continuous distribution in ...
  2. [2]
    Dirichlet distribution | Mean, covariance, proofs, derivations - StatLect
    The Dirichlet distribution is a multivariate continuous probability distribution often used to model the uncertainty about a vector of unknown probabilities.Definition · Caveat · How the distribution is derived · Marginal distributions
  3. [3]
    [PDF] A technical note on the Dirichlet-Multinomial model
    Oct 3, 2012 · Conjugate priors are an important tool in Bayesian statistics. A prior is called conjugate (to a certain sample distribution, or likelihood, ...
  4. [4]
    The History of the Dirichlet and Liouville Distributions - jstor
    In this article we review the development of the Dirichlet distributions and their companions, the Liouville distributions. After reviewing some integral ...
  5. [5]
    DirichletDistribution - Wolfram Language Documentation
    The Dirichlet distribution is named for German mathematician Johann Dirichlet. It has been used as a tool for modeling both real-world and theoretical phenomena ...
  6. [6]
    22.1 Dirichlet Distribution | Stan Functions Reference
    Dirichlet ( θ | α ) = Γ ( ∑ k = 1 K α k ) ∏ k = 1 K Γ ( α k ) ∏ k = 1 K θ k α k − 1 .Missing: PDF formula<|control11|><|separator|>
  7. [7]
    [PDF] Dirichlet Distribution, Dirichlet Process and Dirichlet Process Mixture
    Mean and Variance. ▻ The base measure determines the mean distribution;. ▻ Altering the scale affects the variance. E(pi) = αi σ. = α0 i. (1). V ar(pi) = αi(σ ...Missing: formula | Show results with:formula
  8. [8]
    [PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
    Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over ...
  9. [9]
    Dirichlet Distribution - an overview | ScienceDirect Topics
    It is defined as a stochastic process with realisations that are (random) probability measures, built from Dirichlet finite-dimensional distributions with ...
  10. [10]
    [PDF] Introduction to the Dirichlet Distribution and Related Processes
    The Dirichlet distribution serves as a conjugate prior for the probability parameter q of the multino- mial distribution. 2 That is, if (X | q) ∼ Multinomialk( ...
  11. [11]
    [PDF] A Generalization of the Dirichlet Distribution
    Sep 30, 2008 · This paper presents a generalization of the Dirichlet distribution which is conjugate to a more general class of observations that arise ...
  12. [12]
    The History of the Dirichlet and Liouville Distributions
    - **Historical Origin**: The Dirichlet distribution's history is explored in the article, focusing on its development in probability and statistics.
  13. [13]
    Concepts of Independence for Proportions with a Generalization of ...
    Apr 10, 2012 · Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution. Robert J. Connor National Institutes of Health ...
  14. [14]
    tfp.distributions.Dirichlet - Probability - TensorFlow
    Nov 21, 2023 · The Dirichlet distribution is defined over the (k-1) -simplex using a positive, length- k vector concentration ( k > 1 ).
  15. [15]
    [PDF] dirichlet-conjugate-prior.pdf - Stephen Tu
    The support of the Dirichlet distribution is the (K − 1)-dimensional simplex SK; that is, all K dimensional vectors which form a valid probability distribution.
  16. [16]
    [PDF] Estimating a Dirichlet distribution Thomas P. Minka
    Johnson, N. L., & Kotz, S. (1970). Distributions in statistics: Continuous univariate distributions. New York: Hougton Mifflin. Minka, T. P. (1998) ...
  17. [17]
    What exactly is the alpha in the Dirichlet distribution? - Cross Validated
    Nov 8, 2016 · The Dirichlet distribution is a distribution over the simplex, hence a distribution over finite support distributions.Dirichlet distribution: Normalization of alpha values - Cross Validatedunderstanding of effect of $\alpha$ in Dirichlet distributionMore results from stats.stackexchange.comMissing: alpha_0 | Show results with:alpha_0
  18. [18]
    [PDF] MANAGEMENT SCIENCE - MIT
    Some useful properties of the Dirichlet distribution are: 1. The mean of pk is k/ 0. 2. The variance of pk is k 0 − k / 2. 0 0 ...
  19. [19]
    [PDF] Introduction to the Dirichlet Distribution and Related Processes
    Dirichlet process with parameter α, then its distribution Dα is called a Dirichlet measure. As a consequence of Ferguson's restriction, Dα has support only for ...
  20. [20]
    [PDF] The Dirichlet Distributions - Stat@Duke
    Clearly, the Dirichlet distribution is an extension of the beta distribution to explain probabilities of two or more disjoint events. And in particular, W = (W1 ...
  21. [21]
    [PDF] Auxiliary Variables for Multi-Dirichlet Priors - arXiv
    Aug 17, 2017 · The Dirichlet distribution is the conjugate prior distribution of the multinomial: Dir (θ | α) =Γ PK k=1 αk. QK k=1 Γ(αk). K. Y k=1 θαk−1 k. (2).
  22. [22]
    [PDF] arXiv:1412.1649v1 [stat.ME] 4 Dec 2014
    Dec 4, 2014 · The Dirichlet distribution is known in Bayesian statistics for its use as conjugate prior of categorical and multinomial distributions. In ...
  23. [23]
    [PDF] Fast MLE Computation for the Dirichlet Multinomial - arXiv
    May 26, 2023 · Given a collection of categorical data, we want to find the parameters of a Dirichlet distribution which maximizes the likelihood of that data.
  24. [24]
    The History of the Dirichlet and Liouville Distributions
    The Dirichlet and Liouville distributions are well-known throughout statistics and probability. They arise in numerous contexts including Bayesian analysis ...
  25. [25]
    The Dirichlet Distribution and Process through Neutralities
    Apr 26, 2007 · Download PDF · Journal of Theoretical ... Gupta, R.D., Richards, D.S.P.: The history of the Dirichlet and Liouville distributions.
  26. [26]
    Understanding Hierarchical Processes - MDPI
    This is equivalent to the neutrality of the Dirichlet distribution, again the only distribution on probability vectors exhibiting neutrality. Neutrality and ...
  27. [27]
    [PDF] arXiv:2203.06102v2 [cs.LG] 13 Oct 2022
    Oct 13, 2022 · The (differential) entropy of a Dir(α) distribution is given by. ENT(Dir(α)) = log B(α)+(α0 − K)ϕ(α0) −. K. X j=1. (αj − 1)ϕ(αj) ,. (21) where ...
  28. [28]
    [PDF] Dirichlet Mechanism for Differentially Private KL Diver - arXiv
    Feb 21, 2023 · 2r2ψ0(α − (λ − 1)r∆∞). (18). Case 2: λ = 1. We use the following formula for the KL divergence between two Dirichlet distributions: DKL(P ...
  29. [29]
    [PDF] Review of Probability Distributions for Modeling Count Data - arXiv
    Jan 10, 2020 · Γ(αi + xi). Γ(αi)xi! This is the Dirichlet-multinomial PMF with total count parameter m and con- centration parameters αi for i = 1,...,n ...
  30. [30]
    [PDF] Constructions for a bivariate beta distribution - arXiv
    Sep 16, 2014 · The marginal distributions of the Dirichlet distribution are beta distributions. Let the variates U11,U10,U01 have the Dirichlet ...
  31. [31]
    Beta Distribution -- from Wolfram MathWorld
    A beta distribution is a statistical distribution related to gamma, with two parameters, used as a prior for binomial proportions. Its domain is [0,1].
  32. [32]
    Leonhard Euler (1707 - 1783) - Biography - MacTutor
    He also studied beta and gamma functions, which he had introduced first in 1729. Legendre called these 'Eulerian integrals of the first and second kind' ...
  33. [33]
    [PDF] Contributions to the Mathematical Theory of Evolution. II. Skew ...
    Skew Variation in. Homogeneous Matterial. By KARL PEARSON, University College, London. Communicated by Professor HENRhcI, F.R.S.. Received December 19, 1894,- ...
  34. [34]
    A Bayesian Analysis of Some Nonparametric Problems - Project Euclid
    This paper presents a class of prior distributions, called Dirichlet process priors, broad in the sense of (I), for which (II) is realized.
  35. [35]
    Hierarchical Dirichlet Processes - Taylor & Francis Online
    We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that ...
  36. [36]
    Three Multidimensional-integral Identities with Bayesian Applications
    ... Savage's (1966) generalization of the Dirichlet distribution as a one-dimensional integral. A generalization of Picard's identity is given. Picard's ...
  37. [37]
    [PDF] Hierarchical Multinomial-Dirichlet model for the estimation of ... - IDSIA
    By estimating from data the posterior distribution of α, we need not to fix its value a priori. Instead we give more weight to the values of α that are more ...
  38. [38]
    [PDF] POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA ...
    The Dirichlet parameters were chosen so that the multinomial probabilities for a variable (e.g., motor activity) are centered around the values expected by the.
  39. [39]
    [PDF] Bayesian Checking for Topic Models - ACL Anthology
    The goal of a posterior predictive check (PPC) is to assess the validity of a. Bayesian model without requiring a specific alterna- tive model. Given data, we ...
  40. [40]
    Straightforward Bayesian A/B testing with Dirichlet posteriors - arXiv
    Aug 11, 2025 · Bayesian A/B testing investigates metric changes using the joint posterior distribution of two (or more) experimentally-derived datasets. The ...
  41. [41]
    An integrative Bayesian Dirichlet-multinomial regression model for ...
    Feb 8, 2017 · We propose a Bayesian Dirichlet-Multinomial regression model which uses spike-and-slab priors for the selection of significant associations between a set of ...Missing: paper | Show results with:paper
  42. [42]
    [PDF] Being Bayesian about Categorical Probability
    which the prior distribution acts as adding pseudo counts. ... We also note that some recent studies use NNs to model the concentration parameter of the Dirichlet ...
  43. [43]
  44. [44]
    [PDF] Chapter Eleven MULTIVARIATE DISTRIBUTIONS - Luc Devroye
    THE DIRICHLET DISTRIBUTION,. 4.1. Definitions and properties. Let a,, . . . , ah+, be posltlve numbers. Then (X,, . . . , xk) has a Diri- chlet distribution.
  45. [45]
    [PDF] GENERATING DIRICHLET RANDOM VECTORS USING A ...
    In this paper a generalization ofJohnk and Loukas results reported in [4] and [5] is given and a rejection algorithm for computer generation of Dirichlet random ...
  46. [46]
    Non-Uniform Random Variate Generation - Luc Devroye
    Non-Uniform Random Variate Generation (originally published with Springer-Verlag, New York, 1986) Luc Devroye School of Computer Science McGill University
  47. [47]
    Continuous Multivariate Distributions, Models and Applications
    Apr 21, 2000 · Continuous Multivariate Distributions, Volume 1, Second Edition provides a remarkably comprehensive, self-contained resource for this critical statistical area.Missing: squared | Show results with:squared