Fact-checked by Grok 2 weeks ago

Wishart distribution

The Wishart distribution is a multivariate probability distribution defined on the space of symmetric positive semi-definite p \times p matrices, serving as the natural generalization of the chi-squared distribution to the multivariate setting. It arises as the sampling distribution of the sum of outer products of independent multivariate normal random vectors, or equivalently, as the distribution of n times the sample covariance matrix obtained from n independent observations from a p-dimensional normal distribution with mean zero and covariance matrix \Sigma. The distribution is parameterized by the positive integer degrees of freedom n > p-1 (reflecting the sample size) and the p \times p positive definite scale matrix \Sigma, with probability density function f(W) = \frac{|\Sigma|^{-n/2} |W|^{(n-p-1)/2} \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} W)\right)}{2^{np/2} \prod_{i=1}^p \Gamma\left(\frac{n+1-i}{2}\right)} for W symmetric positive definite. Named after the Scottish statistician , who first derived it in 1928 while working under at , the distribution was introduced in his seminal paper on the generalized product-moment distribution for samples from a normal multivariate population. Wishart's work built on earlier univariate results, extending them to handle correlations in multiple variables, which was crucial for agricultural and biometric applications prevalent at the time. Key properties include the expected value \mathbb{E}[W] = n \Sigma, reflecting that the distribution scales with the degrees of freedom, and a mode at (n-p-1) \Sigma for n > p+1. The distribution is closed under convolution: the sum of independent Wishart matrices with the same scale parameter follows another Wishart with added degrees of freedom. If W \sim \mathcal{W}_p(n, \Sigma), then W^{-1} follows an , which is conjugate to the multivariate normal for on covariance matrices. In applications, the Wishart distribution is fundamental in multivariate statistical analysis, particularly for hypothesis testing on covariance structures, such as in Hotelling's T-squared test or deriving the for variance ratios. It also appears in random matrix theory as the Wishart ensemble, modeling eigenvalues of sample matrices in high-dimensional data, with implications for and . More broadly, it underpins Bayesian models for estimation in , , and , where prior distributions on precision matrices are often Wishart.

Fundamentals

Definition

The Wishart distribution, denoted W_p(n, \Sigma), is the probability distribution of the p \times p random matrix S = \sum_{i=1}^n X_i X_i^T, where X_1, \dots, X_n are independent p-dimensional multivariate normal random vectors, each distributed as \mathcal{N}_p(0, \Sigma). The random variable S takes values in the space of p \times p positive semi-definite matrices. The distribution is parameterized by the degrees of freedom n, a positive that represents the number of observations and satisfies n > p - 1, and the scale matrix \Sigma, a p \times p positive definite . It is named after the statistician John Wishart, who introduced the in 1928 to model the sample from multivariate data. In the univariate case where p = 1, the Wishart distribution reduces to a scaled .

Occurrence and motivation

The Wishart distribution arises in the context of as the distribution of \sum_{i=1}^n X_i X_i^T, where the X_i are i.i.d. from \mathcal{N}_p(0, \Sigma) (known mean zero). This first established by John Wishart in his seminal 1928 paper, where he derived the generalized product moment distribution for samples from a multivariate , laying the foundational motivation for studying covariance structures in higher dimensions. Specifically, if X_1, \dots, X_n are i.i.d. from \mathcal{N}(0, \Sigma), then the matrix S = \sum_{i=1}^n X_i X_i^\top follows a Wishart distribution with n and scale matrix \Sigma. When the mean is unknown, (n-1) times the unbiased sample follows a Wishart distribution with n-1 and scale matrix \Sigma. This distribution serves as a multivariate of the , extending univariate variance inference to the full in settings where observations exhibit correlated variability. Its motivations stem from the need to model and test properties of matrices, such as in hypothesis testing for equality of covariances across groups or assessing structures.

Probability Density Function

General form

The probability density function of the Wishart distribution W_p(n, \Sigma), where S is a p \times p positive definite random matrix, n is the degrees of freedom, and \Sigma is the p \times p positive definite scale matrix, is given by f(S \mid n, \Sigma) = \frac{|S|^{(n-p-1)/2} \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} S)\right)}{2^{np/2} |\Sigma|^{n/2} \Gamma_p(n/2)}, for S > 0, and zero otherwise. This formula, originally derived by Wishart, provides the explicit density in matrix form, generalizing the chi-squared distribution to the multivariate case. The components of the density highlight its multivariate structure: the term |S|^{(n-p-1)/2} involves the of S, which penalizes deviations from the and reflects the volume scaling in p dimensions; the exponential factor \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} S)\right) incorporates the of the \Sigma^{-1} S, measuring the Mahalanobis-like distance from the origin weighted by \Sigma; and the normalization constant includes the \Gamma_p(a) = \pi^{p(p-1)/4} \prod_{i=1}^p \Gamma\left(a - \frac{i-1}{2}\right), which ensures integrability over the space of positive definite matrices and generalizes the univariate to account for the dimensionality p. For the to be concentrated on positive definite matrices with probability 1, the must satisfy n \geq p. This arises from the of n p-dimensional random vectors X_i \sim \mathcal{N}_p(0, \Sigma), where S = \sum_{i=1}^n X_i X_i^\top; the derivation proceeds by transforming the joint of the vectorized X_i into the of S via the of the quadratic transformation, yielding the above form after over the appropriate manifolds.

Spectral decomposition

The spectral decomposition of a positive definite matrix S \sim W_p(n, \Sigma) expresses S = U \Lambda U^T, where U is an and \Lambda = \diag(\lambda_1, \dots, \lambda_p) with \lambda_i > 0. This representation facilitates the of the (PDF) of the Wishart distribution into coordinates involving . The on the space of positive definite matrices transforms under this decomposition as dS = \prod_{i=1}^p \lambda_i^{(p-1)/2} \, d\lambda_i \prod_{1 \leq i < j \leq p} |\lambda_i - \lambda_j| \, d\mu(U), where d\mu(U) denotes the Haar measure on the orthogonal group O(p). Consequently, the joint density g(\Lambda, U) of (\Lambda, U) with respect to the product measure \prod d\lambda_i \, d\mu(U) is given by g(\Lambda, U) = f(U \Lambda U^T) \times \prod_{i=1}^p \lambda_i^{(p-1)/2} \prod_{i < j} |\lambda_i - \lambda_j|, where f is the Wishart PDF. Substituting the standard form of f(S), f(S) = c \, |S|^{(n-p-1)/2} \exp\left( -\frac{1}{2} \tr(\Sigma^{-1} S) \right) with normalizing constant c = 2^{-np/2} |\Sigma|^{-n/2} / \Gamma_p(n/2), yields g(\Lambda, U) = c \, \det(\Lambda)^{(n-p-1)/2} \exp\left( -\frac{1}{2} \tr(\Sigma^{-1} U \Lambda U^T) \right) \prod_{i=1}^p \lambda_i^{(p-1)/2} \prod_{i < j} |\lambda_i - \lambda_j|. Simplifying the powers of the eigenvalues gives \prod_{i=1}^p \lambda_i^{(n-1)/2}, resulting in a form where each \lambda_i appears in a chi-squared-like factor adjusted by the Vandermonde determinant \prod_{i < j} |\lambda_i - \lambda_j| and the trace term coupling the components through U. When \Sigma = I_p, the trace simplifies to \tr(\Lambda) = \sum \lambda_i, rendering the density independent of U. The marginal joint density of the ordered eigenvalues $0 < \lambda_p \leq \dots \leq \lambda_1 < \infty (with respect to Lebesgue measure on this region) is then h(\lambda_1, \dots, \lambda_p) = \tilde{c} \, \prod_{i=1}^p \lambda_i^{(n-1)/2} \exp\left( -\frac{1}{2} \sum_{i=1}^p \lambda_i \right) \prod_{1 \leq i < j \leq p} |\lambda_i - \lambda_j|, where \tilde{c} is the appropriate normalizing constant ensuring integration to 1 over the ordered Weyl chamber. This eigenvalue density integrates the uniform distribution over U and highlights the repulsive interaction among eigenvalues induced by the Vandermonde term. This spectral form is instrumental in random matrix theory for analyzing eigenvalue statistics of high-dimensional Wishart matrices. In the asymptotic regime where p, n \to \infty with p/n \to \gamma \in (0,1), the empirical spectral distribution of the eigenvalues of the normalized matrix S/n converges weakly to the Marchenko-Pastur law, a deterministic density supported on [ (1 - \sqrt{\gamma})^2, (1 + \sqrt{\gamma})^2 ] given by \rho(x) = \frac{1}{2\pi \gamma x} \sqrt{ (x - a)(b - x) }, \quad a = (1 - \sqrt{\gamma})^2, \, b = (1 + \sqrt{\gamma})^2. Such limits underpin applications in principal component analysis and signal processing. For spiked Wishart models, where \Sigma features a few large eigenvalues amid smaller ones, recent 2020s developments explore outlier eigenvalue detection and phase transitions in the spectrum, extending classical results to structured covariances.

Moments and Expectations

Expectation and variance

The expected value of a p \times p random matrix S following the Wishart distribution W_p(n, \Sigma), where n > 0 is the and \Sigma is a positive definite scale matrix, is given by \mathbb{E}[S] = n \Sigma. This result follows directly from the distributional definition: if S = \sum_{i=1}^n X_i X_i^\top with X_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}_p(0, \Sigma), then \mathbb{E}[S] = n \mathbb{E}[X_1 X_1^\top] = n \Sigma, leveraging the linearity of and the fact that \mathbb{E}[X_1 X_1^\top] = \Sigma. The second moments of S determine its variance-covariance structure. The covariance between elements is \operatorname{Cov}(S_{ij}, S_{kl}) = n (\sigma_{ik} \sigma_{jl} + \sigma_{il} \sigma_{jk}), where \sigma_{ab} denotes the (a,b)-element of \Sigma. This arises from the quadratic form properties of the underlying normal vectors: specifically, \operatorname{Cov}(X_{ia} X_{ib}, X_{kc} X_{kd}) = \sigma_{ia,kc} \sigma_{ib,kd} + \sigma_{ia,kd} \sigma_{ib,kc}, summed over the n independent terms. In vectorized form, \operatorname{Var}(\operatorname{vec}(S)) = n (I_{p^2} + K_{pp}) (\Sigma \otimes \Sigma), where K_{pp} is the p^2 \times p^2 commutation matrix that swaps indices in the Kronecker product. For the variances of individual elements, the diagonal entries satisfy \operatorname{Var}(S_{ii}) = 2n \sigma_{ii}^2, while the off-diagonal entries satisfy \operatorname{Var}(S_{ij}) = n (\sigma_{ij}^2 + \sigma_{ii} \sigma_{jj}) for i \neq j. These follow by specializing the general covariance formula to the case (i,j) = (k,l). The uncentered second moment is \mathbb{E}[S^2] = n(n+1) \Sigma^2 + n (\operatorname{tr}(\Sigma)) \Sigma, from which the centered second moment (variance) can be obtained as \mathbb{E}[S^2] - (\mathbb{E}[S])^2 = n \Sigma^2 + n (\operatorname{tr}(\Sigma)) \Sigma, though element-wise expressions are more commonly used for applications. Higher-order uncentered moments exist in closed form via recursive or invariant polynomial representations but are typically derived for specific purposes beyond the first two.

Log-expectation and log-variance

The of the logarithm of the of a p \times p Wishart-distributed S \sim W_p(n, \Sigma), where n > p-1 denotes the and \Sigma is the positive definite scale matrix, is \mathbb{E}[\log \det S] = \log \det \Sigma + \sum_{i=1}^p \psi\left( \frac{n + 1 - i}{2} \right) + p \log 2, with \psi(\cdot) denoting the . This result follows from the Bartlett decomposition of the Wishart matrix into independent chi-squared variates and properties of the . The corresponding variance is \mathrm{Var}(\log \det S) = \sum_{i=1}^p \psi'\left( \frac{n + 1 - i}{2} \right), where \psi'(\cdot) is the trigamma function. These moments capture the scale-invariant behavior of the determinant under Wishart variability. In the scalar case (p = 1), the Wishart distribution W_1(n, \Sigma) coincides with a gamma distribution, specifically S \sim \Gamma(n/2, 2\Sigma) in shape-rate parameterization, yielding the exact expectation \mathbb{E}[\log S] = \psi(n/2) + \log(2\Sigma) and variance \mathrm{Var}(\log S) = \psi'(n/2), which aligns with the general formula. For the logarithmic elements of the full matrix, exact expressions are more involved due to dependence, but the scalar result provides insight into marginal behaviors for diagonal entries. These logarithmic moments are essential in Bayesian analysis, particularly for approximating the evidence or in models with Wishart priors on matrices, such as Gaussian graphical models or models, where they facilitate variational bounds on the log-evidence.

Information Measures

Entropy

The differential entropy of a random matrix \mathbf{W} \sim \mathcal{W}_p(n, \boldsymbol{\Sigma}) measures the average uncertainty in its over the space of positive definite matrices. It is given by the formula H(\mathbf{W}) = \log \Gamma_p\left(\frac{n}{2}\right) + \frac{np}{2} + \frac{p+1}{2} \log \left| 2 \boldsymbol{\Sigma} \right| - \frac{n - p - 1}{2} \sum_{i=1}^p \psi\left( \frac{n - p + i}{2} \right), where \Gamma_p(\cdot) denotes the multivariate gamma function, \psi(\cdot) is the digamma function, and the logarithm is the natural logarithm (measured in nats). This expression is derived by evaluating the definition of differential entropy, H(\mathbf{W}) = -\int f(\mathbf{W}) \log f(\mathbf{W}) \, d\mathbf{W}, where f(\mathbf{W}) is the probability density function of the Wishart distribution. The integral simplifies using known expectations: the trace term \mathbb{E}[\operatorname{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{W})] = np, and the log-determinant term \mathbb{E}[\log |\mathbf{W}|] = \log |\boldsymbol{\Sigma}| + \sum_{i=1}^p \psi\left( \frac{n + 1 - i}{2}\right) + p \log 2, which rely on properties of the gamma and digamma functions from the normalizing constant and moments of the distribution. In applications, such as Bayesian inference for covariance matrices, this entropy quantifies the dispersion in possible estimates of \boldsymbol{\Sigma} from normal samples, with larger values indicating higher uncertainty due to fewer degrees of freedom n or higher dimensionality p. For instance, as n increases, the entropy grows roughly proportionally to np \log n, reflecting reduced relative uncertainty in large-sample covariance estimation.

Cross-entropy and KL-divergence

The cross-entropy between two Wishart distributions, denoted H(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)), is defined as H(W_1 \| W_2) = -\int f_{W_1}(S) \log f_{W_2}(S) \, dS, where f_{W_i} is the probability density function of the i-th distribution. Substituting the Wishart density yields an expression involving the expected trace \mathbb{E}_{W_1}[\operatorname{tr}(\Sigma_2^{-1} S)] = \nu \operatorname{tr}(\Sigma_2^{-1} \Sigma_1), the expected log-determinant \mathbb{E}_{W_1}[\log |S|] = \log |\Sigma_1| + \sum_{i=1}^p \psi\left(\frac{\nu + 1 - i}{2}\right) + p \log 2, and normalizing constants that include multivariate gamma functions \Gamma_p(\nu/2) and powers of 2. This results in H(W_1 \| W_2) = -\frac{\nu - p - 1}{2} \mathbb{E}_{W_1}[\log |S|] + \frac{\nu}{2} \operatorname{tr}(\Sigma_2^{-1} \Sigma_1) + \frac{\nu p}{2} \log 2 + \log \Gamma_p\left(\frac{\nu}{2}\right) + \frac{\nu}{2} \log |\Sigma_2|, assuming identical degrees of freedom \nu > p - 1. The Kullback-Leibler divergence between two Wishart distributions with the same , D_{\mathrm{KL}}(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)), simplifies to a closed-form expression: D_{\mathrm{KL}}(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)) = \frac{\nu}{2} \left[ \operatorname{tr}(\Sigma_2^{-1} \Sigma_1) - p + \log \frac{|\Sigma_2|}{|\Sigma_1|} \right]. This formula arises from the difference in log-densities, leveraging the linearity of expectation for the term and properties of the log-determinant under the Wishart measure; the multivariate gamma terms cancel when \nu is identical. For differing \nu_1 and \nu_2, an adjustment includes functions \psi_p((\nu_1 - \nu_2)/2) and ratios of multivariate gamma functions \log \Gamma_p(\nu_2/2) / \Gamma_p(\nu_1/2). Special cases of the KL divergence arise in Bayesian contexts, such as between a on the and an inverse-Wishart for the posterior precision matrix, often used to merge components in Gaussian-inverse Wishart mixtures by minimizing the divergence to a single component. Similarly, the KL divergence facilitates between Wishart-distributed covariances and multivariate likelihoods in conjugate settings, where the Wishart models the sum of outer products from vectors. Recent applications of these measures appear in variational Bayes methods for approximate posterior inference, such as in quasi-autoencoding variational Bayes for models with Wishart priors on precision matrices, where the KL term bounds the evidence lower bound for scalable covariance estimation.

Characteristic Function and Theorems

Characteristic function

The characteristic function of a random matrix S \sim W_p(n, \Sigma), where W_p(n, \Sigma) denotes the Wishart distribution with p \times p scale matrix \Sigma > 0 and n degrees of freedom, is defined as
\phi_S(T) = \mathbb{E} \left[ \exp \left( i \operatorname{tr}(T S) \right) \right] = \left| I_p - 2 i \Sigma T \right|^{-n/2},
for symmetric p \times p matrices T such that the eigenvalues of I_p - 2 i \Sigma T lie in the complex half-plane \operatorname{Re}(z) > 0. This expression holds for n > 0, even when the density is undefined for n \leq p-1, providing a complete characterization of the distribution via the uniqueness of characteristic functions.
The formula arises from the defining representation S = \sum_{k=1}^n X_k X_k^\top, where the X_k are i.i.d. N_p(0, \Sigma). The characteristic function of each outer product X_k X_k^\top follows from the multivariate normal characteristic function \mathbb{E}[\exp(i \operatorname{tr}(T X X^\top))] = |\Sigma^{-1} - 2 i T|^{-1/2}, and independence of the X_k yields the product form raised to the power n. This characteristic function facilitates moment generation by successive differentiation with respect to the elements of T at T = 0, yielding expressions for means, variances, and higher cumulants of the Wishart random matrix. It also underpins proofs of key properties, such as the closure under convolution for independent Wishart matrices sharing the same scale matrix.

Independence theorem

The independence theorem for the Wishart distribution provides a fundamental decomposition of its structure, particularly when the scale matrix is the identity. Consider a random matrix S \sim W_p(n, I_p), where p is the dimension and n > p - 1 is the degrees of freedom. The theorem states that the vector of diagonal elements (S_{11}, S_{22}, \dots, S_{pp}) is independent of the vector of normalized off-diagonal elements (S_{ij} / \sqrt{S_{ii} S_{jj}} \mid 1 \leq i < j \leq p ). Marginally, each diagonal element S_{ii} follows a chi-squared distribution with n degrees of freedom, S_{ii} \sim \chi^2_n, although the joint distribution of the diagonals exhibits dependence due to the positive definiteness constraint. The normalized off-diagonal elements, which correspond to elements of the sample correlation matrix derived from the Wishart, follow distributions related to the beta family; for instance, in the bivariate case (p=2), the squared correlation r^2 = (S_{12} / \sqrt{S_{11} S_{22}})^2 under zero true correlation follows a Beta(1/2, (n-2)/2) distribution. This separation highlights the distinction between the "radial" components (captured by the diagonals, representing scaled variances) and the "angular" components (captured by the normalized off-diagonals, representing correlations). The result extends the univariate case where sample variance is chi-squared and independent of the mean, generalizing to multivariate settings under normality assumptions. The theorem is pivotal in multivariate analysis, facilitating separate inference on variances and correlations in sample covariance matrices. A proof can be sketched using the Bartlett decomposition, which represents S = T T^T with T lower triangular and entries independent under the identity scale: the diagonal entries of T are square roots of chi-squared random variables with decreasing degrees of freedom, and the subdiagonal entries are standard normals (see Bartlett decomposition for details). The diagonals of S depend only on the magnitudes from this factorization, while the normalized off-diagonals depend solely on the directional (normal) components, ensuring independence. Alternatively, the characteristic function of the Wishart factors in a manner that separates these components, confirming the result through moment-generating properties (see Characteristic function section). An important corollary integrates the spectral properties: in the eigendecomposition S = U \Lambda U^T, where \Lambda is the diagonal matrix of eigenvalues and U is orthogonal, the eigenvalues (entries of \Lambda) are independent of the eigenvectors (columns of U), with U distributed uniformly on the orthogonal group O(p). This follows directly from the invariance of the Wishart density under orthogonal transformations when the scale is identity, aligning the angular components with the uniform distribution on directions.

Corollaries

The independence theorem for the Wishart distribution implies several important corollaries regarding the distributions of submatrices, which follow from properties of multivariate normal vectors and quadratic forms. A key corollary concerns the marginal distribution of principal submatrices. If S \sim W_p(n, \Sigma) and S_{11} is the leading k \times k principal submatrix of S (with k < p), then the marginal distribution of S_{11} is W_k(n, \Sigma_{11}), where \Sigma_{11} is the corresponding leading principal submatrix of \Sigma. This result extends to any principal submatrix by reordering variables. The proof follows directly from the independence theorem applied to the partitioned multivariate normal vectors generating S, as the quadratic form for the submatrix depends only on the marginal normal distribution for those coordinates. Another significant corollary addresses partitioned matrices. Suppose S \sim W_p(n, \Sigma) is partitioned conformably as S = \begin{pmatrix} A & B \\ B^T & C \end{pmatrix}, where A is p_1 \times p_1, C is p_2 \times p_2 with p_1 + p_2 = p, and \Sigma is partitioned similarly as \Sigma = \begin{pmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{CC} \end{pmatrix}. Then, the marginal distribution of A is W_{p_1}(n, \Sigma_{AA}), and the marginal distribution of C is W_{p_2}(n, \Sigma_{CC}). Additionally, the conditional distribution of the Schur complement A \mid C (adjusted for the regression of rows/columns of A on C) is Wishart with degrees of freedom n - p_2 and scale matrix \Sigma_{AA \mid C} = \Sigma_{AA} - \Sigma_{AB} \Sigma_{CC}^{-1} \Sigma_{BA}, and it is independent of (B, C). These follow from conditioning the underlying normal vectors and applying the independence theorem to orthogonal transformations that separate the partitions. When \Sigma = I_p, an additional corollary arises from the rotational invariance of the distribution: the spectral decomposition S = U D U^T, where D is diagonal with independent chi-squared entries (up to scaling) on the diagonal, has U distributed according to the on the orthogonal group, ensuring the eigenvectors are orthogonal and uniformly oriented independent of the eigenvalues. This orthogonality and uniformity stem directly from the independence theorem under the identity scale, as the generating normals are spherically symmetric.

Decompositions and Estimation

Bartlett decomposition

The Bartlett decomposition expresses a Wishart matrix via its Cholesky factorization involving independent chi-squared and normal variates. For an S \sim W_p(n, I_p) with n \geq p, the decomposition takes the form S = L L^T, where L is a p \times p lower triangular matrix with independent entries: the diagonal L_{ii} \sim \sqrt{\chi^2_{n-i+1}} for i = 1, \dots, p, and the off-diagonal entries L_{ij} \sim N(0,1) for i > j. This factorization highlights the underlying structure of the Wishart as arising from sums of outer products of multivariate normals. For the general scale matrix case S \sim W_p(n, \Sigma) where \Sigma is positive definite, first compute the Cholesky factorization \Sigma = C C^T with C lower triangular. Then S = C L L^T C^T, where L follows the same distribution as above; this extends the identity-scale decomposition by transforming through the scale matrix's factorization. This decomposition enables an efficient algorithm for generating Wishart random variates. To simulate S \sim W_p(n, I_p), generate the independent \chi^2_{n-i+1} and take their square roots for the diagonal of L, generate independent standard normals for the strict lower triangular entries of L, and compute S = L L^T. For the general \Sigma, incorporate the pre- and post-multiplication by C. The process requires generating p(p-1)/2 standard normals and p chi-squared variates, followed by matrix multiplications of order O(p^3), making it suitable for computational statistics. The independence of the chi-squared variates on the squares of the diagonal entries of L simplifies sampling by allowing modular generation of components and aids in proofs of independence, such as those concerning the diagonal elements or the determinant of S, which factors into a product involving these chi-squared variates.

Covariance estimator

The Wishart distribution serves as the sampling distribution for the maximum likelihood estimator (MLE) of the covariance matrix in multivariate normal models. Consider n independent and identically distributed observations X_1, \dots, X_n from a p-dimensional normal distribution N_p(\mu, \Sigma), where \Sigma is the unknown p \times p positive definite covariance matrix. The MLE of \Sigma is given by \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^T, where \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i is the sample mean. Under these assumptions, n \hat{\Sigma} follows a Wishart distribution with n-1 degrees of freedom and scale matrix \Sigma, denoted n \hat{\Sigma} \sim W_p(n-1, \Sigma). This estimator is biased, with expectation E[\hat{\Sigma}] = \frac{n-1}{n} \Sigma. An unbiased estimator is the sample covariance matrix S = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^T = \frac{n}{n-1} \hat{\Sigma}, satisfying (n-1) S \sim W_p(n-1, \Sigma). The Wishart form enables exact finite-sample inference on \Sigma when n > p. Likelihood ratio tests for hypotheses on \Sigma, such as testing \Sigma = \Sigma_0 or specific structures like sphericity (\Sigma = \sigma^2 I_p), rely on the distributional properties of the Wishart under the null hypothesis. For the test of \Sigma = \Sigma_0, the test statistic involves the ratio of determinants of sample covariances compared to \Sigma_0, and its distribution is a function of independent Wishart matrices or related ratios, allowing critical values from exact tables or approximations for large n. These tests are pivotal in multivariate analysis for assessing covariance homogeneity or equality across groups. In high-dimensional settings where the dimension p approaches or exceeds the sample size n, the classical Wishart-based estimators suffer from substantial and variance, leading to ill-conditioned matrices. Recent finite-sample corrections, including shrinkage methods that blend the sample toward a structured target (e.g., the ), have been developed to mitigate these issues and improve estimation accuracy. For instance, bias-corrected shrinkage estimators achieve consistent performance under p/n \to c > 0, with theoretical guarantees on reduction.

Marginal and Parameter Aspects

Marginal distributions of elements

The marginal distribution of a diagonal element S_{ii} of a random matrix \mathbf{S} \sim \mathcal{W}_p(\boldsymbol{\Sigma}, n) follows a scaled chi-squared distribution: S_{ii} \sim \sigma_{ii} \chi^2_n, where \chi^2_n is the chi-squared distribution with n degrees of freedom and \sigma_{ii} is the (i,i)-th element of \boldsymbol{\Sigma}. This result arises as the marginal distribution of a $1 \times 1 principal submatrix of \mathbf{S}, which itself follows a Wishart distribution \mathcal{W}_1(\sigma_{ii}, n). The marginal distribution of an off-diagonal element S_{ij} for i \neq j is more intricate and lacks a simple closed form like the chi-squared for diagonals. It can be derived by integrating the Wishart density over all other matrix elements, subject to the positive definiteness constraint. In the general case where \sigma_{ij} \neq 0, it involves a modified Bessel function of the first kind: the density f_{ij}(v_{ij}) includes a factor I_{\nu}(b) with order \nu = (n - p)/2 and argument b depending on \sigma_{ij}, alongside terms ensuring the support maintains positive definiteness of the matrix. This expression highlights the dependence on the full scale matrix \boldsymbol{\Sigma} and degrees of freedom n. Although the univariate marginals are as described, the elements of \mathbf{S} exhibit correlations that reflect the matrix structure. The covariance between elements is given by \operatorname{Cov}(S_{rs}, S_{tu}) = n (\sigma_{rt} \sigma_{su} + \sigma_{ru} \sigma_{st}) for indices r,s,t,u \in \{1, \dots, p\}. For instance, \operatorname{Cov}(S_{ii}, S_{jj}) = n (\sigma_{ii} \sigma_{jj} + \sigma_{ij}^2) for i \neq j, and \operatorname{Var}(S_{ij}) = n (\sigma_{ii} \sigma_{jj} + \sigma_{ij}^2) for i \neq j. These relations underscore the interdependence among elements, with positive correlations generally increasing with the off-diagonal scale parameters.

Shape parameter range

The shape parameter of the Wishart distribution, denoted n (degrees of freedom), governs key properties such as the , moments, and of the \mathbf{S} \sim W_p(\Sigma, n), where p is the matrix dimension and \Sigma is the positive definite scale matrix. For the to be proper and integrable, n > p - 1 is required; this ensures the normalizing constant, involving the \Gamma_p(n/2), is well-defined, as each component demands arguments greater than zero. When this condition holds and \Sigma is positive definite, \mathbf{S} concentrates on the space of positive definite matrices. A stricter condition, n \geq p, guarantees that \mathbf{S} is positive definite , meaning the matrix has full rank with probability 1 and is thus invertible. Under n > p - 1, the E[\mathbf{S}] = n \Sigma is finite, as are the variances of the elements, which take the form \operatorname{Var}(S_{ij}) = n (\sigma_{ij}^2 + \sigma_{ii} \sigma_{jj}) for the scale matrix entries \sigma_{kl}. The expected determinant E[|\mathbf{S}|] is also finite in this range, given by a product involving gamma functions that converges under the same constraint. In the univariate case (p = 1), where the distribution reduces to a scaled chi-squared, When n < p, the resulting improper Wishart distribution has support that includes singular matrices (of rank at most n), rendering the density improper relative to the full positive definite cone; such forms are nonetheless valid probability measures on the lower-dimensional subspace of nonnegative definite matrices. Fractional (non-integer) values of n are allowed, provided n > p - 1 for properness, though smaller values enable improper priors in statistical applications. Historically, John Wishart introduced the distribution in 1928 assuming integer n \geq p, motivated by sums of outer products from normal samples, but subsequent generalizations extended it to real n > p - 1 to accommodate broader theoretical and computational needs.

Applications

Bayesian usage

In , the Wishart distribution is commonly employed as a for the precision matrix (the inverse of the ) in models involving multivariate likelihoods. Suppose the data consist of n independent observations \mathbf{x}_1, \dots, \mathbf{x}_n from a p-dimensional multivariate \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), where the mean \boldsymbol{\mu} is known and the precision matrix \boldsymbol{\Lambda} = \boldsymbol{\Sigma}^{-1} has a prior \boldsymbol{\Lambda} \sim \text{Wishart}(\nu_0, \mathbf{S}_0^{-1}), with \nu_0 > p-1 and scale matrix \mathbf{S}_0. The Wishart prior ensures that the posterior distribution for \boldsymbol{\Lambda} remains in the same family, facilitating closed-form updates. The posterior is then \boldsymbol{\Lambda} \mid \{\mathbf{x}_i\} \sim \text{Wishart}(\nu_0 + n, \mathbf{S}_n^{-1}), where the updated scale is \mathbf{S}_n = \mathbf{S}_0 + \sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top. This conjugacy property, first formalized in the context of multivariate normal models, allows for straightforward without requiring numerical approximation for the or posterior in simple cases. When the \boldsymbol{\mu} is , the extends to the normal-Wishart , maintaining conjugacy for both parameters. A key implication of this setup is the form of the for a new observation \mathbf{x}_{n+1}, which follows a multivariate : \mathbf{x}_{n+1} \mid \{\mathbf{x}_i\} \sim t_p(\boldsymbol{\mu}, \frac{\mathbf{S}_n}{\nu_0 + n - p + 1}, \nu_0 + n). This distribution arises naturally from integrating out the precision matrix from the posterior, providing robust predictions that account for parameter uncertainty. In more complex hierarchical Bayesian models, such as those with multiple levels of multivariate normals (e.g., in or spatial statistics), the Wishart prior is applied to precision matrices at various levels to induce dependence structures. Modern probabilistic programming languages like and PyMC leverage (MCMC) methods to sample from posteriors in these settings, where conjugacy aids initialization but full MCMC is essential for non-conjugate extensions or high dimensions. Recent advancements, including the integration of sufficient statistics for faster Bayesian computation as of 2025, emphasize reparameterization techniques (e.g., via Bartlett decomposition in PyMC) for efficient sampling of Wishart-distributed parameters.

Parameter selection in Bayesian contexts

In Bayesian analysis, the parameters of the Wishart prior for the precision matrix—degrees of freedom n_0 and scale matrix \Sigma_0—are selected to balance propriety, informativeness, and alignment with the data while ensuring the prior is conjugate to the multivariate normal likelihood. The prior is proper provided n_0 > p - 1, where p is the dimension of the matrix; values close to this lower bound produce weak, vague priors that exert minimal influence on the posterior. Smaller n_0 values enhance posterior robustness to model misspecification but risk impropriety if too low. Common methods for parameter selection include empirical Bayes approaches, reference priors, and moment matching. In empirical Bayes, hyperparameters are estimated by maximizing the marginal likelihood. Reference priors, derived from information theory, yield noninformative forms prioritizing asymptotic optimality under entropy loss. Moment matching sets the prior mean n_0 \Sigma_0 to approximate the sample precision, useful for data-driven initialization. Selection criteria emphasize marginal likelihood maximization for predictive performance and posterior robustness, assessed via sensitivity analyses. For the scale matrix \Sigma_0, vague priors often use the to impose minimal structure, while informative choices scale the sample by a factor (e.g., (n_0 - p - 1)^{-1} times the target ) to the on observed variability. Empirical Bayes further refines \Sigma_0 by optimizing it alongside n_0 in high-dimensional settings, shrinking toward sparsity if needed. Recent advances (post-2020) in Bayesian computation have popularized separation strategies, decomposing the into standard deviations and correlations (or precision analogs) to allow priors on each component, improving flexibility over Wishart specifications. These approaches, extended via Cholesky decompositions, enhance in dynamic models and network meta-analyses by separately regularizing elements for better bias reduction and coverage.

Connections to other distributions

The Wishart distribution generalizes the univariate to the multivariate setting. Specifically, when the dimension p = 1, a S \sim W_1(n, \sigma^2) follows a scaled , S \sim \sigma^2 \chi^2_n, where \chi^2_n denotes a with n . This connection arises because the Wishart distribution is defined as the sum of outer products of independent vectors, mirroring the construction of the chi-squared as a sum of squared s. The determinant of a Wishart matrix also exhibits a product form related to gamma distributions, which encompass the chi-squared as a special case. For S \sim W_p(n, I_p) with identity scale matrix and n \geq p-1, the determinant |S| is distributed as the product of p independent chi-squared random variables: |S| \stackrel{d}{=} \prod_{i=1}^p \chi^2_{n - i + 1}, where each \chi^2_k is chi-squared with k degrees of freedom, equivalent to a gamma distribution \Gamma(k/2, 2). This result follows from the Bartlett decomposition of the Wishart matrix into triangular factors involving independent chi-squared variables on the diagonal. The inverse Wishart distribution is the multiplicative inverse of the Wishart. If S \sim W_p(\Sigma^{-1}, n), then S^{-1} \sim \mathrm{IW}_p(\Sigma, n - p + 1), where \mathrm{IW}_p denotes the inverse Wishart with degrees of freedom n - p + 1 and scale matrix \Sigma. This relationship ensures that the inverse Wishart, like the inverse gamma for scalars, serves as a conjugate prior for covariance parameters in Bayesian models. The Wishart distribution can be expressed as a quadratic form involving the matrix normal distribution. Let X be an n \times p matrix with rows independently distributed as N_p(0, \Sigma); then X follows a MN_{n \times p}(0, I_n, \Sigma), and the Wishart matrix is S = X^\top X \sim W_p(n, \Sigma). This quadratic form representation highlights the Wishart's role in modeling sample covariance matrices from multivariate normals. Ratios involving independent Wishart matrices connect to type II and F distributions. The matrix type II distribution arises from ratios of Wishart matrices; for independent W_1 \sim W_p(a, \Theta) and W_2 \sim W_p(b, \Theta) with the same scale \Theta, the distribution of W_1 (W_1 + W_2)^{-1} follows a matrix type II with parameters a and b. In hypothesis testing, such as for covariance equality, ratios of quadratic forms from Wishart-distributed statistics yield F-distributed test statistics, generalizing the univariate F for variance comparisons.

References

  1. [1]
    Wishart Distribution -- from Wolfram MathWorld
    The Wishart distribution is most typically used when describing the covariance matrix of multinormal samples.<|control11|><|separator|>
  2. [2]
    Wishart distribution | Properties, proofs - StatLect
    The Wishart distribution is a multivariate continuous distribution which generalizes the Gamma distribution.Missing: history | Show results with:history
  3. [3]
    John Wishart - Biography - MacTutor
    In 1928 he derived the generalised product-moment distribution which is now named the Wishart distribution. This distribution is described in [1] as:- ...
  4. [4]
    [PDF] An Introduction to Wishart Matrix Moments - arXiv
    ABSTRACT. These lecture notes provide a comprehensive, self-contained introduction to the analysis of Wishart matrix moments.
  5. [5]
    The Wishart Distribution - Project Euclid
    The Wishart distribution arises in a natural way as a matrix generalization of the chi-square distribution. If X,,. . . , X,, are independent with C(4) =.
  6. [6]
    The Generalised Product Moment Distribution in Samples ... - jstor
    THE GENERALISED PRODUCT MOMENT DISTRIBUTION. IN SAMPLES FROM A NORMAL MULTIVARIATE POPU-. LATION. By JOHN WISHART, M.A., B.Sc. Statistical Department ...
  7. [7]
    7.2 The Wishart distribution | Multivariate Statistics
    The Wishart distribution is a multivariate generalisation of the univariate χ2 distribution, and it plays an analogous role in multivariate statistics. In this ...Missing: history | Show results with:history
  8. [8]
    Wishart Distribution
    It is named in honor of John Wishart, who formulated this distribution in 1928. The Wishart distribution is used as an eficient tool for the analysis of the ...Missing: history | Show results with:history
  9. [9]
    On the distribution of the largest eigenvalue in principal components ...
    Equivalently, x(1) is the largest principal component variance of the covariance matrix X′X X ′ X , or the largest eigenvalue of a pvariate Wishart distribution ...
  10. [10]
    Wishart distributions for decomposable covariance graph models
    A principal objective of this paper is to develop a framework for Bayesian in- ference for Gaussian covariance graph models. We proceed to construct a rich and.Missing: motivation testing PCA
  11. [11]
    Dynamic covariance estimation via predictive Wishart process with ...
    The Predictive Wishart Process (PWP) is a novel stochastic process that provides positive semi-definite random matrices indexed by input variables, modeling  ...
  12. [12]
    Restricted Covariance Priors with Applications in Spatial Statistics
    Specifically, we introduce a new distribution called the truncated G-Wishart distribution that has support over precision matrices that lead to positive ...
  13. [13]
    [PDF] A variational approximate posterior for the deep Wishart process
    The deep Wishart process (DWP) is a deep kernel process where the kernel is sampled from a Wishart distribution, and its prior can be equivalent to deep  ...
  14. [14]
    [PDF] An Improved Variational Approximate Posterior for the Deep Wishart ...
    The deep Wishart process (DWP) is a deep kernel process using a Wishart distribution. This paper proposes A/AB-generalised Wishart distributions for improved  ...
  15. [15]
    DLMF: §35.3 Multivariate Gamma and Beta Functions ‣ Properties ...
    Symbols: etr ⁡ ( 𝐀 ) : exponential of trace, ∫ : integral, Γ m ⁡ ( a ) : multivariate gamma function, ℜ ⁡ : real part, a : complex variable, 𝐗 : real symmetric ...Missing: Γ_p( | Show results with:Γ_p(
  16. [16]
    [PDF] Concise Probability Distributions of Eigenvalues
    Section II reviews the joint distribution of the eigenvalues of Wishart matrices and. Section III presents the derivation of the extreme eigenvalue.
  17. [17]
    [PDF] Gaussian and Wishart Ensembles: Eigenvalue Densities
    Wishart Density. Recall that the (real) Wishart distributionW (Σ,n) is the distribution of the random symmetric matrix S = X XT , where X is a p×n data ...Missing: mathematical | Show results with:mathematical
  18. [18]
    [2410.05280] Sampling Spiked Wishart Eigenvalues - arXiv
    Sep 25, 2024 · Efficient schemes for sampling from the eigenvalues of the Wishart distribution have recently been described for both the uncorrelated central ...Missing: 2020s | Show results with:2020s
  19. [19]
    [PDF] Moments of a Wishart matrix - University of Toronto
    The paper discusses the moments of Wishart matrices, in both the central and noncentral cases. The first part of the paper shows that the expectation map ...Missing: variance primary
  20. [20]
    [PDF] The Wishart and Inverse Wishart Distributions
    May 25, 2012 · The Wishart distribution is the multivariate extension of the gamma distribution, although most statisticians use the Wishart distribution in ...
  21. [21]
    All Invariant Moments of the Wishart Distribution - Letac - 2004
    May 19, 2004 · In this paper, we compute moments of a Wishart matrix variate U of the form E(Q(U)) where Q(u) is a polynomial with respect to the entries of the symmetric ...
  22. [22]
    Aspects of Multivariate Statistical Theory - Wiley Online Library
    Aspects of Multivariate Statistical Theory ; Author(s):. Robb J. Muirhead, ; First published:25 March 1982 ; Print ISBN:9780471094425 | ; Online ...
  23. [23]
  24. [24]
    [PDF] Parametric Bayesian Estimation of Differential Entropy ... - Maya Gupta
    Analogous to the previous derivation of the Bayesian Gaussian entropy estimate, we form the posterior distributions using independent inverse Wishart priors ...
  25. [25]
    Proof: Kullback-Leibler divergence for the Wishart distribution
    Dec 2, 2021 · Using the probability density function of the Wishart distribution, this becomes: the Kullback-Leibler divergence from (7) becomes:
  26. [26]
  27. [27]
    How to use KL-divergence to construct conjugate priors, with ... - arXiv
    Sep 15, 2021 · We show how to use the scaled KL-divergence between multivariate Gaussians as an energy function to construct Wishart and normal-Wishart conjugate priors.
  28. [28]
    [PDF] Normal Distributions - Stat@Duke - Duke University
    ... distribution is well-defined for all ν > 0 (it just doesn't have a density for ν ≤ p). For example, it can be characterized by its characteristic function.
  29. [29]
  30. [30]
    [PDF] Wishart and Inverse Wishart Distributions - Oxford statistics department
    We can thus interpret the parameter δ as a prior equivalent sample size and Ψ as the value of a matrix of sums and squares and products from a previous sample.Missing: log<|control11|><|separator|>
  31. [31]
    [PDF] Wishart Distributions and Inverse-Wishart Sampling
    Introduction. The Wishart distribution W(Σ, d, n) is a probability distribution of random nonnegative-definite d × d matrices that is used to.
  32. [32]
    Bartlett decomposition and other factorizations
    Oct 20, 2015 · This post is devoted to the famous Bartlett decomposition of Wishart random matrices. This beautiful result was published in 1933 by the British ...Missing: original | Show results with:original
  33. [33]
    On the Likelihood Ratio Test of a Normal Multivariate Testing Problem
    This problem arises in discriminating between two multivariate normal populations with the same unknown covariance matrix when one is interested to test ...
  34. [34]
    [PDF] Some marginal densities of the wishart distribution
    Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate normal distribution. Some marginal densities, ...
  35. [35]
    [PDF] Covariance of the Wishart Distribution with Applications to Regression
    Apr 30, 2015 · The covariance matrix of a Wishart distribution is Cov [Vec(W)] = n[Σ ⊗ Σ] [Ip2 + T], where T is the transpose operator.
  36. [36]
    [PDF] Wishart Distribution - Max Turgeon
    Wishart random matrices are sums of outer products of independent multivariate normal variables with the same scale matrix Σ. • They allow us to give a ...
  37. [37]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · 8 Normal-Wishart prior. The multivariate analog of the normal-gamma prior is the normal-Wishart prior. Here we just state the results without.<|control11|><|separator|>
  38. [38]
    [PDF] Wishart Priors - MyWeb
    Wishart priors are a multivariate extension of the χ2 distribution, used for variance-covariance matrices, and are defined as the distribution of xixT i ; i=1.
  39. [39]
    Why an inverse-Wishart prior may not be such a good idea
    Mar 7, 2012 · Inverse Wishart-priors are popular priors over covariance functions. People like them priors because they are conjugate to a Gaussian likelihood.
  40. [40]
    Simple Marginally Noninformative Prior Distributions for Covariance ...
    The proposed prior distributions for covariance matrices have standard deviation and correlation parameters that are marginally noninformative, and are simple ...Missing: n0 = | Show results with:n0 =
  41. [41]
    Use of Wishart Prior and Simple Extensions for Sparse Precision ...
    Feb 1, 2016 · The Wishart family of distributions is commonly used in multivariate analysis of Gaussian data to provide a convenient conjugate prior ...
  42. [42]
    Estimation of a Covariance Matrix Using the Reference Prior
    ### Summary of Reference Prior for Covariance Matrix
  43. [43]
    [PDF] Multivariate Empirical Bayes and Estimation of Covariance Matrices
    These estimators work by shrinking the sample eigenvalues toward a central value, in much the same way as the James-Stein estimator for a mean vector shrinks ...
  44. [44]
    Efficient selection of hyperparameters in large Bayesian VARs using ...
    Feb 16, 2020 · It has been shown that selecting the prior hyperparameters in a data-driven manner can often substantially improve forecast performance. We ...
  45. [45]
    A Note on Wishart and Inverse Wishart Priors for Covariance Matrix
    In Bayesian analysis, an inverse Wishart (IW) distribution is often used as a prior for the variance-covariance parameter matrix.
  46. [46]
    [PDF] Efficient Selection of Hyperparameters in Large Bayesian VARs ...
    The hyperparameters of this normal-inverse-Wishart prior are A0,VA,ν0, and S0. For large systems it is important to choose these hyperparameters carefully to ...
  47. [47]
    The impact of covariance priors on arm‐based Bayesian network ...
    Jun 3, 2020 · This resulted in wide credible intervals for the absolute risk p2 (RIW: 0.06 to 0.53; EQ: 0.11 to 0.48) and potentially biased estimates.
  48. [48]
    Flexible Bayesian Dynamic Modeling of Correlation and Covariance ...
    Abstract. Modeling correlation (and covariance) matrices can be challenging due to the positive-definiteness constraint and potential high-dimensionality.Missing: post- | Show results with:post-
  49. [49]
    Wishart Random Numbers
    Jul 7, 2004 · The chi-square distribution is the sum of squared normal variates. The Wishart distribution is the sum of squared multivariate normal variates.
  50. [50]
    Matrix-variate Distributions - JuliaStats
    The Wishart distribution generalizes the gamma distribution to p × p ; If ν > p − 1 ; f ( H ; ν , S ) = 1 2 ν p / 2 ∣ S ∣ ν / 2 Γ p ( ν 2 ) ∣ H ∣ ( ν − p − 1 ) / ...