Wishart distribution
The Wishart distribution is a multivariate probability distribution defined on the space of symmetric positive semi-definite p \times p matrices, serving as the natural generalization of the chi-squared distribution to the multivariate setting.[1] It arises as the sampling distribution of the sum of outer products of independent multivariate normal random vectors, or equivalently, as the distribution of n times the sample covariance matrix obtained from n independent observations from a p-dimensional normal distribution with mean zero and covariance matrix \Sigma.[1] The distribution is parameterized by the positive integer degrees of freedom n > p-1 (reflecting the sample size) and the p \times p positive definite scale matrix \Sigma, with probability density function f(W) = \frac{|\Sigma|^{-n/2} |W|^{(n-p-1)/2} \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} W)\right)}{2^{np/2} \prod_{i=1}^p \Gamma\left(\frac{n+1-i}{2}\right)} for W symmetric positive definite.[2] Named after the Scottish statistician John Wishart, who first derived it in 1928 while working under Ronald A. Fisher at Rothamsted Experimental Station, the distribution was introduced in his seminal paper on the generalized product-moment distribution for samples from a normal multivariate population. Wishart's work built on earlier univariate results, extending them to handle correlations in multiple variables, which was crucial for agricultural and biometric applications prevalent at the time.[3] Key properties include the expected value \mathbb{E}[W] = n \Sigma, reflecting that the distribution scales with the degrees of freedom, and a mode at (n-p-1) \Sigma for n > p+1.[2] The distribution is closed under convolution: the sum of independent Wishart matrices with the same scale parameter follows another Wishart with added degrees of freedom.[2] If W \sim \mathcal{W}_p(n, \Sigma), then W^{-1} follows an inverse Wishart distribution, which is conjugate to the multivariate normal for Bayesian inference on covariance matrices.[2] In applications, the Wishart distribution is fundamental in multivariate statistical analysis, particularly for hypothesis testing on covariance structures, such as in Hotelling's T-squared test or deriving the F-distribution for variance ratios.[1] It also appears in random matrix theory as the Wishart ensemble, modeling eigenvalues of sample covariance matrices in high-dimensional data, with implications for principal component analysis and signal processing.[4] More broadly, it underpins Bayesian models for covariance estimation in genomics, finance, and machine learning, where prior distributions on precision matrices are often Wishart.[5]Fundamentals
Definition
The Wishart distribution, denoted W_p(n, \Sigma), is the probability distribution of the p \times p random matrix S = \sum_{i=1}^n X_i X_i^T, where X_1, \dots, X_n are independent p-dimensional multivariate normal random vectors, each distributed as \mathcal{N}_p(0, \Sigma).[2][6] The random variable S takes values in the space of p \times p positive semi-definite matrices.[2][6] The distribution is parameterized by the degrees of freedom n, a positive integer that represents the number of independent observations and satisfies n > p - 1, and the scale matrix \Sigma, a p \times p positive definite covariance matrix.[2][6] It is named after the statistician John Wishart, who introduced the distribution in 1928 to model the sample covariance matrix from multivariate normal data.[7] In the univariate case where p = 1, the Wishart distribution reduces to a scaled chi-squared distribution.[8]Occurrence and motivation
The Wishart distribution arises in the context of multivariate statistics as the distribution of \sum_{i=1}^n X_i X_i^T, where the X_i are i.i.d. from \mathcal{N}_p(0, \Sigma) (known mean zero). This first established by John Wishart in his seminal 1928 paper, where he derived the generalized product moment distribution for samples from a normal multivariate population, laying the foundational motivation for studying covariance structures in higher dimensions. Specifically, if X_1, \dots, X_n are i.i.d. from \mathcal{N}(0, \Sigma), then the matrix S = \sum_{i=1}^n X_i X_i^\top follows a Wishart distribution with degrees of freedom n and scale matrix \Sigma.[9] When the mean is unknown, (n-1) times the unbiased sample covariance matrix follows a Wishart distribution with degrees of freedom n-1 and scale matrix \Sigma.[2] This distribution serves as a multivariate generalization of the chi-squared distribution, extending univariate variance inference to the full covariance matrix in settings where observations exhibit correlated variability.[9] Its motivations stem from the need to model and test properties of covariance matrices, such as in hypothesis testing for equality of covariances across groups or assessing independence structures.[10]Probability Density Function
General form
The probability density function of the Wishart distribution W_p(n, \Sigma), where S is a p \times p positive definite random matrix, n is the degrees of freedom, and \Sigma is the p \times p positive definite scale matrix, is given by f(S \mid n, \Sigma) = \frac{|S|^{(n-p-1)/2} \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} S)\right)}{2^{np/2} |\Sigma|^{n/2} \Gamma_p(n/2)}, for S > 0, and zero otherwise.[7] This formula, originally derived by Wishart, provides the explicit density in matrix form, generalizing the chi-squared distribution to the multivariate case. The components of the density highlight its multivariate structure: the term |S|^{(n-p-1)/2} involves the determinant of S, which penalizes deviations from the identity matrix and reflects the volume scaling in p dimensions; the exponential factor \exp\left(-\frac{1}{2} \operatorname{tr}(\Sigma^{-1} S)\right) incorporates the trace of the quadratic form \Sigma^{-1} S, measuring the Mahalanobis-like distance from the origin weighted by \Sigma; and the normalization constant includes the multivariate gamma function \Gamma_p(a) = \pi^{p(p-1)/4} \prod_{i=1}^p \Gamma\left(a - \frac{i-1}{2}\right), which ensures integrability over the space of positive definite matrices and generalizes the univariate gamma function to account for the dimensionality p.[7][11] For the distribution to be concentrated on positive definite matrices with probability 1, the degrees of freedom must satisfy n \geq p. This density arises from the joint distribution of n independent p-dimensional normal random vectors X_i \sim \mathcal{N}_p(0, \Sigma), where S = \sum_{i=1}^n X_i X_i^\top; the derivation proceeds by transforming the joint density of the vectorized X_i into the density of S via the Jacobian of the quadratic transformation, yielding the above form after integration over the appropriate manifolds.Spectral decomposition
The spectral decomposition of a positive definite matrix S \sim W_p(n, \Sigma) expresses S = U \Lambda U^T, where U is an orthogonal matrix and \Lambda = \diag(\lambda_1, \dots, \lambda_p) with \lambda_i > 0. This representation facilitates the transformation of the probability density function (PDF) of the Wishart distribution into coordinates involving eigenvalues and eigenvectors.[12] The Lebesgue measure on the space of positive definite matrices transforms under this decomposition as dS = \prod_{i=1}^p \lambda_i^{(p-1)/2} \, d\lambda_i \prod_{1 \leq i < j \leq p} |\lambda_i - \lambda_j| \, d\mu(U), where d\mu(U) denotes the Haar measure on the orthogonal group O(p). Consequently, the joint density g(\Lambda, U) of (\Lambda, U) with respect to the product measure \prod d\lambda_i \, d\mu(U) is given by g(\Lambda, U) = f(U \Lambda U^T) \times \prod_{i=1}^p \lambda_i^{(p-1)/2} \prod_{i < j} |\lambda_i - \lambda_j|, where f is the Wishart PDF. Substituting the standard form of f(S), f(S) = c \, |S|^{(n-p-1)/2} \exp\left( -\frac{1}{2} \tr(\Sigma^{-1} S) \right) with normalizing constant c = 2^{-np/2} |\Sigma|^{-n/2} / \Gamma_p(n/2), yields g(\Lambda, U) = c \, \det(\Lambda)^{(n-p-1)/2} \exp\left( -\frac{1}{2} \tr(\Sigma^{-1} U \Lambda U^T) \right) \prod_{i=1}^p \lambda_i^{(p-1)/2} \prod_{i < j} |\lambda_i - \lambda_j|. Simplifying the powers of the eigenvalues gives \prod_{i=1}^p \lambda_i^{(n-1)/2}, resulting in a form where each \lambda_i appears in a chi-squared-like factor adjusted by the Vandermonde determinant \prod_{i < j} |\lambda_i - \lambda_j| and the trace term coupling the components through U.[13][12] When \Sigma = I_p, the trace simplifies to \tr(\Lambda) = \sum \lambda_i, rendering the density independent of U. The marginal joint density of the ordered eigenvalues $0 < \lambda_p \leq \dots \leq \lambda_1 < \infty (with respect to Lebesgue measure on this region) is then h(\lambda_1, \dots, \lambda_p) = \tilde{c} \, \prod_{i=1}^p \lambda_i^{(n-1)/2} \exp\left( -\frac{1}{2} \sum_{i=1}^p \lambda_i \right) \prod_{1 \leq i < j \leq p} |\lambda_i - \lambda_j|, where \tilde{c} is the appropriate normalizing constant ensuring integration to 1 over the ordered Weyl chamber. This eigenvalue density integrates the uniform distribution over U and highlights the repulsive interaction among eigenvalues induced by the Vandermonde term.[13][12] This spectral form is instrumental in random matrix theory for analyzing eigenvalue statistics of high-dimensional Wishart matrices. In the asymptotic regime where p, n \to \infty with p/n \to \gamma \in (0,1), the empirical spectral distribution of the eigenvalues of the normalized matrix S/n converges weakly to the Marchenko-Pastur law, a deterministic density supported on [ (1 - \sqrt{\gamma})^2, (1 + \sqrt{\gamma})^2 ] given by \rho(x) = \frac{1}{2\pi \gamma x} \sqrt{ (x - a)(b - x) }, \quad a = (1 - \sqrt{\gamma})^2, \, b = (1 + \sqrt{\gamma})^2. [14] Such limits underpin applications in principal component analysis and signal processing. For spiked Wishart models, where \Sigma features a few large eigenvalues amid smaller ones, recent 2020s developments explore outlier eigenvalue detection and phase transitions in the spectrum, extending classical results to structured covariances.[15]Moments and Expectations
Expectation and variance
The expected value of a p \times p random matrix S following the Wishart distribution W_p(n, \Sigma), where n > 0 is the degrees of freedom and \Sigma is a positive definite scale matrix, is given by \mathbb{E}[S] = n \Sigma. This result follows directly from the distributional definition: if S = \sum_{i=1}^n X_i X_i^\top with X_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}_p(0, \Sigma), then \mathbb{E}[S] = n \mathbb{E}[X_1 X_1^\top] = n \Sigma, leveraging the linearity of expectation and the fact that \mathbb{E}[X_1 X_1^\top] = \Sigma.[6][16] The second moments of S determine its variance-covariance structure. The covariance between elements is \operatorname{Cov}(S_{ij}, S_{kl}) = n (\sigma_{ik} \sigma_{jl} + \sigma_{il} \sigma_{jk}), where \sigma_{ab} denotes the (a,b)-element of \Sigma. This arises from the quadratic form properties of the underlying normal vectors: specifically, \operatorname{Cov}(X_{ia} X_{ib}, X_{kc} X_{kd}) = \sigma_{ia,kc} \sigma_{ib,kd} + \sigma_{ia,kd} \sigma_{ib,kc}, summed over the n independent terms. In vectorized form, \operatorname{Var}(\operatorname{vec}(S)) = n (I_{p^2} + K_{pp}) (\Sigma \otimes \Sigma), where K_{pp} is the p^2 \times p^2 commutation matrix that swaps indices in the Kronecker product.[6][17] For the variances of individual elements, the diagonal entries satisfy \operatorname{Var}(S_{ii}) = 2n \sigma_{ii}^2, while the off-diagonal entries satisfy \operatorname{Var}(S_{ij}) = n (\sigma_{ij}^2 + \sigma_{ii} \sigma_{jj}) for i \neq j. These follow by specializing the general covariance formula to the case (i,j) = (k,l). The uncentered second moment is \mathbb{E}[S^2] = n(n+1) \Sigma^2 + n (\operatorname{tr}(\Sigma)) \Sigma, from which the centered second moment (variance) can be obtained as \mathbb{E}[S^2] - (\mathbb{E}[S])^2 = n \Sigma^2 + n (\operatorname{tr}(\Sigma)) \Sigma, though element-wise expressions are more commonly used for applications. Higher-order uncentered moments exist in closed form via recursive or invariant polynomial representations but are typically derived for specific purposes beyond the first two.[17][16][18]Log-expectation and log-variance
The expected value of the logarithm of the determinant of a p \times p Wishart-distributed random matrix S \sim W_p(n, \Sigma), where n > p-1 denotes the degrees of freedom and \Sigma is the positive definite scale matrix, is \mathbb{E}[\log \det S] = \log \det \Sigma + \sum_{i=1}^p \psi\left( \frac{n + 1 - i}{2} \right) + p \log 2, with \psi(\cdot) denoting the digamma function.[19] This result follows from the Bartlett decomposition of the Wishart matrix into independent chi-squared variates and properties of the multivariate gamma function. The corresponding variance is \mathrm{Var}(\log \det S) = \sum_{i=1}^p \psi'\left( \frac{n + 1 - i}{2} \right), where \psi'(\cdot) is the trigamma function.[19] These moments capture the scale-invariant behavior of the determinant under Wishart variability. In the scalar case (p = 1), the Wishart distribution W_1(n, \Sigma) coincides with a gamma distribution, specifically S \sim \Gamma(n/2, 2\Sigma) in shape-rate parameterization, yielding the exact expectation \mathbb{E}[\log S] = \psi(n/2) + \log(2\Sigma) and variance \mathrm{Var}(\log S) = \psi'(n/2), which aligns with the general formula.[19] For the logarithmic elements of the full matrix, exact expressions are more involved due to dependence, but the scalar result provides insight into marginal behaviors for diagonal entries. These logarithmic moments are essential in Bayesian analysis, particularly for approximating the evidence or marginal likelihood in models with Wishart priors on covariance matrices, such as Gaussian graphical models or mixture models, where they facilitate variational bounds on the log-evidence.[20]Information Measures
Entropy
The differential entropy of a random matrix \mathbf{W} \sim \mathcal{W}_p(n, \boldsymbol{\Sigma}) measures the average uncertainty in its distribution over the space of positive definite matrices. It is given by the formula H(\mathbf{W}) = \log \Gamma_p\left(\frac{n}{2}\right) + \frac{np}{2} + \frac{p+1}{2} \log \left| 2 \boldsymbol{\Sigma} \right| - \frac{n - p - 1}{2} \sum_{i=1}^p \psi\left( \frac{n - p + i}{2} \right), where \Gamma_p(\cdot) denotes the multivariate gamma function, \psi(\cdot) is the digamma function, and the logarithm is the natural logarithm (measured in nats).[21] This expression is derived by evaluating the definition of differential entropy, H(\mathbf{W}) = -\int f(\mathbf{W}) \log f(\mathbf{W}) \, d\mathbf{W}, where f(\mathbf{W}) is the probability density function of the Wishart distribution. The integral simplifies using known expectations: the trace term \mathbb{E}[\operatorname{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{W})] = np, and the log-determinant term \mathbb{E}[\log |\mathbf{W}|] = \log |\boldsymbol{\Sigma}| + \sum_{i=1}^p \psi\left( \frac{n + 1 - i}{2}\right) + p \log 2, which rely on properties of the gamma and digamma functions from the normalizing constant and moments of the distribution.[21] In applications, such as Bayesian inference for covariance matrices, this entropy quantifies the dispersion in possible estimates of \boldsymbol{\Sigma} from normal samples, with larger values indicating higher uncertainty due to fewer degrees of freedom n or higher dimensionality p. For instance, as n increases, the entropy grows roughly proportionally to np \log n, reflecting reduced relative uncertainty in large-sample covariance estimation.[21]Cross-entropy and KL-divergence
The cross-entropy between two Wishart distributions, denoted H(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)), is defined as H(W_1 \| W_2) = -\int f_{W_1}(S) \log f_{W_2}(S) \, dS, where f_{W_i} is the probability density function of the i-th distribution.[22] Substituting the Wishart density yields an expression involving the expected trace \mathbb{E}_{W_1}[\operatorname{tr}(\Sigma_2^{-1} S)] = \nu \operatorname{tr}(\Sigma_2^{-1} \Sigma_1), the expected log-determinant \mathbb{E}_{W_1}[\log |S|] = \log |\Sigma_1| + \sum_{i=1}^p \psi\left(\frac{\nu + 1 - i}{2}\right) + p \log 2, and normalizing constants that include multivariate gamma functions \Gamma_p(\nu/2) and powers of 2.[22] This results in H(W_1 \| W_2) = -\frac{\nu - p - 1}{2} \mathbb{E}_{W_1}[\log |S|] + \frac{\nu}{2} \operatorname{tr}(\Sigma_2^{-1} \Sigma_1) + \frac{\nu p}{2} \log 2 + \log \Gamma_p\left(\frac{\nu}{2}\right) + \frac{\nu}{2} \log |\Sigma_2|, assuming identical degrees of freedom \nu > p - 1.[22] The Kullback-Leibler divergence between two Wishart distributions with the same degrees of freedom, D_{\mathrm{KL}}(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)), simplifies to a closed-form expression: D_{\mathrm{KL}}(W_p(\nu, \Sigma_1) \| W_p(\nu, \Sigma_2)) = \frac{\nu}{2} \left[ \operatorname{tr}(\Sigma_2^{-1} \Sigma_1) - p + \log \frac{|\Sigma_2|}{|\Sigma_1|} \right]. This formula arises from the difference in log-densities, leveraging the linearity of expectation for the trace term and properties of the log-determinant under the Wishart measure; the multivariate gamma terms cancel when \nu is identical.[22] For differing degrees of freedom \nu_1 and \nu_2, an adjustment includes digamma functions \psi_p((\nu_1 - \nu_2)/2) and ratios of multivariate gamma functions \log \Gamma_p(\nu_2/2) / \Gamma_p(\nu_1/2).[22] Special cases of the KL divergence arise in Bayesian contexts, such as between a Wishart prior on the covariance and an inverse-Wishart approximation for the posterior precision matrix, often used to merge components in Gaussian-inverse Wishart mixtures by minimizing the divergence to a single component.[23] Similarly, the KL divergence facilitates approximations between Wishart-distributed covariances and multivariate normal likelihoods in conjugate settings, where the Wishart models the sum of outer products from normal vectors. Recent applications of these measures appear in variational Bayes methods for approximate posterior inference, such as in quasi-autoencoding variational Bayes for models with Wishart priors on precision matrices, where the KL term bounds the evidence lower bound for scalable covariance estimation.Characteristic Function and Theorems
Characteristic function
The characteristic function of a random matrix S \sim W_p(n, \Sigma), where W_p(n, \Sigma) denotes the Wishart distribution with p \times p scale matrix \Sigma > 0 and n degrees of freedom, is defined as\phi_S(T) = \mathbb{E} \left[ \exp \left( i \operatorname{tr}(T S) \right) \right] = \left| I_p - 2 i \Sigma T \right|^{-n/2},
for symmetric p \times p matrices T such that the eigenvalues of I_p - 2 i \Sigma T lie in the complex half-plane \operatorname{Re}(z) > 0.[6] This expression holds for n > 0, even when the density is undefined for n \leq p-1, providing a complete characterization of the distribution via the uniqueness of characteristic functions.[25] The formula arises from the defining representation S = \sum_{k=1}^n X_k X_k^\top, where the X_k are i.i.d. N_p(0, \Sigma). The characteristic function of each outer product X_k X_k^\top follows from the multivariate normal characteristic function \mathbb{E}[\exp(i \operatorname{tr}(T X X^\top))] = |\Sigma^{-1} - 2 i T|^{-1/2}, and independence of the X_k yields the product form raised to the power n.[6] This characteristic function facilitates moment generation by successive differentiation with respect to the elements of T at T = 0, yielding expressions for means, variances, and higher cumulants of the Wishart random matrix. It also underpins proofs of key properties, such as the closure under convolution for independent Wishart matrices sharing the same scale matrix.[25]