The multivariate normal distribution, also known as the multivariate Gaussian distribution, is a continuous probability distribution that generalizes the univariate normal distribution to random vectors of arbitrary dimension n \geq 1. It is fully characterized by an n \times 1 mean vector \mu and an n \times n positive definite covariance matrix \Sigma, with the probability density function given byf(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu) \right),where \mathbf{x} is the n \times 1 realization of the random vector, |\Sigma| denotes the determinant of \Sigma, and \Sigma^{-1} is the inverse covariance matrix.[1][2] This distribution arises naturally as the limiting form of sample means via the multivariate central limit theorem and serves as a foundational model in multivariate statistics due to its mathematical tractability and prevalence in real-world data.[3]A defining feature of the multivariate normal distribution is its closure under affine transformations and linear combinations: any linear combination of its components follows a univariate normal distribution, and subvectors yield marginal distributions that are also multivariate normal with corresponding submatrices of \mu and \Sigma.[2] Conditional distributions are likewise multivariate normal; for instance, in the bivariate case with components X_1 and X_2, the conditional X_1 \mid X_2 = x_2 is normal with mean \mu_1 + \rho \frac{\sigma_1}{\sigma_2} (x_2 - \mu_2) and variance \sigma_1^2 (1 - \rho^2), where \rho is the correlation coefficient.[2] Independence between subvectors holds if and only if the corresponding block of the covariance matrix is zero, making it a natural framework for modeling correlated variables.[1] The distribution is symmetric about \mu, with level sets forming ellipsoids defined by the Mahalanobis distance, and its moment-generating function is \exp(\mathbf{t}^T \mu + \frac{1}{2} \mathbf{t}^T \Sigma \mathbf{t}), facilitating derivations in inference.[2]In statistical applications, the multivariate normal underpins techniques such as multivariate analysis of variance (MANOVA), principal component analysis, and outlier detection via Mahalanobis distances, often assuming approximate normality for large samples per the central limit theorem.[3] In machine learning, it forms the basis for Gaussian processes, probabilistic classifiers like quadratic and linear discriminant analysis, and data preprocessing methods such as whitening transformations to normalize feature scales.[4] Its degenerate form, where \Sigma is singular, models lower-dimensional data embedded in higher spaces, such as perfect linear dependencies.[2]
Definitions
Notation and Parametrization
The multivariate normal distribution describes the joint distribution of a p-dimensional random vector \mathbf{X}, denoted as \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), where \boldsymbol{\mu} is the p \times 1 meanvector and \boldsymbol{\Sigma} is the p \times p covariance matrix that is symmetric and positive semi-definite.[3][5] The meanvector \boldsymbol{\mu} functions as the location parameter, specifying the center or expected value of the distribution, E[\mathbf{X}] = \boldsymbol{\mu}.[6][5]The covariance matrix \boldsymbol{\Sigma} serves as the dispersion parameter, quantifying the variances along the diagonal elements and the covariances off the diagonal, with \boldsymbol{\Sigma} = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T].[6][5] Its positive semi-definiteness ensures that all variances and covariances are feasible, meaning \mathbf{z}^T \boldsymbol{\Sigma} \mathbf{z} \geq 0 for any p \times 1 vector \mathbf{z}.[5] The eigenvalues of \boldsymbol{\Sigma} provide the scaling factors along the principal components, while the corresponding eigenvectors define the rotations or orientations of these components, shaping the ellipsoidal level sets of the distribution.[6]The trace of \boldsymbol{\Sigma}, \operatorname{tr}(\boldsymbol{\Sigma}), represents the total variance across all dimensions, serving as a measure of the overall scale of dispersion.[3] In the non-degenerate case, where \boldsymbol{\Sigma} is positive definite and thus invertible, the precision matrix \boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1} is introduced, which inversely weights the variances and covariances and facilitates analysis of conditional dependencies.[6][5]
Standard Multivariate Normal Vector
The standard multivariate normal vector, denoted \mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}_p), is a p-dimensional random vector whose components Z_1, \dots, Z_p are independent and each distributed as a standard univariate normal \mathcal{N}(0,1).[7] This distribution represents the canonical form of the multivariate normal with zero mean and unit variances along the diagonal, along with zero covariances between distinct components.[8]The probability density function of \mathbf{Z} isf(\mathbf{z}) = (2\pi)^{-p/2} \exp\left( -\frac{1}{2} \|\mathbf{z}\|^2 \right),where \|\mathbf{z}\|^2 = \sum_{i=1}^p z_i^2 = \mathbf{z}^\top \mathbf{z}.[7] The components of \mathbf{Z} are independent and identically distributed (i.i.d.) as \mathcal{N}(0,1), reflecting the diagonal structure of the identity covariance matrix \mathbf{I}_p.[7] Consequently, the expected value is E[\mathbf{Z}] = \mathbf{0}, and the second-moment matrix satisfies E[\mathbf{Z} \mathbf{Z}^\top] = \mathbf{I}_p, confirming the orthogonality of the components.[8]This standard form acts as a foundational reference distribution, enabling the construction of general multivariate normals \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) through affine transformations of \mathbf{Z}, such as \mathbf{X} = \boldsymbol{\mu} + \mathbf{A} \mathbf{Z} where \boldsymbol{\Sigma} = \mathbf{A} \mathbf{A}^\top.[9] It is commonly used for standardization in statistical inference and simulation, leveraging the simplicity of i.i.d. standard normals to model more complex correlated structures.[7]
General Multivariate Normal Vector
The general multivariate normal vector \mathbf{X} \in \mathbb{R}^p with mean \boldsymbol{\mu} \in \mathbb{R}^p and positive definite covariance matrix \boldsymbol{\Sigma} \in \mathbb{R}^{p \times p} is constructed via an affine transformation of a standard multivariate normal vector \mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}_p), where \mathbf{I}_p denotes the p \times p identity matrix.[10] Specifically, \mathbf{X} = \boldsymbol{\mu} + \mathbf{A} \mathbf{Z}, where \mathbf{A} is a p \times p matrix satisfying \boldsymbol{\Sigma} = \mathbf{A} \mathbf{A}^T. This representation ensures that \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), as the transformation preserves the normality while adjusting the location and scale.[10]The matrix \mathbf{A} can be obtained through decompositions such as the Cholesky decomposition, where \mathbf{A} is the lower triangular Cholesky factor of \boldsymbol{\Sigma}, or the spectral decomposition, leveraging the eigenvalues and eigenvectors of \boldsymbol{\Sigma}.[7] In the non-degenerate case, \boldsymbol{\Sigma} is positive definite, guaranteeing that \mathbf{A} has full rank p and the distribution is absolutely continuous with respect to the Lebesgue measure on \mathbb{R}^p.[10]A key property is that \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) if and only if \mathbf{X} is an affine transformation of the standard normal vector \mathbf{Z}.[10] When \boldsymbol{\mu} = \mathbf{0}, this reduces to the centered multivariate normal \mathbf{X} = \mathbf{A} \mathbf{Z}, a special instance focusing solely on the covariance structure.[7]
Equivalent Characterizations
The multivariate normal distribution admits several equivalent mathematical characterizations that highlight its fundamental properties and distinguish it from other distributions. These characterizations provide alternative ways to define the distribution without relying solely on the affine transformation of a standard multivariate normal vector, emphasizing its uniqueness and structural features.One key characterization is based on linear combinations of the random vector. A p-dimensional random vector \mathbf{X} follows a multivariate normal distribution with mean vector \boldsymbol{\mu} and covariance matrix \boldsymbol{\Sigma} if and only if, for every non-zero vector \mathbf{a} \in \mathbb{R}^p, the scalar random variable \mathbf{a}^T \mathbf{X} is univariate normal with mean \mathbf{a}^T \boldsymbol{\mu} and variance \mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a}. This property stems from the Cramér-Wold theorem, which asserts that the joint distribution of \mathbf{X} is uniquely determined by the one-dimensional distributions of all its linear projections. Consequently, verifying univariate normality for all such projections suffices to establish multivariate normality for \mathbf{X}.Another equivalent characterization uses the characteristic function of \mathbf{X}. The distribution is multivariate normal with parameters \boldsymbol{\mu} and \boldsymbol{\Sigma} if and only if its characteristic function is\phi_{\mathbf{X}}(\mathbf{t}) = \exp\left(i \mathbf{t}^T \boldsymbol{\mu} - \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t}\right),for all \mathbf{t} \in \mathbb{R}^p. This exponential quadratic form in \mathbf{t} uniquely identifies the multivariate normal among all distributions, as characteristic functions determine distributions uniquely under mild conditions, and this specific form corresponds precisely to the affine structure of the normal family.The multivariate normal distribution also possesses a uniqueness property with respect to its first two moments. It is the unique distribution with given mean \boldsymbol{\mu} and covariance \boldsymbol{\Sigma} such that all linear combinations of its components are normally distributed; among elliptical distributions sharing these moments, only the multivariate normal satisfies this projection property, as other elliptical distributions (e.g., multivariate t) yield non-normal univariate projections despite matching means and covariances. This uniqueness underscores why the multivariate normal is fully specified by \boldsymbol{\mu} and \boldsymbol{\Sigma}, unlike broader classes where higher-order moments vary independently.Finally, joint normality of components serves as a defining property. A random vector \mathbf{X} is multivariate normal if and only if every finite-dimensional subvector (i.e., any selection of its components) follows a multivariate normal distribution with the corresponding sub-mean and sub-covariance matrix derived from \boldsymbol{\mu} and \boldsymbol{\Sigma}. This recursive property ensures consistency across marginals and reinforces the equivalence to the linear combination characterization, as subvectors correspond to specific projections onto coordinate subspaces.
Probability Density Function
The probability density function (PDF) for a p-dimensional random vector \mathbf{X} following a non-degenerate multivariate normal distribution \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), where \boldsymbol{\mu} \in \mathbb{R}^p is the mean vector and \boldsymbol{\Sigma} is a p \times p positive definite covariance matrix, is given byf(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} \det(\boldsymbol{\Sigma})^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right), \quad \mathbf{x} \in \mathbb{R}^p.[11]The exponent involves the quadratic form Q = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}), which represents the squared Mahalanobis distance from \mathbf{x} to \boldsymbol{\mu} and measures a scaled Euclidean distance that accounts for the correlations encoded in \boldsymbol{\Sigma}.[11]For the bivariate case (p=2), with mean \boldsymbol{\mu} = (\mu_1, \mu_2)^T and covariance matrix \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} where -1 < \rho < 1 is the correlation coefficient, the PDF simplifies tof(x_1, x_2) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{1 - \rho^2}} \exp\left( -\frac{1}{2(1 - \rho^2)} \left[ \frac{(x_1 - \mu_1)^2}{\sigma_1^2} + \frac{(x_2 - \mu_2)^2}{\sigma_2^2} - 2\rho \frac{(x_1 - \mu_1)(x_2 - \mu_2)}{\sigma_1 \sigma_2} \right] \right).[11]The level sets of constant density, where Q = c for some constant c > 0, form ellipsoids centered at \boldsymbol{\mu} with orientation and shape determined by the eigenvectors and eigenvalues of \boldsymbol{\Sigma}, providing a geometric interpretation of the distribution's spread and dependence structure.[11]
Cumulative and Tail Distributions
The cumulative distribution function (CDF) of a random vector \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is defined as\mathbf{F}(\mathbf{x}) = P(X_1 \leq x_1, \dots, X_p \leq x_p) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_p} f(\mathbf{u}) \, d\mathbf{u},where f(\mathbf{u}) denotes the probability density function of \mathbf{X}.[12] In general, for p \geq 3, this p-dimensional integral lacks a closed-form expression and requires numerical methods for evaluation.[12] However, explicit expressions exist for the univariate (p=1) and bivariate (p=2) cases.[12]For the univariate case (p=1), the CDF of the standard normal distribution \mathcal{N}(0,1) is\Phi(z) = \frac{1}{2} \left[ 1 + \erf\left( \frac{z}{\sqrt{2}} \right) \right],where \erf(\cdot) is the error function. For a general univariate normal \mathcal{N}(\mu, \sigma^2), the CDF is \Phi\left( \frac{x - \mu}{\sigma} \right).For the bivariate case (p=2), the CDF can be expressed in explicit integral form as\Phi_2(h,k;\rho) = \int_{-\infty}^h \frac{1}{\sqrt{2\pi}} \exp\left( -\frac{t^2}{2} \right) \Phi\left( \frac{k - \rho t}{\sqrt{1 - \rho^2}} \right) \, dt,where \Phi(\cdot) is the univariate standard normal CDF and \rho is the correlation coefficient (assuming standardized marginals with means 0 and variances 1). A closed-form expression is available specifically for orthant probabilities, such as the probability that both components are positive: for standardized \mathbf{X} \sim \mathcal{N}_2(\mathbf{0}, \mathbf{R}) with correlation \rho,P(X_1 > 0, X_2 > 0) = \frac{1}{4} + \frac{1}{2\pi} \arcsin(\rho).This formula, known as Sheppard's formula, applies to the positive orthant and extends to other orthants by symmetry.[13]The tail distribution, or complementary CDF, is given by P(\mathbf{X} > \mathbf{x}) = 1 - \mathbf{F}(\mathbf{x}), which quantifies upper tail probabilities and is particularly relevant in applications like risk assessment where extreme events are of interest.[12] More generally, probabilities over rectangular intervals P(\mathbf{a} < \mathbf{X} < \mathbf{b}) can be computed as the difference \mathbf{F}(\mathbf{b}) - \mathbf{F}(\mathbf{a}). Orthant probabilities represent a special case of interval probabilities where the bounds are \pm \infty in all but one direction per dimension.[12]
Properties
Moments and Expectations
The multivariate normal random vector \mathbf{X} \in \mathbb{R}^p with mean vector \boldsymbol{\mu} and covariance matrix \boldsymbol{\Sigma} has first moment \mathbb{E}[\mathbf{X}] = \boldsymbol{\mu}.[14] Element-wise, the expected value of the i-th component is \mathbb{E}[X_i] = \mu_i.[14] The second central moment, or variance-covariance matrix, is given by \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T] = \boldsymbol{\Sigma}, with the covariance between the i-th and j-th components satisfying \mathrm{Cov}(X_i, X_j) = \Sigma_{ij}.[14]Due to the symmetry of the density function, all odd-order central moments vanish, so the third central moment tensor is zero and the distribution exhibits no skewness.[14] For even-order central moments, particularly the fourth, Isserlis' theorem provides a recursive structure: the expectation of a product of an even number $2m of centered components equals the sum over all perfect matchings (pairings) of the product of pairwise covariances for each pair.[15] This theorem fully characterizes higher raw moments as \mathbb{E}[\mathbf{X}^{\otimes k}] = \sum d_{k,\mathcal{K}} \boldsymbol{\mu}^{\otimes (k - r(\tilde{\mathcal{K}}))} \boldsymbol{\Sigma}^{\mathcal{K}_U}, where the sum is over multi-index matrices \mathcal{K}, r(\tilde{\mathcal{K}}) counts paired indices, and d_{k,\mathcal{K}} is a combinatorial coefficient.[14]The multivariate normal distribution is mesokurtic, with excess kurtosis equal to zero in the sense of Mardia's measure \gamma_2 = \frac{\beta_2}{p(p+2)} - 1 = 0, where \beta_2 = \mathbb{E}[(\mathbf{X}^T \boldsymbol{\Sigma}^{-1} \mathbf{X})^2] = p(p+2) for the standardized case.[16] In estimation contexts, the sample mean \bar{\mathbf{X}} is unbiased for \boldsymbol{\mu}, and the unbiased sample covariance \mathbf{S} = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{X}_i - \bar{\mathbf{X}})(\mathbf{X}_i - \bar{\mathbf{X}})^T has expectation \boldsymbol{\Sigma}.[17]
Marginal and Conditional Distributions
One fundamental property of the multivariate normal distribution is that both its marginal and conditional distributions are also multivariate normal. Consider a random vector \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) partitioned into subvectors \mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix}, where \mathbf{X}_1 is p_1-dimensional and \mathbf{X}_2 is p_2-dimensional with p_1 + p_2 = p, and correspondingly \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix} and \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}, with \boldsymbol{\Sigma}_{21} = \boldsymbol{\Sigma}_{12}^\top.[18]The marginal distribution of \mathbf{X}_1 is \mathcal{N}_{p_1}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11}), and similarly the marginal distribution of \mathbf{X}_2 is \mathcal{N}_{p_2}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_{22}). This follows from integrating the joint density over the other subvector, yielding a density that matches the multivariate normal form with the corresponding sub-mean and sub-covariance matrix.[18][19]The conditional distribution of \mathbf{X}_1 given \mathbf{X}_2 = \mathbf{x}_2 is \mathcal{N}_{p_1} \left( \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2), \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \right), assuming \boldsymbol{\Sigma}_{22} is invertible. This result is derived by completing the square in the exponent of the joint density function, which separates the conditional density as a multivariate normal with the specified parameters. The conditional mean represents the best linear predictor of \mathbf{X}_1 based on \mathbf{X}_2, aligning with the linear regression interpretation in the multivariate normal framework.[18]In the special case of the bivariate normal distribution, where \mathbf{X} = (X_1, X_2)^\top \sim \mathcal{N}_2 \left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} \right) and \rho is the correlation coefficient, the conditional expectation is \mathbb{E}[X_1 \mid X_2 = x_2] = \mu_1 + \rho \frac{\sigma_1}{\sigma_2} (x_2 - \mu_2), and the conditional variance is \mathrm{Var}(X_1 \mid X_2 = x_2) = \sigma_1^2 (1 - \rho^2). These expressions specialize the general formulas, highlighting the linear dependence in the mean and the role of correlation in reducing variance.[20]A key feature of these conditional distributions is homoscedasticity: the conditional covariance matrix \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} (or \sigma_1^2 (1 - \rho^2) in the bivariate case) is constant and does not depend on the observed value \mathbf{x}_2. This constant variance property underscores the multivariate normal's utility in regression models where prediction errors have uniform dispersion.[18][20]
Joint Normality and Independence
A random vector \mathbf{X} = (X_1, \dots, X_k)^\top in \mathbb{R}^k is said to follow a multivariate normal distribution if every linear combination \mathbf{a}^\top \mathbf{X}, for any fixed vector \mathbf{a} \in \mathbb{R}^k, follows a univariate normal distribution. This characterization ensures that the joint distribution captures the full structure of normality across dimensions, as univariate normality of all such combinations implies the vector's joint normality.[21] Equivalently, the distribution is multivariate normal if all finite-dimensional projections onto linear subspaces preserve normality, providing a foundational property for many statistical applications.In the context of multivariate normal distributions, uncorrelatedness between subvectors implies statistical independence, a property unique to the normal family among common continuous distributions. Specifically, for a partitioned vector \mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix} following a multivariate normal distribution with covariance matrix \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix}, the subvectors \mathbf{X}_1 and \mathbf{X}_2 are independent if and only if \Sigma_{12} = \mathbf{0}.[22] This equivalence holds because the joint density factors into the product of the marginal densities when the covariance block is zero, reflecting the absence of linear dependence that would otherwise couple the components.[23] Consequently, zero correlation suffices for independence under joint normality, simplifying analyses in fields like finance and signal processing where such assumptions facilitate model tractability.[22]Marginal normality does not guarantee joint multivariate normality, highlighting the need for explicit checks on dependence structures. A classic counterexample involves two random variables X and Y where Z \sim N(0,1), U \sim \text{Uniform}(0,1) are independent, X = Z, and Y = Z if U < 0.5 or Y = -Z if U > 0.5; here, both X and Y are marginally standard normal, but their joint distribution is not bivariate normal since P(|X| = |Y|) = 1, which exceeds the probability under a zero-covariance bivariate normal.[24] More generally, copulas allow construction of joint distributions with normal marginals but non-normal joints by pairing normal marginals with non-Gaussian copulas, such as Clayton or Gumbel copulas, which introduce tail dependencies absent in the multivariate normal.[25] These examples underscore that nonlinear dependencies can violate joint normality even when marginals are normal.When components of a random vector are independent and each marginally normal, the joint distribution is multivariate normal with a diagonal covariance matrix, and the joint probability density function equals the product of the individual marginal densities.[26] For instance, if \mathbf{X}_i \sim N(\mu_i, \sigma_i^2) independently for i=1,\dots,k, then \mathbf{X} = (\mathbf{X}_1, \dots, \mathbf{X}_k)^\top \sim N(\boldsymbol{\mu}, \Sigma) where \Sigma is diagonal with entries \sigma_i^2, and the density f(\mathbf{x}) = \prod_{i=1}^k f_i(x_i).[22] This factorization property reinforces the role of independence in generating multivariate normals from univariate ones, central to simulation and modeling techniques.[26]
Transformations and Functions
One key property of the multivariate normal distribution is its closure under affine transformations. If \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), where \boldsymbol{\mu} is the mean vector and \boldsymbol{\Sigma} is the positive definite covariance matrix, then for any p \times p invertible matrix \mathbf{B} and p \times 1 vector \mathbf{c}, the transformed vector \mathbf{Y} = \mathbf{BX} + \mathbf{c} follows a multivariate normal distribution \mathcal{N}_p(\mathbf{B} \boldsymbol{\mu} + \mathbf{c}, \mathbf{B} \boldsymbol{\Sigma} \mathbf{B}^T). This result holds because the transformation preserves the normality through the linearity of expectation and the quadratic nature of the exponent in the density function.A special case of the affine transformation arises with linear combinations of the components of \mathbf{X}. For a p \times 1 vector \mathbf{a}, the scalar random variable a^T \mathbf{X} is univariate normal, distributed as \mathcal{N}_1(a^T \boldsymbol{\mu}, a^T \boldsymbol{\Sigma} a). This univariate marginal follows directly from setting \mathbf{B} = \mathbf{a}^T and \mathbf{c} = 0 in the affine result, highlighting how individual projections of multivariate normals remain normal. Such linear combinations are fundamental in deriving moments and in applications like regression coefficients.Quadratic forms involving multivariate normal vectors yield more complex distributions, specifically generalized chi-squared distributions when related to the covariance structure. If \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) and \mathbf{A} is a p \times p symmetric matrix, then \mathbf{X}^T \mathbf{A} \mathbf{X} can be decomposed into a linear combination of independent (non-central) chi-squared random variables, where the weights are the eigenvalues of \mathbf{A} \boldsymbol{\Sigma} and non-centrality parameters depend on \boldsymbol{\mu}. This generalized chi-squared arises because diagonalizing the form via the spectral decomposition of \mathbf{A} \boldsymbol{\Sigma} transforms the variables into independent normals, each contributing a scaled chi-squared term. For instance, when \boldsymbol{\mu} = \mathbf{0} and \mathbf{A} = \boldsymbol{\Sigma}^{-1}, the form \mathbf{X}^T \boldsymbol{\Sigma}^{-1} \mathbf{X} simplifies to a central chi-squared with p degrees of freedom.The likelihood function for parameters \boldsymbol{\mu} and \boldsymbol{\Sigma} based on a single observed vector \mathbf{x} from \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is derived directly from the probability density function evaluated at \mathbf{x}:L(\boldsymbol{\mu}, \boldsymbol{\Sigma} \mid \mathbf{x}) = (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right).This expression encapsulates the joint contribution of the mean and covariance to the data fit, with the quadratic term measuring deviation in Mahalanobis distance.
Information-Theoretic Measures
The differential entropy of a p-dimensional multivariate normal random vector \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is given byh(\mathbf{X}) = \frac{p}{2} \log (2 \pi e) + \frac{1}{2} \log |\boldsymbol{\Sigma}|,where |\boldsymbol{\Sigma}| denotes the determinant of the covariance matrix \boldsymbol{\Sigma}.[27] This expression arises from integrating the negative log-density over the probability density function of the multivariate normal.[27] Among all continuous distributions with fixed covariance matrix \boldsymbol{\Sigma}, the multivariate normal achieves the maximum differential entropy, reflecting its maximal uncertainty or spread under the given second-moment constraints.[28]The Kullback-Leibler (KL) divergence between two p-dimensional multivariate normals \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) and \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2) isD_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \parallel \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2} \left[ (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) + \operatorname{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) - p + \log \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} \right].This closed-form measure quantifies the information loss when approximating the first distribution by the second, incorporating both mean and covariance differences.[29] The divergence is always nonnegative and equals zero if and only if the two distributions are identical.[29]For jointly normal random vectors \mathbf{X} and \mathbf{Y} with correlation matrix \mathbf{R}, the mutual information isI(\mathbf{X}; \mathbf{Y}) = -\frac{1}{2} \log |\mathbf{R}|.This formula derives from the difference between the joint entropy and the sum of marginal entropies, leveraging the Gaussian property that mutual information depends solely on the correlation structure.[30] Under independence, \mathbf{R} is block-diagonal with zero off-blocks, yielding I(\mathbf{X}; \mathbf{Y}) = 0 and additivity of the differential entropy: h(\mathbf{X}, \mathbf{Y}) = h(\mathbf{X}) + h(\mathbf{Y}).[30]
Geometric Interpretation
The level sets of the probability density function of the multivariate normal distribution, defined by the equation ( \mathbf{x} - \boldsymbol{\mu} )^T \boldsymbol{\Sigma}^{-1} ( \mathbf{x} - \boldsymbol{\mu} ) = c for constant c > 0, form ellipsoids in \mathbb{R}^p.[31] These ellipsoids are centered at the mean vector \boldsymbol{\mu} and represent surfaces of constant density, with the density decreasing as c increases.[31] The orientation of these ellipsoids is determined by the eigenvectors of the covariance matrix \boldsymbol{\Sigma}, which define the principal axes along the directions of maximum variance.[31] The lengths of these axes are scaled by the square roots of the corresponding eigenvalues of \boldsymbol{\Sigma}, reflecting the spread of the distribution in those directions; larger eigenvalues correspond to longer axes.[31][32]The Mahalanobis distance, given by d(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{ ( \mathbf{x} - \boldsymbol{\mu} )^T \boldsymbol{\Sigma}^{-1} ( \mathbf{x} - \boldsymbol{\mu} ) }, quantifies the deviation of a point \mathbf{x} from the mean in a space standardized by the covariance structure.[31] This distance generalizes the standard deviation measure from the univariate case, where the "1-sigma" interval covers approximately 68% of the probability mass.[33] In the p-dimensional case, the ellipsoid corresponding to a squared Mahalanobis distance of \chi^2_p(0.68) encloses 68% of the probability, with \chi^2_p denoting the cumulative distribution function of the chi-squared distribution with p degrees of freedom.[32]When the covariance matrix \boldsymbol{\Sigma} is singular with rank r < p, the distribution is degenerate, and the ellipsoids collapse onto lower-dimensional affine subspaces of dimension r embedded in \mathbb{R}^p.[7] In this case, the support of the distribution lies entirely within these subspaces, and the density is undefined with respect to the full p-dimensional Lebesgue measure.[7]In the bivariate case (p=2), the density contours appear as tilted ellipses when the components are correlated, with the tilt angle determined by the off-diagonal elements of \boldsymbol{\Sigma} that capture the correlation.[33] For zero correlation, the ellipses align with the coordinate axes; positive or negative correlations rotate them accordingly, illustrating the elliptical symmetry of the distribution.[32]
Statistical Inference
Parameter Estimation
Given a random sample \mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_n of p-dimensional vectors independently and identically distributed as multivariate normal N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the classical frequentist estimators for the mean vector \boldsymbol{\mu} and covariance matrix \boldsymbol{\Sigma} are derived from moment-based and likelihood-based approaches.[34]The sample mean vector is defined as \hat{\boldsymbol{\mu}} = \frac{1}{n} \sum_{i=1}^n \mathbf{X}_i. This estimator is unbiased, satisfying E[\hat{\boldsymbol{\mu}}] = \boldsymbol{\mu}, and follows an exact multivariate normal distribution \hat{\boldsymbol{\mu}} \sim N_p\left( \boldsymbol{\mu}, \frac{1}{n} \boldsymbol{\Sigma} \right).[34] It is also consistent, converging in probability to \boldsymbol{\mu} as n \to \infty.[34]The unbiased sample covariance matrix is \mathbf{S} = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{X}_i - \hat{\boldsymbol{\mu}})(\mathbf{X}_i - \hat{\boldsymbol{\mu}})^T, which satisfies E[\mathbf{S}] = \boldsymbol{\Sigma} and is independent of \hat{\boldsymbol{\mu}}.[34] This estimator is consistent for \boldsymbol{\Sigma} as n \to \infty.[34]The maximum likelihood estimators (MLEs), obtained by maximizing the likelihood function of the multivariate normal, are \hat{\boldsymbol{\mu}}_{\text{MLE}} = \hat{\boldsymbol{\mu}} and \hat{\boldsymbol{\Sigma}}_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^n (\mathbf{X}_i - \hat{\boldsymbol{\mu}})(\mathbf{X}_i - \hat{\boldsymbol{\mu}})^T = \frac{n-1}{n} \mathbf{S}.[34] While \hat{\boldsymbol{\mu}}_{\text{MLE}} is unbiased, \hat{\boldsymbol{\Sigma}}_{\text{MLE}} is biased downward for \boldsymbol{\Sigma} (with bias -\frac{1}{n} \boldsymbol{\Sigma}), but both are consistent as n \to \infty.[34] Additionally, \sqrt{n} (\hat{\boldsymbol{\mu}} - \boldsymbol{\mu}) is asymptotically normal N_p(\mathbf{0}, \boldsymbol{\Sigma}) under the true model.[34]Under the multivariate normal assumption, the scaled sample covariance (n-1) \mathbf{S} follows a Wishart distribution W_p(\boldsymbol{\Sigma}, n-1), which describes the sampling variability of the covariance estimator and was originally derived for samples from a normal multivariate population.[35][34]
Bayesian Analysis
In Bayesian analysis, the mean vector \mu and covariance matrix \Sigma of the multivariate normal distribution are assigned prior distributions to incorporate uncertainty and prior knowledge. The conjugate prior, which ensures that the posterior distribution belongs to the same family, is the normal-inverse-Wishart (NIW) distribution. This prior was originally developed by Ando and Kaufman (1965) for the analysis of independent multinormal processes with unknown mean and precision.[36] Under the NIW prior, the conditional distribution of \mu given \Sigma is multivariate normal: \mu \mid \Sigma \sim \mathcal{N}(\mu_0, \Sigma / \kappa_0), where \mu_0 is the prior mean vector and \kappa_0 > 0 scales the prior precision relative to \Sigma. Independently, \Sigma follows an inverse Wishart distribution: \Sigma \sim \text{IW}(\nu_0, \Psi_0), with degrees of freedom \nu_0 > p - 1 (where p is the dimension) and scale matrix \Psi_0 a positive definite p \times p matrix.[37]Given n independent observations \mathbf{x}_1, \dots, \mathbf{x}_n \sim \mathcal{N}_p(\mu, \Sigma), the posterior distribution is also NIW, with updated hyperparameters derived from the sufficient statistics: the sample mean \bar{\mathbf{x}} and the sum of squared deviations S = \sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top. The updates are:\begin{align*}
\kappa_n &= \kappa_0 + n, \\
\mu_n &= \frac{\kappa_0 \mu_0 + n \bar{\mathbf{x}}}{\kappa_n}, \\
\nu_n &= \nu_0 + n, \\
\Psi_n &= \Psi_0 + S + \frac{\kappa_0 n}{\kappa_n} (\bar{\mathbf{x}} - \mu_0)(\bar{\mathbf{x}} - \mu_0)^\top.
\end{align*}Thus, the conditional posterior is \mu \mid \Sigma, \mathbf{x} \sim \mathcal{N}(\mu_n, \Sigma / \kappa_n) and \Sigma \mid \mathbf{x} \sim \text{IW}(\nu_n, \Psi_n). The posterior mean of \mu is \mu_n, which coincides with its mode due to the normality of the conditional posterior. For \Sigma, the posterior mean is \mathbb{E}[\Sigma \mid \mathbf{x}] = \Psi_n / (\nu_n - p - 1) (provided \nu_n > p + 1), while the posterior mode is \Psi_n / (\nu_n + p + 1). These expressions weight the prior toward the sample estimates as n increases, reflecting the conjugate structure's shrinkage properties.[37]The posterior predictive distribution for a new observation \mathbf{x}^* \mid \mathbf{x} integrates out the uncertainty in \mu and \Sigma, yielding a multivariate Student-t distribution:\mathbf{x}^* \mid \mathbf{x} \sim t_p\left( \mu_n, \frac{\Psi_n}{\nu_n - p + 1} \left(1 + \frac{1}{\kappa_n}\right), \nu_n - p + 1 \right),with p degrees of freedom adjusted for the dimension (requiring \nu_n > p - 1). This distribution captures both parameter uncertainty and sampling variability, with heavier tails than the normal due to the inverse Wishart component.[37]To select the NIW hyperparameters when substantive prior information is unavailable, empirical Bayes approaches estimate them by maximizing the marginal likelihood of the data, treating the hyperparameters as nuisance parameters. This method, which balances prior informativeness with data fit, has been applied in high-dimensional settings using Wishart-based priors to induce sparsity or shrinkage in covariance estimates.[38]
Normality Testing
Testing whether a sample arises from a multivariate normal distribution is crucial in many statistical analyses, as violations can affect inference validity. Several formal and graphical methods exist for this purpose, focusing on deviations in skewness, kurtosis, or overall shape. These tests are particularly important in high-dimensional settings where univariate checks may overlook joint dependencies.[39]Mardia's test assesses multivariate normality by evaluating sample skewness and kurtosis measures, which under the null hypothesis of normality follow chi-squared distributions. The skewness statistic \beta_{1,p} quantifies asymmetry and is approximated by a \chi^2 distribution with p(p+1)(p+2)/6 degrees of freedom, where p is the dimension; the kurtosis statistic \beta_{2,p} measures tail heaviness and is assessed via the z-statistic z = (\beta_{2,p} - p(p + 2)) / \sqrt{8p(p + 2)/n}, which approximately follows a standard normal distribution N(0,1) for large n. These measures extend univariate moments to the multivariate case and are affine-invariant, making the test robust to linear transformations. The test rejects normality if either statistic exceeds critical values at a chosen significance level.[40][41]The Henze-Zirkler test provides an omnibus approach by integrating a smoothed empirical characteristic function against the normal characteristic function, yielding a statistic that is invariant under affine transformations and consistent against a broad class of alternatives. This method avoids reliance on moments, offering advantages in detecting subtle departures from normality without assuming finite moments. It is computed via numerical integration and has shown competitive performance in simulation studies across various dimensions and sample sizes.Graphical methods, such as multivariate quantile-quantile (Q-Q) plots, offer visual diagnostics by plotting ordered squared Mahalanobis distances against quantiles of the \chi^2_p distribution; deviations from the diagonal line indicate non-normality, particularly in tails or clusters. Extensions of the Shapiro-Wilk test to multivariate data, such as those adapting the W statistic via canonical correlations or projection pursuits, provide empirical p-values for formal assessment, though they are computationally intensive for high dimensions. These tools complement formal tests by highlighting specific violation types, like outliers or asymmetry.Regarding power, these tests vary in sensitivity: Mardia's excels at detecting skewness and kurtosis excesses but may underperform against symmetric heavy-tailed alternatives, while the Henze-Zirkler test demonstrates robust power for departures in tails, dependence structures, or mixtures, often outperforming moment-based methods in moderate to high dimensions. Simulation studies indicate that power improves with sample size but can degrade in very high dimensions without adjustments.[39]
Discriminant Analysis
Gaussian discriminant analysis (GDA) is a probabilistic classification method that assumes observations from each class follow a multivariate normal distribution, leveraging Bayes' theorem to compute posterior class probabilities. Given an observation \mathbf{x} and K classes, the posterior probability for class j is P(y = j \mid \mathbf{x}) = \frac{\pi_j f_j(\mathbf{x})}{\sum_{k=1}^K \pi_k f_k(\mathbf{x})}, where \pi_j is the prior probability of class j and f_j(\mathbf{x}) is the class-conditional density, modeled as the probability density function of a multivariate normal distribution \mathcal{N}(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j). Classification assigns \mathbf{x} to the class with the maximum posterior probability, equivalent to minimizing expected misclassification risk under 0-1 loss.When class-conditional covariance matrices are equal across all classes (\boldsymbol{\Sigma}_j = \boldsymbol{\Sigma} for all j), GDA simplifies to linear discriminant analysis (LDA), where decision boundaries between classes are linear hyperplanes in the feature space. LDA derives from maximizing the separation between class means relative to within-class variability, assuming multivariate normality and equal covariances, as originally formulated by Fisher for taxonomic classification using iris measurements. The linear discriminant function takes the form \delta_j(\mathbf{x}) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_j - \frac{1}{2} \boldsymbol{\mu}_j^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_j + \log \pi_j, leading to linear boundaries since the quadratic terms cancel out.[42]In contrast, quadratic discriminant analysis (QDA) relaxes the equal covariance assumption, allowing each class to have its own \boldsymbol{\Sigma}_j, resulting in quadratic decision boundaries that better capture differing class spreads and orientations. The discriminant function becomes \delta_j(\mathbf{x}) = -\frac{1}{2} \log |\boldsymbol{\Sigma}_j| - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_j)^T \boldsymbol{\Sigma}_j^{-1} (\mathbf{x} - \boldsymbol{\mu}_j) + \log \pi_j, where the quadratic form in \mathbf{x} arises from the inverse covariance terms. QDA, as an extension of the Bayesian framework for multivariate normals, provides greater flexibility for datasets where class covariances vary but increases parameter estimation demands, potentially leading to overfitting in small samples.The performance of these methods is evaluated through misclassification probabilities, which quantify the overlap between class-conditional densities. The overall error rate is P(\text{error}) = \sum_{j=1}^K \pi_j \int_{\mathcal{R}_k, k \neq j} f_j(\mathbf{x}) \, d\mathbf{x}, where \mathcal{R}_k denotes the decision region for class k; for multivariate normals, this integral reflects the degree of separation between means relative to covariance structures, with lower overlap yielding smaller error rates. In LDA, equal covariances promote linear separability, while QDA can achieve lower error rates when covariances differ but may estimate error rates less reliably due to higher variance in covariance estimates.[43]
Computational Methods
Sampling from the Distribution
Generating random samples from a multivariate normal distribution \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), where \boldsymbol{\mu} is the mean vector and \boldsymbol{\Sigma} is the p \times p covariance matrix, relies on linear transformations of independentstandardnormal variables. This approach leverages the property that if \mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}_p), then \mathbf{X} = \boldsymbol{\mu} + \mathbf{A} \mathbf{Z} \sim \mathcal{N}_p(\boldsymbol{\mu}, \mathbf{A} \mathbf{A}^T) for any matrix \mathbf{A} such that \boldsymbol{\Sigma} = \mathbf{A} \mathbf{A}^T.For the non-degenerate case where \boldsymbol{\Sigma} is positive definite, the Cholesky decomposition provides an efficient method to obtain \mathbf{A}. Specifically, compute the lower triangular matrix \mathbf{L} satisfying \boldsymbol{\Sigma} = \mathbf{L} \mathbf{L}^T, then generate \mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}_p) and set \mathbf{X} = \boldsymbol{\mu} + \mathbf{L} \mathbf{Z}. This factorization is numerically stable and computationally efficient, requiring O(p^3) operations for the decomposition followed by O(p^2) per sample.[12] An alternative is the eigenvalue decomposition \boldsymbol{\Sigma} = \mathbf{Q} \mathbf{D} \mathbf{Q}^T, where \mathbf{Q} is orthogonal and \mathbf{D} is diagonal with positive eigenvalues; here, \mathbf{X} = \boldsymbol{\mu} + \mathbf{Q} \mathbf{D}^{1/2} \mathbf{Z}, with \mathbf{D}^{1/2} the diagonal matrix of square roots. This method is useful when eigendecomposition aids in further analysis, though it is generally more expensive than Cholesky.[12]When \boldsymbol{\Sigma} is singular (rank r < p), the distribution is degenerate, concentrating on an r-dimensional affine subspace. Sampling involves projecting to the range of \boldsymbol{\Sigma}, generating samples in the reduced space using a full-rank factorization, and setting components in the null space to match the mean. The Moore-Penrose pseudoinverse can facilitate this by identifying the effective dimensions.Implementations in statistical software handle these cases automatically. In R, the mvrnorm function from the MASS package generates samples via Cholesky decomposition for positive definite \boldsymbol{\Sigma}, with options for handling near-singularity. Python's NumPy library provides numpy.random.multivariate_normal, which uses similar linear algebra techniques and raises errors for non-positive semi-definite matrices.
Numerical Evaluation of Densities and CDFs
The probability density function (PDF) of the multivariate normal distribution \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is evaluated numerically by computing the constant term involving the log-determinant \log|\det(\boldsymbol{\Sigma})|, which avoids underflow issues from direct determinant calculation, and the quadratic form (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}), often via Cholesky decomposition for stability.[44] For near-singular covariance matrices where \boldsymbol{\Sigma} is ill-conditioned, conditioning methods—such as scaling variables or using LDL decomposition—mitigate numerical instability in inverting \boldsymbol{\Sigma} and computing the log-determinant, ensuring accurate evaluation even when eigenvalues approach zero.The cumulative distribution function (CDF), defined as F(\mathbf{x}) = P(\mathbf{X} \leq \mathbf{x}), lacks a closed-form expression for p > 3, necessitating approximation via integration methods. Monte Carlo integration, enhanced by importance sampling to focus on regions near the integration boundaries, provides reliable estimates for general rectangular probabilities, with error decreasing as O(1/\sqrt{N}) for N samples. Quasi-Monte Carlo methods, employing low-discrepancy sequences like Sobol points, achieve faster convergence rates of nearly O(1/N) and are particularly effective for smooth integrands in moderate dimensions up to 20. The Genz algorithm specifically targets orthant probabilities P(\mathbf{X} \geq \mathbf{0}) through a transformation to standardized variables and adaptive quasi-Monte Carlo integration over [0,1]^p, offering high accuracy for correlated cases with computational complexity scaling well up to dimension 100.Tail probabilities, such as P(\mathbf{X} > \mathbf{a}) for large \mathbf{a}, can be bounded in high dimensions using concentration inequalities; for instance, since linear projections \langle \mathbf{X}, \mathbf{u} \rangle are univariate normal and sub-Gaussian, they satisfy P(|\langle \mathbf{X}, \mathbf{u} \rangle - \mathbb{E}[\langle \mathbf{X}, \mathbf{u} \rangle]| \geq t) \leq 2 \exp\left( -\frac{t^2}{2\sigma^2} \right) where \sigma^2 = \mathbf{u}^\top \boldsymbol{\Sigma} \mathbf{u}, facilitating union bounds over directions.[45]The R packagemvtnorm implements these techniques for accurate computation of multivariate normal densities and CDFs, supporting dimensions up to several hundred via Genz's quasi-Monte Carlo method and providing options for log-scale evaluations to handle extreme values.[46]