Multivariate random variable

A multivariate random variable, also referred to as a random vector, is a measurable function from a probability space to \mathbb{R}^n, mapping outcomes to an n-dimensional vector (X_1, X_2, \dots, X_n)^\top where each X_i is a scalar random variable, enabling the analysis of joint behaviors among multiple interdependent quantities.^[1] The joint distribution of a multivariate random variable captures the probabilities of simultaneous outcomes for its components, specified by the joint cumulative distribution function F_{X_1, \dots, X_n}(x_1, \dots, x_n) = P(X_1 \leq x_1, \dots, X_n \leq x_n) for continuous cases or the joint probability mass function for discrete cases.^[1] Marginal distributions for individual components are derived by summing or integrating the joint distribution over the other variables, such as f_{X_i}(x_i) = \int f_X(x) \, dx_1 \cdots dx_{i-1} dx_{i+1} \cdots dx_n for continuous densities.^[1] Dependence between components is quantified through measures like covariance, defined as \operatorname{Cov}(X_j, X_k) = E[(X_j - \mu_j)(X_k - \mu_k)], and the correlation coefficient \rho_{X_j, X_k} = \frac{\operatorname{Cov}(X_j, X_k)}{\sqrt{\operatorname{Var}(X_j) \operatorname{Var}(X_k)}}, which ranges from -1 to 1 and indicates linear association strength.^[2] Independence holds if the joint distribution factors into the product of marginals, implying zero correlation, though the converse is not always true.^[1] The multivariate normal distribution stands as the cornerstone example of a multivariate random variable, characterized by a mean vector \mu and covariance matrix \Sigma, with density f(x) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2} (x - \mu)^\top \Sigma^{-1} (x - \mu) \right\} for x \in \mathbb{R}^d.^[3] It possesses key properties such as normality of all linear combinations and marginals, and independence of components when uncorrelated.^[2] Multivariate random variables find broad applications in fields like statistics, where they model multiple measurements such as age, blood pressure, and cholesterol levels in health studies, and in geophysics for analyzing correlated seismic data.^[4]^[2]

Fundamentals

Definition

In probability theory, a multivariate random variable, also known as a random vector, is formally defined as a measurable function \mathbf{X}: \Omega \to \mathbb{R}^n from a probability space (\Omega, \mathcal{F}, P) to the n-dimensional Euclidean space \mathbb{R}^n, where n \geq 2 and \mathbf{X} = (X_1, \dots, X_n) with each X_i: \Omega \to \mathbb{R} being a univariate random variable.^[5] This definition ensures that the joint behavior of the components can be analyzed through the induced probability measure on \mathbb{R}^n.^[5] Unlike a univariate random variable, which captures the probabilistic behavior of a single scalar outcome, a multivariate random variable emphasizes the interdependence and joint distribution among multiple components, allowing for the study of relationships such as correlation or dependence across dimensions.^[5] While the standard formulation targets real-valued spaces \mathbb{R}^n, the concept extends to more general settings, including complex vector spaces \mathbb{C}^n in applications like signal processing, where the random vector is a measurable mapping to \mathbb{C}^n.^[6]

Notation and Examples

Standard notation for a multivariate random variable employs boldface letters to denote vectors, with \mathbf{X} = (X_1, X_2, \dots, X_p)^\top representing a p-dimensional random vector, where each X_i is a univariate random variable and the superscript \top indicates transposition to a column vector. A specific realization or observed value of this vector is denoted by lowercase \mathbf{x} = (x_1, x_2, \dots, x_p)^\top, allowing distinction between the random entity and its outcomes. To illustrate, consider a bivariate case where \mathbf{X} = (X_1, X_2)^\top captures the height and weight of adult humans in a population; here, X_1 might represent height in inches and X_2 weight in pounds, with joint variability reflecting biological dependencies.^[4] For a continuous example, a multivariate normal random vector \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) has mean vector \boldsymbol{\mu} and positive semi-definite covariance matrix \boldsymbol{\Sigma}, commonly modeling correlated measurements like stock returns or sensor data in p dimensions. In the discrete setting, rolling two fair six-sided dice yields \mathbf{X} = (X_1, X_2)^\top, where X_1 and X_2 are the outcomes (each uniformly distributed over \{1, 2, \dots, 6\}), and the joint probability mass function is P(X_1 = i, X_2 = j) = 1/36 for i, j = 1, \dots, 6, demonstrating uniform joint support over 36 points.^[7] Basic vector operations on multivariate random variables proceed component-wise: addition \mathbf{X} + \mathbf{Y} = (X_1 + Y_1, \dots, X_p + Y_p)^\top combines vectors, while scalar multiplication a\mathbf{X} = (aX_1, \dots, aX_p)^\top scales them, though these lack probabilistic interpretation in isolation here. The foundations of multivariate analysis trace to 19th-century statistics, notably Karl Pearson's 1896 exploration of correlations among multiple organ measurements in biological data, which laid groundwork for handling joint variability beyond univariate cases.^[8]

Joint Distributions

Discrete Multivariate Random Variables

A discrete multivariate random variable is a vector \mathbf{X} = (X_1, X_2, \dots, X_n) where each component X_i takes values in a countable set, typically integers or subsets thereof.^[9] The probability structure is fully specified by the joint probability mass function (PMF), which assigns probabilities to each possible outcome in the support.^[10] The joint PMF is defined as p(\mathbf{x}) = P(\mathbf{X} = \mathbf{x}), where \mathbf{x} = (x_1, x_2, \dots, x_n) is a specific realization, and it satisfies p(\mathbf{x}) \geq 0 for all \mathbf{x} in the support.^[11] The normalization condition requires that the probabilities sum to unity over the entire support: \sum_{\mathbf{x}} p(\mathbf{x}) = 1, where the sum is taken over all possible \mathbf{x}.^[10] The support of \mathbf{X} is the finite or countably infinite set of points in \mathbb{Z}^n (or a subset) where p(\mathbf{x}) > 0.^[9] Consider the example of two independent fair coin flips, where \mathbf{X} = (X_1, X_2) with X_1 indicating the outcome of the first flip (0 for tails, 1 for heads) and X_2 for the second. The support is \{(0,0), (0,1), (1,0), (1,1)\}, and the joint PMF is uniform due to independence and fairness: p(\mathbf{x}) = 0.25 for each point in the support.^[12] This can be represented in a table:

X_1 \setminus X_2	0	1
0	0.25	0.25
1	0.25	0.25

To compute the probability of \mathbf{X} belonging to a subset A of the support, sum the joint PMF over the relevant points: P(\mathbf{X} \in A) = \sum_{\mathbf{x} \in A} p(\mathbf{x}).^[11] For instance, in the coin flip example, P(X_1 = 1) = \sum_{x_2} p(1, x_2) = 0.25 + 0.25 = 0.5.^[12]

Continuous Multivariate Random Variables

A continuous multivariate random variable \mathbf{X} = (X_1, \dots, X_n) takes values in \mathbb{R}^n, where the probability structure is described by a joint probability density function (PDF) f(\mathbf{x}) that is nonnegative for all \mathbf{x} \in \mathbb{R}^n and integrates to 1 over the entire space, i.e., \int_{\mathbb{R}^n} f(\mathbf{x}) \, d\mathbf{x} = 1.^[10]^[9] The nonnegativity ensures that probabilities are always positive, while the integrability condition normalizes the total probability.^[13] The probability that \mathbf{X} falls into any measurable set A \subseteq \mathbb{R}^n is given by the integral of the joint PDF over that set: P(\mathbf{X} \in A) = \int_A f(\mathbf{x}) \, d\mathbf{x}. This integral represents the volume under the density surface over A, contrasting with discrete cases by using continuous integration rather than summation.^[14] The joint cumulative distribution function (CDF) for \mathbf{X} is defined as F(\mathbf{x}) = P(\mathbf{X} \leq \mathbf{x}) = P(X_1 \leq x_1, \dots, X_n \leq x_n) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_n} f(\mathbf{u}) \, d\mathbf{u}_n \cdots d\mathbf{u}_1, which provides the probability that all components are less than or equal to their respective values.^[9]^[14] For continuous distributions, the CDF is continuous and differentiable, with the joint PDF obtained as its mixed partial derivative: f(\mathbf{x}) = \frac{\partial^n}{\partial x_1 \cdots \partial x_n} F(\mathbf{x}).^[4] A prominent example is the bivariate normal distribution for n=2, where the joint PDF is

f(x,y) = \frac{1}{2\pi \sigma_x \sigma_y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)} \left[ \frac{(x-\mu_x)^2}{\sigma_x^2} + \frac{(y-\mu_y)^2}{\sigma_y^2} - \frac{2\rho (x-\mu_x)(y-\mu_y)}{\sigma_x \sigma_y} \right] \right),

with means \mu_x, \mu_y, standard deviations \sigma_x, \sigma_y > 0, and correlation coefficient \rho \in (-1,1).^[15] This density captures linear dependence through \rho and reduces to the product of independent univariate normals when \rho = 0.^[16] Marginal distributions can be obtained by integrating out one variable, yielding univariate normals.^[17]

Moments and Central Tendency

Expected Value Vector

The expected value vector of a multivariate random variable \mathbf{X} = (X_1, \dots, X_n)^T, denoted \boldsymbol{\mu} = E[\mathbf{X}], is the vector whose components are the expected values of the individual random variables: \boldsymbol{\mu} = (E[X_1], \dots, E[X_n])^T.^[18] This vector represents the first moment of the joint distribution and serves as a measure of central tendency for the multivariate distribution.^[19] The i-th component \mu_i = E[X_i] is computed by integrating or summing the i-th coordinate weighted by the joint probability distribution of \mathbf{X}. For a discrete multivariate random variable with joint probability mass function p(\mathbf{x}),

E[X_i] = \sum_{\mathbf{x} \in \mathcal{X}} x_i \, [p](/page/P′′)(\mathbf{x}),

where \mathcal{X} is the support of \mathbf{X}.^[20] For a continuous multivariate random variable with joint probability density function f(\mathbf{x}),

E[X_i] = \int_{\mathbb{R}^n} x_i \, f(\mathbf{x}) \, d\mathbf{x}.

These expressions reduce to the marginal expectations when integrated or summed over the other variables, but the full joint distribution is required for exact computation in dependent cases.^[14] The expected value operator exhibits linearity, a fundamental property that facilitates analysis and computation. Specifically, for scalar constants a and b, and random vectors \mathbf{X} and \mathbf{Y} (possibly of compatible dimensions),

E[a \mathbf{X} + b \mathbf{Y}] = a E[\mathbf{X}] + b E[\mathbf{Y}].

This holds component-wise and extends to affine transformations \mathbf{Y} = A \mathbf{X} + \mathbf{b}, yielding E[\mathbf{Y}] = A E[\mathbf{X}] + \mathbf{b}, where A is a matrix and \mathbf{b} a vector.^[21]^[18] An unbiased estimator of \boldsymbol{\mu} is the sample mean vector obtained from an independent and identically distributed sample \mathbf{X}_1, \dots, \mathbf{X}_m drawn from the distribution of \mathbf{X}:

\bar{\mathbf{X}} = \frac{1}{m} \sum_{j=1}^m \mathbf{X}_j.

By linearity of expectation, E[\bar{\mathbf{X}}] = E[\mathbf{X}] = \boldsymbol{\mu}, confirming its unbiasedness regardless of the dependence structure among the components.^[22]^[21]

Covariance Matrix

The covariance matrix of a multivariate random variable \mathbf{X} = (X_1, \dots, X_p)^T with mean vector \boldsymbol{\mu} = E[\mathbf{X}] is the p \times p symmetric matrix \boldsymbol{\Sigma} = \operatorname{Cov}(\mathbf{X}) = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T].^[23] The element in the i-th row and j-th column of \boldsymbol{\Sigma} is the covariance \sigma_{ij} = \operatorname{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)], which quantifies the joint variability between components X_i and X_j.^[23] When i = j, \sigma_{ii} reduces to the variance \operatorname{Var}(X_i), capturing the marginal spread of each component.^[24] For computation, the elements of the covariance matrix are obtained via the definition of expectation. In the discrete case, where \mathbf{X} has joint probability mass function p(\mathbf{x}),

\sigma_{ij} = \sum_{\mathbf{x}} (x_i - \mu_i)(x_j - \mu_j) p(\mathbf{x}),

assuming the second moments exist.^[25] In the continuous case, with joint probability density function f(\mathbf{x}),

\sigma_{ij} = \int (x_i - \mu_i)(x_j - \mu_j) f(\mathbf{x}) \, d\mathbf{x}.

These expressions extend the univariate covariance to measure linear dependence across multiple dimensions.^[25] The covariance matrix possesses key properties that reflect its role in describing multivariate spread. It is positive semi-definite, meaning that for any non-zero p \times 1 vector \mathbf{a}, \mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} \geq 0, with equality if and only if \mathbf{a} is in the null space of \boldsymbol{\Sigma}; this follows from \mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = \operatorname{Var}(\mathbf{a}^T \mathbf{X}) \geq 0.^[23] Additionally, the trace of \boldsymbol{\Sigma}, \operatorname{tr}(\boldsymbol{\Sigma}) = \sum_{i=1}^p \sigma_{ii}, equals the sum of the component variances, providing a scalar summary of total variability.^[24] In practice, the population covariance matrix is estimated from a sample of m independent observations \mathbf{x}_1, \dots, \mathbf{x}_m using the unbiased sample covariance matrix

\mathbf{S} = \frac{1}{m-1} \sum_{k=1}^m (\mathbf{x}_k - \bar{\mathbf{X}})(\mathbf{x}_k - \bar{\mathbf{X}})^T,

where \bar{\mathbf{X}} = \frac{1}{m} \sum_{k=1}^m \mathbf{x}_k is the sample mean vector.^[26] This estimator is positive semi-definite and converges to \boldsymbol{\Sigma} as m \to \infty under standard conditions.^[26] For two multivariate random variables \mathbf{X} and \mathbf{Y} with mean vectors \boldsymbol{\mu}_X and \boldsymbol{\mu}_Y, the cross-covariance matrix is the rectangular matrix \operatorname{Cov}(\mathbf{X}, \mathbf{Y}) = E[(\mathbf{X} - \boldsymbol{\mu}_X)(\mathbf{Y} - \boldsymbol{\mu}_Y)^T], whose elements are the pairwise covariances between components of \mathbf{X} and \mathbf{Y}.^[23] When \mathbf{X} = \mathbf{Y}, this reduces to the standard covariance matrix.^[23]

Measures of Dependence

Correlation Coefficients

In multivariate statistics, the correlation coefficient standardizes the measure of linear dependence between pairs of random variables, providing a scale-free interpretation of their relationship. For components X_i and X_j of a random vector with covariance \sigma_{ij} and standard deviations \sigma_i and \sigma_j, the Pearson product-moment correlation coefficient is defined as

\rho_{ij} = \frac{\sigma_{ij}}{\sigma_i \sigma_j}.

This scalar lies in the interval [-1, 1], where \rho_{ij} = 1 indicates a perfect positive linear relationship (i.e., X_j = aX_i + b with a > 0), \rho_{ij} = -1 indicates a perfect negative linear relationship (with a < 0), and values near 0 suggest weak or no linear association.^[27]^[28] The full correlation matrix \boldsymbol{\Rho} = (\rho_{ij}) assembles these pairwise coefficients, featuring 1s along the diagonal since each variable is perfectly correlated with itself. A key property is scale invariance: the entries of \boldsymbol{\Rho} remain unchanged under affine transformations of the form X_i' = c_i X_i + d_i (with c_i > 0), unlike raw covariances which depend on the units of measurement. This invariance facilitates comparisons across heterogeneous variables, such as height in meters and weight in kilograms. Additionally, \boldsymbol{\Rho} is symmetric and positive semi-definite, ensuring all eigenvalues are non-negative.^[28] Partial correlation refines this measure by isolating the linear association between two variables while controlling for the effects of others, useful in multivariate settings to detect direct dependencies. The partial correlation \rho_{ij \cdot \mathbf{k}} between X_i and X_j given a set of controlling variables \mathbf{k} ranges in [-1, 1] and inherits scale invariance, with interpretation analogous to the simple case: it equals \pm 1 for perfect partial linear relations after adjustment.^[29] In multiple regression, the multiple correlation coefficient R assesses the combined linear predictive power of several variables on a target, defined in the population as

R = \sqrt{ \frac{ \boldsymbol{\sigma}_{XY}^T \boldsymbol{\Sigma}_{XX}^{-1} \boldsymbol{\sigma}_{XY} }{ \sigma_Y^2 } },

where \boldsymbol{\sigma}_{XY} = \operatorname{Cov}(\mathbf{X}, Y) is the cross-covariance vector between the predictor vector \mathbf{X} and response Y, \boldsymbol{\Sigma}_{XX} is the covariance matrix of \mathbf{X}, and \sigma_Y^2 = \operatorname{Var}(Y); R thus ranges in [0, 1], with R^2 indicating the explained variance proportion.^[30] As a non-linear alternative, Spearman's rank correlation coefficient evaluates monotonic rather than strictly linear dependencies, computed via Pearson correlation on ranked data to handle ordinal scales or outliers robustly.

Uncorrelatedness and Orthogonality

In the context of a multivariate random vector \mathbf{X} = (X_1, \dots, X_n)^\top, the components are said to be uncorrelated if \operatorname{Cov}(X_i, X_j) = 0 for all i \neq j. This condition is equivalent to E[X_i X_j] = E[X_i] E[X_j] for i \neq j, assuming the expectations exist.^[31]^[32] A key property of uncorrelated components is that the covariance matrix \Sigma = \operatorname{Cov}(\mathbf{X}) is diagonal, with the variances \operatorname{Var}(X_i) along the main diagonal. Additionally, for uncorrelated X_1, \dots, X_n, the variance of their sum satisfies \operatorname{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i), which simplifies computations in linear combinations and central limit theorem applications.^[18]^[33] Orthogonality of random variables X_i and X_j (with i \neq j) is defined by E[X_i X_j] = 0, without requiring zero means, distinguishing it from uncorrelatedness which centers the variables first. This concept is particularly valuable in signal processing, where orthogonal signals or noise components facilitate efficient decomposition and filtering in Hilbert spaces of random variables.^[34]^[35] An illustrative example arises in expansions using orthogonal polynomials, such as polynomial chaos expansions, where basis polynomials are chosen to be orthogonal with respect to the inner product defined by the expectation over a probability measure, enabling efficient representation of multivariate random functions.^[36] Independence implies uncorrelatedness but is a stronger condition involving the full joint distribution.^[37]

Independence and Conditional Concepts

Joint Independence

In probability theory, two random vector-valued random variables \mathbf{X} = (X_1, \dots, X_m) and \mathbf{Y} = (Y_1, \dots, Y_n) defined on the same probability space are said to be independent if, for every pair of Borel sets A \subseteq \mathbb{R}^m and B \subseteq \mathbb{R}^n, the joint probability satisfies P(\mathbf{X} \in A, \mathbf{Y} \in B) = P(\mathbf{X} \in A) P(\mathbf{Y} \in B).^[4] This condition is equivalent to the joint cumulative distribution function (CDF) factorizing as F_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = F_{\mathbf{X}}(\mathbf{x}) F_{\mathbf{Y}}(\mathbf{y}) for all \mathbf{x} \in \mathbb{R}^m, \mathbf{y} \in \mathbb{R}^n.^[38] For discrete random vectors, this corresponds to the joint probability mass function (PMF) factorizing as p_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = p_{\mathbf{X}}(\mathbf{x}) p_{\mathbf{Y}}(\mathbf{y}); for continuous random vectors, the joint probability density function (PDF), if it exists, satisfies f_{\mathbf{X},\mathbf{Y}}(\mathbf{x}, \mathbf{y}) = f_{\mathbf{X}}(\mathbf{x}) f_{\mathbf{Y}}(\mathbf{y}).^[4] The joint distribution under independence is thus the product measure of the marginal distributions of \mathbf{X} and \mathbf{Y}.^[39] This joint independence is equivalent to component-wise independence: \mathbf{X} and \mathbf{Y} are independent if and only if every component X_i is independent of every component Y_j for all i = 1, \dots, m and j = 1, \dots, n.^[4] Subsets of components from \mathbf{X} and \mathbf{Y} also remain independent under this condition.^[4] Independence is symmetric, meaning if \mathbf{X} is independent of \mathbf{Y}, then \mathbf{Y} is independent of \mathbf{X}.^[38] A key property is that joint independence implies uncorrelatedness: the covariance between any X_i and Y_j is zero, so \text{Cov}(X_i, Y_j) = 0 for all i, j.^[40] However, the converse does not hold in general, as zero covariance measures only linear dependence, whereas independence captures all forms of probabilistic dependence.^[40] Additionally, independence is preserved under measurable functions: if \mathbf{X} and \mathbf{Y} are independent, then for any measurable functions g: \mathbb{R}^m \to \mathbb{R}^p and h: \mathbb{R}^n \to \mathbb{R}^q, the transformed vectors g(\mathbf{X}) and h(\mathbf{Y}) are also independent.^[4] A classic example illustrates the product measure structure. Consider two independent continuous uniform random variables U \sim \text{Uniform}[0,1] and V \sim \text{Uniform}[0,1]; their joint PDF is f_{U,V}(u,v) = 1 for $0 \leq u,v \leq 1, which is the product of the marginal PDFs f_U(u) = 1 and f_V(v) = 1 on the unit square.^[41] This uniform distribution on the product space [0,1] \times [0,1] exemplifies how independence leads to a rectangular support and separable density, contrasting with dependent cases where the support might be restricted or the density non-factorizable.^[39]

Marginal and Conditional Distributions

In multivariate probability distributions, the marginal distribution of a subset of random variables is obtained by integrating or summing the joint distribution over the remaining variables, effectively ignoring their values to focus on the distribution of the subset alone. For a discrete multivariate random variable \mathbf{X} = (X_1, \dots, X_n) with joint probability mass function p(\mathbf{x}), the marginal distribution of the i-th component X_i is given by

p_{X_i}(x_i) = \sum_{x_j \neq i} p(\mathbf{x}),

where the sum is taken over all possible values of the other components x_j for j \neq i.^[42] Similarly, for a continuous multivariate random variable with joint probability density function f(\mathbf{x}), the marginal density of X_i is

f_{X_i}(x_i) = \int f(\mathbf{x}) \, d\mathbf{x}_{-i},

with the integral over the variables \mathbf{x}_{-i} excluding x_i.^[43] These marginal distributions are themselves univariate (or multivariate if the subset has more than one variable), providing the probability laws for the selected components independently of the others.^[44] The conditional distribution describes the distribution of one subset of variables given fixed values of another subset, capturing dependencies within the joint distribution. For continuous random variables partitioned as \mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2) with joint density f(\mathbf{x}_1, \mathbf{x}_2), the conditional density of \mathbf{X}_1 given \mathbf{X}_2 = \mathbf{x}_2 is

f(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{f(\mathbf{x}_1, \mathbf{x}_2)}{f(\mathbf{x}_2)},

provided f(\mathbf{x}_2) > 0, where f(\mathbf{x}_2) is the marginal density of \mathbf{X}_2.^[45] For discrete cases, the conditional probability mass function follows analogously as p(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{p(\mathbf{x}_1, \mathbf{x}_2)}{p(\mathbf{x}_2)}.^[14] This normalization ensures that the conditional distribution integrates or sums to 1 over \mathbf{x}_1, preserving the axioms of probability while conditioning on observed or fixed values of \mathbf{X}_2.^[43] A key property of these distributions is their role in simplifying multivariate analyses: marginals reduce dimensionality for individual component studies, while conditionals enable inference about one set of variables informed by another. For instance, in the bivariate normal distribution with mean vector \boldsymbol{\mu} and covariance matrix \boldsymbol{\Sigma}, the marginal distribution of each component is univariate normal, specifically X_i \sim \mathcal{N}(\mu_i, \Sigma_{ii}).^[44] If the variables are independent, the conditional distributions coincide with the corresponding marginals.^[45]

Transformations

Affine Transformations

An affine transformation of a multivariate random variable \mathbf{X} is defined as \mathbf{Y} = A \mathbf{X} + \mathbf{b}, where A is a constant matrix and \mathbf{b} is a constant vector.^[4] This generalizes linear transformations by incorporating a location shift via \mathbf{b}. The expected value of \mathbf{Y} transforms linearly as E[\mathbf{Y}] = A E[\mathbf{X}] + \mathbf{b}, reflecting the linearity of expectation extended to vectors.^[4] Similarly, the covariance matrix updates according to \operatorname{Cov}(\mathbf{Y}) = A \operatorname{Cov}(\mathbf{X}) A^T, which preserves the positive semi-definiteness of the covariance while accounting for the scaling and rotation induced by A.^[4] In general, the distribution of \mathbf{Y} is the pushforward measure of the distribution of \mathbf{X} under the affine map, altering the support and density accordingly. For instance, if \mathbf{X} follows a multivariate normal distribution \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), then \mathbf{Y} also follows a multivariate normal \mathcal{N}(A \boldsymbol{\mu} + \mathbf{b}, A \boldsymbol{\Sigma} A^T), demonstrating closure under affine transformations.^[46] A key application is standardization, which centers and scales \mathbf{X} to have mean zero and identity covariance: \mathbf{Z} = \boldsymbol{\Sigma}^{-1/2} (\mathbf{X} - \boldsymbol{\mu}), where \boldsymbol{\Sigma} is positive definite, yielding \mathbf{Z} \sim \mathcal{N}(\mathbf{0}, I).^[47] This transformation facilitates comparisons across dimensions by removing location and scale effects.

Invertible Mappings

In the context of multivariate random variables, an invertible mapping refers to a transformation \mathbf{Y} = g(\mathbf{X}), where \mathbf{X} = (X_1, \dots, X_n)^\top is an n-dimensional random vector, g: \mathbb{R}^n \to \mathbb{R}^n is a differentiable and bijective function with a differentiable inverse g^{-1}, and \mathbf{Y} is the transformed random vector.^[48] Such mappings preserve the dimensionality of the random vector, ensuring that \mathbf{Y} remains n-dimensional, and are fundamental for deriving the distribution of transformed variables from the original joint distribution.^[10] For jointly continuous random vectors with joint probability density function (pdf) f_{\mathbf{X}}(\mathbf{x}), the pdf of \mathbf{Y} under an invertible mapping is given by the change-of-variables formula:

f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(g^{-1}(\mathbf{y})) \left| \det J_{g^{-1}}(\mathbf{y}) \right|,

where J_{g^{-1}}(\mathbf{y}) is the Jacobian matrix of the inverse transformation, whose (i,j)-th entry is \frac{\partial (g^{-1})_i}{\partial y_j}(\mathbf{y}), and \det J_{g^{-1}}(\mathbf{y}) is its determinant.^[48] This formula arises from the requirement that probabilities are preserved under the transformation, accounting for the local volume distortion measured by the absolute value of the Jacobian determinant.^[49] Affine transformations represent a special linear case of such mappings, where the Jacobian is constant.^[10] Invertible mappings play a key role in copula theory, where the dependence structure of a multivariate distribution is separated from its marginals via invertible transformations of the cumulative distribution functions (CDFs) to uniform marginals on [0,1]^n.^[50] Specifically, Sklar's theorem expresses the joint CDF as F(\mathbf{x}) = C(F_1(x_1), \dots, F_n(x_n)), with copula C, and the corresponding density transformation incorporates the Jacobians of the marginal inverse CDFs to yield the copula density.^[50] A classic example is the transformation from Cartesian to polar coordinates for a bivariate random vector (X, Y) with joint pdf f_{X,Y}(x,y), defined by R = \sqrt{X^2 + Y^2} and \Theta = \tan^{-1}(Y/X), with inverse X = R \cos \Theta, Y = R \sin \Theta. The Jacobian determinant of the inverse is r, so the joint pdf of (R, \Theta) is f_{R,\Theta}(r, \theta) = f_{X,Y}(r \cos \theta, r \sin \theta) \cdot r for r > 0 and \theta \in [0, 2\pi).^[48] This transformation simplifies computations for rotationally symmetric distributions, such as the bivariate normal.^[49]

Generating Functions

Characteristic Function

The characteristic function of a multivariate random vector \mathbf{X} \in \mathbb{R}^k is defined as

\phi_{\mathbf{X}}(\mathbf{t}) = E\left[e^{i \mathbf{t}^T \mathbf{X}}\right],

where i = \sqrt{-1} and \mathbf{t} \in \mathbb{R}^k.^[51] If \mathbf{X} admits a joint probability density function f(\mathbf{x}), this expectation can be expressed as the Fourier transform

\phi_{\mathbf{X}}(\mathbf{t}) = \int_{\mathbb{R}^k} e^{i \mathbf{t}^T \mathbf{x}} f(\mathbf{x}) \, d\mathbf{x}.

This function exists for any distribution, as the complex exponential is bounded, and serves as a generating tool that encodes the full probabilistic structure of \mathbf{X}.^[24] Key properties of the characteristic function include its continuity in \mathbf{t} and the normalization \phi_{\mathbf{X}}(\mathbf{0}) = 1, since e^{i \mathbf{0}^T \mathbf{X}} = 1.^[51] Moreover, by the uniqueness theorem, two random vectors have the same distribution if and only if their characteristic functions coincide for all \mathbf{t}, enabling inversion to recover the distribution from \phi_{\mathbf{X}}(\mathbf{t}).^[52] Assuming the moments of \mathbf{X} exist and are finite, higher-order moments can be extracted via differentiation at the origin. Specifically, the mean vector is

E[\mathbf{X}] = -i \nabla_{\mathbf{t}} \phi_{\mathbf{X}}(\mathbf{0}),

where \nabla_{\mathbf{t}} denotes the gradient with respect to \mathbf{t}.^[51] The elements of the covariance matrix \boldsymbol{\Sigma} are obtained from the second partial derivatives:

\text{Cov}(X_j, X_k) = \left. -\frac{\partial^2 \phi_{\mathbf{X}}(\mathbf{t})}{\partial t_j \partial t_k} \right|_{\mathbf{t}=\mathbf{0}} - E[X_j] E[X_k],

with the full second-moment matrix given by E[\mathbf{X} \mathbf{X}^T] = (-i)^2 \mathbf{H} \phi_{\mathbf{X}}(\mathbf{0}), where \mathbf{H} is the Hessian matrix.^[52] A prominent example arises for the multivariate normal distribution with mean \boldsymbol{\mu} and covariance \boldsymbol{\Sigma}, whose characteristic function is

\phi_{\mathbf{X}}(\mathbf{t}) = \exp\left(i \mathbf{t}^T \boldsymbol{\mu} - \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t}\right).

This closed form directly reflects the location and dispersion parameters, and differentiation confirms the moments match E[\mathbf{X}] = \boldsymbol{\mu} and \text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma}.^[44]

Moment-Generating Function

The moment-generating function (MGF) of a multivariate random variable \mathbf{X} \in \mathbb{R}^p is defined as M_{\mathbf{X}}(\mathbf{t}) = E[e^{\mathbf{t}^T \mathbf{X}}], where \mathbf{t} \in \mathbb{R}^p.^[53] This function is analytic in some neighborhood of the origin \mathbf{t} = \mathbf{0} whenever it exists in that region, allowing for a power series expansion that facilitates the extraction of distributional properties.^[54] The MGF relates directly to the moments and cumulants of \mathbf{X} through its partial derivatives evaluated at \mathbf{0}. Specifically, the mean vector is given by the gradient \nabla M_{\mathbf{X}}(\mathbf{0}) = E[\mathbf{X}], while higher-order mixed partial derivatives yield the corresponding joint moments, such as E[X_i X_j] from the second partial with respect to t_i and t_j at \mathbf{0}.^[55] If the MGF exists, it uniquely determines the distribution of \mathbf{X}.^[56] Key properties of the MGF include its behavior under addition of independent random vectors. If \mathbf{X} and \mathbf{Y} are independent, then M_{\mathbf{X} + \mathbf{Y}}(\mathbf{t}) = M_{\mathbf{X}}(\mathbf{t}) M_{\mathbf{Y}}(\mathbf{t}), mirroring the convolution property of their densities.^[57] For the multivariate normal distribution \mathbf{X} \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the MGF takes the explicit form

M_{\mathbf{X}}(\mathbf{t}) = \exp\left( \mathbf{t}^T \boldsymbol{\mu} + \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t} \right),

which exists for all \mathbf{t} and confirms the distribution's moments via differentiation.^[58] However, the MGF does not exist for all multivariate distributions, particularly those with heavy tails lacking finite moments of all orders. For instance, the multivariate Cauchy distribution has no MGF, as its marginal components lead to divergent expectations in the defining integral.^[59]

Advanced Properties

Quadratic Forms

A quadratic form of a multivariate random vector \mathbf{X} \in \mathbb{R}^p is defined as Q = \mathbf{X}^T A \mathbf{X}, where A is a p \times p symmetric matrix. This construction ensures Q is a scalar random variable that captures second-order interactions among the components of \mathbf{X}, playing a central role in multivariate analysis for tasks like deriving test statistics and measuring deviations from a hypothesized structure. If A is not symmetric, the form can be adjusted to use the symmetrized matrix (A + A^T)/2 without altering the value.^[60] For \mathbf{X} \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the moments of Q admit closed-form expressions. The expectation is given by

E[Q] = \operatorname{tr}(A \boldsymbol{\Sigma}) + \boldsymbol{\mu}^T A \boldsymbol{\mu},

where \operatorname{tr}(\cdot) denotes the matrix trace. The variance follows as

\operatorname{Var}(Q) = 2 \operatorname{tr}(A \boldsymbol{\Sigma} A \boldsymbol{\Sigma}) + 4 \boldsymbol{\mu}^T A \boldsymbol{\Sigma} A \boldsymbol{\mu}.

These results derive from the properties of multivariate normal moments and can be obtained using the expectation of products of quadratic forms.^[61] A prominent example is the squared Mahalanobis distance, D^2 = (\mathbf{X} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}), which corresponds to the choice A = \boldsymbol{\Sigma}^{-1}. Under multivariate normality, D^2 \sim \chi^2_p, a central chi-squared distribution with p degrees of freedom, providing a scale-free measure of multivariate dispersion. This distribution facilitates probabilistic statements, such as prediction regions defined by D^2 \leq \chi^2_{p, 1-\alpha}.^[62] Quadratic forms also underlie the Wishart distribution, which describes the sampling variability of covariance matrices. If \mathbf{X}_1, \dots, \mathbf{X}_n \stackrel{\iid}{\sim} N_p(\mathbf{0}, \boldsymbol{\Sigma}), then the random matrix W = \sum_{i=1}^n \mathbf{X}_i \mathbf{X}_i^T follows a Wishart distribution W_p(n, \boldsymbol{\Sigma}), generalizing the chi-squared to matrix-valued sums of outer products. The mean is E[W] = n \boldsymbol{\Sigma}, and it serves as the exact distribution for sample covariance matrices in centered normal samples.^[63]

Isserlis' Theorem for Higher Moments

Isserlis' theorem provides a method for computing the higher-order moments of a multivariate normal random vector, particularly the expectation of the product of its components. For a centered (zero-mean) multivariate normal random vector \mathbf{X} = (X_1, \dots, X_n)^\top \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}), the theorem states that if the product has an odd number of factors, the expectation is zero: E\left[\prod_{k=1}^{2m+1} X_{i_k}\right] = 0. For even orders, $2m, the expectation is the sum over all possible perfect matchings (pairings) of the indices, where each pairing contributes the product of the corresponding covariances:

E\left[\prod_{k=1}^{2m} X_{i_k}\right] = \sum_{\pi} \prod_{(j,l) \in \pi} \operatorname{Cov}(X_{i_j}, X_{i_l}),

with the sum taken over all (2m-1)!! = (2m)! / (2^m m!) distinct pairings \pi.^[64]^[65] In the general non-centered case, where \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) with \boldsymbol{\mu} \neq \mathbf{0}, the full form of Isserlis' theorem extends to sums over all partitions of the index set into singletons and pairs. Each such partition contributes the product of the means for the singletons and the covariances for the pairs, weighted appropriately. This generalization, derived by Withers, allows computation of arbitrary moments by accounting for the location parameter \boldsymbol{\mu}.^[66] The theorem was originally derived by Leon Isserlis in 1918 as a formula for product-moment coefficients of normal distributions. In physics, it is recognized as Wick's theorem, introduced by Gian-Carlo Wick in 1950 for evaluating time-ordered products of Gaussian fields, and has been applied in quantum field theory via Wick rotations to compute Feynman diagrams.^[65] Isserlis' theorem facilitates the evaluation of multivariate Gaussian integrals and the determination of moments beyond the second order, which are otherwise intractable due to the complexity of the joint density. These applications are essential in statistical physics, signal processing, and deriving higher cumulants for asymptotic analyses.^[67]^[66] For illustration, consider four components X_1, X_2, X_3, X_4 of a centered multivariate normal \mathbf{X}. The fourth moment is

E[X_1 X_2 X_3 X_4] = \operatorname{Cov}(X_1, X_2) \operatorname{Cov}(X_3, X_4) + \operatorname{Cov}(X_1, X_3) \operatorname{Cov}(X_2, X_4) + \operatorname{Cov}(X_1, X_4) \operatorname{Cov}(X_2, X_3),

corresponding to the three possible pairings.^[64]

Applications

Portfolio Theory

In portfolio theory, the returns of multiple assets are modeled as a multivariate random vector \mathbf{R} = (R_1, \dots, R_n)^T, where each R_i denotes the random return of the i-th asset, capturing their joint probabilistic behavior including dependencies via the covariance structure.^[68] This setup allows for the analysis of portfolio performance under uncertainty, treating asset returns as realizations from a joint distribution. A portfolio is formed by allocating weights \mathbf{w} = (w_1, \dots, w_n)^T to these assets, yielding the portfolio return R_p = \mathbf{w}^T \mathbf{R}. The expected portfolio return is then \mathbb{E}[R_p] = \mathbf{w}^T \boldsymbol{\mu}, where \boldsymbol{\mu} = \mathbb{E}[\mathbf{R}] is the vector of individual expected returns. The portfolio variance, a key measure of risk, is given by

\operatorname{Var}(R_p) = \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w},

where \boldsymbol{\Sigma} is the covariance matrix of \mathbf{R}. This quadratic form highlights how correlations between assets influence overall risk, enabling diversification benefits when covariances are low.^[69] The foundational Markowitz model, introduced in 1952, formalizes portfolio optimization by minimizing variance subject to a target expected return \mathbf{w}^T \boldsymbol{\mu} = \mu_p and the full investment constraint \mathbf{w}^T \mathbf{1} = 1, where \mathbf{1} is a vector of ones. This mean-variance optimization problem is solved via quadratic programming, yielding the set of portfolios that offer the highest return for a given risk level or the lowest risk for a given return. The resulting collection of optimal portfolios traces the efficient frontier in the mean-variance plane, a hyperbolic curve representing undominated choices. Markowitz's approach revolutionized asset allocation by quantifying the risk-return tradeoff through multivariate moments.^[69] For illustration, consider a two-asset portfolio with returns R_1 and R_2, weights w_1 + w_2 = 1, expected returns \mu_1 and \mu_2, variances \sigma_1^2 and \sigma_2^2, and correlation \rho. The portfolio variance simplifies to \sigma_p^2 = w_1^2 \sigma_1^2 + w_2^2 \sigma_2^2 + 2 w_1 w_2 \rho \sigma_1 \sigma_2. When \rho < 1, diversification reduces \sigma_p below the weighted average of individual variances, demonstrating how negative or low correlations lower risk without sacrificing return potential; for \rho = 1, no diversification benefit exists. This case underscores the multivariate nature of risk in Markowitz's framework.^[69]

Multivariate Regression

Multivariate regression, also known as multivariate multiple regression, extends the linear regression model to scenarios involving multiple response variables and multiple predictor variables, allowing for the joint modeling of correlated outcomes.^[70] The core model is expressed in matrix form as \mathbf{Y} = X \mathbf{B} + \boldsymbol{\Epsilon}, where \mathbf{Y} is an n \times q matrix of response variables with n observations and q responses, X is an n \times p design matrix of predictors (including an intercept column), \mathbf{B} is a p \times q coefficient matrix, and \boldsymbol{\Epsilon} is an n \times q error matrix whose rows are independent and follow a multivariate normal distribution \boldsymbol{\epsilon}_i \sim \mathcal{N}_q(\mathbf{0}, \boldsymbol{\Sigma}), with \boldsymbol{\Sigma} being the q \times q covariance matrix of the errors.^[70] This setup accounts for correlations among the response variables, providing more efficient estimates and inferences compared to fitting separate univariate regressions.^[71] The estimation of \mathbf{B} is typically performed using ordinary least squares (OLS), yielding the estimator \hat{\mathbf{B}} = (X^T X)^{-1} X^T \mathbf{Y}, which minimizes the sum of squared residuals in the Frobenius norm and coincides with the maximum likelihood estimator under multivariate normality assumptions.^[70] This OLS estimator remains efficient even when \boldsymbol{\Sigma} involves off-diagonal correlations, as the structure of the model leads to the same point estimates regardless of whether \boldsymbol{\Sigma} is known or diagonal (implying independent errors across responses).^[70] The covariance matrix of \hat{\mathbf{B}} is then estimated as \widehat{\text{Cov}}(\hat{\mathbf{B}}) = (X^T X)^{-1} \otimes \hat{\boldsymbol{\Sigma}}, where \hat{\boldsymbol{\Sigma}} = \frac{1}{n-p} \mathbf{E}^T \mathbf{E} is the unbiased estimator of the error covariance matrix based on the residuals \mathbf{E} = \mathbf{Y} - X \hat{\mathbf{B}}.^[70] Hypothesis testing in multivariate regression often employs multivariate analysis of variance (MANOVA) techniques to assess whether subsets of coefficients in \mathbf{B} differ from specified values, such as testing overall model significance or group differences in predictors.^[71] A key test statistic is Wilks' lambda (\Lambda), defined as \Lambda = \frac{|\tilde{\boldsymbol{\Sigma}}|}{|\tilde{\boldsymbol{\Sigma}}_1|}, where \tilde{\boldsymbol{\Sigma}} is the maximum likelihood estimate of \boldsymbol{\Sigma} under the full model and \tilde{\boldsymbol{\Sigma}}_1 is the estimate under the null hypothesis (e.g., certain rows of \mathbf{B} are zero); small values of \Lambda indicate rejection of the null.^[70] Under the null, - \nu \log \Lambda approximately follows a chi-squared distribution with degrees of freedom m(p - q), where \nu = n - p - 1 - \frac{1}{2}(m - p + q + 1), m = q, and p, q reflect the dimensions of the tested subspace; this likelihood ratio test provides a unified framework for multivariate hypotheses.^[70] Wilks' lambda was introduced by Samuel S. Wilks in 1932 as a criterion for testing equality of multivariate means, later adapted for regression contexts.^[72] A practical example involves predicting sales across multiple channels (e.g., online, retail, and wholesale) as response variables based on advertising expenditures in various media (e.g., TV, digital, print) as predictors; the model captures how ad spends jointly influence correlated sales outcomes, with MANOVA testing whether the coefficients differ significantly from zero across channels.^[73] The foundations of multivariate regression trace back to the 1930s, with Harold Hotelling's pioneering extensions of univariate methods to multivariate settings, including his 1936 work on relations between sets of variates that laid groundwork for joint modeling of multiple responses and predictors.^[74]