Conditional variance

In probability theory, conditional variance refers to the variance of a random variable X given the value of another random variable Y = y, denoted as \operatorname{Var}(X \mid Y = y), which quantifies the spread of X under the condition that Y is fixed at y.^[1] It is formally defined as \operatorname{Var}(X \mid Y = y) = E[(X - E[X \mid Y = y])^2 \mid Y = y], where E[\cdot \mid Y = y] denotes the conditional expectation given Y = y.^[2] Equivalently, it can be computed using the shortcut formula \operatorname{Var}(X \mid Y = y) = E[X^2 \mid Y = y] - (E[X \mid Y = y])^2, mirroring the unconditional variance but applied to the conditional distribution of X.^[1] The conditional variance \operatorname{Var}(X \mid Y) itself is a random variable, representing the function of Y that takes the value \operatorname{Var}(X \mid Y = y) whenever Y = y.^[1] A fundamental property is the law of total variance, which decomposes the unconditional variance of X into two components: \operatorname{Var}(X) = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]), where the first term captures the average variability of X within the levels of Y, and the second term measures the variability of the conditional means across those levels.^[1] This decomposition is crucial in statistical modeling, as it highlights how conditioning on additional information reduces or reallocates variance, aiding in applications such as regression analysis, risk assessment, and Bayesian inference.^[2] For discrete random variables, the conditional variance is calculated by summing over the possible values of X weighted by the conditional probability mass function, ensuring precise computation in finite support cases.^[2]

Core Concepts

Formal Definition

In probability theory, the conditional variance of a random variable Y given a conditioning event or value X = x, denoted \operatorname{Var}(Y \mid X = x), is defined as the variance of the conditional distribution of Y given X = x. This is formally expressed as

\operatorname{Var}(Y \mid X = x) = \mathbb{E}\left[(Y - \mathbb{E}[Y \mid X = x])^2 \mid X = x\right],

where \mathbb{E}[\cdot \mid X = x] denotes the conditional expectation with respect to the conditional distribution of Y given X = x.^[2]^[3] For the general case where X is a random variable, the conditional variance \operatorname{Var}(Y \mid X) is itself a random variable defined as

\operatorname{Var}(Y \mid X) = \mathbb{E}\left[(Y - \mathbb{E}[Y \mid X])^2 \mid X\right].

An alternative notation for the scalar conditioning case is \sigma^2_{Y \mid X = x}, emphasizing the variance of the conditional distribution.^[2]^[3] This concept originated in early 20th-century probability theory, formalized within the measure-theoretic framework by Andrey Kolmogorov in his foundational work on probability axioms.^[4] When no conditioning is specified, the unconditional variance \operatorname{Var}(Y) arises as a special case.^[2]

Geometric Interpretation

The geometric interpretation of conditional variance arises naturally in the context of a bivariate scatter plot displaying realizations of the joint distribution of random variables X and Y. Each point in the plot represents a pair (x, y) drawn from this distribution. Fixing X = x corresponds to examining a narrow vertical strip at that x-value, where the conditional distribution of Y given X = x manifests as the vertical scatter of points within the strip. The conditional variance \operatorname{Var}(Y \mid X = x) quantifies the average squared vertical distance of these points from the horizontal line positioned at the conditional mean E[Y \mid X = x], which traces the regression curve across all such strips. This measure captures the residual spread or uncertainty in Y after accounting for the information provided by X = x.^[5] Visually, as one moves along the x-axis, the width of these vertical spreads can vary, reflecting how the conditional variance changes with different values of X. In regions where the points cluster tightly around the regression line, the conditional variance is small, indicating that X effectively predicts Y. Conversely, wider spreads signal greater remaining variability. This perspective emphasizes conditional variance as a local measure of dispersion in the joint distribution, distinct from the overall variance, which averages spreads across all strips. The non-negativity of conditional variance follows directly from its construction as an average of squared deviations, ensuring \operatorname{Var}(Y \mid X = x) \geq [0](/page/0) for all x. It achieves zero only when Y is deterministic given X = x, such that all points in the corresponding strip align perfectly on the regression line, implying no residual uncertainty and flawless prediction.^[2] A simple numerical illustration involves coin flips with a random bias. Let X = Q denote the unknown bias (probability of heads, taking values 0.3 or 0.7 with equal probability), and let Y be the outcome of one flip (1 for heads, 0 for tails). For Q = 0.3, the conditional outcomes cluster closer to 0, yielding \operatorname{Var}(Y \mid Q = 0.3) = 0.3 \times 0.7 = 0.21; for Q = 0.7, the spread is similar at 0.21 but shifted higher. If instead Q = [0](/page/0) or $1, the variance drops to zero, as Y is fixed at 0 or 1, respectively, with no spread in the "strip." This example highlights how conditioning on the bias reduces overall uncertainty to the conditional level.

Properties and Decompositions

Law of Total Variance

The law of total variance states that for random variables X and Y with finite second moments, the unconditional variance of Y decomposes as

\operatorname{Var}(Y) = \mathbb{E}[\operatorname{Var}(Y \mid X)] + \operatorname{Var}(\mathbb{E}[Y \mid X]).

^[6]^[7] To derive this, begin with the definition of variance:

\operatorname{Var}(Y) = \mathbb{E}[(Y - \mathbb{E}[Y])^2].

Expand the squared term:

(Y - \mathbb{E}[Y])^2 = (Y - \mathbb{E}[Y \mid X] + \mathbb{E}[Y \mid X] - \mathbb{E}[Y])^2 = (Y - \mathbb{E}[Y \mid X])^2 + 2(Y - \mathbb{E}[Y \mid X])(\mathbb{E}[Y \mid X] - \mathbb{E}[Y]) + (\mathbb{E}[Y \mid X] - \mathbb{E}[Y])^2.

Taking the expectation and applying the law of iterated expectations yields

\mathbb{E}[(Y - \mathbb{E}[Y \mid X])^2] + 2\mathbb{E}[(Y - \mathbb{E}[Y \mid X])(\mathbb{E}[Y \mid X] - \mathbb{E}[Y])] + \mathbb{E}[(\mathbb{E}[Y \mid X] - \mathbb{E}[Y])^2].

The cross term vanishes because \mathbb{E}[(Y - \mathbb{E}[Y \mid X]) \mid X] = 0, so its overall expectation is zero. The first term is \mathbb{E}[\operatorname{Var}(Y \mid X)], and the third is \operatorname{Var}(\mathbb{E}[Y \mid X]).^[6]^[7] This decomposition interprets \mathbb{E}[\operatorname{Var}(Y \mid X)] as the expected variance within the conditional distributions (average within-group variance) and \operatorname{Var}(\mathbb{E}[Y \mid X]) as the variance of the conditional means across values of X (between-group variance).^[6]^[8] The result holds under the assumptions that Y has finite second moments (\mathbb{E}[Y^2] < \infty) and that the joint distribution of (X, Y) is defined, ensuring the conditional expectations and variances exist.^[6]^[7] For example, consider a mixture model where X is a Bernoulli random variable with parameter p = 0.5, and conditionally, Y \mid X = 0 \sim \mathcal{N}(0, 1) while Y \mid X = 1 \sim \mathcal{N}(2, 1). Here, \mathbb{E}[Y \mid X] = 0 or $2, so \operatorname{Var}(\mathbb{E}[Y \mid X]) = (0.5)(0 - 1)^2 + (0.5)(2 - 1)^2 = 1, and \mathbb{E}[\operatorname{Var}(Y \mid X)] = 1, yielding \operatorname{Var}(Y) = 2. This illustrates how the total variance combines the common within-group spread and the separation between group means.

Relation to Conditional Expectation

The conditional variance of a random variable Y given another random variable X, denoted \Var(Y \mid X), is fundamentally linked to the conditional expectation \E[Y \mid X] through the standard variance formula applied in the conditional setting. Specifically, it is defined as

\Var(Y \mid X) = \E\left[ (Y - \E[Y \mid X])^2 \mid X \right],

which expands to the identity

\Var(Y \mid X) = \E[Y^2 \mid X] - (\E[Y \mid X])^2.

This expression directly connects conditional variance to the second conditional moment \E[Y^2 \mid X] and the square of the first conditional moment, mirroring the unconditional variance formula and highlighting how conditioning refines the assessment of variability around the conditional mean.^[9] A natural extension of this relation appears in the conditional covariance between two random variables Y and Z given X, defined analogously as

\Cov(Y, Z \mid X) = \E\left[ (Y - \E[Y \mid X])(Z - \E[Z \mid X]) \mid X \right],

which simplifies to

\Cov(Y, Z \mid X) = \E[YZ \mid X] - \E[Y \mid X] \E[Z \mid X].

This formula serves as the conditional counterpart to the unconditional covariance, capturing the joint variability of Y and Z after accounting for the information in X, and it preserves key linearity properties of conditional expectations.^[9] Conditional variance exhibits several important properties derived from its ties to conditional expectation. It is always non-negative almost surely, \Var(Y \mid X) \geq 0, as it represents the expected squared deviation from the conditional mean, a quantity that cannot be negative by the properties of squared terms in L^2 spaces.^[9] Furthermore, \Var(Y \mid X) = 0 almost surely if Y is measurable with respect to the \sigma-algebra generated by X, denoted \sigma(X), meaning Y is fully determined by the information in X, leaving no residual variability after conditioning.^[9] To derive the core identity, consider the definition of conditional variance as the conditional expectation of the squared deviation from the conditional mean. Expanding (Y - \E[Y \mid X])^2 = Y^2 - 2Y \E[Y \mid X] + (\E[Y \mid X])^2 and taking the conditional expectation given X yields \E[Y^2 \mid X] - 2 \E[Y \E[Y \mid X] \mid X] + (\E[Y \mid X])^2. Since \E[Y \mid X] is \sigma(X)-measurable, the property \E[Y \E[Y \mid X] \mid X] = \E[Y \mid X] \E[Y \mid X] simplifies the middle term to $2 (\E[Y \mid X])^2, resulting in \E[Y^2 \mid X] - (\E[Y \mid X])^2. This proof leverages the projection interpretation of conditional expectation as the L^2-orthogonal projection onto the subspace of \sigma(X)-measurable functions, ensuring the deviation Y - \E[Y \mid X] is orthogonal to that subspace, which underpins the non-negativity and zero-variance properties.^[9]

Variations by Conditioning Type

Discrete Conditioning Variables

When the conditioning random variable Y takes discrete values, the conditional variance of a random variable X given Y = y_i is defined as the expected squared deviation from the conditional mean. Assuming X is also discrete (with probability mass function), this is computed as

\operatorname{Var}(X \mid Y = y_i) = \sum_x (x - \mu_i)^2 P(X = x \mid Y = y_i),

where \mu_i = E[X \mid Y = y_i].^[2] An equivalent form, often used for efficiency, is \operatorname{Var}(X \mid Y = y_i) = E[X^2 \mid Y = y_i] - \mu_i^2.^[2] This summation extends over all possible values x in the support of the conditional distribution of X given Y = y_i. If X is continuous while Y is discrete, the formula instead uses an integral over the conditional density of X given Y = y_i: \operatorname{Var}(X \mid Y = y_i) = \int (x - \mu_i)^2 f_{X \mid Y}(x \mid y_i) \, dx.^[10] A representative example arises when X follows a Bernoulli distribution conditioned on a discrete Y, such as categories representing different groups (e.g., treatment versus control in an experiment), where the success probability p_i depends on the category y_i. For a Bernoulli random variable with parameter p_i, the conditional variance simplifies to \operatorname{Var}(X \mid Y = y_i) = p_i (1 - p_i).^[11] This form highlights how variability in X can differ across discrete conditioning levels, with maximum variance at p_i = 0.5. The marginal conditional variance, which averages the conditional variances over the distribution of Y, is given by

E[\operatorname{Var}(X \mid Y)] = \sum_i \operatorname{Var}(X \mid Y = y_i) P(Y = y_i),

where the sum is over the support of Y.^[12] This expectation provides a measure of average within-group variability in decompositions like the law of total variance. The use of finite summations in these computations offers a straightforward numerical approach, especially for random variables with limited support, avoiding the integration required in continuous settings.^[13]

Continuous Conditioning Variables

When the conditioning variable Y is continuous, the conditional variance of X given Y = y, denoted \operatorname{Var}(X \mid Y = y), is defined assuming X is continuous (with conditional density function f_{X \mid Y}(x \mid y)) as

\operatorname{Var}(X \mid Y = y) = \int_{-\infty}^{\infty} (x - \mu(y))^2 f_{X \mid Y}(x \mid y) \, dx,

where \mu(y) = \mathbb{E}[X \mid Y = y] is the conditional mean.^[14] This integral form arises from the general definition of variance applied to the conditional distribution of X given Y = y, which is characterized by the density f_{X \mid Y}. If X is discrete while Y is continuous, the formula uses a sum over the conditional PMF of X given Y = y: \operatorname{Var}(X \mid Y = y) = \sum_x (x - \mu(y))^2 P(X = x \mid Y = y).^[10] The conditional density f_{X \mid Y}(x \mid y) is derived from the joint probability density function f_{X,Y}(x,y) and the marginal density f_Y(y) via the relation f_{X,Y}(x,y) = f_{X \mid Y}(x \mid y) f_Y(y), allowing computation of conditional quantities from joint distributions.^[15] A classic example occurs in the bivariate normal distribution, where if (X, Y) follows a joint normal distribution with means \mu_X, \mu_Y, variances \sigma_X^2, \sigma_Y^2, and correlation \rho, then the conditional distribution of X given Y = y is normal with mean \mu_X + \rho \frac{\sigma_X}{\sigma_Y} (y - \mu_Y) and constant variance \sigma_X^2 (1 - \rho^2).^[16] This homoscedastic conditional variance highlights how linearity in the normal case simplifies computations compared to more general continuous settings. In practice, estimating the conditional variance from empirical data requires approximating the underlying densities, as direct integration is infeasible without parametric assumptions. Kernel density estimation methods address this by providing nonparametric estimators for f_{X \mid Y} and subsequently for \operatorname{Var}(X \mid Y = y), with developments since the early 2000s—including handling long-memory errors and high-dimensional predictors—continuing into recent years with machine-learning-based semiparametric approaches that improve efficiency in complex models.^[17]^[18]^[19] These approaches, such as local polynomial kernel estimators for conditional covariance, enable robust inference in regression models where heteroscedasticity varies continuously with Y.

Applications in Statistics

Connection to Least Squares Estimation

In linear regression, the residual variance serves as an estimate of the expected conditional variance \mathbb{E}[\mathrm{Var}(Y \mid X)], capturing the variability in the response variable Y that remains unexplained by the predictors X after fitting the model.^[20] This estimation arises because, under the standard assumptions of the linear model, the residuals represent deviations from the predicted conditional mean, and their sample variance provides an unbiased estimate of the underlying conditional variance when it is constant across values of X.^[21] The connection deepens through the principle that the conditional expectation \mathbb{E}[Y \mid X] minimizes the expected squared prediction error \mathbb{E}[(Y - g(X))^2] over all measurable functions g, with the conditional variance \mathrm{Var}(Y \mid X) representing the irreducible error that no predictor can eliminate.^[22] In ordinary least squares (OLS) estimation, the fitted values approximate this conditional expectation under linearity, ensuring that the minimized mean squared error decomposes into the explained variance and the average conditional variance as the baseline noise. This minimization property underscores why OLS is a natural approach for regression, directly leveraging the structure of conditional moments.^[23] The Gauss-Markov theorem further ties conditional variance to least squares by establishing that, under assumptions including linearity, uncorrelated errors, and homoscedasticity (constant \mathrm{Var}(Y \mid X) = \sigma^2), the OLS estimator is the best linear unbiased estimator (BLUE) with minimum variance among linear unbiased estimators.^[24] This optimality relies on the homoscedasticity condition, as violations—such as heteroscedastic conditional variance—can inflate the variance of the OLS estimator, motivating weighted least squares adjustments. The law of total variance briefly illustrates this by partitioning the total variance of Y into explained variation due to X and the unexplained component \mathbb{E}[\mathrm{Var}(Y \mid X)], aligning with the regression's error term.^[25] For a concrete example, consider the simple linear model Y = \beta_0 + \beta_1 X + \epsilon, where \epsilon has mean zero and constant variance \sigma^2, implying homoscedastic conditional variance \mathrm{Var}(Y \mid X) = \sigma^2.^[20] Here, OLS estimates \hat{\beta_0} and \hat{\beta_1} by minimizing the sum of squared residuals, and the residual variance s^2 = \frac{1}{n-2} \sum (y_i - \hat{y_i})^2 unbiasedly estimates \sigma^2, providing a direct measure of the conditional variability around the regression line.^[21] This setup exemplifies how conditional variance quantifies prediction reliability in the homoscedastic case, central to inference in basic regression analysis.

Components of Variance Analysis

Components of variance analysis employs conditional variance to partition the total variability in hierarchical or grouped data within random effects models and analysis of variance (ANOVA) frameworks. In these models, the total variance of the response variable Y, denoted \operatorname{Var}(Y), decomposes into between-group and within-group components: \operatorname{Var}(Y) = \operatorname{Var}(E[Y \mid \text{group}]) + E[\operatorname{Var}(Y \mid \text{group})], where the first term captures variability across groups and the second represents the average conditional variance within groups.^[26] This approach, foundational to understanding sources of variation in multilevel data, builds on the law of total variance by applying it to structured experimental designs.^[27] Estimation of these variance components typically relies on methods like restricted maximum likelihood (REML), which provides unbiased estimates by accounting for degrees of freedom lost to fixed effects estimation. Introduced by Patterson and Thompson in 1971 for handling unbalanced data in linear mixed models, REML maximizes the likelihood of the residuals after adjusting for fixed parameters and gained prominence in the 1970s and 1980s as computational tools advanced. Unlike maximum likelihood, REML avoids downward bias in variance estimates, making it suitable for random effects in ANOVA-like settings.^[28] A representative application occurs in balanced incomplete block designs (BIBD), where treatments are tested across incomplete blocks to control for block effects modeled as random. Here, variance components analysis estimates the between-block variance alongside the within-block error variance, enabling precise inference on treatment differences while accounting for design-induced clustering.^[29] Similarly, in clustered data—such as student performance nested within schools—the method decomposes total variance into school-level (between-cluster) and student-level (within-cluster) components, revealing the proportion of variability attributable to clustering.^[30] Extensions to multivariate cases incorporate conditional variance into models like multivariate analysis of variance (MANOVA) with random effects, decomposing variance-covariance matrices across multiple responses. These modern mixed models address limitations in classical MANOVA by modeling correlated outcomes and hierarchical structures, as explored in frameworks for repeated measures and genetic analyses.^[31]