Fact-checked by Grok 2 weeks ago

Conditional probability distribution

In probability theory and statistics, a conditional probability distribution is the probability distribution of a random variable given the realized value of one or more other random variables, quantifying the likelihood of various outcomes under specified conditions.^[1] It arises from the rules of conditional probability and is fundamental for modeling dependencies between variables in bivariate or multivariate settings.^[2] For discrete random variables X and Y, the conditional probability mass function (PMF) of X given Y = y is defined as P_{X|Y}(x|y) = \frac{P_{X,Y}(x,y)}{P_Y(y)} for P_Y(y) > 0, where P_{X,Y}(x,y) is the joint PMF and P_Y(y) is the marginal PMF of Y.^[3] This conditional PMF satisfies the properties of a valid PMF: it is non-negative and sums to 1 over all possible x.^[2] Similarly, for continuous random variables, the conditional probability density function (PDF) of X given Y = y is f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} for f_Y(y) > 0, where f_{X,Y}(x,y) is the joint PDF and f_Y(y) is the marginal PDF of Y; this conditional PDF integrates to 1 over the real line.^[1] The joint distribution can be recovered as the product of the conditional and marginal: P_{X,Y}(x,y) = P_{X|Y}(x|y) \cdot P_Y(y) for discrete cases, and analogously for continuous.^[3] Key properties include non-symmetry—P_{X|Y} \neq P_{Y|X} in general—and the special case of independence, where the random variables are independent if and only if the conditional distribution equals the unconditional (marginal) distribution, i.e., P_{X|Y}(x|y) = P_X(x) for all x, y.^[2] Conditional distributions enable computations of expectations, variances, and other moments restricted to subpopulations, such as the conditional mean E[X|Y=y], which plays a central role in regression and Bayesian inference.^[1] They are widely applied in fields like statistics, machine learning, and decision theory to update beliefs based on new evidence.^[3]

Fundamentals

Definition

In probability theory, a conditional probability distribution describes the probability distribution of a random variable conditional on the value of another random variable or the occurrence of a specific event, effectively restricting the probability measure to that conditioning information. This concept extends the basic notion of conditional probability—defined for events—to the entire distribution of a random variable, allowing for the assessment of probabilities across its possible outcomes given partial knowledge about related variables.^[4] Formally, consider random variables X and Y defined on a probability space (\Omega, \mathcal{F}, P). The conditional distribution of X given Y = y is given by the family of conditional probabilities P(X \in A \mid Y = y) for all measurable sets A \subseteq \mathbb{R}, where y lies in the support of Y. This defines a probability measure on the space of X, normalized such that the total probability sums to 1 over the possible values or integrates to 1 over the range of X.^[5] Unlike the unconditional or marginal distribution of X, which captures the overall probabilities without additional constraints, the conditional distribution incorporates the observed value y to update and refine these probabilities, often leading to a narrower or shifted spread depending on the dependence between X and Y. This updating process is fundamental to inference and modeling in probabilistic systems.^[2] The foundational elements include probability spaces, comprising a sample space \Omega of possible outcomes, an event algebra \mathcal{F}, and a probability measure P: \mathcal{F} \to [0,1] satisfying Kolmogorov's axioms, along with random variables as measurable functions from \Omega to \mathbb{R}. These prerequisites enable the rigorous construction of conditional distributions.^[6]

Notation and Interpretation

The standard notation for a conditional probability distribution distinguishes between discrete and continuous random variables. For discrete random variables X and Y, the conditional probability mass function is denoted as P_{X|Y}(x|y) = P(X = x \mid Y = y), which gives the probability that X takes the value x given that Y takes the value y.^[3] For continuous random variables, the conditional probability density function is denoted as f_{X|Y}(x|y) = f_{X,Y}(x,y) / f_Y(y), representing the density of X at x given Y = y.^[1] In a more general sense, P(X \mid Y) refers to the entire conditional distribution of X given Y, which is a random probability measure that depends on the value of Y.^[7] The vertical bar | is the conventional symbol used to denote conditioning in these notations, separating the conditioned variable from the conditioning one, as in P(X \mid Y = y).^[1] Intuitively, a conditional probability distribution can be interpreted as updating prior beliefs about a random variable based on new information from the conditioning variable; for instance, in weather forecasting, the distribution of rainfall amounts given a measured temperature refines predictions by restricting possibilities to scenarios consistent with that temperature.^[8]^[9] Conditioning can apply to events or to random variables, leading to distinct notational forms. When conditioning on an event A with positive probability, the notation P(X \mid A) describes the distribution of X restricted to outcomes in A, often using indicator functions to formalize the event.^[1] In contrast, P(X \mid Y) conditions on the random variable Y, yielding a family of distributions parameterized by the possible values of Y, which is essential for handling continuous cases where individual values have zero probability.^[3] In degenerate cases, such as when the conditioning set has zero measure in continuous spaces, the conditional distribution may be represented using the Dirac delta function \delta, which concentrates the probability mass at a point while preserving integrability; this generalized function ensures the formalism remains consistent even for singular distributions.^[7]

Discrete Case

Conditional Probability Mass Function

In the discrete case, the conditional probability mass function (PMF) of a random variable X given another discrete random variable Y = y extends the fundamental concept of conditional probability to the probability mass over the support of X.^[10] For y in the support of Y where p_Y(y) > 0, the conditional PMF is defined as

p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(y)},

where p_{X,Y}(x,y) is the joint PMF of X and Y, and p_Y(y) is the marginal PMF of Y.^[10]^[11] This formula derives directly from the definition of conditional probability applied to events \{X = x\} and \{Y = y\}: P(X = x \mid Y = y) = P(X = x, Y = y) / P(Y = y). Equivalently, the joint PMF can be factored as p_{X,Y}(x,y) = p_{X|Y}(x|y) \, p_Y(y), and solving for the conditional term yields the expression above; this factorization holds by the law of total probability, as summing over x recovers the marginal p_Y(y) = \sum_x p_{X,Y}(x,y).^[12]^[13] The support of the conditional PMF p_{X|Y}(\cdot|y) consists of all x such that p_{X,Y}(x,y) > 0, and it is zero elsewhere; thus, the domain of the conditional distribution may vary with the specific value of y, but it always forms a valid PMF summing to 1 over its support.^[11]^[14] To compute the conditional PMF, consider the outcomes of two independent fair six-sided dice, with X as the result of the first die and Y as the result of the second; the joint PMF is uniform at p_{X,Y}(x,y) = 1/36 for x, y = 1, \dots, 6. The marginal PMF of Y is p_Y(y) = 1/6 for each y. Thus, for any fixed y, the conditional PMF is p_{X|Y}(x|y) = (1/36) / (1/6) = 1/6 for x = 1, \dots, 6, independent of y. This can be visualized in the joint PMF table below, where the conditional row for a specific y (e.g., y=3) is obtained by dividing the joint row by the marginal p_Y(3) = 1/6:

x \backslash y	1	2	3	...	6
1	1/36	1/36	1/36	...	1/36
2	1/36	1/36	1/36	...	1/36
...	...	...	...	...	...
6	1/36	1/36	1/36	...	1/36
Marginal p_Y(y)	1/6	1/6	1/6	...	1/6

For y=3, the conditional probabilities are all $1/6, confirming uniformity.^[15]^[16]

Examples and Applications

A basic example of a conditional PMF for dependent discrete variables uses a joint probability table. Suppose X and Y are binary random variables with joint PMF:

X \backslash Y	0	1
0	0.3	0.2
1	0.1	0.4

The marginal PMF of Y is p_Y(0) = 0.3 + 0.1 = 0.4 and p_Y(1) = 0.2 + 0.4 = 0.6. The conditional PMF of X given Y=0 is p_{X|Y}(0|0) = 0.3 / 0.4 = 0.75 and p_{X|Y}(1|0) = 0.1 / 0.4 = 0.25. Given Y=1, it is p_{X|Y}(0|1) = 0.2 / 0.6 \approx 0.333 and p_{X|Y}(1|1) = 0.4 / 0.6 \approx 0.667. This shows how conditioning on Y alters the distribution of X, reflecting dependence.^[17] Another example is the distribution of the number of boys in a two-child family, assuming each child is independently boy or girl with equal probability 1/2. Let X be the number of boys, so the (unconditional) PMF is P(X=0) = 1/4, P(X=1) = 1/2, P(X=2) = 1/4. Conditioning on at least one boy (i.e., X \geq 1), the conditional PMF is P(X=1 \mid X \geq 1) = (1/2) / (3/4) = 2/3 and P(X=2 \mid X \geq 1) = (1/4) / (3/4) = 1/3. This illustrates the boy or girl paradox in conditional probability.^[18] In applications, conditional PMFs are used in the analysis of two-way contingency tables to examine associations between categorical variables. For instance, in a table of counts for two discrete factors, the conditional PMF of one variable given the other helps test for independence via the chi-squared statistic, with deviations from uniformity indicating dependence. This is fundamental in categorical data analysis and epidemiology for studying relationships like disease occurrence given exposure levels.^[11]

Continuous Case

Conditional Probability Density Function

In the continuous case, the conditional probability density function (PDF) of a random variable X given Y = y, denoted f_{X|Y}(x|y), describes the probability distribution of X conditional on the observed value y of Y. For jointly continuous random variables X and Y with joint PDF f_{X,Y}(x,y) and marginal PDF f_Y(y), the conditional PDF is defined as

f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)},

provided that f_Y(y) > 0.^[19]^[20] This formula arises as the limiting case of the discrete conditional probability mass function, where probabilities are approximated by densities over small intervals: the joint probability P(x \leq X < x + dx, y \leq Y < y + dy) \approx f_{X,Y}(x,y) \, dx \, dy is divided by the marginal P(y \leq Y < y + dy) \approx f_Y(y) \, dy, yielding the ratio of densities in the limit as the intervals shrink to zero.^[21] The conditional PDF satisfies the normalization property that, for each fixed y where f_Y(y) > 0,

\int_{-\infty}^{\infty} f_{X|Y}(x|y) \, dx = 1,

ensuring it integrates to unity over the support of X.^[19] When f_Y(y) = 0, the conditional PDF is undefined, as division by zero occurs; in such cases, more advanced measure-theoretic formulations employ the Radon-Nikodym derivative to define conditional distributions relative to a dominating measure.^[22]^[23]

Examples and Applications

One prominent example of a conditional probability density function arises in the bivariate normal distribution. Consider two jointly normal random variables X and Y with means \mu_X and \mu_Y, variances \sigma_X^2 and \sigma_Y^2, and correlation coefficient \rho. The conditional distribution of Y given X = x is also normal, specifically Y \mid X = x \sim \mathcal{N}\left( \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X), \sigma_Y^2 (1 - \rho^2) \right). This result shows how conditioning shifts the mean linearly with x while scaling the variance by the factor (1 - \rho^2), reflecting the remaining uncertainty after observing X. Graphically, the joint density of a bivariate normal features elliptical level contours centered at (\mu_X, \mu_Y), with orientation determined by \rho. The conditional density f_{Y|X}(y \mid x) corresponds to a vertical slice through the joint density at a fixed x, yielding a univariate normal curve whose location and spread match the parameters above; as |\rho| approaches 1, the conditional variance narrows, illustrating near-deterministic dependence. In the context of linear regression, the conditional distribution provides the predictive distribution for the response variable. Under the Gaussian linear model assumptions—where errors are independent and normally distributed with constant variance \sigma^2—the conditional distribution of Y given predictors X = x is Y \mid X = x \sim \mathcal{N}(\beta_0 + \beta_1 x, \sigma^2), with the mean tracing the regression line and the variance representing prediction error independent of x. This framework underpins inference in regression, such as confidence intervals for predictions. Another illustrative example involves interarrival times in a Poisson process. Let T_1, T_2, \dots, T_n be independent exponential random variables with rate \lambda > 0, representing interarrival times; their sum S_n = T_1 + \cdots + T_n follows a gamma distribution. Conditioning on S_n = t for fixed t > 0, the joint conditional density of (T_1, \dots, T_n) \mid S_n = t is f(t_1, \dots, t_n \mid s_n = t) = (n-1)! / t^{n-1} for t_i > 0 and \sum t_i = t, which matches the density of the spacings (differences) between n i.i.d. uniform[0, t] order statistics. This equivalence highlights how conditioning on the total time transforms the exponential variables into a uniform-like structure, useful in renewal theory.

Properties

Basic Properties

Conditional probability distributions satisfy several fundamental properties that arise directly from their definitions in terms of joint and marginal distributions. These properties hold generally for both discrete and continuous cases, providing the foundational structure for more advanced probabilistic reasoning.^[1] One key property is marginalization, which states that the unconditional probability distribution of a random variable X can be recovered by integrating (or summing, in the discrete case) the conditional distribution of X given Y with respect to the marginal distribution of the conditioning variable Y. Formally, for continuous random variables, the marginal density f_X(x) satisfies

f_X(x) = \int f_{X|Y}(x|y) f_Y(y) \, dy,

and analogously for discrete random variables using summation over the probability mass function p_Y(y). This follows from the definition of the conditional density as f_{X|Y}(x|y) = f_{X,Y}(x,y) / f_Y(y), where f_{X,Y} is the joint density; substituting yields f_X(x) = \int f_{X,Y}(x,y) \, dy, which is the standard marginalization of the joint distribution. The property ensures that conditioning does not alter the overall probabilistic structure when averaged over the conditioning variable.^[1]^[24] Another fundamental property is the chain rule for conditioning, which decomposes the joint conditional distribution of multiple variables into a product of successive conditionals. For random variables X, Y, and Z, the joint conditional probability satisfies

P(X, Y \mid Z) = P(X \mid Y, Z) \, P(Y \mid Z),

with extensions to more variables following iteratively. This is derived from the definition of conditional probability: starting with P(X, Y \mid Z) = P(X, Y, Z) / P(Z), and noting that P(X \mid Y, Z) = P(X, Y, Z) / P(Y, Z), it follows that P(X, Y \mid Z) = [P(X, Y, Z) / P(Y, Z)] \cdot [P(Y, Z) / P(Z)] = P(X \mid Y, Z) \, P(Y \mid Z). The chain rule facilitates the factorization of complex joint conditionals into manageable components.^[25]^[1] Intuitively, conditioning can be viewed as a projection of the full probability distribution onto the coarser information provided by the conditioning variable, akin to summarizing detailed data with partial knowledge. This projection preserves the essential probabilistic features within the available information while discarding finer details orthogonal to it, much like orthogonal projection in a vector space minimizes approximation error. In the context of conditional expectations (closely related to distributions), this manifests as the conditional expectation E[X \mid Y] being the best approximation of X using only Y's information, minimizing mean squared error over Y-measurable functions. Such an interpretation underscores conditioning's role in information reduction without loss of relevance to the conditioned subspace.^[26]

Relation to Joint and Marginal Distributions

The joint probability mass function (PMF) of two discrete random variables X and Y decomposes as the product of the conditional PMF of X given Y and the marginal PMF of Y:

p_{X,Y}(x,y) = p_{X|Y}(x|y) \, p_Y(y)

for all x, y in the support of the joint distribution.^[27] Similarly, for continuous random variables X and Y with joint probability density function (PDF) f_{X,Y}, the decomposition is

f_{X,Y}(x,y) = f_{X|Y}(x|y) \, f_Y(y),

where f_{X|Y} is the conditional PDF and f_Y is the marginal PDF of Y.^[28] This factorization highlights how the conditional distribution captures the dependence structure between variables, while the marginal provides the unconditional behavior of the conditioning variable. For multiple random variables X_1, X_2, \dots, X_n, the joint distribution can be expressed through iterative conditioning via the chain rule:

p(X_1, X_2, \dots, X_n) = p(X_1) \prod_{i=2}^n p(X_i \mid X_1, \dots, X_{i-1}),

which extends the basic decomposition by successively conditioning on prior variables.^[29] This representation is fundamental in modeling complex dependencies, such as in Bayesian networks, where the full joint is built from a sequence of conditional distributions. Marginal distributions can be recovered from conditional ones using the law of total probability. For instance, the marginal PMF of X is obtained by summing the joint over Y:

p_X(x) = \sum_y p_{X|Y}(x|y) \, p_Y(y),

which follows directly from marginalizing the decomposed joint.^[30] In the continuous case, integration replaces summation:

f_X(x) = \int f_{X|Y}(x|y) \, f_Y(y) \, dy.

This relation allows unconditional distributions to be derived from conditional specifications, often via numerical or analytical methods in practice. Consider a bivariate example with discrete variables X (representing outcomes of a die roll, values 1 to 6) and Y (a binary indicator for even/odd). Suppose the marginal PMF of Y is p_Y(0) = 0.5 (odd) and p_Y(1) = 0.5 (even), and the conditional PMF p_{X|Y}(x|0) assigns uniform probability 1/3 to odd values {1,3,5} and 0 otherwise, while p_{X|Y}(x|1) is uniform 1/3 on even values {2,4,6}. The joint PMF is then p_{X,Y}(x,y) = p_{X|Y}(x|y) p_Y(y), yielding, for example, p_{X,Y}(1,0) = (1/3)(0.5) = 1/6 and p_{X,Y}(2,1) = (1/3)(0.5) = 1/6, with all other probabilities following similarly to form a uniform joint over the 6 outcomes. This construction demonstrates how specified conditionals and a marginal fully determine the joint distribution.

Independence and Conditioning

Independence in Conditional Distributions

In probability theory, conditional independence describes a scenario where the probabilistic relationship between two or more random variables is nullified when conditioned on a third variable or event. For events A and B conditioned on an event C with P(C) > 0, A and B are conditionally independent given C if

P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C).

This formulation indicates that, given C, the occurrence of A provides no additional information about the probability of B, and vice versa.^[31] The concept extends naturally to multiple events.^[32] For random variables, conditional independence is defined in terms of their conditional distributions. Consider discrete random variables X and Y conditioned on a discrete random variable Z = z with P(Z = z) > 0. X and Y are conditionally independent given Z = z if the conditional probability mass function factors as

p_{X,Y \mid Z}(x, y \mid z) = p_{X \mid Z}(x \mid z) \cdot p_{Y \mid Z}(y \mid z)

for all x and y in their respective supports.^[33] This holds globally if it is true for all z with positive probability. In the continuous case, for random variables X and Y with joint conditional density f_{X,Y \mid Z} given Z = z, conditional independence requires

f_{X,Y \mid Z}(x, y \mid z) = f_{X \mid Z}(x \mid z) \cdot f_{Y \mid Z}(y \mid z)

whenever the densities exist and f_Z(z) > 0.^[33] Equivalently, the conditional cumulative distribution function factors:

F_{X,Y \mid Z}(x, y \mid z) = F_{X \mid Z}(x \mid z) \cdot F_{Y \mid Z}(y \mid z).

^[33] These definitions generalize to more than two variables, where the joint conditional distribution factors into the product of individual conditional marginals.^[34] A key property is that conditional independence neither implies nor is implied by unconditional independence. For instance, two random variables may be unconditionally dependent but become independent when conditioned on a common cause.^[35] Consider two events E (watching a film like Life is Beautiful) and F (watching Amélie), which are marginally dependent due to shared genre preferences, but conditionally independent given K (liking international emotional comedies), where P(E \mid F, K) = P(E \mid K).^[35] Conversely, variables that are unconditionally independent may exhibit dependence upon conditioning, such as in Berkson's paradox.^[36] Conditional independence simplifies the structure of joint distributions, enabling factorization that reduces computational complexity in probabilistic modeling. For n random variables X_1, \dots, X_n conditionally independent given Z = z, the conditional joint pmf (discrete) or pdf (continuous) becomes

p_{X_1, \dots, X_n \mid Z}(x_1, \dots, x_n \mid z) = \prod_{i=1}^n p_{X_i \mid Z}(x_i \mid z),

facilitating efficient inference in applications like Bayesian networks.^[33]^[35] This property is foundational for decomposing high-dimensional probability questions into tractable components.^[35]

Bayes' Theorem Connection

Bayes' theorem provides a fundamental relationship between conditional probabilities, expressing the conditional probability of one event given another in terms of the reverse conditional probability, the prior probabilities, and the marginal probability.^[37] In the context of probability distributions, it is formulated for parameters \theta and observed data x as the posterior distribution:

p(\theta \mid x) = \frac{p(x \mid \theta) \, p(\theta)}{p(x)},

where p(x \mid \theta) is the likelihood, p(\theta) is the prior distribution, and p(x) = \int p(x \mid \theta) \, p(\theta) \, d\theta is the marginal evidence.^[37] This theorem derives directly from the definition of conditional probability, where the joint distribution decomposes as p(x, \theta) = p(x \mid \theta) \, p(\theta), and the conditional posterior is the joint normalized by the marginal p(x).^[38] Specifically, starting from p(\theta \mid x) = \frac{p(x, \theta)}{p(x)} and substituting the joint factorization yields the theorem after normalization by the evidence term.^[39] In inferential contexts, Bayes' theorem enables the updating of a prior distribution p(\theta) to a posterior p(\theta \mid x) upon observing data x, thereby incorporating new evidence to refine beliefs about \theta. This process is central to Bayesian inference, where the theorem formalizes how conditional distributions evolve with accumulating information. Unlike plain conditioning, which relies solely on observed data without prior structure, Bayes' theorem explicitly incorporates prior beliefs through p(\theta), distinguishing it as the cornerstone of subjective probability updating in Bayesian statistics.^[37]

Advanced Formulations

Measure-Theoretic Definition

In the discrete and continuous cases, conditional probability distributions are constructed pointwise via sums or integrals of joint probabilities or densities divided by marginals, providing an intuitive but limited framework. The measure-theoretic definition extends this to general probability spaces by treating conditional distributions as disintegrations of the joint measure, ensuring consistency without relying on specific topological assumptions.^[40] Formally, given jointly measurable random variables X and Y on a probability space (\Omega, \mathcal{F}, P), with Y taking values in a measurable space (S, \mathcal{S}), a conditional distribution of X given Y is a family of probability measures \{\nu_y\}_{y \in S} on the range space of X, such that for every measurable set A in the \sigma-algebra of X's space,

P(X \in A) = \int_S \nu_y(A) \, P_Y(dy),

where P_Y is the pushforward measure of P under Y, and each \nu_y is supported on the level set \{X : Y = y\}. This defines a Markov kernel \nu: S \times \mathcal{A} \to [0,1], where \nu(y, A) = \nu_y(A), satisfying the disintegration property that recovers the joint distribution.^[40]^[7] The existence of such regular conditional distributions is guaranteed by the disintegration theorem, which holds under standard conditions such as when the spaces are Polish (separable, complete metric spaces) and the measures are \sigma-finite. In these settings, there exists a measurable kernel \nu satisfying the above integral equation for all measurable A. This theorem generalizes the pointwise constructions by replacing discrete sums with integrals over uncountable supports, preserving the marginalization property \int \nu_y(A) \, P_Y(dy) = P(X \in A).^[40]^[41] However, such conditional distributions are not unique; any two versions \nu and \nu' agree P_Y-almost everywhere, meaning they coincide except on a set of P_Y-measure zero. Thus, the kernel is defined only almost surely with respect to the conditioning measure, reflecting the inherent ambiguity in conditioning on continuous variables where individual points have probability zero. This almost-sure equivalence ensures that expectations and probabilities computed via the kernel are invariant across versions.^[40]^[7]

Relation to Conditional Expectation

The conditional expectation of a random variable X given Y = y is the mean of the conditional distribution P^{X \mid Y = y}, formally expressed as

E[X \mid Y = y] = \int_{-\infty}^{\infty} x \, dP^{X \mid Y=y}(x).

This integral representation links the conditional distribution directly to moments, allowing the conditional expectation to serve as a functional of the underlying conditional probability measure.^[42] Conditional expectation inherits key properties from unconditional expectation, notably linearity: for constants a and b,

E[aX + bZ \mid Y] = a E[X \mid Y] + b E[Z \mid Y]

almost surely. Another fundamental property is the tower property, which states that for sub-\sigma-algebras \mathcal{G} \subset \mathcal{H},

E[E[X \mid \mathcal{H}] \mid \mathcal{G}] = E[X \mid \mathcal{G}]

almost surely; this reflects the iterative nature of conditioning on increasingly coarse information. These properties hold in the measure-theoretic framework and facilitate computations in stochastic processes.^[43] The variance of X decomposes via conditioning on Y as

\operatorname{Var}(X) = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]),

known as the law of total variance; the first term captures average conditional variability, while the second measures uncertainty in the conditional mean. This decomposition quantifies how conditioning reduces overall variance and is pivotal in regression analysis and risk assessment. For illustration, consider jointly normal random variables X and Y with means \mu_X, \mu_Y, variances \sigma_X^2, \sigma_Y^2, and correlation \rho. The conditional distribution of Y given X = x is normal with mean

E[Y \mid X = x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X),

which follows directly from integrating against the known conditional density; this linear form exemplifies how the conditional expectation inherits the Gaussian structure.

Conditioning on Sigma-Algebras

In measure-theoretic probability, the conditional distribution of a random variable X taking values in a measurable space (\mathcal{X}, \mathcal{B}) given a sub-\sigma-algebra \mathcal{F} of the underlying \sigma-algebra \mathcal{G} is formalized as a regular conditional probability distribution. This is defined as an \mathcal{F}-measurable map \mu_{X|\mathcal{F}}: \Omega \times \mathcal{B} \to [0,1] such that, for \mathbb{P}-almost every \omega \in \Omega, \mu_{X|\mathcal{F}}(\omega, \cdot) is a probability measure on (\mathcal{X}, \mathcal{B}), and for every B \in \mathcal{B}, the function \omega \mapsto \mu_{X|\mathcal{F}}(\omega, B) is a version of the conditional expectation \mathbb{E}[1_{\{X \in B\}} \mid \mathcal{F}](\omega). This ensures integral consistency, meaning that for any bounded \mathcal{B}-measurable function f: \mathcal{X} \to \mathbb{R},

\mathbb{E}[f(X) \mid \mathcal{F}](\omega) = \int_{\mathcal{X}} f(x) \, \mu_{X|\mathcal{F}}(\omega, dx)

almost surely.^[5] The existence of such a regular conditional distribution requires the space (\mathcal{X}, \mathcal{B}) to be "nice," such as a Polish space equipped with its Borel \sigma-algebra, where a one-to-one measurable map to \mathbb{R} with measurable inverse exists. In this framework, \mathcal{F} represents partial information available in the probability space, and conditioning on \mathcal{F} projects the distribution of X onto the events generated by \mathcal{F}, refining the uncertainty to what is compatible with that information. This generalization extends beyond conditioning on specific random variables or partitions, allowing for conditioning on arbitrary collections of events that capture incomplete knowledge.^[5] A key consequence arises in the context of filtrations \{\mathcal{F}_t\}_{t \geq 0}, increasing families of \sigma-algebras representing evolving information. Doob's martingale theorem states that for an integrable random variable Y, the process M_t = \mathbb{E}[Y \mid \mathcal{F}_t] forms a martingale with respect to the filtration, satisfying \mathbb{E}[M_t \mid \mathcal{F}_s] = M_s for s < t. This property highlights how conditional expectations—and by extension, conditional distributions—evolve predictably as more information is incorporated, converging almost surely under suitable boundedness conditions.^[44] An illustrative example occurs in stochastic processes like Brownian motion \{B_t\}_{t \geq 0} on the probability space (\Omega, \mathcal{F}, \mathbb{P}), where the natural filtration \mathcal{F}_t = \sigma(B_s : 0 \leq s \leq t) encodes past observations up to time t. The conditional distribution of the future path \{B_{t+s}\}_{s > 0} given \mathcal{F}_t is that of a Brownian motion starting at B_t, independent of the past trajectory, by the strong Markov property. This reflects how conditioning on the sigma-algebra of historical data updates the predictive distribution to account for the current position while preserving the process's memoryless increment structure.^[45]

References

[1]
Conditional distribution | Formula, derivation, examples - StatLect
A conditional distribution is the probability distribution of a random variable, calculated according to the rules of conditional probability.
[2]
19.1 - What is a Conditional Distribution? | STAT 414 - STAT ONLINE
A conditional probability distribution is a probability distribution for a sub-population, describing the probability of one characteristic of interest.
[3]
5.3: Conditional Probability Distributions - Statistics LibreTexts
Mar 29, 2020 · A conditional probability distribution describes the probability that a randomly selected person from a sub-population has a given characteristic of interest.Conditional Distributions of... · Conditional pmf's of $X$ given...
[4]
Lesson 19: Conditional Distributions | STAT 414
To recognize that a conditional probability distribution is simply a probability distribution for a sub-population. To learn the formal definition of a ...<|separator|>
[5]
[PDF] Lecture 11: March 6 11.1 Regular conditional probabilities
Regular conditional distributions are useful as they allow us to compute the conditional expectations of all functions of a random variable X simultaneously and ...Missing: "probability
[6]
[PDF] 1 Probability measure and random variables - Arizona Math
Definition 17. Let Yn be a sequence of random variables, and Y a random variable, all defined on the same probability space. We say Yn converges to. Y in ...
[7]
[PDF] Conditional Probability Theory - Michael Betancourt
Conditional probability theory provides a rigorous way to decompose probability distributions over a space 𝑋 into a collection of probability distributions ...
[8]
4 Examples of Using Conditional Probability in Real Life - Statology
May 13, 2022 · Weather forecasters use conditional probability to predict the likelihood of future weather conditions, given current conditions. For example, ...
[9]
5.4 Conditional distributions - Probability And Statistics - Fiveable
The conditional PMF satisfies the properties of a valid probability distribution, such as non-negativity and summing up to 1 over all possible values of X ...<|control11|><|separator|>
[10]
[PDF] Review of Probability Theory - CS229
In the discrete case, the conditional probability mass function of Y given X is simply. pY |X(y|x) = pXY (x, y). pX(x). , assuming that pX(x) 6= 0. In the ...
[11]
19.2 - Definitions | STAT 414
Conditional distributions are valid probability mass functions in their own right. That is, the conditional probabilities are between 0 and 1, inclusive: · In ...Missing: "probability
[12]
[PDF] 2 Discrete Random Variables - Arizona Math
Recall the definition of conditional probability. The probability of A ... The conditional probability mass function of X given B is f(x|B) = P(X = x|B).<|control11|><|separator|>
[13]
[PDF] CS145: Probability & Computing - Lecture 6: Multiple Discrete ...
➢ By the definition of conditional probability: … ➢ The conditional probability mass function is then: Page 9. Conditional Probability Distributions. Page ...
[14]
[PDF] Lecture on July 1st, 2019 Conditional Probability and Conditional ...
Definitions of conditional probabilities, conditional mass function. • Conditional mass function is a probability mass function! • Example: If X ∼ B(N,p) ...
[15]
17.1 - Two Discrete Random Variables | STAT 414
Two discrete random variables have a joint probability mass function (p.m.f.) which is a function of their two-dimensional support. The p.m.f. must sum to 1.
[16]
[PDF] Joint distributions - San Jose State University
Conditional pmfs Consider the following question: Example 0.4 (Toss 2 fair dice). Suppose we are told that the sum is X = 6. What is the (conditional) ...
[17]
20.2 - Conditional Distributions for Continuous Random Variables
Conditional Probability Density Function of Y given X = x. Suppose X and Y are continuous random variables with joint probability density function f ( x ...
[18]
[PDF] Chapters 5. Multivariate Probability Distributions - Brown University
Continuous random vector: Conditional density function of Y given X = x is defined by f(y|x) . = f(x, y). fX(x). = joint marginal. Page 14. Remark on ...
[19]
[PDF] Untitled
For jointly continuous rv's, we may loosely argue the formulas. for marginal and conditional distributions by analogy with the discrete case using fx(x) dx.
[20]
[PDF] conditional probability
This definition applies without change to random vector, or, equivalentiy, to a finite set of random variables. It can be adapted to arbitrary sets of random ...Missing: "probability
[21]
[PDF] Projections, Radon-Nikodym, and conditioning
Mar 8, 2017 · The following result underlies the existence of both Radon-Nikodym derivatives (densities) for measures and Kolmogorov conditional expecta-.
[22]
[PDF] Lecture 24 Conditional probability, order statistics, expectations of ...
E[g(x)] = Px p(x)g(x). ▷ Similarly, X if is continuous with density function f (x) then. E[g(X)] = ...
[23]
[PDF] Chapter 2. Discrete Probability 2.3: Independence 2.3.1 Chain Rule
Proof of Chain Rule. Remember that the definition of conditional probability says P (A ∩ B) = P (A) P (B | A). We'll use this repeatedly to break down our ...
[24]
[PDF] Lecture 10 Conditional Expectation
Two main conceptual leaps here are: 1) we condition with respect to a σ-algebra, and 2) we view the conditional expectation itself as a random variable.
[25]
Factorization of joint probability mass functions - StatLect
How to factorize a probability mass function into a marginal probability mass and conditional probability mass ... probability/factorization-of-joint-probability- ...Missing: decomposition reference
[26]
Factorization of joint probability density functions - StatLect
Factorization of joint probability density functions. by Marco Taboga, PhD. This lecture discusses how to factorize the joint probability density function ...Missing: reference | Show results with:reference
[27]
Conditional Probability | Formulas | Calculation | Chain Rule
If A and B are two events in a sample space S, then the conditional probability of A given B is defined as P(A|B)= P(A∩B) P(B) , when P(B)>0.
[28]
Law of marginal probability | The Book of Statistical Proofs
May 10, 2020 · The law of marginal probability can be motivated from the law of total probability which follows from the Kolmogorov axioms of probability.
[29]
[PDF] CHAPTER 2 CONDITIONAL PROBABILITY AND INDEPENDENCE
Conditional probabilities arise when it is known that a certain event has occurred. This knowledge changes the probabilities of events within the sample space ...
[30]
5.3.4 - Conditional Independence | STAT 504
There are three possible conditional independence models with three random variables: ( X Y ; which means that Y ; P ( Y = j , Z = k | X = i ) = P ( Y = j | X = i ) ...
[31]
Definition: Conditional independence - The Book of Statistical Proofs
Nov 19, 2020 · Generally speaking, random variables are conditionally independent given another random variable, if they are statistically independent in their conditional ...
[32]
[PDF] Independence and Conditional Distributions - Arizona Math
Oct 22, 2009 · We say that two random variables X and Y are independent if for any sets A and B, the events {X ∈ A} and {Y ∈ B} are independent. P{X ∈ A, Y ∈ ...
[33]
[PDF] CS109: Conditional Independence and Random Variables
Conditional independence is a practical, real-world way of decomposing hard probability questions. “Exploiting conditional independence to generate fast.<|control11|><|separator|>
[34]
[PDF] Conditional Independence
Sep 4, 2024 · Definition: Random variables X and Y are conditionally independent given Z, if p(x|y, z) = p(x|z) for all x, y, z ∈ R such that p(y, z) > 0. ...
[35]
LII. An essay towards solving a problem in the doctrine of chances ...
An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S.
[36]
Bayes' theorem | The Book of Statistical Proofs
Sep 27, 2019 · Proof: Bayes' theorem p(A|B)=p(B|A)p(A)p(B). (1) Proof: The conditional probability is defined as the ratio of joint probability, i.e. the ...
[37]
[PDF] 1 Bayes' theorem
4 Derivation from conditional probabilities. To derive the theorem, we start from the definition of conditional probability. The probability of event A given ...<|control11|><|separator|>
[38]
[PDF] Conditioning as disintegration - Yale Statistics and Data Science
One way to engage in rigorous, guilt-free manipulation of conditional distributions is to treat them as disintegrating measures–families of probability measures ...
[39]
https://www.eng.utah.edu/~cs5961/Resources/bayes.pdf
[40]
[PDF] Conditional expectation - Purdue Math
We are given a probability space (Ω,F0,P) and. A σ-algebra F⊂F0. X ∈ F0 such that E[|X|] < ∞. Conditional expectation of X given F: Denoted by E[X|F].
[41]
Probability and Measure, Anniversary Edition | Wiley
Probability and Measure, Anniversary Edition by Patrick Billingsley ... Conditional Distributions and Expectations. Sufficient Subfields*. Minimum-Variance ...<|control11|><|separator|>
[42]
[PDF] Conditional Expectation and Martingales
In general, we define E(X|Z) = E(X|σ(Z)), the conditional expected value given the sigma algebra generated by X, σ(X). We can define the condi- tional ...<|control11|><|separator|>
[43]
[PDF] A guide to Brownian motion and related stochastic processes
Abstract: This is a guide to the mathematical theory of Brownian mo- tion and related stochastic processes, with indications of how this theory is.