Conditional dependence
In probability theory, conditional dependence describes a relationship between two or more random variables or events where their statistical dependence persists or emerges even after accounting for the influence of one or more conditioning variables. Specifically, for random variables X and Y given Z, conditional dependence holds if the conditional probability P(X | Y, Z) differs from P(X | Z) for some values where P(Y, Z) > 0, meaning knowledge of Y provides additional information about X beyond what Z alone offers. This contrasts with unconditional dependence, where P(X, Y) ≠ P(X)P(Y), and can manifest in scenarios where variables appear independent marginally but become dependent upon conditioning, or vice versa, as seen in examples like disease indicators (e.g., malaria and bacterial infection) that are independent overall but dependent given the presence of fever.[1] Conditional dependence plays a central role in probabilistic modeling, particularly in graphical models such as Bayesian networks, where it helps encode complex joint distributions through conditional relationships and d-separation criteria to identify independencies.[2] In causal inference and machine learning, measuring conditional dependence is essential for tasks like feature selection, where irrelevant variables are screened out given others, and causal discovery, which distinguishes direct effects from spurious correlations. Various metrics have been developed to quantify it, including kernel-based approaches using reproducing kernel Hilbert spaces for non-linear dependencies and simple coefficients based on mutual information for practical computation in high dimensions.[3] These concepts underpin advancements in artificial intelligence, enabling efficient inference in large-scale systems by exploiting conditional structures to reduce computational complexity.[4]Core Concepts
Definition
Conditional dependence refers to a relationship between random variables or events where the probability distribution of one is influenced by the other, even after incorporating information from a conditioning variable or set.[5] Intuitively, it arises when knowing the outcome of one variable alters the expected behavior of another, despite accounting for the conditioning factor, reflecting a residual association not explained by the conditioner alone.[6] Formally, two random variables X and Y are conditionally dependent given a third variable Z (with P(Z = z) > 0) if there exist values x, y, z in their supports such that P(X = x, Y = y \mid Z = z) \neq P(X = x \mid Z = z) \, P(Y = y \mid Z = z). [5] This inequality indicates that the joint conditional distribution does not factorize into the product of the marginal conditionals, signifying dependence.[7] Unlike unconditional (marginal) dependence, which assesses association without conditioning, conditional dependence can emerge or disappear based on the conditioner; notably, X and Y may be unconditionally independent yet conditionally dependent given Z, as in collider bias where Z is a common effect of X and Y, inducing spurious association upon conditioning.[8] Conversely, unconditional dependence may vanish under certain conditioning, highlighting the context-specific nature of probabilistic relationships.[5] The concept was first formalized within modern probability theory in the early 20th century, building on Andrei Kolmogorov's axiomatic foundations established in 1933, which provided the rigorous framework for conditional probabilities underlying dependence relations.Relation to Unconditional Dependence
Unconditional dependence between two random variables X and Y occurs when their joint probability distribution does not factorize into the product of their marginal distributions, that is, when P(X, Y) \neq P(X) P(Y). This contrasts with conditional dependence, which, as defined earlier, evaluates the joint distribution relative to a conditioning variable Z. In essence, unconditional dependence captures marginal associations without additional context, while conditional dependence reveals how these associations may alter given knowledge of Z. Conditioning on Z can induce conditional independence from unconditional dependence, particularly in scenarios involving a common cause. For instance, if Z directly influences both X and Y (as in a directed acyclic graph where arrows point from Z to X and from Z to Y), X and Y exhibit unconditional dependence due to their shared origin, but become conditionally independent given Z, as the influence of the common cause is accounted for.[5] This structure, known as a common cause or fork, illustrates how conditioning removes spurious associations propagated through Z.[9] Conversely, conditioning can induce conditional dependence where unconditional independence previously held, a phenomenon exemplified by the V-structure in directed acyclic graphs. In a V-structure, arrows converge on Z from both X and Y (i.e., X \to Z \leftarrow Y), rendering X and Y unconditionally independent since they lack a direct path of influence.[5] However, conditioning on Z—the common effect—creates a dependence between X and Y, as observing Z provides evidence that selects paths linking the two causes through the collider at Z.[9] This is the basis for "explaining away," where evidence for one cause (say, X) reduces the likelihood of the alternative cause (Y) given the observed effect Z, thereby inducing negative conditional dependence between the causes. Overall, conditioning on Z can thus create new dependencies, remove existing ones, or even invert the direction of association between X and Y, fundamentally altering the dependence structure depending on the underlying causal relationships.[5] These dynamics underscore the importance of graphical models like directed acyclic graphs in visualizing how marginal and conditional dependencies interact.[9]Formal Framework
Probabilistic Formulation
In probability theory, conditional dependence between two events A and B given a third event C with P(C) > 0 is defined as the failure of the equality P(A \cap B \mid C) \neq P(A \mid C) P(B \mid C), where the conditional probability is given by P(A \mid C) = P(A \cap C)/P(C).[10] This inequality indicates that the occurrence of A affects the probability of B (or vice versa) even after accounting for C.[11] For random variables, consider random variables X, Y, and Z defined on a probability space. The joint conditional probability mass or density function encapsulates the probabilistic structure. Specifically, the joint conditional distribution satisfies P(X, Y \mid Z) = P(X \mid Y, Z) P(Y \mid Z), derived from the chain rule for conditional probabilities: starting from the joint distribution P(X, Y, Z) = P(X \mid Y, Z) P(Y, Z) = P(X \mid Y, Z) P(Y \mid Z) P(Z), dividing by P(Z) yields the conditional form, assuming P(Z) > 0.[12] Conditional dependence holds when this factorization does not imply P(X \mid Y, Z) = P(X \mid Z), i.e., when P(X, Y \mid Z) \neq P(X \mid Z) P(Y \mid Z). Unconditional dependence arises as the special case where Z is a constant event with probability 1.[10] In the discrete case, for random variables taking values in countable sets, the conditional joint probability mass function is p_{X,Y \mid Z}(x,y \mid z) = p_{X,Y,Z}(x,y,z) / p_Z(z) for p_Z(z) > 0, and the marginal conditionals are p_{X \mid Z}(x \mid z) = \sum_y p_{X,Y \mid Z}(x,y \mid z) and similarly for Y. Dependence occurs if p_{X,Y \mid Z}(x,y \mid z) \neq p_{X \mid Z}(x \mid z) p_{Y \mid Z}(y \mid z) for some x, y, z with p_Z(z) > 0.[13] For continuous random variables with joint density f_{X,Y,Z}, the conditional joint density is f_{X,Y \mid Z}(x,y \mid z) = f_{X,Y,Z}(x,y,z) / f_Z(z) for f_Z(z) > 0, with marginal conditionals f_{X \mid Z}(x \mid z) = \int f_{X,Y \mid Z}(x,y \mid z) \, dy and analogously for Y. Conditional dependence is present when f_{X,Y \mid Z}(x,y \mid z) \neq f_{X \mid Z}(x \mid z) f_{Y \mid Z}(y \mid z) for some x, y, z with f_Z(z) > 0.[12] From an axiomatic perspective in measure-theoretic probability, conditional dependence is framed using sigma-algebras. Let (\Omega, \mathcal{F}, P) be a probability space, and let \sigma(X), \sigma(Y), \sigma(Z) be the sigma-algebras generated by measurable functions X, Y, Z: \Omega \to \mathbb{R}, respectively. The random variables X and Y are conditionally dependent given Z if \sigma(X) and \sigma(Y) are not conditionally independent given \sigma(Z), meaning there exist events A \in \sigma(X), B \in \sigma(Y) such that P(A \cap B \mid \sigma(Z)) \neq P(A \mid \sigma(Z)) P(B \mid \sigma(Z)) on a set of positive probability, where conditional probability given a sigma-algebra is defined via the Radon-Nikodym derivative of the restricted measures.[14] Equivalently, for bounded measurable functions f on the range of X and g on the range of Y, E[f(X) g(Y) \mid \sigma(Z)] \neq E[f(X) \mid \sigma(Z)] E[g(Y) \mid \sigma(Z)] almost surely. This setup ensures the formulation aligns with Kolmogorov's axioms extended to conditional expectations.[14]Measure of Conditional Dependence
One prominent measure of conditional dependence is the conditional mutual information, denoted I(X; Y \mid Z), which quantifies the amount of information shared between random variables X and Y after conditioning on Z.[15] Defined in terms of entropies as I(X; Y \mid Z) = H(X \mid Z) + H(Y \mid Z) - H(X, Y \mid Z), where H(X \mid Z) is the conditional entropy of X given Z measuring the remaining uncertainty in X after observing Z, and similarly for the other terms, this metric captures the expected reduction in uncertainty about one variable from knowing the other, conditional on Z.[15] It equals zero if and only if X and Y are conditionally independent given Z, providing a symmetric, non-negative measure applicable to both discrete and continuous variables without assuming linearity.[15] For jointly Gaussian random variables, partial correlation offers a computationally efficient alternative, measuring the correlation between X and Y after removing the linear effects of Z. The partial correlation coefficient is given by \rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ} \rho_{YZ}}{\sqrt{(1 - \rho_{XZ}^2)(1 - \rho_{YZ}^2)}}, where \rho_{XY}, \rho_{XZ}, and \rho_{YZ} are the pairwise Pearson correlation coefficients.[16] Under Gaussian assumptions, \rho_{XY \cdot Z} = 0 if and only if X and Y are conditionally independent given Z, enabling straightforward hypothesis tests for dependence via its standardized distribution.[16] For non-linear dependencies, rank-based measures such as conditional Kendall's tau and conditional Spearman's rho extend unconditional rank correlations to the conditional setting. Conditional Kendall's tau assesses the concordance probability between X and Y given Z, providing a robust, distribution-free measure of monotonic dependence that ranges from -1 to 1.[17] Similarly, conditional Spearman's rho evaluates the correlation of ranks after conditioning, suitable for detecting non-linear associations in non-Gaussian data.[18] Kernel-based approaches, like the conditional Hilbert-Schmidt Independence Criterion (HSIC), embed variables into reproducing kernel Hilbert spaces to detect arbitrary dependence forms, with the criterion equaling zero under conditional independence and otherwise positive, scaled by kernel choices.[2] These measures have specific limitations tied to their assumptions and practicality. Partial correlation assumes linearity and Gaussianity, potentially underestimating non-linear dependencies, while requiring inversion of covariance matrices that scales cubically with the dimension of Z.[16] Conditional mutual information, though versatile, demands entropy estimation, which is computationally intensive for high dimensions and sensitive to sample size in continuous cases.[15] Rank-based metrics like conditional Kendall's tau and Spearman's rho are robust to outliers but may lack power against weak or non-monotonic relations, and kernel methods such as conditional HSIC suffer from the curse of dimensionality due to kernel matrix computations, often requiring careful hyperparameter tuning.[17][18][2]Properties and Theorems
Basic Properties
Conditional dependence exhibits symmetry: if random variables X and Y are conditionally dependent given Z, then Y and X are also conditionally dependent given Z. This property arises directly from the definitional equivalence p(x \mid y, z) \neq p(x \mid z) if and only if p(y \mid x, z) \neq p(y \mid z).[11] Measures of conditional dependence, such as conditional mutual information I(X; Y \mid Z), possess non-negativity, satisfying I(X; Y \mid Z) \geq 0, with equality holding if and only if X and Y are conditionally independent given Z. This non-negativity stems from the interpretation of conditional mutual information as a Kullback-Leibler divergence, which is inherently non-negative. Additionally, conditional mutual information is symmetric, as I(X; Y \mid Z) = I(Y; X \mid Z).[19] Conditional dependence lacks transitivity with respect to unconditional dependence: the presence of dependence between X and Y given Z does not imply dependence between X and Y unconditionally. A sketch of a counterexample involves scenarios where X and Y are marginally independent but become dependent upon conditioning on Z, such as when Z acts as a common effect (collider) of X and Y.[20] Conditional dependence integrates with marginal distributions through the chain rule of probability, which expresses the joint distribution p(x, y, z) as a product of conditional probabilities, such as p(x, y, z) = p(z) p(x \mid z) p(y \mid x, z). In this factorization, conditional dependence between X and Y given Z manifests in the term p(y \mid x, z) deviating from p(y \mid z), thereby aggregating local dependencies into the overall joint structure while preserving the marginals.[21]Key Theorems
The Hammersley-Clifford theorem establishes a foundational link between conditional independence structures in Markov random fields and the factorization of their joint distributions. Specifically, for a finite undirected graph G = (V, E) and random variables X_V = (X_v)_{v \in V} with strictly positive joint probability distribution P(X_V) > 0 that satisfies the local Markov property with respect to G—meaning that each X_v is conditionally independent of X_{V \setminus (N(v) \cup \{v\})} given X_{N(v)}, where N(v) is the set of neighbors of v—the distribution admits a factorization over the maximal cliques \mathcal{C} of G: P(X_V) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(X_C), where Z is the normalizing constant and each \psi_C is a non-negative potential function defined on the variables in clique C. This implies that the conditional dependence relations encoded by the graph's separation properties are fully captured by interactions within cliques, enabling the representation of complex dependence structures through local potentials in graphical models. A high-level proof outline proceeds by constructing the potentials iteratively from the conditional distributions implied by the Markov property, ensuring the product reproduces the joint via telescoping factorization and normalization, assuming positivity to avoid zero probabilities that could violate the Markov assumptions.[22] The decomposition property governs how conditional independence over composite sets implies independence over subsets, with direct implications for conditional dependence as its contrapositive. For conditional independence, if X \perp\!\!\!\perp (Y, W) \mid Z, then X \perp\!\!\!\perp Y \mid Z and X \perp\!\!\!\perp W \mid Z. Equivalently, for conditional dependence (the negation), if X \not\perp\!\!\!\perp Y \mid Z or X \not\perp\!\!\!\perp W \mid Z (i.e., X depends on at least one of Y or W given Z), then X \not\perp\!\!\!\perp (Y, W) \mid Z. This property, part of the semi-graphoid axioms, ensures that joint conditional dependence cannot arise without at least one marginal dependence. A proof sketch for the independence direction uses marginalization: integrate the joint conditional density p(x, y, w \mid z) = p(x \mid z) p(y, w \mid z) over w to obtain p(x, y \mid z) = p(x \mid z) p(y \mid z), and similarly for the other subset; the dependence contrapositive follows immediately.[23] The intersection property further characterizes compositions of conditional independences, again with nuanced implications for dependence. For conditional independence under strictly positive distributions, if X \perp\!\!\!\perp Y \mid Z \cup W and X \perp\!\!\!\perp W \mid Z, then X \perp\!\!\!\perp (Y, W) \mid Z. This axiom completes the graphoid properties, allowing inference of broader independences from restricted ones, but it fails without positivity—e.g., in distributions with zero probabilities, the property may not hold, leading to spurious conditional dependences where none are implied by the graph structure. For conditional dependence, the contrapositive is: if X \not\perp\!\!\!\perp (Y, W) \mid Z, then either X \not\perp\!\!\!\perp Y \mid Z \cup W or X \not\perp\!\!\!\perp W \mid Z, though failure cases arise in non-positive measures where joint dependence does not propagate to both components, complicating graphical representations. A high-level proof sketch relies on the definition: from X \perp\!\!\!\perp Y \mid Z \cup W, p(x \mid y, z, w) = p(x \mid z, w); substituting the second independence p(x \mid z, w) = p(x \mid z) yields p(x \mid y, z, w) = p(x \mid z), with positivity ensuring all conditionals are well-defined via Bayes' rule without division by zero. Information-theoretic variants use mutual information inequalities, where I(X; Y \mid Z \cup W) = 0 and I(X; W \mid Z) = 0 imply I(X; (Y, W) \mid Z) = 0 by chain rule additivity under positivity.[23]Examples and Illustrations
Elementary Example
Consider two fair coins flipped independently, resulting in random variables X and Y, where 1 denotes heads and 0 denotes tails, each with P(X=1) = P(Y=1) = 0.5. Define Z = X \oplus Y (the XOR operation), so Z = 0 if the outcomes match (both heads or both tails) and Z = 1 if they differ. This setup simulates a scenario where Z acts as a signal of outcome consistency, analogous to a "fair" (matching, Z=0) or "biased" (mismatching, Z=1) indication. Marginally, X and Y are independent, as their joint distribution factors: P(X,Y) = P(X)P(Y), with each of the four outcomes (X,Y) = (0,0), (0,1), (1,0), (1,1) having probability 0.25. Consequently, P(X=1,Y=1) = 0.25 = P(X=1)P(Y=1). Also, P(Z=0) = P(Z=1) = 0.5. However, conditioning on Z=0 induces dependence between X and Y. The conditional joint probabilities are P(X=0,Y=0 \mid Z=0) = 0.5, P(X=1,Y=1 \mid Z=0) = 0.5, and P(X=0,Y=1 \mid Z=0) = P(X=1,Y=0 \mid Z=0) = 0. The marginals are P(X=0 \mid Z=0) = P(X=1 \mid Z=0) = 0.5 and similarly for Y. Thus, P(X=1,Y=1 \mid Z=0) = 0.5 \neq 0.25 = P(X=1 \mid Z=0) P(Y=1 \mid Z=0), demonstrating conditional dependence. A similar inequality holds for Z=1. The full joint probability distribution over X, Y, Z is given in the following table:| X | Y | Z | P(X,Y,Z) |
|---|---|---|---|
| 0 | 0 | 0 | 0.25 |
| 0 | 1 | 1 | 0.25 |
| 1 | 0 | 1 | 0.25 |
| 1 | 1 | 0 | 0.25 |
| Y=0 | Y=1 | |
|---|---|---|
| X=0 | 0.5 | 0 |
| X=1 | 0 | 0.5 |