Fact-checked by Grok 2 weeks ago

Conditional independence

Conditional independence is a fundamental concept in probability theory that extends the idea of statistical independence to scenarios where additional information is available, stating that two random variables (or events) X and Y are conditionally independent given a third variable (or event) Z if P(X \mid Y, Z) = P(X \mid Z), or equivalently, if the joint conditional probability factors as P(X, Y \mid Z) = P(X \mid Z) \cdot P(Y \mid Z).^[1]^[2] This property holds even when X and Y are unconditionally dependent, as conditioning on Z can "explain away" or block the dependence pathway between them, a phenomenon central to causal inference and Bayesian reasoning.^[3] For instance, in graphical models like Bayesian networks, conditional independence is encoded via the absence of direct edges between nodes, enabling efficient computation of complex joint distributions by decomposing them into local conditional probabilities.^[1]^[4] Key properties include symmetry—X \perp Y \mid Z implies Y \perp X \mid Z—and the semi-graphoid axioms, which govern how conditional independences compose and decompose in probabilistic models, forming the basis for d-separation criteria in directed acyclic graphs.^[5] These axioms ensure that conditional independence satisfies symmetry, decomposition, weak union, and contraction, providing a rigorous framework for verifying independences without full distributional knowledge.^[6] Applications span statistics, machine learning, and artificial intelligence; for example, it underpins naive Bayes classifiers by assuming feature independence given the class label, and it facilitates inference in hidden Markov models where observations are conditionally independent given latent states.^[2]^[7] In causal discovery, conditional independence tests help identify graph structures from data, as formalized in algorithms like PC (Peter-Clark), distinguishing correlation from causation.^[3]^[8] Overall, conditional independence simplifies high-dimensional probabilistic modeling, making intractable problems tractable by exploiting modular structure.^[1]

Conditional Independence for Events

Definition and Basic Properties

In probability theory, the foundational concepts of conditional independence build upon the Kolmogorov axioms, which establish probability as a non-negative measure on a sigma-algebra of events that sums to 1 over the entire sample space.^[9] Conditional probability is defined as the probability of an event A given that event B has occurred, expressed as P(A \mid B) = \frac{P(A \cap B)}{P(B)} for P(B) > 0, providing the prerequisite framework for analyzing dependencies under partial information.^[10] Two events A and B in a probability space are said to be conditionally independent given a third event C with P(C) > 0 if

P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C).

This definition captures the idea that, upon observing C, the occurrence of A provides no additional information about the likelihood of B, or vice versa.^[1] The conditional probability measure inherits key properties from the unconditional case: it is non-negative, so $0 \leq P(\cdot \mid C) \leq 1, and it normalizes to 1 over any partition of the sample space into events whose union is the entire space, ensuring \sum_i P(A_i \mid C) = 1 when \bigcup_i A_i = \Omega and the A_i are disjoint.^[11] This structure extends the notion of unconditional independence, which arises as a special case when C is the full sample space \Omega (where P(\Omega) = 1), reducing the condition to P(A \cap B) = P(A) P(B).^[1] The formalization of conditional probability and independence originated in early 20th-century developments in measure-theoretic probability, with Andrey Kolmogorov providing a rigorous axiomatic foundation in his 1933 work Foundations of the Theory of Probability.

Equivalent Formulations

Conditional independence of events A and B given event C (with P(C) > 0) is primarily defined as P(A \cap B \mid C) = P(A \mid C) P(B \mid C).^[12] This is equivalent to the joint probability formulation P(A \cap B \cap C) = \frac{P(A \cap C) P(B \cap C)}{P(C)}.^[12] To derive this equivalence, start with the definition of conditional probability: P(A \cap B \mid C) = \frac{P(A \cap B \cap C)}{P(C)} and P(A \mid C) = \frac{P(A \cap C)}{P(C)}, P(B \mid C) = \frac{P(B \cap C)}{P(C)}. Substituting the latter two into the independence condition yields \frac{P(A \cap B \cap C)}{P(C)} = \frac{P(A \cap C)}{P(C)} \cdot \frac{P(B \cap C)}{P(C)}. Multiplying both sides by P(C) gives P(A \cap B \cap C) = \frac{P(A \cap C) P(B \cap C)}{P(C)}, confirming the equivalence. The reverse direction follows by rearranging: assuming the joint form, divide both sides by P(C) to recover the conditional product form. This derivation relies on the basic properties of conditional probability and assumes P(C) > 0 to avoid division by zero.^[12] Another equivalent formulation is P(A \mid B \cap C) = P(A \mid C). To prove this from the primary definition, apply the definition of conditional probability: P(A \mid B \cap C) = \frac{P(A \cap B \cap C)}{P(B \cap C)}. Now substitute P(A \cap B \cap C) = P(A \cap C) P(B \cap C) / P(C) from the joint equivalence, yielding \frac{[P(A \cap C) P(B \cap C) / P(C)]}{P(B \cap C)} = \frac{P(A \cap C)}{P(C)} = P(A \mid C). By symmetry, P(B \mid A \cap C) = P(B \mid C) also holds. These equivalences extend to the factorization of the joint probability distribution over events, where the joint measure factors conditionally on C. Bayes' theorem is not directly required here but supports the manipulations via the chain rule for probabilities.^[12] Edge cases arise when P(C) = 0, rendering conditional probabilities undefined under the standard Kolmogorov axioms, as division by P(C) is impossible. In such scenarios, conditional independence is vacuously true or not considered, depending on the measure-theoretic extension, but the formulations above do not apply. Degenerate events, like C being the empty set or the full sample space, similarly lead to undefined or trivial conditions, where independence reduces to unconditional forms if applicable.^[13]

Illustrative Examples

One illustrative example of conditional independence for events involves drawing balls from two differently composed boxes. Suppose there are two boxes: the red box contains two red balls, and the blue box contains one red ball and one blue ball. A box is selected at random with equal probability, a ball is drawn from it (and noted for color), returned, and then a ball is drawn from the other box. Let R_1 be the event that the first ball drawn is red, R_2 the event that the second ball is red, and B the event that the first box selected was the red box. Unconditionally, R_1 and R_2 are dependent because the box compositions affect both draws indirectly through the selection process. However, given B, R_1 and R_2 are conditionally independent, as the second draw is from the fixed remaining box, unaffected by the outcome of the first draw. To verify, compute P(R_1 | B) = 1 (since the red box has only red balls), P(R_2 | B) = 1/2 (second draw from blue box), and P(R_1, R_2 | B) = 1 \times 1/2 = 1/2, so P(R_1, R_2 | B) = P(R_1 | B) P(R_2 | B). Similarly, P(R_1 | B^c) = 1/2, P(R_2 | B^c) = 1, and P(R_1, R_2 | B^c) = 1/2 \times 1 = 1/2, confirming the equality holds given B^c. A classic example from dice rolling demonstrates conditional dependence contrasting with unconditional independence, but can illustrate the boundary of conditional independence. Consider two fair six-sided dice rolled independently. Let D_1 be the event that the first die shows an even number, and D_2 the event that the second die shows an even number. Unconditionally, D_1 and D_2 are independent since P(D_1) = P(D_2) = 1/2 and P(D_1 \cap D_2) = 1/4 = P(D_1) P(D_2). Now condition on the sum S being even. The probability P(D_1 | S even) = 1/2 (by symmetry), P(D_2 | S even) = 1/2, but P(D_1 \cap D_2 | S even) = P(both even | sum even) = 9/18 = 1/2 (there are 18 outcomes with even sum: 9 even-even like (2,2),(2,4),...,(6,6) and 9 odd-odd like (1,1),(1,3),...,(5,5); 9 have both even), so P(D_1 \cap D_2 | S even) ≠ P(D_1 | S even) P(D_2 | S even), showing dependence given the parity of the sum. For conditional independence, consider conditioning on the parity of one die; however, the core illustration here highlights how conditioning on the sum or parity induces dependence, contrasting the definition where equality would hold under appropriate conditions like no shared constraint. In everyday scenarios, conditional independence appears in the relationship between a child's height and vocabulary size given their age. Height and vocabulary size are dependent unconditionally, as both tend to increase with age—taller children often have larger vocabularies due to developmental progression. However, given the child's age, height and vocabulary size become conditionally independent, as age accounts for the common developmental factor, making additional information about one irrelevant to predicting the other. For instance, among 5-year-olds, height variations (due to genetics or nutrition) do not predict vocabulary differences beyond what age already explains. To illustrate formally, suppose H is the event of above-average height, V above-average vocabulary, and A a specific age group. Then P(H | A) and P(V | A) are fixed by age norms, and P(H \cap V | A) = P(H | A) P(V | A) if no direct link beyond age, as verified in developmental studies where correlation drops to zero when stratifying by age.^[14] Another intuitive example involves bus arrival delays on the same route. Let D_1 be the event of delay for bus 1 and D_2 for bus 2, which are dependent unconditionally due to shared traffic conditions causing both to be late together. However, given the traffic condition T (e.g., heavy congestion), D_1 and D_2 are conditionally independent, as each bus's delay then depends only on its own factors like driver behavior or stops, independent of the other given T. Computationally, P(D_1 | T) = probability of delay under known traffic (say 0.8 for heavy), P(D_2 | T) = 0.8 similarly, and P(D_1 \cap D_2 | T) = 0.8 \times 0.8 = 0.64 if independent given T, whereas unconditionally P(D_1 \cap D_2) > P(D_1) P(D_2) due to correlation from T. This structure reflects a common causal fork, where traffic is the common cause.

Conditional Independence for Random Variables

Formal Definition

In probability theory, random variables X, Y, and Z are defined on a probability space (\Omega, \mathcal{F}, P). The random variables X and Y are said to be conditionally independent given Z if the \sigma-algebras they generate, \sigma(X) and \sigma(Y), are conditionally independent given \sigma(Z). Specifically, two sub-\sigma-algebras \mathcal{G} and \mathcal{H} of \mathcal{F} are conditionally independent given a sub-\sigma-algebra \mathcal{F} \subseteq \mathcal{G} \vee \mathcal{H} if, for every G \in \mathcal{G} and H \in \mathcal{H},

P(G \cap H \mid \mathcal{F}) = P(G \mid \mathcal{F}) \, P(H \mid \mathcal{F})

almost surely.^[15] An equivalent measure-theoretic formulation for the conditional independence of X and Y given Z is that, for all measurable sets A and B in the respective Borel \sigma-algebras,

P(X \in A, Y \in B \mid Z) = P(X \in A \mid Z) \, P(Y \in B \mid Z)

almost surely with respect to P. This holds under the assumption of a complete probability space, where null sets are included in \mathcal{F}, ensuring the conditional probabilities are well-defined.^[15] This definition for random variables generalizes the notion of conditional independence for events, as it reduces to the event case when X = \mathbf{1}_E and Y = \mathbf{1}_F for events E, F \in \mathcal{F}, where \sigma(X) = \{ \emptyset, E, E^c, \Omega \} and similarly for \sigma(Y). The formal definition applies uniformly to both discrete and continuous random variables, though verification in continuous settings typically relies on the existence of regular conditional distributions.^[15]

Key Properties and Verification

Conditional independence possesses several key properties that facilitate its use in probabilistic modeling. One fundamental property is preservation under additional conditioning by independent variables: if X \perp Y \mid Z and W \perp (X, Y, Z), then X \perp Y \mid (Z, W).^[16] Another important property for multiple variables is the decomposition (or mixing) property: if X \perp (Y, W) \mid Z, then X \perp Y \mid Z and X \perp W \mid Z.^[16] These properties ensure that conditional independence structures remain stable when expanding the conditioning set with irrelevant information or decomposing joint independences. Verification of conditional independence X \perp Y \mid Z can be performed theoretically through equivalence to zero conditional mutual information, defined as

I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z) = 0,

where H denotes entropy; this holds if and only if the variables are conditionally independent.^[17] Empirically, log-likelihood ratio tests, such as the deviance statistic G^2 = 2 \sum o_{ijk} \log(o_{ijk}/e_{ijk}) in multi-way contingency tables (where o_{ijk} and e_{ijk} are observed and expected frequencies under conditional independence), provide a means to assess the hypothesis, with asymptotic chi-squared distribution under the null.^[18] For computational verification, discrete cases often rely on contingency tables, where conditional independence is tested by stratifying over the conditioner Z and applying chi-squared or log-likelihood ratio statistics to each slice, aggregating for overall assessment.^[18] In continuous settings, copula-based methods transform marginals to uniform via empirical copulas and test for independence in the partial copula, while kernel methods embed variables into reproducing kernel Hilbert spaces and use Hilbert-Schmidt independence criteria conditioned on Z to detect dependence.^[19]^[20] Unlike unconditional independence, conditional independence does not imply marginal independence, and marginalizing over the conditioner Z can induce dependence between X and Y even if they are conditionally independent given Z.^[21] This distinction underscores the role of the conditioning set in revealing or masking dependencies.

Common Examples

One common example of conditional independence for random variables arises in the context of Bernoulli trials with an unknown success parameter. Consider independent Bernoulli random variables X_1, X_2, \dots, X_n each with success probability \theta, where \theta itself is a random variable (e.g., following a Beta prior in Bayesian settings). Marginally, the X_i are dependent due to their shared dependence on \theta, as observing one X_i updates beliefs about \theta and thus affects the others. However, conditionally on \theta, the X_i are independent, since the joint conditional probability mass function factors as

p(X_1, \dots, X_n \mid \theta) = \prod_{i=1}^n p(X_i \mid \theta) = \prod_{i=1}^n \theta^{X_i} (1 - \theta)^{1 - X_i},

demonstrating that X_1, \dots, X_n are conditionally independent given \theta. This structure is fundamental in Bayesian inference for modeling sequences of binary outcomes, such as coin flips or diagnostic tests, where the parameter captures shared uncertainty. Another illustrative case involves jointly normal random variables. For a multivariate Gaussian vector \mathbf{X} = (X_1, \dots, X_p)^T \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma), conditional independence between components X_i and X_j (for i \neq j) given the remaining variables \mathbf{X}_{-ij} holds if and only if the (i,j)-entry of the precision matrix \Theta = \Sigma^{-1} is zero. This equivalence stems from the conditional distribution of X_i \mid \mathbf{X}_{-i}, which has mean and variance determined by the i-th row and column of \Theta; a zero \Theta_{ij} implies no direct conditional dependence on X_j. For a simple bivariate example with a third conditioning variable, suppose \mathbf{X} = (X, Y, Z)^T with covariance matrix

\Sigma = \begin{pmatrix} 1 & 0.25 & 0.5 \\ 0.25 & 1 & 0.5 \\ 0.5 & 0.5 & 1 \end{pmatrix},

yielding precision matrix \Theta with \Theta_{12} = 0 after inversion, so X \perp Y \mid Z. This property underpins graphical models for high-dimensional data, such as in finance or genetics, where zero precision entries reveal sparse dependence structures.^[22] In agricultural modeling, consider temperature T (as a proxy for broader weather conditions) and crop yield Y, conditioned on rainfall amount R. Marginally, T and Y are dependent, as higher temperatures often correlate with lower yields through evapotranspiration and water stress. Given R, however, T and Y are conditionally independent if rainfall fully mediates the temperature effect, meaning yield depends on temperature only via its influence on rainfall patterns. This assumes a causal structure where p(Y \mid T, R) = p(Y \mid R), verifiable through conditional densities or regression residuals showing zero partial correlation. For instance, in rainfed systems, empirical models fit yield as a function of rainfall alone after controlling for temperature, with probability mass functions for discretized yields illustrating the factorization (e.g., discrete Y levels with p(Y = y \mid T = t, R = r) = p(Y = y \mid R = r)). This example is prevalent in climate impact studies, aiding predictions under varying weather scenarios.^[23]

Extensions to Random Vectors and Fields

Definition for Vectors

The concept of conditional independence extends naturally from scalar random variables to finite-dimensional random vectors, where components within each vector may exhibit internal dependencies. Let \mathbf{X} = (X_1, \dots, X_n), \mathbf{Y} = (Y_1, \dots, Y_m), and \mathbf{Z} denote random vectors in \mathbb{R}^n, \mathbb{R}^m, and \mathbb{R}^p, respectively. The vectors \mathbf{X} and \mathbf{Y} are said to be conditionally independent given \mathbf{Z} if the joint conditional distribution of (\mathbf{X}, \mathbf{Y}) given \mathbf{Z} = \mathbf{z} factors into the product of the marginal conditional distributions, i.e., the conditional cumulative distribution function satisfies F_{\mathbf{X},\mathbf{Y}|\mathbf{Z}}(\mathbf{x}, \mathbf{y} | \mathbf{z}) = F_{\mathbf{X}|\mathbf{Z}}(\mathbf{x} | \mathbf{z}) F_{\mathbf{Y}|\mathbf{Z}}(\mathbf{y} | \mathbf{z}) for all \mathbf{x}, \mathbf{y}, \mathbf{z} in the support.^[24] When the relevant conditional densities exist, this is equivalent to the factorization

f_{\mathbf{X},\mathbf{Y}|\mathbf{Z}}(\mathbf{x}, \mathbf{y} | \mathbf{z}) = f_{\mathbf{X}|\mathbf{Z}}(\mathbf{x} | \mathbf{z}) \, f_{\mathbf{Y}|\mathbf{Z}}(\mathbf{y} | \mathbf{z}).

^[24] In the measure-theoretic framework, conditional independence is defined via sigma-algebras: \mathbf{X} and \mathbf{Y} are conditionally independent given \mathbf{Z} if \sigma(\mathbf{X}) \perp\!\!\!\perp \sigma(\mathbf{Y}) \mid \sigma(\mathbf{Z}), meaning that for any bounded measurable functions f: \mathbb{R}^n \to \mathbb{R} and g: \mathbb{R}^m \to \mathbb{R},

\mathbb{E}[f(\mathbf{X}) g(\mathbf{Y}) \mid \mathbf{Z}] = \mathbb{E}[f(\mathbf{X}) \mid \mathbf{Z}] \, \mathbb{E}[g(\mathbf{Y}) \mid \mathbf{Z}]

almost surely.^[25] This general formulation implies that every pair of components (X_i, Y_j) is conditionally independent given \mathbf{Z}, i.e., X_i \perp\!\!\!\perp Y_j \mid \mathbf{Z} for all i = 1, \dots, n and j = 1, \dots, m, though the converse does not hold in general due to potential higher-order dependencies.^[24] A notable special case arises for jointly multivariate Gaussian random vectors (\mathbf{X}, \mathbf{Y}, \mathbf{Z}) with mean zero and positive definite covariance matrix \Sigma. Here, \mathbf{X} \perp\!\!\!\perp \mathbf{Y} \mid \mathbf{Z} if and only if the conditional covariance matrix between \mathbf{X} and \mathbf{Y} given \mathbf{Z} is the zero matrix, which corresponds to the precision matrix \Sigma^{-1} (the inverse covariance) being block-diagonal with respect to the partitions induced by \mathbf{X}, \mathbf{Y}, and \mathbf{Z}.^[26]

Properties in Multivariate Settings

In the multivariate setting, conditional independence of random vectors \mathbf{X} and \mathbf{Y} given \mathbf{Z} does not imply their marginal independence, as dependencies among the components of \mathbf{X} and \mathbf{Y} may persist without conditioning but be fully accounted for by \mathbf{Z}.^[27] This effect becomes more pronounced with increasing dimensionality, where the joint distribution p(\mathbf{X}, \mathbf{Y} \mid \mathbf{Z}) factors as p(\mathbf{X} \mid \mathbf{Z}) p(\mathbf{Y} \mid \mathbf{Z}), yet the unconditional joint p(\mathbf{X}, \mathbf{Y}) may exhibit intricate correlations due to the high-dimensional structure of \mathbf{Z}.^[28] When a random vector \mathbf{X} is partitioned into sub-blocks \mathbf{X} = (\mathbf{X}_A, \mathbf{X}_B), conditional independence \mathbf{X} \perp \mathbf{Y} \mid \mathbf{Z} implies block-wise conditional independence \mathbf{X}_A \perp \mathbf{Y} \mid \mathbf{Z} and \mathbf{X}_B \perp \mathbf{Y} \mid \mathbf{Z}, as these follow from marginalizing the factored joint conditional distribution.^[27] However, the converse fails: pairwise block conditional independence does not guarantee the full joint independence (\mathbf{X}_A, \mathbf{X}_B) \perp \mathbf{Y} \mid \mathbf{Z}, since dependencies between \mathbf{X}_A and \mathbf{X}_B given \mathbf{Z} may induce residual associations with \mathbf{Y}.^[28] This property highlights the hierarchical nature of independence in vector partitions, requiring verification of cross-block covariances for complete assessment.^[28] For jointly normal random vectors, conditional independence \mathbf{X} \perp \mathbf{Y} \mid \mathbf{Z} is equivalent to the partial correlation between \mathbf{X} and \mathbf{Y} given \mathbf{Z} being zero, a condition that remains invariant under nonsingular linear transformations of the variables.^[29] Specifically, if \mathbf{W} = A \mathbf{V} + \mathbf{b} where \mathbf{V} = (\mathbf{X}, \mathbf{Y}, \mathbf{Z}) is multivariate Gaussian and A is invertible, the transformed partial correlations preserve the zero structure, ensuring the conditional independence relations hold in the new coordinates.^[29] This invariance facilitates analysis in Gaussian graphical models, where the precision matrix encodes such structures robustly.^[29] In higher dimensions, direct computation or testing of conditional independence among random vectors suffers from the curse of dimensionality, as the number of potential dependence structures grows exponentially with the vector size, rendering exact verification NP-hard without structural assumptions.^[30] Graphical models mitigate this complexity by leveraging conditional independence to factorize the joint distribution, enabling tractable inference via algorithms like junction trees, whose complexity scales with the graph's treewidth rather than full dimensionality.^[30] Counterexamples illustrate that pairwise conditional independence among vector components given the conditioner does not imply joint conditional independence. Consider three scalar random variables X_1, X_2, Y forming components of vectors, where X_1 \perp X_2 \mid Z, X_1 \perp Y \mid Z, and X_2 \perp Y \mid Z, but (X_1, X_2) \not\perp Y \mid Z; a concrete instance arises in non-Gaussian settings with Y = X_1 \oplus X_2 (modulo 2) and Z independent, where pairwise factorizations hold but the joint does not due to the XOR dependence.^[31] Such cases underscore the need for joint checks in multivariate extensions.^[31]

Applications in Stochastic Processes

In stochastic processes, conditional independence plays a central role in defining the Markov property, where the future state of the process is conditionally independent of its past given the current state. For a vector-valued Markov process \{X_t\}_{t \geq 0} with state space \mathbb{R}^d, this means that for any s < t < u, the conditional distribution of X_u given \mathcal{F}_t (the sigma-algebra generated by \{X_r : r \leq t\}) depends only on X_t, implying X_u \perp\!\!\!\perp \mathcal{F}_{t-} \mid X_t, where \mathcal{F}_{t-} is the past up to but not including t. This property simplifies the analysis of dynamic systems by reducing the dimensionality of dependencies, enabling recursive computation of transition probabilities and facilitating the study of long-term behavior such as ergodicity and mixing.^[32] Hidden Markov models (HMMs) extend this framework to scenarios with unobserved states, where observations are conditionally independent given the hidden state sequence. In an HMM, the hidden states \{Z_t\} form a Markov chain, and the observations \{Y_t\} satisfy Y_t \perp\!\!\!\perp (Y_{1:t-1}, Z_{1:t-1}) \mid Z_t, meaning each observation depends only on the current hidden state. This conditional independence structure allows efficient inference via algorithms like the forward-backward procedure, which computes marginal posteriors over hidden states by exploiting the Markov property of the states and the observation independence. Seminal work established the probabilistic foundations for parameter estimation in such models under these assumptions.^[33] Gaussian processes (GPs) leverage conditional independence through their covariance structure to model continuous-time stochastic processes. A GP f \sim \mathcal{GP}(m, k) is defined by a mean function m and kernel k, where the joint distribution of any finite collection of function values is multivariate Gaussian. Conditional independence arises when the kernel induces zero covariance between subsets given others; for instance, given observations at points X, the predictive distribution at new points X_* is f(X_*) \mid f(X) \sim \mathcal{N}(\mu_*, \Sigma_*), with \mu_* and \Sigma_* derived from the conditional mean and covariance, effectively making predictions independent of unobserved regions except through the kernel-mediated conditioning. This property underpins scalable approximations like sparse GPs, where inducing points enforce conditional independences to reduce computational complexity from \mathcal{O}(n^3) to \mathcal{O}(m^3) for m \ll n data points.^[34] Autoregressive moving average (ARMA) models illustrate conditional independence in discrete-time linear processes, where innovations drive the dynamics. In an ARMA(p, q) model, X_t = \sum_{i=1}^p \phi_i X_{t-i} + \sum_{j=1}^q \theta_j \epsilon_{t-j} + \epsilon_t, the innovations \{\epsilon_t\} are white noise, meaning \epsilon_t \perp\!\!\!\perp \epsilon_s for t \neq s, and conditionally independent of past observations given the model parameters. This independence ensures that the one-step-ahead forecast errors are uncorrelated, enabling maximum likelihood estimation and diagnostic checks via residual analysis. For vector ARMA processes, multivariate extensions preserve this through block-diagonal innovation covariances, assuming conditional independence across components unless coupled by the model structure. However, these applications assume stationarity or specific covariance structures, and non-stationarity can undermine conditional independence. In non-stationary processes, such as those with time-varying means or variances, the conditional distributions may depend on absolute time, violating the Markovian independence of future from past given present; for example, structural breaks introduce residual dependencies that persist across regimes, complicating inference and leading to spurious correlations in vector settings. Testing for conditional independence in such cases requires adaptations like kernel-based methods robust to temporal non-stationarity, as standard assumptions fail when innovation variances evolve over time.^[35]

Applications in Statistical Inference

Role in Bayesian Networks

Bayesian networks, also known as belief networks, are directed acyclic graphs (DAGs) that encode conditional independence relationships among a set of random variables, enabling compact representation of joint probability distributions. Each node in the DAG represents a random variable, and directed edges indicate direct probabilistic dependencies, implying that a variable is conditionally independent of its non-descendants given its parents. This graphical structure allows for the identification and exploitation of conditional independences, which underpin efficient probabilistic reasoning in complex systems.^[36] A key mechanism for reading conditional independences from the DAG is the d-separation criterion, which determines whether two sets of nodes X and Y are conditionally independent given a third set Z. According to d-separation, X and Y are d-separated by Z (and thus conditionally independent) if every undirected path between them is blocked by Z, where a path is blocked if it includes a chain A \to B \to C or fork A \leftarrow B \to C with B \in Z, or a collider A \to B \leftarrow C where neither B nor any of its descendants is in Z. This criterion provides a graphical test for conditional independence that is both sound and complete relative to the underlying probability distribution faithful to the DAG.^[37]^[38] The joint probability distribution over the variables in a Bayesian network factors according to the graph structure, leveraging these conditional independences:

P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i)),

where \mathrm{Pa}(X_i) denotes the parents of X_i in the DAG. This factorization reduces the storage requirements from an exponential O(2^n) for the full joint table to the product of the sizes of the conditional probability tables (CPTs) for each node, which is typically much smaller when independences are present.^[36]^[39] Inference in Bayesian networks, such as computing marginal or conditional probabilities, relies on these independences through algorithms like variable elimination. In variable elimination, non-query variables are systematically summed out by multiplying relevant factors (from CPTs) and marginalizing over the variable, exploiting conditional independences to avoid unnecessary computations and prevent exponential growth in intermediate factors. The time complexity is O(n \cdot d^w), where n is the number of variables, d is the maximum domain size, and w is the treewidth (maximum clique size minus one) of the induced undirected graph, making it feasible for networks with sparse dependencies.^[39] A simple illustrative example is a diagnostic network for a disease D causing two symptoms, fever F and cough C, represented as a DAG with edges D \to F and D \to C. Here, F and C are conditionally independent given D (i.e., F \perp C \mid D), as d-separation blocks the path F \leftarrow D \to C when D is observed. The joint distribution factors as P(D, F, C) = P(D) \cdot P(F \mid D) \cdot P(C \mid D), allowing efficient computation of, say, P(D \mid F=\true, C=\true) via variable elimination by first summing over unobserved variables if any. This structure captures real-world medical reasoning where symptoms provide evidence about the underlying disease without direct interaction.^[36]^[37] The use of conditional independences in Bayesian networks offers significant advantages in high-dimensional inference, particularly dimensionality reduction by avoiding the curse of dimensionality inherent in full joint distributions. By parameterizing only local conditional probabilities, the model scales to hundreds or thousands of variables where independences hold, enabling applications in domains like medical diagnosis and fault detection that would otherwise require infeasible data and computation. This compactness also facilitates learning from data and updating beliefs with new evidence.^[36]^[39]

Implications for Markov Chains

In Markov chains, conditional independence underpins the core Markov property, which states that the future state of the process is independent of its past states given the current state. Formally, for a discrete-time Markov chain \{X_n\}_{n \geq 0} with state space \mathcal{S}, the property is expressed as X_{n+1} \perp X_{1:n-1} \mid X_n for all n \geq 1, meaning the distribution of X_{n+1} depends only on X_n and not on earlier history.^[40] This conditional independence simplifies the joint distribution to a product of transition probabilities: P(X_{0:n}) = P(X_0) \prod_{k=1}^n P(X_k \mid X_{k-1}).^[32] Higher-order Markov chains extend this by conditioning on multiple preceding states, preserving conditional independence but with a larger conditioning set. In a k-th order chain, the property becomes X_{n} \perp X_{1:n-k-1} \mid X_{n-k:n-1} for n > k, where the next state depends only on the immediate k past states.^[41] This formulation allows modeling of longer-range dependencies in sequences, such as in language modeling or financial time series, while the joint likelihood factorizes accordingly into conditional terms over the order-k history.^[3] Reversible Markov chains, which are time-symmetric in stationarity, maintain conditional independences when the process is run backward. A stationary chain is reversible if its transition probabilities satisfy the detailed balance equations \pi_i P_{ij} = \pi_j P_{ji} for stationary distribution \pi, implying that, conditional on the current state, the past and future trajectories are independent and identically distributed.^[42] This symmetry preserves the Markov property in the reversed chain, enabling applications like efficient simulation in reversible jump MCMC without altering independence structures.^[42] Parameter estimation in Markov chains leverages conditional independence to construct likelihoods from observed transitions. The maximum likelihood estimator for transition probabilities P_{ij} uses empirical frequencies of jumps from i to j, derived from the factorized likelihood \mathcal{L}(\mathbf{P}; \mathbf{x}) = \prod_{t=1}^T P_{x_{t-1} x_t}, where the conditioning on prior states is implicit via the chain structure.^[43] This approach yields consistent estimators under ergodicity, avoiding full history dependence due to the Markov property.^[44] Extensions to continuous-time Markov chains (CTMCs) retain conditional independence, with the future process independent of the past given the present state at time t. The Markov property holds as P(X_{s} \in A \mid X_u, u \leq t) = P(X_{s} \in A \mid X_t) for s > t, often realized via holding times that are exponential and memoryless.^[32] The Poisson process exemplifies this, as a pure birth CTMC where increments are conditionally independent given the current count, with interarrival times exponentially distributed and independent.^[45]

Uses in Causal Modeling

In causal modeling, conditional independence plays a central role in identifying causal effects from observational data by leveraging graphical criteria that block spurious associations. The back-door criterion, introduced by Pearl, specifies a set of variables Z that blocks all back-door paths—non-directed paths from treatment X to outcome Y—between X and Y, ensuring that all back-door paths are blocked, so that the conditional distribution P(Y | X, Z) identifies the interventional distribution P(Y | do(X), Z), while avoiding conditioning on colliders that could open bias-inducing paths.^[46] This allows estimation of the causal effect via the adjustment formula:

P(Y|do(X=x)) = \sum_z P(Y|X=x, Z=z) P(Z=z)

where do(X=x) denotes intervention on X.^[46] The front-door criterion complements the back-door approach when confounders are unobserved, requiring a set of intermediary variables Z that capture all directed paths from X to Y, such that X blocks all back-door paths from Z to Y, and no unblocked back-door paths exist from X to Z.^[46] This criterion identifies the causal effect through mediation, expressed as:

P(Y|do(X=x)) = \sum_z P(Z=z|X=x) \sum_{x'} P(Y|do(Z=z), X=x') P(X=x')

enabling causal inference even without direct adjustment for confounders.^[46] Pearl's do-calculus provides a formal framework with three inference rules to manipulate expressions involving interventions, replacing do-operators with conditional probabilities based on conditional independences in the causal graph.^[47] Rule 1 (insertion/deletion of observations) states that P(y | do(x), z, w) = P(y | do(x), w) if Y ⊥ Z | X, W in G_{\bar{X}}; Rule 2 (action/observation exchange) states that P(y | do(x), do(z), w) = P(y | do(x), z, w) if Y ⊥ Z | X, W in G_{\bar{X}\bar{Z}}; and Rule 3 (insertion/deletion of actions) states that P(y | do(x), do(z), w) = P(y | do(x), w) if Y ⊥ Z | X, W in G_{\bar{X}\bar{Z}}(\bar{W}), where \bar{W} indicates non-ancestors of W.^[47] These rules systematically determine identifiability using conditional independences, generalizing back- and front-door criteria.^[47] A seminal example is Pearl's model of smoking (X), tar deposits in lungs (Z), and lung cancer (Y), where an unobserved genotype confounds X and Y but Z mediates the effect. The front-door criterion applies since Z intercepts all X-to-Y paths, X blocks back-doors to Z, and no back-doors exist from Z to Y after conditioning on X, allowing identification of smoking's causal effect on cancer despite the confounder. These methods assume all relevant variables are observed and the graph faithfully represents independences; unobserved confounders can violate assumptions, leading to biased estimates.^[46] As of 2025, active research in AI-assisted causal discovery addresses this by automating graph structure learning from data, incorporating large language models to integrate domain knowledge from literature with observational data, such as in constructing causal graphs for material properties from microscopy data.^[48]

Axiomatic Structure of Conditional Independence

Symmetry and Decomposition

The symmetry property asserts that conditional independence is a symmetric relation. Specifically, if random variables X, Y, and Z satisfy X \perp Y \mid Z, then it follows that Y \perp X \mid Z. This holds directly from the probabilistic definition, as the joint conditional distribution factors as p(x,y \mid z) = p(x \mid z) p(y \mid z) if and only if p(y,x \mid z) = p(y \mid z) p(x \mid z). In measure-theoretic terms, for \sigma-algebras \mathcal{G}, \mathcal{H}, and \mathcal{K} generated by X, Y, and Z respectively, conditional independence \mathcal{G} \perp \mathcal{H} \mid \mathcal{K} means that for all bounded measurable functions f: \Omega \to \mathbb{R} with f \mathcal{G}-measurable and g: \Omega \to \mathbb{R} with g \mathcal{H}-measurable, \mathbb{E}[f g \mid \mathcal{K}] = \mathbb{E}[f \mid \mathcal{K}] \mathbb{E}[g \mid \mathcal{K}] almost surely; symmetry follows immediately by swapping f and g. This property is valid for any probability measure on the underlying space.^[49]^[50] The decomposition property states that conditional independence with respect to a joint set implies independence with respect to its components. Formally, if X \perp (Y, W) \mid Z, then X \perp Y \mid Z and X \perp W \mid Z. To prove this using \sigma-algebras, suppose \mathcal{G} \perp \mathcal{H}_1 \vee \mathcal{H}_2 \mid \mathcal{K}, where \mathcal{H}_1 and \mathcal{H}_2 are generated by Y and W. For f \mathcal{G}-measurable and g \mathcal{H}_1-measurable, define h = g \cdot 1 (constant on \mathcal{H}_2), which is (\mathcal{H}_1 \vee \mathcal{H}_2)-measurable. Then \mathbb{E}[f g \mid \mathcal{K}] = \mathbb{E}[f h \mid \mathcal{K}] = \mathbb{E}[f \mid \mathcal{K}] \mathbb{E}[h \mid \mathcal{K}] = \mathbb{E}[f \mid \mathcal{K}] \mathbb{E}[g \mid \mathcal{K}] almost surely, since \mathbb{E}[h \mid \mathcal{K}] = \mathbb{E}[g \mid \mathcal{K}]. The case for \mathcal{H}_2 is analogous. This axiom holds universally across probability measures, as the proof relies solely on the conditional expectation properties.^[49]^[50] These properties simplify the verification of conditional independences involving multiple variables, allowing complex statements to be broken down into simpler pairwise checks without loss of validity. For instance, establishing independence from a vector (Y, W) given Z directly yields the separate independences, reducing computational or analytical effort in probabilistic modeling. Unlike more advanced axioms, symmetry and decomposition are foundational and always satisfied in probabilistic settings, providing a robust basis for further axiomatic extensions.^[49]^[51]

Union and Intersection Axioms

The weak union axiom asserts that if X \perp (Y, W) \mid Z, then X \perp Y \mid (Z, W). This property allows the conditioning set to be expanded by including part of the independent set without altering the independence relation for the remaining variables. It holds universally for any probability distribution defined over discrete or continuous random variables.^[5]^[52] To prove the weak union axiom using the chain rule of probability, start from the independence assumption:

P(X, Y, W \mid Z) = P(X \mid Z) \cdot P(Y, W \mid Z).

Then,

P(X, Y \mid Z, W) = \frac{P(X, Y, W \mid Z)}{P(W \mid Z)} = \frac{P(X \mid Z) \cdot P(Y, W \mid Z)}{P(W \mid Z)} = P(X \mid Z) \cdot P(Y \mid Z, W).

By decomposition, X \perp W \mid Z implies P(X \mid Z, W) = P(X \mid Z), so

P(X, Y \mid Z, W) = P(X \mid Z, W) \cdot P(Y \mid Z, W),

confirming the independence. This derivation relies on the product rule and marginalization.^[52]^[53] The contraction axiom states that if X \perp Y \mid (Z, W) and X \perp W \mid Z, then X \perp (Y, W) \mid Z. This axiom enables combining two independence statements to form a joint independence over a larger set, effectively "contracting" the conditioning information. Like weak union, it applies to all probability distributions.^[5]^[52] The proof proceeds via the chain rule. From X \perp W \mid Z, P(X, W \mid Z) = P(X \mid Z) \cdot P(W \mid Z). From X \perp Y \mid (Z, W), P(X, Y \mid Z, W) = P(X \mid Z, W) \cdot P(Y \mid Z, W) = P(X \mid Z) \cdot P(Y \mid Z, W), since X \perp W \mid Z implies P(X \mid Z, W) = P(X \mid Z). Multiplying by P(W \mid Z) yields

P(X, Y, W \mid Z) = P(X \mid Z) \cdot P(Y, W \mid Z),

establishing the joint independence. This uses the product rule to chain the marginal and conditional factorizations.^[52]^[53] The intersection axiom provides a stronger form: if X \perp Y \mid (Z, W) and X \perp W \mid (Z, Y), then X \perp (Y, W) \mid Z. Unlike weak union and contraction, this axiom holds only for strictly positive probability distributions, where all conditional probabilities are greater than zero, preventing issues with undefined conditionals. It strengthens contraction by adjusting the second conditioning set to include Y, allowing more flexible derivations in positive domains.^[5]^[52] For the proof under positivity, assume both premises. These imply P(X \mid Z, Y, W) = P(X \mid Z, W) and P(X \mid Z, Y, W) = P(X \mid Z, Y), so P(X \mid Z, W) = P(X \mid Z, Y). Since this equality holds for all Y and W (by positivity, all combinations have positive probability), P(X \mid Z, Y) must be independent of Y (as it equals P(X \mid Z, W) for any fixed W, varying over Y), and similarly independent of W. Thus, P(X \mid Z, Y) = P(X \mid Z, W) = P(X \mid Z), so P(X \mid Z, Y, W) = P(X \mid Z). By definition, this establishes X \perp (Y, W) \mid Z. This relies on positivity to ensure all conditionals are defined and the equalities hold universally.^[52]^[53]^[54] These axioms are essential for deriving additional conditional independences from partial knowledge, such as inferring broader separations in joint distributions or completing the independence structure in probabilistic models without full specification. They underpin algorithms for structure learning and inference in graphical models by propagating known relations efficiently.^[5]^[52]

Graphoid Properties and Beyond

The semi-graphoid axioms form a foundational set of properties satisfied by conditional independence relations in probability distributions, consisting of four core rules: symmetry, decomposition, weak union, and contraction.^[5] Symmetry states that if X \perp\!\!\!\perp Y \mid Z, then Y \perp\!\!\!\perp X \mid Z. Decomposition asserts that if X \perp\!\!\!\perp (Y \cup W) \mid Z, then X \perp\!\!\!\perp Y \mid Z and X \perp\!\!\!\perp W \mid Z. Weak union implies that if X \perp\!\!\!\perp (Y \cup W) \mid Z, then X \perp\!\!\!\perp Y \mid (Z \cup W). Contraction holds when X \perp\!\!\!\perp Y \mid (Z \cup W) and X \perp\!\!\!\perp W \mid Z together imply X \perp\!\!\!\perp (Y \cup W) \mid Z. These axioms apply universally to any probability distribution without additional assumptions.^[55] A full graphoid extends the semi-graphoid by incorporating two additional axioms: intersection and composition (also known as reverse decomposition). Intersection requires that if X \perp\!\!\!\perp Y \mid (Z \cup W) and X \perp\!\!\!\perp W \mid (Z \cup Y), then X \perp\!\!\!\perp (Y \cup W) \mid Z. Composition states that if X \perp\!\!\!\perp Y \mid Z and X \perp\!\!\!\perp W \mid Z, then X \perp\!\!\!\perp (Y \cup W) \mid Z. These extra properties hold for conditional independence in distributions with strictly positive densities, ensuring the relation behaves like separation in undirected graphs.^[5]^[55] The intersection axiom fails in distributions lacking strict positivity, such as those with zero probabilities for certain events. For instance, consider three binary random variables A, B, and C where the joint distribution assigns zero probability to specific combinations like P(A=0,B=0,C=0) = P(A=0,B=1,C=1) = P(A=1,B=0,C=0) = P(A=1,B=1,C=1) = 0; here, the premises A \perp\!\!\!\perp B \mid C and A \perp\!\!\!\perp C \mid B hold, but A \not\perp\!\!\!\perp (B \cup C) vacuously fails due to the boundary constraints.^[56] In deterministic settings, such as functional dependencies where X = f(Y) almost surely, intersection also breaks: the premises may hold (e.g., X \perp\!\!\!\perp Z \mid Y and X \perp\!\!\!\perp Y \mid Z), but the conclusion X \perp\!\!\!\perp (Y \cup Z) does not, as X remains dependent on Y.^[56] Beyond classical probability, graphoid properties have been extended to quantum settings, where quantum conditional independence satisfies semi-graphoid axioms but may violate intersection due to non-commutativity and entanglement effects, as explored in quantum causal models.^[57] In category theory, categoroids provide an algebraic framework hybridizing two categories to axiomatize universal conditional independence properties, generalizing graphoids to abstract structures like separoids.^[58] Graphoid axioms underpin faithful representations of conditional independence in directed acyclic graphs (DAGs), where d-separation criteria—blocking all paths between nodes—exactly capture the independences implied by the graph structure under the faithfulness assumption, enabling probabilistic graphical models.^[5]^[55]

References

[1]
[PDF] CS109: Conditional Independence and Random Variables
Conditional independence is a practical, real-world way of decomposing hard probability questions. “Exploiting conditional independence to generate fast.
[2]
A Primer on Probability - Cornell: Computer Science
When there are only two events, B 1 and B 2 , they are conditionally independent given event A if and only if Pr [ B 1 ∩ B 2 ∣ A ] = Pr [ B 1 ∣ A ] Pr [ B 2 ∣ A ] ...
[3]
[PDF] Note Set 2: Conditional Independence and Graphical Models
Conditional independence is a very useful general framework for structuring a probability model in terms of how our variables are connected in a model. We can ...
[4]
[PDF] CONDITIONAL INDEPENDENCE AND ITS REPRESENTATIONS*
A powerful formalism for informational relevance is provided by probability theory, where the notion of relevance is identified with dependence or, more ...
[5]
[PDF] CHAPTER 2 CONDITIONAL PROBABILITY AND INDEPENDENCE
Conditional probability arises when an event occurs, changing probabilities. Statistical independence occurs when knowing an event doesn't affect another's ...
[6]
[PDF] Probability, conditional probability, independence, total ... - UNM Math
Independence. • Independence of two events means that the occurrence of one event does not affect whether another event occurs or vice versa. 41. Page 66 ...
[7]
4.2 - What is Conditional Probability? | STAT 414
Conditional Probability. The conditional probability of an event A given that an event B has occurred is written: P ( A | B ). and is calculated using:.
[8]
11.1.4 - Conditional Probabilities and Independence | STAT 200
Conditional probability is the probability of one event given another. Independent events are unrelated, where one event's outcome doesn't affect another.
[9]
[PDF] A FIRST COURSE IN PROBABILITY - Sheldon Ross
8th ed. p. cm. Includes ... Chapter 3 deals with the extremely important subjects of conditional probability and ...
[10]
[PDF] 1957-feller-anintroductiontoprobabilitytheoryanditsapplications-1.pdf
... Probability Theory and SomeofIts Appli- cations. DOOB: Stochastic Processes ... CONDITIONAL PROBABILITY. STOCHASTIC INDEPENDENCE. 114. 1. Conditional ...
[11]
Conditional Independence - Probability Course
In particular, Definition. Two events A and B are conditionally independent given an event C with P(C)>0 if P(A∩B|C)=P(A|C)P(B|C)(1.8) Recall that from the ...Missing: Kolmogorov | Show results with:Kolmogorov
[12]
[PDF] Conditional independence: definition, examples - UPV
... children are among the shortest population); They are conditionally independent if we condition that they have a certain "age" or age interval (age is the.
[13]
https://gwern.net/doc/statistics/probability/1957-feller-anintroductiontoprobabilitytheoryanditsapplications-1.pdf
[14]
None
No readable text found in the HTML.<|separator|>
[15]
[PDF] Elements of Information Theory
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that ...
[16]
5.3.4 - Conditional Independence | STAT 504
There are three possible conditional independence models with three random variables: ( X Y ; which means that Y ; P ( Y = j , Z = k | X = i ) = P ( Y = j | X = i ) ...
[17]
Conditional independence testing via weighted partial copulas
In this work, a new statistical test procedure, called the weighted partial copula test is investigated to assess conditional independence.
[18]
[PDF] Kernel-based Conditional Independence Test and Application in ...
The KCI-test is a kernel-based method for testing conditional independence, especially for continuous variables, and is computationally efficient.
[19]
Conditional and unconditional statistical independence
Conditional independence almost everywhere in the space of the conditioning variates does not imply unconditional independence.
[20]
[PDF] 1 Normal Theory and the Precision Matrix
Zeros in the precision matrix Φ define, and are defined by, the conditional inde- pendencies in p(x). That is, the precision Φi,j = 0 iff the complete ...
[21]
[PDF] Monty Hall problem from a game show - Washington
... will see). • in practice, independent random variables are rare, but conditional independence is abundant μ(x. 1. ,x. 2. ) = μ(x. 1. )μ(x. 2. ) x. 1. ,x. 2 x.
[22]
Data Science for Weather Impacts on Crop Yield - Frontiers
May 18, 2020 · As per a different study (Schauberger et al., 2017), each day above 30°C causes a decline in maize and soybean yields by upto 6% under rainfed ...
[23]
[PDF] Conditional independence
Apr 14, 2009 · Notice how conditional independence is just ordinary independence in a conditional ... For instance, they may be random vectors. In (5), what is ...
[24]
[PDF] MATHEMATICAL PROBABILITY THEORY IN A NUTSHELL 2 Contents
3.7 Conditional independence with respect to a sigma-algebra . ... respect to a sigma-algebra B is defined as the Radon-Nikodym derivative of the measure. Q ...
[25]
[PDF] The Gaussian conditional independence inference problem
Nov 23, 2021 · Conditional independence in the setting of discrete random vectors is closely related to certain basic geometric properties of the entropy ...
[26]
https://taboege.de/research/dissert.pdf
[27]
[PDF] arXiv:2010.11914v3 [math.ST] 7 Dec 2021
Dec 7, 2021 · ambiguous because conditional independence of random vectors is symmetric in i and j. ... Unifying Markov properties for graphical models.
[28]
[PDF] The Multivariate Gaussian Distribution - Oxford statistics department
Adding independent Gaussians. Linear transformations. Marginal distributions. Conditional distributions. Example. Consider N3(0,Σ) with covariance matrix. Σ ...<|separator|>
[29]
[PDF] Graphical Modeling for High Dimensional Data
Nov 1, 2012 · Conditional independence and Markov properties of graphical models are keys to developing methodologies for high dimensional data analysis in a ...
[30]
[PDF] The Essential Equivalence of Pairwise and Mutual Conditional ...
For a large collection of random variables, pairwise conditional independence is essentially equivalent to mutual conditional independence.
[31]
[PDF] Markov Processes
The Markov property is the independence of the future from the past, given the present. Let us be more formal. Definition 102 (Markov Property) A one-parameter ...
[32]
Statistical Inference for Probabilistic Functions of Finite State Markov ...
December, 1966 Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Leonard E. Baum, Ted Petrie · DOWNLOAD PDF + SAVE TO MY LIBRARY.
[33]
[PDF] Gaussian Processes for Machine Learning
Gaussian processes for machine learning / Carl Edward Rasmussen, Christopher K. I. Williams. p. cm. —(Adaptive computation and machine learning). Includes ...
[34]
https://gaussianprocess.org/gpml/chapters/RW.pdf
[35]
Probabilistic Reasoning in Intelligent Systems - ScienceDirect.com
Probabilistic Reasoning in Intelligent Systems. Networks of Plausible Inference. Book • 1988. Author: Judea Pearl ... PDF version. Ways of reading. No ...
[36]
d-SEPARATION WITHOUT TEARS (At the request of many readers)
d-separation is a criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z.
[37]
d-Separation: From Theorems to Algorithms - ScienceDirect
Judea ... Bayesian network. Its correctness and maximality stems from the soundness and completeness of d -separation with respect to probability theory.
[38]
[PDF] Bayesian Networks: Representation, Variable Elimination
A Bayesian network is a directed acyclic graph (DAG) with nodes representing random variables and edges representing conditional independence statements. Each ...
[39]
[PDF] Conditional Independence and Markov Properties
Jul 5, 2006 · The notion of conditional independence is fundamental for graphical models. For three random variables X, Y and Z we denote this as. X ⊥⊥ Y | Z ...
[40]
[PDF] A Model for High-Order Markov Chains - Adrian E. Raftery
Oct 14, 2003 · We have introduced a model for Markov chains of order higher than one which involves only one additional parameter for each extra lag, can ...
[41]
[PDF] Chapter 3 Reversible Markov Chains - UC Berkeley Statistics
Sep 10, 2002 · If the stationary distribution has no simple form then typically. P * will have no simple form. A few facts about reversible chains are really ...
[42]
[PDF] Note: Maximum Likelihood Estimation for Markov Chains
that the future ...
[43]
[PDF] Statistical Methods in Markov Chains - RAND
At the end of the paper it is briefly indicated how these methods can be applied to a process with an arbitrary state space or a continuous time parameter.
[44]
11.3.1 Introduction - Probability Course
Similar to discrete-time Markov chains, we would like to have the Markov property, i.e., conditioned on the current value of X(t), the past and the future ...
[45]
[PDF] CAUSAL DIAGRAMS FOR EMPIRICAL RESEARCH
Abstract. The primary aim of this paper is to show how graphical models can be used as a mathematical language for integrating statistical and subject- ...
[46]
[PDF] The Do-Calculus Revisited Judea Pearl Keynote Lecture, August 17 ...
Aug 17, 2012 · The do-calculus was developed in 1995 to facilitate the identification of causal effects in non-parametric mod-.
[47]
Causal Discovery from Data Assisted by Large Language Models
Here we demonstrate this approach by combining high-resolution scanning transmission electron microscopy (STEM) data with insights derived from large language ...<|control11|><|separator|>
[48]
[PDF] Conditional Independence in Statistical Theory - AP Dawid
Jul 21, 2004 · INTRODUCTION. INDEPENDENCE and conditional independence are familiar concepts of probability theory, where they form the basis of several ...
[49]
[PDF] Lectures on Algebraic Statistics - UC Berkeley math
Jan 3, 2014 · The proofs of the first three conditional independence axioms (symmetry ... Compute the primary decomposition of the conditional independence ...
[50]
[PDF] graphoids: a graph-based logic for reasoning about relevance ...
The theory of graphoids uncovers the axiomatic basis of probabilistic dependencies and ties it to vertex-separation conditions in graphs. The defining axioms ...Missing: 1987 | Show results with:1987
[51]
[PDF] 2 Foundations - Jeffrey Heinz
a. Prove that the weak union and contraction properties hold for any probability distribution P. b. Prove that the intersection property holds for any positive ...
[52]
[PDF] Reasoning with Conditional Probabilities and Joint Distributions in ...
The proofs of the weak union, contraction, and intersection axioms use the product rule (i.e., Pr[E,F|G] = Pr[E|F, G] Pr[F|G]), which we state formally as.
[53]
[PDF] Graphs and Conditional Independence - University of Oxford
Sep 5, 2011 · For any semigraphoid it holds that. (G) ⇒ (L) ⇒ (P). If +σ satisfies graphoid axioms it further holds that. (P) ⇒ (G) so that in the graphoid ...
[54]
[PDF] On the Intersection and Composition properties of conditional ... - arXiv
Apr 16, 2025 · The most widely known and the simplest general condition on a distribution which ensures Intersection is that the probability density is ...
[55]
[PDF] Classical causal models cannot faithfully explain Bell nonlocality or ...
Aug 5, 2021 · Conditional independence relations satisfy cer- tain properties called semi-graphoid axioms [13]:. Symmetry,. (X ⊥⊥ Y | Z) ⇔ (Y ⊥⊥ X | Z) ...
[56]
[2208.11077] Categoroids: Universal Conditional Independence
Aug 23, 2022 · Categoroids are an algebraic structure for characterizing universal properties of conditional independence, a hybrid of two categories.Missing: graphoid | Show results with:graphoid