Fact-checked by Grok 2 weeks ago

Conditional mutual information

Conditional mutual information is a fundamental quantity in information theory that measures the amount of information one random variable provides about another, conditional on a third random variable. It extends the concept of mutual information by incorporating conditioning to capture dependencies that persist or emerge given partial knowledge. Formally, for jointly distributed random variables X, Y, and Z, the conditional mutual information is defined as I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z), where H(\cdot \mid \cdot) denotes conditional entropy; equivalently, it can be expressed as I(X; Y \mid Z) = H(Y \mid Z) - H(Y \mid X, Z).^[1] This formulation, introduced as part of the broader framework of information measures in the mid-20th century, arises naturally in analyses of communication channels and data processing under constraints. Key properties of conditional mutual information include its non-negativity, I(X; Y \mid Z) \geq 0, which holds for both discrete and continuous random variables under standard regularity conditions, with equality if and only if X and Y are conditionally independent given Z.^[2] It is symmetric in X and Y, i.e., I(X; Y \mid Z) = I(Y; X \mid Z), and satisfies a chain rule analogous to that of mutual information: I(X_1, \dots, X_n; Y \mid Z) = \sum_{i=1}^n I(X_i; Y \mid X_1, \dots, X_{i-1}, Z).^[1] These properties make it a powerful tool for quantifying multipartite correlations and testing Markovian structures, as I(X; Y \mid Z) = 0 implies that Z fully mediates the dependence between X and Y. In practice, conditional mutual information finds extensive applications across disciplines. In machine learning, it is widely used for feature selection, where algorithms like conditional mutual information maximization (CMIM) identify subsets of features that are informative about a target variable while minimizing redundancy given selected features, improving model efficiency and interpretability.^[3] It also plays a role in causal discovery, helping to infer conditional independencies in graphical models, and in communication theory for analyzing multi-user channels and rate regions under side information.^[4] More broadly, its estimation from data supports tasks in neuroscience for detecting neural dependencies and in bioinformatics for gene interaction networks.^[5]

Fundamentals

Definition

Conditional mutual information, denoted I(X; Y \mid Z), measures the amount of information shared between two random variables X and Y after accounting for the information each shares with a third variable Z. It quantifies the dependence between X and Y that remains even when Z is known, capturing how much knowing Y (and Z) reduces uncertainty about X beyond what Z alone provides. This concept generalizes mutual information I(X; Y), which assesses dependence without conditioning, to scenarios where partial knowledge of Z influences the relationship.^[6] To understand conditional mutual information, recall the prerequisite concepts of entropy and conditional entropy from Shannon's information theory. The entropy H(X) measures the uncertainty in a random variable X, while conditional entropy H(X \mid Z) is the expected entropy of X given Z, defined as H(X \mid Z) = H(X, Z) - H(Z). Mutual information is then I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X), representing the reduction in uncertainty of one variable due to knowledge of the other.^[7]^[8] Formally, conditional mutual information is defined in terms of conditional entropies as

I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z).

This follows from the definition of conditional entropy: H(X \mid Z) = -\sum_{z} p(z) \sum_{x} p(x \mid z) \log p(x \mid z) and H(X \mid Y, Z) = -\sum_{y,z} p(y,z) \sum_{x} p(x \mid y,z) \log p(x \mid y,z), where the difference isolates the additional reduction in entropy from Y given Z. Equivalent expressions include

I(X; Y \mid Z) = H(Y \mid Z) - H(Y \mid X, Z)

and, expanding using joint entropies,

I(X; Y \mid Z) = H(X, Z) + H(Y, Z) - H(X, Y, Z) - H(Z).

The latter derives by substituting H(X \mid Z) = H(X, Z) - H(Z) and H(X \mid Y, Z) = H(X, Y, Z) - H(Y, Z) into the primary definition, yielding I(X; Y \mid Z) = [H(X, Z) - H(Z)] - [H(X, Y, Z) - H(Y, Z)].^[8]^[6] Conditional mutual information is defined within the framework established by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," which introduced entropy, conditional entropy, and mutual information as core measures for analyzing communication systems, with early applications in noisy channel capacity. The explicit definition of conditional mutual information appeared in later developments of information theory, such as in Cover and Thomas (1991).^[7]^[9]

Notation Conventions

In standard information theory literature, conditional mutual information is denoted as I(X; Y \mid Z), where the semicolon separates the two random variables X and Y whose shared information is being measured, and the vertical bar indicates conditioning on the third random variable Z.^[9] This notation emphasizes the mutual dependence between X and Y given Z, distinguishing it from joint entropy or other measures.^[10] An alternative notation, I(X, Y \mid Z), occasionally appears in some texts, using a comma instead of a semicolon; however, these are equivalent in meaning, with the semicolon being the more conventional choice to avoid confusion with joint distributions.^[10] Random variables are represented in uppercase letters (e.g., X, Y, Z), while their specific realizations or values are denoted in lowercase (e.g., x, y, z).^[9] The measure is typically expressed in bits when using the base-2 logarithm, or in nats for the natural logarithm (base e); the base must be explicitly specified if not the default base-2 for bits.^[9] A common notational pitfall arises from the similarity to conditional independence symbols, such as X \perp Y \mid Z, which denotes statistical independence between X and Y given Z; notably, I(X; Y \mid Z) = 0 implies X \perp Y \mid Z for discrete or absolutely continuous random variables, but the converse holds under additional regularity conditions.^[9]

Mathematical Expressions

Discrete Distributions

For discrete random variables X, Y, and Z taking values in finite alphabets \mathcal{X}, \mathcal{Y}, and \mathcal{Z} respectively, the joint probability mass function (PMF) p(x,y,z) is fully specified, with marginal and conditional PMFs derived accordingly. As defined generally, the conditional mutual information I(X;Y|Z) measures the expected reduction in uncertainty about X given Z upon observing Y, expressed via conditional entropies. To derive the explicit form using PMFs, begin with the entropy-based definition:

I(X;Y|Z) = H(X|Z) - H(X|Y,Z),

where the conditional entropy H(X|Z) is

H(X|Z) = -\sum_{x,z} p(x,z) \log p(x|z)

and H(X|Y,Z) is

H(X|Y,Z) = -\sum_{x,y,z} p(x,y,z) \log p(x|y,z).

Substituting these into the difference yields

I(X;Y|Z) = \sum_{x,y,z} p(x,y,z) \log p(x|y,z) - \sum_{x,z} p(x,z) \log p(x|z).

The second term can be rewritten by introducing the dummy variable y and using the joint PMF:

\sum_{x,z} p(x,z) \log p(x|z) = \sum_{x,y,z} p(x,y,z) \log p(x|z),

since summing over y preserves the marginal. Thus,

I(X;Y|Z) = \sum_{x,y,z} p(x,y,z) \log \frac{p(x|y,z)}{p(x|z)}.

Recognizing that

p(x|y,z) = \frac{p(x,y|z)}{p(y|z)}

and substituting gives

\log \frac{p(x|y,z)}{p(x|z)} = \log \frac{p(x,y|z)}{p(x|z) p(y|z)},

leading to the PMF expression:

I(X;Y|Z) = \sum_{x,y,z} p(x,y,z) \log \frac{p(x,y|z)}{p(x|z) p(y|z)},

where the conditional joint PMF is p(x,y|z) = p(x,y,z)/p(z) for p(z) > 0. This form quantifies the dependence between X and Y after conditioning on Z through the Kullback-Leibler divergence between the conditional joint and product of marginals, averaged over p(z). A illustrative example is the binary symmetric channel (BSC), where X \in \{0,1\} is the input (Bernoulli with parameter 0.5), Z \in \{0,1\} is independent Bernoulli noise with crossover probability p=0.1, and Y = X \oplus Z is the noisy output, all with finite binary alphabets. The unconditional mutual information I(X;Y) is $1 - h_2(0.1) \approx 0.469 bits, where h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) is the binary entropy function, reflecting partial information loss due to noise. Conditioning on Z, the channel becomes deterministic since X = Y \oplus Z, so H(X|Y,Z) = 0 and H(X|Z) = H(X) = 1 bit (due to independence of X and Z). Thus, I(X;Y|Z) = 1 bit, demonstrating full recovery of information about X and the reduction in dependence uncertainty upon observing the noise.

Continuous Distributions

For continuous random variables X, Y, and Z with joint probability density function f(x,y,z), the conditional mutual information I(X; Y \mid Z) is defined as the expected value of the log-ratio of the conditional joint density to the product of the conditional marginal densities:

I(X; Y \mid Z) = \iiint f(x,y,z) \log \left( \frac{f(x,y \mid z)}{f(x \mid z) f(y \mid z)} \right) \, dx \, dy \, dz,

where the integral extends over the support of the densities, and the conditional densities are given by f(x,y \mid z) = f(x,y,z)/f(z), f(x \mid z) = \int f(x,y,z) \, dy / f(z), and f(y \mid z) = \int f(x,y,z) \, dx / f(z).^[9] This expression arises analogously to the discrete case, replacing probability mass functions with density functions and summations with integrals; it equals the expectation \mathbb{E} \left[ \log \frac{f(X,Y \mid Z)}{f(X \mid Z) f(Y \mid Z)} \right], where the expectation is taken with respect to the joint density f(x,y,z).^[9] Unlike individual differential entropies, which can be negative or diverge to -\infty for continuous variables, the conditional mutual information remains non-negative and finite under mild regularity conditions on the densities.^[11] A representative example occurs when X, Y, and Z are jointly multivariate normal with mean zero and covariance matrix \Sigma. In this case, I(X; Y \mid Z) admits a closed-form expression in terms of the partial correlation coefficient \rho(X,Y \mid Z) between X and Y given Z, specifically I(X; Y \mid Z) = -\frac{1}{2} \log \left(1 - \rho^2(X,Y \mid Z)\right) for scalar variables, where \rho(X,Y \mid Z) = \Sigma_{XY \mid Z} / \sqrt{\Sigma_{XX \mid Z} \Sigma_{YY \mid Z}} and \Sigma_{\cdot \mid Z} denotes the conditional covariance matrix.^[12] The value increases with the magnitude of the partial correlation: it equals zero when \rho(X,Y \mid Z) = 0 (conditional independence) and diverges to infinity as |\rho(X,Y \mid Z)| \to 1 (perfect conditional dependence).^[12] This formulation relies on differential entropy, as the continuous analog of Shannon entropy, ensuring that conditional mutual information quantifies dependence without the infinities plaguing absolute entropies of continuous distributions.^[9]

General Measure-Theoretic Formulation

In the measure-theoretic formulation, conditional mutual information is defined on a probability space (\Omega, \mathcal{F}, P), where the random variables X, Y, and Z are measurable functions from \Omega to respective measurable spaces ( \mathcal{X}, \mathcal{B}_\mathcal{X} ), ( \mathcal{Y}, \mathcal{B}_\mathcal{Y} ), and ( \mathcal{Z}, \mathcal{B}_\mathcal{Z} ). Conditioning is performed with respect to the \sigma-algebra \mathcal{F}_Z \subseteq \mathcal{F} generated by Z, which captures the information available from Z. The induced measures are the joint distribution P_{XYZ} on \mathcal{B}_\mathcal{X} \times \mathcal{B}_\mathcal{Y} \times \mathcal{B}_\mathcal{Z} and the relevant conditional distributions P_{XY|Z}, P_{X|Z}, and P_{Y|Z}, assuming the necessary absolute continuity conditions hold for the Radon-Nikodym derivatives to exist.^[13] The abstract definition of conditional mutual information I(X; Y | Z) is the expected Kullback-Leibler divergence between the conditional joint distribution of X and Y given Z and the product of their conditional marginal distributions:

I(X; Y | Z) = \mathbb{E}_{P_Z} \left[ D\left( P_{XY|Z} \,\middle\|\, P_{X|Z} \otimes P_{Y|Z} \right) \right],

where the expectation is taken with respect to the distribution of Z, and the conditional KL divergence is D(P_{XY|z} \| P_{X|z} \otimes P_{Y|z}) = \int \log \frac{dP_{XY|z}}{d(P_{X|z} \otimes P_{Y|z})} \, dP_{XY|z} for each realization z of Z. Equivalently, it can be expressed in integral form over the joint space as

I(X; Y | Z) = \int \log \frac{dP_{XYZ}}{d(P_{X|Z} \otimes P_{Y|Z} \otimes P_Z)} \, dP_{XYZ},

where P_{X|Z} \otimes P_{Y|Z} \otimes P_Z denotes the appropriate product measure induced by conditional independence of X and Y given Z. This pointwise integrand, \log \frac{dP_{XY|Z}}{d(P_{X|Z} \otimes P_{Y|Z})}, represents the local or pointwise conditional mutual information at each outcome.^[13] This general setup specializes to the discrete case, where distributions are atomic measures on countable spaces and the integral reduces to a summation over probabilities \sum_{x,y,z} p_{XYZ}(x,y,z) \log \frac{p_{XY|Z}(x,y|z)}{p_{X|Z}(x|z) p_{Y|Z}(y|z)}, and to the continuous case, where Lebesgue densities exist and the expression becomes \int p_Z(z) \left[ \iint p_{XY|Z}(x,y|z) \log \frac{p_{XY|Z}(x,y|z)}{p_{X|Z}(x|z) p_{Y|Z}(y|z)} \, dx \, dy \right] dz.^[13] The measure-theoretic approach offers significant advantages over restricted formulations, as it accommodates arbitrary probability measures, including those with mixed discrete-continuous or singular components that lack densities with respect to Lebesgue or counting measures, and extends naturally to infinite-dimensional or nonstandard spaces such as function spaces in stochastic processes. It serves as the foundational framework for advanced information-theoretic results, including ergodic decompositions, information rates in stationary processes, and capacity theorems for general channels.^[13]

Key Properties

Non-negativity

Conditional mutual information I(X; Y \mid Z) is always non-negative, i.e., I(X; Y \mid Z) \geq 0, for any random variables X, Y, and Z.^[14] This property follows directly from the definition of conditional mutual information as a Kullback-Leibler (KL) divergence between the conditional joint distribution and the product of the conditional marginals.^[15] Specifically, for discrete random variables,

I(X; Y \mid Z) = \sum_{z} p(z) \, D_{\mathrm{KL}}\left( P_{X,Y \mid Z=z} \,\middle\|\, P_{X \mid Z=z} \, P_{Y \mid Z=z} \right),

where each term D_{\mathrm{KL}}(\cdot \|\cdot) \geq 0 by the non-negativity of the KL divergence, and thus the weighted average is also non-negative.^[16] The non-negativity of the KL divergence itself is proved using Jensen's inequality applied to the convex function f(u) = -\log u:

D_{\mathrm{KL}}(P \| Q) = \mathbb{E}_{P} \left[ \log \frac{P}{Q} \right] = -\mathbb{E}_{P} \left[ \log \frac{Q}{P} \right] \geq -\log \mathbb{E}_{P} \left[ \frac{Q}{P} \right] = -\log 1 = 0,

with equality if and only if P = Q almost everywhere.^[17] For continuous random variables, the proof is analogous, replacing sums with integrals under suitable regularity conditions to ensure the entropies are well-defined.^[14] Equality in the non-negativity of I(X; Y \mid Z) holds if and only if X and Y are conditionally independent given Z, that is, p(x,y \mid z) = p(x \mid z) p(y \mid z) for all x, y, z with p(z) > 0.^[15] This condition means that Z fully accounts for any dependence between X and Y, rendering the conditional mutual information zero.^[16] In the general measure-theoretic formulation, the non-negativity extends via the relative entropy (KL divergence) between probability measures on abstract spaces, where I(X; Y \mid Z) is defined using Radon-Nikodym derivatives, and the inequality holds by the same convexity argument provided the measures are absolutely continuous.^[14] Although conditional mutual information is always non-negative, it can exceed the unconditional mutual information I(X; Y), which may intuitively suggest that conditioning "increases" dependence; however, this does not violate non-negativity, as both quantities remain \geq 0.^[18] A classic example involves three binary random variables A, B, and C uniformly distributed over the set where A \oplus B \oplus C = 0: here, I(A; B) = 0 due to independence, but I(A; B \mid C) = 1, as conditioning on C reveals perfect dependence between A and B.^[18]

Chain Rule

The chain rule for conditional mutual information provides a decomposition of the mutual information between multiple random variables and an output, conditioned on another variable, into a sum of individual conditional mutual informations with progressively more conditioning. For random variables X_1, \dots, X_n, Y, Z, the chain rule states

I(X_1, \dots, X_n; Y \mid Z) = \sum_{i=1}^n I(X_i; Y \mid Z, X_1, \dots, X_{i-1}),

where the conditioning on previous X_j for j < i accounts for dependencies built sequentially. This equality holds for both discrete and continuous distributions under standard measurability assumptions. This decomposition derives directly from the chain rule for conditional entropy, which expands the joint conditional entropy as

H(X_1, \dots, X_n \mid Z) = \sum_{i=1}^n H(X_i \mid Z, X_1, \dots, X_{i-1}).

Substituting into the definition of conditional mutual information, I(X_1, \dots, X_n; Y \mid Z) = H(X_1, \dots, X_n \mid Z) - H(X_1, \dots, X_n \mid Y, Z), and applying the entropy chain rule to the second term yields the telescoping sum, proving the result inductively by adding one variable at a time. For the bivariate case with random variables X, Y, Z, W, the chain rule simplifies to

I(X, Y; Z \mid W) = I(X; Z \mid W) + I(Y; Z \mid X, W),

capturing how the dependence of Y on Z is refined given knowledge of X. This rule finds applications in iterative conditioning processes, such as successive interference cancellation in multi-user communication channels, where decoding one user's signal conditions the information rate for subsequent users via the chain rule decomposition. In feature selection for machine learning, it enables greedy algorithms that sequentially select features by maximizing conditional mutual information given previously chosen ones, reducing redundancy while preserving relevance to the target variable.^[4]

Additional Identities

The conditional mutual information satisfies the symmetry property I(X; Y \mid Z) = I(Y; X \mid Z). This follows directly from the definition, as I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z) = H(Y \mid Z) - H(Y \mid X, Z) = I(Y; X \mid Z), where the equality H(X \mid Y, Z) = H(Y \mid X, Z) holds by the symmetry of conditional entropy.^[19] Conditional mutual information relates to the unconditional mutual information through the interaction information I(X; Y; Z), defined as I(X; Y; Z) = I(X; Y \mid Z) - I(X; Y). Rearranging yields I(X; Y \mid Z) = I(X; Y) + I(X; Y; Z), illustrating a trade-off: the interaction term I(X; Y; Z) can be positive (indicating synergy beyond the pair), zero (no three-way interaction), or negative (indicating redundancy), which determines whether conditioning on Z increases or decreases the shared information between X and Y. To verify, expand using entropies: I(X; Y \mid Z) = H(X \mid Z) + H(Y \mid Z) - H(X, Y \mid Z), I(X; Y) = H(X) + H(Y) - H(X, Y), and I(X; Y; Z) = H(X \mid Z) + H(Y \mid Z) - H(X, Y \mid Z) - H(X) - H(Y) + H(X, Y); substituting confirms the relation.^[19] A conditional form of the data processing inequality states that if X \to Y \to W \mid Z (i.e., X, Y, W form a Markov chain conditionally on Z), then I(X; W \mid Z) \leq I(X; Y \mid Z). This implies that processing Y to obtain W given Z cannot increase the information about X. The proof follows by applying the standard data processing inequality to the conditional distributions P_{X,Y \mid Z=z} and averaging over Z, since mutual information is non-negative. Equality holds if W is a sufficient statistic for X given Y and Z.^[19]

Interaction Information

Interaction information, also known as co-information, extends mutual information to three random variables X, Y, and Z, quantifying the synergistic or redundant information among them beyond pairwise dependencies. This measure was introduced by Walter J. McGill in 1954. It is defined as the difference between the conditional mutual information and the unconditional mutual information:

II(X; Y; Z) = I(X; Y \mid Z) - I(X; Y).

This measure captures how the dependency between X and Y is influenced by knowledge of Z. Equivalently, it can be expressed in terms of entropies as

II(X; Y; Z) = - \left[ H(X) + H(Y) + H(Z) - H(X,Y) - H(X,Z) - H(Y,Z) + H(X,Y,Z) \right].

The sign of II(X; Y; Z) provides insight into the nature of the three-way interaction: a positive value indicates synergy, where conditioning on Z increases the mutual information between X and Y (i.e., I(X; Y \mid Z) > I(X; Y)), revealing information that emerges only when all three variables are considered together; a negative value signifies redundancy, where conditioning on Z reduces the mutual information (i.e., I(X; Y \mid Z) < I(X; Y)), suggesting overlapping information shared among the variables.^[20] This dual interpretation makes interaction information useful for detecting higher-order dependencies that pairwise mutual information alone cannot identify. A classic example illustrating synergy is the XOR logic gate, where Z = X \oplus Y for binary variables X and Y. Here, the pairwise mutual information I(X; Y) = 0 since X and Y are independent, but the conditional mutual information I(X; Y \mid Z) = 1 bit, as Z fully determines the relationship between X and Y. Thus, II(X; Y; Z) = 1 - 0 = 1 bit, indicating positive synergy and the presence of irreducible three-way dependence. In genetics, interaction information has been applied to identify synergistic effects between genes, such as in epistatic interactions where the combined effect of two loci on a phenotype exceeds their individual contributions, with positive II values highlighting non-additive dependencies in gene expression or disease susceptibility.^[20]

Multivariate Generalizations

Multivariate conditional mutual information extends the bivariate case to scenarios involving multiple conditioning variables, quantifying the dependence between two random variables given a set of others. Formally, for random variables X and Y conditioned on Z_1, \dots, Z_k, it is defined as

I(X; Y \mid Z_1, \dots, Z_k) = H(X \mid Z_1, \dots, Z_k) - H(X \mid Y, Z_1, \dots, Z_k),

where H denotes entropy. This measures the reduction in uncertainty about X provided by Y after accounting for the joint conditioning set \{Z_1, \dots, Z_k\}. The chain rule for mutual information allows iterative computation over multiple variables, such as I(X_1, \dots, X_n; Y \mid Z) = \sum_{i=1}^n I(X_i; Y \mid Z, X_1, \dots, X_{i-1}), facilitating analysis in high-dimensional settings.^[14] Partial mutual information addresses scenarios where specific confounders must be excluded to isolate direct dependencies. It is defined as I(X; Y \mid Z) = H(X \mid Z) - H(X \mid Y, Z), representing the mutual information between X and Y after conditioning on Z. For additional conditioners, it extends to I(X; Y \mid Z, W). The difference I(X; Y \mid Z) - I(X; Y \mid Z, W) represents the portion of conditional dependence between X and Y given Z that is explained by the additional conditioner W. This quantity helps identify whether W mediates or confounds the relationship, effectively excluding its influence to focus on residual associations. In multivariate time series analysis, such measures detect couplings not attributable to common influences, enhancing detection of direct interactions.^[21] Higher-order generalizations include co-information and total correlation, which capture multi-way interactions among n variables. Co-information extends interaction information to arbitrary dimensions via an inclusion-exclusion principle on entropies, quantifying shared information across all variables with alternating signs in the sum. Total correlation, also known as multi-information, is given by

C(X_1, \dots, X_n) = \sum_{i=1}^n H(X_i) - H(X_1, \dots, X_n),

measuring the total multivariate dependence as the divergence between joint and marginal entropies. These metrics reveal synergistic or redundant structures beyond pairwise relations.^[22] These generalizations find application in causal inference, where they estimate edge strengths in directed acyclic graphs by assessing conditional dependencies, and in graphical models for structure learning through independence testing. For instance, they aid in inferring gene regulatory networks by quantifying causal influences under observed covariates.^[23]

References

[1]
[PDF] Lecture 2: Source coding, Conditional Entropy, Mutual Information
Jan 17, 2013 · Definition 5.5 (Conditional mutual information) The conditional mutual information between X and Y given Z is. I(X, Y ;Z) = H(X|Z) − H(X|Y,Z).
[2]
[PDF] Lecture 11: Conditional Mutual Information and Letter Typical ...
Mar 10, 2020 · In order to study the properties of conditional mutual information, we first review the related concept of ... Non-negativity: I(X;Y |Z) ≥ 0, with ...
[3]
Feature selection with conditional mutual information maximin in text ...
In this paper, we propose a novel feature selection method for text categorization called <i>conditional mutual information maximin</i> (CMIM). It can select a ...
[4]
[PDF] A Unifying Framework for Information Theoretic Feature Selection
In these sections we have in effect reverse-engineered a mutual information-based selection scheme, starting from a clearly defined conditional likelihood ...
[5]
Conditional Mutual Information Based Feature Selection for ...
We propose a sequential forward feature selection method to find a subset of features that are most relevant to the classification task.
[6]
[PDF] 6.441S16: Chapter 2: Information Measures: Mutual Information
2.4 Conditional mutual information and conditional independence. Definition 2.4 (Conditional mutual information). I(X;Y ∣Z) = D(. = PXY ∣Z∥PX∣Z. PY ∣Z∣PZ).
[7]
[PDF] A Mathematical Theory of Communication
The capacity to transmit information can be specified by giving this rate of increase, the number of bits per second required to specify the particular signal ...
[8]
[PDF] Lecture 3: Entropy, Relative Entropy, and Mutual Information
Jan 16, 2018 · Definition 7. Conditional mutual information. I(X; Y |Z) , H(X|Z) − H(X|Y,Z). (47). Show that: I(X; Y1,Y2) = I(X; Y1) + I(X; Y2|Y1). Proof: I(X ...<|control11|><|separator|>
[9]
[PDF] Entropy, Relative Entropy and Mutual Information - Columbia CS
It is the reduction in the uncertainty of one random variable due to the knowledge of the other. Definition: Consider two random variables X and Y with a joint.
[10]
[PDF] On Measures of Entropy and Information - Gavin E. Crooks
Mar 2, 2024 · The chain rule for entropies [12, 63] expands conditional entropy as a Shannon information measure. S(A, B) = S(A | B) + S(B). This follows from ...
[11]
[PDF] CCMI : Classifier based Conditional Mutual Information Estimation
For three continuous random variables, X, Y and Z, the conditional mutual information is defined as: I(X; Y |Z) = ZZZ p(x, y, z) log p(x, y, z) p(x, z)p(y|z).
[12]
[PDF] On the Conditional Mutual Information in the Gaussian-Markov ...
In a probabilistic graphical model, each nodes represents a random variable or a group of random variables and the links express the probabilistic dependence ...
[13]
[PDF] Entropy and Information Theory - Stanford Electrical Engineering
This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels ...
[14]
[PDF] entropy, relative entropy, and mutual information
We now introduce mutual information, which is a measure of the amount of information that one random variable contains about another random variable. It is the ...
[15]
[PDF] Lecture 3 1 Relative Entropy - cs.Princeton
Sep 26, 2011 · 2.1 Conditional Mutual Information. We define the conditional mutual information when conditioned upon a third random variable Z to be. I(X; Y ...
[16]
[PDF] Lecture 2: Entropy and mutual information
Thus, if we can show that the relative entropy is a non-negative quantity, we will have shown that the mutual information is also non-negative. = H(X|Z) − H(X| ...
[17]
Proof: Non-negativity of the Kullback-Leibler divergence
May 31, 2020 · Proof: Non-negativity of the Kullback-Leibler divergence ... with KL[P||Q]=0 K L [ P | | Q ] = 0 , if and only if P=Q P = Q . ... KL[P||Q]=∑x∈Xp(x)⋅ ...
[18]
[PDF] Entropy, Mutual Information
We can also define conditional mutual information: Definition 10. The mutual information between two random variables A, B conditioned on a third random ...
[19]
[PDF] Elements of Information Theory
Page 1. Page 2. ELEMENTS OF. INFORMATION THEORY. Second Edition. THOMAS M. COVER ... First, certain quantities like entropy and mutual information arise as the ...
[20]
Capturing the Spectrum of Interaction Effects in Genetic Association ...
In the terminology of genetics, “synergy” maps onto epistasis, and ... Solid edges indicate synergy (positive interaction information (II)) between ...
[21]
Partial Mutual Information for Coupling Analysis of Multivariate Time ...
Nov 14, 2007 · Partial mutual information (PMI) is the part of mutual information between two quantities not contained in a third, used to discover couplings ...
[22]
Information Theoretical Analysis of Multivariate Correlation
- **Definition of Total Correlation (Multi-Information) from Watanabe**: Total correlation, also termed multi-information, is defined by Watanabe as a measure of the total amount of correlation or dependence among a set of random variables. It quantifies the deviation of the joint entropy from the sum of individual entropies, expressed as:
[23]
None
### Summary of Applications of Conditional Mutual Information in Causal Inference and Graphical Models