Fact-checked by Grok 2 weeks ago

Data processing inequality

The data processing inequality (DPI) is a cornerstone theorem in information theory stating that mutual information between an input random variable and an output cannot increase when the output is further processed to produce another variable, formally expressed as I(X; Z) \leq I(X; Y) whenever X, Y, and Z form a Markov chain X \to Y \to Z.^[1] This inequality implies that any stochastic or deterministic processing of data—such as encoding, filtering, or estimation—can only preserve or diminish the information content about the source, ensuring that no manipulation extracts more information than originally available.^[2] Originally formulated by Claude E. Shannon in his seminal 1948 paper "A Mathematical Theory of Communication," the DPI appears in the context of communication channels, where it demonstrates that the rate of reliable information transmission R(X; V) from input X to a statistically processed version V of output Y satisfies R(X; V) \leq R(X; Y).^[3] Proofs of the inequality typically rely on the non-negativity of relative entropy (Kullback-Leibler divergence), showing that I(X; Z) = I(X; Y) - I(X; Y | Z), where I(X; Y | Z) = \mathbb{E}_{X,Z} \left[ D(P_{Y|X,Z} \| P_{Y|Z}) \right] \geq 0, with equality holding if Y is a sufficient statistic for Z regarding X.^[1] The theorem extends to continuous variables and quantum settings, maintaining its core structure under appropriate generalizations.^[4] The DPI plays a pivotal role in information-theoretic limits, underpinning the converse to the channel coding theorem by bounding achievable rates in noisy communication systems and proving impossibility results for data compression and hypothesis testing.^[2] Beyond communications, it finds applications in statistics for characterizing sufficient statistics and the Cramér-Rao bound, in machine learning for analyzing variational inference and model collapse under repeated processing, and in Bayesian networks to quantify information flow in causal structures.^[5]^[6] Stronger variants, such as those quantifying the contraction rate of mutual information, further enable precise analyses in reinforcement learning and privacy-preserving data release.^[7]

Foundations in Information Theory

Mutual Information

Mutual information is a central concept in information theory, serving as a measure of the shared information between two random variables, which underpins inequalities like the data processing inequality by quantifying dependence in probabilistic systems.^[3] The mutual information I(X; Y) between random variables X and Y is defined as the reduction in uncertainty about one variable upon observing the other:

I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X),

where H(\cdot) denotes entropy and H(\cdot \mid \cdot) conditional entropy.^[3] This can equivalently be expressed using joint entropy as

I(X; Y) = H(X) + H(Y) - H(X, Y).

^[8] Mutual information possesses several key properties that highlight its role as a dependence measure. It is non-negative, I(X; Y) \geq 0, with equality if and only if X and Y are independent.^[8] Additionally, it is symmetric, I(X; Y) = I(Y; X), reflecting the bidirectional nature of shared information.^[8] The definitions apply to both discrete and continuous random variables. For discrete variables, entropy is the Shannon entropy H(X) = -\sum p(x) \log p(x).^[3] For continuous variables, the analogous quantity is differential entropy h(X) = -\int f(x) \log f(x) \, dx, where f(x) is the probability density function, leading to I(X; Y) = h(X) - h(X \mid Y) = h(Y) - h(Y \mid X).^[8] Mutual information also satisfies a chain rule, extending its application to multiple variables:

I(X, Y; Z) = I(X; Z) + I(Y; Z \mid X).

This decomposes the total shared information into sequential contributions.^[8]

Markov Chains

A Markov chain provides the probabilistic structure underlying data processing scenarios by modeling sequences of random variables where each subsequent variable depends only on the immediate predecessor, capturing the essence of information flow through successive transformations. For random variables X, Y, and Z, the notation X \to Y \to Z denotes a Markov chain if the conditional probability satisfies p(z \mid x, y) = p(z \mid y) for all x, y, z, meaning Z depends on X solely through Y.^[9] This condition implies conditional independence between X and Z given Y, equivalently stated as I(X; Z \mid Y) = 0, where I measures mutual information as a quantification of dependence.^[8] The defining property of a Markov chain is its memorylessness: the probability distribution of the next state depends only on the current state and not on the sequence of prior states.^[10] For chains with more than three variables, such as X_1 \to X_2 \to \cdots \to X_n, this extends via the Chapman-Kolmogorov equations, which express the n-step transition probabilities in terms of intermediate steps: for discrete states i, j, k,

p_{ij}^{(m+n)} = \sum_k p_{ik}^{(m)} p_{kj}^{(n)},

where p_{ij}^{(r)} is the probability of transitioning from state i to j in r steps; this semigroup property ensures consistency across multiple processing stages.^[11] In information theory, both discrete-time and continuous-time Markov processes are relevant, with the former involving state transitions at fixed intervals and the latter modeling continuous evolution via rates of change, often analyzed through entropy rates to assess long-term information behavior.^[8] Discrete-time chains suffice for many foundational models, while continuous-time variants, governed by infinitesimal generators analogous to transition matrices, apply to processes like diffusion models where time flows uninterrupted.^[12] A simple example is a binary discrete-time Markov chain with states {0, 1}. Suppose X starts as 0 or 1 with equal probability 0.5. The next variable Y stays the same as X with probability 0.9 or flips with probability 0.1, independent of prior history. Then Z follows similarly from Y, illustrating sequential processing where each step introduces limited dependence on the previous output alone.^[11]

Formal Statement

General Form

The data processing inequality provides a fundamental limit on the flow of information through a sequence of processing steps. In its general form, consider random variables X, Y, and Z defined over discrete or continuous alphabets that form a Markov chain X \to Y \to Z, meaning the conditional distribution of Z given Y is independent of X. Under this condition, the mutual information between X and Z is bounded above by the mutual information between X and Y:

I(X; Z) \leq I(X; Y).

This inequality quantifies that any further processing of Y to obtain Z cannot increase the information about X beyond what is already available in Y. Equality holds if and only if I(X; Y \mid Z) = 0, which is equivalent to Z being a sufficient statistic for Y with respect to X, or equivalently, the Markov chain X \to Z \to Y also holding.^[13] The inequality extends naturally to multivariate settings. For random vectors (X_1, \dots, X_n) and Z satisfying the joint Markov property (X_1, \dots, X_n) \to Y \to Z, the mutual information satisfies

I(X_1, \dots, X_n; Z) \leq I(X_1, \dots, X_n; Y).

This generalization preserves the core principle that processing diminishes or preserves, but does not enhance, the relevant information.^[13] The data processing inequality originated in the foundational work of Claude Shannon on communication theory, where it emerged as a consequence of properties of mutual information in channel models. It was later formalized and rigorously stated in standard information theory references.^[3]^[13]

Interpretations

The data processing inequality captures the intuitive notion that any processing of data acts as a filter, incapable of generating new information about the original source variable X; instead, it can only preserve the existing mutual information or allow some of it to be lost through the intermediate variable Y. This principle underscores the irreversibility of information flow in Markov chains X \to Y \to Z, where subsequent transformations cannot amplify the shared information between X and the output Z beyond what Y already provides.^[14] The implications of this inequality are profound for inference tasks, as it establishes strict upper bounds on the accuracy of estimating or decoding X following noisy observations or compressive mappings via Y. In practical settings, such as signal transmission over imperfect channels, the DPI quantifies inevitable information degradation, ensuring that no post-processing algorithm can surpass the informational limits imposed by the initial transformation.^[5] The data processing inequality is closely related to the theory of sufficient statistics. If Z is obtained as a function of Y, equality in the DPI holds if and only if Z is a sufficient statistic for X, meaning I(X; Y \mid Z) = 0. This condition signifies that Z captures all the information about X that is present in Y, rendering further processing of Y redundant for inference purposes and highlighting the efficiency of minimal representations that avoid unnecessary data retention.^[14] A frequent point of confusion arises from conflating the data processing inequality with direct constraints on entropy; while the inequality strictly applies to mutual information—measuring dependence between variables—entropy itself, which gauges the uncertainty of a single variable, may increase under processing, as in the addition of noise, without contradicting the core tenet that shared information does not grow.^[15]

Proof Techniques

Information-Theoretic Derivation

The information-theoretic derivation of the data processing inequality relies on fundamental properties of entropy and mutual information, particularly the chain rule and the non-negativity of mutual information. Consider discrete random variables X, Y, and Z forming a Markov chain X \to Y \to Z, meaning that X and Z are conditionally independent given Y, or equivalently, the conditional mutual information I(X; Z \mid Y) = 0. Under this assumption, the mutual information I(X; Z) cannot exceed I(X; Y), as subsequent processing through Y cannot increase the information about X. The proof proceeds via the chain rule for mutual information applied to the joint distribution of X, Y, and Z:

I(X; Y, Z) = I(X; Y) + I(X; Z \mid Y)

I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z)

Equating the right-hand sides yields

I(X; Y) + I(X; Z \mid Y) = I(X; Z) + I(X; Y \mid Z).

Substituting the Markov condition I(X; Z \mid Y) = 0 and invoking the non-negativity of mutual information, I(X; Y \mid Z) \geq 0, gives

I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z).

A key lemma underlying this derivation is the data processing inequality for conditional mutual information: for any conditioning variable W, if X \to Y \to Z holds conditionally on W, then I(X; Z \mid W) \leq I(X; Y \mid W). This follows analogously from the chain rule and conditional independence, ensuring the inequality propagates under additional conditioning. Equality holds if and only if I(X; Y \mid Z) = 0, which occurs when Z is a sufficient statistic for X (i.e., Z contains all the information about X that is present in Y). This includes the case where the processing from Y to Z is invertible, so Y is a function of Z. This derivation assumes discrete random variables with finite or countable support. For continuous random variables, the inequality holds in the limit of finer discretizations or via measure-theoretic extensions using differential entropy and relative entropy arguments, preserving the core structure.

Alternative Approaches

One alternative proof of the data processing inequality leverages the convexity of the Kullback-Leibler divergence, which can be viewed as a special case of an f-divergence where mutual information is expressed as a convex functional of the conditional distribution. Specifically, for random variables X, Y, and Z forming a Markov chain X \to Y \to Z, the mutual information I(X; Z) is bounded by applying Jensen's inequality to the convex function defining the divergence under the Markov kernel from Y to Z. This approach represents I(X; Z) = D_{KL}(P_{X,Z} \| P_X P_Z) and shows that the expected divergence after processing does not exceed the original, yielding I(X; Z) \leq I(X; Y).^[16] A variational characterization of the data processing inequality arises from the broader class of f-divergences, introduced by Csiszár, where the inequality follows from the monotonicity of these divergences under stochastic channels. For a convex function f with f(1)=0, the f-divergence D_f(P \| Q) = \int Q f\left(\frac{dP}{dQ}\right) d\mu satisfies D_f(P_{Y} \| Q_{Y}) \geq D_f(P_{Z} \| Q_{Z}) when Z is obtained by applying a channel to Y, due to the joint convexity in the pair of measures and the data-processing property inherent to the definition. When f(t) = t \log t, this recovers the KL divergence and thus the standard mutual information bound, providing a unified framework for proving the inequality across divergence families.^[17] In continuous-time settings, the data processing inequality can be derived for diffusion processes governed by stochastic differential equations (SDEs), such as dX_t = b(X_t) dt + \sigma(X_t) dW_t, where the evolution of mutual information is analyzed via the Fokker-Planck equation or Itô calculus. The rate of change \frac{d}{dt} I(X_0; X_t) is non-positive, reflecting monotonic decrease due to noise addition, analogous to the H-theorem in kinetic theory; for Ornstein-Uhlenbeck diffusions, explicit solutions show exponential decay bounded by the inequality. This formulation extends the discrete case to time-evolving Markov processes.^[16]^[18] Computational verification of the data processing inequality in high-dimensional settings poses significant challenges, primarily due to the difficulty in estimating mutual information accurately amid the curse of dimensionality and non-parametric density issues. Post-2020 research highlights that neural network-based estimators, such as variational bounds or kernel methods, often suffer from bias or high variance in dimensions exceeding 10, making direct numerical checks infeasible without dimensionality reduction. These methods underscore the need for scalable approximations in applications like machine learning bounds.^[19]

Examples and Applications

Channel Processing Example

A concrete example of the data processing inequality arises in the context of communication channels, where information about the input degrades through successive noisy transmissions. Consider a binary symmetric channel (BSC) defined by the Markov chain X \to Y \to Z, where X is the binary input symbol drawn uniformly from \{[0](/page/0),[1](/page/1)\}, Y is the output of the first BSC with crossover probability p (the probability that Y \neq X), and Z is the output obtained by passing Y through a second BSC with crossover probability q.^[20]^[21] For the first channel, the mutual information is I(X;Y) = 1 - h_2(p), where h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) is the binary entropy function in bits (with the convention $0 \log_2 0 = 0).^[21] To arrive at this expression, note that H(X) = 1 bit due to uniformity, H(Y|X) = h_2(p) as each output is a binary entropy source conditional on X, and H(Y) = 1 bit by symmetry for p \leq 0.5, so I(X;Y) = H(Y) - H(Y|X) = 1 - h_2(p). The second channel results in an overall BSC from X to Z with effective crossover probability \alpha = p + q - 2pq, computed as the probability that an odd number of flips occurs across both channels: P(Z \neq X) = P(Y = X, Z \neq Y) + P(Y \neq X, Z = Y) = (1-p)q + p(1-q).^[22] Thus, I(X;Z) = 1 - h_2(\alpha). Since \alpha > p for $0 < q < 1 and p < 0.5 (as h_2 is increasing on [0, 0.5]), it follows that I(X;Z) < I(X;Y), demonstrating the inequality for this non-invertible processing.^[21] For a numerical illustration with p = 0.1 and q = 0.1, first compute h_2(0.1): \log_2(0.1) \approx -3.3219, so -0.1 \times -3.3219 = 0.3322; \log_2(0.9) \approx -0.1520, so -0.9 \times -0.1520 = 0.1368; thus h_2(0.1) \approx 0.4690 bits and I(X;Y) \approx 0.5310 bits. The effective \alpha = 0.1 + 0.1 - 2 \times 0.01 = 0.18. Then h_2(0.18): \log_2(0.18) \approx -2.4744, so -0.18 \times -2.4744 = 0.4454; \log_2(0.82) \approx -0.2863, so -0.82 \times -0.2863 = 0.2348; thus h_2(0.18) \approx 0.6802 bits and I(X;Z) \approx 0.3198 bits, showing a drop of approximately 0.2112 bits in mutual information.^[21] The information loss is visualized through the BSC capacity curve, which plots C(p) = 1 - h_2(p) versus p \in [0, 0.5], starting at 1 bit (noiseless) and decreasing symmetrically to 0 bits at p = 0.5 (pure noise). Successive processing shifts the effective p rightward along this concave-down curve, bounding the remaining mutual information below the original. The following table summarizes capacity values at select points to highlight the degradation:

Crossover Probability	Capacity C(p) (bits)
0.00	1.000
0.10	0.531
0.18	0.320
0.50	0.000

This curve underscores how channel processing enforces the data processing inequality by reducing the achievable information rate.^[20]

Estimation and Learning Contexts

In statistical parameter estimation, the data processing inequality (DPI) plays a key role in sufficient statistic theory by establishing that any processed version of the data cannot contain more information about the parameter than the original data. Specifically, if T(X) is a sufficient statistic for the parameter \theta, then the mutual information I(\theta; T(X)) = I(\theta; X), and further processing of T(X) to obtain Z = g(T(X)) satisfies I(\theta; Z) \leq I(\theta; T(X)) by the DPI, implying that estimators based on Z cannot outperform those based on X or T(X) in terms of information content.^[23] This connection underscores the efficiency of sufficient statistics, as they capture all relevant information for estimation without loss, a principle formalized in information-theoretic terms to bound the performance of downstream inferential procedures.^[24] In machine learning, particularly representation learning, the DPI imposes fundamental bounds on feature extraction processes in neural networks, ensuring that transformations cannot increase mutual information about the input. For instance, if Z = f(X) represents the output of a neural network layer f, the DPI yields I(Y; Z) \leq I(Y; X) for any target Y, highlighting how successive layers progressively compress or preserve information hierarchies in deep architectures developed since the 2010s.^[25] This bound is crucial for analyzing why neural networks learn disentangled, task-relevant representations, as excessive processing risks information loss that degrades downstream prediction or generalization.^[26] Recent works leverage this to optimize architectures, ensuring that learned features remain sufficiently informative without violating the inequality.^[25] The DPI also informs causal inference frameworks by quantifying information flow in causal structures, such as directed acyclic graphs (DAGs), where it bounds mutual information along paths to assess direct and indirect causal influences.^[27] This application helps explain scenarios where causal effects become unrecoverable after data transformations, guiding the selection of adjustment sets to avoid unnecessary information degradation. In recent privacy-preserving data processing, particularly with differential privacy mechanisms in the 2020s, the DPI has been extended to derive inequalities that bound utility loss under privacy constraints. For example, forward and reverse DPIs for local differential privacy show that post-processing a privatized output Z from input X satisfies privacy amplification properties, where I(X; Z) \leq \epsilon for privacy parameter \epsilon, enabling tighter composition theorems for iterative algorithms.^[28] These results facilitate the design of mechanisms like Gaussian noise addition, ensuring minimal information leakage while maintaining statistical utility in large-scale analyses.^[29] Such applications highlight the DPI's role in balancing privacy guarantees with practical data usability in modern systems.^[30]

Extensions

Strong Data Processing Inequality

The strong data processing inequality (SDPI) provides a quantitative strengthening of the classical data processing inequality by establishing an exponential decay rate for mutual information under successive channel processing. For a Markov chain X \to Y \to Z, it states that there exists a contraction coefficient \alpha < 1 such that I(X; Z) \leq \alpha I(X; Y) for all distributions on X, where \alpha is the strong data processing constant of the channel from Y to Z.^[31] This constant \alpha, often denoted \eta_{KL} for the Kullback-Leibler divergence, captures the maximal contraction ratio \sup_{P_X \neq Q_X} \frac{D(P_Z \| Q_Z)}{D(P_Y \| Q_Y)}, ensuring strict information loss unless the channel is noiseless.^[31] The Dobrushin coefficient plays a central role in bounding this contraction, defined as \eta_{TV}(P_{Z|Y}) = \sup_{y,y'} \frac{1}{2} \|P_{Z|y} - P_{Z|y'}\|_1, which upper-bounds \eta_{KL} via \eta_{KL} \leq [\eta_{TV}]^2.^[31] This coefficient originates from Dobrushin's 1956 work on ergodicity in Markov chains, where it quantified total variation contraction.^[7] The modern information-theoretic formulation of the SDPI was advanced by Polyanskiy and Wu in 2015, who derived explicit bounds for channels and Bayesian networks, with subsequent refinements in follow-up works such as those addressing input constraints and Gaussian settings to tighten the constants.^[31]^[32] For concrete computation, consider the binary symmetric channel (BSC) with crossover probability p \in (0, 0.5). Here, the contraction coefficient is \eta_{KL} = (1 - 2p)^2, while \eta_{TV} = 1 - 2p, illustrating how the SDPI implies a quadratic contraction in the KL sense relative to total variation.^[31] This leads to broader implications for divergence contractions: the SDPI extends to f-divergences with channel-specific coefficients \eta_f \leq \eta_{TV}, enabling uniform bounds on information leakage across processing steps.^[31] In contrast to the weak data processing inequality, which merely asserts I(X; Z) \leq I(X; Y) without quantifying the loss, the strong version specifies the decay factor \alpha < 1, facilitating applications in large deviation analysis where repeated processing leads to exponentially vanishing error probabilities.^[31]

Multivariate Generalizations

The data processing inequality extends seamlessly to multivariate settings, where the input consists of a joint random variable X = (X_1, \dots, X_n) forming a Markov chain X \to Y \to Z. In this case, the mutual information satisfies I(X; Z) \leq I(X; Y), with mutual information computed over the joint distributions of the respective random variables. This generalization follows directly from the chain rule for mutual information and the non-negativity of conditional mutual information, I(X; Z \mid Y) = 0, which enforces the Markov condition without requiring independence among the components of X.^[14] A particular instance arises with tensor product channels, where the processing operates independently on each component: suppose X_i \to Y_i \to Z_i for i = 1, \dots, n under identical conditional distributions, yielding X = (X_1, \dots, X_n), Y = (Y_1, \dots, Y_n), and Z = (Z_1, \dots, Z_n). The additivity of mutual information for independent components then implies I(X; Z) = \sum_{i=1}^n I(X_i; Z_i) \leq \sum_{i=1}^n I(X_i; Y_i) = I(X; Y). This form is crucial in analyzing systems with multiple parallel signals, such as multi-user communication channels, where the total information flow respects the bound componentwise. The functional data processing inequality provides another key multivariate perspective, stating that for any measurable function f: \mathcal{Y} \to \mathcal{Z}, the composition yields I(X; f(Y)) \leq I(X; Y) under the Markov chain X \to Y \to f(Y). This captures the effect of deterministic post-processing on the joint distribution, emphasizing that no function can amplify the information about X beyond its original content in Y. It underpins applications in compression and feature extraction, where transformations preserve or reduce mutual information.^[14] Extensions beyond strict Markov chains address scenarios with partial dependencies among multiple variables, providing approximate bounds when the conditional independence is relaxed. For instance, in distributed source coding with correlated observations forming an n-letter Markov structure, a spectrum-based measure of correlation yields a new data processing inequality that derives single-letter necessary conditions on the joint distribution, tightening bounds for multi-terminal rate-distortion problems. Such approaches, developed since the mid-2000s, enable analysis of hyper-Markov-like structures where marginals preserve partial conditional independences across variables.^[33] Despite these advances, the multivariate data processing inequality has limitations, particularly in cases where equality fails to hold. Equality in I(X; Z) = I(X; Y) requires Y to be a sufficient statistic for X with respect to Z, meaning the processing from Y to Z retains all relevant information; otherwise, strict inequality arises, as in non-invertible transformations. In non-monotonic processing—where mutual information plateaus or decreases unevenly across variables—the bound remains valid but may not capture tight rates, necessitating stronger variants for precise characterization. The strong data processing inequality offers related scalar bounds in such multivariate contexts.^[7]

Quantum Extensions

The data processing inequality extends to quantum information theory, where it states that the quantum mutual information I(A;B)_\rho between quantum systems A and B is monotonically non-increasing under local quantum operations, formally I(A;B') \leq I(A;B) for a quantum channel applied to B yielding B', in the Markov chain A \to B \to B'. This quantum DPI holds for completely positive trace-preserving (CPTP) maps and underpins limits in quantum communication, such as the Holevo bound and quantum channel capacities. Proofs rely on the monotonicity of quantum relative entropy under CPTP maps. Equality holds if the channel is reversible on the support of the state.^[4]

References

[1]
[PDF] Entropy, Relative Entropy and Mutual Information - Columbia CS
The data processing inequality can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data.
[2]
[PDF] Lecture 7: Data Processing Inequality
ECE 5630: Information Theory for Data Transmission, Security and Machine Learning ... Theorem 1 (Data Processing Inequality) Let PX,QX ∈ P ...
[3]
[PDF] A Mathematical Theory of Communication
379–423, 623–656, July, October, 1948. A Mathematical Theory of Communication. By C. E. SHANNON. INTRODUCTION. THE recent development of various methods of ...Missing: processing | Show results with:processing
[4]
Data processing inequality and open quantum systems
Oct 17, 2017 · ... information theory as the quantum data processing inequality (DPI) [1–3] . Here local processing refers to an open quantum system [4, 5] as ...
[5]
[PDF] Lecture 4: Data-processing, Fano
Sep 9, 2012 · • Data-processing inequality: data processing may (or may not) lose information ... Yao Xie, ECE587, Information Theory, Duke University. 23.
[6]
Data Processing Inequalities and Function-Space Variational ...
Aug 14, 2023 · The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot ...
[7]
[PDF] Strong data-processing inequalities for channels and Bayesian ...
The data-processing inequality, that is, I(U; Y ) ≤ I(U; X) for a Markov ... Information Theory. Dover Publications Inc., New York, NY, 1965. [Bir57].
[8]
[PDF] Elements of Information Theory
Page 1. Page 2. ELEMENTS OF. INFORMATION THEORY. Second Edition. THOMAS M. COVER ... Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ...
[9]
[PDF] Lecture 4: Data processing and Fano's inequalities; AEP 1 Recap 2 ...
Jan 24, 2013 · 3 Data Processing Inequality. Definition 3.1 (Markov Chain). Three random variables X,Y,Z are said to form a Markov. Chain, denoted by X → Y ...
[10]
https://galton.uchicago.edu/~lalley/Courses/312/MarkovChains.pdf
[11]
[PDF] 4. Markov Chains (9/23/12, cf. Ross) 1. Introduction 2. Chapman ...
Markov Chains. 4. Markov Chains (9/23/12, cf. Ross). 1. Introduction. 2. Chapman-Kolmogorov Equations. 3. Types of States. 4. Limiting Probabilities. 5.
[12]
[PDF] CONTINUOUS-TIME MARKOV CHAINS Definition 1. Acontinuous ...
(3). Pt +s = Pt Ps . (These are the natural analogues of the Chapman-Kolmogorov equations for discrete-time chains.) As for discrete-time Markov chains, we ...
[13]
[PDF] entropy, relative entropy, and mutual information
Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ... Definition The mutual information between two random variables X and Y is defined ...
[14]
[PDF] Lecture 2: Gibb's, Data Processing and Fano's Inequalities
Information theory will help us identify ... Intuitively, the data processing inequality says that no clever transformation of the received code (channel.
[15]
[PDF] Lecture 2: Entropy and mutual information
In this case we have the data processing inequality: I(X;Z) ≤ I(X;Y ). (45). In other words, processing cannot increase the information contained in a signal.
[16]
[PDF] Data Processing Theorems and the Second Law of Thermodynamics
Jul 16, 2010 · ... convex function that defines the generalized mutual information. ... Index Terms: Data processing inequality, convexity, perspective function, H– ...
[17]
[PDF] f-Divergence Inequalities - arXiv
Dec 4, 2016 · This paper develops systematic approaches to obtain f-divergence inequalities, dealing with pairs of probability.
[18]
[PDF] Characterizing Dependence of Samples along the Langevin ... - arXiv
Data processing inequality (DPI) is a fundamental concept in information theory, which states that information cannot increase along a noisy channel or a ...
[19]
[PDF] arXiv:2411.01609v2 [quant-ph] 26 May 2025
May 26, 2025 · We also provide numerical verification of the tomography ... (S99), ϵk ≤ ϵ, ∀k = 1, ··· ,ℓ is given by the data processing inequality,.
[20]
[PDF] Channel Capacity - WINLAB, Rutgers University
These symbols are transmitted over a binary symmetric channel which is ... So mutual information is maximized when Y is uniformly distributed, which oc-.
[21]
[PDF] Elements of Information Theory
Throughout the book we use the method of weakly typical sequences, which has its origins in Shannon's original 1948 work but was formally developed in the early ...
[22]
[PDF] Relaying One Bit Across a Tandem of Binary-Symmetric Channels
. This sequence is then transmitted through a binary symmetric channel (BSC) with crossover probability δ ≤ 1/2. The output of this channel, denoted by Y n ...
[23]
[PDF] Minimal Achievable Sufficient Statistic Learning - arXiv
Jun 11, 2019 · A sufficient statistic f(X) for Y is minimal if for ... Proof of CDI Data Processing Inequality. CDI Data Processing Inequality (Theorem 1).
[24]
https://proceedings.mlr.press/v97/cvitkovic19a/cvitkovic19a.pdf
[25]
A Theory of Usable Information Under Computational Constraints
Feb 25, 2020 · This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning.
[26]
[PDF] An Information-Theoretic View for Deep Learning - arXiv
Oct 2, 2018 · Section 3 exploits the strong data processing inequality to derive how the mutual information, between intermediate features representations and ...
[27]
[PDF] Causal inference in degenerate systems: An impossibility result
of the data processing inequality in information the- ory (Cover & Thomas, 2012). Intuitively, it states that transmitting data through a noisy channel can ...
[28]
[PDF] arXiv:1508.03808v2 [stat.ME] 18 Mar 2016
Mar 18, 2016 · [41] address the problem from an inter- ventionalist perspective using Pearl's do-calculus [19] which ... The data processing inequality [4] ...
[29]
Local Privacy, Data Processing Inequalities, and Statistical Minimax ...
Feb 13, 2013 · Abstract:Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy ...
[30]
[PDF] Concurrent Composition Theorems for Differential Privacy - arXiv
Mar 30, 2023 · Differential privacy is a statistical notion of ... Max-divergence is closed under post-processing due to the data-processing inequality.
[31]
[PDF] The Composition Theorem for Differential Privacy
differential privacy as well. We refer to the supplementary material for a ... Theorem 2.4 (Data processing inequality for differential privacy). If a ...
[32]
Strong data-processing inequalities for channels
Aug 25, 2015 · The data-processing inequality, that is, I(U;Y)≤I(U;X) for a Markov chain U→X→Y, has been the method of choice for proving impossibility ( ...Missing: 2011 | Show results with:2011
[33]
Strong Data Processing Inequalities for Input Constrained Additive ...
Dec 20, 2015 · This paper quantifies the intuitive observation that adding noise reduces available information by means of non-linear strong data processing inequalities.Missing: 2011 | Show results with:2011
[34]
Tsirelson's bound from a Generalised Data Processing Inequality
Aug 23, 2011 · We prove that if the data processing inequality holds with respect to this generalized entropy measure then the underlying theory necessarily respects ...
[35]
A New Data Processing Inequality and Its Applications in Distributed ...
Nov 3, 2006 · Based on this new data processing inequality, we provide a single-letter necessary condition for the required joint probability distribution. We ...