Fact-checked by Grok 2 weeks ago

Data processing inequality

The data processing inequality (DPI) is a cornerstone theorem in information theory stating that mutual information between an input random variable and an output cannot increase when the output is further processed to produce another variable, formally expressed as I(X; Z) \leq I(X; Y) whenever X, Y, and Z form a Markov chain X \to Y \to Z. This inequality implies that any stochastic or deterministic processing of data—such as encoding, filtering, or estimation—can only preserve or diminish the information content about the source, ensuring that no manipulation extracts more information than originally available. Originally formulated by Claude E. Shannon in his seminal 1948 paper "," the DPI appears in the context of communication channels, where it demonstrates that the rate of reliable transmission R(X; V) from input X to a statistically processed version V of output Y satisfies R(X; V) \leq R(X; Y). Proofs of the inequality typically rely on the non-negativity of relative entropy (Kullback-Leibler divergence), showing that I(X; Z) = I(X; Y) - I(X; Y | Z), where I(X; Y | Z) = \mathbb{E}_{X,Z} \left[ D(P_{Y|X,Z} \| P_{Y|Z}) \right] \geq 0, with equality holding if Y is a for Z regarding X. The theorem extends to continuous variables and quantum settings, maintaining its core structure under appropriate generalizations. The DPI plays a pivotal role in information-theoretic limits, underpinning the converse to the channel coding theorem by bounding achievable rates in noisy communication systems and proving impossibility results for data compression and hypothesis testing. Beyond communications, it finds applications in statistics for characterizing sufficient statistics and the Cramér-Rao bound, in for analyzing variational inference and model collapse under repeated processing, and in Bayesian networks to quantify in causal structures. Stronger variants, such as those quantifying the contraction rate of , further enable precise analyses in and privacy-preserving data release.

Foundations in Information Theory

Mutual Information

is a central concept in , serving as a measure of the shared information between two random variables, which underpins inequalities like the data processing inequality by quantifying dependence in probabilistic systems. The I(X; Y) between random variables X and Y is defined as the reduction in about one variable upon observing the other: I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X), where H(\cdot) denotes entropy and H(\cdot \mid \cdot) conditional entropy. This can equivalently be expressed using joint entropy as I(X; Y) = H(X) + H(Y) - H(X, Y). Mutual information possesses several key properties that highlight its role as a dependence measure. It is non-negative, I(X; Y) \geq 0, with equality if and only if X and Y are independent. Additionally, it is symmetric, I(X; Y) = I(Y; X), reflecting the bidirectional nature of shared information. The definitions apply to both and continuous random variables. For variables, is the H(X) = -\sum p(x) \log p(x). For continuous variables, the analogous is h(X) = -\int f(x) \log f(x) \, dx, where f(x) is the , leading to I(X; Y) = h(X) - h(X \mid Y) = h(Y) - h(Y \mid X). Mutual information also satisfies a chain rule, extending its application to multiple variables: I(X, Y; Z) = I(X; Z) + I(Y; Z \mid X). This decomposes the total shared information into sequential contributions.

Markov Chains

A Markov chain provides the probabilistic structure underlying data processing scenarios by modeling sequences of random variables where each subsequent variable depends only on the immediate predecessor, capturing the essence of through successive transformations. For random variables X, Y, and Z, the notation X \to Y \to Z denotes a if the satisfies p(z \mid x, y) = p(z \mid y) for all x, y, z, meaning Z depends on X solely through Y. This condition implies between X and Z given Y, equivalently stated as I(X; Z \mid Y) = 0, where I measures as a quantification of dependence. The defining property of a is its : the of the next state depends only on the current state and not on the sequence of prior states. For chains with more than three variables, such as X_1 \to X_2 \to \cdots \to X_n, this extends via the Chapman-Kolmogorov equations, which express the n-step transition probabilities in terms of intermediate steps: for discrete states i, j, k, p_{ij}^{(m+n)} = \sum_k p_{ik}^{(m)} p_{kj}^{(n)}, where p_{ij}^{(r)} is the probability of transitioning from state i to j in r steps; this semigroup property ensures consistency across multiple processing stages. In information theory, both discrete-time and continuous-time Markov processes are relevant, with the former involving state transitions at fixed intervals and the latter modeling continuous evolution via rates of change, often analyzed through entropy rates to assess long-term information behavior. Discrete-time chains suffice for many foundational models, while continuous-time variants, governed by infinitesimal generators analogous to transition matrices, apply to processes like diffusion models where time flows uninterrupted. A simple example is a binary with states {0, 1}. Suppose X starts as 0 or 1 with equal probability 0.5. The next variable Y stays the same as X with probability 0.9 or flips with probability 0.1, independent of prior history. Then Z follows similarly from Y, illustrating sequential processing where each step introduces limited dependence on the previous output alone.

Formal Statement

General Form

The data processing inequality provides a fundamental limit on the flow of through a of processing steps. In its general form, consider random variables X, Y, and Z defined over or continuous alphabets that form a Markov chain X \to Y \to Z, meaning the conditional distribution of Z given Y is independent of X. Under this condition, the mutual between X and Z is bounded above by the mutual between X and Y: I(X; Z) \leq I(X; Y). This inequality quantifies that any further processing of Y to obtain Z cannot increase the information about X beyond what is already available in Y. Equality holds if and only if I(X; Y \mid Z) = 0, which is equivalent to Z being a sufficient statistic for Y with respect to X, or equivalently, the Markov chain X \to Z \to Y also holding. The inequality extends naturally to multivariate settings. For random vectors (X_1, \dots, X_n) and Z satisfying the joint Markov property (X_1, \dots, X_n) \to Y \to Z, the mutual information satisfies I(X_1, \dots, X_n; Z) \leq I(X_1, \dots, X_n; Y). This generalization preserves the core principle that processing diminishes or preserves, but does not enhance, the relevant information. The data processing inequality originated in the foundational work of on , where it emerged as a consequence of properties of in channel models. It was later formalized and rigorously stated in standard references.

Interpretations

The data processing inequality captures the intuitive notion that any processing of data acts as a filter, incapable of generating new about the original variable X; instead, it can only preserve the existing or allow some of it to be lost through the intermediate variable Y. This principle underscores the irreversibility of in Markov chains X \to Y \to Z, where subsequent transformations cannot amplify the shared information between X and the output Z beyond what Y already provides. The implications of this inequality are profound for inference tasks, as it establishes strict upper bounds on the accuracy of estimating or decoding X following noisy observations or compressive mappings via Y. In practical settings, such as signal transmission over imperfect channels, the DPI quantifies inevitable information degradation, ensuring that no post-processing algorithm can surpass the informational limits imposed by the initial transformation. The data processing inequality is closely related to the theory of sufficient statistics. If Z is obtained as a function of Y, equality in the DPI holds Z is a for X, meaning I(X; Y \mid Z) = 0. This condition signifies that Z captures all the information about X that is present in Y, rendering further processing of Y redundant for inference purposes and highlighting the efficiency of minimal representations that avoid unnecessary . A frequent point of confusion arises from conflating the data processing inequality with direct constraints on ; while the inequality strictly applies to —measuring dependence between variables— itself, which gauges the uncertainty of a single variable, may increase under processing, as in the addition of , without contradicting the core tenet that shared does not grow.

Proof Techniques

Information-Theoretic

The information-theoretic derivation of the data processing inequality relies on fundamental properties of entropy and mutual information, particularly the chain rule and the non-negativity of mutual information. Consider discrete random variables X, Y, and Z forming a Markov chain X \to Y \to Z, meaning that X and Z are conditionally independent given Y, or equivalently, the conditional mutual information I(X; Z \mid Y) = 0. Under this assumption, the mutual information I(X; Z) cannot exceed I(X; Y), as subsequent processing through Y cannot increase the information about X. The proof proceeds via the chain rule for mutual information applied to the joint distribution of X, Y, and Z: I(X; Y, Z) = I(X; Y) + I(X; Z \mid Y) I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) Equating the right-hand sides yields I(X; Y) + I(X; Z \mid Y) = I(X; Z) + I(X; Y \mid Z). Substituting the Markov condition I(X; Z \mid Y) = 0 and invoking the non-negativity of mutual information, I(X; Y \mid Z) \geq 0, gives I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z). A key lemma underlying this derivation is the data processing inequality for : for any conditioning variable W, if X \to Y \to Z holds conditionally on W, then I(X; Z \mid W) \leq I(X; Y \mid W). This follows analogously from the chain rule and , ensuring the inequality propagates under additional . Equality holds I(X; Y \mid Z) = 0, which occurs when Z is a for X (i.e., Z contains all the information about X that is present in Y). This includes the case where the processing from Y to Z is invertible, so Y is a of Z. This derivation assumes discrete random variables with finite or countable support. For continuous random variables, the inequality holds in the limit of finer discretizations or via measure-theoretic extensions using and relative entropy arguments, preserving the core structure.

Alternative Approaches

One alternative proof of the data processing inequality leverages the convexity of the , which can be viewed as a special case of an f-divergence where is expressed as a convex functional of the conditional distribution. Specifically, for random variables X, Y, and Z forming a X \to Y \to Z, the I(X; Z) is bounded by applying to the defining the divergence under the from Y to Z. This approach represents I(X; Z) = D_{KL}(P_{X,Z} \| P_X P_Z) and shows that the expected divergence after processing does not exceed the original, yielding I(X; Z) \leq I(X; Y). A variational characterization of the data processing inequality arises from the broader class of f-divergences, introduced by Csiszár, where the inequality follows from the monotonicity of these divergences under stochastic . For a f with f(1)=0, the f-divergence D_f(P \| Q) = \int Q f\left(\frac{dP}{dQ}\right) d\mu satisfies D_f(P_{Y} \| Q_{Y}) \geq D_f(P_{Z} \| Q_{Z}) when Z is obtained by applying a channel to Y, due to the joint convexity in the pair of measures and the data-processing property inherent to the definition. When f(t) = t \log t, this recovers the KL divergence and thus the standard bound, providing a unified framework for proving the inequality across divergence families. In continuous-time settings, the data processing inequality can be derived for diffusion processes governed by stochastic differential equations (SDEs), such as dX_t = b(X_t) dt + \sigma(X_t) dW_t, where the evolution of mutual information is analyzed via the Fokker-Planck equation or Itô calculus. The rate of change \frac{d}{dt} I(X_0; X_t) is non-positive, reflecting monotonic decrease due to noise addition, analogous to the H-theorem in kinetic theory; for Ornstein-Uhlenbeck diffusions, explicit solutions show exponential decay bounded by the inequality. This formulation extends the discrete case to time-evolving Markov processes. Computational verification of the data processing inequality in high-dimensional settings poses significant challenges, primarily due to the difficulty in estimating accurately amid of dimensionality and non-parametric issues. Post-2020 highlights that neural network-based estimators, such as variational bounds or methods, often suffer from or high variance in dimensions exceeding 10, making direct numerical checks infeasible without . These methods underscore the need for scalable approximations in applications like bounds.

Examples and Applications

Channel Processing Example

A concrete example of the data processing inequality arises in the context of communication channels, where information about the input degrades through successive noisy transmissions. Consider a symmetric channel (BSC) defined by the X \to Y \to Z, where X is the input drawn uniformly from \{[0](/page/0),[1](/page/1)\}, Y is the output of the first BSC with crossover probability p (the probability that Y \neq X), and Z is the output obtained by passing Y through a second BSC with crossover probability q. For the first channel, the mutual information is I(X;Y) = 1 - h_2(p), where h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) is the binary entropy function in bits (with the convention $0 \log_2 0 = 0). To arrive at this expression, note that H(X) = 1 bit due to uniformity, H(Y|X) = h_2(p) as each output is a binary entropy source conditional on X, and H(Y) = 1 bit by symmetry for p \leq 0.5, so I(X;Y) = H(Y) - H(Y|X) = 1 - h_2(p). The second channel results in an overall BSC from X to Z with effective crossover probability \alpha = p + q - 2pq, computed as the probability that an odd number of flips occurs across both channels: P(Z \neq X) = P(Y = X, Z \neq Y) + P(Y \neq X, Z = Y) = (1-p)q + p(1-q). Thus, I(X;Z) = 1 - h_2(\alpha). Since \alpha > p for $0 < q < 1 and p < 0.5 (as h_2 is increasing on [0, 0.5]), it follows that I(X;Z) < I(X;Y), demonstrating the inequality for this non-invertible processing. For a numerical illustration with p = 0.1 and q = 0.1, first compute h_2(0.1): \log_2(0.1) \approx -3.3219, so -0.1 \times -3.3219 = 0.3322; \log_2(0.9) \approx -0.1520, so -0.9 \times -0.1520 = 0.1368; thus h_2(0.1) \approx 0.4690 bits and I(X;Y) \approx 0.5310 bits. The effective \alpha = 0.1 + 0.1 - 2 \times 0.01 = 0.18. Then h_2(0.18): \log_2(0.18) \approx -2.4744, so -0.18 \times -2.4744 = 0.4454; \log_2(0.82) \approx -0.2863, so -0.82 \times -0.2863 = 0.2348; thus h_2(0.18) \approx 0.6802 bits and I(X;Z) \approx 0.3198 bits, showing a drop of approximately 0.2112 bits in mutual information. The information loss is visualized through the BSC capacity curve, which plots C(p) = 1 - h_2(p) versus p \in [0, 0.5], starting at 1 bit (noiseless) and decreasing symmetrically to 0 bits at p = 0.5 (pure noise). Successive processing shifts the effective p rightward along this concave-down curve, bounding the remaining below the original. The following table summarizes values at select points to highlight the degradation:
Crossover Probability C(p) (bits)
0.001.000
0.100.531
0.180.320
0.500.000
This curve underscores how channel processing enforces the data processing inequality by reducing the achievable rate.

Estimation and Learning Contexts

In , the data processing inequality (DPI) plays a key role in theory by establishing that any processed version of the data cannot contain more about the parameter than the original data. Specifically, if T(X) is a sufficient statistic for the parameter \theta, then the I(\theta; T(X)) = I(\theta; X), and further processing of T(X) to obtain Z = g(T(X)) satisfies I(\theta; Z) \leq I(\theta; T(X)) by the DPI, implying that estimators based on Z cannot outperform those based on X or T(X) in terms of . This connection underscores the efficiency of sufficient statistics, as they capture all relevant for without loss, a principle formalized in information-theoretic terms to bound the performance of downstream inferential procedures. In , particularly representation learning, the DPI imposes fundamental bounds on feature extraction processes in , ensuring that transformations cannot increase about the input. For instance, if Z = f(X) represents the output of a neural network layer f, the DPI yields I(Y; Z) \leq I(Y; X) for any target Y, highlighting how successive layers progressively compress or preserve hierarchies in deep architectures developed since the . This bound is crucial for analyzing why learn disentangled, task-relevant representations, as excessive risks loss that degrades downstream prediction or . Recent works leverage this to optimize architectures, ensuring that learned features remain sufficiently informative without violating the inequality. The DPI also informs frameworks by quantifying information flow in causal structures, such as directed acyclic graphs (DAGs), where it bounds along paths to assess direct and indirect causal influences. This application helps explain scenarios where causal effects become unrecoverable after data transformations, guiding the selection of adjustment sets to avoid unnecessary information degradation. In recent -preserving data processing, particularly with mechanisms in the 2020s, the DPI has been extended to derive inequalities that bound utility loss under constraints. For example, forward and reverse DPIs for local show that post-processing a privatized output Z from input X satisfies amplification properties, where I(X; Z) \leq \epsilon for parameter \epsilon, enabling tighter composition theorems for iterative algorithms. These results facilitate the design of mechanisms like addition, ensuring minimal information leakage while maintaining statistical utility in large-scale analyses. Such applications highlight the DPI's role in balancing guarantees with practical data usability in modern systems.

Extensions

Strong Data Processing Inequality

The strong data processing inequality (SDPI) provides a quantitative strengthening of the classical data processing inequality by establishing an rate for under successive processing. For a X \to Y \to Z, it states that there exists a \alpha < 1 such that I(X; Z) \leq \alpha I(X; Y) for all distributions on X, where \alpha is the strong data processing constant of the channel from Y to Z. This constant \alpha, often denoted \eta_{KL} for the , captures the maximal contraction ratio \sup_{P_X \neq Q_X} \frac{D(P_Z \| Q_Z)}{D(P_Y \| Q_Y)}, ensuring strict information loss unless the channel is noiseless. The Dobrushin coefficient plays a central role in bounding this contraction, defined as \eta_{TV}(P_{Z|Y}) = \sup_{y,y'} \frac{1}{2} \|P_{Z|y} - P_{Z|y'}\|_1, which upper-bounds \eta_{KL} via \eta_{KL} \leq [\eta_{TV}]^2. This coefficient originates from Dobrushin's 1956 work on ergodicity in Markov chains, where it quantified total variation contraction. The modern information-theoretic formulation of the SDPI was advanced by Polyanskiy and Wu in 2015, who derived explicit bounds for channels and Bayesian networks, with subsequent refinements in follow-up works such as those addressing input constraints and Gaussian settings to tighten the constants. For concrete computation, consider the binary symmetric channel (BSC) with crossover probability p \in (0, 0.5). Here, the contraction coefficient is \eta_{KL} = (1 - 2p)^2, while \eta_{TV} = 1 - 2p, illustrating how the SDPI implies a quadratic in the KL sense relative to total variation. This leads to broader implications for divergence contractions: the SDPI extends to f-divergences with channel-specific coefficients \eta_f \leq \eta_{TV}, enabling uniform bounds on information leakage across processing steps. In contrast to the weak data processing inequality, which merely asserts I(X; Z) \leq I(X; Y) without quantifying the loss, the strong version specifies the decay factor \alpha < 1, facilitating applications in large deviation analysis where repeated processing leads to exponentially vanishing error probabilities.

Multivariate Generalizations

The data processing inequality extends seamlessly to multivariate settings, where the input consists of a joint random variable X = (X_1, \dots, X_n) forming a Markov chain X \to Y \to Z. In this case, the mutual information satisfies I(X; Z) \leq I(X; Y), with mutual information computed over the joint distributions of the respective random variables. This generalization follows directly from the chain rule for mutual information and the non-negativity of conditional mutual information, I(X; Z \mid Y) = 0, which enforces the Markov condition without requiring independence among the components of X. A particular instance arises with channels, where the processing operates independently on each component: suppose X_i \to Y_i \to Z_i for i = 1, \dots, n under identical conditional distributions, yielding X = (X_1, \dots, X_n), Y = (Y_1, \dots, Y_n), and Z = (Z_1, \dots, Z_n). The additivity of for independent components then implies I(X; Z) = \sum_{i=1}^n I(X_i; Z_i) \leq \sum_{i=1}^n I(X_i; Y_i) = I(X; Y). This form is crucial in analyzing systems with multiple parallel signals, such as multi-user communication channels, where the total respects the bound componentwise. The functional data processing inequality provides another multivariate , stating that for any f: \mathcal{Y} \to \mathcal{Z}, the composition yields I(X; f(Y)) \leq I(X; Y) under the X \to Y \to f(Y). This captures the effect of deterministic post-processing on the joint distribution, emphasizing that no function can amplify the information about X beyond its original content in Y. It underpins applications in and extraction, where transformations preserve or reduce . Extensions beyond strict Markov chains address scenarios with partial dependencies among multiple variables, providing approximate bounds when the conditional independence is relaxed. For instance, in distributed source coding with correlated observations forming an n-letter Markov structure, a spectrum-based measure of correlation yields a new data processing inequality that derives single-letter necessary conditions on the joint distribution, tightening bounds for multi-terminal rate-distortion problems. Such approaches, developed since the mid-2000s, enable analysis of hyper-Markov-like structures where marginals preserve partial conditional independences across variables. Despite these advances, the multivariate data processing inequality has limitations, particularly in cases where equality fails to hold. Equality in I(X; Z) = I(X; Y) requires Y to be a for X with respect to Z, meaning the processing from Y to Z retains all relevant ; otherwise, strict arises, as in non-invertible transformations. In non-monotonic processing—where plateaus or decreases unevenly across variables—the bound remains valid but may not capture tight rates, necessitating stronger variants for precise characterization. The strong data processing inequality offers related scalar bounds in such multivariate contexts.

Quantum Extensions

The data processing inequality extends to quantum information theory, where it states that the quantum I(A;B)_\rho between quantum systems A and B is monotonically non-increasing under local quantum operations, formally I(A;B') \leq I(A;B) for a applied to B yielding B', in the Markov chain A \to B \to B'. This quantum DPI holds for completely positive trace-preserving (CPTP) maps and underpins limits in quantum communication, such as the Holevo bound and quantum channel capacities. Proofs rely on the monotonicity of quantum relative under CPTP maps. Equality holds if the channel is reversible on the support of the state.

References

  1. [1]
    [PDF] Entropy, Relative Entropy and Mutual Information - Columbia CS
    The data processing inequality can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data.
  2. [2]
    [PDF] Lecture 7: Data Processing Inequality
    ECE 5630: Information Theory for Data Transmission, Security and Machine Learning ... Theorem 1 (Data Processing Inequality) Let PX,QX ∈ P ...
  3. [3]
    [PDF] A Mathematical Theory of Communication
    379–423, 623–656, July, October, 1948. A Mathematical Theory of Communication. By C. E. SHANNON. INTRODUCTION. THE recent development of various methods of ...Missing: processing | Show results with:processing
  4. [4]
    Data processing inequality and open quantum systems
    Oct 17, 2017 · ... information theory as the quantum data processing inequality (DPI) [1–3] . Here local processing refers to an open quantum system [4, 5] as ...
  5. [5]
    [PDF] Lecture 4: Data-processing, Fano
    Sep 9, 2012 · • Data-processing inequality: data processing may (or may not) lose information ... Yao Xie, ECE587, Information Theory, Duke University. 23.
  6. [6]
    Data Processing Inequalities and Function-Space Variational ...
    Aug 14, 2023 · The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot ...
  7. [7]
    [PDF] Strong data-processing inequalities for channels and Bayesian ...
    The data-processing inequality, that is, I(U; Y ) ≤ I(U; X) for a Markov ... Information Theory. Dover Publications Inc., New York, NY, 1965. [Bir57].
  8. [8]
    [PDF] Elements of Information Theory
    Page 1. Page 2. ELEMENTS OF. INFORMATION THEORY. Second Edition. THOMAS M. COVER ... Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ...
  9. [9]
    [PDF] Lecture 4: Data processing and Fano's inequalities; AEP 1 Recap 2 ...
    Jan 24, 2013 · 3 Data Processing Inequality. Definition 3.1 (Markov Chain). Three random variables X,Y,Z are said to form a Markov. Chain, denoted by X → Y ...
  10. [10]
  11. [11]
    [PDF] 4. Markov Chains (9/23/12, cf. Ross) 1. Introduction 2. Chapman ...
    Markov Chains. 4. Markov Chains (9/23/12, cf. Ross). 1. Introduction. 2. Chapman-Kolmogorov Equations. 3. Types of States. 4. Limiting Probabilities. 5.
  12. [12]
    [PDF] CONTINUOUS-TIME MARKOV CHAINS Definition 1. Acontinuous ...
    (3). Pt +s = Pt Ps . (These are the natural analogues of the Chapman-Kolmogorov equations for discrete-time chains.) As for discrete-time Markov chains, we ...
  13. [13]
    [PDF] entropy, relative entropy, and mutual information
    Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A ... Definition The mutual information between two random variables X and Y is defined ...
  14. [14]
    [PDF] Lecture 2: Gibb's, Data Processing and Fano's Inequalities
    Information theory will help us identify ... Intuitively, the data processing inequality says that no clever transformation of the received code (channel.
  15. [15]
    [PDF] Lecture 2: Entropy and mutual information
    In this case we have the data processing inequality: I(X;Z) ≤ I(X;Y ). (45). In other words, processing cannot increase the information contained in a signal.
  16. [16]
    [PDF] Data Processing Theorems and the Second Law of Thermodynamics
    Jul 16, 2010 · ... convex function that defines the generalized mutual information. ... Index Terms: Data processing inequality, convexity, perspective function, H– ...
  17. [17]
    [PDF] f-Divergence Inequalities - arXiv
    Dec 4, 2016 · This paper develops systematic approaches to obtain f-divergence inequalities, dealing with pairs of probability.
  18. [18]
    [PDF] Characterizing Dependence of Samples along the Langevin ... - arXiv
    Data processing inequality (DPI) is a fundamental concept in information theory, which states that information cannot increase along a noisy channel or a ...
  19. [19]
    [PDF] arXiv:2411.01609v2 [quant-ph] 26 May 2025
    May 26, 2025 · We also provide numerical verification of the tomography ... (S99), ϵk ≤ ϵ, ∀k = 1, ··· ,ℓ is given by the data processing inequality,.
  20. [20]
    [PDF] Channel Capacity - WINLAB, Rutgers University
    These symbols are transmitted over a binary symmetric channel which is ... So mutual information is maximized when Y is uniformly distributed, which oc-.
  21. [21]
    [PDF] Elements of Information Theory
    Throughout the book we use the method of weakly typical sequences, which has its origins in Shannon's original 1948 work but was formally developed in the early ...
  22. [22]
    [PDF] Relaying One Bit Across a Tandem of Binary-Symmetric Channels
    . This sequence is then transmitted through a binary symmetric channel (BSC) with crossover probability δ ≤ 1/2. The output of this channel, denoted by Y n ...
  23. [23]
    [PDF] Minimal Achievable Sufficient Statistic Learning - arXiv
    Jun 11, 2019 · A sufficient statistic f(X) for Y is minimal if for ... Proof of CDI Data Processing Inequality. CDI Data Processing Inequality (Theorem 1).
  24. [24]
  25. [25]
    A Theory of Usable Information Under Computational Constraints
    Feb 25, 2020 · This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning.
  26. [26]
    [PDF] An Information-Theoretic View for Deep Learning - arXiv
    Oct 2, 2018 · Section 3 exploits the strong data processing inequality to derive how the mutual information, between intermediate features representations and ...
  27. [27]
    [PDF] Causal inference in degenerate systems: An impossibility result
    of the data processing inequality in information the- ory (Cover & Thomas, 2012). Intuitively, it states that transmitting data through a noisy channel can ...
  28. [28]
    [PDF] arXiv:1508.03808v2 [stat.ME] 18 Mar 2016
    Mar 18, 2016 · [41] address the problem from an inter- ventionalist perspective using Pearl's do-calculus [19] which ... The data processing inequality [4] ...
  29. [29]
    Local Privacy, Data Processing Inequalities, and Statistical Minimax ...
    Feb 13, 2013 · Abstract:Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy ...
  30. [30]
    [PDF] Concurrent Composition Theorems for Differential Privacy - arXiv
    Mar 30, 2023 · Differential privacy is a statistical notion of ... Max-divergence is closed under post-processing due to the data-processing inequality.
  31. [31]
    [PDF] The Composition Theorem for Differential Privacy
    differential privacy as well. We refer to the supplementary material for a ... Theorem 2.4 (Data processing inequality for differential privacy). If a ...
  32. [32]
    Strong data-processing inequalities for channels
    Aug 25, 2015 · The data-processing inequality, that is, I(U;Y)≤I(U;X) for a Markov chain U→X→Y, has been the method of choice for proving impossibility ( ...Missing: 2011 | Show results with:2011
  33. [33]
    Strong Data Processing Inequalities for Input Constrained Additive ...
    Dec 20, 2015 · This paper quantifies the intuitive observation that adding noise reduces available information by means of non-linear strong data processing inequalities.Missing: 2011 | Show results with:2011
  34. [34]
    Tsirelson's bound from a Generalised Data Processing Inequality
    Aug 23, 2011 · We prove that if the data processing inequality holds with respect to this generalized entropy measure then the underlying theory necessarily respects ...
  35. [35]
    A New Data Processing Inequality and Its Applications in Distributed ...
    Nov 3, 2006 · Based on this new data processing inequality, we provide a single-letter necessary condition for the required joint probability distribution. We ...