Fact-checked by Grok 2 weeks ago

Information theory

Information theory is a mathematical discipline that quantifies, stores, and communicates information, focusing on the fundamental limits of data transmission and processing in the presence of noise.^[1] Founded by Claude E. Shannon through his seminal 1948 paper "A Mathematical Theory of Communication", it treats information as a probabilistic entity, independent of its semantic meaning, and emphasizes engineering aspects of reliable message reproduction across communication systems.^[2]^[1] At its core, information theory introduces entropy as the measure of uncertainty or average information content in a random source, defined for a discrete probability distribution as H = -\sum p_i \log_2 p_i bits, where p_i is the probability of each outcome.^[1] This concept underpins source coding theorems, which establish that data can be compressed to a length approaching the source entropy without loss, enabling efficient storage and transmission.^[1] Shannon's work also defines channel capacity as the supreme mutual information over all input distributions, representing the highest rate at which information can be sent reliably over a noisy channel, as formalized in his noisy-channel coding theorem.^[1]^[2] Key extensions include mutual information, which quantifies the shared information between two variables as I(X;Y) = H(X) + H(Y) - H(X,Y), essential for analyzing dependencies in communication and data processing.^[3] The theory's impact spans telecommunications, where it optimizes signal encoding and error correction; computing, influencing algorithms for compression and cryptography; and broader fields like biology, physics, and machine learning, where entropy models uncertainty in genetic codes, thermodynamic systems, and neural networks.^[2]^[1] Shannon's binary digit (bit) as the unit of information laid the groundwork for the digital age, transforming circuit design and enabling modern electronics.^[2]

Introduction and History

Overview

Information theory is a branch of applied mathematics and electrical engineering primarily concerned with the quantification, storage, and communication of information.^[1] It was pioneered by Claude E. Shannon in his seminal 1948 paper, which established a rigorous mathematical framework for analyzing information processes.^[1] The field focuses on understanding how information can be measured in probabilistic terms, independent of its semantic meaning, to enable efficient handling in technical systems.^[4] The core problems addressed by information theory include quantifying the uncertainty inherent in messages or random events, devising efficient encoding techniques to minimize redundancy for storage and transmission, and designing methods to achieve reliable communication despite noise or interference in channels.^[1] These challenges are foundational to modern digital systems, where resources like bandwidth and storage are limited. For instance, information is conceptually viewed as the reduction of uncertainty: observing the outcome of a fair coin flip eliminates one bit of uncertainty, whereas specifying a result from a fair six-sided die requires log₂(6) ≈ 2.58 bits on average.^[5] Beyond its origins in engineering, information theory has exerted significant influence across interdisciplinary domains, including computer science for algorithm design, biology for modeling genetic information flows, and physics for connections to thermodynamic entropy.^[6]^[7] Central to the field is entropy as a measure of uncertainty, with channel capacity defining the maximum rate of reliable transmission.^[1]

Historical Development

The origins of information theory trace back to early 20th-century efforts to quantify signal transmission in telecommunications. In 1924, Harry Nyquist published "Certain Factors Affecting Telegraph Speed" in the Bell System Technical Journal, where he derived a fundamental result establishing that a signal's maximum transmission rate is limited by twice the bandwidth, providing a foundational measure for the amount of information that could be sent over a channel without distortion, later contributing to the sampling theorem.^[8] Building on this, Ralph Hartley introduced a quantitative measure of information in his 1928 paper "Transmission of Information" in the same journal, defining information as the logarithm of the number of possible distinguishable symbols, which laid the groundwork for logarithmic measures of uncertainty in communication systems.^[9] The field crystallized with Claude Shannon's seminal 1948 paper "A Mathematical Theory of Communication," published in the Bell System Technical Journal, which formalized information theory by introducing concepts like entropy as a measure of uncertainty and channel capacity as the maximum reliable transmission rate over noisy channels, fundamentally shaping modern digital communication.^[1] This work was popularized and extended beyond engineering to broader scientific audiences through Warren Weaver's collaboration in the 1949 book The Mathematical Theory of Communication, which emphasized the theory's implications for semantics and human communication.^[10] In the 1960s, information theory expanded into computational foundations with the development of algorithmic information theory. Ray Solomonoff's 1960 technical report "A Preliminary Report on a General Theory of Inductive Inference" proposed measuring an object's information content by the length of the shortest program that generates it, formalizing inductive inference through algorithmic probability.^[11] Independently, Andrey Kolmogorov advanced this in his 1965 paper "Three Approaches to the Quantitative Definition of Information" in Problems of Information Transmission, and independently by Gregory Chaitin in his 1966 paper "On the Length of Programs for Computing Finite Binary Sequences" in the Journal of the ACM, defining complexity as the minimal program length for computation, providing a non-probabilistic, algorithmic basis for information that influenced computability and randomness studies.^[12]^[13] Quantum extensions emerged in the 1970s, with Alexander Holevo's 1973 paper "Bounds for the Quantity of Information Transmittable by a Quantum Communication Channel" in Problems of Information Transmission establishing the Holevo bound, which limits the classical information transmittable through quantum channels and founded quantum information theory.^[14] By 2000, Michael Nielsen and Isaac Chuang's textbook Quantum Computation and Quantum Information systematized quantum entropy and related measures, integrating classical information theory with quantum mechanics for applications in quantum computing.^[15] In the late 1990s and 2000s, information theory intersected with artificial intelligence, notably through Naftali Tishby's 1999 introduction of the information bottleneck method in the Proceedings of the 37th Allerton Conference, which optimizes data compression for relevant predictions by balancing information retention and compression, later applied to understand representation learning in neural networks.^[16] This integration continued into the 2020s, with information-theoretic tools analyzing deep learning dynamics, such as mutual information flows in neural architectures, as explored in recent works unifying Bayesian statistics and Shannon entropy for machine learning frameworks up to 2025.^[17]

Fundamental Quantities

Entropy

In information theory, entropy quantifies the average uncertainty or information content associated with a discrete random variable. For a discrete random variable X taking values in a finite set \mathcal{X} with probability mass function p(x) = \Pr(X = x) for x \in \mathcal{X}, the entropy H(X) is defined as

H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x),

where the logarithm is base 2, yielding a measure in bits. This formula, introduced by Claude Shannon, represents the expected value of the information content of a single outcome, where the self-information of an event with probability p(x) is -\log_2 p(x), the number of bits needed to specify the outcome on average. Entropy serves as the fundamental limit on the average number of bits required to encode the outcomes of X without loss of information, as established in lossless source coding theorems. For instance, consider a fair coin flip, where X is Bernoulli with p(0) = p(1) = 1/2; then H(X) = 1 bit, indicating one bit suffices to describe the outcome on average. For a biased coin with p(1) = 0.9 and p(0) = 0.1, H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1 \approx 0.469 bits, reflecting lower uncertainty due to the bias. The binary entropy function h(p), a special case for Bernoulli random variables with success probability p, is given by

h(p) = -p \log_2 p - (1-p) \log_2 (1-p),

which reaches its maximum of 1 bit at p = 0.5 and approaches 0 as p nears 0 or 1. More generally, entropy can be expressed in other units: nats using the natural logarithm (dividing by \ln 2 converts bits to nats) or hartleys using base-10 logarithm, though bits are standard in digital contexts for aligning with binary encoding. Key properties of entropy include non-negativity, H(X) \geq 0, with equality if and only if X is deterministic (one outcome has probability 1); this follows from Jensen's inequality applied to the concave function -\log_2 p. Entropy is maximized for the uniform distribution over \mathcal{X}, where H(X) = \log_2 |\mathcal{X}|, providing an upper bound on uncertainty for a given alphabet size. For independent random variables X and Y, entropy is additive: H(X,Y) = H(X) + H(Y). Shannon derived the entropy function from a set of axioms in his seminal 1948 paper: the measure must be continuous in the probabilities, monotonically increasing as probabilities become more uniform, uniquely determined up to a constant multiplier, and additive for independent events. Solving these yields H(X) = -K \sum p(x) \log p(x) for some positive constant K, with K = 1/\ln 2 chosen for bits; the proof involves showing that the logarithmic form satisfies the axioms and is unique.

Joint and Conditional Entropy

Joint entropy extends the concept of entropy to two or more random variables, quantifying the total uncertainty in their joint occurrence. For discrete random variables X and Y with joint probability mass function p(x,y), the joint entropy H(X,Y) is defined as

H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y),

which represents the average number of bits needed to specify both X and Y together.^[1] This measure averages the self-information over the joint distribution, capturing dependencies between the variables.^[18] A key property of joint entropy is its subadditivity: H(X,Y) \leq H(X) + H(Y), with equality holding if and only if X and Y are independent.^[18] Additionally, H(X,Y) \geq \max(H(X), H(Y)), reflecting that the combined uncertainty cannot be less than the uncertainty of either variable alone.^[18] The chain rule provides a decomposition: H(X,Y) = H(X) + H(Y|X), linking joint entropy to conditional entropy and enabling recursive extensions to more variables, such as H(X_1, \dots, X_n) = \sum_{i=1}^n H(X_i | X_1, \dots, X_{i-1}).^[1] Conditional entropy H(Y|X) measures the average remaining uncertainty about Y after observing X. It is formally defined as

H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = H(X,Y) - H(X),

where p(y|x) = p(x,y)/p(x) is the conditional probability mass function.^[1] This quantity interprets as the expected entropy of Y given each possible value of X, weighted by the probability of X.^[18] Properties of conditional entropy include non-negativity: H(Y|X) \geq 0, with equality if Y is a deterministic function of X.^[18] Furthermore, conditioning cannot increase entropy: H(Y|X) \leq H(Y), with equality if X and Y are independent, indicating that knowledge of X either reduces or leaves unchanged the uncertainty in Y.^[1] As an example, consider two fair coin flips X and Y. If independent, the joint entropy H(X,Y) = 2 bits, equal to H(X) + H(Y) = 1 + 1.^[18] If Y = X (perfect dependence), then H(X,Y) = 1 bit, which is \max(H(X), H(Y)). For conditional entropy, suppose Y is the next English letter after observing X as the previous letter; H(Y|X) is about 1.3 bits, much less than the unconditional H(Y) \approx 4.1 bits, due to contextual dependencies.^[1]

Mutual Information

Mutual information quantifies the amount of information that one random variable contains about another, serving as a measure of their statistical dependence.^[1] Introduced by Claude Shannon in his foundational work on communication theory, it captures the shared uncertainty between two variables without assuming a specific form of relationship, such as linearity.^[1] Formally, for discrete random variables X and Y with joint probability mass function p(x,y), the mutual information I(X;Y) is defined as the difference between the entropy of X and the conditional entropy of X given Y:

I(X;Y) = H(X) - H(X|Y).

This equals the Kullback-Leibler divergence between the joint distribution and the product of the marginals:

I(X;Y) = \sum_{x} \sum_{y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)},

where the logarithm is base 2 for units in bits.^[1] The measure is symmetric, such that I(X;Y) = I(Y;X), reflecting that the information Y provides about X is the same as the information X provides about Y.^[1] Mutual information possesses several key properties. It is non-negative, I(X;Y) \geq 0, with equality holding if and only if X and Y are independent, meaning knowledge of one provides no information about the other.^[1] Additionally, it is bounded above by the minimum of the individual entropies: I(X;Y) \leq \min(H(X), H(Y)), ensuring it cannot exceed the inherent uncertainty in either variable.^[1] In interpretive terms, mutual information represents the average reduction in uncertainty about X upon observing Y, or equivalently, the number of bits of information shared between the two variables.^[1] This makes it a fundamental tool for assessing dependence in probabilistic systems, distinct from joint and conditional entropy, which measure absolute or context-dependent uncertainties.^[1] A chain rule extends mutual information to multiple variables: for random variables X_1, X_2, \dots, X_n and Y,

I(X_1, X_2, \dots, X_n; Y) = I(X_1; Y) + I(X_2; Y | X_1) + \cdots + I(X_n; Y | X_1, \dots, X_{n-1}),

allowing decomposition of total dependence into conditional contributions.^[19] Examples illustrate these concepts clearly. If X and Y are independent, then p(x,y) = p(x)p(y), so I(X;Y) = 0.^[1] Conversely, if Y = X (identical variables), then H(X|Y) = 0, yielding I(X;Y) = H(X), the full entropy of X.^[1]

Divergence Measures

Divergence measures in information theory quantify the difference between two probability distributions, with the Kullback-Leibler (KL) divergence serving as a foundational example. For discrete probability distributions P and Q over the same sample space, the KL divergence is defined as

\mathrm{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)},

where the logarithm is typically base-2 for bits or natural for nats.^[20] This measure is always non-negative, \mathrm{KL}(P \parallel Q) \geq 0, and equals zero if and only if P = Q almost everywhere, a property established via the Gibbs inequality. For continuous distributions, the sum is replaced by an integral: \mathrm{KL}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.^[20] The KL divergence exhibits key properties that distinguish it from metric distances. It is asymmetric, meaning \mathrm{KL}(P \parallel Q) \neq \mathrm{KL}(Q \parallel P) in general, reflecting its directed nature from P to Q. Additionally, it does not satisfy the triangle inequality, preventing it from forming a true distance metric on the space of probability distributions. Introduced by Solomon Kullback and Richard Leibler in 1951, the KL divergence originated in the context of statistical hypothesis testing and information sufficiency.^[20] Interpretationally, \mathrm{KL}(P \parallel Q) represents the expected number of extra bits required to encode samples drawn from P using a code optimized for Q, rather than one optimal for P. Equivalently, it measures the information gain obtained when using the true distribution P instead of the approximation Q. This connects to mutual information, where I(X;Y) = \mathbb{E}_{Y} \left[ \mathrm{KL}(P_{X|Y} \parallel P_X) \right], quantifying the average divergence between the conditional and marginal distributions of X. A concrete example arises with Bernoulli distributions, where P \sim \mathrm{Bern}(p) and Q \sim \mathrm{Bern}(q). The KL divergence simplifies to \mathrm{KL}(P \parallel Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}, illustrating how it penalizes mismatches in success probabilities; for instance, with p=0.5 and q=0.6, \mathrm{KL}(P \parallel Q) \approx 0.029 bits. In applications like model selection, the KL divergence underpins criteria such as the Akaike Information Criterion (AIC), which estimates the expected relative entropy between the true data-generating process and a fitted model to balance goodness-of-fit and complexity.^[21]

Source Coding

Compression Fundamentals

Source coding theory addresses the compression of data from information sources without loss of information, establishing fundamental limits on achievable compression rates. The source coding theorem, also known as the noiseless coding theorem, states that for a discrete memoryless source producing symbols from alphabet \mathcal{X} with probability distribution p(x), the optimal average codeword length L for any uniquely decodable code satisfies L \geq H(X), where H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) is the entropy of the source.^[1] This bound is achievable in the limit as the block length increases, meaning sequences of source symbols can be compressed to rates arbitrarily close to the entropy using block codes that are uniquely decodable.^[1] The theorem implies that the minimum compression rate R required for reliable lossless encoding of the source is R \geq H(X) bits per symbol, ensuring that the encoded representation captures the inherent uncertainty or information content of the source without redundancy beyond this limit.^[1] For sources with entropy lower than the logarithm of the alphabet size, significant compression is possible by exploiting statistical dependencies, as entropy quantifies the average surprise or unpredictability per symbol. Huffman coding provides an optimal prefix-free code for discrete memoryless sources with known symbol probabilities, achieving an average code length within 1 bit of the entropy bound.^[22] The algorithm proceeds by first sorting the symbols in decreasing order of their probabilities. It then iteratively builds a binary tree: the two symbols (or subtrees) with the lowest probabilities are combined into a new node with probability equal to their sum, assigning 0 and 1 to the branches, and this process repeats until a single root remains. The codewords are the paths from root to leaves, ensuring no code is a prefix of another and minimizing the weighted path length, which equals the average code length.^[22] Arithmetic coding offers superior efficiency for memoryless sources by encoding entire messages as fractions of the unit interval, rather than fixed-length codes per symbol, approaching the entropy bound more closely without the integer-length overhead of block codes.^[23] In this method, the message's probability interval [0,1) is subdivided according to symbol probabilities; each symbol narrows the current interval proportionally, and the final interval's binary representation (or a tag within it) serves as the code, with the decoder reversing the process to recover the symbols. This fractional encoding allows the code length to be approximately -\log_2 P(m) bits for message m with probability P(m), yielding near-entropy performance even for single symbols.^[23] A practical example is the compression of English text, where the entropy is approximately 1 to 1.5 bits per character due to letter frequencies and contextual redundancies in natural language, far below the 4.7 bits per character (log₂26 for lowercase letters) of uniform encoding.^[24] This redundancy, arising from predictable patterns like common digrams and grammatical structure, enables compression ratios of about 3:1 to 5:1 using entropy-based methods, as demonstrated in early human prediction experiments that estimated the per-character information content.^[24]

Practical Coding Techniques

Variable-length codes assign shorter codewords to more probable symbols and longer ones to less probable symbols, enabling efficient compression of sources with uneven symbol probabilities. One early method is Shannon-Fano coding, which constructs codes by recursively splitting the alphabet into two subsets of approximately equal total probability, assigning binary digits accordingly.^[1]^[25] This top-down approach ensures the code lengths are close to the negative logarithm of the probabilities, though it may not always yield optimal lengths.^[1] In contrast, Huffman coding achieves optimality for prefix-free codes over a fixed alphabet by building a binary tree bottom-up, merging the two least probable symbols iteratively to form codewords whose lengths satisfy the Kraft inequality with equality.^[22] While Shannon-Fano is simpler to implement without requiring full probability sorting, Huffman codes generally produce shorter average lengths, making them preferable for static sources where the full alphabet is known in advance.^[22] Both methods approach the entropy bound for large alphabets but incur redundancy for small ones due to integer code lengths. For universal compression of unknown or streaming data, the Lempel-Ziv-Welch (LZW) algorithm builds a dictionary adaptively by replacing repeated substrings with codes referencing prior occurrences.^[26] Starting from an initial dictionary of single symbols, LZW parses the input into phrases, outputs the code for the longest matching prefix, and adds the extension to the dictionary, enabling real-time learning without prior statistics.^[27] This variant of the LZ78 scheme excels on repetitive data, achieving compression ratios near the entropy rate asymptotically, and is widely used in formats like ZIP archives and GIF images for its simplicity and effectiveness on text and graphics.^[26] Lossy compression techniques sacrifice fidelity for higher rates, guided by rate-distortion theory, which quantifies the minimum rate R(D) needed to represent a source at average distortion no greater than D. For a source X and reproduction \hat{X} with distortion E[d(X, \hat{X})] \leq D,

R(D) = \min_{p(\hat{x}|x): E[d(X,\hat{X})] \leq D} I(X; \hat{X}),

where the mutual information I(X; \hat{X}) captures the information preserved in the approximation.^[1] Practical source-side methods focus on transforming the data to concentrate energy in few coefficients, quantizing them coarsely for low distortion, and entropy-coding the result, balancing rate against perceptual quality without channel considerations. Adaptive coding addresses non-stationary sources by updating models on-the-fly, avoiding assumptions of fixed statistics. Prediction by partial matching (PPM) exemplifies this by estimating symbol probabilities based on increasingly longer contexts from recent history, escaping to lower-order models when unseen patterns arise to prevent overfitting.^[28] This hierarchical approach achieves superior compression on natural language or code by capturing evolving dependencies, often outperforming static Huffman on adaptive arithmetic coders. A representative application is the JPEG standard for image compression, which applies the discrete cosine transform (DCT) to 8x8 blocks to decorrelate pixels based on spatial statistics, yielding coefficients that are quantized and Huffman-coded.^[29] By exploiting human visual sensitivity—retaining low-frequency details while discarding high-frequency noise—JPEG trades minor artifacts for ratios of 10:1 to 20:1 on typical photos, though computational costs in DCT and quantization rise with image resolution. These techniques highlight the tension between achieving near-theoretical rates and practical constraints like decoder simplicity.

Channel Coding

Capacity and Noisy Channels

In information theory, a discrete memoryless channel (DMC) is modeled as a probabilistic mapping from input symbols X drawn from a finite alphabet \mathcal{X} to output symbols Y from a finite alphabet \mathcal{Y}, characterized by conditional transition probabilities p(y|x) that specify the probability of receiving y given that x was transmitted, with outputs independent across time slots.^[1] The capacity C of such a channel represents the supremum of rates at which information can be reliably transmitted and is defined as the maximum mutual information between input and output over all possible input distributions:

C = \max_{p(x)} I(X; Y),

where I(X; Y) = H(Y) - H(Y|X) quantifies the reduction in uncertainty about Y provided by knowledge of X.^[1] Shannon's noisy-channel coding theorem establishes the fundamental limits of reliable communication over such channels: for any rate R < C, there exists a coding scheme achieving arbitrarily low probability of error as the block length n \to \infty; conversely, for R > C, the error probability is bounded away from zero.^[1] A sketch of the achievability proof relies on random coding and typical set decoding. In random coding, $2^{nR} codewords are generated independently for each message, with each symbol drawn i.i.d. from the capacity-achieving input distribution p(x); the average error probability over this ensemble vanishes exponentially as n \to \infty for R < I(X; Y), implying the existence of at least one good code via the probabilistic method. For decoding, the receiver selects the unique codeword jointly typical with the received sequence y^n, where jointly typical sequences (x^n, y^n) satisfy |\hat{p}(x_i, y_i) - p(x_i, y_i)| < \epsilon/n for small \epsilon, ensuring that with high probability, the correct codeword is decoded due to the concentration of probability mass on the typical set of size approximately $2^{n H(X,Y)}.^[30] For the binary symmetric channel (BSC), where \mathcal{X} = \mathcal{Y} = \{0,1\} and bits flip with probability p < 1/2, the capacity-achieving input distribution is uniform, yielding

C = 1 - h_2(p),

with h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) the binary entropy function, so C = 1 bit per use when p=0 (noiseless) and decreases to 0 as p \to 1/2.^[1] In the continuous-time setting with additive white Gaussian noise (AWGN), the channel adds independent Gaussian noise to the transmitted signal under a power constraint; discretizing into $2W dimensions per second (Nyquist rate) for bandwidth W, the capacity simplifies to \frac{1}{2} \log_2 (1 + \mathrm{SNR}) bits per dimension in the high-resolution limit, where SNR is the signal-to-noise ratio P/N with transmit power P and noise power spectral density N/2.^[1] This capacity C interprets as the supreme reliable transmission rate in bits per channel use, beyond which noise fundamentally limits error-free communication, even with optimal encoding and decoding.^[1]

Specific Channel Models

The binary erasure channel (BEC) models a communication system where each transmitted bit is received correctly with probability $1 - \alpha or results in an erasure symbol with probability \alpha, allowing the receiver to detect but not recover erased bits. Introduced by Elias in his foundational work on coding for noisy channels, the BEC simplifies analysis by separating errors from erasures. The capacity of the BEC is C = 1 - \alpha bits per channel use, derived as the maximum mutual information I(X;Y) over input distributions, where the erasure provides partial information. This capacity is achievable with low-complexity codes like parity-check or low-density parity-check (LDPC) codes that detect erasures reliably, as demonstrated by sequences approaching the capacity limit under maximum a posteriori decoding.^[31] Asymmetric binary channels, such as the Z-channel, extend the BEC by introducing biased error patterns where one input (typically 0) is received error-free, while the other (1) flips to 0 with probability \alpha and stays 1 otherwise. The Z-channel capacity requires numerical optimization over the input distribution p = P(X=1), given by C = \max_p [h(\beta) - p h(\alpha)], where \beta = p(1 - \alpha) is the probability of output 1 and h(\cdot) denotes binary entropy; this formula arises from maximizing I(X;Y) = h(Y) - h(Y|X), with h(Y|X) = p h(\alpha). Coding strategies for the Z-channel often involve asymmetric error-correcting codes, such as repeat-accumulate ensembles, which approach capacity through iterative decoding optimized for the bias. Similar optimization techniques apply to other asymmetric binaries like the binary asymmetric channel, where error probabilities differ for each input, yielding capacities via convex optimization of the mutual information.^[32] For continuous-time channels, the additive white Gaussian noise (AWGN) channel under a power constraint P represents a fundamental model, with capacity C = W \log_2 \left(1 + \frac{P}{N_0 W}\right) bits per second for bandwidth W Hz, where N_0 is the noise power spectral density; this seminal result, established by Shannon, assumes Gaussian input distribution to maximize differential entropy. In frequency-selective fading, parallel Gaussian channels with varying noise levels \sigma_i^2 and total power constraint \sum P_i \leq P achieve capacity via the water-filling algorithm, allocating power P_i = \max(\mu - \sigma_i^2, 0) such that \sum P_i = P, where \mu is chosen to satisfy the constraint, leading to C = \sum \frac{1}{2} \log_2 \left(1 + \frac{P_i}{\sigma_i^2}\right). This allocation prioritizes lower-noise subchannels, enhancing spectral efficiency in multicarrier systems like OFDM.^[1] Multiple-input multiple-output (MIMO) channels generalize the Gaussian model to systems with n_t transmit and n_r receive antennas, where the received signal is \mathbf{Y} = \mathbf{H} \mathbf{X} + \mathbf{Z}, with channel matrix \mathbf{H}, input covariance \boldsymbol{\Sigma} under trace constraint \operatorname{tr}(\boldsymbol{\Sigma}) \leq P, and noise \mathbf{Z} \sim \mathcal{N}(0, N_0 \mathbf{I}). The ergodic capacity is C = \max_{\boldsymbol{\Sigma}} \mathbb{E} \left[ \log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right) \right] bits per channel use, achieved with Gaussian inputs; for fixed \mathbf{H}, it simplifies to \log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right). This formula, derived by Telatar and independently by Foschini, highlights multiplexing gains scaling with \min(n_t, n_r), enabling high-throughput in wireless systems.^[33] Erasure and deletion channels pose greater challenges than the BEC, as deletions remove symbols without explicit indication, complicating synchronization and decoding. For the binary deletion channel with deletion probability \delta, the capacity remains an open problem but is bounded above by $1 - h(\delta) and below by constructive codes achieving rates approaching $1 - \delta for small \delta; exact capacity expressions are unknown except in limits. The Varshamov-Tenengolts (VT) codes, introduced for single asymmetric errors but adaptable to single deletions, provide near-optimal constructions with rate \approx 1 - \frac{\log n}{n} for block length n, using a parity-check sum \sum i x_i \equiv 0 \pmod{n+1} to locate the deletion. For multiple deletions, bounds rely on combinatorial arguments, with capacity for small \delta approaching $1 - H_2(\delta) \approx 1 - \delta \log_2 (1/\delta).^[34]^[35] These channel models inform practical system design, such as in modems where BEC-like erasure handling via forward error correction improves reliability over fading links, and water-filling optimizes power allocation in DSL modems to combat frequency-dependent noise. In 5G New Radio (NR) standards, channel models from 3GPP TR 38.901 incorporate clustered delay line (CDL) and tapped delay line (TDL) profiles for frequencies up to 100 GHz, enabling MIMO capacity computations under spatial consistency and beamforming, with implications for enhanced mobile broadband achieving up to 20 Gbps peak rates.^[36]^[37]

Advanced Concepts

Directed Information

Directed information generalizes mutual information to stochastic processes exhibiting temporal dependencies, providing a measure of causal information transfer from one process to another in a directed manner. Formalized by James L. Massey in 1990, building on ideas from Hans Marko's bidirectional communication theory, it addresses scenarios where the influence flows asymmetrically over time, such as in communication channels with memory or feedback, where traditional mutual information fails to capture causality.^[38] This measure is particularly valuable in analyzing systems where past outputs influence future inputs, enabling the quantification of information flow in non-stationary or sequential settings.^[38] The directed information from an input process X^n = (X_1, \dots, X_n) to an output process Y^n = (Y_1, \dots, Y_n) is formally defined as

I(X^n \to Y^n) = \sum_{t=1}^n I(X_t; Y_t \mid Y^{t-1}),

where I(\cdot; \cdot \mid \cdot) denotes conditional mutual information, Y^{t-1} = (Y_1, \dots, Y_{t-1}) (with Y^0 empty), and the sum aggregates the incremental information contributed by each input symbol given the prior output history.^[38] This formulation emphasizes causality by conditioning only on past outputs, avoiding non-causal dependencies on future values. The directed information rate is then \bar{I}(X \to Y) = \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n), assuming the limit exists.^[39] For channels with memory, where the output at each time depends on the entire input history and possibly past outputs, the capacity is characterized using the directed information rate. Specifically, the capacity C is given by

C = \sup \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n),

where the supremum is taken over all causal input distributions P_{X^n \| Y^{n-1}} that respect the channel's memory structure.^[39] This multi-letter expression accounts for the temporal correlations, making it suitable for channels like those with inter-symbol interference or additive noise with memory.^[39] Key properties of directed information include its non-negativity and the data processing inequality, I(X^n \to Y^n) \leq I(X^n \to Z^n) for a channel from Y^n to Z^n.^[38] It reduces to the standard mutual information I(X^n; Y^n) for memoryless channels or when there is no feedback, as I(X^n \to Y^n) = I(X^n; Y^n) holds under independence of the Y_t.^[38] Unlike mutual information, directed information is asymmetric, I(X^n \to Y^n) \neq I(Y^n \to X^n) in general, which explicitly encodes the direction of information flow and distinguishes causal influences.^[38] The conditional mutual information referenced here builds on the concept from mutual information analysis, measuring the reduction in uncertainty about Y_t given Y^{t-1}. In applications to feedback channels, where the input X_t may depend on previous outputs Y^{t-1}, directed information provides the appropriate metric for capacity, as mutual information assumes non-causal access. The feedback capacity satisfies C_{\text{feedback}} \geq C_{\text{no feedback}}, with equality for discrete memoryless channels but strict increase possible for channels with memory.^[38] A notable example is the Schalkwijk-Kailath scheme for the additive white Gaussian noise channel with feedback, which achieves the feedback capacity through iterative linear encoding that refines message estimates, demonstrating practical gains in error exponents despite no increase in asymptotic capacity for this memoryless case.^[40] This scheme highlights how directed information guides optimal coding strategies in feedback scenarios by focusing on causal contributions.^[40] Examples of directed information computation arise in Markovian settings, where sources or channels follow a Markov property, simplifying the conditional terms in the sum. For a first-order Markov source driving a Markov channel, the directed information rate can be evaluated using the stationary distribution and transition probabilities, yielding closed-form expressions for the causal flow.^[41]

Information in Non-Standard Settings

Information theory extends beyond classical discrete and continuous domains to encompass quantum systems, computational paradigms, and complex stochastic processes, where traditional Shannon entropy must be generalized to capture phenomena like superposition, undecidability, and long-range correlations. In quantum settings, information is quantified using density operators rather than probability distributions, leading to measures that account for non-commutativity and entanglement. Algorithmic variants shift focus from probabilistic uncertainty to the intrinsic compressibility of individual objects via program length. These extensions provide foundational tools for fields like quantum computing and theoretical physics, revealing limits unattainable in classical frameworks. In quantum information theory, the von Neumann entropy serves as the primary measure of uncertainty for a quantum state described by a density matrix \rho, defined as

S(\rho) = -\operatorname{Tr}(\rho \log \rho),

where \operatorname{Tr} denotes the trace operation and the logarithm is base-2 for bits. This entropy generalizes Shannon's entropy to quantum mechanics, quantifying the mixedness of \rho while vanishing for pure states. For an ensemble of states \{p_i, \rho_i\}, the Holevo bound \chi upper-bounds the accessible classical information, given by

\chi = S\left(\sum_i p_i \rho_i\right) - \sum_i p_i S(\rho_i),

establishing a fundamental limit on how much classical data can be reliably transmitted through quantum channels without measurement collapse. The quantum mutual information between subsystems A and B of a bipartite state \rho_{AB} extends this to correlations, defined as I(A:B) = S(\rho_A) + S(\rho_B) - S(\rho_{AB}), where \rho_A = \operatorname{Tr}_B(\rho_{AB}) and similarly for \rho_B; this measure is non-negative and equals twice the entanglement entropy for pure states. Quantum channels, modeled as completely positive trace-preserving maps, have capacities constrained by these quantities; for the qubit depolarizing channel \mathcal{N}_p(\rho) = (1-p)\rho + p \frac{I}{2}, where p is the depolarizing probability and I is the identity, the classical capacity equals the Holevo information \chi(\mathcal{N}_p), achieved additively even when tensored with arbitrary channels, while the quantum capacity remains open but bounded above by approximate degradability for low p.^[42]^[43] Algorithmic information theory redefines information content through the lens of computation, independent of probability. The Kolmogorov complexity K(x) of a binary string x is the length of the shortest program in a fixed universal Turing machine that outputs x and halts, providing an absolute, observer-independent measure of complexity. This notion, introduced by Andrey Kolmogorov, captures the minimal description length required to specify x, rendering random strings those with K(x) \approx |x| (incompressible), but K(x) is uncomputable due to the halting problem, limiting its practical use to approximations like compression algorithms. Mutual information analogs, such as I(x:y) = K(x) + K(y) - K(x,y), quantify shared complexity between objects, foundational for understanding randomness and induction despite theoretical intractability. For stochastic processes, the entropy rate quantifies average uncertainty per symbol in the long run. For a stationary discrete-time process \{X_i\}, the entropy rate is defined as

H = \lim_{n \to \infty} \frac{1}{n} H(X_1, \dots, X_n),

where the limit exists under stationarity and equals \lim_{n \to \infty} H(X_n | X_1, \dots, X_{n-1}) by the chain rule. This rate, originally explored by Claude Shannon, governs the fundamental limit for lossless compression of process outputs, with finite values for ergodic processes like Markov chains but infinity for non-stationary ones. It bridges classical information theory to dynamical systems, enabling analysis of correlation decay in sources like language models or physical time series. These concepts find applications in quantum gravity, such as the AdS/CFT correspondence, where holographic information posits that bulk gravitational degrees of freedom in anti-de Sitter (AdS) space encode boundary conformal field theory (CFT) information, with entanglement entropy across minimal surfaces bounding the Ryu-Takayanagi formula S = \frac{\operatorname{Area}(\gamma)}{4G_N}, where \gamma is the extremal surface homologous to the boundary region; this duality suggests emergent spacetime from quantum information limits, as explored in generalizations to excited states and deformed geometries.

Applications

Communication and Secrecy

Information theory provides foundational principles for secure communication, ensuring that messages remain confidential even against adversaries with unlimited computational power. A cornerstone is the concept of perfect secrecy, introduced by Claude Shannon, which requires that the mutual information between the plaintext and the ciphertext is zero, meaning the adversary gains no information about the message from the ciphertext alone.^[44] This condition holds if the key used for encryption is at least as long as the message and uniformly random, independent of the message, as demonstrated in the one-time pad scheme where the ciphertext is the modular sum of the message and key.^[44] In terms of entropy, perfect secrecy demands that the key entropy H(K) satisfies H(K) \geq H(M), where H(M) is the entropy of the message M, ensuring the key cannot be shorter than the message's uncertainty without compromising security.^[44] In noisy communication environments, the wiretap channel model extends these ideas to scenarios where an eavesdropper receives a degraded version of the signal. Aaron Wyner defined the secrecy capacity as the maximum rate at which a message can be reliably transmitted to the legitimate receiver while keeping it perfectly secret from the eavesdropper, given by C_s = \max_{p(x)} [I(X;Y) - I(X;Z)], where X is the input, Y the legitimate output, and Z the eavesdropper's output.^[45] For degraded channels, where the eavesdropper's channel is a degraded version of the main channel, this simplifies to C_s = C_{\text{main}} - C_{\text{wiretap}}, with C_{\text{main}} and C_{\text{wiretap}} being the capacities of the respective channels.^[45] Achieving this capacity involves stochastic encoding, where the transmitter adds random noise to confuse the eavesdropper without affecting the legitimate receiver's decoding, leveraging the difference in mutual informations.^[45] Authentication and key distribution in information-theoretic settings rely on mutual information measures to bound security, particularly in protocols that generate shared secrets over public channels. Ueli Maurer showed that parties observing correlated random variables can distill a secret key through public discussion, with the achievable key rate bounded by the mutual information between their observations minus leakage to the adversary.^[46] In public-key cryptography analogs, information-theoretic bounds use the asymptotic equipartition property (AEP) to analyze extractors, such as universal hash functions, which compress high-entropy sources into uniform keys while preserving secrecy; the AEP ensures that typical sequences have entropy close to the source's, enabling efficient extraction.^[46] These extractors guarantee that the output key's min-entropy is nearly the input's, providing strong security guarantees independent of computational assumptions.^[47] Steganography applies information theory to hide messages within cover media such that the presence of the hidden information is undetectable. Christian Cachin formalized this using the Kullback-Leibler (KL) divergence, defining perfect secrecy when the divergence between the distributions of cover and stego (hidden) objects approaches zero, ensuring the adversary cannot distinguish them better than random guessing.^[48] The embedding rate is limited by the KL divergence, as higher rates increase detectability; secure schemes minimize D(P_{\text{cover}} \| P_{\text{stego}}) through careful distortion of the cover while preserving statistical properties.^[48] Practical examples illustrate these principles. In quantum key distribution, the BB84 protocol by Charles Bennett and Gilles Brassard enables information-theoretically secure key exchange using polarized photons, where error correction and privacy amplification—via hash-based extractors—distill a secret key from the shared quantum measurements, secure against any eavesdropper due to the no-cloning theorem and entropy extraction.^[49]

Computing and Machine Learning

Information theory plays a pivotal role in the design and analysis of pseudorandom number generators (PRNGs), where information-theoretic tests evaluate the unpredictability and entropy of generated sequences to ensure they mimic true randomness for cryptographic and simulation purposes. These tests, grounded in entropy measures, assess whether a PRNG's output distribution is indistinguishable from uniform randomness by quantifying deviations in conditional entropy or mutual information between consecutive outputs. For instance, tests based on source coding theory compress PRNG outputs and compare the achieved compression rates against those expected from truly random sources, revealing biases if the entropy falls short of the theoretical maximum.^[50] A key application in pseudorandomness involves entropy extraction from weak random sources, where the leftover hash lemma provides a foundational result for converting sources with partial min-entropy into nearly uniform keys. The lemma states that, for a source with min-entropy k and a family of universal hash functions, hashing the source yields an output whose statistical distance from uniform is at most $2^{-\frac{\ell - k}{2}}, where \ell is the output length, enabling secure key generation even from imperfect randomness like hardware noise. This principle underpins privacy amplification in cryptography, ensuring that adversaries gain negligible information about the extracted bits. In machine learning, the information bottleneck (IB) principle formalizes the trade-off between data compression and predictive utility by seeking representations that minimize mutual information with inputs while maximizing it with outputs. Introduced as a method to extract relevant features through a bottleneck, IB optimizes the objective I(X; T) - \beta I(T; Y), where X is the input, T the compressed representation, Y the target, and \beta balances compression against relevance. Variational bounds approximate this non-convex optimization, enabling scalable training via neural networks that approximate the posterior p(t|x). This framework has influenced unsupervised learning by promoting parsimonious models that discard irrelevant noise.^[51]^[52] Neural networks leverage information theory for representation learning, as seen in InfoGAN, which extends generative adversarial networks by maximizing mutual information between latent codes and generated images to yield interpretable, disentangled factors. In InfoGAN, the objective augments the GAN loss with I(c; G(z, c)), where c is a subset of informative noise, estimated via a variational lower bound to encourage structured latent spaces without supervision. Similarly, rate-distortion theory informs autoencoders by framing encoding as a compression problem, minimizing distortion d(x, \hat{x}) subject to a rate constraint on the mutual information I(x; z), where z is the latent code. This approach yields efficient embeddings for tasks like dimensionality reduction, with the rate-distortion function R(D) = \min I(X; Z) such that E[d(X, \hat{X})] \leq D.^[53]^[54] In big data analysis, entropy measures enhance clustering by quantifying the uncertainty reduction achieved by partitioning data into groups. An information-theoretic perspective determines the optimal number of clusters by minimizing the total entropy across clusters while penalizing complexity, treating clustering as a balance between description length and fit, akin to the minimum description length principle. For example, the entropy of a clustering is H = -\sum p_k \log p_k + \sum_k p_k H_k, where p_k is the probability of cluster k and H_k its internal entropy, guiding algorithms to avoid over- or under-clustering. Feature selection in high-dimensional settings employs conditional mutual information to identify variables that provide unique predictive power given others, selecting subsets that maximize I(f; y | S), where f is a candidate feature, y the target, and S the selected set, mitigating redundancy and improving model efficiency.^[55]^[56] Recent advances as of 2025 integrate information theory with generative models, particularly diffusion models, where the forward and reverse processes are analyzed through mutual information flows to bound sample quality and training efficiency. In information-theoretic diffusion frameworks, the score function aligns with minimum mean squared error estimation under entropy constraints, revealing that diffusion paths preserve and restore information gradients akin to rate-distortion curves in reverse denoising. For large language models (LLMs), entropy compression bounds limit training efficacy, as the "entropy law" links first-epoch loss to data compressibility, showing that LLM perplexity scales with the negative log of compression ratios achieved by the model on training corpora. This implies fundamental ceilings on generalization from finite datasets, where cross-entropy loss cannot drop below the source entropy without overfitting.^[57]

References

[1]
[PDF] A Mathematical Theory of Communication
In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
[2]
Claude E. Shannon | IEEE Information Theory Society
The American mathematician and computer scientist who conceived and laid the foundations for information theory.
[3]
[PDF] INTRODUCTION TO INFORMATION THEORY - Stanford University
This chapter introduces some of the basic concepts of information theory, as well as the definitions and notations of probabilities that will be used throughout.
[4]
Claude E. Shannon: Founder of Information Theory
Oct 14, 2002 · In a landmark paper written at Bell Labs in 1948, Shannon defined in mathematical terms what information is and how it can be transmitted in the face of noise.<|control11|><|separator|>
[5]
Information Theory - Bits and Binary Digits
The simplest of the many definitions of information in Shannon's theory is that information is a decrease in uncertainty. To illustrate a more concrete ...Missing: flip | Show results with:flip
[6]
Information Theory in Computational Biology: Where We Stand Today
In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational ...Missing: interdisciplinary | Show results with:interdisciplinary
[7]
Claude Shannon: Biologist: The Founder of Information Theory ... - NIH
Claude Shannon founded information theory in the 1940s. The theory has long been known to be closely related to thermodynamics and physics.
[8]
[PDF] Nyquist 1924 - Certain Factors Affecting Telegraph Speed - Monoskop
SYNOPSIS: This paper considers two fundamental factors entering into the maximum speed of "transmission of intelligence by telegraph. These factors are signal ...Missing: sampling | Show results with:sampling<|separator|>
[9]
[PDF] Transmission of Information¹ - By RVL HARTLEY - Monoskop
How the rate of transmission of this information over a system is limited by the distortion resulting from storage of energy is discussed from the transient.
[10]
[PDF] The Mathematical Theory of Communication - MPG.PuRe
The mathematical theory of the engineering aspects of com- munication, as developed chiefly by Claude Shannon at the Bell. Telephone Laboratories, admittedly ...
[11]
[PDF] A PRELIMINARY REPORT ON A GENERAL THEORY OF ...
A GENERAL THEORY. OF INDUCTIVE INFERENCE. R. J. Solomonoff. Abstract. Some preliminary work is presented on a very general new theory of inductive inference.
[12]
[PDF] Three approaches to the quantitative definition of information
Dec 21, 2010 · There are two common approaches to the quantitative definition of. "information": combinatorial and probabilistic. The author briefly de-.
[13]
A. S. Holevo, “Bounds for the Quantity of Information Transmitted by ...
Abstract: Certain bounds are derived for the quantity of information transmitted by a quantum channel. It is proved that if at least two of the set of ...
[14]
[PDF] quantum-computation-and-quantum-information-nielsen-chuang.pdf
This comprehensive textbook describes such remarkable effects as fast quantum algorithms, quantum teleportation, quantum cryptography, and quantum error- ...
[15]
The Information Bottleneck Method - Google Research
Naftali Z. Tishby. Fernando Pereira. William Bialek. Proceedings of the 37th Allerton Conference on Communication, Control and Computing, Urbana, Illinois (1999).
[16]
[PDF] arXiv:2407.12288v4 [stat.ML] 21 May 2025
May 21, 2025 · Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify ...
[17]
Elements of Information Theory | Wiley Online Books
Oct 5, 2001 · THOMAS M. COVER is Professor jointly in the Departments of Electrical Engineering and Statistics at Stanford University.
[18]
Elements of Information Theory | Semantic Scholar
Sep 16, 2005 · Elements of Information Theory · T. Cover, Joy A. Thomas · Published 16 September 2005 · Mathematics.
[19]
On Information and Sufficiency - jstor
This is a journal article titled 'On Information and Sufficiency' by S. Kullback and R. A. Leibler, published in The Annals of Mathematical Statistics.
[20]
Information Theory And An Extension Of The Maximum Likelihood ...
The Akaike Information Criterion, 'AIC', (Akaike, 1973) is a solution to the issue of selecting variables to include in a multiple regression model.
[21]
[PDF] A Method for the Construction of Minimum-Redundancy Codes*
Minimum-Redundancy Codes*. DAVID A. HUFFMAN+, ASSOCIATE, IRE. September. Page 2. 1952. Huffman: A Method for the Construction of Minimum-Redundancy Codes. 1099 ...
[22]
[PDF] ARITHMETIC CODING FOR DATA COIUPRESSION
Authors' Present Address: Ian H. Witten, Radford M. Neal, and John. G. Cleary, Dept. of Computer Science, The University of Calgary,. 2500 University ...
[23]
[PDF] Prediction and Entropy of Printed English - Princeton University
Entropy measures information per letter, while redundancy measures constraints. Prediction is based on how well the next letter can be predicted given ...
[24]
[PDF] The Transmission of Information
Jul 6, 2014 · The main purpose of this paper was to provide a logical basis for the measurement of the rate of transmission of information. It has been shown.Missing: 1952 | Show results with:1952
[25]
[PDF] A Technique for High-Performance Data Compression
Figure 1. A model for a compression system that per images, especially the line drawings of business graphics, forms transparent compression.
[26]
[PDF] Compression of Individual Sequences via Variable-Rate Coding
Compression of Individual Sequences via. Variable-Rate Coding. JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE.
[27]
[PDF] Data Compression Using Adaptive Coding and Partial String Matching
This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can ...
[28]
[PDF] The JPEG Still Picture Compression Standard
DCT coefficle”t The p“rpose of quantlzatlon IS to acbteve furtber compression by representing DCT coefficients wltb no greater precl- slon than IS“ecessary ...
[29]
[PDF] Shannon's Noisy Coding Theorem 16.1 Defining a Channel
Our goal is to understand what kind of encodings are such that c. W is the same as W with high probability and n is as small as possible. The following notation ...Missing: sketch | Show results with:sketch
[30]
[PDF] Reed-Muller Codes Achieve Capacity on Erasure Channels - arXiv
Jun 15, 2015 · Abstract—This paper introduces a new approach to proving that a sequence of deterministic linear codes achieves capacity on an erasure ...
[31]
[PDF] Codes for the Z-channel - arXiv
Jul 2, 2023 · Sec. VIII contains discussion on the capacity of Z-channels with stochastic errors. This is a direct consequence of the seminal channel coding ...
[32]
[PDF] Capacity of Multi-antenna Gaussian Channels - MIT
We investigate the use of multiple transmitting and/or receiving antennas for single user communications over the additive Gaussian channel with and.
[33]
[PDF] arXiv:math/0207197v1 [math.CO] 22 Jul 2002
This paper gives a brief survey of binary single-deletion-correcting codes. The Varshamov-Tenengolts codes appear to be optimal, but many interesting unsolved ...
[34]
[PDF] Information Theory in Modem Practice - EPFL
Experience has shown that it is always risky to speak of “conclusions” when it comes to modems; we will however try in the end to at least sum up what we have.
[35]
[PDF] TR 138 901 - V14.3.0 - 5G; Study on channel model for ... - ETSI
The present document may refer to technical specifications or reports using their 3GPP identities, UMTS identities or. GSM identities. These should be ...
[36]
[PDF] Causality, feedback and directed information.
A definition, closely based on an old idea of Marko, is given for the directed information flowing from one sequence to another. This directed information is ...Missing: seminal | Show results with:seminal
[37]
https://www.etsi.org/deliver/etsi_tr/138900_138999/138901/14.03.00_60/tr_138901v140300p.pdf
[38]
[PDF] Feedback Capacity of Stationary Gaussian Channels - arXiv
In particular, this result shows that the celebrated Schalkwijk–Kailath coding scheme achieves the feedback capacity for the first-order autoregressive ...
[39]
[PDF] Directed Information for Channels with Feedback
The capacity regions of channels with feedback are investigated. The corresponding information rates are simpli ed by using the conditional independence of ...
[40]
None
Summary of each segment:
[41]
Communication theory of secrecy systems - IEEE Xplore
In this paper a theory of secrecy systems is developed. The approach is on a theoretical level and is intended to complement the treatment found in standard ...
[42]
[PDF] The Wire-Tap Channel
Copyright 1975, American Telephone and Telegraph Company. Printed in U.S.A.. The Wire-Tap Channel. By A. D. WYNER. (Manuscript received May 9, 1975).
[43]
[PDF] Secret key agreement by public discussion from common information
Information-theoretic or unconditional security is more de- sirable in ... MAURER: SECRET KEY AGREEMENT BY PUBLIC DISCUSSION FROM COMMON INFORMATION.
[44]
[PDF] How much randomness can be extracted from memoryless Shannon ...
Some authors use a heuristic estimate obtained from the Asymptotic. Equipartition Property, which yields roughly n extractable bits, where n is the total ...
[45]
[PDF] An Information-Theoretic Model for Steganography - cachin.com
Mar 4, 2004 · Abstract. An information-theoretic model for steganography with a passive adversary is proposed. The adversary's task of distinguishing ...
[46]
[PDF] Quantum cryptography: Public key distribution and coin tossing
When elementary quantum systems, such as polarized photons, are used to transmit digital information, the un- certainty principle gives rise to novel ...
[47]
Using Information Theory Approach to Randomness Testing - arXiv
Apr 3, 2005 · Known statistical tests and suggested ones are applied for testing PRNGs. Those experiments show that the power of the new tests is greater than ...Missing: theoretic | Show results with:theoretic
[48]
[physics/0004057] The information bottleneck method - arXiv
Apr 24, 2000 · We squeeze the information that X provides about Y through a bottleneck formed by a limited set of codewords tX.
[49]
Deep Learning and the Information Bottleneck Principle - arXiv
Mar 9, 2015 · Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle.
[50]
[1606.03657] InfoGAN: Interpretable Representation Learning by ...
Jun 12, 2016 · This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations.
[51]
[1312.7381] Rate-Distortion Auto-Encoders - arXiv
Dec 28, 2013 · We propose a learning algorithm for auto-encoders based on a rate-distortion objective that minimizes the mutual information between the inputs and the outputs.Missing: theory | Show results with:theory
[52]
[PDF] How Many Clusters? An Information-Theoretic Perspective
In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the ...
[53]
Fast Binary Feature Selection with Conditional Mutual Information
We propose in this paper a very fast feature selection technique based on conditional mutual information. By picking features which maximize their mutual ...
[54]
[2302.03792] Information-Theoretic Diffusion - arXiv
Feb 7, 2023 · We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory that connect Information with Minimum Mean ...