Fact-checked by Grok 2 weeks ago

Information theory

Information theory is a mathematical discipline that quantifies, stores, and communicates information, focusing on the fundamental limits of data transmission and processing in the presence of . Founded by Claude E. Shannon through his seminal 1948 paper "", it treats information as a probabilistic entity, independent of its semantic meaning, and emphasizes engineering aspects of reliable message reproduction across communication systems. At its core, information theory introduces as the measure of uncertainty or average information content in a random source, defined for a probability distribution as H = -\sum p_i \log_2 p_i bits, where p_i is the probability of each outcome. This concept underpins source coding theorems, which establish that data can be compressed to a length approaching the source entropy without loss, enabling efficient storage and transmission. 's work also defines as the supreme mutual information over all input distributions, representing the highest rate at which information can be sent reliably over a noisy channel, as formalized in his . Key extensions include , which quantifies the shared information between two variables as I(X;Y) = H(X) + H(Y) - H(X,Y), essential for analyzing dependencies in communication and data processing. The theory's impact spans , where it optimizes signal encoding and error correction; , influencing algorithms for and ; and broader fields like , physics, and , where models uncertainty in genetic codes, thermodynamic systems, and neural networks. Shannon's binary digit (bit) as the unit of information laid the groundwork for the digital age, transforming and enabling modern electronics.

Introduction and History

Overview

Information theory is a branch of and primarily concerned with the quantification, storage, and communication of information. It was pioneered by Claude E. Shannon in his seminal 1948 paper, which established a rigorous mathematical framework for analyzing information processes. The field focuses on understanding how information can be measured in probabilistic terms, independent of its semantic meaning, to enable efficient handling in technical systems. The core problems addressed by information theory include quantifying the inherent in messages or random events, devising efficient encoding techniques to minimize for and , and designing methods to achieve reliable communication despite or in channels. These challenges are foundational to modern digital systems, where resources like and are limited. For instance, is conceptually viewed as the reduction of : observing the outcome of a flip eliminates one bit of uncertainty, whereas specifying a result from a fair six-sided die requires log₂(6) ≈ 2.58 bits on average. Beyond its origins in engineering, information theory has exerted significant influence across interdisciplinary domains, including for algorithm design, for modeling genetic information flows, and physics for connections to . Central to the field is as a measure of uncertainty, with defining the maximum rate of reliable transmission.

Historical Development

The origins of information theory trace back to early 20th-century efforts to quantify signal transmission in . In 1924, published "Certain Factors Affecting Telegraph Speed" in the Technical Journal, where he derived a fundamental result establishing that a signal's maximum transmission rate is limited by twice the , providing a foundational measure for the amount of information that could be sent over a without , later contributing to the . Building on this, Ralph Hartley introduced a quantitative measure of information in his 1928 paper "Transmission of Information" in the same journal, defining information as the logarithm of the number of possible distinguishable symbols, which laid the groundwork for logarithmic measures of uncertainty in communication systems. The field crystallized with Claude Shannon's seminal 1948 paper "," published in the Technical Journal, which formalized information theory by introducing concepts like as a measure of and as the maximum reliable transmission rate over noisy channels, fundamentally shaping modern digital communication. This work was popularized and extended beyond engineering to broader scientific audiences through Warren Weaver's collaboration in the 1949 book The Mathematical Theory of Communication, which emphasized the theory's implications for semantics and . In the , information theory expanded into computational foundations with the development of . Ray Solomonoff's 1960 technical report "A Preliminary Report on a General of Inductive " proposed an object's by the length of the shortest program that generates it, formalizing inductive inference through . Independently, advanced this in his 1965 paper "Three Approaches to the Quantitative Definition of Information" in Problems of Information Transmission, and independently by in his 1966 paper "On the Length of Programs for Computing Finite Binary Sequences" in the Journal of the ACM, defining complexity as the minimal program length for computation, providing a non-probabilistic, algorithmic basis for information that influenced and studies. Quantum extensions emerged in the 1970s, with Alexander Holevo's 1973 paper "Bounds for the Quantity of Information Transmittable by a Quantum Communication Channel" in Problems of Information Transmission establishing the Holevo bound, which limits the classical information transmittable through quantum channels and founded . By 2000, Michael Nielsen and Isaac Chuang's textbook systematized quantum entropy and related measures, integrating classical information theory with for applications in . In the late and , information theory intersected with , notably through Naftali Tishby's 1999 introduction of the information bottleneck method in the Proceedings of the 37th Allerton Conference, which optimizes data compression for relevant predictions by balancing information retention and compression, later applied to understand representation learning in neural networks. This integration continued into the 2020s, with information-theoretic tools analyzing dynamics, such as flows in neural architectures, as explored in recent works unifying and Shannon entropy for frameworks up to 2025.

Fundamental Quantities

Entropy

In information theory, entropy quantifies the average uncertainty or associated with a discrete . For a discrete X taking values in a \mathcal{X} with p(x) = \Pr(X = x) for x \in \mathcal{X}, the H(X) is defined as H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x), where the logarithm is 2, yielding a measure in bits. This formula, introduced by , represents the of the of a single outcome, where the self-information of an event with probability p(x) is -\log_2 p(x), the number of bits needed to specify the outcome on average. Entropy serves as the fundamental limit on the average number of bits required to the outcomes of X without of , as established in lossless source coding theorems. For instance, consider a flip, where X is with p(0) = p(1) = 1/2; then H(X) = 1 bit, indicating one bit suffices to describe the outcome on average. For a biased coin with p(1) = 0.9 and p(0) = 0.1, H(X) = -0.9 \log_2 0.9 - 0.1 \log_2 0.1 \approx 0.469 bits, reflecting lower due to the bias. The h(p), a special case for random variables with success probability p, is given by h(p) = -p \log_2 p - (1-p) \log_2 (1-p), which reaches its maximum of 1 bit at p = 0.5 and approaches 0 as p nears 0 or 1. More generally, can be expressed in other units: nats using the natural logarithm (dividing by \ln 2 converts bits to nats) or hartleys using base-10 logarithm, though bits are standard in digital contexts for aligning with encoding. Key properties of entropy include non-negativity, H(X) \geq 0, with equality if and only if X is deterministic (one outcome has probability 1); this follows from applied to the -\log_2 p. Entropy is maximized for the uniform distribution over \mathcal{X}, where H(X) = \log_2 |\mathcal{X}|, providing an upper bound on for a given alphabet size. For independent random variables X and Y, entropy is additive: H(X,Y) = H(X) + H(Y). Shannon derived the entropy function from a set of axioms in his seminal 1948 paper: the measure must be continuous in the probabilities, monotonically increasing as probabilities become more uniform, uniquely determined up to a constant multiplier, and additive for independent events. Solving these yields H(X) = -K \sum p(x) \log p(x) for some positive constant K, with K = 1/\ln 2 chosen for bits; the proof involves showing that the logarithmic form satisfies the axioms and is unique.

Joint and Conditional Entropy

Joint entropy extends the concept of to two or more random variables, quantifying the total uncertainty in their joint occurrence. For discrete random variables X and Y with joint probability mass function p(x,y), the joint entropy H(X,Y) is defined as H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y), which represents the average number of bits needed to specify both X and Y together. This measure averages the self-information over the joint distribution, capturing dependencies between the variables. A key property of joint entropy is its : H(X,Y) \leq H(X) + H(Y), with equality holding if and only if X and Y are . Additionally, H(X,Y) \geq \max(H(X), H(Y)), reflecting that the combined cannot be less than the uncertainty of either variable alone. The chain rule provides a decomposition: H(X,Y) = H(X) + H(Y|X), linking joint entropy to and enabling recursive extensions to more variables, such as H(X_1, \dots, X_n) = \sum_{i=1}^n H(X_i | X_1, \dots, X_{i-1}). Conditional entropy H(Y|X) measures the average remaining uncertainty about Y after observing X. It is formally defined as H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = H(X,Y) - H(X), where p(y|x) = p(x,y)/p(x) is the mass function. This quantity interprets as the expected of Y given each possible value of X, weighted by the probability of X. Properties of conditional entropy include non-negativity: H(Y|X) \geq 0, with equality if Y is a deterministic of X. Furthermore, conditioning cannot increase : H(Y|X) \leq H(Y), with equality if X and Y are , indicating that knowledge of X either reduces or leaves unchanged the in Y. As an example, consider two fair coin flips X and Y. If independent, the joint entropy H(X,Y) = 2 bits, equal to H(X) + H(Y) = 1 + 1. If Y = X (perfect dependence), then H(X,Y) = 1 bit, which is \max(H(X), H(Y)). For conditional entropy, suppose Y is the next English letter after observing X as the previous letter; H(Y|X) is about 1.3 bits, much less than the unconditional H(Y) \approx 4.1 bits, due to contextual dependencies.

Mutual Information

Mutual information quantifies the amount of information that one contains about another, serving as a measure of their statistical dependence. Introduced by in his foundational work on , it captures the shared between two variables without assuming a specific form of relationship, such as . Formally, for discrete random variables X and Y with joint p(x,y), the I(X;Y) is defined as the difference between the of X and the of X given Y: I(X;Y) = H(X) - H(X|Y). This equals the Kullback-Leibler divergence between the joint distribution and the product of the marginals: I(X;Y) = \sum_{x} \sum_{y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}, where the logarithm is base 2 for units in bits. The measure is symmetric, such that I(X;Y) = I(Y;X), reflecting that the information Y provides about X is the same as the information X provides about Y. Mutual information possesses several key properties. It is non-negative, I(X;Y) \geq 0, with equality holding if and only if X and Y are , meaning knowledge of one provides no about the other. Additionally, it is bounded above by the minimum of the individual : I(X;Y) \leq \min(H(X), H(Y)), ensuring it cannot exceed the inherent uncertainty in either variable. In interpretive terms, mutual information represents the average reduction in uncertainty about X upon observing Y, or equivalently, the number of bits of information shared between the two variables. This makes it a fundamental tool for assessing dependence in probabilistic systems, distinct from joint and , which measure absolute or context-dependent uncertainties. A chain rule extends mutual information to multiple variables: for random variables X_1, X_2, \dots, X_n and Y, I(X_1, X_2, \dots, X_n; Y) = I(X_1; Y) + I(X_2; Y | X_1) + \cdots + I(X_n; Y | X_1, \dots, X_{n-1}), allowing decomposition of total dependence into conditional contributions. Examples illustrate these concepts clearly. If X and Y are independent, then p(x,y) = p(x)p(y), so I(X;Y) = 0. Conversely, if Y = X (identical variables), then H(X|Y) = 0, yielding I(X;Y) = H(X), the full entropy of X.

Divergence Measures

Divergence measures in information theory quantify the difference between two probability distributions, with the divergence serving as a foundational example. For discrete probability distributions P and Q over the same , the KL divergence is defined as \mathrm{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}, where the logarithm is typically base-2 for bits or natural for nats. This measure is always non-negative, \mathrm{KL}(P \parallel Q) \geq 0, and equals zero if and only if P = Q , a property established via the . For continuous distributions, the sum is replaced by an : \mathrm{KL}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx. The KL divergence exhibits key properties that distinguish it from metric distances. It is asymmetric, meaning \mathrm{KL}(P \parallel Q) \neq \mathrm{KL}(Q \parallel P) in general, reflecting its directed nature from P to Q. Additionally, it does not satisfy the , preventing it from forming a true on the space of probability distributions. Introduced by Solomon Kullback and Richard Leibler in 1951, the KL divergence originated in the context of statistical hypothesis testing and information sufficiency. Interpretationally, \mathrm{KL}(P \parallel Q) represents the expected number of extra bits required to encode samples drawn from P using a code optimized for Q, rather than one optimal for P. Equivalently, it measures the information gain obtained when using the true distribution P instead of the approximation Q. This connects to , where I(X;Y) = \mathbb{E}_{Y} \left[ \mathrm{KL}(P_{X|Y} \parallel P_X) \right], quantifying the average divergence between the conditional and marginal distributions of X. A concrete example arises with distributions, where P \sim \mathrm{Bern}(p) and Q \sim \mathrm{Bern}(q). The KL divergence simplifies to \mathrm{KL}(P \parallel Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}, illustrating how it penalizes mismatches in success probabilities; for instance, with p=0.5 and q=0.6, \mathrm{KL}(P \parallel Q) \approx 0.029 bits. In applications like , the KL divergence underpins criteria such as the (AIC), which estimates the expected relative between the true data-generating and a fitted model to balance goodness-of-fit and complexity.

Source Coding

Compression Fundamentals

Source coding theory addresses the compression of data from information sources without loss of information, establishing fundamental limits on achievable compression rates. The source coding theorem, also known as the noiseless coding theorem, states that for a discrete memoryless source producing symbols from alphabet \mathcal{X} with probability distribution p(x), the optimal average codeword length L for any uniquely decodable code satisfies L \geq H(X), where H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) is the entropy of the source. This bound is achievable in the limit as the block length increases, meaning sequences of source symbols can be compressed to rates arbitrarily close to the entropy using block codes that are uniquely decodable. The theorem implies that the minimum compression rate R required for reliable lossless encoding of the source is R \geq H(X) bits per symbol, ensuring that the encoded representation captures the inherent uncertainty or of the source without redundancy beyond this limit. For sources with lower than the logarithm of the size, significant is possible by exploiting statistical dependencies, as quantifies the average surprise or unpredictability per symbol. Huffman coding provides an optimal prefix-free code for discrete memoryless sources with known symbol probabilities, achieving an average code length within 1 bit of the bound. The algorithm proceeds by first sorting the symbols in decreasing order of their probabilities. It then iteratively builds a : the two symbols (or subtrees) with the lowest probabilities are combined into a new node with probability equal to their sum, assigning 0 and 1 to the branches, and this process repeats until a single root remains. The codewords are the paths from root to leaves, ensuring no code is a of another and minimizing the weighted path length, which equals the average code length. Arithmetic coding offers superior efficiency for memoryless sources by encoding entire messages as fractions of the unit , rather than fixed-length codes per , approaching the bound more closely without the integer-length overhead of . In this method, the message's probability [0,1) is subdivided according to symbol probabilities; each symbol narrows the current interval proportionally, and the final interval's binary representation (or a tag within it) serves as the code, with the decoder reversing the process to recover the symbols. This fractional encoding allows the code length to be approximately -\log_2 P(m) bits for message m with probability P(m), yielding near-entropy performance even for single symbols. A practical example is the compression of English text, where the entropy is approximately 1 to 1.5 bits per due to letter frequencies and contextual redundancies in , far below the 4.7 bits per (log₂26 for lowercase letters) of uniform encoding. This redundancy, arising from predictable patterns like common digrams and grammatical structure, enables ratios of about 3:1 to 5:1 using entropy-based methods, as demonstrated in early human prediction experiments that estimated the per- information content.

Practical Coding Techniques

Variable-length codes assign shorter codewords to more probable symbols and longer ones to less probable symbols, enabling efficient of sources with uneven symbol probabilities. One early method is Shannon-Fano coding, which constructs codes by recursively splitting the into two subsets of approximately equal total probability, assigning binary digits accordingly. This top-down approach ensures the code lengths are close to the negative logarithm of the probabilities, though it may not always yield optimal lengths. In contrast, Huffman coding achieves optimality for prefix-free codes over a fixed by building a bottom-up, merging the two least probable symbols iteratively to form words whose lengths satisfy the Kraft inequality with equality. While Shannon-Fano is simpler to implement without requiring full probability sorting, Huffman codes generally produce shorter average lengths, making them preferable for static sources where the full is known in advance. Both methods approach the entropy bound for large alphabets but incur redundancy for small ones due to code lengths. For universal compression of unknown or streaming data, the Lempel-Ziv-Welch (LZW) algorithm builds a dictionary adaptively by replacing repeated substrings with codes referencing prior occurrences. Starting from an initial dictionary of single symbols, LZW parses the input into phrases, outputs the code for the longest matching prefix, and adds the extension to the dictionary, enabling real-time learning without prior statistics. This variant of the LZ78 scheme excels on repetitive data, achieving compression ratios near the asymptotically, and is widely used in formats like archives and images for its simplicity and effectiveness on text and graphics. Lossy compression techniques sacrifice fidelity for higher rates, guided by rate-distortion theory, which quantifies the minimum rate R(D) needed to represent a source at average distortion no greater than D. For a source X and reproduction \hat{X} with distortion E[d(X, \hat{X})] \leq D, R(D) = \min_{p(\hat{x}|x): E[d(X,\hat{X})] \leq D} I(X; \hat{X}), where the I(X; \hat{X}) captures the information preserved in the approximation. Practical source-side methods focus on transforming the to concentrate in few coefficients, quantizing them coarsely for low distortion, and -coding the result, balancing rate against perceptual quality without channel considerations. Adaptive coding addresses non-stationary sources by updating models on-the-fly, avoiding assumptions of fixed statistics. Prediction by partial matching () exemplifies this by estimating symbol probabilities based on increasingly longer contexts from recent history, escaping to lower-order models when unseen patterns arise to prevent . This hierarchical approach achieves superior compression on or by capturing evolving dependencies, often outperforming static Huffman on adaptive arithmetic coders. A representative application is the standard for , which applies the (DCT) to 8x8 blocks to decorrelate pixels based on spatial statistics, yielding coefficients that are quantized and Huffman-coded. By exploiting human visual sensitivity—retaining low-frequency details while discarding high-frequency —JPEG trades minor artifacts for ratios of 10:1 to 20:1 on typical photos, though computational costs in DCT and quantization rise with . These techniques highlight the tension between achieving near-theoretical rates and practical constraints like simplicity.

Channel Coding

Capacity and Noisy Channels

In information theory, a memoryless () is modeled as a probabilistic from input symbols X drawn from a finite \mathcal{X} to output symbols Y from a finite \mathcal{Y}, characterized by conditional transition probabilities p(y|x) that specify the probability of receiving y given that x was transmitted, with outputs across time slots. The capacity C of such a channel represents the supremum of rates at which information can be reliably transmitted and is defined as the maximum mutual information between input and output over all possible input distributions: C = \max_{p(x)} I(X; Y), where I(X; Y) = H(Y) - H(Y|X) quantifies the reduction in uncertainty about Y provided by knowledge of X. Shannon's noisy-channel coding theorem establishes the fundamental limits of reliable communication over such channels: for any rate R < C, there exists a coding scheme achieving arbitrarily low probability of error as the block length n \to \infty; conversely, for R > C, the error probability is bounded away from zero. A sketch of the achievability proof relies on random coding and typical set decoding. In random coding, $2^{nR} codewords are generated independently for each message, with each symbol drawn i.i.d. from the capacity-achieving input distribution p(x); the average error probability over this ensemble vanishes exponentially as n \to \infty for R < I(X; Y), implying the existence of at least one good code via the . For decoding, the receiver selects the unique codeword jointly typical with the received sequence y^n, where jointly typical sequences (x^n, y^n) satisfy |\hat{p}(x_i, y_i) - p(x_i, y_i)| < \epsilon/n for small \epsilon, ensuring that with high probability, the correct codeword is decoded due to the concentration of probability mass on the of size approximately $2^{n H(X,Y)}. For the binary symmetric channel (BSC), where \mathcal{X} = \mathcal{Y} = \{0,1\} and bits flip with probability p < 1/2, the capacity-achieving input distribution is , yielding C = 1 - h_2(p), with h_2(p) = -p \log_2 p - (1-p) \log_2 (1-p) the , so C = 1 bit per use when p=0 (noiseless) and decreases to 0 as p \to 1/2. In the continuous-time setting with (AWGN), the channel adds independent to the transmitted signal under a power constraint; discretizing into $2W dimensions per second () for bandwidth W, the capacity simplifies to \frac{1}{2} \log_2 (1 + \mathrm{SNR}) bits per dimension in the high-resolution limit, where SNR is the P/N with transmit power P and noise power spectral density N/2. This C interprets as the supreme reliable transmission rate in bits per channel use, beyond which fundamentally limits error-free communication, even with optimal encoding and decoding.

Specific Channel Models

The (BEC) models a communication where each transmitted bit is received correctly with probability $1 - \alpha or results in an symbol with probability \alpha, allowing the receiver to detect but not recover erased bits. Introduced by in his foundational work on coding for noisy channels, the BEC simplifies analysis by separating errors from erasures. The of the BEC is C = 1 - \alpha bits per channel use, derived as the maximum I(X;Y) over input distributions, where the erasure provides partial information. This is achievable with low-complexity codes like parity-check or low-density parity-check (LDPC) codes that detect erasures reliably, as demonstrated by sequences approaching the limit under maximum a posteriori decoding. Asymmetric binary channels, such as the Z-channel, extend the BEC by introducing biased patterns where one input (typically 0) is received error-free, while the other (1) flips to 0 with probability \alpha and stays 1 otherwise. The Z-channel requires numerical optimization over the input distribution p = P(X=1), given by C = \max_p [h(\beta) - p h(\alpha)], where \beta = p(1 - \alpha) is the probability of output 1 and h(\cdot) denotes entropy; this formula arises from maximizing I(X;Y) = h(Y) - h(Y|X), with h(Y|X) = p h(\alpha). Coding strategies for the Z-channel often involve asymmetric error-correcting codes, such as repeat-accumulate ensembles, which approach through iterative decoding optimized for the . Similar optimization techniques apply to other asymmetric binaries like the binary asymmetric channel, where error probabilities differ for each input, yielding capacities via of the . For continuous-time channels, the (AWGN) channel under a constraint P represents a fundamental model, with C = W \log_2 \left(1 + \frac{P}{N_0 W}\right) bits per second for bandwidth W Hz, where N_0 is the noise spectral density; this seminal result, established by , assumes Gaussian input distribution to maximize . In frequency-selective fading, parallel Gaussian channels with varying noise levels \sigma_i^2 and total constraint \sum P_i \leq P achieve via the , allocating P_i = \max(\mu - \sigma_i^2, 0) such that \sum P_i = P, where \mu is chosen to satisfy the constraint, leading to C = \sum \frac{1}{2} \log_2 \left(1 + \frac{P_i}{\sigma_i^2}\right). This allocation prioritizes lower-noise subchannels, enhancing in multicarrier systems like OFDM. Multiple-input multiple-output (MIMO) channels generalize the Gaussian model to systems with n_t transmit and n_r receive antennas, where the received signal is \mathbf{Y} = \mathbf{H} \mathbf{X} + \mathbf{Z}, with channel matrix \mathbf{H}, input \boldsymbol{\Sigma} under trace constraint \operatorname{tr}(\boldsymbol{\Sigma}) \leq P, and \mathbf{Z} \sim \mathcal{N}(0, N_0 \mathbf{I}). The ergodic capacity is C = \max_{\boldsymbol{\Sigma}} \mathbb{E} \left[ \log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right) \right] bits per channel use, achieved with Gaussian inputs; for fixed \mathbf{H}, it simplifies to \log_2 \det \left( \mathbf{I} + \frac{\mathbf{H} \boldsymbol{\Sigma} \mathbf{H}^H}{N_0} \right). This formula, derived by Telatar and independently by Foschini, highlights multiplexing gains scaling with \min(n_t, n_r), enabling high-throughput in wireless systems. Erasure and deletion channels pose greater challenges than the BEC, as deletions remove symbols without explicit indication, complicating synchronization and decoding. For the binary deletion channel with deletion probability \delta, the capacity remains an open problem but is bounded above by $1 - h(\delta) and below by constructive codes achieving rates approaching $1 - \delta for small \delta; exact capacity expressions are unknown except in limits. The Varshamov-Tenengolts (VT) codes, introduced for single asymmetric errors but adaptable to single deletions, provide near-optimal constructions with rate \approx 1 - \frac{\log n}{n} for block length n, using a parity-check sum \sum i x_i \equiv 0 \pmod{n+1} to locate the deletion. For multiple deletions, bounds rely on combinatorial arguments, with capacity for small \delta approaching $1 - H_2(\delta) \approx 1 - \delta \log_2 (1/\delta). These channel models inform practical system design, such as in modems where BEC-like erasure handling via improves reliability over links, and water-filling optimizes power allocation in DSL modems to combat frequency-dependent . In 5G New Radio (NR) standards, channel models from TR 38.901 incorporate clustered delay line (CDL) and tapped delay line (TDL) profiles for frequencies up to 100 GHz, enabling capacity computations under spatial consistency and , with implications for enhanced achieving up to 20 Gbps peak rates.

Advanced Concepts

Directed Information

Directed information generalizes to processes exhibiting temporal dependencies, providing a measure of causal from one process to another in a directed manner. Formalized by James L. Massey in 1990, building on ideas from Hans Marko's bidirectional , it addresses scenarios where the influence flows asymmetrically over time, such as in communication channels with memory or , where traditional fails to capture . This measure is particularly valuable in analyzing systems where past outputs influence future inputs, enabling the quantification of in non-stationary or sequential settings. The directed information from an input process X^n = (X_1, \dots, X_n) to an output process Y^n = (Y_1, \dots, Y_n) is formally defined as I(X^n \to Y^n) = \sum_{t=1}^n I(X_t; Y_t \mid Y^{t-1}), where I(\cdot; \cdot \mid \cdot) denotes , Y^{t-1} = (Y_1, \dots, Y_{t-1}) (with Y^0 empty), and the sum aggregates the incremental information contributed by each input symbol given the prior output history. This formulation emphasizes by conditioning only on past outputs, avoiding non-causal dependencies on future values. The directed information rate is then \bar{I}(X \to Y) = \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n), assuming the limit exists. For channels with memory, where the output at each time depends on the entire input and possibly past outputs, the capacity is characterized using the directed information rate. Specifically, the capacity C is given by C = \sup \lim_{n \to \infty} \frac{1}{n} I(X^n \to Y^n), where the supremum is taken over all causal input distributions P_{X^n \| Y^{n-1}} that respect the channel's structure. This multi-letter expression accounts for the temporal correlations, making it suitable for channels like those with inter-symbol or additive noise with . Key properties of directed information include its non-negativity and the , I(X^n \to Y^n) \leq I(X^n \to Z^n) for a from Y^n to Z^n. It reduces to the standard I(X^n; Y^n) for memoryless channels or when there is no , as I(X^n \to Y^n) = I(X^n; Y^n) holds under of the Y_t. Unlike mutual information, directed information is asymmetric, I(X^n \to Y^n) \neq I(Y^n \to X^n) in general, which explicitly encodes the direction of information flow and distinguishes causal influences. The referenced here builds on the concept from analysis, measuring the reduction in uncertainty about Y_t given Y^{t-1}. In applications to feedback channels, where the input X_t may depend on previous outputs Y^{t-1}, directed information provides the appropriate for , as mutual assumes non-causal access. The feedback satisfies C_{\text{feedback}} \geq C_{\text{no feedback}}, with equality for memoryless channels but strict increase possible for channels with memory. A notable example is the Schalkwijk-Kailath scheme for the channel with , which achieves the feedback through iterative linear encoding that refines message estimates, demonstrating practical gains in error exponents despite no increase in asymptotic for this memoryless case. This scheme highlights how directed information guides optimal coding strategies in feedback scenarios by focusing on causal contributions. Examples of directed information computation arise in Markovian settings, where sources or channels follow a , simplifying the conditional terms in the sum. For a Markov source driving a Markov channel, the directed information rate can be evaluated using the and transition probabilities, yielding closed-form expressions for the causal flow.

Information in Non-Standard Settings

Information theory extends beyond classical discrete and continuous domains to encompass , computational paradigms, and complex stochastic processes, where traditional Shannon entropy must be generalized to capture phenomena like superposition, undecidability, and long-range correlations. In quantum settings, information is quantified using density operators rather than probability distributions, leading to measures that account for non-commutativity and entanglement. Algorithmic variants shift focus from probabilistic uncertainty to the intrinsic of individual objects via program length. These extensions provide foundational tools for fields like and , revealing limits unattainable in classical frameworks. In quantum information theory, the von Neumann entropy serves as the primary measure of uncertainty for a quantum state described by a density matrix \rho, defined as S(\rho) = -\operatorname{Tr}(\rho \log \rho), where \operatorname{Tr} denotes the trace operation and the logarithm is base-2 for bits. This entropy generalizes Shannon's entropy to quantum mechanics, quantifying the mixedness of \rho while vanishing for pure states. For an ensemble of states \{p_i, \rho_i\}, the Holevo bound \chi upper-bounds the accessible classical information, given by \chi = S\left(\sum_i p_i \rho_i\right) - \sum_i p_i S(\rho_i), establishing a fundamental limit on how much classical data can be reliably transmitted through quantum channels without measurement collapse. The quantum mutual information between subsystems A and B of a bipartite state \rho_{AB} extends this to correlations, defined as I(A:B) = S(\rho_A) + S(\rho_B) - S(\rho_{AB}), where \rho_A = \operatorname{Tr}_B(\rho_{AB}) and similarly for \rho_B; this measure is non-negative and equals twice the entanglement entropy for pure states. Quantum channels, modeled as completely positive trace-preserving maps, have capacities constrained by these quantities; for the qubit depolarizing channel \mathcal{N}_p(\rho) = (1-p)\rho + p \frac{I}{2}, where p is the depolarizing probability and I is the identity, the classical capacity equals the Holevo information \chi(\mathcal{N}_p), achieved additively even when tensored with arbitrary channels, while the quantum capacity remains open but bounded above by approximate degradability for low p. Algorithmic information theory redefines information content through the lens of computation, independent of probability. The K(x) of a binary string x is the length of the shortest in a fixed that outputs x and halts, providing an absolute, observer-independent measure of complexity. This notion, introduced by , captures the minimal description length required to specify x, rendering random strings those with K(x) \approx |x| (incompressible), but K(x) is uncomputable due to the , limiting its practical use to approximations like compression algorithms. analogs, such as I(x:y) = K(x) + K(y) - K(x,y), quantify shared complexity between objects, foundational for understanding and despite theoretical intractability. For stochastic processes, the entropy rate quantifies average uncertainty per symbol in the long run. For a discrete-time process \{X_i\}, the entropy rate is defined as H = \lim_{n \to \infty} \frac{1}{n} H(X_1, \dots, X_n), where the limit exists under stationarity and equals \lim_{n \to \infty} H(X_n | X_1, \dots, X_{n-1}) by the chain rule. This rate, originally explored by , governs the fundamental limit for of process outputs, with finite values for ergodic processes like Markov chains but infinity for non-stationary ones. It bridges classical information theory to dynamical systems, enabling analysis of correlation decay in sources like language models or physical . These concepts find applications in quantum gravity, such as the AdS/CFT correspondence, where holographic information posits that bulk gravitational degrees of freedom in anti-de Sitter () space encode boundary () information, with entanglement entropy across minimal surfaces bounding the -Takayanagi formula S = \frac{\operatorname{Area}(\gamma)}{4G_N}, where \gamma is the extremal surface homologous to the boundary region; this duality suggests emergent spacetime from limits, as explored in generalizations to excited states and deformed geometries.

Applications

Communication and Secrecy

Information theory provides foundational principles for , ensuring that messages remain confidential even against adversaries with unlimited computational power. A cornerstone is the concept of perfect secrecy, introduced by , which requires that the between the and the is zero, meaning the adversary gains no information about the message from the ciphertext alone. This condition holds if the key used for encryption is at least as long as the message and uniformly random, independent of the message, as demonstrated in the scheme where the ciphertext is the modular sum of the message and key. In terms of , perfect secrecy demands that the key entropy H(K) satisfies H(K) \geq H(M), where H(M) is the entropy of the message M, ensuring the key cannot be shorter than the message's uncertainty without compromising security. In noisy communication environments, the wiretap channel model extends these ideas to scenarios where an eavesdropper receives a degraded version of the signal. Aaron Wyner defined the secrecy capacity as the maximum rate at which a message can be reliably transmitted to the legitimate receiver while keeping it perfectly secret from the eavesdropper, given by C_s = \max_{p(x)} [I(X;Y) - I(X;Z)], where X is the input, Y the legitimate output, and Z the eavesdropper's output. For degraded channels, where the eavesdropper's channel is a degraded version of the main channel, this simplifies to C_s = C_{\text{main}} - C_{\text{wiretap}}, with C_{\text{main}} and C_{\text{wiretap}} being the capacities of the respective channels. Achieving this capacity involves stochastic encoding, where the transmitter adds random noise to confuse the eavesdropper without affecting the legitimate receiver's decoding, leveraging the difference in mutual informations. Authentication and key distribution in information-theoretic settings rely on mutual information measures to bound security, particularly in protocols that generate shared secrets over public channels. showed that parties observing correlated random variables can distill a secret key through public discussion, with the achievable key rate bounded by the between their observations minus leakage to the adversary. In analogs, information-theoretic bounds use the asymptotic equipartition property (AEP) to analyze extractors, such as universal hash functions, which compress high- sources into uniform keys while preserving secrecy; the AEP ensures that typical sequences have close to the source's, enabling efficient . These extractors guarantee that the output key's is nearly the input's, providing strong security guarantees independent of computational assumptions. Steganography applies information theory to hide messages within cover media such that the presence of the hidden information is undetectable. Christian Cachin formalized this using the , defining perfect when the between the distributions of cover and stego (hidden) objects approaches zero, ensuring the adversary cannot distinguish them better than random guessing. The embedding rate is limited by the KL divergence, as higher rates increase detectability; secure schemes minimize D(P_{\text{cover}} \| P_{\text{stego}}) through careful of the cover while preserving statistical . Practical examples illustrate these principles. In quantum key distribution, the protocol by Charles Bennett and enables information-theoretically secure key exchange using polarized photons, where error correction and privacy amplification—via hash-based extractors—distill a secret key from the shared quantum measurements, secure against any eavesdropper due to the and entropy extraction.

Computing and Machine Learning

Information theory plays a pivotal role in the design and analysis of pseudorandom number generators (PRNGs), where information-theoretic tests evaluate the unpredictability and of generated sequences to ensure they mimic true for cryptographic and simulation purposes. These tests, grounded in measures, assess whether a PRNG's output distribution is indistinguishable from uniform by quantifying deviations in or between consecutive outputs. For instance, tests based on source coding theory PRNG outputs and compare the achieved compression rates against those expected from truly random sources, revealing biases if the falls short of the theoretical maximum. A key application in involves extraction from weak random sources, where the provides a foundational result for converting sources with partial into nearly uniform keys. The states that, for a source with k and a family of universal functions, hashing the source yields an output whose from uniform is at most $2^{-\frac{\ell - k}{2}}, where \ell is the output length, enabling secure even from imperfect like . This principle underpins privacy amplification in , ensuring that adversaries gain negligible information about the extracted bits. In , the information (IB) principle formalizes the trade-off between data compression and predictive utility by seeking representations that minimize with inputs while maximizing it with outputs. Introduced as a to extract relevant features through a , IB optimizes the objective I(X; T) - \beta I(T; Y), where X is the input, T the compressed representation, Y the target, and \beta balances compression against relevance. Variational bounds approximate this non-convex optimization, enabling scalable training via neural networks that approximate the posterior p(t|x). This framework has influenced by promoting parsimonious models that discard irrelevant noise. Neural networks leverage information theory for representation learning, as seen in InfoGAN, which extends generative adversarial networks by maximizing between latent codes and generated images to yield interpretable, disentangled factors. In InfoGAN, the objective augments the GAN loss with I(c; G(z, c)), where c is a subset of informative noise, estimated via a variational lower bound to encourage structured latent spaces without supervision. Similarly, rate-distortion theory informs autoencoders by framing encoding as a problem, minimizing distortion d(x, \hat{x}) subject to a rate constraint on the I(x; z), where z is the latent code. This approach yields efficient embeddings for tasks like dimensionality reduction, with the rate-distortion function R(D) = \min I(X; Z) such that E[d(X, \hat{X})] \leq D. In analysis, measures enhance clustering by quantifying the uncertainty reduction achieved by partitioning data into groups. An information-theoretic perspective determines the optimal number of clusters by minimizing the total across clusters while penalizing complexity, treating clustering as a balance between description length and fit, akin to the minimum description length principle. For example, the of a clustering is H = -\sum p_k \log p_k + \sum_k p_k H_k, where p_k is the probability of cluster k and H_k its internal , guiding algorithms to avoid over- or under-clustering. in high-dimensional settings employs to identify variables that provide unique predictive power given others, selecting subsets that maximize I(f; y | S), where f is a candidate feature, y the target, and S the selected set, mitigating redundancy and improving model efficiency. Recent advances as of 2025 integrate information theory with generative models, particularly , where the forward and reverse processes are analyzed through flows to bound sample quality and training efficiency. In information-theoretic frameworks, the score function aligns with minimum estimation under constraints, revealing that diffusion paths preserve and restore information gradients akin to rate-distortion curves in reverse denoising. For large language models (), compression bounds limit training efficacy, as the "entropy law" links first-epoch loss to compressibility, showing that LLM scales with the negative log of compression ratios achieved by the model on training corpora. This implies fundamental ceilings on generalization from finite datasets, where loss cannot drop below the source without .

References

  1. [1]
    [PDF] A Mathematical Theory of Communication
    In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
  2. [2]
    Claude E. Shannon | IEEE Information Theory Society
    The American mathematician and computer scientist who conceived and laid the foundations for information theory.
  3. [3]
    [PDF] INTRODUCTION TO INFORMATION THEORY - Stanford University
    This chapter introduces some of the basic concepts of information theory, as well as the definitions and notations of probabilities that will be used throughout.
  4. [4]
    Claude E. Shannon: Founder of Information Theory
    Oct 14, 2002 · In a landmark paper written at Bell Labs in 1948, Shannon defined in mathematical terms what information is and how it can be transmitted in the face of noise.<|control11|><|separator|>
  5. [5]
    Information Theory - Bits and Binary Digits
    The simplest of the many definitions of information in Shannon's theory is that information is a decrease in uncertainty. To illustrate a more concrete ...Missing: flip | Show results with:flip
  6. [6]
    Information Theory in Computational Biology: Where We Stand Today
    In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational ...Missing: interdisciplinary | Show results with:interdisciplinary
  7. [7]
    Claude Shannon: Biologist: The Founder of Information Theory ... - NIH
    Claude Shannon founded information theory in the 1940s. The theory has long been known to be closely related to thermodynamics and physics.
  8. [8]
    [PDF] Nyquist 1924 - Certain Factors Affecting Telegraph Speed - Monoskop
    SYNOPSIS: This paper considers two fundamental factors entering into the maximum speed of "transmission of intelligence by telegraph. These factors are signal ...Missing: sampling | Show results with:sampling<|separator|>
  9. [9]
    [PDF] Transmission of Information¹ - By RVL HARTLEY - Monoskop
    How the rate of transmission of this information over a system is limited by the distortion resulting from storage of energy is discussed from the transient.
  10. [10]
    [PDF] The Mathematical Theory of Communication - MPG.PuRe
    The mathematical theory of the engineering aspects of com- munication, as developed chiefly by Claude Shannon at the Bell. Telephone Laboratories, admittedly ...
  11. [11]
    [PDF] A PRELIMINARY REPORT ON A GENERAL THEORY OF ...
    A GENERAL THEORY. OF INDUCTIVE INFERENCE. R. J. Solomonoff. Abstract. Some preliminary work is presented on a very general new theory of inductive inference.
  12. [12]
    [PDF] Three approaches to the quantitative definition of information
    Dec 21, 2010 · There are two common approaches to the quantitative definition of. "information": combinatorial and probabilistic. The author briefly de-.
  13. [13]
    A. S. Holevo, “Bounds for the Quantity of Information Transmitted by ...
    Abstract: Certain bounds are derived for the quantity of information transmitted by a quantum channel. It is proved that if at least two of the set of ...
  14. [14]
    [PDF] quantum-computation-and-quantum-information-nielsen-chuang.pdf
    This comprehensive textbook describes such remarkable effects as fast quantum algorithms, quantum teleportation, quantum cryptography, and quantum error- ...
  15. [15]
    The Information Bottleneck Method - Google Research
    Naftali Z. Tishby. Fernando Pereira. William Bialek. Proceedings of the 37th Allerton Conference on Communication, Control and Computing, Urbana, Illinois (1999).
  16. [16]
    [PDF] arXiv:2407.12288v4 [stat.ML] 21 May 2025
    May 21, 2025 · Concretely, we provide a theoretical framework rooted in Bayesian statistics and Shannon's information theory which is general enough to unify ...
  17. [17]
    Elements of Information Theory | Wiley Online Books
    Oct 5, 2001 · THOMAS M. COVER is Professor jointly in the Departments of Electrical Engineering and Statistics at Stanford University.
  18. [18]
    Elements of Information Theory | Semantic Scholar
    Sep 16, 2005 · Elements of Information Theory · T. Cover, Joy A. Thomas · Published 16 September 2005 · Mathematics.
  19. [19]
    On Information and Sufficiency - jstor
    This is a journal article titled 'On Information and Sufficiency' by S. Kullback and R. A. Leibler, published in The Annals of Mathematical Statistics.
  20. [20]
    Information Theory And An Extension Of The Maximum Likelihood ...
    The Akaike Information Criterion, 'AIC', (Akaike, 1973) is a solution to the issue of selecting variables to include in a multiple regression model.
  21. [21]
    [PDF] A Method for the Construction of Minimum-Redundancy Codes*
    Minimum-Redundancy Codes*. DAVID A. HUFFMAN+, ASSOCIATE, IRE. September. Page 2. 1952. Huffman: A Method for the Construction of Minimum-Redundancy Codes. 1099 ...
  22. [22]
    [PDF] ARITHMETIC CODING FOR DATA COIUPRESSION
    Authors' Present Address: Ian H. Witten, Radford M. Neal, and John. G. Cleary, Dept. of Computer Science, The University of Calgary,. 2500 University ...
  23. [23]
    [PDF] Prediction and Entropy of Printed English - Princeton University
    Entropy measures information per letter, while redundancy measures constraints. Prediction is based on how well the next letter can be predicted given ...
  24. [24]
    [PDF] The Transmission of Information
    Jul 6, 2014 · The main purpose of this paper was to provide a logical basis for the measurement of the rate of transmission of information. It has been shown.Missing: 1952 | Show results with:1952
  25. [25]
    [PDF] A Technique for High-Performance Data Compression
    Figure 1. A model for a compression system that per images, especially the line drawings of business graphics, forms transparent compression.
  26. [26]
    [PDF] Compression of Individual Sequences via Variable-Rate Coding
    Compression of Individual Sequences via. Variable-Rate Coding. JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE.
  27. [27]
    [PDF] Data Compression Using Adaptive Coding and Partial String Matching
    This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can ...
  28. [28]
    [PDF] The JPEG Still Picture Compression Standard
    DCT coefficle”t The p“rpose of quantlzatlon IS to acbteve furtber compression by representing DCT coefficients wltb no greater precl- slon than IS“ecessary ...
  29. [29]
    [PDF] Shannon's Noisy Coding Theorem 16.1 Defining a Channel
    Our goal is to understand what kind of encodings are such that c. W is the same as W with high probability and n is as small as possible. The following notation ...Missing: sketch | Show results with:sketch
  30. [30]
    [PDF] Reed-Muller Codes Achieve Capacity on Erasure Channels - arXiv
    Jun 15, 2015 · Abstract—This paper introduces a new approach to proving that a sequence of deterministic linear codes achieves capacity on an erasure ...
  31. [31]
    [PDF] Codes for the Z-channel - arXiv
    Jul 2, 2023 · Sec. VIII contains discussion on the capacity of Z-channels with stochastic errors. This is a direct consequence of the seminal channel coding ...
  32. [32]
    [PDF] Capacity of Multi-antenna Gaussian Channels - MIT
    We investigate the use of multiple transmitting and/or receiving antennas for single user communications over the additive Gaussian channel with and.
  33. [33]
    [PDF] arXiv:math/0207197v1 [math.CO] 22 Jul 2002
    This paper gives a brief survey of binary single-deletion-correcting codes. The Varshamov-Tenengolts codes appear to be optimal, but many interesting unsolved ...
  34. [34]
    [PDF] Information Theory in Modem Practice - EPFL
    Experience has shown that it is always risky to speak of “conclusions” when it comes to modems; we will however try in the end to at least sum up what we have.
  35. [35]
    [PDF] TR 138 901 - V14.3.0 - 5G; Study on channel model for ... - ETSI
    The present document may refer to technical specifications or reports using their 3GPP identities, UMTS identities or. GSM identities. These should be ...
  36. [36]
    [PDF] Causality, feedback and directed information.
    A definition, closely based on an old idea of Marko, is given for the directed information flowing from one sequence to another. This directed information is ...Missing: seminal | Show results with:seminal
  37. [37]
  38. [38]
    [PDF] Feedback Capacity of Stationary Gaussian Channels - arXiv
    In particular, this result shows that the celebrated Schalkwijk–Kailath coding scheme achieves the feedback capacity for the first-order autoregressive ...
  39. [39]
    [PDF] Directed Information for Channels with Feedback
    The capacity regions of channels with feedback are investigated. The corresponding information rates are simpli ed by using the conditional independence of ...
  40. [40]
    None
    Summary of each segment:
  41. [41]
    Communication theory of secrecy systems - IEEE Xplore
    In this paper a theory of secrecy systems is developed. The approach is on a theoretical level and is intended to complement the treatment found in standard ...
  42. [42]
    [PDF] The Wire-Tap Channel
    Copyright 1975, American Telephone and Telegraph Company. Printed in U.S.A.. The Wire-Tap Channel. By A. D. WYNER. (Manuscript received May 9, 1975).
  43. [43]
    [PDF] Secret key agreement by public discussion from common information
    Information-theoretic or unconditional security is more de- sirable in ... MAURER: SECRET KEY AGREEMENT BY PUBLIC DISCUSSION FROM COMMON INFORMATION.
  44. [44]
    [PDF] How much randomness can be extracted from memoryless Shannon ...
    Some authors use a heuristic estimate obtained from the Asymptotic. Equipartition Property, which yields roughly n extractable bits, where n is the total ...
  45. [45]
    [PDF] An Information-Theoretic Model for Steganography - cachin.com
    Mar 4, 2004 · Abstract. An information-theoretic model for steganography with a passive adversary is proposed. The adversary's task of distinguishing ...
  46. [46]
    [PDF] Quantum cryptography: Public key distribution and coin tossing
    When elementary quantum systems, such as polarized photons, are used to transmit digital information, the un- certainty principle gives rise to novel ...
  47. [47]
    Using Information Theory Approach to Randomness Testing - arXiv
    Apr 3, 2005 · Known statistical tests and suggested ones are applied for testing PRNGs. Those experiments show that the power of the new tests is greater than ...Missing: theoretic | Show results with:theoretic
  48. [48]
    [physics/0004057] The information bottleneck method - arXiv
    Apr 24, 2000 · We squeeze the information that X provides about Y through a bottleneck formed by a limited set of codewords tX.
  49. [49]
    Deep Learning and the Information Bottleneck Principle - arXiv
    Mar 9, 2015 · Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle.
  50. [50]
    [1606.03657] InfoGAN: Interpretable Representation Learning by ...
    Jun 12, 2016 · This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations.
  51. [51]
    [1312.7381] Rate-Distortion Auto-Encoders - arXiv
    Dec 28, 2013 · We propose a learning algorithm for auto-encoders based on a rate-distortion objective that minimizes the mutual information between the inputs and the outputs.Missing: theory | Show results with:theory
  52. [52]
    [PDF] How Many Clusters? An Information-Theoretic Perspective
    In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the ...
  53. [53]
    Fast Binary Feature Selection with Conditional Mutual Information
    We propose in this paper a very fast feature selection technique based on conditional mutual information. By picking features which maximize their mutual ...
  54. [54]
    [2302.03792] Information-Theoretic Diffusion - arXiv
    Feb 7, 2023 · We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory that connect Information with Minimum Mean ...