Fact-checked by Grok 2 weeks ago

Differential entropy

Differential entropy is a measure of uncertainty in for continuous s, serving as the continuous analog to Shannon's discrete entropy. It is defined as h(X) = -\int f_X(x) \log f_X(x) \, dx, where f_X is the of the X, and the is taken over the support of f_X. This quantity, introduced by in his seminal 1948 paper on , quantifies the expected information content or average in observing outcomes from a continuous distribution. Unlike discrete , which is always non-negative and represents the minimum bits needed to encode outcomes, differential entropy can be negative, reflecting that continuous distributions lack a natural finite "alphabet" and depend on the choice of units or . For instance, it remains invariant under translations, so h(X + c) = h(X) for any constant c, but scales under linear transformations: h(aX) = h(X) + \log |a| for scalar a \neq 0, and more generally h(AX) = h(X) + \log |\det(A)| for A. It connects to discrete through quantization: if a continuous is discretized into bins of width \Delta, the discrete H(X_\Delta) approximates h(X) + \log \Delta, and as \Delta \to 0, this relation highlights differential entropy's role as a limiting case. Key properties include joint differential entropy h(X,Y) = -\iint f_{X,Y}(x,y) \log f_{X,Y}(x,y) \, dx \, dy, which satisfies subadditivity h(X,Y) \leq h(X) + h(Y) with equality if X and Y are independent, and conditional entropy h(X|Y) = h(X,Y) - h(Y), which is non-increasing under conditioning: h(X|Y) \leq h(X). The asymptotic equipartition property (AEP) extends to continuous i.i.d. sequences, where the probability of the typical set approaches 1, and its volume is approximately $2^{n h(X)} for n samples, enabling analysis of compression and typical behavior in continuous sources. Among all distributions with fixed variance, the Gaussian achieves the maximum differential entropy, given by h(X) = \frac{1}{2} \log (2 \pi e \sigma^2) for a univariate normal with variance \sigma^2, or more generally \frac{1}{2} \log ((2\pi e)^n |\mathbf{K}|) for a multivariate normal with covariance matrix \mathbf{K}. This maximum entropy principle underscores its utility in modeling noise and signals. Differential entropy plays a central role in continuous-channel capacity derivations, rate-distortion theory for analog sources, and bounds like the continuous analog of Fano's inequality, which relates estimation error to entropy: \mathbb{E}[(X - \hat{X})^2] \geq \frac{1}{2\pi e} 2^{2 h(X)}. These aspects make it indispensable for applications in signal processing, communications, and statistical inference involving continuous data.

Fundamentals

Definition

Differential entropy, also known as continuous entropy, is a measure of uncertainty for continuous random variables in information theory. For a continuous random variable X with probability density function f(x), the differential entropy h(X) is defined as h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx, where the integral is taken over the support of f, and the logarithm is typically base-2 (yielding bits) or the natural logarithm (yielding nats). This definition extends the concept of Shannon entropy from discrete to continuous distributions and was first introduced by Claude Shannon in his foundational work on communication theory. The differential arises as the limiting case of discrete Shannon applied to a quantized of the continuous . Consider partitioning the real line into small intervals of width \Delta and approximating f(x) by a discrete p_i = f(x_i) \Delta for interval centers x_i; the discrete of this approximation is H(p) \approx h(X) + \log (1/\Delta), and in the limit as \Delta \to 0, h(X) = \lim_{\Delta \to 0} [H(p) - \log (1/\Delta)]. This derivation highlights that differential is not a direct probability but an asymptotic quantity relative to the discretization . Unlike discrete entropy, which is dimensionless and nonnegative, differential entropy carries units dependent on the measurement scale of X and can take negative values. Specifically, it is not under linear transformations: if Y = aX + b with a \neq 0, then h(Y) = h(X) + \log |a|, reflecting how rescaling the variable (e.g., changing units from meters to centimeters) alters the entropy by a factor related to the scaling. For the definition to be well-defined, f must be absolutely continuous with respect to the , and the integral must converge absolutely, ensuring finite entropy.

Relation to Discrete Entropy

Differential entropy extends Claude Shannon's discrete entropy to continuous random variables, providing a measure of uncertainty or information content for probability density functions rather than discrete probability mass functions. Introduced by in as part of the foundational work on for continuous channels, it formalizes the notion of entropy in scenarios where signals or are modeled continuously, such as in communication systems. Despite this analogy, differential entropy differs fundamentally from its discrete counterpart and is not directly comparable. entropy is always non-negative and represents an absolute measure of , whereas differential entropy can be negative and depends on the units of . A negative value arises when the distribution is more concentrated than a over a (e.g., [0,1]), which has zero differential entropy, indicating lower relative to that baseline. Additionally, scaling the X to aX (with a \neq 0) shifts the differential entropy by \log |a|, highlighting its relative nature tied to the chosen rather than an quantity. The interpretation pitfalls include this unit dependence and the need for careful normalization to ensure additivity for independent variables holds, as the joint entropy equals the sum of individual entropies only when the densities multiply appropriately. The precise relation between the two entropies emerges from a discretization process. Consider partitioning the support of a continuous random variable into small bins of width \Delta. The resulting discrete random variable, with probabilities p_i \approx f(x_i) \Delta where f is the density, has entropy H approximating the differential entropy h(X) adjusted for the bin size: H \approx h(X) - \log \Delta This approximation becomes exact in the limit as \Delta \to 0, where the -\log \Delta term (diverging to +\infty) compensates for the infinite resolution of the continuous case, ensuring the discrete entropy remains non-negative while the differential entropy stays finite. This limiting procedure underscores why differential entropy is not a limiting case of discrete entropy without the adjustment term, emphasizing the conceptual shift from finite to infinite sample spaces.

Properties

Basic Properties

Differential entropy exhibits invariance under translation. Consider a continuous X with f_X(x), so that its differential entropy is h(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx. For a constant c, let Y = X + c; the of Y is f_Y(y) = f_X(y - c). Substituting into the entropy integral gives h(Y) = -\int_{-\infty}^{\infty} f_X(y - c) \log f_X(y - c) \, dy. By the change of variable u = y - c, this simplifies to -\int_{-\infty}^{\infty} f_X(u) \log f_X(u) \, du = h(X), demonstrating the invariance. The scaling property introduces a dependence on units. For a scalar a \neq 0, let Y = aX; the density of Y is f_Y(y) = \frac{1}{|a|} f_X\left(\frac{y}{a}\right). Thus, h(Y) = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \log \left[ \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \right] dy = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \left[ \log \frac{1}{|a|} + \log f_X\left(\frac{y}{a}\right) \right] dy. Substituting z = y/a yields h(Y) = h(X) + \log |a|, reflecting how rescaling affects the entropy by the logarithm of the scaling factor, which accounts for changes in measurement units. Unlike discrete entropy, which is always non-negative, differential entropy can be negative, highlighting its non-invariance under . For a X with bounded support S of finite volume V = \int_S dx, the satisfies h(X) \leq \log V, with equality achieved when X is uniformly distributed over S. This upper bound arises because the maximizes the for a fixed support, and \log V < 0 when V < 1, allowing negative values even at the maximum. Differential entropy is continuous with respect to weak convergence of densities. Specifically, if a sequence of densities f_n converges to f in the L^1 norm (i.e., \int |f_n - f| dx \to 0) and satisfies suitable integrability conditions for the entropy to be well-defined, then h(f_n) \to h(f). This continuity ensures that small perturbations in the density lead to small changes in entropy, facilitating analysis in estimation and approximation contexts.

Chain Rule and Additivity

The joint differential entropy of a random vector \mathbf{X} = (X_1, \dots, X_n) in \mathbb{R}^n, with joint probability density function f_{\mathbf{X}}(\mathbf{x}), extends the scalar case and is given by h(\mathbf{X}) = -\int_{\mathbb{R}^n} f_{\mathbf{X}}(\mathbf{x}) \log f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}. This measure quantifies the uncertainty in the joint distribution over the vector, analogous to the scalar differential entropy but integrated over the higher-dimensional space. A key property is the chain rule for differential entropy, which decomposes the joint entropy into a sum of conditional entropies: h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i \mid X_1, \dots, X_{i-1}), where h(X_i \mid X_1, \dots, X_{i-1}) = -\int f(x_i \mid x_1, \dots, x_{i-1}) \log f(x_i \mid x_1, \dots, x_{i-1}) \, dx_i, averaged over the conditioning variables. This decomposition holds whenever the relevant densities exist and is derived from the definition of . If the random variables X_1, \dots, X_n are mutually independent, then each conditional entropy simplifies to the marginal entropy, yielding h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i), demonstrating additivity for the joint entropy under independence. The conditional differential entropy h(X \mid Y) for jointly continuous random variables X and Y with joint density f_{X,Y}(x,y) is defined as h(X \mid Y) = h(X,Y) - h(Y) = -\iint f_{X,Y}(x,y) \log f_{X \mid Y}(x \mid y) \, dx \, dy, where the outer integral over y effectively averages the scalar conditional entropies. This quantity represents the residual uncertainty in X given knowledge of Y. A fundamental inequality is h(X \mid Y) \leq h(X), with equality if and only if X and Y are independent, reflecting that conditioning cannot increase differential entropy. This follows directly from the non-negativity of mutual information, I(X;Y) = h(X) - h(X \mid Y) \geq 0. For the joint case, subadditivity holds as h(X,Y) = h(X) + h(Y \mid X) \leq h(X) + h(Y), again with equality under independence. For the sum of random variables, subadditivity extends to h(X + Y) \leq h(X) + h(Y). This arises because X + Y is a deterministic function of the pair (X, Y), and differential entropy does not increase under measurable functions: h(X + Y) \leq h(X,Y). Combined with the joint subadditivity, the bound follows. Equality in the overall inequality requires both independence of X and Y (for h(X,Y) = h(X) + h(Y)) and no information loss in the mapping to the sum, which occurs only in degenerate cases where the transformation is invertible almost surely, such as when one variable has zero variance; otherwise, the inequality is strict even under independence, as the convolution of densities generally introduces dependence that reduces entropy relative to the additive case. The phrase "densities are compatible" refers to conditions where the characteristic functions ensure the entropy of the convolution equals the sum, but such cases are exceptional and not generally satisfied. In continuous channels, the data processing inequality preserves the structure of the discrete case. For a Markov chain X \to Y \to Z where X, Y, Z are continuous random variables, the mutual information satisfies I(X; Z) \leq I(X; Y), with equality if Z is a sufficient statistic for X given Y. This implies that processing through a continuous channel cannot increase information about the input, and it extends to differential entropies via I(X;Y) = h(X) - h(X \mid Y). The inequality holds under the existence of densities and is proven using the chain rule and non-negativity of relative entropy.

Maximum Entropy

Gaussian Maximization Theorem

The Gaussian maximization theorem asserts that, among all continuous probability distributions for a random variable X with fixed variance \sigma^2, the differential entropy h(X) is maximized uniquely by the \mathcal{N}(\mu, \sigma^2), where the maximum value is \frac{1}{2} \log (2 \pi e \sigma^2) in nats. This result holds regardless of the mean \mu, as shifting the distribution does not affect the entropy. This theorem underscores the Gaussian distribution's role as the embodiment of maximum uncertainty under a second-moment , a principle central to information theory that favors the least informative prior consistent with observed variance. It has profound implications, such as establishing the capacity of additive white Gaussian noise channels as \frac{1}{2} \log (1 + \frac{P}{N}), where the input distribution maximizing mutual information is Gaussian. The theorem generalizes to multivariate cases: for an n-dimensional random vector X with fixed covariance matrix \Sigma, the maximum differential entropy is achieved by the multivariate Gaussian \mathcal{N}(\mu, \Sigma), yielding h(X) = \frac{n}{2} \log (2 \pi e) + \frac{1}{2} \log \det(\Sigma) nats. For uncorrelated components with variances \sigma_i^2, this simplifies to \frac{n}{2} \log (2 \pi e) + \sum_{i=1}^n \log \sigma_i. Without variance or similar constraints, differential entropy is unbounded above, as densities can be made arbitrarily flat, though it approaches -\infty for degenerate distributions with zero variance.

Proof of the Theorem

To prove the , consider the problem of maximizing the h(X) = -\mathbb{E}[\log f(X)] = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx over all f on \mathbb{R}, subject to the normalization constraint \int_{-\infty}^{\infty} f(x) \, dx = 1 and the fixed second-moment constraint \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 f(x) \, dx = \sigma^2 (assuming without loss of generality that \mathbb{E}[X] = 0). This is a constrained optimization problem in the space of densities, solved using the method of Lagrange multipliers for functionals. Introduce the Lagrangian functional \mathcal{L} = -\int f \log f \, dx + \lambda \left( \int f \, dx - 1 \right) - \mu \left( \int x^2 f \, dx - \sigma^2 \right), where \lambda is the multiplier for normalization and \mu > 0 for the variance constraint. The functional derivative with respect to f is set to zero: \frac{\delta \mathcal{L}}{\delta f} = -\log f - 1 + \lambda - \mu x^2 = 0, yielding f(x) = \exp(\lambda - 1 - \mu x^2), where the normalization constant is incorporated via \lambda. This form is proportional to a Gaussian density. To identify the parameters, impose the second-moment constraint: the variance of this distribution is $1/(2\mu), so \mu = 1/(2\sigma^2) to match \sigma^2. The normalization then gives the density of \mathcal{N}(0, \sigma^2). An alternative proof uses the non-negativity of the Kullback-Leibler (KL) divergence. Let g denote the density of \mathcal{N}(0, \sigma^2). Then, D(f \| g) = \int f \log \frac{f}{g} \, dx = -\int f \log g \, dx - h(f) \geq 0, with equality if and only if f = g almost everywhere. Rearranging gives h(f) \leq -\int f \log g \, dx. Now, \log g(x) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so -\int f \log g \, dx = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \int x^2 f(x) \, dx = \frac 12 \log(2\pi e \sigma^2), where the last step substitutes the constraint \int x^2 f = \sigma^2 and adds the constant \frac{1}{2} \log e = \frac{1}{2}. Thus, h(f) \leq \frac{1}{2} \log(2\pi e \sigma^2), which is precisely the differential entropy of the Gaussian, with equality only for the Gaussian density. To verify, compute the of the Gaussian directly: for X \sim \mathcal{N}(0, \sigma^2), h(X) = -\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \log \left[ \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \right] dx. The logarithm expands to -\frac{1}{2} \log (2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so the integral separates into \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \mathbb{E}[X^2] = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2} = \frac{1}{2} \log(2\pi e \sigma^2), confirming the bound is achieved.

Examples

Exponential Distribution

The exponential distribution with rate parameter \lambda > 0 has probability density function
f(x) = \lambda e^{-\lambda x}, \quad x \geq 0,
and mean $1/\lambda. This distribution models phenomena such as inter-arrival times in Poisson processes and is characterized by its memoryless property.
The differential entropy h(X) of a continuous random variable X with density f is h(X) = -\int f(x) \log f(x) \, dx. For the exponential distribution,
h(X) = -\int_{0}^{\infty} \lambda e^{-\lambda x} \log(\lambda e^{-\lambda x}) \, dx = -\int_{0}^{\infty} \lambda e^{-\lambda x} (\log \lambda - \lambda x) \, dx.
The first term evaluates to -\log \lambda \int_{0}^{\infty} \lambda e^{-\lambda x} \, dx = -\log \lambda, since the integral of the density is 1. The second term is \lambda \int_{0}^{\infty} x \lambda e^{-\lambda x} \, dx = \lambda \mathbb{E}[X] = \lambda \cdot (1/\lambda) = 1. Thus, h(X) = 1 - \log \lambda (in nats), derived by direct integration.
This entropy value increases with the mean $1/\lambda, as larger means correspond to broader spreads and greater uncertainty in the distribution. For \lambda > e, the entropy is negative, a feature unique to differential entropy that does not imply negative information but reflects the continuous nature relative to a uniform reference measure. The exponential distribution maximizes the differential entropy among all continuous distributions supported on [0, \infty) with fixed mean $1/\lambda. It serves as the continuous analog to the geometric distribution, which maximizes entropy for discrete non-negative integer-valued random variables with fixed mean.

Uniform Distribution

The over an [b, b+a] has f(x) = \frac{1}{a} for x \in [b, b+a] and f(x) = 0 otherwise. The differential entropy h(X) is computed via the : h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx = -\int_{b}^{b+a} \frac{1}{a} \log \left( \frac{1}{a} \right) \, dx = \log a, where the logarithm is the natural logarithm (nats). This value is of the b, depending only on the a of the . This \log a scales logarithmically with the , quantifying the in locating the within the fixed volume; larger a increases the possible outcomes, hence higher . Among all distributions supported on an of fixed a, the uniform achieves the maximum differential , as established by the non-negativity of the Kullback-Leibler divergence between any density f and the uniform density g(x) = 1/a: D(f \| g) = \log a - h(X) \geq 0, implying h(X) \leq \log a. In quantization theory, the provides a reference for bounding the of approximations to continuous random . As the quantization bin size \Delta \to 0, the H(X^*) of the quantized X^* satisfies H(X^*) \to h(X) + \log \Delta, where equality holds asymptotically for the uniform case, linking differential to the limiting behavior of . The differential entropy of the can be negative when a < 1, for instance h(X) = \log 0.5 < 0 for a = 0.5, which underscores the unit dependence of differential entropy unlike its non-negative discrete counterpart. This property aligns with the scaling behavior of differential entropy under linear transformations.

Information Measures

Relation to Mutual Information

The mutual information between two continuous random variables X and Y, denoted I(X;Y), is defined as the difference between the differential entropy of X and the conditional differential entropy of X given Y: I(X;Y) = h(X) - h(X|Y). Equivalently, it can be expressed as I(X;Y) = h(X) + h(Y) - h(X,Y), where h(X,Y) is the joint differential entropy. This quantity quantifies the amount of information that X and Y share, representing the reduction in uncertainty about one variable upon knowing the other. A key property of mutual information is its non-negativity: I(X;Y) \geq 0, with equality holding if and only if X and Y are independent. This follows directly from the conditioning inequality for differential entropy, which states that h(X|Y) \leq h(X), implying that observing Y cannot increase the uncertainty about X. For multiple variables, the chain rule extends mutual information analogously to the discrete case: I(X_1, \dots, X_n; Y) = \sum_{i=1}^n I(X_i; Y \mid X_1, \dots, X_{i-1}), where the conditional mutual information I(X_i; Y \mid X_1, \dots, X_{i-1}) measures the additional shared information between X_i and Y given the previous variables. In information theory, interprets the shared uncertainty between variables, with serving as the foundational building block for analyzing continuous systems. It plays a central role in defining the capacity of continuous channels, where the capacity C is the maximum achievable I(X;Y) over input distributions subject to constraints, representing the highest rate of reliable communication. For instance, in the additive white Gaussian noise channel under a power constraint, the mutual information I(X;Y) is maximized when X is Gaussian, achieving the channel capacity C = \frac{1}{2} \log_2 \left(1 + \frac{P}{N}\right) bits per transmission, where P is the signal power and N is the noise power.

Connection to Estimator Error

The Cramér-Rao bound establishes a fundamental limit on the performance of unbiased estimators, stating that for an unbiased estimator \hat{\theta} of a parameter \theta based on n i.i.d. observations, the variance satisfies \operatorname{Var}(\hat{\theta}) \geq \frac{1}{n I(\theta)}, where the Fisher information I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}\right]. This Fisher information quantifies the sensitivity of the likelihood to changes in \theta, and it connects to differential entropy h(X;\theta) = -\mathbb{E}[\log f(X;\theta)], since the expected log-likelihood is -h(X;\theta), and the Fisher information I(\theta) is the negative of its second derivative with respect to \theta (or equivalently, the second derivative of h(X;\theta)). Higher Fisher information, corresponding to lower entropy for fixed support, implies tighter bounds on estimation variance. In estimation contexts involving noisy observations, the entropy power inequality provides insight into minimal mean squared error (MSE). The inequality asserts that for independent random vectors X and Z, h(X + Z) \geq \frac{1}{2} \log\left( e^{2 h(X)} + e^{2 h(Z)} \right) (in nats for scalars), implying that greater differential entropy h(X) leads to larger minimal MSE when estimating X from Y = X + Z for fixed noise variance \operatorname{Var}(Z). Equality holds when X and Z are Gaussian. This follows from the integral representation h(X) = \frac{1}{2} \int_0^\infty \frac{\operatorname{MMSE}(X \mid \sqrt{\operatorname{snr}} X + Z)}{\operatorname{snr}^2} \, d\operatorname{snr}, where \operatorname{MMSE} is the minimum MSE; thus, increased h(X) necessitates higher average estimation error across signal-to-noise ratios. In Bayesian estimation, the posterior differential entropy h(\theta \mid \mathbf{X}) serves as a measure of residual uncertainty about the parameter \theta after incorporating data \mathbf{X}, with lower posterior entropy indicating more precise inference and better estimators in terms of uncertainty reduction. Optimal Bayesian estimators, such as those minimizing expected Bregman divergence, aim to concentrate the posterior, thereby minimizing this entropy as a proxy for estimation quality. Asymptotically, in sequential estimation with large samples, the entropy rate—characterizing the average uncertainty per observation—links to error rates through the , where the posterior covariance approximates I(\theta)^{-1}/n, yielding posterior entropy roughly \frac{d}{2} \log(2\pi e / n) - \frac{1}{2} \log \det I(\theta) for d-dimensional \theta, bounding large-sample . For the Gaussian case, the differential entropy h(X) = \frac{1}{2} \log(2\pi e \sigma^2) directly ties to estimation error, as the for the sample mean estimator from n i.i.d. N(\mu, \sigma^2) observations is \sigma^2 / n, saturating the and scaling inversely with sample size while reflecting the entropy's dependence on variance.

Common Distributions

Formulas for Specific Distributions

The differential entropy formulas for several standard continuous distributions are listed in the table below, expressed in nats using the natural logarithm. These expressions assume the conventional parametrizations and are independent of location parameters where applicable, as shifting does not affect the entropy value.
DistributionParametersSupportDifferential Entropy H(X)
GaussianVariance \sigma^2 > 0(-\infty, \infty)\frac{1}{2} \log (2 \pi e \sigma^2)
Exponential \lambda > 0[0, \infty)$1 - \log \lambda
Gamma \alpha > 0, \beta > 0[0, \infty)\alpha - \log \beta + \log \Gamma(\alpha) + (1 - \alpha) \psi(\alpha), where \psi is the digamma function
LaplaceScale b > 0(-\infty, \infty)$1 + \log (2b)
CauchyScale \gamma > 0(-\infty, \infty)\log (4 \pi \gamma)
Weibull k > 0, scale \lambda > 0[0, \infty)\gamma \left(1 - \frac{1}{k}\right) + \log \left( \frac{\lambda}{k} \right) + 1, where \gamma \approx 0.57721 is the Euler-Mascheroni constant
The exponential distribution appears as a special case of both the gamma (with \alpha = 1) and Weibull (with k = 1) distributions.

Comparison Across Distributions

Among all continuous distributions with a fixed variance, the Gaussian distribution achieves the maximum differential entropy, a result derived from the principle of maximum entropy subject to a second-moment constraint. This positions the Gaussian as the distribution of maximum uncertainty under this constraint, with its differential entropy given by \frac{1}{2} \log (2 \pi e \sigma^2) for variance \sigma^2. Distributions exhibiting heavier tails, such as the Cauchy distribution, can attain higher differential entropy values; for a Cauchy distribution with scale parameter \gamma, the differential entropy is \log (4 \pi \gamma), which surpasses that of Gaussians with comparable but finite variance parameters. However, the infinite variance of the Cauchy prevents direct comparison within the fixed-variance framework. For distributions constrained to the positive real line with a fixed \mu > 0, the maximizes the differential entropy, yielding $1 + \log \mu. Shape parameters within parametric families reveal systematic patterns in differential entropy. For the with fixed scale, as the k approaches 1 from above, the distribution converges to the , and the differential entropy approaches $1 + \log \mu. Across families, differential entropy generally increases with for fixed variance up to the Gaussian limit (kurtosis of 3), beyond which heavier tails reduce entropy relative to the maximum. In multivariate isotropic cases with \sigma^2 I_d in d , differential entropies scale linearly with for the Gaussian, given by \frac{d}{2} \log (2 \pi e \sigma^2), reflecting additive across independent components. Other distributions sharing this exhibit lower entropies but follow a similar linear scaling, with the gap to the Gaussian bound widening for non-Gaussian forms in higher . The following table compares differential entropies for select univariate distributions normalized to unit variance, highlighting the Gaussian's supremacy and the decreasing order for heavier-tailed alternatives:
DistributionParameterDifferential Entropy (nats)
Gaussian-1.419
Student-t\nu = 51.369
Laplace-1.347
-1.243
Student-t\nu = 31.222
These values confirm the ordering Gaussian > Student-t (\nu=5) > Laplace > uniform > Student-t (\nu=3), where decreasing degrees of freedom in the Student-t introduce heavier tails and correspondingly lower entropy under the unit-variance constraint.

Variants

Conditional Differential Entropy

The conditional differential entropy of a continuous random variable X given a fixed value y of another continuous random variable Y with joint density f_{X,Y}(x,y) is defined as h(X \mid Y = y) = -\int_{-\infty}^{\infty} f_{X \mid Y}(x \mid y) \log f_{X \mid Y}(x \mid y) \, dx, where f_{X \mid Y}(x \mid y) = f_{X,Y}(x,y)/f_Y(y) denotes the conditional density of X given Y = y, assuming it exists. The average conditional differential entropy h(X \mid Y) is then obtained by taking the expectation over the distribution of Y: h(X \mid Y) = \mathbb{E}_Y \left[ h(X \mid Y = y) \right] = -\iint f_{X,Y}(x,y) \log f_{X \mid Y}(x \mid y) \, dx \, dy. This measure extends the concept of differential entropy to scenarios where partial information about Y reduces uncertainty in X. Key properties of conditional differential entropy mirror those of its counterpart but account for the peculiarities of continuous distributions. Notably, it is non-increasing under additional : h(X \mid Y, Z) \leq h(X \mid Y) for any Z, with equality if Z is conditionally of X given Y (i.e., Z \perp X \mid Y), reflecting that additional cannot increase uncertainty. Unlike conditional entropy, which is always non-negative, conditional differential entropy can take negative values, as the underlying differential entropy itself may be negative for densities more concentrated than a standard Gaussian. The chain rule for holds in conditional form, enabling the decomposition of joint differential entropies into sums involving conditional terms. In the context of stochastic processes, conditional differential entropy serves as a measure of prediction error, quantifying the residual uncertainty in future states after observing past observations. For instance, in linear prediction tasks, it bounds the minimum mean squared error via relations like the Kolmogorov-Szegö formula, where lower conditional entropy corresponds to tighter predictions. Computing h(X \mid Y) poses challenges when X and Y exhibit dependencies, as it requires estimating the full joint density f_{X,Y}, often demanding high-dimensional integration or approximation techniques such as Monte Carlo methods or kernel density estimation. Generalizations to infinite-dimensional settings, such as es, extend conditional differential entropy to function spaces, where it is defined through limits of finite-dimensional projections or via the log-determinant of conditional covariance operators. For a , the conditional entropy given observations is \frac{1}{2} \log \left( (2\pi e)^n \det(\Sigma_{X \mid Y}) \right) in n-dimensional approximations, with \Sigma_{X \mid Y} the conditional , converging appropriately in the infinite limit. This framework is crucial in applications like and , where it quantifies posterior uncertainty over infinite-dimensional parameters.

Relative Differential Entropy

The relative differential entropy, commonly referred to as the Kullback-Leibler (KL) divergence, quantifies the difference between two probability density functions f and g over a continuous space. Introduced by Kullback and Leibler as a measure of information for distinguishing hypotheses, it is defined for absolutely continuous distributions as D(f \| g) = \int_{-\infty}^{\infty} f(x) \log \frac{f(x)}{g(x)} \, dx, assuming the integral exists and g(x) > 0 wherever f(x) > 0. This expression can be equivalently rewritten using differential entropy H(f) as D(f \| g) = -H(f) - \int_{-\infty}^{\infty} f(x) \log g(x) \, dx, where the second term represents the cross-entropy between f and g. Alternatively, it takes the form of an expectation under f: D(f \| g) = \mathbb{E}_{X \sim f} \left[ \log \frac{f(X)}{g(X)} \right]. These formulations highlight its role as the expected excess log-likelihood when approximating f with g. The KL divergence exhibits key properties that distinguish it from symmetric distances. It is asymmetric, meaning D(f \| g) \neq D(g \| f) in general, reflecting the directional nature of information loss from one distribution to another. Additionally, it is non-negative, D(f \| g) \geq 0, with equality if and only if f = g almost everywhere; this follows from Jensen's inequality applied to the convex function t \mapsto - \log t, or equivalently Gibbs' inequality in information theory contexts. The non-negativity ensures it serves as a valid divergence measure, though it is not a true metric due to the asymmetry and lack of the triangle inequality. In relation to differential entropy, the KL divergence measures the excess entropy of f relative to g, capturing how much more (or less) uncertainty is present in f when g is taken as a reference. It equals zero precisely when the distributions coincide, emphasizing its utility in assessing distributional similarity. Applications abound in statistical inference; for instance, in variational inference, minimizing D(q \| p) (or its reverse) approximates intractable posteriors by optimizing a tractable q to bound the model evidence, as foundational in mean-field methods for graphical models. In model selection, asymptotic expansions of the KL divergence underpin criteria like the Akaike information criterion (AIC), which penalizes model complexity to select the distribution closest to the true data-generating process. Furthermore, Pinsker's inequality relates it to the total variation distance: \frac{1}{2} \| f - g \|_1^2 \leq D(f \| g), providing a bound on the L1 difference in terms of the divergence, useful for convergence analysis in density estimation. A notable connection arises in continuous mutual information, defined as I(X; Y) = D(p_{XY} \| p_X p_Y) = \int \int p_{XY}(x,y) \log \frac{p_{XY}(x,y)}{p_X(x) p_Y(y)} \, dx \, dy, which measures dependence between random variables X and Y as the KL divergence from the joint density to the product of marginals; this equals the and reduces to zero for independent variables.

References

  1. [1]
    [PDF] Elements of Information Theory
    This is intended to be a simple and accessible book on information theory. As Einstein said, “Everything should be made as simple as.
  2. [2]
    [PDF] A Mathematical Theory of Communication
    In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
  3. [3]
    [PDF] Lecture 17: Differential Entropy
    Differential entropy, defined for continuous random variables, is h(X) = − ∫ f(x) log f(x)dx, and is the volume of the typical set.
  4. [4]
    [PDF] Elements of Information Theory
    Page 1. Page 2. ELEMENTS OF. INFORMATION THEORY. Second Edition. THOMAS M. COVER ... First, certain quantities like entropy and mutual information arise as the ...
  5. [5]
    [PDF] This is IT: A Primer on Shannon's Entropy and Information
    Aug 12, 2022 · This gives the desired relation between discrete and continuous entropies: h(X) ≈ H([X]) − log. 1 δ . (32). As δ → 0, [X] converges to X ...
  6. [6]
    [PDF] Lecture 10 1 Communication Problem (Chapter 7)
    Feb 10, 2015 · ... differential entropy can be positive or negative. This is not the only way in which they differ. h(X + c) = h(X), for constant c h(X · c) = h(X) ...
  7. [7]
    [PDF] ECE 587 / STA 563: Lecture 7 – Differential Entropy - Henry Pfister
    Nov 4, 2024 · ◦ h(AX) = h(X) + log |det(A)| when A is a square matrix. • Proof of scaling property for scalar setting. ◦ The differential entropy of a ...
  8. [8]
    [PDF] A Probabilistic Upper Bound on Differential Entropy
    It is well known [3] that the entropy of a distribution with support [xL,xR] is at most log(xR − xL), which is the entropy of the distribution that is uniform ...
  9. [9]
    [PDF] Existence and Continuity of Differential Entropy for a Class of ... - arXiv
    Mar 28, 2017 · First defined by Shannon in [1], the differential entropy of a random variable is derived by subtracting log m from the discrete entropy of the ...<|separator|>
  10. [10]
    [PDF] On uniqueness theorems for Tsallis entropy and Tsallis relative ...
    Shannon entropy is uniquely determined by the Shannon- Khinchin's axiom [1], which is referred to as the uniqueness theorem for Shannon entropy. The Shannon- ...
  11. [11]
  12. [12]
    Elements of Information Theory | Wiley Online Books
    Apr 7, 2005 · Elements of Information Theory cover image. Elements of Information Theory ... Differential Entropy (Pages: 243-259) · Summary · PDF · Request ...
  13. [13]
    Exponential Distribution | Definition | Memoryless Random Variable
    The exponential distribution is one of the widely used continuous distributions. It is often used to model the time elapsed between events.Missing: authoritative source
  14. [14]
    [PDF] Probability distributions and maximum entropy - Keith Conrad
    For discrete distributions, on the other hand, entropy is always ≥ 0, since values of a discrete probability density function never exceed 1. That entropy can ...
  15. [15]
    Proof: Continuous uniform distribution maximizes differential entropy ...
    Aug 25, 2023 · Theorem: The continuous uniform distribution maximizes differential entropy for a random variable with a fixed range.
  16. [16]
    [PDF] Communication In The Presence Of Noise - Proceedings of the IEEE
    A method is developed for representing any communication system geometrically. Messages and the corresponding signals are points in two “function spaces,” ...
  17. [17]
    [PDF] SANDIA REPORT Entropy and its Relationship with Statistics
    This leads to a relationship between differential entropy and Fisher information because 𝜑(𝜃) = Elog p(X;𝜃) is the differential entropy for the random ...
  18. [18]
    Information–Theoretic Aspects of Location Parameter Estimation ...
    Another consequence is the relationship between estimation error and differential entropy, which includes the Cramér–Rao bound as described next. First, the ...
  19. [19]
    [PDF] Information Theoretic Proofs of Entropy Power Inequalities - arXiv
    Aug 24, 2010 · following “chain rule”, which states that the optimal estimation given Y of θ results from the optimal estimation given Y of the optimal ...
  20. [20]
    Parametric Bayesian Estimation of Differential Entropy and Relative ...
    In this paper we present Bayesian estimates for differential entropy and relative entropy that are optimal in the sense of minimizing expected Bregman ...
  21. [21]
    [PDF] Objective Bayesian analysis for the differential entropy of the ... - arXiv
    Nov 13, 2023 · e−xxϕ−1dx is the gamma function. In this paper, focusing on the gamma distribution, we derive the posterior distributions using objective priors ...
  22. [22]
    None
    ### Extracted Formula for Differential Entropy of the Weibull Distribution
  23. [23]
    [PDF] Lecture 20: Conditional Differential Entropy, Info. Theory in ML
    Mar 15, 2018 · Definition 1. Conditional Relative Entropy. Given two conditional PMFs PX|Y and QX|Y , the con- ditional relative entropy is: D(PX|Y ||QX|Y ) ...
  24. [24]
    [PDF] Lecture 9: Information Measures
    Mar 3, 2020 · 1. Uniform distribution maximizes h over a bounded domain: If supp(P) ⊆ Rd is bounded, then h(P) ≤ h Unif(supp(P)) . 2.
  25. [25]
    [PDF] Generic Variance Bounds on Estimation and Prediction Errors in ...
    Abstract—In this paper, we obtain generic bounds on the vari- ances of estimation and prediction errors in time series analysis.
  26. [26]
    [PDF] Differential Entropy Rate Characterisations of Long Range ... - arXiv
    Oct 30, 2021 · In this paper we consider the properties of the differential entropy rate of stochastic processes that have an autocorrelation function that ...
  27. [27]
    Relative Entropy and Mutual Information in Gaussian Statistical Field ...
    Dec 17, 2024 · In particular, the relative entropy between two Gaussian measures on infinite dimensional spaces can be infinite even if both measures are non- ...
  28. [28]
    On Information and Sufficiency - Project Euclid
    Project Euclid, Open Access March, 1951, On Information and Sufficiency, S. Kullback, RA Leibler, DOWNLOAD PDF + SAVE TO MY LIBRARY.