Fact-checked by Grok 2 weeks ago

Differential entropy

Differential entropy is a measure of uncertainty in information theory for continuous random variables, serving as the continuous analog to Shannon's discrete entropy. It is defined as h(X) = -\int f_X(x) \log f_X(x) \, dx, where f_X is the probability density function of the random variable X, and the integral is taken over the support of f_X.^[1] This quantity, introduced by Claude Shannon in his seminal 1948 paper on communication theory, quantifies the expected information content or average surprise in observing outcomes from a continuous distribution.^[2] Unlike discrete entropy, which is always non-negative and represents the minimum bits needed to encode outcomes, differential entropy can be negative, reflecting that continuous distributions lack a natural finite "alphabet" and depend on the choice of units or coordinate system.^[1] For instance, it remains invariant under translations, so h(X + c) = h(X) for any constant c, but scales under linear transformations: h(aX) = h(X) + \log |a| for scalar a \neq 0, and more generally h(AX) = h(X) + \log |\det(A)| for invertible matrix A.^[3] It connects to discrete entropy through quantization: if a continuous variable is discretized into bins of width \Delta, the discrete entropy H(X_\Delta) approximates h(X) + \log \Delta, and as \Delta \to 0, this relation highlights differential entropy's role as a limiting case.^[1] Key properties include joint differential entropy h(X,Y) = -\iint f_{X,Y}(x,y) \log f_{X,Y}(x,y) \, dx \, dy, which satisfies subadditivity h(X,Y) \leq h(X) + h(Y) with equality if X and Y are independent, and conditional entropy h(X|Y) = h(X,Y) - h(Y), which is non-increasing under conditioning: h(X|Y) \leq h(X).^[1] The asymptotic equipartition property (AEP) extends to continuous i.i.d. sequences, where the probability of the typical set approaches 1, and its volume is approximately $2^{n h(X)} for n samples, enabling analysis of compression and typical behavior in continuous sources.^[1] Among all distributions with fixed variance, the Gaussian achieves the maximum differential entropy, given by h(X) = \frac{1}{2} \log (2 \pi e \sigma^2) for a univariate normal with variance \sigma^2, or more generally \frac{1}{2} \log ((2\pi e)^n |\mathbf{K}|) for a multivariate normal with covariance matrix \mathbf{K}.^[1] This maximum entropy principle underscores its utility in modeling noise and signals. Differential entropy plays a central role in continuous-channel capacity derivations, rate-distortion theory for analog sources, and bounds like the continuous analog of Fano's inequality, which relates estimation error to entropy: \mathbb{E}[(X - \hat{X})^2] \geq \frac{1}{2\pi e} 2^{2 h(X)}.^[3] These aspects make it indispensable for applications in signal processing, communications, and statistical inference involving continuous data.^[1]

Fundamentals

Definition

Differential entropy, also known as continuous entropy, is a measure of uncertainty for continuous random variables in information theory. For a continuous random variable X with probability density function f(x), the differential entropy h(X) is defined as

h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx,

where the integral is taken over the support of f, and the logarithm is typically base-2 (yielding bits) or the natural logarithm (yielding nats).^[2] This definition extends the concept of Shannon entropy from discrete to continuous distributions and was first introduced by Claude Shannon in his foundational work on communication theory.^[2] The differential entropy arises as the limiting case of discrete Shannon entropy applied to a quantized approximation of the continuous distribution. Consider partitioning the real line into small intervals of width \Delta and approximating f(x) by a discrete probability mass function p_i = f(x_i) \Delta for interval centers x_i; the discrete entropy of this approximation is H(p) \approx h(X) + \log (1/\Delta), and in the limit as \Delta \to 0, h(X) = \lim_{\Delta \to 0} [H(p) - \log (1/\Delta)]. This derivation highlights that differential entropy is not a direct probability but an asymptotic quantity relative to the discretization scale.^[3] Unlike discrete entropy, which is dimensionless and nonnegative, differential entropy carries units dependent on the measurement scale of X and can take negative values. Specifically, it is not invariant under linear transformations: if Y = aX + b with a \neq 0, then h(Y) = h(X) + \log |a|, reflecting how rescaling the variable (e.g., changing units from meters to centimeters) alters the entropy by a factor related to the scaling. For the definition to be well-defined, f must be absolutely continuous with respect to the Lebesgue measure, and the integral must converge absolutely, ensuring finite entropy.

Relation to Discrete Entropy

Differential entropy extends Claude Shannon's discrete entropy to continuous random variables, providing a measure of uncertainty or information content for probability density functions rather than discrete probability mass functions. Introduced by Shannon in 1948 as part of the foundational work on information theory for continuous channels, it formalizes the notion of entropy in scenarios where signals or noise are modeled continuously, such as in communication systems.^[2] Despite this analogy, differential entropy differs fundamentally from its discrete counterpart and is not directly comparable. Discrete entropy is always non-negative and represents an absolute measure of uncertainty, whereas differential entropy can be negative and depends on the units of measurement. A negative value arises when the distribution is more concentrated than a uniform distribution over a unit interval (e.g., [0,1]), which has zero differential entropy, indicating lower uncertainty relative to that baseline. Additionally, scaling the random variable X to aX (with a \neq 0) shifts the differential entropy by \log |a|, highlighting its relative nature tied to the chosen coordinate system rather than an invariant quantity. The interpretation pitfalls include this unit dependence and the need for careful density normalization to ensure additivity for independent variables holds, as the joint entropy equals the sum of individual entropies only when the densities multiply appropriately.^[2]^[4] The precise relation between the two entropies emerges from a discretization process. Consider partitioning the support of a continuous random variable into small bins of width \Delta. The resulting discrete random variable, with probabilities p_i \approx f(x_i) \Delta where f is the density, has entropy H approximating the differential entropy h(X) adjusted for the bin size:

H \approx h(X) - \log \Delta

This approximation becomes exact in the limit as \Delta \to 0, where the -\log \Delta term (diverging to +\infty) compensates for the infinite resolution of the continuous case, ensuring the discrete entropy remains non-negative while the differential entropy stays finite. This limiting procedure underscores why differential entropy is not a limiting case of discrete entropy without the adjustment term, emphasizing the conceptual shift from finite to infinite sample spaces.^[5]^[4]

Properties

Basic Properties

Differential entropy exhibits invariance under translation. Consider a continuous random variable X with probability density function f_X(x), so that its differential entropy is h(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx. For a constant c, let Y = X + c; the density of Y is f_Y(y) = f_X(y - c). Substituting into the entropy integral gives h(Y) = -\int_{-\infty}^{\infty} f_X(y - c) \log f_X(y - c) \, dy. By the change of variable u = y - c, this simplifies to -\int_{-\infty}^{\infty} f_X(u) \log f_X(u) \, du = h(X), demonstrating the invariance.^[6] The scaling property introduces a dependence on units. For a scalar a \neq 0, let Y = aX; the density of Y is f_Y(y) = \frac{1}{|a|} f_X\left(\frac{y}{a}\right). Thus, h(Y) = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \log \left[ \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \right] dy = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \left[ \log \frac{1}{|a|} + \log f_X\left(\frac{y}{a}\right) \right] dy. Substituting z = y/a yields h(Y) = h(X) + \log |a|, reflecting how rescaling affects the entropy by the logarithm of the scaling factor, which accounts for changes in measurement units.^[7] Unlike discrete entropy, which is always non-negative, differential entropy can be negative, highlighting its non-invariance under discretization. For a random variable X with bounded support S of finite volume V = \int_S dx, the entropy satisfies h(X) \leq \log V, with equality achieved when X is uniformly distributed over S. This upper bound arises because the uniform distribution maximizes the entropy for a fixed support, and \log V < 0 when V < 1, allowing negative values even at the maximum.^[8] Differential entropy is continuous with respect to weak convergence of densities. Specifically, if a sequence of densities f_n converges to f in the L^1 norm (i.e., \int |f_n - f| dx \to 0) and satisfies suitable integrability conditions for the entropy to be well-defined, then h(f_n) \to h(f). This continuity ensures that small perturbations in the density lead to small changes in entropy, facilitating analysis in estimation and approximation contexts.^[9]

Chain Rule and Additivity

The joint differential entropy of a random vector \mathbf{X} = (X_1, \dots, X_n) in \mathbb{R}^n, with joint probability density function f_{\mathbf{X}}(\mathbf{x}), extends the scalar case and is given by

h(\mathbf{X}) = -\int_{\mathbb{R}^n} f_{\mathbf{X}}(\mathbf{x}) \log f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}.

This measure quantifies the uncertainty in the joint distribution over the vector, analogous to the scalar differential entropy but integrated over the higher-dimensional space. A key property is the chain rule for differential entropy, which decomposes the joint entropy into a sum of conditional entropies:

h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i \mid X_1, \dots, X_{i-1}),

where h(X_i \mid X_1, \dots, X_{i-1}) = -\int f(x_i \mid x_1, \dots, x_{i-1}) \log f(x_i \mid x_1, \dots, x_{i-1}) \, dx_i, averaged over the conditioning variables. This decomposition holds whenever the relevant densities exist and is derived from the definition of conditional density. If the random variables X_1, \dots, X_n are mutually independent, then each conditional entropy simplifies to the marginal entropy, yielding h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i), demonstrating additivity for the joint entropy under independence. The conditional differential entropy h(X \mid Y) for jointly continuous random variables X and Y with joint density f_{X,Y}(x,y) is defined as

h(X \mid Y) = h(X,Y) - h(Y) = -\iint f_{X,Y}(x,y) \log f_{X \mid Y}(x \mid y) \, dx \, dy,

where the outer integral over y effectively averages the scalar conditional entropies. This quantity represents the residual uncertainty in X given knowledge of Y. A fundamental inequality is h(X \mid Y) \leq h(X), with equality if and only if X and Y are independent, reflecting that conditioning cannot increase differential entropy. This follows directly from the non-negativity of mutual information, I(X;Y) = h(X) - h(X \mid Y) \geq 0. For the joint case, subadditivity holds as h(X,Y) = h(X) + h(Y \mid X) \leq h(X) + h(Y), again with equality under independence. For the sum of random variables, subadditivity extends to h(X + Y) \leq h(X) + h(Y). This arises because X + Y is a deterministic function of the pair (X, Y), and differential entropy does not increase under measurable functions: h(X + Y) \leq h(X,Y). Combined with the joint subadditivity, the bound follows. Equality in the overall inequality requires both independence of X and Y (for h(X,Y) = h(X) + h(Y)) and no information loss in the mapping to the sum, which occurs only in degenerate cases where the transformation is invertible almost surely, such as when one variable has zero variance; otherwise, the inequality is strict even under independence, as the convolution of densities generally introduces dependence that reduces entropy relative to the additive case. The phrase "densities are compatible" refers to conditions where the characteristic functions ensure the entropy of the convolution equals the sum, but such cases are exceptional and not generally satisfied.^[10] In continuous channels, the data processing inequality preserves the structure of the discrete case. For a Markov chain X \to Y \to Z where X, Y, Z are continuous random variables, the mutual information satisfies I(X; Z) \leq I(X; Y), with equality if Z is a sufficient statistic for X given Y. This implies that processing through a continuous channel cannot increase information about the input, and it extends to differential entropies via I(X;Y) = h(X) - h(X \mid Y). The inequality holds under the existence of densities and is proven using the chain rule and non-negativity of relative entropy.

Maximum Entropy

Gaussian Maximization Theorem

The Gaussian maximization theorem asserts that, among all continuous probability distributions for a random variable X with fixed variance \sigma^2, the differential entropy h(X) is maximized uniquely by the Gaussian distribution \mathcal{N}(\mu, \sigma^2), where the maximum value is \frac{1}{2} \log (2 \pi e \sigma^2) in nats.^[11] This result holds regardless of the mean \mu, as shifting the distribution does not affect the entropy.^[11] This theorem underscores the Gaussian distribution's role as the embodiment of maximum uncertainty under a second-moment constraint, a principle central to information theory that favors the least informative prior consistent with observed variance.^[11] It has profound implications, such as establishing the capacity of additive white Gaussian noise channels as \frac{1}{2} \log (1 + \frac{P}{N}), where the input distribution maximizing mutual information is Gaussian. The theorem generalizes to multivariate cases: for an n-dimensional random vector X with fixed covariance matrix \Sigma, the maximum differential entropy is achieved by the multivariate Gaussian \mathcal{N}(\mu, \Sigma), yielding h(X) = \frac{n}{2} \log (2 \pi e) + \frac{1}{2} \log \det(\Sigma) nats.^[11] For uncorrelated components with variances \sigma_i^2, this simplifies to \frac{n}{2} \log (2 \pi e) + \sum_{i=1}^n \log \sigma_i.^[11] Without variance or similar constraints, differential entropy is unbounded above, as densities can be made arbitrarily flat, though it approaches -\infty for degenerate distributions with zero variance.^[11]

Proof of the Theorem

To prove the Gaussian maximization theorem, consider the problem of maximizing the differential entropy h(X) = -\mathbb{E}[\log f(X)] = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx over all probability density functions f on \mathbb{R}, subject to the normalization constraint \int_{-\infty}^{\infty} f(x) \, dx = 1 and the fixed second-moment constraint \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 f(x) \, dx = \sigma^2 (assuming without loss of generality that \mathbb{E}[X] = 0). This is a constrained optimization problem in the space of densities, solved using the method of Lagrange multipliers for functionals. Introduce the Lagrangian functional

\mathcal{L} = -\int f \log f \, dx + \lambda \left( \int f \, dx - 1 \right) - \mu \left( \int x^2 f \, dx - \sigma^2 \right),

where \lambda is the multiplier for normalization and \mu > 0 for the variance constraint. The functional derivative with respect to f is set to zero:

\frac{\delta \mathcal{L}}{\delta f} = -\log f - 1 + \lambda - \mu x^2 = 0,

yielding

f(x) = \exp(\lambda - 1 - \mu x^2),

where the normalization constant is incorporated via \lambda. This form is proportional to a Gaussian density. To identify the parameters, impose the second-moment constraint: the variance of this distribution is $1/(2\mu), so \mu = 1/(2\sigma^2) to match \sigma^2. The normalization then gives the density of \mathcal{N}(0, \sigma^2). An alternative proof uses the non-negativity of the Kullback-Leibler (KL) divergence. Let g denote the density of \mathcal{N}(0, \sigma^2). Then,

D(f \| g) = \int f \log \frac{f}{g} \, dx = -\int f \log g \, dx - h(f) \geq 0,

with equality if and only if f = g almost everywhere. Rearranging gives

h(f) \leq -\int f \log g \, dx.

Now, \log g(x) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so

-\int f \log g \, dx = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \int x^2 f(x) \, dx = \frac 12 \log(2\pi e \sigma^2),

where the last step substitutes the constraint \int x^2 f = \sigma^2 and adds the constant \frac{1}{2} \log e = \frac{1}{2}. Thus, h(f) \leq \frac{1}{2} \log(2\pi e \sigma^2), which is precisely the differential entropy of the Gaussian, with equality only for the Gaussian density. To verify, compute the entropy of the Gaussian directly: for X \sim \mathcal{N}(0, \sigma^2),

h(X) = -\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \log \left[ \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \right] dx.

The logarithm expands to -\frac{1}{2} \log (2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so the integral separates into \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \mathbb{E}[X^2] = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2} = \frac{1}{2} \log(2\pi e \sigma^2), confirming the bound is achieved.

Examples

Exponential Distribution

The exponential distribution with rate parameter \lambda > 0 has probability density function
f(x) = \lambda e^{-\lambda x}, \quad x \geq 0,
and mean $1/\lambda.^[12] This distribution models phenomena such as inter-arrival times in Poisson processes and is characterized by its memoryless property. The differential entropy h(X) of a continuous random variable X with density f is h(X) = -\int f(x) \log f(x) \, dx. For the exponential distribution,
h(X) = -\int_{0}^{\infty} \lambda e^{-\lambda x} \log(\lambda e^{-\lambda x}) \, dx = -\int_{0}^{\infty} \lambda e^{-\lambda x} (\log \lambda - \lambda x) \, dx.
The first term evaluates to -\log \lambda \int_{0}^{\infty} \lambda e^{-\lambda x} \, dx = -\log \lambda, since the integral of the density is 1. The second term is \lambda \int_{0}^{\infty} x \lambda e^{-\lambda x} \, dx = \lambda \mathbb{E}[X] = \lambda \cdot (1/\lambda) = 1. Thus, h(X) = 1 - \log \lambda (in nats), derived by direct integration.^[13] This entropy value increases with the mean $1/\lambda, as larger means correspond to broader spreads and greater uncertainty in the distribution. For \lambda > e, the entropy is negative, a feature unique to differential entropy that does not imply negative information but reflects the continuous nature relative to a uniform reference measure.^[13] The exponential distribution maximizes the differential entropy among all continuous distributions supported on [0, \infty) with fixed mean $1/\lambda.^[13] It serves as the continuous analog to the geometric distribution, which maximizes entropy for discrete non-negative integer-valued random variables with fixed mean.^[13]

Uniform Distribution

The uniform distribution over an interval [b, b+a] has probability density function f(x) = \frac{1}{a} for x \in [b, b+a] and f(x) = 0 otherwise.^[1] The differential entropy h(X) is computed via the integral definition:

h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx = -\int_{b}^{b+a} \frac{1}{a} \log \left( \frac{1}{a} \right) \, dx = \log a,

where the logarithm is the natural logarithm (nats).^[1] This value is independent of the location parameter b, depending only on the length a of the support interval.^[1] This entropy \log a scales logarithmically with the interval length, quantifying the uncertainty in locating the random variable within the fixed support volume; larger a increases the possible outcomes, hence higher entropy.^[1] Among all distributions supported on an interval of fixed length a, the uniform achieves the maximum differential entropy, as established by the non-negativity of the Kullback-Leibler divergence between any density f and the uniform density g(x) = 1/a: D(f \| g) = \log a - h(X) \geq 0, implying h(X) \leq \log a.^[14] In quantization theory, the uniform distribution provides a reference for bounding the entropy of discrete approximations to continuous random variables. As the quantization bin size \Delta \to 0, the entropy H(X^*) of the quantized variable X^* satisfies H(X^*) \to h(X) + \log \Delta, where equality holds asymptotically for the uniform case, linking differential entropy to the limiting behavior of discrete entropy.^[1] The differential entropy of the uniform can be negative when a < 1, for instance h(X) = \log 0.5 < 0 for a = 0.5, which underscores the unit dependence of differential entropy unlike its non-negative discrete counterpart.^[1] This property aligns with the scaling behavior of differential entropy under linear transformations.^[1]

Information Measures

Relation to Mutual Information

The mutual information between two continuous random variables X and Y, denoted I(X;Y), is defined as the difference between the differential entropy of X and the conditional differential entropy of X given Y:

I(X;Y) = h(X) - h(X|Y).

Equivalently, it can be expressed as I(X;Y) = h(X) + h(Y) - h(X,Y), where h(X,Y) is the joint differential entropy.^[1] This quantity quantifies the amount of information that X and Y share, representing the reduction in uncertainty about one variable upon knowing the other.^[1] A key property of mutual information is its non-negativity: I(X;Y) \geq 0, with equality holding if and only if X and Y are independent.^[1] This follows directly from the conditioning inequality for differential entropy, which states that h(X|Y) \leq h(X), implying that observing Y cannot increase the uncertainty about X.^[1] For multiple variables, the chain rule extends mutual information analogously to the discrete case:

I(X_1, \dots, X_n; Y) = \sum_{i=1}^n I(X_i; Y \mid X_1, \dots, X_{i-1}),

where the conditional mutual information I(X_i; Y \mid X_1, \dots, X_{i-1}) measures the additional shared information between X_i and Y given the previous variables.^[1] In information theory, mutual information interprets the shared uncertainty between variables, with differential entropy serving as the foundational building block for analyzing continuous systems.^[1] It plays a central role in defining the capacity of continuous channels, where the capacity C is the maximum achievable I(X;Y) over input distributions subject to constraints, representing the highest rate of reliable communication.^[15] For instance, in the additive white Gaussian noise channel under a power constraint, the mutual information I(X;Y) is maximized when X is Gaussian, achieving the channel capacity C = \frac{1}{2} \log_2 \left(1 + \frac{P}{N}\right) bits per transmission, where P is the signal power and N is the noise power.^[15]

Connection to Estimator Error

The Cramér-Rao bound establishes a fundamental limit on the performance of unbiased estimators, stating that for an unbiased estimator \hat{\theta} of a parameter \theta based on n i.i.d. observations, the variance satisfies \operatorname{Var}(\hat{\theta}) \geq \frac{1}{n I(\theta)}, where the Fisher information I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}\right].^[16] This Fisher information quantifies the sensitivity of the likelihood to changes in \theta, and it connects to differential entropy h(X;\theta) = -\mathbb{E}[\log f(X;\theta)], since the expected log-likelihood is -h(X;\theta), and the Fisher information I(\theta) is the negative of its second derivative with respect to \theta (or equivalently, the second derivative of h(X;\theta)).^[16]^[17] Higher Fisher information, corresponding to lower entropy for fixed support, implies tighter bounds on estimation variance.^[18] In estimation contexts involving noisy observations, the entropy power inequality provides insight into minimal mean squared error (MSE). The inequality asserts that for independent random vectors X and Z, h(X + Z) \geq \frac{1}{2} \log\left( e^{2 h(X)} + e^{2 h(Z)} \right) (in nats for scalars), implying that greater differential entropy h(X) leads to larger minimal MSE when estimating X from Y = X + Z for fixed noise variance \operatorname{Var}(Z).^[19] Equality holds when X and Z are Gaussian. This follows from the integral representation h(X) = \frac{1}{2} \int_0^\infty \frac{\operatorname{MMSE}(X \mid \sqrt{\operatorname{snr}} X + Z)}{\operatorname{snr}^2} \, d\operatorname{snr}, where \operatorname{MMSE} is the minimum MSE; thus, increased h(X) necessitates higher average estimation error across signal-to-noise ratios.^[19] In Bayesian estimation, the posterior differential entropy h(\theta \mid \mathbf{X}) serves as a measure of residual uncertainty about the parameter \theta after incorporating data \mathbf{X}, with lower posterior entropy indicating more precise inference and better estimators in terms of uncertainty reduction.^[20] Optimal Bayesian estimators, such as those minimizing expected Bregman divergence, aim to concentrate the posterior, thereby minimizing this entropy as a proxy for estimation quality.^[20] Asymptotically, in sequential estimation with large samples, the entropy rate—characterizing the average uncertainty per observation—links to error rates through the Fisher information matrix, where the posterior covariance approximates I(\theta)^{-1}/n, yielding posterior entropy roughly \frac{d}{2} \log(2\pi e / n) - \frac{1}{2} \log \det I(\theta) for d-dimensional \theta, bounding large-sample MSE.^[18] For the Gaussian case, the differential entropy h(X) = \frac{1}{2} \log(2\pi e \sigma^2) directly ties to estimation error, as the MSE for the sample mean estimator from n i.i.d. N(\mu, \sigma^2) observations is \sigma^2 / n, saturating the Cramér-Rao bound and scaling inversely with sample size while reflecting the entropy's dependence on variance.^[18]

Common Distributions

Formulas for Specific Distributions

The differential entropy formulas for several standard continuous distributions are listed in the table below, expressed in nats using the natural logarithm. These expressions assume the conventional parametrizations and are independent of location parameters where applicable, as shifting does not affect the entropy value.^[1]

Distribution	Parameters	Support	Differential Entropy H(X)
Gaussian	Variance \sigma^2 > 0	(-\infty, \infty)	\frac{1}{2} \log (2 \pi e \sigma^2)^[1]
Exponential	Rate \lambda > 0	[0, \infty)	$1 - \log \lambda^[1]
Gamma	Shape \alpha > 0, rate \beta > 0	[0, \infty)	\alpha - \log \beta + \log \Gamma(\alpha) + (1 - \alpha) \psi(\alpha), where \psi is the digamma function^[21]
Laplace	Scale b > 0	(-\infty, \infty)	$1 + \log (2b)^[1]
Cauchy	Scale \gamma > 0	(-\infty, \infty)	\log (4 \pi \gamma)^[1]
Weibull	Shape k > 0, scale \lambda > 0	[0, \infty)	\gamma \left(1 - \frac{1}{k}\right) + \log \left( \frac{\lambda}{k} \right) + 1, where \gamma \approx 0.57721 is the Euler-Mascheroni constant^[22]

The exponential distribution appears as a special case of both the gamma (with \alpha = 1) and Weibull (with k = 1) distributions.^[1]

Comparison Across Distributions

Among all continuous distributions with a fixed variance, the Gaussian distribution achieves the maximum differential entropy, a result derived from the principle of maximum entropy subject to a second-moment constraint. This positions the Gaussian as the distribution of maximum uncertainty under this constraint, with its differential entropy given by \frac{1}{2} \log (2 \pi e \sigma^2) for variance \sigma^2. Distributions exhibiting heavier tails, such as the Cauchy distribution, can attain higher differential entropy values; for a Cauchy distribution with scale parameter \gamma, the differential entropy is \log (4 \pi \gamma), which surpasses that of Gaussians with comparable but finite variance parameters. However, the infinite variance of the Cauchy prevents direct comparison within the fixed-variance framework.^[11] For distributions constrained to the positive real line with a fixed mean \mu > 0, the exponential distribution maximizes the differential entropy, yielding $1 + \log \mu.^[13] Shape parameters within parametric families reveal systematic patterns in differential entropy. For the Weibull distribution with fixed scale, as the shape parameter k approaches 1 from above, the distribution converges to the exponential, and the differential entropy approaches $1 + \log \mu. Across families, differential entropy generally increases with kurtosis for fixed variance up to the Gaussian limit (kurtosis of 3), beyond which heavier tails reduce entropy relative to the maximum. In multivariate isotropic cases with covariance matrix \sigma^2 I_d in d dimensions, differential entropies scale linearly with dimension for the Gaussian, given by \frac{d}{2} \log (2 \pi e \sigma^2), reflecting additive uncertainty across independent components. Other distributions sharing this covariance exhibit lower entropies but follow a similar linear scaling, with the gap to the Gaussian bound widening for non-Gaussian forms in higher dimensions.^[11] The following table compares differential entropies for select univariate distributions normalized to unit variance, highlighting the Gaussian's supremacy and the decreasing order for heavier-tailed alternatives:

Distribution	Parameter	Differential Entropy (nats)
Gaussian	-	1.419
Student-t	\nu = 5	1.369
Laplace	-	1.347
Uniform	-	1.243
Student-t	\nu = 3	1.222

These values confirm the ordering Gaussian > Student-t (\nu=5) > Laplace > uniform > Student-t (\nu=3), where decreasing degrees of freedom in the Student-t introduce heavier tails and correspondingly lower entropy under the unit-variance constraint.^[11]

Variants

Conditional Differential Entropy

The conditional differential entropy of a continuous random variable X given a fixed value y of another continuous random variable Y with joint density f_{X,Y}(x,y) is defined as

h(X \mid Y = y) = -\int_{-\infty}^{\infty} f_{X \mid Y}(x \mid y) \log f_{X \mid Y}(x \mid y) \, dx,

where f_{X \mid Y}(x \mid y) = f_{X,Y}(x,y)/f_Y(y) denotes the conditional density of X given Y = y, assuming it exists.^[4] The average conditional differential entropy h(X \mid Y) is then obtained by taking the expectation over the distribution of Y:

h(X \mid Y) = \mathbb{E}_Y \left[ h(X \mid Y = y) \right] = -\iint f_{X,Y}(x,y) \log f_{X \mid Y}(x \mid y) \, dx \, dy.

This measure extends the concept of differential entropy to scenarios where partial information about Y reduces uncertainty in X. Key properties of conditional differential entropy mirror those of its discrete counterpart but account for the peculiarities of continuous distributions. Notably, it is non-increasing under additional conditioning: h(X \mid Y, Z) \leq h(X \mid Y) for any Z, with equality if Z is conditionally independent of X given Y (i.e., Z \perp X \mid Y), reflecting that additional information cannot increase uncertainty.^[1] Unlike discrete conditional entropy, which is always non-negative, conditional differential entropy can take negative values, as the underlying differential entropy itself may be negative for densities more concentrated than a standard Gaussian.^[23] The chain rule for entropy holds in conditional form, enabling the decomposition of joint differential entropies into sums involving conditional terms.^[24] In the context of stochastic processes, conditional differential entropy serves as a measure of prediction error, quantifying the residual uncertainty in future states after observing past observations. For instance, in linear prediction tasks, it bounds the minimum mean squared error via relations like the Kolmogorov-Szegö formula, where lower conditional entropy corresponds to tighter predictions.^[25] Computing h(X \mid Y) poses challenges when X and Y exhibit dependencies, as it requires estimating the full joint density f_{X,Y}, often demanding high-dimensional integration or approximation techniques such as Monte Carlo methods or kernel density estimation.^[3] Generalizations to infinite-dimensional settings, such as Gaussian processes, extend conditional differential entropy to function spaces, where it is defined through limits of finite-dimensional projections or via the log-determinant of conditional covariance operators. For a Gaussian process, the conditional entropy given observations is \frac{1}{2} \log \left( (2\pi e)^n \det(\Sigma_{X \mid Y}) \right) in n-dimensional approximations, with \Sigma_{X \mid Y} the conditional covariance matrix, converging appropriately in the infinite limit.^[26] This framework is crucial in applications like Bayesian inference and kriging, where it quantifies posterior uncertainty over infinite-dimensional parameters.^[27]

Relative Differential Entropy

The relative differential entropy, commonly referred to as the Kullback-Leibler (KL) divergence, quantifies the difference between two probability density functions f and g over a continuous space. Introduced by Kullback and Leibler as a measure of information for distinguishing hypotheses, it is defined for absolutely continuous distributions as

D(f \| g) = \int_{-\infty}^{\infty} f(x) \log \frac{f(x)}{g(x)} \, dx,

assuming the integral exists and g(x) > 0 wherever f(x) > 0.^[28] This expression can be equivalently rewritten using differential entropy H(f) as

D(f \| g) = -H(f) - \int_{-\infty}^{\infty} f(x) \log g(x) \, dx,

where the second term represents the cross-entropy between f and g. Alternatively, it takes the form of an expectation under f:

D(f \| g) = \mathbb{E}_{X \sim f} \left[ \log \frac{f(X)}{g(X)} \right].

These formulations highlight its role as the expected excess log-likelihood when approximating f with g. The KL divergence exhibits key properties that distinguish it from symmetric distances. It is asymmetric, meaning D(f \| g) \neq D(g \| f) in general, reflecting the directional nature of information loss from one distribution to another. Additionally, it is non-negative, D(f \| g) \geq 0, with equality if and only if f = g almost everywhere; this follows from Jensen's inequality applied to the convex function t \mapsto - \log t, or equivalently Gibbs' inequality in information theory contexts. The non-negativity ensures it serves as a valid divergence measure, though it is not a true metric due to the asymmetry and lack of the triangle inequality.^[28] In relation to differential entropy, the KL divergence measures the excess entropy of f relative to g, capturing how much more (or less) uncertainty is present in f when g is taken as a reference. It equals zero precisely when the distributions coincide, emphasizing its utility in assessing distributional similarity. Applications abound in statistical inference; for instance, in variational inference, minimizing D(q \| p) (or its reverse) approximates intractable posteriors by optimizing a tractable q to bound the model evidence, as foundational in mean-field methods for graphical models. In model selection, asymptotic expansions of the KL divergence underpin criteria like the Akaike information criterion (AIC), which penalizes model complexity to select the distribution closest to the true data-generating process. Furthermore, Pinsker's inequality relates it to the total variation distance:

\frac{1}{2} \| f - g \|_1^2 \leq D(f \| g),

providing a bound on the L1 difference in terms of the divergence, useful for convergence analysis in density estimation. A notable connection arises in continuous mutual information, defined as

I(X; Y) = D(p_{XY} \| p_X p_Y) = \int \int p_{XY}(x,y) \log \frac{p_{XY}(x,y)}{p_X(x) p_Y(y)} \, dx \, dy,

which measures dependence between random variables X and Y as the KL divergence from the joint density to the product of marginals; this equals the mutual information and reduces to zero for independent variables.

References

[1]
[PDF] Elements of Information Theory
This is intended to be a simple and accessible book on information theory. As Einstein said, “Everything should be made as simple as.
[2]
[PDF] A Mathematical Theory of Communication
In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible ...
[3]
[PDF] Lecture 17: Differential Entropy
Differential entropy, defined for continuous random variables, is h(X) = − ∫ f(x) log f(x)dx, and is the volume of the typical set.
[4]
[PDF] Elements of Information Theory
Page 1. Page 2. ELEMENTS OF. INFORMATION THEORY. Second Edition. THOMAS M. COVER ... First, certain quantities like entropy and mutual information arise as the ...
[5]
[PDF] This is IT: A Primer on Shannon's Entropy and Information
Aug 12, 2022 · This gives the desired relation between discrete and continuous entropies: h(X) ≈ H([X]) − log. 1 δ . (32). As δ → 0, [X] converges to X ...
[6]
[PDF] Lecture 10 1 Communication Problem (Chapter 7)
Feb 10, 2015 · ... differential entropy can be positive or negative. This is not the only way in which they differ. h(X + c) = h(X), for constant c h(X · c) = h(X) ...
[7]
[PDF] ECE 587 / STA 563: Lecture 7 – Differential Entropy - Henry Pfister
Nov 4, 2024 · ◦ h(AX) = h(X) + log |det(A)| when A is a square matrix. • Proof of scaling property for scalar setting. ◦ The differential entropy of a ...
[8]
[PDF] A Probabilistic Upper Bound on Differential Entropy
It is well known [3] that the entropy of a distribution with support [xL,xR] is at most log(xR − xL), which is the entropy of the distribution that is uniform ...
[9]
[PDF] Existence and Continuity of Differential Entropy for a Class of ... - arXiv
Mar 28, 2017 · First defined by Shannon in [1], the differential entropy of a random variable is derived by subtracting log m from the discrete entropy of the ...<|separator|>
[10]
[PDF] On uniqueness theorems for Tsallis entropy and Tsallis relative ...
Shannon entropy is uniquely determined by the Shannon- Khinchin's axiom [1], which is referred to as the uniqueness theorem for Shannon entropy. The Shannon- ...
[11]
https://onlinelibrary.wiley.com/doi/book/10.1002/047174882X
[12]
Elements of Information Theory | Wiley Online Books
Apr 7, 2005 · Elements of Information Theory cover image. Elements of Information Theory ... Differential Entropy (Pages: 243-259) · Summary · PDF · Request ...
[13]
Exponential Distribution | Definition | Memoryless Random Variable
The exponential distribution is one of the widely used continuous distributions. It is often used to model the time elapsed between events.Missing: authoritative source
[14]
[PDF] Probability distributions and maximum entropy - Keith Conrad
For discrete distributions, on the other hand, entropy is always ≥ 0, since values of a discrete probability density function never exceed 1. That entropy can ...
[15]
Proof: Continuous uniform distribution maximizes differential entropy ...
Aug 25, 2023 · Theorem: The continuous uniform distribution maximizes differential entropy for a random variable with a fixed range.
[16]
[PDF] Communication In The Presence Of Noise - Proceedings of the IEEE
A method is developed for representing any communication system geometrically. Messages and the corresponding signals are points in two “function spaces,” ...
[17]
[PDF] SANDIA REPORT Entropy and its Relationship with Statistics
This leads to a relationship between differential entropy and Fisher information because 𝜑(𝜃) = Elog p(X;𝜃) is the differential entropy for the random ...
[18]
Information–Theoretic Aspects of Location Parameter Estimation ...
Another consequence is the relationship between estimation error and differential entropy, which includes the Cramér–Rao bound as described next. First, the ...
[19]
[PDF] Information Theoretic Proofs of Entropy Power Inequalities - arXiv
Aug 24, 2010 · following “chain rule”, which states that the optimal estimation given Y of θ results from the optimal estimation given Y of the optimal ...
[20]
Parametric Bayesian Estimation of Differential Entropy and Relative ...
In this paper we present Bayesian estimates for differential entropy and relative entropy that are optimal in the sense of minimizing expected Bregman ...
[21]
[PDF] Objective Bayesian analysis for the differential entropy of the ... - arXiv
Nov 13, 2023 · e−xxϕ−1dx is the gamma function. In this paper, focusing on the gamma distribution, we derive the posterior distributions using objective priors ...
[22]
None
### Extracted Formula for Differential Entropy of the Weibull Distribution
[23]
[PDF] Lecture 20: Conditional Differential Entropy, Info. Theory in ML
Mar 15, 2018 · Definition 1. Conditional Relative Entropy. Given two conditional PMFs PX|Y and QX|Y , the con- ditional relative entropy is: D(PX|Y ||QX|Y ) ...
[24]
[PDF] Lecture 9: Information Measures
Mar 3, 2020 · 1. Uniform distribution maximizes h over a bounded domain: If supp(P) ⊆ Rd is bounded, then h(P) ≤ h Unif(supp(P)) . 2.
[25]
[PDF] Generic Variance Bounds on Estimation and Prediction Errors in ...
Abstract—In this paper, we obtain generic bounds on the vari- ances of estimation and prediction errors in time series analysis.
[26]
[PDF] Differential Entropy Rate Characterisations of Long Range ... - arXiv
Oct 30, 2021 · In this paper we consider the properties of the differential entropy rate of stochastic processes that have an autocorrelation function that ...
[27]
Relative Entropy and Mutual Information in Gaussian Statistical Field ...
Dec 17, 2024 · In particular, the relative entropy between two Gaussian measures on infinite dimensional spaces can be infinite even if both measures are non- ...
[28]
On Information and Sufficiency - Project Euclid
Project Euclid, Open Access March, 1951, On Information and Sufficiency, S. Kullback, RA Leibler, DOWNLOAD PDF + SAVE TO MY LIBRARY.