Differential entropy
Differential entropy is a measure of uncertainty in information theory for continuous random variables, serving as the continuous analog to Shannon's discrete entropy. It is defined as h(X) = -\int f_X(x) \log f_X(x) \, dx, where f_X is the probability density function of the random variable X, and the integral is taken over the support of f_X.[1] This quantity, introduced by Claude Shannon in his seminal 1948 paper on communication theory, quantifies the expected information content or average surprise in observing outcomes from a continuous distribution.[2] Unlike discrete entropy, which is always non-negative and represents the minimum bits needed to encode outcomes, differential entropy can be negative, reflecting that continuous distributions lack a natural finite "alphabet" and depend on the choice of units or coordinate system.[1] For instance, it remains invariant under translations, so h(X + c) = h(X) for any constant c, but scales under linear transformations: h(aX) = h(X) + \log |a| for scalar a \neq 0, and more generally h(AX) = h(X) + \log |\det(A)| for invertible matrix A.[3] It connects to discrete entropy through quantization: if a continuous variable is discretized into bins of width \Delta, the discrete entropy H(X_\Delta) approximates h(X) + \log \Delta, and as \Delta \to 0, this relation highlights differential entropy's role as a limiting case.[1] Key properties include joint differential entropy h(X,Y) = -\iint f_{X,Y}(x,y) \log f_{X,Y}(x,y) \, dx \, dy, which satisfies subadditivity h(X,Y) \leq h(X) + h(Y) with equality if X and Y are independent, and conditional entropy h(X|Y) = h(X,Y) - h(Y), which is non-increasing under conditioning: h(X|Y) \leq h(X).[1] The asymptotic equipartition property (AEP) extends to continuous i.i.d. sequences, where the probability of the typical set approaches 1, and its volume is approximately $2^{n h(X)} for n samples, enabling analysis of compression and typical behavior in continuous sources.[1] Among all distributions with fixed variance, the Gaussian achieves the maximum differential entropy, given by h(X) = \frac{1}{2} \log (2 \pi e \sigma^2) for a univariate normal with variance \sigma^2, or more generally \frac{1}{2} \log ((2\pi e)^n |\mathbf{K}|) for a multivariate normal with covariance matrix \mathbf{K}.[1] This maximum entropy principle underscores its utility in modeling noise and signals. Differential entropy plays a central role in continuous-channel capacity derivations, rate-distortion theory for analog sources, and bounds like the continuous analog of Fano's inequality, which relates estimation error to entropy: \mathbb{E}[(X - \hat{X})^2] \geq \frac{1}{2\pi e} 2^{2 h(X)}.[3] These aspects make it indispensable for applications in signal processing, communications, and statistical inference involving continuous data.[1]Fundamentals
Definition
Differential entropy, also known as continuous entropy, is a measure of uncertainty for continuous random variables in information theory. For a continuous random variable X with probability density function f(x), the differential entropy h(X) is defined as h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx, where the integral is taken over the support of f, and the logarithm is typically base-2 (yielding bits) or the natural logarithm (yielding nats).[2] This definition extends the concept of Shannon entropy from discrete to continuous distributions and was first introduced by Claude Shannon in his foundational work on communication theory.[2] The differential entropy arises as the limiting case of discrete Shannon entropy applied to a quantized approximation of the continuous distribution. Consider partitioning the real line into small intervals of width \Delta and approximating f(x) by a discrete probability mass function p_i = f(x_i) \Delta for interval centers x_i; the discrete entropy of this approximation is H(p) \approx h(X) + \log (1/\Delta), and in the limit as \Delta \to 0, h(X) = \lim_{\Delta \to 0} [H(p) - \log (1/\Delta)]. This derivation highlights that differential entropy is not a direct probability but an asymptotic quantity relative to the discretization scale.[3] Unlike discrete entropy, which is dimensionless and nonnegative, differential entropy carries units dependent on the measurement scale of X and can take negative values. Specifically, it is not invariant under linear transformations: if Y = aX + b with a \neq 0, then h(Y) = h(X) + \log |a|, reflecting how rescaling the variable (e.g., changing units from meters to centimeters) alters the entropy by a factor related to the scaling. For the definition to be well-defined, f must be absolutely continuous with respect to the Lebesgue measure, and the integral must converge absolutely, ensuring finite entropy.Relation to Discrete Entropy
Differential entropy extends Claude Shannon's discrete entropy to continuous random variables, providing a measure of uncertainty or information content for probability density functions rather than discrete probability mass functions. Introduced by Shannon in 1948 as part of the foundational work on information theory for continuous channels, it formalizes the notion of entropy in scenarios where signals or noise are modeled continuously, such as in communication systems.[2] Despite this analogy, differential entropy differs fundamentally from its discrete counterpart and is not directly comparable. Discrete entropy is always non-negative and represents an absolute measure of uncertainty, whereas differential entropy can be negative and depends on the units of measurement. A negative value arises when the distribution is more concentrated than a uniform distribution over a unit interval (e.g., [0,1]), which has zero differential entropy, indicating lower uncertainty relative to that baseline. Additionally, scaling the random variable X to aX (with a \neq 0) shifts the differential entropy by \log |a|, highlighting its relative nature tied to the chosen coordinate system rather than an invariant quantity. The interpretation pitfalls include this unit dependence and the need for careful density normalization to ensure additivity for independent variables holds, as the joint entropy equals the sum of individual entropies only when the densities multiply appropriately.[2][4] The precise relation between the two entropies emerges from a discretization process. Consider partitioning the support of a continuous random variable into small bins of width \Delta. The resulting discrete random variable, with probabilities p_i \approx f(x_i) \Delta where f is the density, has entropy H approximating the differential entropy h(X) adjusted for the bin size: H \approx h(X) - \log \Delta This approximation becomes exact in the limit as \Delta \to 0, where the -\log \Delta term (diverging to +\infty) compensates for the infinite resolution of the continuous case, ensuring the discrete entropy remains non-negative while the differential entropy stays finite. This limiting procedure underscores why differential entropy is not a limiting case of discrete entropy without the adjustment term, emphasizing the conceptual shift from finite to infinite sample spaces.[5][4]Properties
Basic Properties
Differential entropy exhibits invariance under translation. Consider a continuous random variable X with probability density function f_X(x), so that its differential entropy is h(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx. For a constant c, let Y = X + c; the density of Y is f_Y(y) = f_X(y - c). Substituting into the entropy integral gives h(Y) = -\int_{-\infty}^{\infty} f_X(y - c) \log f_X(y - c) \, dy. By the change of variable u = y - c, this simplifies to -\int_{-\infty}^{\infty} f_X(u) \log f_X(u) \, du = h(X), demonstrating the invariance.[6] The scaling property introduces a dependence on units. For a scalar a \neq 0, let Y = aX; the density of Y is f_Y(y) = \frac{1}{|a|} f_X\left(\frac{y}{a}\right). Thus, h(Y) = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \log \left[ \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \right] dy = -\int_{-\infty}^{\infty} \frac{1}{|a|} f_X\left(\frac{y}{a}\right) \left[ \log \frac{1}{|a|} + \log f_X\left(\frac{y}{a}\right) \right] dy. Substituting z = y/a yields h(Y) = h(X) + \log |a|, reflecting how rescaling affects the entropy by the logarithm of the scaling factor, which accounts for changes in measurement units.[7] Unlike discrete entropy, which is always non-negative, differential entropy can be negative, highlighting its non-invariance under discretization. For a random variable X with bounded support S of finite volume V = \int_S dx, the entropy satisfies h(X) \leq \log V, with equality achieved when X is uniformly distributed over S. This upper bound arises because the uniform distribution maximizes the entropy for a fixed support, and \log V < 0 when V < 1, allowing negative values even at the maximum.[8] Differential entropy is continuous with respect to weak convergence of densities. Specifically, if a sequence of densities f_n converges to f in the L^1 norm (i.e., \int |f_n - f| dx \to 0) and satisfies suitable integrability conditions for the entropy to be well-defined, then h(f_n) \to h(f). This continuity ensures that small perturbations in the density lead to small changes in entropy, facilitating analysis in estimation and approximation contexts.[9]Chain Rule and Additivity
The joint differential entropy of a random vector \mathbf{X} = (X_1, \dots, X_n) in \mathbb{R}^n, with joint probability density function f_{\mathbf{X}}(\mathbf{x}), extends the scalar case and is given by h(\mathbf{X}) = -\int_{\mathbb{R}^n} f_{\mathbf{X}}(\mathbf{x}) \log f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}. This measure quantifies the uncertainty in the joint distribution over the vector, analogous to the scalar differential entropy but integrated over the higher-dimensional space. A key property is the chain rule for differential entropy, which decomposes the joint entropy into a sum of conditional entropies: h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i \mid X_1, \dots, X_{i-1}), where h(X_i \mid X_1, \dots, X_{i-1}) = -\int f(x_i \mid x_1, \dots, x_{i-1}) \log f(x_i \mid x_1, \dots, x_{i-1}) \, dx_i, averaged over the conditioning variables. This decomposition holds whenever the relevant densities exist and is derived from the definition of conditional density. If the random variables X_1, \dots, X_n are mutually independent, then each conditional entropy simplifies to the marginal entropy, yielding h(X_1, \dots, X_n) = \sum_{i=1}^n h(X_i), demonstrating additivity for the joint entropy under independence. The conditional differential entropy h(X \mid Y) for jointly continuous random variables X and Y with joint density f_{X,Y}(x,y) is defined as h(X \mid Y) = h(X,Y) - h(Y) = -\iint f_{X,Y}(x,y) \log f_{X \mid Y}(x \mid y) \, dx \, dy, where the outer integral over y effectively averages the scalar conditional entropies. This quantity represents the residual uncertainty in X given knowledge of Y. A fundamental inequality is h(X \mid Y) \leq h(X), with equality if and only if X and Y are independent, reflecting that conditioning cannot increase differential entropy. This follows directly from the non-negativity of mutual information, I(X;Y) = h(X) - h(X \mid Y) \geq 0. For the joint case, subadditivity holds as h(X,Y) = h(X) + h(Y \mid X) \leq h(X) + h(Y), again with equality under independence. For the sum of random variables, subadditivity extends to h(X + Y) \leq h(X) + h(Y). This arises because X + Y is a deterministic function of the pair (X, Y), and differential entropy does not increase under measurable functions: h(X + Y) \leq h(X,Y). Combined with the joint subadditivity, the bound follows. Equality in the overall inequality requires both independence of X and Y (for h(X,Y) = h(X) + h(Y)) and no information loss in the mapping to the sum, which occurs only in degenerate cases where the transformation is invertible almost surely, such as when one variable has zero variance; otherwise, the inequality is strict even under independence, as the convolution of densities generally introduces dependence that reduces entropy relative to the additive case. The phrase "densities are compatible" refers to conditions where the characteristic functions ensure the entropy of the convolution equals the sum, but such cases are exceptional and not generally satisfied.[10] In continuous channels, the data processing inequality preserves the structure of the discrete case. For a Markov chain X \to Y \to Z where X, Y, Z are continuous random variables, the mutual information satisfies I(X; Z) \leq I(X; Y), with equality if Z is a sufficient statistic for X given Y. This implies that processing through a continuous channel cannot increase information about the input, and it extends to differential entropies via I(X;Y) = h(X) - h(X \mid Y). The inequality holds under the existence of densities and is proven using the chain rule and non-negativity of relative entropy.Maximum Entropy
Gaussian Maximization Theorem
The Gaussian maximization theorem asserts that, among all continuous probability distributions for a random variable X with fixed variance \sigma^2, the differential entropy h(X) is maximized uniquely by the Gaussian distribution \mathcal{N}(\mu, \sigma^2), where the maximum value is \frac{1}{2} \log (2 \pi e \sigma^2) in nats.[11] This result holds regardless of the mean \mu, as shifting the distribution does not affect the entropy.[11] This theorem underscores the Gaussian distribution's role as the embodiment of maximum uncertainty under a second-moment constraint, a principle central to information theory that favors the least informative prior consistent with observed variance.[11] It has profound implications, such as establishing the capacity of additive white Gaussian noise channels as \frac{1}{2} \log (1 + \frac{P}{N}), where the input distribution maximizing mutual information is Gaussian. The theorem generalizes to multivariate cases: for an n-dimensional random vector X with fixed covariance matrix \Sigma, the maximum differential entropy is achieved by the multivariate Gaussian \mathcal{N}(\mu, \Sigma), yielding h(X) = \frac{n}{2} \log (2 \pi e) + \frac{1}{2} \log \det(\Sigma) nats.[11] For uncorrelated components with variances \sigma_i^2, this simplifies to \frac{n}{2} \log (2 \pi e) + \sum_{i=1}^n \log \sigma_i.[11] Without variance or similar constraints, differential entropy is unbounded above, as densities can be made arbitrarily flat, though it approaches -\infty for degenerate distributions with zero variance.[11]Proof of the Theorem
To prove the Gaussian maximization theorem, consider the problem of maximizing the differential entropy h(X) = -\mathbb{E}[\log f(X)] = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx over all probability density functions f on \mathbb{R}, subject to the normalization constraint \int_{-\infty}^{\infty} f(x) \, dx = 1 and the fixed second-moment constraint \mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 f(x) \, dx = \sigma^2 (assuming without loss of generality that \mathbb{E}[X] = 0). This is a constrained optimization problem in the space of densities, solved using the method of Lagrange multipliers for functionals. Introduce the Lagrangian functional \mathcal{L} = -\int f \log f \, dx + \lambda \left( \int f \, dx - 1 \right) - \mu \left( \int x^2 f \, dx - \sigma^2 \right), where \lambda is the multiplier for normalization and \mu > 0 for the variance constraint. The functional derivative with respect to f is set to zero: \frac{\delta \mathcal{L}}{\delta f} = -\log f - 1 + \lambda - \mu x^2 = 0, yielding f(x) = \exp(\lambda - 1 - \mu x^2), where the normalization constant is incorporated via \lambda. This form is proportional to a Gaussian density. To identify the parameters, impose the second-moment constraint: the variance of this distribution is $1/(2\mu), so \mu = 1/(2\sigma^2) to match \sigma^2. The normalization then gives the density of \mathcal{N}(0, \sigma^2). An alternative proof uses the non-negativity of the Kullback-Leibler (KL) divergence. Let g denote the density of \mathcal{N}(0, \sigma^2). Then, D(f \| g) = \int f \log \frac{f}{g} \, dx = -\int f \log g \, dx - h(f) \geq 0, with equality if and only if f = g almost everywhere. Rearranging gives h(f) \leq -\int f \log g \, dx. Now, \log g(x) = -\frac{1}{2} \log(2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so -\int f \log g \, dx = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \int x^2 f(x) \, dx = \frac 12 \log(2\pi e \sigma^2), where the last step substitutes the constraint \int x^2 f = \sigma^2 and adds the constant \frac{1}{2} \log e = \frac{1}{2}. Thus, h(f) \leq \frac{1}{2} \log(2\pi e \sigma^2), which is precisely the differential entropy of the Gaussian, with equality only for the Gaussian density. To verify, compute the entropy of the Gaussian directly: for X \sim \mathcal{N}(0, \sigma^2), h(X) = -\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \log \left[ \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{x^2}{2\sigma^2} \right) \right] dx. The logarithm expands to -\frac{1}{2} \log (2\pi \sigma^2) - \frac{x^2}{2\sigma^2}, so the integral separates into \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2\sigma^2} \mathbb{E}[X^2] = \frac{1}{2} \log(2\pi \sigma^2) + \frac{1}{2} = \frac{1}{2} \log(2\pi e \sigma^2), confirming the bound is achieved.Examples
Exponential Distribution
The exponential distribution with rate parameter \lambda > 0 has probability density functionf(x) = \lambda e^{-\lambda x}, \quad x \geq 0,
and mean $1/\lambda.[12] This distribution models phenomena such as inter-arrival times in Poisson processes and is characterized by its memoryless property. The differential entropy h(X) of a continuous random variable X with density f is h(X) = -\int f(x) \log f(x) \, dx. For the exponential distribution,
h(X) = -\int_{0}^{\infty} \lambda e^{-\lambda x} \log(\lambda e^{-\lambda x}) \, dx = -\int_{0}^{\infty} \lambda e^{-\lambda x} (\log \lambda - \lambda x) \, dx.
The first term evaluates to -\log \lambda \int_{0}^{\infty} \lambda e^{-\lambda x} \, dx = -\log \lambda, since the integral of the density is 1. The second term is \lambda \int_{0}^{\infty} x \lambda e^{-\lambda x} \, dx = \lambda \mathbb{E}[X] = \lambda \cdot (1/\lambda) = 1. Thus, h(X) = 1 - \log \lambda (in nats), derived by direct integration.[13] This entropy value increases with the mean $1/\lambda, as larger means correspond to broader spreads and greater uncertainty in the distribution. For \lambda > e, the entropy is negative, a feature unique to differential entropy that does not imply negative information but reflects the continuous nature relative to a uniform reference measure.[13] The exponential distribution maximizes the differential entropy among all continuous distributions supported on [0, \infty) with fixed mean $1/\lambda.[13] It serves as the continuous analog to the geometric distribution, which maximizes entropy for discrete non-negative integer-valued random variables with fixed mean.[13]
Uniform Distribution
The uniform distribution over an interval [b, b+a] has probability density function f(x) = \frac{1}{a} for x \in [b, b+a] and f(x) = 0 otherwise.[1] The differential entropy h(X) is computed via the integral definition: h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx = -\int_{b}^{b+a} \frac{1}{a} \log \left( \frac{1}{a} \right) \, dx = \log a, where the logarithm is the natural logarithm (nats).[1] This value is independent of the location parameter b, depending only on the length a of the support interval.[1] This entropy \log a scales logarithmically with the interval length, quantifying the uncertainty in locating the random variable within the fixed support volume; larger a increases the possible outcomes, hence higher entropy.[1] Among all distributions supported on an interval of fixed length a, the uniform achieves the maximum differential entropy, as established by the non-negativity of the Kullback-Leibler divergence between any density f and the uniform density g(x) = 1/a: D(f \| g) = \log a - h(X) \geq 0, implying h(X) \leq \log a.[14] In quantization theory, the uniform distribution provides a reference for bounding the entropy of discrete approximations to continuous random variables. As the quantization bin size \Delta \to 0, the entropy H(X^*) of the quantized variable X^* satisfies H(X^*) \to h(X) + \log \Delta, where equality holds asymptotically for the uniform case, linking differential entropy to the limiting behavior of discrete entropy.[1] The differential entropy of the uniform can be negative when a < 1, for instance h(X) = \log 0.5 < 0 for a = 0.5, which underscores the unit dependence of differential entropy unlike its non-negative discrete counterpart.[1] This property aligns with the scaling behavior of differential entropy under linear transformations.[1]Information Measures
Relation to Mutual Information
The mutual information between two continuous random variables X and Y, denoted I(X;Y), is defined as the difference between the differential entropy of X and the conditional differential entropy of X given Y: I(X;Y) = h(X) - h(X|Y). Equivalently, it can be expressed as I(X;Y) = h(X) + h(Y) - h(X,Y), where h(X,Y) is the joint differential entropy.[1] This quantity quantifies the amount of information that X and Y share, representing the reduction in uncertainty about one variable upon knowing the other.[1] A key property of mutual information is its non-negativity: I(X;Y) \geq 0, with equality holding if and only if X and Y are independent.[1] This follows directly from the conditioning inequality for differential entropy, which states that h(X|Y) \leq h(X), implying that observing Y cannot increase the uncertainty about X.[1] For multiple variables, the chain rule extends mutual information analogously to the discrete case: I(X_1, \dots, X_n; Y) = \sum_{i=1}^n I(X_i; Y \mid X_1, \dots, X_{i-1}), where the conditional mutual information I(X_i; Y \mid X_1, \dots, X_{i-1}) measures the additional shared information between X_i and Y given the previous variables.[1] In information theory, mutual information interprets the shared uncertainty between variables, with differential entropy serving as the foundational building block for analyzing continuous systems.[1] It plays a central role in defining the capacity of continuous channels, where the capacity C is the maximum achievable I(X;Y) over input distributions subject to constraints, representing the highest rate of reliable communication.[15] For instance, in the additive white Gaussian noise channel under a power constraint, the mutual information I(X;Y) is maximized when X is Gaussian, achieving the channel capacity C = \frac{1}{2} \log_2 \left(1 + \frac{P}{N}\right) bits per transmission, where P is the signal power and N is the noise power.[15]Connection to Estimator Error
The Cramér-Rao bound establishes a fundamental limit on the performance of unbiased estimators, stating that for an unbiased estimator \hat{\theta} of a parameter \theta based on n i.i.d. observations, the variance satisfies \operatorname{Var}(\hat{\theta}) \geq \frac{1}{n I(\theta)}, where the Fisher information I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}\right].[16] This Fisher information quantifies the sensitivity of the likelihood to changes in \theta, and it connects to differential entropy h(X;\theta) = -\mathbb{E}[\log f(X;\theta)], since the expected log-likelihood is -h(X;\theta), and the Fisher information I(\theta) is the negative of its second derivative with respect to \theta (or equivalently, the second derivative of h(X;\theta)).[16][17] Higher Fisher information, corresponding to lower entropy for fixed support, implies tighter bounds on estimation variance.[18] In estimation contexts involving noisy observations, the entropy power inequality provides insight into minimal mean squared error (MSE). The inequality asserts that for independent random vectors X and Z, h(X + Z) \geq \frac{1}{2} \log\left( e^{2 h(X)} + e^{2 h(Z)} \right) (in nats for scalars), implying that greater differential entropy h(X) leads to larger minimal MSE when estimating X from Y = X + Z for fixed noise variance \operatorname{Var}(Z).[19] Equality holds when X and Z are Gaussian. This follows from the integral representation h(X) = \frac{1}{2} \int_0^\infty \frac{\operatorname{MMSE}(X \mid \sqrt{\operatorname{snr}} X + Z)}{\operatorname{snr}^2} \, d\operatorname{snr}, where \operatorname{MMSE} is the minimum MSE; thus, increased h(X) necessitates higher average estimation error across signal-to-noise ratios.[19] In Bayesian estimation, the posterior differential entropy h(\theta \mid \mathbf{X}) serves as a measure of residual uncertainty about the parameter \theta after incorporating data \mathbf{X}, with lower posterior entropy indicating more precise inference and better estimators in terms of uncertainty reduction.[20] Optimal Bayesian estimators, such as those minimizing expected Bregman divergence, aim to concentrate the posterior, thereby minimizing this entropy as a proxy for estimation quality.[20] Asymptotically, in sequential estimation with large samples, the entropy rate—characterizing the average uncertainty per observation—links to error rates through the Fisher information matrix, where the posterior covariance approximates I(\theta)^{-1}/n, yielding posterior entropy roughly \frac{d}{2} \log(2\pi e / n) - \frac{1}{2} \log \det I(\theta) for d-dimensional \theta, bounding large-sample MSE.[18] For the Gaussian case, the differential entropy h(X) = \frac{1}{2} \log(2\pi e \sigma^2) directly ties to estimation error, as the MSE for the sample mean estimator from n i.i.d. N(\mu, \sigma^2) observations is \sigma^2 / n, saturating the Cramér-Rao bound and scaling inversely with sample size while reflecting the entropy's dependence on variance.[18]Common Distributions
Formulas for Specific Distributions
The differential entropy formulas for several standard continuous distributions are listed in the table below, expressed in nats using the natural logarithm. These expressions assume the conventional parametrizations and are independent of location parameters where applicable, as shifting does not affect the entropy value.[1]| Distribution | Parameters | Support | Differential Entropy H(X) |
|---|---|---|---|
| Gaussian | Variance \sigma^2 > 0 | (-\infty, \infty) | \frac{1}{2} \log (2 \pi e \sigma^2)[1] |
| Exponential | Rate \lambda > 0 | [0, \infty) | $1 - \log \lambda[1] |
| Gamma | Shape \alpha > 0, rate \beta > 0 | [0, \infty) | \alpha - \log \beta + \log \Gamma(\alpha) + (1 - \alpha) \psi(\alpha), where \psi is the digamma function[21] |
| Laplace | Scale b > 0 | (-\infty, \infty) | $1 + \log (2b)[1] |
| Cauchy | Scale \gamma > 0 | (-\infty, \infty) | \log (4 \pi \gamma)[1] |
| Weibull | Shape k > 0, scale \lambda > 0 | [0, \infty) | \gamma \left(1 - \frac{1}{k}\right) + \log \left( \frac{\lambda}{k} \right) + 1, where \gamma \approx 0.57721 is the Euler-Mascheroni constant[22] |
Comparison Across Distributions
Among all continuous distributions with a fixed variance, the Gaussian distribution achieves the maximum differential entropy, a result derived from the principle of maximum entropy subject to a second-moment constraint. This positions the Gaussian as the distribution of maximum uncertainty under this constraint, with its differential entropy given by \frac{1}{2} \log (2 \pi e \sigma^2) for variance \sigma^2. Distributions exhibiting heavier tails, such as the Cauchy distribution, can attain higher differential entropy values; for a Cauchy distribution with scale parameter \gamma, the differential entropy is \log (4 \pi \gamma), which surpasses that of Gaussians with comparable but finite variance parameters. However, the infinite variance of the Cauchy prevents direct comparison within the fixed-variance framework.[11] For distributions constrained to the positive real line with a fixed mean \mu > 0, the exponential distribution maximizes the differential entropy, yielding $1 + \log \mu.[13] Shape parameters within parametric families reveal systematic patterns in differential entropy. For the Weibull distribution with fixed scale, as the shape parameter k approaches 1 from above, the distribution converges to the exponential, and the differential entropy approaches $1 + \log \mu. Across families, differential entropy generally increases with kurtosis for fixed variance up to the Gaussian limit (kurtosis of 3), beyond which heavier tails reduce entropy relative to the maximum. In multivariate isotropic cases with covariance matrix \sigma^2 I_d in d dimensions, differential entropies scale linearly with dimension for the Gaussian, given by \frac{d}{2} \log (2 \pi e \sigma^2), reflecting additive uncertainty across independent components. Other distributions sharing this covariance exhibit lower entropies but follow a similar linear scaling, with the gap to the Gaussian bound widening for non-Gaussian forms in higher dimensions.[11] The following table compares differential entropies for select univariate distributions normalized to unit variance, highlighting the Gaussian's supremacy and the decreasing order for heavier-tailed alternatives:| Distribution | Parameter | Differential Entropy (nats) |
|---|---|---|
| Gaussian | - | 1.419 |
| Student-t | \nu = 5 | 1.369 |
| Laplace | - | 1.347 |
| Uniform | - | 1.243 |
| Student-t | \nu = 3 | 1.222 |