Fact-checked by Grok 2 weeks ago

Entropy estimation

Entropy estimation refers to the statistical inference of the entropy of a probability distribution based on a finite set of observed samples, serving as a measure of uncertainty or information content in data from discrete or continuous sources.^[1] In information theory, Shannon entropy quantifies the average unpredictability of a discrete random variable X with probability mass function p(x) as H(X) = -\sum p(x) \log p(x), while differential entropy extends this to continuous variables with density f(x) as h(X) = -\int f(x) \log f(x) \, dx.^[2] The entropy rate further generalizes this to stochastic processes, representing the average entropy per symbol or time step in sequences.^[1] For discrete distributions, estimation is particularly challenging when the support size is large or infinite, as unseen symbols in samples lead to underestimation by plug-in methods like the maximum likelihood estimator.^[3] Nonparametric approaches, such as the Good-Turing estimator or Bayesian methods using Dirichlet or Pitman-Yor priors, address this by incorporating smoothing or hierarchical modeling to account for unobserved outcomes, achieving consistency under power-law tail assumptions.^[3] Parametric methods assume a specific model, like a Markov chain, and use maximum likelihood for faster convergence but risk bias from model misspecification.^[1] In the continuous case, nonparametric techniques dominate due to the lack of finite support, including histogram-based, kernel density estimation (KDE), nearest-neighbor, and spacing methods that approximate the density integral via sample spacings or local densities.^[4] These estimators aim for root-n consistency under smoothness and tail decay conditions, though they suffer from the curse of dimensionality, where performance degrades rapidly beyond low dimensions (e.g., d > 3).^[4] Hybrid approaches, like resubstitution or cross-validation variants, mitigate bias but increase computational demands.^[1] Key challenges across both domains include finite-sample bias and variance, sensitivity to hyperparameters (e.g., bin width in histograms or kernel bandwidth in KDE), and scalability for high-dimensional or sequential data.^[1] Advances incorporate machine learning, such as neural estimators for entropy production or transfer entropy, enhancing robustness in nonequilibrium systems.^[5] Applications span information theory for source coding limits, machine learning for feature selection and model evaluation, neuroscience for neural information quantification, and physics for complexity analysis in time series.^[3] In cryptography, accurate min-entropy estimation ensures randomness quality in generators.^[6]

Fundamentals

Definition and Importance

Entropy estimation refers to the process of approximating the entropy of a random variable from a finite set of observed samples, where the true underlying probability distribution is unknown. For discrete random variables, this is the Shannon entropy, denoted as H(X), with probability mass function p(x) defined as

H(X) = -\sum_{x} p(x) \log p(x),

where the logarithm is typically base-2 to measure entropy in bits.^[7] This measure quantifies the average uncertainty or information content associated with the possible outcomes of X. Introduced by Claude Shannon in his seminal 1948 paper "A Mathematical Theory of Communication," entropy provided a foundational metric for analyzing communication systems and has since become central to information theory.^[7] The importance of entropy estimation stems from its wide-ranging applications across disciplines. In data compression, it sets the theoretical limit on the average number of bits needed to encode messages from a source, enabling efficient storage and transmission.^[7] In machine learning, entropy guides feature selection by identifying variables that reduce uncertainty in predictive models, such as in decision trees where lower conditional entropy indicates informative splits.^[8] Applications extend to neuroscience, where estimating entropy from neural spike patterns reveals coding efficiency and information processing in the brain, and to physics, drawing analogies to thermodynamic entropy as a measure of disorder in statistical mechanics.^[9]^[10] In practice, the true distribution p(x) is rarely known, necessitating estimation from finite samples \{x_1, \dots, x_n\} drawn from the process generating the data. The goal of entropy estimation is to compute an approximation \hat{H} such that \hat{H} \approx H(X), with controlled bias (systematic error) and variance (random fluctuation) to ensure reliability in downstream analyses.^[11] For continuous random variables, a related concept is differential entropy, which extends the discrete case but requires careful handling of density estimation.^[12]

Types of Entropy

Entropy estimation commonly involves several variants of entropy, each tailored to different types of probability distributions and applications in information theory. The primary forms include discrete entropy for finite or countable sample spaces, differential entropy for continuous distributions, and Rényi entropy as a parameterized generalization. These types differ in their mathematical formulations and properties, influencing the choice of estimation methods from data samples. Related quantities, such as mutual information, capture dependence between variables and are expressed in terms of entropies.^[7] The entropy rate extends entropy to stochastic processes, defined for a stationary process \{X_t\} as H = \lim_{n \to \infty} \frac{1}{n} H(X_1, \dots, X_n), representing the average entropy per symbol or time step in sequences.^[1] Discrete entropy, introduced by Shannon, quantifies the uncertainty in a discrete random variable X taking values in a finite alphabet \mathcal{X} with probability mass function p(x). It is defined as

H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x),

where the logarithm is typically base 2 for bits or natural for nats. This measure is always non-negative and achieves its maximum when all outcomes are equally likely.^[7] For continuous random variables, differential entropy extends the concept but addresses the challenges of uncountable support. For a continuous random variable X with probability density function p(x), the differential entropy is

h(X) = -\int_{-\infty}^{\infty} p(x) \log p(x) \, dx.

Unlike discrete entropy, differential entropy can take negative values, for example, for distributions more concentrated than a uniform over a unit interval, and it is not invariant under nonlinear transformations of the variable, requiring normalization or careful scaling in applications. Continuous forms also demand handling of infinite support, where densities integrate to 1 but probabilities over finite intervals are less than 1, complicating direct analogies to discrete cases.^[7]^[13] Rényi entropy generalizes Shannon entropy through an order parameter \alpha > 0, \alpha \neq 1, providing a family of measures useful for their varying sensitivity to probability distributions. For a discrete random variable X, the Rényi entropy of order \alpha is

H_\alpha(X) = \frac{1}{1-\alpha} \log \sum_{x \in \mathcal{X}} p(x)^\alpha,

with the Shannon entropy recovered in the limit as \alpha \to 1. Higher orders (\alpha > 1) emphasize rare events less and offer robustness to outliers or noise in estimation contexts compared to Shannon entropy.^[14]^[15] Mutual information captures the shared information between two random variables X and Y, expressed in terms of entropies as I(X; Y) = H(X) + H(Y) - H(X,Y), where H(X,Y) is the joint entropy. This quantity is always non-negative for discrete variables and zero if X and Y are independent, making it a key tool for dependence estimation in joint distributions.^[7] A fundamental distinction among these types is that discrete entropy is inherently non-negative due to the finite summation and concave properties of the logarithm, whereas differential entropy lacks this bound and requires density estimation techniques sensitive to the choice of support and binning. Rényi entropies for \alpha \neq 1 also remain non-negative for discrete cases but adapt differently to estimation robustness needs.^[7]^[16]

Estimation Challenges

Estimating entropy from finite samples presents significant challenges due to the bias-variance tradeoff inherent in most estimators. Nonparametric estimators, such as the histogram method, exhibit a negative bias for finite sample sizes n, systematically underestimating the true entropy, though this bias diminishes asymptotically as n \to \infty under appropriate conditions like m = o(n) where m is the support size.^[17] The variance of these estimators decreases with increasing n, typically scaling as O((\log n)^2 / n), but achieving low overall error requires balancing the bias reduction against potential variance inflation from overly complex models.^[17] Sample size requirements pose a major hurdle, particularly for discrete distributions where reliable estimation demands n \gg |\mathcal{X}|, the alphabet size, to mitigate severe underestimation from unseen symbols.^[17] In high-dimensional settings, the curse of dimensionality exacerbates this issue, as the effective sample size needed grows exponentially with dimension d, leading to bias terms decaying only as O(n^{-\gamma/d}) for methods like k-nearest neighbors, rendering estimation impractical without dimensionality reduction.^[18] For continuous distributions, similar demands arise, often requiring n on the order of c^d for accuracy, where c > 1.^[19] Consistency of entropy estimators relies on asymptotic behavior as n \to \infty; strong consistency, in the almost sure sense, holds for many nonparametric approaches when the support size grows slower than n, ensuring convergence to the true entropy.^[17] However, weak consistency in probability may apply in more restricted regimes, such as with fixed support size. The underlying data distribution further influences accuracy: uniform distributions, with their maximum entropy, are generally easier to estimate with lower bias compared to heavy-tailed ones like the Student's t, where tail behavior amplifies bias and requires tailored smoothing to achieve root-n consistency.^[20] Non-stationarity in the data, violating the standard i.i.d. assumption, introduces additional inconsistency by altering effective sample independence.^[17] Evaluation of entropy estimators commonly employs the mean squared error (MSE), defined as \text{MSE} = \text{[bias](/page/Bias)}^2 + \text{variance}, which quantifies the total prediction error and guides comparisons across methods.^[21] Cross-validation techniques are frequently used to tune hyperparameters, such as bandwidths in kernel-based approaches, by minimizing empirical MSE on held-out data to balance bias and variance.^[21]

Non-Parametric Estimators

Histogram Estimator

The histogram estimator, also known as the plug-in estimator, provides a straightforward non-parametric approach to entropy estimation for discrete random variables or continuous variables discretized via binning. Given a sample of n independent and identically distributed observations from a distribution over a finite or countably infinite support, the method partitions the support into k non-overlapping bins. For each bin i, the empirical probability is estimated as \hat{p}_i = n_i / n, where n_i denotes the number of samples falling into bin i. The entropy estimate is then calculated using Shannon's formula applied to these empirical probabilities:

\hat{H} = -\sum_{i=1}^k \hat{p}_i \log \hat{p}_i,

with the convention that terms where n_i = 0 are omitted, as \hat{p}_i \log \hat{p}_i \to 0 as \hat{p}_i \to 0.^[17] This plug-in approach directly substitutes the empirical distribution into the entropy functional, making it computationally efficient for low-dimensional or discrete data.^[17] A key limitation of the histogram estimator is its negative bias, which arises primarily from unobserved bins—regions with positive true probability mass that contain no samples, leading to an underestimation of the true entropy H.^[17] The expected bias scales approximately as -(k-1)/(2n) for large n, where k is the number of bins.^[17] To mitigate this, the Miller-Madow correction adjusts the estimate by adding a term that accounts for the unseen mass:

\hat{H}_{MM} = \hat{H} + \frac{k - 1}{2n},

where k approximates the effective support size (e.g., the alphabet size for discrete data or the number of bins for binned continuous data). This correction, derived from asymptotic bias analysis, significantly reduces the underestimation for moderate sample sizes but assumes k is known or well-estimated. Selecting the number of bins k is critical, as it governs the bias-variance trade-off: too few bins increase bias by oversmoothing the distribution, while too many amplify variance by fragmenting the sample.^[17] Empirical rules for one-dimensional cases include Sturges' formula, k = \lceil 1 + \log_2 n \rceil, which aims for bins that capture the data's scale without excessive fragmentation, or the rule of thumb k \approx \sqrt{n} for roughly balanced resolution. In practice, these choices work well for n in the thousands but degrade for very small or very large samples.^[17] For multidimensional continuous data, the histogram estimator extends by forming a grid of bins across dimensions, but this quickly becomes impractical due to the curse of dimensionality, where the number of bins grows exponentially with dimension d (e.g., k^d bins for k per dimension).^[17] The advantages of the histogram estimator include its simplicity, interpretability, and minimal computational demands, making it a baseline method in fields like information theory and machine learning.^[17] However, its performance is highly sensitive to binning choices, particularly in high dimensions, where sparse bins lead to unreliable probability estimates and exacerbated bias.^[17]

Nearest-Neighbor Estimator

The nearest-neighbor estimator, particularly the Kozachenko-Leonenko (KL) estimator, provides a non-parametric method for approximating the differential entropy of a continuous random variable from a finite sample of i.i.d. observations. It relies on distances to k-nearest neighbors to locally estimate the underlying probability density, offering an adaptive approach that adjusts to variations in data density without requiring predefined bins or parametric assumptions. This makes it particularly suitable for multidimensional data where global partitioning methods may fail.^[22] The KL estimator is defined as

\hat{H}(X) = \psi(n) - \psi(k) + \log(c_d) + \frac{d}{n} \sum_{i=1}^n \log \varepsilon_i(k),

where \psi(\cdot) is the digamma function, n is the sample size, k is the number of nearest neighbors, d is the data dimensionality, c_d is the volume of the unit ball in d-dimensions (e.g., c_d = \pi^{d/2} / \Gamma(d/2 + 1) for the Euclidean norm), and \varepsilon_i(k) is the Euclidean distance from the i-th observation to its k-th nearest neighbor (excluding itself). This formulation derives from approximating the local density at each point x_i as k / ((n-1) c_d \varepsilon_i(k)^d), with the entropy contribution averaged using digamma functions to correct for finite-sample bias. The estimator is asymptotically unbiased and consistent under mild conditions on the density, such as absolute continuity.^[22]^[23] The choice of k balances bias and variance: small values like k=1 minimize bias but increase variance, while larger k (typically 3 to 5) stabilize estimates, especially in low-density regions.^[23]^[24] As a non-parametric method, the KL estimator adapts to local data structure, avoiding artifacts from fixed binning and performing better than histogram-based approaches in moderate to high dimensions by leveraging geometric distances rather than arbitrary partitions. It has been widely adopted in fields like machine learning and signal processing for its computational efficiency, with O(n \log n) time complexity via efficient nearest-neighbor searches.^[24] For discrete data, the estimator can be adapted by employing discrete distance metrics such as the Hamming distance or by embedding the symbols into a continuous space (e.g., via random projections), allowing estimation of Shannon entropy while mitigating issues like zero distances from identical samples. Despite its strengths, the estimator remains sensitive to the choice of distance metric—Euclidean may underperform in non-isotropic spaces compared to Manhattan or Mahalanobis distances—and suffers from the curse of dimensionality, where bias grows as O(n^{-1/d}) in high d, leading to unreliable estimates beyond 10–20 dimensions without dimensionality reduction.^[24]^[23]

Sample-Spacing Estimator

The sample-spacing estimator is a non-parametric approach to differential entropy estimation that leverages the gaps, or spacings, between ordered observations from a one-dimensional continuous probability density. Given an i.i.d. sample X_1, \dots, X_n from density f, the observations are ordered as X_{(1)} \leq \dots \leq X_{(n)}. Adjacent spacings are computed as D_i = X_{(i)} - X_{(i-1)} for i = 2, \dots, n, often with boundary adjustments such as defining X_{(0)} = -\infty or incorporating the sample range. A local density estimate is then formed at the midpoint of each spacing interval, approximating f(x) \approx \frac{n+1}{n D_i}, which provides an intuitive inverse relationship between spacing size and density height.^[25] A prominent variant is the Vasicek estimator, which uses wider m-spacings for improved stability: \hat{H}_V = \frac{1}{n} \sum_{i=1}^{n-m} \log \left( \frac{n+1}{m} (X_{(i+m)} - X_{(i)}) \right), where m is a window parameter satisfying m \to \infty and m/n \to 0 as n \to \infty. This estimator achieves weak consistency under mild conditions on f, such as boundedness and positivity over its support.^[25]^[26] Sample-spacing estimators offer advantages in simplicity and computational efficiency, requiring only sorting of the sample. They are unbiased for the uniform distribution, where spacings follow an exponential distribution, making them particularly reliable in that case. Extensions to multidimensional settings are possible through projections onto one-dimensional subspaces, though this introduces complexity and potential bias from dimensionality reduction. However, these methods exhibit high variance for non-uniform densities with varying local densities, and they are inherently designed for one dimension, with higher-dimensional adaptations remaining challenging and less developed.^[25]^[27]

Parametric Estimators

Bayesian Estimator

Bayesian estimators for entropy incorporate prior distributions to regularize probability estimates, particularly beneficial in scenarios with limited samples where maximum likelihood approaches suffer from severe bias. For discrete distributions, the Dirichlet prior is commonly employed, assuming the probability vector \mathbf{p} follows p \sim \mathrm{Dir}(\alpha, \dots, \alpha) with symmetric concentration parameter \alpha > 0. Given n observations with counts n_i for i = 1, \dots, k categories, the posterior mean probabilities are \hat{h}_i = \frac{n_i + \alpha}{n + k\alpha}, and the entropy estimate is \hat{H} = -\sum_{i=1}^k \hat{h}_i \log \hat{h}_i.^[28] A prominent example is the Nemenman-Shafee-Bialek (NSB) estimator, which uses an infinite mixture of Dirichlet priors, chosen such that the induced prior over the entropy is approximately uniform. This approach mitigates underestimation in sparse data by adaptively smoothing based on the data, yielding low bias even when the sample size n is much smaller than the alphabet size |\mathcal{X}|. The method ensures desirable asymptotic properties, such as consistency and reduced variance in undersampled regimes.^[28] For continuous distributions, Bayesian estimation often approximates the density via discretization into bins, applying the Dirichlet prior to the resulting multinomial counts as in the discrete case. This binning strategy allows extension of the posterior mean approach while preserving Bayesian regularization.^[29] These estimators excel in handling undersampling, such as in genomics where large alphabets (e.g., DNA n-mers) lead to many unobserved events, and provide uncertainty quantification through the posterior variance of the entropy, enabling credible intervals for reliable inference. In neuroscience, they have been applied to estimate entropy in neural spike trains, demonstrating small bias and accurate information measures from limited recordings.^[28]

Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) for entropy assumes a parametric form for the underlying probability distribution and proceeds by first obtaining the maximum likelihood estimates of the model parameters from the observed data, then substituting these estimates into the analytical expression for the entropy of the assumed distribution. This plug-in approach serves as a fundamental baseline in parametric entropy estimation, leveraging the asymptotic efficiency of MLE under correct model specification.^[30] In the discrete case, the distribution is typically modeled as multinomial over a finite alphabet \mathcal{X}, with the MLE of the probability mass function given by \hat{p}_i = n_i / n for each symbol i, where n_i denotes the observed count of i and n is the total sample size. The resulting entropy estimate is then

\hat{H}_{\text{MLE}} = -\sum_{i \in \mathcal{X}} \hat{p}_i \log \hat{p}_i,

which coincides with the uncorrected empirical (histogram) estimator when bin counts align with symbol frequencies. This estimator exhibits negative bias, with its expected value approximated as \mathbb{E}[\hat{H}_{\text{MLE}}] \approx H - (|\mathcal{X}| - 1)/(2n) for large n, where H is the true entropy; the bias arises from the underestimation of low-probability events and intensifies as the alphabet size |\mathcal{X}| grows relative to n, though no explicit correction is incorporated in the standard formulation.^[30]^[31] For continuous distributions belonging to a parametric family \{p(x \mid \theta) : \theta \in \Theta\}, the MLE \hat{\theta} maximizes the likelihood \prod_{j=1}^n p(x_j \mid \theta) over the observed samples \{x_1, \dots, x_n\}, after which the differential entropy is evaluated as

\hat{h}_{\text{MLE}}(\hat{\theta}) = -\int p(x \mid \hat{\theta}) \log p(x \mid \hat{\theta}) \, dx.

A prominent example is the multivariate Gaussian distribution in d dimensions, where \hat{\theta} includes the sample mean and covariance matrix, yielding the closed-form entropy \hat{h}_{\text{MLE}} = \frac{d}{2} \log (2 \pi e) + \frac{1}{2} \log \det \hat{\Sigma}, with \hat{\Sigma} the maximum likelihood covariance estimate. Under the correct parametric model, the MLE-based entropy estimator is asymptotically unbiased and achieves the Cramér-Rao lower bound for efficiency as n \to \infty, making it statistically optimal in well-specified low-dimensional settings; moreover, its simplicity facilitates implementation when the entropy functional admits closed-form evaluation or standard optimization routines. However, the estimator remains biased and inconsistent for finite n without adjustments, particularly suffering from severe underestimation in high-dimensional spaces or when the assumed model is misspecified, as the plug-in ignores distributional mismatches and amplifies parameter estimation errors in the entropy computation.^[30]

Advanced Estimators

Expected Entropy Estimator

The expected entropy estimator addresses the challenge of bias in traditional plug-in entropy estimates by computing the conditional expected entropy E[H \mid n] given a finite sample size n, rather than directly approximating the true entropy H. This approach is particularly valuable when samples are limited, as it incorporates analytical corrections derived from the expected value of the estimator under the observed data distribution. For independent and identically distributed (i.i.d.) data, the plug-in estimator \hat{H} is known to be biased such that E[\hat{H}] = H - b(n), where b(n) is a positive bias term decreasing with n; the expected entropy estimator inverts this by adding a correction term to yield an approximately unbiased estimate.^[32] A seminal implementation uses a Bayesian framework with a mixture of Dirichlet priors to flatten the prior distribution over possible entropy values, ensuring the posterior expected entropy E[H \mid \{n_i\}] is robust to undersampling. The estimator averages the entropy over a range of prior parameters \beta, leveraging the digamma function \psi for computation: the prior mean entropy is \xi(\beta) = \psi(\kappa + 1) - \psi(\beta + 1), where \kappa = K \beta and K is the alphabet size. This method achieves low relative error (under 10%) even for n \ll K, outperforming simple corrections like Miller-Madow in small-sample regimes.^[32] For dependent processes such as Markov chains, the expected entropy estimator extends to the entropy rate H = -\sum_i \pi_i \sum_j p_{ij} \log p_{ij}, where \pi is the stationary distribution and p_{ij} are transition probabilities; finite-sample adjustments account for bias in estimated transitions via digamma corrections applied to transition counts. One adaptation, the Chao-Wang-Jost estimator, applies digamma corrections to discovery rates of states and transitions: \hat{H}^{\text{CWJ}} = \sum_i \frac{n_i}{n} \left( \psi(n) - \psi(n_i) \right) + unseen-state adjustment, yielding low bias for sequences with memory.^[33]^[33] These estimators offer advantages in handling dependent data, where standard i.i.d. methods fail due to unaccounted correlations; they reduce bias by explicitly modeling sample-size effects on transition estimates, making them suitable for time series with temporal structure. Applications include source coding for memoryful processes, where accurate entropy rates enable efficient compression, and small-n scenarios like genomic sequences or neural spike trains exhibiting Markov-like dependence.^[32]^[33]

Neural Network Estimator

The Neural Joint Entropy Estimator (NJEE) is a deep learning-based approach for estimating the joint entropy of discrete random variables, particularly effective for high-dimensional data with large alphabets. It leverages the universal approximation capabilities of neural networks to model conditional distributions autoregressively via the chain rule of entropy, decomposing the joint entropy H(X) into a marginal entropy term plus conditional entropies. Specifically, for a multivariate variable X = (X_1, \dots, X_{d_x}), the estimate is given by

\hat{H}_n(X) = \hat{H}_n(X_1) + \sum_{m=2}^{d_x} \hat{\text{CE}}_n(G_{\theta_m}(X_m \mid X_{1:m-1})),

where \hat{H}_n(X_1) is the marginal entropy of the first component (often estimated via a plugin method or another NJEE recursion), and \hat{\text{CE}}_n(G_{\theta_m}) = -\frac{1}{n} \sum_{i=1}^n \log G_{\theta_m}(x_{i,m} \mid x_{i,1:m-1}) is the empirical cross-entropy loss of a neural network G_{\theta_m} that approximates the conditional probability mass function P(X_m \mid X_{1:m-1}).^[34] The network G_{\theta_m} is trained by minimizing this loss over samples from the joint distribution, using a softmax output layer whose size matches the alphabet of X_m. This setup allows NJEE to capture complex dependencies without assuming a parametric form for the joint distribution.^[34] In practice, the architecture of each G_{\theta_m} consists of fully connected layers (e.g., two hidden layers with 50 nodes each, ReLU activations) for low-dimensional or tabular data, but extends to convolutional neural networks for image data or transformer-based models for sequential data like text, enabling handling of structured inputs.^[34] Training involves standard optimizers like ADAM and relies on the data samples themselves, often augmented with shuffling for marginal estimates if needed. NJEE demonstrates strong consistency, with the estimation error bounded by |\hat{H}_n(X) - H(X)| \leq C \epsilon + \delta for sufficiently large n, under the universal approximation theorem for continuously differentiable activations.^[34] Key advantages of NJEE include its ability to manage large alphabet sizes (e.g., outperforming histogram-based methods in root mean square error for alphabets up to 1000 symbols with n \leq 1000 samples) and to model intricate dependencies in high-dimensional settings, where traditional non-parametric estimators struggle due to the curse of dimensionality.^[34]^[35] It also facilitates extensions to mutual information and transfer entropy estimation by differencing joint and conditional entropies.^[34] For continuous (differential) entropy estimation, neural network methods adapt normalizing flows, which are invertible transformations f: \mathbb{R}^d \to \mathbb{R}^d mapping data samples to a simple base distribution (e.g., Gaussian) with known entropy. The differential entropy is then estimated as

\hat{h}(X) = h(Z) + \frac{1}{n} \sum_{i=1}^n \log |\det J_f(x_i)|,

where Z = f(X), h(Z) is the base entropy, and J_f is the Jacobian of f; the flow is trained to maximize the likelihood of the data under the induced density. Common architectures include autoregressive flows or coupling layers, suitable for high-dimensional continuous data like images. Despite these strengths, neural network estimators like NJEE and flow-based methods are computationally intensive, requiring significant training time and resources (e.g., GPU acceleration for flows in dimensions >10). They demand large sample sizes (n \gg d) to avoid overfitting and provide reliable approximations, and their black-box nature limits interpretability compared to explicit geometric estimators.^[34]

References

[1]
A Review of Shannon and Differential Entropy Rate Estimation - MDPI
In this paper, we present a review of Shannon and differential entropy rate estimation techniques. Entropy rate, which measures the average information gain ...
[2]
https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[3]
[PDF] Bayesian Entropy Estimation for Countable Discrete Distributions
This paper estimates Shannon's entropy from discrete data using Bayesian methods, especially the Pitman-Yor process, for cases with unknown or infinite symbols.
[4]
http://www.its.caltech.edu/~jimbeck/summerlectures/references/Entropy%20estimation.pdf
[5]
-divergence improves the entropy production estimation via machine ...
Jan 30, 2024 · The α − NEEP (Neural Estimator for Entropy Production) exhibits a much more robust performance against strong nonequilibrium driving or slow dynamics.
[6]
[2406.19983] Machine Learning Predictors for Min-Entropy Estimation
Jun 28, 2024 · This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in ...
[7]
[PDF] A Mathematical Theory of Communication
The form of H will be recognized as that of entropy as defined in certain formulations of statistical mechanics8 where pi is the probability of a system ...
[8]
Applications of Entropy in Data Analysis and Machine Learning
In machine learning, it is used for classification, feature extraction, algorithm optimization, anomaly detection, and more. The applications of entropy to data ...
[9]
Entropy of Neuronal Spike Patterns - PMC - NIH
Nov 11, 2024 · By quantifying the uncertainty and informational content of neuronal patterns, entropy measures provide insights into neural coding strategies, ...
[10]
Entropy: From Thermodynamics to Information Processing - PMC
Oct 14, 2021 · Entropy is a concept that emerged in the 19th century. It used to be associated with heat harnessed by a thermal machine to perform work during the Industrial ...
[11]
Estimating Entropies and Informations
May 21, 2025 · The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the ...
[12]
[PDF] Entropy and Information Theory - Stanford Electrical Engineering
This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels.
[13]
[PDF] Computing Entropies With Nested Sampling - arXiv
Crucially, in continuous spaces, the differential entropy is not invariant under changes of coordinates, so any value given for a differential entropy must ...Missing: citation | Show results with:citation
[14]
On Measures of Entropy and Information - Project Euclid
... 1961 On Measures of Entropy and Information. Chapter Author(s) Alfréd Rényi. Editor(s) Jerzy Neyman. Berkeley Symp. on Math. Statist. and Prob., 1961: 547-561 ( ...Missing: original | Show results with:original
[15]
[PDF] Robust and Fast Measure of Information via Low-Rank Representation
Rényi's entropy enables higher robustness against noises in the data. This demonstrates the great potential of our low- rank Rényi's entropy on information ...
[16]
Δ-Entropy: Definition, properties and applications in system ...
Different from the discrete entropy, the differential entropy can be negative and even minus infinite.
[17]
[PDF] Estimation of Entropy and Mutual Information
We present some new results on the nonparametric estimation of entropy and mutual information. First, we use an exact local expansion of the.Missing: survey | Show results with:survey
[18]
Entropy estimation via uniformization - ScienceDirect
It is well known that, entropy estimation becomes increasingly more difficult as the dimensionality ... curse of dimensionality is unavoidable. However, efforts ...3. Uniformizing Mapping... · 3.1. Truncated Kl/ksg... · 5. Application Examples
[19]
[PDF] High-Dimensional Smoothed Entropy Estimation via ... - arXiv
May 8, 2023 · approximation and estimation guarantees in the low dimensional regime, demonstrating removal of the curse of dimensionality. We applied our ...
[20]
[PDF] On the estimation of entropy
Entropy is estimated using histogram and kernel methods. Root-n consistency requires assumptions about tail behavior, distribution smoothness, and ...Missing: survey | Show results with:survey
[21]
None
### Summary of Evaluation Metrics for Entropy Estimators (https://arxiv.org/pdf/2310.07547)
[22]
[PDF] Mixture-based estimation of entropy - arXiv
Jan 6, 2022 · Mixture-based entropy estimation uses a semi-parametric estimate based on a mixture model, such as a Gaussian mixture model, when the data ...
[23]
[PDF] kozachenko-leonenko.pdf - Dmitri Pavlov
We establish conditions for asymptotic unbiasedness and consistency of a simple estimator of the unknown entropy of an absolutely continuous random vector from.
[24]
None
### Summary of Classical Kozachenko-Leonenko k-Nearest Neighbor Entropy Estimator
[25]
None
### Summary of Kozachenko-Leonenko Entropy Estimator from arXiv:1602.07440
[26]
Effectiveness of the Kozachenko-Leonenko estimator for ...
Dec 3, 2009 · In this Brief Report we generalize a well-known binless strategy for the estimation of BG entropy, the Kozachenko-Leonenko algorithm (KLA) [14]<|control11|><|separator|>
[27]
(PDF) Nonparametric Entropy Estimation: An Overview
Nonparametric Entropy Estimation: An Overview. January 1997. Authors: Jan Beirlant at KU Leuven. Jan Beirlant · KU Leuven · E. J. Dudewicz.
[28]
https://arxiv.org/abs/physics/0108025
[29]
Sample-Spacings-Based Density and Entropy Estimators for ...
Aug 1, 2010 · In the next section, we generalize the SDE method such that it can be extended to multiple dimensions in certain circumstances. 4 Generalization ...
[30]
[physics/0108025] Entropy and inference, revisited - arXiv
Aug 15, 2001 · Title:Entropy and inference, revisited. Authors:Ilya Nemenman, Fariel Shafee, William Bialek. View a PDF of the paper titled Entropy and ...
[31]
Bayesian Entropy Estimation for Countable Discrete Distributions
We derive formulas for the posterior mean (Bayes' least squares estimate) and variance under Dirichlet and Pitman-Yor process priors. Moreover, we show that a ...Missing: seminal | Show results with:seminal
[32]
Empirical Estimation of Information Measures: A Literature Guide
We give a brief survey of the literature on the empirical estimation of entropy, differential entropy, relative entropy, mutual information and related ...
[33]
Note on the bias of information estimates - Semantic Scholar
Semantic Scholar extracted view of "Note on the bias of information estimates" by G. Miller et al.
[34]
None
### Summary of Expected Entropy Estimator Method
[35]
A Note on Entropy Estimation | Neural Computation - MIT Press Direct
Oct 1, 2015 · There, the basic strategy is to place a prior over the space of probability distributions and then perform inference using the induced posterior ...
[36]
Entropy Estimators for Markovian Sequences: A Comparative Analysis
We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this ...
[37]
https://doi.org/10.1109/TNNLS.2022.3204919
[38]
https://doi.org/10.3390/e26050369