Fact-checked by Grok 2 weeks ago

Entropy estimation

Entropy estimation refers to the of the of a based on a of observed samples, serving as a measure of uncertainty or information content in data from or continuous sources. In , Shannon entropy quantifies the average unpredictability of a X with p(x) as H(X) = -\sum p(x) \log p(x), while extends this to continuous variables with density f(x) as h(X) = -\int f(x) \log f(x) \, dx. The further generalizes this to processes, representing the average per symbol or time step in sequences. For distributions, is particularly challenging when the support size is large or , as unseen symbols in samples lead to underestimation by plug-in methods like the maximum likelihood estimator. Nonparametric approaches, such as the Good-Turing estimator or Bayesian methods using Dirichlet or Pitman-Yor priors, address this by incorporating smoothing or hierarchical modeling to account for unobserved outcomes, achieving consistency under power-law tail assumptions. methods assume a specific model, like a , and use maximum likelihood for faster convergence but risk bias from model misspecification. In the continuous case, nonparametric techniques dominate due to the lack of finite support, including histogram-based, kernel density estimation (KDE), nearest-neighbor, and spacing methods that approximate the density integral via sample spacings or local densities. These estimators aim for root-n consistency under smoothness and tail decay conditions, though they suffer from the curse of dimensionality, where performance degrades rapidly beyond low dimensions (e.g., d > 3). Hybrid approaches, like resubstitution or cross-validation variants, mitigate bias but increase computational demands. Key challenges across both domains include finite-sample and variance, sensitivity to hyperparameters (e.g., bin width in histograms or kernel bandwidth in ), and scalability for high-dimensional or sequential data. Advances incorporate , such as neural estimators for or , enhancing robustness in nonequilibrium systems. Applications span for source coding limits, for and model evaluation, for neural information quantification, and physics for complexity analysis in time series. In cryptography, accurate estimation ensures randomness quality in generators.

Fundamentals

Definition and Importance

Entropy estimation refers to the process of approximating the entropy of a from a of observed samples, where the true underlying is unknown. For random variables, this is the Shannon entropy, denoted as H(X), with probability mass function p(x) defined as H(X) = -\sum_{x} p(x) \log p(x), where the logarithm is typically base-2 to measure entropy in bits. This measure quantifies the average uncertainty or information content associated with the possible outcomes of X. Introduced by in his seminal 1948 paper "," entropy provided a foundational metric for analyzing communication systems and has since become central to . The importance of entropy estimation stems from its wide-ranging applications across disciplines. In data compression, it sets the theoretical limit on the average number of bits needed to encode messages from a source, enabling efficient storage and transmission. In machine learning, entropy guides by identifying variables that reduce uncertainty in predictive models, such as in decision trees where lower indicates informative splits. Applications extend to , where estimating from neural spike patterns reveals coding efficiency and information processing in the , and to physics, drawing analogies to thermodynamic as a measure of in . In practice, the true p(x) is rarely known, necessitating from finite samples \{x_1, \dots, x_n\} drawn from generating the . The goal of entropy is to compute an \hat{H} such that \hat{H} \approx H(X), with controlled bias (systematic error) and variance (random fluctuation) to ensure reliability in downstream analyses. For continuous random variables, a related concept is , which extends the discrete case but requires careful handling of .

Types of Entropy

Entropy estimation commonly involves several variants of entropy, each tailored to different types of probability distributions and applications in . The primary forms include discrete entropy for finite or countable sample spaces, for continuous distributions, and as a parameterized generalization. These types differ in their mathematical formulations and properties, influencing the choice of estimation methods from data samples. Related quantities, such as , capture dependence between variables and are expressed in terms of entropies. The extends to processes, defined for a \{X_t\} as H = \lim_{n \to \infty} \frac{1}{n} H(X_1, \dots, X_n), representing the average per symbol or time step in sequences. , introduced by , quantifies the uncertainty in a X taking values in a finite \mathcal{X} with p(x). It is defined as H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x), where the logarithm is typically base 2 for bits or natural for nats. This measure is always non-negative and achieves its maximum when all outcomes are equally likely. For continuous random variables, differential entropy extends the concept but addresses the challenges of uncountable support. For a continuous X with p(x), the is h(X) = -\int_{-\infty}^{\infty} p(x) \log p(x) \, dx. Unlike entropy, can take negative values, for example, for distributions more concentrated than a uniform over a , and it is not invariant under nonlinear transformations of the variable, requiring or careful in applications. Continuous forms also demand handling of infinite support, where densities integrate to 1 but probabilities over finite intervals are less than 1, complicating direct analogies to cases. Rényi entropy generalizes Shannon entropy through an order parameter \alpha > 0, \alpha \neq 1, providing a family of measures useful for their varying sensitivity to probability distributions. For a random variable X, the of order \alpha is H_\alpha(X) = \frac{1}{1-\alpha} \log \sum_{x \in \mathcal{X}} p(x)^\alpha, with the Shannon recovered in the limit as \alpha \to 1. Higher orders (\alpha > 1) emphasize less and offer robustness to outliers or noise in estimation contexts compared to Shannon . Mutual information captures the shared information between two random variables X and Y, expressed in terms of entropies as I(X; Y) = H(X) + H(Y) - H(X,Y), where H(X,Y) is the joint entropy. This quantity is always non-negative for discrete variables and zero if X and Y are independent, making it a key tool for dependence estimation in joint distributions. A fundamental distinction among these types is that discrete entropy is inherently non-negative due to the finite summation and concave properties of the logarithm, whereas differential entropy lacks this bound and requires density estimation techniques sensitive to the choice of support and binning. Rényi entropies for \alpha \neq 1 also remain non-negative for discrete cases but adapt differently to estimation robustness needs.

Estimation Challenges

Estimating from finite samples presents significant challenges due to the -variance inherent in most estimators. Nonparametric estimators, such as the method, exhibit a negative for finite sample sizes n, systematically underestimating the true , though this bias diminishes asymptotically as n \to \infty under appropriate conditions like m = o(n) where m is the support size. The variance of these estimators decreases with increasing n, typically scaling as O((\log n)^2 / n), but achieving low overall error requires balancing the reduction against potential variance inflation from overly complex models. Sample size requirements pose a major hurdle, particularly for distributions where reliable demands n \gg |\mathcal{X}|, the , to mitigate severe underestimation from unseen symbols. In high- settings, the curse of exacerbates this issue, as the effective sample needed grows exponentially with d, leading to terms decaying only as O(n^{-\gamma/d}) for methods like k-nearest neighbors, rendering impractical without . For continuous distributions, similar demands arise, often requiring n on the order of c^d for accuracy, where c > 1. Consistency of entropy estimators relies on asymptotic behavior as n \to \infty; , in the almost sure sense, holds for many nonparametric approaches when the support size grows slower than n, ensuring convergence to the true . However, weak consistency in probability may apply in more restricted regimes, such as with fixed support size. The underlying distribution further influences accuracy: distributions, with their maximum , are generally easier to estimate with lower compared to heavy-tailed ones like the Student's t, where tail behavior amplifies and requires tailored smoothing to achieve root-n . Non-stationarity in the , violating the standard i.i.d. assumption, introduces additional inconsistency by altering effective sample . Evaluation of entropy estimators commonly employs the (MSE), defined as \text{MSE} = \text{[bias](/page/Bias)}^2 + \text{variance}, which quantifies the total prediction error and guides comparisons across methods. Cross-validation techniques are frequently used to tune hyperparameters, such as bandwidths in kernel-based approaches, by minimizing empirical MSE on held-out data to balance and variance.

Non-Parametric Estimators

Histogram Estimator

The histogram estimator, also known as the plug-in estimator, provides a straightforward non-parametric approach to entropy estimation for random variables or continuous variables discretized via binning. Given a sample of n independent and identically distributed observations from a distribution over a finite or countably infinite support, the method partitions the support into k non-overlapping bins. For each bin i, the empirical probability is estimated as \hat{p}_i = n_i / n, where n_i denotes the number of samples falling into bin i. The entropy estimate is then calculated using Shannon's formula applied to these empirical probabilities: \hat{H} = -\sum_{i=1}^k \hat{p}_i \log \hat{p}_i, with the convention that terms where n_i = 0 are omitted, as \hat{p}_i \log \hat{p}_i \to 0 as \hat{p}_i \to 0. This plug-in approach directly substitutes the empirical distribution into the entropy functional, making it computationally efficient for low-dimensional or discrete data. A key limitation of the histogram estimator is its negative bias, which arises primarily from unobserved bins—regions with positive true probability mass that contain no samples, leading to an underestimation of the true entropy H. The expected bias scales approximately as -(k-1)/(2n) for large n, where k is the number of bins. To mitigate this, the Miller-Madow correction adjusts the estimate by adding a term that accounts for the unseen mass: \hat{H}_{MM} = \hat{H} + \frac{k - 1}{2n}, where k approximates the effective support size (e.g., the alphabet size for discrete data or the number of bins for binned continuous data). This correction, derived from asymptotic bias analysis, significantly reduces the underestimation for moderate sample sizes but assumes k is known or well-estimated. Selecting the number of bins k is critical, as it governs the bias-variance trade-off: too few bins increase bias by oversmoothing the distribution, while too many amplify variance by fragmenting the sample. Empirical rules for one-dimensional cases include Sturges' formula, k = \lceil 1 + \log_2 n \rceil, which aims for bins that capture the data's scale without excessive fragmentation, or the rule of thumb k \approx \sqrt{n} for roughly balanced resolution. In practice, these choices work well for n in the thousands but degrade for very small or very large samples. For multidimensional continuous data, the histogram estimator extends by forming a grid of bins across dimensions, but this quickly becomes impractical due to of dimensionality, where the number of bins grows exponentially with d (e.g., k^d bins for k per ). The advantages of the histogram estimator include its simplicity, interpretability, and minimal computational demands, making it a method in fields like and . However, its performance is highly sensitive to binning choices, particularly in high dimensions, where sparse bins lead to unreliable probability estimates and exacerbated bias.

Nearest-Neighbor Estimator

The nearest-neighbor estimator, particularly the Kozachenko-Leonenko (KL) estimator, provides a non-parametric method for approximating the of a continuous from a finite sample of i.i.d. observations. It relies on distances to k-nearest neighbors to locally estimate the underlying probability density, offering an adaptive approach that adjusts to variations in data density without requiring predefined bins or parametric assumptions. This makes it particularly suitable for multidimensional data where global partitioning methods may fail. The KL estimator is defined as \hat{H}(X) = \psi(n) - \psi(k) + \log(c_d) + \frac{d}{n} \sum_{i=1}^n \log \varepsilon_i(k), where \psi(\cdot) is the digamma function, n is the sample size, k is the number of nearest neighbors, d is the data dimensionality, c_d is the volume of the unit ball in d-dimensions (e.g., c_d = \pi^{d/2} / \Gamma(d/2 + 1) for the Euclidean norm), and \varepsilon_i(k) is the Euclidean distance from the i-th observation to its k-th nearest neighbor (excluding itself). This formulation derives from approximating the local density at each point x_i as k / ((n-1) c_d \varepsilon_i(k)^d), with the entropy contribution averaged using digamma functions to correct for finite-sample bias. The estimator is asymptotically unbiased and consistent under mild conditions on the density, such as absolute continuity. The choice of k balances and variance: small values like k=1 minimize bias but increase variance, while larger k (typically 3 to 5) stabilize estimates, especially in low-density regions. As a non-parametric method, the KL estimator adapts to local , avoiding artifacts from fixed binning and performing better than histogram-based approaches in moderate to high dimensions by leveraging geometric distances rather than arbitrary partitions. It has been widely adopted in fields like and for its computational efficiency, with O(n \log n) via efficient nearest-neighbor searches. For discrete data, the estimator can be adapted by employing discrete distance metrics such as the or by embedding the symbols into a continuous (e.g., via random projections), allowing estimation of Shannon entropy while mitigating issues like zero distances from identical samples. Despite its strengths, the estimator remains sensitive to the choice of distance metric—Euclidean may underperform in non-isotropic compared to Manhattan or Mahalanobis distances—and suffers from of dimensionality, where bias grows as O(n^{-1/d}) in high d, leading to unreliable estimates beyond 10–20 dimensions without .

Sample-Spacing Estimator

The sample-spacing estimator is a non-parametric approach to estimation that leverages the gaps, or spacings, between ordered observations from a one-dimensional continuous probability . Given an i.i.d. sample X_1, \dots, X_n from f, the observations are ordered as X_{(1)} \leq \dots \leq X_{(n)}. Adjacent spacings are computed as D_i = X_{(i)} - X_{(i-1)} for i = 2, \dots, n, often with boundary adjustments such as defining X_{(0)} = -\infty or incorporating the sample . A local estimate is then formed at the of each spacing , approximating f(x) \approx \frac{n+1}{n D_i}, which provides an intuitive inverse relationship between spacing size and height. A prominent variant is the Vasicek estimator, which uses wider m-spacings for improved : \hat{H}_V = \frac{1}{n} \sum_{i=1}^{n-m} \log \left( \frac{n+1}{m} (X_{(i+m)} - X_{(i)}) \right), where m is a window parameter satisfying m \to \infty and m/n \to 0 as n \to \infty. This estimator achieves weak consistency under mild conditions on f, such as boundedness and positivity over its . Sample-spacing estimators offer advantages in simplicity and computational efficiency, requiring only of the sample. They are unbiased distribution, where spacings follow an , making them particularly reliable in that case. Extensions to multidimensional settings are possible through projections onto one-dimensional subspaces, though this introduces complexity and potential bias from . However, these methods exhibit high variance for non-uniform densities with varying local densities, and they are inherently designed for one , with higher-dimensional adaptations remaining challenging and less developed.

Parametric Estimators

Bayesian Estimator

Bayesian estimators for entropy incorporate prior distributions to regularize probability estimates, particularly beneficial in scenarios with limited samples where maximum likelihood approaches suffer from severe bias. For discrete distributions, the Dirichlet prior is commonly employed, assuming the probability vector \mathbf{p} follows p \sim \mathrm{Dir}(\alpha, \dots, \alpha) with symmetric concentration parameter \alpha > 0. Given n observations with counts n_i for i = 1, \dots, k categories, the posterior mean probabilities are \hat{h}_i = \frac{n_i + \alpha}{n + k\alpha}, and the entropy estimate is \hat{H} = -\sum_{i=1}^k \hat{h}_i \log \hat{h}_i. A prominent example is the Nemenman-Shafee-Bialek (NSB) estimator, which uses an infinite of Dirichlet , chosen such that the induced prior over the is approximately . This approach mitigates underestimation in sparse data by adaptively smoothing based on the data, yielding low even when the sample size n is much smaller than the alphabet size |\mathcal{X}|. The method ensures desirable asymptotic properties, such as consistency and reduced variance in undersampled regimes. For continuous distributions, Bayesian estimation often approximates the density via discretization into bins, applying the Dirichlet prior to the resulting multinomial counts as in the discrete case. This binning strategy allows extension of the posterior mean approach while preserving Bayesian regularization. These estimators excel in handling undersampling, such as in genomics where large alphabets (e.g., DNA n-mers) lead to many unobserved events, and provide uncertainty quantification through the posterior variance of the entropy, enabling credible intervals for reliable inference. In neuroscience, they have been applied to estimate entropy in neural spike trains, demonstrating small bias and accurate information measures from limited recordings.

Maximum Likelihood Estimator

The maximum likelihood estimator (MLE) for assumes a form for the underlying and proceeds by first obtaining the maximum likelihood estimates of the model parameters from the observed data, then substituting these estimates into the analytical expression for the of the assumed . This approach serves as a fundamental baseline in , leveraging the asymptotic of MLE under correct model specification. In the discrete case, the distribution is typically modeled as multinomial over a finite \mathcal{X}, with the MLE of the given by \hat{p}_i = n_i / n for each i, where n_i denotes the observed count of i and n is the total sample size. The resulting estimate is then \hat{H}_{\text{MLE}} = -\sum_{i \in \mathcal{X}} \hat{p}_i \log \hat{p}_i, which coincides with the uncorrected empirical (histogram) estimator when bin counts align with symbol frequencies. This estimator exhibits negative , with its expected value approximated as \mathbb{E}[\hat{H}_{\text{MLE}}] \approx H - (|\mathcal{X}| - 1)/(2n) for large n, where H is the true ; the bias arises from the underestimation of low-probability events and intensifies as the alphabet size |\mathcal{X}| grows relative to n, though no explicit correction is incorporated in the standard formulation. For continuous distributions belonging to a family \{p(x \mid \theta) : \theta \in \Theta\}, the MLE \hat{\theta} maximizes the likelihood \prod_{j=1}^n p(x_j \mid \theta) over the observed samples \{x_1, \dots, x_n\}, after which the is evaluated as \hat{h}_{\text{MLE}}(\hat{\theta}) = -\int p(x \mid \hat{\theta}) \log p(x \mid \hat{\theta}) \, dx. A prominent example is the multivariate Gaussian distribution in d dimensions, where \hat{\theta} includes the matrix, yielding the closed-form entropy \hat{h}_{\text{MLE}} = \frac{d}{2} \log (2 \pi e) + \frac{1}{2} \log \det \hat{\Sigma}, with \hat{\Sigma} the maximum likelihood estimate. Under the correct , the MLE-based is asymptotically unbiased and achieves the Cramér-Rao lower bound for as n \to \infty, making it statistically optimal in well-specified low-dimensional settings; moreover, its simplicity facilitates implementation when the functional admits closed-form evaluation or standard optimization routines. However, the remains biased and inconsistent for finite n without adjustments, particularly suffering from severe underestimation in high-dimensional spaces or when the assumed model is misspecified, as the ignores distributional mismatches and amplifies parameter estimation errors in the computation.

Advanced Estimators

Expected Entropy Estimator

The expected estimator addresses the challenge of in traditional plug-in estimates by computing the conditional expected E[H \mid n] given a finite sample size n, rather than directly approximating the true H. This approach is particularly valuable when samples are limited, as it incorporates analytical corrections derived from the expected value of the estimator under the observed distribution. For independent and identically distributed (i.i.d.) , the plug-in estimator \hat{H} is known to be biased such that E[\hat{H}] = H - b(n), where b(n) is a positive term decreasing with n; the expected estimator inverts this by adding a correction term to yield an approximately unbiased estimate. A seminal uses a Bayesian framework with a of Dirichlet to flatten the over possible values, ensuring the posterior expected E[H \mid \{n_i\}] is robust to . The estimator averages the over a range of parameters \beta, leveraging the \psi for : the is \xi(\beta) = \psi(\kappa + 1) - \psi(\beta + 1), where \kappa = K \beta and K is the . This achieves low relative (under 10%) even for n \ll K, outperforming simple corrections like Miller-Madow in small-sample regimes. For dependent processes such as Markov chains, the expected entropy estimator extends to the entropy rate H = -\sum_i \pi_i \sum_j p_{ij} \log p_{ij}, where \pi is the and p_{ij} are transition probabilities; finite-sample adjustments account for bias in estimated transitions via digamma corrections applied to transition counts. One adaptation, the Chao-Wang-Jost estimator, applies digamma corrections to discovery rates of states and transitions: \hat{H}^{\text{CWJ}} = \sum_i \frac{n_i}{n} \left( \psi(n) - \psi(n_i) \right) + unseen-state adjustment, yielding low bias for sequences with memory. These estimators offer advantages in handling dependent data, where standard i.i.d. methods fail due to unaccounted correlations; they reduce by explicitly modeling sample-size effects on transition estimates, making them suitable for with temporal structure. Applications include source coding for memoryful processes, where accurate rates enable efficient , and small-n scenarios like genomic sequences or neural trains exhibiting Markov-like dependence.

Neural Network Estimator

The Neural Joint Entropy Estimator (NJEE) is a deep learning-based approach for estimating the joint entropy of discrete random variables, particularly effective for high-dimensional data with large alphabets. It leverages the universal approximation capabilities of neural networks to model conditional distributions autoregressively via the chain rule of entropy, decomposing the joint entropy H(X) into a marginal entropy term plus conditional entropies. Specifically, for a multivariate variable X = (X_1, \dots, X_{d_x}), the estimate is given by \hat{H}_n(X) = \hat{H}_n(X_1) + \sum_{m=2}^{d_x} \hat{\text{CE}}_n(G_{\theta_m}(X_m \mid X_{1:m-1})), where \hat{H}_n(X_1) is the marginal entropy of the first component (often estimated via a plugin method or another NJEE recursion), and \hat{\text{CE}}_n(G_{\theta_m}) = -\frac{1}{n} \sum_{i=1}^n \log G_{\theta_m}(x_{i,m} \mid x_{i,1:m-1}) is the empirical cross-entropy loss of a neural network G_{\theta_m} that approximates the conditional probability mass function P(X_m \mid X_{1:m-1}). The network G_{\theta_m} is trained by minimizing this loss over samples from the joint distribution, using a softmax output layer whose size matches the alphabet of X_m. This setup allows NJEE to capture complex dependencies without assuming a parametric form for the joint distribution. In practice, the architecture of each G_{\theta_m} consists of fully connected layers (e.g., two hidden layers with 50 nodes each, ReLU activations) for low-dimensional or tabular data, but extends to convolutional neural networks for image data or transformer-based models for sequential data like text, enabling handling of structured inputs. Training involves standard optimizers like ADAM and relies on the data samples themselves, often augmented with shuffling for marginal estimates if needed. NJEE demonstrates strong consistency, with the estimation error bounded by |\hat{H}_n(X) - H(X)| \leq C \epsilon + \delta for sufficiently large n, under the universal approximation theorem for continuously differentiable activations. Key advantages of NJEE include its ability to manage large alphabet sizes (e.g., outperforming histogram-based methods in error for alphabets up to 1000 symbols with n \leq 1000 samples) and to model intricate dependencies in high-dimensional settings, where traditional non-parametric estimators struggle due to the curse of dimensionality. It also facilitates extensions to and estimation by differencing joint and conditional entropies. For continuous (differential) entropy estimation, neural network methods adapt normalizing flows, which are invertible transformations f: \mathbb{R}^d \to \mathbb{R}^d mapping data samples to a simple base distribution (e.g., Gaussian) with known entropy. The differential entropy is then estimated as \hat{h}(X) = h(Z) + \frac{1}{n} \sum_{i=1}^n \log |\det J_f(x_i)|, where Z = f(X), h(Z) is the base entropy, and J_f is the of f; the flow is trained to maximize the likelihood of the data under the induced density. Common architectures include autoregressive flows or coupling layers, suitable for high-dimensional continuous data like images. Despite these strengths, estimators like NJEE and flow-based methods are computationally intensive, requiring significant training time and resources (e.g., GPU acceleration for flows in dimensions >10). They demand large sample sizes (n \gg d) to avoid and provide reliable approximations, and their black-box nature limits interpretability compared to explicit geometric estimators.

References

  1. [1]
    A Review of Shannon and Differential Entropy Rate Estimation - MDPI
    In this paper, we present a review of Shannon and differential entropy rate estimation techniques. Entropy rate, which measures the average information gain ...
  2. [2]
  3. [3]
    [PDF] Bayesian Entropy Estimation for Countable Discrete Distributions
    This paper estimates Shannon's entropy from discrete data using Bayesian methods, especially the Pitman-Yor process, for cases with unknown or infinite symbols.
  4. [4]
  5. [5]
    -divergence improves the entropy production estimation via machine ...
    Jan 30, 2024 · The α − NEEP (Neural Estimator for Entropy Production) exhibits a much more robust performance against strong nonequilibrium driving or slow dynamics.
  6. [6]
    [2406.19983] Machine Learning Predictors for Min-Entropy Estimation
    Jun 28, 2024 · This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in ...
  7. [7]
    [PDF] A Mathematical Theory of Communication
    The form of H will be recognized as that of entropy as defined in certain formulations of statistical mechanics8 where pi is the probability of a system ...
  8. [8]
    Applications of Entropy in Data Analysis and Machine Learning
    In machine learning, it is used for classification, feature extraction, algorithm optimization, anomaly detection, and more. The applications of entropy to data ...
  9. [9]
    Entropy of Neuronal Spike Patterns - PMC - NIH
    Nov 11, 2024 · By quantifying the uncertainty and informational content of neuronal patterns, entropy measures provide insights into neural coding strategies, ...
  10. [10]
    Entropy: From Thermodynamics to Information Processing - PMC
    Oct 14, 2021 · Entropy is a concept that emerged in the 19th century. It used to be associated with heat harnessed by a thermal machine to perform work during the Industrial ...
  11. [11]
    Estimating Entropies and Informations
    May 21, 2025 · The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the ...
  12. [12]
    [PDF] Entropy and Information Theory - Stanford Electrical Engineering
    This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels.
  13. [13]
    [PDF] Computing Entropies With Nested Sampling - arXiv
    Crucially, in continuous spaces, the differential entropy is not invariant under changes of coordinates, so any value given for a differential entropy must ...Missing: citation | Show results with:citation
  14. [14]
    On Measures of Entropy and Information - Project Euclid
    ... 1961 On Measures of Entropy and Information. Chapter Author(s) Alfréd Rényi. Editor(s) Jerzy Neyman. Berkeley Symp. on Math. Statist. and Prob., 1961: 547-561 ( ...Missing: original | Show results with:original
  15. [15]
    [PDF] Robust and Fast Measure of Information via Low-Rank Representation
    Rényi's entropy enables higher robustness against noises in the data. This demonstrates the great potential of our low- rank Rényi's entropy on information ...
  16. [16]
    Δ-Entropy: Definition, properties and applications in system ...
    Different from the discrete entropy, the differential entropy can be negative and even minus infinite.
  17. [17]
    [PDF] Estimation of Entropy and Mutual Information
    We present some new results on the nonparametric estimation of entropy and mutual information. First, we use an exact local expansion of the.Missing: survey | Show results with:survey
  18. [18]
    Entropy estimation via uniformization - ScienceDirect
    It is well known that, entropy estimation becomes increasingly more difficult as the dimensionality ... curse of dimensionality is unavoidable. However, efforts ...3. Uniformizing Mapping... · 3.1. Truncated Kl/ksg... · 5. Application Examples
  19. [19]
    [PDF] High-Dimensional Smoothed Entropy Estimation via ... - arXiv
    May 8, 2023 · approximation and estimation guarantees in the low dimensional regime, demonstrating removal of the curse of dimensionality. We applied our ...
  20. [20]
    [PDF] On the estimation of entropy
    Entropy is estimated using histogram and kernel methods. Root-n consistency requires assumptions about tail behavior, distribution smoothness, and ...Missing: survey | Show results with:survey
  21. [21]
    None
    ### Summary of Evaluation Metrics for Entropy Estimators (https://arxiv.org/pdf/2310.07547)
  22. [22]
    [PDF] Mixture-based estimation of entropy - arXiv
    Jan 6, 2022 · Mixture-based entropy estimation uses a semi-parametric estimate based on a mixture model, such as a Gaussian mixture model, when the data ...
  23. [23]
    [PDF] kozachenko-leonenko.pdf - Dmitri Pavlov
    We establish conditions for asymptotic unbiasedness and consistency of a simple estimator of the unknown entropy of an absolutely continuous random vector from.
  24. [24]
    None
    ### Summary of Classical Kozachenko-Leonenko k-Nearest Neighbor Entropy Estimator
  25. [25]
    None
    ### Summary of Kozachenko-Leonenko Entropy Estimator from arXiv:1602.07440
  26. [26]
    Effectiveness of the Kozachenko-Leonenko estimator for ...
    Dec 3, 2009 · In this Brief Report we generalize a well-known binless strategy for the estimation of BG entropy, the Kozachenko-Leonenko algorithm (KLA) [14]<|control11|><|separator|>
  27. [27]
    (PDF) Nonparametric Entropy Estimation: An Overview
    Nonparametric Entropy Estimation: An Overview. January 1997. Authors: Jan Beirlant at KU Leuven. Jan Beirlant · KU Leuven · E. J. Dudewicz.
  28. [28]
  29. [29]
    Sample-Spacings-Based Density and Entropy Estimators for ...
    Aug 1, 2010 · In the next section, we generalize the SDE method such that it can be extended to multiple dimensions in certain circumstances. 4 Generalization ...
  30. [30]
    [physics/0108025] Entropy and inference, revisited - arXiv
    Aug 15, 2001 · Title:Entropy and inference, revisited. Authors:Ilya Nemenman, Fariel Shafee, William Bialek. View a PDF of the paper titled Entropy and ...
  31. [31]
    Bayesian Entropy Estimation for Countable Discrete Distributions
    We derive formulas for the posterior mean (Bayes' least squares estimate) and variance under Dirichlet and Pitman-Yor process priors. Moreover, we show that a ...Missing: seminal | Show results with:seminal
  32. [32]
    Empirical Estimation of Information Measures: A Literature Guide
    We give a brief survey of the literature on the empirical estimation of entropy, differential entropy, relative entropy, mutual information and related ...
  33. [33]
    Note on the bias of information estimates - Semantic Scholar
    Semantic Scholar extracted view of "Note on the bias of information estimates" by G. Miller et al.
  34. [34]
    None
    ### Summary of Expected Entropy Estimator Method
  35. [35]
    A Note on Entropy Estimation | Neural Computation - MIT Press Direct
    Oct 1, 2015 · There, the basic strategy is to place a prior over the space of probability distributions and then perform inference using the induced posterior ...
  36. [36]
    Entropy Estimators for Markovian Sequences: A Comparative Analysis
    We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this ...
  37. [37]
  38. [38]