Fact-checked by Grok 2 weeks ago

Cross-entropy

In information theory, cross-entropy is a measure of the inefficiency in representing events from one probability distribution using an encoding scheme optimized for a different distribution. For two discrete probability distributions p and q defined over the same event space, the cross-entropy H(p, q) is given by the formula
H(p, q) = -\sum_x p(x) \log_2 q(x),
where the logarithm is base-2 to express the result in bits; this quantifies the expected number of bits required to encode a sample from p using a code designed for q.^[1] The cross-entropy generalizes the concept of Shannon entropy, which is the special case H(p, p) representing the inherent uncertainty in p. It is always at least as large as the entropy of p, i.e., H(p, q) \geq H(p), with equality holding if and only if p = q almost everywhere. The nonnegative difference D_{\text{KL}}(p \parallel q) = H(p, q) - H(p) defines the Kullback-Leibler (KL) divergence, a key asymmetry measure introduced in the context of statistical discrimination between distributions.^[2] The term "cross-entropy" was introduced in the mid-20th century, building on foundational work in information measures from the late 1940s. In applications, cross-entropy plays a central role in source coding and communication theory, where minimizing it approximates optimal compression under mismatched models. It also underpins rate-distortion theory and channel capacity bounds by evaluating the cost of distribution approximations in noisy environments.^[3] Beyond information theory, cross-entropy is widely employed in machine learning as a loss function for probabilistic classifiers, such as logistic regression and neural networks with softmax outputs. Here, p typically represents one-hot encoded true labels (empirical distribution), and q the model's predicted probabilities; minimizing the empirical cross-entropy equates to maximum likelihood estimation under a categorical model, promoting well-calibrated predictions.^[4] This usage extends to generative models, reinforcement learning, and knowledge distillation, where it facilitates efficient optimization and measures distributional similarity.^[5]

Fundamentals

Definition

In information theory, cross-entropy is a measure that quantifies the average number of bits (for base-2 logarithm) or nats (for natural logarithm) required on average to represent samples from one probability distribution using an encoding scheme optimized for a different distribution.^[6] For discrete probability mass functions p and q defined over a finite or countable set \mathcal{X}, the cross-entropy H(p, q) is defined as

H(p, q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x),

where the logarithm is typically base-2 (yielding units in bits) or natural (yielding nats), depending on the convention used. The expression is finite provided that the support of p is contained within the support of q (i.e., q(x) > 0 whenever p(x) > 0); otherwise, H(p, q) = \infty.^[6] This can equivalently be expressed as the expectation of the negative log-likelihood under the distribution p:

H(p, q) = \mathbb{E}_{x \sim p} [-\log q(x)].

The expectation emphasizes that cross-entropy averages the surprise or information content of events drawn from p, as encoded by q.^[6] For continuous probability density functions p and q over a continuous space, the definition extends analogously to the integral form:

H(p, q) = -\int p(x) \log q(x) \, dx,

again with the logarithm base determining the units (bits or nats). The expression is finite provided that the support of p is contained within the support of q (i.e., q(x) > 0 whenever p(x) > 0); otherwise, H(p, q) = \infty.^[6] When p = q, cross-entropy reduces to the Shannon entropy H(p).^[6]

Motivation

Cross-entropy emerged from the foundations of information theory established by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," where he introduced entropy as a measure of uncertainty and average information content in probabilistic sources. Although Shannon's work focused primarily on entropy for optimal coding, the broader framework of mismatched probability distributions set the stage for cross-entropy's development in the early 1950s. Researchers like Robert Fano, through his seminars at MIT, contributed to its formalization as a key extension of these ideas, emphasizing its role in communication systems under imperfect models.^[7] At its core, cross-entropy provides an intuitive measure of inefficiency in data representation: it quantifies the average number of bits needed to encode samples drawn from a true source distribution p when using an optimal coding scheme designed for a different approximating distribution q.^[3] This arises directly from coding theory, where the expected code length for symbols from p under a code optimized for q exceeds the minimal possible length, reflecting the penalty of assuming an incorrect probabilistic model. The concept generalizes entropy, which captures the average surprise in a single distribution, to scenarios involving two distributions, thus highlighting the practical costs of model mismatch in information transmission.^[3] This motivation stems from the need to evaluate how well one probability model can represent another in resource-constrained encoding tasks, such as compressing messages over noisy channels where the source statistics may be only approximately known.^[3] By focusing on the excess bits required due to divergence between distributions, cross-entropy underscores the importance of accurate probabilistic assumptions for efficient communication, a principle that has influenced subsequent theoretical and applied work in the field.

Theoretical Foundations

Relation to Entropy and Divergences

Cross-entropy H(p, q) between two probability mass functions p and q over a discrete sample space decomposes additively into the Shannon entropy of the true distribution p and the Kullback-Leibler (KL) divergence measuring the discrepancy between p and the approximating distribution q.^[8]^[9] This relation is given by

H(p, q) = H(p) + D_{\text{KL}}(p \parallel q),

where the Shannon entropy is

H(p) = -\sum_{x} p(x) \log p(x),

introduced as a measure of uncertainty in a random variable's distribution.^[8] To derive the decomposition, substitute the definitions of cross-entropy and entropy:

H(p, q) = -\sum_{x} p(x) \log q(x),

H(p, q) - H(p) = -\sum_{x} p(x) \log q(x) + \sum_{x} p(x) \log p(x) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = D_{\text{KL}}(p \parallel q).

Thus, rearranging yields the additive form, illustrating that cross-entropy exceeds the intrinsic uncertainty of p by the KL term, which quantifies the extra bits required to encode samples from p using code lengths optimal for q.^[9] The KL divergence D_{\text{KL}}(p \parallel q) represents the non-negative excess information or inefficiency in using q to approximate p, satisfying D_{\text{KL}}(p \parallel q) \geq 0 with equality if and only if p(x) = q(x) for all x where p(x) > 0, by Gibbs' inequality.^[9] Cross-entropy inherits the asymmetry of the KL divergence, such that H(p, q) \neq H(q, p) in general unless p = q. This directed property implies that the measure depends on the order of distributions, making it suitable for scenarios where p is the reference (true) distribution and q is the approximation, as in encoding or inference contexts where the direction of approximation matters.^[9] A symmetrized variant derived from cross-entropy components is the Jensen-Shannon divergence, which averages KL divergences to a mixture distribution and yields a bounded, symmetric metric between p and q.^[10]

Properties

The cross-entropy H(p, q) exhibits several key mathematical properties that enhance its applicability in theoretical analysis and practical computations. One fundamental property is its non-negativity relative to the entropy of the true distribution: H(p, q) \geq H(p), with equality holding if and only if p = q. This inequality arises because H(p, q) = H(p) + D_{\text{KL}}(p \parallel q), where the Kullback-Leibler divergence D_{\text{KL}}(p \parallel q) \geq 0. Another important characteristic is joint convexity: the function H(p, q) is jointly convex in the pair of probability distributions (p, q). This follows from the joint convexity of the Kullback-Leibler divergence and the fact that the entropy H(p) is linear in p for fixed q, making cross-entropy minimization a convex optimization problem in many settings. Cross-entropy also demonstrates monotonicity under marginalization, or coarse-graining: if p' and q' are the marginal distributions of p and q over a subset of variables, then H(p', q') \leq H(p, q). This non-increasing behavior reflects the reduction in descriptive complexity when integrating out variables, analogous to the data processing inequality for divergences. When the approximating distribution q is uniform over a finite support of size n, the cross-entropy simplifies to H(p, U) = \log n, regardless of the specific form of p. This value quantifies the inefficiency of using a uniform code to represent samples from p, corresponding to the maximum possible bits per symbol in an equiprobable coding scheme. For continuous distributions with smooth, positive densities, cross-entropy is continuous with respect to variations in p and q, and differentiable wherever q > 0, ensuring well-behaved gradients for analytical purposes.^[11]

Estimation Methods

Empirical Estimation

The cross-entropy between two probability distributions p and q over a discrete or continuous space is given by the expectation H(p, q) = \mathbb{E}_{X \sim p}[-\log q(X)]. When direct computation is infeasible, empirical estimation approximates this expectation using finite samples drawn from p, without assuming a parametric form for either distribution. A straightforward plug-in estimator arises from Monte Carlo approximation: given n independent samples x_1, \dots, x_n \sim p, the estimator is \hat{H}(p, q) = -\frac{1}{n} \sum_{i=1}^n \log q(x_i). If q is fully specified and known, this estimator is unbiased, as its expectation equals the true cross-entropy. However, in practice, q is often estimated empirically from a separate set of m samples y_1, \dots, y_m \sim q, yielding \hat{q} (e.g., via histogram or kernel methods), and the combined plug-in becomes \hat{H}(p, \hat{q}) = -\frac{1}{n} \sum_{i=1}^n \log \hat{q}(x_i). This joint estimator exhibits downward bias, underestimating the true H(p, q), primarily due to the underestimation of entropy in \hat{q} and the concavity of the logarithm, with the bias scaling as O((k-1)/\min(n,m)) for discrete supports of size k. The variance of the estimator decreases as O(1/n + 1/m), but can be high in high dimensions or with sparse data. To mitigate the downward bias, especially in discrete cases with finite support, corrections such as pseudocount smoothing are applied by adding a small constant (e.g., \alpha = 1/(k+1)) to the empirical frequencies before normalization, yielding a smoothed \hat{q}_\alpha that reduces underestimation for low-probability events. This approach, akin to Laplace smoothing, trades some introduced bias for lower variance in sparse regimes, with asymptotic bias correction provable under mild regularity conditions. For more precise variance reduction, non-asymptotic bounds on the mean squared error can guide sample size selection, ensuring the estimator concentrates around the true value with high probability.^[12] In continuous settings, where density estimation is required for \hat{q}, non-parametric methods like kernel density estimation (KDE) provide a flexible plug-in approach. KDE constructs \hat{q}(x) = \frac{1}{m h^d} \sum_{j=1}^m K\left(\frac{x - y_j}{h}\right), with kernel K (e.g., Gaussian) and bandwidth h, then substitutes into the sample average over x_i. The resulting estimator inherits KDE's bias-variance tradeoff, with bias O(h^2) and variance O(1/(m h^d)) in d dimensions, optimized by choosing h \sim m^{-1/(d+4)} for minimax rates. This method performs well for smooth densities but suffers the curse of dimensionality, requiring large m in high d. When p and q differ substantially—such as in rare event scenarios where events probable under p are unlikely under q—the plug-in variance can explode due to poor overlap. Importance sampling addresses this by drawing samples from a proposal distribution r (often q) and reweighting: \hat{H}(p, q) \approx -\frac{1}{n} \sum_{i=1}^n \log q(z_i) \cdot \frac{p(z_i)}{r(z_i)}, where z_i \sim r. This adjusts for discrepancies, reducing variance when r approximates the optimal tilting r^*(x) \propto p(x) |\log q(x)|, though high weights for rare events under r necessitate adaptive choices like cross-entropy minimization to iteratively refine r.^[13]^[14]

Parameter Estimation

Parameter estimation in cross-entropy involves fitting parameters \theta of a parametric distribution q_\theta to approximate an unknown target distribution p, typically represented by a sample of observations \{x_i\}_{i=1}^n, by minimizing the empirical cross-entropy. This approach seeks to solve \arg\min_\theta \hat{H}(p, q_\theta), where the empirical cross-entropy is given by

\hat{H}(p, q_\theta) = -\frac{1}{n} \sum_{i=1}^n \log q_\theta(x_i).

Minimizing this quantity aligns q_\theta closely with the data while respecting the constraints of the parametric family, leveraging the principle of minimum cross-entropy as a method of inference.^[15] For certain parametric families, closed-form solutions for the optimal \theta exist. In particular, when q_\theta belongs to an exponential family, the minimization reduces to matching the model's sufficient statistics to their empirical counterparts from the data, yielding explicit parameter updates without iterative computation. This property stems from the conjugate structure of exponential families under cross-entropy minimization, preserving the family form post-optimization.^[15] In more complex scenarios where closed-form solutions are unavailable, gradient-based optimization methods are employed to solve the minimization problem. Techniques such as the Newton-Raphson algorithm compute updates via the gradient and Hessian of the empirical cross-entropy, enabling convergence to local minima; for intricate models, automatic differentiation facilitates efficient computation of these derivatives. The empirical estimation of cross-entropy serves as an initial benchmark for assessing the fitted parametric model's performance.^[15] To mitigate overfitting during parameter selection, particularly in flexible parametric forms, cross-validation techniques partition the data into training and validation sets, selecting \theta that minimizes the cross-validated empirical cross-entropy. This approach balances model complexity with generalization, ensuring the estimated parameters do not excessively tailor to noise in the sample.^[16] Computational challenges arise in high-dimensional settings or with discrete supports, where the optimization landscape becomes rugged, leading to slow convergence or trapping in poor local minima; the curse of dimensionality exacerbates variance in gradient estimates, often necessitating adaptive sampling or dimensionality reduction strategies. For discrete distributions with large supports, enumerating probabilities during minimization can be prohibitive, prompting approximations like stochastic gradients.

Applications in Statistics

Connection to Maximum Likelihood

In statistical modeling, the cross-entropy between an empirical distribution \hat{p} derived from a sample of n independent and identically distributed (i.i.d.) observations \{x_1, \dots, x_n\} from a true distribution p and a parametric model distribution q_\theta is defined as

H(\hat{p}, q_\theta) = -\sum_x \hat{p}(x) \log q_\theta(x) = -\frac{1}{n} \sum_{i=1}^n \log q_\theta(x_i).

Minimizing this cross-entropy with respect to the parameter \theta is mathematically equivalent to maximizing the average log-likelihood \frac{1}{n} \sum_{i=1}^n \log q_\theta(x_i), up to a constant factor and sign change, since the empirical entropy term H(\hat{p}) is independent of \theta. This equivalence establishes that cross-entropy minimization serves as the population counterpart to maximum likelihood estimation (MLE) under the empirical distribution. Probabilistically, the cross-entropy H(p, q_\theta) represents the expected negative log-likelihood under the true distribution p, given by

H(p, q_\theta) = -\mathbb{E}_{X \sim p} [\log q_\theta(X)],

which quantifies the average number of bits required to encode samples from p using an optimal code for q_\theta. As n \to \infty, the empirical cross-entropy H(\hat{p}, q_\theta) converges to this expected value by the law of large numbers, assuming i.i.d. sampling, thereby linking the finite-sample MLE objective to the infinite-sample information-theoretic measure. This equivalence holds under standard assumptions, including i.i.d. observations from p and correct model specification, where the true p belongs to the parametric family \{q_\theta : \theta \in \Theta\}. For consistency of the minimizer \hat{\theta}_n = \arg\min_\theta H(\hat{p}, q_\theta), the model must be well-specified; otherwise, \hat{\theta}_n converges to the \theta that minimizes the population KL divergence D_{\text{KL}}(p \| q_\theta), which decomposes as H(p, q_\theta) - H(p). Asymptotically, under regularity conditions such as differentiability of the log-likelihood and positive definiteness of the Fisher information matrix, the MLE \hat{\theta}_n (equivalently, the cross-entropy minimizer) is consistent (\hat{\theta}_n \to \theta_0 in probability) and asymptotically normal (\sqrt{n} (\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})), where \theta_0 is the true parameter and I(\theta_0) is the Fisher information. These properties ensure that cross-entropy minimization yields efficient estimators in large samples when the model is correctly specified.^[17] This connection between cross-entropy and MLE was popularized in statistics during the post-1970s era, particularly through the integration of information theory into model selection criteria like Akaike's Information Criterion (AIC), which approximates the expected KL divergence minimized by MLE.

Cross-entropy Minimization

Cross-entropy minimization involves optimizing model parameters to reduce the divergence between predicted and true probability distributions, often serving as a surrogate for maximum likelihood estimation in probabilistic modeling.^[18] In linear models, such as logistic regression, the cross-entropy loss function exhibits convexity with respect to the parameters, ensuring a unique global minimum that gradient-based methods can reliably reach. This convex optimization landscape facilitates straightforward convergence analysis and guarantees efficient training without the risk of suboptimal local solutions. In contrast to non-linear models, this property simplifies optimization in statistical contexts. To mitigate overfitting during cross-entropy minimization, L2 regularization is commonly incorporated by adding a penalty term proportional to the squared magnitude of the model parameters to the loss function. This technique, also known as weight decay or ridge regularization, constrains parameter growth and promotes simpler models that generalize better to unseen data. Convergence guarantees for cross-entropy minimization depend on the problem's convexity: in the convex linear case, algorithms like stochastic gradient descent achieve the global minimum under standard assumptions such as Lipschitz continuity and bounded variance. Numerical stability poses a practical challenge in cross-entropy computation, particularly when predicted probabilities approach zero, leading to undefined logarithms as \log(0). This issue is addressed through clipping, where probabilities are bounded away from zero and one by a small epsilon (e.g., $10^{-8}), or by smoothing techniques that add a small constant to the denominator in softmax computations to prevent underflow.^[19]

Applications in Machine Learning

As a Loss Function

In machine learning, cross-entropy serves as a loss function for training probabilistic classifiers, measuring the divergence between the true probability distribution of class labels and the model's predicted probabilities.^[20] For a multi-class classification problem with C classes, the cross-entropy loss for a single example is defined as

L = -\sum_{c=1}^{C} y_c \log \hat{p}_c,

where y is the one-hot encoded true label vector (with y_c = 1 for the correct class and 0 otherwise), and \hat{p} is the predicted probability vector, typically produced by applying the softmax function to the model's raw outputs.^[4] This formulation, known as categorical cross-entropy, extends the binary case to K > 2 classes while maintaining the probabilistic interpretation.^[4] In the binary classification setting (C=2), the loss simplifies to

L = - \left[ y \log \hat{p} + (1-y) \log (1 - \hat{p}) \right],

where y \in \{0, 1\} is the true label and \hat{p} is the predicted probability for the positive class.^[4] Cross-entropy is preferred over squared error loss for probabilistic models because it directly corresponds to the negative log-likelihood under a multinomial distribution, providing a natural measure for probability estimates, and it imposes stronger penalties on predictions that are confidently incorrect, leading to faster convergence in training neural networks. During model training, empirical risk minimization is employed, where the parameters are adjusted to minimize the average cross-entropy loss across the training dataset, approximating the expected risk under the true data distribution.

In Classification Models

In logistic regression models, cross-entropy loss is employed alongside the sigmoid activation for binary classification or softmax for multi-class scenarios, enabling the production of interpretable probability distributions over classes. This formulation penalizes deviations between predicted probabilities and true labels, promoting models that output well-calibrated confidence scores suitable for decision-making in probabilistic settings.^[5] For neural networks, cross-entropy facilitates efficient backpropagation in multi-layer architectures by providing smooth, differentiable gradients that guide weight updates toward minimizing prediction errors. In image classification benchmarks like MNIST, convolutional neural networks trained with this loss learn robust feature representations, achieving error rates below 1% on handwritten digit recognition through iterative optimization of hierarchical patterns. To address class imbalance in classification datasets, weighted cross-entropy modifies the standard loss by scaling contributions from each class inversely proportional to their frequencies, thereby emphasizing minority classes and enhancing overall model recall without sacrificing precision on majority groups. This adaptation has proven effective in real-world scenarios where data distributions are skewed, leading to more equitable performance across categories.^[21] Cross-entropy also informs evaluation beyond simple accuracy, acting as a proxy for calibration by quantifying how closely predicted probabilities reflect true likelihoods; lower values correlate with reliable uncertainty estimates, aiding in applications requiring trustworthy predictions like medical diagnostics. A notable case study is the application of cross-entropy in convolutional neural networks for computer vision tasks, as exemplified by the AlexNet model, which utilized this loss to train on the ImageNet dataset and achieved a top-5 error rate of 15.3%,^[22] empirically outperforming hinge loss-based approaches in multi-class settings due to superior handling of probabilistic outputs and generalization to large-scale image data.

Variants

Amended Cross-entropy

The amended cross-entropy, also known as the amended cross-entropy cost (ACE), is a modified loss function that extends the standard cross-entropy by incorporating a diversity-promoting term, enabling the simultaneous training of multiple classifiers in an ensemble while explicitly controlling their pairwise diversity. Proposed by Shoham and Permuter in 2020, this variant addresses limitations in standard cross-entropy for ensemble learning, where independent training can lead to correlated predictions that limit performance gains.^[23] The formula for the amended cross-entropy for the k-th classifier in an ensemble of K classifiers is:

H'(p, q_k) = H(p, q_k) - \frac{\gamma}{K-1} \sum_{j \neq k} H(q_j, q_k)

where H(p, q_k) is the standard cross-entropy between the true distribution p and the predicted distribution q_k of the k-th classifier, H(q_j, q_k) is the cross-entropy between the predicted distributions of the j-th and k-th classifiers, and γ is a tunable diversity factor that balances accuracy and diversity. This structure encourages the classifiers to differ in their predictions, analogous to negative correlation learning in regression tasks, thereby improving overall ensemble robustness.^[23] The primary purpose of ACE is to reduce the risk of overfitting in individual classifiers by promoting explicit diversity, which leads to lower variance in ensemble predictions compared to training with standard cross-entropy alone. Empirical evaluations on datasets like MNIST and CIFAR-10 demonstrate that ACE ensembles achieve higher accuracy—for instance, 98.02% on MNIST versus 97.90% for standard cross-entropy ensembles—while maintaining computational efficiency through minimal additional parameters.^[23] In applications, ACE has been employed in deep neural networks for classification tasks, particularly in stacked mixture of classifiers architectures, where it facilitates better generalization on complex datasets with potential sparsity in feature distributions. Compared to standard cross-entropy, ACE shows empirically lower variance in performance estimates across multiple runs, making it suitable for scenarios involving heavy-tailed error distributions or outlier-sensitive predictions.^[23]

Label Smoothing Variant

Label smoothing is a variant of the cross-entropy loss proposed by Szegedy et al. in 2016 within the development of the Inception-v3 architecture for image classification tasks.^[24] It addresses overconfidence in neural network predictions by softening the one-hot encoded ground-truth labels into a more distributed target probability vector, rather than relying on the standard cross-entropy formulation with hard labels. The softened label distribution y' for a true class y is defined as

y'_k = (1 - \epsilon) \delta_{k y} + \frac{\epsilon}{K},

where \delta_{k y} is the Kronecker delta (1 if k = y, 0 otherwise), K is the number of classes, and \epsilon is the smoothing parameter, typically set to 0.1 in practice.^[24] The modified loss replaces the original cross-entropy H(y, p) with H(y', p) = -\sum_k y'_k \log p_k, where p denotes the model's predicted softmax probabilities. This expands to H(y', p) = (1 - \epsilon) H(y, p) + \epsilon H(u, p), with u as the uniform distribution u_k = 1/K, effectively adding a regularization term that penalizes predictions far from uniformity and promotes higher entropy in the output distribution.^[24] By encouraging the model to avoid assigning near-100% probability to a single class, label smoothing functions as an entropy-based regularizer that mitigates overconfidence during training.^[24] This approach yields several benefits, including reduced overfitting through improved generalization on held-out data.^[24] It also enhances probability calibration, ensuring that predicted confidence scores better align with true accuracy rates, as evidenced by reduced expected calibration error in deep networks. Additionally, label smoothing improves adversarial robustness by smoothing decision boundaries and decreasing sensitivity to input perturbations, leading to higher accuracy under targeted attacks.^[25] In empirical evaluations, Szegedy et al. demonstrated that applying label smoothing with \epsilon = 0.1 to Inception-v3 on the ImageNet ILSVRC 2012 dataset reduced the top-1 classification error from 23.1% to 22.8% and the top-5 error from 6.3% to 6.1%.^[24] Subsequent work has confirmed these gains across benchmarks, including improved performance in natural language processing tasks like machine translation, where it consistently boosts test accuracy while maintaining training stability.^[26]

References

[1]
[PDF] CSE 5523: Lecture Notes 6 Information Theory
This is called cross entropy. Using a different distribution is always worse (optimality of maximum likelihood estimation). The loss in expected information ...
[2]
On Information and Sufficiency - Project Euclid
Project Euclid, Open Access March, 1951, On Information and Sufficiency, S. Kullback, RA Leibler, DOWNLOAD PDF + SAVE TO MY LIBRARY.
[3]
https://ee.stanford.edu/~gray/it.pdf
[4]
[PDF] Entropy and Information Theory - Stanford Electrical Engineering
This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels ...
[5]
Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Apr 14, 2023 · Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, ...
[6]
[PDF] Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Abstract. Cross-entropy is a widely used loss function in applications. It coincides with the logistic loss applied to the outputs of a neural network, when.Missing: original | Show results with:original
[7]
https://seminaire-poincare.pages.math.cnrs.fr/rioul.pdf
[8]
[PDF] This is IT: A Primer on Shannon's Entropy and Information
Shannon may have first come across the notion of entropy from Wiener, which was one of Shannon's teachers at MIT. Robert Fano, one of Shannon's colleagues at ...
[9]
[PDF] A Mathematical Theory of Communication
THE ENTROPY OF AN INFORMATION SOURCE. Consider a discrete source of the finite state type considered above. For each possible state i there will be a set of ...
[10]
(PDF) On Information and Sufficiency - ResearchGate
Translation from Russian, 1972, Nauka, Moscow. On information and su ciency. Jan 1951; 79-86. S Kullback; R A Leibler. Kullback, S. and Leibler, R. A. (1951).
[11]
[PDF] Divergence measures based on the Shannon entropy - UF CISE
The Jensen-Shannon divergence can be generalized to provide such a measure for any finite number of distributions. This is also useful in multiclass ...
[12]
[1909.12830] The Differentiable Cross-Entropy Method - arXiv
Sep 27, 2019 · We study the cross-entropy method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable ...
[13]
https://doi.org/10.1109/TIT.2016.2604842
[14]
[PDF] Estimation of Entropy and Mutual Information
This article presents new results on nonparametric estimation of entropy and mutual information, proving consistency and showing how common estimators can ...
[15]
[PDF] Empirical estimation of entropy functionals with confidence
Dec 19, 2010 · Based on the statistical properties of. kNN density estimators, we derive the bias and variance of the plug-in estimator in terms of the sample ...
[16]
https://projecteuclid.org/journals/annals-of-statistics/volume-15/issue-1/A-Comparison-of-Cross-Validation-Techniques-in-Density-Estimation/10.1214/aos/1176350258.full
[17]
https://www.stat.cmu.edu/~brian/763-2015/week06/papers/self-liang-1987.pdf
[18]
A Comparison of Cross-Validation Techniques in Density Estimation
These theorems also show that either type of cross validation can be used to compare different estimators (e.g., kernel versus orthogonal series). Citation.
[19]
[PDF] Improvement of the cross-entropy method in high dimension ... - arXiv
In this paper, we propose a new cross-entropy-based importance sampling algorithm to improve rare event probability estimation in high di- mension. We focus on ...Missing: minimizing | Show results with:minimizing
[20]
[PDF] Asymptotic Properties of Maximum Likelihood Estimators and ...
This paper derives large sample properties of maximum likelihood estimators and likelihood ratio statistics, including their asymptotic distribution, when the ...<|separator|>
[21]
None
Below is a merged summary of the convexity of loss functions (focusing on cross-entropy) and their optimization implications, combining all the information from the provided segments into a concise yet comprehensive response. To retain maximum detail, I’ll use a table in CSV format to organize the key points, followed by a narrative summary that integrates additional insights and context.
[22]
None
### Summary of SGD for Cross-Entropy, Regularization, and Convergence
[23]
[PDF] On the training dynamics of deep networks with L2 regularization
Jun 15, 2020 · ... L2 regularization in deep learning. ... Fully connected depth 3 and width 64 trained on CIFAR-10 with batch size 512, η = 0.1 and cross-entropy ...
[24]
Convergence guarantees for a class of non-convex and non-smooth ...
We consider the problem of finding critical points of a broad class of non-convex problems with non-smooth components. We analyze the behavior of two gradient- ...Missing: cross- entropy
[25]
Probabilistic Interpretation of Feedforward Classification Network ...
We wish to treat the outputs of the network as probabilities of alternatives (eg. pattern classes), conditioned on the inputs.
[26]
Rational Decisions - Good - 1952 - Royal Statistical Society - Wiley
Summary. This paper deals first with the relationship between the theory of probability and the theory of rational behaviour. A method is then suggested ...
[27]
Resolving Class Imbalance in Object Detection with Weighted Cross ...
Jun 2, 2020 · In this paper, we propose to explore and overcome such problem by application of several weighted variants of Cross Entropy loss.
[28]
[2007.08140] Amended Cross Entropy Cost: Framework For Explicit ...
Jul 16, 2020 · In this paper we present a new cost function called the Amended Cross Entropy (ACE). Its novelty lies in its affording the capability to train multiple ...Missing: Pereira 1993
[29]
Rethinking the Inception Architecture for Computer Vision - arXiv
Dec 2, 2015 · Access Paper: View a PDF of the paper titled Rethinking the Inception Architecture for Computer Vision, by Christian Szegedy and 4 other authors.
[30]
[1906.11567] Adversarial Robustness via Label-Smoothing - arXiv
Jun 27, 2019 · Abstract:We study Label-Smoothing as a means for improving adversarial robustness of supervised deep-learning models.
[31]
[PDF] Towards a Better Understanding of Label Smoothing in Neural ...
In Transformer-based models, label smoothing is a widely applied method to improve model per- formance. Szegedy et al. (2016) initially introduce the method ...