Cross-entropy
In information theory, cross-entropy is a measure of the inefficiency in representing events from one probability distribution using an encoding scheme optimized for a different distribution. For two discrete probability distributions p and q defined over the same event space, the cross-entropy H(p, q) is given by the formulaH(p, q) = -\sum_x p(x) \log_2 q(x),
where the logarithm is base-2 to express the result in bits; this quantifies the expected number of bits required to encode a sample from p using a code designed for q.[1] The cross-entropy generalizes the concept of Shannon entropy, which is the special case H(p, p) representing the inherent uncertainty in p. It is always at least as large as the entropy of p, i.e., H(p, q) \geq H(p), with equality holding if and only if p = q almost everywhere. The nonnegative difference D_{\text{KL}}(p \parallel q) = H(p, q) - H(p) defines the Kullback-Leibler (KL) divergence, a key asymmetry measure introduced in the context of statistical discrimination between distributions.[2] The term "cross-entropy" was introduced in the mid-20th century, building on foundational work in information measures from the late 1940s. In applications, cross-entropy plays a central role in source coding and communication theory, where minimizing it approximates optimal compression under mismatched models. It also underpins rate-distortion theory and channel capacity bounds by evaluating the cost of distribution approximations in noisy environments.[3] Beyond information theory, cross-entropy is widely employed in machine learning as a loss function for probabilistic classifiers, such as logistic regression and neural networks with softmax outputs. Here, p typically represents one-hot encoded true labels (empirical distribution), and q the model's predicted probabilities; minimizing the empirical cross-entropy equates to maximum likelihood estimation under a categorical model, promoting well-calibrated predictions.[4] This usage extends to generative models, reinforcement learning, and knowledge distillation, where it facilitates efficient optimization and measures distributional similarity.[5]