Fact-checked by Grok 2 weeks ago

Hellinger distance

The Hellinger distance is a in and that measures the dissimilarity between two probability distributions, defined for probability densities p and q with respect to a common measure \mu as H(P, Q) = \left[ \int (\sqrt{p(x)} - \sqrt{q(x)})^2 \, d\mu(x) \right]^{1/2}. This distance is equivalent to the \ell_2 between the square roots of the densities and is bounded between 0 and \sqrt{2}, with to 0 P = Q. It originates from the Hellinger integral, introduced by German mathematician Ernst Hellinger in his 1909 paper on quadratic forms of infinitely many variables. A key feature of the Hellinger distance is its relation to other statistical distances, such as the total variation distance TV(P, Q) and the Kullback-Leibler divergence KL(P, Q), satisfying the inequalities \frac{1}{2} H^2(P, Q) \leq TV(P, Q) \leq H(P, Q) \leq \sqrt{KL(P, Q)}. It is also a special case of an f-divergence, corresponding to the convex function f(t) = 2(1 - \sqrt{t}), which ensures it is jointly convex in its arguments and satisfies the data processing inequality. Additionally, the squared Hellinger distance connects to the Bhattacharyya coefficient \rho(P, Q) = \int \sqrt{p(x) q(x)} \, d\mu(x) via H^2(P, Q) = 2(1 - \rho(P, Q)), highlighting its interpretability as a measure of overlap between distributions. For independent samples, it tensorizes nicely, with H^2(P^n, Q^n) = 2(1 - (1 - H^2(P, Q)/2)^n), making it useful for analyzing sample sizes in statistical inference. In statistical applications, the Hellinger distance plays a central role in hypothesis testing, where a small value indicates that distinguishing between distributions P_0 and P_1 is challenging, leading to error bounds like \inf_{\Psi} \frac{1}{2} [P_0(\Psi(X) \neq 0) + P_1(\Psi(X) \neq 1)] \leq c_1 \exp(-c_2 n H^2(P_0, P_1)) for test functions \Psi. It is employed in robust methods, such as minimum Hellinger distance estimators, which are less sensitive to model misspecification compared to maximum likelihood. Beyond classical statistics, the distance appears in for tasks like evaluation and in for quantifying distribution shifts, due to its properties and computational tractability.

Introduction

Overview and Motivation

The Hellinger distance serves as a on the of probability measures, characterized by symmetry—meaning the distance between two measures μ and ν equals that between ν and μ—non-negativity, where the distance is always greater than or equal to zero, and the , ensuring the distance is zero the measures are identical. This structure makes it a robust tool for comparing distributions in a geometrically meaningful way within the of all probability measures. Its motivation stems from the need to quantify dissimilarity between probability distributions in settings where traditional measures like the Kullback-Leibler divergence may fail due to or unboundedness; the Hellinger distance addresses these by remaining bounded between 0 and \sqrt{2} while providing embeddability into a via the square-root transformation of densities. This boundedness and Hilbert structure facilitate its utility in hypothesis testing, where it bounds error rates in distinguishing distributions, in for controlling approximation errors, and in for analyzing and channel capacities under metric constraints. As a specific instance, the squared Hellinger distance can be understood as an f-divergence corresponding to the f(t) = 2(1 - \sqrt{t}), which captures the divergence in a canonical form without requiring absolute continuity assumptions upfront. Intuitively, for two distributions with densities p and q, it measures the L^2 distance between their square-root densities, offering a geometrically interpretable notion of overlap that penalizes differences in a Euclidean-like manner on the transformed space.

Historical Development

The Hellinger integral, foundational to the Hellinger distance, was introduced by Ernst Hellinger in his 1909 habilitation thesis as part of developing a new foundation for the theory of quadratic forms involving infinitely many variables, within the emerging framework of measure theory and equations. This work built on contemporary advances in integration theory, providing a tool to compare measures through their square roots, predating the full formalization of Radon-Nikodym derivatives. In the 1910s and 1920s, the Hellinger integral found early applications in and . Johann Radon incorporated it into his 1913 generalization of integration theory for absolutely additive set functions, combining it with Lebesgue's and Stieltjes' concepts to extend integrals over abstract spaces. Henri Lebesgue's measure-theoretic framework indirectly influenced these developments, as Hellinger's approach addressed limitations in handling non-absolutely continuous measures, fostering its use in studying convergence and transformations in infinite-dimensional spaces. The concept experienced a revival in statistics during the mid-20th century, aligning with broader explorations of divergences for probabilistic comparisons. formalized the Hellinger distance in 1948, embedding it into Hilbert spaces and applying it to measures, which connected it to early ideas on information measures akin to those in ' 1948 work on symmetric divergences. Lucien Le Cam further advanced its statistical role in the 1960s, popularizing the squared Hellinger distance as a for asymptotic normality and contiguity in through key papers like his 1960 analysis of locally asymptotically normal families. By the post-1950s period, it gained traction in , with C.R. Rao's 1963 contributions naming and promoting it for comparisons. In recent years (2020–2025), the Hellinger distance has seen increased adoption in , particularly for tasks involving generative models and . For instance, empirical estimators for the squared Hellinger distance have been developed to assess between continuous distributions, enabling robust evaluations in high-dimensional settings. Its connections to Fisher-Rao metrics on statistical manifolds have also been highlighted in 2025 analyses, underscoring its relevance for geometric interpretations in training and optimal transport problems. As a member of the f-divergence family, it continues to bridge classical measure theory with modern computational applications.

Definitions

General Measure-Theoretic Definition

The Hellinger distance is defined in the general measure-theoretic setting for two probability measures P and Q on a (X, \mathcal{A}). Let \mu be a \sigma-finite measure that dominates both P and Q, meaning P \ll \mu and Q \ll \mu; such a dominating measure always exists, for instance, \mu = P + Q. The squared Hellinger distance is then given by H^2(P, Q) = \int_X \left( \sqrt{\frac{dP}{d\mu}} - \sqrt{\frac{dQ}{d\mu}} \right)^2 \, d\mu, where \frac{dP}{d\mu} and \frac{dQ}{d\mu} denote the Radon-Nikodym derivatives of P and Q with respect to \mu. This definition applies even when P and Q are not absolutely continuous with respect to each other, as the dominating measure \mu accommodates singular components: on sets where one measure has zero density relative to \mu, the corresponding square root term vanishes, contributing to the distance without requiring mutual absolute continuity. Moreover, the value of H^2(P, Q) is independent of the choice of dominating measure \mu, provided it dominates both P and Q. An equivalent expression for the squared Hellinger distance is H^2(P, Q) = 2 \left( 1 - \int_X \sqrt{ \frac{dP}{d\mu} \cdot \frac{dQ}{d\mu} } \, d\mu \right), where the integral \int_X \sqrt{ \frac{dP}{d\mu} \cdot \frac{dQ}{d\mu} } \, d\mu = \int_X \sqrt{dP \, dQ} is known as the Hellinger affinity. To see the equivalence, expand the squared term in the original definition: \int_X \left( \sqrt{\frac{dP}{d\mu}} - \sqrt{\frac{dQ}{d\mu}} \right)^2 \, d\mu = \int_X \frac{dP}{d\mu} \, d\mu + \int_X \frac{dQ}{d\mu} \, d\mu - 2 \int_X \sqrt{ \frac{dP}{d\mu} \cdot \frac{dQ}{d\mu} } \, d\mu = 1 + 1 - 2 \int_X \sqrt{dP \, dQ}, yielding the affinity form. The Hellinger distance is typically defined as the H(P, Q) = \sqrt{H^2(P, Q)}, which takes values in [0, \sqrt{2}] since the lies in [0, 1]. In some texts, it is normalized by scaling with $1/\sqrt{2} to bound the distance in [0, 1]. As a special instance, this general definition specializes to discrete probability distributions by taking the as the dominating \mu.

For Absolutely Continuous Measures

When two probability measures P and Q on \mathbb{R}^n are absolutely continuous with respect to the \lambda, they admit probability density functions f = dP/d\lambda and g = dQ/d\lambda with respect to \lambda. In this case, the Hellinger distance specializes to H^2(P, Q) = \int_{\mathbb{R}^n} \left( \sqrt{f(x)} - \sqrt{g(x)} \right)^2 \, d\lambda(x), which is well-defined provided the integral exists. This expression follows directly from the general measure-theoretic definition by substituting the dominating measure \mu = \lambda and the Radon-Nikodym derivatives dP/d\mu = f and dQ/d\mu = g. An equivalent form, often useful for computation, is H^2(P, Q) = 2 \left(1 - \int_{\mathbb{R}^n} \sqrt{f(x) g(x)} \, d\lambda(x)\right), where the integral represents the Bhattacharyya coefficient between the densities. The squared distance takes values in [0, 2], with equality to 0 if and only if f = g almost everywhere with respect to \lambda. The form involving square roots admits a natural interpretation in the Hilbert space L^2(\lambda). The square-root transformation maps each density to a unit vector \sqrt{f}, \sqrt{g} \in L^2(\lambda) on the unit sphere, since \int f \, d\lambda = \int g \, d\lambda = 1 implies \|\sqrt{f}\|_2 = \|\sqrt{g}\|_2 = 1. The Hellinger distance is then the \ell^2-norm (Euclidean distance) between these embedded points: H(P, Q) = \left\| \sqrt{f} - \sqrt{g} \right\|_{L^2(\lambda)}. This embedding facilitates analysis of densities as points in a Hilbert space, highlighting the geometric structure underlying the distance. For illustration, consider the on [0,1] with density f(x) = \mathbf{1}_{[0,1]}(x) and the uniform on [0,2] with g(x) = \frac{1}{2} \mathbf{1}_{[0,2]}(x). The is \int_0^1 \sqrt{1 \cdot \frac{1}{2}} \, dx = \frac{1}{\sqrt{2}}, so H^2(P, Q) = 2\left(1 - \frac{1}{\sqrt{2}}\right) \approx 0.586, \quad H(P, Q) \approx 0.765. This value reflects moderate divergence due to the differing supports and scales.

Discrete Probability Distributions

The Hellinger distance between two discrete probability distributions with probability mass functions p = (p_i)_{i=1}^k and q = (q_i)_{i=1}^k on a finite \{1, \dots, k\} is defined as H(p, q) = \sqrt{\sum_{i=1}^k (\sqrt{p_i} - \sqrt{q_i})^2} = \sqrt{2 \left(1 - \sum_{i=1}^k \sqrt{p_i q_i}\right)}. This formulation ensures that H(p, q) ranges from 0 (when p = q) to \sqrt{2} (when p and q have disjoint supports). This discrete definition arises as a special case of the general measure-theoretic formulation by considering the underlying probability measures with respect to the on the , where the reduces to a sum over point masses analogous to Dirac measures. For illustration, consider two distributions: p with success probability 0.3 (so p = (0.3, 0.7)) and q with success probability 0.7 (so q = (0.7, 0.3)). The Hellinger distance is H(p, q) = \sqrt{ (\sqrt{0.3} - \sqrt{0.7})^2 + (\sqrt{0.7} - \sqrt{0.3})^2 } \approx 0.409, computed via the first equivalent form, or equivalently \sqrt{2\left(1 - 2\sqrt{0.3 \cdot 0.7}\right)} \approx 0.409. This value quantifies moderate between the distributions, reflecting their differing concentrations on the outcomes. The definition extends naturally to countably infinite sample spaces with probability mass functions p = (p_i)_{i=1}^\infty and q = (q_i)_{i=1}^\infty, using the same formulas, where the sums \sum_{i=1}^\infty (\sqrt{p_i} - \sqrt{q_i})^2 and \sum_{i=1}^\infty \sqrt{p_i q_i} are guaranteed to converge to finite values between 0 and 2, and 0 and 1, respectively, due to the normalization of p and q as probability distributions.

Properties

As a Metric

The Hellinger distance H(P, Q) satisfies the axioms of a metric on the space of probability measures. First, non-negativity holds since H(P, Q) = \left( \int (\sqrt{dP/d\mu} - \sqrt{dQ/d\mu})^2 \, d\mu \right)^{1/2} \geq 0 for a dominating measure \mu, with equality if and only if P = Q, as the integrand is nonnegative and the L^2 norm vanishes only when the square-root densities coincide almost everywhere. Symmetry follows immediately from the definition, as H(P, Q) = H(Q, P). The triangle inequality H(P, R) \leq H(P, Q) + H(Q, R) is inherited from the of the L^2 norm applied to the square-root densities: \|\sqrt{dP/d\mu} - \sqrt{dR/d\mu}\|_2 \leq \|\sqrt{dP/d\mu} - \sqrt{dQ/d\mu}\|_2 + \|\sqrt{dQ/d\mu} - \sqrt{dR/d\mu}\|_2, which can be verified using the Cauchy-Schwarz inequality on the of square-root densities. This confirms that the Hellinger distance defines a on the set of probability measures. The induced by the on the of finite measures is equivalent to the topology, meaning convergence in one implies convergence in the other. On the of probability measures \mathcal{P}(S) over a complete separable S, the Hellinger is complete, as it is topologically equivalent to the complete ; for tight families of measures, this completeness aligns with properties under the induced . Unlike divergences such as the Kullback-Leibler divergence, which lack symmetry and the , the Hellinger distance qualifies as a true , enabling its use in geometric and topological analyses of probability spaces.

Boundedness and Key Inequalities

The Hellinger distance satisfies $0 \leq H(P, Q) \leq \sqrt{2} for any probability measures P and Q, where equality to 0 holds P = Q, and equality to \sqrt{2} holds P and Q are mutually singular. This bound follows from the definition H(P, Q) = \left( \int (\sqrt{dP/d\mu} - \sqrt{dQ/d\mu})^2 \, d\mu \right)^{1/2}, which expands to H^2(P, Q) = 2\left(1 - \int \sqrt{(dP/d\mu)(dQ/d\mu)} \, d\mu \right), and the integral term, known as the Bhattacharyya coefficient BC(P, Q), satisfies $0 \leq BC(P, Q) \leq 1 by the Cauchy-Schwarz inequality applied to the square-root densities. Thus, H(P, Q) = \sqrt{2(1 - BC(P, Q))}, providing a direct link between the distance and this affinity measure. A key self-inequality relates the squared Hellinger distance to the distance: H^2(P, Q) \leq 2 \, TV(P, Q). This bound, adapted from Pinsker-type inequalities for other divergences, highlights the Hellinger distance's controlled growth relative to and underscores its utility in bounding error probabilities in statistical testing. The Hellinger distance exhibits monotonicity under Markov kernels: if P' = K P and Q' = K Q for a stochastic kernel K, then H(P', Q') \leq H(P, Q). This data-processing property arises from the f-divergence structure of the Hellinger distance and ensures that processing through channels does not increase the distance between distributions.

Relations to Other Divergences

Connection to Total Variation Distance

The total variation distance between two probability measures P and Q on a measurable space is defined as \text{TV}(P, Q) = \sup_{A} |P(A) - Q(A)|, where the supremum is taken over all measurable sets A, and it admits the equivalent integral representation \text{TV}(P, Q) = \frac{1}{2} \int |dP - dQ|. Assuming P and Q are absolutely continuous with respect to a common dominating measure \mu with densities p and q, the squared Hellinger distance H^2(P, Q) and the total variation distance \text{TV}(P, Q) satisfy the inequalities \frac{1}{2} H^2(P, Q) \leq \text{TV}(P, Q) \leq H(P, Q), where H(P, Q) = \left\| \sqrt{p} - \sqrt{q} \right\|_2 is the Hellinger distance. The lower bound follows from the fact that \text{TV}(P, Q) \geq 1 - \int \sqrt{pq} \, d\mu = \frac{1}{2} H^2(P, Q), derived by considering the optimal coupling or direct comparison of integrals. The upper bound can be proved using the Cauchy-Schwarz inequality applied to the difference of densities: \int |p - q| \, d\mu = \int |\sqrt{p} - \sqrt{q}| \cdot |\sqrt{p} + \sqrt{q}| \, d\mu \leq \left\| \sqrt{p} - \sqrt{q} \right\|_2 \left\| \sqrt{p} + \sqrt{q} \right\|_2 \leq \left\| \sqrt{p} - \sqrt{q} \right\|_2 \cdot 2, yielding \text{TV}(P, Q) \leq H(P, Q) after normalization; alternatively, it follows from the coupling interpretation where the minimal probability of disagreement bounds the L1 norm via the L2 norm on square roots. These inequalities imply topological equivalence between the Hellinger and distances on the space of absolutely continuous probability measures, as in one implies in the other. For contiguous s of distributions (where one is contiguous to another if no test can distinguish them asymptotically), the distance is asymptotically equivalent to H(P, Q) under regimes or local asymptotic normality, capturing the same rate of distinguishability up to constants. Both distances metrize in the subspace of probability measures absolutely continuous with respect to a fixed dominating measure, ensuring that sequences converging weakly also converge in these metrics, and vice versa within this subspace. However, the Hellinger distance is often computationally more tractable for densities, as it involves norms on square-root transformed densities rather than the supremum over sets required for .

Relation to Bhattacharyya Coefficient

The Bhattacharyya coefficient between two probability measures P and Q, also known as the Hellinger affinity, is defined as \mathrm{BC}(P, Q) = \int \sqrt{dP \, dQ}, where the is taken with respect to a dominating measure on the underlying space. This coefficient quantifies the similarity or overlap between the two measures. The Hellinger distance H(P, Q) is directly related to the Bhattacharyya coefficient via H^2(P, Q) = 2 \bigl(1 - \mathrm{BC}(P, Q)\bigr). This connection arises from the integral definition of the squared Hellinger distance: H^2(P, Q) = \int \biggl( \sqrt{\frac{dP}{d\mu}} - \sqrt{\frac{dQ}{d\mu}} \biggr)^2 \, d\mu, where \mu is a dominating measure and p = dP/d\mu, q = dQ/d\mu are the corresponding densities. Expanding the integrand yields (\sqrt{p} - \sqrt{q})^2 = p + q - 2\sqrt{pq}, and integrating term by term gives H^2(P, Q) = \int p \, d\mu + \int q \, d\mu - 2 \int \sqrt{pq} \, d\mu = 2 - 2 \, \mathrm{BC}(P, Q), since \int p \, d\mu = \int q \, d\mu = 1 for probability measures. The Bhattacharyya coefficient \mathrm{BC}(P, Q) ranges from 1, when P = Q, to 0, when P and Q are mutually singular with no overlap. This overlap interpretation makes it particularly useful in deriving probabilistic bounds, such as the Bhattacharyya bound for the Bayes probability between two classes, which provides an upper limit based on the coefficient's value. For illustration, consider two multivariate Gaussian distributions \mathcal{N}(\mu_1, \Sigma) and \mathcal{N}(\mu_2, \Sigma) sharing the same \Sigma. Their Bhattacharyya coefficient simplifies to \mathrm{BC} = \exp\left( -\frac{1}{8} (\mu_1 - \mu_2)^T \Sigma^{-1} (\mu_1 - \mu_2) \right) = \exp\left( -\frac{d^2}{8} \right), where d = \sqrt{(\mu_1 - \mu_2)^T \Sigma^{-1} (\mu_1 - \mu_2)} is the between the means. This closed-form expression highlights how the overlap decreases exponentially with the squared .

Comparisons with Other f-Divergences

The Hellinger distance belongs to the broad class of f-divergences, which quantify the difference between two probability measures P and Q via the formula D_f(P \| Q) = \int q \, f\left(\frac{p}{q}\right) \, d\mu, where f: (0, \infty) \to \mathbb{R} is a with f(1) = 0, and p = dP/d\mu, q = dQ/d\mu are densities with respect to a dominating measure \mu. The squared Hellinger distance corresponds to the f-divergence with f(t) = 2(1 - \sqrt{t}), yielding H^2(P, Q) = D_f(P \| Q). In comparison to the Kullback-Leibler (KL) divergence, another prominent f-divergence defined by f(t) = t \log t, the Hellinger distance exhibits distinct behavioral properties. The KL divergence is asymmetric—D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)—and unbounded above, making it sensitive to rare events where p > 0 but q = 0. By contrast, the Hellinger distance is symmetric and bounded ($0 \leq H(P, Q) \leq \sqrt{2}), providing a more stable measure for distributions with differing supports. A key inequality linking them is D_{\text{KL}}(P \| Q) \geq H^2(P, Q), which arises from the properties of f-divergences. The chi-squared divergence, an f-divergence with f(t) = (t - 1)^2, shares the asymmetry and unboundedness of the but amplifies differences where p/q deviates substantially from 1, rendering it particularly sensitive to heavy tails or outliers in the ratio p/q. This contrasts with the Hellinger, whose square-root transformation moderates such sensitivities. The Jensen-Shannon (JS) divergence offers a symmetrized alternative to the , defined as \text{JS}(P, Q) = \frac{1}{2} D_{\text{KL}}(P \| M) + \frac{1}{2} D_{\text{KL}}(Q \| M), where M = (P + Q)/2; like the Hellinger, it is symmetric and bounded (by \log 2), but its averaging over the mixture M yields a smoother profile that avoids extreme values even when P and Q have disjoint supports. A primary advantage of the Hellinger distance lies in its square-root transform, which downweights the influence of regions where densities differ greatly (e.g., tails or outliers), positioning it as a robust intermediary between the overly punitive KL divergence and the L1-based total variation distance. This property makes the Hellinger preferable in scenarios requiring insensitivity to sparse or extreme data variations, such as or testing under model misspecification.

Computation and Estimation

Closed-Form Expressions for Parametric Distributions

Closed-form expressions for the Hellinger distance between distributions from common parametric families are valuable for exact computations in statistical inference and model comparison, as they avoid numerical approximation of the defining integral \int (\sqrt{f(x)} - \sqrt{g(x)})^2 \, dx = 2(1 - \int \sqrt{f g} \, dx). These formulas typically arise from evaluating the affinity (Bhattacharyya coefficient) \rho = \int \sqrt{f g} \, dx, with the squared Hellinger distance given by H^2 = 2(1 - \rho). For families like the normal, Poisson, and exponential, the expressions involve elementary functions or special functions such as the modified Bessel function. For univariate normal distributions N(\mu_1, \sigma_1^2) and N(\mu_2, \sigma_2^2), the squared Hellinger distance is H^2 = 2 \left(1 - \sqrt{\frac{2 \sigma_1 \sigma_2}{\sigma_1^2 + \sigma_2^2}} \exp\left( -\frac{(\mu_1 - \mu_2)^2}{4(\sigma_1^2 + \sigma_2^2)} \right) \right). $$ This form highlights the influence of both mean and variance differences, with the exponential term capturing location shifts and the square root term reflecting scale differences.[](https://arxiv.org/pdf/1810.08693) The formula extends naturally to multivariate normals $N(\mu_1, \Sigma_1)$ and $N(\mu_2, \Sigma_2)$: H^2 = 2 \left(1 - \frac{\det(\Sigma_1)^{1/4} \det(\Sigma_2)^{1/4}}{\det\left( \frac{\Sigma_1 + \Sigma_2}{2} \right)^{1/2}} \exp\left( -\frac{1}{8} (\mu_1 - \mu_2)^T \left( \frac{\Sigma_1 + \Sigma_2}{2} \right)^{-1} (\mu_1 - \mu_2) \right) \right). [](https://arxiv.org/pdf/1810.08693) The determinant terms account for [covariance](/page/Covariance) structure mismatches, making this useful in high-dimensional settings like multivariate analysis. For Poisson distributions $\mathrm{Pois}(\lambda)$ and $\mathrm{Pois}(\mu)$, the squared Hellinger distance is H^2 = 2 \left(1 - \exp\left( -\frac{(\sqrt{\lambda} - \sqrt{\mu})^2}{2} \right) \right). In the exponential family, for distributions $\mathrm{Exp}(\lambda)$ and $\mathrm{Exp}(\mu)$ (with rate parameters $\lambda, \mu > 0$), the squared Hellinger distance is H^2 = 2 \left(1 - \frac{2 \sqrt{\lambda \mu}}{\lambda + \mu} \right). $$ The direct evaluation of the integral over [0, \infty) yields this simple form, emphasizing the role of rate differences in lifetime or waiting-time models. For other parametric families, closed forms involve special functions but follow similar derivations. The squared Hellinger distance between two gamma distributions \mathrm{Gamma}(a_1, b_1) and \mathrm{Gamma}(a_2, b_2) (shape-rate parameterization) can be expressed using the beta function when rates are equal, or more generally via the confluent hypergeometric function. Similarly, for Weibull distributions with common shape parameter k > 0 and scales \lambda_1, \lambda_2 > 0, the expression relies on the gamma function: H^2 = 2 \left(1 - 2^{1/k} \Gamma\left(1 + \frac{1}{k}\right) \left( \lambda_1^{1/(2k)} + \lambda_2^{1/(2k)} \right)^{-1} \right), $$ capturing reliability and survival analysis scenarios. For beta distributions $\mathrm{Beta}(\alpha_1, \beta_1)$ and $\mathrm{Beta}(\alpha_2, \beta_2)$, the affinity integrates to a form involving the hypergeometric function ${}_2F_1$, providing a closed expression for proportions on $[0,1]$. These formulas, derived from integral properties, enable efficient applications in parametric hypothesis testing and density estimation. ### Numerical Estimation Methods When closed-form expressions are unavailable, such as for non-parametric or complex continuous distributions, numerical methods rely on samples drawn from the underlying distributions to estimate the Hellinger distance. A common approach is the empirical plug-in estimator, which approximates the densities using data and then computes the distance via [numerical integration](/page/Numerical_integration). For instance, given independent samples $X_1, \dots, X_n \sim P$ and $Y_1, \dots, Y_m \sim Q$, kernel density estimates $\hat{p}$ and $\hat{q}$ are formed, and the squared Hellinger distance is approximated as $\hat{H}^2(P, Q) \approx \int (\sqrt{\hat{p}(x)} - \sqrt{\hat{q}(x)})^2 \, dx$, evaluated via [quadrature](/page/Quadrature) or [Monte Carlo](/page/Monte_Carlo) sampling over a [grid](/page/Grid) or additional points.[](https://stats.stackexchange.com/questions/50931/calculating-hellinger-divergence-from-results-of-kernel-density-estimates-in-mat)[](https://www.sciencedirect.com/science/article/pii/016771529390022B) A specific empirical estimator for the squared Hellinger distance between continuous distributions, proposed by Ding and Mullhaupt, uses the empirical cumulative distribution functions (ECDFs) of the samples. It estimates the scaled Hellinger affinity as $\hat{A}(P, Q) = \frac{1}{n} \sum_{i=1}^n \sqrt{\frac{\delta Q_c(X_i)}{\delta P_c(X_i)}}$, where $\delta P_c(X_i)$ and $\delta Q_c(X_i)$ are the left slopes (differences in ECDF values) at the ordered sample points. The squared distance is then $\hat{H}^2(P, Q) = 2 \left(1 - \frac{4}{\pi} \hat{A}(P, Q)\right)$, with the $\pi/4$ factor providing multiplicative bias correction for the affinity. A symmetric variant averages $\hat{A}(P, Q)$ and $\hat{A}(Q, P)$ for improved stability. This estimator converges almost surely to the true value under mild continuity assumptions, with computational steps involving sorting the samples, which is efficient for moderate sample sizes.[](https://doi.org/10.3390/e25040612) For [kernel density estimation](/page/Kernel_density_estimation) ([KDE](/page/KDE))-based methods, Gaussian or Epanechnikov kernels are typically used to estimate $\sqrt{\hat{p}}$ and $\sqrt{\hat{q}}$ on a fine grid, followed by [trapezoidal integration](/page/Integration) of $(\sqrt{\hat{p}} - \sqrt{\hat{q}})^2$. The [bandwidth](/page/Bandwidth) selection minimizes the asymptotic mean Hellinger distance, often via cross-validation, ensuring [consistency](/page/Consistency) rates of $O(1/\sqrt{n} + h^2 + 1/(n h)^{1/2})$ where $h$ is the [bandwidth](/page/Bandwidth). [Monte Carlo integration](/page/Monte_Carlo_integration) can approximate the [integral](/page/Integral) by averaging over $K$ additional samples, with variance decreasing as $O(1/K)$, though it requires careful sampling from a bounding measure to avoid [bias](/page/Bias) in unbounded supports.[](https://www.sciencedirect.com/science/article/pii/016771529390022B)[](https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1079&context=jmasm) Bias in these empirical estimators, arising from finite samples and density approximation, can be corrected using [bootstrapping](/page/Bootstrapping): resample with replacement from the original samples to generate $B$ bootstrap replicates, compute the [estimator](/page/Estimator) for each, and adjust the original estimate by the [average](/page/Average) bootstrap [bias](/page/Bias). This approach yields consistent variance estimates and reduces [bias](/page/Bias) in small samples, particularly for the [affinity](/page/Affinity) component. Error bounds for the [estimators](/page/Estimator) typically show standard deviation $\sqrt{\mathrm{Var}(\hat{A})} \approx 1/\sqrt{nm}$ under regularity conditions like bounded densities, ensuring $\sqrt{n \wedge m}$-[consistency](/page/Consistency).[](https://projecteuclid.org/journals/bernoulli/volume-22/issue-2/Consistency-efficiency-and-robustness-of-conditional-disparity-methods/10.3150/14-BEJ678.pdf)[](https://doi.org/10.3390/e25040612) Recent developments include robust estimators for squared Hellinger distance that generalize to f-divergences and incorporate almost sure [convergence](/page/Convergence) via strong law arguments. Additionally, minimum Hellinger distance estimators adapted for complex survey designs with unequal probabilities use Horvitz-Thompson adjusted KDEs, maintaining robustness and efficiency in finite samples.[](https://doi.org/10.3390/e25040612) The naive empirical [computation](/page/Computation) via double summation over samples for [affinity](/page/Affinity) approximation has $O(nm)$ [complexity](/page/Complexity), but one-dimensional cases can be accelerated to $O((n+m) \log (n+m))$ using [sorting](/page/Sorting) for ECDFs or FFT for [kernel](/page/Kernel) convolutions. These methods complement [parametric](/page/Parametric) closed-form expressions when [family](/page/Family) assumptions hold but samples are available for validation.[](https://doi.org/10.3390/e25040612) ## Applications ### In Statistics In statistics, the Hellinger distance serves as a key tool in [hypothesis](/page/Hypothesis) testing, particularly for goodness-of-fit tests in [parametric](/page/Parametric) models. Hellinger deviance tests, which minimize the Hellinger distance between empirical and model distributions, provide analogs to likelihood [ratio](/page/Ratio) tests and exhibit high efficiency under the [null hypothesis](/page/Null_hypothesis) while maintaining breakdown points against outliers.[](https://www.jstor.org/stable/2289852) Under Le Cam's theory of contiguity, where sequences of measures are contiguous if the squared Hellinger distance converges to zero, these tests achieve asymptotic chi-squared distributions for local alternatives, enabling reliable inference in large samples.[](https://projecteuclid.org/journals/annals-of-statistics/volume-30/issue-3/The-statistical-work-of-Lucien-Le-Cam/10.1214/aos/1028674836.pdf) For [density estimation](/page/Density_estimation), minimum Hellinger distance estimators (MHDE) offer a robust alternative to [maximum likelihood estimation](/page/Maximum_likelihood_estimation), especially under model misspecification or contaminated data. Introduced for [parametric](/page/Parametric) models with independent identically distributed observations, MHDE minimizes the Hellinger distance between the empirical distribution and a [parametric](/page/Parametric) family, yielding asymptotically efficient and minimax robust estimates within Hellinger neighborhoods.[](https://projecteuclid.org/journals/annals-of-statistics/volume-5/issue-3/Minimum-Hellinger-Distance-Estimates-for-Parametric-Models/10.1214/aos/1176343842.full) In finite mixture models, for instance, MHDE demonstrates superior [mean squared error](/page/Mean_squared_error) performance compared to maximum likelihood when data include [contamination](/page/Contamination), as it downweights outliers through the [integral](/page/Integral) form of the distance.[](https://www.jstor.org/stable/2291601) In sequential analysis, the Hellinger distance facilitates drift detection in nonstationary data streams, where distributions evolve over time. The Hellinger Distance Drift Detection Method (HDDDM), proposed in 2011, computes the distance between histograms of reference and current data batches to identify abrupt or gradual changes, using adaptive thresholds for decision-making and enabling classifier resets to maintain performance.[](https://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/cidue11.pdf) This approach has been extended in subsequent work on stream mining, incorporating windowing schemes and comparisons with other metrics for [real-time](/page/Real-time) monitoring. Recent advancements highlight the Hellinger distance's versatility in complex settings. In 2025, minimum Hellinger distance estimators were developed for parametric superpopulation models under complex survey designs, such as [Poisson probability proportional to size sampling](/page/Poisson-probability-proportional-to-size_sampling), using Horvitz-Thompson-adjusted kernel densities to ensure L1-consistency, asymptotic normality, and robustness against high-leverage observations, as demonstrated on [National Health](/page/National_Health) and Nutrition Examination Survey data.[](https://arxiv.org/abs/2510.14055) Similarly, a 2024 extension to semiparametric covariate models introduces minimum profile Hellinger distance estimation, which profiles out nonparametric components to yield consistent and asymptotically normal estimators for parameters, enhancing robustness in [regression](/page/Regression) contexts.[](https://www.sciencedirect.com/science/article/pii/S0167947324001385) Compared to the [Kullback-Leibler divergence](/page/KL), the Hellinger distance provides advantages in [robust statistics](/page/Robust_statistics) through its boundedness and symmetry, offering finite-sample stability and resistance to heavy-tailed contamination where KL-based methods like maximum likelihood can fail.[](https://projecteuclid.org/journals/annals-of-statistics/volume-5/issue-3/Minimum-Hellinger-Distance-Estimates-for-Parametric-Models/10.1214/aos/1176343842.full) This leads to better performance in small or perturbed samples, with MHDE exhibiting [minimax](/page/Minimax) robustness properties absent in KL estimators.[](https://www.researchgate.net/publication/220451886_Hellinger_distance_decision_trees_are_robust_and_skew-insensitive) ### In Machine Learning and Data Science In machine learning, the Hellinger distance has proven particularly valuable for addressing class imbalance in classification tasks, where minority classes are underrepresented. One approach involves Hellinger-based oversampling, which generates synthetic samples for minority classes by minimizing the Hellinger distance to the majority class distribution, thereby reducing overlap and skewness in multi-class imbalanced datasets. This method, applied to datasets like those in medical diagnostics, has demonstrated up to a 20% improvement in classification accuracy compared to traditional oversampling techniques such as SMOTE. Complementing this, stable sparse feature selection using Hellinger distance (sssHD) integrates the metric into a lasso-like penalty to select robust features in high-dimensional, imbalanced data, outperforming alternatives like mutual information-based selection in stability and false discovery rate control on bioinformatics datasets.[](https://ieeexplore.ieee.org/document/8418525/)[](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3411-3) In ensemble methods, the Hellinger distance serves as an effective splitting criterion for random forests, enhancing performance on imbalanced datasets by prioritizing splits that maximize [divergence](/page/Divergence) between class-conditional distributions rather than [impurity](/page/Impurity) measures like Gini [index](/page/Index). This adaptation leads to more balanced tree growth and improved minority class recall, with empirical evaluations on benchmark imbalanced datasets showing superior AUC-ROC scores over standard random forests, especially in scenarios with imbalance ratios exceeding 1:10.[](https://www.sciencedirect.com/science/article/abs/pii/S0957417420300890) For bias detection in AI systems, the Hellinger distance quantifies disparities in predictive outcome distributions across demographic groups, such as [gender](/page/Gender) or ethnicity, enabling the identification of algorithmic fairness issues. As [cataloged](/page/The_Catalogue) by the [OECD](/page/OECD), this metric supports mitigation strategies by measuring divergence in probability distributions of model predictions, facilitating audits that align with responsible [AI](/page/Ai) principles without requiring access to sensitive labels.[](https://oecd.ai/en/catalogue/metrics/hellinger-distance) In biomedical [signal processing](/page/Signal_processing), particularly for EEG-based [seizure](/page/Seizure) detection, Hellinger distance combined with [particle swarm optimization](/page/Particle_swarm_optimization) (PSO) enables efficient [feature selection](/page/Feature_selection) from high-dimensional signals. This hybrid method selects features that maximize Hellinger divergence between epileptic and non-epileptic states, reducing dimensionality by up to 90% while achieving [classification](/page/Classification) accuracies above 98% on public EEG datasets like CHB-MIT, outperforming genetic algorithm-based selectors in computational [efficiency](/page/Efficiency).[](https://www.sciencedirect.com/science/article/abs/pii/S2214212623002387) Recent advancements derive closed-form expressions for the [mean](/page/Mean) and variance of the squared Hellinger distance between pairs of random [density](/page/Density) matrices in quantum settings.[](https://link.aps.org/doi/10.1103/PhysRevE.111.054204) Additionally, in handling nonstationary data streams, Hellinger-based drift detection identifies gradual or abrupt shifts in feature distributions, originally proposed for evolving environments and since integrated into modern [online learning](/page/Online_learning) frameworks for applications like fraud detection, where it outperforms Kolmogorov-Smirnov tests in sensitivity to subtle changes.[](https://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/cidue11.pdf) ## Generalizations and Variants ### Squared Hellinger Distance The squared Hellinger distance between two probability measures $P$ and $Q$ on a [measurable space](/page/Measurable_space), with densities $p = dP/d\mu$ and $q = dQ/d\mu$ with respect to a dominating measure $\mu$, is defined as H^2(P, Q) = \int \left( \sqrt{p} - \sqrt{q} \right)^2 , d\mu = 2 - 2 \int \sqrt{p q} , d\mu. [](https://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf) This form is an f-divergence with [generating function](/page/Generating_function) $f(x) = (1 - \sqrt{x})^2$.[](https://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf) Unlike the standard Hellinger distance $H(P, Q) = \sqrt{H^2(P, Q)}$, the squared variant is often preferred in applications due to its tensorization over [independent](/page/Independent) product measures—if $P = P_1 \times P_2$ and $Q = Q_1 \times Q_2$ with the components [independent](/page/Independent)—yielding $H^2(P, Q) = 2 \left(1 - \left(1 - \frac{H^2(P_1, Q_1)}{2}\right)\left(1 - \frac{H^2(P_2, Q_2)}{2}\right)\right)$, which approximates $H^2(P_1, Q_1) + H^2(P_2, Q_2)$ when the distances are small and simplifies analysis in high-dimensional or sequential settings. Additionally, $H^2$ exhibits superior differentiability properties compared to $H$, facilitating optimization and asymptotic expansions in [statistical inference](/page/Statistical_inference).[](http://www.stat.yale.edu/~pollard/Manuscripts%2BNotes/Paris2001/Lectures/DQM.pdf) As an f-divergence, $H^2(P, Q)$ is jointly convex in the pair $(P, Q)$.[](https://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf) For distributions that are close, such as in parametric families $P_\theta$ and $P_{\theta + h}$ under regularity conditions, $H^2$ admits a local quadratic approximation given by $H^2(P_\theta, P_{\theta + h}) \approx \frac{1}{4} h^T I(\theta) h$, where $I(\theta)$ is the Fisher information matrix; this relates $H^2$ to the Fisher-Rao metric and underscores its role in local asymptotics.[](https://par.nsf.gov/servlets/purl/10157852) In the univariate case with equal variance, this simplifies further, providing a direct link to information geometry. The squared Hellinger distance finds distinct applications in [empirical process](/page/Empirical_process) theory, where it governs [uniform convergence](/page/Uniform_convergence) rates of empirical measures over function classes due to its boundedness and equivalence to [total variation](/page/Total_variation) in weak topologies.[](https://sites.stat.columbia.edu/bodhi/Talks/Emp-Proc-Lecture-Notes.pdf) For instance, maximal Hellinger inequalities bound suprema of [empirical processes](/page/Empirical_process), enabling concentration results for [estimators](/page/Estimator) in nonparametric settings.[](https://sites.stat.columbia.edu/bodhi/Talks/Emp-Proc-Lecture-Notes.pdf) Recent work has developed an [almost surely](/page/Almost_surely) consistent empirical [estimator](/page/Estimator) for $H^2$ between continuous distributions, leveraging [kernel](/page/Kernel) [density](/page/Density) estimates and achieving [convergence](/page/Convergence) without [bias](/page/Bias) under mild [smoothness](/page/Smoothness) assumptions.[](https://www.mdpi.com/1099-4300/25/4/612) In relation to the total variation distance $TV(P, Q)$, the squared Hellinger satisfies $\frac{1}{2} H^2(P, Q) \leq TV(P, Q) \leq H(P, Q) \sqrt{1 - \frac{H^2(P, Q)}{4}}$, with both metrics inducing the same topology on probability measures.[](https://nobel.web.unc.edu/wp-content/uploads/sites/13591/2020/11/Distance-Divergence.pdf) For binary outcomes with orthogonal supports—such as Bernoulli(1) and Bernoulli(0)—equality holds as $H^2(P, Q) = 2 = 2 \cdot TV(P, Q)$.[](https://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf) A representative example arises with univariate normal distributions $N(\mu_1, \sigma^2)$ and $N(\mu_2, \sigma^2)$, where the exact $H^2$ is $2 \left(1 - \exp\left( -\frac{(\mu_1 - \mu_2)^2}{8 \sigma^2} \right) \right)$; for close means with $|\mu_1 - \mu_2| \ll \sigma$, this approximates $H^2 \approx \frac{(\mu_1 - \mu_2)^2}{4 \sigma^2}$, illustrating the local scaling with the [Fisher information](/page/Fisher_information) $I = 1/\sigma^2$.[](https://www.stat.cmu.edu/~larry/=stat705/Lecture27.pdf) ### Hellinger-Kantorovich Distance The Hellinger-Kantorovich distance, also known as the Wasserstein-Fisher-Rao or HK distance, generalizes the Hellinger distance to the space of nonnegative (not necessarily probability) [Radon](/page/Radon) measures on a [metric space](/page/Metric_space), incorporating both transport and mass variation aspects. It arises from a [linearization](/page/Linearization) of the Hellinger distance via the Riemannian structure on the space of square-root densities, where measures μ and ν are lifted to their square-root embeddings in a [cone](/page/Cone) space over the base [metric](/page/Metric). Formally, for measures μ, ν on a [Polish space](/page/Polish_space) (X, d), the squared HK distance is defined as \text{HK}2^2(\mu, \nu) = \inf{\pi \in \Pi(\sqrt{\mu}, \sqrt{\nu})} \int_{X \times X} d(x,y)^2 , d\pi(x,y), where $\Pi(\sqrt{\mu}, \sqrt{\nu})$ denotes the set of couplings between the "square-root measures" $\sqrt{\mu}$ and $\sqrt{\nu}$, interpreted via the cone construction that embeds densities into positive functions. This formulation unifies optimal transport with Hellinger-type affinities, allowing for dynamic interpolations between measures of differing total masses.[](https://epubs.siam.org/doi/10.1137/15M1041420)[](https://link.springer.com/article/10.1007/s00222-017-0759-8) Key properties of the Hellinger-Kantorovich distance include its extension to non-probability measures, making it suitable for unbalanced [transport](/page/Transport) problems where [mass](/page/Mass) creation or [annihilation](/page/Annihilation) is permitted. It induces [geodesic](/page/Geodesic) distances on the space of measures, analogous to Wasserstein geodesics but embedded in the Hellinger geometry, which provides a sub-Riemannian structure on the cone of positive measures. For probability measures P and Q, the HK distance satisfies $\text{HK}(P, Q) \geq H(P, Q)$, with equality holding when the optimal [coupling](/page/Coupling) aligns with the L^2 embedding of the densities, ensuring the HK [metric](/page/Metric) metrizes [weak convergence](/page/Weak_convergence) plus [total variation](/page/Total_variation) in appropriate settings. This inequality highlights its role as a coarser [metric](/page/Metric) than the standard Hellinger while preserving [metric](/page/Metric) properties like non-negativity, symmetry, and the [triangle inequality](/page/Triangle_inequality).[](https://epubs.siam.org/doi/10.1137/15M1041420)[](https://link.springer.com/article/10.1007/s00222-017-0759-8) A notable development is the local linearization of the Hellinger-Kantorovich distance, explored in a 2022 SIAM Journal on Imaging Sciences paper (received [2021](/page/2021)), which approximates the [metric](/page/Metric) for small perturbations via its Riemannian exponential and logarithmic maps. Locally, the linearized [HK](/page/.hk) distance approximates the standard Hellinger distance, i.e., HK(μ, ν) ≈ H(μ, ν) for nearby measures, facilitating explicit computations of tangent vectors and inner products in the space of densities. This linearization proves useful in optimization, as it enables efficient Euclidean-like algorithms for tasks involving measure comparisons while retaining the geometric insights of the full HK [metric](/page/Metric).[](https://epubs.siam.org/doi/10.1137/21M1400080) In applications, the Hellinger-Kantorovich distance serves as a smoother alternative to the Wasserstein distance in variational [inference](/page/Inference) and generative modeling, balancing rigid [transport](/page/Transport) costs with flexible mass adjustments to improve [convergence](/page/Convergence) in [gradient](/page/Gradient) flows. For instance, it underpins algorithms for mean-field variational [inference](/page/Inference) by enabling polyhedral optimizations in unbalanced settings and supports [kernel](/page/Kernel) approximations for sampling in high-dimensional spaces. In generative models, it enhances [training](/page/Training) [stability](/page/Stability) over pure optimal [transport](/page/Transport) by incorporating Hellinger regularization, as demonstrated in semi-dual formulations for adversarial [training](/page/Training).[](https://dl.acm.org/doi/10.5555/3666122.3667962)[](https://kantorovich.org/event/ki-seminar-zhu/)

References

  1. [1]
    [PDF] Lecture Notes 27 36-705 1 The Fundamental Statistical Distances
    Hellinger distance: The Hellinger distance between two distributions is,. H(P, Q) = Z. ( p p(x) − p q(x))2dx. 1/2. ,. i.e. the Hellinger distance is the `2 norm ...
  2. [2]
    Neue Begründung der Theorie quadratischer Formen von ... - EuDML
    Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. E. Hellinger · Journal für die reine und angewandte Mathematik (1909).
  3. [3]
    [PDF] Total variation distance between measures
    Feb 15, 2005 · The Hellinger distance is closely related to the total variation distance—for example, both distances define the same topology of the space of ...
  4. [4]
    [PDF] Some notes on the Hellinger distance and various Fisher-Rao ...
    Oct 2, 2025 · These expository notes introduce the Hellinger distance on the set of all measures and the induced Fisher-Rao distances for subsets of measures, ...
  5. [5]
    [PDF] Hilbert Space Embeddings and Metrics on Probability Measures
    A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and ...
  6. [6]
    [PDF] 7.1 Definition and basic properties of f-divergences - People
    Indeed, f's differing by a linear term lead to the same f-divergence, cf. Proposition 7.1. • Squared Hellinger distance: f(x) = (1 −. √x)2. ,. H2(P, Q) , EQ ...
  7. [7]
    [PDF] Theorie und Anwendungen der absolut additiven Mengenfunktionen
    von. Johann Radon. (Vorgelegt in der Sitzung am 26. Juni 1918.) Einleitung. In ... ein verallgemeinertes Hellinger'sches Integral, an Stelle des. »u-ten ...
  8. [8]
    Johann Radon (1887 - 1956) - Biography - MacTutor
    He did this, overcoming considerable obstacles, through a combination of Stieltjes', Lebesgue's and Hellinger's concepts of an integral. The paper is full of ...<|control11|><|separator|>
  9. [9]
    THE STATISTICAL WORK OF LUCIEN LE CAM Free University ...
    In fact, the 1960 paper predates the introduction of the deficiency distance and ... metric of choice is the Hellinger distance, whose square is given in (9.4).
  10. [10]
    [PDF] Some notes on the Hellinger distance and various Fisher-Rao ...
    Oct 15, 2025 · These expository notes introduce the Hellinger distance on the set of all measures and the induced Fisher-Rao distances for subsets of ...
  11. [11]
    Empirical Squared Hellinger Distance Estimator and ... - MDPI
    Apr 4, 2023 · We present an empirical estimator for the squared Hellinger distance between two continuous distributions, which almost surely converges.
  12. [12]
  13. [13]
    [PDF] Hellinger differentiability
    The Hellinger distance between densities corresponds to the L2 norm of the difference between the unit vectors. This Chapter explains some of the statistical ...
  14. [14]
    [PDF] probability metrics
    Jan 17, 2020 · Hellinger metric is complete on P(S), since dTV is. The Hellinger integral and distance are convenient when considering prod- uct measures.
  15. [15]
    [PDF] 12. Hellinger distance
    In this lecture, we will introduce a new notion of distance between probability distributions called Hellinger distance. Using some of the nice properties ...
  16. [16]
  17. [17]
    [PDF] On Loss Functions and f-Divergences - Department of Statistics
    If(µ, π) := Xz. |µ(z) − π(z)|. • Hellinger distance: f(u) = 1. 2(√u − 1). 2 ... • Hellinger distance corresponds to an f-divergence with f(u) = −2. √u.
  18. [18]
    None
    ### Summary of arXiv:1111.6372v2 on Chi^2 Divergence and Hellinger Distance/Squared Hellinger
  19. [19]
    [PDF] October 31 1 Distribution Distances
    Both the Hellinger distance and the total variation distance satisfy the triangle inequality, which we can see because the L1 and L2 norms already exhibit ...
  20. [20]
    [PDF] The total variation distance between high-dimensional Gaussians ...
    Oct 22, 2023 · Bounds for the total variation distance using the Hellinger distance. For distributions P and. Q over Rd with densities p and q, their ...
  21. [21]
    [PDF] arXiv:2002.05094v1 [math.DS] 12 Feb 2020
    Feb 12, 2020 · the Hellinger distance on the set of probability measures on Z+. ... χa,b(k) = ea−b ab k/2. Ik(2√ab), where Ik is the modified Bessel function of ...
  22. [22]
    Calculating Hellinger Divergence from Results of Kernel Density ...
    Feb 27, 2013 · The Hellinger distance is H=∑i(√fi−√gi)2 . Share.Hellinger Distance between 2 vectors of data points using cumsum ...Is there an unbiased estimator of the Hellinger distance between two ...More results from stats.stackexchange.com
  23. [23]
    Hellinger distance and Kullback—Leibler loss for the kernel density ...
    The optimal window width, which asymptotically minimizes mean Hellinger distance between the kernel estimator and density, is known to be equivalent to the ...
  24. [24]
  25. [25]
    [PDF] The Weighted Hellinger Distance for Kernel Distribution Estimator of ...
    May 1, 2012 · The asymptotic mean weighted Hellinger distance (AMWHD) is derived for the kernel distribution estimator of a function of observations.
  26. [26]
    Consistency, efficiency and robustness of conditional disparity ...
    We also observe that Hellinger distance estimators have large variances in some cases, mostly due to occasional outlying parameter esti- mates. By contrast, ...
  27. [27]
    Hellinger Deviance Tests: Efficiency, Breakdown Points, and ... - jstor
    Hellinger distance analogs of likelihood ratio tests are proposed for parametric inference. The proposed tests are based on minimized Hellinger distances ...
  28. [28]
    Minimum Hellinger Distance Estimates for Parametric Models
    This paper defines and studies for independent identically distributed observations a new parametric estimation procedure which is asymptotically efficient.
  29. [29]
    Minimum Hellinger Distance Estimation for Finite Mixture Models - jstor
    MSE's are calculated assuming that go is "truth." In general, the MHDE is considerably more efficient than the MLE for contaminated data. The only apparent ...
  30. [30]
    [PDF] Hellinger Distance Based Drift Detection for Nonstationary ...
    In this work, we propose and analyze a feature based drift detection method using the Hellinger distance to detect gradual or abrupt changes in the distribution ...Missing: 2020s | Show results with:2020s
  31. [31]
  32. [32]
    Minimum profile Hellinger distance estimation of general covariate ...
    For semiparametric covariate models, the minimum Hellinger distance method is extended and a minimum profile Hellinger distance estimator is proposed. Its ...
  33. [33]
    Hellinger distance decision trees are robust and skew-insensitive
    Aug 6, 2025 · We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives ...
  34. [34]
    Hellinger distance based oversampling method to solve multi-class ...
    Obtained results show increase of 20% in classification accuracy compared to classification of imbalance multi-class dataset.
  35. [35]
    Hellinger distance-based stable sparse feature selection for high ...
    Mar 23, 2020 · As mentioned above, Hellinger distance essentially captures the divergence between the feature value distributions of different classes and is ...
  36. [36]
    Study of Hellinger Distance as a splitting metric for Random Forests ...
    Jul 1, 2020 · Hellinger Distance (HD) is a splitting metric that has been shown to have an excellent performance for imbalanced classification problems ...
  37. [37]
    Hellinger Distance - OECD.AI
    The Hellinger distance is a metric used to measure the similarity between two probability distributions. It is related to the Euclidean distance but applied ...
  38. [38]
    An efficient feature selection and explainable classification method ...
    This study introduces a novel feature selection method based on Hellinger distance and particle swarm optimization (PSO) for reducing the dimensionality of ...
  39. [39]
    Exact mean and variance of the squared Hellinger distance for ...
    May 5, 2025 · In this work, we derive the mean and variance of the Hellinger distance between pairs of density matrices, where one or both matrices are random.
  40. [40]
    [PDF] Hellinger differentiability - Yale Statistics and Data Science
    Mar 20, 2001 · The Hellinger distance between densities corresponds to the L2 norm of the difference between the unit vectors. This Chapter explains some of ...Missing: additivity | Show results with:additivity
  41. [41]
    [PDF] On optimal designs for nonregular models
    Therefore, a model is regular if the squared Hellinger distance is locally approxi- mately quadratic, with the Fisher information matrix characterizing that ...
  42. [42]
    [PDF] A Gentle Introduction to Empirical Process Theory and Applications
    ... Hellinger distance between p and q is equivalent to the Hellinger distance between p and (p + q)/2. The maximal inequality is now a consequence of Theorem ...
  43. [43]
    [PDF] Distances and Divergences for Probability Distributions
    Hellinger Distance vs. Total Variation. Fact: For any pair of densities f, g we have the following inequalities. Z min(f, g) dx ≥. 1. 2. Z pf g dx. 2. = 1. 2. 1 ...
  44. [44]
    The Hellinger--Kantorovich Distance and Geodesic Curves - SIAM.org
    We discuss a new notion of distance on the space of finite and nonnegative measures on Ω ⊂ R d , which we call the Hellinger--Kantorovich distance.
  45. [45]
    Optimal Entropy-Transport problems and a new Hellinger ...
    Dec 14, 2017 · The Hellinger–Kantorovich distance can then be defined by taking the best Kantorovich–Wasserstein distance between all the possible lifts of \mu ...
  46. [46]
    The Linearized Hellinger--Kantorovich Distance - SIAM.org
    We discuss a new notion of distance on the space of finite and nonnegative measures on Ω ⊂ ℝ 𝑑 , which we call the Hellinger--Kantorovich distance. It can be ...
  47. [47]
    Generative modeling through the semi-dual formulation of ...
    Dec 10, 2023 · Our model outperforms existing OT-based generative models ... Optimal entropy-transport problems and a new hellinger-kantorovich distance between ...
  48. [48]
    Kernel Approximation of Wasserstein and Fisher-Rao Gradient flows
    I will showcase inference and sampling algorithms using a new kernel approximation of the Wasserstein-Fisher-Rao (aka Hellinger-Kantorovich) gradient flows.