Le Cam's theorem

Le Cam's theorem is a foundational result in statistical decision theory that characterizes the deficiency distance between two statistical experiments as the supremum over all bounded loss functions and decision rules of the minimal difference in expected risks between the experiments.^[1] Specifically, for experiments P_1 and P_2, the deficiency \delta(P_1, P_2) < \epsilon if and only if, for every decision rule \rho_2 in P_2 and every loss function L with \|L\|_\infty \leq 1, there exists a decision rule \rho_1 in P_1 such that the risk R_\theta(P_1, \rho_1, L) < R_\theta(P_2, \rho_2, L) + \epsilon for all parameters \theta.^[1] The theorem emerged from Lucien Le Cam's work in the 1960s, building on earlier ideas by Blackwell on the comparison of experiments through information loss.^[2] Le Cam formalized statistical experiments as triples (\Omega, \mathcal{T}, \{P_\theta : \theta \in \Theta\}), consisting of a sample space (\Omega, \mathcal{T}), a parameter space \Theta, and a family of probability measures P_\theta indexed by \theta.^[1] This framework allows for a precise quantification of how closely two experiments can approximate each other via randomized decision procedures, known as Markov kernels. Central to the theorem is the Le Cam distance \Delta(P_1, P_2) = \max(\delta(P_1, P_2), \delta(P_2, P_1)), where the deficiency \delta(P_1, P_2) measures the infimum total variation distance \|T P_{1,\theta} - P_{2,\theta}\|_{TV} over all Markov kernels T from P_1 to P_2, taken supremum over \theta.^[2] This distance captures the "cost" of transforming one experiment into another, with \Delta(P_1, P_2) = 0 implying the experiments are equivalent in the sense that they yield the same minimal risks for all decision problems.^[1] In asymptotic statistics, sequences of experiments (P_{1,n}) and (P_{2,n}) are asymptotically equivalent if \Delta(P_{1,n}, P_{2,n}) \to 0 as n \to \infty, meaning their inferential properties coincide in the large-sample limit.^[2] Le Cam's theorem underpins this equivalence by linking distance to risk bounds, enabling the simplification of complex models—such as replacing nonparametric density estimation with Gaussian white noise experiments—without altering asymptotic minimax risks.^[2] Applications span contiguity of measures, local asymptotic normality, and modern nonparametric inference, influencing fields like econometrics and signal processing.^[2]

Background concepts

Poisson binomial distribution

The Poisson binomial distribution is the discrete probability distribution of the sum S_n = \sum_{i=1}^n X_i, where the X_i are independent Bernoulli random variables with respective success probabilities p_i \in [0,1]. This setup generalizes the binomial distribution, which arises when all p_i are identical, and forms a foundational model for scenarios involving heterogeneous success probabilities, such as reliability analysis or randomized algorithms.^[3] The probability mass function of S_n is

\Pr(S_n = k) = \sum_{\substack{A \subseteq \\ |A|=k}} \prod_{i \in A} p_i \prod_{j \notin A} (1 - p_j),

for integers k = 0, 1, \dots, n, where the sum runs over all subsets A of \{1, \dots, n\} of size k. Exact evaluation of this function is computationally intensive for large n, as it involves \binom{n}{k} terms, each requiring products over the probabilities.^[3] The expected value is \mathbb{E}[S_n] = \sum_{i=1}^n p_i, commonly denoted \lambda_n. The variance is \mathrm{Var}(S_n) = \sum_{i=1}^n p_i (1 - p_i), reflecting the additive nature of independent summands. These moments provide essential summaries for understanding the distribution's central tendency and spread.^[3] The distribution derives its name from the French mathematician Siméon Denis Poisson, who introduced the concept in 1837 while considering sums of independent trials with varying probabilities, thereby extending the binomial framework to heterogeneous cases.^[3]

Total variation distance

The total variation distance, often denoted d_{\mathrm{TV}}(P, Q), serves as a fundamental metric for comparing two probability measures P and Q defined on the same measurable space. For discrete probability measures on a countable space, it is defined as

d_{\mathrm{TV}}(P, Q) = \frac{1}{2} \sum_{x} |P(\{x\}) - Q(\{x\})|,

where the sum is over all points x in the support. Equivalently, it can be expressed as

d_{\mathrm{TV}}(P, Q) = \sup_{A} |P(A) - Q(A)|,

with the supremum taken over all measurable sets A. This formulation highlights its interpretation as the maximum possible difference in the probabilities assigned by P and Q to any event.^[4]^[5] As a metric on the space of probability measures, the total variation distance satisfies the properties of non-negativity, symmetry, and the triangle inequality, with d_{\mathrm{TV}}(P, Q) = 0 if and only if P = Q. It is bounded above by 1, since probabilities lie between 0 and 1, and for discrete cases, it equals half the \ell^1 distance between the probability mass functions. This distance is particularly valuable in approximation theorems, such as Le Cam's, because it provides an upper bound on the difference in expectations: for any bounded measurable function f with |f| \leq 1, |\mathbb{E}_P - \mathbb{E}_Q| \leq 2 d_{\mathrm{TV}}(P, Q), thereby quantifying the error in approximating expectations under one distribution by another.^[4] To illustrate, consider two Bernoulli distributions: let P be Bernoulli(p = 0.3) and Q be Bernoulli(q = 0.5). The probability mass functions differ at 0 and 1, with |P(1) - Q(1)| = |0.3 - 0.5| = 0.2 and |P(0) - Q(0)| = 0.2. Thus,

d_{\mathrm{TV}}(P, Q) = \frac{1}{2} (0.2 + 0.2) = 0.2.

This value represents the largest discrepancy in event probabilities, such as the probability of success differing by 0.2. In the context of Le Cam's theorem, the total variation distance measures the quality of approximating the Poisson binomial distribution by a Poisson distribution.^[4]

Statement of the theorem

Formal statement

Le Cam's theorem asserts that if X_1, \dots, X_n are independent Bernoulli random variables with parameters p_i satisfying $0 \leq p_i \leq 1, and if S_n = \sum_{i=1}^n X_i with \lambda_n = \sum_{i=1}^n p_i, then the total variation distance between the law of S_n (the Poisson binomial distribution) and the Poisson distribution with parameter \lambda_n satisfies

d_{\mathrm{TV}}\bigl( \law(S_n), \mathrm{Po}(\lambda_n) \bigr) \leq \sum_{i=1}^n p_i^2,

where d_{\mathrm{TV}}(\mu, \nu) = \sup_{A} \bigl| \mu(A) - \nu(A) \bigr|.^[5] This inequality holds for arbitrary finite n and any choice of the p_i in [0,1]. The right-hand side measures the approximation error, which is controlled by the sum of the squared success probabilities and becomes small whenever the p_i are sufficiently sparse or close to zero.^[5] In the special case where p_i = \lambda / n for all i (so that \law(S_n) is binomial with parameters n and \lambda / n), the bound simplifies to \lambda / n, thereby quantifying the rate in the classical Poisson limit theorem.^[5]

Improved bounds

Subsequent refinements to Le Cam's original bound have provided sharper estimates for the total variation distance between the Poisson binomial distribution and the approximating Poisson distribution. A key improvement replaces the constant factor in the original inequality with \left(1 \wedge \frac{1}{\lambda_n}\right), yielding

d_{\mathrm{TV}}\left( \mathcal{L}(S_n), \mathrm{Po}(\lambda_n) \right) \leq \left(1 \wedge \frac{1}{\lambda_n}\right) \sum_{i=1}^n p_i^2,

where \mathcal{L}(S_n) denotes the law of the sum S_n = \sum_{i=1}^n X_i. This refinement, developed in subsequent work including contributions by Barbour and collaborators, accounts for the dependence of the error on the mean parameter \lambda_n = \sum_{i=1}^n p_i, tightening the bound particularly when \lambda_n is large.^[6] The factor $1 \wedge 1/\lambda_n ensures that the bound is capped at \sum p_i^2 when \lambda_n \leq 1, matching the scale of the original estimate, while scaling down to approximately \sum p_i^2 / \lambda_n for large \lambda_n, reflecting reduced relative error in that regime. This adjustment arises from more precise coupling arguments that exploit the structure of the Poisson process. In the asymptotic regime where \max_i p_i \to 0 and \lambda_n remains fixed or grows moderately, the total variation distance is of order O\left(\sum p_i^2\right), with the refined bound capturing the leading term. Moreover, equality in this order can be approached in specific cases, such as when the p_i are equal, demonstrating the sharpness of the estimate under homogeneity. An extension of these bounds applies to the multinomial setting, where one considers the vector (X_1, \dots, X_m) of sums over disjoint categories with fixed total sum constraint \sum_{j=1}^m X_j = S_n. Here, the joint distribution is approximated by independent Poisson random variables with parameters \lambda_j = \sum_{i \in \mathcal{I}_j} p_i, and the total variation bound remains analogous in the scalar projections, scaling with \sum p_i^2 terms within each category.

Historical development

Lucien Le Cam's contributions

Lucien Le Cam (1924–2000) was a French-American mathematician and statistician, best known for his foundational contributions to asymptotic theory in statistics.^[7] Born in France, he earned his doctorate from the University of California, Berkeley, in 1950 and spent much of his career as a professor there, becoming Professor Emeritus of Mathematics and Statistics.^[8] Le Cam's research emphasized limit theorems and approximation techniques, influencing modern statistical inference and probability.^[9] Le Cam's theorem on the deficiency distance between statistical experiments was formalized in his 1964 paper "Sufficiency and Approximate Sufficiency" published in the Annals of Mathematical Statistics.^[1] In this work, he introduced the deficiency measure \delta(P_1, P_2) to quantify how well one experiment P_1 can approximate another P_2 in terms of minimal expected risks for decision problems. The theorem establishes that the deficiency is equivalent to the supremum over bounded loss functions of the minimal risk difference, providing a precise metric for comparing statistical models.^[1] This contribution built on Le Cam's earlier investigations into asymptotic properties of statistical procedures, including his 1953 introduction of contiguity of probability measures, which laid groundwork for understanding convergence in statistical experiments.^[2] The deficiency concept addressed the need for a decision-theoretic framework to evaluate approximate sufficiency and the "information loss" in reducing data, revolutionizing the comparison of complex models in asymptotic statistics.^[2] Le Cam further developed these ideas in subsequent works, such as his 1969 paper linking likelihood ratio convergence to deficiency, solidifying the theorem's role in local asymptotic normality and modern inference theory.^[2] The foundations for comparing statistical experiments trace back to David Blackwell's 1951 paper "Comparison of Experiments" in the Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability.^[10] Blackwell defined one experiment as more informative than another if, for every decision problem, the minimal risk achievable with the former is no larger than with the latter, using randomized decision rules (Markov kernels) to formalize the comparison. This work established equivalence conditions based on the existence of sufficient statistics or channels between experiments.^[10] Earlier concepts of sufficiency, introduced by Ronald Fisher in 1920 and refined by Jerzy Neyman in the 1930s, provided the backdrop for evaluating data reduction without information loss, but lacked a quantitative distance measure for approximations. Blackwell's framework extended these ideas to a general decision-theoretic setting, motivating Le Cam's later quantification via the deficiency distance.^[2] These developments filled gaps in prior literature, which focused on exact sufficiency rather than approximate comparisons essential for asymptotic analysis, paving the way for Le Cam's theorem as a key advancement in statistical decision theory.^[2]

Proof outline

Coupling approach

The coupling approach provides an intuitive probabilistic proof of Le Cam's theorem by constructing auxiliary random variables to compare the distributions of S_n = \sum_{i=1}^n X_i and a Poisson random variable with mean \lambda_n = \sum_{i=1}^n p_i. Independent Poisson random variables Y_i \sim \mathrm{Po}(p_i) are introduced for i = 1, \dots, n, so that T_n = \sum_{i=1}^n Y_i \sim \mathrm{Po}(\lambda_n). Each pair (X_i, Y_i) is then coupled independently to preserve the marginal distributions \mathrm{Bern}(p_i) for X_i and \mathrm{Po}(p_i) for Y_i, while minimizing the mismatch probability \mathbb{P}(X_i \neq Y_i). The total variation distance between \mathrm{Bern}(p_i) and \mathrm{Po}(p_i) equals this minimal mismatch probability, given explicitly by

d_{\mathrm{TV}}(\mathrm{Bern}(p_i), \mathrm{Po}(p_i)) = p_i (1 - e^{-p_i}).

This quantity satisfies p_i (1 - e^{-p_i}) \leq p_i^2, as $1 - e^{-p_i} \leq p_i follows from the convexity of the exponential function. Under the product coupling of all pairs, the event \{S_n \neq T_n\} is contained in the union of the mismatch events \{X_i \neq Y_i\} for i = 1, \dots, n. Thus, by the union bound,

\mathbb{P}(S_n \neq T_n) \leq \sum_{i=1}^n \mathbb{P}(X_i \neq Y_i) = \sum_{i=1}^n d_{\mathrm{TV}}(\mathrm{Bern}(p_i), \mathrm{Po}(p_i)) \leq \sum_{i=1}^n p_i^2.

The total variation distance then inherits this bound, since for any coupling of random variables with marginal laws \mu and \nu,

d_{\mathrm{TV}}(\mu, \nu) \leq \mathbb{P}(S_n \neq T_n)

over the constructed joint distribution. This yields

d_{\mathrm{TV}}(\mathrm{law}(S_n), \mathrm{Po}(\lambda_n)) \leq \sum_{i=1}^n p_i^2,

establishing the theorem.^[5] Intuitively, mismatches in each pair occur primarily when Y_i \geq 2, which has probability O(p_i^2), or due to the small adjustment needed to match the marginal at 1, as \mathbb{P}(Y_i = 1) = p_i e^{-p_i} \approx p_i - p_i^2; these discrepancies accumulate additively across indicators to bound the distance by \sum p_i^2. Note that while Le Cam's original 1960 proof used characteristic functions and provided bounds like \|Q - P\| < 2 \sum p_i^2 (with tighter versions involving \min(1, 1/\lambda_n)), the coupling method offers a simple derivation of the \sum p_i^2 bound.^[5]

Distance estimation

In the coupling construction, the total variation distance between the distribution of the sum S_n = \sum_{i=1}^n X_i and that of \sum_{i=1}^n Y_i (where the Y_i are independent Poisson random variables with means p_i, so \sum Y_i \sim \mathrm{Po}(\lambda_n) and \lambda_n = \sum p_i) is bounded using the triangle inequality for total variation distance on convolutions of measures:

d_{\mathrm{TV}}\left( \sum_{i=1}^n X_i, \sum_{i=1}^n Y_i \right) \leq \sum_{i=1}^n d_{\mathrm{TV}}(X_i, Y_i).

This follows from independently coupling each pair (X_i, Y_i) to preserve the marginal distributions, allowing the error to accumulate additively across components.^[5] For each individual pair, the total variation distance d_{\mathrm{TV}}(X_i, Y_i) is p_i (1 - e^{-p_i}), which is upper-bounded by p_i^2. Aggregating these errors yields the explicit bound

d_{\mathrm{TV}}(S_n, \sum Y_i) \leq \sum_{i=1}^n p_i (1 - e^{-p_i}) \leq \sum_{i=1}^n p_i^2,

providing a simple quadratic form in the probabilities that controls the approximation quality when the p_i are small or sparse. This quadratic summation term highlights the theorem's strength in regimes where \sum p_i^2 is much smaller than \lambda_n, ensuring the Poisson approximation is effective even for non-identical Bernoullis.^[5]

Applications and extensions

Approximations in probability theory

Le Cam's theorem facilitates approximations between statistical experiments by quantifying how closely one can mimic another through Markov kernels, enabling the replacement of complex models with simpler ones without significant loss in inferential performance. In asymptotic statistics, sequences of experiments (P_n) and (Q_n) are asymptotically equivalent if their Le Cam distance \Delta(P_n, Q_n) \to 0, implying identical limiting minimax risks for decision problems.^[2] A prominent application is in nonparametric inference, where the theorem establishes asymptotic equivalence between density estimation from i.i.d. samples and the Gaussian white noise model dY(t) = f(t) dt + n^{-1/2} dW(t), for smooth densities f. This equivalence, with deficiency bounded by terms involving metric entropy of the function class, simplifies analysis of estimation rates and allows transfer of results from white noise settings to direct observations. Such approximations are vital in signal processing for denoising and in econometrics for semiparametric models.^[11]^[2] The theorem also underpins approximations in high-dimensional settings, such as sparse signal recovery, where the deficiency measures the information loss from dimensionality reduction, ensuring that projected experiments retain near-optimal testing power.^[2]

Connections to other methods

Le Cam's deficiency theorem connects intimately to the theory of contiguity of probability measures, also developed by Le Cam, where one sequence of measures is contiguous to another if no test can distinguish them asymptotically. The theorem links contiguity to bounded likelihood ratios, with the deficiency providing a metric to quantify how closely experiments support the same contiguous alternatives, essential for deriving asymptotic distributions under local perturbations.^[2] It further ties to local asymptotic normality (LAN), where Le Cam showed that many parametric models approximate a Gaussian shift experiment locally around the true parameter, with the deficiency vanishing as the neighborhood shrinks. This connection enables efficiency bounds for estimators like maximum likelihood, extending to semiparametric and nonparametric cases via Le Cam's framework.^[2]^[12] Extensions of the theorem appear in modern areas, such as quantum statistical experiments, where analogs of deficiency compare quantum channels, preserving the risk characterization for quantum hypothesis testing. In generalized linear models, the theorem establishes asymptotic equivalence to Gaussian regressions, facilitating robust inference.^[13]^[14]