Fact-checked by Grok 2 weeks ago

Le Cam's theorem

Le Cam's theorem is a foundational result in statistical decision theory that characterizes the deficiency distance between two statistical experiments as the supremum over all bounded loss functions and decision rules of the minimal difference in expected risks between the experiments. Specifically, for experiments P_1 and P_2, the deficiency \delta(P_1, P_2) < \epsilon if and only if, for every decision rule \rho_2 in P_2 and every loss function L with \|L\|_\infty \leq 1, there exists a decision rule \rho_1 in P_1 such that the risk R_\theta(P_1, \rho_1, L) < R_\theta(P_2, \rho_2, L) + \epsilon for all parameters \theta. The theorem emerged from Lucien Le Cam's work in the 1960s, building on earlier ideas by Blackwell on the comparison of experiments through information loss. Le Cam formalized statistical experiments as triples (\Omega, \mathcal{T}, \{P_\theta : \theta \in \Theta\}), consisting of a (\Omega, \mathcal{T}), a parameter space \Theta, and a family of probability measures P_\theta indexed by \theta. This framework allows for a precise quantification of how closely two experiments can approximate each other via randomized decision procedures, known as Markov kernels. Central to the theorem is the Le Cam distance \Delta(P_1, P_2) = \max(\delta(P_1, P_2), \delta(P_2, P_1)), where the deficiency \delta(P_1, P_2) measures the infimum distance \|T P_{1,\theta} - P_{2,\theta}\|_{TV} over all Markov kernels T from P_1 to P_2, taken supremum over \theta. This distance captures the "cost" of transforming one experiment into another, with \Delta(P_1, P_2) = 0 implying the experiments are equivalent in the sense that they yield the same minimal risks for all decision problems. In asymptotic statistics, sequences of experiments (P_{1,n}) and (P_{2,n}) are asymptotically equivalent if \Delta(P_{1,n}, P_{2,n}) \to 0 as n \to \infty, meaning their inferential properties coincide in the large-sample limit. Le Cam's theorem underpins this equivalence by linking distance to risk bounds, enabling the simplification of complex models—such as replacing nonparametric density estimation with Gaussian white noise experiments—without altering asymptotic minimax risks. Applications span contiguity of measures, local asymptotic normality, and modern nonparametric inference, influencing fields like econometrics and signal processing.

Background concepts

Poisson binomial distribution

The Poisson binomial distribution is the discrete probability distribution of the sum S_n = \sum_{i=1}^n X_i, where the X_i are independent random variables with respective success probabilities p_i \in [0,1]. This setup generalizes the , which arises when all p_i are identical, and forms a foundational model for scenarios involving heterogeneous success probabilities, such as reliability analysis or randomized algorithms. The of S_n is \Pr(S_n = k) = \sum_{\substack{A \subseteq \\ |A|=k}} \prod_{i \in A} p_i \prod_{j \notin A} (1 - p_j), for integers k = 0, 1, \dots, n, where the sum runs over all subsets A of \{1, \dots, n\} of size k. Exact evaluation of this function is computationally intensive for large n, as it involves \binom{n}{k} terms, each requiring products over the probabilities. The expected value is \mathbb{E}[S_n] = \sum_{i=1}^n p_i, commonly denoted \lambda_n. The variance is \mathrm{Var}(S_n) = \sum_{i=1}^n p_i (1 - p_i), reflecting the additive nature of independent summands. These moments provide essential summaries for understanding the distribution's and spread. The distribution derives its name from the mathematician , who introduced the concept in 1837 while considering sums of trials with varying probabilities, thereby extending the framework to heterogeneous cases.

Total variation distance

The total variation distance, often denoted d_{\mathrm{TV}}(P, Q), serves as a fundamental metric for comparing two probability measures P and Q defined on the same measurable space. For discrete probability measures on a countable space, it is defined as d_{\mathrm{TV}}(P, Q) = \frac{1}{2} \sum_{x} |P(\{x\}) - Q(\{x\})|, where the sum is over all points x in the support. Equivalently, it can be expressed as d_{\mathrm{TV}}(P, Q) = \sup_{A} |P(A) - Q(A)|, with the supremum taken over all measurable sets A. This formulation highlights its interpretation as the maximum possible difference in the probabilities assigned by P and Q to any event. As a metric on the space of probability measures, the distance satisfies the properties of non-negativity, symmetry, and the , with d_{\mathrm{TV}}(P, Q) = 0 if and only if P = Q. It is bounded above by 1, since probabilities lie between 0 and 1, and for discrete cases, it equals half the \ell^1 distance between the probability mass functions. This distance is particularly valuable in approximation theorems, such as Le Cam's, because it provides an upper bound on the difference in expectations: for any bounded f with |f| \leq 1, |\mathbb{E}_P - \mathbb{E}_Q| \leq 2 d_{\mathrm{TV}}(P, Q), thereby quantifying the error in approximating expectations under one distribution by another. To illustrate, consider two Bernoulli distributions: let P be Bernoulli(p = 0.3) and Q be Bernoulli(q = 0.5). The probability mass functions differ at 0 and 1, with |P(1) - Q(1)| = |0.3 - 0.5| = 0.2 and |P(0) - Q(0)| = 0.2. Thus, d_{\mathrm{TV}}(P, Q) = \frac{1}{2} (0.2 + 0.2) = 0.2. This value represents the largest discrepancy in event probabilities, such as the probability of success differing by 0.2. In the context of Le Cam's theorem, the total variation distance measures the quality of approximating the Poisson binomial distribution by a Poisson distribution.

Statement of the theorem

Formal statement

Le Cam's theorem asserts that if X_1, \dots, X_n are independent random variables with parameters p_i satisfying $0 \leq p_i \leq 1, and if S_n = \sum_{i=1}^n X_i with \lambda_n = \sum_{i=1}^n p_i, then the distance between the of S_n (the ) and the with parameter \lambda_n satisfies d_{\mathrm{TV}}\bigl( \law(S_n), \mathrm{Po}(\lambda_n) \bigr) \leq \sum_{i=1}^n p_i^2, where d_{\mathrm{TV}}(\mu, \nu) = \sup_{A} \bigl| \mu(A) - \nu(A) \bigr|. This inequality holds for arbitrary finite n and any choice of the p_i in [0,1]. The right-hand side measures the approximation error, which is controlled by the sum of the squared success probabilities and becomes small whenever the p_i are sufficiently sparse or close to zero. In the special case where p_i = \lambda / n for all i (so that \law(S_n) is binomial with parameters n and \lambda / n), the bound simplifies to \lambda / n, thereby quantifying the rate in the classical .

Improved bounds

Subsequent refinements to Le Cam's original bound have provided sharper estimates for the total variation distance between the Poisson binomial distribution and the approximating . A key improvement replaces the constant factor in the original inequality with \left(1 \wedge \frac{1}{\lambda_n}\right), yielding d_{\mathrm{TV}}\left( \mathcal{L}(S_n), \mathrm{Po}(\lambda_n) \right) \leq \left(1 \wedge \frac{1}{\lambda_n}\right) \sum_{i=1}^n p_i^2, where \mathcal{L}(S_n) denotes the law of the S_n = \sum_{i=1}^n X_i. This refinement, developed in subsequent work including contributions by Barbour and collaborators, accounts for the dependence of the error on the mean parameter \lambda_n = \sum_{i=1}^n p_i, tightening the bound particularly when \lambda_n is large. The factor $1 \wedge 1/\lambda_n ensures that the bound is capped at \sum p_i^2 when \lambda_n \leq 1, matching the scale of the original estimate, while scaling down to approximately \sum p_i^2 / \lambda_n for large \lambda_n, reflecting reduced relative error in that regime. This adjustment arises from more precise coupling arguments that exploit the structure of the Poisson process. In the asymptotic regime where \max_i p_i \to 0 and \lambda_n remains fixed or grows moderately, the total variation is of O\left(\sum p_i^2\right), with the refined bound capturing the leading term. Moreover, equality in this order can be approached in specific cases, such as when the p_i are equal, demonstrating the sharpness of the estimate under homogeneity. An extension of these bounds applies to the multinomial setting, where one considers the (X_1, \dots, X_m) of sums over disjoint categories with fixed total sum constraint \sum_{j=1}^m X_j = S_n. Here, the joint distribution is approximated by independent random variables with parameters \lambda_j = \sum_{i \in \mathcal{I}_j} p_i, and the bound remains analogous in the scalar projections, scaling with \sum p_i^2 terms within each category.

Historical development

Lucien Le Cam's contributions

Lucien Le Cam (1924–2000) was a , best known for his foundational contributions to asymptotic theory in statistics. Born in , he earned his doctorate from the , in 1950 and spent much of his career as a professor there, becoming Professor Emeritus of Mathematics and Statistics. Le Cam's research emphasized limit theorems and approximation techniques, influencing modern and probability. Le Cam's theorem on the deficiency distance between statistical experiments was formalized in his 1964 paper "Sufficiency and Approximate Sufficiency" published in the Annals of Mathematical Statistics. In this work, he introduced the deficiency measure \delta(P_1, P_2) to quantify how well one experiment P_1 can approximate another P_2 in terms of minimal expected risks for decision problems. The theorem establishes that the deficiency is equivalent to the supremum over bounded loss functions of the minimal , providing a precise for comparing statistical models. This contribution built on Le Cam's earlier investigations into asymptotic properties of statistical procedures, including his 1953 introduction of contiguity of probability measures, which laid groundwork for understanding convergence in statistical experiments. The deficiency concept addressed the need for a decision-theoretic framework to evaluate approximate sufficiency and the "information loss" in reducing data, revolutionizing the comparison of complex models in asymptotic statistics. Le Cam further developed these ideas in subsequent works, such as his 1969 paper linking likelihood ratio convergence to deficiency, solidifying the theorem's role in local asymptotic normality and modern inference theory. The foundations for comparing statistical experiments trace back to David Blackwell's 1951 paper "Comparison of Experiments" in the Proceedings of the Second Symposium on and Probability. Blackwell defined one experiment as more informative than another if, for every , the minimal risk achievable with the former is no larger than with the latter, using randomized decision rules (Markov kernels) to formalize the . This work established conditions based on the existence of sufficient or channels between experiments. Earlier concepts of sufficiency, introduced by in 1920 and refined by in the 1930s, provided the backdrop for evaluating data reduction without information loss, but lacked a quantitative for approximations. Blackwell's framework extended these ideas to a general decision-theoretic setting, motivating Le Cam's later quantification via the deficiency distance. These developments filled gaps in prior literature, which focused on exact sufficiency rather than approximate comparisons essential for , paving the way for Le Cam's theorem as a key advancement in .

Proof outline

Coupling approach

The approach provides an intuitive probabilistic proof of Le Cam's theorem by constructing auxiliary random variables to compare the distributions of S_n = \sum_{i=1}^n X_i and a random variable with mean \lambda_n = \sum_{i=1}^n p_i. Independent random variables Y_i \sim \mathrm{Po}(p_i) are introduced for i = 1, \dots, n, so that T_n = \sum_{i=1}^n Y_i \sim \mathrm{Po}(\lambda_n). Each pair (X_i, Y_i) is then independently to preserve the marginal distributions \mathrm{Bern}(p_i) for X_i and \mathrm{Po}(p_i) for Y_i, while minimizing the mismatch probability \mathbb{P}(X_i \neq Y_i). The distance between \mathrm{Bern}(p_i) and \mathrm{Po}(p_i) equals this minimal mismatch probability, given explicitly by d_{\mathrm{TV}}(\mathrm{Bern}(p_i), \mathrm{Po}(p_i)) = p_i (1 - e^{-p_i}). This quantity satisfies p_i (1 - e^{-p_i}) \leq p_i^2, as $1 - e^{-p_i} \leq p_i follows from the convexity of the . Under the product coupling of all pairs, the event \{S_n \neq T_n\} is contained in the union of the mismatch events \{X_i \neq Y_i\} for i = 1, \dots, n. Thus, by the union bound, \mathbb{P}(S_n \neq T_n) \leq \sum_{i=1}^n \mathbb{P}(X_i \neq Y_i) = \sum_{i=1}^n d_{\mathrm{TV}}(\mathrm{Bern}(p_i), \mathrm{Po}(p_i)) \leq \sum_{i=1}^n p_i^2. The distance then inherits this bound, since for any of random variables with marginal laws \mu and \nu, d_{\mathrm{TV}}(\mu, \nu) \leq \mathbb{P}(S_n \neq T_n) over the constructed joint distribution. This yields d_{\mathrm{TV}}(\mathrm{law}(S_n), \mathrm{Po}(\lambda_n)) \leq \sum_{i=1}^n p_i^2, establishing the theorem. Intuitively, mismatches in each pair occur primarily when Y_i \geq 2, which has probability O(p_i^2), or due to the small adjustment needed to match the marginal at 1, as \mathbb{P}(Y_i = 1) = p_i e^{-p_i} \approx p_i - p_i^2; these discrepancies accumulate additively across indicators to bound the distance by \sum p_i^2. Note that while Le Cam's original proof used characteristic functions and provided bounds like \|Q - P\| < 2 \sum p_i^2 (with tighter versions involving \min(1, 1/\lambda_n)), the coupling method offers a simple derivation of the \sum p_i^2 bound.

Distance estimation

In the coupling construction, the distance between the distribution of the sum S_n = \sum_{i=1}^n X_i and that of \sum_{i=1}^n Y_i (where the Y_i are independent random variables with means p_i, so \sum Y_i \sim \mathrm{Po}(\lambda_n) and \lambda_n = \sum p_i) is bounded using the for distance on convolutions of measures: d_{\mathrm{TV}}\left( \sum_{i=1}^n X_i, \sum_{i=1}^n Y_i \right) \leq \sum_{i=1}^n d_{\mathrm{TV}}(X_i, Y_i). This follows from independently each pair (X_i, Y_i) to preserve the marginal distributions, allowing the error to accumulate additively across components. For each individual pair, the distance d_{\mathrm{TV}}(X_i, Y_i) is p_i (1 - e^{-p_i}), which is upper-bounded by p_i^2. Aggregating these errors yields the explicit bound d_{\mathrm{TV}}(S_n, \sum Y_i) \leq \sum_{i=1}^n p_i (1 - e^{-p_i}) \leq \sum_{i=1}^n p_i^2, providing a simple in the probabilities that controls the quality when the p_i are small or sparse. This summation term highlights the theorem's strength in regimes where \sum p_i^2 is much smaller than \lambda_n, ensuring the is effective even for non-identical Bernoullis.

Applications and extensions

Approximations in probability theory

Le Cam's theorem facilitates approximations between statistical experiments by quantifying how closely one can mimic another through Markov kernels, enabling the replacement of complex models with simpler ones without significant loss in inferential performance. In asymptotic statistics, sequences of experiments (P_n) and (Q_n) are asymptotically equivalent if their Le Cam distance \Delta(P_n, Q_n) \to 0, implying identical limiting minimax risks for decision problems. A prominent application is in nonparametric inference, where the theorem establishes asymptotic equivalence between density estimation from i.i.d. samples and the Gaussian white noise model dY(t) = f(t) dt + n^{-1/2} dW(t), for smooth densities f. This equivalence, with deficiency bounded by terms involving metric entropy of the function class, simplifies analysis of estimation rates and allows transfer of results from white noise settings to direct observations. Such approximations are vital in signal processing for denoising and in econometrics for semiparametric models. The theorem also underpins approximations in high-dimensional settings, such as sparse signal recovery, where the deficiency measures the information loss from , ensuring that projected experiments retain near-optimal testing power.

Connections to other methods

Le Cam's deficiency theorem connects intimately to the theory of contiguity of probability measures, also developed by Le Cam, where one sequence of measures is contiguous to another if no test can distinguish them asymptotically. The theorem links contiguity to bounded likelihood ratios, with the deficiency providing a to quantify how closely experiments support the same contiguous alternatives, essential for deriving asymptotic distributions under local perturbations. It further ties to local asymptotic normality (LAN), where Le Cam showed that many parametric models approximate a Gaussian shift experiment locally around the true parameter, with the deficiency vanishing as the neighborhood shrinks. This connection enables efficiency bounds for estimators like maximum likelihood, extending to semiparametric and nonparametric cases via Le Cam's framework. Extensions of the theorem appear in modern areas, such as quantum statistical experiments, where analogs of deficiency compare quantum channels, preserving the risk characterization for quantum hypothesis testing. In generalized linear models, the theorem establishes asymptotic equivalence to Gaussian regressions, facilitating robust inference.