Fact-checked by Grok 2 weeks ago

Hoeffding's inequality

Hoeffding's inequality is a in that bounds the probability of large deviations for the sum (or average) of independent, bounded random variables from their , providing exponential tail bounds without requiring knowledge of the variables' distributions beyond their bounds. It was established by statistician Wassily Hoeffding in his 1963 paper "Probability Inequalities for Sums of Bounded Random Variables," published in the Journal of the American Statistical Association. This result is notable for its simplicity and applicability to non-identically distributed variables within specified ranges. The precise statement of Hoeffding's inequality, in one common form, considers independent random variables X_1, \dots, X_n such that a_i \leq X_i \leq b_i for each i, with expected value \mu = \mathbb{E}[\sum_{i=1}^n X_i]. It asserts that for any t > 0, P\left( \left| \sum_{i=1}^n X_i - \mu \right| \geq t \right) \leq 2 \exp\left( -\frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). When the variables are identically bounded, say a \leq X_i \leq b for all i, the bound simplifies to P\left( \left| \frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] \right| \geq t \right) \leq 2 \exp\left( -\frac{2 n t^2}{(b - a)^2} \right). These bounds hold due to the sub-Gaussian properties of bounded random variables, as derived via moment-generating functions in the original proof. Hoeffding's inequality is foundational in statistical learning theory, where it underpins analyses of generalization error, empirical risk minimization, and the convergence of sample averages in algorithms like those for classification and regression. For instance, it is used to derive probably approximately correct (PAC) learning guarantees, ensuring that with high probability, a model's performance on unseen data closely matches its training performance, assuming bounded loss functions. Beyond machine learning, it applies to confidence intervals for parameters in bounded settings, such as estimating binomial proportions, and supports the weak law of large numbers by quantifying deviation rates. The inequality has inspired numerous extensions, including versions for martingales (Azuma-Hoeffding) and dependent variables (e.g., Markov chains), enhancing its utility in sequential decision-making and time-series analysis. Its sharpness and distribution-free nature make it a preferred tool over weaker bounds like Chebyshev's, though it can be loose when variances are small compared to ranges.

Statement

For Independent Bounded Random Variables

Hoeffding's inequality establishes a concentration bound for the sum of random variables that are bounded but not necessarily identically distributed. Consider random variables X_1, \dots, X_n such that \mathbb{E}[X_i] = 0 and a_i \leq X_i \leq b_i , where a_i < b_i are real numbers for each i = 1, \dots, n. The inequality states that for any t > 0, \mathbb{P}\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2 \exp\left( -\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). This result, originally derived by Wassily Hoeffding, applies to the centered sum and holds regardless of the specific joint distribution beyond independence and boundedness. A key feature of the bound is its dependence solely on the lengths of the intervals (b_i - a_i), which represent the maximum possible deviations of each X_i from its mean, rather than on the variances \mathrm{Var}(X_i). This makes the inequality robust in settings where variance estimates are unavailable or the variables exhibit high variability within their bounds. Unlike bounds relying on second moments, such as those from the or , Hoeffding's provides exponential decay in the tail probability without assuming finite variance or identical distributions. Intuitively, the bound quantifies how the collective range of the variables limits the likelihood of large deviations in their , ensuring sub-Gaussian-like for the total. This distribution-free property has made it a foundational tool in for analyzing s in diverse applications, from statistical estimation to analysis.

For Bernoulli Random Variables

Hoeffding's inequality specializes naturally to the case of independent random variables, which take values in the set {0, 1}. Consider independent random variables X_1, \dots, X_n where each X_i \sim \text{Bernoulli}(p_i), so X_i \in \{0, 1\} with \mathbb{E}[X_i] = p_i for $0 \leq p_i \leq 1. Define the S_n = \sum_{i=1}^n X_i and its \mu = \mathbb{E}[S_n] = \sum_{i=1}^n p_i. This setup arises frequently in modeling outcomes, such as success/failure trials in statistical experiments. For this configuration, the provides a tight concentration bound around the . Specifically, for any t \geq 0, P(|S_n - \mu| \geq t) \leq 2 \exp\left(-\frac{2t^2}{n}\right). This two-sided bound follows from applying the general Hoeffding for bounded variables and using the union bound for symmetry. Each X_i is bounded in [0, 1], so the range length b_i - a_i = 1 for all i. The form holds regardless of the individual p_i, making it distribution-free within the Bernoulli class. The simplification arises directly from the general bounded case, where the exponent depends on the sum of squared range lengths: \sum_{i=1}^n (b_i - a_i)^2 = \sum_{i=1}^n 1^2 = n. Substituting this into the general exponent -\frac{2t^2}{\sum (b_i - a_i)^2} yields the n-scaled denominator, resulting in the cleaner form \exp(-2t^2 / n) for the one-sided tail, doubled for the two-sided probability. This uniform scaling with n highlights the inequality's utility for large sample sizes in binary summation problems. A classic example is the sum of flips, where each p_i = 1/2, so \mu = n/2. The bound then states that P\left(\left|S_n - \frac{n}{2}\right| \geq t\right) \leq 2 \exp\left(-\frac{2t^2}{n}\right), quantifying the low probability of significant deviation from the expected number of heads in n tosses. For instance, with n = 100 and t = 10, the probability is at most $2 \exp(-2) \approx 0.27, illustrating rapid concentration as n grows.

Generalizations

For Sub-Gaussian Random Variables

A zero-mean X is \sigma^2-sub-Gaussian if its satisfies \mathbb{E}[\exp(\lambda X)] \leq \exp(\lambda^2 \sigma^2 / 2) for all \lambda \in \mathbb{R}. This definition captures distributions with tails decaying at least as fast as those of a Gaussian with variance \sigma^2, using \sigma^2 as a variance proxy parameter that controls the concentration behavior. For independent \sigma_i^2-sub-Gaussian random variables X_1, \dots, X_n with zero means, Hoeffding's inequality extends to bound the tails of their sum: \mathbb{P}\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2 \exp\left( -\frac{t^2}{2 \sum_{i=1}^n \sigma_i^2} \right) for all t \geq 0. This form mirrors the original Hoeffding bound but replaces range-based parameters with the more flexible sub-Gaussian variance proxies, enabling tighter controls when the effective tail decay is lighter than the worst-case bounded scenario. Bounded random variables align naturally with this framework: a zero-mean random variable X_i taking values in [a_i, b_i] almost surely is \left( (b_i - a_i)/2 \right)^2-sub-Gaussian. Substituting this proxy into the sub-Gaussian inequality recovers the classical Hoeffding bound for bounded variables, but it tightens the estimate when the actual variance or tail behavior is smaller relative to the full range (b_i - a_i). The sub-Gaussian extension proves particularly advantageous for unbounded random variables, such as Gaussian random variables, where a normal distribution \mathcal{N}(0, \sigma^2) is precisely \sigma^2-sub-Gaussian, aligning the variance proxy exactly with the true variance. This allows Hoeffding-style concentration to apply beyond explicit bounds, facilitating analysis in settings like where variables may have infinite support but controlled tails.

For Dependent Processes

Extensions of Hoeffding's inequality to dependent processes, particularly those arising from s, address scenarios where random variables exhibit temporal dependencies, such as in sequential data or stochastic processes. In this context, the setup typically involves sums of the form \sum_{i=1}^n f(X_i), where \{X_i\}_{i=1}^n is a and f is a on the state space, often with |f(x)| \leq b for some constant b. For chains with bounded increments, such as |f(X_t) - f(X_{t-1})| \leq c, the dependence structure allows for concentration bounds that account for mixing properties rather than assuming . A key result from establishes a Hoeffding-type bound for general-state-space s that are not necessarily reversible. Specifically, for a stationary with invariant measure \pi and absolute $1 - \lambda > 0, the probability bound is P_\pi \left( \left| \sum_{i=1}^n f(X_i) - n \pi(f) \right| \geq t \right) \leq 2 \exp\left( -\frac{1-\lambda}{1+\lambda} \cdot \frac{t^2}{2 n (b - a)^2 / 4} \right), where f: \mathcal{X} \to [a, b] is bounded, and the factor \frac{1+\lambda}{1-\lambda} serves as a variance proxy \sigma^2 that inflates the bound due to dependence, with \lambda capturing the mixing rate. This extends the classical case by incorporating the chain's via the , enabling applications in statistical learning where data points are Markov-dependent. A more recent development in 2025 extends this to s using martingale methods combined with . For an irreducible and positive recurrent on a countable state with invariant \pi and \mathcal{L}^2(\pi)- \lambda(Q) > 0, the bound applies to the time average: \mathbb{P}_{\pi}\left( \left| \frac{1}{t} \int_0^t g(X_s) \, ds - \pi(g) \right| \geq \varepsilon \right) \leq 2 \exp\left\{ -\frac{\lambda(Q) t \varepsilon^2}{(b-a)^2} \right\}, where g: E \to [a, b] is a . This result facilitates concentration for integrals over continuous paths, such as in processes or continuous-time MCMC, by leveraging the generator's spectral properties. These bounds for dependent processes are inherently looser than those for sub-Gaussian variables, as the dependence introduces a multiplicative factor (e.g., involving ) that slows the rate, reflecting slower mixing. They typically require assumptions like , positive recurrence, or reversibility to ensure the exists, limiting applicability to non-ergodic or slowly mixing chains.

Proof

Of the Basic Inequality

The proof of Hoeffding's inequality for the sum of independent bounded random variables begins by assuming that X_1, \dots, X_n are independent random variables with E[X_i] = 0 and a_i \leq X_i \leq b_i for each i, where a_i < b_i. If the original variables do not have zero mean, they can be centered by subtracting their expectations, which preserves the range lengths b_i - a_i. The goal is to bound P\left(\sum_{i=1}^n X_i \geq t\right) for t > 0. A key ingredient is , which provides an upper bound on the of a zero-mean bounded . Specifically, for such an X_i, \log \mathbb{E}\left[\exp(\lambda X_i)\right] \leq \frac{\lambda^2 (b_i - a_i)^2}{8} for \lambda > 0. This bound follows from the convexity of the and an application of after an affine transformation to map the support to [-1, 1]. Applying to the exponential, P\left(\sum_{i=1}^n X_i \geq t\right) = P\left(\exp\left(\lambda \sum_{i=1}^n X_i\right) \geq e^{\lambda t}\right) \leq \frac{\mathbb{E}\left[\exp\left(\lambda \sum_{i=1}^n X_i\right)\right]}{e^{\lambda t}} for any \lambda > 0. By , the moment generating function factors as \mathbb{E}\left[\exp\left(\lambda \sum_{i=1}^n X_i\right)\right] = \prod_{i=1}^n \mathbb{E}\left[\exp(\lambda X_i)\right], so the bound becomes P\left(\sum_{i=1}^n X_i \geq t\right) \leq \frac{\prod_{i=1}^n \mathbb{E}\left[\exp(\lambda X_i)\right]}{e^{\lambda t}} = \exp\left( \sum_{i=1}^n \log \mathbb{E}\left[\exp(\lambda X_i)\right] - \lambda t \right). Substituting the bound from yields P\left(\sum_{i=1}^n X_i \geq t\right) \leq \exp\left( \frac{\lambda^2}{8} \sum_{i=1}^n (b_i - a_i)^2 - \lambda t \right). To tighten the bound, optimize over \lambda > 0 by minimizing the exponent \frac{\lambda^2}{8} \sum_{i=1}^n (b_i - a_i)^2 - \lambda t. The minimizing value is \lambda = \frac{4t}{\sum_{i=1}^n (b_i - a_i)^2}, which plugs in to give P\left(\sum_{i=1}^n X_i \geq t\right) \leq \exp\left( -\frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). This is the one-sided bound. For the two-sided version, note that the inequality is symmetric: replacing X_i with -X_i (which has the same range length) yields the same bound for P\left(\sum_{i=1}^n X_i \leq -t\right). Thus, by the union bound, P\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2 \exp\left( -\frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). As a simple illustration, for independent centered random variables (taking values in \{0,1\} after centering, with range length 1), the bound specializes to P\left( \left| \sum_{i=1}^n X_i \right| \geq t \right) \leq 2 \exp\left( -2 t^2 / n \right).

Of the Generalizations

The proofs of the generalizations of Hoeffding's inequality adapt the core (MGF) technique from the basic case, where allows direct product bounds on the MGF of the sum, but relax assumptions through asymmetric bounds, tail characterizations, or conditional control to handle one-sided restrictions, sub-Gaussian tails, and dependence structures. The sub-Gaussian generalization extends the inequality to random variables with Gaussian-like tails without requiring bounded support, relying directly on the sub-Gaussian definition. A zero-mean random variable X_i is \sigma_i^2-sub-Gaussian if \log E[\exp(\lambda X_i)] \leq \frac{\lambda^2 \sigma_i^2}{2} for all \lambda \in \mathbb{R}, capturing exponential control equivalent to bounded variables via . For independent such X_i, the sum S_n = \sum X_i satisfies E[\exp(\lambda S_n)] \leq \prod E[\exp(\lambda X_i)] \leq \exp\left(\frac{\lambda^2}{2} \sum \sigma_i^2\right) by the of the log-MGF under . Then, implies P(S_n \geq t) \leq \exp\left(-\frac{\lambda t}{1} + \frac{\lambda^2}{2} \sum \sigma_i^2\right), and setting \lambda = \frac{t}{\sum \sigma_i^2} optimizes to P(S_n \geq t) \leq \exp\left(-\frac{t^2}{2 \sum \sigma_i^2}\right), with a symmetric two-sided form P(|S_n| \geq t) \leq 2 \exp\left(-\frac{t^2}{2 \sum \sigma_i^2}\right). This direct derivation highlights sub-Gaussianity as a unifying framework, where bounded variables are \frac{(b_i - a_i)^2}{4}-sub-Gaussian, bridging classical and modern concentration. Bounded variables with one-sided constraints can be handled under sub-Gaussian assumptions that and . For dependent processes, such as Markov chains, proofs employ martingale decompositions or conditional MGF bounds to control dependence while preserving exponential moments, often relaxing the rate via mixing parameters. In the Markov setting, define the Doob martingale M_k = E[f(X_n) \mid \mathcal{F}_k] for a bounded function f on the state space and filtration \mathcal{F}_k generated by the chain up to time k, where differences |M_{k+1} - M_k| \leq c_k under bounded differences. Azuma-Hoeffding applies directly: \log E[\exp(\lambda (M_{n} - M_0))] \leq \frac{\lambda^2}{2} \sum c_k^2 conditionally at each step, yielding P(M_n - M_0 \geq t) \leq \exp\left(-\frac{t^2}{2 \sum c_k^2}\right) after optimization, as the conditional expectations ensure the MGF product telescopes like the independent case. For general-state-space chains, this extends via operator norms on L^2(\pi) (stationary measure \pi), bounding \|P f - \pi(f)\|_2 \leq \lambda \|f - \pi(f)\|_2 for transition kernel P with norm \lambda < 1, leading to a variance proxy inflated by \frac{1 + \lambda}{1 - \lambda} \sum \frac{(b_i - a_i)^2}{4} in the tail bound. Alternatively, for uniformly ergodic chains satisfying a minorization condition P_x(X_m \in \cdot) \geq \delta \phi(\cdot) for mixing time m and measure \phi, Poisson's equation decomposes the sum into a martingale plus negligible boundary terms, applying the basic MGF bound after ergodicity absorbs dependence, resulting in P(S_n - E[S_n] \geq \epsilon n) \leq \exp\left(-\frac{\delta^2 (\epsilon n - 2 \|f\| m / \delta)^2}{2 n \|f\|^2 m^2}\right) for large n. Across these, the common thread is exponential moment control via conditional sub-Gaussianity or spectral gaps, trading independence for quantifiable dependence decay.

Applications

Concentration Bounds

Hoeffding's inequality serves as a foundational tool in probability theory for deriving concentration bounds on the deviations of sums of independent bounded random variables from their expected values. It provides non-asymptotic upper bounds on tail probabilities, ensuring that large deviations occur with exponentially small probability, which is crucial for analyzing how sums concentrate around their means in finite samples. These bounds are particularly effective for random variables with sub-Gaussian tails, offering stronger guarantees than moment-based inequalities like Chebyshev's, which only yield polynomial decay rates. Compared to Chernoff bounds, Hoeffding's inequality is simpler for bounded variables, as it delivers a direct exponential form without the need to minimize over the moment-generating function parameter. Relative to , Hoeffding's approach is more conservative because it disregards the variances of the variables, depending only on their ranges, which results in looser but easier-to-apply bounds when variance information is unavailable or hard to compute. The core intuition of the inequality stems from the exponential decay in the bound, with a rate governed by -2t² / v, where v represents the sum of the squared ranges of the individual variables; this rapid decay ensures that the sum remains close to its mean with overwhelmingly high probability, even for moderately large deviations t. A representative example arises in bounding the deviation of the sample mean \bar{X} from the true mean \mu for n independent identically distributed random variables X_i each bounded in a range of length R:
P(|\bar{X} - \mu| \geq \epsilon) \leq 2 \exp\left( -\frac{2n\epsilon^2}{R^2} \right).
This bound demonstrates the inequality's utility in large-sample settings, where the probability of deviation shrinks exponentially with n.

Machine Learning and Statistics

In machine learning, Hoeffding's inequality plays a central role in establishing generalization bounds for (ERM), where the goal is to ensure that the empirical risk computed on a finite sample approximates the true expected risk over the underlying distribution. For a hypothesis class with bounded loss functions in [0, B], the inequality facilitates uniform convergence: for a single hypothesis,
P\left( \left| \hat{R}(h) - R(h) \right| \geq \epsilon \right) \leq 2 \exp\left( -\frac{2 n \epsilon^2}{B^2} \right).
For uniform convergence over a finite class of size |H|, a union bound yields
P\left( \sup_{h} \left| \hat{R}(h) - R(h) \right| \geq \epsilon \right) \leq 2 |H| \exp\left( -\frac{2 n \epsilon^2}{B^2} \right).
For infinite or VC-bounded classes, complexity is controlled via , Rademacher averages, or covering numbers, leading to bounds of the form 2 exp( - n ε² / (2 B²) + c sqrt( (d log n)/n ) ) or similar, where d is the VC dimension. This bound underpins the no-free-lunch theorems in learning, guaranteeing that ERM hypotheses generalize with high probability when sample sizes are sufficiently large relative to the desired accuracy ε.
In statistics, Hoeffding's inequality is instrumental for non-parametric density estimation, providing exponential concentration for estimators like the kernel density estimator f_n(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - X_i), where deviations in the L1 norm \int |f_n - f| from the true density f are controlled as P \left\{ \left| \int |f_n - f| - E \int |f_n - f| \right| > t \right\} \leq 2 e^{-n t^2 / (2 \int |K|^2)}. It also bounds supremum norms in kernel methods, such as the Vapnik-Chervonenkis distance V_n = \sup_{A} |\mu_n(A) - \mu(A)|, yielding P \{ |V_n - E V_n| \geq t \} \leq 2 e^{-2 n t^2}, which follows from (a bounded-differences related to Hoeffding) ensuring consistent in infinite-dimensional function spaces without strong assumptions. These applications highlight its utility in high-dimensional settings where models fail. Hoeffding's inequality formed a cornerstone of Probably Approximately Correct (PAC) learning theory in the 1990s, enabling foundational results on sample complexity for consistent learning independent of data dimensionality in low-complexity classes. Post-2010, extensions have integrated it into analyses of stochastic gradient descent (SGD) in deep learning, bounding martingale differences in iterates to derive high-probability generalization guarantees for multi-pass optimization, even under non-convex losses. A key advantage over weaker bounds like Chebyshev's is the exponential tail decay, yielding sample complexities of order O(1/\epsilon^2) for ε-accurate estimation, which remains dimension-independent for bounded variables and avoids variance-dependent terms.

Confidence Intervals

Hoeffding's inequality enables the construction of explicit confidence intervals for the population mean \mu of a bounded random variable based on independent and identically distributed (i.i.d.) samples X_1, \dots, X_n \in [a, b]. Let \bar{X} = n^{-1} \sum_{i=1}^n X_i denote the sample mean. For any \delta \in (0,1), the interval \left[ \bar{X} - \sqrt{\frac{(b-a)^2 \log(2/\delta)}{2n}}, \ \bar{X} + \sqrt{\frac{(b-a)^2 \log(2/\delta)}{2n}} \right] contains \mu with probability at least $1 - \delta. This interval arises by applying Hoeffding's inequality, which states that P(|\bar{X} - \mu| \geq t/n) \leq 2 \exp(-2 t^2 / (n (b-a)^2)) for t > 0. To achieve the desired confidence level $1 - \delta, set the right-hand side equal to \delta and solve for t: t = \sqrt{\frac{(b-a)^2 n}{2} \log\left(\frac{2}{\delta}\right)}, yielding the half-width \sqrt{(b-a)^2 \log(2/\delta) / (2n)} after dividing by n. A key advantage of this construction is its distribution-free nature, requiring no assumptions beyond boundedness and independence, such as normality or known variance, making it applicable in settings like survey sampling where distributional details are unknown. It is particularly valuable in A/B testing for metrics like conversion rates, which are often bounded. However, these intervals can be wider than those based on the central limit theorem's normal approximation when the true variance is small relative to the range (b-a), as the bound does not exploit variance information. For the special case of Bernoulli random variables modeling binary outcomes in polls, the interval simplifies with a=0, b=1, providing a conservative yet valid bound on the success probability.

References

  1. [1]
    Probability Inequalities for Sums of Bounded Random Variables - jstor
    [8] Hoeffding, Wassily, "A class of statistics with asymptotically normal distribution,". Annals of Mathematical Statistics, 19 (1948), 293-325. [9] Lehmann, ...
  2. [2]
    [PDF] CS229 Supplemental Lecture notes Hoeffding's inequality
    Hoeffding's inequality is a powerful technique—perhaps the most important inequality in learning theory—for bounding the probability that sums of bounded ...Missing: statement | Show results with:statement
  3. [3]
    [PDF] LECTURE NOTES 6 1 Hoeffding's inequality
    There each random variable is between −1 and 1 so we have that by Hoeffding's inequality: P X − µ ≥ t ≤ 2 exp −2nt2 . Observe once again this inequality is ...
  4. [4]
    Hoeffding's Inequality for General Markov Chains and Its ...
    This paper establishes Hoeffding's lemma and inequality for bounded functions of general-state-space and not necessarily reversible Markov chains. The sharpness ...
  5. [5]
    [PDF] Some Basic Concentration Inequalities - Andrew B. Nobel
    Note: Hoeffding bound does not use information about the variance of the Xis. Page 19. Example: Bernoulli Random Variables. Let X1,...,Xn be iid Bern(p). Note ...
  6. [6]
    Probability Inequalities for Sums of Bounded Random Variables
    Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive num-.
  7. [7]
    [PDF] High-Dimensional Probability You are reading the first edition. The ...
    May 12, 2025 · ... Hoeffding's inequality can indeed be proved for all sub-gaussian distributions. This makes the family of sub-gaussian distributions a ...
  8. [8]
    [PDF] Hoeffding's Inequality for General Markov Chains and Its ...
    This paper establishes Hoeffding's lemma and inequality for bounded functions of general-state- space and not necessarily reversible Markov chains.<|control11|><|separator|>
  9. [9]
  10. [10]
    [PDF] Hoeffding's Inequality - John A. Gubner's Home Page
    Dec 11, 2023 · Let X be a bounded random variable with a ≤ X ≤ b, where ... Let X1,...,Xn be independent bounded random variables with ai ≤ Xi ≤ bi.
  11. [11]
    Concentration Inequalities - Pascal Massart - Oxford University Press
    Free delivery 25-day returnsStephane Boucheron, Gabor Lugosi, and Pascal Massart. Now in paperback; Written by leading experts in the field; A comprehensible account of concentration ...Missing: Hoeffding generalizations
  12. [12]
    [PDF] On Hoeffding's inequalities - arXiv
    Namely, I show that P{Mn ≥ x} ≤ cP{Sn ≥ x}, where c is an absolute constant and Sn = ε1 + ··· + εn is a sum of independent identically distributed Bernoulli ...
  13. [13]
    [PDF] Lecture 5: Subgaussian random variables
    As the name suggests, the notion of subgaussian random variables is a generalization of Gaussian random variables. ... This is the Hoeffding inequality. This also ...
  14. [14]
    [PDF] Hoeffding's inequality for uniformly ergodic Markov chains
    Abstract. We provide a generalization of Hoeffding's inequality to partial sums that are derived from a uniformly ergodic Markov.
  15. [15]
    [PDF] Hoeffding, Chernoff, Bennet, and Bernstein Bounds
    In Bennet inequality, we assume that the variable is upper bounded, and want to estimate its moment generating function using variance information. Lemma 3.1.
  16. [16]
    [PDF] Learning Theory 1 Introduction 2 Hoeffding's Inequality
    It is extremely widely used in machine learning theory. There are several equivalent forms of it, and it is worth understanding these in detail. Theorem 1 ( ...
  17. [17]
    [PDF] EXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION
    In this paper, we collect various extensions of Hoeffding's inequality and highlight their applica- tions in the nonparametric estimation of densities and ...
  18. [18]
    [PDF] Generalization Performance of Multi-pass Stochastic Gradient ...
    Stochastic gradient descent (SGD) has found wide applications in machine learning due to its simplicity in implementation, low memory requirement and low ...Missing: deep | Show results with:deep<|control11|><|separator|>
  19. [19]