Fact-checked by Grok 2 weeks ago

Sample complexity

Sample complexity refers to the minimal number of training examples required for a learning algorithm to output a hypothesis that approximates an unknown target concept with error at most ε, with probability at least 1 - δ over the random draw of examples from an arbitrary distribution, in the Probably Approximately Correct (PAC) learning framework.^[1] This concept quantifies the data efficiency of learning procedures and is central to computational learning theory, where a concept class is PAC-learnable if its sample complexity is polynomial in 1/ε, log(1/δ), and intrinsic parameters of the class.^[2] Introduced by Leslie Valiant in 1984 as part of the foundational PAC model, sample complexity addresses the challenge of learning from examples without explicit programming, emphasizing polynomial bounds to ensure feasibility.^[2] In 1989, Blumer et al. established that a concept class is PAC-learnable if and only if it has finite Vapnik-Chervonenkis (VC) dimension d, a combinatorial measure of the class's expressive power, with sample complexity bounded by O((d/ε) log(1/ε) + (1/ε) log(1/δ)) for consistent learners.^[3] Subsequent work refined these bounds; notably, Hanneke (2016) proved the optimal sample complexity is Θ((d + log(1/δ))/ε) in the realizable case, achieved via empirical risk minimization over subsamples, matching known lower bounds up to constants.^[1] Beyond classical supervised learning, sample complexity extends to settings like agnostic learning, reinforcement learning, and active learning, where it influences algorithm design by highlighting trade-offs between model complexity and data requirements.^[3] For instance, classes with low VC dimension, such as halfspaces in low dimensions, exhibit favorable sample complexity, enabling efficient generalization, while high-dimensional or overparameterized models demand exponentially more samples to avoid overfitting.^[1] These insights underpin modern machine learning practices, guiding the selection of architectures and regularization techniques to optimize empirical performance.

Fundamentals

Definition and Motivation

In supervised learning, the foundational setup involves an instance space \mathcal{X}, a label space \mathcal{Y}, a target function f: \mathcal{X} \to \mathcal{Y} drawn from an unknown distribution D over \mathcal{X} \times \mathcal{Y}, and a hypothesis class \mathcal{H} \subseteq \{h: \mathcal{X} \to \mathcal{Y}\}. A learning algorithm receives independent and identically distributed (i.i.d.) training samples (x_i, y_i) where y_i = f(x_i), and aims to output a hypothesis h \in \mathcal{H} that approximates f. Performance is typically measured using the 0-1 loss \ell(h(x), y) = \mathbb{1}[h(x) \neq y], with the true risk L_D(h) = \mathbb{E}_{(x,y) \sim D} [\ell(h(x), y)] and empirical risk L_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h(x_i), y_i), where m is the number of samples.^[4] Sample complexity, denoted m(\epsilon, \delta, \mathcal{H}), is defined as the minimal number of i.i.d. samples required for a learning algorithm to output a hypothesis h such that, with probability at least $1 - \delta over the choice of samples, the true risk satisfies L_D(h) \leq \epsilon. This quantity captures the statistical resources needed to ensure reliable generalization from finite data to the underlying distribution D. In the probably approximately correct (PAC) learning framework, a hypothesis class \mathcal{H} is PAC-learnable if there exists an algorithm with polynomial sample complexity in \frac{1}{\epsilon}, \frac{1}{\delta}, and the representation size of \mathcal{H}.^[4]^[5] The motivation for studying sample complexity arises from the need to guarantee that learned models generalize beyond training data, avoiding overfitting where models memorize noise rather than patterns, which occurs with insufficient samples. In practice, low sample complexity enables efficient learning in resource-constrained settings, such as data-scarce domains, while high complexity signals fundamental limitations of the hypothesis class. The no-free-lunch theorem underscores this by proving that, without restrictions on \mathcal{H} or D, no algorithm can achieve better-than-random average performance across all possible problems, implying infinite sample complexity in the unrestricted case and necessitating inductive biases like bounded \mathcal{H}. A general form of sample complexity bound is m \geq \frac{1}{\epsilon} \log \frac{1}{\delta} + terms depending on \mathcal{H}, highlighting the interplay between desired accuracy \epsilon, confidence $1 - \delta, and class complexity.^[4]

Historical Development

The foundations of sample complexity in computational learning theory trace back to the statistical learning work of Vladimir Vapnik and Alexey Chervonenkis in 1971, who introduced concepts of uniform convergence and growth functions that quantified the amount of data needed for empirical risk minimization to approximate true risk, laying the groundwork for later bounds on learnable classes.^[6] Their theory emphasized the role of data volume in controlling generalization error for pattern recognition tasks, though it was framed in probabilistic terms rather than explicitly algorithmic learning models.^[6] The modern notion of sample complexity crystallized with Leslie Valiant's 1984 introduction of the Probably Approximately Correct (PAC) learning framework, which formalized learning as efficiently identifying a hypothesis close to the target concept with high probability, using sample size as a key measure alongside computational efficiency.^[7] Valiant's model was motivated by the feasibility of artificial intelligence, positing that learnability from examples could bridge the gap between explicit programming and autonomous knowledge acquisition in machines.^[7] In the 1990s, seminal advances by Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth connected sample complexity directly to the VC dimension, proving that finite VC dimension implies polynomial sample requirements for PAC learnability and providing tight upper bounds of order O\left(\frac{1}{\epsilon} (d \log \frac{1}{\epsilon} + \log \frac{1}{\delta})\right), where d is the VC dimension, \epsilon the error, and \delta the failure probability. Subsequent evolution extended sample complexity analysis beyond the realizable PAC setting. Haussler formalized agnostic PAC learning in 1992, relaxing the perfect hypothesis assumption to minimize error relative to the best-in-class, which increased sample demands to O\left(\frac{VC(H)}{\epsilon^2} \log \frac{1}{\epsilon\delta}\right) for hypothesis class H. This framework influenced active learning variants, such as those by Sanjoy Dasgupta in 2005, where adaptive querying reduces sample needs by focusing on informative examples. In the 2010s, focus shifted to optimal bounds, with Steve Hanneke establishing in 2016 that the optimal sample complexity for realizable PAC learning is \Theta\left(\frac{d + \log(1/\delta)}{\epsilon}\right), where d is the VC dimension, resolving a longstanding gap by matching minimax rates up to constants.^[8] Post-2020 developments have increasingly adopted information-theoretic perspectives, viewing sample complexity through mutual information between training samples and learned hypotheses to derive tighter, distribution-dependent bounds that generalize beyond VC-based uniform convergence.^[9] For instance, recent works have linked conditional mutual information to sample requirements in sequential settings, emphasizing minimal information leakage for robust generalization.^[10] These approaches highlight a paradigm shift toward probabilistic and Bayesian interpretations, influencing extensions in robust and private learning.^[9]

PAC Learning Framework

Realizable PAC Learning

In the realizable PAC learning framework, it is assumed that there exists a target hypothesis c \in \mathcal{H} that perfectly classifies the data generated from an unknown distribution D over the instance space \mathcal{X} \times \mathcal{Y}, meaning the error of c is zero: \Pr_{(x,y) \sim D}[c(x) \neq y] = 0.^[4] The learner's goal is to output a hypothesis h \in \mathcal{H} such that, with probability at least $1 - \delta over the choice of training samples drawn i.i.d. from D, the generalization error satisfies \Pr_{(x,y) \sim D}[h(x) \neq y] \leq \epsilon, for given accuracy parameter \epsilon > 0 and confidence parameter \delta > 0.^[7] This setting captures scenarios where the hypothesis class \mathcal{H} is rich enough to contain a flawless representation of the underlying concept, allowing for clean theoretical analysis of sample requirements. Probably Approximately Correct (PAC) learning formalizes the notion that a concept class \mathcal{C} \subseteq \{0,1\}^\mathcal{X} (with corresponding hypotheses \mathcal{H}) is learnable if there exists an algorithm whose sample complexity—the number of examples m needed—is polynomial in \frac{1}{\epsilon}, \frac{1}{\delta}, and the effective size of the instance space (such as the dimension n for \mathcal{X} = \{0,1\}^n).^[7] Introduced by Valiant, this framework emphasizes computational and statistical efficiency, ensuring that learning is feasible without exhaustive enumeration of all possible data.^[7] In the realizable case, the polynomial dependence highlights classes where restrictions on \mathcal{H} enable finite, tractable sample sizes, contrasting with settings lacking such structure. For unrestricted hypothesis spaces, such as \mathcal{H} comprising all possible functions from \mathcal{X} to \{0,1\}, the sample complexity is infinite, as no finite number of examples can guarantee low error with high probability against arbitrary distributions.^[4] Finite sample complexity arises only under restrictions on \mathcal{H}, such as finiteness or parametric constraints, which limit the number of inconsistent hypotheses. In the finite case, a basic guarantee follows from the empirical risk minimization (ERM) algorithm combined with a union bound over \mathcal{H}: to ensure that the output h has error at most \epsilon with probability at least $1 - \delta, it suffices to draw

m \geq \frac{1}{\epsilon} \left( \ln |\mathcal{H}| + \ln \frac{1}{\delta} \right)

samples, assuming the ERM hypothesis is consistent with the training data.^[4] This bound demonstrates logarithmic dependence on the hypothesis class size |\mathcal{H}|, underscoring the role of \mathcal{H}'s cardinality in controlling sample needs. A canonical example is the PAC learnability of conjunctions of literals over the Boolean cube \mathcal{X} = \{0,1\}^n, where \mathcal{H} consists of functions that output 1 only if all selected literals (positive or negative features) are satisfied. Valiant showed that this class, with |\mathcal{H}| = 3^n (accounting for including, excluding, or negating each of n variables), admits a polynomial-time algorithm achieving sample complexity O\left(\frac{n}{\epsilon} \ln \frac{1}{\epsilon} + \frac{1}{\epsilon} \ln \frac{1}{\delta}\right), which is polynomial in n, \frac{1}{\epsilon}, and \frac{1}{\delta}.^[7] This result illustrates how structured \mathcal{H} enables efficient realizable learning, serving as a foundational demonstration of PAC feasibility for restricted Boolean functions.

Agnostic PAC Learning

In the agnostic PAC learning framework, the objective is to identify a hypothesis h \in \mathcal{H} that approximately minimizes the expected loss L_{\mathcal{D}}(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}} [\ell(h(x), y)] with respect to an unknown distribution \mathcal{D} over the instance-label space, without assuming the existence of a zero-loss hypothesis in the class. The learner, given a sample S of m i.i.d. examples from \mathcal{D}, outputs h such that L_{\mathcal{D}}(h) \leq \min_{h' \in \mathcal{H}} L_{\mathcal{D}}(h') + \epsilon with probability at least $1 - \delta, where \epsilon > 0 is the excess risk tolerance and \delta > 0 is the failure probability. This setting, formalized as a generalization of the realizable PAC model to handle arbitrary labeling noise or model mismatch, applies to bounded loss functions \ell: [0,1] \to [0,1] and emphasizes relative performance to the best-in-class hypothesis.^[11]^[4] The sample complexity in agnostic PAC learning presents greater challenges than in the realizable case, as the irreducible error from suboptimal hypotheses or noise demands tighter concentration to approximate the optimal risk. Success hinges on uniform convergence, where the empirical risk \hat{L}_S(h) = \frac{1}{m} \sum_{(x,y) \in S} \ell(h(x), y) converges to L_{\mathcal{D}}(h) simultaneously for all h \in \mathcal{H}, with high probability over S. This property ensures that empirical risk minimization (ERM), which selects \hat{h} = \arg\min_{h \in \mathcal{H}} \hat{L}_S(h), yields a hypothesis with controlled excess risk, but requires larger m to overcome variance introduced by noisy labels.^[4] For finite hypothesis classes with |\mathcal{H}| < \infty, agnostic PAC learnability is achieved via ERM, with sample complexity bounded by Hoeffding's inequality applied to the uniform deviation of risks:

m \geq \frac{2}{\epsilon^2} \ln \left( \frac{2 |\mathcal{H}|}{\delta} \right).

This guarantee holds for any distribution \mathcal{D} and loss in [0,1], underscoring the quadratic $1/\epsilon dependence essential for distinguishing the ERM hypothesis from the optimal one amid noise.^[4] Unlike realizable PAC learning, which targets absolute error \epsilon under a zero-risk assumption, agnostic PAC introduces excess risk as the performance metric, reflecting approximation to the best achievable in \mathcal{H}. For broader analysis, especially infinite classes, tools like uniform stability—measuring sensitivity to sample perturbations—or Rademacher complexity—quantifying average correlation with random noise—provide high-level complexity measures to bound generalization without explicit enumeration.^[4]

Hypothesis Spaces

Unrestricted Hypothesis Spaces

In the PAC learning framework, an unrestricted hypothesis space consists of all possible functions mapping the input domain X to the output space Y. When X is finite with size n and Y = \{0,1\}, the hypothesis class H = Y^X has cardinality |H| = 2^n, exhibiting exponential growth in the domain size n. This vast expressiveness captures every conceivable labeling of the inputs, without imposing structural constraints on the target concept.^[4] A fundamental counting argument reveals why such unrestricted spaces lead to high sample complexity in the realizable PAC setting, though the samples themselves can be obtained in polynomial numbers. To identify the target hypothesis with error at most \epsilon and failure probability at most \delta, an algorithm must effectively distinguish the correct hypothesis from the exponentially many alternatives consistent with the samples drawn from any distribution. If the number of samples m is too small, numerous hypotheses remain viable, resulting in outputs that generalize poorly. Specifically, a lower bound on the sample complexity is given by

m(\epsilon, \delta) = \Omega\left( \frac{1}{\epsilon} \log \frac{|H|}{\delta} \right),

which scales polynomially with n (as \log |H| = n \log 2) for |H| exponential in n. However, the exponential size of H makes it computationally infeasible to search or optimize over the space in polynomial time, precluding polynomial-time learnability as required by the PAC model.^[4]^[12] A canonical illustration is the task of learning an arbitrary Boolean function defined over n input bits, where the domain X = \{0,1\}^n has |X| = 2^n points and H includes all $2^{2^n} possible functions. In the worst case, under a uniform distribution over X, achieving small \epsilon requires \Theta(2^n / \epsilon) samples to sufficiently cover the domain and resolve the target labeling, as fewer samples leave too many inconsistent hypotheses unexplored. This exponential dependence echoes the no-free-lunch theorem's implication that, absent restrictions on H, no general-purpose learner can succeed efficiently across all possible targets.^[4]^[12]

Restricted Hypothesis Spaces

In the context of Probably Approximately Correct (PAC) learning, restricting the hypothesis space \mathcal{H} to a structured subset with limited expressive power is essential for achieving finite sample complexity, as it enables uniform convergence of empirical risks to true risks over \mathcal{H}. Unlike unrestricted hypothesis spaces, which lead to no-free-lunch impossibilities requiring exponentially many samples, finite or low-capacity \mathcal{H} ensures that a learner can generalize from polynomially many examples with high probability. This restriction introduces an inductive bias, prioritizing simpler models such as lookup tables for finite domains or linear functions over infinite domains, thereby bounding the risk of overfitting and guaranteeing learnability in both realizable and agnostic settings.^[4] For finite hypothesis spaces, where |\mathcal{H}| < \infty, the sample complexity exhibits a logarithmic dependence on the size of \mathcal{H}. In the realizable case, an empirical risk minimization (ERM) learner requires m = O\left(\frac{1}{\epsilon} \left( \ln |\mathcal{H}| + \ln \frac{1}{\delta} \right) \right) samples to output a hypothesis with error at most \epsilon with probability at least $1 - \delta. In the agnostic case, the complexity increases to m = O\left(\frac{1}{\epsilon^2} \left( \ln |\mathcal{H}| + \ln \frac{1}{\delta} \right) \right), ensuring the output's error is within \epsilon of the best-in-class error. These bounds arise from Hoeffding's inequality combined with the union bound over \mathcal{H}, highlighting how finiteness controls the multiplicity of hypotheses and facilitates uniform convergence.^[4] When \mathcal{H} is infinite but possesses low capacity, such as classes of linear classifiers in n-dimensional space, sample complexity remains finite by relying on measures like the growth function \tau_{\mathcal{H}}(m), which quantifies the maximum number of distinct labelings \mathcal{H} can induce on m points. For such restricted spaces, the growth function grows polynomially in m, leading to sample complexities that scale with the "effective size" of \mathcal{H}, often captured by the VC dimension as a combinatorial proxy. Both realizable and agnostic PAC learning benefit, with polynomial restrictions yielding overall complexities of the form \mathrm{poly}(1/\epsilon, 1/\delta, n), where n is the input dimension, thus ensuring efficient learnability for structured models like halfspaces.^[4]

Examples of PAC-Learnable Spaces

One canonical example of a PAC-learnable hypothesis space is the class of intervals on the real line, where the instance space is \mathbb{R} and each hypothesis h_{a,b} classifies points x \in [a, b] as positive (label 1) and points outside as negative (label 0). This class has finite VC dimension of 2, as it can shatter any set of two points (e.g., by placing the interval to cover one, both, or neither) but cannot shatter three collinear points without violating some labeling.^[13] By the VC dimension theorem, the class is PAC learnable in the realizable case, with sample complexity O\left(\frac{1}{\epsilon} \log \frac{1}{\epsilon} + \log \frac{1}{\delta}\right).^[14] An efficient consistent learner for this space draws a sample and outputs the hypothesis defined by the minimum and maximum positively labeled points as endpoints a and b, which fits all positives and excludes negatives in the sample; this approach leverages the restricted structure of intervals to ensure low error with high probability.^[13] Another prominent example is the class of linear classifiers, or half-spaces, in d-dimensional Euclidean space \mathbb{R}^d, where hypotheses are of the form h_{\mathbf{w},b}(\mathbf{x}) = \operatorname{sign}(\mathbf{w} \cdot \mathbf{x} + b) and separate points via a hyperplane. The VC dimension of this class is d+1, allowing it to shatter any d+1 points in general position but not d+2. Consequently, it is PAC learnable, with sample complexity scaling as O\left(\frac{d}{\epsilon} \log \frac{1}{\epsilon} + \frac{1}{\epsilon} \log \frac{1}{\delta}\right) in the realizable setting.^[14] The perceptron algorithm provides an efficient method to find a consistent linear separator by iteratively updating weights based on misclassified examples, converging in a number of steps polynomial in d and the margin. Similarly, support vector machines can learn half-spaces in the agnostic case, though their PAC guarantees stem from the finite VC dimension. In the domain of boolean functions over \{0,1\}^n, the class of k-DNF formulas—disjunctions of conjunctions each involving at most k literals—is PAC learnable when k is fixed, as the restriction on term size bounds the effective complexity. The VC dimension grows with k and n, leading to sample complexity that is polynomial in n^k, $1/\epsilon, and \log(1/\delta), which becomes inefficient as k increases but remains tractable for constant k.^[15] Jackson's algorithm achieves this learnability under the uniform distribution using membership queries to identify relevant terms and prune the hypothesis space, demonstrating how parameterizing the formula length enables PAC guarantees in expressive yet restricted boolean classes.^[15]

Theoretical Bounds

Finite Sample Complexity Bounds

In the realizable case of Probably Approximately Correct (PAC) learning, where there exists a hypothesis in the finite class \mathcal{H} that perfectly fits the target concept, the empirical risk minimizer (ERM) over \mathcal{H} achieves PAC learnability with sample complexity m = O\left( \frac{1}{\epsilon} \log \frac{|\mathcal{H}|}{\delta} \right).^[3] This bound ensures that with probability at least $1 - \delta, the ERM hypothesis has error at most \epsilon relative to the true error of zero. The derivation relies on the union bound applied over all hypotheses in \mathcal{H}: for each bad hypothesis h with true error at least \epsilon, the probability that its empirical error is zero on a random sample of size m is at most (1 - \epsilon)^m \leq e^{-\epsilon m}; thus, the probability that any bad hypothesis appears consistent is at most |\mathcal{H}| e^{-\epsilon m}, which is set below \delta to solve for m.^[16] A more precise form of the lower bound for the sample complexity in the realizable case is m \geq \frac{\log |\mathcal{H}| + \log (1/\delta)}{2 \epsilon \log (1/(1-\epsilon))}, which approximates to \Omega\left( \frac{1}{\epsilon} (\log |\mathcal{H}| + \log (1/\delta)) \right) for small \epsilon.^[17] This lower bound holds even for simple finite hypothesis classes, such as |\mathcal{H}| = 2, where distinguishing between hypotheses requires \Omega\left( \frac{1}{\epsilon} \log \frac{1}{\delta} \right) samples to achieve the desired confidence. In the agnostic PAC setting, where no perfect hypothesis exists and the goal is to approximate the minimum error over \mathcal{H} within additive \epsilon, the ERM achieves sample complexity m = O\left( \frac{1}{\epsilon^2} (\log |\mathcal{H}| + \log (1/\delta)) \right).^[18] This follows from uniform convergence guarantees: Hoeffding's inequality bounds the deviation of empirical error from true error for a fixed hypothesis as $2 e^{-2 m \epsilon^2}, and the union bound over |\mathcal{H}| hypotheses yields the overall probability of deviation exceeding \epsilon below \delta, solving for m. Lower bounds in the agnostic case scale as \Omega\left( \frac{\log |\mathcal{H}| + \log (1/\delta)}{\epsilon^2} \right) for finite classes, reflecting the need to reliably estimate errors even without realizability, with the quadratic dependence on $1/\epsilon arising from variance in error estimation.^[19]

VC Dimension and Shattering

The Vapnik–Chervonenkis (VC) dimension provides a fundamental measure of the capacity of a hypothesis class \mathcal{H} in the context of probably approximately correct (PAC) learning, particularly for infinite classes where the cardinality |\mathcal{H}| is unbounded. A set of points S \subseteq \mathcal{X} of size |S| = k is said to be shattered by \mathcal{H} if for every possible labeling of S with binary labels \{0,1\}^k, there exists a hypothesis h \in \mathcal{H} that realizes that labeling exactly on S. Shattering captures the expressive power of \mathcal{H}, as it indicates the ability to fit any dichotomy on the points without regard to the underlying distribution. The VC dimension, denoted d = \mathrm{VC}(\mathcal{H}), is the cardinality of the largest set that can be shattered by \mathcal{H}; if no such finite maximum exists, then \mathrm{VC}(\mathcal{H}) = \infty.^[14] Finite VC dimension is necessary and sufficient for PAC learnability of \mathcal{H}, extending the finite-class bounds to infinite settings by replacing |\mathcal{H}| with a combinatorial measure of complexity. For realizable PAC learning, where there exists a hypothesis in \mathcal{H} consistent with the target concept, the sample complexity m required to achieve error at most \epsilon with probability at least $1 - \delta satisfies

m = O\left( \frac{d}{\epsilon} \log \frac{d}{\epsilon} + \frac{1}{\epsilon} \log \frac{1}{\delta} \right).

Tight bounds show that the sample complexity is \Theta\left( \frac{d + \log(1/\delta)}{\epsilon} \right) in the realizable case, matching lower bounds up to constants (Hanneke, 2016).^[1] In the agnostic setting, where no perfect hypothesis exists and the goal is to output a hypothesis with error close to the minimum over \mathcal{H}, the sample complexity is

m = O\left( \frac{d}{\epsilon^2} \log \frac{d}{\epsilon} + \frac{1}{\epsilon^2} \log \frac{1}{\delta} \right).

These bounds highlight how low VC dimension controls the number of samples needed to generalize, preventing overfitting in high-capacity classes.^[14] Illustrative examples demonstrate the VC dimension for simple geometric hypothesis classes. Consider thresholds on the real line, defined as \mathcal{H} = \{ h_\theta : x \mapsto \mathbf{1}[x \geq \theta] \mid \theta \in \mathbb{R} \}, which corresponds to half-lines starting from a threshold. This class has VC dimension 1, as it can shatter any single point (by choosing \theta appropriately) but cannot shatter two points, since it always labels points to the right of the threshold positively and to the left negatively, failing to realize the opposite labeling. In contrast, linear separators (half-planes) in the plane, \mathcal{H} = \{ h_w : (x_1, x_2) \mapsto \mathrm{sgn}(w_1 x_1 + w_2 x_2 + b) \mid w_1, w_2, b \in \mathbb{R} \}, have VC dimension 3: any three non-collinear points can be shattered by adjusting the line to separate any dichotomy, but no four points in general position can be shattered, as some labelings require a non-linear separator. The growth function \Pi_{\mathcal{H}}(m) = \max_{S \subseteq \mathcal{X}, |S|=m} |\{ (h(x_1), \dots, h(x_m)) \mid h \in \mathcal{H} \}| quantifies the maximum number of distinct labelings inducible by \mathcal{H} on m points, directly relating to shattering since \Pi_{\mathcal{H}}(d) = 2^d if d = \mathrm{VC}(\mathcal{H}). The Sauer–Shelah lemma provides a tight combinatorial bound on this growth for finite VC dimension:

\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i}.

A looser but analytically convenient polynomial upper bound follows: for m > d,

\Pi_{\mathcal{H}}(m) \leq \left( \frac{em}{d} \right)^d.

These bounds ensure that the growth remains polynomial in m when d is fixed, underpinning the sample complexity guarantees by controlling the uniform deviation of empirical risks over \mathcal{H}.^[14]

Extensions and Variations

Active Learning Sample Complexity

In active learning, the algorithm adaptively selects which unlabeled examples to query for labels, in contrast to passive learning, which relies on a fixed set of independent and identically distributed (i.i.d.) labeled samples. This adaptivity allows active learning to focus on informative examples, potentially reducing the label complexity in settings where labels are scarce or expensive to obtain, such as by achieving asymptotically superior sample requirements compared to passive methods for finite VC dimension classes.^[20] In the realizable case, where the target hypothesis lies in the hypothesis class, active learning algorithms like confidence-based sampling can achieve label complexities of O\left(\frac{d \log(1/\epsilon)}{\epsilon}\right), where d is the VC dimension, \epsilon is the desired error rate, and the bound improves upon passive learning's O\left(\frac{d}{\epsilon} \log(1/\epsilon)\right) under favorable distributions. However, lower bounds demonstrate that active learning provides no universal speedup over passive learning across all distributions and hypothesis classes; for instance, certain contrived distributions require \Omega(1/\sqrt{\epsilon}) labels even for VC dimension 1, matching passive rates up to constants. Improvements are possible for structured hypothesis spaces, such as homogeneous linear separators under uniform distributions on the unit sphere, where active methods yield \tilde{O}(d \log^2(1/\epsilon)) labels, exponentially better than passive bounds.^[21]^[20]^[22] A canonical example illustrating active learning's potential is learning a threshold function on the unit interval [0,1], where the hypothesis class consists of halfspaces defined by a single point w \in [0,1]. In passive learning, \Theta(1/\epsilon) i.i.d. samples are needed to achieve error \epsilon, as the threshold may lie in a low-density region. Active learning, however, can perform a binary search to locate w within an \epsilon-interval using only O(\log(1/\epsilon)) queries, yielding an exponential reduction in label complexity.^[21] Seminal results from the 2000s, including works by Dasgupta on splitting-based bounds and by Dasgupta, Kalai, and Monteleoni on perceptron-based active learning, established foundational analyses of optimal label complexities tailored to hypothesis structure. More recent perspectives, such as those in multi-group agnostic active learning, emphasize disagreement coefficients to derive instance-dependent bounds that extend beyond worst-case analyses, highlighting adaptive query strategies' ability to achieve near-optimal rates in diverse settings.^[21]^[22]^[23]

Information-Theoretic Perspectives

From an information-theoretic viewpoint, sample complexity in PAC learning is fundamentally limited by the amount of information samples must convey to resolve ambiguity within the hypothesis class \mathcal{H}. Specifically, the number of samples m required satisfies m \geq \frac{\log |\mathcal{H}|}{I(Y; H)}, where I(Y; H) denotes the mutual information between the true label Y and the true hypothesis H \in \mathcal{H} under the underlying data distribution. This bound, derived via Fano's inequality applied to the multi-hypothesis testing problem of identifying the correct H, ensures that the cumulative information from m i.i.d. samples suffices to distinguish the true hypothesis from the others with high probability, thereby achieving the desired generalization guarantees.^[24]^[25] In noisy settings, such as when labels are corrupted by observation noise, the perspective incorporates channel capacity to quantify the effective information transmission per sample. Here, the data-label channel models the mapping from true labels to observed (noisy) ones, with capacity C representing the maximum mutual information transmissible per use. This formulation highlights how noise reduces the information yield per sample, inflating the required m proportionally to $1/C, as seen in analyses of Gaussian mixture models under label noise.^[26] Recent advances have further connected these ideas to practical learning properties. For instance, a 2023 analysis leverages the asymptotic equipartition property to derive sample complexity bounds tied directly to label accuracy and hypothesis class characteristics, offering guarantees on generalization error that hold across varying sample sizes and contrasting with purely combinatorial approaches like VC dimension bounds. These information-theoretic tools provide a complementary lens to earlier combinatorial perspectives, revealing fundamental limits independent of algorithmic details.^[27]

Sample Complexity in Reinforcement Learning

Sample complexity also extends to reinforcement learning (RL), where agents learn policies from interactions with an environment. In tabular RL, model-based algorithms achieve sample complexities of \tilde{O}(S A H^3 / \epsilon^2) for finding an \epsilon-optimal policy in an MDP with S states, A actions, and horizon H, under standard assumptions. Model-free methods like Q-learning have higher complexities, often \tilde{O}(S^2 A H^3 / \epsilon^2). Recent works as of 2024 explore info-theoretic bounds, showing lower bounds of \Omega(S A H / (1-\gamma)^3) for discounted infinite-horizon settings, emphasizing the role of exploration in resolving uncertainty about transition dynamics and rewards. These bounds inform efficient algorithms for large-scale RL applications.^[28]^[29]

Applications

General Machine Learning

In modern machine learning, the curse of dimensionality poses significant challenges to sample complexity, particularly in high-dimensional spaces where the VC dimension of hypothesis spaces grows rapidly with the number of model parameters, necessitating exponentially more samples for reliable generalization. For neural networks, the VC dimension scales as O(W L \log W), where W is the total number of weights and L is the depth, implying that models with billions of parameters—such as large vision transformers—require millions of training examples to mitigate overfitting and achieve low generalization error. This scaling underscores the practical difficulty in supervised learning tasks, where insufficient samples lead to poor performance despite computational advances.^[30]^[30] To address these challenges, strategies like data augmentation and transfer learning effectively reduce the effective complexity of the hypothesis space without explicitly bounding it. Data augmentation expands limited datasets by applying transformations such as rotations, flips, and style transfers, effectively increasing sample diversity and improving classification accuracy on tasks like distinguishing dogs from cats by up to 7% with doubled effective dataset size. Transfer learning further enhances sample efficiency by initializing models with pre-trained features from large source datasets (e.g., ImageNet), constraining the search space to general low-level features in early layers while requiring far fewer target samples—often orders of magnitude less—for fine-tuning on new classification tasks.^[31]^[32]^[32] A key observation reconciling theory and practice is the double descent phenomenon, where overparameterized models exhibit a second descent in test error as complexity increases beyond the interpolation threshold, allowing low effective complexity and strong generalization despite high VC dimension. This behavior, observed across neural architectures and datasets in the 2020s, challenges classical bounds and enables billion-parameter models to succeed with sample sizes like ImageNet's 1.2 million images for image classification, achieving top-5 accuracies over 90%.^[33]^[33]^[34]

Robotics and Control Systems

In robotics and control systems, sample complexity is particularly critical due to the high cost and risk associated with physical interactions, where each trial can involve real-world hardware wear, energy consumption, or safety concerns. Unlike supervised learning scenarios with abundant i.i.d. data, robotic tasks often require learning dynamics models or control policies in sequential environments modeled as Markov Decision Processes (MDPs) with state space |S| and action space |A|. These interactions generate non-i.i.d. trajectories, amplifying the need for algorithms that minimize the number of episodes or steps to achieve reliable performance.^[35] In reinforcement learning (RL) for such systems, model-free methods like tabular Q-learning exhibit sample complexity bounds of \tilde{O}(|S| |A| (1 - \gamma)^{-4} / \epsilon^2) to find an \epsilon-optimal policy with high probability, assuming finite MDPs and appropriate exploration.^[36] Model-based approaches mitigate this by learning an approximate dynamics model from fewer real samples and then generating simulated trajectories for planning or policy optimization, thereby reducing the reliance on costly physical trials. For instance, in robotic locomotion or manipulation, these methods can achieve orders-of-magnitude improvements in sample efficiency by leveraging the learned model for virtual rollouts.^[37] A seminal framework for analyzing these bounds in MDPs is the PAC-MDP model introduced by Kearns and Singh, which guarantees that an algorithm outputs a policy with value within \epsilon of the optimal one using a polynomial number of episodes in |S|, |A|, $1/(1-\gamma), and $1/\epsilon. This enables bounds on approximate policy iteration, where the policy is refined iteratively using the estimated model or value function.^[38] In the 2020s, research has increasingly focused on sample-efficient RL for robots through meta-learning techniques, which enable rapid adaptation by reusing samples across related tasks, such as varying object grasps or terrains. For example, meta-RL algorithms like PEARL learn task embeddings from prior interactions, allowing robots to adapt policies with just a few dozen trials per new task in industrial insertion scenarios, significantly lowering the barrier for deployment in dynamic environments. As of 2025, advances such as human-in-the-loop reinforcement learning have further improved sample efficiency for dexterous manipulation tasks, enabling robots to learn complex skills like peg insertion with fewer than 100 real-world interactions by incorporating human demonstrations.^[39]^[40]

References

[1]
[PDF] The Optimal Sample Complexity of PAC Learning
If no such m exists, define M(ε, δ) = ∞. The sample complexity is our primary object of study in this work. We require a few additional definitions before ...<|control11|><|separator|>
[2]
[PDF] A Theory of the Learnable - People
ABSTRACT: Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this paper we regard ...
[3]
[PDF] Learnability and the Vapnik-Chervonenkis Dimension
Abstract. Valiant's learnability model is extended to learning classes of concepts defined by regions in. Euclidean space E”. The methods in this paper lead ...
[4]
[PDF] Understanding Machine Learning: From Theory to Algorithms
Shai Shalev-Shwartz is an Associate Professor at the School of Computer. Science and Engineering at The Hebrew University, Israel. Shai Ben-David is a Professor ...
[5]
A theory of the learnable
ABSTRACT. Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this paper we regard.
[6]
On the Uniform Convergence of Relative Frequencies of Events to ...
On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Authors: V. N. Vapnik and A. Ya. Chervonenkis ...
[7]
A theory of the learnable | Communications of the ACM
A theory of the learnable. Author: L. G. Valiant. L. G. Valiant. Harvard Univ ... First page of PDF. Formats available. You can view the full content in the ...
[8]
https://www.jmlr.org/papers/v17/15-389.html
[9]
[PDF] Information Complexity and Generalization Bounds - arXiv
Oct 24, 2021 · We present a unifying picture of PAC-Bayesian and mutual information-based upper bounds on the generalization error of randomized learning ...
[10]
[PDF] Towards a Unified Information-Theoretic Framework for Generalization
Nov 9, 2021 · ... sample complexity of PAC learning using an information-theoretic framework. ⊲. 3 Optimal CMI Bound for SVM and Stable Compression Schemes. In ...
[11]
Toward efficient agnostic learning | Proceedings of the fifth annual ...
In this paper we initiate an investigation of generalizations of the Probably Approximately Correct (PAC) learning model that attempt to significantly weaken ...
[12]
[PDF] CS 391L: Machine Learning: Computational Learning Theory
Sample Complexity: How many training examples are needed for a learner to construct (with high probability) a highly accurate concept? • Computational ...
[13]
[PDF] The Vapnik-Chervonenkis Dimension - UPenn CIS
The VC dimension has also been generalized to give combinatorial complexity measures that characterize the sample complexity of learning in various extensions ...
[14]
[PDF] Learnability and the Vapnik-Chervonenkis Dimension - CIS UPenn
The work of D. Haussler and M. K. Warmuth was supported by the Office of Naval Research grant. N00014-86-K-0454. The work of A. Blumer was ...Missing: 1990s | Show results with:1990s
[15]
An Efficient Membership-Query Algorithm for Learning DNF with ...
We present a membership-query algorithm for efficiently learning DNF with respect to the uniform distribution.
[16]
[PDF] Lecture 2: PAC Learnability 1 PAC bound for finite hypothesis classes
Our first learnability result shows that finite hypothesis classes are PAC learnable with a sample complexity depending logarithmically on the number of ...
[17]
[PDF] Machine Learning Theory Lecture 3: PAC Bounds for Finite Concept ...
Oct 3, 2011 · We can immediately apply this result to bound the sample complexity of the concept classes we discussed. 2.1 The Class of Monotone Conjunctions.Missing: counting | Show results with:counting
[18]
[PDF] Lecture 2: ERM, Finite Classes and the Uniform Convergence Property
Any finite hypothesis class is learnable. In particular, any. ERM algorithm can learn a finite hypothesis class with sample complexity m = O( 1. 2 log. |H| δ. ) ...
[19]
Exact lower bounds for the agnostic probably-approximately-correct ...
The Probably Approximately Correct (PAC) model aims at providing a clean, plausible and minimalistic abstraction of the supervised learning process [24, 25]. In ...
[20]
[PDF] The True Sample Complexity of Active Learning*
– We prove that any distribution and finite VC dimension concept class has active learning sample complexity asymptotically smaller than the sample complexity ...Missing: equation | Show results with:equation
[21]
[PDF] Coarse sample complexity bounds for active learning - UCSD CSE
Abstract. We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the specific target hypothesis ...Missing: realizable | Show results with:realizable
[22]
Analysis of Perceptron-Based Active Learning
We start by showing that in an active learning setting, the Perceptron algorithm needs Ω(1/ε2) labels to learn linear separators within generalization error ε.
[23]
[PDF] Agnostic Multi-Group Active Learning
worst-case disagreement coefficient over the collection. Roughly speaking, this guarantee improves upon the label complexity of standard multi-group learning in.
[24]
[PDF] An Introductory Guide to Fano's Inequality with Applications ... - arXiv
Nov 25, 2019 · Once a multiple hypothesis test is set up, Fano's inequality provides a lower bound on its error probability in terms of the mutual information, ...Missing: PAC | Show results with:PAC
[25]
None
### Summary of Classical Information-Theoretic Lower Bound for PAC Sample Complexity
[26]
[PDF] On the Role of Channel Capacity in Learning Gaussian Mixture ...
Theorems 1 and 2 together reveal a dichotomy: precisely at the channel capacity (β = 1), the large-system behavior of the sample complexity undergoes a phase- ...
[27]
Information theoretic perspective on sample complexity
The sample complexity depends on the accuracy of the labels and a confidence parameter. It is also a function of properties of the hypothesis class. To describe ...
[28]
Nearly-tight VC-dimension and pseudodimension bounds for ... - arXiv
Mar 8, 2017 · Abstract:We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function.
[29]
[PDF] The Effectiveness of Data Augmentation in Image Classification ...
In this paper, we explore and compare multiple solutions to the problem of data augmentation in image classification. Previous work has demonstrated the ...
[30]
How transferable are features in deep neural networks? - arXiv
Nov 6, 2014 · We also document that the transferability of features decreases as the distance between the base task and target task increases, but that ...
[31]
Reconciling modern machine learning practice and the bias-variance trade-off
### Summary of Key Points on Double Descent and Sample Complexity in Overparameterized Models
[32]
[PDF] ImageNet Classification with Deep Convolutional Neural Networks
Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was ...
[33]
Complexity analysis of reinforcement learning and its application to ...
Complexity analysis of reinforcement learning and its application to robotics. Abstract: Reinforcement learning (RL) is a widely adopted theory in machine ...
[34]
Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
Apr 21, 2023 · In this paper, we revisit the sample complexity of Q-learning for tabular Markov decision processes (MDPs). For concreteness, let us ...Abstract · Introduction · Background and Algorithms · Main Results: Sample...
[35]
High-Accuracy Model-Based Reinforcement Learning, a Survey - arXiv
Jul 17, 2021 · Some of these methods succeed in achieving high accuracy at low sample complexity, most do so either in a robotics or in a games context. In ...
[36]
[PDF] Near-Optimal Reinforcement Learning in Polynomial Time
Abstract. We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal ...Missing: framework | Show results with:framework
[37]
[PDF] Meta-Reinforcement Learning for Robotic Industrial Insertion Tasks
First, due its capability for off-policy training, it is highly sample efficient. Second,. PEARL learns a task embedding, which allows it to explic- itly learn ...