Fact-checked by Grok 2 weeks ago

Bootstrap aggregating

Bootstrap aggregating, commonly abbreviated as bagging, is an technique that enhances the performance of models by generating multiple instances of a base , each trained on a random bootstrap sample of the original training , and then aggregating their predictions to produce a final output. For tasks, predictions are typically averaged across the models, while for , a majority vote is used to determine the class label. This method leverages the bootstrap resampling technique, which draws samples with replacement from the , resulting in subsets that are approximately 63.2% unique on average, to introduce diversity among the estimators. Introduced by statistician Leo Breiman in his 1996 paper "Bagging Predictors," the approach was developed to address instability in certain learning algorithms, such as decision trees and neural networks, where small changes in the training data can lead to large variations in predictions. Breiman's work built on earlier ideas of averaging predictors but formalized bagging as a practical strategy, demonstrating its effectiveness through experiments on both real-world and simulated datasets. Theoretically, bagging reduces the variance of high-variance, unstable procedures without significantly increasing bias, making it particularly suitable for noisy data environments. Bagging has proven especially beneficial for tree-based methods, where empirical studies show error rate reductions of 20% to 47% in misclassification tasks compared to single models. It serves as a foundational technique in , influencing subsequent developments like Random Forests, which extend bagging by incorporating feature randomness to further decorrelate the base learners. Widely applied in fields such as finance, bioinformatics, and , bagging remains a robust method for improving predictive accuracy and stability in scenarios.

Introduction and Background

Definition and Purpose

Bootstrap aggregating, commonly known as bagging, is an ensemble technique that generates multiple versions of a base predictor by training each on a distinct bootstrap sample drawn from the original training dataset, followed by aggregating their individual predictions to produce a final output. For tasks, aggregation typically involves averaging the predictions, while for , it employs a vote or mechanism to determine the most supported class. This approach leverages the bootstrap resampling method to introduce variability in the training data, enabling the creation of diverse models from the same base learner. The primary purpose of bagging is to reduce the variance of high-variance, unstable predictors—such as decision trees—thereby enhancing overall model and generalization performance without substantially increasing . Unstable learners like trees are prone to , where small perturbations in the training data lead to significantly different models and predictions; bagging mitigates this by averaging across multiple such models, smoothing out idiosyncratic errors and promoting more robust decision boundaries. As a result, bagging is particularly effective for procedures where prediction instability amplifies variance, leading to improved accuracy on unseen data compared to a single instance of the base learner. At its core, the intuition behind bagging lies in the statistical principle that averaging predictions from moderately correlated models—each trained on perturbed data—diminishes the overall prediction error, as uncorrelated components of variance tend to cancel out. Mathematically, for a regression setting, the aggregated prediction is formulated as: \hat{y}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_b(x), where B denotes the number of bootstrap models, and \hat{y}_b(x) is the prediction from the b-th model for input x. This averaging process exploits the diversity introduced by bootstrap sampling to yield a lower-variance ensemble, aligning with broader ensemble learning goals of variance reduction through model combination.

Bootstrap Sampling Fundamentals

The bootstrap method is a non-parametric resampling introduced by Bradley Efron in 1979, which involves drawing samples with replacement from an original D of size N to generate new datasets that approximate the of a . In this process, each bootstrap dataset, denoted as D_b for b = 1, \dots, B, is constructed by independently selecting N data points from D, where every original sample has an equal probability of $1/N of being chosen at each draw. This resampling mimics the empirical distribution of the original data, enabling without parametric assumptions about the underlying population. A key characteristic of bootstrap sampling is the expected composition of each resampled dataset, where approximately 63.2% of the samples are unique due to the with-replacement nature of the draws. This proportion arises from the probability that a particular original sample is not selected in the bootstrap set, calculated as (1 - 1/N)^N, which converges to $1/e \approx 0.3679 as N grows large, leaving the fraction of unique samples as $1 - (1 - 1/N)^N \approx 1 - 1/e \approx 0.632. Consequently, each D_b contains duplicates of some observations while excluding others, known as out-of-bag samples, which on average comprise about 36.8% of the original dataset. In the context of bootstrap aggregating, or bagging, as formalized by Leo Breiman in 1996, this resampling generates B diverse sets to multiple base models, thereby introducing variability in the that helps decorrelate the predictions across the ensemble. The bootstrap's ability to replicate the original empirical distribution ensures that each D_b serves as a reliable proxy for the full dataset D, while the induced diversity reduces the covariance between base learners without requiring additional data collection. Furthermore, the method facilitates variance estimation of model parameters or predictions by treating the bootstrap replicates as pseudo-samples from the population, a property that underpins its utility in ensemble .

Ensemble Learning Prerequisites

Ensemble learning is a machine learning paradigm that combines predictions from multiple base models, known as base learners, to achieve superior performance compared to a single model. This approach leverages the collective strength of diverse models to enhance , reduce errors, and improve robustness on unseen data. Ensemble methods are broadly categorized into three main types: bagging, which operates in parallel to primarily reduce variance; boosting, which builds models sequentially to focus on ; and stacking, which uses a meta-learner to combine outputs from heterogeneous base models. Central to the effectiveness of are principles such as the "wisdom of the crowds," where aggregating diverse predictions approximates a better overall estimate than any individual model, and the promotion of among base learners to minimize correlated errors and amplify strengths. ensures that errors made by one model are not replicated across the ensemble, leading to more stable and accurate collective decisions. Bootstrap sampling serves as a key mechanism to introduce this by generating varied subsets. Base learners in ensemble methods are typically weak or unstable algorithms that perform reasonably well but exhibit high variance, such as unpruned decision trees, making them particularly amenable to averaging or weighting in an ensemble to smooth out inconsistencies. These models benefit from the ensemble framework because their individual instabilities are mitigated through combination, without requiring extensive tuning of the base algorithm itself. Compared to single models, ensembles generally offer improved predictive accuracy and greater robustness to noise and outliers, albeit at the expense of increased computational resources due to training and aggregating multiple models. This trade-off is often justified in scenarios where model reliability is paramount, as the ensemble's aggregated output provides a more reliable approximation of the underlying data distribution.

Core Algorithm

Steps in Bagging

Bootstrap aggregating, or bagging, follows a straightforward procedural to construct an of base learners. The process begins with generating B bootstrap samples from the original training dataset L of size n, where each sample L_b is drawn with replacement and typically has the same size n. This step leverages the bootstrap method to introduce variability among the samples, allowing for diverse training instances while preserving the empirical distribution of the data. Next, for each bootstrap sample L_b where b = 1, \dots, B, an independent model f_b is trained using the same learning algorithm. The choice of base learner is crucial; bagging is most effective with unstable learners, such as unpruned decision trees or subset selection in , whose predictions vary significantly with small changes in the training data. Stable learners, like with all predictors or k-nearest neighbors, yield minimal benefits from aggregation. The training of each f_b occurs in , as the models are independent, facilitating computational efficiency on multiprocessor systems. Finally, predictions from the B base models are combined to form the ensemble prediction \hat{f}. For regression tasks, where the outcome is numerical, the aggregation computes the average: \hat{f}(x) = \frac{1}{B} \sum_{b=1}^B f_b(x). For classification tasks, with categorical outcomes, a plurality (majority) vote is used: the class receiving the most votes across the f_b(x) is selected as \hat{f}(x). In cases of ties during voting, strategies include selecting an odd B to reduce tie probability or applying a tie-breaking rule, such as random selection among tied classes or choosing the class with the lowest index. Key parameters in bagging include the number of bootstrap samples B and the base learner type. Originally, Breiman recommended B = 25 for and B = 50 for , noting that performance stabilizes after these values in experiments on datasets like and two-class Gaussian. In modern implementations, larger B values, such as 100 to 500, are commonly used to further reduce variance until estimates plateau, balancing accuracy gains against computational cost. The base learner is selected based on the problem domain, with decision trees being a frequent choice due to their instability and interpretability. The algorithm can be expressed in pseudocode as follows:
Algorithm Bagging(L, B, BaseLearner):
    for b = 1 to B do
        Draw bootstrap sample L_b from L (with replacement, |L_b| = |L|)
        Train f_b = BaseLearner(L_b)
    end for
    
    To predict for input x:
        if [Regression](/page/Regression):
            return (1/B) * sum_{b=1 to B} f_b(x)
        else:  // Classification
            Compute votes: for each class c, count = |{b : f_b(x) = c}|
            return argmax_c count  // plurality vote, with tie-breaker if needed
This structure ensures the ensemble's robustness by averaging out instabilities in individual models.

Out-of-Bag Error Estimation

In bootstrap aggregating (bagging), each bootstrap sample is generated by sampling with replacement from the original of N, such that approximately 63.2% of the data points are included in the sample and the remaining ~37% are left out; these excluded points are termed out-of-bag (OOB) samples for that particular base learner. The OOB mechanism leverages these naturally occurring validation sets to enable error estimation without partitioning the data or additional models. For a given data point i, its OOB prediction \hat{y}_{OOB,i} is obtained by aggregating (e.g., via majority vote for or averaging for ) the predictions from all base learners whose bootstrap samples did not include i. The overall OOB error is then computed as the average prediction error across all data points: \text{OOB error} = \frac{1}{N} \sum_{i=1}^N L(y_i, \hat{y}_{OOB,i}) where L is the loss function (e.g., 0-1 loss for or for ). This yields an unbiased estimate of the , as each point is evaluated on models trained independently of it. The primary advantages of OOB error estimation include its computational efficiency, since no separate validation or holdout set is required, allowing full use of the training data. Additionally, empirical evaluations demonstrate that OOB estimates are as accurate as those from .632+ bootstrap or 10-fold cross-validation, often with lower variance due to the larger effective sample size.

Mathematical Formulation

Bootstrap aggregating, or bagging, formalizes the aggregation of predictions from multiple base learners trained on bootstrap samples of the training data. For regression tasks, the bagged predictor at input x is given by the average over B base predictors: \hat{f}(x) = \frac{1}{B} \sum_{b=1}^B f_b(x), where each f_b(x) is the output of the base learner fitted to the b-th bootstrap replicate of the training set. For classification, the predicted class is determined by majority vote, equivalently the argmax over class proportions: \hat{y}(x) = \arg\max_k \frac{1}{B} \sum_{b=1}^B I(f_b(x) = k), with I(\cdot) the indicator function and k ranging over possible classes. These formulations assume the base learners are identically distributed across bootstrap samples, drawn from the empirical distribution of the data. Bagging achieves variance reduction through this averaging process. Consider base predictors with common variance \sigma^2; if they were independent, the variance of the bagged predictor would be \sigma^2 / B. However, since bootstrap samples introduce correlation among the predictors, the actual variance is \rho \sigma^2 + (1 - \rho) \sigma^2 / B, where \rho is the pairwise correlation between base predictors, leading to a moderated reduction compared to the independent case. This effect is most pronounced for unstable base learners, such as unpruned decision trees, where high \sigma^2 amplifies the benefits. Under standard assumptions of independent and identically distributed (i.i.d.) data from an underlying P, and a learning , the bagged predictor is asymptotically consistent as the sample size n \to \infty and B \to \infty. Specifically, it converges in probability to the expected predictor \phi_A(x; P) = E_L[\phi(x; L)], where L denotes a random set; this limit retains the of the single learner while minimizing variance contributions from sampling variability. The out-of-bag (OOB) error formalizes a nearly unbiased of performance without requiring a held-out test set. For each training observation i, the OOB is the (or vote) over the of base predictors whose bootstrap samples exclude i, which occurs for approximately $1 - e^{-1} \approx 0.368 of the bags under the standard bootstrap. The overall OOB error is then the loss over all such per-observation predictions, assuming approximate independence between inclusion status and prediction errors.

Applications and Variants

Bagging with Decision Trees

Decision trees are particularly well-suited for bootstrap aggregating, or bagging, due to their inherent instability and high variance as base learners. Small perturbations in the training data can lead to substantial changes in the tree structure, such as different splits at internal nodes, resulting in diverse predictors across bootstrap samples. This instability allows bagging to effectively reduce variance by averaging or over multiple trees, smoothing out erratic predictions without introducing additional . In the bagging process applied to decision trees, multiple unpruned trees are trained on distinct bootstrap samples drawn with replacement from the original dataset, capturing the variability in tree structures induced by sampling noise. For classification tasks, predictions from these trees are aggregated via majority voting, while tasks use averaging of the outputs. This approach leverages the full feature set for each tree, without features, to emphasize the benefits of bootstrap-induced diversity in reducing to the data. Empirically, bagging with decision trees has demonstrated notable improvements in predictive accuracy, especially on datasets with . For instance, on the waveform classification dataset, which includes artificial , the set error rate decreased from 29.0% for a single tree to 19.4% for the bagged ensemble, representing an approximate 13% relative reduction in error. Similar gains, often in the range of 10-20% accuracy improvement, have been observed across various noisy benchmarks, underscoring bagging's value in stabilizing tree-based models.

Relation to Random Forests

Random forests represent a direct extension of bootstrap aggregating (bagging) specifically tailored for decision tree ensembles, introducing an additional layer of randomness to enhance performance. In bagging, multiple decision trees are constructed on bootstrap samples of the training data, and their predictions are aggregated to reduce variance. Random forests build upon this by modifying the tree induction process: at each node of every tree, instead of considering all available features for the split, a randomly selected subset of features is evaluated. Typically, the size of this subset is set to \sqrt{p} for classification problems, where p is the total number of features, or p/3 for regression, though smaller values like \log_2(p) + 1 can also be used depending on the dataset. This feature randomness, first proposed by Leo Breiman, aims to further decorrelate the individual trees beyond what bootstrap sampling alone achieves. The core algorithmic tweak in random forests involves integrating this feature selection mechanism into the standard bagging procedure during the tree-growing phase. Specifically, while the bootstrap sampling of instances remains unchanged—drawing with replacement from the original to create diverse training sets for each —the splitting criterion at internal nodes is altered to draw a fresh random sample of for consideration. This process is repeated independently at every split and for every in the forest, ensuring that no single dominates across the . By limiting the feature pool at each decision point, random forests prevent the trees from becoming overly similar, as would occur if all trees were grown using the full set of on similar bootstrap samples. This increased diversity among trees leads to lower \rho between their predictions, which, according to the bias-variance-covariance for ensembles, contributes to greater overall without substantially increasing . A key difference from plain bagging lies in this dual randomization: bootstrap sampling addresses data variability, while feature subsampling tackles attribute redundancy, particularly beneficial in scenarios with many irrelevant or correlated features. In bagging with unpruned decision trees, trees can still exhibit high correlation if the dataset has a small number of informative features that consistently lead to similar splits. Random forests mitigate this by enforcing variability in feature consideration, resulting in more robust ensembles that generalize better. Empirical studies in the original formulation demonstrate that this approach yields lower prediction errors compared to bagging, especially as the number of features increases, with error rates approaching the Bayes error in high-dimensional settings—for instance, achieving around 2.8% error on datasets with 1,000 inputs using modest subset sizes like 25 features. Overall, random forests are typically superior to plain bagging on high-dimensional data, where feature noise or dimensionality can amplify variance in standard ensembles. This superiority stems from the enhanced , allowing random forests to maintain low variance while preserving the low of individual deep trees. The method's has been validated across diverse applications, confirming its role as a refined bagging variant that leverages both instance and feature randomness for improved predictive accuracy.

Other Ensemble Extensions

Bootstrap aggregating, commonly known as bagging, extends beyond decision trees to other base learners, including neural networks and vector machines, where it helps mitigate by averaging predictions from multiple models trained on bootstrapped subsets of the data. In the case of neural networks, bagged neural networks (BNNs) train each network on a different bootstrap sample, reducing variance and improving generalization in deep models prone to due to their high capacity. Similarly, for vector machines (SVMs), bagging ensembles construct multiple SVM classifiers on bootstrap samples and aggregate their outputs, enhancing classification performance on real-world datasets where individual SVMs may underperform due to sensitivity to noise or outliers. This approach is particularly effective for SVMs in tasks, as demonstrated in empirical studies showing improved accuracy over standalone SVMs. A notable tree-based variant of bagging is Extremely Randomized Trees (Extra-Trees), which introduces greater randomization by selecting split thresholds uniformly at random from the input feature range rather than optimizing them, leading to faster training times compared to random forests while maintaining comparable predictive accuracy. Developed for supervised and , Extra-Trees build an of such randomized trees on bootstrap samples, reducing computational overhead in high-dimensional spaces without sacrificing robustness. Another adaptation, pasting, modifies bagging by sampling training subsets without replacement, which is advantageous for smaller datasets to ensure diverse yet non-overlapping instances across ensemble members, thereby preserving data efficiency. Post-2010 developments have integrated bagging into frameworks, such as combinations with boosting algorithms, where bagging stabilizes while boosting focuses on correction, yielding superior performance in tasks like intrusion detection and image classification. For instance, bagging-boosting models using as a base have shown enhanced accuracy on imbalanced datasets by leveraging bagging's diversity alongside boosting's sequential error correction. Additionally, online bagging adaptations enable from , updating ensemble members dynamically with new bootstrap-like samples to support anytime in evolving environments. Recent 2025 advances include bagging enhancements for financial forecasting and robustness against unlearnable or adversarial data in pipelines. These extensions highlight bagging's versatility in modern pipelines, particularly for non-stationary or resource-constrained settings.

Theoretical Analysis

Bias-Variance Decomposition

In the context of regression tasks, the expected prediction error of a learning algorithm can be decomposed into three components: the squared bias, the variance, and the irreducible noise. Specifically, for a predictor \hat{f}(x; D) trained on dataset D, the expected mean squared error is given by \mathbb{E}_{D} \left[ (\hat{f}(x; D) - \mathbb{E}[y \mid x])^2 \right] = \left( \mathbb{E}_{D} [\hat{f}(x; D)] - \mathbb{E}[y \mid x] \right)^2 + \mathbb{E}_{D} \left[ (\hat{f}(x; D) - \mathbb{E}_{D} [\hat{f}(x; D)])^2 \right] + \sigma^2, where the first term represents the squared (systematic deviation from the true ), the second term captures the variance ( to fluctuations in the training data D), and \sigma^2 is the variance of the noise in the data. This decomposition highlights the fundamental -variance tradeoff in statistical learning, where reducing bias often increases variance and vice versa. Bootstrap aggregating, or bagging, primarily targets the variance component of this decomposition by generating multiple bootstrap resamples of the training data and averaging the predictions from base learners fitted to each resample. The averaging process preserves the of the predictor, resulting in negligible change to the term, as the mean aligns closely with the bias of the individual base models. However, for unstable base learners—those highly sensitive to small perturbations in the training data, such as unpruned decision trees—bagging substantially reduces variance by smoothing out idiosyncratic fluctuations across the . To illustrate the variance reduction quantitatively in regression, consider an ensemble of B base predictors \hat{f}_b(x), each with variance \sigma^2, and average pairwise correlation \rho between any two predictors. The variance of the bagged predictor \hat{f}_B(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}_b(x) approximates \sigma^2 \left( \rho + \frac{1 - \rho}{B} \right), which approaches \rho \sigma^2 as B grows large and can be much lower than \sigma^2 when \rho is small. This reduction is most pronounced when the base learners exhibit low correlation, a property encouraged by the diversity introduced via . For bagging to effectively lower overall error through variance reduction, the base learning algorithm must inherently possess low bias but high variance; stable learners with already low variance, such as linear regression, yield minimal benefits from aggregation.

Error Bounds and Convergence

Bootstrap aggregating, or bagging, exhibits strong convergence properties as the number of bootstrap replicates B increases. Specifically, the bagged predictor, which is the average (for regression) or majority vote (for classification) over B base predictors trained on bootstrap samples, converges almost surely to the infinite ensemble average—the expected value of the base predictor under the empirical distribution of the training data—as B \to \infty. This result follows directly from the strong law of large numbers applied to the sequence of bootstrap predictors, assuming they are independent and identically distributed under the bootstrap measure. In practice, this infinite ensemble serves as a stable approximation to the true expected predictor under the underlying data distribution, provided the bootstrap samples faithfully represent the original data. Theoretical error bounds for bagging focus on the excess , defined as the difference between the of the bagged predictor and the optimal Bayes . For general base learners, recent analyses provide finite-sample upper bounds on the excess that include a term scaling as O(1/\sqrt{B}), derived from the achieved by averaging correlated bootstrap estimates. These bounds hold under mild conditions on the base , without requiring specific margin assumptions on the , and demonstrate that bagging can achieve guarantees comparable to O(1/\sqrt{n}) in sample size n when combined with appropriate bootstrap subsample sizes. In Breiman's foundational work, initial bounds showed that the aggregated predictor's error is bounded below the individual predictor's error for unstable procedures, with the improvement quantified via applied to the squared error: (E[Z])^2 \leq E[Z^2], where Z represents prediction deviations. Bagging's stability analysis highlights its role in mitigating the high variance of unstable base learners, such as unpruned decision trees or neural networks, by averaging out fluctuations across bootstrap replicates. Breiman established that for such procedures, bagging reduces the overall error by stabilizing the , with empirical and theoretical evidence showing to a level \rho \sigma^2, where \rho is the average correlation among bootstrap predictors and \sigma^2 is the base variance. This stabilization is less pronounced for inherently stable learners like , where bagging may even slightly increase due to introduced variability. Despite these guarantees, theoretical bounds for bagging are tighter and more straightforward in settings, where decomposes cleanly into and variance components, compared to , where discrete complicates risk analysis and bounds often rely on additional approximations. All such analyses assume the training data are independent and identically distributed (i.i.d.), ensuring the bootstrap samples are valid resampling approximations; violations of i.i.d., such as in time-series data, can degrade performance and invalidate the bounds.

Empirical Performance Metrics

Empirical studies have demonstrated that bootstrap aggregating, or bagging, consistently improves predictive performance over single base learners, particularly when applied to unstable algorithms like decision trees. In tasks on UCI repository datasets such as Waveform, Heart, and , bagging reduced misclassification rates by 20% to 47% relative to single trees, translating to absolute test error reductions of approximately 5% to 10% on average. For problems, including Housing and simulated Friedman datasets, bagging achieved (MSE) reductions of 22% to 46%, with notable gains in noisy environments where single models overfit. These improvements stem from through averaging or majority voting across bootstrap replicates, leading to more stable predictions without altering bias significantly. Bagging excels empirically on noisy and high-dimensional data, where base learners exhibit high variance. For instance, on the dataset (34 features) and (35 features), both prone to noise and dimensionality challenges, bagging lowered test errors by 23% and 27%, respectively, compared to single classifiers. Conversely, it provides minimal benefits for low-variance, stable models such as or nearest neighbors, where ensemble averaging yields negligible error reductions due to the inherent stability of the base predictors. Aggregate benchmarks across UCI datasets confirm these patterns, with bagging most effective when the is moderate to low, enhancing robustness in real-world applications like diagnostics or environmental modeling. Comparisons with other ensembles highlight bagging's strengths in variance-heavy scenarios. Relative to single decision trees, bagging delivers major gains, often cutting test errors by 5-15% in tree-based setups on UCI benchmarks. Versus boosting methods like , bagging is complementary, performing better on datasets dominated by variance issues rather than bias, though boosting generally achieves higher overall accuracy (e.g., 24-31% relative error reduction vs. bagging's 4%). In noisy conditions, bagging's uniform resampling avoids to outliers more reliably than boosting's weighted emphasis. Out-of-bag (OOB) serves as a reliable for test in bagging, providing an unbiased estimate comparable to cross-validation and enabling efficient performance assessment without separate validation sets. This validates bagging's practical utility in high-dimensional, noisy settings.
DatasetSingle Tree Error (%)Bagged Error (%)Relative Reduction (%)
Waveform29.019.433
Heart10.05.347
6.04.230
11.28.623
This table illustrates representative classification results from UCI datasets, underscoring bagging's consistent error mitigation.

Practical Implementation

Algorithm for Classification

In classification tasks, bootstrap aggregating, or bagging, involves training multiple base classifiers on bootstrap samples of the training data, where each classifier outputs class labels or probability estimates for a given input. The final prediction is obtained by aggregating these outputs: for hard predictions (class labels), a majority vote (or plurality vote for multi-class problems) determines the predicted class, selecting the one receiving the most votes across the ensemble. Alternatively, if base classifiers provide class probability estimates \hat{p}(j \mid x) for each class j, bagging averages these probabilities over all classifiers to yield \hat{p}_B(j \mid x) = \frac{1}{B} \sum_{b=1}^B \hat{p}_b(j \mid x), with the predicted class as \arg\max_j \hat{p}_B(j \mid x). For multi-class classification problems involving more than two classes, bagging employs direct on the hard predictions from each base classifier, without requiring decomposition into one-vs-all or one-vs-one strategies. This approach naturally extends the case, where coincides with , and handles ties by selecting the class with the highest vote count, though practical implementations may incorporate random tie-breaking for reproducibility. When using averaged probability outputs, thresholding is applied to determine the class assignment; for binary classification, a threshold of 0.5 is commonly used on the averaged probability for the positive class, though this can be tuned based on the problem's requirements, such as class imbalance. In multi-class settings, no explicit thresholding is needed beyond the argmax operation, but probability calibration can refine decision boundaries if necessary. The following pseudocode outlines the bagging adapted for , emphasizing the voting mechanism (assuming B base classifiers and a with N training examples):
[Algorithm](/page/The_Algorithm) Bagging-Classification(D_train, B, BaseClassifier):
    Input: Training [data](/page/Data) D_train = {(x_i, y_i)} for i=1 to N, number of bootstrap iterations B, base classifier type
    Output: Ensemble classifier H(x)

    for b = 1 to B do:
        Draw bootstrap sample D_b of size N from D_train (with replacement)
        Train base classifier h_b = BaseClassifier(D_b)
    
    // Prediction function for new input x
    H(x):
        Initialize vote counts: votes[j] = 0 for each class j
        for b = 1 to B do:
            y_b = h_b(x)  // Predicted class label from b-th classifier
            votes[y_b] += 1
        return argmax_j votes[j]  // Class with [plurality](/page/Plurality) ([majority](/page/Majority)) vote; random tie-break if needed
        
    return H
This procedure follows the general steps of bagging but focuses on vote aggregation for classification.

Ozone Dataset Example

The ozone dataset, introduced by Breiman in his seminal work on bagging, consists of 330 complete observations from daily maximum ozone concentrations in the Los Angeles area during the summer months, paired with eight meteorological predictor variables such as , , , and pressure differences. These features capture environmental factors influencing ozone levels, making the dataset a classic example for regression tasks in . In applying bagging to this dataset, Breiman used regression trees as base learners, constructing an ensemble of 25 trees via bootstrap sampling from the training data. The process begins by randomly partitioning the into a learning set (approximately 85% of the data) and a test set (15%), followed by growing an initial regression tree on the learning set with subtree size selected via 10-fold cross-validation to balance bias and variance. Bootstrap samples are then drawn with replacement from the learning set, and an unpruned tree is fitted to each sample; the bagged predictor aggregates these by averaging the individual tree predictions for new instances. This approach yields a (MSE) of 18.0 on the test set, compared to 23.1 for a single pruned tree, representing a 22% reduction in error attributable to the averaging that mitigates the high variance of individual trees. Out-of-bag (OOB) estimates provide an internal validation mechanism in bagging, where predictions for each are averaged only from whose bootstrap samples excluded it, offering an unbiased MSE approximation without a separate test set. Overall, the application illustrates bagging's effectiveness in reducing prediction variance on real-world environmental data prone to instability in single tree models, enhancing reliability for forecasting without altering the underlying .

Computational Considerations

The of bootstrap aggregating (bagging) scales linearly with the number of bootstrap iterations B and the time T of the learner, yielding an overall of O(B \cdot T). This arises because each of the B models is trained independently on a bootstrap sample of the . Since the models do not depend on one another, bagging lends itself to efficient parallelization, enabling simultaneous across multiple processors with no need for inter-processor communication, which can significantly reduce wall-clock time on multi-core systems. Memory usage in bagging implementations typically scales with B, as all base models must be stored to aggregate predictions during inference; for instance, scikit-learn's BaggingClassifier retains fitted estimators and their associated samples in attributes like estimators_ and estimators_samples_. Alternatives such as online aggregation mitigate this by incrementally updating ensemble predictions without storing the full set of models, particularly useful in streaming or resource-constrained environments. Out-of-bag (OOB) estimation, which approximates , incurs additional minor memory overhead for tracking bootstrap samples to identify OOB instances per model. Bagging is readily available in established software libraries, facilitating practical deployment. In Python's , the BaggingClassifier supports parallel execution via the n_jobs parameter, which can be set to -1 to use all available processors (default: None, meaning 1 job), and enables tuning of B (via n_estimators, default 10) using OOB scores computed with oob_score=True. Similarly, R's ipred package implements bagging through its bagging function (default B=25), with options for OOB error estimation (coob=TRUE) and control over sample sizes. To enhance on large (N \gg 1), techniques—such as drawing bootstrap samples smaller than the full ()—can reduce both time and demands while maintaining ; the ipred package supports this via the ns . Additionally, B can be adaptively tuned by OOB , with indicating beyond 25 iterations, allowing to avoid unnecessary computation.

Evaluation

Advantages

Bootstrap aggregating, commonly known as bagging, primarily excels in reducing the variance of predictions, particularly when applied to unstable base learners such as decision trees and neural networks. By averaging the outputs from multiple models trained on bootstrap samples of the data, bagging mitigates the impact of fluctuations in individual predictors, leading to more stable and robust ensemble predictions. This variance reduction is most pronounced for procedures where small perturbations in the training data cause large changes in the model output, effectively smoothing out noise and improving generalization performance. Unlike sequential ensemble methods such as boosting, bagging requires minimal hyperparameter tuning, typically limited to the number of bootstrap iterations B and the choice of base learner, making it simpler to implement and less prone to through excessive configuration. Additionally, bagging incorporates out-of-bag (OOB) as a built-in for unbiased error assessment; each bootstrap sample leaves out approximately 37% of the , which can be used to evaluate the without requiring a separate validation set, providing an efficient and reliable proxy for . The independent training of base models on disjoint bootstrap samples enables bagging to be highly parallelizable, with no interdependencies between learners, facilitating efficient across multi-core processors or distributed systems and reducing overall computational time for large datasets. When using decision as base learners, bagging enhances interpretability by allowing the aggregation of variable importance measures across the , such as the average decrease in from splits on each , which provides a more reliable ranking of predictor than a single tree.

Disadvantages and Limitations

Bootstrap aggregating, or bagging, primarily addresses in predictions but offers limited benefits in reducing , especially for models that already exhibit low variance and high , such as . In these scenarios, the ensemble average retains the inherent of the base learners while providing minimal additional variance stabilization, resulting in little to no over the single base model. A key limitation arises from the computational demands of bagging, which scales linearly with the number of bootstrap iterations B, effectively multiplying the training cost of the learner by B. While this overhead can be mitigated through parallelization since the bootstrap models are trained independently, it remains a barrier for resource-constrained environments or when using computationally expensive algorithms. The achieved by bagging depends on the of predictions from different bootstrap samples; if the base learners produce highly correlated outputs—due to insufficient in the bootstrapped datasets—the term dominates, leading to diminished ensemble benefits. Theoretical analyses show that the ensemble variance is given by \frac{1}{B} \mathrm{Var}(\hat{f}) + \left(1 - \frac{1}{B}\right) \mathrm{Cov}(\hat{f}, \hat{f}^*), where high covariance limits gains even as B increases. In low-noise datasets, bagging can cause over-smoothing by averaging predictions, which may obscure sharp patterns or boundaries critical to the underlying , particularly when the base procedure is and already accurate. This effect can degrade performance compared to the unbagged model. Finally, bagging's success is highly sensitive to the choice of base learner, performing effectively only with unstable, high-variance algorithms like classification trees, while it fails to improve or even worsens results for learners such as linear models or k-nearest neighbors, highlighting its unsuitability as a universal enhancement technique.

Historical Development

Origins and Key Publications

Bootstrap aggregating, commonly known as bagging, was introduced by statistician Leo Breiman of the University of California, Berkeley, who coined the term as a portmanteau of "bootstrap" and "aggregating." The method was first presented in a technical report in September 1994, with the seminal publication appearing in 1996. The key paper, "Bagging Predictors," published in the Machine Learning journal, formalized bagging as a technique for generating multiple versions of a predictor through bootstrap sampling and aggregating their outputs to improve overall performance. In this work, Breiman demonstrated the approach using examples such as classification and regression trees, including a brief illustration on the ozone dataset to show error reduction. The paper emphasized bagging's simplicity and effectiveness, particularly for unstable predictors where small data perturbations lead to large prediction variations. The foundations of bagging trace back to the bootstrap resampling method, developed by Efron in 1979 as a nonparametric tool for estimating statistical properties without assuming distributional forms. Earlier ideas of combining multiple models, precursors to modern ensemble methods, emerged in the late 1980s and early 1990s, including work on averaging predictions from decision trees and neural networks, such as efforts by Kwok and Carter (1990) on committee machines and Dietterich and Bakiri (1991) on error-correcting output codes for multiclass problems. Breiman's motivation for bagging stemmed from the need to stabilize high-variance classifiers and regressors, like those based on trees or neural networks, by leveraging the increasing availability of computational resources to train and average numerous bootstrap-generated models. This approach aimed to reduce prediction error without requiring modifications to the base learning algorithm, making it broadly applicable amid the computational advances of the mid-1990s.

Evolution and Influential Works

Following its foundational introduction, bootstrap aggregating, or bagging, evolved through key integrations that enhanced its practical utility. A pivotal advancement came in 2001 when Leo Breiman introduced random forests, which extended bagging by incorporating random at each of decision , thereby increasing ensemble diversity and mitigating among base models for better generalization in high-dimensional settings. Subsequent influential works provided deeper theoretical and empirical insights into bagging's mechanisms. In 2000, Thomas G. Dietterich's analysis of ensemble methods experimentally evaluated bagging alongside boosting and randomization, highlighting bagging's strength in variance reduction for unstable classifiers like decision trees through bootstrap resampling. Complementing this, Andreas Buja and Werner Stuetzle's 2006 study offered a rigorous examination of bagging's effects on U-statistics, demonstrating that it generally lowers variance at the potential cost of slight bias increase, thus clarifying conditions under which bagging yields net prediction improvements. Bagging's modern impact is evident in its integration into core ecosystems and diverse applications. The library, starting from version 0.15 released in 2014, has included bagging as a standard ensemble tool, enabling accessible implementations for classifiers and regressors in Python-based workflows. By the , bagging powered advancements in specialized domains; in , it bolstered genome-enabled predictions by stabilizing genomic best linear unbiased prediction models against . Similarly, in , bagging ensembles improved forecasting by aggregating predictions to enhance accuracy and stability in stock return models. Recent extensions have bridged bagging with deep learning paradigms, addressing gaps in traditional applications. In the 2020s, researchers have applied bagging to transformer architectures, using bootstrap aggregation to create robust deep ensembles for tasks like time series forecasting, where it reduces variance in foundation model outputs without substantial computational overhead.

References

  1. [1]
    [PDF] Bagging Predictors - UC Berkeley Statistics
    Abstract. Bagging predictors is a method for generating multiple versions of a pre- dictor and using these to get an aggregated predictor.
  2. [2]
    [PDF] Ensemble Methods in Machine Learning
    This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble ...
  3. [3]
    Bootstrap Methods: Another Look at the Jackknife - Project Euclid
    January, 1979 Bootstrap Methods: Another Look at the Jackknife. B. Efron · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Statist. 7(1): 1-26 (January, 1979). DOI ...
  4. [4]
    A Survey on Ensemble Learning under the Era of Deep Learning
    Jan 21, 2021 · This survey discusses ensemble learning based on deep neural networks, its high training costs, and the need to reduce these costs for more ...
  5. [5]
  6. [6]
  7. [7]
  8. [8]
    Small-Sample Error Estimation for Bagged Classification Rules
    It is important to select an odd to avoid the issue of tie breaking in the majority vote. ... Out-of-bag estimates are obtained by testing the majority voting ...
  9. [9]
    What is Bagging in Machine Learning? A Guide With Examples
    Nov 20, 2023 · Bagging (bootstrap aggregating) is an ensemble method that involves training multiple models independently on random subsets of the data.What Is Bagging In Machine... · Bagging In Python: A Brief... · Bagging Classifier
  10. [10]
    [PDF] (out-of-bag estimates) - UC Berkeley Statistics
    Abstract. In bagging, predictors are constructed using bootstrap samples from the training set and then aggregated to form a bagged predictor.
  11. [11]
    Bagging Predictors | Machine Learning
    Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages ove.
  12. [12]
    Random Forests | Machine Learning
    Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same di.
  13. [13]
    [PDF] Ensemble deep learning: A review - arXiv
    Aug 9, 2022 · In [52], bagged neural networks (BNNs) was proposed wherein each neural network was trained over different dataset sampled randomly with ...
  14. [14]
    Support Vector Machine Ensemble with Bagging - SpringerLink
    To improve the limited classification performance of the real SVM, we propose to use the SVM ensembles with bagging (bootstrap aggregating). Each individual SVM ...
  15. [15]
    Constructing support vector machine ensemble - ScienceDirect.com
    To improve the limited classification performance of the real SVM, we propose to use the SVM ensemble with bagging (bootstrap aggregating) or boosting.
  16. [16]
    Extremely randomized trees | Machine Learning
    Mar 2, 2006 · This paper proposes a new tree-based ensemble method for supervised classification and regression problems.
  17. [17]
    (PDF) Extremely Randomized Trees - ResearchGate
    Mar 2, 2006 · Extra Trees (ET): is an ensemble learning algorithm similar to Ran-dom Forest that constructs a collection of decision trees (Geurts et al., ...
  18. [18]
    1.11. Ensembles: Gradient boosting, random forests, bagging, voting ...
    Ensemble methods combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness ...GradientBoostingClassifier · HistGradientBoostingClassifier
  19. [19]
    Hybrid bagging and boosting with SHAP based feature selection for ...
    Dec 17, 2024 · The paper proposes a new paradigm in intrusion detection in IoT networks by using a hybrid CNN with the XGBoost model optimized via the Modified ...
  20. [20]
    [PDF] Online Bagging for Anytime Transfer Learning - arXiv
    Oct 20, 2019 · Abstract—Transfer learning techniques have been widely used in the reality that it is difficult to obtain sufficient labeled data.Missing: post- | Show results with:post-
  21. [21]
    Online bagging of evolving fuzzy systems - ScienceDirect.com
    Our approach suggests a new adaptive, online ensembling variant by performing an online bagging of EFS, where each base member acts as a conventional EFS ...Missing: post- developments
  22. [22]
    [PDF] Neural Networks and the Bias/Variance Dilemma - Free
    Feedforward neural networks trained by error backpropagation are ex- amples of nonparametric regression estimators. We present a tutorial.
  23. [23]
    Consistency of random forests - Project Euclid
    Bagging (a contraction of bootstrap-aggregating) is a general aggregation ... law of large numbers, Ln(j,z) → L (j,z) almost surely as n → ∞ for all ...
  24. [24]
    [PDF] Bagging Provides Assumption-free Stability - arXiv
    Apr 25, 2024 · Abstract. Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on ...
  25. [25]
    [PDF] Bootstrap Aggregating and Random Forest
    Bagging (Bootstrap Aggregating) improves forecast robustness by averaging estimators. Random Forest uses decision trees and bagging to balance accuracy and ...
  26. [26]
    [PDF] An Empirical Comparison of Voting Classification Algorithms
    Feb 20, 2006 · ¡ Improving robustness of boosting w.r.t. noise. ¡ Are there better ways to handle a “perfect” classifier when boosting?
  27. [27]
    Bagging predictors | Machine Learning
    Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.
  28. [28]
    BaggingClassifier
    - **Default Value for n_estimators**: 10
  29. [29]
    [PDF] Package 'ipred'
    The ipred package improves predictive models using indirect classification and bagging for classification, regression, and survival problems.
  30. [30]
    None
    ### Summary of Advantages of Bagging from the Document
  31. [31]
    [PDF] 1 RANDOM FORESTS Leo Breiman Statistics Department University ...
    Constructing Ensembles of Decision Trees: Bagging, Boosting and. Randomization, Machine Learning 1-22. Freund, Y. and Schapire, R. [1996] Experiments with a ...
  32. [32]
    [PDF] On Bagging and Nonlinear Estimation. - Jerome H. Friedman
    Jan 4, 2000 · We show that bagging reduces the variability of the nonlinear component by replacing it with an estimate of its expected value, while leaving ...
  33. [33]
    Analyzing bagging - Project Euclid
    Bagging is a computationally intensive procedure to improve unstable estimators or classifiers, especially for high dimensional data, smoothing hard decisions.
  34. [34]
    Ensemble Methods in Machine Learning | SpringerLink
    Dec 1, 2000 · Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and ...
  35. [35]
    Andreas Buja and Werner Stuetzle (2006). Observations on bagging ...
    We investigate bagging in a simplified situation: the prediction rule produced by a learning algorithm is replaced by a simple real-valued U-statistic of iid ...