Bagging and Boosting methods of Ensemble Learning methods are widely used to increase the model performance, while there are a lot of articles focusing on the coding & implementation these models, the purpose of this article is to:
- Appreciate these models by highlighting their significance, and
- Understand the circumstances when a particular method would be the ideal choice for implementation.
Ensemble Learning:
Building a highly accurate prediction model is certainly a difficult task.
Our main challenge is the Prediction Error, which can be broken down into:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Bias – The average of distance between the prediction and the target.

Variance – Variability in the predictions.
Noise – Irreducible error i.e. the part of target value which the model is not able to predict / explain.
As you know it is impossible to reduce the noise, hence the term irreducible error, we shift our focus on reducing Bias and Variance.
So, Ensemble learning methods bring up the technique to reduce the Bias and Variance of the model by using multiple models together (hence the term Ensemble), in order to achieve better predictive performance, instead of a single model for prediction.
There are a number of Ensemble methods, in this article I will be discussing about the two widely used Ensemble methods that are Bagging and Boosting.
BOOSTING
When we use different / single learning algorithm, multiple times for prediction.
BAGGING
When we generate multiple versions of predictors and aggregate them to use in a model.
BAGGING (Bootstrap Aggregating):
The procedure of applying Bootstrap sampling on the training dataset and then Aggregating when estimating a numerical outcome and doing a plurality vote when predicting a class is called BAGGING.
MATH of Bagging algorithm:
As there is limitation of having a large training set, so we take repeated Bootstrap samples (SB) with replacement from the same training set (S {( yn ; xn), n = 1,…., N}, where y is either a class or numerical target response.
The main objective is to use (SB) to get a better predictor than the single training set predictor ø (x, SB).
When y is numerical, we take average of ø (x, SB),
øB (x) = avg B ø (x, SB).
When y is a class label, we take plurality vote of ø (x, SB) to form øB (x).
When we want to reduce the Variance in the model, we can use Bagging.
According to Mr. Leo Breiman, Bagging algorithm can significantly move a good but unstable classification algorithms, like Classification and Regression trees, C4.5 and Neural Networks, towards optimality as it can contribute in reduction in the prediction error by reducing the variance without affecting the bias.
On the other hand, if we apply bagging to stable algorithms like k-Nearest neighbour, Discriminant analysis and Naïve bayes it would degrade their performance since this algorithm uses bootstrap samples consist of approx. 63% of the original data, so each of these samples would be missing around 37% of the original data.
The Bagging Algorithm:
Step 1: Generate Bootstrap Samples of same size (with replacement) are repeatedly taken from the training data set, so that each record has an equal probability of being selected.
Step 2: A classification or estimation model is trained on each bootstrap sample drawn in Step 1, and a prediction is recorded for each sample.
Step 3: The bagging ensemble prediction is then defined to be the class with the most votes in Step 2 (for classification models) or the average of the predictions made in Step 2 (for estimation models)
Boosting (AdaBoost):
The idea of Boosting method is that instead of using a simple algorithm, which is not strong enough to make the accurate predictions alone as there are high variance and error rate, we combine multiple simple learning algorithms together, rather than finding a single highly accurate prediction rule.
Hence, we can say that Boosting algorithm is adaptive.
These simple algorithms are known as a weak learner, and Boosting algorithm calls this weak learner multiple times and feeding it a different subset of the training samples, such that each time the base learning algorithm generates a new weak prediction rule.
The Boosting algorithm then combines these multiple weak rules together to reduce variance and bias in individual model rules into a single prediction rule, such that it will be much more accurate than any one of the weak rules as.
Two fundamental approaches for effective implementation of Boosting algorithm:
- Choosing the different subsets from training dataset for different iterations:
- To increase the efficiency of the base learner predictions, high weightage is placed on the examples that were misclassified by earlier weak learner.
- How to combine weak leaners together:
- Taking a weighted majority vote of the predictions.
- Taking a weighted majority vote of the predictions.
By Sirakorn – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=85888769
MATH of Boosting algorithm:
Given: (x1, y1),…., (xm, ym) where xi ∈ X, y1 ∈ Y = {0,1}
Initialize D1 (i) = 1/ m
For t = 1, …. , T
- Train base leaner using distribution Dt.
- Get base classifier ht : X –> ℝ.
- After the base learner finds the base classifier, it has to minimize the error
ε(i) = Pr(i)˜Dt [ ht (x1) ≠ yi ]
- Choose αt ∈ ℝ.
- Update: Dt+1 (i) = [ Dt(i) exp( – αt yi ht (x1) )] / Zt
where Zt is a normalization factor
Output of the final classifier:
H(x) = sign ( (t = 1 to T) Σ αt ht (x1) )
Boosting can help us reduce both Bias and Variance. It must be noted that when Bias in a model reduces the Variance increases and vice versa. Hence, we must find an optimal point of balance, this is called Bias – Variance Trade-off.
The Boosting Algorithm:
Step 1: All observations have equal weight in the original training data set D1. An
initial “base” classifier h1 is determined.
Step 2: The observations that were incorrectly classified by the previous base classifier have their weights increased, while the observations that were correctly
classified have their weights decreased.
This gives us data distribution Dm, m=2, … , M.
A new base classifier hm, m = 2, … , M is determined, based on the new weights. This step is repeated until the desired number of iterations M is achieved.
Step 3: The final boosted classifier is the weighted sum of the M base classifiers
Generalization & Test Error:
Generalization error refers to the performance on examples that are not seen by the weak learners during the training. It is the probability of misclassifying a new example by the base classifier.
Test error is the faction of mistakes on a newly sampled test set.
Bagging vs. Boosting:
Based on the available data, problem objective and other existing settings, choose between bagging or boosting. Accuracy of the model is increased by significantly reducing the variability of the estimate during the adaptive procedure. Hence, the performance results obtained show higher stability with aggregation than during individual results.
- The bagging method will not present better result in bias when there is a challenge of low performance. However, since the boosting method concentrates on the optimization of the strengths of individual model and reducing the weakness of a single model, booting model gives output with low errors.
- When facing the challenge of overfitting in a single model, the bagging method out performs the boosting method as the later model itself comes with over-fitting.
References:
Bagging Predictors By Leo Breiman
The Boosting Approach to Machine Learning by Robert E. Schapire
Data Mining and Predictive Analytics by Daniel T. Larose, Chantal D. Larose
https://en.wikipedia.org/wiki/Ensemble_learning
https://blog.paperspace.com/bagging-ensemble-methods/
https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205