Last updated December 28, 2020
In AI Mysteries

Guide To Ensemble Methods: Bagging vs Boosting

Share

Published on December 1, 2020

by Mohammed Rizwan Nagauri

Bagging and Boosting methods of Ensemble Learning methods are widely used to increase the model performance, while there are a lot of articles focusing on the coding & implementation these models, the purpose of this article is to:

Appreciate these models by highlighting their significance, and
Understand the circumstances when a particular method would be the ideal choice for implementation.

Ensemble Learning:

Building a highly accurate prediction model is certainly a difficult task.

Our main challenge is the Prediction Error, which can be broken down into:

Bias – The average of distance between the prediction and the target.

Variance – Variability in the predictions.

Noise – Irreducible error i.e. the part of target value which the model is not able to predict / explain.

As you know it is impossible to reduce the noise, hence the term irreducible error, we shift our focus on reducing Bias and Variance.

So, Ensemble learning methods bring up the technique to reduce the Bias and Variance of the model by using multiple models together (hence the term Ensemble), in order to achieve better predictive performance, instead of a single model for prediction.

There are a number of Ensemble methods, in this article I will be discussing about the two widely used Ensemble methods that are Bagging and Boosting.

BOOSTING

When we use different / single learning algorithm, multiple times for prediction.

BAGGING

When we generate multiple versions of predictors and aggregate them to use in a model.

BAGGING (Bootstrap Aggregating):

The procedure of applying Bootstrap sampling on the training dataset and then Aggregating when estimating a numerical outcome and doing a plurality vote when predicting a class is called BAGGING.

MATH of Bagging algorithm:

As there is limitation of having a large training set, so we take repeated Bootstrap samples (S_B) with replacement from the same training set (S {( y_n; x_n), n = 1,…., N}, where y is either a class or numerical target response.

The main objective is to use (S_B) to get a better predictor than the single training set predictor ø (x, S_B).

When y is numerical, we take average of ø (x, S_B),

ø_B (x) = avg _B ø (x, S_B).

When y is a class label, we take plurality vote of ø (x, S_B) to form ø_B (x).

When we want to reduce the Variance in the model, we can use Bagging.

According to Mr. Leo Breiman, Bagging algorithm can significantly move a good but unstable classification algorithms, like Classification and Regression trees, C4.5 and Neural Networks, towards optimality as it can contribute in reduction in the prediction error by reducing the variance without affecting the bias.

On the other hand, if we apply bagging to stable algorithms like k-Nearest neighbour, Discriminant analysis and Naïve bayes it would degrade their performance since this algorithm uses bootstrap samples consist of approx. 63% of the original data, so each of these samples would be missing around 37% of the original data.

The Bagging Algorithm:

Step 1: Generate Bootstrap Samples of same size (with replacement) are repeatedly taken from the training data set, so that each record has an equal probability of being selected.

Step 2: A classification or estimation model is trained on each bootstrap sample drawn in Step 1, and a prediction is recorded for each sample.

Step 3: The bagging ensemble prediction is then defined to be the class with the most votes in Step 2 (for classification models) or the average of the predictions made in Step 2 (for estimation models)

Boosting (AdaBoost):

The idea of Boosting method is that instead of using a simple algorithm, which is not strong enough to make the accurate predictions alone as there are high variance and error rate, we combine multiple simple learning algorithms together, rather than finding a single highly accurate prediction rule.

Hence, we can say that Boosting algorithm is adaptive.

These simple algorithms are known as a weak learner, and Boosting algorithm calls this weak learner multiple times and feeding it a different subset of the training samples, such that each time the base learning algorithm generates a new weak prediction rule.

The Boosting algorithm then combines these multiple weak rules together to reduce variance and bias in individual model rules into a single prediction rule, such that it will be much more accurate than any one of the weak rules as.

Two fundamental approaches for effective implementation of Boosting algorithm:

Choosing the different subsets from training dataset for different iterations:
- To increase the efficiency of the base learner predictions, high weightage is placed on the examples that were misclassified by earlier weak learner.
How to combine weak leaners together:
- Taking a weighted majority vote of the predictions.

By Sirakorn – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=85888769

MATH of Boosting algorithm:

Given: (x₁, y₁),…., (x_m, y_m) where x_i ∈ X, y₁∈ Y = {0,1}

Initialize D₁(i) = 1/ m

For t = 1, …. , T

Train base leaner using distribution D_t.
Get base classifier h_t: X –> ℝ_.
- _{After the base learner finds the base classifier, it has to minimize the}_error

_ε(i) = Pr(i)_˜D_t [ h_t(x₁) ≠ y_i]

Choose α_t ∈ ℝ_.
Update: D_t+₁ (i) = [ D_t(i) exp( – α_t y_ih_t(x₁) )] / Z_t

whereZ_tis a normalization factor

Output of the final classifier:

H(x) = sign ( (t = 1 to T) Σ α_t h_t(x₁) )

Boosting can help us reduce both Bias and Variance. It must be noted that when Bias in a model reduces the Variance increases and vice versa. Hence, we must find an optimal point of balance, this is called Bias – Variance Trade-off.

The Boosting Algorithm:

Step 1: All observations have equal weight in the original training data set D₁. An

initial “base” classifier h₁ is determined.

Step 2: The observations that were incorrectly classified by the previous base classifier have their weights increased, while the observations that were correctly

classified have their weights decreased.

This gives us data distribution D_m, m=2, … , M.

A new base classifier hm, m = 2, … , M is determined, based on the new weights. This step is repeated until the desired number of iterations M is achieved.

Step 3: The final boosted classifier is the weighted sum of the M base classifiers

Generalization & Test Error:

Generalization error refers to the performance on examples that are not seen by the weak learners during the training. It is the probability of misclassifying a new example by the base classifier.

Test error is the faction of mistakes on a newly sampled test set.

Bagging vs. Boosting:

Based on the available data, problem objective and other existing settings, choose between bagging or boosting. Accuracy of the model is increased by significantly reducing the variability of the estimate during the adaptive procedure. Hence, the performance results obtained show higher stability with aggregation than during individual results.

The bagging method will not present better result in bias when there is a challenge of low performance. However, since the boosting method concentrates on the optimization of the strengths of individual model and reducing the weakness of a single model, booting model gives output with low errors.

When facing the challenge of overfitting in a single model, the bagging method out performs the boosting method as the later model itself comes with over-fitting.

References:

Bagging Predictors By Leo Breiman

The Boosting Approach to Machine Learning by Robert E. Schapire

Data Mining and Predictive Analytics by Daniel T. Larose, Chantal D. Larose

https://en.wikipedia.org/wiki/Ensemble_learning

https://blog.paperspace.com/bagging-ensemble-methods/

https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

Access all our open Survey & Awards Nomination forms in one place