Last updated May 30, 2022
In AI Mysteries

Demystifying AdaBoost: The Origin Of Boosting

AdaBoost was the first successful boosting algorithm that combined multiple weak classifier to create a strong classifier.

Share

Published on May 6, 2021

by Aditya Singh

Listen to this story

Introduced by Yoav Freund and Robert E. Schapire in their paper “A Short Introduction to Boosting”, AdaBoost was the first successful boosting algorithm; all modern boosting algorithms build upon the success of boosting theory employed by AdaBoost. The premise of this theory is simple; combine multiple weak classifiers to create one strong classifier. This boosting approach can be used to boost the performance of any machine learning algorithm. But it’s best paired with weak learners, which are models that perform just slightly better than a random guess on classification problems. The most commonly used weak learners are decision trees with a depth of one, also known as decision stumps. This article will go through the boosting theory and then implement Adaboost from scratch.

Before we delve into the mathematics and code, let’s understand the gist of Adaboost with the help of a simple 2D example:

Adaboost/Boosting example — Source: http://sli.ics.uci.edu/Classes/2015W-273a?action=download&upname=09-ensembles.pdf

The first classifier makes the classification based on the y-axis feature and creates a horizontal decision line at some threshold. This decision line is by no means perfect, and based on its predictions, a performance score is calculated. The performance measure is then used to calculate and update weights for all training samples. Moreover, the second classifier uses these weights and creates a different and maybe better decision line. The performance measure of the second classifier is used to update the weights, which are then used by the third classifier. This process is repeated for the pre-set number, n_estimators, of classifiers. All the classifiers are combined through a weighted sum; here, the classifiers with better performance have a higher weight. This means that better classifiers have more impact on the final decision boundary.

Briefly put, Adaboost trains a set of weak learners sequentially. The errors of the earlier learners are used to tag the harder examples that the later learners focus on. In the end, the complete set of learners is combined and converted onto one complex classifier.

Decision Stump used by Adaboost — Decision Stump

For the first iteration, the weights are initialized to 1/N and the error is calculated as misclassifications/N, where N is the number of training examples. This is later updated to include the weights and becomes ???? (w_i× error_i)/????w_i, where w_i and error_idenote the weight and prediction error of the i-th training example. The error is 1 if misclassified and 0 if classified correctly. The performance measure of the learners, denoted by ????, is calculated as 0.5 × log(1 – error/error).

This performance score is then used to update the weights using w = w × e ^{-???? × y_true × y_pred}, where y_true and y_pred are the actual and predicted classes. Finally, for making predictions a weighted sum of all classifiers’ prediction is calculated: ???? ????_i× y_pred_i and the sign of the sum is used to make the classification. Here ????_i and y_pred_i refer to the performance score and prediction of the i-th classifier.

Implementing Adaboost From Scratch

Before we can code Adaboost, we need to implement the weak learner. Each decision stump will have four attributes:

threshold – the threshold that differentiates the two classes based on one feature
alpha – the performance measure, used as its weight in the final prediction
index – the index of the feature the classifier is built on
polarity – used to invert weak classifiers that perform worse than random(0.5).

class DecisionStump():
    def __init__(self):
         self.threshold = None
         self.alpha = None
         self.index = None
         self.polarity = 1

In addition to these attributes, it will also need a function for making predictions.

 def predict(self, X):
     n_samples = X.shape[0]
     X_column = X[:, self.index]
     predictions = np.ones(n_samples)
     # classifying based on the polarity and threshold
     if self.polarity == 1:
        predictions[X_column < self.threshold] = -1
     else:
        predictions[X_column > self.threshold] = -1
     return predictions

Using this DecisionStump class we can start working on Adaboost. It will have two attributes:

n_estimators – number of decision stumps to use
ensemble_of_stumps – a collection of trained decision stumps

class Adaboost():

def __init__(self, n_estimators=5):
    self.n_estimators = n_estimators
    self.ensemble_of_stumps = []

Now comes the hard part, the training method that creates the ensemble of decision stumps. We initialize the weights to 1/N and start training the first stump.

def fit(self, X, y):
   N, n_features = X.shape
   weights = np.full(N, (1 / N))
   for _ in range(self.n_estimators):
      stump = DecisionStump()

Each of the decision stumps will use one feature to make the prediction, and it will use a threshold to make classifications. To ‘train’ the stump we need to find the best possible feature-threshold combination. To achieve this we iteratively go through all features, and for each feature, we test all of its unique values as the threshold. In the end, the feature-threshold combination with the lowest error is selected.

     min_error = float('inf')
     # looking for the best threshold and feature configuration
     for i in range(n_features):
        X_column = X[:, i]
        thresholds = np.unique(X_column)
        for threshold in thresholds:
            p = 1
            predictions = np.ones(N)
            predictions[X_column < threshold] = -1
            # Calculating error
            misclassified = weights[y != predictions]
            error = sum(misclassified)

If the error of a stump is less than 0.5, we reverse its polarity. This is done because, by definition, the weak learners need to perform better than random.

         # if error is less than .5, reverse polarity of learner
         if error > 0.5:
               error = 1 - error
               p = -1
         # Storing the best configuration for the current stump
         if error < min_error:
               stump.polarity = p
               stump.threshold = threshold
               stump.index = i
               min_error = error

Using the error of the stump, we calculate its performance measure alpha. Which is then used to update the weights for the next stump.

         stump.alpha = 0.5 * np.log((1.0 - min_error) / (min_error))
         weights  = weights * np.exp(-stump.alpha * y * stump.predict(X))
         # Normalizing the weights to one
         weights /= np.sum(weights)

Finally, we add the current decision stump to Adaboost’s ensemble of weak learners.

        self.ensemble_of_stumps.append(stump)

In addition to the training method, we need a method for making predictions. This method will calculate the weighted sum of all stump’s predictions and use the sign of the sum as the final prediction.

   def predict(self, X):
       # weighted sum of all stumps
       stump_preds = [stump.alpha * stump.predict(X) for stump in self.ensemble_of_stumps]
       y_pred = np.sum(stump_preds, axis=0)
       y_pred = np.sign(y_pred)
       return y_pred

Let’s test our Adaboost implementation on the breast cancer dataset and compare it with sklearn’s implementation.

 import time
 from sklearn import datasets
 from sklearn.metrics import accuracy_score
 from sklearn.model_selection import train_test_split

 X, y = datasets.load_breast_cancer(return_X_y=True)
 #changing class 0 to -1
 y[y == 0] = -1
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 start = time.perf_counter()
 classifier = Adaboost(n_estimators=5)
 classifier.fit(X_train, y_train)
 y_pred = classifier.predict(X_test)
 end = time.perf_counter()

 accuracy = accuracy_score(y_test, y_pred)
 print ("Accuracy:", accuracy)
 print(f'Finished in {round(end-start, 2)} second(s)')

-----------------------------Output-----------------------------
 Accuracy: 0.9649122807017544
 Finished in 5.48 second(s)

 from sklearn.ensemble import AdaBoostClassifier

 start = time.perf_counter()
 sk_classifier = AdaBoostClassifier(n_estimators=5, random_state=42)
 sk_classifier.fit(X_train, y_train)
 sk_y_pred = classifier.predict(X_test)
 end = time.perf_counter()

 sk_accuracy = accuracy_score(y_test, y_pred)
 print ("Accuracy:", sk_accuracy)
 print(f'Finished in {round(end-start, 2)} second(s)')

-----------------------------Output-----------------------------
 Accuracy: 0.9649122807017544
 Finished in 0.02 second(s)

Our implementation can match sklearn’s implementation in terms of accuracy but is rather slow.

The Colab notebook for the above implementation can be found here.

Last Epoch

This article deconstructed boosting and implemented Adaboost. I hope this gives you a good understanding of the boosting approach and helps you understand modern boosting algorithms more easily. Boosting has many advantages; it is fast and easy to program. Excluding the number of weak learners, n_estimators, it has no parameters to tune. One thing to keep in mind when working with boosting algorithm is the data. Because the boosting method obsessively tries to correct its errors, any outliers or noise can send the model down a tangent of trying to learn unrealistic or wrong examples.