Listen to this story

Introduced by Yoav Freund and Robert E. Schapire in their paper “A Short Introduction to Boosting”, AdaBoost was the first successful boosting algorithm; all modern boosting algorithms build upon the success of boosting theory employed by AdaBoost. The premise of this theory is simple; combine multiple weak classifiers to create one strong classifier. This boosting approach can be used to boost the performance of any machine learning algorithm. But it’s best paired with weak learners, which are models that perform just slightly better than a random guess on classification problems. The most commonly used weak learners are decision trees with a depth of one, also known as decision stumps. This article will go through the boosting theory and then implement Adaboost from scratch.
Before we delve into the mathematics and code, let’s understand the gist of Adaboost with the help of a simple 2D example:
The first classifier makes the classification based on the yaxis feature and creates a horizontal decision line at some threshold. This decision line is by no means perfect, and based on its predictions, a performance score is calculated. The performance measure is then used to calculate and update weights for all training samples. Moreover, the second classifier uses these weights and creates a different and maybe better decision line. The performance measure of the second classifier is used to update the weights, which are then used by the third classifier. This process is repeated for the preset number, n_estimators
, of classifiers. All the classifiers are combined through a weighted sum; here, the classifiers with better performance have a higher weight. This means that better classifiers have more impact on the final decision boundary.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Briefly put, Adaboost trains a set of weak learners sequentially. The errors of the earlier learners are used to tag the harder examples that the later learners focus on. In the end, the complete set of learners is combined and converted onto one complex classifier.
For the first iteration, the weights are initialized to 1/N and the error is calculated as misclassifications/N, where N is the number of training examples. This is later updated to include the weights and becomes ???? (w_{i }× error_{i})/????w_{i}_{ }, where w_{i} and error_{i }denote the weight and prediction error of the ith training example. The error is 1 if misclassified and 0 if classified correctly. The performance measure of the learners, denoted by ????, is calculated as 0.5 × log(1 – error/error).
This performance score is then used to update the weights using w = w × e ^{???? × y_true × y_pred}, where y_true and y_pred are the actual and predicted classes. Finally, for making predictions a weighted sum of all classifiers’ prediction is calculated: ???? ????_{i }× y_pred_{i} and the sign of the sum is used to make the classification. Here ????_{i} and y_pred_{i} refer to the performance score and prediction of the ith classifier.
Implementing Adaboost From Scratch
Before we can code Adaboost, we need to implement the weak learner. Each decision stump will have four attributes:
threshold
– the threshold that differentiates the two classes based on one featurealpha
– the performance measure, used as its weight in the final predictionindex
– the index of the feature the classifier is built onpolarity
– used to invert weak classifiers that perform worse than random(0.5).
class DecisionStump(): def __init__(self): self.threshold = None self.alpha = None self.index = None self.polarity = 1
In addition to these attributes, it will also need a function for making predictions.
def predict(self, X): n_samples = X.shape[0] X_column = X[:, self.index] predictions = np.ones(n_samples) # classifying based on the polarity and threshold if self.polarity == 1: predictions[X_column < self.threshold] = 1 else: predictions[X_column > self.threshold] = 1 return predictions
Using this DecisionStump class we can start working on Adaboost. It will have two attributes:
n_estimators
– number of decision stumps to useensemble_of_stumps
– a collection of trained decision stumps
class Adaboost():
def __init__(self, n_estimators=5): self.n_estimators = n_estimators self.ensemble_of_stumps = []
Now comes the hard part, the training method that creates the ensemble of decision stumps. We initialize the weights to 1/N and start training the first stump.
def fit(self, X, y): N, n_features = X.shape weights = np.full(N, (1 / N)) for _ in range(self.n_estimators): stump = DecisionStump()
Each of the decision stumps will use one feature to make the prediction, and it will use a threshold to make classifications. To ‘train’ the stump we need to find the best possible featurethreshold combination. To achieve this we iteratively go through all features, and for each feature, we test all of its unique values as the threshold. In the end, the featurethreshold combination with the lowest error is selected.
min_error = float('inf') # looking for the best threshold and feature configuration for i in range(n_features): X_column = X[:, i] thresholds = np.unique(X_column) for threshold in thresholds: p = 1 predictions = np.ones(N) predictions[X_column < threshold] = 1 # Calculating error misclassified = weights[y != predictions] error = sum(misclassified)
If the error of a stump is less than 0.5, we reverse its polarity. This is done because, by definition, the weak learners need to perform better than random.
# if error is less than .5, reverse polarity of learner if error > 0.5: error = 1  error p = 1 # Storing the best configuration for the current stump if error < min_error: stump.polarity = p stump.threshold = threshold stump.index = i min_error = error
Using the error of the stump, we calculate its performance measure alpha. Which is then used to update the weights for the next stump.
stump.alpha = 0.5 * np.log((1.0  min_error) / (min_error)) weights = weights * np.exp(stump.alpha * y * stump.predict(X)) # Normalizing the weights to one weights /= np.sum(weights)
Finally, we add the current decision stump to Adaboost’s ensemble of weak learners.
self.ensemble_of_stumps.append(stump)
In addition to the training method, we need a method for making predictions. This method will calculate the weighted sum of all stump’s predictions and use the sign of the sum as the final prediction.
def predict(self, X): # weighted sum of all stumps stump_preds = [stump.alpha * stump.predict(X) for stump in self.ensemble_of_stumps] y_pred = np.sum(stump_preds, axis=0) y_pred = np.sign(y_pred) return y_pred
Let’s test our Adaboost implementation on the breast cancer dataset and compare it with sklearn’s implementation.
import time from sklearn import datasets from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split X, y = datasets.load_breast_cancer(return_X_y=True) #changing class 0 to 1 y[y == 0] = 1 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) start = time.perf_counter() classifier = Adaboost(n_estimators=5) classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) end = time.perf_counter() accuracy = accuracy_score(y_test, y_pred) print ("Accuracy:", accuracy) print(f'Finished in {round(endstart, 2)} second(s)')
Output Accuracy: 0.9649122807017544 Finished in 5.48 second(s)
from sklearn.ensemble import AdaBoostClassifier start = time.perf_counter() sk_classifier = AdaBoostClassifier(n_estimators=5, random_state=42) sk_classifier.fit(X_train, y_train) sk_y_pred = classifier.predict(X_test) end = time.perf_counter() sk_accuracy = accuracy_score(y_test, y_pred) print ("Accuracy:", sk_accuracy) print(f'Finished in {round(endstart, 2)} second(s)')
Output Accuracy: 0.9649122807017544 Finished in 0.02 second(s)
Our implementation can match sklearn’s implementation in terms of accuracy but is rather slow.
The Colab notebook for the above implementation can be found here.
Last Epoch
This article deconstructed boosting and implemented Adaboost. I hope this gives you a good understanding of the boosting approach and helps you understand modern boosting algorithms more easily. Boosting has many advantages; it is fast and easy to program. Excluding the number of weak learners, n_estimators
, it has no parameters to tune. One thing to keep in mind when working with boosting algorithm is the data. Because the boosting method obsessively tries to correct its errors, any outliers or noise can send the model down a tangent of trying to learn unrealistic or wrong examples.
References
 Paper
 Slides from the University of California CS273a course
 This article by Jason Brownlee