Search

# Demystifying AdaBoost: The Origin Of Boosting

AdaBoost was the first successful boosting algorithm that combined multiple weak classifier to create a strong classifier.
 Listen to this story

Introduced by Yoav Freund and Robert E. Schapire in their paper “A Short Introduction to Boosting”,  AdaBoost was the first successful boosting algorithm; all modern boosting algorithms build upon the success of boosting theory employed by AdaBoost. The premise of this theory is simple; combine multiple weak classifiers to create one strong classifier. This boosting approach can be used to boost the performance of any machine learning algorithm. But it’s best paired with weak learners, which are models that perform just slightly better than a random guess on classification problems. The most commonly used weak learners are decision trees with a depth of one, also known as decision stumps. This article will go through the boosting theory and then implement Adaboost from scratch.

Before we delve into the mathematics and code, let’s understand the gist of Adaboost with the help of a simple 2D example:

The first classifier makes the classification based on the y-axis feature and creates a horizontal decision line at some threshold. This decision line is by no means perfect, and based on its predictions, a performance score is calculated. The performance measure is then used to calculate and update weights for all training samples. Moreover, the second classifier uses these weights and creates a different and maybe better decision line. The performance measure of the second classifier is used to update the weights, which are then used by the third classifier.  This process is repeated for the pre-set number, `n_estimators`, of classifiers. All the classifiers are combined through a weighted sum; here, the classifiers with better performance have a higher weight. This means that better classifiers have more impact on the final decision boundary.

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

Briefly put, Adaboost trains a set of weak learners sequentially. The errors of the earlier learners are used to tag the harder examples that the later learners focus on. In the end, the complete set of learners is combined and converted onto one complex classifier.

For the first iteration, the weights are initialized to 1/N and the error is calculated as misclassifications/N, where N is the number of training examples. This is later updated to include the weights and becomes ???? (wi × errori)/????wi , where wi and errori denote the weight and prediction error of the i-th training example. The error is 1 if misclassified and 0 if classified correctly. The performance measure of the learners, denoted by ????, is calculated as 0.5 × log(1 – error/error)

This performance score is then used to update the weights using w = w × e -????  × y_true × y_pred, where y_true and y_pred are the actual and predicted classes. Finally, for making predictions a weighted sum of all classifiers’ prediction is calculated: ???? ????i × y_predi and the sign of the sum is used to make the classification. Here ????i and y_predi refer to the performance score and prediction of the i-th classifier.

Before we can code Adaboost, we need to implement the weak learner. Each decision stump will have four attributes:

• `threshold` – the threshold that differentiates the two classes based on one feature
• `alpha` – the performance measure, used as its weight in the final prediction
• `index` – the index of the feature the classifier is built on
• `polarity` – used to invert weak classifiers that perform worse than random(0.5).
```class DecisionStump():
def __init__(self):
self.threshold = None
self.alpha = None
self.index = None
self.polarity = 1 ```

In addition to these attributes, it will also need a function for making predictions.

``` def predict(self, X):
n_samples = X.shape[0]
X_column = X[:, self.index]
predictions = np.ones(n_samples)
# classifying based on the polarity and threshold
if self.polarity == 1:
predictions[X_column < self.threshold] = -1
else:
predictions[X_column > self.threshold] = -1
return predictions ```

Using this DecisionStump class we can start working on Adaboost. It will have two attributes:

• `n_estimators` – number of decision stumps to use
• `ensemble_of_stumps` – a collection of trained decision stumps

```def __init__(self, n_estimators=5):
self.n_estimators = n_estimators
self.ensemble_of_stumps = [] ```

Now comes the hard part, the training method that creates the ensemble of decision stumps. We initialize the weights to 1/N and start training the first stump.

```def fit(self, X, y):
N, n_features = X.shape
weights = np.full(N, (1 / N))
for _ in range(self.n_estimators):
stump = DecisionStump() ```

Each of the decision stumps will use one feature to make the prediction, and it will use a threshold to make classifications. To ‘train’ the stump we need to find the best possible feature-threshold combination. To achieve this we iteratively go through all features, and for each feature, we test all of its unique values as the threshold. In the end, the feature-threshold combination with the lowest error is selected.

```     min_error = float('inf')
# looking for the best threshold and feature configuration
for i in range(n_features):
X_column = X[:, i]
thresholds = np.unique(X_column)
for threshold in thresholds:
p = 1
predictions = np.ones(N)
predictions[X_column < threshold] = -1
# Calculating error
misclassified = weights[y != predictions]
error = sum(misclassified) ```

If the error of a stump is less than 0.5, we reverse its polarity. This is done because, by definition, the weak learners need to perform better than random.

```         # if error is less than .5, reverse polarity of learner
if error > 0.5:
error = 1 - error
p = -1
# Storing the best configuration for the current stump
if error < min_error:
stump.polarity = p
stump.threshold = threshold
stump.index = i
min_error = error ```

Using the error of the stump, we calculate its performance measure alpha. Which is then used to update the weights for the next stump.

```         stump.alpha = 0.5 * np.log((1.0 - min_error) / (min_error))
weights  = weights * np.exp(-stump.alpha * y * stump.predict(X))
# Normalizing the weights to one
weights /= np.sum(weights) ```

Finally, we add the current decision stump to Adaboost’s ensemble of weak learners.

`        self.ensemble_of_stumps.append(stump)`

In addition to the training method, we need a method for making predictions. This method will calculate the weighted sum of all stump’s predictions and use the sign of the sum as the final prediction.

```   def predict(self, X):
# weighted sum of all stumps
stump_preds = [stump.alpha * stump.predict(X) for stump in self.ensemble_of_stumps]
y_pred = np.sum(stump_preds, axis=0)
y_pred = np.sign(y_pred)
return y_pred ```

Let’s test our Adaboost implementation on the breast cancer dataset and compare it with sklearn’s implementation

``` import time
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#changing class 0 to -1
y[y == 0] = -1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

start = time.perf_counter()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
end = time.perf_counter()

accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", accuracy)
print(f'Finished in {round(end-start, 2)} second(s)')  ```
```-----------------------------Output-----------------------------
Accuracy: 0.9649122807017544
Finished in 5.48 second(s) ```
``` from sklearn.ensemble import AdaBoostClassifier

start = time.perf_counter()
sk_classifier.fit(X_train, y_train)
sk_y_pred = classifier.predict(X_test)
end = time.perf_counter()

sk_accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", sk_accuracy)
print(f'Finished in {round(end-start, 2)} second(s)')  ```
```-----------------------------Output-----------------------------
Accuracy: 0.9649122807017544
Finished in 0.02 second(s) ```

Our implementation can match sklearn’s implementation in terms of accuracy but is rather slow.

The Colab notebook for the above implementation can be found here.

### Last Epoch

This article deconstructed boosting and implemented Adaboost. I hope this gives you a good understanding of the boosting approach and helps you understand modern boosting algorithms more easily. Boosting has many advantages; it is fast and easy to program. Excluding the number of weak learners, `n_estimators`, it has no parameters to tune. One thing to keep in mind when working with boosting algorithm is the data. Because the boosting method obsessively tries to correct its errors, any outliers or noise can send the model down a tangent of trying to learn unrealistic or wrong examples.

##### References
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### NVIDIA Planning Big Expansions in Japan

Prime Minister Fumio Kishida has extended billions of dollars in financial support to bolster TSMC

### Runway Partners with Getty to Build Video Generation Model for Enterprises

Runway enterprise users can refine RGM with their proprietary datasets, benefiting various industries like Hollywood,

### Modular Announces Partnership with AWS, NVIDIA

ModCon 2023 was Modular’s first developer conference.

### Air India Closes Historic Data Centres, Migrates to the Cloud

The closure of the data centres will further result in nearly a million dollar net

### India’s Fab Ambitions: Tall Promises or Work in Progress?

Despite numerous claims, India still does not have a fab.

### Extropic AI is All About e/acc of Generative AI

The startup wants to open source its technology, which the founder calls is “AI Manhattan

### AWS’ Multi-Layered Approach to Tech and Trust

For 17 years Amazon Web Services has followed a set of principles to reach the

### AssemblyAI Raises \$50 Mn for Superhuman Speech Recognition

The new model the company is building is trained on more than 10 million hours

### Top 6 Generative AI Jobs in India

As the year concludes, we’ve compiled a list of promising generative AI jobs for those

### AIM Top Ranked PG Data Science Programs (Full Time On-Campus) – 2023

A good course has high and well-balanced scores across all the parameters, making it at