Last updated December 16, 2020
In AI Mysteries

Handling Imbalanced Datasets: A Guide With Hands-on Implementation

Imbalanced datasets. In this article, I’ll be discussing the way to achieve balanced datasets using various techniques, as well as compare them.

Share

Published on October 21, 2020

by Jayita Bhattacharyya

In classification machine learning problems(binary and multiclass), datasets are often imbalanced which means that one class has a higher number of samples than others. This will lead to bias during the training of the model, the class containing a higher number of samples will be preferred more over the classes containing a lower number of samples. Having bias will, in turn, increase the true-negative and false-positive rates. Hence to overcome this bias of the model we need to make the dataset balanced containing an approximately equal number of samples in all the classes.

In this article, I’ll be discussing the way to achieve balanced datasets using various techniques, as well as compare them.

For demonstration, I’ve taken the Pima Indians Diabetes Database by UCI Machine Learning from Kaggle. Get the dataset from here. This is a binary classification dataset. Dataset consists of various factors related to diabetes – Pregnancies, Glucose, blood pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree, Age, Outcome(1 for positive, 0 for negative). ‘Outcome’ is the dependent variable, rest are independent variables.

imbalanced, machine learning, imbalanced dataset, balanced, dataset

Python provides a package imbalance-learn for handling imbalanced datasets

pip install imbalanced-learn

Exploring the dataset

import pandas as pd 
import matplotlib.pyplot as plt
df = pd.read_csv('/input/pima-indians-diabetes-database/diabetes.csv')
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

It clearly shows an imbalance dataset wherein class 0 is 500 and class 1 is 268 which is nearly half of class 0.

For better understanding, we do some visualization.

count_classes = pd.value_counts(df['Outcome'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Class Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency")

We separate out the dependent and independent variables into variable X and Y respectively.

X = df.drop('Outcome',axis = 1)
Y = df['Outcome']

Original dataset size

X.shape,Y.shape
((768, 8), (768,))

Undersampling

This technique samples down or reduces the samples of the class containing more data equivalent to the class containing the least samples. Suppose class A has 900 samples and class B has 100 samples, then the imbalance ratio is 9:1. Using the undersampling technique we keep class B as 100 samples and from class A we randomly select 100 samples out of 900. Then the ratio becomes 1:1 and we can say it’s balanced.

From the imblearn library, we have the under_sampling module which contains various libraries to achieve undersampling. Out of those, I’ve shown the performance of the NearMiss module.

from imblearn.under_sampling import NearMiss
nm = NearMiss()
X_res,y_res=nm.fit_sample(X,Y)
X_res.shape,y_res.shape

((536, 8), (536,))

After undersampling the size of the dataset is 536 which is reduced by 232.

from collections import Counter
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))

Original dataset shape Counter({0: 500, 1: 268})
Resampled dataset shape Counter({0: 268, 1: 268})

We can see the comparison between original target values and undersampled values. It is seen that both the classes have an equal number and it is the number of the original class containing lower samples(class 1 in this case).

Undersampling is also referred to as downsampling as it reduces the number of samples. This method should only be used for large datasets as otherwise there’s a huge loss of data, which is not good for the model. Losing out on data is not appropriate as it could hold important information regarding the dataset.

Oversampling

Oversampling is just the opposite of undersampling. Here the class containing less data is made equivalent to the class containing more data. This is done by adding more data to the least sample containing class. Let’s take the same example of undersampling, then, in this case, class A will remain 900 and class B will also be 900 (which was previously 100). Hence the ratio will be 1:1 and it’ll be balanced.

The imblearn library contains an over_sampling module which contains various libraries to achieve oversampling. Out of those, I’ve shown the performance of the RandomOverSampler module.

from imblearn.over_sampling import RandomOverSampler
os =  RandomOverSampler()
X_train_res, y_train_res = os.fit_sample(X, Y)
X_train_res.shape,y_train_res.shape

((1000, 8), (1000,))

After oversampling, size of dataset becomes 1000 which was originally 768

print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_train_res)))

Original dataset shape Counter({0: 500, 1: 268})
Resampled dataset shape Counter({1: 500, 0: 500})

We can see the comparison between original target values and oversampled values. It is seen that both the classes have an equal number and it is the number of the original class containing higher samples(class 0 in this case).

Oversampling is also referred to as upsampling as it increases the number of samples. This method should primarily be used in the small or medium-sized dataset. It is better than undersampling as there is no loss of data instead more data is added, which can prove to be good for the model.

SMOTETomek

SMOTETomek is somewhere upsampling and downsampling. SMOTETomek is a hybrid method which is a mixture of the above two methods, it uses an under-sampling method (Tomek) with an oversampling method (SMOTE). This is present within imblearn.combine module.

from imblearn.combine import SMOTETomek
smk = SMOTETomek()
X_res,y_res=smk.fit_sample(X,Y)
X_res.shape,y_res.shape

((944, 8), (944,))

Here the dataset size after resampling is increased to 944 from 768

print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))

Original dataset shape Counter({0: 500, 1: 268})
Resampled dataset shape Counter({1: 472, 0: 472})

Class 0 has been downsampled from 500 to 472 and class 1 has been upsampled from 268 to 472.

Note that using both techniques together in smotetomek we get the same ratio of samples as 1:1 which is balanced

SMOTEEENN

SMOTEENN another library present within imblearn.combine module. This performs similar to SMOTETomek, there is some difference in results between the two methods.

from imblearn.combine import SMOTEENN
smk = SMOTEENN()
X_res,y_res=smk.fit_sample(X,Y)
X_res.shape,y_res.shape

((532, 8), (532,))

Here the dataset size after resampling is decreased to 532 from 768

print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))

Original dataset shape Counter({0: 500, 1: 268})
Resampled dataset shape Counter({1: 309, 0: 223})

Class 0 has been downsampled from 500 to 223 and class 1 has been upsampled from 268 to 309.

Unlike SMOTETomek the ratio is not 1:1 but the difference between the samples is not very large.

In all these above 4 techniques there is an important parameter called sampling_strategy, this simply means how the resampling should be done. By default, it’s set to ‘auto’.

The other available options are:

‘minority’ – resampling done only to the minority class.

‘not majority’ – resample all classes except the majority class. This is the same as ‘auto’.

‘not minority’ – resample all classes except minority class.

‘all’ – resample all classes.

Balanced RandomForestClassifier

Imblearn.ensemble module contains some ensembe methods like BalanceCascade, EasyEnsemble, etc.

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3)
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier()
brf.fit(X_train,Y_train)
brf.score(X_train,Y_train)

0.9583333333333334

brf.score(X_test,Y_test)

0.7272727272727273

These methods perform well on training dataset but not so good on the test dataset. Though it can be used directly on imbalanced datasets, that’s the advantage and later can be stacked with other models.

Conclusion

Here I’ve discussed some of the most commonly used imbalanced dataset handling techniques. To avoid biases of the model imbalanced dataset should be converted into the balanced dataset. It is observed that Tree-based models don’t have much effect even if the dataset is imbalanced, though this completely depends on the data itself. Imblearn library also has pipeline and metrics modules.
The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find this code.

Access all our open Survey & Awards Nomination forms in one place