Now Reading
What is Imblearn Technique – Everything To Know For Class Imbalance Issues In Machine Learning

What is Imblearn Technique – Everything To Know For Class Imbalance Issues In Machine Learning

Rohit Dwivedi
W3Schools

In machine learning, while building a classification model we sometimes come to situations where we do not have an equal proportion of classes. That means when we have class imbalance issues for example we have 500 records of 0 class and only 200 records of 1 class. This is called a class imbalance. All machine learning models are designed in such a way that they should attain maximum accuracy but in these types of situations, the model gets biased towards the majority class and will, at last, reflect on precision and recall. So how to build a model on these types of data set in a manner that the model should correctly classify the respective class and does not get biased. 

To get rid of these imbalance class issues few techniques are used called as Imblearn Technique that is mainly used in these types of situations. Imblearn techniques help to either upsample the minority class or downsample the majority class to match the equal proportion. Through this article, we will discuss imblearn techniques and how we can use them to do upsampling and downsampling. For this experiment, we are using Pima Indian Diabetes data since it is an imbalance class data set. The data is available on Kaggle for downloading.  

What we will learn from this article?



  1. How to deal with class imbalanced data sets?
  2. What are Imblean Techniques? How do they work?
  3. How to implement imblean techniques over a data set having imbalanced classes?
  1. How to deal with class imbalanced data sets?

Class imbalance issues are the problem when we do not have equal ratios of different classes. Consider an example if we had to build a machine learning model that will predict whether a loan applicant will default or not. The data set has 500 rows of data points for the default class but for non-default we are only given 200 rows of data points. When we will build the model it is obvious that it would be biased towards the default class because it’s the majority class. The model will learn how to classify default classes in a more good manner as compared to the default. This will not be called as a good predictive model. So, to resolve this problem we make use of some techniques that are called Imblearn Techniques. They help us to either reduce the majority class as default to the same ratio as non-default or vice versa. 

  1. What are Imblean Techniques? How do they work?

Imblearn techniques are the methods by which we can generate a data set that has an equal ratio of classes. The predictive model built on this type of data set would be able to generalize well. We mainly have two options to treat an imbalanced data set that are Upsampling and Downsampling. Upsampling is the way where we generate synthetic data so for the minority class to match the ratio with the majority class whereas in downsampling we reduce the majority class data points to match it to the minority class. 

  1. How to implement imblean techniques over a data set having imbalanced classes?

Now lets us practically understand how upsampling and downsampling is done. We will first install the imblearn package then import all the required libraries and the pima data set. Use the below code for the same. 

!pip install imblearn
import pandas as pd
from sklearn.ensemble import  RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
from imblearn.over_sampling import SMOTE
Now we will check the value count for both the classes present in the data set. Use the below code for the same. 
df['class'].value_counts()

As we checked there are a total of 500 rows that falls under 0 class and 268 rows that are present in 1 class. This results in an imbalance data set where the majority of the data points lie in 0 class. Now we have two options either use upsampling or downsampling. We will do both and will check the results. We will first divide the data into features and target X and y respectively. Then we will divide the data set into training and testing sets. Use the below code for the same. 

X = df.values[:,0:7] 

y = df.values[:,8]   

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

Now we will check the count of both the classes in the training data and will use upsampling to generate new data points for minority classes. Use the below code to do the same. 

print("Count of 1 class in training set before upsampling :" ,(sum(y_train==1)))

print("Count of 0 class in training set before upsampling :",format(sum(y_train==0)))

We are using Smote techniques from imblearn to do upsampling. It generates data points based on the K-nearest neighbor algorithm. We have defined k = 3 whereas it can be tweaked since it is a hyperparameter. We will first generate the data point and then will compare the counts of classes after upsampling. Refer to the below code for the same. 

smote = SMOTE(sampling_strategy = 1 ,k_neighbors = 3, random_state=1)   

X_train_new, y_train_new = smote.fit_sample(X_train, y_train.ravel())

print("Count of 1 class in training set after upsampling  :" ,(sum(y_train_new==1)))

print("Count of 0 class in training set after upsampling  :",(sum(y_train_new==0)))

See Also

Now the classes are balanced. Now we will build a model using random forest on the original data and then the new data. Use the below code for the same. 

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.score(X_test, y_test))
print(classification_report(y_pred, y_test))
print(confusion_matrix(y_pred, y_test))
model.fit(X_train_new, y_train_new)
y_pred= model.predict(X_test)
print(model.score(X_test, y_test))
print(confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))

Now we will downsample the majority class and we will randomly delete the records from the original data to match the minority class. Use the below code for the same. 

Non_diabetic_indices = df[df['class'] == 0].index   
Non_diabetic = len(df[df['class'] == 0])           
Diabetic_indices = df[df['class'] == 1].index       
Diabetic = len(df[df['class'] == 1])                
print(Non_diabetic)
print(Diabetic)
 

random = np.random.choice( Non_diabetic_indices, Non_diabetic – 200 , replace=False)  

down_sample_indices = np.concatenate([Diabetic_indices,random])

Now we will again divide the data set and will again build the model. Use the below code for the same. 

X = down_sample.values[:,0:7] 
Y = down_sample.values[:,8]   
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=7)
print('After DownSampling X_train:' ,X_train.shape)
print('After DownSampling X_test:' ,X_test.shape)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.score(X_test, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Conclusion 

In this article, we discussed how we can pre-process the imbalanced class data set before building predictive models. We explored Imblearn techniques and used the SMOTE method to generate synthetic data. We first did up sampling and then performed down sampling. There are again more methods present in imblean techniques like Tomek links and Cluster centroid that also can be used for the same problem. You can check the official documentation here.

Also check this article “Complete Tutorial on Tkinter To Deploy Machine Learning Model” that will help you to deploy machine learning models.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top