(A) Introduction
This article assumes that the readers have some knowledge about binary classification problems.
Consider a binary classification problem where the target variable is highly imbalanced. You may imagine problems like detecting fraudulent transactions, predicting attrition, cancer detection, etc. where the number of positive examples is relatively fewer as compared to the number of negative examples. In such cases, training a classification algorithm to detect the positive classes accurately becomes difficult as the model becomes biased towards predicting the negatives.
In this article, I will discuss some popular class Imbalanced correction techniques that will help you to balance such Imbalanced data. The aim of this article is to present the algorithms in the simplest form and discuss their purposes and limitations. The article will also include some animated videos to demonstrate how these algorithms work. In the end, you will get the necessary R/Python codes so that you may start using these techniques.
There are three general ways to balance a class-Imbalanced dataset – 1) under-sampling, 2) over-sampling and 3) hybrid techniques. In the attempt to make the class distribution uniform (or nearly uniform) the under-sampling techniques delete a number of observations from the majority class while over-sampling techniques generate observations from the minority class. Hybrid techniques are a combination of both under-sampling and oversampling techniques.
(B) Under-sampling Techniques
(B.1) Random Under-sampling
Random under-sampling is a technique that randomly selects a number of observations from the majority class and removes them from the dataset. This reduces the number of observations from the majority class, which may help the data to get balanced.
Video 1 shows an illustration of the technique. For the purpose of demonstration, a simple dataset has been used in this video which contains two numerical predictors and a target. The observations belonging to the majority class are coloured ‘magenta’ and the observations belonging to the minority class are coloured black. The video shows the random deletion of observations from the majority class.
Disadvantage: The deletion of observations in this random manner may lead to the deletion of some very useful information from the dataset.
(B.2) Tomek Links
This is a heuristic approach. This technique identifies all the pairs of observations that are nearest to each other but belongs to different classes. These pairs are called the Tomek links. In other words, two observations a and b will a Tomek Link if:
- a is the nearest neighbour of b.
- b is the nearest neighbour of a.
- a and b belong to different classes.
The heuristic is that these links are usually found in the boundary of separation of the two classes. Therefore, deleting these links may increase class separation (Fig. 3). However, for class-Imbalanced handling problems, we delete the observation from the majority class from each Tomek link (Fig. 4). Note that this however method doesn’t massively solve the class imbalanced problem because the number of observations that get deleted is few in numbers.
Video 2 shows an illustration of this technique. In the video, for each iteration, observation belonging to the minority class is indicated using a red circle and its nearest neighbour is indicated using a blue circle. If a Tomek link is identified, then the count of the Tomek link increases by one in the chart title.
(B.3) Neighbourhood Cleaning Rule using Wilson’s Edited Nearest Neighbour (ENN) Rule
The ENN algorithm can be written in a simple way, as follows:
- For every observation O
- Find the three nearest neighbours of O
- If O gets misclassified by its three nearest neighbours
- Then delete O
- End if
- End For
This is a heuristic approach and is popularly used as a data cleaning technique. This algorithm is used as a class Imbalanced correction technique with a slight modification. In such cases, we restrict ourselves to only the nearest neighbours of the majority class. That is, if any observations belonging to the majority class get misclassified by its three nearest neighbours then we delete that observation.
Video 3 shows an illustration of this technique. In the video, in each iteration, a red circle represents an observation from the majority class and its three nearest neighbours are represented using blue circles. If one or more of the nearest neighbours have the same value, then you may see less than three blue circles. If the red circle disappears, then that implies that one of its nearest neighbours has the exact same values as the observation. If any observations from the majority class get misclassified by its three nearest neighbours then that observation gets marked by a cross inside a circle.
(C) Oversampling Techniques
(C.1) Random Oversampling
Random over-sampling is a technique that randomly samples a number of observations from the minority class (sampling is done with replacement) and adds them to the dataset. This increases the number of observations from the minority class, which may help the data to get balanced. By repeating the same observation, a number of times the algorithm is forced to focus on considering the minority class amidst so many majority classes while creating a decision boundary.
Video 4 shows an illustration of this technique. In the video, you will see integers appearing on the top of the observations belonging to the minority class. The integer indicates the number of times the observation is sampled.
Disadvantage: This technique may increase the chance of model overfitting.
(C.2) Synthetic Minority Over-Sampling Technique (SMOTE)
This is also a heuristic approach based on the k-nearest neighbour algorithm. This technique generates synthetic observations from the minority class by interpolating a collinear point (an observation) between observation of the minority class and its nearest neighbour. A simple SMOTE algorithm can be written in the following way:
SMOTE Algorithm (for k = 1)
- For each observation O in the minority class
- Find its nearest minority neighbour N
- Let d = distance (O, N)
- Let r = a randomly generated random number from the open interval (0, 1)
- Let d1 = d * r
- Generate a new point P which is at the distance d1 from O and are collinear to O and N
- End For
Video 5 shows an illustration of SMOTE for k=1. In the video, in every iteration, an observation belonging to the minority class is considered and marked using a red circle and its nearest neighbour is marked using a blue circle. The distance between these two points is represented by a dotted line. A new synthetic observation is inserted on this dotted line which is marked using a cross enclosed in a circle. In the end, all the synthetically generated observations are marked on the graph as red dots.
SMOTE Algorithm (for k = 2)
- For each observation O in the minority class
- Find its two nearest minority neighbour N1, N2
- For N in [N1, N2]
- Let d = distance (O, N)
- Let r = a randomly generated random number from the open interval (0, 1)
- Let d1 = d * r
- Generate a new point P which is at the distance d1 from O and are collinear to O and N
- End For
- End For
Video 6 shows an illustration of SMOTE for k=2.
Of course, for any k, the algorithm can be extended as follows:
- For each observation O in the minority class
- Find its k nearest minority neighbour N1, N2, …, Nk.
- For N in [N1, N2, …, Nk]
- Let d = distance (O, N)
- Let r = a randomly generated random number from the open interval (0, 1)
- Let d1 = d * r
- Generate a new point P which is at the distance d1 from O and are collinear to O and N
- End For
- End For
(D) HYBRID TECHNIQUES
(D.1) SMOTE + Tomek Links
It is discussed before that under-sampling using Tomek links does not massively solve the class imbalanced problem because the number of observations that get deleted are few in numbers. This technique helps in clearing the boundary of separation rather than aiming to make the class distribution uniform. This under-sampling technique along with SMOTE may give us a dataset that is balanced and has a cleaned boundary of separation. However, note that the SMOTE should be applied first.
(D.2) SMOTE + ENN
Like the class Imbalanced handling technique using Tomek links, Wilson’s ENN also doesn’t solve the class Imbalanced problem to a major extent because the number of observations that gets deleted is few in numbers. This technique rather focuses on cleaning the data than in making the class distribution uniform. Appling the SMOTE algorithm on the dataset followed by ENN may help us to get a cleaner version of balanced data where some minority observations are synthetically generated.
Similarly, one may try a combination of all these techniques, i.e. SMOTE + ENN + Tomek
(E) Lab
Visit the links to get the R and Python codes of the techniques discussed above.
- Download the Python codes (Jupyter notebook).
- Download the R codes (R script file).
(F) Some Important Points
Before closing, I want to list down some important points which you must keep in mind when you are using these techniques.
- Apply these techniques only on the training data. Which means, first make the train-validation-test split and then apply these class imbalanced handling techniques on the training data. Your aim is to make the training data balanced in order to train your classification algorithm properly. However, the performance of the classification algorithm should be tested on the actual replica of the data and therefore we will keep the validation data and the test data unchanged so that it is a representative part of the original data.
- Use stratified random sampling for the train-validation-test split. This ensures that the class distribution in each of these splits is the same and therefore each of these data subsets is a better representative of the population.
- While random under-sampling techniques or random oversampling techniques work on datasets containing numerical and categorical variables, techniques like Tomek Links ENN, and SMOTE can be applied on datasets that contain only numeric variables. This is because they are based on the calculation of distances among observations in order to derive their nearest neighbour.
- With some modification, these techniques may be applied to datasets that contain variables other than numerical variables. For example, if your dataset contains all predictors which are binary in nature then you may use Jaccard similarity in place of Euclidean distance. However, in such cases, you may have to write these algorithms from scratch. The SMOTE algorithm is extended for nominal features (SMOTE-N) and for datasets containing both nominal and continuous features (SMOTE-NC). These may be topics of some of my future blogs. You may refer to the original paper on SMOTE to understand these algorithms (see section 6). Check the SMOTE-NC function in python. The SMOTE() function in the DMwR library can be applied to datasets with both numerical and categorical variables.
- Your dataset may contain binary predictors. Your SMOTE algorithm will consider those variables continuous and while generating a synthetic observation it may assign a continuous value between 0 and 1 for that variable for the newly generated observation. You may want to process that variable later and give an attempt to convert those continuous values to binary values.
- Many types of research have shown that AUROC is usually preferred metrics for evaluating model performance in the presence of imbalanced datasets.
- While Tomek Links and ENN techniques can be applied for multiclass problems but SMOTE can be applied only for binary classification problems. If you would like to explore a class Imbalanced handling technique for the multiclass classification problem, then you may explore the SCUT algorithm which used SMOTE and Cluster-based Undersampling for multi-class imbalanced data classification. This may also be an elaborate topic of one of my future blog posts.
This article is presented by AIM Expert Network (AEN), an invite-only thought leadership platform for tech experts. Check your eligibility.