Listen to this story
The most common machine learning problem is probably classification and one of the biggest issues that arises when executing it is the presence of an imbalanced dataset. Thus, inference from the models becomes imbalanced and inaccurate when the classes are distributed unequally.
So, how do we handle the problems in a model that is trained on imbalanced data? Well, there can be various techniques such as reshaping the dataset or making tweaks to the machine learning model itself. The same techniques cannot necessarily be applied to all the problems, although one can work better than the other for balancing a dataset.
Here are a few techniques to help developers achieve the best out of imbalanced data —
Sign up for your weekly dose of what's up in emerging technology.
Verifying the accuracy, validity, and performance of the built machine learning model requires finding the right evaluation metrics. In case your data is imbalanced, selecting the right metric is a tricky task because several metrics might give out an almost perfect score for the models. To resolve this, we can use the following metrics for evaluation of the model with imbalanced data:
- Recall/Sensitivity: For one class, how many samples are predicted correctly.
- Precision/Specificity: Allows calculation of a class specific problem.
- F1 Score: Harmonic mean of recall and precision.
- MCC: Calculation of the correlation coefficient between predicted and observed binary classifications.
- AUC–ROC: Being independent of changes in proportion of responders, it infers the relation between false positive rate and true positive rate.
Though oversampling and undersampling in machine learning models during training is seen as a major drawback when implemented in the real world, this method can introduce balance in imbalanced datasets. Both these methods are dependent on the model itself and can be used in the same dataset as well.
Oversampling is implemented when the quantity of data is insufficient. In this process, we increase the size of the rare samples to balance the dataset. The samples are generated using techniques like SMOTE, bootstrapping, and repetitions. The most common technique used while oversampling is ‘Random Over Sampling’, wherein random copies are added to the minority class to balance with the majority class. However, this can also cause overfitting.
On the other hand, undersampling is used to reduce the size of the abundant class i.e., the size of the dataset is sufficient. Thus, the rare samples are kept intact and the size is balanced by selection of an equal number of samples from the abundant class to create a new dataset for further modelling. But, this can cause removal of important information from the dataset.
SMOTE (Synthetic Minority Oversampling Technique)
A good alternative to handle the problems with oversampling and undersampling is SMOTE, wherein a random point is picked from the minority class and the K-nearest neighbour is computed, followed by the addition of random points around the chosen point.
In this manner, the relevant points are added without altering the accuracy of the model. This method therefore provides better results when compared to simple undersampling and oversampling.
K-fold Cross Validation
This technique involves cross validating the dataset after it is generated by the process of oversampling since it makes predicting the minority class easier. It is commonly used by data scientists to stabilise and generalise a machine learning model with an imbalanced dataset as it prevents data leakage from the validation set.
The perfect procedure for K-fold cross validation in incomplete dataset is to:
- Exclude some amount of data for validation that will not be used for oversampling, feature selection, and model building;
- Follow up by oversampling the minority class without the excluded data in the training set;
- Depending on the number of folds, i.e., ‘K’—Repeat it ‘K’ times.
Ensembling resampled datasets
The most obvious—but not an all round way—to handle imbalanced data is to use more data. Therefore, ensembling different resampled datasets is another technique that can overcome problems while generalising using random forest or logistic regression. This comes along with identifying the rare class that was discarded during generalising the training dataset.
Such an ensemble can be obtained by using multiple learning algorithms and models to obtain better performance on the same dataset after it is resampled using oversampling or undersampling.
One of the popular ways is to use ‘BaggingClassifier’ for assembling. In this method, the oversampled or undersampled dataset is combined to train using both the minority class and the abundant class in the dataset.
There is no particular technique that can work for imbalanced datasets, but a combination of various methods that are apparent and can be used as a starting point for perfecting the models.
- Choosing the right model: There are models that are suited to work with imbalanced datasets and do not require you to make changes to the data, like XGBoost.
- Collecting more data: The simplest way is to get more data with positive examples allowing perspective on abundant and rare classes.
- Anomaly Detection: Building the classification problem to detect rare items or observations.
- Resampling using different ratios: While putting together different sampled datasets, fine-tuning the model by deciding the ratio between the rare and the abundant class changes the influence of each class, thus altering the inference.