Imbalanced datasets are encountered in many fields, where machine learning has found its applications, including business, finance, and biomedical science. In imbalanced datasets, the number of instances in one (or more) class(es) is very low compared to the others. Training standard machine learning models on such datasets leads to the creation of biased models with higher false-positive and true-negative rates.
A common approach for overcoming this issue is generating synthetic instances of the minority class using an oversampling algorithm. SMOTE is a widely used oversampling technique. It selects an arbitrary minority class data point and its k nearest neighbours of the minority class. SMOTE then generates synthetic minority class data points along line segments joining these k nearest neighbours. SMOTE, however, has several limitations, for example, it does not consider the distribution of minority classes and latent noise in a data set.
SMOTE often over-generalizes the minority class, this leads to misclassifications of the majority class and affects the model’s overall balance. In their paper, “LoRAS: An oversampling approach for imbalanced datasets” Saptarshi Bej, Narek Davtyan, et al. proposed Localized Randomized Affine Shadowsampling (LoRAS), which produces better machine learning models for imbalanced datasets.
Algorithm & Approach
LoRAS relies on locally approximating the manifold by generating a random convex combination of noisy minority class data points. LoRAS generates Gaussian noise in small neighbourhoods around the minority class samples and creates the final synthetic data with convex combinations of multiple noisy data points (shadows samples) as opposed to SMOTE-based strategies that consider a combination of only two minority class data points. Adding these shadow samples allows LoRAS to better estimate the local mean of the latent minority class data distribution.
An Iteration Visualised
For a data point, p three of the closest neighbours (using KNN) are chosen to build a neighbourhood of p, depicted as a box in the figure above.
The four data points in the closest neighbourhood of p (including p) are selected.
A normal distribution is created centered at these parent data point n. Then shadow points are drawn from this distribution.
Three random shadow points are chosen at a time to obtain a random affine combination of them (spanning a triangle). Finally, a new LoRAS sample point is generated from the neighbourhood of a single data point p.
Comparing LoRAS with ADASYN, SMOTE, and its variants
- Install LoRAS and imbalanced-learn from PyPI
!pip install -U imbalanced-learn !pip install loras
- Import necessary libraries and classes
import pandas as pd import numpy as np import loras from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN from sklearn.metrics import f1_score, balanced_accuracy_score, precision_score, recall_score from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt %matplotlib inline
- We will be using the Credit Card Fraud Detection dataset available on Kaggle. You can download and upload it to Colab manually or fetch it using Kaggle’s API as shown below.
Go to your Kaggle Account and generate an API Token, place this .json file in a folder named Kaggle in your Google drive.
Mount drive to Colab.
from google.colab import drive drive.mount('/content/gdrive')
Define the config path, and navigate into the directory with the API token.
import os os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle" !cd /content/gdrive/My Drive/Kaggle
Download and unzip the dataset using the API command you can get from the dataset page.
!kaggle datasets download -d mlg-ulb/creditcardfraud #unzip and delete the zip !unzip \*.zip && rm *.zip
Load the dataset and convert it into a NumPy array for oversampling.
filename='creditcard.csv' data=pd.read_csv(filename) data=data.values data.shape
- Separate the labels from features.
labels, features =data[:,30], data[:,:30]
- Divide the dataset into train and test set before oversampling. Never test on the oversampled or undersampled dataset.
# fraud transactions label_1=np.where(labels == 1)[0] label_1=list(label_1) print((len(label_1))) features_1=features[label_1] features_1_train=features_1[list(range(0,246))] features_1_test=features_1[list(range(246,492))] # normal transactions label_0=np.where(labels == 0)[0] label_0=list(label_0) features_0=features[label_0] features_0_train=features_0[list(range(0,142157))] features_0_test=features_0[list(range(142157,284315))] training_data=np.concatenate((features_1_train,features_0_train)) test_data=np.concatenate((features_1_test,features_0_test)) training_labels=np.concatenate((np.zeros(246)+1, np.zeros(142157))) test_labels=np.concatenate((np.zeros(246)+1, np.zeros(142158)))
- Oversample minority class using LoRAS and other methods.
min_class_points = features_1_train maj_class_points = features_0_train #LoRAS loras_min_class_points = loras.fit_resample(maj_class_points, min_class_points) print(loras_min_class_points.shape) LoRAS_feat = np.concatenate((loras_min_class_points, maj_class_points)) LoRAS_labels = np.concatenate((np.zeros(len(loras_min_class_points))+1, np.zeros(len(maj_class_points)))) #SMOTE sm = SMOTE(random_state=42, k_neighbors=30, sampling_strategy=1) SMOTE_feat, SMOTE_labels = sm.fit_resample(training_data,training_labels) print(SMOTE_feat.shape) print(SMOTE_labels.shape) #SMOTE Boderline smb = BorderlineSMOTE(random_state=42, k_neighbors=30) SMOTEb_feat, SMOTEb_labels = smb.fit_resample(training_data,training_labels) print(SMOTEb_feat.shape) print(SMOTEb_labels.shape) #SMOTE SVM sms = SVMSMOTE(random_state=42, k_neighbors=30) SMOTEs_feat, SMOTEs_labels = sms.fit_resample(training_data,training_labels) print(SMOTEs_feat.shape) print(SMOTEs_labels.shape) #ADASYN ada = ADASYN(random_state=42,n_neighbors=30) ADA_feat, ADA_labels = ada.fit_resample(training_data,training_labels) print(ADA_feat.shape) print(ADA_labels.shape)
- Create functions for creating and evaluating models
def get_metrics(y_test, y_pred): metrics = {} metrics["f1_score"] = f1_score(y_test, y_pred) metrics["accuracy"] = balanced_accuracy_score(y_test, y_pred) metrics["precision"] = precision_score(y_test, y_pred) metrics["recall"] = recall_score(y_test, y_pred) return metrics def linear_regression(X_train, y_train, X_test, y_test): logreg = LogisticRegression(random_state=42, C=.005, multi_class='multinomial', max_iter=685) logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) return get_metrics(y_test, y_pred) def random_forest(X_train, y_train, X_test, y_test): det = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=42) det.fit(X_train, y_train) y_pred = det.predict(X_test) return get_metrics(y_test, y_pred)
- Train and evaluate models for different oversampling methods.
results_normal_lr = linear_regression(training_data, training_labels, test_data, test_labels) results_normal_rf = random_forest(training_data, training_labels, test_data, test_labels) results_loras_lr = linear_regression(LoRAS_feat, LoRAS_labels, test_data, test_labels) results_loras_rf = random_forest(LoRAS_feat, LoRAS_labels, test_data, test_labels) results_sm_lr = linear_regression(SMOTE_feat, SMOTE_labels, test_data, test_labels) results_sm_rf = random_forest(SMOTE_feat, SMOTE_labels, test_data, test_labels) results_sms_lr = linear_regression(SMOTEs_feat, SMOTEs_labels, test_data, test_labels) results_sms_rf = random_forest(SMOTEs_feat, SMOTEs_labels, test_data, test_labels) results_smb_lr = linear_regression(SMOTEb_feat, SMOTEb_labels, test_data, test_labels) results_smb_rf = random_forest(SMOTEb_feat, SMOTEb_labels, test_data, test_labels) results_ada_lr = linear_regression(ADA_feat, ADA_labels, test_data, test_labels) results_ada_rf = random_forest(ADA_feat, ADA_labels, test_data, test_labels) sampling_method = ['Normal', 'LoRAS','SMOTE','SMOTE SVM', 'SMOTE BORDELINE', 'ADASYN']
- Plot the results.
# linear regression results LR_results = [results_normal_lr, results_loras_lr, results_sm_lr, results_sms_lr, results_smb_lr, results_ada_lr] LR_dict = dict(zip(sampling_method, LR_results)) #create a DataFrame from the results dictionary LR_df = pd.DataFrame.from_dict(LR_dict, orient='index') # create a column for sampling methods LR_df['sampling_method'] = LR_df.index LR_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'], kind="bar", figsize=(12, 8), rot = 45, title = "Performance of Linear Regression models built with different Sampling Methods") # random forest results RF_results = [results_normal_rf, results_loras_rf, results_sm_rf, results_sms_rf, results_smb_rf, results_ada_rf] RF_dict = dict(zip(sampling_method, RF_results)) #create a DataFrame from the results dictionary RF_df = pd.DataFrame.from_dict(RF_dict, orient='index') # create a column for sampling methods RF_df['sampling_method'] = RF_df.index RF_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'], kind="bar", figsize=(12, 8), rot = 45, title = "Performance of Random Forest models built with different Sampling Methods")
Here is the Colab Notebook for the above code.
Last Epoch
LoRAS aims to improve the precision-recall balance (F1-Score) and class-wise average accuracy (balanced accuracy) of the models. The F1-Score measures how well the classification model handled the minority class classification, whereas Balanced accuracy provides us with a measure of how the classification model handled both majority and minority classes. Thus, these two measures together give a holistic understanding of a classifier’s performance on a dataset. LoRAS was benchmarked against the existing algorithms on a total of 14 imbalanced datasets.
Considering the average performance over all the datasets, LoRAS has the best Balanced accuracy and F1-Score. SMOTE improves Balanced accuracy compared to the models trained without any oversampling but it lags behind in F1-Score, for quite a few datasets with high baseline F1-Score. Applying ADASYN increases the Balanced accuracy compared to SMOTE, but again compromises on the F1-Score. LoRAS produces the best Balanced accuracy on average by maintaining the highest average F1-Score among all oversampling techniques.
References
To better understand the mathematics behind LoRAS, and to learn more about the finetuning parameters refer to the following resources:
Want to learn about a sampling technique for creating ensemble models from imbalanced datasets? Check out our Guide to MESA.