Hands-On Guide To LoRAS: A Better Oversampling Algorithm

Localized Randomized Affine Shadowsampling (LoRAS) locally approximates the manifold by generating a random convex combination of noisy minority class data points.

Share

Illustration by Hands-on Guide to LoRAS: A Better Oversampling Algorithm

Published on April 3, 2021

by Aditya Singh

Imbalanced datasets are encountered in many fields, where machine learning has found its applications, including business, finance, and biomedical science. In imbalanced datasets, the number of instances in one (or more) class(es) is very low compared to the others. Training standard machine learning models on such datasets leads to the creation of biased models with higher false-positive and true-negative rates.

A common approach for overcoming this issue is generating synthetic instances of the minority class using an oversampling algorithm. SMOTE is a widely used oversampling technique. It selects an arbitrary minority class data point and its k nearest neighbours of the minority class. SMOTE then generates synthetic minority class data points along line segments joining these k nearest neighbours. SMOTE, however, has several limitations, for example, it does not consider the distribution of minority classes and latent noise in a data set.

SMOTE often over-generalizes the minority class, this leads to misclassifications of the majority class and affects the model’s overall balance. In their paper, “LoRAS: An oversampling approach for imbalanced datasets” Saptarshi Bej, Narek Davtyan, et al. proposed Localized Randomized Affine Shadowsampling (LoRAS), which produces better machine learning models for imbalanced datasets.

Algorithm & Approach

LoRAS relies on locally approximating the manifold by generating a random convex combination of noisy minority class data points. LoRAS generates Gaussian noise in small neighbourhoods around the minority class samples and creates the final synthetic data with convex combinations of multiple noisy data points (shadows samples) as opposed to SMOTE-based strategies that consider a combination of only two minority class data points. Adding these shadow samples allows LoRAS to better estimate the local mean of the latent minority class data distribution.

LoRAS Algorithm as mentioned in the paper

An Iteration Visualised

For a data point, p three of the closest neighbours (using KNN) are chosen to build a neighbourhood of p, depicted as a box in the figure above.

The four data points in the closest neighbourhood of p (including p) are selected.

A normal distribution is created centered at these parent data point n. Then shadow points are drawn from this distribution.

Three random shadow points are chosen at a time to obtain a random affine combination of them (spanning a triangle). Finally, a new LoRAS sample point is generated from the neighbourhood of a single data point p.

Comparing LoRAS with ADASYN, SMOTE, and its variants

Install LoRAS and imbalanced-learn from PyPI

 !pip install -U imbalanced-learn
 !pip install loras

Import necessary libraries and classes

import pandas as pd
import numpy as np

import loras
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
 
from sklearn.metrics import f1_score, balanced_accuracy_score,    precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
%matplotlib inline

We will be using the Credit Card Fraud Detection dataset available on Kaggle. You can download and upload it to Colab manually or fetch it using Kaggle’s API as shown below.

Go to your Kaggle Account and generate an API Token, place this .json file in a folder named Kaggle in your Google drive.

Mount drive to Colab.

 from google.colab import drive
 drive.mount('/content/gdrive')

Define the config path, and navigate into the directory with the API token.

 import os
 os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
 !cd /content/gdrive/My Drive/Kaggle

Download and unzip the dataset using the API command you can get from the dataset page.

 !kaggle datasets download -d mlg-ulb/creditcardfraud
 #unzip and delete the zip
 !unzip \*.zip  && rm *.zip

Load the dataset and convert it into a NumPy array for oversampling.

 filename='creditcard.csv'
 data=pd.read_csv(filename)
 data=data.values
 data.shape

Separate the labels from features.

labels, features =data[:,30], data[:,:30]

Divide the dataset into train and test set before oversampling. Never test on the oversampled or undersampled dataset.

 # fraud transactions
 label_1=np.where(labels == 1)[0]
 label_1=list(label_1)
 print((len(label_1)))
 features_1=features[label_1]
 features_1_train=features_1[list(range(0,246))]
 features_1_test=features_1[list(range(246,492))]

 # normal transactions
 label_0=np.where(labels == 0)[0]
 label_0=list(label_0)
 features_0=features[label_0]
 features_0_train=features_0[list(range(0,142157))]
 features_0_test=features_0[list(range(142157,284315))]

 training_data=np.concatenate((features_1_train,features_0_train))
 test_data=np.concatenate((features_1_test,features_0_test))
 training_labels=np.concatenate((np.zeros(246)+1, np.zeros(142157)))
 test_labels=np.concatenate((np.zeros(246)+1, np.zeros(142158)))

Oversample minority class using LoRAS and other methods.

 min_class_points = features_1_train
 maj_class_points = features_0_train

 #LoRAS
 loras_min_class_points = loras.fit_resample(maj_class_points, min_class_points)
 print(loras_min_class_points.shape)
 LoRAS_feat = np.concatenate((loras_min_class_points, maj_class_points))
 LoRAS_labels = np.concatenate((np.zeros(len(loras_min_class_points))+1, np.zeros(len(maj_class_points))))

 #SMOTE
 sm = SMOTE(random_state=42, k_neighbors=30, sampling_strategy=1)
 SMOTE_feat, SMOTE_labels = sm.fit_resample(training_data,training_labels)
 print(SMOTE_feat.shape)
 print(SMOTE_labels.shape)

 #SMOTE Boderline
 smb = BorderlineSMOTE(random_state=42, k_neighbors=30)
 SMOTEb_feat, SMOTEb_labels = smb.fit_resample(training_data,training_labels)
 print(SMOTEb_feat.shape)
 print(SMOTEb_labels.shape)

 #SMOTE SVM
 sms = SVMSMOTE(random_state=42, k_neighbors=30)
 SMOTEs_feat, SMOTEs_labels = sms.fit_resample(training_data,training_labels)
 print(SMOTEs_feat.shape)
 print(SMOTEs_labels.shape)

 #ADASYN
 ada = ADASYN(random_state=42,n_neighbors=30)
 ADA_feat, ADA_labels = ada.fit_resample(training_data,training_labels)
 print(ADA_feat.shape)
 print(ADA_labels.shape)

Create functions for creating and evaluating models

 def get_metrics(y_test, y_pred):
     metrics = {}
     metrics["f1_score"] = f1_score(y_test, y_pred)
     metrics["accuracy"] = balanced_accuracy_score(y_test, y_pred)
     metrics["precision"] = precision_score(y_test, y_pred)
     metrics["recall"] = recall_score(y_test, y_pred)
     return metrics

 def linear_regression(X_train, y_train, X_test, y_test):
     logreg = LogisticRegression(random_state=42, C=.005, multi_class='multinomial', max_iter=685)
     logreg.fit(X_train, y_train)
     y_pred = logreg.predict(X_test)
     return get_metrics(y_test, y_pred)

 def random_forest(X_train, y_train, X_test, y_test):
     det = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=42)
     det.fit(X_train, y_train)
     y_pred = det.predict(X_test)
     return get_metrics(y_test, y_pred)

Train and evaluate models for different oversampling methods.

 results_normal_lr = linear_regression(training_data, training_labels, test_data, test_labels)
 results_normal_rf = random_forest(training_data, training_labels, test_data, test_labels)

 results_loras_lr = linear_regression(LoRAS_feat, LoRAS_labels, test_data, test_labels)
 results_loras_rf = random_forest(LoRAS_feat, LoRAS_labels, test_data, test_labels)

 results_sm_lr = linear_regression(SMOTE_feat, SMOTE_labels, test_data, test_labels)
 results_sm_rf = random_forest(SMOTE_feat, SMOTE_labels, test_data, test_labels)

 results_sms_lr = linear_regression(SMOTEs_feat, SMOTEs_labels, test_data, test_labels)
 results_sms_rf = random_forest(SMOTEs_feat, SMOTEs_labels, test_data, test_labels)

 results_smb_lr = linear_regression(SMOTEb_feat, SMOTEb_labels, test_data, test_labels)
 results_smb_rf = random_forest(SMOTEb_feat, SMOTEb_labels, test_data, test_labels)

 results_ada_lr = linear_regression(ADA_feat, ADA_labels, test_data, test_labels)
 results_ada_rf = random_forest(ADA_feat, ADA_labels, test_data, test_labels)

 sampling_method = ['Normal', 'LoRAS','SMOTE','SMOTE SVM', 'SMOTE BORDELINE', 'ADASYN']

Plot the results.

 # linear regression results
 LR_results = [results_normal_lr, results_loras_lr, results_sm_lr, results_sms_lr, 
               results_smb_lr, results_ada_lr]
 LR_dict = dict(zip(sampling_method, LR_results))
 #create a DataFrame from the results dictionary
 LR_df = pd.DataFrame.from_dict(LR_dict, orient='index')
 # create a column for sampling methods
 LR_df['sampling_method'] = LR_df.index
 LR_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'],
            kind="bar", figsize=(12, 8), rot = 45,
            title = "Performance of Linear Regression models built with different Sampling Methods")

 # random forest results
 RF_results = [results_normal_rf, results_loras_rf, results_sm_rf,
              results_sms_rf, results_smb_rf, results_ada_rf]
 RF_dict = dict(zip(sampling_method, RF_results))
 #create a DataFrame from the results dictionary
 RF_df = pd.DataFrame.from_dict(RF_dict, orient='index')
 # create a column for sampling methods
 RF_df['sampling_method'] = RF_df.index
 RF_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'],
            kind="bar", figsize=(12, 8), rot = 45,
            title = "Performance of Random Forest models built with different Sampling Methods")

Plot of different metrics of models created with LoRAS vs other oversampling methods

Here is the Colab Notebook for the above code.

Last Epoch

LoRAS aims to improve the precision-recall balance (F1-Score) and class-wise average accuracy (balanced accuracy) of the models. The F1-Score measures how well the classification model handled the minority class classification, whereas Balanced accuracy provides us with a measure of how the classification model handled both majority and minority classes. Thus, these two measures together give a holistic understanding of a classifier’s performance on a dataset. LoRAS was benchmarked against the existing algorithms on a total of 14 imbalanced datasets.

LoRAS vs other oversampling methods — Balanced accuracy/F1-Score for oversampling strategies

Considering the average performance over all the datasets, LoRAS has the best Balanced accuracy and F1-Score. SMOTE improves Balanced accuracy compared to the models trained without any oversampling but it lags behind in F1-Score, for quite a few datasets with high baseline F1-Score. Applying ADASYN increases the Balanced accuracy compared to SMOTE, but again compromises on the F1-Score. LoRAS produces the best Balanced accuracy on average by maintaining the highest average F1-Score among all oversampling techniques.