Now Reading
Hands-On Guide To LoRAS: A Better Oversampling Algorithm

Hands-On Guide To LoRAS: A Better Oversampling Algorithm

LoRAS featured

Imbalanced datasets are encountered in many fields, where machine learning has found its applications, including business, finance, and biomedical science. In imbalanced datasets, the number of instances in one (or more) class(es) is very low compared to the others. Training standard machine learning models on such datasets leads to the creation of biased models with higher false-positive and true-negative rates. 

A common approach for overcoming this issue is generating synthetic instances of the minority class using an oversampling algorithm. SMOTE is a widely used oversampling technique. It selects an arbitrary minority class data point and its k nearest neighbours of the minority class. SMOTE then generates synthetic minority class data points along line segments joining these k nearest neighbours.  SMOTE, however, has several limitations, for example, it does not consider the distribution of minority classes and latent noise in a data set. 

SMOTE often over-generalizes the minority class, this leads to misclassifications of the majority class and affects the model’s overall balance.  In their paper, “LoRAS: An oversampling approach for imbalanced datasetsSaptarshi Bej, Narek Davtyan, et al. proposed Localized Randomized Affine Shadowsampling (LoRAS), which produces better machine learning models for imbalanced datasets.

Algorithm & Approach

LoRAS relies on locally approximating the manifold by generating a random convex combination of noisy minority class data points. LoRAS generates Gaussian noise in small neighbourhoods around the minority class samples and creates the final synthetic data with convex combinations of multiple noisy data points (shadows samples) as opposed to SMOTE-based strategies that consider a combination of only two minority class data points. Adding these shadow samples allows LoRAS to better estimate the local mean of the latent minority class data distribution.

LoRAS Algorithm as mentioned in the paper
An Iteration Visualised

For a data point, p three of the closest neighbours (using KNN) are chosen to build a neighbourhood of p, depicted as a box in the figure above.

The four data points in the closest neighbourhood of p (including p) are selected.

A normal distribution is created centered at these parent data point n. Then shadow points are drawn from this distribution. 

Three random shadow points are chosen at a time to obtain a random affine combination of them (spanning a triangle). Finally, a new LoRAS sample point is generated from the neighbourhood of a single data point p.

Comparing LoRAS with ADASYN, SMOTE, and its variants

  1. Install LoRAS and imbalanced-learn from PyPI
 !pip install -U imbalanced-learn
 !pip install loras 
  1. Import necessary libraries and classes
import pandas as pd
import numpy as np

import loras
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
 
from sklearn.metrics import f1_score, balanced_accuracy_score,    precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
%matplotlib inline 
  1. We will be using the Credit Card Fraud Detection dataset available on Kaggle. You can download and upload it to Colab manually or fetch it using Kaggle’s API as shown below.

Go to your Kaggle Account and generate an API Token, place this .json file in a folder named Kaggle in your Google drive. 

Mount drive to Colab.

 from google.colab import drive
 drive.mount('/content/gdrive') 

Define the config path, and navigate into the directory with the API token.

 import os
 os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
 !cd /content/gdrive/My Drive/Kaggle 

Download and unzip the dataset using the API command you can get from the dataset page.

 !kaggle datasets download -d mlg-ulb/creditcardfraud
 #unzip and delete the zip
 !unzip \*.zip  && rm *.zip 

Load the dataset and convert it into a NumPy array for oversampling.

See Also
Pillow - Python Library for Image Processing

 filename='creditcard.csv'
 data=pd.read_csv(filename)
 data=data.values
 data.shape 
  1. Separate the labels from features.

labels, features =data[:,30], data[:,:30]

  1. Divide the dataset into train and test set before oversampling. Never test on the oversampled or undersampled dataset. 
 # fraud transactions
 label_1=np.where(labels == 1)[0]
 label_1=list(label_1)
 print((len(label_1)))
 features_1=features[label_1]
 features_1_train=features_1[list(range(0,246))]
 features_1_test=features_1[list(range(246,492))]

 # normal transactions
 label_0=np.where(labels == 0)[0]
 label_0=list(label_0)
 features_0=features[label_0]
 features_0_train=features_0[list(range(0,142157))]
 features_0_test=features_0[list(range(142157,284315))]

 training_data=np.concatenate((features_1_train,features_0_train))
 test_data=np.concatenate((features_1_test,features_0_test))
 training_labels=np.concatenate((np.zeros(246)+1, np.zeros(142157)))
 test_labels=np.concatenate((np.zeros(246)+1, np.zeros(142158))) 
  1. Oversample minority class using LoRAS and other methods.
 min_class_points = features_1_train
 maj_class_points = features_0_train

 #LoRAS
 loras_min_class_points = loras.fit_resample(maj_class_points, min_class_points)
 print(loras_min_class_points.shape)
 LoRAS_feat = np.concatenate((loras_min_class_points, maj_class_points))
 LoRAS_labels = np.concatenate((np.zeros(len(loras_min_class_points))+1, np.zeros(len(maj_class_points))))

 #SMOTE
 sm = SMOTE(random_state=42, k_neighbors=30, sampling_strategy=1)
 SMOTE_feat, SMOTE_labels = sm.fit_resample(training_data,training_labels)
 print(SMOTE_feat.shape)
 print(SMOTE_labels.shape)

 #SMOTE Boderline
 smb = BorderlineSMOTE(random_state=42, k_neighbors=30)
 SMOTEb_feat, SMOTEb_labels = smb.fit_resample(training_data,training_labels)
 print(SMOTEb_feat.shape)
 print(SMOTEb_labels.shape)

 #SMOTE SVM
 sms = SVMSMOTE(random_state=42, k_neighbors=30)
 SMOTEs_feat, SMOTEs_labels = sms.fit_resample(training_data,training_labels)
 print(SMOTEs_feat.shape)
 print(SMOTEs_labels.shape)

 #ADASYN
 ada = ADASYN(random_state=42,n_neighbors=30)
 ADA_feat, ADA_labels = ada.fit_resample(training_data,training_labels)
 print(ADA_feat.shape)
 print(ADA_labels.shape) 
  1. Create functions for creating and evaluating models
 def get_metrics(y_test, y_pred):
     metrics = {}
     metrics["f1_score"] = f1_score(y_test, y_pred)
     metrics["accuracy"] = balanced_accuracy_score(y_test, y_pred)
     metrics["precision"] = precision_score(y_test, y_pred)
     metrics["recall"] = recall_score(y_test, y_pred)
     return metrics

 def linear_regression(X_train, y_train, X_test, y_test):
     logreg = LogisticRegression(random_state=42, C=.005, multi_class='multinomial', max_iter=685)
     logreg.fit(X_train, y_train)
     y_pred = logreg.predict(X_test)
     return get_metrics(y_test, y_pred)

 def random_forest(X_train, y_train, X_test, y_test):
     det = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=42)
     det.fit(X_train, y_train)
     y_pred = det.predict(X_test)
     return get_metrics(y_test, y_pred) 
  1. Train and evaluate models for different oversampling methods.
 results_normal_lr = linear_regression(training_data, training_labels, test_data, test_labels)
 results_normal_rf = random_forest(training_data, training_labels, test_data, test_labels)

 results_loras_lr = linear_regression(LoRAS_feat, LoRAS_labels, test_data, test_labels)
 results_loras_rf = random_forest(LoRAS_feat, LoRAS_labels, test_data, test_labels)

 results_sm_lr = linear_regression(SMOTE_feat, SMOTE_labels, test_data, test_labels)
 results_sm_rf = random_forest(SMOTE_feat, SMOTE_labels, test_data, test_labels)

 results_sms_lr = linear_regression(SMOTEs_feat, SMOTEs_labels, test_data, test_labels)
 results_sms_rf = random_forest(SMOTEs_feat, SMOTEs_labels, test_data, test_labels)

 results_smb_lr = linear_regression(SMOTEb_feat, SMOTEb_labels, test_data, test_labels)
 results_smb_rf = random_forest(SMOTEb_feat, SMOTEb_labels, test_data, test_labels)

 results_ada_lr = linear_regression(ADA_feat, ADA_labels, test_data, test_labels)
 results_ada_rf = random_forest(ADA_feat, ADA_labels, test_data, test_labels)

 sampling_method = ['Normal', 'LoRAS','SMOTE','SMOTE SVM', 'SMOTE BORDELINE', 'ADASYN'] 
  1. Plot the results.
 # linear regression results
 LR_results = [results_normal_lr, results_loras_lr, results_sm_lr, results_sms_lr, 
               results_smb_lr, results_ada_lr]
 LR_dict = dict(zip(sampling_method, LR_results))
 #create a DataFrame from the results dictionary
 LR_df = pd.DataFrame.from_dict(LR_dict, orient='index')
 # create a column for sampling methods
 LR_df['sampling_method'] = LR_df.index
 LR_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'],
            kind="bar", figsize=(12, 8), rot = 45,
            title = "Performance of Linear Regression models built with different Sampling Methods")

 # random forest results
 RF_results = [results_normal_rf, results_loras_rf, results_sm_rf,
              results_sms_rf, results_smb_rf, results_ada_rf]
 RF_dict = dict(zip(sampling_method, RF_results))
 #create a DataFrame from the results dictionary
 RF_df = pd.DataFrame.from_dict(RF_dict, orient='index')
 # create a column for sampling methods
 RF_df['sampling_method'] = RF_df.index
 RF_df.plot(x='sampling_method', y=['f1_score', 'accuracy', 'precision', 'recall'],
            kind="bar", figsize=(12, 8), rot = 45,
            title = "Performance of Random Forest models built with different Sampling Methods") 
Plot of different metrics of models created with LoRAS vs other oversampling methods

Here is the Colab Notebook for the above code.

Last Epoch 

LoRAS aims to improve the precision-recall balance (F1-Score) and class-wise average accuracy (balanced accuracy) of the models. The F1-Score measures how well the classification model handled the minority class classification, whereas Balanced accuracy provides us with a measure of how the classification model handled both majority and minority classes. Thus, these two measures together give a holistic understanding of a classifier’s performance on a dataset. LoRAS was benchmarked against the existing algorithms on a total of 14 imbalanced datasets.

LoRAS vs other oversampling methods
Balanced accuracy/F1-Score for oversampling strategies

Considering the average performance over all the datasets, LoRAS has the best Balanced accuracy and F1-Score. SMOTE improves Balanced accuracy compared to the models trained without any oversampling but it lags behind in F1-Score, for quite a few datasets with high baseline F1-Score. Applying ADASYN increases the Balanced accuracy compared to SMOTE, but again compromises on the F1-Score. LoRAS produces the best Balanced accuracy on average by maintaining the highest average F1-Score among all oversampling techniques.

References

To better understand the mathematics behind LoRAS, and to learn more about the finetuning parameters refer to the following resources:

Want to learn about a sampling technique for creating ensemble models from imbalanced datasets? Check out our Guide to MESA.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top