Consider a scenario where you have to classify between apples and oranges and 90% of your dataset contains apples. This leaves only 10% of the data to oranges and the model tends to get biased towards apples. This type of dataset is called an imbalanced dataset and affects the performance of the model. To overcome this, the near-miss algorithm can be applied to the dataset.
In this article, we will learn about the near-miss algorithm, the different versions of it and implement the different versions on an imbalanced dataset.
Sign up for your weekly dose of what's up in emerging technology.
What is the Near-Miss Algorithm?
Near-miss is an algorithm that can help in balancing an imbalanced dataset. It can be grouped under undersampling algorithms and is an efficient way to balance the data. The algorithm does this by looking at the class distribution and randomly eliminating samples from the larger class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.
The steps taken by this algorithm are:
- The algorithm first calculates the distance between all the points in the larger class with the points in the smaller class. This can make the process of undersampling easier.
- Select instances of the larger class that have the shortest distance with the smaller class. These n classes need to be stored for elimination.
- If there are m instances of the smaller class then the algorithm will return m*n instances of the larger class.
Types of the near-miss algorithm:
Version 1: In the first version, the data is balanced by calculating the average minimum distance between the larger distribution and three closest smaller distributions.
Version 2: Here, the data is balanced by calculating the average minimum distance between the larger distribution and three furthest smaller distributions.
Version 3: Here, the smaller class instances are considered and m neighbours are stored. Then the distance between this and the larger distribution is taken and the largest distance is eliminated.
To better understand the concept, we will first implement the algorithm without balancing the data and check its accuracy. Then, we will apply the near-miss algorithm to the data and then compare the accuracy.
We will select the diabetes dataset for this. You can download the dataset from here.
Let us now import the required libraries and load our dataset.
import numpy as np import pandas as pd import matplotlib.pyplot as plt data=pd.read_csv('https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv') data
As you can see there are only 268 instances of class 1 and 500 of class 0 hence the data is imbalanced.
Building Classifier With Unbalanced Data
Now, let us build the classification model on this data.
First, we will split the data into train and test sets.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
We will first create a plot of the train data and visualize the data distribution before applying the algorithm.
X_train=X_train.values c=Counter(y_train) for out, _ in c.items(): points = where(y_train == out) pyplot.scatter(X_train[points, 0], X_train[points, 1], out=str(out)) pyplot.legend() pyplot.show()
As shown above, there are more orange points than blue ones.
Now, we will build a logistic regression model on this data and check the classification report on it.
from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import confusion_matrix, classification_report reg = LogisticRegression() reg.fit(X_train, y_train.ravel()) pred = reg.predict(X_test) print(classification_report(y_test, pred))
As you can see although for class 0 the precision is more, the recall factor for class 1 is very less indicating there is a high bias towards class 0.
Using the Near-Miss Algorithm to Treat Class-Imbalance problem
In order to overcome this we will use the near-miss algorithm as follows:
from imblearn.under_sampling import NearMiss nr = NearMiss() X_near, Y_near= nr.fit_sample(X_train, y_train.ravel()) c=Counter(Y_near) for out, _ in c.items(): points = where(y == out) pyplot.scatter(X_near[points, 0], X_near[points, 1], out=str(out)) pyplot.legend() pyplot.show()
After the undersampling, you can see more equal numbers of blue and orange points. Which means the dataset is much more balanced and better to train since the bias is reduced. Let us again build the model on this.
reg1 = LogisticRegression() reg1.fit(X_near, Y_near.ravel()) pred = reg1.predict(X_test) print(classification_report(y_test, pred))
As you can see the recall column there is a significant improvement in the values and the data is more evenly distributed. This helps in the model being more reliable for real-time inputs.
In this article, we learnt what is a near-miss algorithm and how to use it in a problem. We also saw a comparison between imbalance data and the balanced one. It is essential to make sure the data is not biased before the model can be trained. The near-miss algorithm is one to ensure that the data is more evenly distributed and not causing bias.
The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.