Using Near-Miss Algorithm For Imbalanced Datasets

Consider a scenario where you have to classify between apples and oranges and 90% of your dataset contains apples. This leaves only 10% of the data to oranges and the model tends to get biased towards apples. This type of dataset is called an imbalanced dataset and affects the performance of the model. To overcome this, the near-miss algorithm can be applied to the dataset. 

In this article, we will learn about the near-miss algorithm, the different versions of it and implement the different versions on an imbalanced dataset. 

What is the Near-Miss Algorithm?

Near-miss is an algorithm that can help in balancing an imbalanced dataset. It can be grouped under undersampling algorithms and is an efficient way to balance the data. The algorithm does this by looking at the class distribution and randomly eliminating samples from the larger class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution. 


Sign up for your weekly dose of what's up in emerging technology.

The steps taken by this algorithm are:

  1. The algorithm first calculates the distance between all the points in the larger class with the points in the smaller class. This can make the process of undersampling easier. 
  2. Select instances of the larger class that have the shortest distance with the smaller class. These n classes need to be stored for elimination. 
  3. If there are m instances of the smaller class then the algorithm will return m*n instances of the larger class. 

Types of the near-miss algorithm:

Version 1: In the first version, the data is balanced by calculating the average minimum distance between the larger distribution and three closest smaller distributions.

Download our Mobile App

Version 2: Here, the data is balanced by calculating the average minimum distance between the larger distribution and three furthest smaller distributions. 

Version 3: Here, the smaller class instances are considered and m neighbours are stored. Then the distance between this and the larger distribution is taken and the largest distance is eliminated. 


To better understand the concept, we will first implement the algorithm without balancing the data and check its accuracy. Then, we will apply the near-miss algorithm to the data and then compare the accuracy.

We will select the diabetes dataset for this. You can download the dataset from here.

Let us now import the required libraries and load our dataset. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


As you can see there are only 268 instances of class 1 and 500 of class 0 hence the data is imbalanced. 

Building Classifier With Unbalanced Data

Now, let us build the classification model on this data. 

First, we will split the data into train and test sets. 

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0) 

We will first create a plot of the train data and visualize the data distribution before applying the algorithm.

for out, _ in c.items():
  points = where(y_train == out)[0]
  pyplot.scatter(X_train[points, 0], X_train[points, 1], out=str(out))

As shown above, there are more orange points than blue ones. 

Now, we will build a logistic regression model on this data and check the classification report on it.

from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
reg = LogisticRegression(), y_train.ravel()) 
pred = reg.predict(X_test)  
print(classification_report(y_test, pred)) 

As you can see although for class 0 the precision is more, the recall factor for class 1 is very less indicating there is a high bias towards class 0. 

Using the Near-Miss Algorithm to Treat Class-Imbalance problem

In order to overcome this we will use the near-miss algorithm as follows:

from imblearn.under_sampling import NearMiss 
nr = NearMiss() 
X_near, Y_near= nr.fit_sample(X_train, y_train.ravel()) 
for out, _ in c.items():
  points = where(y == out)[0]
  pyplot.scatter(X_near[points, 0], X_near[points, 1], out=str(out))

After the undersampling, you can see more equal numbers of blue and orange points. Which means the dataset is much more balanced and better to train since the bias is reduced. Let us again build the model on this. 

reg1 = LogisticRegression(), Y_near.ravel()) 
pred = reg1.predict(X_test)  
print(classification_report(y_test, pred)) 

As you can see the recall column there is a significant improvement in the values and the data is more evenly distributed. This helps in the model being more reliable for real-time inputs. 


In this article, we learnt what is a near-miss algorithm and how to use it in a problem. We also saw a comparison between imbalance data and the balanced one. It is essential to make sure the data is not biased before the model can be trained. The near-miss algorithm is one to ensure that the data is more evenly distributed and not causing bias.

The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox