# Using Near-Miss Algorithm For Imbalanced Datasets

Consider a scenario where you have to classify between apples and oranges and 90% of your dataset contains apples. This leaves only 10% of the data to oranges and the model tends to get biased towards apples. This type of dataset is called an imbalanced dataset and affects the performance of the model. To overcome this, the near-miss algorithm can be applied to the dataset.

In this article, we will learn about the near-miss algorithm, the different versions of it and implement the different versions on an imbalanced dataset.

### What is the Near-Miss Algorithm?

Near-miss is an algorithm that can help in balancing an imbalanced dataset. It can be grouped under undersampling algorithms and is an efficient way to balance the data. The algorithm does this by looking at the class distribution and randomly eliminating samples from the larger class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

The steps taken by this algorithm are:

1. The algorithm first calculates the distance between all the points in the larger class with the points in the smaller class. This can make the process of undersampling easier.
2. Select instances of the larger class that have the shortest distance with the smaller class. These n classes need to be stored for elimination.
3. If there are m instances of the smaller class then the algorithm will return m*n instances of the larger class.

### Types of the near-miss algorithm:

Version 1: In the first version, the data is balanced by calculating the average minimum distance between the larger distribution and three closest smaller distributions.

Version 2: Here, the data is balanced by calculating the average minimum distance between the larger distribution and three furthest smaller distributions.

Version 3: Here, the smaller class instances are considered and m neighbours are stored. Then the distance between this and the larger distribution is taken and the largest distance is eliminated.

### Implementation

To better understand the concept, we will first implement the algorithm without balancing the data and check its accuracy. Then, we will apply the near-miss algorithm to the data and then compare the accuracy.

We will select the diabetes dataset for this. You can download the dataset from here.

Let us now import the required libraries and load our dataset.

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data```

`data['Outcome'].value_counts()`

As you can see there are only 268 instances of class 1 and 500 of class 0 hence the data is imbalanced.

### Building Classifier With Unbalanced Data

Now, let us build the classification model on this data.

First, we will split the data into train and test sets.

```from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0) ```

We will first create a plot of the train data and visualize the data distribution before applying the algorithm.

```X_train=X_train.values
c=Counter(y_train)
for out, _ in c.items():
points = where(y_train == out)[0]
pyplot.scatter(X_train[points, 0], X_train[points, 1], out=str(out))
pyplot.legend()
pyplot.show()```

As shown above, there are more orange points than blue ones.

Now, we will build a logistic regression model on this data and check the classification report on it.

```from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
reg = LogisticRegression()
reg.fit(X_train, y_train.ravel())
pred = reg.predict(X_test)
print(classification_report(y_test, pred)) ```

As you can see although for class 0 the precision is more, the recall factor for class 1 is very less indicating there is a high bias towards class 0.

### Using the Near-Miss Algorithm to Treat Class-Imbalance problem

In order to overcome this we will use the near-miss algorithm as follows:

```from imblearn.under_sampling import NearMiss
nr = NearMiss()
X_near, Y_near= nr.fit_sample(X_train, y_train.ravel())
c=Counter(Y_near)
for out, _ in c.items():
points = where(y == out)[0]
pyplot.scatter(X_near[points, 0], X_near[points, 1], out=str(out))
pyplot.legend()
pyplot.show()```

After the undersampling, you can see more equal numbers of blue and orange points. Which means the dataset is much more balanced and better to train since the bias is reduced. Let us again build the model on this.

```reg1 = LogisticRegression()
reg1.fit(X_near, Y_near.ravel())
pred = reg1.predict(X_test)
print(classification_report(y_test, pred)) ```

As you can see the recall column there is a significant improvement in the values and the data is more evenly distributed. This helps in the model being more reliable for real-time inputs.

### Conclusion

In this article, we learnt what is a near-miss algorithm and how to use it in a problem. We also saw a comparison between imbalance data and the balanced one. It is essential to make sure the data is not biased before the model can be trained. The near-miss algorithm is one to ensure that the data is more evenly distributed and not causing bias.

The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook of this code.

Rs. 299/month

## More Great AIM Stories

### Watermarking: A Band-Aid Solution for LLMs

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

## AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Do economists make good data scientists?

What we refer to as coding skills for data science are in fact the ability to think logically and understand underlying data structures.

### IBM sells parts of Watson Health; what are the repercussions?

IBM Watson Health was an ambitious project introduced to use the core AI platform to help healthcare professionals analyse large amounts of data and assist in cancer treatment.

### How Indian AI patents get stuck in red tape

From 2015 to 2018, Indian companies have filed over 4,600 patents in the US, of which 64.8% are technology patents.

### Why is it raining IPOs in the analytics space?

Research shows close to 1000 companies going public, raising \$315 billion as of late December – and smashing the previous record of less than \$200 billion.

### Is AI2’s Macaw better than GPT-3?

If a bird didn’t have wings, how would it be affected?
Macaw: It would be unable to fly
GPT-3: It would be a bad bird.

### How language models perfected plagiarism to an art

Today, most institutions employ text-matching software to counteract plagiarism.

### Behind Meta’s claim of building world’s fastest AI Supercomputer

Meta has released the AI Research SuperCluster (RSC), calling it one of the fastest AI supercomputers running presently in the world.

### How Cryptogenomics realises data anonymization in genetic research

Stanford professor Gill Bejerano developed a method to analyse the DNA of large numbers of patients without storing or holding the DNA samples in a database.

### Top laptops for Python programming in 2022

The Microsoft Surface Book 2 is a fantastic option for any coders out there, as it is one of the most powerful 2-in-1 laptops available

### Step-by-step guide to build a simple neural network in PyTorch from scratch

In this article, we will learn how we can build a simple neural network using the PyTorch library in just a few steps. For this purpose, we will demonstrate a hands-on implementation where we will build a simple neural network for a classification problem.