MITB Banner

A Complete Guide To Outlier Detection With Hands-On Implementation For Beginners

Through this article, we will be discussing outliers, their impact and methods to treat the outlier present in the data. We will also demonstrate the hands-on implementation of these methods.

Share

Complete Guide To Outlier Detection

To start off, one must need to know what an outlier is. An outlier is basically the value of a point or a data point who largely differs from the rest of the crowd. Let us try to understand this with an example: in a class of 20 students with heights ranging from 170cm to 215 cm. The average height is 175cm and the maximum is 195cm. So here, the one with 215cm largely differs from the group and thus the data point is called an outlier. Another term that is used prominently for an outlier is, “anomaly”. Also, a person with a height of 125 cm would have been considered as an outlier too.

Through this article, we will be discussing outliers, their impact and methods to treat the outlier present in the data. We will also demonstrate the hands-on implementation of these methods.

Impact of outliers

Now, let us understand why it is important to identify outliers when it comes to machine learning.

As we know that the data we use for training machine learning models is highly sensitive to missing or faux values, as it severely limits the accuracy of the model. For the same reason, we need to have a great understanding of the data we have. In that having a good understanding of outliers is equally important because outliers can severely affect the accuracy of the model that we desire to train.

Source: Medium

Here we see, 300 is clearly an outlier. I have also tried to show how an outlier really impacts the statistical dimensions of the data. An outlier largely impacts mean and thus standard deviation and obviously would do the same to variance. 

Here just to give briefings: mean can be understood as the average of all the values, the median indicates the middlemost value in the data, the mode is the most repetitive value in the data. Standard deviation is the amount of variation of the values in the data that we are having.

It is easier to identify the outlier when the number of given values is not that humongous but what to do when the number of data points is in tens of thousands. In such a case we can opt to identify outliers in the following ways: 

How to identify Outliers and remove them?

We simply cannot always remove outliers just by seeing them. The method of removing the outliers differs from the type of data set you are working on.

But you can surely get hands of knowledge of a few really important methods. 

You can surely try it out yourself too. I will demonstrate it on the New York City AirBnB Data Set

A few methods over which we will be looking at are:

  1. Percentile 
  2. Using Scatter/Box Plots
  3. Using Standard Deviation
  4. Z test

NOTE: You can also find the link to Google Colab at the end of this article.

#dependencies
import pandas as pd  #pip install pandas
import numpy as np   #pip install numpy
df=pd.read_csv('/content/AB_NYC_2019.csv') #reading the csv file
df.sample(10) #randomly gives out the rows from the data set

OUTPUT

df.shape  # in the format of (row,columns)

# here we will be studying the price column from the data set

df.columns

OUTPUT

  1. Removing Outliers using Percentile
#setting the limits or our criteria for an item to be called as an outlier
upper_bound=df['price'].quantile(0.9995)   #value at 99.99 percentile
print('Upper bound:',upper_bound)
lower_bound=df['price'].quantile(0.0005)   #value at 0.05 percentile
print('Lower bound:',lower_bound)
max_price=max(df['price'])
print(max_price)
min_price=min(df['price'])
print(min_price)
df[df['price']>upper_bound]
print(len(df[df['price']>upper_bound]))
df[df['price']<lower_bound]
print(len(df[df['price']<lower_bound]))

OUTPUT

df_percentile=df[(df['price']<upper_bound) & (df['price']>lower_bound)]

print(df_percentile) #removing outliers with the help of percentile

OUTPUT

After removing the outliers we have 48841 entries which earlier was 48895.

  1. Identifying and removing outlier with scatter plot
import matplotlib.pyplot as plt
plt.scatter(df.index,df['price'],color='red')
plt.title('Price of accomodation')
plt.xlabel('indices')
plt.ylabel('Price')
plt.show()

OUTPUT

x_upper=list(df[df['price']>upper_bound].index)
y_upper=df[df['price']>upper_bound]
#print(x_upper)
#print(y_upper['price'])
x_lower=list(df[df['price']<lower_bound+2500].index)
y_lower=df[df['price']<lower_bound+2500]
#print(x_lower)
#print(y_lower['price'])
x_inlier=list(df[(df['price']<upper_bound) & (df['price']>lower_bound)].index)
y_inlier=df[(df['price']<upper_bound) & (df['price']>lower_bound)]
print(x_inlier)
print(y_inlier)

OUTPUT

plt.scatter(x_upper,y_upper['price'],color='black',marker='d',label='Above Upper Quartile')
plt.scatter(x_lower,y_lower['price'],color='red',label='Below Lower Quartile')
plt.scatter(x_inlier,y_inlier['price'],color='green',label='Inlier')
plt.title('Price of accomodation')
plt.xlabel('indices')
plt.ylabel('Price')
plt.legend()
plt.show()

OUTPUT

If you look carefully at the bottom of the graph there you will see some red data points, these are exactly the ones who are under the lower quartile. We are not able to see them distinguishably because the number of green data points is the way for you to see the red ones. And moreover, the difference between the lower bound and the normal range is not that much for it to be clearly visible.

NOTE:

  1. here we wanted to use the same parameters so as to maintain uniformity
  2. also remember not to use scatter plot when there are a lot of data points and especially when these data points lack variance.

We will be seeing and analyzing the rest two on a different dataset because not every data set can be used to show and implement all the methods.

Link to dataset: https://www.kaggle.com/mustafaali96/weight-height

3. Standard Deviation

Formula:

Outlier Detection for Beginners

Here, n is the number of samples. X is the value and x̄ is the mean of all values.

from scipy.stats import norm
df=pd.read_csv('weight-height.csv')
df.head()
Outlier Detection for Beginners
plt.hist(df.Height,bins=10,rwidth=0.8,density=True)
rng = np.arange(df.Height.min(), df.Height.max(), 0.1)
plt.plot(rng, norm.pdf(rng,df.Height.mean(),df.Height.std()))
Outlier Detection for Beginners

NOTE:

Here standard deviation is helpful when there’s a normal distribution (a bell-shaped curve) that easily can be observed.

Outlier Detection for Beginners

(10000×3 is the shape of the data frame)

#setting limits for outliers 
#constraining by using 3 std dev technique
#one may take 2 std dev or even 4-5 std dev for the same , it totally depends on the type of the data being used
upper_limit=df.Height.mean()+3*(df.Height.std())
print(upper_limit)

OUTPUT

77.91014411714076

lower_limit=df.Height.mean()-3*(df.Height.std())

print(lower_limit)

OUTPUT

54.82497539250136

Outlier Detection for Beginners

Removing the outliers

Observe the shape of the data frame, now. (9993X3)

4. Z-Test

Formula:

df['zscore'] = ( df.Height - df.Height.mean() ) / df.Height.std()

df.head(5)

OUTPUT

Outlier Detection for Beginners

The agenda:

Get data points that have z score higher than 3 or lower than -3.

Another way of saying the same thing is to get data points that are more than 3  standard deviations away.

#the outliers

df[(df.zscore<-3) | (df.zscore>3)]

OUTPUT

Outlier Detection for Beginners

Outliers Removed

Outlier Detection for Beginners

This is exactly how one can identify outliers and according to the data and how the removal will impact, can remove the outliers.

Conclusion

This article was plotted on the very idea to help beginners understand the concept of outliers, how to identify outliers and how to remove outliers. There are many more methods like IQR etc but these are sufficient for a beginner. Hope this adds a considerable amount of value to your work. 

The complete code for the above implementation is available at the AIM’s GitHub repository. Please visit this link for this complete code.

Share
Picture of Bhavishya Pandit

Bhavishya Pandit

Understanding and building fathomable approaches to problem statements is what I like the most. I love talking about conversations whose main plot is machine learning, computer vision, deep learning, data analysis and visualization. Apart from them, my interest also lies in listening to business podcasts, use cases and reading self help books.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India