Last updated October 13, 2021
In AI Mysteries

A Complete Guide To Outlier Detection With Hands-On Implementation For Beginners

Through this article, we will be discussing outliers, their impact and methods to treat the outlier present in the data. We will also demonstrate the hands-on implementation of these methods.

Share

Published on October 20, 2020

by Bhavishya Pandit

To start off, one must need to know what an outlier is. An outlier is basically the value of a point or a data point who largely differs from the rest of the crowd. Let us try to understand this with an example: in a class of 20 students with heights ranging from 170cm to 215 cm. The average height is 175cm and the maximum is 195cm. So here, the one with 215cm largely differs from the group and thus the data point is called an outlier. Another term that is used prominently for an outlier is, “anomaly”. Also, a person with a height of 125 cm would have been considered as an outlier too.

Through this article, we will be discussing outliers, their impact and methods to treat the outlier present in the data. We will also demonstrate the hands-on implementation of these methods.

Impact of outliers

Now, let us understand why it is important to identify outliers when it comes to machine learning.

As we know that the data we use for training machine learning models is highly sensitive to missing or faux values, as it severely limits the accuracy of the model. For the same reason, we need to have a great understanding of the data we have. In that having a good understanding of outliers is equally important because outliers can severely affect the accuracy of the model that we desire to train.

Source: Medium

Here we see, 300 is clearly an outlier. I have also tried to show how an outlier really impacts the statistical dimensions of the data. An outlier largely impacts mean and thus standard deviation and obviously would do the same to variance.

Here just to give briefings: mean can be understood as the average of all the values, the median indicates the middlemost value in the data, the mode is the most repetitive value in the data. Standard deviation is the amount of variation of the values in the data that we are having.

It is easier to identify the outlier when the number of given values is not that humongous but what to do when the number of data points is in tens of thousands. In such a case we can opt to identify outliers in the following ways:

How to identify Outliers and remove them?

We simply cannot always remove outliers just by seeing them. The method of removing the outliers differs from the type of data set you are working on.

But you can surely get hands of knowledge of a few really important methods.

You can surely try it out yourself too. I will demonstrate it on the New York City AirBnB Data Set.

A few methods over which we will be looking at are:

Percentile
Using Scatter/Box Plots
Using Standard Deviation
Z test

NOTE: You can also find the link to Google Colab at the end of this article.

#dependencies
import pandas as pd  #pip install pandas
import numpy as np   #pip install numpy
df=pd.read_csv('/content/AB_NYC_2019.csv') #reading the csv file
df.sample(10) #randomly gives out the rows from the data set

OUTPUT

df.shape # in the format of (row,columns)

# here we will be studying the price column from the data set

df.columns

OUTPUT

Removing Outliers using Percentile

#setting the limits or our criteria for an item to be called as an outlier
upper_bound=df['price'].quantile(0.9995)   #value at 99.99 percentile
print('Upper bound:',upper_bound)
lower_bound=df['price'].quantile(0.0005)   #value at 0.05 percentile
print('Lower bound:',lower_bound)
max_price=max(df['price'])
print(max_price)
min_price=min(df['price'])
print(min_price)
df[df['price']>upper_bound]
print(len(df[df['price']>upper_bound]))
df[df['price']<lower_bound]
print(len(df[df['price']<lower_bound]))

OUTPUT

df_percentile=df[(df['price']<upper_bound) & (df['price']>lower_bound)]

print(df_percentile) #removing outliers with the help of percentile

OUTPUT

After removing the outliers we have 48841 entries which earlier was 48895.

Identifying and removing outlier with scatter plot

import matplotlib.pyplot as plt
plt.scatter(df.index,df['price'],color='red')
plt.title('Price of accomodation')
plt.xlabel('indices')
plt.ylabel('Price')
plt.show()

OUTPUT

x_upper=list(df[df['price']>upper_bound].index)
y_upper=df[df['price']>upper_bound]
#print(x_upper)
#print(y_upper['price'])
x_lower=list(df[df['price']<lower_bound+2500].index)
y_lower=df[df['price']<lower_bound+2500]
#print(x_lower)
#print(y_lower['price'])
x_inlier=list(df[(df['price']<upper_bound) & (df['price']>lower_bound)].index)
y_inlier=df[(df['price']<upper_bound) & (df['price']>lower_bound)]
print(x_inlier)
print(y_inlier)

OUTPUT

plt.scatter(x_upper,y_upper['price'],color='black',marker='d',label='Above Upper Quartile')
plt.scatter(x_lower,y_lower['price'],color='red',label='Below Lower Quartile')
plt.scatter(x_inlier,y_inlier['price'],color='green',label='Inlier')
plt.title('Price of accomodation')
plt.xlabel('indices')
plt.ylabel('Price')
plt.legend()
plt.show()

OUTPUT

If you look carefully at the bottom of the graph there you will see some red data points, these are exactly the ones who are under the lower quartile. We are not able to see them distinguishably because the number of green data points is the way for you to see the red ones. And moreover, the difference between the lower bound and the normal range is not that much for it to be clearly visible.

NOTE:

here we wanted to use the same parameters so as to maintain uniformity
also remember not to use scatter plot when there are a lot of data points and especially when these data points lack variance.

We will be seeing and analyzing the rest two on a different dataset because not every data set can be used to show and implement all the methods.

Link to dataset: https://www.kaggle.com/mustafaali96/weight-height

3. Standard Deviation

Formula:

Here, n is the number of samples. X is the value and x̄ is the mean of all values.

from scipy.stats import norm
df=pd.read_csv('weight-height.csv')
df.head()

plt.hist(df.Height,bins=10,rwidth=0.8,density=True)
rng = np.arange(df.Height.min(), df.Height.max(), 0.1)
plt.plot(rng, norm.pdf(rng,df.Height.mean(),df.Height.std()))

NOTE:

Here standard deviation is helpful when there’s a normal distribution (a bell-shaped curve) that easily can be observed.

(10000×3 is the shape of the data frame)

#setting limits for outliers 
#constraining by using 3 std dev technique
#one may take 2 std dev or even 4-5 std dev for the same , it totally depends on the type of the data being used
upper_limit=df.Height.mean()+3*(df.Height.std())
print(upper_limit)

OUTPUT

77.91014411714076

lower_limit=df.Height.mean()-3*(df.Height.std())

print(lower_limit)

OUTPUT

54.82497539250136

Removing the outliers

Observe the shape of the data frame, now. (9993X3)

4. Z-Test

Formula:

df['zscore'] = ( df.Height - df.Height.mean() ) / df.Height.std()

df.head(5)

OUTPUT

The agenda:

Get data points that have z score higher than 3 or lower than -3.

Another way of saying the same thing is to get data points that are more than 3 standard deviations away.

#the outliers

df[(df.zscore<-3) | (df.zscore>3)]

OUTPUT

Outliers Removed

This is exactly how one can identify outliers and according to the data and how the removal will impact, can remove the outliers.

Conclusion

This article was plotted on the very idea to help beginners understand the concept of outliers, how to identify outliers and how to remove outliers. There are many more methods like IQR etc but these are sufficient for a beginner. Hope this adds a considerable amount of value to your work.

The complete code for the above implementation is available at the AIM’s GitHub repository. Please visit this link for this complete code.

Access all our open Survey & Awards Nomination forms in one place

Bhavishya Pandit

Understanding and building fathomable approaches to problem statements is what I like the most. I love talking about conversations whose main plot is machine learning, computer vision, deep learning, data analysis and visualization. Apart from them, my interest also lies in listening to business podcasts, use cases and reading self help books.