Advertisement

Outlier Detection Using z-Score – A Complete Guide With Python Codes

In this article, we will be discussing how we should detect outliers in the data set and remove them using different ways.
Listen to this story

In Machine Learning, we often come across situations where we see outliers present in the data set. These outliers are nothing but extreme values present or we can say the values that do not follow the pattern in the data. The values that diverge from all other values are termed as outliers.

These outliers can arise due to different factors like human error while preparing the data or internationally putting outliers in the data to test the model and many other different reasons. But are they beneficial for us while building predictive models? The answer is sometimes we have to drop these outliers and sometimes when we retain them as they hold some interesting meaning. 

In this article, we will be discussing how we should detect outliers in the dataset and remove them using different ways. We will use a weight-height dataset that is available on Kaggle publicly. The data set contains weight and height values, we will search for outliers in the weight column. 

What you will learn from this article?

  • What are Outliers? How to find them? 
  • What are Z-score and Standard deviation?
  • How to remove Outliers using Z-score and Standard deviation? 

An outlier is nothing but the most extreme values present in the dataset. The values that are very unusual in the data as explained earlier.  Let us find the outlier in the weight column of the data set. We will first import the library and the data. Use the below code for the same. 

import pandas as pd

import matplotlib.pyplot as plt

df = pd.read_csv("weight.csv")

df.Weight

Now we will plot the histogram and check the distribution of this column. Use the below code for the same. 

plt.hist(df.Weight, bins=20, width=0.8)

plt.xlabel('Weight')

plt.ylabel('Count')

plt.show()

From the above graph, we can see that data is centred towards the mean and follows a normal distribution. The value going towards the left to the mean is decreasing whereas it is increasing towards the right. Let us see the descriptive statistics of this column like mean, standard deviation, min, and maximum values. Use the below code for the same. 

df.Weight.describe()

The mean of the weight column is found to be 161.44 and the standard deviation to be 32.108. The min and max values present in the column are 64 and 269 respectively. Now we will use 3 standard deviations and everything lying away from this will be treated as an outlier. We will see an upper limit and lower limit using 3 standard deviations. Every data point that lies beyond the upper limit and lower limit will be an outlier. Use the below code for the same. 

upper = df.Weight.mean() + 3*df.Weight.std()

lower = df.Weight.mean() -3*df.Weight.std()

print(upper)

print(lower)

Now we will see what are those data points that fall beyond these limits.

The above two data points are now treated as outliers. Now if we want to remove it we can just pick those data points that fall under these limits. Use the below code for the same. 

new_df= df[(df.Weight<upper) & (df.Weight>lower)]

new_df.head()

new_df.shape()

The original data had 10,000 rows and now the new data frame has 9998 and those 2 rows that were treated as outliers are now removed. Now we will do the same thing using a Z- score that tells about how far data is away from standard deviation. It is calculated by subtracting the mean from the data point and dividing it by the standard deviation.  Let us see practically how this is done. 

df['zscore'] = ( df.Weight - df.Weight.mean() ) / df.Weight.std()

df.head(5)

We can see for each row the z score is computed. Now we will check only those rows that have z score greater than 3 or less than -3. Use the below code for the same. 

df[df['zscore']>3]

df[df['zscore']<-3]

We have found the same outliers that were found before with the standard deviation method. We can remove it in the same way that we used earlier keeping only those data points that fall under the 3 standard deviations. 

df_new = df[(df.zscore>-3) & (df.zscore<3)]

(no output)

Conclusion 

I would like to conclude the article by stating that outliers are very important and one needs to be very careful while treating them whether they are to be removed or to be retained. In this article, we discussed two methods by which we can detect the presence of outliers and remove them.  We first detected them using the upper limit and lower limit using 3 standard deviations. We then used z score methods to do the same. Both methods are very effective to find outliers. We can also make use of Boxplot visualization to check the same.  At the same time, we should be very careful handling these as they can be sometimes very helpful as well. 

Please go through this Colab notebook for complete codes.

Download our Mobile App

Rohit Dwivedi
I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

Career Building in ML & AI

31st May | Online

31st May - 1st Jun '23 | Online

Rakuten Product Conference 2023

15th June | Online

Building LLM powered applications using LangChain

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR