# Outlier Detection Using z-Score – A Complete Guide With Python Codes

In this article, we will be discussing how we should detect outliers in the data set and remove them using different ways.
 Listen to this story

In Machine Learning, we often come across situations where we see outliers present in the data set. These outliers are nothing but extreme values present or we can say the values that do not follow the pattern in the data. The values that diverge from all other values are termed as outliers.

These outliers can arise due to different factors like human error while preparing the data or internationally putting outliers in the data to test the model and many other different reasons. But are they beneficial for us while building predictive models? The answer is sometimes we have to drop these outliers and sometimes when we retain them as they hold some interesting meaning.

In this article, we will be discussing how we should detect outliers in the dataset and remove them using different ways. We will use a weight-height dataset that is available on Kaggle publicly. The data set contains weight and height values, we will search for outliers in the weight column.

• What are Outliers? How to find them?
• What are Z-score and Standard deviation?
• How to remove Outliers using Z-score and Standard deviation?

An outlier is nothing but the most extreme values present in the dataset. The values that are very unusual in the data as explained earlier.  Let us find the outlier in the weight column of the data set. We will first import the library and the data. Use the below code for the same.

`import pandas as pd`

`import matplotlib.pyplot as plt`

`df = pd.read_csv("weight.csv")`

`df.Weight`

Now we will plot the histogram and check the distribution of this column. Use the below code for the same.

`plt.hist(df.Weight, bins=20, width=0.8)`

`plt.xlabel('Weight')`

`plt.ylabel('Count')`

`plt.show()`

From the above graph, we can see that data is centred towards the mean and follows a normal distribution. The value going towards the left to the mean is decreasing whereas it is increasing towards the right. Let us see the descriptive statistics of this column like mean, standard deviation, min, and maximum values. Use the below code for the same.

`df.Weight.describe()`

The mean of the weight column is found to be 161.44 and the standard deviation to be 32.108. The min and max values present in the column are 64 and 269 respectively. Now we will use 3 standard deviations and everything lying away from this will be treated as an outlier. We will see an upper limit and lower limit using 3 standard deviations. Every data point that lies beyond the upper limit and lower limit will be an outlier. Use the below code for the same.

`upper = df.Weight.mean() + 3*df.Weight.std()`

`lower = df.Weight.mean() -3*df.Weight.std()`

`print(upper)`

`print(lower)`

Now we will see what are those data points that fall beyond these limits.

The above two data points are now treated as outliers. Now if we want to remove it we can just pick those data points that fall under these limits. Use the below code for the same.

`new_df= df[(df.Weight<upper) & (df.Weight>lower)]`

`new_df.head()`

`new_df.shape()`

The original data had 10,000 rows and now the new data frame has 9998 and those 2 rows that were treated as outliers are now removed. Now we will do the same thing using a Z- score that tells about how far data is away from standard deviation. It is calculated by subtracting the mean from the data point and dividing it by the standard deviation.  Let us see practically how this is done.

`df['zscore'] = ( df.Weight - df.Weight.mean() ) / df.Weight.std()`

`df.head(5)`

We can see for each row the z score is computed. Now we will check only those rows that have z score greater than 3 or less than -3. Use the below code for the same.

`df[df['zscore']>3]`

`df[df['zscore']<-3]`

We have found the same outliers that were found before with the standard deviation method. We can remove it in the same way that we used earlier keeping only those data points that fall under the 3 standard deviations.

`df_new = df[(df.zscore>-3) & (df.zscore<3)]`

(no output)

Conclusion

I would like to conclude the article by stating that outliers are very important and one needs to be very careful while treating them whether they are to be removed or to be retained. In this article, we discussed two methods by which we can detect the presence of outliers and remove them.  We first detected them using the upper limit and lower limit using 3 standard deviations. We then used z score methods to do the same. Both methods are very effective to find outliers. We can also make use of Boxplot visualization to check the same.  At the same time, we should be very careful handling these as they can be sometimes very helpful as well.

Please go through this Colab notebook for complete codes.

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

## Oct 11-13, 2023 | Bangalore

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### AI Takes the Centre Stage at Microsoft Build 2023

One of the most notable announcements made by Microsoft at Build was the integration of Bing with ChatGPT

### Uncensored Models are Double-edged Swords That Need to be Unleashed

Embracing uncensored models is crucial for scientific exploration, freedom of expression, diversity, storytelling, and composable nature of the open-source AI community

### The Context Length Hitch with GPT Models

OpenAI rival, Anthropic AI has opened up the context window massively with its own chatbot Claude, pushing it to sound 75,000 words or 100,000 tokens.

### AIM Launches #AMA with AI Mentors, A Webinar Series on AI Forum for India Discord

The ‘AMA with AI Mentors’ webinar series is an integral part of AIM’s efforts to foster a vibrant AI community in India.

### Now Everyone’s a Developer, Thanks to Microsoft

Developers! You might have heard of “upskilling”, ever heard of “downskilling”?

### Indian Govt to Soon Launch Generative AI Services

During this year’s Microsoft Build Conference, Microsoft demonstrated the implementation of a Generative AI-powered multilingual chatbot developed in India

### Jeffrey Ullman’s Unsettling Ultimatum

People in AI research who could be excellent teachers are basically forced to do second and third grade research because that’s how they get promoted

### Lenovo’s ISG Defies Gravity, Lifts Profits High Above PC Challenges

In the fiscal year 2022-23, ISG emerged as a profitable high-growth engine for Lenovo. ISG’s revenue soared to nearly US\$10 billion, marking an impressive 37% year-on-year growth, accompanied by an all-time high operating profit of US\$98 million.

### Microsoft’s Blizzard Acquisition Comes With An AI Twist

As the Redmond giant acquires Activision Blizzard, they will not only get access to a wide bevy of profitable franchises, but also to valuable internal game dev tools.

### Harnessing Human Emotions in Generative AI

“AI can still be very literal in its interactions, while humans don’t tend to express emotions in simple and straightforward ways.”