Listen to this story
In Machine Learning, we often come across situations where we see outliers present in the data set. These outliers are nothing but extreme values present or we can say the values that do not follow the pattern in the data. The values that diverge from all other values are termed as outliers.
These outliers can arise due to different factors like human error while preparing the data or internationally putting outliers in the data to test the model and many other different reasons. But are they beneficial for us while building predictive models? The answer is sometimes we have to drop these outliers and sometimes when we retain them as they hold some interesting meaning.
In this article, we will be discussing how we should detect outliers in the dataset and remove them using different ways. We will use a weight-height dataset that is available on Kaggle publicly. The data set contains weight and height values, we will search for outliers in the weight column.
What you will learn from this article?
- What are Outliers? How to find them?
- What are Z-score and Standard deviation?
- How to remove Outliers using Z-score and Standard deviation?
An outlier is nothing but the most extreme values present in the dataset. The values that are very unusual in the data as explained earlier. Let us find the outlier in the weight column of the data set. We will first import the library and the data. Use the below code for the same.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("weight.csv")
Now we will plot the histogram and check the distribution of this column. Use the below code for the same.
plt.hist(df.Weight, bins=20, width=0.8)
From the above graph, we can see that data is centred towards the mean and follows a normal distribution. The value going towards the left to the mean is decreasing whereas it is increasing towards the right. Let us see the descriptive statistics of this column like mean, standard deviation, min, and maximum values. Use the below code for the same.
The mean of the weight column is found to be 161.44 and the standard deviation to be 32.108. The min and max values present in the column are 64 and 269 respectively. Now we will use 3 standard deviations and everything lying away from this will be treated as an outlier. We will see an upper limit and lower limit using 3 standard deviations. Every data point that lies beyond the upper limit and lower limit will be an outlier. Use the below code for the same.
upper = df.Weight.mean() + 3*df.Weight.std()
lower = df.Weight.mean() -3*df.Weight.std()
Now we will see what are those data points that fall beyond these limits.
The above two data points are now treated as outliers. Now if we want to remove it we can just pick those data points that fall under these limits. Use the below code for the same.
new_df= df[(df.Weight<upper) & (df.Weight>lower)]
The original data had 10,000 rows and now the new data frame has 9998 and those 2 rows that were treated as outliers are now removed. Now we will do the same thing using a Z- score that tells about how far data is away from standard deviation. It is calculated by subtracting the mean from the data point and dividing it by the standard deviation. Let us see practically how this is done.
df['zscore'] = ( df.Weight - df.Weight.mean() ) / df.Weight.std()
We can see for each row the z score is computed. Now we will check only those rows that have z score greater than 3 or less than -3. Use the below code for the same.
We have found the same outliers that were found before with the standard deviation method. We can remove it in the same way that we used earlier keeping only those data points that fall under the 3 standard deviations.
df_new = df[(df.zscore>-3) & (df.zscore<3)]
I would like to conclude the article by stating that outliers are very important and one needs to be very careful while treating them whether they are to be removed or to be retained. In this article, we discussed two methods by which we can detect the presence of outliers and remove them. We first detected them using the upper limit and lower limit using 3 standard deviations. We then used z score methods to do the same. Both methods are very effective to find outliers. We can also make use of Boxplot visualization to check the same. At the same time, we should be very careful handling these as they can be sometimes very helpful as well.
Please go through this Colab notebook for complete codes.