To start off, one must need to know what an outlier is. An outlier is basically the value of a point or a data point who largely differs from the rest of the crowd. Let us try to understand this with an example: in a class of 20 students with heights ranging from 170cm to 215 cm. The average height is 175cm and the maximum is 195cm. So here, the one with 215cm largely differs from the group and thus the data point is called an outlier. Another term that is used prominently for an outlier is, “anomaly”. Also, a person with a height of 125 cm would have been considered as an outlier too.
Impact of outliers
Now, let us understand why it is important to identify outliers when it comes to machine learning.
As we know that the data we use for training machine learning models is highly sensitive to missing or faux values, as it severely limits the accuracy of the model. For the same reason, we need to have a great understanding of the data we have. In that having a good understanding of outliers is equally important because outliers can severely affect the accuracy of the model that we desire to train.
Here we see, 300 is clearly an outlier. I have also tried to show how an outlier really impacts the statistical dimensions of the data. An outlier largely impacts mean and thus standard deviation and obviously would do the same to variance.
Here just to give briefings: mean can be understood as the average of all the values, the median indicates the middlemost value in the data, the mode is the most repetitive value in the data. Standard deviation is the amount of variation of the values in the data that we are having.
It is easier to identify the outlier when the number of given values is not that humongous but what to do when the number of data points is in tens of thousands. In such a case we can opt to identify outliers in the following ways:
How to identify Outliers and remove them?
We simply cannot always remove outliers just by seeing them. The method of removing the outliers differs from the type of data set you are working on.
But you can surely get hands of knowledge of a few really important methods.
You can surely try it out yourself too. I will demonstrate it on the New York City AirBnB Data Set.
A few methods over which we will be looking at are:
- Using Scatter/Box Plots
- Using Standard Deviation
- Z test
NOTE: You can also find the link to Google Colab at the end of this article.
#dependencies import pandas as pd #pip install pandas import numpy as np #pip install numpy df=pd.read_csv('/content/AB_NYC_2019.csv') #reading the csv file df.sample(10) #randomly gives out the rows from the data set
df.shape # in the format of (row,columns)
# here we will be studying the price column from the data set
- Removing Outliers using Percentile
#setting the limits or our criteria for an item to be called as an outlier upper_bound=df['price'].quantile(0.9995) #value at 99.99 percentile print('Upper bound:',upper_bound) lower_bound=df['price'].quantile(0.0005) #value at 0.05 percentile print('Lower bound:',lower_bound) max_price=max(df['price']) print(max_price) min_price=min(df['price']) print(min_price) df[df['price']>upper_bound] print(len(df[df['price']>upper_bound])) df[df['price']<lower_bound] print(len(df[df['price']<lower_bound]))
df_percentile=df[(df['price']<upper_bound) & (df['price']>lower_bound)]
print(df_percentile) #removing outliers with the help of percentile
After removing the outliers we have 48841 entries which earlier was 48895.
- Identifying and removing outlier with scatter plot
import matplotlib.pyplot as plt plt.scatter(df.index,df['price'],color='red') plt.title('Price of accomodation') plt.xlabel('indices') plt.ylabel('Price') plt.show()
x_upper=list(df[df['price']>upper_bound].index) y_upper=df[df['price']>upper_bound] #print(x_upper) #print(y_upper['price']) x_lower=list(df[df['price']<lower_bound+2500].index) y_lower=df[df['price']<lower_bound+2500] #print(x_lower) #print(y_lower['price']) x_inlier=list(df[(df['price']<upper_bound) & (df['price']>lower_bound)].index) y_inlier=df[(df['price']<upper_bound) & (df['price']>lower_bound)] print(x_inlier) print(y_inlier)
plt.scatter(x_upper,y_upper['price'],color='black',marker='d',label='Above Upper Quartile') plt.scatter(x_lower,y_lower['price'],color='red',label='Below Lower Quartile') plt.scatter(x_inlier,y_inlier['price'],color='green',label='Inlier') plt.title('Price of accomodation') plt.xlabel('indices') plt.ylabel('Price') plt.legend() plt.show()
If you look carefully at the bottom of the graph there you will see some red data points, these are exactly the ones who are under the lower quartile. We are not able to see them distinguishably because the number of green data points is the way for you to see the red ones. And moreover, the difference between the lower bound and the normal range is not that much for it to be clearly visible.
- here we wanted to use the same parameters so as to maintain uniformity
- also remember not to use scatter plot when there are a lot of data points and especially when these data points lack variance.
We will be seeing and analyzing the rest two on a different dataset because not every data set can be used to show and implement all the methods.
Link to dataset: https://www.kaggle.com/mustafaali96/weight-height
3. Standard Deviation
Here, n is the number of samples. X is the value and x̄ is the mean of all values.
from scipy.stats import norm df=pd.read_csv('weight-height.csv') df.head()
plt.hist(df.Height,bins=10,rwidth=0.8,density=True) rng = np.arange(df.Height.min(), df.Height.max(), 0.1) plt.plot(rng, norm.pdf(rng,df.Height.mean(),df.Height.std()))
Here standard deviation is helpful when there’s a normal distribution (a bell-shaped curve) that easily can be observed.
(10000×3 is the shape of the data frame)
#setting limits for outliers #constraining by using 3 std dev technique #one may take 2 std dev or even 4-5 std dev for the same , it totally depends on the type of the data being used upper_limit=df.Height.mean()+3*(df.Height.std()) print(upper_limit)
Removing the outliers
Observe the shape of the data frame, now. (9993X3)
df['zscore'] = ( df.Height - df.Height.mean() ) / df.Height.std()
Get data points that have z score higher than 3 or lower than -3.
Another way of saying the same thing is to get data points that are more than 3 standard deviations away.
df[(df.zscore<-3) | (df.zscore>3)]
This is exactly how one can identify outliers and according to the data and how the removal will impact, can remove the outliers.
This article was plotted on the very idea to help beginners understand the concept of outliers, how to identify outliers and how to remove outliers. There are many more methods like IQR etc but these are sufficient for a beginner. Hope this adds a considerable amount of value to your work.
The complete code for the above implementation is available at the AIM’s GitHub repository. Please visit this link for this complete code.