Listen to this story
Outliers, also known as anomalous cases, frequently appear in real data. These might invalidate the analysis of the results, but they might also hold important information. In any scenario, having the capacity to spot these abnormalities is crucial. Robust statistics, which seeks to identify outliers by first fitting the bulk of the data and then highlighting data points that depart from it, is a good technique for this goal. As removing outliers is not a good option in every scenario, this article will focus on when not to drop outliers. Following are the topics to be covered.
Table of contents
- A brief about outliers and their identification
- Situations related to safety
- When the majority is the outlier
- Data related to experiments
Outliers are generally considered as the extreme values in the dataset. Let’s have a look at what is exactly meant by extreme values.
A brief about outliers and their identification
Outliers are those data points in a dataset which does not follow the pattern of the majority of data. In simpler words, it is data which lies outside the other value in the population. Outliers could sometimes represent a mistake, for example, incorrect coding of the data, or an experiment may not have been run correctly. It could be any mistake. But sometimes, it contains the true representation of the population data. In either case, first, the outliers have to be identified and then take necessary steps to minimize the outliers.
To identify the outliers in the dataset, this article is generalizing the ways for both categorical and numerical data. If one needs to understand them in-depth, read here.
Sorting the data
It is possible to filter the quantitative variables from low to high and look for extremely low or extremely high values. Any extreme values found should be marked. This is an easy approach to determine if you should look at particular data points before utilising more complex techniques.
To quickly see the data distribution, software with a box plot or box-and-whisker plot might be used to visualise the data. The minimum and maximum values (the range), the median, and the interquartile range for your data are highlighted in this style of graphic.
Many computer systems use an asterisk to indicate an outlier on a chart, and these will be beyond the graph’s boundaries.
Applying statistical tests or techniques to find extreme values is the process of statistical outlier identification.
The extreme data points might be transformed into z scores that indicate how far from the mean they are. A value can be categorised as an outlier if its z score is sufficiently high or low. Generally speaking, values with a z score of larger than 3 or lower than -3 are regarded as outliers.
The IQR method
The interquartile range (IQR) describes the range of the population dataset’s middle half. It may be used to draw “limits” around the data, and any numbers that deviate from those boundaries are referred to as outliers.
As mentioned above, outliers could sometimes represent error and sometimes represent the true value of the population. The error could either be dropped or could be mitigated accordingly. But in the second case, the outlier could not certainly be dropped because it could create some unusual disturbances in the dataset. Let’s understand this situation with some real-life examples.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Situations related to safety
Consider a dataset related to a car safety experiment developed, and you, as an analyst in Global NCAP, found that there are some outliers (unusual patterns) in the dataset. Specifically, the outlier is in data received from tire pressure monitor systems, ABS and Traction control systems.
In this scenario, before dropping the outlier, one needs to validate with the engineering team the importance these outliers have on the total NCAP rating for a particular vehicle. If it doesn’t affect much well, it could be dropped whether wise try to mitigate. Most probably, these are important features for the safety of the car, and hence it could be dropped or either be modified.
When the majority is the outlier
Consider a dataset with the majority of data lying in the outlier regions that are on the two extreme ends. Then it could be dropped because if suppose 70% of data is an outlier, then the data left would not be sufficient for a better understanding. The data could be modified if the data doesn’t generate any Type 1 or Type 2 errors. Because modifying a big chunk of data would definitely affect the analysis.
Data related to experiments
When the data is related to experiments, the outlier is mostly an error which could be corrected, but sometimes it is a natural variation. Here is a statement from a Zoologist who shares a real-life situation.
“ If the inaccuracy or mistake is just evident, I exclude outliers. My most recent exclusion, complete recalculation of the statistics, and mistake in the captured mice’s recorded body length. It measured 67 mm and weighed more than 40 g. Having worked in the field for more than 30 years, I am aware that this species cannot possibly be this little (or obese). As inaccuracies cannot be remedied, I thus removed both measurements from the data.”
With the statement, it is clear that the person has good experience in the domain and on the basis of that, the person decided to either drop the outlier or not.
Outliers are not always the extreme data points in the given range. There is sometimes an error or sometimes an anomaly. To deal with the outliers, one needs domain knowledge of the related data. With this article, we have understood when not to drop outliers from data.