Active Hackathon

When not to remove outliers from data?

Outliers are not always the extreme data points.
Listen to this story

Outliers, also known as anomalous cases, frequently appear in real data. These might invalidate the analysis of the results, but they might also hold important information. In any scenario, having the capacity to spot these abnormalities is crucial. Robust statistics, which seeks to identify outliers by first fitting the bulk of the data and then highlighting data points that depart from it, is a good technique for this goal. As removing outliers is not a good option in every scenario, this article will focus on when not to drop outliers. Following are the topics to be covered.

Table of contents

  1. A brief about outliers and their identification
  2. Situations related to safety
  3. When the majority is the outlier
  4. Data related to experiments

Outliers are generally considered as the extreme values in the dataset. Let’s have a look at what is exactly meant by extreme values.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

A brief about outliers and their identification

Outliers are those data points in a dataset which does not follow the pattern of the majority of data. In simpler words, it is data which lies outside the other value in the population. Outliers could sometimes represent a mistake, for example, incorrect coding of the data, or an experiment may not have been run correctly. It could be any mistake. But sometimes, it contains the true representation of the population data. In either case, first, the outliers have to be identified and then take necessary steps to minimize the outliers.

To identify the outliers in the dataset, this article is generalizing the ways for both categorical and numerical data. If one needs to understand them in-depth, read here.

Sorting the data

It is possible to filter the quantitative variables from low to high and look for extremely low or extremely high values. Any extreme values found should be marked. This is an easy approach to determine if you should look at particular data points before utilising more complex techniques.

Visualizing

To quickly see the data distribution, software with a box plot or box-and-whisker plot might be used to visualise the data. The minimum and maximum values (the range), the median, and the interquartile range for your data are highlighted in this style of graphic.

Many computer systems use an asterisk to indicate an outlier on a chart, and these will be beyond the graph’s boundaries.

Statistical methods

Applying statistical tests or techniques to find extreme values is the process of statistical outlier identification.

The extreme data points might be transformed into z scores that indicate how far from the mean they are. A value can be categorised as an outlier if its z score is sufficiently high or low. Generally speaking, values with a z score of larger than 3 or lower than -3 are regarded as outliers.

The IQR method

The interquartile range (IQR) describes the range of the population dataset’s middle half. It may be used to draw “limits” around the data, and any numbers that deviate from those boundaries are referred to as outliers.

As mentioned above, outliers could sometimes represent error and sometimes represent the true value of the population. The error could either be dropped or could be mitigated accordingly. But in the second case, the outlier could not certainly be dropped because it could create some unusual disturbances in the dataset. Let’s understand this situation with some real-life examples.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Consider a dataset related to a car safety experiment developed, and you, as an analyst in Global NCAP, found that there are some outliers (unusual patterns) in the dataset. Specifically, the outlier is in data received from tire pressure monitor systems, ABS and Traction control systems.

In this scenario, before dropping the outlier, one needs to validate with the engineering team the importance these outliers have on the total NCAP rating for a particular vehicle. If it doesn’t affect much well, it could be dropped whether wise try to mitigate. Most probably, these are important features for the safety of the car, and hence it could be dropped or either be modified.

When the majority is the outlier

Consider a dataset with the majority of data lying in the outlier regions that are on the two extreme ends. Then it could be dropped because if suppose 70% of data is an outlier, then the data left would not be sufficient for a better understanding. The data could be modified if the data doesn’t generate any Type 1 or Type 2 errors. Because modifying a big chunk of data would definitely affect the analysis.

When the data is related to experiments, the outlier is mostly an error which could be corrected, but sometimes it is a natural variation. Here is a statement from a Zoologist who shares a real-life situation.

“ If the inaccuracy or mistake is just evident, I exclude outliers. My most recent exclusion, complete recalculation of the statistics, and mistake in the captured mice’s recorded body length. It measured 67 mm and weighed more than 40 g. Having worked in the field for more than 30 years, I am aware that this species cannot possibly be this little (or obese). As inaccuracies cannot be remedied, I thus removed both measurements from the data.”

With the statement, it is clear that the person has good experience in the domain and on the basis of that, the person decided to either drop the outlier or not.

Conclusion

Outliers are not always the extreme data points in the given range. There is sometimes an error or sometimes an anomaly. To deal with the outliers, one needs domain knowledge of the related data. With this article, we have understood when not to drop outliers from data.

References

More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]