Last updated July 20, 2022
In AI Mysteries

When not to remove outliers from data?

Outliers are not always the extreme data points.

Published on July 20, 2022
by Sourabh Mehta

Listen to this story

Outliers, also known as anomalous cases, frequently appear in real data. These might invalidate the analysis of the results, but they might also hold important information. In any scenario, having the capacity to spot these abnormalities is crucial. Robust statistics, which seeks to identify outliers by first fitting the bulk of the data and then highlighting data points that depart from it, is a good technique for this goal. As removing outliers is not a good option in every scenario, this article will focus on when not to drop outliers. Following are the topics to be covered.

A brief about outliers and their identification
Situations related to safety
When the majority is the outlier
Data related to experiments

Outliers are generally considered as the extreme values in the dataset. Let’s have a look at what is exactly meant by extreme values.

A brief about outliers and their identification

Outliers are those data points in a dataset which does not follow the pattern of the majority of data. In simpler words, it is data which lies outside the other value in the population. Outliers could sometimes represent a mistake, for example, incorrect coding of the data, or an experiment may not have been run correctly. It could be any mistake. But sometimes, it contains the true representation of the population data. In either case, first, the outliers have to be identified and then take necessary steps to minimize the outliers.

To identify the outliers in the dataset, this article is generalizing the ways for both categorical and numerical data. If one needs to understand them in-depth, read here.

Sorting the data

It is possible to filter the quantitative variables from low to high and look for extremely low or extremely high values. Any extreme values found should be marked. This is an easy approach to determine if you should look at particular data points before utilising more complex techniques.

Visualizing

To quickly see the data distribution, software with a box plot or box-and-whisker plot might be used to visualise the data. The minimum and maximum values (the range), the median, and the interquartile range for your data are highlighted in this style of graphic.

Many computer systems use an asterisk to indicate an outlier on a chart, and these will be beyond the graph’s boundaries.

Statistical methods

Applying statistical tests or techniques to find extreme values is the process of statistical outlier identification.

The extreme data points might be transformed into z scores that indicate how far from the mean they are. A value can be categorised as an outlier if its z score is sufficiently high or low. Generally speaking, values with a z score of larger than 3 or lower than -3 are regarded as outliers.

The IQR method

The interquartile range (IQR) describes the range of the population dataset’s middle half. It may be used to draw “limits” around the data, and any numbers that deviate from those boundaries are referred to as outliers.

As mentioned above, outliers could sometimes represent error and sometimes represent the true value of the population. The error could either be dropped or could be mitigated accordingly. But in the second case, the outlier could not certainly be dropped because it could create some unusual disturbances in the dataset. Let’s understand this situation with some real-life examples.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Consider a dataset related to a car safety experiment developed, and you, as an analyst in Global NCAP, found that there are some outliers (unusual patterns) in the dataset. Specifically, the outlier is in data received from tire pressure monitor systems, ABS and Traction control systems.

In this scenario, before dropping the outlier, one needs to validate with the engineering team the importance these outliers have on the total NCAP rating for a particular vehicle. If it doesn’t affect much well, it could be dropped whether wise try to mitigate. Most probably, these are important features for the safety of the car, and hence it could be dropped or either be modified.

When the majority is the outlier

Consider a dataset with the majority of data lying in the outlier regions that are on the two extreme ends. Then it could be dropped because if suppose 70% of data is an outlier, then the data left would not be sufficient for a better understanding. The data could be modified if the data doesn’t generate any Type 1 or Type 2 errors. Because modifying a big chunk of data would definitely affect the analysis.

When the data is related to experiments, the outlier is mostly an error which could be corrected, but sometimes it is a natural variation. Here is a statement from a Zoologist who shares a real-life situation.

“ If the inaccuracy or mistake is just evident, I exclude outliers. My most recent exclusion, complete recalculation of the statistics, and mistake in the captured mice’s recorded body length. It measured 67 mm and weighed more than 40 g. Having worked in the field for more than 30 years, I am aware that this species cannot possibly be this little (or obese). As inaccuracies cannot be remedied, I thus removed both measurements from the data.”

With the statement, it is clear that the person has good experience in the domain and on the basis of that, the person decided to either drop the outlier or not.

Conclusion

Outliers are not always the extreme data points in the given range. There is sometimes an error or sometimes an anomaly. To deal with the outliers, one needs domain knowledge of the related data. With this article, we have understood when not to drop outliers from data.

References

Read more about anomaly detection

Access all our open Survey & Awards Nomination forms in one place >>

Sourabh Mehta

Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

When not to remove outliers from data?

Table of contents

A brief about outliers and their identification

Sorting the data

Visualizing

Statistical methods

The IQR method

Situations related to safety

When the majority is the outlier

Data related to experiments

Conclusion

References

Sourabh Mehta

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.