# When not to remove outliers from data?

Outliers are not always the extreme data points.
 Listen to this story

Outliers, also known as anomalous cases, frequently appear in real data. These might invalidate the analysis of the results, but they might also hold important information. In any scenario, having the capacity to spot these abnormalities is crucial. Robust statistics, which seeks to identify outliers by first fitting the bulk of the data and then highlighting data points that depart from it, is a good technique for this goal. As removing outliers is not a good option in every scenario, this article will focus on when not to drop outliers. Following are the topics to be covered.

1. A brief about outliers and their identification
2. Situations related to safety
3. When the majority is the outlier
4. Data related to experiments

Outliers are generally considered as the extreme values in the dataset. Let’s have a look at what is exactly meant by extreme values.

## A brief about outliers and their identification

Outliers are those data points in a dataset which does not follow the pattern of the majority of data. In simpler words, it is data which lies outside the other value in the population. Outliers could sometimes represent a mistake, for example, incorrect coding of the data, or an experiment may not have been run correctly. It could be any mistake. But sometimes, it contains the true representation of the population data. In either case, first, the outliers have to be identified and then take necessary steps to minimize the outliers.

### Sorting the data

It is possible to filter the quantitative variables from low to high and look for extremely low or extremely high values. Any extreme values found should be marked. This is an easy approach to determine if you should look at particular data points before utilising more complex techniques.

### Visualizing

To quickly see the data distribution, software with a box plot or box-and-whisker plot might be used to visualise the data. The minimum and maximum values (the range), the median, and the interquartile range for your data are highlighted in this style of graphic.

Many computer systems use an asterisk to indicate an outlier on a chart, and these will be beyond the graph’s boundaries.

### Statistical methods

Applying statistical tests or techniques to find extreme values is the process of statistical outlier identification.

The extreme data points might be transformed into z scores that indicate how far from the mean they are. A value can be categorised as an outlier if its z score is sufficiently high or low. Generally speaking, values with a z score of larger than 3 or lower than -3 are regarded as outliers.

### The IQR method

The interquartile range (IQR) describes the range of the population dataset’s middle half. It may be used to draw “limits” around the data, and any numbers that deviate from those boundaries are referred to as outliers.

As mentioned above, outliers could sometimes represent error and sometimes represent the true value of the population. The error could either be dropped or could be mitigated accordingly. But in the second case, the outlier could not certainly be dropped because it could create some unusual disturbances in the dataset. Let’s understand this situation with some real-life examples.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Consider a dataset related to a car safety experiment developed, and you, as an analyst in Global NCAP, found that there are some outliers (unusual patterns) in the dataset. Specifically, the outlier is in data received from tire pressure monitor systems, ABS and Traction control systems.

In this scenario, before dropping the outlier, one needs to validate with the engineering team the importance these outliers have on the total NCAP rating for a particular vehicle. If it doesn’t affect much well, it could be dropped whether wise try to mitigate. Most probably, these are important features for the safety of the car, and hence it could be dropped or either be modified.

## When the majority is the outlier

Consider a dataset with the majority of data lying in the outlier regions that are on the two extreme ends. Then it could be dropped because if suppose 70% of data is an outlier, then the data left would not be sufficient for a better understanding. The data could be modified if the data doesn’t generate any Type 1 or Type 2 errors. Because modifying a big chunk of data would definitely affect the analysis.

When the data is related to experiments, the outlier is mostly an error which could be corrected, but sometimes it is a natural variation. Here is a statement from a Zoologist who shares a real-life situation.

“ If the inaccuracy or mistake is just evident, I exclude outliers. My most recent exclusion, complete recalculation of the statistics, and mistake in the captured mice’s recorded body length. It measured 67 mm and weighed more than 40 g. Having worked in the field for more than 30 years, I am aware that this species cannot possibly be this little (or obese). As inaccuracies cannot be remedied, I thus removed both measurements from the data.”

With the statement, it is clear that the person has good experience in the domain and on the basis of that, the person decided to either drop the outlier or not.

## Conclusion

Outliers are not always the extreme data points in the given range. There is sometimes an error or sometimes an anomaly. To deal with the outliers, one needs domain knowledge of the related data. With this article, we have understood when not to drop outliers from data.

## Our Upcoming Events

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### AI in Programming is to Collaborate, Not Eliminate

While the potential of AI is unquestionable, a deeper look into its current capabilities suggests that a complete or even a partial AI takeover in programming is unlikely

### Apple Should be Scared of Windows Copilot

Copilot will start its early rollout as part of the free Windows 11 update, beginning on September 26

### Top 5 Libraries in C/C++ for ML in 2023

There are tons of libraries in C/C++ for ML, such as TensorFlow, Caffe, and mlpack

### Tesla Optimus Finally Learns Yoga, Performs Vrikshasana

Jim Fan, senior AI scientist at NVIDIA, has come forward with insights on how exactly Optimus functions with such brilliance

### NVIDIA’s Dominance Set to Surge Further

NVIDIA’s Meteoric Rise in 2023: On Track to Surpass \$50 Billion Revenue, Achieves \$1 Trillion Market Cap, and Forges Global Partnerships for AI Dominance.

### 6 Brilliant JavaScript Frameworks for Every Developer

Although Python and R are more famous for machine learning, Java can serve this purpose effectively, especially if you’re already familiar with it