Active Hackathon

Essential Ways To Handle Data Cleaning

Data cleaning

Data cleaning isn’t the most attractive part when it comes to data science or machine learning, but it is one of the most important ones. There are no tricks nor any shortcuts for data cleaning, if one needs to have the best model possible, they need a better quality of data and a clean one. Machine learning and data scientists spend a lot of time in data cleaning because of a common belief among them that whatever data they put into the algorithm, the results solely depend upon it.

Below are some tips when it comes to data cleaning:


Sign up for your weekly dose of what's up in emerging technology.

Better Data Quality

It is a common notion around developers, where they chase perfecting the algorithm and making it look fancy, often ignoring one of the major factors that contribute to the success of an algorithm, the data quality. Data cleaning is a lot more important than it sounds, no matter how good one’s algorithm is or no matter how fancy it is, untidy data will give you abysmal results. Poor quality data also results in biased outcomes, which can afflict the businesses if firms fail to identify the potential flaws in it.

Filtering Unnecessary Outliners

Outliers can cause problems with specific models like linear regression models (reducing their robustness). But, removing an outlier just because it is big and not because it is uninformative might make your model miss out on information. Have a legitimate reason when you are thinking about removing an outlier.

Removing Duplicate Observations

This is one of the basic steps of data cleaning in data science. Duplicated observations frequently occur during data collection. They might occur during, combining datasets from multiple places, receiving data from other parties and scraping data. And a few are irrelevant observations, which are those ones that don’t actually fit into a specific problem, which are under consideration. These observations, if spotted correctly, will enhance one’s model. It is recommended to check for these observations before the engineering features come into play.

Syntax Errors

Making sure the data types are stored correctly can save a lot of time and help in creating a better model. All the values must be stored in relevant data types.

There are some types of errors that need to be kept in mind:

About Pad strings: Strings can be padded with spaces, and other characters to a certain width like some of the numerical codes are represented with inserting zeros to ensure they always have the same number of digits. 

401 => 000401 (6 digits)

Removing white spaces: Simply means removing extra white spaces at the beginning or the ending of the strings. 

“  hello world “ => “hello world”

Fixing Structural Errors

Structural errors are those that come into existence while processes like measurement, data transfer, etc. For example, one can check for typos, inconsistent capitalisation. Another way is to try to merge or include mislabelled classes into one.

Standardising the Values

Standardising, say, for Strings means, making sure all values are either in lower case or the upper case. Same way, the numerical values can be standardised to a certain measurement unit. For example, the length can be in meters and feet. The difference of one meter is considered the same as the difference of one foot, so one has to convert the height to one single unit. 

Missing Data

Most algorithms do not accept missing values, so handling missing data becomes all the more crucial when it comes to algorithms and making one’s data cleaner.

The missing data can be handled in two ways:

Dropping observations with missing values: Dropping the missing values is not the most optimal way for the reason being that, when one drops an observation, it means dropping some information. 

Imputing missing values based on other observations: Imputing missing values is also not that optimal either. Imputing missing value means the value was originally missing, but when someone filled it in, which eventually leads to a loss in information no matter what imputation method one uses.

Something missing can be informative as well; one can then add these missing values to the algorithm after they realise them. Imputing is like trying to fit a missing part of the puzzle back in after you have taken it out. The models built with missing values might not add any real information and keep reinforcing the patterns already provided by other features.

A possible solution? Just tell the algorithm that something is missing.

Handle missing categorical data: for missing categorical feature data, one can label them as ‘Missing’. It’s like adding a new class for the feature.

Handling missing numerical data: To process numerical data, one should flag and fill the values. First, flag the observation with an indicator variable of the missingness, then fill the original value with 0 to meet the technical requirement of no missing values.

Flagging and filling essentially allow the algorithm to estimate the optimal constant for missingness instead of filling it.

More Great AIM Stories

Sameer Balaganur
Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.