Now Reading
How To Build A Fair Dataset For A Machine Learning Project

How To Build A Fair Dataset For A Machine Learning Project

Ambika Choudhury

A machine learning model can be seen as a miracle but it’s won’t amount to anything if one doesn’t feed good dataset into the model. Before feeding the dataset for training, there are lots of tasks which need to be done but they remain unnamed and uncelebrated behind a successful machine learning algorithm.

Data Collection

This part can be a tricky one for those who have just entered into the plot of machine learning and wants to try their hands on. Collecting data may seem very easy going and it is by the way, only if you know your project well and what kind of data you really want to proceed. The quantity of the collection basically depends upon the complexity of your project. A simple machine learning project requires much lesser data than a complex one. There are large volumes of data available by means of open source, or other private sources. While collecting be sure that the data is ML-friendly.  

Data Preprocessing

This is the method where the raw dataset needs to be converted into a clean and meaningful dataset. The preprocessing is accomplished by following several steps where the noisy data are transformed into some specific database with proper information. This process follows the following steps mentioned below



It is crucial to format the data in such a way that it becomes the best fit for your machine learning model. The raw data which you have gathered is obviously in a bad shape or any other unwanted format and hence it is not suitable for you to work with in order to get a good predicted outcome. You can easily change the formatting type as per your need for the project.


Generally, removing or fixing the missing values in a dataset is known as cleaning data. There may be missing instances or unwanted attributes which you can omit in order to make a meaningful dataset.


Data sampling helps not only beginners but also the pros in machine learning projects. Large volumes of data not only consumes the memory space but it also consumes a lot of time which is way more precious than anything for everyone. If the dataset is sampled and it is being used, it really takes a little run time for the algorithms and gives you a positive probability to go further with the dataset in your project.

Feature Engineering

In this method, the dataset is transformed such that it describes some patterns and features. It is one of the crucial parts where the selection and extraction of the right features from the data takes place. The features are selected in such a way that they are the most useful as well as relevant from the available data. This part involves several steps which are mentioned below.

See Also
Cocktails, Math & Machine Learning: The Fascinating Journey Of Kaggle Master Arthur Llau

Feature Extraction

Feature extraction can be described as removing some quantity of resources required to represent a large dataset and focuses on optimisation of the number of features. It is important to find out which features are beneficial for the particular ml project and thus selecting them will result in faster computation as well as low consumption of memory.

Feature Scaling

The feature scaling represents the standardisation of independent variables in the features of a dataset. It can be done by two prominent ways, standardisation and normalisation. The standardisation is one most the most used technique by the machine learning practitioners which provides data with the property of standard normal distribution or can be called as Gaussian distribution. On the other hand, normalisation represents rescaling of the data features between binary digits 0 and 1 also can be called as min-max normalisation.

Data Splitting

Data splitting holds an important position while working on a machine learning project. In general, the dataset needs to be split into training set, test set and the validation set which are usually split into 70%, 20%, and 10% respectively. The training set represents the known and labelled data which is used while building a machine learning model, this set of data helps in predicting the outcome of the future data by creating a hypothesis in the model. The test set represents the set of data used to test the predicted hypothesis created by the training set and the validation set is used to validate or restest the created hypothesis in order to avoid overfitting.


Machine learning models are basically “Garbage in-Garbage out”. So, while building a machine learning model, it is more important to pay attention to prepare the dataset rather than on the algorithms. Remember a simple algorithm can outperform in a robust way if the dataset which is fed is fair enough.

Provide your comments below


Copyright Analytics India Magazine Pvt Ltd

Scroll To Top