What exactly happens during the process of training a data set? How do the algorithms make use of the provided data set and make prediction or classify them into different categories? Let us look at an example.
Lets us assume that we have collected data on different types of alcohol. A particular drink is categorised by humans based on several factors. For example, if a drink is “strong”, that is, it has a high percentage of alcohol, we call it as hard liquor. And if it is a mixture of different types of alcohols, then its a cocktail. An algorithm or our ML model will read this data in a numerical form, assigning ‘1’ if it is hard liquor and so on.
What is Data Preparation? An uncelebrated side of data science. Before we look into when to use a specific algorithm, let us go through the processes involved in cleansing the data set.
What is Data Preprocessing?
This involves transforming noisy data from a particular database, giving it a shape, assigning proper columns for the nameless or missing feature names. It also includes converting the data into information which can be understood easily.
This involves the following steps:
- Data Cleansing: Cleansing the data for missing values, this can be done by several methods. This also involves clearing the noisy data and making sense of it by removing dissonance. Let us look at a few examples of how we can deal with the missing values. The null or missing values provide no information for a model, and that’s why we need to figure out a way of dealing with them. We can delete the whole feature if more than 80 percent values are missing. Having this kind of feature will account for very little while training the model, and will not have a great effect during classification or regression. On the other hand, we can use the available values of that particular feature and build a regression model to predict the missing values. This is advisable if the missing values are less than 10-15 percent.
- Data Integration: This includes clubbing of similar data together and to negate the differences in the data. It involves accessing different databases and extracting information from various trusted sources and clubbing them together. This process will give meaning to the unstructured data by framing it in a way that can be understood.
- Data Transformation: Usually the data will have huge differences in the numbers and a computer will give weightage to the values which are higher. Scaling data is mandatory to make sure that the model gives equal weightage to all the values. Let us take an example the data from a bank. Let’s say it has two features — the age and income of a person. The age feature has a lower value say 1 to 3 digits compared to the income which can be upto 6 to 10 digits. The model sees both of these factors as just numbers, while we consider consider them as a reflection of age vs the income. Therefore, to give equal weightage to both the features, we need to scale the values. This process is also called ‘normalisation’.
- Data Reduction or Dimension Reduction: This process involves reduction of features based on the degree of information provided by it. This involves two methods. First, feature selection, where some of the features are neglected while training a model because they give very less information. Second, feature extraction, where different methods like PCA (Principal Component Analysis) map the features into a high dimension with less number of features. Both linear and non-linear methods can be used in this procedure. Advantages of Dimension Reduction is the correlation of the data can be highly reduced with low variance.
Once we have computed the data and analysed it, we need to focus on selecting which model suits best for our data. This solely depends on the type of data we are dealing with. Since there are so many ML models, let us compare two of the most frequently used algorithms — Randomforest and Support Vector Machine (SVM).
Randomforest works very well on a data set which contains both numerical and categorical variables. This is because it is a supervised nonlinear classification ML algorithm. Data transformation is not usually required in Randomforest since it is nonlinear, it does not give weightage to a single feature, and treats every feature equally. On the other hand, SVM is a linear ML algorithm. It takes into the consideration the distance between the points at a particular space and time. This method fluctuates with the change in values — the higher the value the higher is distance — which is why data scaling is necessary here. In both the algorithms, conversion of the categorical values to numerical features is required because the machine cannot understand how to read different categories.
Therefore, as a thumb rule, SVM can be used where there are a lot of features, the data values are in a particular range, and they do not require any scaling, such as any image (0-255). It is hard to scale a feature which is above 6 to 10 digits for training it in SVM.
For classification techniques, Randomforest works better because it calculates the probability, whereas SVM has ‘support vectors’ which calculate the distances. Finally, the distances have to be converted into probability.
Steps to consider while applying your ML algorithm:
- Check the missing values in your data and clear them.
- Clean the data and frame it in a structured manner to maintain the integrity.
- Find the relevant features which account for the classification or regression during the training. Remove unwanted data to reduce the dimensions.
- Depending on the data and your required output, choose the ML algorithm.
- Split a small portion of training data into validation set to check if your model is overfitting during training. Correct it with regulariser, if there is any.
- Keep the learning rate at optimum level to make sure the model does not overshoot or undershoot during error correction.
These are some of the basic steps to be considered while training a ML model. There is no perfect model or an algorithm which can be used for all the data sets. While building a ML model, the most important and the hardest part is cleansing and preprocessing the data. Ironically, applying the algorithm and predicting the output is just a few lines of code, which is the easiest job while building.