Scaling of the data comes under the set of steps of data pre-processing when we are performing machine learning algorithms in the data set. As we know most of the supervised and unsupervised learning methods make decisions according to the data sets applied to them and often the algorithms calculate the distance between the data points to make better inferences out of the data.
In real life, if we take an example of purchasing apples from a bunch of apples, we go close to the shop, examine various apples and pick various apples of the same attributes. Because we have learned about the attributes of apples and we know which are better and which are not good also we know which attributes can be compromised and which can not. So if most of the apples consist of pretty similar attributes we will take less time in the selection of the apples which directly affect the time of purchasing taken by us. The moral of the example is if the apples every apple in the shop is good we will take less time to purchase or if the apples are not good enough we will take more time in the selection process which means that if the values of attributes are closer we will work faster and the chances of selecting good apples also strong.
Similarly in the machine learning algorithms if the values of the features are closer to each other there are chances for the algorithm to get trained well and faster instead of the data set where the data points or features values have high differences with each other will take more time to understand the data and the accuracy will be lower.
So if the data in any conditions has data points far from each other, scaling is a technique to make them closer to each other or in simpler words, we can say that the scaling is used for making data points generalized so that the distance between them will be lower.
As we know, most of the machine learning models learn from the data by the time the learning model maps the data points from input to output. And the distribution of the data points can be different for every feature of the data. Larger differences between the data points of input variables increase the uncertainty in the results of the model.
The machine learning models provide weights to the input variables according to their data points and inferences for output. In that case, if the difference between the data points is so high, the model will need to provide the larger weight to the points and in final results, the model with a large weight value is often unstable. This means the model can produce poor results or can perform poorly during learning.
I am not saying that all the algorithms will face this problem but most of the basic algorithms like linear and logistic regression, artificial neural networks, clustering algorithms with k value etc face the effect of the difference in scale for input variables.
Scaling the target value is a good idea in regression modelling; scaling of the data makes it easy for a model to learn and understand the problem. In the case of neural networks, an independent variable with a spread of values may result in a large loss in training and testing and cause the learning process to be unstable.
Normalization and Standardization are the two main methods for the scaling of the data. Which are widely used in the algorithms where scaling is required. Both of them can be implemented by the scikit-learn libraries preprocess package.
Why feature scaling?
As of now, we have discussed that various machine learning algorithms are sensitive when the data is not scaled. There are various machine learning algorithms that use the same kind of basic strategies as their base concept under the algorithm. These base concepts are totally based on the mapping of the distance between data points. Like
Gradient Descent Algorithm
Machine learning algorithms like linear regression, logistic regression using this algorithm as their basic function. Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a different function. Basically in any algorithm, the gradient descent function slides through the data set while applied to the data set, step by step. So if the distance between the data points increases the size of the step will change and the movement of the function will not be smooth. Take a look at the formula.

Here x represents the feature value and theta represents the movement of the function or optimization function. So if the distance range between feature values increases the movement will increase and function will not work properly. In that situation, we will be required to have a data set well rescaled so that the function can better help in the development of the machine learning model.
Distance Based-Algorithm
Various unsupervised and supervised learning methods use the distance-based algorithm. like KNN, K-Means SVM etc are examples of algorithms that use the distance between data points behind the scene. For example, in a corporate office the salary of the employees are totally dependent on the experience and there are people who are newcomers and some are well experienced and some of those have medium experience. We need to make a model which can predict the salary and if the number of employees of any class is more then the model will be prone to that class of employees to prohibit the situation. We need to rescale the data so the data is well spread in the space and algorithms can learn better from it.
Now we know the situation where we are required to rescale the data and which algorithms are expecting scaled data to perform better in learning and testing.
Let’s take a closer look at the normalization and the standardization.
Data Normalization
Normalization can have various meanings, in the simplest case normalization means adjusting all the values measured in the different scales, in a common scale.
In statistics, normalization is the method of rescaling data where we try to fit all the data points between the range of 0 to 1 so that the data points can become closer to each other.
It is a very common approach to scaling the data. In this method of scaling the data, the minimum value of any feature gets converted into 0 and the maximum value of the feature gets converted into 1.
Basically, under the operation of normalization, the difference between any value and the minimum value gets divided by the difference of the maximum and minimum values. We can represent the normalization as follows.

Where x is any value from the feature x and min(X) is the minimum value from the feature and max(x) is the maximum value of the feature.
Scikit learn provides the implementation of normalization in a preprocessing package. Let’s see how it works.
Implementing the max-min normalization
import numpy as np
from sklearn.preprocessing import MinMaxScaler
Defining an array
df=np.array([[2, 3, 7, 30],
[9, 4, 6, 1],
[8, 15, 2, 40],
[20, 10, 2, 6]])
print(df)
Output:

Visualizing the array.
import matplotlib.pyplot as plt
fig = plt.figure(figsize =(10, 7))
plt.boxplot(df)
plt.show()
Output:

Normalizing the array.
scaler = MinMaxScaler()
scaler.fit(df)
scaled_features = scaler.transform(df)
print(scaled_features)
Output:

Visualizing scaled data:
fig = plt.figure(figsize =(10, 7))
plt.boxplot(scaled_features)
plt.show()
Output:

In the graphs and in the array we can see how the values are changed by the normalization. Let’s move towards standardization.
Where to Use Normalization?
Since we have seen the normalization method scales between zero to one it is better to use with the data where the distribution of the data is not following the Gaussian distribution or we can apply with an algorithm that does not count on the distribution of the data in the procedure like K-means and KNN.
Standardization
Like normalization, standardization is also required in some forms of machine learning when the input data points are scaled in different scales. Standardization can be a common scale for these data points.
The basic concept behind the standardization function is to make data points centred about the mean of all the data points presented in a feature with a unit standard deviation. This means the mean of the data point will be zero and the standard deviation will be 1.
This technique also tries to scale the data point between zero to one but in it, we don’t use max or minimum. Here we are working with the mean and the standard deviation.
In statistics, the mean is the average value of all the numbers presented in a set of numbers and the standard deviation is a measurement of the dispersion of the data points from the mean value of the data points.
So in standardization, the data points are rescaled by ensuring that after scaling they will be in a curve shape. Mathematically we can represent it as follow

We can implement it in python using scikit-learn provide preprocessing package
Implementation
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_X = sc_X.fit_transform(df)
print(sc_X)
Output:

Visualization:

Here we can see that the visualization is pretty similar to the normalization but the values are varying between -2 to 2. In this, we can not define a range but the distribution of the data points will be similar in a bigger space.
Where to use Standardization?
Since the results provided by the standardization are not bounded with any range as we have seen in normalization, it can be used with the data where the distribution is following the Gaussian distribution. In the case of outliers, standardization does not harm the position wherein normalization captures all the data points in their ranges.
Final Words
Here in the article, we got an overview of scaling, we have seen what are the methods we can use in scaling and how we can implement it and also seen different use cases where we can use different methods of scaling.
It is recommended to not ignore any of the methods because of the data quality. Whenever going for the modelling we should start with the raw data, then go with the scaling method and compare all the results. It is a good practice because in fewer lines of code we can implement the scaling part and if we are trying everything then there will be fewer chances of missing a perfect result.
References: