Regularization is a set of techniques which can help avoid overfitting in neural networks, thereby improving the accuracy of deep learning models when it is fed entirely new data from the problem domain. There are various regularization techniques, some of the most popular ones are — L1, L2, dropout, early stopping, and data augmentation.
Why is Regularization Required?
The characteristic of a good machine learning model is its ability to generalise well from the training data to any data from the problem domain; this allows it to make good predictions on the data that model has never seen. To define generalisation, it refers to how well the model has learnt the concepts to apply to any data rather than just with the specific data it was trained on during the training process.
On the flip side, if the model is not generalised, a problem of overfitting emerges. In overfitting, the machine learning model works on the training data too well but fails when applied to the testing data. It even picks up the noise and fluctuations in the training data and learns it as a concept. This is where regularization steps in and makes slight changes to the learning algorithm so that the model generalises better. Some of the regularization techniques are as follows:
Sign up for your weekly dose of what's up in emerging technology.
L2 and L1 Regularization
L2 and L1 are the most common types of regularization. Regularization works on the premise that smaller weights lead to simpler models which in results helps in avoiding overfitting. So to obtain a smaller weight matrix, these techniques add a ‘regularization term’ along with the loss to obtain the cost function.
Cost function = Loss + Regularization term
The difference between L1 and L2 regularization techniques lies in the nature of this regularization term. In general, the addition of this regularization term causes the values of the weight matrices to reduce, leading simpler models.
In L2, we depict cost function as
Here, lambda is the regularization parameter which is the sum of squares of all feature weights. L2 technique forces the weight to reduce but never makes them zero. Also referred to as ridge regularization, this technique performs best when all the input features influence the output, and all the weights are of almost equal size.
In the L1 regularization technique,
Unlike in the case of L2 regularization, where weights are never reduced to zero, in L1 the absolute value of the weights are penalised. This technique is useful when the aim is to compress the model. Also called Lasso regularization, in this technique, insignificant input features are assigned zero weight and useful features with non-zero.
Another most frequently used regularization technique is dropout. It essentially means that during the training, randomly selected neurons are turned off or ‘dropped’ out. It means that they are temporarily obstructed from influencing or activating the downward neuron in a forward pass, and none of the weights updates is applied on the backward pass.
So if neurons are randomly dropped out of the network during training, the other neurons step in and make the predictions for the missing neurons. This results in independent internal representations being learned by the network, making the network less sensitive to the specific weight of the neurons. Such a network is better generalised and has fewer chances of producing overfitting.
It is a kind of cross-validation strategy where one part of the training set is used as a validation set, and the performance of the model is gauged against this set. So if the performance on this validation set gets worse, the training on the model is immediately stopped.
The main idea behind this technique is that while fitting a neural network on training data, consecutively, the model is evaluated on the unseen data or the validation set after each iteration. So if the performance on this validation set is decreasing or remaining the same for the certain iterations, then the process of model training is stopped.
The simplest way to reduce overfitting is to increase the data, and this technique helps in doing so.
Data augmentation is a regularization technique, which is used generally when we have images as data sets. It generates additional data artificially from the existing training data by making minor changes such as rotation, flipping, cropping, or blurring a few pixels in the image, and this process generates more and more data. Through this regularization technique, the model variance is reduced, which in turn decreases the regularization error.