When to Apply L1 or L2 Regularization to Neural Network Weights?

L1 and L2 regularization techniques can be used for the weights of the neural networks. using regularization of weights we can avoid the overfitting problem of the network

In the procedure of regularization, we penalize the coefficients or restrict the sizes of the coefficients which helps a predictive model to be less biased and well-performing. When we talk about neural networks, we can also apply the same procedure of regularization on the weights of the neural networks to make them efficient and robust. In this article, we will be discussing how we can perform L1 and L2 regularization of neural network weights and the effect of regularization on the neural networks. The major points that we will discuss here are listed below.

The most comprehensive Repository of Python Libraries for Data Science >>

Table of Contents


Sign up for your weekly dose of what's up in emerging technology.
  1. Problem with Larger Weights
  2. Benefits of Small Weights
  3. Penalizing Large Weights
    1. Weight Size Calculation 
    2. Determine Amount of Attention
  4. Where to Use Regularization?

Let’s begin the discussion by understanding the problems with larger weights in neural networks.

Problem with Larger Weights

In most of the neural networks, we see that we start the learning procedure of the neural networks using the parameters, especially the network weights. We use training data and stochastic gradient descent to make the network learned for a given task. So as much as we train the network we create more weights on the training data and this causes the overfitting of the network on the training data. Because as the training procedure increases the size of weights we used to learn increases so that the network can learn the example in the training data more efficiently. 

The main motive behind any type of modelling procedure is to make the model that has large variance and less bias to a particular class in the training data. A network becomes more complex as the weight increases which can be considered as the sign of a network that is overly specialized in training data. Also, the overweight causes change in the outcome more than expected, if the changes occur in the input. That is why we are required to merge less weight in the network to make it more simple and task-oriented. 

Another problem with the large weight is that the large weights are incompatible with the interrelationship between the variables. Many times we see that we have multiple input variables where every input variable has some relevance with the output variable. The relevance can be higher or lesser and a network with fewer weights are more capable of learning the relevance than the network with larger weights.     

Benefits of Small Weights

By understanding the problem with large weights we got to know about the importance of the small weights. We can make the network use smaller weights in the learning. A simple way to perform this is to make the optimization process of the network consider the weight size with the loss in the calculation. We can also use the regularization methods which are focused on limiting the capacity of the models by adding penalties to the objective function. 

More formally we can say that there will be a larger loss score because of a larger penalty if the weights used in the network are larger.  After capturing the larger loss score the optimization process can make the model use smaller weights.  

As we know that the small weights are more regular and not biased with any class this makes the model be benefitted from the performance and we can say that the penalties we are using for optimization are weights regularization. In the article, we have seen that this approach of penalizing model coefficients used in statistic models is considered shrinkage. Using penalties in such a model coefficient, we make the coefficient shrink during the optimization process. 

When using the penalties with the coefficient of the neural network, we make the network pay less attention to the irrelevant input variables with the addition of less generalization error.

Penalizing Large Weights

As of now, we have seen in the above sections how a neural network with small weights is a more efficient way of modelling. But when talking about the training, increments in size of the weights are not in our hands but by penalizing them we can control the size. The procedure of penalizing the model can be completed into two parts where the first part belongs to the calculation of the size of the weights and in the second part, we define the amount of attention for optimization procedure that can be paid to the penalty. Both of the parts are necessary to be followed during the modelling of networks.

Weight Size Calculation

In neural networks, the weights are the real values that can be of any nature either positive or negative. So simply adding weights to them is not an appropriate approach. Ultimately here we are talking about the regularization process which can be defined as the process of performing restriction or regularization of the estimates where the features are estimated using coefficients of the models. We can say that the regularization in machine learning is a way to penalize the complex model which helps in reducing the overfitting and increases the performance of the models for new inputs by deploying the small weights into the modem or network. So the addition of weights or regularization can be performed by two main approaches that are listed as follows:

  • L1 Regularization: Using this regularization we add an L1 penalty which is an absolute value of the magnitude of the coefficient or weights using which we restrict the size of coefficients. In regression analysis we mostly see the L1 penalty in the Lasso regression.

The above image is a mathematical representation of the lasso function where the function under the box is a representation of the L1 penalty.

  • L2 Regularization:   Using this regularization we add an L2 penalty which is basically square of the magnitude of the coefficient of weights and we mostly use the example of L2 penalty in the ridge regression.

the above image is a mathematical representation of the ridge function where the function under the box is a representation of the L2 penalty.

By the above definitions we can say that the L1 penalty tries to make the weights near to zero or zero if possible. This outcomes as the more sparse weights in the networks. Using L2 we can perform slight changes in the weights and using these penalties in regularization we can penalize large weights more severely. 

In the field of neural networks we can say that the L2 penalty is used for decaying the weights that is why it is the most used approach for regularization or weight size reduction. There is one more approach we can use for weight addition or regularization in which we include both kinds of penalties which we see in elastic net regression. After this calculation we can  proceed for the next part where we need to determine the amount of attention for the optimization process. 

Determine Amount of Attention

In the above, we have seen how the L1 and L2 regularization helps in the calculation of the weights. After the calculation, we can add the size of weights to the function that we are using for optimization of the loss of the network and we can call this function a loss objective function. 

Normally we don’t add each weight of the penalty directly. Before adding them we optimize them using the alpha or lambda parameters. Using these hyperparameters we  control the learning process to give attention to the penalty 

The value of the alpha hyperparameter varies between zero to one and if the value is in zero we say it as the no penalty and the value is in one we can say it as the full penalty and using the values the hyperparameters controls the model to being biased form a low amount bias to high amount bias.

Using The size of the penalty, the model can be controlled for performing the underfitting or overfitting of the training data. So if the size of the penalty is low we can allow the model to overfit and if the penalty is strong we can allow the model to underfit the training data. 

To choose the type of regularization to use in the network we calculate the vector norm of the weights on each layer and this calculation makes the process of choosing alpha value flexible. Either we can use the alpha value as default for each layer or we can choose a different alpha value for each layer. 

Where to Use Regularization 

In the above section, we have seen how we can regularize the weights of the neural networks but we all know there are such conditions where we may need to apply the regularization techniques. Some of the conditions are listed below.

  • We can use them with any neural network because it is a generic approach for making the model performance higher. But it is suggested to use especially with the LSTM models. It can be mostly used with sequential input and such connections which are recurrent. 
  • If in any situation the scale of input values are not similar we can use the regularization because of its great ability to update the input variable to have the same scale.
  • We normally see that the large networks mostly become overfitted to the training data we can use for regularization with the large networks.
  •  Pre-trained neural networks are better with those data only on which they are trained and to use with newer data or different inputs we can use the regularization. It helps the network to perform a variety of data that are irrelevant to each other.  

As we have seen in the article that L1 and L2 both of the regularization approach is useful and also we can apply both of them rather than choosing between them. In the regression procedure, we have seen the success of the elastic net where both of the penalties are used. We can also try this approach in neural networks.  Also, we use the small values of the hyperparameter in the regularization that helps in controlling the contribution of each weight to the penalty. Manually assigning value to the hyperparameter we can use the grid search method for choosing the right hyperparameter for better performance.

Final Words

In this article, we have got an introduction to the problem of the neural networks in the context of the weights of the coefficients and we saw how the L1 and L2 regularization techniques from the regression analysis help us to find an optimal solution. Along with that we also saw the procedure of how we can perform the regularization and improve the neural network with situations where we can use the regularization to enhance the performance of the neural network.

More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM