Neural Networks have surely saved us many at times, the way we have used them for different use cases if simply phenomenal. This concept of deep learning was in talks for decades but because of computational issues, it was side talked for a few years. Deep Learning has got its hype again, many think that it has come a few years ago but that ain’t true.
With computational issues, there were many other issues with a neural network too.
One such is Exploding gradient.
In this article you will learn about:
- What is exploding gradient and what issues did it cause?
- How to identify it?
- How to rectify it?
What is exploding gradient and how does it hamper us?
It can be understood as a recurrent neural network. For those who don’t understand what a recurrent neural network is, can be intuited as a Neural network who gives feedback to its own self after every iteration of the self. Here feedback means the changing of the weight.
Source: Research gate.
In gradient descent we try to find the global minimum of the cost function which will be the optimal solution for us.
In this, the flow of information is from x1 to y3, in between we see h0,h1 etc which are the hidden layers. These hidden layers add biases and weight which is referred to as w. While propagating the information from y3 to x1, it will have to go through the hidden layers. With every iteration, the weights are set again. In RNN’s the weights are set to itself for a hidden layer itself too. That term is called Wrec which stands for Weight recurring.
The output value at y3 is multiplied by the weights of h2 which is then given to h1 whose weights are multiplied by that of h1 and thus this goes on. Here the thing that we have to understand is that if the weights that are multiplied by the output of y3 are less than 1 then with time the actual value will diminish. Similarly, if the weights that are multiplied are more than one so eventually the value will become exponentially larger than the usual one.
So for it to not change the value of the weights has to be equal to one.
So here, in the situation where the value of the weights is larger than 1, that problem is called exploding gradient because it hampers the gradient descent algorithm. When the weights are less than 1 then it is called vanishing gradient because the value of the gradient becomes considerably small with time. The actual weights are greater than one and thus the output becomes exponentially larger at the end which hinders the accuracy and thus model training. A network with the problem of exploding gradient won’t be able to learn from its training data. This is a serious problem.
How to identify exploding gradients?
There are a few ways by which you can get an idea of whether your model is suffering from exploding gradients or not. They are:
- If the model weights become unexpectedly large in the end.
- Your model has a poor loss
- Or the model displays NaN loss whilst training.
- The gradient value for error persists over 1.0 for every subsequent iteration during training.
How to deal with an exploding gradient?
- Use LSTM’s (Long short term memory)
LSTM’s store the information and then is tolled against the values of the previous iterations. Here what happens is the value of Wrec is equalled to 1 which later doesn’t really impact the gradient.
The sign sigma is for sigmoid activation function, tanh is for the tangent hyperbolic activation function. The value x which is coming out from ht is the final output value.
Xt is the value that is added to the system, more like an input vector.
There’s a lot more to it which you can understand and read from here.
- Gradient Clipping
In really simple terms, it can be understood as clipping the size of the gradient by limiting it to a certain range of acceptable values.
This is a process that is done before the gradient descent step takes place.
You can read more about gradient clipping from the research paper here.
- Weight Regularization
In this what we do is penalise the network’s loss function by regularising the loss.
We use L1 regularisation or L2 regularisation which adds the square of the value to it.
These regularisations techniques: L1 and L2 can be used for controlling the exploding gradients. You can read more from the research paper from here.
This article is aimed to discuss the issues that we may have whilst training a neural network in the step of backpropagation. This issue is addressed by the name exploding gradient when the weight recurring is greater than 1 and vanishing gradient when weight recurring is less than 1. We had also discussed how to identify the problem of exploding gradients which is by identifying and observing the loss and the weights of the model.
Later we halted the article with a few solutions to our problem. LSTM is one of the most prominently used solutions for the same and apart from that, we had discussed gradient clipping and regularization techniques.
Hope you liked the article.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Understanding and building fathomable approaches to problem statements is what I like the most. I love talking about conversations whose main plot is machine learning, computer vision, deep learning, data analysis and visualization. Apart from them, my interest also lies in listening to business podcasts, use cases and reading self help books.