Deep neural networks are vulnerable to the problem of vanishing and exploding gradients. This is especially true for Recurrent Neural Networks, which are commonly used (RNNs). Because RNNs are typically used in situations requiring short-term memory, the weights can be easily exploited during training, resulting in unexpected outcomes such as Nan or the model failing to coverage at the desired point. So, in order to reduce this effect, various methods, such as regularizers, are used. From all of those methods, we will focus on the Gradient Clipping method in this article and attempt to understand it both theoretically and practically. Below are the major points listed that are to be discussed in this article.
Table Of Contents
- The Exploding Gradient Problem
- What is Gradient Clipping?
- How to Use Gradient Clipping?
- Implementing Gradient Clipping
Let’s start the discussion by understanding the problem and its causes.
The Exploding Gradient Problem
The exploding gradient problem is a problem that arises when using gradient-based learning methods and backpropagation to train artificial neural networks. An artificial neural network, also known as a neural network or a neural net, is a learning algorithm that employs a network of functions to comprehend and translate data input into a specific output. This type of learning algorithm aims to replicate the way neurons in the human brain work.
When large error gradients accumulate, exploding gradients occur, resulting in very large updates to neural network model weights during training. Gradients are used to update the network weights during training, but this process typically works best when the updates are small and controlled. When the magnitudes of the gradients add up, an unstable network is likely to form, resulting in poor prediction results or even a model that reports nothing useful at all.
In the training of artificial neural networks, exploding gradients can cause issues. When gradients explode, the network becomes unstable, and the learning cannot be completed. The weights’ values can also grow to the point where they overflow, resulting in NaN values.
The term “not a number” refers to values that represent undefined or unrepresentable values. In order to correct the training, it’s helpful to know how to spot exploding gradients. Because recurrent networks and gradient-based learning methods deal with large sequences, this is a common occurrence. There are techniques for repairing exploding gradients, such as gradient clipping and weight regularization, among others. In this post, we will take a look at the Gradient Clipping method.
What is Gradient Clipping?
Gradient clipping is a technique for preventing exploding gradients in recurrent neural networks. Gradient clipping can be calculated in a variety of ways, but one of the most common is to rescale gradients so that their norm is at most a certain value. Gradient clipping involves introducing a pre-determined gradient threshold and then scaling down gradient norms that exceed it to match the norm.
This ensures that no gradient has a norm greater than the threshold, resulting in the gradients being clipped. Although the gradient introduces a bias in the resulting values, gradient clipping can keep things stable.
It can be difficult to train recurrent neural networks. Vanishing gradients and exploding gradients are two common problems when training recurrent neural networks. When the gradient becomes too large, error gradients accumulate, resulting in an unstable network.
Vanishing gradients can occur when optimization becomes stuck at a certain point due to a gradient that is too small to progress. Gradient clipping can prevent these gradient issues from messing up the parameters during training.
In general, exploding gradients can be avoided by carefully configuring the network model, such as using a small learning rate, scaling the target variables, and using a standard loss function. However, in recurrent networks with a large number of input time steps, exploding gradients may still be an issue.
How to Use Gradient Clipping?
Changing the error derivative before propagating it back through the network and using it to update the weights is a common solution to exploding gradients. By rescaling the error derivative, the updates to the weights are also rescaled, reducing the likelihood of an overflow or underflow dramatically.
Gradient scaling is the process of normalizing the error gradient vector so that the vector norm (magnitude) equals a predefined value, such as 1.0. Gradient clipping is the process of forcing gradient values (element-by-element) to a specific minimum or maximum value if they exceed an expected range. These techniques are frequently referred to collectively as “gradient clipping.”
It is common practice to use the same gradient clipping configuration for all network layers. Nonetheless, there are some cases where a wider range of error gradients is permitted in the output layer than in the hidden layer.
Implementing Gradient Clipping
We now understand why Exploding Gradients occur and how Gradient Clipping can help to resolve them. We also saw two different methods for applying Clipping to your deep neural network. Let’s look at how both Gradient Clipping algorithms are implemented in major Machine Learning frameworks like Tensorflow and Pytorch.
We will use the Fashion MNIST dataset, which is an open-source digit classification data set designed for image classification.
Gradient clipping is simple to implement in TensorFlow models. All you have to do is pass the parameter to the optimizer function. To clip the gradients, all optimizers have ‘clipnorm’ and ‘clipvalue’ parameters.
Before proceeding further we quickly discuss how we can clipnorm and clipvalue parameters.
Gradient norm scaling entails modifying the derivatives of the loss function to have a specified vector norm when the gradient vector’s L2 vector norm (sum of squared values) exceeds a threshold value. For example, we may provide a norm of 1.0, which means that if the vector norm for a gradient exceeds 1.0, the vector values will be rescaled so that the vector norm equals 1.0.
Gradient value clipping entails clipping the derivatives of the loss function to a specific value if a gradient value is less than or greater than a negative or positive threshold. For instance, we may define a norm of 0.5, which means that if a gradient value is less than -0.5, it is set to -0.5, and if it is greater than 0.5, it is set to 0.5.
Now that we have understood what is the actual role of these parameters. Start the implementation by importing the necessary package and submodule.
import tensorflow as tf from tensorflow.keras.datasets import mnist
Next load the Fashion MNIST dataset and pre-process it so that the TF model can handle it.
# load the data (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data() # make compatible for tensorflow x_train, x_test = x_train / 255., x_test / 255. # scalling train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)) train_data = train_data.repeat().shuffle(5000).batch(32).prefetch(1)
Now we will define and compile the model without gradient clipping, here I’m intentionally limiting the numbers of layers and neurons for each layer so as to replicate the behavior.
# build a model model = tf.keras.models.Sequential([ tf.keras.layers.LSTM(10,input_shape=(28, 28)), tf.keras.layers.Dense(10) ]) #compile a model model.compile( # inside the optimizer we are doing clipping optimizer=tf.keras.optimizers.SGD(), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()], )
Next, we’ll fit the model and observe the loss and accuracy movement.
Here is the result,
As we can see we have trained for a few epochs and in which model is struggling to reduce loss and accuracy too. Now let’s whether Grading clipping will make any difference here.
As we discuss earlier to implement gradient clipping we need to initiate the desired method inside the optimizer. Here I’m moving with the clipvalue method.
# inside the optimizer we are doing clipping optimizer=tf.keras.optimizers.SGD(clipvalue=0.5)
Next, we’ll train the model with gradient clipping and can observe loss and accuracies as,
Now it is clear that clipping gradients value can improve the training performance of the model.
Clipping the gradients speeds up training by allowing the model to converge more quickly. This means that the training reaches a minimum error rate faster. Because the error diverges as the gradients explode, no global or local minima can be found. When the exploding gradients are clipped, the errors begin to converge to a minimum point.
This post has discussed what exploding gradients are and why they happen. In order to encounter this effect, we discussed a technique known as Gradient clipping and saw how this technique can solve the problem both theoretically and practically.