Listen to this story
Training the neural networks faster is one of the important factors in deep learning. We generally find such difficulties with the neural networks to their complex architecture and the large number of parameters used. As the size of the data, network and weights increase, the training time of the models also increases which is not good for the modellers and practitioners. In this article, we are going to discuss some of the tips and tricks that can speed up the training of neural networks. The major points to be discussed in the article are listed below.
Table of contents
- Multi-GPU Training
- Learning rate scaling
- Cyclic learning rate schedules
- Mix up training
- Label smoothing
- Transfer learning
- Mixed precision training
Let’s start by discussing how Multi-GPU training can improve the speed of learning.
Sign up for your weekly dose of what's up in emerging technology.
This tip is purely on the side of speeding up the neural networks that have no connection with the performance of the models. This tip may become costly but it is very effective. Implementation of one GPU can also make the training of neural networks faster but applying more GPUs has more benefits. If anyone is not capable of implying GPU in their system they can go through the google collab notebooks that provide support for GPU and TPU on an online basis.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Applying multiple GPUs in training distributes the data in different GPUs and these GPUs hold the network weight and make them learn about the mini-batch size of the data. For example, if we have a batch size of 8192 and 256 GPUs implemented then every GPU will have a mini-batch of size 32 or we can say 32 samples to train a network. This means training of the network will become faster.
Learning rate scaling
This is one of the tips that can help us in enhancing the speed of training neural networks. In general training of the neural network with a large batch size represents the low validation accuracy. In the above section, we have seen that applying multiple GPUs distributes the batch size to prevent slow training of the network.
One thing we can also perform in a scenario where GPUs are not available is to scale the learning rate; this tip can compensate for the averaging effect that the mini-batch has. For example, we can increase the batch size 4 times when training over four GPUs. We can also multiply the learning rate by 4 to increase the speed of the training.
We can also say this method is the learning rate warmup, which is a simple strategy to start the training of the model with high learning rates. At the very start, we can start training with less learning rate and increase it to a preset value in a warm-up phase. The less learning rate can be through the first few epochs. Then the learning rate can decay as usually happens in the standard training.
Both of these tricks are useful when using the distributed training of the network using multiple GPUs. We can also find that in the learning rate warmup the harder to train models can be stabilised regardless of the batch size and the number of GPUs we are using.
Cyclic Learning Rate Schedules
Learning rate schedules can be of various types, one of them is cyclic learning rate schedules and which helps in increasing the speed of training neural networks. This mainly works by increasing and decreasing the learning rate in a cycle under predefined upper and lower bounds. In some of the schedules, we find that the upper bound decreases as the training procedure progress.
One time cyclic learning rate schedules are a variant of cyclic learning rate schedules that increase and decrease the learning rate only one time during the entire training process. We can also consider it similar to the learning rate warmup that we have discussed in the above section. This tip can also be applied to the momentum parameter of the optimizer but in the reversed order
Mix up training
This is a very simple tip that can also be called a mixup which mainly works with networks in the field of computer vision. The idea of this tip is taken from this paper. The process of mixup helps in avoiding overfitting and reduces the sensitivity of the models in front of adversarial data. We can also think of this process as a data augmentation technique that randomly performs blending on the input samples. Going on the deep we find that this tip collects a pair of data samples and generates new samples while computing the weighted average of inputs and outputs.
One of our articles explains the process of blending which is utilised by the mixup process. For example in an image classification task, the process will work by blending images in the input and the same blending parameter to compute a weighted average of the output labels.
Label smoothing is a general technique to speed up the training process of neural networks. A normal classification dataset consists of the labels that are one-hot encoded, where a true class has the values of one and other classes have the zero value. In such a situation, a softmax function never outputs the one-hot encoded vectors. This procedure mainly creates a gap between the distributions of the ground truth labels and model predictions.
Applying label smoothing can reduce the gap between the distribution of the ground truth labels and model predictions. In label smoothing, we mainly subtract some epsilon from the true labels and add the subtraction results to the others. By performing this the models get prevention overfitting and work as a regularizer. One thing that is required to notice here is that if the value of epsilon becomes very large then the labels can get flattened too much.
The strong value of label smoothing gives less retention of the information from the labels. The effect of label smoothing can also be seen in the speed of the training of neural networks because the model learns faster from the soft targets which are a weighted average of the hard targets and the uniform distribution over labels. This method can be utilised with various modelling procedures like image classification, language translation and speech recognition.
Transfer learning can be explained as the process where a model starts its training by transferring the weights from other models, instead of training from scratch. This is a very good way to increase the training of the model as well as it can also help increase the performance of the model because the weights we are using with the process are somewhere already trained and a huge amount of training time can be cut from the whole training time. One of our articles explains how this type of learning can be performed.
Mixed precision training
We can define this type of training to make a model to learn both 16-bit and 32-bit floating points so that the model can learn faster and the training time of the model can be reduced. Let’s take an example of a simple neural network where the model needs to detect an object from the images. Training such a model means finding the edge weight of the network such that it can be able to perform object detection from the data. These edge weights can be stored in a 32-bit format.
This general training can involve forward and backpropagation and to perform so it will require billions of multiplication if the points are in 32 bits. We can avoid the 32 bits point to represent a number during training. In backpropagation gradients that the network is computing can be very low values and in such conditions we are required to push a lot of memory to represent the numbers. We can also inflate these gradients so that we don’t require a lot of memory to represent these numbers.
We can also represent the numbers with 16 bits and can save a lot of memory storage of the model and the training program can be faster than before. This procedure can be explained as training the model while the arithmetic operations are having very few bits. The only thing which needs to be considered in such training is accuracy. This can happen because the accuracy of the model is significantly less.
As explained above we can train the model in both 16 bit and 32 bit using the mixed-precision training where it maintains the master copy of actual weight parameters in the original 32-bit precision format. And we can use it as an actual set of the weights that we use with the model after training.
In between the training, we can convert the 32 bit into 16-bit precisions and can perform the forward propagation with all the arithmetic operations with less memory and the loss can be computed to feed after some time by scaling it into the backpropagation where the gradients are also scaled up.
In the backpropagation, we compute 16-bit gradients and the final gradients go to the actual set of weights that we use in the model. The above-given procedure is one iteration and this happens for many iterations. The below image explains the whole procedure that we have discussed.
In the article, we have discussed the tricks and tips that can be utilized to speed up the training of neural networks. Since the training time and performance of the models are major factors that need to be considered in the modelling we should try to employ them in our processes.