One of the most popular applications of computer vision is image classification, which uses a pre-trained along with an optimised model to identify hundreds of classes of objects that include people, animals, places and more. For a few years now, this technique has been used by almost every sector, such as healthcare, financial, e-commerce, among others, to identify and portray various specific features.
Below here, we compiled eight interesting tricks and techniques, in alphabetical order, that can be used to produce outcomes as well as increase the accuracy of an image classification model.
Cosine Learning Rate Decay
The Cosine Learning Rate Decay involves reductions and restarts of learning rates over the course of training. Cosine annealing, also known as stochastic gradient descent with restarts (SGDR) helps in accelerating the training of deep neural networks. According to sources, SGDR provides good performance in a faster manner, which allows to train larger networks and can be used to build efficient ensembles at no cost.
Knowledge Distillation
Knowledge Distillation follows a teacher-student relationship method. The strategy involves first training a (teacher) model on a typical loss function on the available data. Next, a different (student) model (typically much smaller than the teacher model) is trained, but instead of optimising the loss function defined using hard data labels, this student model is trained to mimic the teacher model.
Linear Scaling Learning Rate
Linear Scaling Learning Rate helps in overcoming the challenges of optimisation in image classification models. According to a study by the researchers at AWS, mini-batch stochastic gradient descent (SGD) groups multiple samples to a minibatch to increase parallelism and decrease communication costs, while large batch size may slow down the training progress.
Techniques like linear scaling learning rate help scale the batch size up for single machine training such as-
- In mini-batch SGD, increasing the batch size does not change the expectation of the stochastic gradient but reduces its variance.
- In a large batch size, one can increase the learning rate to make larger progress along the opposite of the gradient direction.
Learning Rate Warmup
Learning rate warmup involves steps like incrementing the learning rate to a larger value over a certain number of training iterations and then followed by decrementing the learning rate. This technique can be performed using step-decay, exponential decay or other such schemes.
According to the researchers of Salesforce Research, the technique was introduced out of the need in order to induce stability in the initial phase of training with large learning rates. Learning rate warmup has been employed in the training of several architectures at scale, including ResNets as well as Transformer networks.
Label Smoothing
Label Smoothing is one of the popular regularisation techniques for classification models. This technique helps by preventing the model from predicting the labels during training. Label smoothing has been used successfully to improve the accuracy of deep learning models across a range of tasks, including image classification, speech recognition, and machine translation. It is a widely used technique or can be said as a “trick” to improve the network performance of an image classification model.
Mixed Precision Training
Mixed precision is the combination of using both the 16-bit and 32-bit floating-point types in an image classification model during the training in order to make it run faster and use less memory. It is basically a combined use of different numerical precisions in a computational method.
According to a blog post by NVIDIA, mixed-precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
Model Tweaks
According to the study by the AWS researchers, a model tweak is a minor adjustment to the network architecture that includes changing the stride of a particular convolution layer. Such a tweak often barely changes the computational complexity while having a non-negligible effect on model accuracy.
No Bias Decay
The weight decay often pertains to the learnable parameters that include both the weights and bias. According to AWS researchers, its equivalent to applying an L2 regularisation to all parameters to drive their values towards 0.
However, it’s recommended to only apply the regularisation to weights in order to avoid the issue of overfitting. The no bias decay heuristic technique follows this recommendation as it only applies the weight decay to the weights in convolution and fully connected layers.