Active Hackathon

How to automate finding the optimal learning rate?

The Cyclic Learning rate method finds the rate automatically.
Listen to this story

Finding the best settings for deep learning model hyperparameters has long been regarded as more an art than a science and has depended primarily on trial and error. Learning rate (LR) is possibly the most significant hyperparameter in deep learning since it determines how much gradient is backpropagated. This, in turn, determines how far we progress towards minima. A slow learning rate causes the model to converge slowly, whereas a fast learning rate causes the model to diverge. As a result, the learning rate must be precisely right. This article will help in learning about automatic learning rate finder and its implementation. Following are the topics to be considered.

Table of contents

  1. Dilemma behind the optimal Learning Rate
  2. Common architecture of Automatic LR finder
  3. Finding optimal learning rate with PyTorch

The learning can’t be fixed for every network. Let’s see behind the scenes of Learning rate testing.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Dilemma behind the optimal Learning Rate

The learning rate is a hyperparameter that governs the alteration of the network’s weights in relation to the loss gradient. It quantifies the amount of information to be learned from every new mini-batch of training data. Mathematically it is a penalty on the amount of information consumed by the weights of the model. The larger the steps taken along the trajectory to the lowest of the loss function, where the optimal model parameters are, the faster it learns.

Analytics India Magazine

The testing range of learning rates is limited to one epoch of training repetitions, with the learning rate increasing with each mini-batch of data. The learning rate increases from a very small to a very big number during the procedure, causing the training loss to begin with a plateau, then decrease to some minimal value, and then burst. This usual behaviour may be plotted (as shown in the graphic below) and utilised to choose a suitable range for the learning rate, especially in the region where the loss is decreasing.

Analytics India Magazine

The suggested lowest learning rate is the value at which the loss reduces the fastest (minimal negative gradient), whereas the recommended maximum learning rate is much less than the learning rate at which the loss is the smallest. It is much less, say ten times, because for plotting a smoothed version of the loss, picking the learning rate that corresponds to the smallest loss is likely to be too great, causing the loss to diverge over training.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Common architecture of Automatic LR finder

A typical Automatic Learning rate finder uses a Cyclic Learning rate method. The algorithm’s goal is to provide a method for determining global learning rates for training neural networks that eliminate the requirement for hundreds of tests to determine the best values with no additional processing. By introducing the notion of the LR Range Test, CLR delivers an excellent learning rate range (LR range) for an experiment.

A good learning rate is one that results in a significant decrease in network loss. Here comes CLR’s sorcery. The original CLR paper describes an experiment in which you may monitor the behaviour of the learning rate in relation to the loss. The experiment is simple to understand: after each mini-batch, progressively raise the learning rate while noting the loss at each step. This slow rise might be linear or exponential. And, sure, this is similar to the LR Range Test.

Analytics India Magazine

After carrying out the experiment, Leslie demonstrated that at excessively low learning rates, the loss may diminish, but only at a very slow rate. When you enter the ideal learning rate zone, you will see a sharp decline in the loss function. If the learning rate is increased further, it may produce parameter loss in the network, which may result in an increase in losses. So, based on this experiment, it is evident that you are looking for a sharp drop in the loss function, and you may do so by analysing the gradients of the loss function at various stages of training.

Finding optimal learning rate with PyTorch

This article for finding the optimal learning rate for the neural network uses the PyTorch lighting package. The model used for this article is a LeNet classifier, a typical beginner convolutional neural network. This model is used as an image classifier in this article, the dataset used is the famous MNIST dataset.

Let’s start with installing and importing the dependencies and the prerequisites.

!pip install PyTorch-lightning
!pip install torchmetrics

Need to install the torchmeterics with the PyTorch lighting because the metrics module is shifted from the PyTorch lighting package. The torch metrics have both predefined and also could build a custom evaluation method.

import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
 
import PyTorch_lightning as pl
from PyTorch_lightning.loggers import TensorBoardLogger
from torchmetrics import functional as FM

One can easily download the MNIST dataset or use the code from the notebook attached in the references section for downloading and splitting the data.

The model is a LeNet classifier built using the “LightningModule” of the PyTorch lighting. The model has 2 convolution layers and 3 linear layers with 120, 84, and 10 fully connected neurons respectively.

 def __init__(self, num_classes=10):
        super().__init__()
        self.lr = 2e-3
        
        self.conv1 = nn.Conv2d(1, 6, 5, padding=2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)
   
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        
        acc = FM.accuracy(y_hat, y)
        loss = F.cross_entropy(y_hat, y)
        self.log_dict({'train_acc': acc, 'train_loss': loss}, on_step=True, on_epoch=False)
        return {"training accuracy": acc,"loss":loss}
 
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        
        val_acc = FM.accuracy(y_hat, y)
        loss = F.cross_entropy(y_hat, y)
        self.log_dict({'val_acc': val_acc, 'val_loss': loss}, reduce_fx=torch.mean)
        return {"validation accuracy":val_acc, "validation losss":loss}

Here is a glimpse of the classifier model. For the details, refer to the Colab notebook attached in the reference section.

Now it’s time for training the model. There will be two versions of this model one without the auto-learning rate finder and the other with the auto learning rate. The losses and the accuracy metrics would be logged using the tensorboard for better visualization.

data_directory= "/content/"
batch_size=60 
logger_directory= 'logs/without_auto_lr' 
name_of_log= 'LeNet classifier' 
version_of_log = 1.0 
Default_lr= 1e-3 
max_epochs=8
model = LeNet_classifier_model()
dataset = MNISTData(data_directory, batch_size)
logger  = TensorBoardLogger(save_dir=logger_directory,version=version_of_log,name=name_of_log)
trainer = pl.Trainer(gpus=1, max_epochs=max_epochs, logger=logger, auto_lr_find=False, val_check_interval=0.5)
 
model.lr = Default_lr
 
print(f'Default model LR: {model.lr}')
 
trainer.fit(model, dataset)
Analytics India Magazine
Analytics India Magazine
Analytics India Magazine

So with a learning rate of 0.001 and a total of 8 epochs, the minimum loss is achieved at 5000 steps for the training data and for validation, it’s 6500 steps which seemed to get lower as the epochs increased.

Let’s find the optimum learning rate with lesser steps required and lower loss and high accuracy score. For using the automatic LR finder, each would be the same as earlier. Just need to add these lines to the code which will find the optimal learning rate and plot the loss vs learning rate curve for better visualization.

lr_finder = trainer_2.tuner.lr_find(model, dataset)
model.hparams.lr = lr_finder.suggestion()
print(f'Auto-find model LR: {model.hparams.lr}')
 
fig = lr_finder.plot(suggest=True)
Analytics India Magazine
Analytics India Magazine

So the optimal learning rate for the model is 0.025 which is greater than the default learning rate. Therefore the computational time would be less compared and it would be less cost-effective. Furthermore, you can train the model on this learning rate, leaving that to you.

Conclusions

The LR finder is a great tool for determining the best learning rate for a given situation, but it should be used with caution. It is critical to use the same initial weights in both the Learning rate range test and subsequent model training. Never assume that the discovered learning rates are optimal for any model initialization. With this article, we have understood the concept of automatic learning rate finder with implementation.

References

More Great AIM Stories

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]