###### Comprehensive Guide To Learning Rate Algorithms (With Python Codes) # Comprehensive Guide To Learning Rate Algorithms (With Python Codes)  Learning rate is an important hyperparameter that controls how much we adjust the weights in the network according to the gradient. The question most commonly asked in the field of Machine learning is “how do we know what is the right value for learning rate?”

Unfortunately, there is no one size fits all answer to this question. But, I will put forth some of the methods you can use that can help you estimate what value should be used. This article covers the types of Learning Rate (LR) algorithms, the behaviour of learning rates with SGD and implementation of techniques to find out suitable LR values.

### Types of LR algorithms

The learning rate algorithms are broadly classified into two categories:

1. Constant Learning rate algorithm – As the name suggests, these algorithms deal with learning rates that remain constant throughout the training process. Stochastic Gradient Descent falls under this category.

Here, η represents the learning rate. The smaller the value of η, the slower the training and adjustment of weights. But if the value is too high, the model converges too quickly and results in a suboptimal solution.

Stochastic Gradient Descent and Learning rate

Stochastic Gradient Descent (SGD) is one of the most common optimizers used in machine learning. Let us see how SGD looks for a single sample. Take a look at the loss function below.

## Stay Connected

where x is the input sample, y is the label, and θ is the weight. We can define the partial derivative cost function for a batch size equal to N as:

In its most basic form, the SGD works by updating the value of 𝚹 that moves the weights in the direction opposite to the gradient value of the loss.

Unfortunately, when it comes to deep learning models, the reality is different. The local minima in case of smaller models are relatively shallow and are easy to get past. But, because of millions of parameters involved in the deep models, the local minima tends to be wider and thus creates a problem called plateaus. This is also called saddle because of how it looks. When this situation occurs, our learning model is stuck in the saddle and struggles to get out.

One solution to this is fixing our learning rate large enough to escape the saddle. Let us now look at the methods that can be used.

1. Reduce LR on Plateau: This is one of the ways of moving out of the saddle. Every time the loss begins to plateau, the learning rate decreases by a set fraction. When the loss function succumbs to higher learning rate and keeps moving around in the saddle, reducing the learning rate can help the loss find a smoother surface to escape this.
2. LR Finder: In this method, the learning rate is selected by taking a random value for the weight, calculating the loss and getting a learning rate for that value. Next, a small step is taken and the learning rate is recalculated for the new weight and loss. This process is plotted in a graph and the optimal LR is selected.
3. Cyclic Learning Rate: This method eliminates the need to experimentally find the best values and schedule for global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between boundaries.

Let us implement Cyclic LR and LR finder for CIFAR 10 to understand the difference and see the improvement in the accuracy.

We will import the required libraries and load our data.

`from keras import backend as Kimport timeimport matplotlib.pyplot as pltimport numpy as np% matplotlib inlinenp.random.seed(2017) from keras import regularizersfrom keras.models import Sequentialfromkeras.layers.convolutionalimportConvolution2D, MaxPooling2D,AveragePooling2Dfrom keras.layers import Activation, Flatten, Dense, Dropoutfrom keras.layers.normalization import BatchNormalizationfrom keras.utils import np_utilsfrom keras.preprocessing.image import ImageDataGeneratorfrom keras.datasets import cifar10(train_features, train_labels), (test_features, test_labels) = cifar10.load_data()num_train, img_rows, img_cols,img_channels =  train_features.shapenum_test, _, _, _ =  test_features.shapenum_classes = len(np.unique(train_labels))class_names = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']fig = plt.figure(figsize=(8,3))for i in range(num_classes):    ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])    idx = np.where(train_labels[:]==i)    features_idx = train_features[idx,::]    img_num = np.random.randint(features_idx.shape)    im = features_idx[img_num]    ax.set_title(class_names[i])    plt.imshow(im)plt.show()` Our data has been loaded and saved in variables, let us normalize the data and convert them to categorical data.

`train_features = train_features.astype('float32')/255test_features = test_features.astype('float32')/255train_labels = np_utils.to_categorical(train_labels, num_classes)test_labels = np_utils.to_categorical(test_labels, num_classes)`

For simplification purposes, let us split our data into batches and augment the data using Image Datagenerator.

`(trainX, trainy), (testX, testy) = cifar10.load_data()datagen = ImageDataGenerator(featurewise_center=True, featurewise_std_normalization=True)datagen.fit(trainX)iterator = datagen.flow(trainX, trainy, batch_size=128)batchX, batchy = iterator.next()iterator = datagen.flow(trainX, trainy, batch_size=len(trainX), shuffle=False)batchX, batchy = iterator.next()print(batchX.shape, batchX.mean(), batchX.std())min_pix, max_pix = batchX.min(), batchX.max()`

Let us do the same for test data as well

`iterator1 = datagen.flow(testX, testy, batch_size=len(testX), shuffle=False)batch_testX, batch_testy = iterator1.next()X_train = batchXX_test = batch_testXy_train=batchyy_test=batch_testy`

Now, build a CNN model with batch normalization and regularization function (for faster convergence) and bear in mind to use SGD optimizer

`from keras import optimizersmodel1 = Sequential()model1.add(Convolution2D(32, 3, 3, border_mode='same',kernel_regularizer=regularizers.l2(0.0001), input_shape=(32, 32, 3)))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(Convolution2D(64, 3, 3,kernel_regularizer=regularizers.l2(0.0001),border_mode='same'))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(MaxPooling2D(pool_size=(2, 2)))model1.add(Dropout(0.2))model1.add(Convolution2D(32, 1, 1))model1.add(Convolution2D(64, 3, 3, kernel_regularizer = regularizers.l2 (0.0001), border_mode = 'same'))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(Convolution2D(128, 3, 3,kernel_regularizer=regularizers.l2(0.0001),border_mode='same'))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(MaxPooling2D(pool_size=(2, 2)))model1.add(Dropout(0.3))model1.add(Convolution2D(32, 1, 1))model1.add(Convolution2D(128, 3, 3,kernel_regularizer=regularizers.l2(0.0001), border_mode='same'))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(Convolution2D(256, 3, 3,kernel_regularizer=regularizers.l2(0.0001), border_mode='same'))model1.add(Activation('relu'))model1.add(BatchNormalization())model1.add(MaxPooling2D(pool_size=(2, 2)))model1.add(Dropout(0.5))model1.add(Convolution2D(10, 1, 1))model1.add(AveragePooling2D(pool_size = (4,4)))model1.add(Flatten())model1.add(Activation('softmax'))sgd = optimizers.SGD(lr=0.0001, momentum=0.9, nesterov=True)model1.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])`

Do not worry about the lr that is assigned above. You can assign any value here since we will be overriding it soon.

In order to make the model work better, I will use the cutout function.

###### Free Online Resources To Get Hands-On Deep Learning

`def get_random_eraser(p=0.5, s_l=0.02, s_h=0.4, r_1=0.3, r_2=1/0.3, v_l=0, v_h=255, pixel_level=False):    def eraser(input_img):        img_h, img_w, img_c = input_img.shape        p_1 = np.random.rand()        if p_1 > p:            return input_img        while True:            s = np.random.uniform(s_l, s_h) * img_h * img_w            r = np.random.uniform(r_1, r_2)            w = int(np.sqrt(s / r))            h = int(np.sqrt(s * r))            left = np.random.randint(0, img_w)            top = np.random.randint(0, img_h)            if left + w <= img_w and top + h <= img_h:                break        if pixel_level:            c = np.random.uniform(v_l, v_h, (h, w, img_c))        else:            c = np.random.uniform(v_l, v_h)        input_img[top:top + h, left:left + w, :] = c        return input_img    return eraser`

Let us go ahead and implement the LR finder algorithm from Keras.

`from keras.callbacks import Callbackclass LR_Finder(Callback):     def __init__(self, start_lr=1e-5, end_lr=10, step_size=None, beta=.98):        super().__init__()        self.start_lr = start_lr        self.end_lr = end_lr        self.step_size = step_size        self.beta = beta        self.lr_mult = (end_lr/start_lr)**(1/step_size)            def on_train_begin(self, logs=None):        self.best_loss = 1e9        self.avg_loss = 0        self.losses, self.smoothed_losses, self.lrs, self.iterations = [], [], [], []        self.iteration = 0        logs = logs or {}        K.set_value(self.model.optimizer.lr, self.start_lr)            def on_batch_end(self, epoch, logs=None):        logs = logs or {}        loss = logs.get('loss')        self.iteration += 1                self.avg_loss = self.beta * self.avg_loss + (1 - self.beta) * loss        smoothed_loss = self.avg_loss / (1 - self.beta**self.iteration)        if self.iteration>1 and smoothed_loss > self.best_loss * 4:            self.model.stop_training = True            return        if smoothed_loss < self.best_loss or self.iteration==1:            self.best_loss = smoothed_loss        lr = self.start_lr * (self.lr_mult**self.iteration)            self.losses.append(loss)        self.smoothed_losses.append(smoothed_loss)        self.lrs.append(lr)        self.iterations.append(self.iteration)        K.set_value(self.model.optimizer.lr, lr)      def plot_lr(self):        plt.xlabel('Iterations')        plt.ylabel('Learning rate')        plt.plot(self.iterations, self.lrs)    def plot(self, n_skip=1):        plt.ylabel('Loss')        plt.xlabel('Learning rate (log scale)')        plt.plot(self.lrs[n_skip:-5], self.losses[n_skip:-5])        plt.xscale('log')            def plot_smoothed_loss(self, n_skip=10):        plt.ylabel('Smoothed Losses')        plt.xlabel('Learning rate (log scale)')        plt.plot(self.lrs[n_skip:-5], self.smoothed_losses[n_skip:-5])        plt.xscale('log')    def plot_loss(self):        plt.ylabel('Losses')        plt.xlabel('Iterations')        plt.plot(self.iterations[10:], self.losses[10:])`

It is time to put everything together. We will define our accuracy function and a function to plot the model graphically.

`def plot_model_history(model_history):    fig, axs = plt.subplots(1,2,figsize=(15,5))    axs.plot(range(1,len(model_history.history['acc'])+1),model_history.history['acc'])    axs.plot(range(1,len(model_history.history['val_acc'])+1),model_history.history['val_acc'])    axs.set_title('Model Accuracy')    axs.set_ylabel('Accuracy')    axs.set_xlabel('Epoch')    axs.set_xticks(np.arange(1,len(model_history.history['acc'])+1),len(model_history.history['acc'])/10)    axs.legend(['train', 'val'], loc='best')    axs.plot(range(1,len(model_history.history['loss'])+1),model_history.history['loss'])    axs.plot(range(1,len(model_history.history['val_loss'])+1),model_history.history['val_loss'])    axs.set_title('Model Loss')    axs.set_ylabel('Loss')    axs.set_xlabel('Epoch')    axs.set_xticks(np.arange(1,len(model_history.history['loss'])+1),len(model_history.history['loss'])/10)    axs.legend(['train', 'val'], loc='best')    plt.show()def accuracy(test_x, test_y, model):    result = model.predict(test_x)    predicted_class = np.argmax(result, axis=1)    true_class = np.argmax(test_y, axis=1)    num_correct = np.sum(predicted_class == true_class)     accuracy = float(num_correct)/result.shape    return (accuracy * 100)datagen = ImageDataGenerator(zoom_range=0.0,                              horizontal_flip=False,                              preprocessing_function=get_random_eraser(v_l=min_pix, v_h=max_pix, pixel_level=True))lr_finder = LR_Finder(start_lr=1e-5, end_lr=1e-2, step_size=np.ceil(X_train.shape/128))start = time.time()model_info = model1.fit_generator(datagen.flow(X_train, Y_train, batch_size = 128),                                  samples_per_epoch = train_features.shape, nb_epoch = 100,                                   validation_data = (X_test, Y_test), verbose=0,                                  callbacks=[lr_finder])end = time.time()print ("Model took %0.2f seconds to train"%(end - start))print ("Accuracy on test data is: %0.2f"%accuracy(X_test, Y_test, model1))lr_finder.plot_lr()`

lr_finder.plot_smoothed_loss()

Typically, a good static learning rate can be found half-way on the descending loss curve. In the plot shown that would be around 0.002(10^-3) because that is where the descent is steeper.

For our cyclic learning rates, we need boundaries (start and end) and this can be identified from the graph as well. The boundaries are the point at which the loss starts descending and the point at which the loss stops descending. From the graph above, the curve starts at 0.002 and stops at 0.2 (10^-1). We have identified our boundaries, let us implement the cyclic LR and begin our training.

`from keras.callbacks import Callback, ModelCheckpointclass CyclicLR(Callback):  def __init__(self, min_lr, max_lr, stepsize=1000):    super().__init__()    self.min_lr = min_lr    self.max_lr = max_lr    self.currstep = 0    self.stepsize = stepsize  def on_train_batch_begin(self, batch, logs=None):    currstep = self.currstep    stepsize = self.stepsize    min_lr   = self.min_lr    max_lr   = self.max_lr    dlr = (max_lr - min_lr) / stepsize    if currstep < stepsize :      dlr = dlr*currstep    else:      dlr = dlr*(2*stepsize - currstep)    lr = min_lr + dlr    K.set_value(self.model.optimizer.lr, lr)    self.currstep += 1  def on_train_batch_end(self, batch, logs=None):    if self.currstep == 4000:      self.currstep = 0clr = CyclicLR(2e-4, 2e-2, 2000)model1.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])def scheduler(epoch, lr):  return round(1e-2/(1+0.1*epoch), 10)start = time.time()model_info = model1.fit_generator(datagen.flow(X_train, Y_train, batch_size = 128),                                 samples_per_epoch = train_features.shape, nb_epoch = 100,                                  validation_data = (X_test, Y_test), verbose=1,                                 callbacks=[clr])end = time.time()print ("Model took %0.2f seconds to train"%(end - start))plot_model_history(model_info)print ("Accuracy on test data is: %0.2f"%accuracy(X_test, Y_test, model1))`

Although we see no sharp spikes, while tuning hyperparameters it is essential to check for overfitting. The best way to do this is to identify misclassified images in the dataset. Once this identification is done, you can always go back to the learning rate curves or the model and tweak it further to get the best results possible. I will use gradcam and identify the misclassifications below.

`import cv2model.summary()def gradcam(idx, images, normimage, layername):  ii = idx  x = normimage[ii].reshape((1, 32, 32, 3))  preds = model.predict(x)  class_idx = np.argmax(preds)  class_output = model.output[:, class_idx]  last_conv_layer = model.get_layer(layername)      grads = K.gradients(class_output, last_conv_layer.output)  pooled_grads = K.mean(grads, axis=(0, 1, 2))  iterate = K.function([model.input], [pooled_grads, last_conv_layer.output])  pooled_grads_value, conv_layer_output_value = iterate([x])  depth = conv_layer_output_value.shape[-1]  for i in range(depth):    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]         heatmap = np.mean(conv_layer_output_value, axis=-1)  heatmap = np.maximum(heatmap, 0)  max_heatmap = np.max(heatmap)  if max_heatmap >= 0 :     heatmap /= max_heatmap  img = images[ii]   heatmap = cv2.resize(heatmap, (img.shape, img.shape))  heatmap = np.uint8(255 * heatmap)  heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)  heatmap = cv2.cvtColor(heatmap, cv2.COLOR_BGR2RGB)           superimposed_img = cv2.addWeighted(img, 0.7, heatmap, 0.3, 0)  return superimposed_imgy_pred = model.predict(X_test)i = 0fig, ax = plt.subplots(10, 5, figsize = (15, 30))fig.suptitle('Misclassified Images')fig.tight_layout(pad = 0.3, rect = [0, 0, 0.9, 0.9])for (x, y) in [(i, j) for i in range(5) for j in range(5)]:  while i < 10000 and np.argmax(y_pred[i, :]) == testy[i]:    i += 1  ax[2*x, y].imshow(testX[i])  ax[2*x, y].axis('off')  acls, pcls = class_names[int(testy[i])], class_names[np.argmax(y_pred[i, :])]  ax[2*x, y].set_title('%d A: %s P: %s' % (i, acls, pcls))  ax[2*x+1, y].imshow(gradcam(i, testX, X_test, 'conv2d_17'))  ax[2*x+1, y].axis('off')  i += 1   if i >= 10000:    break`

You can see here that these images are not classified right. But now that we have the tools to improve our learning rates we can go back to the model and tune it better.

Conclusion

Hyper-parameter optimization is a very important and time-consuming process in the life of a good machine learning model. It helps in making the model stand out and be better. With the techniques discussed above, you can improve your model by tuning the learning rates better.

What Do You Think?

###### Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top