Understanding Adaptive Optimization techniques in Deep learning

Throughout this article, we will discuss these optimization techniques with their intuition and implementation.
Adaptive Optimization

Optimization, as an important part of deep learning, has attracted much attention from researchers, with the exponential growth of the amount of data. Neural networks consist of millions of parameters to handle the complexities became a challenge for researchers, these algorithms have to be more efficient to achieve better results. The functionalities of the optimization algorithm are to minimize the loss function by reaching global minima.

The two important metrics to determine the efficiency of algorithms are the speed of convergence which is the process of reaching the global minima, and the generalization to new data that means how the model is performing on unseen data. Based on these two metrics researchers built the optimization algorithms. Throughout this article, we will discuss these optimization techniques with their intuition and implementation.

Topics we cover in this article

  • Understanding adaptive optimization 
  • Adagrad optimization
  • Adadelta optimization
  • Adam optimization
  • Adabound optimization

Understanding Adaptive optimization

Optimization techniques like Gradient Descent, SGD, mini-batch Gradient Descent need to set a hyperparameter learning rate before training the model. If this learning rate doesn’t give good results, we need to change the learning rates and train the model again. In deep learning, training the model generally takes lots of time. Some researchers are fed up with setting up these learning rates. Hence they got an idea of Adaptive optimization techniques. Here, it doesn’t need to set learning rate, just we need to initialize the learning rate parameters 0.001  and these adaptive optimization algorithms keep updating learning rates while training the model. 

So what is the learning rate …………?  The learning rate is the most important aspect of the learning process of the model. These are steps the model takes to reach the global minima.

Hands-on implementation

Here we will implement the Convolutional Neural Network (CNN) model in MNIST data classification through which we will compare the optimization techniques.

from keras.datasets import mnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

# Model configuration
batch_size = 250
no_epochs = 5
no_classes = 10
validation_split = 0.2
verbosity = 1

# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) =mnist.load_data()

# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 

# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)

# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)

input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize input data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

Now, we will discuss the adaptive optimization techniques one-by-one and add it to the above-defined CNN model.


Adagrad works on setting the learning rate by dividing the learning rate component by the square root of the cumulative sum of the current gradient and the previous gradient.     

Adaptive Optimization techniques in Deep learning

Here θ is the parameter we need to update, η is the learning rate ε is added to give non zero value, Gt is the gradient estimate at time t.

Compiling the CNN model with Adagrad Optimizer

# Compile the model

# Fit data to model
history = model.fit(input_train, target_train,

# Generate generalization metric  s
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss using Adagrad: {score[0]} / Test accuracy: {score[1]}')

Adaptive Optimization techniques in Deep learning


Adadelta works on exponential moving averages of the squared delta’s, here delta refers to the difference between the current weight and the newly updated weight. In Adadelta optimization technique it removes the learning rate parameter and replaces it with delta. 

Compiling the CNN model with Adadelta Optimizer replacing Adagrad in the above CNN model

model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer = tensorflow.keras.optimizers.Adadelta(learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta"))

After updating the optimizer to Adadelta, we again trained the model.


Adam – Adaptive moment estimation 

Beginners mostly used the Adam optimization technique very popular and used in many models as an optimizer, adam is a combination of RMS prop and momentum, it uses the squared gradient to scale the learning rate parameters like RMSprop and it works similar to the momentum by adding averages of moving gradients. It computes different parameters for individual parameters.


In momentum technique, instead of using only the gradient of current steps it also accumulates the gradient of the past steps to reach global minima. We use SGD with momentum to work effectively and momentum help SGD to accelerate training.

Momentum term γ = 0.9 
Adaptive Optimization techniques in Deep learning

Here m and v are moving averages of the gradients and Betas only used in Adam optimization uses parameters are beta_1 = 0.9 and beta_2 =0.999 and g is the gradients in mini-batch. 

Compiling the CNN model with Adam Optimizer replacing in the above CNN model


After updating the optimizer to Adam, we again trained the model



Adabound is an Adam variant that uses dynamic boundaries of learning rates, Adabound is as fast as Adam and as good as SGD, the main problem in adaptive techniques is they fail in convergence better because of insatiable and extreme learning rates, where the lower and upper bounds are initialized as 0 and infinity.

This concept was inspired by gradient clipping the gradients larger than the threshold to avoid gradient explosion.

Adaptive Optimization techniques in Deep learning

Compiling the CNN model with Adabound Optimizer replacing in the above CNN model 

In the below code snippet, we are importing Adabound because in Keras optimizer’s library Adabound is not an inbuilt function, so we need to import Adabound.

from keras_adabound import AdaBound
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=AdaBound(lr=1e-3, final_lr=0.1))

After updating the optimizer to Adabound, we again trained the model          

Adaptive Optimization techniques in Deep learning


In this article, we discussed the adaptive optimization techniques and demonstrated the implementation. As we discussed above the best optimization algorithm will have better convergence and good generalization to new data. As we have seen the Adabound optimization introduced has a higher accuracy as compared to other optimizers, which balances the convergence and generalization.

More Great AIM Stories

Prudhvi varma
AI enthusiast, Currently working with Analytics India Magazine. I have experience of working with Machine learning, Deep learning real-time problems, Neural networks, structuring and machine learning projects. I am a Computer Vision researcher and I am Interested in solving real-time computer vision problems.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Yugesh Verma
A guide to explainable named entity recognition

Named entity recognition (NER) is difficult to understand how the process of NER worked in the background or how the process is behaving with the data, it needs more explainability. we can make it more explainable.

Yugesh Verma
10 real-life applications of Genetic Optimization

Genetic algorithms have a variety of applications, and one of the basic applications of genetic algorithms can be the optimization of problems and solutions. We use optimization for finding the best solution to any problem. Optimization using genetic algorithms can be considered genetic optimization

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM