Now Reading
Guide To The Latest AdaBelief Optimizer for Machine/Deep learning

Guide To The Latest AdaBelief Optimizer for Machine/Deep learning

AdaBelief Optimizer

We have seen enough of the optimizers previously in Tensorflow and PyTorch library, today we will be discussing a specific one i.e. AdaBelief. Almost every neural network and machine learning algorithm use optimizers to optimize their loss function using gradient descent. There are many optimizers available in PyTorch as well as TensorFlow for a specific type of problems like SGD, Adam, RMSprop, and many more, now to choose the best optimizer there are many factors we need to consider like the speed of convergence, generalization of model and loss metrics.

Now SGD(Stochastic Gradient Descent) is better in generalization while Adam is good in convergence speed. 

Register for FREE Workshop on Data Engineering>>

Recently researchers from Yale University have introduced a new Novel optimizer with a paper called “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients” here. It combines many features of other optimizers into one. Most popular optimizers for deep learning and machine learning tasks are categorized as adaptive methods(Adam) and accelerated schemes(SGD). For most of the task such as convolutional neural networks

(CNN), adaptive methods(Adam) act faster but generalize worse compared

to SGD(stochastic gradient descent); and for more complex tasks like generative adversarial networks (GANs), adaptive methods are typically the default because of their stability. 

So Authors introduces AdaBelief that can simultaneously achieve three goals: 

  1. fast convergence as in adaptive methods,
  2. good generalization as in SGD, 
  3. training stability.

Adam and AdaBelief

Let’s see both of the optimizers in detail what’s been changed and optimized.

Adam(Adaptive Moment Estimation)

Image for post

The Adam Optimizer is one of the most used optimizers to train different kinds of neural networks.

in Adam, the update direction is  , where is the EMA (Exponential Moving Average) of ; It basically combines the optimization techniques of momentum and RMSprop. Adam consist of two internal states : momentum and squared momentum of the gradient (g). With every training batch, each of them is updated using exponential weighted averaging (EWA)

Image for post

here β is  referred as hyperparameters. These are then used to update the parameters for each step as shown below:

Image for post

where α is the learning rate, and ϵ is added to improve stability.

AdaBelief (Adapting Stepsizes by the Belief in Observed Gradients)

adabelief optimizer

AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:

adabelief

And this s-t replaces v-t to form this update direction:

adabelief

Installation and Usage

git clone https://github.com/juntang-zhuang/Adabelief-Optimizer.git

1. PyTorch implementations

See folder PyTorch_Experiments, for each subfolder, execute sh run.sh. See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.

 pip install adabelief-pytorch==0.2.0
 from adabelief_pytorch import AdaBelief
 AdaBelief_optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False) 

2. Tensorflow implementation

Some of the projects Text classification and word embedding in Tensorflow

See Also
deep learning cover art

 pip install adabelief-tf==0.2.0
 from adabelief_tf import AdaBeliefOptimizer
 Adam_optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False) 

Below are some pf the experimental results comparing the performance of AdaBelief optimizer with each other optimizers on different neural networks like CNNs, LSTMs, and GANs:

1. Results on Image Classification:

AdaBelief Optimizer

2. Result on LSTM(Time Series Modeling):

AdaBelief Optimizer

3. Results on a small GAN(Generative Adversarial Network) with vanilla CNN generator:

AdaBelief Optimizer

4. Results on Transformer

AdaBelief Optimizer

5. Results on Toy Example

AdaBelief Optimizer

Conclusion

We get to know AdaBelief, that is an optimizer derived from Adam and has no extra parameters, just a change in one of the parameters. It gives both fast convergence speed as well as good generalization in models. It’s easy to adapt its step size according to its “belief” in the current gradient direction. It performs well in the “Large gradient and small curvature” cases as it is considered as both the amplitude and sign of the gradients. Some of the external resources you can look for:

Read more

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top