Guide To The Latest AdaBelief Optimizer for Machine/Deep learning

AdaBelief Optimizer


We have seen enough of the optimizers previously in Tensorflow and PyTorch library, today we will be discussing a specific one i.e. AdaBelief. Almost every neural network and machine learning algorithm use optimizers to optimize their loss function using gradient descent. There are many optimizers available in PyTorch as well as TensorFlow for a specific type of problems like SGD, Adam, RMSprop, and many more, now to choose the best optimizer there are many factors we need to consider like the speed of convergence, generalization of model and loss metrics.

Now SGD(Stochastic Gradient Descent) is better in generalization while Adam is good in convergence speed. 


Sign up for your weekly dose of what's up in emerging technology.

Recently researchers from Yale University have introduced a new Novel optimizer with a paper called “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients” here. It combines many features of other optimizers into one. Most popular optimizers for deep learning and machine learning tasks are categorized as adaptive methods(Adam) and accelerated schemes(SGD). For most of the task such as convolutional neural networks

(CNN), adaptive methods(Adam) act faster but generalize worse compared

to SGD(stochastic gradient descent); and for more complex tasks like generative adversarial networks (GANs), adaptive methods are typically the default because of their stability. 

So Authors introduces AdaBelief that can simultaneously achieve three goals: 

  1. fast convergence as in adaptive methods,
  2. good generalization as in SGD, 
  3. training stability.

Adam and AdaBelief

Let’s see both of the optimizers in detail what’s been changed and optimized.

Adam(Adaptive Moment Estimation)

Image for post

The Adam Optimizer is one of the most used optimizers to train different kinds of neural networks.

in Adam, the update direction is  , where is the EMA (Exponential Moving Average) of ; It basically combines the optimization techniques of momentum and RMSprop. Adam consist of two internal states : momentum and squared momentum of the gradient (g). With every training batch, each of them is updated using exponential weighted averaging (EWA)

Image for post

here β is  referred as hyperparameters. These are then used to update the parameters for each step as shown below:

Image for post

where α is the learning rate, and ϵ is added to improve stability.

AdaBelief (Adapting Stepsizes by the Belief in Observed Gradients)

adabelief optimizer

AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:


And this s-t replaces v-t to form this update direction:


Installation and Usage

git clone

1. PyTorch implementations

See folder PyTorch_Experiments, for each subfolder, execute sh See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.

 pip install adabelief-pytorch==0.2.0
 from adabelief_pytorch import AdaBelief
 AdaBelief_optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False) 

2. Tensorflow implementation

Some of the projects Text classification and word embedding in Tensorflow

 pip install adabelief-tf==0.2.0
 from adabelief_tf import AdaBeliefOptimizer
 Adam_optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False) 

Below are some pf the experimental results comparing the performance of AdaBelief optimizer with each other optimizers on different neural networks like CNNs, LSTMs, and GANs:

1. Results on Image Classification:

AdaBelief Optimizer

2. Result on LSTM(Time Series Modeling):

AdaBelief Optimizer

3. Results on a small GAN(Generative Adversarial Network) with vanilla CNN generator:

AdaBelief Optimizer

4. Results on Transformer

AdaBelief Optimizer

5. Results on Toy Example

AdaBelief Optimizer


We get to know AdaBelief, that is an optimizer derived from Adam and has no extra parameters, just a change in one of the parameters. It gives both fast convergence speed as well as good generalization in models. It’s easy to adapt its step size according to its “belief” in the current gradient direction. It performs well in the “Large gradient and small curvature” cases as it is considered as both the amplitude and sign of the gradients. Some of the external resources you can look for:

Read more

More Great AIM Stories

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM