We have seen enough of the optimizers previously in Tensorflow and PyTorch library, today we will be discussing a specific one i.e. AdaBelief. Almost every neural network and machine learning algorithm use optimizers to optimize their loss function using gradient descent. There are many optimizers available in PyTorch as well as TensorFlow for a specific type of problems like SGD, Adam, RMSprop, and many more, now to choose the best optimizer there are many factors we need to consider like the speed of convergence, generalization of model and loss metrics.
Now SGD(Stochastic Gradient Descent) is better in generalization while Adam is good in convergence speed.
Sign up for your weekly dose of what's up in emerging technology.
Recently researchers from Yale University have introduced a new Novel optimizer with a paper called “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients” here. It combines many features of other optimizers into one. Most popular optimizers for deep learning and machine learning tasks are categorized as adaptive methods(Adam) and accelerated schemes(SGD). For most of the task such as convolutional neural networks
(CNN), adaptive methods(Adam) act faster but generalize worse compared
to SGD(stochastic gradient descent); and for more complex tasks like generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.
So Authors introduces AdaBelief that can simultaneously achieve three goals:
- fast convergence as in adaptive methods,
- good generalization as in SGD,
- training stability.
Adam and AdaBelief
Let’s see both of the optimizers in detail what’s been changed and optimized.
Adam(Adaptive Moment Estimation)
The Adam Optimizer is one of the most used optimizers to train different kinds of neural networks.
in Adam, the update direction is , where is the EMA (Exponential Moving Average) of ; It basically combines the optimization techniques of momentum and RMSprop. Adam consist of two internal states : momentum and squared momentum of the gradient (g). With every training batch, each of them is updated using exponential weighted averaging (EWA)
here β is referred as hyperparameters. These are then used to update the parameters for each step as shown below:
where α is the learning rate, and ϵ is added to improve stability.
AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:
And this s-t replaces v-t to form this update direction:
Installation and Usage
git clone https://github.com/juntang-zhuang/Adabelief-Optimizer.git
1. PyTorch implementations
See folder PyTorch_Experiments, for each subfolder, execute sh run.sh. See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.
pip install adabelief-pytorch==0.2.0 from adabelief_pytorch import AdaBelief AdaBelief_optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False)
2. Tensorflow implementation
pip install adabelief-tf==0.2.0 from adabelief_tf import AdaBeliefOptimizer Adam_optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)
Below are some pf the experimental results comparing the performance of AdaBelief optimizer with each other optimizers on different neural networks like CNNs, LSTMs, and GANs:
1. Results on Image Classification:
2. Result on LSTM(Time Series Modeling):
3. Results on a small GAN(Generative Adversarial Network) with vanilla CNN generator:
4. Results on Transformer
5. Results on Toy Example
We get to know AdaBelief, that is an optimizer derived from Adam and has no extra parameters, just a change in one of the parameters. It gives both fast convergence speed as well as good generalization in models. It’s easy to adapt its step size according to its “belief” in the current gradient direction. It performs well in the “Large gradient and small curvature” cases as it is considered as both the amplitude and sign of the gradients. Some of the external resources you can look for:
- SN-GAN https://github.com/juntang-zhuang/SNGAN-AdaBelief
- Transformer (PyTorch 1.1) https://github.com/juntang-zhuang/transformer-adabelief
- Transformer (PyTorch 1.6) https://github.com/juntang-zhuang/fairseq-adabelief
- Reinforcement Learning (Toy) https://github.com/juntang-zhuang/rainbow-adabelief
- Reinforcement Learning (HalfCheetah-v2 Walker2d-v2) https://github.com/juntang-zhuang/SAC-Adabelief
- AdaBelief GitHub Repository
- Research Paper