Optimizers are the expanded class, which includes the method to train your machine/deep learning model. Right optimizers are necessary for your model as they improve training speed and performance, Now there are many optimizers algorithms we have in PyTorch and TensorFlow library but today we will be discussing how to initiate TensorFlow Keras optimizers, with a small demonstration in jupyter notebook.
Before optimizers, it’s good to have some preliminary exposure in loss functions as both works parallelly in deep learning projects. We have already covered the TensorFlow loss function and PyTorch loss functions in our previous articles. Loss functions are just a mathematical way of measuring how good your machine/deep learning model performs.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
During the training of the model, we tune the parameters(also known as hyperparameter tuning) and weights to minimize the loss and try to make our prediction accuracy as correct as possible. Now to change these parameters the optimizer’s role came in, which ties the model parameters with the loss function by updating the model in response to the loss function output. Simply optimizers shape the model into its most accurate form by playing with model weights. The loss function just tells the optimizer when it’s moving in the right or wrong direction.
Optimizers are Classes or methods used to change the attributes of your machine/deep learning model such as weights and learning rate in order to reduce the losses. Optimizers help to get results faster.Definition
Tensorflow Keras Optimizers Classes:
Gradient descent optimizers, the year in which the papers were published, and the components they act upon
TensorFlow mainly supports 9 optimizer classes, consisting of algorithms like Adadelta, FTRL, NAdam, Adadelta, and many more.
- Adadelta: Optimizer that implements the Adadelta algorithm.
- Adagrad: Optimizer that implements the Adagrad algorithm.
- Adam: Optimizer that implements the Adam algorithm.
- Adamax: Optimizer that implements the Adamax algorithm.
- Ftrl: Optimizer that implements the FTRL algorithm.
- Nadam: Optimizer that implements the NAdam algorithm.
- Optimizer class: Base class for Keras optimizers.
- RMSprop: Optimizer that implements the RMSprop algorithm.
- SGD: Gradient descent (with momentum) optimizer.
Gradient Descent algorithm
Before explaining let’s first look at the most popular algorithm i.e. gradient descent, there are many other algorithms that have been made on top of gradient descent like Adagrad, RMSprop, and Adam. The king of all the optimizers and it’s very fast, robust, and flexible. A basic workflow of gradient descent follows the following steps:
- Calculate all the minor changes in each weight parameter affecting the loss function.
- Tuning each individual weight on the basis of its gradient.
- Repeat 1 and 2 till the loss function reaches at its minimum.
But there are some complications with this algorithm, as the gradient is a partial derivative and measure of change. It connects loss functions and the weights; the gradient algorithm tells what operation we should do weights to minimize loss functions – subtract 0.04, add 0.2, or anything relevant.
The problem comes when it is stuck at local minima whenever we deal with large multi-dimensional datasets. As shown in the above figure: the global minimum is the least minimum value of a function while a local minimum is the local minimum value of a function in a certain neighborhood.
For initialization you can simply use google colab or for implementation in a local machine you can download anaconda that integrates all the major data science pages into one. Use below import command to initialize tensorflow:
import tensorflow as tf
Adagrad adapts the learning rate specifically with individual features: it means that some of the weights in your dataset have different learning rates than others. It always works best in a sparse dataset where a lot of inputs are missing. In TensorFlow, you can call the optimizer using the below command.
tf.keras.optimizers.Adagrad( learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-07, name="Adagrad", **kwargs )
It is a parameter specific learning rate, adapts with how frequently a parameter gets updated during training. Parameters we pass with these optimizers are learning_rate, initial_accumulator_value, epsilon, name, and **kwargs you can read more about them at Keras documentation or TensorFlow docs. This optimizer is been referred from Duchi et al., 2011 paper
It is an exclusive version of Adagrad developed by Geoffrey Hinton(learn more), now the thinking behind this optimizer was pretty straight forward: instead of letting all of the gradients accumulate for momentum, it only accumulates gradients in a specific fix window. It is exactly like Adaprop(an updated version of Adagrad with some improvement), you can call this in the TensorFlow framework using the below command:
tf.keras.optimizers.RMSprop( learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False, name='RMSprop', **kwargs )
Learn more about RMSprop here
Adadelta(adaptive delta) Optimizer
Now like the RMSprop optimizer, Adadelta(Read paper: Zeiler, 2012) is another more improved optimization algorithm, here delta refers to the difference between the current weight and the newly updated weight. Adadelta removed the use of the learning rate parameter completely and replaced it with an exponential moving average of squared deltas. You can call it in your machine learning project using the below command with basic parameters like epsilon, learning_rate, rho, and **kwargs.
tf.keras.optimizers.Adadelta( learning_rate=0.001, rho=0.95, epsilon=1e-07, name='Adadelta', **kwargs )
Adam stands for adaptive moment estimation, which is another way of using past gradients to calculate current gradients, for the deep mathematical explanation you can read its official paper(Kingma & Ba, 2014) here, Adam utilizes the concept of momentum by adding fractions of previous gradients to the current one, it is practically accepted in many projects during training neural nets.
You can call it using Tensorflow by leveraging the below commands into your project.
tf.keras.optimizers.Adam( learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', **kwargs )
Here is the standalone usage for the algorithm:
opt = tf.keras.optimizers.Adam(learning_rate=0.1) var1 = tf.Variable(10.0) loss = lambda: (var1 ** 2)/2.0 # d(loss)/d(var1) == var1 step_count = opt.minimize(loss, [var1]).numpy() # The first step is `-learning_rate*sign(grad)` var1.numpy()
AdaMax Optimizer Class
You can call it using below commands:
tf.keras.optimizers.Adamax( learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Adamax', **kwargs )
It is a variant of Adam based on the infinity norm. Sometimes it is considered superior to Adam, especially in models with embeddings.
To learn more about implementation using the deep learning demo project go here.
NAdam optimizer is an acronym for Nesterov and Adam optimizer. Its official research paper was published in 2015 here, now this Nesterov component is way more efficient than its previous implementations. Nadam used Nesterov to update the gradient. You can call NAdam optimizer class during training your model in Tensorflow by leveraging the below commands:
tf.keras.optimizers.Nadam( learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name='Nadam', **kwargs )
According to algorithm 1 of the research paper by google, This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).
tf.keras.optimizers.Ftrl( learning_rate=0.001, learning_rate_power=-0.5, initial_accumulator_value=0.1, l1_regularization_strength=0.0, l2_regularization_strength=0.0, name="Ftrl", l2_shrinkage_regularization_strength=0.0, beta=0.0, **kwargs )
Stochastic gradient descent(SGD) optimization algorithm in contrast performs a parameter update for each training example as given below:
SGD performs redundant computations for bigger datasets, as it recomputes gradients for the same example before each parameter update. It performs frequent updates with a high variance that cause the objective function to fluctuate heavily as as shown in below image:
You can call the SGD optimizer using below commands:
tf.keras.optimizers.SGD( learning_rate=0.01, momentum=0.0, nesterov=False, name="SGD", **kwargs )
Now for starter you can implement a standalone example like this to see the output:
opt = tf.keras.optimizers.SGD(learning_rate=0.1) var = tf.Variable(1.0) loss = lambda: (var ** 2)/2.0 # d(loss)/d(var1) = var1 step_count = opt.minimize(loss, [var]).numpy() ## Step is `- learning_rate * grad` var.numpy()
Learn more about SGD here.
We have covered all the major optimizers classes supported by the Tensorflow framework, to learn more about the usage and practical demonstration you can follow this official documentation curated by Keras and Tensorflow both are totally the same, as of now we already know Keras is merged into TensorFlow, but in TensorFlow documentation, you can also see each optimizers usage in some projects: