# Guide To Tensorflow Keras Optimizers

Optimizers are the expanded class, which includes the method to train your machine/deep learning model. Right optimizers are necessary for your model as they improve training speed and performance, Now there are many optimizers algorithms we have in PyTorch and TensorFlow library but today we will be discussing how to initiate TensorFlow Keras optimizers, with a small demonstration in jupyter notebook.

Before optimizers, it’s good to have some preliminary exposure in loss functions as both works parallelly in deep learning projects. We have already covered the TensorFlow loss function and PyTorch loss functions in our previous articles. Loss functions are just a mathematical way of measuring how good your machine/deep learning model performs.

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Now how the loss functions and optimizers are related?

During the training of the model, we tune the parameters(also known as hyperparameter tuning) and weights to minimize the loss and try to make our prediction accuracy as correct as possible. Now to change these parameters the optimizer’s role came in, which ties the model parameters with the loss function by updating the model in response to the loss function output. Simply optimizers shape the model into its most accurate form by playing with model weights. The loss function just tells the optimizer when it’s moving in the right or wrong direction.

Optimizers are Classes or methods used to change the attributes of your machine/deep learning model such as weights and learning rate in order to reduce the losses. Optimizers help to get results faster.

Definition

## Tensorflow Keras Optimizers Classes:

Gradient descent optimizers, the year in which the papers were published, and the components they act upon

• Ftrl: Optimizer that implements the FTRL algorithm.
• Optimizer class: Base class for Keras optimizers.
• RMSprop: Optimizer that implements the RMSprop algorithm.
• SGD: Gradient descent (with momentum) optimizer.

Before explaining let’s first look at the most popular algorithm i.e. gradient descent, there are many other algorithms that have been made on top of gradient descent like Adagrad, RMSprop, and Adam. The king of all the optimizers and it’s very fast, robust, and flexible. A basic workflow of gradient descent follows the following steps:

1.  Calculate all the minor changes in each weight parameter affecting the loss function.
2. Tuning each individual weight on the basis of its gradient.
3. Repeat  1 and 2 till the loss function reaches at its minimum.

But there are some complications with this algorithm, as the gradient is a partial derivative and measure of change. It connects loss functions and the weights; the gradient algorithm tells what operation we should do weights to minimize loss functions – subtract 0.04, add 0.2, or anything relevant.

The problem comes when it is stuck at local minima whenever we deal with large multi-dimensional datasets. As shown in the above figure: the global minimum is the least minimum value of a function while a local minimum is the local minimum value of a function in a certain neighborhood.

## Initialize

For initialization you can simply use google colab or for implementation in a local machine you can download anaconda that integrates all the major data science pages into one. Use below import command to initialize tensorflow:

`import tensorflow as tf`

Adagrad adapts the learning rate specifically with individual features: it means that some of the weights in your dataset have different learning rates than others. It always works best in a sparse dataset where a lot of inputs are missing. In TensorFlow, you can call the optimizer using the below command.

``` tf.keras.optimizers.Adagrad(
learning_rate=0.001,
initial_accumulator_value=0.1,
epsilon=1e-07,
**kwargs
) ```

It is a parameter specific learning rate, adapts with how frequently a parameter gets updated during training. Parameters we pass with these optimizers are learning_rate, initial_accumulator_value, epsilon, name, and **kwargs you can read more about them at Keras documentation or TensorFlow docs. This optimizer is been referred from Duchi et al., 2011 paper

## RMSprop Optimizer

It is an exclusive version of Adagrad developed by Geoffrey Hinton(learn more), now the thinking behind this optimizer was pretty straight forward: instead of letting all of the gradients accumulate for momentum, it only accumulates gradients in a specific fix window. It is exactly like Adaprop(an updated version of Adagrad with some improvement), you can call this in the TensorFlow framework using the below command:

``` tf.keras.optimizers.RMSprop(
learning_rate=0.001,
rho=0.9, momentum=0.0,
epsilon=1e-07,
centered=False,
name='RMSprop', **kwargs
) ```

Now like the RMSprop optimizer, Adadelta(Read paper: Zeiler, 2012) is another more improved optimization algorithm, here delta refers to the difference between the current weight and the newly updated weight. Adadelta removed the use of the learning rate parameter completely and replaced it with an exponential moving average of squared deltas. You can call it in your machine learning project using the below command with basic parameters like epsilon, learning_rate, rho, and **kwargs.

``` tf.keras.optimizers.Adadelta(
**kwargs
) ```

Adam stands for adaptive moment estimation, which is another way of using past gradients to calculate current gradients, for the deep mathematical explanation you can read its official paper(Kingma & Ba, 2014) here, Adam utilizes the concept of momentum by adding fractions of previous gradients to the current one, it is practically accepted in many projects during training neural nets.

You can call it using Tensorflow by leveraging the below commands into your project.

``` tf.keras.optimizers.Adam(
) ```

Here is the standalone usage for the algorithm:

``` opt = tf.keras.optimizers.Adam(learning_rate=0.1)
var1 = tf.Variable(10.0)
loss = lambda: (var1 ** 2)/2.0       # d(loss)/d(var1) == var1
step_count = opt.minimize(loss, [var1]).numpy()
# The first step is `-learning_rate*sign(grad)`
var1.numpy() ```

You can call it using below commands:

``` tf.keras.optimizers.Adamax(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
) ```

It is a variant of Adam based on the infinity norm. Sometimes it is considered superior to Adam, especially in models with embeddings.

NAdam optimizer is an acronym for Nesterov and Adam optimizer. Its official research paper was published in 2015 here, now this Nesterov component is way more efficient than its previous implementations. Nadam used Nesterov to update the gradient. You can call NAdam optimizer class during training your model in Tensorflow by leveraging the below commands:

``` tf.keras.optimizers.Nadam(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
) ```

## Ftrl Optimizer

According to algorithm 1 of the research paper by google, This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).

``` tf.keras.optimizers.Ftrl(
learning_rate=0.001,
learning_rate_power=-0.5,
initial_accumulator_value=0.1,
l1_regularization_strength=0.0,
l2_regularization_strength=0.0,
name="Ftrl",
l2_shrinkage_regularization_strength=0.0,
beta=0.0,
**kwargs
) ```

## SGD Optimizer

Stochastic gradient descent(SGD) optimization algorithm in contrast performs a parameter update for each training example as given below:

SGD performs redundant computations for bigger datasets, as it recomputes gradients for the same example before each parameter update. It performs frequent updates with a high variance that cause the objective function to fluctuate heavily as as shown in below image:

You can call the SGD optimizer using below commands:

``` tf.keras.optimizers.SGD(
learning_rate=0.01,
momentum=0.0,
nesterov=False,
name="SGD",
**kwargs
) ```

Now for starter you can implement a standalone example like this to see the output:

``` opt = tf.keras.optimizers.SGD(learning_rate=0.1)
var = tf.Variable(1.0)
loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
step_count = opt.minimize(loss, [var]).numpy()
## Step is `- learning_rate * grad`
var.numpy() ```

## Conclusion

We have covered all the major optimizers classes supported by the Tensorflow framework, to learn more about the usage and practical demonstration you can follow this official documentation curated by Keras and Tensorflow both are totally the same, as of now we already know Keras is merged into TensorFlow, but in TensorFlow documentation, you can also see each optimizers usage in some projects:

Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### OpenAI Did The Right Thing by Firing Sam Altman

The board was right all along, and Altman knew.

### OpenAI-Anthropic Merger No More

The OpenAI board should have partnered with Anthropic when it had the chance to

### OpenAI is Nothing Without Its Customers

And people, who stood with its leader, Sam Altman, till the end.

### OpenAI’s New Board is a ‘GentleMen’s’ Club

OpenAI announced a three-member board, and none of the members are from the company.

### ‘Big Tech’s AI Regulation Talk Doesn’t Match Their Actions’

Sören Mindermann is currently a postdoc with computer scientist Yoshua Bengio at MILA, working

### It Was Sam’s Plan All Along

The OpenAI team, Satya Nadella, and everyone else just played along.

### Data Science Hiring Process at Wipro

With over 30,000 AI and analytics professionals, the team is building its own LLMs.

### OpenAI End Game

The rise of alternatives

### OpenAI Saga Shows Why Open Source is Necessary

This reiterates that such technology should not be limited to a select few, warranting

### OpenAI Secretly Works on Q*, Inches Closer Towards AGI

This new development comes in the background of Andrej Karpathy thinking of centralisation and