###### Ultimate guide to PyTorch Optimizers # Ultimate guide to PyTorch Optimizers PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. Today we are going to discuss the PyTorch optimizers, So far, we’ve been manually updating the parameters using the computed gradients and that’s maybe fine for two parameters, but in real-world use cases, we have a lot of parameters so we can’t write optimizers algorithms each time! We use one among PyTorch’s optimizers, like SGD or Adagrad class.

The optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other parameters as well, and performs the updates through its step() method.

`Register for FREE Workshop on Data Engineering>>`

Simply it is the method to update various hyperparameters that can reduce the losses in much less effort, Let’s look at some of the optimizers class supported by the PyTorch framework:

## TORCH.OPTIM

torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.

Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

``` import torch.optim as optim
SGD_optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.7)
## or

``` def Adadelta(weights, sqrs, deltas, rho, batch_size):
eps_stable = 1e-5
for weight, sqr, delta in zip(weights, sqrs, deltas):
sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
# update weight in place.
weight[:] -= cur_delta ```

With help of PyTorch you can do same with just a single line of code as shown below:

Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

``` def Adagrad(data):
for t in range(num_iterations):
weights = weights - lr * gradient_update
return weights ```

In several problems many times the most critical information is present in the data that is not as frequent. So if the use-case you are working on is related to sparse data, Adagrad can be useful. You can call the algorithm by using the below command with the help torch:

`torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)`

But there is some drawback too like it is computationally expensive and the learning rate is also decreasing which make it slow in training.

Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

`torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)`

This time the authors suggested an improved version of Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size as shown in line12 in the algorithm below.

The weight decay or regularization term does not end up in the moving averages and in outputs it is only proportional to the weight itself. The authors show practically that AdamW yields better training loss, that means the models generalize much better than models trained with Adam allowing the remake to compete with stochastic gradient descent with momentum.

`torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)`

It implements the AdamW algorithm, the previous paper of Adam: A Method for Stochastic Optimization was available here and The AdamW variant was proposed in Decoupled Weight Decay Regularization paper, complete source code for AdamW is available here

SparseAdam Implements a lazy version of Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters. You can use this optimizer using the below code:

`torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)`

It Implements the Adamax algorithm (a variant of Adam supported infinity norm). Paper for adamax has been proposed here : Adam A Method for Stochastic Optimization.

`torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)`

## LBFGS class

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc(minFunc – unconstrained differentiable multivariate optimization in Matlab) you can simply call this with the help of the torch method:

```torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)
```

## RMSprop class

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

`torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)`

## Rprop class

This class Implements the resilient backpropagation algorithm.

`torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))`

## SGD Class

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is predicted on the formula from On the importance of initialization and momentum in deep learning.

``` def SGD(data, batch_size, lr):
N = len(data)
np.random.shuffle(data)
mini_batches = np.array([data[i:i+batch_size]
for i in range(0, N, batch_size)])
for X,y in mini_batches:
backprop(X, y, lr) ```

Pytorch class usage:

``` torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
#usage
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
loss_fn(model(input), target).backward()
optimizer.step() ```

Stochastic gradient descent is extremely basic and is seldom used now. One problem is with the worldwide learning rate related to an equivalent . Hence it doesn’t work well when the parameters are in several scales since a coffee learning rate will make the training slow while an outsized learning rate might cause oscillations. Also, Stochastic gradient descent generally has a hard time escaping the saddle points. Adagrad, Adadelta, RMSprop, and ADAM generally handle saddle points better. SGD with momentum renders some speed to the optimization and also helps escape local minima better.

## ASGD class

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

`torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)`

`Join our Telegram Group. Be part of an engaging community`