PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. Today we are going to discuss the PyTorch optimizers, So far, we’ve been manually updating the parameters using the computed gradients and that’s maybe fine for two parameters, but in real-world use cases, we have a lot of parameters so we can’t write optimizers algorithms each time! We use one among PyTorch’s optimizers, like SGD or Adagrad class.
The optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other parameters as well, and performs the updates through its step() method.
Sign up for your weekly dose of what's up in emerging technology.
Simply it is the method to update various hyperparameters that can reduce the losses in much less effort, Let’s look at some of the optimizers class supported by the PyTorch framework:
Table of contents
torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.
Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.
import torch.optim as optim SGD_optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.7) ## or Adam_optimizer = optim.Adam([var1, var2], lr=0.001)
It implements the Adadelta algorithm and the algorithms were proposed in ADADELTA: An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this :
def Adadelta(weights, sqrs, deltas, rho, batch_size): eps_stable = 1e-5 for weight, sqr, delta in zip(weights, sqrs, deltas): g = weight.grad / batch_size sqr[:] = rho * sqr + (1. - rho) * nd.square(g) cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta # update weight in place. weight[:] -= cur_delta
With help of PyTorch you can do same with just a single line of code as shown below:
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
learn more here
Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:
def Adagrad(data): gradient_sums = np.zeros(theta.shape) for t in range(num_iterations): gradients = compute_gradients(data, weights) gradient_sums += gradients ** 2 gradient_update = gradients / (np.sqrt(gradient_sums + epsilon)) weights = weights - lr * gradient_update return weights
In several problems many times the most critical information is present in the data that is not as frequent. So if the use-case you are working on is related to sparse data, Adagrad can be useful. You can call the algorithm by using the below command with the help torch:
torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
But there is some drawback too like it is computationally expensive and the learning rate is also decreasing which make it slow in training.
Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
- Paper: Adam: A Method for Stochastic Optimization.
- Implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization paper
- Learn more
This time the authors suggested an improved version of Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size as shown in line12 in the algorithm below.
The weight decay or regularization term does not end up in the moving averages and in outputs it is only proportional to the weight itself. The authors show practically that AdamW yields better training loss, that means the models generalize much better than models trained with Adam allowing the remake to compete with stochastic gradient descent with momentum.
torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
It implements the AdamW algorithm, the previous paper of Adam: A Method for Stochastic Optimization was available here and The AdamW variant was proposed in Decoupled Weight Decay Regularization paper, complete source code for AdamW is available here
SparseAdam Implements a lazy version of Adam algorithm which is suitable for sparse tensors.
In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters. You can use this optimizer using the below code:
torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)
torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc(minFunc – unconstrained differentiable multivariate optimization in Matlab) you can simply call this with the help of the torch method:
torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)
Learn more here
This class Implements the RMSprop algorithm, which was Proposed by G. Hinton in his course.
The centered version first appears in Generating Sequences With Recurrent Neural Networks.
torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
This class Implements the resilient backpropagation algorithm.
torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
Implements stochastic gradient descent (optionally with momentum).
Nesterov momentum is predicted on the formula from On the importance of initialization and momentum in deep learning.
def SGD(data, batch_size, lr): N = len(data) np.random.shuffle(data) mini_batches = np.array([data[i:i+batch_size] for i in range(0, N, batch_size)]) for X,y in mini_batches: backprop(X, y, lr)
Pytorch class usage:
torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False) #usage optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) optimizer.zero_grad() loss_fn(model(input), target).backward() optimizer.step()
Stochastic gradient descent is extremely basic and is seldom used now. One problem is with the worldwide learning rate related to an equivalent . Hence it doesn’t work well when the parameters are in several scales since a coffee learning rate will make the training slow while an outsized learning rate might cause oscillations. Also, Stochastic gradient descent generally has a hard time escaping the saddle points. Adagrad, Adadelta, RMSprop, and ADAM generally handle saddle points better. SGD with momentum renders some speed to the optimization and also helps escape local minima better.
Learn more here
It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.
It has been proposed in Acceleration of stochastic approximation by averaging paper.
torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
We have covered some new optimizer classes which we didn’t see in the TensorFlow Keras optimizer article previously. PyTorch is a very powerful tool for doing deep learning research or for any business purpose. You can learn more about Pytorch and supported optimizers here, the documentation is beautifully curated with all the parameters explained for each class thoroughly.