Ultimate guide to PyTorch Optimizers

PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. Today we are going to discuss the PyTorch optimizers, So far, we’ve been manually updating the parameters using the computed gradients and that’s maybe fine for two parameters, but in real-world use cases, we have a lot of parameters so we can’t write optimizers algorithms each time! We use one among PyTorch’s optimizers, like SGD or Adagrad class.

The optimizer takes the parameters we want to update, the learning rate we want to use (and possibly many other parameters as well, and performs the updates through its step() method.

Simply it is the method to update various hyperparameters that can reduce the losses in much less effort, Let’s look at some of the optimizers class supported by the PyTorch framework:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.


torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.

Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

 import torch.optim as optim
 SGD_optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.7)
 ## or 
Adam_optimizer = optim.Adam([var1, var2], lr=0.001)

AdaDelta Class

Pytorch Adadelta algortihm formula

It implements the Adadelta algorithm and the algorithms were proposed in ADADELTA: An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this :

 def Adadelta(weights, sqrs, deltas, rho, batch_size):
     eps_stable = 1e-5
     for weight, sqr, delta in zip(weights, sqrs, deltas):
         g = weight.grad / batch_size
         sqr[:] = rho * sqr + (1. - rho) * nd.square(g)
         cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
         delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta
         # update weight in place.
         weight[:] -= cur_delta 

With help of PyTorch you can do same with just a single line of code as shown below:

torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

learn more here

AdaGrad Class

Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

 def Adagrad(data):
    gradient_sums = np.zeros(theta.shape[0])
     for t in range(num_iterations):
         gradients = compute_gradients(data, weights)
         gradient_sums += gradients ** 2
         gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))
         weights = weights - lr * gradient_update
     return weights 

In several problems many times the most critical information is present in the data that is not as frequent. So if the use-case you are working on is related to sparse data, Adagrad can be useful. You can call the algorithm by using the below command with the help torch:

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

But there is some drawback too like it is computationally expensive and the learning rate is also decreasing which make it slow in training.

Learn more

Adam Class

Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

AdamW Class

This time the authors suggested an improved version of Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size as shown in line12 in the algorithm below.

Image for post

The weight decay or regularization term does not end up in the moving averages and in outputs it is only proportional to the weight itself. The authors show practically that AdamW yields better training loss, that means the models generalize much better than models trained with Adam allowing the remake to compete with stochastic gradient descent with momentum.

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)

It implements the AdamW algorithm, the previous paper of Adam: A Method for Stochastic Optimization was available here and The AdamW variant was proposed in Decoupled Weight Decay Regularization paper, complete source code for AdamW is available here

SparseAdam Class

SparseAdam Implements a lazy version of Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters. You can use this optimizer using the below code:

torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)

Learn more

Adamax Class

It Implements the Adamax algorithm (a variant of Adam supported infinity norm). Paper for adamax has been proposed here : Adam A Method for Stochastic Optimization.

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

Learn more

LBFGS class

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc(minFunc – unconstrained differentiable multivariate optimization in Matlab) you can simply call this with the help of the torch method:

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)

Learn more here

RMSprop class

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Rprop class

This class Implements the resilient backpropagation algorithm.

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

SGD Class

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is predicted on the formula from On the importance of initialization and momentum in deep learning.

 def SGD(data, batch_size, lr):
     N = len(data)
     mini_batches = np.array([data[i:i+batch_size]
      for i in range(0, N, batch_size)])
     for X,y in mini_batches:
         backprop(X, y, lr) 

Pytorch class usage:

 torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
 optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
 loss_fn(model(input), target).backward()

Stochastic gradient descent is extremely basic and is seldom used now. One problem is with the worldwide learning rate related to an equivalent . Hence it doesn’t work well when the parameters are in several scales since a coffee learning rate will make the training slow while an outsized learning rate might cause oscillations. Also, Stochastic gradient descent generally has a hard time escaping the saddle points. Adagrad, Adadelta, RMSprop, and ADAM generally handle saddle points better. SGD with momentum renders some speed to the optimization and also helps escape local minima better.

Learn more here

ASGD class

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

Learn more


We have covered some new optimizer classes which we didn’t see in the TensorFlow Keras optimizer article previously. PyTorch is a very powerful tool for doing deep learning research or for any business purpose. You can learn more about Pytorch and supported optimizers here, the documentation is beautifully curated with all the parameters explained for each class thoroughly.

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox