Active Hackathon

What is momentum in a Neural network and how does it work?  

In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better.

In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better. The process of minimizing loss is called optimization. An optimizer is a method that modifies the weights of the neural network to reduce the loss. Although several neural network optimizers exist, in this article we will learn about gradient descent with momentum and compare its performance with others. The major topics. Below are the major points that we are going to discuss in this post

Table of contents

  1. Introduction to Optimizers
  2. How does Gradient descent work?
  3. Stochastic gradient descent / SGD with momentum
  4. Performance analysis

Let’s start the discussion by understanding the optimizer


Sign up for your weekly dose of what's up in emerging technology.

Introduction to Optimizers

Simply, if you want to know how an optimizer works, Suppose you are at the top of a hill and you want to come down on the surface then what would you do? You will move downwards towards the slope, but as you move upwards, it means that you are moving in the wrong direction. You will then change directions and move downwards, and finally, you will reach the surface. This is how an optimizer works. Below shows exactly that, θ1 θ0 are the weights and J(θ) is the loss function, and that black line is a person who is moving towards the lowest point of the graph.

It is not possible to know what should be the optimal weights of the model initially so the weights are initialized randomly with some methods then those weights are changed by optimizers until we get the minimum loss. An optimizer is used to reduce the loss function by updating the weights of the models and providing better results.

There are various optimizers that have been created in recent years, they have their own advantages and disadvantages. Such as Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch Stochastic Gradient Descent, SGD with momentum, etc.

How does Gradient descent work?

Gradient descent is the optimization algorithm that is used to optimize the machine learning and deep learning models. Gradient descent calculates the function’s lowest value or minima numerically. It is a method of minimizing the function by following the negative of the gradient. 

X n+1:   is the new weight 

Xn:   is the old weight

α:     learning rate

∇f(Xn):  gradient of the cost function with respect to X

The below figure shows the cost vs weight graph of the gradient descent. We can see that model weights are initialized randomly and get changed repeatedly to minimize the cost function. The size of the learning steps is proportional to the slope of the cost function, so the steps gradually become smaller as they get closer to the minimum cost. That yellow line is the tangent, which is used to calculate the gradient value. As it gets to the minimum cost, it becomes parallel to the x-axis, which means that the tangent cannot go any lower.

Also worth mentioning that choosing the alpha value learning rate is also an important part, it should not be too high or too low. Below are two images that show what happens when choosing the alpha wrongly. The left image shows when the alpha value is too high which results in bouncing here and there in the curve and the image shows when the alpha value is too small resulting that it will reach the local minima but it will take time.

Gradient descent is simple and easy to implement but it is trapped into local minima, instead of finding global minima. There is a variant of gradient descent called batch gradient descent whose inner functionality is the same as gradient descent. In gradient descent, it calculates the error for one datapoint and immediately updates it, But in batch gradient descent model calculates the error for each instance of the training dataset, but does not update until all of the examples in the training dataset have been evaluated. We can see in the figure that it converges regularly but very slowly.

Stochastic gradient descent / SGD with momentum

In batch gradient descent, the gradient is computed with the entire dataset at each step, causing it to be very slow when the dataset is large. Where Stochastic gradient descent picks a random instance from the dataset at every step and calculates the gradient only on a single instance. It makes SGD work much faster than batch gradient descent, it is also run fast on huge datasets, since only one instance needs to be calculated. Due to its stochastic (random) nature, this algorithm converges less regularly than batch gradient descent; it keeps oscillating toward convergence. Momentum is used to remove its random convergence.

                           SGD SGD – Momentum

The symbol ‘p’ is momentum. Using all previous updates, the momentum at time ‘t’ is calculated, giving more weight to the latest updates compared to the previous update in order to speed convergence. After adding momentum stochastic GD convergence looks like this.

It is much smoother than before.

Performance analysis

In the following Colab notebook link given in the references, the effect of momentum on various model parameters is compared such as training time, accuracies (train and validation), and loss (train and validation). The performance of SGD and Adam optimizers is evaluated.   

Final words

In this article, we learn about optimizers and their types and understand the intuition behind them such as gradient descent, batch gradient descent, stochastic descent, and SGD with the moment. 


More Great AIM Stories

Waqqas Ansari
Waqqas Ansari is a data science guy with a math background. He likes solving challenging business problems through predictive modelling, descriptive modelling, and machine learning algorithms. He is fascinated by new technologies, especially those relating to machine learning.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022