MITB Banner

What is momentum in a Neural network and how does it work?  

In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better.

Share

In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better. The process of minimizing loss is called optimization. An optimizer is a method that modifies the weights of the neural network to reduce the loss. Although several neural network optimizers exist, in this article we will learn about gradient descent with momentum and compare its performance with others. The major topics. Below are the major points that we are going to discuss in this post

Table of contents

  1. Introduction to Optimizers
  2. How does Gradient descent work?
  3. Stochastic gradient descent / SGD with momentum
  4. Performance analysis

Let’s start the discussion by understanding the optimizer

Introduction to Optimizers

Simply, if you want to know how an optimizer works, Suppose you are at the top of a hill and you want to come down on the surface then what would you do? You will move downwards towards the slope, but as you move upwards, it means that you are moving in the wrong direction. You will then change directions and move downwards, and finally, you will reach the surface. This is how an optimizer works. Below shows exactly that, θ1 θ0 are the weights and J(θ) is the loss function, and that black line is a person who is moving towards the lowest point of the graph.

It is not possible to know what should be the optimal weights of the model initially so the weights are initialized randomly with some methods then those weights are changed by optimizers until we get the minimum loss. An optimizer is used to reduce the loss function by updating the weights of the models and providing better results.

There are various optimizers that have been created in recent years, they have their own advantages and disadvantages. Such as Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch Stochastic Gradient Descent, SGD with momentum, etc.

How does Gradient descent work?

Gradient descent is the optimization algorithm that is used to optimize the machine learning and deep learning models. Gradient descent calculates the function’s lowest value or minima numerically. It is a method of minimizing the function by following the negative of the gradient. 

X n+1:   is the new weight 

Xn:   is the old weight

α:     learning rate

∇f(Xn):  gradient of the cost function with respect to X

The below figure shows the cost vs weight graph of the gradient descent. We can see that model weights are initialized randomly and get changed repeatedly to minimize the cost function. The size of the learning steps is proportional to the slope of the cost function, so the steps gradually become smaller as they get closer to the minimum cost. That yellow line is the tangent, which is used to calculate the gradient value. As it gets to the minimum cost, it becomes parallel to the x-axis, which means that the tangent cannot go any lower.

Also worth mentioning that choosing the alpha value learning rate is also an important part, it should not be too high or too low. Below are two images that show what happens when choosing the alpha wrongly. The left image shows when the alpha value is too high which results in bouncing here and there in the curve and the image shows when the alpha value is too small resulting that it will reach the local minima but it will take time.

Gradient descent is simple and easy to implement but it is trapped into local minima, instead of finding global minima. There is a variant of gradient descent called batch gradient descent whose inner functionality is the same as gradient descent. In gradient descent, it calculates the error for one datapoint and immediately updates it, But in batch gradient descent model calculates the error for each instance of the training dataset, but does not update until all of the examples in the training dataset have been evaluated. We can see in the figure that it converges regularly but very slowly.

Stochastic gradient descent / SGD with momentum

In batch gradient descent, the gradient is computed with the entire dataset at each step, causing it to be very slow when the dataset is large. Where Stochastic gradient descent picks a random instance from the dataset at every step and calculates the gradient only on a single instance. It makes SGD work much faster than batch gradient descent, it is also run fast on huge datasets, since only one instance needs to be calculated. Due to its stochastic (random) nature, this algorithm converges less regularly than batch gradient descent; it keeps oscillating toward convergence. Momentum is used to remove its random convergence.

                           SGD SGD – Momentum

The symbol ‘p’ is momentum. Using all previous updates, the momentum at time ‘t’ is calculated, giving more weight to the latest updates compared to the previous update in order to speed convergence. After adding momentum stochastic GD convergence looks like this.

It is much smoother than before.

Performance analysis

In the following Colab notebook link given in the references, the effect of momentum on various model parameters is compared such as training time, accuracies (train and validation), and loss (train and validation). The performance of SGD and Adam optimizers is evaluated.   

Final words

In this article, we learn about optimizers and their types and understand the intuition behind them such as gradient descent, batch gradient descent, stochastic descent, and SGD with the moment. 

Reference

Share
Picture of Waqqas Ansari

Waqqas Ansari

Waqqas Ansari is a data science guy with a math background. He likes solving challenging business problems through predictive modelling, descriptive modelling, and machine learning algorithms. He is fascinated by new technologies, especially those relating to machine learning.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.