In a neural network, there is the concept of loss, which is used to calculate performance. The higher the loss, the poorer the performance of the neural network, that is why we always try to minimize the loss so that the neural network performs better. The process of minimizing loss is called optimization. An optimizer is a method that modifies the weights of the neural network to reduce the loss. Although several neural network optimizers exist, in this article we will learn about gradient descent with momentum and compare its performance with others. The major topics. Below are the major points that we are going to discuss in this post
Table of contents
- Introduction to Optimizers
- How does Gradient descent work?
- Stochastic gradient descent / SGD with momentum
- Performance analysis
Let’s start the discussion by understanding the optimizer
Introduction to Optimizers
Simply, if you want to know how an optimizer works, Suppose you are at the top of a hill and you want to come down on the surface then what would you do? You will move downwards towards the slope, but as you move upwards, it means that you are moving in the wrong direction. You will then change directions and move downwards, and finally, you will reach the surface. This is how an optimizer works. Below shows exactly that, θ1 θ0 are the weights and J(θ) is the loss function, and that black line is a person who is moving towards the lowest point of the graph.
It is not possible to know what should be the optimal weights of the model initially so the weights are initialized randomly with some methods then those weights are changed by optimizers until we get the minimum loss. An optimizer is used to reduce the loss function by updating the weights of the models and providing better results.
There are various optimizers that have been created in recent years, they have their own advantages and disadvantages. Such as Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch Stochastic Gradient Descent, SGD with momentum, etc.
How does Gradient descent work?
Gradient descent is the optimization algorithm that is used to optimize the machine learning and deep learning models. Gradient descent calculates the function’s lowest value or minima numerically. It is a method of minimizing the function by following the negative of the gradient.
X n+1: is the new weight
Xn: is the old weight
α: learning rate
∇f(Xn): gradient of the cost function with respect to X
The below figure shows the cost vs weight graph of the gradient descent. We can see that model weights are initialized randomly and get changed repeatedly to minimize the cost function. The size of the learning steps is proportional to the slope of the cost function, so the steps gradually become smaller as they get closer to the minimum cost. That yellow line is the tangent, which is used to calculate the gradient value. As it gets to the minimum cost, it becomes parallel to the x-axis, which means that the tangent cannot go any lower.
Also worth mentioning that choosing the alpha value learning rate is also an important part, it should not be too high or too low. Below are two images that show what happens when choosing the alpha wrongly. The left image shows when the alpha value is too high which results in bouncing here and there in the curve and the image shows when the alpha value is too small resulting that it will reach the local minima but it will take time.
Gradient descent is simple and easy to implement but it is trapped into local minima, instead of finding global minima. There is a variant of gradient descent called batch gradient descent whose inner functionality is the same as gradient descent. In gradient descent, it calculates the error for one datapoint and immediately updates it, But in batch gradient descent model calculates the error for each instance of the training dataset, but does not update until all of the examples in the training dataset have been evaluated. We can see in the figure that it converges regularly but very slowly.
Stochastic gradient descent / SGD with momentum
In batch gradient descent, the gradient is computed with the entire dataset at each step, causing it to be very slow when the dataset is large. Where Stochastic gradient descent picks a random instance from the dataset at every step and calculates the gradient only on a single instance. It makes SGD work much faster than batch gradient descent, it is also run fast on huge datasets, since only one instance needs to be calculated. Due to its stochastic (random) nature, this algorithm converges less regularly than batch gradient descent; it keeps oscillating toward convergence. Momentum is used to remove its random convergence.
SGD SGD – Momentum
The symbol ‘p’ is momentum. Using all previous updates, the momentum at time ‘t’ is calculated, giving more weight to the latest updates compared to the previous update in order to speed convergence. After adding momentum stochastic GD convergence looks like this.
It is much smoother than before.
Performance analysis
In the following Colab notebook link given in the references, the effect of momentum on various model parameters is compared such as training time, accuracies (train and validation), and loss (train and validation). The performance of SGD and Adam optimizers is evaluated.
Final words
In this article, we learn about optimizers and their types and understand the intuition behind them such as gradient descent, batch gradient descent, stochastic descent, and SGD with the moment.