MITB Banner

How Stochastic Gradient Descent Is Solving Optimisation Problems In Deep Learning

Share

To a large extent, deep learning is all about solving optimisation problems. According to computer science researchers, stochastic gradient descent, better known as SGD has become the workhorse of Deep Learning, which, in turn, is responsible for the remarkable progress in computer vision.

Despite its simplicity, SGD is a simple variant of classical gradient descent where the stochasticity comes from employing a random subset of the measurements (mini-batch) to compute the gradient at each descent. It also has implicit regularisation effects, making it suited for highly non-convex loss functions, such as those entailed in training deep networks for classification.

SGD is so popular that it is now being billed as the cornerstone for deep learning. According to Sanjeev Arora, a professor of Computer Science at Princeton, research in deep learning is taking place in four core areas

  • Non-convex optimisation
  • Over-parameterisation and generalisation
  • Role of depth
  • Generative models

SGD falls under the non-convex optimisation problem. Google researcher Ali Rahimi indicated that the study of non-convex optimisation for deep neural networks will address two questions largely:

  1. What does the loss function look like?
  2. Why does SGD converge?

Good optimisation is the core part of deep learning and a significant performance boost often comes from better optimisation techniques. In fact, researchers believe the choice of optimisation algorithms matters, especially when one is dealing with large datasets. This is especially the case for stochastic algorithms. Because, in stochastic settings, researchers only observe a subset of the data at a particular time, That is why the improved optimisation techniques allow them to make the best use of data efficiently. One particular trick is maintaining a running mean of gradients over time and adding that to the current gradient.

Advantages of Stochastic Gradient Descent for learning problems:

  • According to a senior data scientist, one of the distinct advantages of using Stochastic Gradient Descent is that it does the calculations faster than gradient descent and batch gradient descent. However, gradient descent is the best approach if one wants a speedier result.
  • Computer scientists claim that performing one pass of SGD on a particular dataset is statistically (minimax) optimal. In other words, no other algorithm can get one better results on the expected loss (on all possible data distributions
  • Also, on massive datasets, stochastic gradient descent can converges faster because it performs updates more frequently. Also, the stochastic nature of online/minibatch training takes advantage of vectorised operations and processes the mini-batch all at once instead of training on single data points.
  • Facebook’s chief AI scientist emphasised the reason behind the popularity of SGD is that it can process more examples within the available computation time.
  • A lot of modern optimisation algorithms such as RMSProp and Adam are based on gradient descent, but the question is are these superior to the standard stochastic gradient descent
  • In particular, stochastic gradient descent delivers similar guarantees to empirical risk minimisation, which exactly minimises an empirical average of the loss on training data. So, for many learning problems, SGD is not really a “poor” optimisation procedure.
  • In the context of large-scale learning, SGD has received considerable attention and is applied to text classification and natural language processing. Two key benefits of Stochastic Gradient Descent are efficiency and the ease of implementation. In a situation when data is less, classifiers in the module are scaled to problems with more than 10^5 training examples and more than 10^5 features.
  • Stochastic gradient descent is best suited for unconstrained optimisation problems. In contrast to BGD, SGD approximates the true gradient of E(w,b) by considering a single training example at a time.

The disadvantages of SGD include:

  • SGD requires a number of hyperparameters and a number of iterations
  • It is also sensitive to feature scaling

Conclusion

According to a paper by University of Buffalo’s Department of Computer Science and Engineering, Stochastic Gradient Descent is powering nearly all of deep learning applications today. SGD is an extension of gradient descent algorithm and it is a method of generalisation beyond the training set. Furthermore, the paper states that outside of deep learning, SGD is the main way to train large linear models on very large data sets. With the exponential growth of interest in Deep Learning, which started in the academic world around 2006, SGD, thanks to its simplicity in implementation and efficiency in dealing with large scale datasets, has become by far the most common method for training deep neural networks and other large scale.

Share
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.