Inspired by human brains, Artificial Neural Networks (ANN) are now being utilised by enterprises across the globe to solve a number of complex computing tasks like speech recognition, computer vision, stock market prediction, among others.
In this article, we list down 6 techniques which can be used to optimise deep neural networks.
1| Stochastic Gradient Descent
Backpropagation or Backward propagation of errors can be said as one of the most common and popular techniques of training neural networks. It searches for the minimum value of the error function in weight space using a technique known as gradient descent. The basic idea behind stochastic approximation can be traced back to the research paper Stochastic Approximation Method by Herbert Robbins and Sutton Monro which was published in 1951.
Stochastic Gradient Descent (SGD) is a type of Gradient Descent and is one of the most popular iterative methods for optimising an objective function with suitable smoothness properties in a deep neural network. SGD replaces the actual gradient which is calculated from the dataset by an estimated one which is calculated from randomly selected data. This technique uses a single sample to perform each iteration.
2| Limited memory BFGS (L-BFGS) & Conjugate gradient (CG)
Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) are the Batch methods which help in simplifying and speeding up the process of pretraining deep algorithms. This method was developed by researchers Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng at Stanford University in order to mitigate the issues of tuning and parallelising in the Stochastic Gradient Descent technique.
L-BFGS is highly competitive or sometimes superior to SGDs/CG for low dimensional problems where the number of parameters is relatively small, for instance, convolutional models. On the other hand, for high dimensional problems, CG is more competitive and usually outperforms L-BFGS and Stochastic Gradient Descents.
3| Mini-Batch Gradient Descent
Mini-batch gradient descent is a type of the gradient descent algorithm which works by splitting the training dataset into small batches. There are several features of this technique such as this method reduces the variance of the parameter updates, which can lead to more stable convergence, it can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient with respect to a mini-batch is very efficient. Further, Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
4| Weight Initialization
Weight Initialization is a method of optimising the deep neural networks by preventing layer activation outputs from vanishing during the process of a forward pass through a deep neural network. There are two types of weight initialisation, they are zero weight initialisation and random weight initialisation. Zero weight initialisation was proposed by Sarfaraz Masood and Pravin Chandra in their paper training neural network with zero weight initialization.
In zero weight initialisation, since all the weights are the same, the activations in all hidden units are also the same, thus it makes the gradient with respect to each weight be the same. Whereas, in random weight initialisation, random values are assigned to weights very close to zero which serves the process of symmetry-breaking and gives better accuracy than zero weight initialisation.
5| Synthetic Gradients
Researchers at Google’s Deepmind developed the optimisation technique, Synthetic Gradients which has been claimed to improve communication between multiple neural networks. The method uses activations at every network layer and present extra space to use that information for updation. It provides quick and accurate results in complex neural network computing.
6| Gradient Descent with Momentum
Gradient Descent with Momentum is basically used to increase the speed of deep neural networks. It is achieved by accelerating gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations. However, if someone is implying a sparse input dataset then the performance of this method will be a poor one.