Gradient Descent is the most common optimisation strategy used in machine learning frameworks. It is an iterative algorithm used to minimise a function to its local or global minima. In simple words, Gradient Descent iterates overs a function, adjusting it\u2019s parameters until it finds the minimum.\u00a0\r\n\r\nGradient Descent is used to minimise the error by adjusting weights after passing through all the samples in the training set. If the weights are updated after a specified subset of training samples, or after each sample in the training set, then it is called a Stochastic Gradient Descent.\u00a0\r\n\r\n\r\n\r\nStochastic Gradient Descent (SGD) and many of its variants are popular state-of-the-art methods for training deep learning models due to their efficiency. However, SGD suffers from many limitations that prevent its more widespread use: for example, the error signal diminishes as the gradient is back-propagated (i.e. the gradient vanishes); and SGD is sensitive to poor conditioning, which means a small input can change the gradient dramatically.\u00a0\r\n\r\nLearning rate is a parameter, denoted by \u03b1(alpha), is used to tune how accurately a model converges on a result (classification\/prediction, etc.). This can be thought of as a ball thrown down a staircase. A higher learning rate value is equivalent to the higher speed of the descending ball. This ball will leap skipping adjacent steps and reaching the bottom quickly but not settling immediately because of the momentum it carries.\r\n\r\nLearning rate is\u00a0 scalar - a value which tells the machine how fast or how slow it arrives at some conclusion. The speed at which a model learns is important and it varies with different applications. A super fast learning algorithm can miss a few data points or correlations which can give better insights on the data. Missing this will eventually lead to wrong classifications.\r\n\r\n\r\n\r\nThis momentum can be controlled with three common types of implementing the learning rate decay:\r\n\r\n \tStep decay: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs\r\n \tExponential decay has the mathematical form \u03b1=\u03b1_0e^(\u2212kt), where \u03b1_0,k are hyperparameters and t is the iteration number\r\n \t1\/t decay has the mathematical form \u03b1=\u03b10\/(1+kt) where a0,k are hyperparameters and t is the iteration number\r\n\r\nThere is no one stop answer to finding out the method in which hyperparameters can be tuned to reduce the loss; more or less a trial and error experimentation.\r\n\r\nTo bottle down on the values, there are few methods to skim through the parameter space to figure out the values that align with the objective of the model that is being trained:\r\n\r\n \tAdagrad is an adaptive learning rate method. Weights with a high gradient will have low learning rate and vice versa\r\n \tRMSprop adjusts the Adagrad method in a very simple way to reduce its aggressive, monotonically decreasing learning rate. This approach makes use of a moving average of squared gradients\r\n \tAdam is almost similar to RMSProp but with momentum\r\n\r\nWhereas, Alternating Direction Method of Multipliers (ADMM) has been used successfully in many conventional machine learning applications and is considered to be a useful alternative to Stochastic Gradient Descent (SGD) as a deep learning optimizer.\u00a0\r\nADMM And Alternatives\r\nAdam is the most popular method because it is computationally efficient and requires little tuning. Other well-known methods that incorporate adaptive learning rates include AdaGrad, RMSProp and AMSGrad.\r\n\r\nThe use of the Alternating Direction Method of Multipliers (ADMM) has been proposed as an alternative to SGD.\r\n\r\nRecently, the Alternating Direction Method of Multipliers (ADMM) has become popular with researchers due to its excellent scalability.\u00a0\r\n\r\nHowever, as an emerging domain, several challenges remain, including:\r\n\r\n \tThe lack of global convergence guarantees,\u00a0\r\n \tSlow convergence towards solutions, and\u00a0\r\n \tCubic time complexity with regard to feature dimensions.\u00a0\r\n\r\nTo address these problems, the researchers at the George Mason University propose a novel optimization framework for deep learning via ADMM (dlADMM) to address these challenges simultaneously.\r\n\r\nHere\u2019s how dlADMM tries to solve few challenges:\r\n\r\n \tThe parameters in each layer are updated backward and then forward so that the parameter information in each layer is exchanged efficiently.\u00a0\r\n \tThe time complexity is reduced from cubic to quadratic in (latent) feature dimensions via a dedicated algorithm design for subproblems that enhances them utilizing iterative quadratic approximations and backtracking.\r\n \tExperiments on benchmark datasets demonstrated that proposed dlADMM algorithm outperforms most of the comparison methods.\r\n\r\nKnow more about dlADMM here.