Active Hackathon

# A Lowdown On Alternatives To Gradient  Descent Optimization Algorithms

Gradient Descent is the most common optimisation strategy used in machine learning frameworks. It is an iterative algorithm used to minimise a function to its local or global minima. In simple words, Gradient Descent iterates overs a function, adjusting it’s parameters until it finds the minimum.

Gradient Descent is used to minimise the error by adjusting weights after passing through all the samples in the training set. If the weights are updated after a specified subset of training samples, or after each sample in the training set, then it is called a Stochastic Gradient Descent

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

Stochastic Gradient Descent (SGD) and many of its variants are popular state-of-the-art methods for training deep learning models due to their efficiency. However, SGD suffers from many limitations that prevent its more widespread use: for example, the error signal diminishes as the gradient is back-propagated (i.e. the gradient vanishes); and SGD is sensitive to poor conditioning, which means a small input can change the gradient dramatically.

Learning rate is a parameter, denoted by α(alpha), is used to tune how accurately a model converges on a result (classification/prediction, etc.). This can be thought of as a ball thrown down a staircase. A higher learning rate value is equivalent to the higher speed of the descending ball. This ball will leap skipping adjacent steps and reaching the bottom quickly but not settling immediately because of the momentum it carries.

Learning rate is  scalar – a value which tells the machine how fast or how slow it arrives at some conclusion. The speed at which a model learns is important and it varies with different applications. A super fast learning algorithm can miss a few data points or correlations which can give better insights on the data. Missing this will eventually lead to wrong classifications.

This momentum can be controlled with three common types of implementing the learning rate decay:

1. Step decay: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs
2. Exponential decay has the mathematical form α=α_0e^(−kt), where α_0,k are hyperparameters and t is the iteration number
3. 1/t decay has the mathematical form α=α0/(1+kt) where a0,k are hyperparameters and t is the iteration number

There is no one stop answer to finding out the method in which hyperparameters can be tuned to reduce the loss; more or less a trial and error experimentation.

To bottle down on the values, there are few methods to skim through the parameter space to figure out the values that align with the objective of the model that is being trained:

• Adagrad is an adaptive learning rate method. Weights with a high gradient will have low learning rate and vice versa
• RMSprop adjusts the Adagrad method in a very simple way to reduce its aggressive, monotonically decreasing learning rate. This approach makes use of a moving average of squared gradients
• Adam is almost similar to RMSProp but with momentum

Whereas, Alternating Direction Method of Multipliers (ADMM) has been used successfully in many conventional machine learning applications and is considered to be a useful alternative to Stochastic Gradient Descent (SGD) as a deep learning optimizer.

### ADMM And Alternatives

Adam is the most popular method because it is computationally efficient and requires little tuning. Other well-known methods that incorporate adaptive learning rates include AdaGrad, RMSProp and AMSGrad.

The use of the Alternating Direction Method of Multipliers (ADMM) has been proposed as an alternative to SGD.

Recently, the Alternating Direction Method of Multipliers (ADMM) has become popular with researchers due to its excellent scalability.

However, as an emerging domain, several challenges remain, including:

1. The lack of global convergence guarantees,
2. Slow convergence towards solutions, and
3. Cubic time complexity with regard to feature dimensions.

To address these problems, the researchers at the George Mason University propose a novel optimization framework for deep learning via ADMM (dlADMM) to address these challenges simultaneously.

Here’s how dlADMM tries to solve few challenges:

• The parameters in each layer are updated backward and then forward so that the parameter information in each layer is exchanged efficiently.
• The time complexity is reduced from cubic to quadratic in (latent) feature dimensions via a dedicated algorithm design for subproblems that enhances them utilizing iterative quadratic approximations and backtracking.
• Experiments on benchmark datasets demonstrated that proposed dlADMM algorithm outperforms most of the comparison methods.

## More Great AIM Stories

### Why Is Federated Learning Getting So Popular

I have a master's degree in Robotics and I write about machine learning advancements.

## Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

### India’s Digital Agriculture Mission is about people, not projects

Indian agritech companies aim to provide innovative solutions and actionable advice.

### The Data Analyst Job Openings in Indian Government

Emerging data analyst profiles in government organisations at central and state levels.

### The Insane Salaries of Indian IT CEOs

HCL Technologies’ C Vijayakumar is the highest paid CEO among Indian software companies, with a whopping \$16.52 million remuneration

### Can TikTok be the Giant Slayer?

Facebook’s parent company Meta posted its first-ever revenue decline after a stupendous run for years.

### Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

### Indian IT is Trying to Make Their Metaverse Happen

TCS is working on 60 metaverse projects globally.

### Should we call Rust a Failed Programming Language?

Rust has been ranked as the most liked language by its users for two years in surveys but programmers say otherwise

### WhatsApp Journeys – Instant Gratification with No frills

It is not merely the availability of customers on WhatsApp that is of value but also, the ease of their journey.

### Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter

### Why the Government is Right to Block the Startup Sales to Big Tech

This concentration of power and wealth in big tech mirrors the rise in inequality in the broader society.