Does Deep Learning Suffer From Too Many Optimizers?

Published on June 3, 2021

by Ram Sagar

“There is no single optimizer that dominates its competitors across all tasks.”

Critics often call machine learning ‘glorified statistics’. There is some merit to the argument. The fundamental function of any machine learning model is pattern recognition, which relies on the principles of convergence; the methods of fitting data to the model. To that end, neural networks use optimization methods, typically categorised as first-order, high-order and derivative-free. First-order optimization such as gradient descent and its variants are quite popular.

Gradient descent method refers to the idea of updating the variables iteratively in the (opposite) direction of the gradients of the objective function. After each update, the gradient descent method (think mathematical expression) guides the model towards the target gradually, converging to the optimal value of the objective function.

The stochastic gradient method is an unbiased estimate of the real gradient. It reduces the update time while dealing with large numbers of samples and removes a certain amount of computational redundancy. Then there are other variants that claim to do a better job.

Large-scale stochastic optimization drives a wide variety of machine learning tasks. The permutations and combinations of this whole ordeal quickly run out of hands when one stumbles upon scores of benevolent sounding optimizers. Fatigue by abundance is now a serious challenge among researchers. Choosing the right optimization method can be a nightmare. Not to forget the effective tuning of hyperparameters that heavily influences the training speed and final performance of the learned model. These tasks are time and resource-intensive.

Choosing the optimizer is one of the most crucial design decisions in deep learning, and it is not an easy one. The above illustration shows the number of times ArXiv titles and abstracts mention specific optimizers per year. The growing literature now lists hundreds of optimisation methods. The researchers at the University of Tübingen performed an extensive, standardized benchmark of fifteen popular deep learning optimizers. This paper is one of the few works that focus on large scale benchmarking of the optimizers. According to the researchers, the objective here is to help understand how optimization methods and hyperparameters influence the training performance.

“While some optimizers are frequently decent, they also generally perform similarly, often switching their positions in the ranking.”
Paper by Schmidt et al.

Aiming for generality, the researchers evaluated the performance on eight diverse real-world deep learning problems from different disciplines. From a collection of more than a hundred deep learning optimizers, the researchers selected fifteen of the most popular choices for benchmarking. “There are enough optimizers,” said the researchers. The authors also noted that the conclusions of this paper might not generalize to other workloads such as GANs, reinforcement learning, or applications where e.g. memory usage is crucial.

The researchers analysed more than 50,000 individual runs and have open-sourced all the baseline results of their experiments. This seminal work underlines the dangers of chasing state-of-the-art hype and highlights the following:

There are now enough optimizers.
Optimizer performance varies greatly across tasks.
There is no single optimizer that dominates its competitors across all tasks.
ADAM and ADABOUND consistently perform well.
Different optimizers exhibit a surprisingly similar performance distribution compared to a single method re-tuned or simply re-run with different random seeds.
Having an accurate baseline for optimizers can drastically reduce the amount of computational budget.

Given these results, the researchers question the rationale behind development of new methods when there are more fundamental problems at hand. The researchers hope their experiments will nudge the ML community to “move beyond inventing yet another optimizer and to focus on key challenges, such as automatic, inner-loop tuning for truly robust and efficient optimization.” The researchers also admitted the creators of new optimizers cannot be expected to compare their work with every possible previous method. The baselines of all the experiments have been open-sourced, and the ML community can access the data set that contains 53,760 unique runs, each consisting of thousands of individual data points, such as the mini-batch training losses of every iteration or epoch-wise performance measures, which can be used as competitive and well-tuned baselines for future benchmarks of new optimizers.

Know more here.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

‘It’s Not a Bug, It’s a Feature.’ The Controversy Around Gradient Descent

How to automate finding the optimal learning rate?

Why should the learning rate always be low?

How is gradient descent used in unsupervised learning problems?

8 Alternatives of Gradient Descent in Machine Learning

Gradient Ascent: When to use it in machine learning?

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the