“There is no single optimizer that dominates its competitors across all tasks.”
Critics often call machine learning ‘glorified statistics’. There is some merit to the argument. The fundamental function of any machine learning model is pattern recognition, which relies on the principles of convergence; the methods of fitting data to the model. To that end, neural networks use optimization methods, typically categorised as first-order, high-order and derivative-free. First-order optimization such as gradient descent and its variants are quite popular.
Gradient descent method refers to the idea of updating the variables iteratively in the (opposite) direction of the gradients of the objective function. After each update, the gradient descent method (think mathematical expression) guides the model towards the target gradually, converging to the optimal value of the objective function.
The stochastic gradient method is an unbiased estimate of the real gradient. It reduces the update time while dealing with large numbers of samples and removes a certain amount of computational redundancy. Then there are other variants that claim to do a better job.
Large-scale stochastic optimization drives a wide variety of machine learning tasks. The permutations and combinations of this whole ordeal quickly run out of hands when one stumbles upon scores of benevolent sounding optimizers. Fatigue by abundance is now a serious challenge among researchers. Choosing the right optimization method can be a nightmare. Not to forget the effective tuning of hyperparameters that heavily influences the training speed and final performance of the learned model. These tasks are time and resource-intensive.
Choosing the optimizer is one of the most crucial design decisions in deep learning, and it is not an easy one. The above illustration shows the number of times ArXiv titles and abstracts mention specific optimizers per year. The growing literature now lists hundreds of optimisation methods. The researchers at the University of Tübingen performed an extensive, standardized benchmark of fifteen popular deep learning optimizers. This paper is one of the few works that focus on large scale benchmarking of the optimizers. According to the researchers, the objective here is to help understand how optimization methods and hyperparameters influence the training performance.
“While some optimizers are frequently decent, they also generally perform similarly, often switching their positions in the ranking.”
Paper by Schmidt et al.
Aiming for generality, the researchers evaluated the performance on eight diverse real-world deep learning problems from different disciplines. From a collection of more than a hundred deep learning optimizers, the researchers selected fifteen of the most popular choices for benchmarking. “There are enough optimizers,” said the researchers. The authors also noted that the conclusions of this paper might not generalize to other workloads such as GANs, reinforcement learning, or applications where e.g. memory usage is crucial.
The researchers analysed more than 50,000 individual runs and have open-sourced all the baseline results of their experiments. This seminal work underlines the dangers of chasing state-of-the-art hype and highlights the following:
- There are now enough optimizers.
- Optimizer performance varies greatly across tasks.
- There is no single optimizer that dominates its competitors across all tasks.
- ADAM and ADABOUND consistently perform well.
- Different optimizers exhibit a surprisingly similar performance distribution compared to a single method re-tuned or simply re-run with different random seeds.
- Having an accurate baseline for optimizers can drastically reduce the amount of computational budget.
Given these results, the researchers question the rationale behind development of new methods when there are more fundamental problems at hand. The researchers hope their experiments will nudge the ML community to “move beyond inventing yet another optimizer and to focus on key challenges, such as automatic, inner-loop tuning for truly robust and efficient optimization.” The researchers also admitted the creators of new optimizers cannot be expected to compare their work with every possible previous method. The baselines of all the experiments have been open-sourced, and the ML community can access the data set that contains 53,760 unique runs, each consisting of thousands of individual data points, such as the mini-batch training losses of every iteration or epoch-wise performance measures, which can be used as competitive and well-tuned baselines for future benchmarks of new optimizers.
Know more here.