Neural networks are trained to exactly fit the data. Such models usually would be considered as over-fitting, and yet they have managed to obtain high accuracy on test data. It is counter-intuitive — but it works. This has raised many eyebrows, especially regarding the mathematical foundations of machine learning and their relevance to practitioners.
In order to address these contradictions, researchers at OpenAI, in their recent work, double down on this widely believed grand illusion of bigger is better.
In this paper, an attempt has been made to reconcile classical understanding and modern practice within a unified performance curve.
The “double descent” curve overtakes the classic U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance.
Neither classical statisticians’ conventional wisdom that too large models are worse nor the modern ML paradigm that bigger models are better, uphold.
This itself highlights the voids between the performance of a model and the limits of classical analyses.
So far, many models have been looked at through the glasses of the classical U-shaped bias-variance trade-off curve and now a new phenomenon was discovered called the double descent. Mikhail Belkin and his peers who pioneered this work challenged the notion of the very popular claim that “bigger models are always better.”
Why Deep Double Descent
Bias-variance trade-off deals with whether a model is under-fitting or overfitting. It gives insights into the underlying structure in data, such that a practitioner can make enough amendments to their models to avoid fitting spurious patterns.
Not adhering to the common notion that standard statistical machine learning theory predicting that bigger models should be more prone to overfitting, Belkin et al. in their seminal paper have discovered that the standard bias-variance tradeoff actually derails once it hits the “interpolation threshold”.
In other words, the bias-variance tradeoff before the interpolation threshold holds and increased model complexity leads to overfitting, increasing test error.
Whereas, after the interpolation threshold, researchers found that test error actually starts to go down as one keeps increasing model complexity!
Does Over-parametrization Always Hurt?
For test error to increase with sample size, explain the researchers in the open review of the paper, at least one of the following must occur:
- (A): Training error increases with a sample size
- (B): Generalization gap (= Test Error – Train Error) increases with sample size.
Now, all possible settings can occur in practice:
There exist cases where (A) is true and (B) is false, where (A) is false, and (B) is true, and where both (A) and (B) are true.
For the critical number of samples, the models must “try very hard” to fit the train set, which can destroy their global structure. Whereas for fewer samples, the models are overparameterized enough to fit the train set while still behaving well on the distribution.
The double descent behaviour has so far not been explored owing to several cultural and practical barriers. The authors say that observing the double descent curve requires a parametric family of spaces with functions of arbitrary complexity.
That said, over-parameterization does have practical advantages.
“According to experiments, modern models usually outperform the optimal “classical” model on the test set,” admitted the authors in their paper.
There is a growing understanding that larger models are “easy” to optimise as methods like stochastic gradient descent SGD, converge to global minima of the training risk in over-parameterized regimes.
Thus, large interpolating models can have low test risk and are easy to optimise especially with those like SGD. The models to the left of the interpolation peak, wrote the authors, are more likely to have optimisation properties qualitatively different from those to the right.
With this work, the authors have provided evidence for the existence of double descent phenomenon for a wide spectrum of models and datasets.
They also speculate that their work would help in the understanding of model performance as it underlines the limits of conventional methodologies while exposing new avenues to study and compare computational, statistical, and mathematical properties of the classical and modern regimes in machine learning.
Know more about the work here.