Looking at machine learning model selection through the lens of bias-variance tradeoff served well for the whole domain. One can choose a good model by trading off bias and variance as a function of complexity. The training-testing paradigm of building ML models is well known. As the name suggests, the training dataset is used to train the model, whereas the testing dataset is used to test the model’s prediction capabilities. The smaller the error of prediction, the better. However, in the pursuit of reducing these errors, one might also end up overfitting, i.e., the model gets too good at identifying items from the training dataset and falters on anything that is even remotely different from it. The error in this context is a combination of noise, bias and variance. While noise is usually uncontrollable, bias and variance can be tweaked. Increasing bias lowers the variance and vice versa. This is why researchers tussle to find that sweet spot in the form of a bias-variance tradeoff.
However, ML researchers are slowly starting to question the usefulness of such methods. The bias-variance tradeoff method can be imprecise, and researchers feel that it is unclear what class of models the tradeoff holds for, and it is left to the user to choose the notion of complexity. According to the researchers, one would usually end up with the following questions:
a) When and why is there a benefit to adding more parameters? and,
b) Why does interpolate noise need not cause harmful overfitting?
According to classical learning theory, generalisation improves and then deteriorates with increasing model complexity. This process follows a U-shape curve characteristic of the bias-variance tradeoff. The test error, which is a function of the number of parameters, forms a U-shaped curve in the under parameterized regime. In the case of deep neural networks and other machine learning models, a double descent curve is observed instead of this typical U-shaped curve. According to Acoli et al., deep neural networks achieve better generalisation performances while interpolating the training data. “Rather than the U-curve emblematic of the bias-variance tradeoff, their test error often follows a “double descent”— a mark of the beneficial role of overparameterization,” stated the authors.
The double descent curve demonstrates that increasing model capacity past the interpolation threshold can decrease test error. Increasing neural network capacity through width leads to double descent. The discovery of double descent proved that highly overparameterized models often improve over the best under parameterized model in test performance. As illustrated above, the first descent occurs in the under parameterized regime and is a consequence of the classical bias-variance tradeoff.
“The reasons behind the performance of deep neural networks in the overparameterized regime are still poorly understood.”Ascoli et al.
As illustrated above, the double curve displays two regimes. At the interpolation threshold, the training error vanishes and the curve peaks in the absence of regularisation. Despite such glaring results, the real reason behind the performance of deep neural networks in the overparameterized regime still remains elusive to researchers. According to the researchers, implicit regularisation of stochastic gradient descent and the convergence to mean-field limits are a few reasons for this performance.
The double descent phenomenon demonstrated that the classical bias-variance tradeoff, a cornerstone of conventional ML, is predictive only in the under parameterized regime where the learned model is not sufficiently complex to interpolate the training data.
In their investigation of the double descent regime, Ascoli et al. observed that
- Double descent curve originates from behaviour of noise and initialisation variances.
- The peak at the interpolation threshold is completely due to noise and initialisation variance.
- The sampling variance and the bias in the overparameterized regime do not vary substantially.
- The benefit of over parametrizing stems only from reducing these two contributions.
Explaining why we need to be wary of applying classical statistical methods to modern ML regimes, the researchers from Rice University, in their work titled “Farewell to Bias-Variance Tradeoff“, explained that interpolating solutions in a supervised setting is focused on random design settings where both the inputs and outputs are randomly drawn from an unknown. This random design enables interpolating solutions to provide remarkable performance. However, when it comes to classical statistics, where a fixed design setting is employed, the inputs from training and test data, including the same data, are matched to different random outputs. This is why, in fixed design settings, interpolating the input-output training examples is likely to “seriously fail” when applied on test inputs,” wrote the authors. The authors believe that understanding the improved performance of ML in an overparameterized regime requires new thinking, theories, and foundational empirical studies, even for the simplest case of the linear model.