Ensemble is the most popular and, in fact, the highly preferred trick for the data scientists working in the realms of deep learning and ML competitions like Kaggle.
Ensemble or model averaging technique is used to improve the performance of deep learning models. All you have to do is take the average of the outputs of a few neural networks trained independently over the same training data set. It is common knowledge that this strategy will lead to a significant boost in prediction accuracy over the test set compared to each individual model. The performance gains will also hold even when all the architectures are the same, and even when trained using the same training algorithm over the same training data set.
“Ensemble gives a performance boost to test accuracies in deep learning applications, but such accuracy gains cannot be matched by training the average of the models directly.”Allen-Zhu And Li
Ensemble boosts accuracy. But, how does it really work? Why unweighted averages of the accuracies of individual models triumph over the overall performance? To address these questions, the researchers from Microsoft and Carnegie Mellon University have recently published a work titled, “Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. ”
Demystifying Ensemble Success
Along with ensemble, the authors also take into consideration— knowledge and self distillation. Self distillation, explained the authors, refers to training a single model to match the output of another single model. In this paper, they also show that self-distillation can improve test accuracy under our multi-view setting.
Self-distillation is performing implicit ensemble + knowledge distillation. The success of ensemble, self and knowledge distillation has left more to be desired and the authors classify them as mysteries.
“Unlike in the deep learning case, the superior performance of ensemble in the random feature setting cannot be distilled to an individual model.”
Three mysteries of deep learning as summarised by the researchers:
- Why is there a sudden performance boost while using ensemble? Alternatively, if one directly trains the functions altogether, why does the performance boost disappear?
- Will ensemble learning work over the models after knowledge distillation to further improve test accuracy?
- Why does training the same model again using itself as the teacher suddenly boosts the test accuracy ?
Most existing theories on ensemble, wrote the authors, only apply to the case where individual models are fundamentally different or trained over different datasets. Since the success of ensemble is restricted to structured random inputs, the authors explored a common structure found in many datasets used for deep learning. In vision datasets in particular, the object can usually be classified using multiple views.
In vision datasets, the object can usually be classified using multiple views. A Res-Net model can be trained to get multiple views/features of an object. For example, looking through the front, side and rear of a car.
Using multi-view data, the network has the potential to quickly learn a subset of features and memorise the data that cannot be classified correctly using these view features. Now, the ensemble of different networks will collect all these learnable view features. The researchers also demonstrated advantages of multi views with knowledge distillations and self distillation techniques.
- This work provides the first theoretical proof toward understanding how ensemble works in deep learning.
- The multi-view framework introduced here can be applied to settings where data augmentation is required.
- This work provides new theoretical insights on how neural networks pick up features during training.
- This, in turn, can help design new, principled approaches to improve the test accuracy of a neural network, potentially matching that of ensemble.
Find the original paper here