In yet another effort to shed the image of black box models for machine learning, a group of researchers probed something fundamental this time – what happens at the initial stages of training and how effective is pre-training?
Historically, most of the work prioritized around what happens during the later stages of training, while the initial phase has been explored less.
To explore the initial phases of training, the researchers from Facebook AI and MIT CSAIL collaborated and provided a unified framework to understand the same. To do this, they employed the methodology of iterative magnitude pruning with rewinding.
The authors considered previous results where it has already been proven that iterating and rewinding the weights to their values early in the training of the unpruned model, rather than their initial values, led to better performance on deeper networks such as ResNets.
In other words, this suggests that the changes in the network during this initial phase are vital for the success of the training of small networks. And if these claims are solid, then it has many implications. From cutting down the model size to training time, from making more ML-friendly edge devices to explaining the predictions, there are plenty of use cases.
This approach provides a simple scheme for measuring the importance of the weights at an early stage of training within an actionable and causal framework.
Overview Of The Approach
For experiments, the authors considered ResNet-20 and tracked the changes during the earliest phase of training by specifically focusing on the first 4,000 iterations (10 epochs).
The procedure involves pruning 20% of weights and rewinding the remaining weights to their values from an earlier iteration during the pre-pruning training run. This process is then iterated.
During the first 4,000 iterations of training, the authors observed three sub-phases:
- The first phase, which lasts only the initial few iterations, where the gradient magnitudes are enormous, and the network changes rapidly.
- The performance quickly improves in the next 500 iterations and weight magnitudes increase. Whereas sign differences from initialization quickly increase, and gradient magnitudes reach a minimum before settling.
- Finally, all these quantities continue to change in the same direction, but begin to decelerate.
It is unclear, however, the extent to which various aspects of the data distribution are necessary; Notably, whether the change in weights during the early phase of training dependent on p(x) or p(y|x).
To investigate the change in weights, the authors, by pre-training the network with techniques that ignore labels entirely (self-supervised), provided random labels or blurred training examples. These experiments were done on the CIFAR-10 dataset.
The results show that pre-training on random labels provides no improvement above rewinding and that pre-training for longer begins to hurt accuracy. Whereas, blurring the examples makes the IMP approach underperform regardless of the pre-training provided.
The contributions of this work can be summarised as follows:
- To provide in-depth summarizing learning over the early part of training with an overview of various statistics
- Deeper networks are not robust to reinitialization with random weights
- The distribution of weights after the early phase of training is already highly non-i.i.d(independent and identically distributed)
- To measure how dependent early phases of training is on data
The authors conclude that weights are highly non-independent at the rewinding point. They claim that the weights at this point cannot be easily approximated and any kind of approach that aims at skipping directly to the rewinding point is unlikely to succeed. However, rewinding may not be necessary if networks are pre-trained appropriately.
Know more here.