“Transfer Learning will be the next driver of Machine Learning Success”
Andrew NG
Recently, researchers from Google proposed the solution of a very fundamental question in the machine learning community — What is being transferred in Transfer Learning? They explained various tools and analyses to address the fundamental question.
The ability to transfer the domain knowledge of one machine in which it is trained on to another where the data is usually scarce is one of the desired capabilities for machines. Researchers around the globe have been using transfer learning in various deep learning applications, including object detection, image classification, medical imaging tasks, among others.
Despite these utilisations, there are cases found by several researchers where there is a nontrivial difference in visual forms between the source and the target domain. It has become difficult for the researchers to understand what enables a successful transfer and which parts of the network are responsible for that.
The Methodology
In order to investigate transfer learning, the researchers analyse networks in four different cases — the pre-trained network, the network at random initialisation, the network that is fine-tuned on target domain after pre-training on the source domain and the model that is trained on target domain from random initialisation.
They also used a series of analysis to understand what is being transferred between the models:
- Firstly, they investigated the feature reuse by shuffling the data. The shuffling of blocks in the data disrupts the visual features in the images. This analysis showed the importance of feature re-use and proved that the low-level statistics of the data that is not disturbed by shuffling the pixels also play a role in the successful transfer.
- Next, they compared the detailed behaviours of trained models. To perform this, the researchers investigated the agreements and disagreements between models that are trained from pre-training versus scratch. This experiment proved that two instances of models trained from pre-trained weights are more similar in feature space compared to ones trained from random initialisation.
- The researchers then investigated the loss landscape of models trained from pre-training and random initialisation weights. They observed that there is no performance barrier between the two instances of models trained from pre-trained weights, which suggests that the pre-trained weights guide the optimisation to a flat basin of the loss landscape.
Dataset Used
The researchers used CheXpert data, which is a medical imaging dataset of chest x-rays considering different diseases. Besides this, they also used the DomainNet dataset that is specifically designed to probe transfer learning in diverse domains. The domains range from real images to sketches, clipart and painting samples.
Contributions Of This Research
The researchers made several contributions to this project. They are mentioned below:
- For a successful transfer, both feature-reuse and low-level statistics of the data are important.
- Models trained from pre-trained weights make similar mistakes on the target domain. They also have similar features and are surprisingly close in the distance in the parameter space. They are usually in the same basins of the loss landscape.
- The models trained from random initialisation do not live in the same basin. They usually make different mistakes and have different features, and are farther away in the distance in the parameter space.
- Modules in the lower layers are in charge of general features, and modules in higher layers are more sensitive to perturbation of their parameters.
- One can start from earlier checkpoints of the pre-trained model without losing the accuracy of the fine-tuned model. The starting point of such phenomena depends on when the pre-train model enters its final basin.
Wrapping Up
In this project, the researchers presented that when a model is trained from pre-trained weights, the model stays in the same basin as well as in the loss landscape. Also, the different instances of such models are similar in feature space and close in parameter space.
They concluded that feature reuse plays a vital role in transfer learning, especially when the downstream task shares similar visual features with the pre-training domain. However, there are certain other factors such as low-level statistics that can lead to significant benefits of transfer learning, especially on optimisation speed.
Further, on a concluding note, the researchers said, “Our observation of low-level data statistics improving training speed could lead to better network initialisation methods. Using these findings to improve transfer learning is of interest for future work.”
Read the paper here.