It has become increasingly common to pre-train models to develop general-purpose abilities and knowledge that can then be “transferred” to downstream tasks.
In applications of transfer learning to computer vision, pre-training is typically done via supervised learning on a large labelled dataset like ImageNet. In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.
In spite of being widely popular there are still few pressing questions bothering transfer learning in ML:
- How much of the original task has the model forgotten?
- Why don’t large models change as much as small models?
- Can we make more out of pre-trained weight statistics?
- Are the results similar to other tasks, such as segmentation?
The rapid rate of progress and diversity of techniques can make it difficult to compare different algorithms and understand the space of existing methods for transfer learning. The researchers at Google say that there is a need for a unified approach to understanding the effectiveness of transfer learning.
To formulate a unique, unified approach, the researchers treated every NLP problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output.
The text-to-text framework offers flexibility that helps in evaluating performance on a wide variety of English-based NLP problems, including question answering, document summarization, and sentiment classification, to name a few.
With this unified approach, wrote the authors, we can compare the effectiveness of different transfer learning objectives, unlabeled datasets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and datasets beyond what has previously been considered.
Testing Effectiveness of Transfer Learning
In transfer learning, the neural network is trained in two stages:
- Pre-training: The network is generally trained on a large-scale benchmark dataset representing a wide range of categories
- Fine-tuning: Pre-trained network is further trained on the specific target task of interest, which may have fewer labelled examples than the pre-training dataset.
For this study, all the experiments done were based on Transformer architecture, considering its wide applicability and adoption. The baseline model is designed so that the encoder and decoder are each similar in size and configuration to BERT.
One of the examples for testing the model is shown above, where the words “for”, “inviting” and “last” are crossed (corrupted), which were chosen randomly. Each consecutive span of corrupted tokens is replaced by <X> and <Y>. Since “for” and “inviting” occur consecutively, they are replaced by <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>.
The authors pre-trained their models for 1 million steps on a batch size of 2^11 sequences of length 512, corresponding to a total of about 1 trillion pre-training tokens. Pre-training on the RealNews-like, or Wikipedia + TBC datasets outperformed pre-training on C4 on a few downstream tasks. However, these dataset variants are sufficiently small that they would be repeated hundreds of times over the course of pre-training on 1 trillion tokens.
The results show that additional pre-training can indeed be helpful and that both increasing the batch size and increasing the number of training steps can confer this benefit.
The authors summarised their findings as follows:
Finding The Right Training Strategies
The paper states that updating all of a pre-trained model’s parameters during fine-tuning outperformed methods that are designed to update fewer parameters, although updating all parameters is expensive. In a multi-task setting, the authors couldn’t find a strategy that matched the performance of the basic approach of unsupervised pre-training, followed by supervised fine-tuning. However, they found that fine-tuning after pre-training on a mixture of tasks produced comparable performance to unsupervised pre-training.
The authors also posit that larger models trained for longer might benefit from a larger proportion of unlabeled data because they are more likely to overfit to smaller training datasets.
While some work on transfer learning for NLP has considered architectural variants of the Transformer, the original encoder-decoder form worked best in the text-to-text
framework. Though an encoder-decoder model uses twice as many parameters as “encoder-only” (e.g. BERT) or “decoder-only” (language model) architectures, it has a similar computational cost. This study also demonstrates that sharing the parameters in the encoder and decoder did not result in a substantial performance drop while reducing the total parameter count by 50 per cent.
When comparing C4 to datasets that use additional filtering, the researchers found that training on in-domain unlabeled data could boost performance in a few downstream tasks. However, constraining to a single domain typically results in a smaller dataset. Performance can degrade when an unlabeled dataset is small enough that it is repeated many times over the course of pre-training. This hints at the preference for large and diverse datasets for generic language understanding tasks.
Few other key conclusions of the work:
- Larger models tend to perform better
- Pre-training on unlabeled in domain data can improve performance on downstream tasks
- English-only pre-training did not achieve state-of-the-art results on the translation tasks
- Ensembling models that were fine-tuned from the same base pre-trained model performed worse than pre-training and fine-tuning all models completely