MITB Banner

Testing The Limits Of Transfer Learning In Natural Language Processing

Share

It has become increasingly common to pre-train models to develop general-purpose abilities and knowledge that can then be “transferred” to downstream tasks.

In applications of transfer learning to computer vision, pre-training is typically done via supervised learning on a large labelled dataset like ImageNet. In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.

In spite of being widely popular there are still few pressing questions bothering transfer learning in ML:

  • How much of the original task has the model forgotten? 
  • Why don’t large models change as much as small models? 
  • Can we make more out of pre-trained weight statistics?
  • Are the results similar to other tasks, such as segmentation?

The rapid rate of progress and diversity of techniques can make it difficult to compare different algorithms and understand the space of existing methods for transfer learning. The researchers at Google say that there is a need for a unified approach to understanding the effectiveness of transfer learning. 

To formulate a unique, unified approach, the researchers treated every NLP problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output. 

The text-to-text framework offers flexibility that helps in evaluating performance on a wide variety of English-based NLP problems, including question answering, document summarization, and sentiment classification, to name a few. 

With this unified approach, wrote the authors, we can compare the effectiveness of different transfer learning objectives, unlabeled datasets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and datasets beyond what has previously been considered.

Testing Effectiveness of Transfer Learning

In transfer learning, the neural network is trained in two stages: 

  • Pre-training: The network is generally trained on a large-scale benchmark dataset representing a wide range of categories
  • Fine-tuning: Pre-trained network is further trained on the specific target task of interest, which may have fewer labelled examples than the pre-training dataset.

For this study, all the experiments done were based on Transformer architecture, considering its wide applicability and adoption. The baseline model is designed so that the encoder and decoder are each similar in size and configuration to BERT. 

One of the examples for testing the model is shown above, where the words “for”, “inviting” and “last” are crossed (corrupted), which were chosen randomly. Each consecutive span of corrupted tokens is replaced by <X> and <Y>. Since “for” and “inviting” occur consecutively, they are replaced by <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>.

The authors pre-trained their models for 1 million steps on a batch size of 2^11 sequences of length 512, corresponding to a total of about 1 trillion pre-training tokens. Pre-training on the RealNews-like, or Wikipedia + TBC datasets outperformed pre-training on C4 on a few downstream tasks. However, these dataset variants are sufficiently small that they would be repeated hundreds of times over the course of pre-training on 1 trillion tokens.

The results show that additional pre-training can indeed be helpful and that both increasing the batch size and increasing the number of training steps can confer this benefit.

The authors summarised their findings as follows:

Finding The Right Training Strategies 

The paper states that updating all of a pre-trained model’s parameters during fine-tuning outperformed methods that are designed to update fewer parameters, although updating all parameters is expensive. In a multi-task setting, the authors couldn’t find a strategy that matched the performance of the basic approach of unsupervised pre-training, followed by supervised fine-tuning. However, they found that fine-tuning after pre-training on a mixture of tasks produced comparable performance to unsupervised pre-training.

The authors also posit that larger models trained for longer might benefit from a larger proportion of unlabeled data because they are more likely to overfit to smaller training datasets.

Architectures 

While some work on transfer learning for NLP has considered architectural variants of the Transformer, the  original encoder-decoder form worked best in the text-to-text

framework. Though an encoder-decoder model uses twice as many parameters as “encoder-only” (e.g. BERT) or “decoder-only” (language model) architectures, it has a similar computational cost. This study also demonstrates that sharing the parameters in the encoder and decoder did not result in a substantial performance drop while reducing the total parameter count by 50 per cent.

Datasets 

When comparing C4 to datasets that use additional filtering, the researchers found that training on in-domain unlabeled data could boost performance in a few downstream tasks. However, constraining to a single domain typically results in a smaller dataset. Performance can degrade when an unlabeled dataset is small enough that it is repeated many times over the course of pre-training. This hints at the preference for large and diverse datasets for generic language understanding tasks.

Few other key conclusions of the work:

  • Larger models tend to perform better
  • Pre-training on unlabeled in domain data can improve performance on downstream tasks 
  • English-only pre-training did not achieve state-of-the-art results on the translation tasks
  • Ensembling models that were fine-tuned from the same base pre-trained model performed worse than pre-training and fine-tuning all models completely 

Link to the paper

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.