Pre-training is a machine learning technique employed to train a model to recognise patterns using one task and apply the learned parameters in similar tasks. Much like how humans process new information.
Pre-training language models have made Natural Language Processing significantly cheaper, faster, and easier. Pre-trained models (instead of training the model from scratch) achieve better performance with less training data. Language model pre-training uses self-supervision, which doesn’t require any training data. Fine-tuning, on the other hand, is used to make endpoint adjustments to enhance performance.
Now, the researchers from Facebook have proposed an additional stage between pre-training and fine-tuning, called pre-finetuning — a large scale, multitask learning stage between the two, with over 50 datasets and 4.8 million labelled examples. The research showed pre-finetuning improves the performance of pre-trained discriminator, generation models, and sample efficiency during fine-tuning.
“We show that multitask supervised tuning, if done at a sufficiently large scale with many different tasks, can be an effective second stage of task-agnostic pre-training, removing the need to pre-select the best intermediate tasks,” the authors of the study, said.
What Is MUPPET?
Multitask training is a sub-field of machine learning where a shared model learns multiple tasks simultaneously. The technique is generally used on top of traditional pre-training. This approach comes with greater data efficiency, fast learning using auxiliary information, and reduced overfitting. Models such as multitask deep neural networks (MT DNN) have improved several language benchmarks.
Pre-finetuning is an intermediate technique bookended by pretraining and fine-tuning, and involves large multitask learning steps performed on 50 tasks such as classification, summarisation, question-answering, etc. Standard multitasking schemes often fail to learn ‘high-quality representations’. However, the newly introduced training technique, called Massive Multitask Representation or MUPPET, radically improves training stability and overall performance using loss scaling and task-heterogeneous batches.
For this model, RoBERTa and BART, two popular pre-trained models, were chosen as initial pre-trained models. For each task, a different prediction scheme was used. The pre-tuning procedure was performed for both models, and each model configuration was trained with 64 GPUs.
MUPPET’s Performance
After pretuning, RoBERTa and BART were tested on widely-used benchmarks such as RTE, BoolQ, RACE, SQuAD, and MNLI. It was observed that pre-fine tuning compromises performance when fewer tasks, up 15, are used. However, beyond this point, for a larger number of language tasks, pre-fine tuning leads to performance improvements. MUPPET models performed better than the pre-trained models.
Wrapping Up
The MUPPET model demonstrated:
- Pre-trained models when further refined with pre-fine tuning significantly improve performance on downstream tasks
- MUPPET, which uses loss scaling and task-heterogenous batches, is effective for learning at scale
- Beyond a threshold (in this case, 15), representation improves linearly with the number of tasks.
- Pre-fine tuned models require lesser amounts of data for fine-tuning, as compared to a vanilla pre-trained model.
- It outperforms previous models such as Recognising Partial Text Entailment (RTE), HellaSWAG and shows improvement on pre-trained representations for Multi-Genre Natural Language Inference (MNLI), Common Sense QA, and Stanford Question Answering Dataset (SQuAD) dataset.
Read the full paper here.