Pre-training is a machine learning technique employed to train a model to recognise patterns using one task and apply the learned parameters in similar tasks. Much like how humans process new information.
Pre-training language models have made Natural Language Processing significantly cheaper, faster, and easier. Pre-trained models (instead of training the model from scratch) achieve better performance with less training data. Language model pre-training uses self-supervision, which doesn’t require any training data. Fine-tuning, on the other hand, is used to make endpoint adjustments to enhance performance.
Now, the researchers from Facebook have proposed an additional stage between pre-training and fine-tuning, called pre-finetuning — a large scale, multitask learning stage between the two, with over 50 datasets and 4.8 million labelled examples. The research showed pre-finetuning improves the performance of pre-trained discriminator, generation models, and sample efficiency during fine-tuning.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
“We show that multitask supervised tuning, if done at a sufficiently large scale with many different tasks, can be an effective second stage of task-agnostic pre-training, removing the need to pre-select the best intermediate tasks,” the authors of the study, said.

What Is MUPPET?
Multitask training is a sub-field of machine learning where a shared model learns multiple tasks simultaneously. The technique is generally used on top of traditional pre-training. This approach comes with greater data efficiency, fast learning using auxiliary information, and reduced overfitting. Models such as multitask deep neural networks (MT DNN) have improved several language benchmarks.
Pre-finetuning is an intermediate technique bookended by pretraining and fine-tuning, and involves large multitask learning steps performed on 50 tasks such as classification, summarisation, question-answering, etc. Standard multitasking schemes often fail to learn ‘high-quality representations’. However, the newly introduced training technique, called Massive Multitask Representation or MUPPET, radically improves training stability and overall performance using loss scaling and task-heterogeneous batches.
For this model, RoBERTa and BART, two popular pre-trained models, were chosen as initial pre-trained models. For each task, a different prediction scheme was used. The pre-tuning procedure was performed for both models, and each model configuration was trained with 64 GPUs.
MUPPET’s Performance
After pretuning, RoBERTa and BART were tested on widely-used benchmarks such as RTE, BoolQ, RACE, SQuAD, and MNLI. It was observed that pre-fine tuning compromises performance when fewer tasks, up 15, are used. However, beyond this point, for a larger number of language tasks, pre-fine tuning leads to performance improvements. MUPPET models performed better than the pre-trained models.
Wrapping Up
The MUPPET model demonstrated:
- Pre-trained models when further refined with pre-fine tuning significantly improve performance on downstream tasks
- MUPPET, which uses loss scaling and task-heterogenous batches, is effective for learning at scale
- Beyond a threshold (in this case, 15), representation improves linearly with the number of tasks.
- Pre-fine tuned models require lesser amounts of data for fine-tuning, as compared to a vanilla pre-trained model.
- It outperforms previous models such as Recognising Partial Text Entailment (RTE), HellaSWAG and shows improvement on pre-trained representations for Multi-Genre Natural Language Inference (MNLI), Common Sense QA, and Stanford Question Answering Dataset (SQuAD) dataset.
Read the full paper here.