Microsoft Research has collaborated with OpenAI to release a paper titled, ‘Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer’ that describes a technique called µTransfer. This method was proven to make the expensive process of training wide neural networks cost-effective by reducing the amount of trial and error needed.
Tensor Programs was initially introduced by Microsoft Research in 2020. The study was based on µ-Parametrisation that enabled maximal feature learning in the infinite-width limit. The application of µTransfer can help speed up the work done on massive neural networks like GPT-3 and larger networks eventually.
The process of training hyperparameters in wide neural networks drains resources because each time, the network has to guess which hyperparameters to use. The paper shows that there exists a very specific parameterisation that maintains optimal hyperparameters across multiple model sizes.
The team partnered with OpenAI to assess how effective µTransfer would be for GPT-3. Post that, the technique was used to tune a small proxy model with 40 million parameters. The optimal hyperparameter combination that resulted from this was copied onto GPT-3’s 6.7 billion parameters. The study demonstrated that the total compute used to tune GPT-3 turned out to be a mere 7 percent of the compute used to pretrain the model.
“µP provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past works, like the T5 model. I believe both practitioners and researchers alike will find this work valuable,” Colin Raffel, co-creator of the T5 and assistant professor of Computer Science at the University of North Carolina, stated.