How Pre-Finetuning Boosts Performance Of Language Models

Pre-training is a machine learning technique employed to train a model to recognise patterns using one task and apply the learned parameters in similar tasks. Much like how humans process new information.

Pre-training language models have made Natural Language Processing significantly cheaper, faster, and easier. Pre-trained models (instead of training the model from scratch) achieve better performance with less training data. Language model pre-training uses self-supervision, which doesn’t require any training data. Fine-tuning, on the other hand, is used to make endpoint adjustments to enhance performance.

Now, the researchers from Facebook have proposed an additional stage between pre-training and fine-tuning, called pre-finetuning — a large scale, multitask learning stage between the two, with over 50 datasets and 4.8 million labelled examples. The research showed pre-finetuning improves the performance of pre-trained discriminator, generation models, and sample efficiency during fine-tuning.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

“We show that multitask supervised tuning, if done at a sufficiently large scale with many different tasks, can be an effective second stage of task-agnostic pre-training, removing the need to pre-select the best intermediate tasks,” the authors of the study, said.



Multitask training is a sub-field of machine learning where a shared model learns multiple tasks simultaneously. The technique is generally used on top of traditional pre-training. This approach comes with greater data efficiency, fast learning using auxiliary information, and reduced overfitting. Models such as multitask deep neural networks (MT DNN) have improved several language benchmarks.

Pre-finetuning is an intermediate technique bookended by pretraining and fine-tuning, and involves large multitask learning steps performed on 50 tasks such as classification, summarisation, question-answering, etc. Standard multitasking schemes often fail to learn ‘high-quality representations’. However, the newly introduced training technique, called Massive Multitask Representation or MUPPET, radically improves training stability and overall performance using loss scaling and task-heterogeneous batches.

For this model, RoBERTa and BART, two popular pre-trained models, were chosen as initial pre-trained models. For each task, a different prediction scheme was used. The pre-tuning procedure was performed for both models, and each model configuration was trained with 64 GPUs.


MUPPET’s Performance

After pretuning, RoBERTa and BART were tested on widely-used benchmarks such as RTE, BoolQ, RACE, SQuAD, and MNLI. It was observed that pre-fine tuning compromises performance when fewer tasks, up 15, are used. However, beyond this point, for a larger number of language tasks, pre-fine tuning leads to performance improvements. MUPPET models performed better than the pre-trained models. 


Wrapping Up

The MUPPET model demonstrated:

  • Pre-trained models when further refined with pre-fine tuning significantly improve performance on downstream tasks
  • MUPPET, which uses loss scaling and task-heterogenous batches, is effective for learning at scale
  • Beyond a threshold (in this case, 15), representation improves linearly with the number of tasks.
  • Pre-fine tuned models require lesser amounts of data for fine-tuning, as compared to a vanilla pre-trained model.
  • It outperforms previous models such as Recognising Partial Text Entailment (RTE), HellaSWAG and shows improvement on pre-trained representations for Multi-Genre Natural Language Inference (MNLI), Common Sense QA, and Stanford Question Answering Dataset (SQuAD) dataset.

Read the full paper here.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.