“What do data-rich models know that models with less pre-training data do not?”
The performance of language models is determined mostly by the amount of training data, quality of the training data and choice of modelling technique for estimation. At the same time, scaling up a novel algorithm to a large number of data barricades the entry for new modelling techniques.
Pretrained language models like BERT use massive datasets on the order of tens or even hundreds of billions of words to learn linguistic features and world knowledge, and they can be fine-tuned to achieve good performance on many downstream tasks.
General-purpose pre-trained language models achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge, ask the researchers at NYU, do these models learn from large scale pretraining that they cannot learn from less data?
Probing Pretraining Data
To understand the relation between massiveness of data and learning in language models, the researchers adopted four probing methods — classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks and plotted to learn curves (shown above) for the four probing methods. For each method, they computed overall performance for each RoBERTa model tested as the macro average over sub-task performance after normalisation.
As shown above, the learning curves are established for each of these probing methods individually to how pretraining data affects masked language models (MLMs).
For the experiments, the researchers probed the MiniBERTas, a set of 12 RoBERTa models pre-trained from scratch on 1M, 10M, 100M, and 1B words sampled from a combination of Wikipedia and Smashwords.
The authors stated that they noticed classifier probing, Minimum Description Length (MDL) probing, and acceptability judgment performance improved rapidly between 1M and 10M words and slowed down beyond 100M words. Whereas, the performance on the NLU tasks in SuperGLUE improved rapidly with over 1B words and the trend continues with larger data scales. This implies, said the researchers, that some of the skills that RoBERTa uses to solve typical NLU tasks require billions of words.
The most striking result, according to the authors, is that improvements in NLU task performance require far more data than improvements in representations of linguistic features as measured by these methods.
According to the authors, the findings of their work can be summarised as follows:
- Language models require only about 10M or 100M words to learn representations that reliably encode most semantic features.
- A larger quantity of data is needed in order to grasp enough commonsense knowledge to master typical downstream NLU tasks.
- It is likely that other forms of knowledge– other than linguistic features– are responsible for recent improvements in language understanding among large pre-trained models.
Big Might Be Better After All
The popular models OpenAI’s GPT had 1.5 billion parameters, which was the biggest model back then. This was followed by the release of NVIDIA’s Megatron, with 8 billion parameters and later by Microsoft’s Turing NLG with 17 billion parameters. Earlier this year, OpenAI surprised everyone again with GPT-3, an even better version.
The research circles are familiar with the trends of ‘big is better’ and the lure of double descent when it comes to natural language understanding. The only barrier to this mode of research is who gets to do it. Crunching these massive datasets is resource-intensive, which only the labs such as OpenAI can afford. Individual researchers rely heavily on pre-trained models. So, probing the pretraining models for their efficacy, as discussed above, does help.
Natural language understanding is one of the final barriers that the researchers are looking to breach so as to establish foundations of AGI. However, language is tricky for machines, especially when one considers the number of popularly spoken languages. It is usually assumed that given appropriately annotated data, language models should be trainable in any language.
However, despite this crude cross-linguistic compatibility, it is unlikely that all languages are equally easy, or methods that are equally good at all languages. There is still a long way to go for machines to get better at understanding language and for the researchers to explain why popular models are good at what they do. Despite the successful demonstration of their probing methodologies in their work, the NYU researchers admit that their experimental results do not explain what causes NLU task performance to improve with large quantities of data.
Check the original paper here.