It has been recently established with OpenAI’s GPT-3 release that larger models perform better. Not just GPT, but other NLP models like T5 too have given better results compared to previous works. Historically, NLP systems have struggled to learn from a few examples. But, with GPT-3, the researchers showed that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. While scaling up has been linked to increase unsupervised or at least semi-supervised performance, the same cannot be said in the case of computer vision applications.
In order to explore the notion ‘big is better’, with CV models, the researchers at Google Brain, ran experiments with the modified SimCLR model— SimCLRv2.
Sign up for your weekly dose of what's up in emerging technology.
They found that fewer the labels, the more the task-agnostic use of unlabeled data benefits from a bigger network!
How Well Do Bigger Networks Perform?
Learning from just a few labeled examples while making best use of a large amount of unlabeled data is a long-standing problem in machine learning. An alternative approach for computer vision tasks is to leverage unlabeled data during supervised learning as a form of regularisation.
In this work, the proposed semi-supervised learning framework leverages unlabeled data in two ways:
- task-agnostic use in unsupervised pre training, and
- task-specific use in self-training / distillation
One approach to semi-supervised learning involves unsupervised or self-supervised pre training, followed by supervised fine-tuning.
Although it has received little attention in computer vision, this approach has become predominant in NLP, where one first trains a large language model on unlabeled text (e.g., Wikipedia), and then fine-tunes the model on a few labeled examples.
As shown in the plots above, the Top-1 accuracy of previous state-of-the-art (SOTA) methods and SimCLRv2 on ImageNet using only 1% or 10% of the labels. Dashed line denotes fully supervised ResNet-50 trained with 100% of labels.
The results show that bigger models are more label-efficient for both supervised and semi-supervised learning, but gains appear to be larger for semi-supervised learning .
Link to the paper.
What Does This Mean For Smaller Labs?
According to a report by AI21 labs, the estimated costs of training differently sized BERT models on the Wikipedia and Book corpora (15 GB) for single training run are as follows:
- $2.5k – $50k (110 M parameters)
- $10k – $200k (340 M parameters)
- $80k – $1.6m (1.5 B parameters)
At list-price, training the 11 billion parameter variant of T5 costs well above $1.3 million for a single run. And, with increasing runs, training costs might land up north of $10 million.
A similar but more rigorous approach was taken by Emma Strubell and her colleagues in a work published last year. The results of their work can be seen as follows:
With every new research, the idea of bigger networks being better is gaining more credibility. But who can afford to experiment with such large networks to check for any gains of any accuracy? Definitely not individual researchers or startup AI labs. Companies like Google and OpenAI (backed by Microsoft) enjoy the advantage of high computational resources, which they can afford to play around (experiment).
Smaller organisations do not have the resources to replicate the successes of larger organisations. We have seen last month how Uber was forced to wind down its AI labs due to unfavourable circumstances. The R&D aspect of AI doesn’t look commercially viable in the short run, hence the smaller research groups will have no other way than to rely on the APIs released by the likes of OpenAI. Apart from this there is also a declining trend in the cloud computing costs. For instance, the prices on AWS were reduced over 65 times since its launch in 2006, and by as much as 73% between 2014 and 2017.
So we can safely assume that this trend will continue and computations become affordable. The compactness of networks is also an active area of research. Even in the case of above work by Google researchers with SimCLRv2, it demonstrates that task-agnostically learned general representations can be distilled into a more specialised and compact network using unlabeled examples. That said, there is no denying the fact that the inherent nature of deep learning innovation is rigged against smaller groups.