Transformers have been one of the widely popular approaches in deep learning, especially large scale transformer models like GPT-2, GPT-3, BERT, Turing NLG, Megatron-LM, XLNet, RoBERTa, etc. These models have the potential to find real-world applications, such as machine translation, time series prediction, and video understanding, among others.
Every time a large scale model comes into the limelight, controversy galore. The same was the fate of Megatron-Turing Natural Language Generation (MT-NLG) with 530 billion parameters, which Microsoft launched two weeks ago in collaboration with NVIDIA.
Sign up for your weekly dose of what's up in emerging technology.
An amalgamation of DeepSpeed and Megatron transfer models, MT-NLG is 3x the number of parameters compared to the existing largest models, including GPT-3 (175 billion parameters), Turing NLG (17 billion parameters), and Megatron-LM (8 billion parameters).
Further, the duo also claimed that its MT-NLG is the largest and the most powerful monolithic transformer language model trained to date. In contrast, other large-scale language models, including BAAI’s Wu Dao 2.0 (1.75 trillion parameters) and Google’s Switch Transformer (1.6 trillion parameters), surpass MT-NLG in terms of trained parameters.
The latest development comes at a time where Microsoft had already announced a programme a year ago, which was bigger and more powerful, a model with 1T or one trillion parameters. In the blog post, Microsoft showed that the stats are not only 1T bigger than Megatron-Turing NLG with 530 parameters; 1T has higher numbers for every performance figure. This includes tera-FLOPs that were achieved, batch size, number of GPUs, etc.
So, if 1T is bigger than every measure, how can Megatron-Turing NLG with 530 billion parameters be the biggest?
NVIDIA’s senior director of product management and marketing, Paresh Kharya, in an interview with ZDNet, said that the key is that 1T was never ‘trained to convergence’ – a term used for a model that has been fully developed and can be used for performing inference, and a stage where predictions are made. “Instead, it went through a limited number of training runs, or ‘ epochs,’ which does not lead to convergence,” he added.
Further, he said that training large scale models to convergence takes weeks or months and depends on the size of the supercomputer used. Pointing at the table on the GitHub page, Kharya explained that listings are called ‘scaling studies.’ These create a measure of what kind of performance can be obtained even without training a model to convergence. He said such studies can be done by doing partial training runs for a few minutes at different scale and model sizes.
Giving an analogy of a car and ‘miles per gallon’ measure, Kharya said that the figures for various metrics like ‘achieved teraFLOPs’ are ‘real data points’ measured by conducting partial training runs, like what it takes to train and deploy a particular model before they commit to doing so.
Further explaining, he said different customers use different models, and they need to estimate, say if they were to bring a model size online on an NVIDIA platform – how much computing resources would they need to invest, or if they had a given amount of computing resources – how long would it take to train these models.
Similarly, the data points in FLOPs tell a customer/user how long they would need a cloud instance or how large an instance they will need for a given amount of training time. What that means is, Megatron Turing NLG with 530 parameters is the largest model whose neural weights are fairly developed to be able to perform on benchmark tests, of which Microsoft and NVIDIA offered several results/outcomes. The uniqueness of that is the ability to deploy such a large model across parallelised infrastructure, said Kharya.
According to him, as these models continue to scale, they can break the memory of a single GPU and sometimes do not even fit in the memory of a single server. He said, using the Megatron software to split models between different GPUs and different servers, alongside both ‘data parallelism and model parallelism’ and smarter networking, you are able to achieve high efficiency. “50 per cent of theoretical peak performance of GPUs,” added Kharya. He said it is a very high number, where you are achieving hundreds of teraFLOPs for every GPU.
So, will NVIDIA and Microsoft ‘train to convergence’ an actual one-trillion model? To this, Kharya said that everyone in the industry is working on these really large models, and it is going to happen, but only time can tell, by whom and when – we will have to wait and watch.
Currently, MT-NLG is not a commercial product. It is a research project between NVIDIA and Microsoft. However, on NVIDIA’s website, this is available on a catalog page, and you can obtain dozens of models made available, including transformer-based language models and other kinds of neural networks for classification, language translation, object detection, text-to-speech, recommender engines, sentiment analysis, etc.
Kharya said that the models are ‘pre-trained’ and ready to be used for inference. However, a few customers enhance the model further with additional training runs on their own data.