Earlier this week, in partnership with Microsoft, NVIDIA introduced one of the largest transformer language models, the Megatron-Turing Natural Language Generation (MT-NLG) model with 530 billion parameters. The language model is powered by DeepSpeed and Megatron transformer models.
Interestingly, MT-NLG has 3x the number of parameters compared to the existing largest models, including GPT-3 (175 billion parameters), Turing NLG (17 billion parameters), Megatron-LM (8 billion parameters), and others.
In comparison, the Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) Wu Dao 2.0 (1.75 trillion parameters) and Google’s Switch Transformer (1.6 trillion parameters) are some of the largest transformer language models in the space.
Evolution of Language Models
In recent years, transformer-based language models in NLP have witnessed rapid progress, fueled by computation at scale, large datasets, and advanced algorithms to train these models. As a result, these models generalise effective zero- or few-shot learners with high accuracy on many tasks and datasets. Some notable downstream applications include code autocompletion, summarization, automatic dialogue generation, translation, semantic search, etc.
Here is a timeline graph of some of the popular language models, which have grown at an exponential rate over the past (as shown below):
How is MT-NLG different from others?
Training such models – GPT-3, Megatron-LM, Turning NLG, etc. – is challenging for two main reasons. Firstly, it is no longer possible to fit the parameters of these models in the memory of even the largest GPU. Secondly, the large number of computing operations required can result in unrealistically long training times if special attention is not paid to optimising the algorithms, software, and hardware stack.
However, training MT-NLG was possible, thanks to NVIDIA’s and Microsoft’s combined effort, where they achieved unprecedented training efficiency by combining SOTA GPU-accelerated training infrastructure with a distributed learning software stack. As a result, the team built high-quality, realistic training corpora with hundreds of billions of tokens and co-developed training recipes to improve optimisation efficiency and stability.
SOTA supercomputing clusters such as the NVIDIA Selene and Microsoft Azure NDv4 were used to train this model. But, achieving the full potential of these supercomputers requires parallelism across thousands of GPUs, along with efficiency and scalability on both memory and compute.
Separately, existing parallelism techniques such as data, pipeline, or tensor-slicing have trade-offs in memory and cannot be used to train models at scale. Here’s why:
- Data parallelism might help achieve good compute efficiency, but it replicates model states and cannot utilise aggregate distributed memory.
- Pipeline parallelism can scale efficiency across nodes, but it requires large batch sizes, coarse grain parallelism, and perfect load balancing, which is impossible at scale.
- Tensor-slicing needs significant communication between GPUs that limits compute efficiency beyond a node where high-bandwidth NVLink is unavailable.
By combining NVIDIA Megatron-LM and Microsoft DeepSpeed, the duo created an efficient and scalable 3D parallel system capable of blending data, pipeline, and tensor-slicing based parallelism to address these challenges.
For building MT-NLG with 530 billion parameters, each model’s replica consists of 280 NVIDIA A100 GPUs, with 8-way tensor-slicing and 35-way pipeline parallelism across nodes. The data parallelism from DeepSpeed was then used to scale out further to thousands of GPUs.
In terms of the hardware system, the model training is done via mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer backed by 560 DGX A100 servers networked with HDR InfiniBand in a full-fat tree configuration. Each DGX A100 has 8 NVIDIA A100 80GB Tensor Core GPUs, fully connected by NVSwitch and NVLink. Microsoft used a similar reference architecture for Azure NDv4 cloud supercomputers.
The team built its training dataset based on The Pile. Curated by Eleuther AI, it consists of 825 GB worth of English text corpus targeted at training large-scale language models. The data consists of text scraped off from various sources on the internet, including Wikipedia, news clippings, and academic journal repositories. In addition, to diversify the training, the team also collected the Common Crawl (CC) snapshots, RealNews and CC-Stories datasets.
Furthermore, they used the deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components to identify duplicate documents. Following this, the team then used a priority order based on the quality of datasets. Lastly, they used n-gram based filtering to remove downstream task data from the training datasets to avoid contamination.
As a result, the researchers ended with a set of 15 datasets consisting of 339 billion tokens. While training, they opted to blend the datasets into heterogeneous batches as per variable sampling weights (shown below), emphasising higher-quality datasets. They trained the model on 270 billion tokens.
|Dataset||Tokens (billions)||Weights (per cent)||Epochs|
NVIDIA, along with Microsoft, evaluated MT-NLG by selecting eight tasks spanning five different areas of NLP. Namely,
- Text prediction task LAMBDA: Here, the model predicts the last word of a given paragraph.
- Reading comprehension tasks like RACE-h and BoolQ: The model generates answers to questions based on a given paragraph.
- Commonsense reasoning tasks PiQA, Winogrande, and HellaSwag: Each requires some commonsense knowledge beyond statistical language patterns to solve.
- Natural language inference: ANLI-R2 and HANS target the typical failure cases of past models.
- Word sense disambiguation task WiC: The model evaluates polysemy understanding from context.
The team evaluated MT-NLG in zero-, one– and few-shot settings without searching for the optimal number of shots. Here are the results:
What about Bias?
As large language models rapidly advance the SOTA language generation, they also suffer immensely from issues like bias and toxicity. NVIDIA and Microsoft believe that understanding and removing these problems in language models is crucial. But, the question is, how are they solving this?
As per their observation, the MT-NLG picks up stereotypes and biases from the data it is trained on. The duo said that it looks to address this problem via continued research and help quantify the model’s bias.
MT-NLG is an example of when supercomputers like NVIDIA Selene or Microsoft Azure NDv4 are used with DeepSpeed and Megatron-LM software breakthroughs to train large language AI models. NVIDIA and Microsoft believe that the quality and results they have obtained are a step forward in the journey towards opening the full potential of AI in natural language processing (NLP).
In other words, the innovations of Microsoft’s DeepSpeed and NVIDIA’s Megatron-LM will benefit existing and future AI model development, thereby making large AI models cheaper and faster to train. “We look forward to how ‘MT-NLG’ will shape tomorrow’s products and motivate the community to push the boundaries of (NLP) natural language processing even further,” said the NVIDIA team.