NVIDIA, Microsoft Introduce New Language Model MT-NLG With 530 Billion Parameters, Leaves GPT-3 Behind

MT-NLG has 3x the number of parameters compared to the existing largest models – GPT-3, Turing NLG, Megatron-LM and others.
NVIDIA, Microsoft Introduce New Language Model MT-NLG With 530 Billion Parameters, Leaves GPT-3 Behind

Earlier this week, in partnership with Microsoft, NVIDIA introduced one of the largest transformer language models, the Megatron-Turing Natural Language Generation (MT-NLG) model with 530 billion parameters. The language model is powered by DeepSpeed and Megatron transformer models.

Interestingly, MT-NLG has 3x the number of parameters compared to the existing largest models, including GPT-3 (175 billion parameters), Turing NLG (17 billion parameters), Megatron-LM (8 billion parameters), and others.

In comparison, the Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) Wu Dao 2.0 (1.75 trillion parameters) and Google’s Switch Transformer (1.6 trillion parameters) are some of the largest transformer language models in the space. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Previously, Microsoft had also partnered with OpenAI, where it acquired exclusive rights to use its GPT-3 language models for commercial use cases

Evolution of Language Models

In recent years, transformer-based language models in NLP have witnessed rapid progress, fueled by computation at scale, large datasets, and advanced algorithms to train these models. As a result, these models generalise effective zero- or few-shot learners with high accuracy on many tasks and datasets. Some notable downstream applications include code autocompletion, summarization, automatic dialogue generation, translation, semantic search, etc. 


Download our Mobile App



Here is a timeline graph of some of the popular language models, which have grown at an exponential rate over the past (as shown below): 

NVIDIA, Microsoft Introduce New Language Model MT-NLG With 530 Billion Parameters, Leaves GPT-3 Behind

(Source: NVIDIA) 

How is MT-NLG different from others? 

Training such models – GPT-3, Megatron-LM, Turning NLG, etc. – is challenging for two main reasons. Firstly, it is no longer possible to fit the parameters of these models in the memory of even the largest GPU. Secondly, the large number of computing operations required can result in unrealistically long training times if special attention is not paid to optimising the algorithms, software, and hardware stack. 

However, training MT-NLG was possible, thanks to NVIDIA’s and Microsoft’s combined effort, where they achieved unprecedented training efficiency by combining SOTA GPU-accelerated training infrastructure with a distributed learning software stack. As a result, the team built high-quality, realistic training corpora with hundreds of billions of tokens and co-developed training recipes to improve optimisation efficiency and stability. 

MT-NLG Explained 

SOTA supercomputing clusters such as the NVIDIA Selene and Microsoft Azure NDv4 were used to train this model. But, achieving the full potential of these supercomputers requires parallelism across thousands of GPUs, along with efficiency and scalability on both memory and compute.

Separately, existing parallelism techniques such as data, pipeline, or tensor-slicing have trade-offs in memory and cannot be used to train models at scale. Here’s why: 

  • Data parallelism might help achieve good compute efficiency, but it replicates model states and cannot utilise aggregate distributed memory.  
  • Pipeline parallelism can scale efficiency across nodes, but it requires large batch sizes, coarse grain parallelism, and perfect load balancing, which is impossible at scale. 
  • Tensor-slicing needs significant communication between GPUs that limits compute efficiency beyond a node where high-bandwidth NVLink is unavailable. 

By combining NVIDIA Megatron-LM and Microsoft DeepSpeed, the duo created an efficient and scalable 3D parallel system capable of blending data, pipeline, and tensor-slicing based parallelism to address these challenges. 

For building MT-NLG with 530 billion parameters, each model’s replica consists of 280 NVIDIA A100 GPUs, with 8-way tensor-slicing and 35-way pipeline parallelism across nodes. The data parallelism from DeepSpeed was then used to scale out further to thousands of GPUs. 

In terms of the hardware system, the model training is done via mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer backed by 560 DGX A100 servers networked with HDR InfiniBand in a full-fat tree configuration. Each DGX A100 has 8 NVIDIA A100 80GB Tensor Core GPUs, fully connected by NVSwitch and NVLink. Microsoft used a similar reference architecture for Azure NDv4 cloud supercomputers. 

Training Datasets 

The team built its training dataset based on The Pile. Curated by Eleuther AI, it consists of 825 GB worth of English text corpus targeted at training large-scale language models. The data consists of text scraped off from various sources on the internet, including Wikipedia, news clippings, and academic journal repositories. In addition, to diversify the training, the team also collected the Common Crawl (CC) snapshots, RealNews and CC-Stories datasets. 

Furthermore, they used the deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components to identify duplicate documents. Following this, the team then used a priority order based on the quality of datasets. Lastly, they used n-gram based filtering to remove downstream task data from the training datasets to avoid contamination. 

As a result, the researchers ended with a set of 15 datasets consisting of 339 billion tokens. While training, they opted to blend the datasets into heterogeneous batches as per variable sampling weights (shown below), emphasising higher-quality datasets. They trained the model on 270 billion tokens.

DatasetTokens (billions)Weights (per cent)Epochs
Books325.714.31.5
OpenWebText214.819.33.6
Stack Exchange11.65.71.4
PubMed Abstracts4.42.91.8
Wikipedia4.24.83.2
Gutenberg (PG-19)2.70.90.9
BookCorpus21.511.8
NIH ExPorter0.30.21.8
Pile-CC49.89.40.5
ArXiv20.81.40.2
GitHub24.31.60.2
CC-2020-5068.7130.5
CC-2021-0482.615.70.5
RealNews21.991.1
CC-Stories5.30.90.5

Experiments 

NVIDIA, along with Microsoft, evaluated MT-NLG by selecting eight tasks spanning five different areas of NLP. Namely, 

  • Text prediction task LAMBDA: Here, the model predicts the last word of a given paragraph.
  • Reading comprehension tasks like RACE-h and BoolQ: The model generates answers to questions based on a given paragraph. 
  • Commonsense reasoning tasks PiQA, Winogrande, and HellaSwag: Each requires some commonsense knowledge beyond statistical language patterns to solve. 
  • Natural language inference: ANLI-R2 and HANS target the typical failure cases of past models. 
  • Word sense disambiguation task WiC: The model evaluates polysemy understanding from context. 

The team evaluated MT-NLG in zero-, one– and few-shot settings without searching for the optimal number of shots. Here are the results: 

TasksZero-shotOne-shotFew-shot
Lambada0.766*0.731*0.872*
BoolQ0.7820.8250.848
RACE-h0.4790.4840.479
PiQA0.820*0.810*0.832*
HellaSwag0.8020.8020.824
WinoGrande0.730.7370.789
ANLI-R20.3660.3970.396
HANS0.6070.6490.702
WiC0.4860.5130.585

What about Bias? 

As large language models rapidly advance the SOTA language generation, they also suffer immensely from issues like bias and toxicity. NVIDIA and Microsoft believe that understanding and removing these problems in language models is crucial. But, the question is, how are they solving this? 

As per their observation, the MT-NLG picks up stereotypes and biases from the data it is trained on. The duo said that it looks to address this problem via continued research and help quantify the model’s bias. 

What’s Next? 

MT-NLG is an example of when supercomputers like NVIDIA Selene or Microsoft Azure NDv4 are used with DeepSpeed and Megatron-LM software breakthroughs to train large language AI models. NVIDIA and Microsoft believe that the quality and results they have obtained are a step forward in the journey towards opening the full potential of AI in natural language processing (NLP).  

In other words, the innovations of Microsoft’s DeepSpeed and NVIDIA’s Megatron-LM will benefit existing and future AI model development, thereby making large AI models cheaper and faster to train. “We look forward to how ‘MT-NLG’ will shape tomorrow’s products and motivate the community to push the boundaries of (NLP) natural language processing even further,” said the NVIDIA team. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
AIM TOP STORIES

Do machines feel pain?

Scientists worldwide have been finding ways to bring a sense of awareness to robots, including feeling pain, reacting to it, and withstanding harsh operating conditions.

IT professionals and DevOps say no to low-code

The obsession with low-code is led by its drag-and-drop interface, which saves a lot of time. In low-code, every single process is shown visually with the help of a graphical interface that makes everything easier to understand.

Neuralink elon musk

What could go wrong with Neuralink?

While the broad aim of developing such a BCI is to allow humans to be competitive with AI, Musk wants Neuralink to solve immediate problems like the treatment of Parkinson’s disease and brain ailments.

Understanding cybersecurity from machine learning POV 

Today, companies depend more on digitalisation and Internet-of-Things (IoT) after various security issues like unauthorised access, malware attack, zero-day attack, data breach, denial of service (DoS), social engineering or phishing surfaced at a significant rate.