Microsoft, NVIDIA test waters for a large-scale generative language model with promising results

We believe that our results and findings can help, shape, and facilitate future research in foundational, large-scale pretraining.
Microsoft, NVIDIA test waters for a large-scale generative language model with promising results

Thanks to self-supervised learning, few-shot, zero-shot, and fine-tuning techniques, the size of the language models are growing each passing day significantly, calling for high-performance hardware, software, and algorithms to enable training large models. 

Taking a collaborative approach, Microsoft and NVIDIA have joined hands to train one of the largest, monolithic transformer-based language models with 530 billion parameters, Megatron-Turing NLG (MT-NLG). The duo claimed to have established state-of-the-art results, alongside SOTA accuracies in natural language processing (NLP), by adapting to downstream tasks via few-shot, zero-shot, and fine-tuning techniques. 

In a research paper “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model,” the researchers from NVIDIA and Microsoft discussed the challenges in training neural networks at scale. They presented 3D parallelism strategies and hardware infrastructures that enabled efficient training of MT-NLG. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

“Large language model training is challenging to stabilise, and experimentation can be costly, therefore, we documented our training configurations and datasets extensively to facilitate future research,” shared the researchers. 

In addition to this, they also analysed the social biases exhibited by MT-NLG, alongside examining various factors that can affect in-context learning, bringing forth awareness of certain limitations of the current generation of large language models. “We believe that our results and findings can help, shape, and facilitate future research in foundational, large-scale pretraining,” added researchers. 

Unleashing the power of large scale language models 

In October last year, the duo introduced MT-NLG. The SOTA language model is powered by Microsoft’s DeepSpeed and NVIDIA’s Megatron transformer models. It has 3x the number of parameters compared to the existing largest models, including GPT-3 (175 billion parameters), Turing NLG (17 billion parameters), Megatron-LM (8 billion parameters), and the most recent EleutherAI’s GPT-NeoX with 20 billion parameters trained on CoreWeave GPUs. 

As part of Microsoft’s AI at sale effort, the DeepSpeed team has investigated model applications and optimisations for a mixture of experts (MoE) models. These models are said to reduce the cost of training and inference for large models while allowing the next generation models to be trained and served on today’s technology. 

In comparison, Google’s Switch Transformer (1.6 trillion parameters) and China’s Wu Dao 2.0 (1.75 trillion parameters) are the largest transformer language models in the space. However, when it comes to large scale language models and use cases, Microsoft has been upping the game. It has partnered with OpenAI, acquiring the exclusive right to use its GPT-3 language models for commercial use cases

Microsoft owned GitHub last year released GitHub Copilot, which is powered by Codex, an AI system created by OpenAI that has been trained on a selection of English language and source code from open sources, including code in public repositories on GitHub.  

There is more

MT-NLG with 530 billion parameters is not the largest language model developed by NVIDIA and Microsoft. Last year, Microsoft announced a bigger and more powerful model with one trillion (1T) parameters. This 1 T model is bigger and has the highest numbers for every performance figure, including tera-FLOPs that were achieved, batch size, number of GPUs, etc. 

Microsoft, NVIDIA test waters for a large-scale generative language model with promising results
Highlighting language model (in red) with one trillion parameters developed by NVIDIA and Microsoft (Source: GitHub)

This brings us to the question: if the language model with one trillion is bigger than every measure, how can MT-NLG with 530 billion parameters be the biggest

To this, NVIDIA had said that the one trillion language model was never ‘trained to convergence,’ – a term used for a model that has been fully developed and can be used for performing inference, and a stage where predictions are made. Instead, this particular model went through a limited number of training runs, also known as epochs, which does not lead to convergence. 

MT-NLG with 530 billion parameters is still a research project between NVIDIA and Microsoft and is yet to see the light as a commercial product. Check out NVIDIA’s catalogue page for other popular models made available. It includes transformer-based language models and other neural networks for classification, language translation, text-to-speech, object detection, recommender engines, sentiment analysis, etc.

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR