Training Models With Over 100 Billion Parameters

Training Models With Over 100 Billion Parameters

Following the announcement of open source release of DeepSpeed library and Zero Redundancy Optimiser (ZeRO), Microsoft, in mid of this year, announced its upgrade, in order to train large neural networks, with ZeRO-2. Training large scale models often come with several challenges, such as hardware limitations and tradeoffs with computation and efficiency. Thus, to overcome these challenges of data parallelism and model parallelism, Microsoft introduced ZeRO to train models with 100 billion parameters.

While ZeRO was known for reducing memory redundancy in data parallelism, this new framework, ZeRO-2, according to the company, allows training large AI models with 170 billion parameters. In addition, ZeRO-2 optimises memory consumptions and reduces activation and fragmented memory. Along with this, one of the most significant advantages ZeRO-2 comes with is the ability to enable memory optimisation on every single GPU. This release has reduced the training time by 30% for training models like BERT, which was done in 44 minutes, as compared to NVIDIA, which takes up to 67 minutes. Such advancement is also a big leap from previous releases and other ones that focuses on distributing training on different GPUs.

This new release by Microsoft comes with better advantages than its previous version in memory consumption. The company, in their blog post, stated that compared to their previous model, ZeRO-2, with advanced training efficiency, can train double the size of models and can train 100 billion parameter models at 10x speed.

Also Read: Build 2020 Showed That ML Developers Are The Focus For Microsoft

Deep Dive Into ZeRO-2

Deep learning training of large scale models encompasses massive memory consumption which includes model state memory, activation memory and fragmented memory, ZeRO-2 with its advancements, optimises the memory consumption. This is done separately for three different memory consumption, explained below:

Model State Memory

In order to optimise model state memory, ZeRO-2 needs to optimise three stages of models states — partitioner optimiser, partitioning gradients, and parameters. According to the company, the predecessor was only capable of supporting the first stage of partitioner optimiser saving up to 4x of memory consumption. Whereas, ZeRO-2, with its advanced capabilities, supports the latter two stages. Compared to conventional data parallelism, ZeRO-2 will add up extra 2x of memory savings, making it to 8x.

Activation Memory

Even after optimising model state, there could be several challenges that can arise due to the stored activation, such as activation replication in existing model parallelism approaches. Although these issues can easily be stopped by the activation checkpoint, the task gets challenging for a large scale model. Therefore, ZeRo-2, with its advanced capabilities of activation partitioning, removes these memory issues by offloading this activation memory consumption on to the host CPUs.

Fragmented Memory

Because of the variations in the lifetime of different tensors and lack of contagious memory, some of the memory gets fragmented during training large scale deep learning models. ZeRO-2, in order to eliminate this issue, proactively manages the memory throughout the lifetime of different tensors.

Also Read: How Microsoft Wants To Enhance Your Tech Skills By Integrating MS Learn, LinkedIn & GitHub

Efficiency Evaluation Of ZeRO-2

ZeRO-2 also comes with several advantages that help systems to manage training bigger models, such as model scalability, which supports billions of parameters, robust speed of 10x, superlinear speedup and usability of up to 13 billion parameters without model parallelism.

Scalability Of The Model

With the capability of supporting 170 billion parameters while training deep learning models, ZeRO becomes the one of the few that support such an order of magnitude of models. The company stated that the trials were conducted with 400 NVIDIA GPUs, where the system scaled around 200 billion parameters.

Robust Speed Of The Model

With better memory consumption comes faster training of models. For conventional data parallelism, the company uses state-of-the-art model parallelism approach where it runs on 100 billion parameters with massive GPUs of 38 teraflops and 30% hardware peak. However, with ZeRO-2, the model parallelism is also reduced, sometimes omitted, with reduced communication cost, which further helps with additional memory savings.

Superlinear Speedup With DeepSpeed

Along with the advantages above, ZeRO-2 also reduced the memory footprint of the model states, which allows the larger models to fit in the memory, which in turn yields a better outcome.

Democratising Training For Larger Models

With the help of model refactoring, ZeRO-2 helps in training up to 13 billion parameters without model parallelism. This advancement allows scientists and researchers to work around larger models without worrying about partitioning models among multiple GPUs.

Also Read: Microsoft News Service’s AI Reporter Decision Backfires

Case In Point: BERT Training With ZeRO-2

Unlike the previous releases, this new advanced and highly optimised ZeRO-2 enhances computation by seeding up the input and output process of training every single GPUs. Along with building a solid foundation for scaling up to larger deep learning models, such optimisation in ZeRO-2 will also refine the training performance of the moderately-sized models like BERT. 

With ZeRO-2, Microsoft made a record time of training BERT in only 44 minutes, by only augmenting their software efficiency. To test this, Microsoft, used two prominent models NVIDIA BERT and HuggingFace BERT, where ZeRO-2 showcases immense potential with its higher throughput and higher sequence length than NVIDIA and HuggingFace, exhibiting improvement of 28% and 62% respectively over the two. Furthermore, the company also claimed that the system supports 1.8x of batch size without running out of memory. 

A lot of this could be attributed to its highly optimised transformer kernels and asynchronous input and output. These altogether allow models with increased learning speed, higher convergence and reduced redundancy. Furthermore, the company has stated that these optimisations aren’t BERT specific, instead can easily be deployed for various workloads. 

Wrapping Up:

Along with ZeRO-2, Microsoft also announced AI-first supercomputer, which has been designed to manage AI workloads and to execute large pre-trained models like Turing-NLG. The release of ZeRO-2 along with its other announcements at Build 2020, Microsoft has taken a bold step towards dominating the field of large scale AI, giving a strong competition to Google, NVIDIA, IBM etc.

Download our Mobile App

Sejuti Das
Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it