Microsoft Releases Latest Version Of DeepSpeed, Its Python Library For Deep Learning Optimisation

Recently, Microsoft announced the new advancements in the popular deep learning optimisation library known as DeepSpeed. This library is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale. 

DeepSpeed, the open-source deep learning training optimisation library was unveiled in February this year along with ZeRO (Zero Redundancy Optimiser), which is a memory optimisation technology in the library that assists large model training by improving scale, speed, cost, and usability. 

The researchers at the tech giant developed this library in order to make distributed training easy, efficient, and effective. DeepSpeed now can train a language model with one trillion parameters using as few as 800 NVIDIA V100 GPUs. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

DeepSpeed has combined three powerful technologies to enable training of trillion-scale models and to scale to thousands of GPUs, which include data-parallel training, model parallel training, and pipeline parallel training. 


There are several intuitive features of this deep learning library. Some of them are mentioned below:

  • Speed: DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimisations on computing, memory, IO, etc. and effectiveness optimisations on advanced hyperparameter tuning and optimisers.
  • Memory Efficiency: The library provides memory-efficient data parallelism and enables training models without model parallelism.
  • Scalability: DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, also known as 3D parallelism.
  • Communication Efficiency: Pipeline parallelism of DeepSpeed reduces communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.

Why Use DeepSpeed

Training a large and advanced deep learning model is complex as well as challenging. It includes a number of conditions, such as model design, setting up state-of-the-art training techniques including distributed training, mixed precision, gradient accumulation, among others. 

Even after putting in a lot of effort, there is no certainty that the system will perform up to the expectation or achieve the desired convergence rate. This is because large models easily run out of memory with pure data parallelism, and it is hard to utilise model parallelism in such cases. This is where DeepSpeed comes into the picture. The library not only addresses these drawbacks but also accelerates model development and training.

What’s New

According to this blog post, the DeepSpeed library specifically added four new system technologies that offer extreme compute, memory, communication efficiency and powers model training with billions to trillions of parameters. The blog post mentioned that it can compute long input sequences and power on hardware systems with a single GPU, low-end clusters with very slow ethernet networks, and more.

The new technologies are mentioned below-

  • Trillion parameter model training with 3D parallelism: The library enables a flexible combination of three parallelism approaches, which are ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. This 3D parallelism adapts to the varying needs of the workload requirements in order to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency.
  • 10x bigger model training on a single GPU with ZeRO-Offload: The ZeRO-2 technology is extended in order to leverage both CPU and GPU memory for training large models. With the help of a single NVIDIA V100 GPU, users now can run models of up to 13 billion parameters without running out of memory. It is 10 times bigger than the existing approaches while obtaining competitive throughput.
  • Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: The library offers sparse attention kernels, which is an instrumental technology to support long sequences of model inputs, whether for text, image as well as sound. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution.  
  • 1-bit Adam with up to 5x communication volume reduction: Adam is an effective optimiser for training many large-scale deep learning models. The researchers at Microsoft introduced a new algorithm, known as 1-bit Adam, with efficient implementation. The new algorithm reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam. 

Wrapping Up

Researchers at the tech giant have been continuing to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training. The library has enabled researchers to create the Turing Natural Language Generation (Turing-NLG), also known as one of the largest language models with 17 billion parameters and state-of-the-art accuracy.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox