Microsoft Releases Latest Version Of DeepSpeed, Its Python Library For Deep Learning Optimisation

Recently, Microsoft announced the new advancements in the popular deep learning optimisation library known as DeepSpeed. This library is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale. 

DeepSpeed, the open-source deep learning training optimisation library was unveiled in February this year along with ZeRO (Zero Redundancy Optimiser), which is a memory optimisation technology in the library that assists large model training by improving scale, speed, cost, and usability. 

The researchers at the tech giant developed this library in order to make distributed training easy, efficient, and effective. DeepSpeed now can train a language model with one trillion parameters using as few as 800 NVIDIA V100 GPUs. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

DeepSpeed has combined three powerful technologies to enable training of trillion-scale models and to scale to thousands of GPUs, which include data-parallel training, model parallel training, and pipeline parallel training. 


There are several intuitive features of this deep learning library. Some of them are mentioned below:

Download our Mobile App

  • Speed: DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimisations on computing, memory, IO, etc. and effectiveness optimisations on advanced hyperparameter tuning and optimisers.
  • Memory Efficiency: The library provides memory-efficient data parallelism and enables training models without model parallelism.
  • Scalability: DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, also known as 3D parallelism.
  • Communication Efficiency: Pipeline parallelism of DeepSpeed reduces communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.

Why Use DeepSpeed

Training a large and advanced deep learning model is complex as well as challenging. It includes a number of conditions, such as model design, setting up state-of-the-art training techniques including distributed training, mixed precision, gradient accumulation, among others. 

Even after putting in a lot of effort, there is no certainty that the system will perform up to the expectation or achieve the desired convergence rate. This is because large models easily run out of memory with pure data parallelism, and it is hard to utilise model parallelism in such cases. This is where DeepSpeed comes into the picture. The library not only addresses these drawbacks but also accelerates model development and training.

What’s New

According to this blog post, the DeepSpeed library specifically added four new system technologies that offer extreme compute, memory, communication efficiency and powers model training with billions to trillions of parameters. The blog post mentioned that it can compute long input sequences and power on hardware systems with a single GPU, low-end clusters with very slow ethernet networks, and more.

The new technologies are mentioned below-

  • Trillion parameter model training with 3D parallelism: The library enables a flexible combination of three parallelism approaches, which are ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. This 3D parallelism adapts to the varying needs of the workload requirements in order to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency.
  • 10x bigger model training on a single GPU with ZeRO-Offload: The ZeRO-2 technology is extended in order to leverage both CPU and GPU memory for training large models. With the help of a single NVIDIA V100 GPU, users now can run models of up to 13 billion parameters without running out of memory. It is 10 times bigger than the existing approaches while obtaining competitive throughput.
  • Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: The library offers sparse attention kernels, which is an instrumental technology to support long sequences of model inputs, whether for text, image as well as sound. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution.  
  • 1-bit Adam with up to 5x communication volume reduction: Adam is an effective optimiser for training many large-scale deep learning models. The researchers at Microsoft introduced a new algorithm, known as 1-bit Adam, with efficient implementation. The new algorithm reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam. 

Wrapping Up

Researchers at the tech giant have been continuing to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training. The library has enabled researchers to create the Turing Natural Language Generation (Turing-NLG), also known as one of the largest language models with 17 billion parameters and state-of-the-art accuracy.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.