Recently, Microsoft announced the new advancements in the popular deep learning optimisation library known as DeepSpeed. This library is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale.
DeepSpeed, the open-source deep learning training optimisation library was unveiled in February this year along with ZeRO (Zero Redundancy Optimiser), which is a memory optimisation technology in the library that assists large model training by improving scale, speed, cost, and usability.
The researchers at the tech giant developed this library in order to make distributed training easy, efficient, and effective. DeepSpeed now can train a language model with one trillion parameters using as few as 800 NVIDIA V100 GPUs.
DeepSpeed has combined three powerful technologies to enable training of trillion-scale models and to scale to thousands of GPUs, which include data-parallel training, model parallel training, and pipeline parallel training.
There are several intuitive features of this deep learning library. Some of them are mentioned below:
- Speed: DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimisations on computing, memory, IO, etc. and effectiveness optimisations on advanced hyperparameter tuning and optimisers.
- Memory Efficiency: The library provides memory-efficient data parallelism and enables training models without model parallelism.
- Scalability: DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, also known as 3D parallelism.
- Communication Efficiency: Pipeline parallelism of DeepSpeed reduces communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.
Why Use DeepSpeed
Training a large and advanced deep learning model is complex as well as challenging. It includes a number of conditions, such as model design, setting up state-of-the-art training techniques including distributed training, mixed precision, gradient accumulation, among others.
Even after putting in a lot of effort, there is no certainty that the system will perform up to the expectation or achieve the desired convergence rate. This is because large models easily run out of memory with pure data parallelism, and it is hard to utilise model parallelism in such cases. This is where DeepSpeed comes into the picture. The library not only addresses these drawbacks but also accelerates model development and training.
According to this blog post, the DeepSpeed library specifically added four new system technologies that offer extreme compute, memory, communication efficiency and powers model training with billions to trillions of parameters. The blog post mentioned that it can compute long input sequences and power on hardware systems with a single GPU, low-end clusters with very slow ethernet networks, and more.
The new technologies are mentioned below-
- Trillion parameter model training with 3D parallelism: The library enables a flexible combination of three parallelism approaches, which are ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. This 3D parallelism adapts to the varying needs of the workload requirements in order to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency.
- 10x bigger model training on a single GPU with ZeRO-Offload: The ZeRO-2 technology is extended in order to leverage both CPU and GPU memory for training large models. With the help of a single NVIDIA V100 GPU, users now can run models of up to 13 billion parameters without running out of memory. It is 10 times bigger than the existing approaches while obtaining competitive throughput.
- Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: The library offers sparse attention kernels, which is an instrumental technology to support long sequences of model inputs, whether for text, image as well as sound. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution.
- 1-bit Adam with up to 5x communication volume reduction: Adam is an effective optimiser for training many large-scale deep learning models. The researchers at Microsoft introduced a new algorithm, known as 1-bit Adam, with efficient implementation. The new algorithm reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam.
Researchers at the tech giant have been continuing to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training. The library has enabled researchers to create the Turing Natural Language Generation (Turing-NLG), also known as one of the largest language models with 17 billion parameters and state-of-the-art accuracy.