Natural Language Processing (NLP) has made considerable strides in recent years on the back of the availability of larger datasets and computation at scale. Recent works have demonstrated that large language models can have high accuracy on many NLP datasets without additional finetuning.
However, researchers find it challenging to train large language models, primarily due to two main reasons:
- Since the GPU memory is limited, it isn’t easy to fit these large models on a single or even multi-GPU server.
- To train these models, one needs to run several computing operations, resulting in unrealistic training times; for example, a model like GPT-3 with 175 billion parameters may take 36 years on eight V100 GPUs.
In 2019, NVIDIA introduced MegatronLM, an 8.3 billion transformer language with model and data parallelism trained on 512 GPUs. At the time, it was the largest transformer model ever trained. While this method works for models with up to 20 billion parameters on DGX A100 servers with eight A100 GPUs, it is ineffective in larger models. This is because the models need to be split across multiple servers, causing problems like:
- Slower communication rate since the all-reduce communication now needs to go through inter-server links. This is slower than the high bandwidth NVLink available inside a DGX A100 server.
- A greater degree of model parallelism decreases GPU utilisation.
To overcome these limitations, teams from NVIDIA, Microsoft Research, and Stanford have proposed a technique which involves smart combination of different parallelism methods — tensor, pipeline, and data parallelism to achieve a two-order magnitude increase in the size of models that can be trained, as compared to the current systems.
Modes of parallelism
There are three major modes of parallelism.
Model parallelism: Individual layers of the model are partitioned over multiple devices. The team deployed a partitioning strategy employed by Megatron for transformer layers that form the base for language models.
Pipeline parallelism: The layers of the model are shared across multiple devices. When used on repetitive transformer-based models, each device can be assigned to an equal number of transformer layers.
Here, the batch is split into smaller micro-batches, and the execution is pipelined across these micro-batches. The pipeline schemes ensure the inputs see consistent weight updates in both backward and forward passes.
Data parallelism: Each worker has a copy of the full model. The workers aggregate gradients periodically to ensure that all workers see a consistent version of the weights. For large models that do not fit a single worker, data parallelism can be used on smaller model shards.
Scaling model training
The team showed a way to combine tensor, pipeline and data parallelism methods to train large models. “The combination of tensor parallelism within a multi-GPU server, pipeline parallelism across multi-GPU servers, and data parallelism, allows us to practically train models with a trillion parameters with graceful scaling in a highly-optimised cluster environment with high-bandwidth links between GPUs on the same server and across servers,” the researchers said.
The team performed training iterations on models with a trillion parameters at 502 petaFLOP/s on 3072 GPUs by combining three techniques. They achieved a per-GPU throughput of 52 percent, against the 36 percent throughput achieved with previous techniques.
To achieve this high throughput, the team implemented several innovative techniques and careful engineering along multiple axes: efficient kernel implementation. This approach allowed most computations to be compute-bound rather than memory bound. The smart partitioning of computation graphs over the devices reduced the number of bytes required to be sent over network links. This also resulted in limiting idle device periods, optimisation of domain-specific communication, and fast hardware. The authors also studied interactions between the components which affects throughput.
In the future, the team will work on optimising the pipeline schedule. Tradeoffs associated with hyperparameters such as micro batch size, global batch size, and activation recomputation on throughput will be explored. They will also study the implication of using scheduling without pipeline flushes, as used in the present set up.
Read the full paper here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
I am a journalist with a postgraduate degree in computer network engineering. When not reading or writing, one can find me doodling away to my heart’s content.