“Training GPT-3 with 175 billion parameters would require approximately 36 years with 8 V100 GPUs.”
Training large machine learning models calls for huge compute power (~in hundreds of exaflops), efficient memory management for a reduced memory footprint and other tweaks. But, language models have grown at a great pace. In a span of two years, the parameter count went from billions to a trillion. The memory on a single GPU falls short when training these large models. This is the reason researchers at NVIDIA and others have tried parallelism strategies. For example, model parallelism enables splitting up of the parameters across multiple GPUs. Most of the model parallelism techniques are either difficult to use or scale.
Scaling GPUs comes with various challenges. There are many compute-intensive and memory-intensive components. For instance, GPUs that train state-of-the-art personal recommendation models are largely affected by model architecture configurations such as dense and sparse features or dimensions of a neural network. These models often contain large embedding tables that do not fit into limited GPU memory. It gets even worse when GPUs are made to run billion-large parameter language models like BERT or GPT-3.
- GPU memory capacity is limited. It is impossible to fit large models on a single GPU or even on a multi-GPU server.
- Unrealistically long training times.
Typically, training models use weak scaling approaches and distributed data parallelism to scale training batch size with a number of GPUs. Though this approach allows the model to train on larger datasets, it comes with a trade-off; all parameters must fit on a single GPU. This is where parallelism comes into picture. Model parallel training overcomes this limitation as it partitions the model across multiple GPUs. Previously, general purpose model parallel frameworks such as GPipe and Mesh-TensorFlow have been proposed for the same purpose. While gPipe divides groups of layers across different processors, Mesh-TensorFlow employs intra-layer model parallelism.
Other methods of model parallelism such as tensor and pipeline parallelism have been proposed too. Unfortunately, wrote the researchers at NVIDIA, naive usage leads to fundamental scaling issues at thousands of GPUs. Expensive cross-node communication or idle periods waiting on other devices are few reasons. Moreover, the high number of compute operations required can result in unrealistically long training times without model parallelism. For example, OpenAI’s GPT-3 comes with 175 billion parameters and, according to the researchers, would require approximately 36 years with eight V100 GPUs or seven months with 512 V100 GPUs assuming perfect data-parallel scaling.
The researchers at NVIDIA, Stanford and Microsoft Research have combined tensor parallelism and pipeline parallelism techniques in their experiments to efficiently use GPUs for large language models. Parallelism facilitates efficient training of large models that do not fit in the memory of a single GPU.
- Data parallelism: In this regime, every worker has a copy of the full model and the input dataset is sharded. Sharding is a process of horizontal partitioning of data in a database. Gradients are periodically aggregated to ensure all workers see a consistent version of the weights. Data parallelism can be used on smaller model shards.
- In Pipeline parallelism, the layers of a model are sharded across multiple devices. When used on repetitive transformer-based models, each device can be assigned an equal number of transformer layers. A batch is split into smaller micro batches; execution is then pipelined across microbatches.
- Tensor model parallelism requires individual layers of the model to be partitioned over multiple devices.
The researchers used tensor model parallelism inside a DGX A100 server and pipeline parallelism across DGX A100 servers. According to them, combining parallelism strategies allows scaling up of models with a trillion parameters.
“Larger language models are dramatically more useful for NLP tasks than smaller models.”
Modern day machine learning models outperform the optimal “classical” model on the test set. Last year, the researchers at OpenAI demonstrated with the double-descent curve that larger models are “easy” to optimise. This is because methods like stochastic gradient descent (SGD), converge to global minima of the training risk in over-parameterized regimes. According to the researchers, large interpolating models can have low test risk and are easy to optimise. The models to the left of the interpolation peak, as shown above, are more likely to have optimisation properties qualitatively different from those to the right.
The main focus of these experiments have been on the language models because of their wide availability. Recommendation systems too can use some of these parallelism strategies. For the whole community to reap the benefits of these mega models, hardware parallelism is definitely the area to look into.