While smaller models can be trained on a light amount of data, as larger models were introduced, the demand for processing data has outgrown the computational power of the machinery. Eventually, it made more sense to distribute the machine learning workload across multiple machines instead of having a centralised system. The volume of data has increased to such an extent that it has become difficult to move or centralise. Or even in large enterprises where transaction processing is such an extensive process that the relevant data is stored in a different location, centralised solutions aren’t suitable. Distributed machine learning quickens the pace of training for neural networks by using a cluster of GPUs during training. When a model is fine-tuning hyperparameters, parallelised training looks through multiple configurations at the same time, which makes it faster.
There are two main branches under distributed training, called data parallelism and model parallelism.
In data parallelism, the dataset is split into ‘N’ parts, where ‘N’ is the number of GPUs. These parts are then assigned to parallel computational machines. Post that, gradients are calculated for each copy of the model, after which all the models exchange the gradients. In the end, the values of these gradients are averaged.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
For every GPU or node, the same parameters are used for the forward propagation. A small batch of data is sent to every node, and the gradient is computed normally and sent back to the main node. There are two strategies using which distributed training is practised called synchronous and asynchronous.
As a part of sync training, the model sends different parts of the data into each accelerator. Every model has a complete copy of the model and is trained solely on a part of the data. Every single part starts forward pass simultaneously and computes a different output and gradient.
Synchronous training uses an all-reduce algorithm which collects all the trainable parameters from various workers and accelerators.
While synchronous training can be advantageous in many ways, it is harder to scale and can result in workers staying idle at times. In asynchronous, workers don’t have to wait on each other during downtime in maintenance or a difference in capacity or priorities. Especially if devices are smaller, less reliable and more limited, asynchronous training may be a better choice. If the devices are stronger and with a powerful connection, synchronous training is more suited.
In model parallelism, every model is partitioned into ‘N’ parts, just like data parallelism, where ‘N’ is the number of GPUs. Each model is then placed on an individual GPU. The batch of GPUs is then calculated sequentially in this manner, starting with GPU#0, GPU#1 and continuing until GPU#N. This is forward propagation. Backward propagation on the other end begins with the reverse, GPU#N and ends at GPU#0.
Model parallelism has some obvious benefits. It can be used to train a model such that it does not fit into just a single GPU. But when computing is moving in a sequential fashion, for example, when GPU#1 is in computation, the others simply lie idle. This can be resolved by shifting to an asynchronous style of GPU functioning.
Source: Research Paper
There are multiple mini-batches in progress in the pipeline; first the initial mini-batches update weights, the mini-batches next in the pipeline adopt stale weights to derive gradients. In model parallelism, staleness leads to instability and low model accuracy. A study titled ‘Efficient and Robust Parallel DNN Training through Model Parallelism as Multi-GPU Platform’ that tested model parallelism against data parallelism showed that models using data parallelism increase in their accuracy as training proceeds, but the accuracy starts fluctuating with model parallelism.
The study also demonstrated that data parallelism is expected to have a scalability issue. Model parallelism seemed more apt for DNN models as a bigger number of GPUs was added.
In a recent and prominent instance, Google AI’s large language model PaLM or Pathways Language Model used a combination of data and model parallelism as a part of its state-of-the-art training. The model was scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods while each Pod used model parallelism with standard data.