Data parallelism vs. model parallelism – How do they differ in distributed training?

Model parallelism seemed more apt for DNN models as a bigger number of GPUs was added.

While smaller models can be trained on a light amount of data, as larger models were introduced, the demand for processing data has outgrown the computational power of the machinery. Eventually, it made more sense to distribute the machine learning workload across multiple machines instead of having a centralised system. The volume of data has increased to such an extent that it has become difficult to move or centralise. Or even in large enterprises where transaction processing is such an extensive process that the relevant data is stored in a different location, centralised solutions aren’t suitable. Distributed machine learning quickens the pace of training for neural networks by using a cluster of GPUs during training. When a model is fine-tuning hyperparameters, parallelised training looks through multiple configurations at the same time, which makes it faster. 

There are two main branches under distributed training, called data parallelism and model parallelism. 

Data parallelism

In data parallelism, the dataset is split into ‘N’ parts, where ‘N’ is the number of GPUs. These parts are then assigned to parallel computational machines. Post that, gradients are calculated for each copy of the model, after which all the models exchange the gradients. In the end, the values of these gradients are averaged. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

For every GPU or node, the same parameters are used for the forward propagation. A small batch of data is sent to every node, and the gradient is computed normally and sent back to the main node. There are two strategies using which distributed training is practised called synchronous and asynchronous. 

Synchronous training

As a part of sync training, the model sends different parts of the data into each accelerator. Every model has a complete copy of the model and is trained solely on a part of the data. Every single part starts forward pass simultaneously and computes a different output and gradient. 

Synchronous training uses an all-reduce algorithm which collects all the trainable parameters from various workers and accelerators. 

Asynchronous training

While synchronous training can be advantageous in many ways, it is harder to scale and can result in workers staying idle at times. In asynchronous, workers don’t have to wait on each other during downtime in maintenance or a difference in capacity or priorities. Especially if devices are smaller, less reliable and more limited, asynchronous training may be a better choice. If the devices are stronger and with a powerful connection, synchronous training is more suited. 

Model parallelism 

In model parallelism, every model is partitioned into ‘N’ parts, just like data parallelism, where ‘N’ is the number of GPUs. Each model is then placed on an individual GPU. The batch of GPUs is then calculated sequentially in this manner, starting with GPU#0, GPU#1 and continuing until GPU#N. This is forward propagation. Backward propagation on the other end begins with the reverse, GPU#N and ends at GPU#0. 

Model parallelism has some obvious benefits. It can be used to train a model such that it does not fit into just a single GPU. But when computing is moving in a sequential fashion, for example, when GPU#1 is in computation, the others simply lie idle. This can be resolved by shifting to an asynchronous style of GPU functioning. 

Source: Research Paper

There are multiple mini-batches in progress in the pipeline; first the initial mini-batches update weights, the mini-batches next in the pipeline adopt stale weights to derive gradients. In model parallelism, staleness leads to instability and low model accuracy. A study titled ‘Efficient and Robust Parallel DNN Training through Model Parallelism as Multi-GPU Platform’ that tested model parallelism against data parallelism showed that models using data parallelism increase in their accuracy as training proceeds, but the accuracy starts fluctuating with model parallelism. 

The study also demonstrated that data parallelism is expected to have a scalability issue. Model parallelism seemed more apt for DNN models as a bigger number of GPUs was added. 
In a recent and prominent instance, Google AI’s large language model PaLM or Pathways Language Model used a combination of data and model parallelism as a part of its state-of-the-art training. The model was scaled using data parallelism at the Pod level across two Cloud TPU v4 Pods while each Pod used model parallelism with standard data.

More Great AIM Stories

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM