In recent years, we have witnessed the success of deep learning across multiple domains. But we have also seen that due to the large size and computational complexities of the models and data, the performance of the deep learning procedures is reduced. To improve the performance of these models, parallel and distributed deep learning approaches have been introduced. In this article, we are going to discuss parallel and distributed deep learning methods in detail and will try to understand how they help in speeding up the deep learning process. The major points to be discussed in this article are listed below.
Table of Contents
- Need for Parallel and Distributed Deep Learning
- Parallel and Distributed Methods
- Data Parallelism
- How to Implement Data Parallelism?
- Model Parallelism
- How to Implement Model Parallelism?
- Data Parallelization vs Model Parallelization
Let’s begin the discussion by understanding the need for parallel and distributed deep learning.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Need for Parallel and Distributed Deep Learning
Deep neural networks are good at extracting meaningful data and modelling the data for given tasks. Sometimes when the data is high dimensional or the number of parameters in the model is very high, then we are required to perform high computation. In such cases, parallel or distributed deep learning can be helpful in reducing the effort taken by high computation.
A typical neural network in such intensive computation takes a lot of time to get trained. For example, if we talk about the VGG networks, where we have a single-core CPU and about 1/8 on a single machine with 8 cores in CPU, it can take more than 10 hours to get the model trained.
One other problem is that sometimes the amount of data is very high and we can not store it on a single computer. So it becomes very important to us to use methods that can reduce the storage and computation problem with the single computation approaches. Parallel and distributed algorithms can make the training time and computation drastically faster.
Parallel and Distributed Methods
There can be various ways to parallelize or distribute computation for deep neural networks using multiple machines or cores. Some of the ways are listed below:
Local Training: In this way, we are required to store the model and data in a single machine but use the multiple cores or GPU of the machine.
Multi-Core Processing: Multiple cores from a single machine can be used for fitting the data and model, where these cores share the memory (PRAM model). The use of multiple cores can be performed in the following ways:
- To process multiple images at once using the multiple cores in each layer. This is a core parallel process.
- SGD of multiple mini-batches can be performed in parallel by using multiple cores.
- A computationally intensive subroutine like matrix multiplication can be performed using GPU (Graphics Processing Unit).
- Multiple cores and GPUs can also be used together for the process where cores can share the GPU and other subroutines can be performed using GPU.
Distributed Training: When we find that storing data or models in a single machine is not feasible for the machine and the process performance, we can use multiple machines for saving the data and models for a better performance level. Normally we face problems with higher memory and computation of the data and models which can be solved using the following methods:
- Data Parallelism: It is simply a way to distribute the high dimension and memory data into multiple machines so that we can achieve a faster training and computation of the data.
- Model Parallelism: Sometimes, neural network models become too big such that they can not be saved into a single machine. So we use the model parallelism by distributing the model into multiple machines. For example, forward and backward propagation can be used for communication between machines wherein a single machine, a single layer from the network is saved and provides the outcome.
So here we got an understanding of the basic intuition behind the data and model parallelism. Let’s discuss it in detail so we will have a deeper understanding.
Data parallelism can be defined as the splitting of the data into N partitions where each of the partitions can be used for training into different machines or devices like CPU cores, GPUs, or even machines. Talking about the traditional way, where the training procedure produces the one gradient for every mini-batch, after applying data parallelism, the procedure can prudence N gradient. Now the question is, how should these N gradients get combined?
For this purpose, we can combine the gradient in the following way:
Synchronous Distributed SGD: This is the process in which the average of N gradient is used for updating the model parameter once. Using the average can produce an accurate gradient. The major drawback of this technique is, it requires finishing up all the calculations in all the devices for its local gradient.
Asynchronous Distributed SGD: In this process, all the gradients are used for updating the model’s parameters without combining them. Using this process, we can have N parameter updates for every minibatch.
There are many papers where we can find the mention of synchronous and asynchronous distributed SGD. By seeing many of them and after defining them, we can say that the synchronously distributed SGD technique requires more time because it has to wait for completing all the processes in different computers.
On the other hand, asynchronously distributed SGD doesn’t require waiting for other devices to finish. The async approach will take less time to complete a mini-batch step than the sync approach will do. When we talk about the performance, the sync approach produces the gradient with less noise than the noise in the gradient produced by the async approach because of N mini-batches.
How to Implement Data Parallelism?
Implementation of data parallelism can be performed in the following given ways.
from torch.nn.parallel import DistributedDataParallel as DDP # `model` is the model we previously initialized model = ... # `rank` is a device number starting from 0 model = model.to(rank) ddp_model = DDP(model, device_ids=[rank])
The above codes can be used for implementing synchronous distributed SGD using the PyTorch library. Since there is no wrapper for the async in PyTorch, we are required to perform it more manually. We can find one implementation of the async approach here. An official tutorial for the sync approach can be found here.
Using TensorFlow Keras
In TensorFlow Keras, we have a wrapper for both of the SGDs. For synchronous, we can use tf.distribute.MirroredStrategy and for asynchronous, we can use tf.distribute.experimental.ParameterServerStrategy in the following way:
import tensorflow as tf strategy = tf.distribute.MirroredStrategy() with strategy.scope(): model = Model(...) model.compile(...)
For more details, the reader can refer to the official guides on the TensorFlow website.
Here we have seen the data parallelism in detail. Let’s discuss the details of model Parallelism.
Model parallelism can be considered as a process of splitting the model into N partitions. These partitions complete their work on different devices. Split of the models can be performed on the basis of the layers. As we know, a neural network can consist of various layers, so a set of layers or a layer can have a specific device to perform the operations. However, we can also split the model more intricately depending on the model’s architecture.
How to Implement Model Parallelism?
Implementation of model parallelism can be performed in the following given ways.
Using the wrapper torch.nn.Module.to, we can move the model parameters into different devices. Let’s consider the below codes for this purpose.
import torch.nn as nn linear1 = nn.Linear(16, 8).to('cuda:0') linear2 = nn.Linear(8, 4).to('cuda:1')
The above code snippet will create two linear layers, each of which will be placed on a different GPU.
We can use a tf.device wrapper for the distribution of operations into different devices. A simple implementation of model parallelism using the TensorFlow library can be followed using the below code.
import tensorflow as tf from tensorflow.keras import layers with tf.device('/GPU:0'): linear1 = layers.Dense(8, input_dim=16) with tf.device('/GPU:1'): linear2 = layers.Dense(4, input_dim=8)
For more details on this, readers can refer to this link.
Till now we have discussed the methods for parallelization. Let’s see the difference between these methods.
Data Parallelization vs Model Parallelization
Data parallelism is used more often than model parallelism. As we know that in synchronous Distributed SGD, synchronizing the operations becomes a time-consuming task, a similar limitation we can find in the whole model parallelism. We need to wait for part of the neural network and synchronization between them. Also, on the other hand, we can have many examples of models which are better suited for model parallelization like inception networks.
The above image is a representation of the inception module with dimension reductions. In this representation, we can see that 4 independent paths are there from the previous layer instead of going in parallel and only 2 synchronize points (filter concentration, previous layer).
Data parallelization is a technique that is often used to speed up the process. In data parallelization, we only replicate the network on various devices and we run the data in batches during the forward pass and concatenate them into a single batch. When we talk about model parallelization, we use it when the size of the model is not feasible to save into a single machine. The more and the longer parallel paths the model has, the better it might be suited for model parallelization.
In this article, we have discussed the need to parallelize deep learning with the ways which can be used to solve the limitation of traditional deep learning models. Along with this, we have discussed data parallelism and model parallelism which are the main concepts of parallel deep learning.