In distributed training, the workload is shared between mini processors called the worker nodes. The nodes run in parallel to speed up the model training. Traditionally, distributed training has been used for machine learning models. But of late, it’s making inroads into compute-intensive tasks such as deep learning to train deep neural networks.
Below, we have corralled the most popular distributed training frameworks in 2021:
Uber introduced Horovod in 2017, just a month after launching Michelangelo, an internal machine learning-as-a-service platform to build and deploy systems at scale. Uber built Horovod to make distributed deep learning fast and easy to use. It could bring down the model training time to hours from days and weeks. Horovod can scale up an existing training script to run on hundreds of GPUs in just a few lines of Python code. Horovod applies Baidu’s draft implementation of the TensorFlow ring-allreduce algorithm and builds upon it. At any given point, various teams at Uber would use different releases of TensorFlow. The idea with Horovod was to help all these teams leverage the ring-allreduce algorithm without having to upgrade to TensorFlow’s latest version.
Horovod can be installed on-premise or on cloud platforms such as AWS, Azure, and Databricks. Horovod can also run on top of Apache Spark and unify data processing and model training on a single pipeline. After configuring Horovod, the same infrastructure can be used to train models with any framework–TensorFlow, PyTorch, and MXNet.
Distributed TensorFlow applications consist of a cluster with parameter servers and workers. The workers are placed in the GPU as they calculate gradients during training; the parameters are placed in the CPU and used for only aggregating gradients and broadcasting updates. The ‘chief worker’ coordinates model training, initialises the model, counts the number of steps taken, monitors the session, and saves and recovers the model checkpoints to recover in case of failure. If a chief worker dies, the training needs to be restarted from the most recent checkpoint.
One of the major disadvantages of Distributed TensorFlow is that the practitioner needs to manage starting and stopping of servers explicitly. This requires keeping track of the IP addresses and ports of all the TensorFlow servers in the network.
TensorFlowOnSpark is a framework for running TensorFlow on Apache Spark. It allows distributed training and inference on Apache Spark Clusters. It is made available by Yahoo and is designated to work along with SparkSQL, MLlib and other Spark libraries in a single pipeline.
It enables both synchronous and asynchronous training and inference while supporting all types of TensorFlow programs. It supports model parallelism and data parallelism, as well as TensorFlow tools like TensorBoard on Spark clusters.
Typically, by changing just 10 lines of Python code, any TensorFlow program can be easily modified to work with TensorFlowOn Spark
It supports direct communication between TensorFlow processes–worker and parameter servers. Process-to-process communication helps TensorFlowOnSpark to scale easily by adding machines.
BigDL is a distributed deep learning framework for Apache Spark. It allows deep learning applications to run on Apache Hadoop/Spark cluster to directly process the production data and for deployment and management as part of the end-to-end data analysis process. It implements distributed, data-parallel training directly on top of the functional compute model of Spark. It provides synchronous data parallel training to train a deep neural network model across the cluster, offering better scalability and efficiency compared to asynchronous training. Distributed training in BigDL is implemented as an iterative process; each iteration runs a couple of Spark jobs to compute the gradients using the current mini-batch and then update the parameters.
PyTorch offers distributed training capabilities made available through torch.distributed package. It can be categorised into three main components:
- Distributed data-parallel training (DDP) is a widely adopted single-program multiple-data training program paradigm that enables model replication on every process to be fed with a different set of input data samples.
- RPC-based distributed training to support general training structures that cannot fit into data-parallel training. It helps in managing object lifetime and extends the autograd engine beyond the machine.
- Collective communication supports sending tensors across processes within a group.