Top Distributed Training Frameworks In 2021

In distributed training, the workload is shared between mini processors called the worker nodes. The nodes run in parallel to speed up the model training. Traditionally, distributed training has been used for machine learning models.  But of late, it’s making inroads into compute-intensive tasks such as deep learning to train deep neural networks.

Below, we have corralled the most popular distributed training frameworks in 2021:


Uber introduced Horovod in 2017, just a month after launching Michelangelo, an internal machine learning-as-a-service platform to build and deploy systems at scale. Uber built Horovod to make distributed deep learning fast and easy to use. It could bring down the model training time to hours from days and weeks. Horovod can scale up an existing training script to run on hundreds of GPUs in just a few lines of Python code. Horovod applies Baidu’s draft implementation of the TensorFlow ring-allreduce algorithm and builds upon it. At any given point, various teams at Uber would use different releases of TensorFlow. The idea with Horovod was to help all these teams leverage the ring-allreduce algorithm without having to upgrade to TensorFlow’s latest version. 

Horovod can be installed on-premise or on cloud platforms such as AWS, Azure, and Databricks. Horovod can also run on top of Apache Spark and unify data processing and model training on a single pipeline. After configuring Horovod, the same infrastructure can be used to train models with any framework–TensorFlow, PyTorch, and MXNet.

Distributed TensorFlow

Distributed TensorFlow applications consist of a cluster with parameter servers and workers. The workers are placed in the GPU as they calculate gradients during training; the parameters are placed in the CPU and used for only aggregating gradients and broadcasting updates. The ‘chief worker’ coordinates model training, initialises the model, counts the number of steps taken, monitors the session, and saves and recovers the model checkpoints to recover in case of failure. If a chief worker dies, the training needs to be restarted from the most recent checkpoint.

One of the major disadvantages of Distributed TensorFlow is that the practitioner needs to manage starting and stopping of servers explicitly. This requires keeping track of the IP addresses and ports of all the TensorFlow servers in the network.


TensorFlowOnSpark is a framework for running TensorFlow on Apache Spark. It allows distributed training and inference on Apache Spark Clusters. It is made available by Yahoo and is designated to work along with SparkSQL, MLlib and other Spark libraries in a single pipeline.

It enables both synchronous and asynchronous training and inference while supporting all types of TensorFlow programs. It supports model parallelism and data parallelism, as well as TensorFlow tools like TensorBoard on Spark clusters.

Typically, by changing just 10 lines of Python code, any TensorFlow program can be easily modified to work with TensorFlowOn Spark

It supports direct communication between TensorFlow processes–worker and parameter servers. Process-to-process communication helps TensorFlowOnSpark to scale easily by adding machines.


BigDL is a distributed deep learning framework for Apache Spark. It allows deep learning applications to run on Apache Hadoop/Spark cluster to directly process the production data and for deployment and management as part of the end-to-end data analysis process. It implements distributed, data-parallel training directly on top of the functional compute model of Spark. It provides synchronous data parallel training to train a deep neural network model across the cluster, offering better scalability and efficiency compared to asynchronous training. Distributed training in BigDL is implemented as an iterative process; each iteration runs a couple of Spark jobs to compute the gradients using the current mini-batch and then update the parameters.


PyTorch offers distributed training capabilities made available through torch.distributed package. It can be categorised into three main components:

  • Distributed data-parallel training (DDP) is a widely adopted single-program multiple-data training program paradigm that enables model replication on every process to be fed with a different set of input data samples.
  • RPC-based distributed training to support general training structures that cannot fit into data-parallel training. It helps in managing object lifetime and extends the autograd engine beyond the machine.
  • Collective communication supports sending tensors across processes within a group.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox