MITB Banner

DeepSpeed Vs Horovod: A Comparative Analysis

A comparative analysis of open-source deep learning optimization libraries DeepSpeed and Horovod for advancing large-scale model training.

Share

Deep learning represents a new artificial intelligence (AI) and machine learning paradigm. It has achieved enormous appeal in scientific computing, and its algorithms are widely employed to address challenging issues. To a certain degree, all deep learning algorithms depend on the capacity of deep neural networks (DNNs) to span GPU topologies. However, the same scalability has led to computer-intensive programmes, which pose operating problems for enterprises. Thus, from training to optimisation, the life cycle of a deep learning project demands strong building blocks for infrastructure that can extend computer workloads.

Over the years, many open-source deep learning optimisation libraries have been announced by tech giants such as Google, Microsoft, Uber, DeepMind and others. In this article, we will compare two of these libraries–DeepSpeed and Horovod.

DeepSpeed

In February 2020, Microsoft announced the release of an open-source library called DeepSpeed.

Training a large and advanced deep learning model is complex and includes a number of challenges, such as model design, setting up state-of-the-art training techniques including distributed training, mixed precision, gradient accumulation, among others. 

There is no certainty that the system will perform up to the expectation or achieve the desired convergence rate. This is because large models easily run out of memory with pure data parallelism, and it is hard to utilise model parallelism in such cases. This is where DeepSpeed comes into the picture, which addresses these drawbacks and accelerates model development and training.

One of the most important applications of DeepSpeed has been the development of Turing natural language generation (Turing-NLG), one of the largest language models with 17 billion parameters.

DeepScale stands apart in four important areas:

  • Scale: DeepSpeed supports system running models with up to 100 billion parameters, which is ten times improved on existing training optimisation frameworks. DeepSpeed’s 3D parallels can effectively train in-depth learning models with trillions of parameters using contemporary GPU clusters with hundreds of devices.
  • Speed: DeepSpeed was 4-5 times higher than competing libraries in initial tests.
  • Cost: models could be trained at three times cheaper using DeepSpeed than the alternatives.
  • Usability: DeepSpeed does not require PyTorch models for refactoring and can be used with only a few lines of code.

Horovod 

Horovod is Uber’s open-source, free software framework for distributed deep learning training using TensorFlow, PyTorch, Keras and Apache MXNet. Horovod aims to make distributed deep learning quick and easy to use. Originally, Horovod was built by Uber to make distributed deep learning quick and easy to train existing training scripts to run on hundreds of GPUs with just a few lines of Python code. It also brought the model training time down from days and weeks to hours and minutes. In the cloud platforms, including AWS, Azure, and Databricks, Horovod can be installed on-site or directly run out of the box.

Furthermore, Horovod can run on top of Apache Spark, allowing data processing and model training to be unified under a single pipeline. Once Horovod is configured, the same infrastructure may be used to train models with any framework, allowing the switching between TensorFlow, PyTorch, MXNet and future frameworks. The main principles of Horovod are built on MPI notions, namely size, rank, rank, local rank, allreduce, and allgather.

DeepSpeed vs Horovod 

Advanced deep learning models are tough to train. Besides model design, model scientists also need modern training approaches such as distributed training, mixed precision, gradient accumulation and monitoring. Still, the ideal system performance and convergence rate cannot be achieved by scientists. Large models give considerable accuracy benefits, but training billions to trillions of parameters often meets fundamental hardware restrictions. Existing systems make trade-offs between processing, communication and development efficiency to fit these models into memory. DeepSpeed and Horovod address these difficulties to expedite model development and training.

DeepSpeed brings advanced training techniques, such as ZeRO, distributed training, mixed precision and monitoring, to PyTorch compatible lightweight APIs. DeepSpeed addresses the underlying performance difficulties and improves the speed and scale of the training with only a few lines of code change to the PyTorch model.

On the other hand, the primary motivation for Horovod is to make it easy to use a single GPU training script and to scale it successfully to train across several GPUs. At Uber, it was found that the MPI model was considerably more straightforward and needed far fewer code modifications than earlier alternatives such as Distributed TensorFlow with parameter servers. Once a training script with Horovod is built, it could run on a single GPU, several GPUs or even numerous hosts without changing the code. Furthermore, Horovod is not only easy to use but also fast. 

Share
Picture of Ritika Sagar

Ritika Sagar

Ritika Sagar is currently pursuing PDG in Journalism from St. Xavier's, Mumbai. She is a journalist in the making who spends her time playing video games and analyzing the developments in the tech world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.