Deep learning represents a new artificial intelligence (AI) and machine learning paradigm. It has achieved enormous appeal in scientific computing, and its algorithms are widely employed to address challenging issues. To a certain degree, all deep learning algorithms depend on the capacity of deep neural networks (DNNs) to span GPU topologies. However, the same scalability has led to computer-intensive programmes, which pose operating problems for enterprises. Thus, from training to optimisation, the life cycle of a deep learning project demands strong building blocks for infrastructure that can extend computer workloads.
Over the years, many open-source deep learning optimisation libraries have been announced by tech giants such as Google, Microsoft, Uber, DeepMind and others. In this article, we will compare two of these libraries–DeepSpeed and Horovod.
Training a large and advanced deep learning model is complex and includes a number of challenges, such as model design, setting up state-of-the-art training techniques including distributed training, mixed precision, gradient accumulation, among others.
There is no certainty that the system will perform up to the expectation or achieve the desired convergence rate. This is because large models easily run out of memory with pure data parallelism, and it is hard to utilise model parallelism in such cases. This is where DeepSpeed comes into the picture, which addresses these drawbacks and accelerates model development and training.
One of the most important applications of DeepSpeed has been the development of Turing natural language generation (Turing-NLG), one of the largest language models with 17 billion parameters.
DeepScale stands apart in four important areas:
- Scale: DeepSpeed supports system running models with up to 100 billion parameters, which is ten times improved on existing training optimisation frameworks. DeepSpeed’s 3D parallels can effectively train in-depth learning models with trillions of parameters using contemporary GPU clusters with hundreds of devices.
- Speed: DeepSpeed was 4-5 times higher than competing libraries in initial tests.
- Cost: models could be trained at three times cheaper using DeepSpeed than the alternatives.
- Usability: DeepSpeed does not require PyTorch models for refactoring and can be used with only a few lines of code.
Horovod is Uber’s open-source, free software framework for distributed deep learning training using TensorFlow, PyTorch, Keras and Apache MXNet. Horovod aims to make distributed deep learning quick and easy to use. Originally, Horovod was built by Uber to make distributed deep learning quick and easy to train existing training scripts to run on hundreds of GPUs with just a few lines of Python code. It also brought the model training time down from days and weeks to hours and minutes. In the cloud platforms, including AWS, Azure, and Databricks, Horovod can be installed on-site or directly run out of the box.
Furthermore, Horovod can run on top of Apache Spark, allowing data processing and model training to be unified under a single pipeline. Once Horovod is configured, the same infrastructure may be used to train models with any framework, allowing the switching between TensorFlow, PyTorch, MXNet and future frameworks. The main principles of Horovod are built on MPI notions, namely size, rank, rank, local rank, allreduce, and allgather.
DeepSpeed vs Horovod
Advanced deep learning models are tough to train. Besides model design, model scientists also need modern training approaches such as distributed training, mixed precision, gradient accumulation and monitoring. Still, the ideal system performance and convergence rate cannot be achieved by scientists. Large models give considerable accuracy benefits, but training billions to trillions of parameters often meets fundamental hardware restrictions. Existing systems make trade-offs between processing, communication and development efficiency to fit these models into memory. DeepSpeed and Horovod address these difficulties to expedite model development and training.
DeepSpeed brings advanced training techniques, such as ZeRO, distributed training, mixed precision and monitoring, to PyTorch compatible lightweight APIs. DeepSpeed addresses the underlying performance difficulties and improves the speed and scale of the training with only a few lines of code change to the PyTorch model.
On the other hand, the primary motivation for Horovod is to make it easy to use a single GPU training script and to scale it successfully to train across several GPUs. At Uber, it was found that the MPI model was considerably more straightforward and needed far fewer code modifications than earlier alternatives such as Distributed TensorFlow with parameter servers. Once a training script with Horovod is built, it could run on a single GPU, several GPUs or even numerous hosts without changing the code. Furthermore, Horovod is not only easy to use but also fast.