Microsoft’s DeepSpeed was introduced in 2020 and is one of the most popular deep learning optimisation libraries today. The mission behind the library was to make distributed training easy, efficient, and effective. Today, DeepSpeed can train a language model with one trillion parameters using as few as 800 NVIDIA V100 GPUs. Over the years, many open-source deep learning optimisation libraries have been announced by tech giants such as Google, Microsoft, Uber, DeepMind and others, but DeepSpeed remains one of the most popular. Today, we are looking at the top six alternatives to DeepSpeed.
DeepSpeed enables trillion-scale model training through its combination of powerful technologies. It can also scale to thousands of GPUs, including data-parallel training, parallel model training, and pipeline parallel training. The team can also optimise for MoE models at scale and reduce the cost of training and inference for large models. The library is PyTorch-compatible and proven to improve large model training, achieving over 5x in system performance. The Turing-NLG with 17B parameters is one of the earliest leveragers of the library.
Fairscale by Facebook Research
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Fairscale by FAIR is a PyTorch extension for high performance and large scale training. This library extends basic PyTorch capabilities with new SOTA scaling and the latest distributed training techniques. It does so in the form of composable modules and easy to use APIs that help to scale models with limited resources.
FairScale follows three major missions; to allow users to understand it with minimum cognitive overload, to combine multiple FairScale APIs as part of their training loop seamlessly, and leverage the best scaling and efficiency performance. FairScale can be used across multiple axes and provides solutions for scaling models by layer parallelism and tensor parallelism. It also allows users to achieve low memory utilisation and efficient computation, deal with optimising memory usage irrespective of the scale of the model, training without hyperparameter tuning and other techniques to optimise training performance. It also features inter-and intra-layer parallelism, splitting models across multiple GPUs and hosts.
NVIDIA’s TensorRT is a C++ library for high-performance inference, built on NVIDIA GPUs and deep learning accelerators. The library is built on CUDA, NVIDIA’s parallel programming model, enabling developers to calibrate for lower precision with high accuracy, optimise neural network models, and deploy to hyperscale data centres, embedded platforms, or automotive product platforms. Deep learning models from almost all popular frameworks can be parsed and optimised for low latency and high throughput inference on NVIDIA GPUs using TensorRT. The three essential optimisations it allows for are Mixed Precision Inference, Layer Fusion and Batching. In addition, it uses the new NVIDIA Ampere Architecture GPUs and sparse tensor cores for an additional performance boost.
TensorRT offers INT8 based on quantisation aware training, post-training quantisation, and FP16 optimisations. These can be used for applications for production deployments of deep learning inference applications, such as video streaming, speech recognition, recommendation, fraud detection, text generation, and natural language processing. It also minimises application latency for better real-time services.
Uber developed Horovod to make distributed deep learning fast and easy to use. It is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The library is claimed to bring model training time down from days and weeks to hours and minutes.
Horovod leverages message passing interface stacks such as OpenMPI to enable a training job to run on a highly parallel and distributed infrastructure without any modifications. It is currently hosted by the LF AI & Data Foundation. The primary motivation is to ease the task of taking a single-GPU training script and successfully scaling it to train across many GPUs in parallel. Once the training script has been written for scale with Horovod, the Uber team claims it can run on a single GPU, multiple GPUs, or even multiple hosts without any further code changes. The framework achieved 90% scaling efficiency for both Inception V3 and ResNet-101 and 68% for VGG-16. It is further leveraged by companies like the NVIDIA Collective Communications Library and Message Passing Interface (MPI). Horovod helps distribute and aggregate model parameters across workers, optimise network bandwidth usage and scale deep neural network models.
Mesh TensorFlow is a language for distributed deep learning that can specify a broad class of distributed tensor computations. The language aims to formalise and implement distribution strategies for computation graphs over the hardware/processors. It is implemented as a layer over TensorFlow. TensorFlow occupies only one GPU for training and works in two methods: data parallelism and model parallelism. Mesh is a repository for model parallelism on supercomputers. It allows the users to specify any tensor dimensions to be split across any dimensions of a multi-dimensional mesh of processors. The MeshTensorFlow graph compiles this in an SPMD program with parallel operations coupled with collective communication primitives such as Allreduce.
MNN is a blazing fast, lightweight deep learning framework battle-tested by business-critical use cases in Alibaba. It is a lightweight mobile-side deep learning inference engine called Mobile Neural Network (MNN). MNN focuses on the acceleration and optimisation of inference while solving efficiency problems during model deployment.
Mobile Neural Network ensures optimisation, conversion, and inference of deep neural network models. Apps leverage it for live broadcast, short video capture, search recommendation, product searching by image, interactive marketing, equity distribution, security risk control and other scenarios. MNN stably runs more than 100 million times per day, is applied in IoT devices and is used in scenarios like smiley face red envelopes, scans, and a finger-guessing game.
DeepMind’s TF-Replicator is a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. It focuses on the scalability related to how TensorFlow programs leverage Tensor Processing Units (TPUs). It mitigates the portability and adoption barriers of the TPU with a simpler and developer-friendly programming model.
TF-Replicator is based on the “in-graph replication” pattern that computes for each device and replicates in the same TensorFlow graph. This allows for a high level of parallelism where the TF-Replicator first builds independent computations for each device and then leaves placeholders on specific cross-device computation use-cases. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation. This allows users to scale up workloads to many devices and seamlessly switch between different types of accelerators.