There are two fundamentally different and complementary ways of accelerating machine learning workloads:
- By vertical scaling or scaling-up, where one adds more resources to a single machine
2. By horizontal scaling or scaling-out, where one adds more nodes to the system
But when it comes to the degree of distribution within a machine learning ecosystem, they are classified as:
Top Data Scientists for our Hackathons
- Fully Distributed
Centralised systems employ a strictly hierarchical approach. But the distributed system consists of a network of independent nodes and where no specific roles are assigned to certain nodes.
A centralised solution is not the right choice when data is inherently distributed or too big to store on single machines. For instance, think about astronomical data that is too large to move and centralise.
In a recent work published by the researchers at Delft University of Technology, Netherlands, they wrote in detail about the current state-of-the-art distributed ML models and how they affect computation latency and other attributes.
The advantages of using distributed ML models are plenty, it is beyond the scope of this article, however, here we list down of popular toolkits and techniques that enable distributed machine learning:
MapReduce and Hadoop
MapReduce is a framework for processing data and was developed by Google in order to process data in a distributed setting. First, all data is split into tuples during the map phase, which is followed by the reduce phase, where these tuples are grouped to generate a single output value per key. MapReduce and Hadoop heavily rely on the distributed file system in every phase of the execution.
Transformations in linear algebra, as they occur in many machine learning algorithms, are typically highly iterative in nature and the paradigm of the map and the reduce operations are not ideal for such iterative tasks. This is what Apache Spark has been developed to resolve.
The key difference here is the MapReduce tasks, which would require to write all (intermediate) data to disk for it to be executed. Whereas, Spark can keep all the data in memory, which saves expensive reads from the disk.
AllReduce uses common high-performance computing technology to iteratively train stochastic gradient descent models on separate mini-batches of the training data. Baidu claims linear speedup when applying this technique in order to train deep learning networks.
Horovod like Baidu, adds a layer of AllReduce-based MPI training to Tensorflow. Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency when training on (Nvidia) GPUs. However, Horovod lacks fault tolerance and therefore suffers from the same scalability issues as those of Baidu’s.
This deep learning framework distributes machine learning through AllReduce algorithms. It does this by using NCCL between GPUs on a single host, and custom code between hosts based on Facebook’s Gloo library.
Microsoft Cognitive Toolkit
This toolkit offers multiple ways of data-parallel distribution. Many of them use the Ring AllReduce tactic as previously described, making the same trade-off of linear scalability over fault-tolerance.
Developed by Google, DistBelief is one of the early practical implementations of large-scale distributed machine learning. It supports data and model parallel training on tens of thousands of CPU cores. They are also capable of training a huge model with 1.7 billion parameters.
Developed by Google, Tensorflow has evolved from DistBelief and borrows the concepts of a computation graph and parameter server from it. Unlike DistBelief, defining a new type of neural network layer in Tensorflow requires no custom code, composed of fundamental math operations.
DIANNE (Distributed Artificial Neural Networks)
A Java-based distributed deep learning framework, DIANNE, uses the Torch native backend for executing the necessary computations. Each basic building block of a neural network can be deployed on a specific node, hence enabling model-parallelism.
On a small cluster of 10 machines equipped with a GPU, MXNet achieves almost linear speedup compared to a single machine when training GoogleNet. Similar to that of Tensorflow, the models are represented as dataflow graphs.
This approach is aimed at exploiting ML’s error tolerance, dependencies, and non-uniform convergence in order to achieve good scalability on large datasets.
Petuum uses the Parameter Server paradigm to keep track of the model being trained.
Petuum provides an abstraction layer that also allows it to run on systems using the Hadoop job scheduler and HDFS (Hadoop file system), which simplifies compatibility with the pre-existing clusters.
Scaling out is still a pressing challenge that delays the widespread usage of distributed models. Not all machine learning algorithms lend themselves to a distributed computing model that can achieve a high degree of parallelism.