MITB Banner

Top 11 Tools For Distributed Machine Learning

Share

There are two fundamentally different and complementary ways of accelerating machine learning workloads: 

  1. By vertical scaling or scaling-up, where one adds more resources to a single machine 

Or 

2. By horizontal scaling or scaling-out, where one adds more nodes to the system

But when it comes to the degree of distribution within a machine learning ecosystem, they are classified as:

  • Centralised
  • Decentralised
  • Fully Distributed

Centralised systems employ a strictly hierarchical approach. But the distributed system consists of a network of independent nodes and where no specific roles are assigned to certain nodes.

A centralised solution is not the right choice when data is inherently distributed or too big to store on single machines. For instance, think about astronomical data that is too large to move and centralise.

In a recent work published by the researchers at Delft University of Technology, Netherlands, they wrote in detail about the current state-of-the-art distributed ML models and how they affect computation latency and other attributes.

The advantages of using distributed ML models are plenty, it is beyond the scope of this article, however, here we list down of popular toolkits and techniques that enable distributed machine learning:

MapReduce and Hadoop

MapReduce is a framework for processing data and was developed by Google in order to process data in a distributed setting. First, all data is split into tuples during the map phase, which is followed by the reduce phase, where these tuples are grouped to generate a single output value per key. MapReduce and Hadoop heavily rely on the distributed file system in every phase of the execution. 

Apache Spark 

Transformations in linear algebra, as they occur in many machine learning algorithms, are typically highly iterative in nature and the paradigm of the map and the reduce operations are not ideal for such iterative tasks. This is what Apache Spark has been developed to resolve.

The key difference here is the MapReduce tasks, which would require to write all (intermediate) data to disk for it to be executed. Whereas, Spark can keep all the data in memory, which saves expensive reads from the disk.

Baidu AllReduce

AllReduce uses common high-performance computing technology to iteratively train stochastic gradient descent models on separate mini-batches of the training data. Baidu claims linear speedup when applying this technique in order to train deep learning networks.

Horovod

Horovod like Baidu, adds a layer of AllReduce-based MPI training to Tensorflow. Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency when training on (Nvidia) GPUs. However, Horovod lacks fault tolerance and therefore suffers from the same scalability issues as those of Baidu’s.

Caffe2

This deep learning framework distributes machine learning through AllReduce algorithms. It does this by using NCCL between GPUs on a single host, and custom code between hosts based on Facebook’s Gloo library.

Microsoft Cognitive Toolkit

.

This toolkit offers multiple ways of data-parallel distribution. Many of them use the Ring AllReduce tactic as previously described, making the same trade-off of linear scalability over fault-tolerance.

DistBelief

Developed by Google, DistBelief is one of the early practical implementations of large-scale distributed machine learning. It supports data and model parallel training on tens of thousands of CPU cores. They are also capable of training a huge model with 1.7 billion parameters.

Tensorflow

Developed by Google, Tensorflow has evolved from DistBelief and borrows the concepts of a computation graph and parameter server from it. Unlike DistBelief, defining a new type of neural network layer in Tensorflow requires no custom code, composed of fundamental math operations.

DIANNE (Distributed Artificial Neural Networks) 

A Java-based distributed deep learning framework, DIANNE, uses the Torch native backend for executing the necessary computations. Each basic building block of a neural network can be deployed on a specific node, hence enabling model-parallelism.

MXNet 

On a small cluster of 10 machines equipped with a GPU, MXNet achieves almost linear speedup compared to a single machine when training GoogleNet. Similar to that of Tensorflow, the models are represented as dataflow graphs. 

Petuum

This approach is aimed at exploiting ML’s error tolerance, dependencies, and non-uniform convergence in order to achieve good scalability on large datasets.

Petuum uses the Parameter Server paradigm to keep track of the model being trained. 

Petuum provides an abstraction layer that also allows it to run on systems using the Hadoop job scheduler and HDFS (Hadoop file system), which simplifies compatibility with the pre-existing clusters.

Scaling out is still a pressing challenge that delays the widespread usage of distributed models. Not all machine learning algorithms lend themselves to a distributed computing model that can achieve a high degree of parallelism.


PS: The story was written using a keyboard.
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed