MITB Banner

Distributed Machine Learning Is The Answer To Scalability And Computation Requirements

Share

We all know the traditional way of machine learning, where programmers use an integrated tool for data mining and conduct analysis on the results. However, the traditional way may not work if the data is too large to store in the RAM of a single computer. Most existing ML algorithms are designed by assuming that data can be easily accessed, which means the same data may be accessed many times.

Scalability Led To the Rise of Distributed ML

It was this challenge to handle large-scale data due to scalability and efficiency of learning algorithms with respect to computational and memory resources that gave rise to distributed ML. For example, if the computational complexity of the algorithm outpaces the main memory then the algorithm will not scale well and will not be able to process the training data set or will not run due to memory restrictions. Distributed ML algorithms rose to handle very large data sets and develop efficient and scalable algorithms with regard to accuracy and to requirements of computation (memory, time and communication needs).

Distributed ML algorithms are part of large-scale learning which has received considerable attention over the last few years, thanks to its ability to allocate learning process onto several workstations — distributed computing to scale up learning algorithms. It is these advances which make ML tasks on big data scalable, flexible and efficient. There are two approaches to distributed learning algorithms. The distributed nature of these datasets can lead to the two most common types of data fragmentation:

  1. Horizontal fragmentation where subsets of instances are stored at different sites
  2. Vertical fragmentation where subsets of attributes of instances are stored at different sites

Some of the most common scenarios are where distributed ML algorithms are deployed are in healthcare or advertising where a simple application can accumulate a lot of data. Since data is huge, programmers frequently re-train data so as not to interrupt the workflow and use parallel loading. For example, MapReduce was built to allow automatic parallelisation and distribution of large-scale special-purpose computations that process large amounts of raw data, such as crawled documents or web request logs and compute various kinds of derived data.
Several Distributed ML Platforms Are New

One of the most widely-used distributed data processing systems for ML workloads is Apache Spark MLlib and Apache Mahout. Microsoft also released its Distributed ML Toolkit (DMTK), which contains both algorithmic and system innovations. Microsoft’s DMTK framework supports unified interface for data parallelisation, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency. System innovations and ML innovations are pushing the frontiers of distributed ML.

Single Machine vs Distributed ML

  • Experts emphasise that traditional ML approaches are designed to address the dataset at hand which implies central processing of data in a database. However, this is usually not possible due to the fact that the cost of storing a single dataset is bigger than storing data in smaller parts. Also, the computational cost of mining a single data repository or database is bigger than processing smaller parts of data
  • As opposed to a centralised approach, a distributed mining approach helps in parallel processing. Also, distributed learning algorithms have their foundations in ensemble learning which helps build a set of classifiers to improve the accuracy of a single classifier. An ensemble approach merges with that of a distributed environment since a classifier is trained onsite, with a subset of data stored in it.
  • Distributed learning also provides the best solution to large-scale learning given how memory limitation and algorithm complexity are the main obstacles. Besides overcoming the problem of centralised storage, distributed learning is also scalable since data is offset by adding more processors.
  • Also, experts peg that in the future data analytics will be primarily done in a distributed environment.

Disadvantages of Distributed ML algorithms

Unfortunately writing and running a distributed ML algorithm is highly complicated and developing distributed ML packages becomes difficult because of platform dependency. On the other hand, there are no standardised measures to evaluate distributed algorithms. Many ML researchers say that existing measures benchmarked against classical ML methods show less reliability.

But one thing’s clear — the practice of ML which was so far concentrated on monolithic data sets from where learning algorithms generate a single model is soon getting phased out with distributed learning algorithms. Also the rise of big data and IoT has led to several distributed data sets and these big datasets stored in a central repository impose huge processing and computing requirements. And that’s why researchers assert that distributed processing of data is the right computing platform.

Share
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.