MITB Banner

How Do Large Firms Train ML Models At Scale?

Microsoft's DeepSpeed abstracts difficult aspects of large scale learning such as parallelisation, mixed precision, and gradient accumulation.

Share

One of the easiest ways to improve a machine learning model is to make it bigger. The additional capacity makes room for complex connections. 

However, the metamorphosis of ML models from the conceptualisation stage to production is a complex and time-consuming process. The challenges include: managing large amounts of data, choosing the best algorithm for training, managing the compute capacity while training, and finally deploying the model in the production environment.

Below, we look at how large firms such as Microsoft, Google, Uber, Amazon train models at scale.

Microsoft

In 2019, Microsoft developed Zero Redundancy Optimizer (ZeRO) to optimise memory and improve training speed as the model size increases. ZeRO eliminates memory redundancies in the data and model-parallel training while retaining high computational granularity and retains low communication volume. The team claimed ZeRO could scale beyond a trillion parameters.

In 2020, Microsoft released ZeRO-2 that trains large AI models with 170 billion parameters. It optimises memory consumptions and reduces activation and fragmented memory. It has reduced the training time by 30 percent for models like BERT.

Zero-2 optimises the full spectrum of memory consumption during deep learning training. It includes model state, activation memory, and fragmented memory. ZeRO-2 optimises large models during distributed training and introduces a new technique to accelerate single GPU performance using kernel optimisation.

Credit: Microsoft

Microsoft also offers DeepSpeed, an open-source framework built on PyTorch for optimising the training of large models by providing a simple API for training parallelisation. ZeRO and ZeRO-2 are implementations of DeepSpeed. DeepSpeed abstracts challenging aspects of large scale learning such as parallelisation, mixed precision, and gradient accumulation.

 Google 

In 2019, Google introduced GPipe, a technique for efficient training of giant neural networks using pipeline parallelism. In pipeline parallelism, multiple steps depend on each other, but execution overlap and the output of one step are given as the next step.

Credit: Google

GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent apart from pipeline parallelism to train any DNN containing multiple sequential layers. GPipe partitions a model across various accelerators and spins small batches of training examples to even smaller batches. Hence, GPipe’s accelerators can operate parallelly and maximise the scalability of the training process. It allows easy deployment of more accelerators to train large models and further scale the performance without tuning hyperparameters.

GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent apart from pipeline parallelism to train any DNN containing multiple sequential layers. GPipe partitions a model across various accelerators and spins small batches of training examples to even smaller batches. Hence, GPipe’s accelerators can operate parallelly and maximise the scalability of the training process. It allows easy deployment of more accelerators to train large models and further scale the performance without tuning hyperparameters.

GPipe performed well on multiple popular datasets. For ImageNet, it displayed 84.3 percent accuracy, CIFAR-10 99 percent accuracy, and CIFAR-100 91.3 percent accuracy.

Uber

Uber’s Horovod is an open-source framework for distributed deep learning training using TensorFlow, PyTorch, Keras, and Apache MXNet. Named after the traditional Russian folk dance, Horovod leverages message passing interface stacks such as OpenMPI to train a model to run on a highly parallel and distributed infrastructure without modification.

Just recently, Horovod introduced Elastic Horovod for distributed training that scales the number of workers dynamically through the training process. Elastic solves the problem of autoscaling the training process. 

Tesla

Pytorch is the backbone of various features that run in the background of Tesla’s AI stack. It powers the fully autonomous objectives of the Tesla motors.

Tesla collected swathes of data from multiple sources such as road markings, traffic signals, overhead signs, moving and static objects, crosswalks, environment tags. 

The collected data is labelled, and the training is done on on-premise GPU clusters before being taken through the entire stack. Tesla’s workflow has a multi-task setting. Since it is impossible to have a neural network for each of these tasks, Tesla employs Hydra Nets to solve recurring tasks. The working of Hydra Net is based on a combination of data-parallel and model-parallel training.

Credit: PyTorch

At Tesla, multi-task training is done in three main ways: Round-robin training, sync pool of workers, and async pool of workers. For example, Tesla trains 48 networks for making 1,000 predictions that take 70,000 GPU hours for autopilot.

Twitter

Twitter’s models are primarily trained on sparse data. Sparse data is the data that has a lot of fields, but only some have values. To increase development speed and have more relevant models, Twitter uses distributed model training in TensorFlow.

In a blog, Twitter claimed that using customised distributed training, the team was able to increase the performance by 100 times over the standard TensorFlow distribution strategies, which also delivered 60 times speed as compared to training on a single machine. Data parallelism and model parallelism form the core of training models at Twitter.

LinkedIn

In LinkedIn, most of the model training occurs offline. The models are trained and retrained every few hours using Hadoop. Further, LinkedIn also uses a proprietary Pro-ML training service that leverages Azkaban and Spark for executing training workflows. The infrastructure supports different models as well as tools for tasks such as hyperparameter tuning.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.