Now Reading
How Do Large Firms Train ML Models At Scale?

How Do Large Firms Train ML Models At Scale?

  • Microsoft's DeepSpeed abstracts difficult aspects of large scale learning such as parallelisation, mixed precision, and gradient accumulation.

One of the easiest ways to improve a machine learning model is to make it bigger. The additional capacity makes room for complex connections. 

However, the metamorphosis of ML models from the conceptualisation stage to production is a complex and time-consuming process. The challenges include: managing large amounts of data, choosing the best algorithm for training, managing the compute capacity while training, and finally deploying the model in the production environment.

Register for our upcoming Masterclass>>

Below, we look at how large firms such as Microsoft, Google, Uber, Amazon train models at scale.

Microsoft

In 2019, Microsoft developed Zero Redundancy Optimizer (ZeRO) to optimise memory and improve training speed as the model size increases. ZeRO eliminates memory redundancies in the data and model-parallel training while retaining high computational granularity and retains low communication volume. The team claimed ZeRO could scale beyond a trillion parameters.

In 2020, Microsoft released ZeRO-2 that trains large AI models with 170 billion parameters. It optimises memory consumptions and reduces activation and fragmented memory. It has reduced the training time by 30 percent for models like BERT.

Looking for a job change? Let us help you.

Zero-2 optimises the full spectrum of memory consumption during deep learning training. It includes model state, activation memory, and fragmented memory. ZeRO-2 optimises large models during distributed training and introduces a new technique to accelerate single GPU performance using kernel optimisation.

Credit: Microsoft

Microsoft also offers DeepSpeed, an open-source framework built on PyTorch for optimising the training of large models by providing a simple API for training parallelisation. ZeRO and ZeRO-2 are implementations of DeepSpeed. DeepSpeed abstracts challenging aspects of large scale learning such as parallelisation, mixed precision, and gradient accumulation.

 Google 

In 2019, Google introduced GPipe, a technique for efficient training of giant neural networks using pipeline parallelism. In pipeline parallelism, multiple steps depend on each other, but execution overlap and the output of one step are given as the next step.

Credit: Google

GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent apart from pipeline parallelism to train any DNN containing multiple sequential layers. GPipe partitions a model across various accelerators and spins small batches of training examples to even smaller batches. Hence, GPipe’s accelerators can operate parallelly and maximise the scalability of the training process. It allows easy deployment of more accelerators to train large models and further scale the performance without tuning hyperparameters.

GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent apart from pipeline parallelism to train any DNN containing multiple sequential layers. GPipe partitions a model across various accelerators and spins small batches of training examples to even smaller batches. Hence, GPipe’s accelerators can operate parallelly and maximise the scalability of the training process. It allows easy deployment of more accelerators to train large models and further scale the performance without tuning hyperparameters.

GPipe performed well on multiple popular datasets. For ImageNet, it displayed 84.3 percent accuracy, CIFAR-10 99 percent accuracy, and CIFAR-100 91.3 percent accuracy.

Uber

Uber’s Horovod is an open-source framework for distributed deep learning training using TensorFlow, PyTorch, Keras, and Apache MXNet. Named after the traditional Russian folk dance, Horovod leverages message passing interface stacks such as OpenMPI to train a model to run on a highly parallel and distributed infrastructure without modification.

Just recently, Horovod introduced Elastic Horovod for distributed training that scales the number of workers dynamically through the training process. Elastic solves the problem of autoscaling the training process. 

Tesla

Pytorch is the backbone of various features that run in the background of Tesla’s AI stack. It powers the fully autonomous objectives of the Tesla motors.

Tesla collected swathes of data from multiple sources such as road markings, traffic signals, overhead signs, moving and static objects, crosswalks, environment tags. 

The collected data is labelled, and the training is done on on-premise GPU clusters before being taken through the entire stack. Tesla’s workflow has a multi-task setting. Since it is impossible to have a neural network for each of these tasks, Tesla employs Hydra Nets to solve recurring tasks. The working of Hydra Net is based on a combination of data-parallel and model-parallel training.

Credit: PyTorch

At Tesla, multi-task training is done in three main ways: Round-robin training, sync pool of workers, and async pool of workers. For example, Tesla trains 48 networks for making 1,000 predictions that take 70,000 GPU hours for autopilot.

Twitter

Twitter’s models are primarily trained on sparse data. Sparse data is the data that has a lot of fields, but only some have values. To increase development speed and have more relevant models, Twitter uses distributed model training in TensorFlow.

In a blog, Twitter claimed that using customised distributed training, the team was able to increase the performance by 100 times over the standard TensorFlow distribution strategies, which also delivered 60 times speed as compared to training on a single machine. Data parallelism and model parallelism form the core of training models at Twitter.

LinkedIn

In LinkedIn, most of the model training occurs offline. The models are trained and retrained every few hours using Hadoop. Further, LinkedIn also uses a proprietary Pro-ML training service that leverages Azkaban and Spark for executing training workflows. The infrastructure supports different models as well as tools for tasks such as hyperparameter tuning.

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top