How Google’s Gpipe Is Using Pipeline Parallelism For Training Neural Networks

Training bigger neural networks can be challenging when faced with accelerator memory limits.  The size of the datasets being used by machine learning models is very large nowadays. For example, a standard image classification datasets like hashtagged Instagram contains millions of images. With increasing quality of the images, the memory required will also increase. Today, the memory available on NVIDIA GPUs is only 32 GB.

Therefore, there needs to be a tradeoff between memory allocated for the features in a model and how the network gets activated. It is only understandable why the accelerator memory limit needs to be breached.

A deep neural network benefits from larger datasets as it alleviates the problem of overfitting. And, to run these ever growing networks, we need deep learning supercomputers such as Google TPU or NVIDIA’s DGX which enable parallelism by providing faster interconnections between the accelerators.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Today, an average ImageNet resolution is 469 x 387 and it has been proven that by increasing the size of an input image, the final accuracy score of a classifier increases. To fit the current accelerator memory limits, most models are made to process images of sizes 299 x 299 or 331 x 331.

Meet Gpipe

In this paper, the researchers at Google Brain, propose pipeline parallelism to scale up deep neural networks training. And, as a result, they introduce a new machine learning library called GPipe.

Download our Mobile App

GPipe can be used to parse a model across different accelerators and to automatically split a mini-batch of training examples into micro-batches. Pipelining allows the accelerators to function with parallelism.

The memory required to update the weights during backpropagation can be reduced with GPipe as it automatically calculates the forward activations during backpropagation. Hence enabling the users to use more accelerators for training larger models and achieving performances to scale without filtering hyperparameters.

Researchers at Google Brain say, “GPipe can support models up to 25 times larger using 8 accelerators without reducing the batch size. The implementation of GPipe is very efficient: with 4 times more accelerators we can achieve a 3.5 times speedup for training giant neural networks.”

So, to test and demonstrate the GPipe’s functionality, the researchers have used ImageNet ILSRVC 2012 dataset where they use up 557 million parameters with an input image size of 480 x 480. And, this scaled up AmoebaNet model attains validation accuracy of 84.3 % top-1 outperforming all other models trained from scratch on ImageNet dataset.

The 2014 ImageNet challenge has seen accuracy scores of 74.8% with 4 million parameters. And, in 2017 the accuracy has risen to 82.7% while using up 145.8 million parameters which is 36 times the number of parameters used previously.

The researchers have also managed to push the CIFAR-10 accuracy to 99%. The CIFAR-10 dataset contains 60,000 32 x 32 color images in 10 different classes. The 10 different classes represent aeroplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

Design Features Of GPipe

The core algorithm has been implemented using TensorFlow library. By invoking a GPipe library, the user specifies a sequential list of L layers. Where each layer specifies model parameters, stateless forward computation function and an optional cost estimation function.

After the layer specifications have been defined, GPipe partitions the network into K composite layers and places k-th composite layer onto k-th accelerator. The number of partitions, ‘K’ is user-defined and During training, GPipe first divides a mini-batch of size N into T micro-batches at the first layer. Each micro-batch contains N/T examples.

Each accelerator only stores output activations at the partition boundaries, rather than activations of all intermediate layers within the partition. The accelerator recomputes the composite forward function and requires only the cached activations at partition boundaries; reducing the overall memory allocation.

The gradients for each micro-batch are computed based on the same model parameters as the forward pass. At the end of each mini-batch, the model parameters are updated across accelerators by applying gradients. So, GPipe, in a way resonates with the nature of gradient descent independent of number of partitions.

To scale up the models, RMSProp optimizer with a decay of 0.9 and label smoothing coefficient equal to 0.1have been used. The learning rate is scheduled to decay after 3 epochs at a rate of 0.97 with an initial learning rate of 0.00125 times the batch size. This scaled up giant model reached 84.3% top-1 accuracy with single-crop.

What Do Results Say

With GPipe, it is possible to:

  • Support models up to 25 times using 8 accelerators due to recomputation and model parallelism.
  • Achieve up to 3.5 times speedup with four times more accelerators using pipelining in our experiments.
  • Train consistently regardless of the number of partitions due to synchronous gradient descent.
  • Free researchers from the time consuming process of re-tuning hyperparameters. So, GPipe can be combined with data parallelism to scale neural network training using more accelerators.
  • Advance the performance of visual recognition tasks on multiple datasets, including pushing ImageNet top-5 accuracy to 97.0%, CIFAR-10 accuracy to 99.0%, and CIFAR-100 accuracy to 91.3%.
  • The training efficiency of GPipe can be further improved by better graph partition algorithms.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.