MITB Banner

All You Need To Know About PyTorch’s New PipeTransformer

It combines Freeze Algorithm, AutoPipe, AutoDP, and AutoCache modules to increase the training speed significantly.

Share

PipeTransformer

Transformer models are growing in size at an unprecedented rate, with models reaching trillion level parameters since the release of GPT-3 a year ago. PyTorch’s PipeTransformer is the latest edition in computing resources used to train such models.

PyTorch has introduced a new tool for distributed training of Transformer models, PipeTransformer. The tool leverages automated elastic pipelining and an adaptive on the fly freeze algorithm. This allows PipeTransformer to identify and freeze some layers gradually during training. Then, an elastic pipelining system is used to allocate the resources used to train the remaining active layers. “PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width,” PyTorch’s blog post explained. 

Neural networks generally converge from the bottom-up in freeze training. Therefore, researchers at PyTorch have utilised freeze training for distributed training of transformer models. This process allows for the dynamic allocation of resources to the set of active layers. According to the team, this strategy is “especially pertinent to pipeline parallelism, as excluding consecutive bottom layers from the pipeline can reduce computation, memory, and communication overhead”. 

The design excludes frozen layers from the pipeline, thus allowing the model to be packed into fewer GPUs. This provides an advantage of fewer cross- GPU communications and smaller pipeline bubbles. The clusters can also accommodate more pipeline replicas which increase the width of parallelism. Lastly, these features allow the speedups to be multiplicative and accelerate the training even further.  

About PipeTransformer Design

Image: PyTorch Blog Post

PipeTransformer is built up of four core components. It combines Freeze Algorithm, AutoPipe, AutoDP, and AutoCache modules to increase the training speed significantly. 

The first component is the Freeze Algorithm- a tunable and adaptive algorithm that generates signals to guide the selected layers to freeze over different iterations. Second, the AutoPipe comes into play. The freeze algorithm triggers the elastic pipelining module and considers the activation sizes and variances of workloads across different partitions. This enables packing the remaining active layers into fewer GPUs. Third, it splits a mini-batch into micro-batches based on profiling prior results for different pipeline lengths. Fourth, AutoDP maintains the hierarchical communication process groups to attain membership for collective communications. AutoCache shares activation across various processes to replace stale caches during transitions. 

The developers have used a customised version of PyTorch Pipeline to support the elastic pipelining in the transformer; and the PyTorch DDP as a baseline for data parallelism. In addition, they have decoupled the training system for the four core components to ensure the generality of the framework. This processing can be illustrated in the diagram below. 

This means that the freeze algorithm in grey samples indicators from the training loop to make freezing decisions and shares them with AutoPipe in green. AutoPipe passes the pipeline length information to AutoDP in purple which spawns more pipeline replicas. AutoCache in orange ensures connections between pipelines. 

AutoPipe accelerates training by excluding frozen layers from the pipeline and packing the active layers into fewer GPUs. The various components of the pipeline partition the pipelines, minimise the number of pipeline devices and optimise mini-batch chunk size accordingly.

AutoPipe balances pipeline partitions based on parameter sizes while considering the cross partition communication overload and frozen layer memory footprint. It uses a greedy algorithm to allocate all frozen and active layers to ensure that all partitioned sublayers are evenly distributed into GPU devices. 

While AutoPipe compresses the same pipeline into fewer GPUs, AutoDP can automatically create new pipeline replicas to increase data-parallel width. In addition, the researchers have developed ‘double communication process groups’ for DDP to overcome various challenges. There are two groups- the message process group is for lightweight control messages and covers all processes. The active training process group contains active processes that serve as a vehicle for heavy-weight tensor communications during training.

PipeTransformer leverages automated elastic pipelining to facilitate distributed training. The researchers have claimed that theirs is the first paper studying layer freezing for pipeline and data-parallel training. In the evaluation, PipeTransformer attained close to 2.83 fold speedup while ensuring its accuracy. The team used Vision Transformer on ImageNet and BERT on SQuAD and GLUE datasets to conclude.

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.