All You Need To Know About PyTorch’s New PipeTransformer

It combines Freeze Algorithm, AutoPipe, AutoDP, and AutoCache modules to increase the training speed significantly.

Transformer models are growing in size at an unprecedented rate, with models reaching trillion level parameters since the release of GPT-3 a year ago. PyTorch’s PipeTransformer is the latest edition in computing resources used to train such models.

PyTorch has introduced a new tool for distributed training of Transformer models, PipeTransformer. The tool leverages automated elastic pipelining and an adaptive on the fly freeze algorithm. This allows PipeTransformer to identify and freeze some layers gradually during training. Then, an elastic pipelining system is used to allocate the resources used to train the remaining active layers. “PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width,” PyTorch’s blog post explained. 

Neural networks generally converge from the bottom-up in freeze training. Therefore, researchers at PyTorch have utilised freeze training for distributed training of transformer models. This process allows for the dynamic allocation of resources to the set of active layers. According to the team, this strategy is “especially pertinent to pipeline parallelism, as excluding consecutive bottom layers from the pipeline can reduce computation, memory, and communication overhead”. 

The design excludes frozen layers from the pipeline, thus allowing the model to be packed into fewer GPUs. This provides an advantage of fewer cross- GPU communications and smaller pipeline bubbles. The clusters can also accommodate more pipeline replicas which increase the width of parallelism. Lastly, these features allow the speedups to be multiplicative and accelerate the training even further.  

About PipeTransformer Design

Image: PyTorch Blog Post

PipeTransformer is built up of four core components. It combines Freeze Algorithm, AutoPipe, AutoDP, and AutoCache modules to increase the training speed significantly. 

The first component is the Freeze Algorithm- a tunable and adaptive algorithm that generates signals to guide the selected layers to freeze over different iterations. Second, the AutoPipe comes into play. The freeze algorithm triggers the elastic pipelining module and considers the activation sizes and variances of workloads across different partitions. This enables packing the remaining active layers into fewer GPUs. Third, it splits a mini-batch into micro-batches based on profiling prior results for different pipeline lengths. Fourth, AutoDP maintains the hierarchical communication process groups to attain membership for collective communications. AutoCache shares activation across various processes to replace stale caches during transitions. 

The developers have used a customised version of PyTorch Pipeline to support the elastic pipelining in the transformer; and the PyTorch DDP as a baseline for data parallelism. In addition, they have decoupled the training system for the four core components to ensure the generality of the framework. This processing can be illustrated in the diagram below. 

This means that the freeze algorithm in grey samples indicators from the training loop to make freezing decisions and shares them with AutoPipe in green. AutoPipe passes the pipeline length information to AutoDP in purple which spawns more pipeline replicas. AutoCache in orange ensures connections between pipelines. 

AutoPipe accelerates training by excluding frozen layers from the pipeline and packing the active layers into fewer GPUs. The various components of the pipeline partition the pipelines, minimise the number of pipeline devices and optimise mini-batch chunk size accordingly.

AutoPipe balances pipeline partitions based on parameter sizes while considering the cross partition communication overload and frozen layer memory footprint. It uses a greedy algorithm to allocate all frozen and active layers to ensure that all partitioned sublayers are evenly distributed into GPU devices. 

While AutoPipe compresses the same pipeline into fewer GPUs, AutoDP can automatically create new pipeline replicas to increase data-parallel width. In addition, the researchers have developed ‘double communication process groups’ for DDP to overcome various challenges. There are two groups- the message process group is for lightweight control messages and covers all processes. The active training process group contains active processes that serve as a vehicle for heavy-weight tensor communications during training.

PipeTransformer leverages automated elastic pipelining to facilitate distributed training. The researchers have claimed that theirs is the first paper studying layer freezing for pipeline and data-parallel training. In the evaluation, PipeTransformer attained close to 2.83 fold speedup while ensuring its accuracy. The team used Vision Transformer on ImageNet and BERT on SQuAD and GLUE datasets to conclude.

Download our Mobile App

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.