Transformer models are growing in size at an unprecedented rate, with models reaching trillion level parameters since the release of GPT-3 a year ago. PyTorch’s PipeTransformer is the latest edition in computing resources used to train such models.
PyTorch has introduced a new tool for distributed training of Transformer models, PipeTransformer. The tool leverages automated elastic pipelining and an adaptive on the fly freeze algorithm. This allows PipeTransformer to identify and freeze some layers gradually during training. Then, an elastic pipelining system is used to allocate the resources used to train the remaining active layers. “PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width,” PyTorch’s blog post explained.
Neural networks generally converge from the bottom-up in freeze training. Therefore, researchers at PyTorch have utilised freeze training for distributed training of transformer models. This process allows for the dynamic allocation of resources to the set of active layers. According to the team, this strategy is “especially pertinent to pipeline parallelism, as excluding consecutive bottom layers from the pipeline can reduce computation, memory, and communication overhead”.
The design excludes frozen layers from the pipeline, thus allowing the model to be packed into fewer GPUs. This provides an advantage of fewer cross- GPU communications and smaller pipeline bubbles. The clusters can also accommodate more pipeline replicas which increase the width of parallelism. Lastly, these features allow the speedups to be multiplicative and accelerate the training even further.
About PipeTransformer Design
Image: PyTorch Blog Post
PipeTransformer is built up of four core components. It combines Freeze Algorithm, AutoPipe, AutoDP, and AutoCache modules to increase the training speed significantly.
The first component is the Freeze Algorithm- a tunable and adaptive algorithm that generates signals to guide the selected layers to freeze over different iterations. Second, the AutoPipe comes into play. The freeze algorithm triggers the elastic pipelining module and considers the activation sizes and variances of workloads across different partitions. This enables packing the remaining active layers into fewer GPUs. Third, it splits a mini-batch into micro-batches based on profiling prior results for different pipeline lengths. Fourth, AutoDP maintains the hierarchical communication process groups to attain membership for collective communications. AutoCache shares activation across various processes to replace stale caches during transitions.
The developers have used a customised version of PyTorch Pipeline to support the elastic pipelining in the transformer; and the PyTorch DDP as a baseline for data parallelism. In addition, they have decoupled the training system for the four core components to ensure the generality of the framework. This processing can be illustrated in the diagram below.
This means that the freeze algorithm in grey samples indicators from the training loop to make freezing decisions and shares them with AutoPipe in green. AutoPipe passes the pipeline length information to AutoDP in purple which spawns more pipeline replicas. AutoCache in orange ensures connections between pipelines.
AutoPipe accelerates training by excluding frozen layers from the pipeline and packing the active layers into fewer GPUs. The various components of the pipeline partition the pipelines, minimise the number of pipeline devices and optimise mini-batch chunk size accordingly.
AutoPipe balances pipeline partitions based on parameter sizes while considering the cross partition communication overload and frozen layer memory footprint. It uses a greedy algorithm to allocate all frozen and active layers to ensure that all partitioned sublayers are evenly distributed into GPU devices.
While AutoPipe compresses the same pipeline into fewer GPUs, AutoDP can automatically create new pipeline replicas to increase data-parallel width. In addition, the researchers have developed ‘double communication process groups’ for DDP to overcome various challenges. There are two groups- the message process group is for lightweight control messages and covers all processes. The active training process group contains active processes that serve as a vehicle for heavy-weight tensor communications during training.
PipeTransformer leverages automated elastic pipelining to facilitate distributed training. The researchers have claimed that theirs is the first paper studying layer freezing for pipeline and data-parallel training. In the evaluation, PipeTransformer attained close to 2.83 fold speedup while ensuring its accuracy. The team used Vision Transformer on ImageNet and BERT on SQuAD and GLUE datasets to conclude.