Colossal-AI is a large-scale deep learning model designed to train data parallelly. It combines different standards of parallelization techniques such as pipeline parallelism, data parallelism, tensor parallelism, sequence parallelism. It allows developers to create models for parallel computing as they create models for normal computing. With the help of this tool, developers can concentrate more on deep learning model development and stress-free from the distributed training. In this article, we will understand the concepts of distributed and parallel learning and how these can be achieved in deep learning. The major points to be covered in this article are listed below.
Table of contents
- What is Distributed Training and Parallelism?
- Introduction to Colossal-AI
- The motive behind Colossal-AI creation
- Working of Colossal-AI
What is Distributed System and Parallelism?
A distributed system made up of multiple software components which can be run on multiple hardware machines is done to achieve high performance and low latency rate. A single machine cannot achieve what a distributed system can achieve. Scalability is a metric that evaluates the performance of a distributed system. For example, if we run a model on 4 machines, we expect that the model should run 4X faster than the single system. However, there are challenges to achieving linear speedup such as writing an Algorithm for good design, different types of hardware that leads to latency. See the figure below for Distributed systems, different software connected with different hardware.
Parallelism means running processes simultaneously. In machine learning, there are different paradigms of parallelism, such as data parallelism, model parallelism, Tensor parallelism, pipeline parallelism.
Introduction to Colossal-AI
Colossal-AI was created by Zhengda Bian, Hongxin Liu. It is based on the popular deep learning framework called PyTorch. It is such a powerful system that can perform complicated distributed training and gives an easy way to set up all the parallelism such as data parallelism, model parallelism, Tensor parallelism, pipeline parallelism. It also provides the optimization for Tensor parallelism with matrix to matrix multi-dimensional multiplication. Colossal-AI workflow is so smooth that it is very easy to use.
The motive behind Colossal-AI creation
As we are advancing in the field of deep learning, we are getting a huge amount of data. Deep learning shows its spectacular performance in many applications. Neural networks, such as BERT, can learn and predict at a high level of intelligence by training on large amounts of data. So if neural networks get more memory resources and computational power then they can become powerful but it will become an expensive model, meaning it will need more GPU power to execute.
The trend of models becoming significantly larger and larger each day is constantly on the rise, It only took 3 months to get the new largest model from BERT-large to GPT-2 (Radford -2019) and now most recently GLM was introduced with 1.75 trillion ridiculously high number of parameters. So to run these large models distributed training must be required, that is why this system came into existence.
Working of Colossal-AI
Colossal-AI provides a very easy way to build a combination of data parallelism, model parallelism, Tensor parallelism, pipeline parallelism. With the help of API, users can create a distributed deep learning model using tensor parallelism. Now let us understand different parallelism approaches that can be used when building parallel deep learning models.
Tensor parallelism: It has tools for tensor parallelism such as 2D, 2.5D and 3D, researchers mentioned that they will add 1D tensor parallelism in future.
2D tensor parallelism: This parallelism is based on SUMMA (Scalable Universal Matrix Multiplication Algorithm) which splits the input data and weights and layer outputs into two dimensions. These tensors are then distributed over a 2D mesh of N2 devices, where N is the tensor chunks. Let X be the input and W be the weight. We have splitted both X and W.
and
2.5D tensor parallelism: Based on the 2.5D matrix multiplication algorithm, 2D parallelism reduces the memory cost but introduces more communication, that’s why a 2.5D was introduced to reduce communication by using more devices.
Processor P = q * q * d and q = d = 2, we split X into d*q rows and q columns.
Again it can be reshaped into d layers,
and

And weights W,
3D Tensor Parallelism: It parallelizes the computation of deep learning models at optimum cost.
Again we split X input and W weight as,

and

Where each X and W is stored at the processor (i,j,l).
(NOTE:- In the figure, “A” is the weight which we mentioned above as “W”)
Sequence Parallelism: In a very long sequence of language models such as document-level text understanding, Since the size of a sequence dimension is large, it is memory inefficient because the layer activation consumes large amounts of memory, Through sequence parallelism, long sequences are split up into shorter subsequences that can be processed simultaneously by an array of devices, enabling the model to be trained on longer sequences than could be handled by a single GPU.
Final words
Through this article, we learned about what is the Colossal-AI model, understood what a distributed system is and different types of parallelism. We also got to know why this model is necessary. We also went over what Tensor Parallelism is and what Sequence Parallelism is used to achieve parallelism in deep learning.