A guide to parallel deep learning with Colossal-AI

Colossal-AI is such a powerful system that can perform complicated distributed training and give an easy way to set up different types of parallelism.

Colossal-AI is a large-scale deep learning model designed to train data parallelly. It combines different standards of parallelization techniques such as pipeline parallelism, data parallelism, tensor parallelism, sequence parallelism. It allows developers to create models for parallel computing as they create models for normal computing. With the help of this tool, developers can concentrate more on deep learning model development and stress-free from the distributed training. In this article, we will understand the concepts of distributed and parallel learning and how these can be achieved in deep learning. The major points to be covered in this article are listed below. 

Table of contents

  1. What is Distributed Training and Parallelism?
  2. Introduction to Colossal-AI
  3. The motive behind Colossal-AI creation
  4. Working of Colossal-AI

What is Distributed System and Parallelism?

A distributed system made up of multiple software components which can be run on multiple hardware machines is done to achieve high performance and low latency rate. A single machine cannot achieve what a distributed system can achieve. Scalability is a metric that evaluates the performance of a distributed system. For example, if we run a model on 4 machines, we expect that the model should run 4X faster than the single system. However, there are challenges to achieving linear speedup such as writing an Algorithm for good design, different types of hardware that leads to latency. See the figure below for Distributed systems, different software connected with different hardware.


Parallelism means running processes simultaneously. In machine learning, there are different paradigms of parallelism, such as data parallelism, model parallelism, Tensor parallelism, pipeline parallelism.

Introduction to Colossal-AI

Colossal-AI was created by Zhengda Bian, Hongxin Liu. It is based on the popular deep learning framework called PyTorch. It is such a powerful system that can perform complicated distributed training and gives an easy way to set up all the parallelism such as data parallelism, model parallelism, Tensor parallelism, pipeline parallelism. It also provides the optimization for Tensor parallelism with matrix to matrix multi-dimensional multiplication. Colossal-AI workflow is so smooth that it is very easy to use.


(Image source)

The motive behind Colossal-AI creation

As we are advancing in the field of deep learning, we are getting a huge amount of data. Deep learning shows its spectacular performance in many applications. Neural networks, such as BERT, can learn and predict at a high level of intelligence by training on large amounts of data. So if neural networks get more memory resources and computational power then they can become powerful but it will become an expensive model, meaning it will need more GPU power to execute. 

The trend of models becoming significantly larger and larger each day is constantly on the rise, It only took 3 months to get the new largest model from BERT-large to GPT-2 (Radford -2019) and now most recently GLM was introduced with 1.75 trillion ridiculously high number of parameters. So to run these large models distributed training must be required, that is why this system came into existence.

Working of Colossal-AI

Colossal-AI provides a very easy way to build a combination of data parallelism, model parallelism, Tensor parallelism, pipeline parallelism. With the help of API, users can create a distributed deep learning model using tensor parallelism. Now let us understand different parallelism approaches that can be used when building parallel deep learning models. 

Tensor parallelism: It has tools for tensor parallelism such as 2D, 2.5D and 3D, researchers mentioned that they will add 1D tensor parallelism in future.


(Image source)

2D tensor parallelism: This parallelism is based on SUMMA (Scalable Universal Matrix Multiplication Algorithm) which splits the input data and weights and layer outputs into two dimensions. These tensors are then distributed over a 2D mesh of N2 devices, where N is the tensor chunks. Let X be the input and W be the weight. We have splitted both X and W.

<math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced open="[" close="]"><mtable><mtr><mtd><mi>X</mi><mn>10</mn></mtd><mtd><mi>X</mi><mn>11</mn></mtd></mtr><mtr><mtd><mi>X</mi><mn>00</mn></mtd><mtd><mi>X</mi><mn>01</mn></mtd></mtr></mtable></mfenced></math>


<math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced open="[" close="]"><mtable><mtr><mtd><mi>W</mi><mn>10</mn></mtd><mtd><mi>W</mi><mn>11</mn></mtd></mtr><mtr><mtd><mi>W</mi><mn>00</mn></mtd><mtd><mi>W</mi><mn>01</mn></mtd></mtr></mtable></mfenced></math>

2.5D tensor parallelism: Based on the 2.5D matrix multiplication algorithm, 2D parallelism reduces the memory cost but introduces more communication, that’s why a 2.5D was introduced to reduce communication by using more devices.

Processor P = q * q * d and q = d = 2, we split X into d*q rows and q columns.

<math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced open="[" close="]"><mtable><mtr><mtd><mi>X</mi><mn>30</mn></mtd><mtd><mi>X</mi><mn>31</mn></mtd></mtr><mtr><mtd><mi>X</mi><mn>20</mn></mtd><mtd><mi>X</mi><mn>21</mn></mtd></mtr><mtr><mtd><mi>X</mi><mn>10</mn></mtd><mtd><mi>X</mi><mn>11</mn></mtd></mtr><mtr><mtd><mi>X</mi><mn>00</mn></mtd><mtd><mi>X</mi><mn>01</mn></mtd></mtr></mtable></mfenced></math>

Again it can be reshaped into d layers,

<math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced open="[" close="]"><mtable><mtr><mtd><mi>X</mi><mn>10</mn></mtd><mtd><mi>X</mi><mn>11</mn></mtd></mtr><mtr><mtd><mi>X</mi><mn>00</mn></mtd><mtd><mi>X</mi><mn>01</mn></mtd></mtr></mtable></mfenced></math>


And weights W,

<math xmlns="http://www.w3.org/1998/Math/MathML"><mfenced open="[" close="]"><mtable><mtr><mtd><mi>W</mi><mn>10</mn></mtd><mtd><mi>W</mi><mn>11</mn></mtd></mtr><mtr><mtd><mi>W</mi><mn>00</mn></mtd><mtd><mi>W</mi><mn>01</mn></mtd></mtr></mtable></mfenced></math>

3D Tensor Parallelism: It parallelizes the computation of deep learning models at optimum cost.

Again we split X input and W weight as,


Where each X and W is stored at the processor (i,j,l).


(Image source)

(NOTE:- In the figure, “A” is the weight which we mentioned above as “W”)

Sequence Parallelism: In a very long sequence of language models such as document-level text understanding, Since the size of a sequence dimension is large, it is memory inefficient because the layer activation consumes large amounts of memory, Through sequence parallelism, long sequences are split up into shorter subsequences that can be processed simultaneously by an array of devices, enabling the model to be trained on longer sequences than could be handled by a single GPU.

Final words

Through this article, we learned about what is the Colossal-AI model, understood what a distributed system is and different types of parallelism. We also got to know why this model is necessary. We also went over what Tensor Parallelism is and what Sequence Parallelism is used to achieve parallelism in deep learning.


  1. Colossal-AI
  2. Colossal-AI: Official Research Paper

Download our Mobile App

Waqqas Ansari
Waqqas Ansari is a data science guy with a math background. He likes solving challenging business problems through predictive modelling, descriptive modelling, and machine learning algorithms. He is fascinated by new technologies, especially those relating to machine learning.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.