Swin Transformers: Computer Vision’s Most Powerful Tool

Classification of images using Swin Transformers, a general-purpose computer vision backbone.

Swin Transformers (ST) are a subcategory of Vision Transformers. It constructs hierarchical feature maps by merging image patches into deeper layers and has a linear computational complexity proportional to the size of the input image due to self-attention processing occurring only within each local window. As a result, it can be used as a general-purpose backbone for picture classification and dense recognition applications. In comparison, earlier vision transformers generate feature maps with a single low resolution and have a quadratic computational complexity proportional to the size of the input image due to global self-attention processing. 

Moreover, ICCV 2021 has revealed the winners of its Best Paper Awards, honourable mentions, and Best Student Paper competitions. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” has been selected as the best paper at ICCV 2021. (For source code)


Sign up for your weekly dose of what's up in emerging technology.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo from Microsoft Research Asia, are the researchers involved in this case. 

Challenges in Vision Applications

Tokens in existing transformer-based models are all fixed in size, making them inappropriate for certain vision applications. Another distinction is that pixels in photographs have a far better resolution than words in text passages. Finally, numerous vision tasks, such as semantic segmentation, need dense prediction at the pixel level, which is intractable for a transformer on high-resolution images due to the computational complexity of its self-attention being quadratic in size. To address these challenges, the researchers offer the ST, a general-purpose Transformer backbone that produces hierarchical feature maps and has a linear computational complexity proportional to picture size.

Source: Swin Transformer

Process of a Swin Transformer

The above figure demonstrates that ST creates a hierarchical representation by starting with small patches and gradually merging surrounding patches into deeper transformer layers. The ST model may easily exploit more dense prediction approaches like feature pyramid networks (FPN) or U-Net using these hierarchical feature maps. The linear computational complexity is obtained by computing non-overlapping windows that divide a picture. Furthermore, because the number of patches in each window is fixed, the complexity is proportional to the image size. Thus, ST is well-suited as a general-purpose backbone for various vision applications, unlike earlier transformer-based systems, which generate feature maps with a single resolution and are quadratic in complexity.


ST’s essential design aspect is its shifting of the window divider between consecutive self-attention layers. The shifted windows connect the windows of the last layer, considerably increasing modelling capability. This technique is also efficient in real-world latency: all query patches within a window share the same key set1, simplifying hardware memory access.

Source: Swin Transformer

The above figure depicts the ST architecture in its tiniest form (SwinT). It begins by breaking the RGB image input into non-overlapping patches using a patch splitting module such as ViT. Then, each patch is handled as a “token,” with its feature set to be a concatenation of the raw RGB values of the individual pixels.

ST is constructed by substituting a module based on shifted windows for the standard multi-head self-attention (MSA) module in a transformer block, with the remaining layers remaining unchanged. Thus, an ST block comprises a shifted window-based MSA module, a two-layer MLP, and GELU nonlinearity. Each MSA and MLP module is preceded by a LayerNorm (LN) layer, and a residual connection follows each module.


ST is a novel vision transformer that generates a hierarchical feature representation and has a computational complexity proportional to the size of the input image. In addition, ST outperforms previous best approaches on COCO object detection and ADE20K semantic segmentation. Moreover, ST’s characteristics make it suitable for many vision applications, including image classification and dense prediction tasks like object identification and semantic segmentation. As per the researchers, ST’s superior performance on various vision issues will encourage vision and language signal modelling unification. As a fundamental component of ST, the researchers have demonstrated that shifting window-based self-attention is effective and efficient for visual challenges.

More Great AIM Stories

Dr. Nivash Jeevanandam
Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.