MITB Banner

Swin Transformers: Computer Vision’s Most Powerful Tool

Classification of images using Swin Transformers, a general-purpose computer vision backbone.

Share

Swin Transformers (ST) are a subcategory of Vision Transformers. It constructs hierarchical feature maps by merging image patches into deeper layers and has a linear computational complexity proportional to the size of the input image due to self-attention processing occurring only within each local window. As a result, it can be used as a general-purpose backbone for picture classification and dense recognition applications. In comparison, earlier vision transformers generate feature maps with a single low resolution and have a quadratic computational complexity proportional to the size of the input image due to global self-attention processing. 

Moreover, ICCV 2021 has revealed the winners of its Best Paper Awards, honourable mentions, and Best Student Paper competitions. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” has been selected as the best paper at ICCV 2021. (For source code)

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo from Microsoft Research Asia, are the researchers involved in this case. 

Challenges in Vision Applications

Tokens in existing transformer-based models are all fixed in size, making them inappropriate for certain vision applications. Another distinction is that pixels in photographs have a far better resolution than words in text passages. Finally, numerous vision tasks, such as semantic segmentation, need dense prediction at the pixel level, which is intractable for a transformer on high-resolution images due to the computational complexity of its self-attention being quadratic in size. To address these challenges, the researchers offer the ST, a general-purpose Transformer backbone that produces hierarchical feature maps and has a linear computational complexity proportional to picture size.

Source: Swin Transformer

Process of a Swin Transformer

The above figure demonstrates that ST creates a hierarchical representation by starting with small patches and gradually merging surrounding patches into deeper transformer layers. The ST model may easily exploit more dense prediction approaches like feature pyramid networks (FPN) or U-Net using these hierarchical feature maps. The linear computational complexity is obtained by computing non-overlapping windows that divide a picture. Furthermore, because the number of patches in each window is fixed, the complexity is proportional to the image size. Thus, ST is well-suited as a general-purpose backbone for various vision applications, unlike earlier transformer-based systems, which generate feature maps with a single resolution and are quadratic in complexity.

Architecture 

ST’s essential design aspect is its shifting of the window divider between consecutive self-attention layers. The shifted windows connect the windows of the last layer, considerably increasing modelling capability. This technique is also efficient in real-world latency: all query patches within a window share the same key set1, simplifying hardware memory access.

Source: Swin Transformer

The above figure depicts the ST architecture in its tiniest form (SwinT). It begins by breaking the RGB image input into non-overlapping patches using a patch splitting module such as ViT. Then, each patch is handled as a “token,” with its feature set to be a concatenation of the raw RGB values of the individual pixels.

ST is constructed by substituting a module based on shifted windows for the standard multi-head self-attention (MSA) module in a transformer block, with the remaining layers remaining unchanged. Thus, an ST block comprises a shifted window-based MSA module, a two-layer MLP, and GELU nonlinearity. Each MSA and MLP module is preceded by a LayerNorm (LN) layer, and a residual connection follows each module.

Conclusion

ST is a novel vision transformer that generates a hierarchical feature representation and has a computational complexity proportional to the size of the input image. In addition, ST outperforms previous best approaches on COCO object detection and ADE20K semantic segmentation. Moreover, ST’s characteristics make it suitable for many vision applications, including image classification and dense prediction tasks like object identification and semantic segmentation. As per the researchers, ST’s superior performance on various vision issues will encourage vision and language signal modelling unification. As a fundamental component of ST, the researchers have demonstrated that shifting window-based self-attention is effective and efficient for visual challenges.

Share
Picture of Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.