Swin Transformers (ST) are a subcategory of Vision Transformers. It constructs hierarchical feature maps by merging image patches into deeper layers and has a linear computational complexity proportional to the size of the input image due to self-attention processing occurring only within each local window. As a result, it can be used as a general-purpose backbone for picture classification and dense recognition applications. In comparison, earlier vision transformers generate feature maps with a single low resolution and have a quadratic computational complexity proportional to the size of the input image due to global self-attention processing.
Moreover, ICCV 2021 has revealed the winners of its Best Paper Awards, honourable mentions, and Best Student Paper competitions. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” has been selected as the best paper at ICCV 2021. (For source code)
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo from Microsoft Research Asia, are the researchers involved in this case.
Challenges in Vision Applications
Tokens in existing transformer-based models are all fixed in size, making them inappropriate for certain vision applications. Another distinction is that pixels in photographs have a far better resolution than words in text passages. Finally, numerous vision tasks, such as semantic segmentation, need dense prediction at the pixel level, which is intractable for a transformer on high-resolution images due to the computational complexity of its self-attention being quadratic in size. To address these challenges, the researchers offer the ST, a general-purpose Transformer backbone that produces hierarchical feature maps and has a linear computational complexity proportional to picture size.
Source: Swin Transformer
Process of a Swin Transformer
The above figure demonstrates that ST creates a hierarchical representation by starting with small patches and gradually merging surrounding patches into deeper transformer layers. The ST model may easily exploit more dense prediction approaches like feature pyramid networks (FPN) or U-Net using these hierarchical feature maps. The linear computational complexity is obtained by computing non-overlapping windows that divide a picture. Furthermore, because the number of patches in each window is fixed, the complexity is proportional to the image size. Thus, ST is well-suited as a general-purpose backbone for various vision applications, unlike earlier transformer-based systems, which generate feature maps with a single resolution and are quadratic in complexity.
Architecture
ST’s essential design aspect is its shifting of the window divider between consecutive self-attention layers. The shifted windows connect the windows of the last layer, considerably increasing modelling capability. This technique is also efficient in real-world latency: all query patches within a window share the same key set1, simplifying hardware memory access.
Source: Swin Transformer
The above figure depicts the ST architecture in its tiniest form (SwinT). It begins by breaking the RGB image input into non-overlapping patches using a patch splitting module such as ViT. Then, each patch is handled as a “token,” with its feature set to be a concatenation of the raw RGB values of the individual pixels.
ST is constructed by substituting a module based on shifted windows for the standard multi-head self-attention (MSA) module in a transformer block, with the remaining layers remaining unchanged. Thus, an ST block comprises a shifted window-based MSA module, a two-layer MLP, and GELU nonlinearity. Each MSA and MLP module is preceded by a LayerNorm (LN) layer, and a residual connection follows each module.
Conclusion
ST is a novel vision transformer that generates a hierarchical feature representation and has a computational complexity proportional to the size of the input image. In addition, ST outperforms previous best approaches on COCO object detection and ADE20K semantic segmentation. Moreover, ST’s characteristics make it suitable for many vision applications, including image classification and dense prediction tasks like object identification and semantic segmentation. As per the researchers, ST’s superior performance on various vision issues will encourage vision and language signal modelling unification. As a fundamental component of ST, the researchers have demonstrated that shifting window-based self-attention is effective and efficient for visual challenges.