Last updated March 14, 2021
In AI Mysteries

What Is Transformer-iN-Transformer?

Share

Published on March 14, 2021

by Ambika Choudhury

A team of researchers from Huawei, ISCAS & UCAS recently introduced a new Transformer model for visual recognition, known as Transformer-iN-Transformer (TNT). The researchers said this neural network architecture outperformed conventional vision Transformers (ViT) and could potentially solve the problems in Transformer-based models for computer vision tasks.

Transformer, a popular self-attention-based neural network, is used for various natural language processing (NLP) tasks. Lately, researchers have also been using pure Transformer-based models to solve various computer vision problems, such as object detection, image recognition, image processing and more.

Why This Research

Researchers have recently applied Transformers to several computer vision tasks. For instance, Facebook’s DEtection Transformer (DETR) solves objection detection tasks using a Transformer encoder-decoder architecture and treats it as a direct set prediction problem. Vision Transformer (ViT) observes an image as a sequence of patches and then performs classification with a Transformer encoder.

The researchers said computer vision models, purely based on Transformer architecture, are compelling because they provide a computing paradigm without the image-specific inductive bias, which is entirely different from convolutional neural networks (CNNs). They said: “Compared to the mainstream CNN models, these Transformer-based models have also shown promising performance on visual tasks.”

According to researchers, most of the Transformers view an image as a sequence of patches and ignore the local relation as well as the intrinsic structure information inside each patch, which is essential for visual recognition. To mitigate such issues, the researchers proposed the new Transformer architecture.

How TNT Works

Firstly, an image is split into a sequence of patches. After the splitting, each patch is reshaped to some pixel sequence. The pixel embeddings and patch embeddings are obtained using a linear transformation from the pixels and patches, respectively and then fed into a stack of TNT blocks for representation learning.

Unlike ViT, which utilises a standard Transformer to process the sequence of patches that corrupts the local structure of a patch, TNT architecture can better learn to model the local information for visual recognition. It learns both global and local information in an image, and each patch is further transformed into the target size with pixel unfold and with a linear projection.

As shown in the above figure, there are two Transformer blocks in the TNT block, where the outer Transformer block models the global relation among patch embeddings and the inner Transformer block extracts the local structure information of the pixel embeddings. The TNT architecture is built by stacking these TNT blocks.

Outperforming ResNet

Through the proposed TNT model, the researchers showed that it is possible to model both global and local structure information of the images and improve the representation ability of the feature.

To have a better understanding of the current progress of visual Transformers, the researchers included the popular representative CNN-based models such as ResNet, RegNet and EfficientNet and compared them to TNT using the popular ImageNet dataset. The results showed that TNT outperformed the widely-used ResNet and RegNet models, except EfficientNet.

Benefits of TNT

The features of TNT are more diverse and contain richer information than those of the DeiT model. These benefits will help understand the inner Transformer block for modelling local features.
TNT has a strong generalisation ability.
TNT architecture can better learn to model local information for visual recognition.

Wrapping Up

Transformer models like TNT will help in the advancement of computer vision research. Compared to the conventional vision transformers (ViT), the TNT network architecture can better preserve and model the local information for visual recognition.

Read the paper here.

Access all our open Survey & Awards Nomination forms in one place