MITB Banner

What Is Transformer-iN-Transformer?

Share

A team of researchers from Huawei, ISCAS & UCAS recently introduced a new Transformer model for visual recognition, known as Transformer-iN-Transformer (TNT). The researchers said this neural network architecture outperformed conventional vision Transformers (ViT) and could potentially solve the problems in Transformer-based models for computer vision tasks.

Transformer, a popular self-attention-based neural network, is used for various natural language processing (NLP) tasks. Lately, researchers have also been using pure Transformer-based models to solve various computer vision problems, such as object detection, image recognition, image processing and more. 

Why This Research

Researchers have recently applied Transformers to several computer vision tasks. For instance, Facebook’s DEtection Transformer (DETR) solves objection detection tasks using a Transformer encoder-decoder architecture and treats it as a direct set prediction problem. Vision Transformer (ViT) observes an image as a sequence of patches and then performs classification with a Transformer encoder.   

The researchers said computer vision models, purely based on Transformer architecture, are compelling because they provide a computing paradigm without the image-specific inductive bias, which is entirely different from convolutional neural networks (CNNs). They said: “Compared to the mainstream CNN models, these Transformer-based models have also shown promising performance on visual tasks.”

According to researchers, most of the Transformers view an image as a sequence of patches and ignore the local relation as well as the intrinsic structure information inside each patch, which is essential for visual recognition. To mitigate such issues, the researchers proposed the new Transformer architecture.

How TNT Works

Firstly, an image is split into a sequence of patches. After the splitting, each patch is reshaped to some pixel sequence. The pixel embeddings and patch embeddings are obtained using a linear transformation from the pixels and patches, respectively and then fed into a stack of TNT blocks for representation learning.

Unlike ViT, which utilises a standard Transformer to process the sequence of patches that corrupts the local structure of a patch, TNT architecture can better learn to model the local information for visual recognition. It learns both global and local information in an image, and each patch is further transformed into the target size with pixel unfold and with a linear projection. 

As shown in the above figure, there are two Transformer blocks in the TNT block, where the outer Transformer block models the global relation among patch embeddings and the inner Transformer block extracts the local structure information of the pixel embeddings. The TNT architecture is built by stacking these TNT blocks.

Outperforming ResNet

Through the proposed TNT model, the researchers showed that it is possible to model both global and local structure information of the images and improve the representation ability of the feature.

To have a better understanding of the current progress of visual Transformers, the researchers included the popular representative CNN-based models such as ResNet, RegNet and EfficientNet and compared them to TNT using the popular ImageNet dataset. The results showed that TNT outperformed the widely-used ResNet and RegNet models, except EfficientNet. 

Benefits of TNT

  • The features of TNT are more diverse and contain richer information than those of the DeiT model. These benefits will help understand the inner Transformer block for modelling local features.
  • TNT has a strong generalisation ability.
  • TNT architecture can better learn to model local information for visual recognition.

Wrapping Up

Transformer models like TNT will help in the advancement of computer vision research. Compared to the conventional vision transformers (ViT), the TNT network architecture can better preserve and model the local information for visual recognition.

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.