MITB Banner

Are Visual Transformers Better Than CNNs

So far, convolutional neural networks (CNNs) have been the de-facto model for visual data.

Share

“In 2021, language models will start to become aware of the visual world. The next generation of models will understand text better because of the many images they’ve seen.”

Ilya Sutskever, co-founder, OpenAI

The line dividing pixels and prose is receding with every passing day. And, the reason behind this is the flourishing of vision transformers. Over the years, neural networks got better with natural language processing. These networks can now generate fixed-or-variable-length vector-space representations and even aggregate the information from adjacent words to determine the meaning in a given context. Transformer architecture allowed developers to harness parallel processing units like TPUs fully. Convolutional neural networks (CNNs), on the other hand, though less sequential, take a relatively large number of steps to combine information.

(Paper by S Khan et al.,)

The introduction of transformer architecture allowed for significant parallelisation and optimisation of translation quality. Whereas CNNs operated on a fixed-sized window and had trouble capturing relations at the pixel level in both spatial and time domains. Furthermore, filter weights in CNNs remain fixed after training, so the operation cannot adapt dynamically to input variations. The shortcomings of CNNs fueled a revolution of hybrid models, models that incorporate the best of both worlds.

What Do ViTs See

Google’s Vision Transformers (ViT) can be likened to Transformers when it comes to the way its architecture processes natural language and to CNNs when it comes to vision tasks. In ViTs, images are represented as sequences, and class labels for the image are predicted, which allows models to learn image structure independently. Input images are treated as a sequence of patches where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Google even claimed that their ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.

So far, convolutional neural networks (CNNs) have been the de-facto model for visual data. Works from Google, and later by OpenAI with CLIP and DALL.E, have shown that there is something beyond CNNs for vision tasks. (Vision) Transformer models (ViT) are already performing on par when it comes to image classification tasks. This also brings us to the question: How different are these ViTs from CNNs when it comes to learning an image? How do they solve tasks? What are the learning techniques employed for visual representations? 

To explore the differences between ViTs and CNNs, researchers from Google have surveyed the various factors influencing the learning processes of these models. In this report, the researchers analysed how local/global spatial information is utilised and have found that ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.

To understand ViTs, the researchers focused on the representational structure of these models and have drawn insights from techniques such as neural network representation similarity, which allows for easier comparison of neural networks.

(Image credits: Paper by Raghu et al.,)

As illustrated above, when CKA (centered kernel alignment; used for comparisons) similarities are plotted between all pairs of layers across different model architectures, the researchers observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers. The report states that the representation structure of ViTs and CNNs show significant differences, with ViTs having highly similar representations throughout the model, while the ResNet models show much lower similarity between lower and higher layers. 

ViTs indeed have access to more global information than CNNs in their lower layers. But, how much of a difference does this make when these models learn features?

This report also investigates the lower attention layers of the models and their access to global information. The researchers observe that lower layer effective receptive fields for ViT are indeed larger than in ResNets, and become much more global midway through the network. Receptive field size is considered to be a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. ViT receptive fields also show a strong dependence on their centre patch due to their strong residual connections. 

Like Transformers, ViTs contain skip (aka, identity or shortcut) connections throughout, which are added on after the (i) self-attention layer, and (ii) MLP layer. These skip connections are much more influential in ViT compared to ResNet. Understanding fundamental differences between ViTs and CNNs is important as the transformer architecture has become more ubiquitous. Transformers have extended their reach from taking over the world of language models to usurping CNNs as the de-facto vision model. This also sets the context for the application of these models. Their individual representational capacities can offer trade-offs, making them more suitable for use cases than others. One model might be more suitable for crunching satellite data while the other might excel with data related to diabetes. The era of brute-forcing CNNs for all things vision might be over, and experts envision a more hybrid future when it comes to ML models.

Key Takeaways

The findings of this report can be summarised as follows:

  • ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.
  • Skip connections in ViT are even more influential than in ResNets, having strong effects on performance and representation similarity.
  • ResNet required more lower layers to compute similar representations to a smaller set of ViT lower layers.
  • Local information early on for image tasks (which is hardcoded into CNN architectures) is important for strong performance.
  • ResNet (CNNs) is trained to classify with a global average pooling step, while ViT has a separate classification (CLS) token.
  • Larger ViT models develop significantly stronger intermediate representations through larger pre-training datasets

Read the complete report here.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India