Are Visual Transformers Better Than CNNs

So far, convolutional neural networks (CNNs) have been the de-facto model for visual data.

“In 2021, language models will start to become aware of the visual world. The next generation of models will understand text better because of the many images they’ve seen.”

Ilya Sutskever, co-founder, OpenAI

The line dividing pixels and prose is receding with every passing day. And, the reason behind this is the flourishing of vision transformers. Over the years, neural networks got better with natural language processing. These networks can now generate fixed-or-variable-length vector-space representations and even aggregate the information from adjacent words to determine the meaning in a given context. Transformer architecture allowed developers to harness parallel processing units like TPUs fully. Convolutional neural networks (CNNs), on the other hand, though less sequential, take a relatively large number of steps to combine information.

(Paper by S Khan et al.,)

The introduction of transformer architecture allowed for significant parallelisation and optimisation of translation quality. Whereas CNNs operated on a fixed-sized window and had trouble capturing relations at the pixel level in both spatial and time domains. Furthermore, filter weights in CNNs remain fixed after training, so the operation cannot adapt dynamically to input variations. The shortcomings of CNNs fueled a revolution of hybrid models, models that incorporate the best of both worlds.

What Do ViTs See

Google’s Vision Transformers (ViT) can be likened to Transformers when it comes to the way its architecture processes natural language and to CNNs when it comes to vision tasks. In ViTs, images are represented as sequences, and class labels for the image are predicted, which allows models to learn image structure independently. Input images are treated as a sequence of patches where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Google even claimed that their ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

So far, convolutional neural networks (CNNs) have been the de-facto model for visual data. Works from Google, and later by OpenAI with CLIP and DALL.E, have shown that there is something beyond CNNs for vision tasks. (Vision) Transformer models (ViT) are already performing on par when it comes to image classification tasks. This also brings us to the question: How different are these ViTs from CNNs when it comes to learning an image? How do they solve tasks? What are the learning techniques employed for visual representations? 

To explore the differences between ViTs and CNNs, researchers from Google have surveyed the various factors influencing the learning processes of these models. In this report, the researchers analysed how local/global spatial information is utilised and have found that ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.

To understand ViTs, the researchers focused on the representational structure of these models and have drawn insights from techniques such as neural network representation similarity, which allows for easier comparison of neural networks.

(Image credits: Paper by Raghu et al.,)

As illustrated above, when CKA (centered kernel alignment; used for comparisons) similarities are plotted between all pairs of layers across different model architectures, the researchers observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers. The report states that the representation structure of ViTs and CNNs show significant differences, with ViTs having highly similar representations throughout the model, while the ResNet models show much lower similarity between lower and higher layers. 

ViTs indeed have access to more global information than CNNs in their lower layers. But, how much of a difference does this make when these models learn features?

This report also investigates the lower attention layers of the models and their access to global information. The researchers observe that lower layer effective receptive fields for ViT are indeed larger than in ResNets, and become much more global midway through the network. Receptive field size is considered to be a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. ViT receptive fields also show a strong dependence on their centre patch due to their strong residual connections. 

Like Transformers, ViTs contain skip (aka, identity or shortcut) connections throughout, which are added on after the (i) self-attention layer, and (ii) MLP layer. These skip connections are much more influential in ViT compared to ResNet. Understanding fundamental differences between ViTs and CNNs is important as the transformer architecture has become more ubiquitous. Transformers have extended their reach from taking over the world of language models to usurping CNNs as the de-facto vision model. This also sets the context for the application of these models. Their individual representational capacities can offer trade-offs, making them more suitable for use cases than others. One model might be more suitable for crunching satellite data while the other might excel with data related to diabetes. The era of brute-forcing CNNs for all things vision might be over, and experts envision a more hybrid future when it comes to ML models.

Key Takeaways

The findings of this report can be summarised as follows:

  • ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.
  • Skip connections in ViT are even more influential than in ResNets, having strong effects on performance and representation similarity.
  • ResNet required more lower layers to compute similar representations to a smaller set of ViT lower layers.
  • Local information early on for image tasks (which is hardcoded into CNN architectures) is important for strong performance.
  • ResNet (CNNs) is trained to classify with a global average pooling step, while ViT has a separate classification (CLS) token.
  • Larger ViT models develop significantly stronger intermediate representations through larger pre-training datasets

Read the complete report here.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox