“In 2021, language models will start to become aware of the visual world. The next generation of models will understand text better because of the many images they’ve seen.”
Ilya Sutskever, co-founder, OpenAI
The line dividing pixels and prose is receding with every passing day. And, the reason behind this is the flourishing of vision transformers. Over the years, neural networks got better with natural language processing. These networks can now generate fixed-or-variable-length vector-space representations and even aggregate the information from adjacent words to determine the meaning in a given context. Transformer architecture allowed developers to harness parallel processing units like TPUs fully. Convolutional neural networks (CNNs), on the other hand, though less sequential, take a relatively large number of steps to combine information.
The introduction of transformer architecture allowed for significant parallelisation and optimisation of translation quality. Whereas CNNs operated on a fixed-sized window and had trouble capturing relations at the pixel level in both spatial and time domains. Furthermore, filter weights in CNNs remain fixed after training, so the operation cannot adapt dynamically to input variations. The shortcomings of CNNs fueled a revolution of hybrid models, models that incorporate the best of both worlds.
What Do ViTs See
Google’s Vision Transformers (ViT) can be likened to Transformers when it comes to the way its architecture processes natural language and to CNNs when it comes to vision tasks. In ViTs, images are represented as sequences, and class labels for the image are predicted, which allows models to learn image structure independently. Input images are treated as a sequence of patches where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Google even claimed that their ViT outperforms state-of-the-art CNN with four times fewer computational resources when trained on sufficient data.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
So far, convolutional neural networks (CNNs) have been the de-facto model for visual data. Works from Google, and later by OpenAI with CLIP and DALL.E, have shown that there is something beyond CNNs for vision tasks. (Vision) Transformer models (ViT) are already performing on par when it comes to image classification tasks. This also brings us to the question: How different are these ViTs from CNNs when it comes to learning an image? How do they solve tasks? What are the learning techniques employed for visual representations?

To explore the differences between ViTs and CNNs, researchers from Google have surveyed the various factors influencing the learning processes of these models. In this report, the researchers analysed how local/global spatial information is utilised and have found that ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.
To understand ViTs, the researchers focused on the representational structure of these models and have drawn insights from techniques such as neural network representation similarity, which allows for easier comparison of neural networks.
As illustrated above, when CKA (centered kernel alignment; used for comparisons) similarities are plotted between all pairs of layers across different model architectures, the researchers observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers. The report states that the representation structure of ViTs and CNNs show significant differences, with ViTs having highly similar representations throughout the model, while the ResNet models show much lower similarity between lower and higher layers.
ViTs indeed have access to more global information than CNNs in their lower layers. But, how much of a difference does this make when these models learn features?
This report also investigates the lower attention layers of the models and their access to global information. The researchers observe that lower layer effective receptive fields for ViT are indeed larger than in ResNets, and become much more global midway through the network. Receptive field size is considered to be a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. ViT receptive fields also show a strong dependence on their centre patch due to their strong residual connections.
Like Transformers, ViTs contain skip (aka, identity or shortcut) connections throughout, which are added on after the (i) self-attention layer, and (ii) MLP layer. These skip connections are much more influential in ViT compared to ResNet. Understanding fundamental differences between ViTs and CNNs is important as the transformer architecture has become more ubiquitous. Transformers have extended their reach from taking over the world of language models to usurping CNNs as the de-facto vision model. This also sets the context for the application of these models. Their individual representational capacities can offer trade-offs, making them more suitable for use cases than others. One model might be more suitable for crunching satellite data while the other might excel with data related to diabetes. The era of brute-forcing CNNs for all things vision might be over, and experts envision a more hybrid future when it comes to ML models.
Key Takeaways
The findings of this report can be summarised as follows:
- ViT incorporates more global information than ResNet at lower layers, leading to quantitatively different features.
- Skip connections in ViT are even more influential than in ResNets, having strong effects on performance and representation similarity.
- ResNet required more lower layers to compute similar representations to a smaller set of ViT lower layers.
- Local information early on for image tasks (which is hardcoded into CNN architectures) is important for strong performance.
- ResNet (CNNs) is trained to classify with a global average pooling step, while ViT has a separate classification (CLS) token.
- Larger ViT models develop significantly stronger intermediate representations through larger pre-training datasets
Read the complete report here.