What’s Wrong With Limiting Transformers To Just Language Models

Transformers can help workers spend more time doing meaningful work and boost productivity.
Transformer language model

Three years back, Google introduced Transformer, a novel neural network architecture. Modelled on the self-attention mechanism and best suited for language understanding, transformers could outperform both recurrent and convolutional neural network models. 

Transformers perceive the entire input sequences simultaneously. The process depends on transforming one sequence into another, like the other usual sequence-to-sequence models, plus employing the attention mechanism.


Sign up for your weekly dose of what's up in emerging technology.

Since its introduction, some of the most popular and groundbreaking language models have been based on transformers. It includes Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-Training Models (GPT-2,-3), and XLNet.

Though invented for processing language models mostly, the scope of transformers has grown way beyond that. Transformers can help workers spend more time doing meaningful work and boost productivity. As per a Mckinsey Global Institute report, technology such as transformers could result in helping capture an additional 20-25 per cent growth in net economic benefits – valued at $13 trillion globally – in the next 12 years.

In this article, we look at the use of transformers in fields other than language processing.

Transformers for Computer Vision

Since the 1980s, convolutional neural networks have been the most popular mechanism for computer vision applications. The benefit of using CNN was that it avoided the need for hand-designed visual features. Instead, it helps in learning to perform tasks directly from data, end-to-end. Although CNN prevents the need for hand-crafted feature extraction, its architecture itself is designed specifically for images and can be computationally demanding. This limitation encouraged scientists to look for more domain agnostic and computationally agnostic architectures for achieving state-of-the-art results.

To this end, Google introduced Vision Transformer (ViT), a vision model based on the Transformer neural network.  An input image is represented as a sequence of image patches by ViT and it directly predicts class labels for the image.  ViT has demonstrated excellent performance; when trained on sufficient data, it can also outperform CNN models even with four times fewer computational resources.

Credit: Google AI

Google described ViT as a preliminary step towards a generic, scalable architecture to solve vision problems and other tasks from other domains.

Improving on ViT, Facebook introduced a Data-efficient Image Transformer (DeiT). It is a transformer-specific knowledge distillation procedure based on a distillation token. A distillation token is a learned vector that flows through the network alongside the transformed data to enhance image classification performance with less training data. DeiT can be trained with 1.2 million images; CNN requires hundreds of millions of images for similar training.

Notably, ViT is not the first time Transformer has been applied to computer vision tasks. In May 2020, Facebook AI researchers introduced Detection Transformers (DETR). A new approach to object detection and panoptic segmentation, DETR is the first object detection framework to integrate transformers as a central building block for detection.

Credit: Facebook AI

While traditional computer vision models generally use complex pipelines based on custom layers to localise objects in images and later extract features, DETR used a simpler neural network to offer an end-to-end deep learning solution. Transformers’ self-attention mechanisms let DETR perform global reasoning on images as well as specific objects detected.

Reinforcement Learning

In reinforcement learning, single steps are estimated using the Markov property to work on a task in time. It can also be formulated as a sequence modelling problem to predict actions that lead to high rewards. Researchers, in the past, have attempted to build high capacity sequence prediction models that work well with domains like NLP and provide effective solutions to the reinforcement learning problem.

Researchers this year, from the University of California, Berkeley, demonstrated that state-of-art Transformer architectures could be used to reframe reinforcement learning as one big sequence modelling problem by modelling distributions over sequences of states’ actions and rewards. The researchers observed that reframing significantly simplifies design decisions, such as removing the requirement for separate behaviour policy constraints. This approach can be applied to dynamics prediction, long-horizon dynamics prediction, offline reinforcement learning, imitation learning, and goal conditioned reinforcement learning.

Other Applications 

Biology and Medicine

Protein sequencing has been a field of great interest for researchers. Several deep learning approaches have emerged, but they have computational limitations or are designed to solve specific problems. In 2020, a group of researchers demonstrated a transformer neural network for pre-training task agnostic sequence representations. This helps in solving two main issues: protein family classification and protein interaction prediction. The result of this research offered a promising framework for fine-tuning the pre-training representation of other protein prediction tasks.

Similarly, a recent study published in Nature demonstrated a transformer’s use to generate novel molecules with the predicted ability to bind a target protein by relying only on its amino acid sequence.

DeepMind’s AlphaFold 2, which made huge waves a few months back, is also based on a similar attention mechanism used in transformers. This mechanism has replaced the convolutional neural network used in the case of AlphaFold 1.


In January this year, OpenAI released a Transformer based text-to-image engine called DALL.E, which is essentially a visual idea generator. With the text prompt as an input, it generates images to match the prompt. It can also combine disparate ideas to synthesise objects that may not even exist in the real world. The DALL.E model can also perform image-to-image translation tasks based on prompts.

Along with DALL.E, OpenAI also released CLIP, which relies on text-image pair datasets already available on the internet and performs a broader range of visual classification tasks without requiring additional training examples.

Wrapping up

Sequence related tasks were generally performed using RNN and CNN. However, such networks had certain limitations, including lack of parallelisation, which makes it challenging to train them. Transformers, based on attention mechanisms, have not only overcome these limitations but have also revolutionised the way we perform sequence tasks. Moreover, with better research and development, transformers’ applications can be extended beyond just NLP tasks.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM