Andrej Karpathy Believes AI Models Will Consolidate. What Is He Talking About?

Andrej Karpathy wrote a very compelling Twitter thread last week, making an argument about the consolidation of AI model architecture.
Andrej Karpathy

Andrej Karpathy, Director of AI at Tesla wrote a very compelling Twitter thread last week, making an argument about the consolidation of AI model architecture. In his tweet, he detailed how in the last few years the neural network architecture across areas of applications have begun to look similar. In particular, he cited the example of Transformers – traditionally linked to building language models, their scope is more far-reaching.

Further, he said while earlier, in areas of applications like vision, there were differences in techniques for classification, segmentation, detection, and generation. However, all of these are being converted to the same framework. The only distinguishing features being data, input and output specifications, and pattern in attention mask.

Generality of Transformers 

Modelled on the self attention mechanism, transformers, a novel neural network was introduced by Google. Like other sequence-to-sequence models, a transformer transforms one sequence to another while employing the attention mechanism. It was initially introduced for processing language models – leading to the launch of models like BERT, XLNet, and GPT. However, transformers can be used for more diverse applications. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Computer vision

In 2020, Google introduced Vision Transformer (ViT), a vision model based on the transformer neural network. In this model, input is a stream of image patches. It directly predicts class labels for the image. When trained on sufficiently large data, ViT is found to outperform CNN models which have been traditionally used for computer vision applications.


Download our Mobile App



Facebook (now Meta) has also launched a similar transformer model called the Data-efficient Image Transformer (DeiT). It operates on a knowledge distillation procedure that is based on the distillation token and can be trained with 1.2 million images.

Reinforcement learning

Researchers from the University of California, Berkeley demonstrated that the state-of-art Transformer architectures can be used to frame reinforcement learning as a big sequence modelling problem. This could be done by modelling distributions over sequence of states’ actions and rewards. This technique could also be applied to dynamics prediction, offline reinforcement learning, imitation learning, long horizon dynamics, and conditioned reinforcement learning.

For tabular data

A group of Russian researchers demonstrated a simple adaptation of Transformer architecture for tabular data. To this end, the researchers introduced the FT-Transformer (Feature tokenizer + Transformer). This model could transform all features – including categorical and numerical – to embeddings and apply a stack of transformer layers to the embeddings. Here every transformer layer operates on the feature level of one object.

DABS

This year, researchers from Stanford University published a paper titled ‘DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning’. This paper outlined how self-supervised learning algorithms like BERT and SimCLR are highly successful models in their respective domains but they are also very domain-specific. This means that the new algorithms need to be developed for each new setting. To this end, the researchers introduce a domain-agnostic method called the Domain-Agnostic Benchmark for Self-Supervised Learning, also known as DABS.

To perform well on this benchmark, a given algorithm is evaluated on seven domains – natural images, English text, speech recordings, multichannel sensor data, chest x-rays, and images with text descriptions. These domains contain unlabelled datasets for pretraining. The model is scored on its downstream performance on a set of labelled tasks in the domain. The team also introduced e-Mix and ShED, two baseline domain agnostic algorithms.

DABS (Source: arXiv) 

Is such consolidation desirable

The Twitter thread saw massive engagement with a lot of people chipping in to offer their views on Karpathy’s observation. Many called this development as the next paradigm in AI and machine learning. A person commented that if realised fully, this consolidation will help in easy multimodal fusion. “For problems with multiple input modalities (e.g. vision+language) picking a fusion strategy used to be a big deal. Now (given enough data) you can just let self-attention learn one,” he wrote.

Many others pointed out that parallels can be drawn with the neocortex which has a highly uniform architecture across the different input modalities. “Perhaps nature has stumbled by a very similar powerful architecture and replicated it in a similar fashion, varying only some of the details,” Karpathy wrote.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.