“In 2021, language models will start to become aware of the visual world.”
Ilya Sutskever, co-founder, OpenAI
For many years, within the realms of AI, there has been a lot of talk about Artificial General Intelligence or AGI — building algorithms that can learn on the go and simulate human cognition. However, the elusive human brain remains too complex to clone. So, AI researchers started focusing on winning small by taming specific skills. Rules were written, models were tweaked. Today, we have computer vision models that can detect faces in the group and in the dark, and language models that can write prose. But, these are two separate skills–vision and speech. And now, there is a major push for neural networks that can do both. These objectives were further reiterated in Ilya Sutskever’s recent interview on Andrew Ng’s The Batch.
“ If you can expose models to data similar to those absorbed by humans, they should learn concepts in a way that’s more similar to humans. This is an aspiration — it has yet to be proven — but I’m hopeful that we’ll see something like it in 2021,” posited Ilya Sutskever, the Chief scientist of OpenAI. The ability to process text and images together, he believes, should make models smarter.
Transformers Claim New Territory
Over the years, neural networks got better at processing language. These networks can generate fixed-or-variable-length vector-space representations and then aggregate the information from surrounding words to determine the meaning in a given context. The new abilities of language models were made possible by the Transformers architecture. Although recurrent neural networks have been around for over a decade now, their sequential nature makes it difficult to harness parallel processing units like TPUs fully. Whereas, the Convolutional neural networks (CNNs), though less sequential, take a relatively large number of steps to combine information.
On the other hand, Transformers allowed for significant parallelisation to optimise translation quality. That said, even the convolutions operation– the defacto standard for CV– has its fair share of problems. They operate on a fixed-sized window and cannot capture long-range dependencies such as arbitrary relations between pixels in both spatial and time domains in a given video. Furthermore, convolution filter weights remain fixed after training so the operation cannot adapt dynamically to input variations. Text-to-image synthesis has been an active area of research since the release of “Generative Adversarial Text to Image Synthesis” paper in 2016. While the transformer architecture has become the go-to solution for many natural language processing tasks, its applications in computer vision remain limited. However, there has been a growing interest to make language models work on computer vision applications.
Let’s take a look at a few works released recently:
DETR
DE⫶TR: End-to-End Object Detection with Transformers https://t.co/4W27PVx6Js (+paper https://t.co/xmxnSOBjBa) awesome to see a solid swing at (non-autoregressive) end-to-end detection. Anchor boxes + nms is a mess. (I was hoping detection would go end-to-end back in ~2013) pic.twitter.com/cN4hK2ZFY0
— Andrej Karpathy (@karpathy) May 27, 2020
In this work, to understand how Transformers make end-to-end object detection simpler, the researchers pitted it against the state-of-the-art Faster R-CNN, a traditional two-stage detection system. DETR eliminated the need for refinement and deduplication processes of state of the art CNNs by placing a transformer architecture. It simplifies the detection pipeline as the transformer architecture performs the traditionally specific operations to object detection. Read more about DETR here.
DeIT
Facebook AI also has open-sourced Data-efficient image Transformers (DeiT), a new system to train computer vision models using Transformers. DeIT is a high-performance image classification model that require less data & computing resources to train than previous AI models.
Vision Transformer(ViT)
Recent conversation with a friend:@ilyasut: what's your take on https://t.co/fqVhQNaBWQ?@OriolVinyalsML: my take is: farewell convolutions : ) pic.twitter.com/9PEvxmWvO4
— Oriol Vinyals (@OriolVinyalsML) October 3, 2020
Last year, a paper under double-blind review for ICLR 2021 caught the imagination of the ML community. The paper titled, ‘An image is worth 16X16 words’ was discussed by the likes of Tesla AI head, Andrej Karpathy, among many others. In computer vision applications, attention is either applied along with CNNs or used to replace certain components of these convolutional networks while keeping their overall structure in place. But convolutional architectures remain dominant. However, ViTs are poised to make convolutions obsolete.
Now, with the proliferation of larger language general purpose models like GPT-3, there is a great potential to apply these models to a wide range of applications.
OpenAI Ups The Game
As promised by Sutskever, OpenAI released neural networks last week– DALL.E and Clip. DALL·E is a neural network capable of creating images from simple English prompts. Whereas, CLIP or Contrastive Language–Image Pre-training is a neural network that leverages its zero-shot ability to learn out of the box categories with minimal training; like GPT-3.
According to OpenAI team, DALL·E is highly imaginative. It can explore the compositional structure of language to create meaningful images. When prompted with only a part of the image, DALL.E could fill in the image’s remaining portion. DALL.E can also give a 3D feel to the images – like a 3D rendering engine that runs just on text. The neural network can even control the location and angle from which a scene is rendered. It can generate known objects in compliance with precise specifications of angle and lighting conditions. This feature alone has great implications in the multi-billion dollar movie industry. Talking about billion-dollar industries, there is a recent entry to the list– data labelling companies(eg: Scale AI). Deep learning models rely on annotated data. Finding labelled data in every trivial category is difficult. For instance, the popular ImageNet dataset, required over 25,000 workers to annotate 14 million images for 22,000 object categories. OpenAI claims it can shave a large part of the tedious work using its latest neural network, CLIP.
Unlike the ImageNet model, CLIP can be adapted to perform various visual classification tasks without additional training examples. To apply CLIP to a new task, all one has to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations. And as we make these models smarter, wrote Sutskever, they will become safer as well.
Though NLP models have grown exponentially, the reliability of these models is largely underwhelming. So, researchers at OpenAI are tapping into reinforcement learning to make them better. “At OpenAI, we’ve developed a new method called reinforcement learning from human feedback,” wrote Sutskever.
Using RL with human feedback for language models like GPT-3 makes it possible to eliminate the undesirable behaviour, which the large models learn from passively absorbed information.
The OpenAI co-founder believes that by exposing language models to both text and images and training them under the human judges’ supervision, more useful and trustworthy models can be made. “In 2021, language models will start to become aware of the visual world. The next generation of models will [hopefully] understand text better because of the many images they’ve seen” assures Sutskever.