It is truly the age of large models. Each of these models is bigger and more advanced than the previous one. Take, for example, GPT-3 – when it was introduced in 2020, it was the largest language model trained on 175 billion parameters. Fast forward one year, and we already have the GLaM model, which is a trillion weight model. Transformer models like GPT-3 and GLaM are transforming natural language processing. We are having active conversations around these models making job roles like writers and even programmers obsolete. While these can be dismissed as speculations, for now, one cannot deny that these large language models have truly transformed the field of NLP.
Could this innovation be extended to other fields – like computer vision? Can we have a GPT-3 moment for computer vision?
Sign up for your weekly dose of what's up in emerging technology.
GPT for computer vision
OpenAI recently released GLIDE, a text-to-image generator, where the researchers applied guided diffusion to the problem of text conditional image synthesis. For GLIDE, the researchers trained a 3.5 billion parameter diffusion model that uses a text encoder. Next, they compared CLIP (Contrastive Language-Image Pre-training) guidance and classifier free guidance. They found that the model generated with classifier-free guidance was more photorealistic and reflected a broader range of world knowledge. The most striking feature of this model is that it achieves performance comparable to that of DALL.E at less than one-third of its parameters.
For the uninitiated, OpenAI had released two models modelled after GPT-3 – DALL.E and CLIP – in 2020. Trained on 12 billion parameters, DALL.E can render an image from scratch and edit aspects of an existing image by using text prompts. On the other hand, CLIP is a neural network that is trained on 400 million pairs of images and text. OpenAI’s blog said that CLIP is similar to the GPT family to perform tasks like object character and action recognition. Notably, OpenAI’s team anticipated CLIP’s applications to be image classification and generation. However, one year later, CLIP has found a variety of applications like content moderation, image search, image similarity, image ranking, object tracking, and robotics control.
Even before GLIDE, DALL.E, and CLIP, OpenAI researchers released Image GPT. As per the team, the motivation behind this research was that just as a large transformer model can be trained on language; similar models can be trained on pixel sequences to generate coherent image completions and samples.
While GLIDE or DALL.E models are much smaller than GPT-3 in size, they may be considered a precursor to similar large models even for computer vision.
“GPT-3 has compiled and built ready to use or build upon powerful natural language models using probably the largest language datasets, allowing it to be used in different tasks with just additional parameters without any additional training. Computer vision and image processing are primed for a similar jump in highly generalised & applicable AI models – image datasets, the use cases & compute capabilities to train on them are readily available. OpenAI itself is working on image GPT working on image generation & completion,” said Arvind Saraf, Head of Engineering, Drishti Technologies.
He further added, “While such highly generalised models have great potential to expand the reach and use of computer vision technology, like any off the shelf neural network architecture, it is likely to have limitations. The long term ethical implications of the same and potential for issues like identity theft and impersonation are yet to be explored. However, the potential applications are immense – such as scene reconstruction, event detection and motion estimation in video analytics, and 3D scene modelling.”
Having GPT-like models for computer vision has its own challenges. Talking about the same, Kyle Fernandes, co-founder of Memechat, said, “Companies like EleutherAI and OpenAI are working on image GPTs. These models use concepts of both natural language and computer vision. Models like GPT require a lot of data; even smaller models like Ada is 25 GB. Having 25 GB for a graphic processor is huge – imagine this for a 175 billion parameter model. That is why, while these models are interesting, you need huge resources for them. Bigger companies may be able to afford that kind of resources, but smaller companies may find it difficult to lay their hands on this tech. A possible solution could be building smaller models and tuning them for specific focus.”
Other Transformer models for computer vision
One of the most popular Transformer models for computer vision was by Google, aptly named Vision Transformer (ViT). It replicates the Transformer architecture for natural language processing and represents image inputs as sequences. It predicts class labels for the image and allows models to learn image structure independently. Google claimed that ViT can outperform state-of-the-art CNN with four times fewer resources when trained on sufficient data.
As an improvement over Google ViT, Facebook released Data-efficient Image Transformer (DeiT). DeiT can be trained on 1.2 million images against hundreds of millions of images that are required for CNN. It is based on a transformer-specific knowledge distillation procedure that reduces training requirements.
Facebook researchers have also built DEtection TRansformer, which is a set-based global loss that uses bipartite matching and a transformer encoder-decoder architecture. DETR uses CNN for the representation of input images before using a positional encoder and then passing it to a transformer encoder.