Listen to this story
In November last year, AIM predicted that GPT-4 is going to arrive soon. Among several questions about the huge size and probably 100 trillion number of parameters, the possibility of it being multimodal was one of the predictions. It is now increasingly looking like it might be true with the recent releases and rumours.
Firstly, multimodal systems have been a developing paradigm in the world of AI through which a single system would consist of capabilities to accept input and produce output through various data types such as text, speech, and image, or videos.
Big-tech Leading the Way
Microsoft recently released Kosmos-1 in their research paper, ‘Language is Not All You Need: Aligning Perception with Language Models’, which is a multimodal large language model (MLLM). The paper highlights the importance of integrating language, action, and multimodal perception for taking another step towards AGI. This hints that the company is already fine-tuning multimodality with OpenAI’s GPT-4.
For instance, the number of parameters of Kosmos-1 is only 1.6 billion during training. Even then, the model performs brilliantly. What if GPT-4 might be smaller than we think or just a collection of models that make up for the size?
Along similar lines, Google released ‘PaLM-E’, a single model that has the ability to control different robots in the real-world, while is also equally competent in VQA and captioning tasks. Much like Kosmos-1, this model also integrated multimodal information into a pre-trained large language model (LLM). This was after Google researchers published Multimodal Chain-of-Thought Reasoning in Language Models, which was built for generating intermediate reasoning chains using language and vision with under one billion parameters.
Though not exactly called an MLLM, Meta’s LLaMA, when compared to GPT-3, has less number of parameters (65 billion) but outperforms it on several tasks. This resembles the idea of multimodal systems, which essentially consists of multiple small models. Who knows Meta might soon release something similar to Google and Microsoft. The company has been talking about the importance of multimodal systems since March last year.
Omnivore, FLAVA, CM3, and Data2vec, are creations of Meta. Each model takes a multimodal approach for solving different tasks like speech, vision, text, and even 3D. The company said, “Our results from Omnivore, FLAVA, and CM3 suggest that, over the horizon, we may be able to train a single AI model that solves challenging tasks across all the modalities”.
DeepMind was ahead in the game as well. Gato, a multimodal, multitask, and multi-embodiment generalist policy was released in November 2022. The system includes a real robot arm which, based on the input, decides the output action in text, vision, and robotics tasks.
Language Models Are Getting Smaller, But Better
LLMs have revolutionised the field of AI but somehow are restricted to only language tasks, specifically something like chatbots. When OpenAI released ChatGPT, the whole internet was just talking about it. This eventually led to people realising its limitations as well. Now, the companies harnessing LLMs’ potential are moving beyond and delving into robotics, possibly AGI.
Let’s not forget Tesla. Elon Musk’s Optimus Robot is probably just around the corner, and will probably be here around the same time as GPT-4. Who knows what it might be capable of? The company has been working behind closed doors but the potential is undoubtable.
Pre-trained language models that are smaller in size are proving to be enough when implemented along with other modalities. Researchers of PaLM-E explain that with LLMs at the centre, they are able to give robots more autonomy and less fine-tuning when compared to previous models.
Microsoft recently also released Visual ChatGPT, hinting on the possibilities of the same. The model bridges the gap between NLP and image generation.
Similarly, researchers of Kosmos-1 suggest the same thing—LLMs with a knack for multimodal perception are the way to go if we want them to be smarter than average AI. By broadening their senses beyond just reading text, they can start to pick up some common sense—you know, like how not to walk into a wall or bring your coffee—bonus points!
Another thing to add. Speaking with AIM, Dr Thomas Hartung, the professor behind the recent developments in organoid intelligence, said that implementation of organoids for developments towards AGI does not sound like such a bad idea, and is actually fascinating. “I would not like to speculate too much but I can tell you that there are, at the moment, a lot of exciting people doing some crazy stuff and I think a lot will come out in the future.” Maybe we would be able to incorporate emotional inputs from different models into AI soon. Who knows?