Listen to this story
OpenAI just revealed that ChatGPT can now see, speak, and hear, making it a true multimodal system. Furthermore, it plans to add DALL.E-3 to ChatGPT and ChatGPT Enterprise. Meanwhile, Google is doing something similar with Gemini, their own multimodal system, coming this fall.
While we expect to have two multimodal products by October, it would be interesting to witness contributions from open-source players in the multimodal market. Presently, Stability AI and Meta seem to be strong contenders capable of achieving this.
Stability AI has the Means
Stability AI possesses all the necessary resources to craft an open-source multimodal model. They have Stable Diffusion for text-to-image, Stable LM for text-to-text, and their latest addition, Stable Audio, for text-to-music generation. By merging these three models, Stability AI could potentially create one of a kind multimodal model much like OpenAI. Though Stable Audio is not open source, Stability AI has revealed their upcoming plans to introduce an open-source model based on the Stable Audio architecture, with different training data.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Furthermore, earlier this year, Stability AI and its multimodal AI research lab DeepFloyd announced the research release of DeepFloyd IF, a powerful text-to-image cascaded pixel diffusion model. It wouldn’t be a surprise if we see a multimodal coming from Stability AI soon in the future.
Meta has Plans
In a surprise turn of events, recently, at a social event, OpenAI engineer Jason Wei overheard a conversation suggesting that Meta has amassed sufficient computing power to train both Llama 3 and Llama 4. While Llama 3 aims to achieve performance on par with GPT-4, it will remain free of cost. Moreover, Llama 3 is anticipated to introduce open-source multimodal capabilities as well.
ImageBind is part of Meta’s efforts to create multimodal AI systems that learn from all possible types of data around them. ImageBind is the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position.
Furthermore, Meta released a multimodal model “CM3leon”, that does both text-to-image and image-to-text generation. Additionally Meta’s Seamless M4T can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task.
Multimodal is the Future
Open source LLMs can be customised to meet the specific needs of an organization. This can help to reduce the cost of developing and maintaining AI applications. The lack of a true multimodal in the open source market has led to developers trying their own hand at attempting a multimodal. Some of them worked and some did not. However, this is the nature of the open source community where they employ the method of trial and error.
Earlier this year, a group of scientists at the University of Wisconsin-Madison, Microsoft Research, and Columbia University created a multimodal called LLaVA. It is a multimodal LLM which can deal with both text and image inputs. It uses Vicuna as the large language model (LLM) and CLIP ViT-L/14 as a visual encoder.
Similarly, another group of researchers at King Abdullah University of Science and Technology created MiniGPT-4-an open-sourced model performing complex vision-language tasks like GPT-4. To build MiniGPT-4, the researchers used Vicuna, which is built on LLaMA, as a language decoder and the BLIP-2 Vision Language Model, as a visual decoder. Moreover, to simplify the multimodal model creation process, open-source communities have also introduced models like BLIP-2 and mPLUG-Owl.
While the open source community is experimenting to create a viable multimodal system, it’s essential for Meta and Stability AI to step up their game and develop a multimodal solution soon. Otherwise, Google and OpenAI might pull ahead, further widening the gap between open source and closed source players.