MITB Banner

Can Stability AI and Meta Meet OpenAI’s Multimodal Challenge?

Llama 3 is anticipated to introduce open-source multimodal capabilities

Share

Listen to this story

OpenAI just revealed that ChatGPT can now see, speak, and hear, making it a true multimodal system. Furthermore, it plans to add DALL.E-3 to ChatGPT and ChatGPT Enterprise. Meanwhile, Google is doing something similar with Gemini, their own multimodal system, coming this fall.

While we expect to have two multimodal products by October, it would be interesting to witness contributions from open-source players in the multimodal market. Presently, Stability AI and Meta seem to be strong contenders capable of achieving this.

Stability AI has the Means 

Stability AI possesses all the necessary resources to craft an open-source multimodal model. They  have Stable Diffusion for text-to-image, Stable LM for text-to-text, and their latest addition, Stable Audio, for text-to-music generation. By merging these three models, Stability AI could potentially create one of a kind multimodal model much like OpenAI. Though Stable Audio is not open source, Stability AI has revealed their upcoming plans to introduce an open-source model based on the Stable Audio architecture, with different training data. 

Furthermore, earlier this year, Stability AI and its multimodal AI research lab DeepFloyd announced the research release of DeepFloyd IF, a powerful text-to-image cascaded pixel diffusion model. It wouldn’t be a surprise if we see a multimodal coming from Stability AI soon in the future. 

Meta has Plans

In a surprise turn of events, recently, at a social event, OpenAI engineer Jason Wei overheard a conversation suggesting that Meta has amassed sufficient computing power to train both Llama 3 and Llama 4. While Llama 3 aims to achieve performance on par with GPT-4, it will remain free of cost. Moreover, Llama 3 is anticipated to introduce open-source multimodal capabilities as well. 

ImageBind is part of Meta’s efforts to create multimodal AI systems that learn from all possible types of data around them. ImageBind is the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position.

Furthermore, Meta released a multimodal model  “CM3leon”, that does both text-to-image and image-to-text generation. Additionally Meta’s Seamless M4T can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task.

Multimodal is the Future 

Open source LLMs can be customised to meet the specific needs of an organization. This can help to reduce the cost of developing and maintaining AI applications. The lack of a true multimodal in the open source market has led to developers trying their own hand at attempting a multimodal. Some of them worked and some did not. However, this is the nature of the open source community where they employ the method of trial and error. 

Earlier this year, a group of scientists at the University of Wisconsin-Madison, Microsoft Research, and Columbia University created a multimodal called LLaVA. It is a multimodal LLM which can deal with both text and image inputs. It uses Vicuna as the large language model (LLM) and CLIP ViT-L/14 as a visual encoder. 

Similarly, another group of researchers at King Abdullah University of Science and Technology created MiniGPT-4-an open-sourced model performing complex vision-language tasks like GPT-4. To build MiniGPT-4, the researchers used Vicuna, which is built on LLaMA, as a language decoder and the BLIP-2 Vision Language Model, as a visual decoder. Moreover, to simplify the multimodal model creation process, open-source communities have also introduced models like BLIP-2 and mPLUG-Owl.

While the open source community is experimenting to create a viable multimodal system, it’s essential for Meta and Stability AI to step up their game and develop a multimodal solution soon. Otherwise, Google and OpenAI might pull ahead, further widening the gap between open source and closed source players.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.