Can Stability AI and Meta Meet OpenAI’s Multimodal Challenge?

Llama 3 is anticipated to introduce open-source multimodal capabilities
Listen to this story

OpenAI just revealed that ChatGPT can now see, speak, and hear, making it a true multimodal system. Furthermore, it plans to add DALL.E-3 to ChatGPT and ChatGPT Enterprise. Meanwhile, Google is doing something similar with Gemini, their own multimodal system, coming this fall.

While we expect to have two multimodal products by October, it would be interesting to witness contributions from open-source players in the multimodal market. Presently, Stability AI and Meta seem to be strong contenders capable of achieving this.

Stability AI has the Means 

Stability AI possesses all the necessary resources to craft an open-source multimodal model. They  have Stable Diffusion for text-to-image, Stable LM for text-to-text, and their latest addition, Stable Audio, for text-to-music generation. By merging these three models, Stability AI could potentially create one of a kind multimodal model much like OpenAI. Though Stable Audio is not open source, Stability AI has revealed their upcoming plans to introduce an open-source model based on the Stable Audio architecture, with different training data. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Furthermore, earlier this year, Stability AI and its multimodal AI research lab DeepFloyd announced the research release of DeepFloyd IF, a powerful text-to-image cascaded pixel diffusion model. It wouldn’t be a surprise if we see a multimodal coming from Stability AI soon in the future. 

Meta has Plans

In a surprise turn of events, recently, at a social event, OpenAI engineer Jason Wei overheard a conversation suggesting that Meta has amassed sufficient computing power to train both Llama 3 and Llama 4. While Llama 3 aims to achieve performance on par with GPT-4, it will remain free of cost. Moreover, Llama 3 is anticipated to introduce open-source multimodal capabilities as well. 

ImageBind is part of Meta’s efforts to create multimodal AI systems that learn from all possible types of data around them. ImageBind is the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position.

Furthermore, Meta released a multimodal model  “CM3leon”, that does both text-to-image and image-to-text generation. Additionally Meta’s Seamless M4T can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task.

Multimodal is the Future 

Open source LLMs can be customised to meet the specific needs of an organization. This can help to reduce the cost of developing and maintaining AI applications. The lack of a true multimodal in the open source market has led to developers trying their own hand at attempting a multimodal. Some of them worked and some did not. However, this is the nature of the open source community where they employ the method of trial and error. 

Earlier this year, a group of scientists at the University of Wisconsin-Madison, Microsoft Research, and Columbia University created a multimodal called LLaVA. It is a multimodal LLM which can deal with both text and image inputs. It uses Vicuna as the large language model (LLM) and CLIP ViT-L/14 as a visual encoder. 

Similarly, another group of researchers at King Abdullah University of Science and Technology created MiniGPT-4-an open-sourced model performing complex vision-language tasks like GPT-4. To build MiniGPT-4, the researchers used Vicuna, which is built on LLaMA, as a language decoder and the BLIP-2 Vision Language Model, as a visual decoder. Moreover, to simplify the multimodal model creation process, open-source communities have also introduced models like BLIP-2 and mPLUG-Owl.

While the open source community is experimenting to create a viable multimodal system, it’s essential for Meta and Stability AI to step up their game and develop a multimodal solution soon. Otherwise, Google and OpenAI might pull ahead, further widening the gap between open source and closed source players.

Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox