Listen to this story
Google has been speaking of Gemini for a while now and people are growing increasingly impatient with its all talk, no show. Meanwhile, OpenAI sensed the lull and grabbed the opportunity by announcing its plans to integrate DALL-E 3 with ChatGPT Plus and ChatGPT Enterprise.
This surely is a game changing move by OpenAI as it props up GPT-4 as the first functional multimodal model out in the market, which creates text and image both, similar to what Gemini promises.
To make up for the absence of Gemini, Google recently added extensions to Bard along with the ability to upload images with Lens and get Search images in responses. It was Google’s attempt to make Bard multimodal. However, only time will tell if it will be able to withstand the incoming competition from DALL-E-integrated ChatGPT Plus, scheduled to be launched in October.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
That said, OpenAI has the potential to impact not only Google Bard and Gemini but also put pressure on other text-to-image generation models like Midjourney and Stable Diffusion as DALL-E 3 has shown promise by creating high-quality images.
Integrating DALL-E 3 with ChatGPT Plus gives OpenAI an edge as compared to other image generation tools as it has the largest user base compared to all other models out there in any segment.
At the moment, ChatGPT is one of the world’s most-popular websites, which attracted a staggering 1.4 billion visits globally in August. Meanwhile, during the same month, Bard received 183.5 million visits. Midjourney, on the other hand, has over 15 million active users and saw 21 million visits in August. Stable Diffusion has more than 10 million daily active users across all channels, according to Stability AI chief Emad Mostaque.
From users’ perspective, DALL-E 3 on ChatGPT gives them the freedom to generate text as well as image on a single platform. And naturally, if the users are getting easy results from one popular platform, they would prefer it over the others, any day.
If we look at the numbers, ChatGPT boasts of a huge user base who won’t shy away from using newer versions of ChatGPT Plus at a price of $20 or a little more. Midjourney, however, has a huge price difference and sells monthly plans ranging from $10 to $120. It can be said that OpenAI is paving the way for a unified multimodal model capable of handling a wide range of tasks. Additionally, there have been user complaints regarding the user interface of Midjourney, which is presently hosted on Discord.
Multimodal Market is Scattered
If we examine the currently available multimodal models, we find that they are quite scattered, for there isn’t a single model that can perform all tasks. Alongside closed-source models, there are also various open-source models claiming to be multimodal. It is however still not clear which model deserves to claim that it is truly multimodal.
For example, Hugging Face recently introduced a multimodal model named IDEFICS. It has the ability to process both text and image inputs and generate descriptions for the images. Similarly, Bard possesses the capability to accept image inputs. Also, Meta recently launched SeamlessM4T, a foundational speech/text translation and transcription model with an all-in-one system that performs multiple tasks such as speech-to-speech, speech-to-text, text-to-text translation, and speech recognition. OpenAI and Google have also developed their own speech-to-text models, namely Whisper and AudioPaLM-2, respectively.
If OpenAI adds text-to-speech and speech-to-text features as well to ChatGPT Plus, it could race ahead of other models, making it challenging for others to catch up. Meanwhile, OpenAI doesn’t seem to have any plans to stop here. According to recent reports, it is also planning to integrate GPT-Vision into GPT-4, indicating that it is here to stay.