Listen to this story
We had previously raised the question: What happened to multimodal-GPT-4? Six months later, it appears that Google’s Gemini has compelled OpenAI to strongly consider expediting the release of GPT-4 with multimodal capabilities. According to reports, Google may be releasing Gemini anytime now and OpenAI needs to buckle up.
OpenAI is currently in the process of integrating GPT-4 with multimodal capabilities, much like what Google is planning with Gemini. This integrated model is expected to be named GPT-Vision, as per a recent report. The timing appears to be quite opportune, as both Gemini and GPT-Vision are expected to enter the scene and potentially compete against each other this fall.
Although OpenAI CEO Sam Altman had earlier made it clear that one shouldn’t expect GPT-5 or GPT- 4.5 in the near future, however, as per The Information, OpenAI might follow up GPT-Vision with an even more powerful multimodal model, codenamed Gobi. Unlike GPT-4, Gobi is being designed to be multimodal from the start.
It needs to be seen if OpenAI makes the right decision by clashing with Gemini. Many are anticipating that OpenAI may introduce a multimodal GPT-4 during their first-ever developer’s conference, OpenAI DevDay, which will be held on November 6 in San Francisco.
Is GPT-Vision better than Gemini?
OpenAI’s decision to withhold the multimodal capabilities does not stem from an inability to develop them. The ChatGPT creator has, in fact, collaborated with a startup called Be My Eyes, which is developing an app to describe images to the blind users, helping them interpret their surroundings so that they can interact with the world more independently.
During this collaboration, OpenAI recognised that adding multimodal capabilities to GPT-4 at this stage might be premature as the integration of images could potentially raise privacy concerns. Moreover, there’s a risk of misinterpreting facial features such as gender or emotional state, which could result in harmful or inappropriate responses.
Meanwhile, OpenAI has got its bases covered. A few months ago, reports came out that OpenAI is working on DALL-E 3. Early samples leaked by YouTuber MattVidPro indicate that this model did much better than other image generators, including Midjourney, which is usually seen as the best for making realistic images.
Interestingly, in a recent interview, Google chief Sundar Pichai, when asked what edge Gemini has over ChatGPT, replied, “Today you have separate text models and image-generation models and so on. With Gemini, these will converge.” This means that the most we can anticipate from Gemini is its ability to generate text and images based on user prompts.
If OpenAI combines the capabilities of Dall E-3 and ChatGPT Plus, it is pretty much good to go against Gemini. To have an edge over GPT-4, Gemini is being trained on YouTube videos and would be the first multimodal model being trained on video rather than just text (or in GPT-4’s case, text plus images). Moreover, Demis Hassabis recently claimed that engineers at DeepMind are using techniques from AlphaGo for Gemini
On the other hand, Google’s Bard hasn’t been able to make a strong impression and falls short of ChatGPT when it comes to generating text. Thus, placing hope on Gemini to turn Google’s fortunes around is a huge bet.
OpenAI can afford to risk it
OpenAI’s process of shipping products is different from that of Google. Google, being an old and reputed player in the market with 4.3 billion customers worldwide, thinks twice and more before launching any product. It makes sure that the products are fully ready without any loose ends.
On the other hand, OpenAI has shipped products in the past, even when they are not fully finished in the hope that consumer reviews will help them make necessary changes. Consider GPT-4, when OpenAI initially introduced it, they mentioned it would be multimodal. However, this didn’t turn out to be the case. Moreover, OpenAI acknowledged the limitations of GPT-4, stating that it still isn’t entirely dependable, often generating inaccurate information and making reasoning errors.
Pichai expressed similar views during a recent interview when he noted that ChatGPT’s launch before LaMDA signalled Google that the technology of LLMs is well-suited for the market. He stated, “Credit to OpenAI for the launch of ChatGPT, which showed a product-market fit and that people are ready to understand and play with the technology”.
It would be safe to say that with both Google and OpenAI striving to take the lead in the multimodal war, this fall will surely be interesting.