MITB Banner

What Happened to Multimodal GPT-4?

As of now the multimodal features of GPT 4 are not accessible in the APIs

Share

Listen to this story

On March 14, OpenAI released GPT-4 with much fanfare, proudly exhibiting its multimodal features. Months have passed by but there seems to be no buzz or interest around it anymore. It was said that GPT-4 is capable of generating text and accepting both image and text inputs, making it an improvement over its predecessor, GPT-3.5, which only accepted text input. Interestingly, even ChatGPT Plus is not multimodal.

Meanwhile, OpenAI recently filed for the GPT-5 trademark with the United States Patent and Trademark Office (USPTO). Trademark attorney Josh Gerben took to Twitter on July 31 to reveal that this action by the company hints at the possibility of them working on a fresh iteration of their language model. But, shouldn’t OpenAI deliver on its GPT-4 promises before proceeding with GPT-5. Users were expecting easy interaction with a chatbot using images, but this multimodal functionality hasn’t been fully realised. The internet has been abuzz with questions about the status of GPT-4 multimodal functionality. 

During the GPT-4 demo livestream, several impressive capabilities of the model were showcased. It was able to interpret a funny image and accurately describe what made the image humorous. Additionally, Greg Brockman, president and co-founder of OpenAI demonstrated how he could effortlessly create a website by simply inputting a photo of an idea from his notebook, with GPT-4 providing the necessary assistance.

He specifically mentioned these features will take time to roll out, but now the wait has been too long. Right now only Bing Search based on GPT-4 lets you make searches using images but it needs refinement and is not up to the mark with its responses. So what exactly is holding back OpenAI to explore multimodal features and come up with its own product. 

https://twitter.com/MikePFrank/status/1685795943172767744

 Multimodal features aren’t available in the API

While introducing GPT-4, OpenAI said that they are introducing GPT-4’s text input capability through ChatGPT and the API, and are working on making the image input capability more widely available by collaborating closely with ‘Be My Eyes’.  As of now this collaboration is in closed beta and being tested for feedback among a small subset of the users. No official update has been released yet on the same. 

As of now, the Multimodal features of GPT-4 are not accessible in the APIs. OpenAI’s blog post mentions  that users can currently make text-only requests to the GPT-4 model, and the capability to input images is still in a limited alpha stage. However, OpenAI assures users that they will automatically update to the recommended stable model as new versions are released over time. This indicates that more advanced features and capabilities may become available to users as the model continues to evolve and improve.

OpenAI recently introduced Code Interpreter in ChatGPT Plus. Many termed it as GPT-4.5 moment but interestingly it was just old-school OCR from Python libraries and didn’t use multimodal for image generation. 

GPU Scarcity 

Due to a shortage of GPUs, OpenAI is facing challenges in allowing users to process more data through their large language models like ChatGPT. This shortage has also affected their plans to introduce new features and services as per their original schedule. 

A month back, Altman acknowledged this concern and explained that most of the issue was a result of GPU shortages, according to a blog post by Raja Habib, CEO and co-founder at Human Loop, which was later taken down on OpenAI’s request. The blog specifically mentioned that multimodality, which was demoed as part of the GPT-4 release can’t be extended to everyone until more GPUs come online.

GPT-4 was probably trained using around 10,000 to 25,000  Nvidia’s A100s. For GPT-5, Musk suggested it might require 30,000 to 50,000 H100s . In February 2023, Morgan Stanley predicted GPT-5 would use 25,000 GPUs. With such an amount of GPU’s required and Nvidia being the only reliable supplier in the market, it boils down to availability of GPUs. 

Focus on Dall E-3? 

Going by the developments, we can say that OpenAI is presently focussing on text to image generation. Recently, YouTuber MatVidPro shared details of OpenAI’s next project, which is likely to be DALL.E 3.  OpenAI’s future plans for the model’s public access and its official name remain uncertain as of now. Currently, the unreleased model is in the testing phase, available to a select group of around 400 people worldwide on an invite-only basis, as per Matt’s information.

Conclusively, only time will tell whether OpenAI will better GPT-4 or come up with GPT-5.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India