Last updated October 4, 2023
In Innovation in AI

ChatGPT’s Game-Changing ‘Vision’

With OpenAI finally integrating image features, GPT-4V(ision) opens doors for use cases that span across domains – putting ChatGPT ahead in the multimodal race

Share

Published on October 4, 2023

by Vandana Nair

Listen to this story

The introduction of image functionality on GPT-4 has piqued the interest of ChatGPT users in the last couple of weeks with most of them already experimenting with the incredible features of GPT-4 V(ision). Be it reading, recognising images, answering specific queries, coding or designing a website, the multimodality that GPT-4V has brought is becoming a game-changer.

The versatility that this feature brings is set to further revolutionise the way various industries work.

Multimodal Functionality

A recent paper on preliminary explorations with GPT-4V by Microsoft researchers was released a few days ago. The paper analyses the latest model to understand large multimodal models (LMMs) and revolves around assessing and testing GPT-4V’s capabilities through a wide range of structured tasks. The paper emphasised the distinct ability of GPT-4V to understand visual cues sketched or placed on input images that opens up innovative human-computer interaction techniques such as visual referencing prompts.

While some of the basic functions were tested with GPT-4V, a larger range of functions that have possible use cases in various industries, were listed out in the paper — a few predominant ones being in medical and insurance.

Crucial Medical Field

While there have been discussions on GPT-4’s capabilities in the medical field, the latest update has only cemented its future. The ability of the model to decipher and critically analyse images can help infer details from a scan or X-ray. Radiology will have the maximum use case.

In the below example, GPT-4V has been fed a tooth X-ray and prompted with various questions. Interestingly, the chatbot has been careful to put a disclaimer at the start and not give conclusive results.

Source: arxiv.org

Auto Insurance

The capabilities of GPT-4V was also tested to check if it fits in auto insurance. With a focus on car accident reporting, two aspects, namely, vehicle damage evaluation and insurance reporting (recognising vehicle information such as licence plate, model, etc.) were tested. While the model has been able to give a detailed explanation of the type and severity of a damage, it is not able to conclusively estimate the cost of any damage. This is probably where the limitations of the model arise.

Coding Game Stepped Up

While coding has been facilitated with ChatGPT’s Code Interpreter, GPT-4V has advanced the chatbot’s coding capabilities. From inputting a basic drawing or scribbles from a whiteboard, the model is able to code a website/app with ease. The low code/no code option has only been further simplified for an end user, which raises the question on the fate of coding platforms.

The CEO of HyperWriteAI, Matt Shumer, tweeted about a GPT-4V-powered frontend engineer agent. By simply uploading an image design, the model is able to code, correct the rendered form and even refine code to improve design quality.

The first GPT-4V-powered frontend engineer agent.

Just upload a picture of a design, and the agent autonomously codes it up, looks at a render for mistakes, improves the code accordingly, repeat.

Utterly insane. pic.twitter.com/qN75vwkbDZ
— Matt Shumer (@mattshumer_) September 29, 2023

All is Not Perfect

As impressive as the results have been, the model is still not 100% accurate. GPT-4V can generate errors when it comes to reading minute details or counting variables that are too similar, thereby mandating the need to cross-check before relying on it completely.

Source: arxiv.org

Surpassing Other LMMs

A few months ago, when ChatGPT was compared with its closest rival Bard, the latter surpassed OpenAI’s chatbot on many counts. The multimodal features such as voice/image and web browsing were the key features Bard had to its credit. However, all of them are now addressed through ChatGPT.

While GPT-4V may be versatile, accuracy remains a concern, which is also a problem with Bard. A few users found incorrect responses from both GPT-4V and Bard when given a prompt that requires strategic thinking in a Pac-Man game.

GPT-4V fail. Bard fails too. I thought GPT-4V would get this one. #ChatGPT #GPT4 pic.twitter.com/vBAIILZxF8
— BeyondBacktesting (@BBacktesting) October 2, 2023

Redemption of ChatGPT

The multimodal feature of ChatGPT comes at a time when the chatbot had reportedly been witnessing a decline in users since July. As per recent reports, the website and mobile visits to ChatGPT decreased by 3.2% to 1.43 billion in August — a 10% drop from each of the previous months. With a series of product feature launches over the last couple of weeks, including voice integration, OpenAI is possibly looking at a redemption.

Given the buzz created with people experimenting with GPT-4V, the chatbot might possibly see a spike in the next few months. Furthermore, with OpenAI DevDay just around the corner and the company making a few major announcements, the latest ‘vision’ function might probably be a teaser for what lies ahead for ChatGPT.

On a lighter note, in addition to causing breakthroughs with a serious range of functions, GPT4V can probably help you save face too, in case you don’t understand a meme or joke: a probable win for all.

Source: arxiv.org

Access all our open Survey & Awards Nomination forms in one place

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.