ViperGPT vs GPT-4

ViperGPT uses Python code to interpret and solve image queries.
Listen to this story

Former Google Research Scientist Carl Vondrick, who is currently an Assistant Professor at Columbia University, along with two computer vision PhD researchers from the same university, Dídac Surís and Sachit Menon, proposed the ViperGPT, a framework for programmatic composition of specialised vision, language, math, and logic functions for complex visual queries. 

This new model is capable of connecting individual advances in vision and languages, alongside enabling them to show capabilities beyond what any individual model can do on its own. Simply put, you can input your query in any visual format, including image and video, and obtain the desired result. Depending on the type of query, the output generated can either be in an image format or text format. But, in the case of GPT-4, the output is only text format. 

How does it work?

The framework uses a combination of step-by-step reasoning from Codex along with external knowledge queried from GPT-3’s text model which results in impressive performance in this setting.

ViperGPT currently uses Codex API on the GPT-3 model. The pre-trained models used are : GPT-3 model for LLM query function (text-davinci-003). The official OpenAI Python API8 is also used. 

The LLM then uses the API to write a Python programme to solve the input query. This code is then used on the input image/video to generate the desired output via vision and language models. 

Providing Codex with API exposing visual capabilities, such as “find, compute_depth”, are enough to create these programmes. With prior training on code, the model is able to reason on how to use these functions and implement relevant logic. By this approach, the results of the model have delivered remarkable ‘zero-shot performance’—without training on task-specific images

The paper also mentions that as the model improves, ViperGPT will produce improved results. To support research in the proposed model, the team said that a Python library will be developed that will promote rapid development for programme synthesis for visual tasks which will eventually become open source. 

Queries solved on ViperGPT (Source: arXiv

Evaluation model

ViperGPT was evaluated on four tasks to understand the model’s diverse capabilities in varied contexts without additional training. The tasks include: 

  • Visual grounding, which implies associating language with visual input.
  • Compositional image question answering, which means that the model works on answering questions using complex compositional reasoning that combines multiple visual and textual inputs.
  • External knowledge-dependent image question answering, which is a framework to answer questions about images that require external knowledge beyond what information is shown in the image, such as general knowledge or factual information. 
  • Video causal and temporal reasoning, which indicates a model’s ability to reason about events and causal relationships in a video based on both visual and temporal cues.

The researchers considered these tasks to roughly build on one another, with visual grounding being a prerequisite for compositional image question answering, so and so forth. The result: Better  spatial relations, best accuracy, alongside outperforming on all zero-shot methods and more. 

ViperGPT vs GPT-4

However, one question remains: How is it different from generative models such as GPT-4? The latest multimodal platform, ‘GPT-4’, takes inputs in text and image format. The AI model can receive text prompts and images where the user can specify any type of vision or language-related task. However, the image input capability is still a research prototype and the output provided will be in text alone. 

In ViperGPT, depending on the query, the model can produce an output that can be in any format such as text, multiple choice selection, or image regions. 

It is to be noted that the parameters used for training GPT-4 or ViperGPT models are not available. It is also to be seen in the long run whether ViperGPT can be used in tandem with other problem solving models, such as GPT-4 itself, to provide an integrated framework utilising recognition and generative models. 

Download our Mobile App

Vandana Nair
As a rare breed of engineering, MBA, and journalism graduate, I bring a unique combination of technical know-how, business acumen, and storytelling skills to the table. My insatiable curiosity for all things startups, businesses, and AI technologies ensure that I'll always bring a fresh and insightful perspective to my reporting.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it