ViperGPT vs GPT-4

ViperGPT uses Python code to interpret and solve image queries.

Share

Published on March 21, 2023

by Vandana Nair

Listen to this story

Former Google Research Scientist Carl Vondrick, who is currently an Assistant Professor at Columbia University, along with two computer vision PhD researchers from the same university, Dídac Surís and Sachit Menon, proposed the ViperGPT, a framework for programmatic composition of specialised vision, language, math, and logic functions for complex visual queries.

This new model is capable of connecting individual advances in vision and languages, alongside enabling them to show capabilities beyond what any individual model can do on its own. Simply put, you can input your query in any visual format, including image and video, and obtain the desired result. Depending on the type of query, the output generated can either be in an image format or text format. But, in the case of GPT-4, the output is only text format.

How does it work?

The framework uses a combination of step-by-step reasoning from Codex along with external knowledge queried from GPT-3’s text model which results in impressive performance in this setting.

ViperGPT currently uses Codex API on the GPT-3 model. The pre-trained models used are : GPT-3 model for LLM query function (text-davinci-003). The official OpenAI Python API8 is also used.

The LLM then uses the API to write a Python programme to solve the input query. This code is then used on the input image/video to generate the desired output via vision and language models.

Providing Codex with API exposing visual capabilities, such as “find, compute_depth”, are enough to create these programmes. With prior training on code, the model is able to reason on how to use these functions and implement relevant logic. By this approach, the results of the model have delivered remarkable ‘zero-shot performance’—without training on task-specific images.

The paper also mentions that as the model improves, ViperGPT will produce improved results. To support research in the proposed model, the team said that a Python library will be developed that will promote rapid development for programme synthesis for visual tasks which will eventually become open source.

**Queries solved on ViperGPT** (*Source:* *arXiv*)

Evaluation model

ViperGPT was evaluated on four tasks to understand the model’s diverse capabilities in varied contexts without additional training. The tasks include:

Visual grounding, which implies associating language with visual input.
Compositional image question answering, which means that the model works on answering questions using complex compositional reasoning that combines multiple visual and textual inputs.
External knowledge-dependent image question answering, which is a framework to answer questions about images that require external knowledge beyond what information is shown in the image, such as general knowledge or factual information.
Video causal and temporal reasoning, which indicates a model’s ability to reason about events and causal relationships in a video based on both visual and temporal cues.

The researchers considered these tasks to roughly build on one another, with visual grounding being a prerequisite for compositional image question answering, so and so forth. The result: Better spatial relations, best accuracy, alongside outperforming on all zero-shot methods and more.

ViperGPT vs GPT-4

However, one question remains: How is it different from generative models such as GPT-4? The latest multimodal platform, ‘GPT-4’, takes inputs in text and image format. The AI model can receive text prompts and images where the user can specify any type of vision or language-related task. However, the image input capability is still a research prototype and the output provided will be in text alone.

In ViperGPT, depending on the query, the model can produce an output that can be in any format such as text, multiple choice selection, or image regions.

It is to be noted that the parameters used for training GPT-4 or ViperGPT models are not available. It is also to be seen in the long run whether ViperGPT can be used in tandem with other problem solving models, such as GPT-4 itself, to provide an integrated framework utilising recognition and generative models.

Access all our open Survey & Awards Nomination forms in one place

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.