MITB Banner

ByteDance Uses GPT-4V to Create a Multimodal LLM, Groma, for Enhanced Image Region Understanding

“Groma demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization”

Share

ByteDance to Launch Platform to Build Custom Chatbots
Listen to this story

Researchers from ByteDance and the University of Hong Kong recently developed Groma, a multimodal Large Language Model (MLLM) that excels in region-level image tasks by utilising a localised visual tokenisation approach and leveraging GPT-4V.

Groma excels not only in comprehensive image understanding but is also adept at region-level tasks such as region captioning and visual grounding. Instead of depending on LLMs or external modules for localization, Groma leverages the spatial understanding capability of the visual tokenizer. This ‘perceive-then-understand’ design also resembles human vision process.

In this localized visual tokenization mechanism, an image is segmented into regions of interest, which are then converted into region tokens. Groma encodes the image into both global image tokens and local region tokens. By integrating region tokens into user instructions and model responses, Groma understands user-specified region inputs and ground its textual output to images.

Source: github.io

Furthermore, to improve Groma’s ability to engage in visually grounded conversations, the team curated a dataset of 30k visually grounded conversations for instruction finetuning. This marks the first grounded chat dataset constructed with both visual and textual prompts, leveraging the powerful GPT-4V for data generation.

In contrast to other MLLMs that depend on the language model or external module for localization, Groma consistently shows superior performances in standard referring and grounding benchmarks. This highlights the benefits of embedding localization into image tokenization.

Share
Picture of Sukriti Gupta

Sukriti Gupta

Having done her undergrad in engineering and masters in journalism, Sukriti likes combining her technical know-how and storytelling to simplify seemingly complicated tech topics in a way everyone can understand
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.