MITB Banner

This ML Model Can Help Robots Learn About The Relationships Between Objects

Share

This ML Model Can Help Robots Learn About The Relationships Between Objects

MIT researchers have developed a new machine-learning model. As a result, the robots may understand the world’s interactions similarly to how humans do. Numerous deep learning models struggle to see the environment because they are unaware of the intricate interactions between individual items. For example, without this understanding, a robot built to assist someone in the kitchen would have difficulties following commands such as “take up the spatula to the left of the stove and place it on top of the cutting board.”

Composing Visual Relationships

Researchers at MIT devised a model that comprehends the fundamental interactions between items in a scene to address this issue. Their model describes specific relationships one by one and then combines them to explain the entire picture. This research could be applicable in scenarios where industrial robots are required to do complex, multistep manipulation tasks, such as stacking things in a warehouse or assembling appliances. Additionally, it advances the field toward enabling robots to learn from and interact with their environments like how people do.

“When I look at a table, I cannot assert that an object exists at the XYZ place. Our minds do not operate in this manner. When we comprehend a scene in our minds, we do it based on the relationships between the objects. We believe that by developing a system that can comprehend the relationships between objects, we will be able to manipulate and change our environments more effectively,” says Yilun Du, PhD Student, Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT.

One Relationship at a Time

The researchers built a framework to build an image of a scene from a text description of objects and their connections, such as “A wood table to the left of a blue stool.” To the right of a blue stool is a crimson couch.” Their method would deconstruct these words into two smaller components that explain each unique relationship and then model each component independently. These components are then integrated to create an image of the scene via an optimisation process.

Source: Learning to Compose Visual Relations

The researchers represented the individual object interactions in a scene description using a machine-learning technique called energy-based models. This technique enables them to encode each relational description using a single energy-based model and then combine them in a way that infers all objects and relationships. In addition, by segmenting the lines into shorter bits for each relationship, the system may recombine them in various ways, Li explains, making it more adaptable to new scenario descriptions.

“Other systems would consider all relationships holistically and generate the image from the description in a single shot. However, these approaches fail when we have out-of-distribution descriptions, such as those with more relations, because these models cannot truly adapt to generate images with more relationships from a single shot. However, by combining these smaller, independent models, we can model a greater number of relationships and respond to unique combinations,” Du explains.

Additionally, the system works in reverse – given an image, it can generate text descriptions that correspond to the relationships between the scene’s objects. Additionally, their model can alter an image by rearranging the scene’s elements to correspond to a new description.

Recognising Complex Scenes

The researchers compared their model to those generated by other deep learning approaches when given written descriptions and instructed to generate images displaying the associated items and their relationships. Their model outperformed the baselines in each case. Additionally, they asked humans to assess if the generated photos matched the scene description. Ninety-one per cent of participants concluded that the new model performed better in the most difficult situations, where descriptions included three linkages.

Additionally, the researchers presented the model photographs of previously unseen scenes and various alternative text descriptions for each image. It was able to correctly pick the description that best fit the item relationships in the image. Furthermore, when the researchers provided the system with two relational scene descriptions that depicted the same image in distinct ways, the model recognised that the descriptions were equal.

“One of the outstanding fundamental problems in computer vision is developing visual representations that can deal with the compositional nature of the world around us. This article makes substantial progress toward resolving this issue by providing an energy-based model that explicitly represents the numerous relationships between the items displayed in the image. The results are quite remarkable,” says Josef Sivic, a distinguished researcher from Czech Technical University’s Czech Institute of Informatics, Robotics, and Cybernetics who was not involved in this research.
For more information, refer to the article.

Share
Picture of Dr. Nivash Jeevanandam

Dr. Nivash Jeevanandam

Nivash holds a doctorate in information technology and has been a research associate at a university and a development engineer in the IT industry. Data science and machine learning excite him.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.