MITB Banner

Google Unveils Multimodal Chain of Thought Reasoning With PaLM-E

The model can make use of visual data to enhance its language processing capabilities.

Share

Listen to this story

Researchers at Google have proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world, while at the same time being quantitatively competent at general VQA and captioning tasks. The embodied language model is built by injecting multi-modal information such as images into the embedding space of a pre-trained large language model (LLM).

The PaLM-E model, which stands for Pathways Language Model with Embodied, represents a significant leap forward particularly in the realm of human-robot interaction. With the ability to seamlessly control various robots across multiple environments, PaLM-E demonstrates a level of flexibility and adaptability previously unseen in similar technologies.

The model is able to make use of visual data to enhance its language processing capabilities, resulting in an embodied language model that is both versatile and quantitatively competent. It is trained on a  diverse mixture of tasks across multiple robot embodiments and general vision-language tasks. Importantly, The researchers demonstrated that this diversity in training leads to several approaches of transfer from the vision language domains into embodied decision making, enabling robot planning tasks to be achieved data efficiently.

Their largest model, PaLM-E-562B, shows capabilities like multimodal chain of thought reasoning, over multiple images, despite being trained on only single-image prompts. Through zero-shot reasoning PaLM-E-562B can tell visually-conditioned jokes given an image, and demonstrates capabilities including perception, visually-grounded dialogue, and planning. 

PaLM-E can also perform math given an image with handwritten numbers. The model transfers knowledge from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states) are inserted alongside text tokens as input to an LLM, trained end-to-end.

 

(Credit: https://palm-e.github.io/)

The concluding results indicate that frozen language models seem like a path towards general-purpose embodied multimodal models that fully retain their language capabilities. Researchers have also found an alternative method with unfrozen models: scaling up the language model size leads to significantly less catastrophic forgetting while becoming an embodied agent. 

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.