Google Unveils Multimodal Chain of Thought Reasoning With PaLM-E

The model can make use of visual data to enhance its language processing capabilities.
Listen to this story

Researchers at Google have proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world, while at the same time being quantitatively competent at general VQA and captioning tasks. The embodied language model is built by injecting multi-modal information such as images into the embedding space of a pre-trained large language model (LLM).

The PaLM-E model, which stands for Pathways Language Model with Embodied, represents a significant leap forward particularly in the realm of human-robot interaction. With the ability to seamlessly control various robots across multiple environments, PaLM-E demonstrates a level of flexibility and adaptability previously unseen in similar technologies.

The model is able to make use of visual data to enhance its language processing capabilities, resulting in an embodied language model that is both versatile and quantitatively competent. It is trained on a  diverse mixture of tasks across multiple robot embodiments and general vision-language tasks. Importantly, The researchers demonstrated that this diversity in training leads to several approaches of transfer from the vision language domains into embodied decision making, enabling robot planning tasks to be achieved data efficiently.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Their largest model, PaLM-E-562B, shows capabilities like multimodal chain of thought reasoning, over multiple images, despite being trained on only single-image prompts. Through zero-shot reasoning PaLM-E-562B can tell visually-conditioned jokes given an image, and demonstrates capabilities including perception, visually-grounded dialogue, and planning. 

PaLM-E can also perform math given an image with handwritten numbers. The model transfers knowledge from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states) are inserted alongside text tokens as input to an LLM, trained end-to-end.


Download our Mobile App


The concluding results indicate that frozen language models seem like a path towards general-purpose embodied multimodal models that fully retain their language capabilities. Researchers have also found an alternative method with unfrozen models: scaling up the language model size leads to significantly less catastrophic forgetting while becoming an embodied agent. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox