Listen to this story
Researchers at Google have proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world, while at the same time being quantitatively competent at general VQA and captioning tasks. The embodied language model is built by injecting multi-modal information such as images into the embedding space of a pre-trained large language model (LLM).
The PaLM-E model, which stands for Pathways Language Model with Embodied, represents a significant leap forward particularly in the realm of human-robot interaction. With the ability to seamlessly control various robots across multiple environments, PaLM-E demonstrates a level of flexibility and adaptability previously unseen in similar technologies.
The model is able to make use of visual data to enhance its language processing capabilities, resulting in an embodied language model that is both versatile and quantitatively competent. It is trained on a diverse mixture of tasks across multiple robot embodiments and general vision-language tasks. Importantly, The researchers demonstrated that this diversity in training leads to several approaches of transfer from the vision language domains into embodied decision making, enabling robot planning tasks to be achieved data efficiently.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Their largest model, PaLM-E-562B, shows capabilities like multimodal chain of thought reasoning, over multiple images, despite being trained on only single-image prompts. Through zero-shot reasoning PaLM-E-562B can tell visually-conditioned jokes given an image, and demonstrates capabilities including perception, visually-grounded dialogue, and planning.
PaLM-E can also perform math given an image with handwritten numbers. The model transfers knowledge from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states) are inserted alongside text tokens as input to an LLM, trained end-to-end.
Download our Mobile App
The concluding results indicate that frozen language models seem like a path towards general-purpose embodied multimodal models that fully retain their language capabilities. Researchers have also found an alternative method with unfrozen models: scaling up the language model size leads to significantly less catastrophic forgetting while becoming an embodied agent.