Google Takes Leap Forward in Robotics with RT-2

It showed emergent robotic skills that were not present in the data due to knowledge transfer from web pre-training

Share

Published on August 1, 2023

by Shyam Nandan Upadhyay

Listen to this story

Google DeepMind introduced a successor to its Robotics Transformer model 1 called RT-2, a Transformer-based model trained on text and images from the web, enabling it to directly produce robotic actions.

Unlike chatbots, robots face real-world challenges, requiring a grounding in the physical environment and complex tasks. However, RT-2 is a significant step towards creating more capable and helpful robots, addressing the challenges of time-consuming and expensive training methods used previously. Similar to how language models learn from web data to understand general concepts, RT-2 employs web data to inform and guide robot behaviour.

It is an advancement that extends the capabilities of vision-language models (VLMs), which take images as input and generate text. It builds upon models like PaLI-X and PaLM-E and adapts them to serve as the foundation for RT-2. To enable robot control, RT-2 represents actions as tokens in its output, similar to language tokens, allowing actions to be processed using standard natural language tokenizers. This approach enables the model to output robotic actions and control the behaviour of robots effectively.

Tests and Abilities

DeepMind conducted qualitative and quantitative experiments on RT-2 models using over 6,000 robotic trials. Three categories of skills were defined: symbol understanding, reasoning, and human recognition, which required combining knowledge from web-scale data and the robot’s experience.

RT-2 demonstrated emergent robotic skills that were not present in the robot data, thanks to knowledge transfer from web pre-training. For instance, by leveraging knowledge from a vast web dataset, RT-2 understands concepts like identifying trash and throwing it away, without the need for specific training. It can even grasp abstract concepts, recognizing that certain objects become trash after use.

RT-2 simplifies the process of instructing robots by combining complex reasoning with robotic actions in a single model. It can perform tasks even without explicit training for them. RT-2’s ability to transfer knowledge from language and vision training data to robot actions showcases its versatility and effectiveness in handling various tasks.

It showed more than a 3x improvement in generalization performance compared to previous baselines like RT-1 and VC-1. RT-2 retained performance on original tasks seen in robot data and significantly improved performance on previously unseen scenarios, showcasing the benefits of large-scale pre-training. Moreover, RT-2 outperformed baselines pre-trained on visual-only tasks, indicating its superior performance in handling novel situations.

Google ventured into developing smarter robots by incorporating its language model, LLM PaLM, into robotics, resulting in the PaLM-SayCan system. However, the new robot demonstrated some imperfections during a live demo. The New York Times witnessed the robot inaccurately identifying soda flavours and misidentifying fruit as the colour white.

Others in the Game

While Google DeepMind has been at it when it comes to robotics, Boston Dynamics has also bolstered its efforts and is one of the leading competitors. Boston Dynamics has made significant advancements in robotics with the release of robots like Spot and the improved capabilities of its humanoid robot ‘Atlas.’

Atlas is now capable of navigating uneven terrain, recovering from falls, carrying objects, opening doors, climbing ladders, and performing various tasks. These improvements are a result of enhanced grasping and manipulation capabilities and new control algorithms, allowing Atlas to improvise and adapt to different conditions, at par with top-notch developments, if not more than them.

The robot’s 28 hydraulically operated joints and various sensors, such as LIDAR and cameras, contribute to its flexibility and understanding of its surroundings. Boston Dynamics has a history of developing advanced robots, including Spot and Handle, with the goal of creating versatile robots that can perform a wide range of activities.

While other companies like Musk’s Tesla have come up with Optimus, the project is still in progress and looks lacklustre at the moment.

OpenAI, on the other hand, had a robotics division that created a robotic arm capable of solving the Rubik’s cube. However, the company shut down this division in 2021. Yet, OpenAI has now decided to re-enter the robotics domain and has invested in a Norwegian startup called 1x.

In 2021, Google DeepMind made strides in building more generalized robots through vision-based robotic manipulation based on RGB-Stacking. This technology enables robots to understand the environment and objects around them.

Meanwhile, Microsoft seems to be focusing on the development of ChatGPT, extending its capabilities to robotics arms, drones, and home assistant robots. The company’s AI Lab Projects division is experimenting with AI and robots together to automate various tasks using the collaborative robot Paul-E, which possesses embedded vision and high-res force control. However, Microsoft’s research efforts in robotics are not as extensive as those of Google DeepMind.

Google DeepMind is deeply involved in researching the integration of language models into machines, which could potentially impact the ongoing debate about embodiment’s significance for AGI.

Overall, the robotics landscape is highly competitive, with various companies investing in different approaches and technologies to push the boundaries of what robots can achieve.

Access all our open Survey & Awards Nomination forms in one place