Listen to this story
|
Google’s DeepMind unit has introduced RT-2, the first ever vision-language-action (VLA) model that is more efficient in robot control than any model before. Aptly named “robotics transformer” or RT, this advancement is set to change the way robots interact with their environment and execute tasks with precision.
RT-2 is a learning wizard. The model can grow smarter as time goes by and easily understand both words and pictures. The problem-solving model can tackle tricky challenges it has never faced before or been trained on.
The model has the ability to learn and adapt in real-world scenarios and has the capacity to learn information from diverse sources such as the web and robotics data. By understanding both language and visual input, RT-2 can effortlessly tackle tasks it has not been trained on or come across.
The researchers integrated two pre-existing models, Pathways Language and Image Model (PaLI-X) along with Pathways Language Model Embodied (PaLM-E), to serve as the foundation for RT-2. This VLA model enables robots to understand both language and visuals, which enables them to take appropriate actions. The system’s training involved extensive text data and images from the the internet, akin to internet’s favourite chatbots like ChatGPT.
According to researchers, the RT-2 enabled robot can undertake a diverse range of complex tasks, by using both visual and language data. These tasks include activities like organising files in alphabetical order by perusing the labels on the documents and subsequently sorting and placing them in their appropriate locations.
The paper titled “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control” is authored by Anthony Brohan and colleagues, and posted within the latest Deepmind blog post.
Read more: Google DeepMind Takes Back What it Lost to OpenAI