MITB Banner

Google Introduces UniPi, Generative AI in Decision Making and Robotics

UniPi will be able to help agents learn how to do many different tasks in many different environments in the real world.

Share

Listen to this story

Amid the hype around multimodal systems, Google researchers have created a new model called UniPi that can learn how to do different tasks in different environments. In the blog UniPi: Learning universal policies via text-guided video generation, researchers describe the model’s capability in doing different tasks by using text guidance. 

The new Universal Policy (UniPi) addresses environmental diversity and reward specification challenges. This is important for guiding robots to do different tasks in different environments using text inputs. 

To do this, the policy leverages text to describe the task and videos to show how to complete it. UniPi uses a special program to generate videos that show the steps an agent should take to complete the task. Then, UniPi uses another program to figure out the actions needed to make those steps happen. Finally, UniPi can use those actions to complete the task in the real world or in a simulation. 

According to the researchers, UniPi is able to generalise on both, seen and novel combinations of language prompts. UniPi will be able to help agents learn how to do many different tasks in many different environments in the real world.

Researchers evaluated the quality of videos generated by UniPi, after pre-training on non-robot data, using the Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD) metrics. The language-image alignment was measured using Contrastive Language-Image Pre-training scores (CLIPScores). The results showed that pre-trained UniPi had significantly better FID and FVD scores and a higher CLIPScore than UniPi without pre-training. These findings suggest that pre-training on non-robot data can assist with generating plans for robots.

Earlier, Google had also released ‘PaLM-E’, a single model that has the ability to control different robots in the real-world, while is also equally competent in VQA and captioning tasks.

This was after Google researchers published Multimodal Chain-of-Thought Reasoning in Language Models, which was built for generating intermediate reasoning chains using language and vision with under one billion parameters. 

UniPi has four major components:

Consistent video generation with first-frame tiling

To ensure environment consistency during conditional video synthesis, the observed image is added as supplementary context when denoising each frame in the synthesised video. UniPi achieves this by concatenating each intermediate frame that is sampled from noise with the conditioned observed image. This technique helps maintain a consistent environment state over time by providing a clear signal to the model.

Hierarchical planning through temporal super resolution

UniPi creates videos at a coarse level by sampling videos sparsely, capturing select moments in time. These videos are referred to as “abstractions.” UniPi then enhances these videos to ensure that they accurately represent valid behaviour within the environment. This is done through a process called super-resolution, which improves consistency by filling in the gaps between frames and creating a more detailed representation of the desired behaviour. 

Flexible behaviour synthesis 

In order to train the video generation model, which is to guide images towards a particular set of states that is dependent on text input, the video diffusion algorithm is utilised, which encodes pre-trained language features from the Text-To-Text Transfer Transformer (T5).

Task-specific action adaptation

The collection of synthesised videos is trained on a compact inverse dynamics model that translates frames into low-level control actions. This process is separate from the planner and can be completed using a smaller and possibly less ideal dataset generated by a simulator. Once we have the input frame and text description of the current goal, the inverse dynamics model synthesises image frames and generates a sequence of control actions that predict the upcoming actions. An agent then implements the deduced low-level control actions via closed-loop control.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.