Listen to this story
|
Amid the hype around multimodal systems, Google researchers have created a new model called UniPi that can learn how to do different tasks in different environments. In the blog UniPi: Learning universal policies via text-guided video generation, researchers describe the model’s capability in doing different tasks by using text guidance.
The new Universal Policy (UniPi) addresses environmental diversity and reward specification challenges. This is important for guiding robots to do different tasks in different environments using text inputs.
To do this, the policy leverages text to describe the task and videos to show how to complete it. UniPi uses a special program to generate videos that show the steps an agent should take to complete the task. Then, UniPi uses another program to figure out the actions needed to make those steps happen. Finally, UniPi can use those actions to complete the task in the real world or in a simulation.
According to the researchers, UniPi is able to generalise on both, seen and novel combinations of language prompts. UniPi will be able to help agents learn how to do many different tasks in many different environments in the real world.
Researchers evaluated the quality of videos generated by UniPi, after pre-training on non-robot data, using the Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD) metrics. The language-image alignment was measured using Contrastive Language-Image Pre-training scores (CLIPScores). The results showed that pre-trained UniPi had significantly better FID and FVD scores and a higher CLIPScore than UniPi without pre-training. These findings suggest that pre-training on non-robot data can assist with generating plans for robots.
Earlier, Google had also released ‘PaLM-E’, a single model that has the ability to control different robots in the real-world, while is also equally competent in VQA and captioning tasks.
This was after Google researchers published Multimodal Chain-of-Thought Reasoning in Language Models, which was built for generating intermediate reasoning chains using language and vision with under one billion parameters.
UniPi has four major components:
Consistent video generation with first-frame tiling
To ensure environment consistency during conditional video synthesis, the observed image is added as supplementary context when denoising each frame in the synthesised video. UniPi achieves this by concatenating each intermediate frame that is sampled from noise with the conditioned observed image. This technique helps maintain a consistent environment state over time by providing a clear signal to the model.
Hierarchical planning through temporal super resolution
UniPi creates videos at a coarse level by sampling videos sparsely, capturing select moments in time. These videos are referred to as “abstractions.” UniPi then enhances these videos to ensure that they accurately represent valid behaviour within the environment. This is done through a process called super-resolution, which improves consistency by filling in the gaps between frames and creating a more detailed representation of the desired behaviour.
Flexible behaviour synthesis
In order to train the video generation model, which is to guide images towards a particular set of states that is dependent on text input, the video diffusion algorithm is utilised, which encodes pre-trained language features from the Text-To-Text Transfer Transformer (T5).
Task-specific action adaptation
The collection of synthesised videos is trained on a compact inverse dynamics model that translates frames into low-level control actions. This process is separate from the planner and can be completed using a smaller and possibly less ideal dataset generated by a simulator. Once we have the input frame and text description of the current goal, the inverse dynamics model synthesises image frames and generates a sequence of control actions that predict the upcoming actions. An agent then implements the deduced low-level control actions via closed-loop control.