The success of deep learning has been linked to how well the algorithms generalised when presented with open-world settings. This notion transformed the fields of computer vision and natural language processing. Reinforcement learning, on the other hand, has been playing catch up within realms of AI. The potential is immense. But, applying RL is not straightforward.
To apply RL to a new problem, one needs to set up an environment, define a reward function and train the robot to solve the task. In short, with every new task, you need to start from scratch. Online RL methods are data-hungry and starting from scratch for every new problem makes it impractical for real-world robotics problems.
Any effective data-driven method for deep reinforcement learning should be able to use data to pre-train offline while improving with online fine-tuning. This helps learn about the dynamics of the world and the task being solved. The usefulness of captured knowledge depends on the quality of the data that is provided.
Since this prior data can come from a variety of sources, we require an algorithm that does not utilise different types of data in any privileged way.
In a recent paper, researchers at Berkeley, investigate how to build RL algorithms that are not only effective for pre-training from a variety of off-policy datasets but also well suited for continuous improvement with online data collection. They also propose an algorithm — advantage weighted actor critic (AWAC).
Overview Of AWAC
In robotics, collecting high-quality data for a task is very difficult, often as difficult as solving the task itself. To achieve the generalisation in robotics, it requires RL algorithms that take advantage of vast amounts of prior data. This is opposed to, for instance, computer vision, where humans can label the data.
For this experiment, the researchers use a method called advantage weighted actor critic (AWAC) to learn from offline data and fine-tune in order to reach expert-level performance after collecting a limited amount of interaction data.
First, learning policies by offline learning are studied on a prior dataset D and then fine-tuning with online interaction. The prior data could be obtained via prior runs of RL, expert demonstrations or any other source of transitions.
The researchers in their experiments studied tasks representative of the difficulties of real-world robot learning, where offline learning and online fine-tuning are most relevant. These tasks involve complex manipulation skills such as in-hand rotation of a pen, opening a door by unlatching the handle, and picking up a sphere and relocating it to a target location.
In this work, the researchers showed that difficult, high-dimensional, sparse reward dexterous manipulation problems could be learnt from human demonstrations and off-policy data. They then evaluate the AWAC method with suboptimal prior data generated by a random controller.
- The new method solves the rotation of pen tasks in 120K timesteps, the equivalent of just 20 minutes of online interaction.
- Alternative off-policy RL and offline RL algorithms are largely unable to solve the door and relocate task within the given time.
- Using off-policy critic estimation allows to outperform other methods significantly.
- AWAC learns the fastest online and is actually able to make use of the offline dataset effectively as opposed to some methods which are completely unable to learn.
With successful demonstration on single tasks, the researchers now aim to apply AWAC to multi-task regime in reinforcement learning, with data sharing and generalisation between tasks.
The researchers believe that being able to use prior data and fine-tune quickly on new problems opens up many new avenues of research. Thus, active data collection (online learning) will be the key in the future.
Read the original paper here.