Reinforcement learning (RL) is the most widely used machine learning algorithm, besides supervised and unsupervised learning and the less common self-supervised and semi-supervised learning. RL focuses on the controlled learning process, where a machine learning algorithm is provided with a set of actions, parameters, and end values. It teaches the machine trial and error.
From a data efficiency perspective, several methods have been proposed, including online setting, reply buffer, storing experience in a transition memory, etc. In recent years, off-policy actor-critic algorithms have been gaining prominence, where RL algorithms can learn from limited data sets entirely without interaction (offline RL).
Despite the advancement in the field, several challenges put a brake on reinforcement learning, mainly how one uses the collected data, grows the dataset, and builds the most effective datasets. Data efficiency is critical in many real-world scenarios, where gathering data is the main bottleneck, especially in robotics.
Extrapolating from these developments, DeepMind researchers have proposed the reinforcement learning process into two distinct sub-processes, data-collection and inference of knowledge, which improves the data efficiency and enhances capabilities for the next generation of RL agents. The researchers call this the ‘Collect and Infer (C&I)’ paradigm.
Introducing Collect and Infer
In a paper, ‘Collect & Infer – a fresh look at data-efficient Reinforcement Learning,’ the DeepMind researchers explain how this works and give a lightweight overview of the core concepts and implications of the C&I paradigm.
The C&I method assumes two sub-processes: acting (data collection) and learning (inference) decoupled but connected through a transition memory, where all data resulting from environment interaction is collected and later drawn for learning. Plus, the research views RL as two independent processes, which offers additional flexibility in algorithm design, and emphasises that these processes can and should be optimised independently.
The image below shows the ‘collect and infer’ agent. The top part depicts collecting experience, and the lower part is inference (the two parts share policy pool and transition memory).
Collect and Infer agent (Source: arXiv)
Here’s how it works
DeepMind researchers said that the key idea of the C&I paradigm is to separate reinforcement learning into two processes, which is optimised by considering each process separately.
- Process 1: Deals with data collection into a transition memory by interacting with the surrounding or environment
- Process 2: Infers knowledge about the environment or surroundings by learning from the memory data
Further, the team set two objectives to optimise each process:
- Optimal inference: given a fixed data batch, what is the correct learning setup to get to the maximally performing policy?
- Optimal collection: given an inference process, what is the minimal set of data required to get to a maximally performing policy?
The researchers also described the algorithms into the following objectives:
- Learning is done offline (in a batch setting) assuming fixed data as suggested by ‘optimal inference.’ Data may have been collected by a behaviour policy different from the one that is the learning target. That enables the use of the same data to optimise for multiple objectives simultaneously and coincides with interest in offline reinforcement learning.
- Data collection is a process that needs to be optimised in its own right. Naive exploration policies that employ simple random perturbations of a task policy (epsilon greedy) are likely to be insufficient. The behaviour that is optimal for data collection in the sense of ‘optimal collection’ may be quite different from the optimal behaviour for a task of interest.
- Treating data collection as a separate process provides novel ways to integrate known methods like skills, innovative exploration schemes, or model-based approaches into the learning process without biasing the final task solution.
- Data collection may happen concurrently with an inference or can be conducted separately.
- ‘Collect and Infer’ suggests a different focus for evaluation compared to usual regret-based frameworks for exploration. C&I does not aim to optimise task performance during collection. Instead, they distinguish between a learning phase and a deployment phase.
Regarding implications, C&I suggests alternative solutions to several problems that will become prominent as reinforcement learning is applied to more challenging tasks, including multi-task, transfer, or life-long learning.
Further, the team discussed the use case of C&I in robotics, and how these algorithms are interpreted from the C&I perspective, and where that perspective suggests changes or improves. The example of SAC-X, using basic C&I principles, learns to solve complex scenarios of putting two items in a box after opening the lid. The example highlighted the flexibility of using the C&I paradigm. It suggested an interpolation between pure offline and more conventional online learning scenarios and chimed naturally with the growing interest in data-driven approaches, where large datasets of experience are built up over a period of time, which enabled rapid learning of new behaviours with only small amounts of online experience.
DeepMind researchers said that decoupling acting and learning, along with emphasising three off-policy learning, gives greater flexibility when designing exploration or other actively optimised data collection strategies. This includes schemes for unsupervised reinforcement learning and unsupervised skill discovery. Leveraging data as a vehicle for knowledge transfer enables new algorithms for multi-task and transfer scenarios.
According to DeepMind, the main idea of the ‘Collect & Infer’ paradigm is to re-think data-efficient reinforcement learning using clear separation of data collection and exploitation into two distinct bet connected processes. Also, to exploit the flexibility of off-policy reinforcement learning in agent design for problems as diverse as online RL, offline RL, or life-long learning.
The team believes that C&I will become a go-to option for a data-efficient learning agent that treats data as a resource transformed into different types of representations used for action selection (policies or may facilitate future learning problems (models, skills, or perceptual representations).
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Amit Raja Naik is a senior writer at Analytics India Magazine, where he dives deep into the latest technology innovations. He is also a professional bass player.