DeepMind recently introduced a new meta-learning approach that generates a reinforcement learning algorithm known as Learned Policy Gradient (LPG). According to the researchers, automating the discovery of update rules from data could lead to more efficient algorithms that could also be better adapted to specific environments.
That one technique of machine learning which can be compared with the psychological behaviour of animals is reinforcement learning. The objective of reinforcement learning is to maximise the expected cumulative rewards or average rewards. This algorithm has gained much traction by researchers and developers over the past few years.
The research in the field of reinforcement learning cost years of manual research to update an agent’s parameters according to one of several possible rules. Automating the updated rules have always been a significant challenge in this domain and researchers have been trying to discover alternatives to fundamental concepts of reinforcement learning such as value functions and temporal-difference learning. DeepMind researchers have now addressed this challenge using a new approach.
Behind The Framework
According to the researchers, this new approach sheds light into an entirely new rule which includes both ‘what to predict’, such as value functions and ‘how to learn from it’, such as bootstrapping by interacting with a set of environments.
They stated, “This paper takes a step towards discovering general-purpose RL algorithms. We introduce a meta-learning framework that jointly discovers both ‘what the agent should predict’ and ‘how to use predictions for policy improvement’ from data generated by interacting with a distribution of environments.”
The goal of this meta-learning approach is to find the optimal update rule from distribution of environment and initial agent parameters. It automatically discovers reinforcement learning algorithms from data generated by interaction with a set of environments.
Learned Policy Gradient (LPG)
The outcome of this approach is the reinforcement learning algorithm known as Learned Policy Gradient (LPG). LPG is an update rule parameterised by meta-parameters which requires an
agent to produce a policy and a dimensional categorical prediction vector. This new reinforcement learning algorithm is a backward LSTM network, which can update the policy as well as the prediction vector from the trajectory of the agent.
The LPG allows the update rule to decide what the agent’s vector should be predicting. The meta-learning framework discovers such update rules from multiple learning agents, each of which interacts with a different environment.
Contributions Of This Research
The contributions made by the researchers in this project are mentioned below-
- This research made the first attempt to meta-learn a full RL update rule by jointly discovering both what to predict’ and ‘how to bootstrap’.
- The approach discovered its own alternative to the concept of value functions.
- It discovers a bootstrapping mechanism to maintain and use its predictions.
- When trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance.
- The approach has the potential to accelerate the process of discovering new reinforcement learning algorithms.
The researchers claimed that this research made the first attempt to meta-learn a full RL update rule by discovering both ‘what to predict’ and ‘how to bootstrap’, replacing the existing RL concepts such as value function and temporal-difference learning. The algorithm has the ability to discover useful functions, and use those functions effectively to update the policy of the agents.
To a broader impact, the approach is said to have the potential to dramatically accelerate the process of discovering new reinforcement learning algorithms by automating the process of discovery in a data-driven way. Also, due to the data-driven nature of the proposed approach, the resulting algorithm may capture unintended bias in the training set of environments.
Furthermore, the proposed approach may also serve as a tool to assist reinforcement learning researchers in developing and improving their hand-designed algorithms by providing insights about what a good update rule looks like depending on the architecture that researchers provide as input, which could speed up the manual discovery of RL algorithms.
Read the paper here.