MITB Banner

Behind DeepMind’s Framework That Discovers New Reinforcement Learning Algorithms

Share

DeepMind recently introduced a new meta-learning approach that generates a reinforcement learning algorithm known as Learned Policy Gradient (LPG). According to the researchers, automating the discovery of update rules from data could lead to more efficient algorithms that could also be better adapted to specific environments.   

That one technique of machine learning which can be compared with the psychological behaviour of animals is reinforcement learning. The objective of reinforcement learning is to maximise the expected cumulative rewards or average rewards. This algorithm has gained much traction by researchers and developers over the past few years.

The research in the field of reinforcement learning cost years of manual research to update an agent’s parameters according to one of several possible rules. Automating the updated rules have always been a significant challenge in this domain and researchers have been trying to discover alternatives to fundamental concepts of reinforcement learning such as value functions and temporal-difference learning. DeepMind researchers have now addressed this challenge using a new approach. 

Behind The Framework

According to the researchers, this new approach sheds light into an entirely new rule which includes both ‘what to predict’, such as value functions and ‘how to learn from it’, such as bootstrapping by interacting with a set of environments.

They stated, “This paper takes a step towards discovering general-purpose RL algorithms. We introduce a meta-learning framework that jointly discovers both ‘what the agent should predict’ and ‘how to use predictions for policy improvement’ from data generated by interacting with a distribution of environments.”

The goal of this meta-learning approach is to find the optimal update rule from distribution of environment and initial agent parameters. It automatically discovers reinforcement learning algorithms from data generated by interaction with a set of environments. 

Learned Policy Gradient (LPG)

The outcome of this approach is the reinforcement learning algorithm known as Learned Policy Gradient (LPG). LPG is an update rule parameterised by meta-parameters which requires an

agent to produce a policy and a dimensional categorical prediction vector. This new reinforcement learning algorithm is a backward LSTM network, which can update the policy as well as the prediction vector from the trajectory of the agent. 

The LPG allows the update rule to decide what the agent’s vector should be predicting. The meta-learning framework discovers such update rules from multiple learning agents, each of which interacts with a different environment.

Contributions Of This Research

The contributions made by the researchers in this project are mentioned below-

  • This research made the first attempt to meta-learn a full RL update rule by jointly discovering both what to predict’ and ‘how to bootstrap’.
  • The approach discovered its own alternative to the concept of value functions.
  • It discovers a bootstrapping mechanism to maintain and use its predictions.
  • When trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance.
  • The approach has the potential to accelerate the process of discovering new reinforcement learning algorithms.

Wrapping Up

The researchers claimed that this research made the first attempt to meta-learn a full RL update rule by discovering both ‘what to predict’ and ‘how to bootstrap’, replacing the existing RL concepts such as value function and temporal-difference learning. The algorithm has the ability to discover useful functions, and use those functions effectively to update the policy of the agents.

To a broader impact, the approach is said to have the potential to dramatically accelerate the process of discovering new reinforcement learning algorithms by automating the process of discovery in a data-driven way. Also, due to the data-driven nature of the proposed approach, the resulting algorithm may capture unintended bias in the training set of environments.

Furthermore, the proposed approach may also serve as a tool to assist reinforcement learning researchers in developing and improving their hand-designed algorithms by providing insights about what a good update rule looks like depending on the architecture that researchers provide as input, which could speed up the manual discovery of RL algorithms.

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.