DeepMind Found New Approach To Create Faster Reinforcement Learning Models

Recently, researchers from DeepMind and McGill University proposed new approaches to speed up the solution of complex reinforcement learning problems. They mainly introduced a divide and conquer approach to reinforcement learning (RL), which is combined with deep learning to scale up the potentials of the agents. 

For a few years now, reinforcement learning has been providing a conceptual framework in order to address several fundamental problems. This algorithm has been utilised in several applications, such as to model robots, simulate artificial limbs, developing self-driving cars, play games like poker, Go, and more. 

Also, the recent combination of reinforcement learning with deep learning added several impressive achievements and is found to be a promising approach to tackle important sequential decision-making problems that are currently intractable. One such issue is the amount of data needed or an RL agent to learn to perform a task.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Behind the Approach

In this project, the researchers discussed that the range of problems the RL agents can tackle could be significantly extended if they are endowed with the appropriate mechanisms to leverage prior knowledge. The framework is basically based on the premise that an RL problem can usually be decomposed into a multitude of “tasks.” 

The researchers generalised two fundamental operations in RL, policy improvement and policy evaluation, from single to multiple operands, i.e. tasks and policies, respectively. According to them, the generalisation of these two fundamental operations underlying much of RL, which is policy evaluation and policy improvement allows the solution of one task to speed up the solution of other tasks. 


Download our Mobile App



The Generalised policy evaluation (GPE) is the computation of the value function of a policy on a set of tasks. The generalised version of these two procedures are jointly referred to as “generalised policy updates,”

The generalised policy updates make it possible to reuse the solution of tasks in two distinct ways. They are-

  • When a task’s reward function can be approximated as a linear combination of reward functions of other tasks, the reinforcement learning problem can be reduced to a simpler linear regression which is solvable with only a fraction of the data.
  • When the linearity constraint is not satisfied, the agent can also leverage the solution of tasks. In this case, by using them to interact with and learn about the environment. This can also considerably reduce the amount of data needed to solve the problem.

The researchers combined these two strategies in order to produce a divide-and-conquer approach to RL that can assist in scaling the agents to problems that are currently intractable due to issues like lack of data. 

They stated, “If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression.” 

Researchers further added, “When this is not the case, the agent can still exploit the task solutions by using them to interact with and learn about the environment. Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.”

The Outcome

In this paper, the researchers showed the possible ways to efficiently implement GPE and GPI and discussed how their combination leads to a generalised policy whose behaviour is modulated by a vector of preferences. 

Also, the vector of preferences is considered to be the solution of a linear regression problem. This reduces a reinforcement learning task to a much simpler problem that can be solved using only a fraction of the data.

Wrapping Up

The researchers proposed a divide and conquer approach where they generalised two fundamental operations in RL, policy improvement and policy evaluation that can be used to speed up the solution of a reinforcement learning problem. The strategy is also claimed to improve the sample efficiency if the mapping from states to preferences is simpler to learn than the corresponding policy.

The source code that is used to generate all of the data in this research is available in GitHub. Get the source code here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.