DeepMind Introduces TayPO, A Policy Optimisation Framework For RL Algorithm

Recently, DeepMind collaborated with Columbia University to propose Taylor expansion Policy Optimisation (TayPO), which is a policy optimisation formalism that generalises methods like trust region policy optimisation (TRPO) and improves the performance of several state-of-the-art distributed algorithms.

Policy optimisation is one of the main approaches for deriving reinforcement learning algorithms. It has several successful applications in various challenging domains and is known to be a significant framework in model-free reinforcement learning (RL). 

Policy optimisation methods are centred around the policy, the function that maps the agent’s state to its next action. These methods view reinforcement learning as a numerical optimisation problem, where one can optimise the expected reward with respect to the policy’s parameters. This optimisation method helps in driving significant algorithmic performance gains.

Among all algorithmic improvements, the two most prominent algorithmic improvements are trust-region policy search and off-policy corrections, which focuses on the orthogonal aspects of policy optimisation. 

In the case of the trust-region policy search, the idea is to constrain the size of policy updates, which limits the deviations between consecutive policies and lower-bounds the performance of the new policy. While in case of the off-policy corrections, it requires accounting for the discrepancy between target policies and behaviour policies. According to the researchers, both the algorithmic ideas have contributed significantly to stabilising policy optimisation. 

In this research, the researchers partially unify both algorithmic ideas into a single framework. They mentioned that a ubiquitous approximation method known as Taylor expansions share high-level similarities with both the trust-region policy search and the off-policy corrections. 

This fundamental connection between the trust-region policy search and Taylor expansions are unified into a general framework known as the Taylor expansion policy optimisation (TayPO).

Behind TayPO

In the case of policy optimisation, it is crucial that the update function such as policy gradients or surrogate objectives can be evaluated with sampled data under the behaviour policy. According to the researchers, Taylor expansions are a natural paradigm to satisfy this necessity.

The researchers stated, “We start with a general result of applying Taylor expansions to Q-functions. When we apply the same technique to the RL objective, we reuse the general result and derive a higher-order policy optimisation objective.”

Taylor expansion policy optimisation (TayPO) is a general framework where Taylor expansions share high-level similarities with both trust-region policy search and off-policy corrections. Taylor expansions is basically a method based on the Taylor series concept that is used to describe and approximate math functions. 

In this paper, the researcher described that Taylor expansions construct approximates to the full IS (Importance Sampling) corrections and showed how it can be intimately related to the established off-policy evaluation techniques. They stated, “The idea of Importance Sampling is the core of most off-policy evaluation techniques.”

Contributions In This Paper

  • In this paper, the researchers investigated the application of Taylor expansions in reinforcement learning.
  • They proposed a policy optimisation known as Taylor expansion policy optimisation (TayPO) that improves the performance of state-of-the-art distributed algorithms.
  • The researchers also showed that Taylor expansions intimately relate to off-policy evaluation. 
  • Finally, the researchers showed that the new formulation, TayPO entails modifications which improve the performance of several state-of-the-art distributed algorithms.

Wrapping Up

Taylor expansions naturally connect trust-region policy search with the off-policy evaluations. According to the researchers, this new formulation unifies previous results and opens doors to the new algorithms and brings significant gains to certain state-of-the-art deep reinforcement learning agents.

Read the paper here.

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?