Recently, DeepMind collaborated with Columbia University to propose Taylor expansion Policy Optimisation (TayPO), which is a policy optimisation formalism that generalises methods like trust region policy optimisation (TRPO) and improves the performance of several state-of-the-art distributed algorithms.
Policy optimisation is one of the main approaches for deriving reinforcement learning algorithms. It has several successful applications in various challenging domains and is known to be a significant framework in model-free reinforcement learning (RL).
Policy optimisation methods are centred around the policy, the function that maps the agent’s state to its next action. These methods view reinforcement learning as a numerical optimisation problem, where one can optimise the expected reward with respect to the policy’s parameters. This optimisation method helps in driving significant algorithmic performance gains.
Among all algorithmic improvements, the two most prominent algorithmic improvements are trust-region policy search and off-policy corrections, which focuses on the orthogonal aspects of policy optimisation.
In the case of the trust-region policy search, the idea is to constrain the size of policy updates, which limits the deviations between consecutive policies and lower-bounds the performance of the new policy. While in case of the off-policy corrections, it requires accounting for the discrepancy between target policies and behaviour policies. According to the researchers, both the algorithmic ideas have contributed significantly to stabilising policy optimisation.
In this research, the researchers partially unify both algorithmic ideas into a single framework. They mentioned that a ubiquitous approximation method known as Taylor expansions share high-level similarities with both the trust-region policy search and the off-policy corrections.
This fundamental connection between the trust-region policy search and Taylor expansions are unified into a general framework known as the Taylor expansion policy optimisation (TayPO).
In the case of policy optimisation, it is crucial that the update function such as policy gradients or surrogate objectives can be evaluated with sampled data under the behaviour policy. According to the researchers, Taylor expansions are a natural paradigm to satisfy this necessity.
The researchers stated, “We start with a general result of applying Taylor expansions to Q-functions. When we apply the same technique to the RL objective, we reuse the general result and derive a higher-order policy optimisation objective.”
Taylor expansion policy optimisation (TayPO) is a general framework where Taylor expansions share high-level similarities with both trust-region policy search and off-policy corrections. Taylor expansions is basically a method based on the Taylor series concept that is used to describe and approximate math functions.
In this paper, the researcher described that Taylor expansions construct approximates to the full IS (Importance Sampling) corrections and showed how it can be intimately related to the established off-policy evaluation techniques. They stated, “The idea of Importance Sampling is the core of most off-policy evaluation techniques.”
Contributions In This Paper
- In this paper, the researchers investigated the application of Taylor expansions in reinforcement learning.
- They proposed a policy optimisation known as Taylor expansion policy optimisation (TayPO) that improves the performance of state-of-the-art distributed algorithms.
- The researchers also showed that Taylor expansions intimately relate to off-policy evaluation.
- Finally, the researchers showed that the new formulation, TayPO entails modifications which improve the performance of several state-of-the-art distributed algorithms.
Taylor expansions naturally connect trust-region policy search with the off-policy evaluations. According to the researchers, this new formulation unifies previous results and opens doors to the new algorithms and brings significant gains to certain state-of-the-art deep reinforcement learning agents.
Read the paper here.