The environment and its underlying dynamics of all the reinforcement learning problems are typically abstracted as a Markov decision process (MDP). Because MDPs are useful for studying optimisation problems solved via dynamic programming and reinforcement learning. Today, most of the effective existing reinforcement learning (RL) algorithms are rooted in this dynamic programming paradigm.
An alternative paradigm for RL is based on linear programming (LP). The researchers at Google, try to generalise this LP approach and have tried to demonstrate how relevant it is to RL.
A Brief Of Overview Of Duality
Duality or the duality principle is associated with the optimisation theory, which posits that optimisation problems can be perceived as the primal problem or the dual problem.
The solution to the dual problem provides a lower bound to the solution of the primal (minimisation) problem. The Fenchel-Rockafellar duality theorem (FRDT) is one of the many theorems on duality, named after mathematicians Werner Fenchel and R.T Rockafellar. This theorem is considered to be one of the most powerful tools in all of the convex analysis.
The concept of duality is a basic and powerful tool in optimisation and machine learning, especially in the field of convex analysis, allowing a researcher to easily reformulate optimisation problems in alternative ways that are potentially more tractable.
The paper states that the Fenchel-Rockafellar duality and its variants are at the heart of a number of recent RL algorithms, although many of these were originally presented through less generalisable derivations.
Duality Applied to RL
In practical scenarios such as robotic navigation in warehouses or on highways, there are multiple constraints that are difficult to keep track of. The researchers at Google with the help of duality try to address this question of how to learn to satisfy the tremendous number of constraints associated with arbitrary input?
Traditionally, RL implements Q-learning and actor-critic, which, the authors write, often find function approximation, bootstrapping, and off-policy learning to be challenging and the conventional methods obscure them through a series of rough approximations. This detaches the practical implementations of these algorithms from their mathematical foundations.
So, to test their new approach of duality, the researchers implemented a duality-based training on a navigational agent. The objective of the agent is to start at one corner of a multi-room map and navigate to the opposite corner. Based on the performance in this objective, the authors compared their algorithm to an actor-critic approach.
Although the underlying mathematical problem is the same in both cases, actor-critic uses a number of approximations due to the infeasibility of satisfying a large number of constraints.
Whereas, the algorithm based on duality, the authors claim, is better suited for practical implementation. According to the experimental results, the duality-based implementation achieved significantly higher reward compared to actor-critic.
Depending on the exact form of regularisation, wrote the authors, application of Fenchel-Rockafellar duality leads to objectives which hint at connections to actor-critic via squared Bellman error minimisation as well as max-likelihood policy learning.
According to the authors, the techniques of this new approach can be summarised as follows:
- When presented with a problem that appears difficult to solve, consider writing the problem as a constrained convex optimisation and solving its Fenchel-Rockafellar
dual, or its Lagrangian form
- If the dual is still difficult to solve (e.g. when the primal objective is linear, yielding
a dual with constraints), consider modifying the original objective, e.g., by applying
an appropriate convex regulariser.
This work attempts to formulate the well-known reinforcement learning problem as a mathematical objective with constraints. And, when convex duality is applied repeatedly in combination with a regulariser, an equivalent problem without constraints is obtained. When the problem becomes unconstrained then it enables easier implementation in real world settings.
The researchers conclude that their algorithms are not only more mathematically principled than existing RL methods, but they also often yield better practical performance, showing the value of unifying mathematical principles with practical implementation.