Reinforcement learning (RL) algorithm designers often tend to hard code use cases into the system because the nature of the environment in which an agent operates is usually chaotic. For RL to be adopted widely, the algorithms need to be more clever. To be precise, these algorithms should self-learn to a point where it can use a better reward function when given a choice for the same task.
However, the reward functions for most real-world tasks are difficult or impossible to procedurally specify. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned. Most real-world tasks have far more complex reward functions than this. In particular, tasks involving human interaction depend on complex and user-dependent preferences.
Reinforcement learning is founded on the observation that it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function. Applying this insight to reward function analysis, the researchers at UC Berkeley and DeepMind developed methods to compare reward functions directly, without training a policy.
A Novel Way Of Quantifying Rewards
Prior work usually evaluates the learned reward function using the “rollout method” where a
policy is trained to optimise the reward function. Unfortunately, this method is computationally expensive because it requires us to solve an RL problem.
Furthermore, the rollout method produces false negatives when the reward matches user preferences, but the RL algorithm fails to maximise. The rollout method also produces false positives. Of the many reward functions inducing the desired rollout in a given environment, only a small subset aligns with the user’s preferences. If the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies.
To tackle this, the researchers introduce the Equivalent-Policy Invariant Comparison (EPIC). Before we go further into the details of EPIC and its significance to reinforcement learning, the reader has to be familiar with the concepts of metric spaces and potential-based shaping. There are two important definitions, to begin with:
- Pseudometric: A metric is a function that defines a concept of distance between any two members of the set, which are usually called points. And, a pseudometric space is a generalisation of a metric space in which the distance between two distinct points can be zero.
- Potential Based Shaping (PBS): Potential-based reward shaping (PBRS) aims to improve the learning speed of an RL agent by extracting and utilising extra knowledge while performing a task. Knowledge is extracted from previously learned tasks and is transferred to be used in a target task.
In the case of EPIC, the authors use a pseudometric, which extracts the potential-based shaping of the reward functions and then compares the obtained points using Pearson distance.
Pearson distance between two random variables X and Y is calculated as follows:
Where ρ(X, Y ) is the Pearson correlation between X and Y .
EPIC distance is defined using Pearson distance as follows:
Where, D: distance, R: rewards, S: current state, A: action performed, S1 : changed state
The distance calculated using this approach can then be used to predict the outcome of using a certain reward function. The authors claim that this works even in an unseen test environment.
This work introduces novel ways of evaluating reward functions for reinforcement learning tasks. However, the underlying principles of this new method are founded in mathematics (metric spaces, topologies), the explanation of which is beyond the scope of this article. The contribution of this work can be summarised as follows:
- Current reward learning algorithms have considerable limitations
- The distance between reward functions is a highly informative addition for evaluation
- EPIC distance compares reward functions directly, without training a policy
For further understanding, read the original work: Link to the paper.