“Specifying a reward function can be one of the trickiest parts of applying RL to a problem.”DeepMind
At the heart of a successful reinforcement learning algorithm sits a well-coded reward function. Reward functions for most real-world tasks are difficult to specify procedurally. Most real-world tasks have complex reward functions. In particular, tasks involving human interaction depend on complex and user-dependent preferences. A popular belief within the RL community is that it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function.
Today there are many techniques to learn a reward function from data as varied as the initial state, demonstrations, corrections, preference comparisons, and many other data sources. A group of researchers from DeepMind, Berkeley and OpenAI have introduced EPIC, a new way to evaluate reward functions and reward learning algorithms.
RL training is computationally expensive. For instance, if the policy performs poorly, you can’t tell if it is due to the learned reward failing to match user preferences or the RL algorithm failing to optimise the learned reward. According to the researchers, Equivalent-Policy Invariant Comparison or (EPIC) works by comparing reward functions directly, without training a policy. EPIC is a fast and reliable way to compute the similarity of two reward functions work. EPIC can be used to benchmark reinforcement learning algorithms by comparing learned reward functions to a ground-truth reward.
In a paper titled “Quantifying difference in reward functions” (recently accepted at the prestigious ICLR conference), the researchers claimed EPIC resulted in 1,000 times faster solution than alternative evaluation methods. Furthermore, it requires little to no hyperparameter tuning. The researchers showed reward functions judged as similar by EPIC induce policies with similar returns, even in unseen environments.
How EPIC works:
- As shown in the illustration above, EPIC compares reward functions Rᵤ and Rᵥ by first mapping them to canonical representatives.
- It then computes the Pearson distance between the canonical representatives on a coverage distribution 𝒟.
Note: Pearson distance between two random variables X and Y is calculated as follows:
EPIC distance is defined using Pearson distance as follows:
Where, D: distance, R: rewards, S: current state, A: action performed, S1 : changed state
The distance calculated using this approach can then be used to predict the outcome of using a certain reward function. According to the researchers, canonicalisation removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.
RL agents strive to maximise reward. This is great as long as the high reward is found only in states that the user finds desirable. However, in systems employed in the real world, there may be undesired shortcuts to high reward involving the agent tampering with the process that determines agent reward, the reward function. For instance, a self-driving car; a positive reward may be given once it reaches the correct destination and negative rewards for breaking traffic rules and causing accidents. Standardised metrics are an important driver of progress in machine learning. Unfortunately, traditional policy-based metrics do not guarantee the fidelity of the learned reward function. A reward is non-zero only at the end, where it is either −1, 0, or 1, depending on who won. In any practically implemented system, agent reward may not coincide with user utility.
To evaluate EPIC, the researchers developed two alternatives as baselines: Episode Return Correlation (ERC) and Nearest Point in Equivalence Class (NPEC). On comparing procedurally specified reward functions in four tasks, the researchers found EPIC is more reliable than the baselines NPEC and ERC, and more computationally efficient than NPEC. The experimental results showed EPIC correctly infers zero distance between equivalent reward functions that the NPEC and ERC baselines wrongly considered dissimilar.
Reinforcement learning never got the attention it deserved; the main reason being its areas of application. Unlike a typical convolutional neural network, used for photo tagging on social media, RL’s use cases — self driving, robotics for medical surgeries etc — are more critical. This make reward function evaluation even more important. There can’t be enough stress tests for an RL algorithm given the uncertain nature of the real world. But, the evaluation of reward functions is a good place to start. As RL is increasingly applied to complex and user-facing applications such as recommender systems, chatbots and autonomous vehicles, reward functions evaluation will need more attention. Since there are various techniques to specify a reward function, the researchers believe EPIC can play a crucial role here.
- Current reward learning algorithms have considerable limitations
- The distance between reward functions is a highly informative addition for evaluation
- EPIC distance compares reward functions directly, without training a policy.
- EPIC is fast, reliable and can predict return even in unseen deployment environments.
EPIC is now available as a library. Check the Github repo.