Finding A Better Reward Function In Reinforcement Learning

Reinforcement learning (RL) algorithm designers often tend to hard code use cases into the system because the nature of the environment in which an agent operates is usually chaotic. For RL to be adopted widely, the algorithms need to be more clever. To be precise, these algorithms should self-learn to a point where it can use a better reward function when given a choice for the same task.

However, the reward functions for most real-world tasks are difficult or impossible to procedurally specify. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned. Most real-world tasks have far more complex reward functions than this. In particular, tasks involving human interaction depend on complex and user-dependent preferences.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Reinforcement learning is founded on the observation that it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function. Applying this insight to reward function analysis, the researchers at UC Berkeley and DeepMind developed methods to compare reward functions directly, without training a policy.

A Novel Way Of Quantifying Rewards

Prior work usually evaluates the learned reward function using the “rollout method” where a 

policy is trained to optimise the reward function. Unfortunately, this method is computationally expensive because it requires us to solve an RL problem. 

Furthermore, the rollout method produces false negatives when the reward matches user preferences, but the RL algorithm fails to maximise. The rollout method also produces false positives. Of the many reward functions inducing the desired rollout in a given environment, only a small subset aligns with the user’s preferences. If the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies.

To tackle this, the researchers introduce the Equivalent-Policy Invariant Comparison (EPIC). Before we go further into the details of EPIC and its significance to reinforcement learning, the reader has to be familiar with the concepts of metric spaces and potential-based shaping. There are two important definitions, to begin with:

  • Pseudometric: A metric is a function that defines a concept of distance between any two members of the set, which are usually called points. And, a pseudometric space is a generalisation of a metric space in which the distance between two distinct points can be zero.
  • Potential Based Shaping (PBS): Potential-based reward shaping (PBRS) aims to improve the learning speed of an RL agent by extracting and utilising extra knowledge while performing a task. Knowledge is extracted from previously learned tasks and is transferred to be used in a target task.

In the case of EPIC, the authors use a pseudometric, which extracts the potential-based shaping of the reward functions and then compares the obtained points using Pearson distance.

Pearson distance between two random variables X and Y is calculated as follows:

Where ρ(X, Y ) is the Pearson correlation between X and Y .

EPIC distance is defined using Pearson distance as follows:

Where, D: distance, R: rewards, S: current state, A: action performed, S1 : changed state

The distance calculated using this approach can then be used to predict the outcome of using a certain reward function. The authors claim that this works even in an unseen test environment.

Key Takeaways

This work introduces novel ways of evaluating reward functions for reinforcement learning tasks. However, the underlying principles of this new method are founded in mathematics (metric spaces, topologies), the explanation of which is beyond the scope of this article. The contribution of this work can be summarised as follows:

  • Current reward learning algorithms have considerable limitations
  • The distance between reward functions is a highly informative addition for evaluation
  • EPIC distance compares reward functions directly, without training a policy

For further understanding, read the original work: Link to the paper.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM