MITB Banner

Finding A Better Reward Function In Reinforcement Learning

Share

Reinforcement learning (RL) algorithm designers often tend to hard code use cases into the system because the nature of the environment in which an agent operates is usually chaotic. For RL to be adopted widely, the algorithms need to be more clever. To be precise, these algorithms should self-learn to a point where it can use a better reward function when given a choice for the same task.

However, the reward functions for most real-world tasks are difficult or impossible to procedurally specify. Even a task as simple as peg insertion from pixels has a non-trivial reward function that must usually be learned. Most real-world tasks have far more complex reward functions than this. In particular, tasks involving human interaction depend on complex and user-dependent preferences.

Reinforcement learning is founded on the observation that it is usually easier and more robust to specify a reward function, rather than a policy maximising that reward function. Applying this insight to reward function analysis, the researchers at UC Berkeley and DeepMind developed methods to compare reward functions directly, without training a policy.

A Novel Way Of Quantifying Rewards

Prior work usually evaluates the learned reward function using the “rollout method” where a 

policy is trained to optimise the reward function. Unfortunately, this method is computationally expensive because it requires us to solve an RL problem. 

Furthermore, the rollout method produces false negatives when the reward matches user preferences, but the RL algorithm fails to maximise. The rollout method also produces false positives. Of the many reward functions inducing the desired rollout in a given environment, only a small subset aligns with the user’s preferences. If the initial state distribution or transition dynamics change, misaligned rewards may induce undesirable policies.

To tackle this, the researchers introduce the Equivalent-Policy Invariant Comparison (EPIC). Before we go further into the details of EPIC and its significance to reinforcement learning, the reader has to be familiar with the concepts of metric spaces and potential-based shaping. There are two important definitions, to begin with:

  • Pseudometric: A metric is a function that defines a concept of distance between any two members of the set, which are usually called points. And, a pseudometric space is a generalisation of a metric space in which the distance between two distinct points can be zero.
  • Potential Based Shaping (PBS): Potential-based reward shaping (PBRS) aims to improve the learning speed of an RL agent by extracting and utilising extra knowledge while performing a task. Knowledge is extracted from previously learned tasks and is transferred to be used in a target task.

In the case of EPIC, the authors use a pseudometric, which extracts the potential-based shaping of the reward functions and then compares the obtained points using Pearson distance.

Pearson distance between two random variables X and Y is calculated as follows:

Where ρ(X, Y ) is the Pearson correlation between X and Y .

EPIC distance is defined using Pearson distance as follows:

Where, D: distance, R: rewards, S: current state, A: action performed, S1 : changed state

The distance calculated using this approach can then be used to predict the outcome of using a certain reward function. The authors claim that this works even in an unseen test environment.

Key Takeaways

This work introduces novel ways of evaluating reward functions for reinforcement learning tasks. However, the underlying principles of this new method are founded in mathematics (metric spaces, topologies), the explanation of which is beyond the scope of this article. The contribution of this work can be summarised as follows:

  • Current reward learning algorithms have considerable limitations
  • The distance between reward functions is a highly informative addition for evaluation
  • EPIC distance compares reward functions directly, without training a policy

For further understanding, read the original work: Link to the paper.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.