Can RL Agents Behave More Human-Like Without Relying On Task Rewards?

Recently, a team of researchers from Google Brain, Vector Institute and the University of Toronto showed that the entropy, information gain, and empowerment of reinforcement learning agents correlate strongly with a human behaviour similarity metric. 

In the past few years, reinforcement learning has made several achievements in solving complex problems. From playing complex mind games to robotic manipulations, the agents in reinforcement learning have accomplished the tasks depending on the manually defined reward functions. 

However, according to the researchers, designing informative reward functions often leads to issues like being expensive, time-consuming, and prone to human error. These difficulties increase with the complexity of the task of interest. Inspired by the learning abilities of natural agents like children and to mitigate such problems, the researchers studied some common types of intrinsic motivation.

The Mechanism Behind

In order to accelerate the development of intrinsic objectives, the researchers compute potential goals on pre-collected datasets of agent behaviour, rather than optimising them online, and compare them by analysing their correlations. They studied mainly three types of intrinsic motivation seeking mathematical objectives for reinforcement learning agents that do not depend on a specific task and can be applied to any unknown environment. 

They are as follows-

  • Input Entropy: The input entropy encourages encountering rare sensory inputs, measured by a learned density model.
  • Information Gain: The information gain rewards the agent for discovering the rules of its environment.
  • Empowerment: Lastly, the empowerment rewards the agent for maximising its influence over its sensory inputs or environment. 

It is worthy to consider that designing intrinsic objectives that result in intelligent behaviour across various environments is regarded as a crucial unsolved problem in the field of reinforcement learning. Also, training agents to evaluate different intrinsic objectives can lead to a slow and expensive process. 

In order to address such a problem, the researchers collected a diverse dataset of 26 agents in four complex environments to compare task reward, similarity to human players, as well as three representative intrinsic objectives. 

After the required dataset is collected, they analysed the correlations between intrinsic objectives and supervised objectives, such as task reward and human similarity. This means, the researchers computed the three intrinsic objectives on each dataset and analysed the intrinsic objectives’ correlations with each other and with task reward and a human similarity objective.

The RL Environment

For the learning environment, the researchers chose three different Atari environments provided by Arcade Learning Environment, which are Breakout, Seaquest, and Montezuma’s Revenge. Additionally, they used the Minecraft Treechop environment provided by MineRL. Also, the agents represented in the dataset include learning algorithms, trivial agents, random, no-op agents, PPO, RND and ICM.

Benefits of This Research

This new research brought a few advantages that can be used to create machines that can act more efficiently and, most importantly, more human-like. Some of them are-

  • The intrinsic objectives correlate strongly with human similarity across all studied environments than the manual task reward. To develop agents that behave similarly to human players, intrinsic objectives may therefore be more relevant than typical task rewards.
  • The implementations of the intrinsic objectives could lead to effective exploration when optimised online and could serve as evaluation metrics when task rewards and demonstrations are unavailable.
  • Optimising the third intrinsic objective, empowerment together with either of the two other objectives could be beneficial for designing exploration methods.

 Wrapping Up

One of the important terms related to current reinforcement learning approach is the reward function, which helps in completing complex tasks. This research tried to show the other side of the coin, where the RL agents tried to achieve the goal without using any task reward. 

Studying the three intrinsic objectives, researchers found out that all the three intrinsic objectives correlate more strongly with a human behaviour similarity metric than with task reward. Moreover, input entropy and information gain correlate more strongly with human similarity than task reward does. This, in result, suggested the utilisation of intrinsic objectives to design agents that behave similarly to human players. 

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

AI Good Teammate
Victor Dey
Can AI Be A Good Teammate?

Recently, researchers have been able to develop a few RL agents that can learn games from scratch through pure self-play without any human input.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM