Top Exploration Methods In Reinforcement Learning

Researchers from Mila AI Institute, Quebec, in their survey, have departed from the traditional categorisation of RL exploration methods and given a treatise into RL exploration methods.

Over the years, many exploration methods have been formulated by incorporating mathematical approaches. Reinforcement Learning (RL) exploration techniques have been categorised generally into undirected and directed methods based on the choice of information considered by the exploration algorithm.

(Image credits: Survey by Amin et al.,)

However, in their survey, researchers from Mila AI Institute, Quebec, have departed from the traditional categorisation of RL exploration methods and given a treatise into RL exploration methods by segregating based on rewards, memory, and more. In the next section, we briefly discuss a few of the most important exploration methods listed above.

Reward-Free vs Reward-Based Exploration

According to the survey, the reward-free techniques select actions of an agent randomly, use some intrinsic information for guiding exploration, or utilise some notion of intrinsic information to guide exploration without taking into account extrinsic rewards. Reward-free exploration comes in handy in environments where the reward signal is not immediately available to the agent. In comparison, reward-based exploration methods leverage the information related to the reward signal. The methods are categorised based on the type of information used and how it is used in the selection of exploratory actions.

Memory-free vs Memory-based Exploration

Memory-free exploration methods only take the state of the environment into account. Whereas, memory-based consider additional information about the history of the agent’s interaction with the environment. For example, ​​DeepMind’s Agent57, which set a new benchmark for Atari games last year, employed episodic memory in their RL policy. 

Blind Exploration

Blind exploration methods explore environments via a random action selection. The agents are not guided through their exploratory path by any form of information. Thus, they are categorised as uninformed or blind. 

Intrinsically-Motivated Exploration

As part of reward-free exploration, intrinsically motivated exploration methods utilise an intrinsic motivation to explore the unexplored parts of the environment. Unlike blind exploration, intrinsically motivated exploration techniques utilise some form of intrinsic information to encourage exploring the state-action spaces in the absence of external rewards.

Value-Based Methods

This exploration approach selects the stochastic actions based on value function or rewards from the environment. These methods use these value functions to decide if the preferred action is a more knowledge acquisition or reward maximisation.

Policy-Search Based Methods

Unlike value-based methods, policy-search based methods, as the name suggests, explicitly represent a policy instead of, or in addition to, a value function. Most policy search methods learn a stochastic policy. The initialisation of the exploration policy can be freely chosen. In some policy architectures, the amount of exploration is fixed to some constant or decreased according to a set schedule.

Randomised Action Selection Exploration Methods

Randomised exploration methods assign action selection probabilities to the possible actions based on the estimated value functions/rewards or policies, akin to Value-Based Exploration and Policy-Search Based Exploration. 

Optimism/Bonus-Based Exploration

In this method, actions with uncertain values are preferred over the rest of the possible actions. As the name suggests, these exploration methods usually involve a form of bonus, which is added to the reward. Bonus-based techniques utilise an extrinsic reward for motivating the exploration of the environment.

Deliberate Exploration 

Deliberate exploration deals with Bayes-Adaptive exploration methods. As per the survey, deliberate exploration requires the computation of posterior distribution over models and updating it assuming a prior over the transition dynamics. This category also consists of Meta-Learning Based Exploration techniques, via which the agent learns to adapt quickly using the prior given tasks.

Probability Matching 

This exploration method decides an action by sampling a single instance from the posterior belief over environments or value functions (feedback) and solving for that sampled environment exactly. The agent then acts in accordance with that solution. Each action is thus taken with the probability that the agent considers it to be the optimal action. 

Meta-Learning Based Methods

In meta learning-based RL, agents interact with multiple train Markov Decision Process(MDP), allowing them to learn a strategy. According to the survey, Meta-reinforcement learning strategies have the potential to learn an approximately optimal exploration-exploitation trade-off with respect to MDPs.

Learn more about the exploration methods in this survey.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox