Over the years, many exploration methods have been formulated by incorporating mathematical approaches. Reinforcement Learning (RL) exploration techniques have been categorised generally into undirected and directed methods based on the choice of information considered by the exploration algorithm.
However, in their survey, researchers from Mila AI Institute, Quebec, have departed from the traditional categorisation of RL exploration methods and given a treatise into RL exploration methods by segregating based on rewards, memory, and more. In the next section, we briefly discuss a few of the most important exploration methods listed above.
Reward-Free vs Reward-Based Exploration
According to the survey, the reward-free techniques select actions of an agent randomly, use some intrinsic information for guiding exploration, or utilise some notion of intrinsic information to guide exploration without taking into account extrinsic rewards. Reward-free exploration comes in handy in environments where the reward signal is not immediately available to the agent. In comparison, reward-based exploration methods leverage the information related to the reward signal. The methods are categorised based on the type of information used and how it is used in the selection of exploratory actions.
Memory-free vs Memory-based Exploration
Memory-free exploration methods only take the state of the environment into account. Whereas, memory-based consider additional information about the history of the agent’s interaction with the environment. For example, DeepMind’s Agent57, which set a new benchmark for Atari games last year, employed episodic memory in their RL policy.
Blind Exploration
Blind exploration methods explore environments via a random action selection. The agents are not guided through their exploratory path by any form of information. Thus, they are categorised as uninformed or blind.
Intrinsically-Motivated Exploration
As part of reward-free exploration, intrinsically motivated exploration methods utilise an intrinsic motivation to explore the unexplored parts of the environment. Unlike blind exploration, intrinsically motivated exploration techniques utilise some form of intrinsic information to encourage exploring the state-action spaces in the absence of external rewards.
Value-Based Methods
This exploration approach selects the stochastic actions based on value function or rewards from the environment. These methods use these value functions to decide if the preferred action is a more knowledge acquisition or reward maximisation.
Policy-Search Based Methods
Unlike value-based methods, policy-search based methods, as the name suggests, explicitly represent a policy instead of, or in addition to, a value function. Most policy search methods learn a stochastic policy. The initialisation of the exploration policy can be freely chosen. In some policy architectures, the amount of exploration is fixed to some constant or decreased according to a set schedule.
Randomised Action Selection Exploration Methods
Randomised exploration methods assign action selection probabilities to the possible actions based on the estimated value functions/rewards or policies, akin to Value-Based Exploration and Policy-Search Based Exploration.
Optimism/Bonus-Based Exploration
In this method, actions with uncertain values are preferred over the rest of the possible actions. As the name suggests, these exploration methods usually involve a form of bonus, which is added to the reward. Bonus-based techniques utilise an extrinsic reward for motivating the exploration of the environment.
Deliberate Exploration
Deliberate exploration deals with Bayes-Adaptive exploration methods. As per the survey, deliberate exploration requires the computation of posterior distribution over models and updating it assuming a prior over the transition dynamics. This category also consists of Meta-Learning Based Exploration techniques, via which the agent learns to adapt quickly using the prior given tasks.
Probability Matching
This exploration method decides an action by sampling a single instance from the posterior belief over environments or value functions (feedback) and solving for that sampled environment exactly. The agent then acts in accordance with that solution. Each action is thus taken with the probability that the agent considers it to be the optimal action.
Meta-Learning Based Methods
In meta learning-based RL, agents interact with multiple train Markov Decision Process(MDP), allowing them to learn a strategy. According to the survey, Meta-reinforcement learning strategies have the potential to learn an approximately optimal exploration-exploitation trade-off with respect to MDPs.
Learn more about the exploration methods in this survey.