A classical approach to any reinforcement learning (RL) problem is to explore and to exploit. Explore the most rewarding way that reaches the target and keep on exploiting a certain action; exploration is hard. Without proper reward functions, the algorithms can end up chasing their own tails to eternity. When we say rewards, think of them as mathematical functions crafted carefully to nudge the algorithm. To be more precise, consider teaching a robotic arm or an AI playing a strategic game like Go or Chess to reach a target on its own.
Over the years, many exploration strategies have been formulated by incorporating mathematical approaches. In the next section, we list popular exploration strategies used in reinforcement learning models.
Curiosity Based Exploration
First introduced by Dr Juergen Schmidhuber in 1991, curiosity in RL models was implemented through a framework of curious neural controllers. This described how a particular algorithm can be driven by curiosity and boredom. This was done by introducing (delayed) reinforcement for actions that increase the model network’s knowledge about the world. This, in turn, requires the model network to model its own ignorance, thus showing a rudimentary form of self-introspective behaviour.
As the name suggests, the objective of this approach is to identify a potential way and keep on exploiting it ‘greedily’. This approach is popularly associated with the multi-arm bandit problem, a simplified RL problem where the agent has to find the best slot machine to make more money. The agent randomly explores with probability ϵ and takes the optimal action most of the time with probability 1−ϵ.
A machine with the highest current average payout is selected with
probability = (1 – epsilon) + (epsilon / k)
And, machines that don’t have the highest current payout average are selected with
probability = epsilon / k.
Over time, the best paying machine will be played more and more often.
Upper Confidence Bound
Upper confidence bound Q^t(a)+U^t(a), where Q^t(a) is the average rewards associated with action a up to time t and U^t(a) is a function reversely proportional to how many times action a has been taken. Maximising this upper confidence bound is a strategy employed by the agents to move towards the goal. The popular AlphaGo zero program of DeepMind used Monte Carlo Tree Search, which, in turn, uses a neural network to guide the simulations. Each simulation in the Go game iteratively selects moves that maximise the upper confidence bound.
If we consider the canonical example of slot machines again, an arm produces a random payout drawn independently of the past.
Because the distribution of payouts corresponding to each arm is not listed, the player (read agent) can learn it only by experimenting. By continuing to explore alternative arms, an agent may learn how to earn higher payouts in future — making sure that the agent is not stuck with one strategy and exploring others while exploiting one.
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The agent keeps track of the probability of optimal actions and samples from this distribution.
At each time step, an action is taken using the following probabilistic function (via Lilian Weng):
Where π(a|h_t) is the probability of taking action, given the history h_t
Inspired by Thompson sampling, Bootstrapped deep Q networks DQN introduced a notion of uncertainty in Q-value approximation in classic DQN by using the bootstrapping method. Bootstrapping is to approximate a distribution by sampling with replacement from the same population multiple times and then aggregate the results.
Within Reinforcement Learning, exponential weighting schemes are broadly used for balancing exploration and exploitation, and are equivalently referred to as Boltzmann, Gibbs, or softmax exploration policies. In the most common version of Boltzmann exploration, the probability of determining an arm is proportional to an exponential function of the empirical mean of the reward of that arm, which is denoted as follows:
Memory-based exploration strategies were introduced to resolve the disadvantages of intrinsic motivation or reward-based reinforcement learning. Rewards in varying environments can be inadequate in real time scenarios. DeepMind’s Agent57, which set a new benchmark for Atari games recently, employed episodic memory in their RL policy. Agent57 is built on an NGU (never give up) agent, which combines curiosity-driven exploration and distributed deep RL agents to compute an intrinsic reward in order to encourage exploration. This reward is defined by combining per episode and life-long novelty. The per-episode novelty rapidly vanishes over the course of an episode, and it is computed by comparing observations to the contents of episodic memory.
Apart from Agent57, there have been other works on RL without the disadvantages of intrinsic motivations. One such is the Multi Agent Reinforcement Learning (MARL) by MIT and DeepMind, that encourages coordination among agents within an environment and learns from each other.
Know more about exploration strategies in detail here.