Now Reading
Why Solving Multi-armed Bandit Problem Is Significant For Reinforcement Learning

Why Solving Multi-armed Bandit Problem Is Significant For Reinforcement Learning

The objective of any algorithm is to find patterns in the training data that map the input data attributes to the target (the answer to be predicted), and an ML model that captures these patterns.

A typical ML pipeline consists of a sequence of components; components which are a compilation of computations. Data is sent through these components and is manipulated with the help of computation.

Reinforcement learning is one popular ML algorithm right now. With the applications knowing no bounds, the developers continue to tweak it for better results.

Multi-armed bandit problem is one such challenge that reinforcement learning poses to the developers.

Also known as k- or N-bandit problem, it deals with the allocation of resources when there are multiple options with not much information about the options. This problem can also be categorised as being a part of stochastic scheduling; scheduling that deals with the random nature of real-world events.

The origin of this problem was based on defining the strategies for a player at the slot machines (one-armed bandit) in the casino. The player pulls the lever to operate the machine with a display. The display usually gives patterns of fruits or other flashy symbols. These symbols have their rewards(cash). The objective of the player is to make more money by finding a high pay-off pattern. These machines are called bandits because the chance of player losing their money is more. So the player has little upside and almost guaranteed downside.

Just like in real-world problems involving markets or diagnosis, the risk is high and the model that does the analysis should be reliable enough to thwart the anomalies.

One way to do is to have more information which is a looming problem with building machine learning models as the incoming data is unstructured, non-linear and involves outliers.

To find the right combination and the right machine is very important for the player in the casino. The player will do better if they are tipped with better performing machines. So, in a hypothetical scenario, a player can win big if they can operate all the high rewarding machines with multiple levers(arms).

Variants Of The Bandit Problem

It is important to know what these bandit algorithms are competing with. For example, in K-armed bandits, the agents are competing with the arm or the lever that has the highest expected reward.

Whereas, in contextual bandits with expert advice,  it is crucial to know which expert has the highest expected reward.

Contextual bandits, also known as associative reinforcement learning or multi-armed bandits with covariates.

At the beginning of each round, the world generates a set of covariates of fixed dimensionality (so-called ”context”) and rewards for each arm which are related to the covariates. The agent chooses an arm for that round, and the reward  for that arm gets revealed, but not for the others. The goal is for the agent to maximize its obtained rewards in the long term, using the history of his previous actions.

See Also

The best available strategies for maximizing the gains appear in establishing upper confidence bounds and the more popular Thompson sampling.

The Upper Confidence Bounds (UCB) algorithm measures this potential by upper confidence bound of the reward value, ˆUt(a) so that the true value is below with bound Q(a)≤ˆQt(a)+ˆUt(a) with high probability. The upper bound ˆUt(a) is a function of Nt(a); a larger number of trials Nt(a)should give us a smaller bound ˆUt(a).

In UCB algorithm, we always select the greediest action to maximize the upper confidence bound:


“Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model.   Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework,” wrote the researchers at Google Brain in their paper titled Deep Bayesian Bandits Showdown.

Also, know how to implement Thompson sampling using Python here


Register for our upcoming events:

Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below


What's Your Reaction?
In Love
Not Sure
Scroll To Top