Q-Learning is a traditional model-free approach to train Reinforcement Learning agents. It is also viewed as a method of asynchronous dynamic programming. It was introduced by Watkins&Dayan in 1992.
Q-Learning Overview
In Q-Learning we build a Q-Table to store Q values for all possible combinations of state and action pairs. It is called Q-Learning because it represents the quality of a certain action an agent can take in a provided space.
The agents use a Q-table to choose the best action which gives maximum reward to the agent. So, basically the Q-Table acts as a cheat sheet to the agent as it has all the possible combinations for the environment. It is also called model-free because the Q-value is not approximated using any function, it is simply stored inside a table, with rows as states and actions as columns.
However, Q-learning suffers from curse-of-dimensionality as sometimes due to a large number of state and action pairs it’s not possible to store all the mappings.
Q – Learning Algorithm
Let’s Implement the Q-Learning algorithm using Numpy and see how it works.
The Q-function can be iteratively optimized to reach an optimal Q-value using the Bellman Equations.
This is how a Q-table schema looks like,
Q – Learning Implementation
Let’s implement a Q-Learning algorithm from scratch to play Frozen Lake provided by OpenAI Gym. We will use NumPy to implement the entire algorithm.
Environment Details
Frozen Lake environment has the following specifications and the agent is rewarded for finding a walkable path to a goal tile.
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.
Code Walkthrough
Let’s understand the NumPy code step by step.
- Let’s declare a two-dimensional array with rows equal to state size and columns equal to action size.
- Let’s see how the Q-table looks like, we can see that it has 16 possible states with 4 different actions,
- Possible 16 States in Frozen-lake environment are as follows.
- SFFF (S: starting point, safe)
- FHFH (F: frozen surface, safe)
- FFFH (H: hole, fall to your doom)
- HFFG (G: goal, where the frisbee is located)
- Possible 16 States in Frozen-lake environment are as follows.
- Possible 4 Actions in Frozen-lake environment are as follows.
- Top, Bottom, Right, Left
- Finally, Q-Table has respective 16 states and 4 actions.
- Let’s define some hyperparameters needed to learn the Q-values.
- Let’s go ahead and Implement the Q-Learning algorithm now.
- Based on the hyperparameters defined above, let’s iterate through the total number of episodes, for every episode the agent is allowed to take a maximum of 99 steps as max_steps.
- We keep the trade-off between exploration vs. exploitation using a random number generator, here exp_tradeoff.
- We take a random step if epsilon is lesser than exp_tradeoff.
- We record the rewards for every step and update the Q-table using Bellman Equations.
Let’s have a look at the Q-Learning Algorithm Code snippet,
NoteBook
Results
The above figure shows the number of steps it took the Q-learning based agent to reach the goal. We basically tested our agent on 5 episodes and in every episode, the agent was able to reach the Goal(G).
This is how we can train an end to end Q-learning agent using NumPy.