Q-Learning is a traditional *model-free* approach to train Reinforcement Learning agents. It is also viewed as a method of asynchronous dynamic programming. It was introduced by Watkins&Dayan in 1992.

## Q-Learning **Overview**

In Q-Learning we build a Q-Table to store Q values for all possible combinations of state and action pairs. It is called Q-Learning because it represents the quality of a certain action an agent can take in a provided space.

The agents use a Q-table to choose the best action which gives maximum reward to the agent. So, basically the Q-Table acts as a cheat sheet to the agent as it has all the possible combinations for the environment. It is also called *model-free *because the Q-value is not approximated using any function, it is simply stored inside a table, with rows as states and actions as columns.

*However, Q-learning suffers from curse-of-dimensionality as sometimes due to a large number of state and action pairs it’s not possible to store all the mappings.*

### Q – Learning **Algorithm**

Let’s Implement the Q-Learning algorithm using Numpy and see how it works.

The Q-function can be iteratively optimized to reach an optimal Q-value using the Bellman Equations.

This is how a Q-table schema looks like,

## Q – Learning **Implementation**

Let’s implement a Q-Learning algorithm from scratch to play Frozen Lake provided by OpenAI Gym. We will use NumPy to implement the entire algorithm.

### Environment Details

Frozen Lake environment has the following specifications and the agent is rewarded for finding a walkable path to a goal tile.

*SFFF (S: starting point, safe)*

*FHFH (F: frozen surface, safe)*

*FFFH (H: hole, fall to your doom)*

*HFFG (G: goal, where the frisbee is located)*

*The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.*

### Code Walkthrough

Let’s understand the NumPy code step by step.

- Let’s declare a two-dimensional array with rows equal to state size and columns equal to action size.

- Let’s see how the Q-table looks like, we can see that it has 16 possible states with 4 different actions,
- Possible 16
**States**in Frozen-lake environment are as follows.*SFFF (S: starting point, safe)**FHFH (F: frozen surface, safe)**FFFH (H: hole, fall to your doom)**HFFG (G: goal, where the frisbee is located)*

- Possible 16

- Possible 4
**Actions**in Frozen-lake environment are as follows.- Top, Bottom, Right, Left

- Finally, Q-Table has respective 16 states and 4 actions.

- Let’s define some hyperparameters needed to learn the Q-values.

- Let’s go ahead and Implement the Q-Learning algorithm now.
- Based on the hyperparameters defined above, let’s iterate through the total number of episodes, for every episode the agent is allowed to take a maximum of 99 steps as max_steps.
- We keep the trade-off between exploration vs. exploitation using a random number generator, here exp_tradeoff.
- We take a random step if epsilon is lesser than exp_tradeoff.
- We record the rewards for every step and update the Q-table using Bellman Equations.

Let’s have a look at the Q-Learning Algorithm Code snippet,

**NoteBook**

### Results

The above figure shows the number of steps it took the Q-learning based agent to reach the goal. We basically tested our agent on 5 episodes and in every episode, the agent was able to reach the Goal(G).

This is how we can train an end to end Q-learning agent using NumPy.