Hands-On Guide to Understand and Implement Q – Learning

Q-Learning is a traditional model-free approach to train Reinforcement Learning agents. It is also viewed as a method of asynchronous dynamic programming. It was introduced by Watkins&Dayan in 1992.

Q-Learning Overview

In Q-Learning we build a Q-Table to store Q values for all possible combinations of state and action pairs. It is called Q-Learning because it represents the quality of a certain action an agent can take in a provided space.
The agents use a Q-table to choose the best action which gives maximum reward to the agent. So, basically the Q-Table acts as a cheat sheet to the agent as it has all the possible combinations for the environment. It is also called model-free because the Q-value is not approximated using any function, it is simply stored inside a table, with rows as states and actions as columns.

However, Q-learning suffers from curse-of-dimensionality as sometimes due to a large number of state and action pairs it’s not possible to store all the mappings.

Q – Learning Algorithm

Let’s Implement the Q-Learning algorithm using Numpy and see how it works.

The Q-function can be iteratively optimized to reach an optimal Q-value using the Bellman Equations.

This is how a Q-table schema looks like,

Q – Learning Implementation

Let’s implement a Q-Learning algorithm from scratch to play Frozen Lake provided by OpenAI Gym. We will use NumPy to implement the entire algorithm.

Environment Details

Frozen Lake environment has the following specifications and the agent is rewarded for finding a walkable path to a goal tile.

SFFF       (S: starting point, safe)

FHFH       (F: frozen surface, safe)

FFFH       (H: hole, fall to your doom)

HFFG       (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

Code Walkthrough

Let’s understand the NumPy code step by step.

  • Let’s declare a two-dimensional array with rows equal to state size and columns equal to action size.

Q - Learning

  • Let’s see how the Q-table looks like, we can see that it has 16 possible states with 4 different actions,
    • Possible 16 States in Frozen-lake environment are as follows.
      1. SFFF       (S: starting point, safe)
      2. FHFH       (F: frozen surface, safe)
      3. FFFH       (H: hole, fall to your doom)
      4. HFFG       (G: goal, where the frisbee is located)

  • Possible 4 Actions in Frozen-lake environment are as follows.
    • Top, Bottom, Right, Left

  • Finally, Q-Table has respective 16 states and 4 actions.

  • Let’s define some hyperparameters needed to learn the Q-values.

  • Let’s go ahead and Implement the Q-Learning algorithm now.
    • Based on the hyperparameters defined above, let’s iterate through the total number of episodes, for every episode the agent is allowed to take a maximum of 99 steps as max_steps.
    • We keep the trade-off between exploration vs. exploitation using a random number generator, here exp_tradeoff.
    • We take a random step if epsilon is lesser than exp_tradeoff.
    • We record the rewards for every step and update the Q-table using Bellman Equations.

Let’s have a look at the Q-Learning Algorithm Code snippet,

Q - Learning



Q - Learning

The above figure shows the number of steps it took the Q-learning based agent to reach the goal. We basically tested our agent on 5 episodes and in every episode, the agent was able to reach the Goal(G).

This is how we can train an end to end Q-learning agent using NumPy.

Download our Mobile App

Anurag Upadhyaya
Experienced Data Scientist with a demonstrated history of working in Industrial IOT (IIOT), Industry 4.0, Power Systems and Manufacturing domain. I have experience in designing robust solutions for various clients using Machine Learning, Artificial Intelligence, and Deep Learning. I have been instrumental in developing end to end solutions from scratch and deploying them independently at scale.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.

Intel Goes All in on AI

Pat Gelsinger said, there are three types of chip manufacturers, “you’re big, you’re niche or you’re dead”