Comprehensive Guide To Deep Q-Learning For Data Science Enthusiasts

Deep Q-Learning

This article will talk about reinforcement learning (RL) and Deep Q-Learning using openAI’s Gym environment and TensorFlow 2, and we will implement a case study using python. I assume that readers have a good understanding of reinforcement learning and deep learning. For the very beginners, I recommend this article before going further.

Introduction to Reinforcement Learning 

Reinforcement is a part of machine learning concerned about the action, which an agent in an environment takes to maximize the rewards. Reinforcement Learning differs from supervised learning and unsupervised learning in the sense that it does not need a supervised input/output pair. And not requires a higher amount of correction in any actions to make action highly efficient. 

Let’s say I want to make a bot(agent) playing ludo with the other three players with ludo dice(environment); this bot should have the ability to roll the dice (state) and picking up the right token(action), and moving the token based on dice number(rewards).

For a better understanding, you can learn reinforcement learning in depth from here.

In all reinforcement learning subjects, the Markov Decision Process (MDP) plays a huge role; an important point to notice is that each state presented in the environment results from its previous state, which is also a result of its previous state. Thus, somewhere, the present state of any environment results from the composition of information gathered from the previous states. 

So the task of any agent is to perform an action and make a higher reward provided by the environment. The Markov Decision Process makes an agent decide to choose an optimal action on a given state to maximize the reward. The probability of choosing action at a particular time from a state is called policy. So the goal of the Markov Decision Process is to find the optimal policy.

Introduction to Q-Learning Algorithm 

In the figure, we can see that the processes of Q-Learning, from the start to the end of the processes, Q-Learning follow four methods and two sub-process. So, let’s discuss the details of every process.

  1. Initialize parameter – In this step, the model learns about the action and states that an agent needs to perform in a certain environment and time.
  2. Identify current state – An agent needs to store the previous records to act optimally to earn maximized rewards. To act in the current state, it needs to identify the state and perform a combination of actions. 
  3. Choose an action and gain experience  – By the initialisation process, a Q-table gets generated where it gives the information about the combination of actions and states. Then, it looks for past experiences and compares the weight. If it’s a new situation, the Q-Table will update it for the next step.
  4. Update the reward in Q-table and determine the next state – After gaining the experience, agents get the reward from the environment. That reward amplitude gets recorded in the Q-table as experience data, and this becomes helpful in predicting the actions in the next step.  

Let’s get in more depth about Q-table; it works like this:

In Q-Learning, we learn about the Q(s, a) Function which is a mapping between all actions and to a state. Say for a random state and an agent can perform three actions, each of these actions will be computed as three different values, each value will get updated in Q table this is what we see over in image.

Here we have a Q table for each state of the game board. We see for each timestamp that the Q value for that specific action is updated according to rewards for that particular action; Q value varies between 0 to 1. Mathematically it can be represented as 

???? = discounted factor(controls the future contribution of rewards)

In the q table, for every action and state here in our example, a vehicle can move in three directions. It means the vehicle(agent) can perform three actions and earn a reward for the same performed action to generate a q value in the q table. In a real-life situation, states can be more than 10,000 and action can be in 1000000; in that case, Q-table size will be huge, and then the model would be space time-consuming.

In that situation, Deep Q-learning comes to save us. 

What is Deep Q-Learning?

In deep Q-Learning, we combine Q-Learning with a neural network to break the chain and find the optimal Q-value function. In the algorithm of deep Q-Learning, we use states as input and optimal Q-value of all possible actions as the output. The difference in technique of Q-learning and Deep Q-learning can be illustrated by – 

In Deep Q-Learning, the user stores all past experiences in memory and the future action defined by the output of  Q-Network. Thus, Q-network gains the Q-value at state st, and at the same time target network (Neural Network) calculates the Q-value for state St+1 (next state) to make the training stabilized and blocks the abruptly increments in Q-value count by copying it as training data on each iterated Q-value of the Q-network. This process is presented in the image below.

As already discussed in this topic, the agent has to gathers and store the previous experiences. In Deep Q-Learning, we use neural network and Q-Learning, where neural network stores the experience in memory in tuple< State, Next state, Action, Reward> format. It is already proven that the stability of neural network training increases when we pick a random batch of previous data. So here, deep Q-Learning utilizes one more concept to increase agents’ performance level – experience replay, which is nothing more than the stocking of the previous experiences.

 Target network uses experience replay for training, and Q-network uses it for calculation of Q-value. The loss is calculated by the squared difference of targeted Q-Value and predicted Q-Value.

This is performed only for the training of Q-Network just before the copying of parameters to the target network.

Let’s sum it all Deep Q-learning processes into steps :

  1. First, provide the environment’s state to the agent.
  2. The agent uses Q-values of all possible actions for the provided state.
  3. Agent picks and performs an action based on Q-Value of action for gathering higher rewards.
  4. Observe reward and next steps.
  5. Stores previous experience in experience replay memory. 
  6. Training of the networks using experience replay memory.
  7. Repeat steps 2-6 for each state.

Implementation using Deep Q-Learning 

Set up the environment in Colab 

Requirements: Python 3.6 or above, TensorFlow 2.0 or above, openAI Gym

Gym environment 

We are using a taxi-v3 gym environment. This one is very simple in which a taxi is an agent, and it needs to pick and drop passengers in 4 locations, and the agent can perform six actions. It can either pick up and drop the passenger or go in the four directions (south, east, north, west). more information about Gym environment is here 

Import the required library
 import numpy as np
 import random
 from IPython.display import clear_output
 from collections import deque
 import progressbar
 import gym
 from tensorflow.keras import Model, Sequential
 from tensorflow.keras.layers import Dense, Embedding, Reshape
 from tensorflow.keras.optimizers import Adam 
Creation of gym environment 
 env_taxi = gym.make("Taxi-v3").env

Number of observations and states of environment :

 print('Number of states: {}'.format(env_taxi.observation_space.n))
 print('Number of actions: {}'.format(env_taxi.action_space.n)) 
 Number of states: 500
 Number of actions: 6 

We have used the make function to initiate the taxi-v3 object of the Gym environment and render function to render the current state of the environment. and in the very next cell, we are printing the count of states and actions of the environment, which is showing us 500 states and six actions in which our agent is going to perform.

Next I am Implementing the agent in taxi class :

 class taxi:
     def __init__(self, env_taxi, optimizer):
         # Initialize attributes
         self._state_size = env_taxi.observation_space.n
         self._action_size = env_taxi.action_space.n
         self._optimizer = optimizer
         self.expirience_replay_memory = deque(maxlen=2000)
         # Initialize discount and exploration rate = 0.6
         self.exploration = 0.1
         # Build networks
         self.q_network = self._build_compile_model()
         self.target_network = self._build_compile_model()
     def gather(self, state, action, reward, next_state, terminated):
         self.expirience_replay_memory.append((state, action, reward, next_state, terminated))
     def _build_compile_model(self):
         model = Sequential()
         model.add(Embedding(self._state_size, 10, input_length=1))
         model.add(Dense(50, activation='relu'))
         model.add(Dense(50, activation='relu'))
         model.add(Dense(self._action_size, activation='linear'))
         model.compile(loss='mse', optimizer=self._optimizer)
         return model
     def align_both_model(self):
     def active(self, state):
         if np.random.rand() <= self.exploration:
             return env_taxi.action_space.sample()
         q_values = self.q_network.predict(state)
         return np.argmax(q_values[0])
     def retraining(self, batch_size):
         minbatch = random.sample(self.expirience_replay_memory, batch_size)
         for state, action, reward, next_state, terminated in minbatch:
             target = self.q_network.predict(state)
             if terminated:
                 target[0][action] = reward
                 t = self.target_network.predict(next_state)
                 target[0][action] = reward + * np.amax(t)
   , target, epochs=1, verbose=0) 

I have defined 6 functions in the taxi class __init__, gather, _build_compile_model, alighn_both_model, active, retraining 

 First, in __init__ function, we initialize the size of state and action by the function observaltion_space and action_space, then an optimizer and experience_replay_memory. Then providing discount rate and exploration rate,

and with the build compile method, created q_network and target_network and aligned them with the align_both_model method. Finally, in the gather function, enperiance_replay_memory is appended by the state, action, reward and next state value. 

Let’s go through the _build_compile_model method. 

 def _build_compile_model(self):
     model = Sequential()
     model.add(Embedding(self._state_size, 10, input_length=1))
     model.add(Dense(50, activation='relu'))
     model.add(Dense(50, activation='relu'))
     model.add(Dense(self._action_size, activation='linear'))
     model.compile(loss='mse', optimizer=self._optimizer)
     return model 

As we see, it’s a feed-forward neural network sequential model where the first layer is the embedding layer; mostly, we use it in language processing, but here the Gym environment object returns state value in a discrete or single number the embedding layer reduces the number of potential values. Embedding layer parameters input-dimensions takes the number of values that we have, and output_dimensions will provide the value of vector space required in the results. We want to convey 500 possible states by ten values. This is why we are using the embedding layer here. After this reshape layer, prepare the copy parameter as data and feed it to the target network (feed-forward neural network).

In active function, we start the Q-Network, or we can say we choose a random action based on the exploration value/rate. 

In the retraining method, we will train the Q-network by picking up a random sample from the experience_replay_memory method.

After all this 6 function we create a object of taxi class and prepare it for training:

 optimizer = Adam(learning_rate=0.01)
 taxi = taxi(env_taxi, optimizer)
 batch_size = 32
 num_of_episodes = 100
 timesteps_per_episode = 1000

Now we can train the our model by using following code:

 for e in range(0, num_of_episodes):
     # Reset the environment
     state = env_taxi.reset()
     state = np.reshape(state, [1, 1])
     # Initialize variables
     reward = 0
     terminated = False
     bar = progressbar.ProgressBar(maxval=timesteps_per_episode/10, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
     for timestep in range(timesteps_per_episode):
         # Run Action
         action =
         # Take action    
         next_state, reward, terminated, info = env_taxi.step(action) 
         next_state = np.reshape(next_state, [1, 1])
         taxi.gather(state, action, reward, next_state, terminated)
         state = next_state
         if terminated:
         if len(taxi.expirience_replay_memory) > batch_size:
         if timestep%10 == 0:
             bar.update(timestep/10 + 1)
     if (e + 1) % 10 == 0:
         print("Episode: {}".format(e + 1))

This article discussed how to run openAI Gym provided taxi-v3 environments taxi using Deep Q- learning algorithm and see how the algorithm works. There are many environments available in OpenAI Gym. So I encourage you to try the Deep Q- learning in different environments. 

Reference :

All the content written here is created with the help of the following sources:

Download our Mobile App

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.