How to generate recommendations using reinforcement learning?

Embedding Q-Learning with Policy network would generate recommendation

Customers suffer from searching for interesting products due to the growth of information on the World Wide Web. To avoid this problem, recommendation systems are used to help customers find satisfying products. Recommendation engines are a type of machine learning that deals with ranking or evaluating items or consumers. A recommender system is a system that anticipates the ratings that a user will give to a certain item. These forecasts will be graded and returned to the user. This article will be focused on using a reinforcement learning algorithm to build a recommendation system. Following are the topics to be covered.

Table of contents

  1. About reinforcement learning
  2. How reinforcement learning used for recommendation
  3. Building a recommendation system with RL

AI agents in Reinforcement Learning behave in a highly dynamic environment to complete certain tasks. Let’s understand more about Reinforcement Learning (RL).

About Reinforcement Learning

Reinforcement Learning (RL) is a machine learning approach that allows an agent to learn in an interactive environment via trial and error based on feedback from its actions and experiences. Based on learning algorithms, the software employs a trial-and-error approach to provide feedback in the form of punishments or incentives to the opponent. The self-learning machine chooses the best action to provide the best results. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Agents use trial and error to achieve certain tasks in a given context. The goal of this learning process is to maximise the environment’s cumulative rewards. RL differs from both supervised and unsupervised learning methods. Because RL is a technology for online learning. In a dynamic context, the agent learns about data. Changes in existing stages of action have an immediate impact on the following action. 

Analytics India Magazine

According to the fundamental reinforcement model described above, the first step is for the system to obtain the input state from the environment. The action’s outcome is then done in the environment. Finally, the environment alters in response to activity and assigns rewards or punishment based on the new condition. Depending on the surroundings, the agent can visit a limited number of states. A numerical prize will be received after visiting each state. Punishments are represented by negative numbers. Intelligent agents strive to maximise cumulative rewards while minimising penalties.

Are you looking for a complete repository of Python libraries used in data science, check out here.

How is reinforcement learning used for the recommendation?

The goal is to maximise the predicted sum of the future values for each state. The On-Policy TD control, often known as the SARSA technique, is a key component of the reinforcement algorithm. It is made up of state-action pairs, and we may learn by altering the value state Q(s, a) from one state action pair to another.

Q-learning is one of the most important reinforcement learning strategies. Reinforcement Learning is the process through which agents learn to pick optimum behaviours in their specific environment.

To attain the target state, agents execute an action ‘a’ in each condition to become a new state. Each action conducted by the agent should be done to attain the goal state as efficiently as possible. Each agent state transitions to another with the following action. An agent’s policy is described as a series of activities conducted by an agent on behalf of a state. Q-Learning is one variation of this approach. Instead of computing explicit values for each state, this technique will calculate a value function Q(s, a) to represent the values that acted in states.

Formally, the value of Q(s, a) is the discounted total of future rewards gained by doing action an in s and then selecting optimum actions. To address recommendation issues using Q-learning approaches, we must first define relevant actions, states, and rewarding and punishing procedures.

This technique primarily focuses on analysing performance in each activity in each state, which is referred to as Q values. Recommendation systems employ this Q-learning approach to determine the likelihood of the next job and achieve a high performable state; most suggestions are based on client acceptance/rejection or time spent on a certain activity, for example. 

The Q-learning approach gives an appropriate framework to personalised recommendations, which may be utilised directly for any type of recommendation problem. Each value of reward/action/state is an estimate of how accurate prediction can be. The problem’s non-deterministic character is used to update the problem rules.

This rule takes into consideration the fact that doing the same action in the same state might result in various rewards. As the value of n decreases, the influence of reward values decreases continually.

There are several concerns in the suggestion, among other conventional ways of recommendation. Some are not very effective. The key disadvantage is that personalised recommendations are not available; instead, they rely on content filtering or user-type-based filtering. However, society and technology are increasingly focused on presenting the appropriate product or service to the proper market or consumer.

Building a recommendation system with RL

This article uses Deep Deterministic Policy Gradient (DDPG), a type of reinforcement learning that blends Q-learning with Policy gradients. As an actor-critic technique, DDPG has two models: actor and critic. Instead of a probability distribution of actions, the actor is a policy network that takes the state as input and outputs the precise action (continuous). The critic is a Q-value network that accepts state and action as input and returns the Q-value as output. The DDPG technique is an “off”-policy approach. The word “deterministic” in DDPG refers to the fact that the actor computes the action directly rather than using a probability distribution across actions.

The recommendation system will be built on the famous IMDb movie rating dataset. Due to time constraints for training the DDPG model using a pre-trained DDPG model known as RecNN. 

Installing the RecNN

from IPython.display import clear_output
! git clone
! pip install -r ./RecNN/requirements.txt
! pip install ./RecNN
! pip install gdown

Cloning the GitHub repository to install the RecNN recommendation toolkit based on Reinforcement Learning.

Reading the data

This article uses the IMDB movie rating dataset. With the following code downloading and unzipping the metadata and the pretrained model.

! wget
! gdown
! unzip

If using a colab notebook make sure the hardware accelerator is set to GPU because it would be required to train the model.

Import necessary libraries

import pandas as pd
import numpy as np
from scipy.spatial import distance
import matplotlib.pyplot as plt
import recnn
from import tqdm
import pickle
import gc
import json
import torch
from import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

The library’s main abstraction for datasets is named environment, which is similar to how other reinforcement learning libraries name it. FrameEnv provides static length, whereas SeqEnv implements dynamic length as well as a sequential state representation encoder. First, let’s look at FrameEnv. You must specify embeddings and rating directories to initialise an env. Caching is another option.

cuda = torch.device('cuda')
frame_size = 10
meta = json.load(open('/content/drive/MyDrive/Datasets/omdb.json'))
frame_size = 10
batch_size = 1
dirs =
env =, frame_size, batch_size)
Analytics India Magazine

Now store the TensorFlow to the GPU.

Since DDPG is an actor-critic technique, it has two models: actor and critic. This article is using the actor model. Try using the critic model, leaving that to you.

ddpg = recnn.nn.models.Actor(1290, 128, 256).to(cuda)
test_batch = next(iter(env.test_dataloader))
state, action, reward, next_state, done =

Create a custom function

To store the recommendation score create a custom function which will take the input from the metadata and process it through the model and at last store the results in a pandas dataframe.

def rank_score(gen_action, metric):
    scores = []
    for i in env.base.key_to_id.keys():
        if i == 0 or i == '0':
        scores.append([i, metric(env.base.embeddings[env.base.key_to_id[i]], gen_action)])
    scores = list(sorted(scores, key = lambda x: x[1]))
    scores = scores[:10]
    ids = [i[0] for i in scores]
    for i in range(10):
        scores[i].extend([meta[str(scores[i][0])]['omdb'][key]  for key in ['Title',
                                'Genre', 'Language', 'Released', 'imdbRating']])
    indexes = ['id', 'score', 'Title', 'Genre', 'Language', 'Released', 'imdbRating']
    table_dict = dict([(key,[i[idx] for i in scores]) for idx, key in enumerate(indexes)])
    table = pd.DataFrame(table_dict)
    return table

Generate recommendations

ddpg_model = ddpg(state)
ddpg_model = ddpg_model[np.random.randint(0, state.size(0), 1)[0]].detach().cpu().numpy()

Using the custom function with different distance calculation techniques to get the best recommendations with some random samples.

rank_score(ddpg_model, distance.euclidean)
Analytics India Magazine
rank_score(ddpg_model, distance.correlation)
Analytics India Magazine


The purpose of Reinforcement Learning is to develop an appropriate action model that maximises the agent’s total cumulative reward. The main disadvantage is that tailored recommendations are not available; they are about content filtering or user type-based filtering. With this article, we have understood the implementation of Reinforcement Learning to build a recommendation system.


Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox