Customers suffer from searching for interesting products due to the growth of information on the World Wide Web. To avoid this problem, recommendation systems are used to help customers find satisfying products. Recommendation engines are a type of machine learning that deals with ranking or evaluating items or consumers. A recommender system is a system that anticipates the ratings that a user will give to a certain item. These forecasts will be graded and returned to the user. This article will be focused on using a reinforcement learning algorithm to build a recommendation system. Following are the topics to be covered.
Table of contents
- About reinforcement learning
- How reinforcement learning used for recommendation
- Building a recommendation system with RL
AI agents in Reinforcement Learning behave in a highly dynamic environment to complete certain tasks. Let’s understand more about Reinforcement Learning (RL).
About Reinforcement Learning
Reinforcement Learning (RL) is a machine learning approach that allows an agent to learn in an interactive environment via trial and error based on feedback from its actions and experiences. Based on learning algorithms, the software employs a trial-and-error approach to provide feedback in the form of punishments or incentives to the opponent. The self-learning machine chooses the best action to provide the best results.
Agents use trial and error to achieve certain tasks in a given context. The goal of this learning process is to maximise the environment’s cumulative rewards. RL differs from both supervised and unsupervised learning methods. Because RL is a technology for online learning. In a dynamic context, the agent learns about data. Changes in existing stages of action have an immediate impact on the following action.

According to the fundamental reinforcement model described above, the first step is for the system to obtain the input state from the environment. The action’s outcome is then done in the environment. Finally, the environment alters in response to activity and assigns rewards or punishment based on the new condition. Depending on the surroundings, the agent can visit a limited number of states. A numerical prize will be received after visiting each state. Punishments are represented by negative numbers. Intelligent agents strive to maximise cumulative rewards while minimising penalties.
Are you looking for a complete repository of Python libraries used in data science, check out here.
How is reinforcement learning used for the recommendation?
The goal is to maximise the predicted sum of the future values for each state. The On-Policy TD control, often known as the SARSA technique, is a key component of the reinforcement algorithm. It is made up of state-action pairs, and we may learn by altering the value state Q(s, a) from one state action pair to another.
Q-learning is one of the most important reinforcement learning strategies. Reinforcement Learning is the process through which agents learn to pick optimum behaviours in their specific environment.
To attain the target state, agents execute an action ‘a’ in each condition to become a new state. Each action conducted by the agent should be done to attain the goal state as efficiently as possible. Each agent state transitions to another with the following action. An agent’s policy is described as a series of activities conducted by an agent on behalf of a state. Q-Learning is one variation of this approach. Instead of computing explicit values for each state, this technique will calculate a value function Q(s, a) to represent the values that acted in states.
Formally, the value of Q(s, a) is the discounted total of future rewards gained by doing action an in s and then selecting optimum actions. To address recommendation issues using Q-learning approaches, we must first define relevant actions, states, and rewarding and punishing procedures.
This technique primarily focuses on analysing performance in each activity in each state, which is referred to as Q values. Recommendation systems employ this Q-learning approach to determine the likelihood of the next job and achieve a high performable state; most suggestions are based on client acceptance/rejection or time spent on a certain activity, for example.
The Q-learning approach gives an appropriate framework to personalised recommendations, which may be utilised directly for any type of recommendation problem. Each value of reward/action/state is an estimate of how accurate prediction can be. The problem’s non-deterministic character is used to update the problem rules.
This rule takes into consideration the fact that doing the same action in the same state might result in various rewards. As the value of n decreases, the influence of reward values decreases continually.
There are several concerns in the suggestion, among other conventional ways of recommendation. Some are not very effective. The key disadvantage is that personalised recommendations are not available; instead, they rely on content filtering or user-type-based filtering. However, society and technology are increasingly focused on presenting the appropriate product or service to the proper market or consumer.
Building a recommendation system with RL
This article uses Deep Deterministic Policy Gradient (DDPG), a type of reinforcement learning that blends Q-learning with Policy gradients. As an actor-critic technique, DDPG has two models: actor and critic. Instead of a probability distribution of actions, the actor is a policy network that takes the state as input and outputs the precise action (continuous). The critic is a Q-value network that accepts state and action as input and returns the Q-value as output. The DDPG technique is an “off”-policy approach. The word “deterministic” in DDPG refers to the fact that the actor computes the action directly rather than using a probability distribution across actions.
The recommendation system will be built on the famous IMDb movie rating dataset. Due to time constraints for training the DDPG model using a pre-trained DDPG model known as RecNN.
Installing the RecNN
from IPython.display import clear_output ! git clone https://github.com/awarebayes/RecNN ! pip install -r ./RecNN/requirements.txt ! pip install ./RecNN ! pip install gdown clear_output()
Cloning the GitHub repository to install the RecNN recommendation toolkit based on Reinforcement Learning.
Reading the data
This article uses the IMDB movie rating dataset. With the following code downloading and unzipping the metadata and the pretrained model.
! wget http://files.grouplens.org/datasets/movielens/ml-20m.zip ! gdown https://drive.google.com/uc?id=1EQ_zXBR3DKpmJR3jBgLvt-xoOvArGMsL ! unzip ml-20m.zip clear_output()
If using a colab notebook make sure the hardware accelerator is set to GPU because it would be required to train the model.
Import necessary libraries
import pandas as pd import numpy as np from scipy.spatial import distance import matplotlib.pyplot as plt import recnn from tqdm.auto import tqdm import pickle import gc import json import torch from torch.utils.data import Dataset, DataLoader import torch.nn as nn import torch.nn.functional as F import torch.optim as optim
The library’s main abstraction for datasets is named environment, which is similar to how other reinforcement learning libraries name it. FrameEnv provides static length, whereas SeqEnv implements dynamic length as well as a sequential state representation encoder. First, let’s look at FrameEnv. You must specify embeddings and rating directories to initialise an env. Caching is another option.
cuda = torch.device('cuda') frame_size = 10 meta = json.load(open('/content/drive/MyDrive/Datasets/omdb.json')) tqdm.pandas()
frame_size = 10 batch_size = 1 dirs = recnn.data.env.DataPath( base="", embeddings="ml20_pca128.pkl", ratings="ml-20m/ratings.csv", cache="cache_frame_env.pkl", use_cache=True ) env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)

Now store the TensorFlow to the GPU.
Since DDPG is an actor-critic technique, it has two models: actor and critic. This article is using the actor model. Try using the critic model, leaving that to you.
ddpg = recnn.nn.models.Actor(1290, 128, 256).to(cuda) test_batch = next(iter(env.test_dataloader)) state, action, reward, next_state, done = recnn.data.get_base_batch(test_batch)
Create a custom function
To store the recommendation score create a custom function which will take the input from the metadata and process it through the model and at last store the results in a pandas dataframe.
def rank_score(gen_action, metric): scores = [] for i in env.base.key_to_id.keys(): if i == 0 or i == '0': continue scores.append([i, metric(env.base.embeddings[env.base.key_to_id[i]], gen_action)]) scores = list(sorted(scores, key = lambda x: x[1])) scores = scores[:10] ids = [i[0] for i in scores] for i in range(10): scores[i].extend([meta[str(scores[i][0])]['omdb'][key] for key in ['Title', 'Genre', 'Language', 'Released', 'imdbRating']]) indexes = ['id', 'score', 'Title', 'Genre', 'Language', 'Released', 'imdbRating'] table_dict = dict([(key,[i[idx] for i in scores]) for idx, key in enumerate(indexes)]) table = pd.DataFrame(table_dict) return table
Generate recommendations
ddpg_model = ddpg(state) ddpg_model = ddpg_model[np.random.randint(0, state.size(0), 1)[0]].detach().cpu().numpy()
Using the custom function with different distance calculation techniques to get the best recommendations with some random samples.
rank_score(ddpg_model, distance.euclidean)

rank_score(ddpg_model, distance.correlation)

Conclusion
The purpose of Reinforcement Learning is to develop an appropriate action model that maximises the agent’s total cumulative reward. The main disadvantage is that tailored recommendations are not available; they are about content filtering or user type-based filtering. With this article, we have understood the implementation of Reinforcement Learning to build a recommendation system.