How to Evaluate Recommender Systems with RGRecSys?

Traditionally, evaluations of recommender systems have focused on the performance of algorithms such as recommendation accuracy. However, today’s evaluation of such systems should not be limited to metrics such as accuracy but should also consider constraints such as user-fed incorrect data, evaluating sub-items or sub-products to be recommended, changes in data distribution, and so on. So, in this post, we will look at RGRecSys, a library that performs constraint evaluation of recommender systems. The major points to be covered in this article are listed below.

Table of contents

  1. About recommender systems
  2. Evaluating a recommender system
  3. How RGRecSys helps to evaluate
  4. Evaluation features of RGRecSys 

Let’s first quickly have a brief introduction to the recommender system.

About recommender systems

A recommender system, sometimes known as a recommendation engine, is a type of information filtering system that attempts to forecast a user’s “rating” or “preference” for an item. Playlist generators for video and music services, product recommenders for online businesses, content recommenders for social media platforms, and open web content recommenders are all examples of recommender systems in use. 


Sign up for your weekly dose of what's up in emerging technology.

These systems can function with a single input, such as music, or with several inputs, such as news, books, and search queries, within and across platforms. There are other popular recommender systems for specific themes like restaurants and online dating. Recommender systems often employ collaborative filtering or content-based filtering (also known as the personality-based approach), in addition to other systems such as knowledge-based systems. 

Collaborative filtering techniques build a model based on a user’s previous behaviour (products previously purchased or selected, as well as numerical ratings awarded to those items) and analogous decisions made by other users. This model is then used to estimate which things (or item ratings) the user would be interested in. By utilizing a collection of distinct, pre-tagged features of an item, content-based filtering approaches recommend additional things with similar attributes.

Download our Mobile App

Evaluating a recommender system

Traditionally, recommender systems have been created and analyzed under simple but unrealistic assumptions, such as i.i.d. assumptions (training and testing data being independent and identically distributed) and noiseless and abundant data. 

Recent studies have relaxed these assumptions and are focusing on constructing models in more difficult, yet realistic settings in which data given into recommender systems is intentionally attacked, scarce, and biased. However, there are more aspects to consider while assessing robustness.

Some user and item features, for example, maybe corrupted (transformation), or the i.i.d. assumption of training and testing data may be violated (distributional shift). The performance of recommender systems that rely too heavily on unrealistic assumptions might suffer greatly. Hence in order to get rid of these all and such constraints, one should have properly evaluated the recommender system. 

How RGRecSys help to evaluate?

A comprehensive definition of robustness for recommendation systems that can include and formalize several perspectives on robustness, such as robustness with respect to subpopulation, transformations, distributional disparity, attack, and sparsity.

Robustness Gym for RecSys (RGRecSys) is a robustness evaluation toolkit for recommendation systems that allows us to quickly and uniformly conduct a comprehensive robustness evaluation for recommendation system models. RGRecSys assesses the robustness of recommender system models to data subpopulation, transformation, distributional shift, attack, and sparsity.

To demonstrate the utility of RGRecSys, this system makes use of RecBole built-in models. RecBole makes use of Pytorch throughout the library and proposes a unified framework that includes data, model, and evaluation modules. This library contains a diverse collection of models for general, sequential, context-aware, and knowledge-based recommendation. 

Its general and extensible data structure makes it easy to add new models to the library and provides users with enough flexibility to set up experimental environments like hyper-parameters and splitting criteria. We can conduct unified and comprehensive robustness evaluations on recommender system models using RGRecSys’ robustness evaluation module.

Evaluation features of RGRecSys

Subpopulation evaluation

The performance metrics averaged across all users and items are reported in most existing recommender system libraries. A single high-performance metric, on the other hand, does not guarantee that the model will perform well for a subset of users or items. 

For example, a recommender system may perform well on average across all users but poorly on subgroups of users such as females or people of a specific race. With the increasing importance of fair recommender systems, reporting performance for a subset of users or items is critical.

Here the RGRecSys library allows us to evaluate model performance for any sub-group of interest, such as users of a specific feature, users’ activeness based on the number of interactions, and users’ critiques based on their rating score. That is, given a trained model, the library can slice test data to perform a fine-grained evaluation on models and assess their slicing robustness.

Distributional shift evaluation

Many recommender system models are predicated on the assumption that training and testing data are distributed uniformly. However, in real-world scenarios, this i.i.d. assumption is frequently broken. RGRecSys can be used to validate recommender system models that are subject to distributional shifts.

To accomplish this, RGRecSys first provides users with the training data distribution based on user features and then allows them to manipulate the testing data distribution by sampling it to differ from the training data distribution. Our library users, for example, can select the female to male ratio in testing data that differs greatly from training data.

Attack evaluation

Because recommender systems have such a large economic impact, they are extremely vulnerable to attacks aimed at changing the rankings of specific items. When dealing with malicious attacks, it is critical to evaluate the performance of recommender system models. 

RGRecSys allows us to test the models in the context of a Cross-Site Request Forgery (CSRF) attack, in which the attacker causes the victim user to unintentionally perform an action. For example, the user may change his ratings despite his will, resulting in corrupted interaction data in the training dataset. We can use RGRecSys to determine the severity of an attack by determining how much of the interaction will be corrupted.

Sparsity evaluation

Data fed into recommender systems usually include users’ explicit or implicit feedback, such as ratings or clicks. Typically, such data is sparse, and recommender systems have been known to perform poorly when fed sparse data. This library allows users to compare the robustness of different models under sparse data by randomly removing a fraction of user interaction data. We can select the level of sparsity as well as the users with whom you want to drop interactions based on their activity.

Transformation Evaluation

Most recommender system models require access to user and item features in order to provide users with a set of recommended items. Such information can be gathered by asking users and content providers to fill out a profile that includes some information about them or by the recommender system extracting it automatically, for example, from user reviews. However, misleading information, an error that occurs when recommender models attempt to extract it, or a malicious attack could taint this data.

As a result, it’s more realistic to assume that some user or item features are inaccurate. RGRecSys allows us to evaluate models under transformation on user or item features by allowing them to choose which features to transform and how severe the transformation should be. This transformation can be random, in which case the feature value can take any value, or structure, in which case the feature value is within a certain range of its true value.

Final words

Through this article, we have discussed the recommendation system and how we can evaluate it. As we discussed in this article, evaluating the recommendation system is not about the accuracy testing, but about its robustness on the certain constraint that is mainly observed in the recommendation system such as robustness against sparse data, robustness against misleading user-fed information, etc. This evaluation can be carried out by the library that we have discussed. This library for evaluating recommender systems is in the development phase. 


Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox