Deep Learning algorithms are the go-to solution to almost all the recommender systems nowadays. Deep learning thrives at devouring tonnes of data and spewing out recommendations with great accuracy. These systems are ubiquitous and have touched many lives in some form or the other. From YouTube to Netflix, the applications have risen multifold.
Since there is no going back for these deep learning-based recommendation systems, it is only obvious to evaluate them for the route they take in building the final model. There are many popular recommendation approaches that have been accepted widely. But are these as good as they seem to be? Are they reproducible? Are there simpler, better alternatives to the deep learning approach?
To address these questions, Maurizio Ferrari and his colleagues have done a study on recent recommendation approaches.
Finding How Good Is Good Enough
The authors considered 18 algorithms that were presented at top-level research conferences in the last years. These 18 algorithms were the result of a study conducted by the authors by analysing research papers that proposed new algorithmic approaches for top-n recommendation tasks using deep learning methods at the recent conference proceedings of KDD, SIGIR, TheWebConf (WWW), and RecSys for corresponding research works.
To validate their experiment, the following baseline methods were considered in the experiments to compare with the performance of new recommendation approaches:
TopPopular: A non-personalized method that recommends the most popular items to everyone. Popularity is measured by the number of explicit or implicit ratings.
ItemKNN: A traditional Collaborative-Filtering (CF) approach based on k-nearest-neighborhood (KNN). And others that include UserKNN, ItemKNN-CFCBF etc.
The latest approaches that were checked reproducibility include:
- Collaborative Memory Networks (CMN)
- Metapath based Context for RECommendation (MCRec)
- Collaborative Variational Autoencoder (CVAE)
- Collaborative Deep Learning (CDL)
- Neural Collaborative Filtering (NCF) and others.
After collating the relevant works, the papers with code were selected to check for reproducibility. The authors lament that they could reproduce the published results with an acceptable degree of certainty for only seven papers.
The reproducibility of a work was decided based on the following factors:
- A working version of the source code is available or the code only has to be modified in minimal ways to work correctly
- At least one dataset used in the original paper is available. A further requirement here is that either the originally-used train-test splits are publicly available or that they can be reconstructed based on the information in the paper
To check for reproducibility, the authors performed refactoring on the original implementations in a way that allowed them to apply the same evaluation procedure that was used in the original papers. Specifically, refactoring is done in a way that the original code for training, hyper-parameter optimization and prediction are separated from the evaluation code.
The study, to the surprise of the authors, revealed that in the large majority of the investigated cases (6 out of 7) the proposed deep learning techniques did not consistently outperform the simple, but fine-tuned baseline methods.
Future Direction
This paper was an attempt to address the following:
Reproducibility: To what extent is recent research in the area reproducible (with reasonable effort)?
Progress: To what extent are recent algorithms actually leading to better performance results when compared to relatively simple, but well-tuned, baseline methods?
Besides issues related to the baselines, an additional challenge is that researchers use various types of datasets, evaluation protocols, performance measures, and data preprocessing steps, which makes it difficult to conclude which method is the best across different application scenarios.
The lower performing newer approaches can be a consequence of the following:
(i) weak baselines:
(ii) establishment of weak methods as new baselines; and
(iii) difficulties in comparing or reproducing results across papers
If the newer approaches cannot outrank the older, simpler approaches then it is a no-brainer to implement the new ones.
Read the full paper here