Be it watching a web series or shopping online, recommender systems work as time-savers for many. This system predicts and estimates the preferences of a user’s content. Popular online platforms such as Facebook, Netflix, Myntra, among others, have been using this technology in many ways.
In this article, we list down – in no particular order – ten datasets one must know to build recommender systems.
1| MovieLens 25M Dataset
About: MovieLens is a rating data set from the MovieLens website, which has been collected over several periods. MovieLens 25M movie rating dataset describes 5-star rating and free-text tagging activity from MovieLens, which contains 2,50,00,095 ratings and 10,93,360 tag applications across 62,423 movies. These data were created by 1,62,541 users between 9 January 1995, and 21 November 2019.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Click here to know more.
2| Social Network Influencer
About: This dataset is provided by Peerindex, which comprises a standard, pairwise preference learning task. Here, each datapoint describes two individuals and the pre-computed, standardized features based on twitter activity. This includes the volume of interactions, number of followers, etc. provided for each individual. With the help of this dataset, one can train a machine learning model, which can predict which human is more influential with high accuracy.
Click here to know more.
3| Million Song Dataset
About: Million Song Dataset is a collection of audio features and metadata for a million contemporary popular music tracks. Provided by Echo Nest, the core of this dataset is the feature analysis and metadata for one million songs. The purpose of this dataset is to encourage research on algorithms that scale to commercial sizes, provide a reference dataset for evaluating research, help new researchers get started in the MIR field, and more.
Click here to know more.
4| Free Music Archive
About: Free Music Archive (FMA) is a collection of high-quality, legal audio downloads for music analysis. The dataset is suitable for evaluating several tasks in MIR, a field which is concerned with browsing, searching, and organizing vast music collections. It contains 917 GiB and 343 days of Creative Commons-licensed audio from 1,06,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. FMA provides full-length and high-quality audio, pre-computed features, together with the track- and user-level metadata, tags, and free-form text such as biographies.
Click here to know more.
5| Netflix Prize Dataset
About: Netflix Prize dataset is the multivariate, time-series dataset which was used in the Netflix Prize competition. The Netflix Prize dataset consists of about 100 million movie ratings. There are over 4,80,000 customers in the dataset, where each is identified by a unique integer id. With the help of this dataset, one can predict missing entries in the movie-user rating matrix.
Click here to know more.
6| Book-Crossing Dataset
About: Book-Crossing Dataset is a 4-week crawl dataset from the Book-Crossing community. It contains 2,78,858 users who are anonymized, but with demographic information, providing 1,149,780 ratings (explicit / implicit) about 2,71,379 books.
Click here to know more.
7| Amazon Review Data
About: Amazon Review data is a collection of reviews, i.e. ratings, text, helpfulness votes, product metadata, i.e. descriptions, category information, price, brand, and image features, and links which are viewed. The dataset contains product reviews and metadata from Amazon, and the total number of reviews in this dataset is 233.1 million.
Click here to know more.
8| Yahoo! Music User Ratings
About: This dataset represents a collection of the Yahoo! Music community’s preferences for various musical artists. The dataset contains over 10 million ratings of musical artists which were given by the Yahoo! Music users. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. It may serve as a testbed for matrix and graph algorithms, including PCA and clustering algorithms.
Click here to know more.
9| LastFM
About: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. It is a collection of 17,632 music artists listened to and tagged by the users.
Click here to know more.
10| Steam Video Games
About: The Steam Video Games dataset is a collection of user behaviours such as purchase and play, with columns: user-id, game-title, behaviour-name and value, which indicates the degree to which the behavior was performed in the most popular PC Gaming hub, Steam.
Click here to know more.