Now Reading
10 Open-Source Datasets One Must Know To Build Recommender Systems

10 Open-Source Datasets One Must Know To Build Recommender Systems

Ambika Choudhury

Be it watching a web series or shopping online, recommender systems work as time-savers for many. This system predicts and estimates the preferences of a user’s content. Popular online platforms such as Facebook, Netflix, Myntra, among others, have been using this technology in many ways.  

In this article, we list down – in no particular order – ten datasets one must know to build recommender systems.

1| MovieLens 25M Dataset

About: MovieLens is a rating data set from the MovieLens website, which has been collected over several periods. MovieLens 25M movie rating dataset describes 5-star rating and free-text tagging activity from MovieLens, which contains 2,50,00,095 ratings and 10,93,360 tag applications across 62,423 movies. These data were created by 1,62,541 users between 9 January 1995, and 21 November 2019.   



Click here to know more.

2| Social Network Influencer

About: This dataset is provided by Peerindex, which comprises a standard, pairwise preference learning task. Here, each datapoint describes two individuals and the pre-computed, standardized features based on twitter activity. This includes the volume of interactions, number of followers, etc. provided for each individual. With the help of this dataset, one can train a machine learning model, which can predict which human is more influential with high accuracy.  

Click here to know more.

3| Million Song Dataset

About: Million Song Dataset is a collection of audio features and metadata for a million contemporary popular music tracks. Provided by Echo Nest, the core of this dataset is the feature analysis and metadata for one million songs. The purpose of this dataset is to encourage research on algorithms that scale to commercial sizes, provide a reference dataset for evaluating research, help new researchers get started in the MIR field, and more.    

Click here to know more.

4| Free Music Archive

About: Free Music Archive (FMA) is a collection of high-quality, legal audio downloads for music analysis. The dataset is suitable for evaluating several tasks in MIR, a field which is concerned with browsing, searching, and organizing vast music collections. It contains 917 GiB and 343 days of Creative Commons-licensed audio from 1,06,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. FMA provides full-length and high-quality audio, pre-computed features, together with the track- and user-level metadata, tags, and free-form text such as biographies.  



Click here to know more.

5| Netflix Prize Dataset

About: Netflix Prize dataset is the multivariate, time-series dataset which was used in the Netflix Prize competition. The Netflix Prize dataset consists of about 100 million movie ratings. There are over 4,80,000 customers in the dataset, where each is identified by a unique integer id. With the help of this dataset, one can predict missing entries in the movie-user rating matrix. 

Click here to know more.

6| Book-Crossing Dataset 

About: Book-Crossing Dataset is a 4-week crawl dataset from the Book-Crossing community. It contains 2,78,858 users who are anonymized, but with demographic information, providing 1,149,780 ratings (explicit / implicit) about 2,71,379 books.

Click here to know more.

7| Amazon Review Data

About: Amazon Review data is a collection of reviews, i.e. ratings, text, helpfulness votes, product metadata, i.e. descriptions, category information, price, brand, and image features, and links which are viewed. The dataset contains product reviews and metadata from Amazon, and the total number of reviews in this dataset is 233.1 million.

See Also
A Compilation Of 16 Datasets Released By Google

Click here to know more.

8| Yahoo! Music User Ratings

About: This dataset represents a collection of the Yahoo! Music community’s preferences for various musical artists. The dataset contains over 10 million ratings of musical artists which were given by the Yahoo! Music users. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. It may serve as a testbed for matrix and graph algorithms, including PCA and clustering algorithms.  

Click here to know more.

9| LastFM

About: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. It is a collection of 17,632 music artists listened to and tagged by the users.

Click here to know more.

10| Steam Video Games

About: The Steam Video Games dataset is a collection of user behaviours such as purchase and play, with columns: user-id, game-title, behaviour-name and value, which indicates the degree to which the behavior was performed in the most popular PC Gaming hub, Steam. 

Click here to know more.

Provide your comments below

comments


If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top