10 Open-Source Datasets One Must Know To Build Recommender Systems

Be it watching a web series or shopping online, recommender systems work as time-savers for many. This system predicts and estimates the preferences of a user’s content. Popular online platforms such as Facebook, Netflix, Myntra, among others, have been using this technology in many ways.  

In this article, we list down – in no particular order – ten datasets one must know to build recommender systems.

1| MovieLens 25M Dataset

About: MovieLens is a rating data set from the MovieLens website, which has been collected over several periods. MovieLens 25M movie rating dataset describes 5-star rating and free-text tagging activity from MovieLens, which contains 2,50,00,095 ratings and 10,93,360 tag applications across 62,423 movies. These data were created by 1,62,541 users between 9 January 1995, and 21 November 2019.   


Sign up for your weekly dose of what's up in emerging technology.

Click here to know more.

2| Social Network Influencer

About: This dataset is provided by Peerindex, which comprises a standard, pairwise preference learning task. Here, each datapoint describes two individuals and the pre-computed, standardized features based on twitter activity. This includes the volume of interactions, number of followers, etc. provided for each individual. With the help of this dataset, one can train a machine learning model, which can predict which human is more influential with high accuracy.  

Click here to know more.

3| Million Song Dataset

About: Million Song Dataset is a collection of audio features and metadata for a million contemporary popular music tracks. Provided by Echo Nest, the core of this dataset is the feature analysis and metadata for one million songs. The purpose of this dataset is to encourage research on algorithms that scale to commercial sizes, provide a reference dataset for evaluating research, help new researchers get started in the MIR field, and more.    

Click here to know more.

4| Free Music Archive

About: Free Music Archive (FMA) is a collection of high-quality, legal audio downloads for music analysis. The dataset is suitable for evaluating several tasks in MIR, a field which is concerned with browsing, searching, and organizing vast music collections. It contains 917 GiB and 343 days of Creative Commons-licensed audio from 1,06,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. FMA provides full-length and high-quality audio, pre-computed features, together with the track- and user-level metadata, tags, and free-form text such as biographies.  

Click here to know more.

5| Netflix Prize Dataset

About: Netflix Prize dataset is the multivariate, time-series dataset which was used in the Netflix Prize competition. The Netflix Prize dataset consists of about 100 million movie ratings. There are over 4,80,000 customers in the dataset, where each is identified by a unique integer id. With the help of this dataset, one can predict missing entries in the movie-user rating matrix. 

Click here to know more.

6| Book-Crossing Dataset 

About: Book-Crossing Dataset is a 4-week crawl dataset from the Book-Crossing community. It contains 2,78,858 users who are anonymized, but with demographic information, providing 1,149,780 ratings (explicit / implicit) about 2,71,379 books.

Click here to know more.

7| Amazon Review Data

About: Amazon Review data is a collection of reviews, i.e. ratings, text, helpfulness votes, product metadata, i.e. descriptions, category information, price, brand, and image features, and links which are viewed. The dataset contains product reviews and metadata from Amazon, and the total number of reviews in this dataset is 233.1 million.

Click here to know more.

8| Yahoo! Music User Ratings

About: This dataset represents a collection of the Yahoo! Music community’s preferences for various musical artists. The dataset contains over 10 million ratings of musical artists which were given by the Yahoo! Music users. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. It may serve as a testbed for matrix and graph algorithms, including PCA and clustering algorithms.  

Click here to know more.

9| LastFM

About: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. It is a collection of 17,632 music artists listened to and tagged by the users.

Click here to know more.

10| Steam Video Games

About: The Steam Video Games dataset is a collection of user behaviours such as purchase and play, with columns: user-id, game-title, behaviour-name and value, which indicates the degree to which the behavior was performed in the most popular PC Gaming hub, Steam. 

Click here to know more.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM