MITB Banner

10 Open-Source Datasets One Must Know To Build Recommender Systems

Share

Be it watching a web series or shopping online, recommender systems work as time-savers for many. This system predicts and estimates the preferences of a user’s content. Popular online platforms such as Facebook, Netflix, Myntra, among others, have been using this technology in many ways.  

In this article, we list down – in no particular order – ten datasets one must know to build recommender systems.

1| MovieLens 25M Dataset

About: MovieLens is a rating data set from the MovieLens website, which has been collected over several periods. MovieLens 25M movie rating dataset describes 5-star rating and free-text tagging activity from MovieLens, which contains 2,50,00,095 ratings and 10,93,360 tag applications across 62,423 movies. These data were created by 1,62,541 users between 9 January 1995, and 21 November 2019.   

Click here to know more.

2| Social Network Influencer

About: This dataset is provided by Peerindex, which comprises a standard, pairwise preference learning task. Here, each datapoint describes two individuals and the pre-computed, standardized features based on twitter activity. This includes the volume of interactions, number of followers, etc. provided for each individual. With the help of this dataset, one can train a machine learning model, which can predict which human is more influential with high accuracy.  

Click here to know more.

3| Million Song Dataset

About: Million Song Dataset is a collection of audio features and metadata for a million contemporary popular music tracks. Provided by Echo Nest, the core of this dataset is the feature analysis and metadata for one million songs. The purpose of this dataset is to encourage research on algorithms that scale to commercial sizes, provide a reference dataset for evaluating research, help new researchers get started in the MIR field, and more.    

Click here to know more.

4| Free Music Archive

About: Free Music Archive (FMA) is a collection of high-quality, legal audio downloads for music analysis. The dataset is suitable for evaluating several tasks in MIR, a field which is concerned with browsing, searching, and organizing vast music collections. It contains 917 GiB and 343 days of Creative Commons-licensed audio from 1,06,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. FMA provides full-length and high-quality audio, pre-computed features, together with the track- and user-level metadata, tags, and free-form text such as biographies.  

Click here to know more.

5| Netflix Prize Dataset

About: Netflix Prize dataset is the multivariate, time-series dataset which was used in the Netflix Prize competition. The Netflix Prize dataset consists of about 100 million movie ratings. There are over 4,80,000 customers in the dataset, where each is identified by a unique integer id. With the help of this dataset, one can predict missing entries in the movie-user rating matrix. 

Click here to know more.

6| Book-Crossing Dataset 

About: Book-Crossing Dataset is a 4-week crawl dataset from the Book-Crossing community. It contains 2,78,858 users who are anonymized, but with demographic information, providing 1,149,780 ratings (explicit / implicit) about 2,71,379 books.

Click here to know more.

7| Amazon Review Data

About: Amazon Review data is a collection of reviews, i.e. ratings, text, helpfulness votes, product metadata, i.e. descriptions, category information, price, brand, and image features, and links which are viewed. The dataset contains product reviews and metadata from Amazon, and the total number of reviews in this dataset is 233.1 million.

Click here to know more.

8| Yahoo! Music User Ratings

About: This dataset represents a collection of the Yahoo! Music community’s preferences for various musical artists. The dataset contains over 10 million ratings of musical artists which were given by the Yahoo! Music users. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms. It may serve as a testbed for matrix and graph algorithms, including PCA and clustering algorithms.  

Click here to know more.

9| LastFM

About: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. It is a collection of 17,632 music artists listened to and tagged by the users.

Click here to know more.

10| Steam Video Games

About: The Steam Video Games dataset is a collection of user behaviours such as purchase and play, with columns: user-id, game-title, behaviour-name and value, which indicates the degree to which the behavior was performed in the most popular PC Gaming hub, Steam. 

Click here to know more.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.