Now Reading
10 Open-Source Datasets For Text Classification

10 Open-Source Datasets For Text Classification

Ambika Choudhury

One of the popular fields of research, text classification is the method of analysing textual data to gain meaningful information. According to sources, the global text analytics market is expected to post a CAGR of more than 20% during the period 2020-2024. Text classification can be used in a number of applications such as automating CRM tasks, improving web browsing, e-commerce, among others. 

In this article, we list down 10 open-source datasets, which can be used for text classification.

(The list is in alphabetical order)



1| Amazon Reviews Dataset

The Amazon Review dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis. The size of the dataset is 493MB. 

Get the data here.



2| Enron Email Dataset

The Enron Email Dataset contains email data from about 150 users who are mostly senior management of Enron organisation. The dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) and contains a total of about 0.5M messages. 

Get the data here.  

3| Goodreads Book Reviews

This dataset contains reviews from the Goodreads book review website along with a variety of attributes describing the items. It includes reviews, read, review actions, book attributes and other such. There are a total number of items including 1,561,465. 

Get the data here.  

4| IMDB Dataset 

The IMDB dataset includes 50K movie reviews for natural language processing or text analytics. This is a dataset for binary sentiment classification, which includes a set of 25,000 highly polar movie reviews for training and 25,000 for testing. 

Get the data here.

5| MovieLens Latest Datasets

This dataset is a collection of movies, its ratings, tag applications and the users. There are two sets of this data, which has been collected over a period of time. The small set includes 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users, and the large set includes 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. The large set also includes tag genome data with 14 million relevance scores across 1,100 tags.

Get the data here.

6| OpinRank Dataset 

This data set contains full reviews for cars and hotels collected from Tripadvisor and Edmunds. The dataset contains full reviews of hotels in 10 different cities as well as full reviews of cars for model-years 2007, 2008 and 2009. In the dataset, the total number of car reviews include approximately 42,230, and the total number of hotel reviews include approximately 259,000.  

Get the data here.

7| SMS Spam Collection

The SMS Spam Collection is a public dataset of SMS labelled messages, which have been collected for mobile phone spam research. The dataset has one collection composed by 5,574 English, real and non-encoded messages, tagged according to being legitimate or spam. The dataset is available in both plain text and ARFF format. 

See Also

Get the data here

8| The Blog Authorship Corpus 

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. In this dataset, each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. 

Get the data here.

9| WordNet

WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets) and each expressing a distinct concept. In this dataset, the total number of synsets are 117 000 and each of which is linked to other synsets by means of a small number of conceptual relations. 

Get the data here.

10| Yelp Reviews

The Yelp dataset is an all-purpose dataset for learning and is a subset of Yelp’s businesses, reviews, and user data, which can be used for personal, educational, and academic purposes. The dataset includes 6,685,900 reviews, 200,000 pictures, 192,609 businesses from 10 metropolitan areas.

Get the data here.

Provide your comments below

comments


If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top