10 Open-Source Datasets For Text Classification

One of the popular fields of research, text classification is the method of analysing textual data to gain meaningful information. According to sources, the global text analytics market is expected to post a CAGR of more than 20% during the period 2020-2024. Text classification can be used in a number of applications such as automating CRM tasks, improving web browsing, e-commerce, among others. 

In this article, we list down 10 open-source datasets, which can be used for text classification.

(The list is in alphabetical order)

1| Amazon Reviews Dataset

The Amazon Review dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis. The size of the dataset is 493MB. 

Get the data here.

2| Enron Email Dataset

The Enron Email Dataset contains email data from about 150 users who are mostly senior management of Enron organisation. The dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) and contains a total of about 0.5M messages. 

Get the data here.  

3| Goodreads Book Reviews

This dataset contains reviews from the Goodreads book review website along with a variety of attributes describing the items. It includes reviews, read, review actions, book attributes and other such. There are a total number of items including 1,561,465. 

Get the data here.  

4| IMDB Dataset 

The IMDB dataset includes 50K movie reviews for natural language processing or text analytics. This is a dataset for binary sentiment classification, which includes a set of 25,000 highly polar movie reviews for training and 25,000 for testing. 

Get the data here.

5| MovieLens Latest Datasets

This dataset is a collection of movies, its ratings, tag applications and the users. There are two sets of this data, which has been collected over a period of time. The small set includes 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users, and the large set includes 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. The large set also includes tag genome data with 14 million relevance scores across 1,100 tags.

Get the data here.

6| OpinRank Dataset 

This data set contains full reviews for cars and hotels collected from Tripadvisor and Edmunds. The dataset contains full reviews of hotels in 10 different cities as well as full reviews of cars for model-years 2007, 2008 and 2009. In the dataset, the total number of car reviews include approximately 42,230, and the total number of hotel reviews include approximately 259,000.  

Get the data here.

7| SMS Spam Collection

The SMS Spam Collection is a public dataset of SMS labelled messages, which have been collected for mobile phone spam research. The dataset has one collection composed by 5,574 English, real and non-encoded messages, tagged according to being legitimate or spam. The dataset is available in both plain text and ARFF format. 

Get the data here

8| The Blog Authorship Corpus 

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. In this dataset, each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. 

Get the data here.

9| WordNet

WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets) and each expressing a distinct concept. In this dataset, the total number of synsets are 117 000 and each of which is linked to other synsets by means of a small number of conceptual relations. 

Get the data here.

10| Yelp Reviews

The Yelp dataset is an all-purpose dataset for learning and is a subset of Yelp’s businesses, reviews, and user data, which can be used for personal, educational, and academic purposes. The dataset includes 6,685,900 reviews, 200,000 pictures, 192,609 businesses from 10 metropolitan areas.

Get the data here.

Download our Mobile App

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Sam Altman a Hypocrite? 

While on the one hand, Altman is advocating for the international community to build strong AI regulations, he is also worried when someone finally decides to regulate it