MITB Banner

10 NLP Open-Source Datasets To Start Your First NLP Project

There has been significant growth in natural language processing (NLP) over the last few years. The demand for advanced text recognition, sentiment analysis, speech recognition, machine-to-human communication has led to the rise of several innovations. According to industry estimates, the global NLP market will reach a market value of US$ 28.6 billion in 2026 and is expected to witness CAGR of 11.71% across the forecast period through 2018 to 2026. 

In this article, we list down 10 free and open-source NLP datasets to kickstart your first NLP project.

1| The Blog Authorship Corpus

About: The Blog Authorship Corpus consists of collected posts of 19,320 bloggers which are gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. In this dataset, each blog is presented as a separate file, the name of which indicates a blogger id and the blogger’s self-provided gender, age, industry, and astrological sign.

Category: Sentiment analysis

Get the dataset here

2| Amazon Product Dataset

About: Amazon Product dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 – July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Category: Sentiment Analysis

Get the dataset here

3| Multi-Domain Sentiment Dataset

About: The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains) kitchen, books, DVDs, and electronics. Here each domain has several thousand reviews, but the exact number varies by the domain. The reviews contain star ratings (1 to 5 stars) which can also be converted into binary labels.

Category: Sentiment Analysis

Get the dataset here

4| LibriSpeech

About: LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from reading audiobooks from LibriVox project and has been carefully segmented and aligned.

Category: Speech recognition

Get the dataset here

5| Free Spoken Digit Dataset (FSDD)

About: Free Spoken Digit Dataset (FSDD) is an open dataset which is a collection of a simple audio/speech dataset consisting of recordings of spoken digits in WAV files at 8kHz. In this dataset, the recordings are trimmed so that they have near minimal silence at the beginnings and ends.

Category: Speech recognition

Get the data here

6| Stanford Question Answering Dataset (SQuAD)

About: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset which consists of questions posed by the crowd-workers on a set of Wikipedia articles. Here, the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowd-workers to look similar to answerable ones. 

Category: Question & Answering Analysis

Get the data here

7| Jeopardy! Questions in a JSON file

About: This dataset is a JSON file containing 216,930 Jeopardy questions, answers, and other data. According to j-archive, the total number of Jeopardy! questions over the show’s span are 252,583. The length of the file is approximately 53 MB.   

Category: Questions & Answers Analysis

Get the data here

8| Yelp Reviews

About: The Yelp dataset is an all-purpose dataset for learning. It is a subset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes. The dataset contains 6,685,900 reviews, 200,000 pictures, 192,609 businesses from 10 metropolitan areas.

Category: Text Classification

Get the data here

9| WordNet

About: WordNet is a large lexical database of English. This dataset superficially resembles a thesaurus, in that it groups words together based on their meanings. The main relation among words in WordNet is synonymy, as between the words shut and close or car and automobile. There’s a total of 117 000 synsets in WordNet, each of which is linked to other synsets by means of a small number of conceptual relations. 

Category: Text Classification

Get the data here

10| TIMIT

About: TIMIT Acoustic-Phonetic Continuous Speech Corpus is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. The dataset contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. 

Category: Speech Recognition

Get the data here

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories