Active Hackathon

10 NLP Open-Source Datasets To Start Your First NLP Project

There has been significant growth in natural language processing (NLP) over the last few years. The demand for advanced text recognition, sentiment analysis, speech recognition, machine-to-human communication has led to the rise of several innovations. According to industry estimates, the global NLP market will reach a market value of US$ 28.6 billion in 2026 and is expected to witness CAGR of 11.71% across the forecast period through 2018 to 2026. 

In this article, we list down 10 free and open-source NLP datasets to kickstart your first NLP project.


Sign up for your weekly dose of what's up in emerging technology.

1| The Blog Authorship Corpus

About: The Blog Authorship Corpus consists of collected posts of 19,320 bloggers which are gathered from in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. In this dataset, each blog is presented as a separate file, the name of which indicates a blogger id and the blogger’s self-provided gender, age, industry, and astrological sign.

Category: Sentiment analysis

Get the dataset here

2| Amazon Product Dataset

About: Amazon Product dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 – July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Category: Sentiment Analysis

Get the dataset here

3| Multi-Domain Sentiment Dataset

About: The Multi-Domain Sentiment Dataset contains product reviews taken from from 4 product types (domains) kitchen, books, DVDs, and electronics. Here each domain has several thousand reviews, but the exact number varies by the domain. The reviews contain star ratings (1 to 5 stars) which can also be converted into binary labels.

Category: Sentiment Analysis

Get the dataset here

4| LibriSpeech

About: LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from reading audiobooks from LibriVox project and has been carefully segmented and aligned.

Category: Speech recognition

Get the dataset here

5| Free Spoken Digit Dataset (FSDD)

About: Free Spoken Digit Dataset (FSDD) is an open dataset which is a collection of a simple audio/speech dataset consisting of recordings of spoken digits in WAV files at 8kHz. In this dataset, the recordings are trimmed so that they have near minimal silence at the beginnings and ends.

Category: Speech recognition

Get the data here

6| Stanford Question Answering Dataset (SQuAD)

About: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset which consists of questions posed by the crowd-workers on a set of Wikipedia articles. Here, the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowd-workers to look similar to answerable ones. 

Category: Question & Answering Analysis

Get the data here

7| Jeopardy! Questions in a JSON file

About: This dataset is a JSON file containing 216,930 Jeopardy questions, answers, and other data. According to j-archive, the total number of Jeopardy! questions over the show’s span are 252,583. The length of the file is approximately 53 MB.   

Category: Questions & Answers Analysis

Get the data here

8| Yelp Reviews

About: The Yelp dataset is an all-purpose dataset for learning. It is a subset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes. The dataset contains 6,685,900 reviews, 200,000 pictures, 192,609 businesses from 10 metropolitan areas.

Category: Text Classification

Get the data here

9| WordNet

About: WordNet is a large lexical database of English. This dataset superficially resembles a thesaurus, in that it groups words together based on their meanings. The main relation among words in WordNet is synonymy, as between the words shut and close or car and automobile. There’s a total of 117 000 synsets in WordNet, each of which is linked to other synsets by means of a small number of conceptual relations. 

Category: Text Classification

Get the data here


About: TIMIT Acoustic-Phonetic Continuous Speech Corpus is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. The dataset contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. 

Category: Speech Recognition

Get the data here

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.