Last updated November 17, 2021
In AI Origins & Evolution

10 Popular Datasets For Sentiment Analysis

Published on February 4, 2020
by Sameer Balaganur

Sentiment analysis has found its applications in various fields that are now helping enterprises to estimate and learn from their clients or customers correctly. Sentiment analysis is increasingly being used for social media monitoring, brand monitoring, the voice of the customer (VoC), customer service, and market research. Sentiment analysis uses NLP methods and algorithms that are either rule-based, hybrid, or rely on machine learning techniques to learn data from datasets.

The data needed in sentiment analysis should be specialised and are required in large quantities. The most challenging part about the sentiment analysis training process isn’t finding data in large amounts; instead, it is to find the relevant datasets. These data sets must cover a wide area of sentiment analysis applications and use cases.

Below are listed some of the most popular datasets for sentiment analysis.

This list is in no particular order.

Amazon Product Data

Amazon product data is a subset of a large 142.8 million Amazon review dataset that was made available by Stanford professor, Julian McAuley. This sentiment analysis dataset contains reviews from May 1996 to July 2014. The dataset reviews include ratings, text, helpfull votes, product description, category information, price, brand, and image features.

Stanford Sentiment Treebank

This dataset contains just over 10,000 pieces of Stanford data from HTML files of Rotten Tomatoes. The sentiments are rated between 1 and 25, where one is the most negative and 25 is the most positive. The deep learning model by Stanford has been built on the representation of sentences based on the sentence structure instead just giving points based on the positive and negative words.

For example:

The Interview was neither that funny nor that witty.

Even if there are words like funny and witty, the overall structure is a negative type.

Multi-Domain Sentiment Dataset

This dataset contains positive and negative files for thousands of Amazon products. Although the reviews are for older products, this data set is excellent to use. The data derives from the Department of Computer Science at John Hopkins University.

The reviews contain ratings from 1 to 5 stars that can be converted to binary as needed.

Download original data:

Unprocessed.tar.gz

processed_acl.tar.gz

processed_stars.tar.gz

IMDB Movie Reviews Dataset

This large movie dataset contains a collection of about 50,000 movie reviews from IMDB. In this dataset, only highly polarised reviews are being considered. The positive and negative reviews are even in number; however, the negative review has a score of ≤ 4 out of 10, and the positive review has a score of ≥ 7 out of 10.

Sentiment140

Sentiment140 is used to discover the sentiment of a brand or product or even a topic on the social media platform Twitter. Rather than working on keywords-based approach, which leverages high precision for lower recall, Sentiment140 works with classifiers built from machine learning algorithms. The Sentiment140 uses classification results for individual tweets along with the traditional surface that aggregated metrics. The Sentiment140 is used for brand management, polling, and planning a purchase.

Twitter US Airline Sentiment

This sentiment analysis dataset contains tweets since Feb 2015 about each of the major US airline. Each tweet is classified either positive, negative or neutral. The included features including Twitter ID, sentiment confidence score, sentiments, negative reasons, airline name, retweet count, name, tweet text, tweet coordinates, date and time of the tweet, and the location of the tweet.

Paper Reviews Data Set

Paper Reviews Data Set contains reviews from English and Spanish languages on computing and informatics conferences. The algorithm used will predict the opinions of academic paper reviews. Most of the dataset for the sentiment analysis of this type is sent in Spanish. It has a total of instances of N=405 evaluated with a 5-point scale, -2: very negative, -1: neutral, 1: positive, 2: very positive. The distribution of the scores is uniform, and there exists a difference between the way the paper is evaluated and the review written by the original reviewer.

Sentiment Lexicons For 81 Languages

Sentiment Lexicons for 81 Languages contains languages from Afrikaans to Yiddish. This data includes both positive and negative sentiment lexicons for a total of 81 languages. These lexica were generated via graph propagation for the sentiment analysis based on a knowledge graph which is a graphical representation of real-world objects and the relationship between them. The general idea is that words closely linked on a knowledge graph may have similar sentiment polarities. The sentiments were built based on English sentiment lexicons.

Lexicoder Sentiment Dictionary

This dataset for the sentiment analysis is designed to be used within the Lexicoder, which performs the content analysis. This dictionary consists of 2,858 negative sentiment words and 1,709 positive sentiment words. In addition to that, 2,860 negations of negative and 1,721 positive words are also included. Anyone willing to test this is advised by the developers to subtract negated positive words from positive counts and subtract the negated negative words from the negative count.

Opin-Rank Review Dataset

Opin-Rank Review Dataset contains full reviews on cars and hotels. This data set includes about 2,59,000 hotel reviews and 42,230 car reviews collected from TripAdvisor and Edmunds, respectively. The car dataset has the models from 2007, 2008, 2009 and has about 140-250 cars from each year. The fields include dates, favourites, author names, and full review in text. The dataset contains information from 10 different cities which include Dubai, Beijing, Las Vegas, San Fransisco, etc. There are reviews of about 80-700 hotels from each city. The fields include review, date, title and full-textual review.

Access all our open Survey & Awards Nomination forms in one place >>

Sameer Balaganur

Sameer is an aspiring Content Writer. Occasionally writes poems, loves food and is head over heels with Basketball.

Watch More

10 Popular Datasets For Sentiment Analysis

Amazon Product Data

Stanford Sentiment Treebank

Multi-Domain Sentiment Dataset

IMDB Movie Reviews Dataset

Sentiment140

Twitter US Airline Sentiment

Paper Reviews Data Set

Sentiment Lexicons For 81 Languages

Lexicoder Sentiment Dictionary

Opin-Rank Review Dataset

Sameer Balaganur

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.