Sentiment analysis has found its applications in various fields that are now helping enterprises to estimate and learn from their clients or customers correctly. Sentiment analysis is increasingly being used for social media monitoring, brand monitoring, the voice of the customer (VoC), customer service, and market research. Sentiment analysis uses NLP methods and algorithms that are either rule-based, hybrid, or rely on machine learning techniques to learn data from datasets.
The data needed in sentiment analysis should be specialised and are required in large quantities. The most challenging part about the sentiment analysis training process isn’t finding data in large amounts; instead, it is to find the relevant datasets. These data sets must cover a wide area of sentiment analysis applications and use cases.
Below are listed some of the most popular datasets for sentiment analysis.
This list is in no particular order.
Amazon Product Data
Amazon product data is a subset of a large 142.8 million Amazon review dataset that was made available by Stanford professor, Julian McAuley. This sentiment analysis dataset contains reviews from May 1996 to July 2014. The dataset reviews include ratings, text, helpfull votes, product description, category information, price, brand, and image features.
Stanford Sentiment Treebank
This dataset contains just over 10,000 pieces of Stanford data from HTML files of Rotten Tomatoes. The sentiments are rated between 1 and 25, where one is the most negative and 25 is the most positive. The deep learning model by Stanford has been built on the representation of sentences based on the sentence structure instead just giving points based on the positive and negative words.
The Interview was neither that funny nor that witty.
Even if there are words like funny and witty, the overall structure is a negative type.
Multi-Domain Sentiment Dataset
This dataset contains positive and negative files for thousands of Amazon products. Although the reviews are for older products, this data set is excellent to use. The data derives from the Department of Computer Science at John Hopkins University.
The reviews contain ratings from 1 to 5 stars that can be converted to binary as needed.
Download original data:
IMDB Movie Reviews Dataset
This large movie dataset contains a collection of about 50,000 movie reviews from IMDB. In this dataset, only highly polarised reviews are being considered. The positive and negative reviews are even in number; however, the negative review has a score of ≤ 4 out of 10, and the positive review has a score of ≥ 7 out of 10.
Sentiment140 is used to discover the sentiment of a brand or product or even a topic on the social media platform Twitter. Rather than working on keywords-based approach, which leverages high precision for lower recall, Sentiment140 works with classifiers built from machine learning algorithms. The Sentiment140 uses classification results for individual tweets along with the traditional surface that aggregated metrics. The Sentiment140 is used for brand management, polling, and planning a purchase.
Twitter US Airline Sentiment
This sentiment analysis dataset contains tweets since Feb 2015 about each of the major US airline. Each tweet is classified either positive, negative or neutral. The included features including Twitter ID, sentiment confidence score, sentiments, negative reasons, airline name, retweet count, name, tweet text, tweet coordinates, date and time of the tweet, and the location of the tweet.
Paper Reviews Data Set
Paper Reviews Data Set contains reviews from English and Spanish languages on computing and informatics conferences. The algorithm used will predict the opinions of academic paper reviews. Most of the dataset for the sentiment analysis of this type is sent in Spanish. It has a total of instances of N=405 evaluated with a 5-point scale, -2: very negative, -1: neutral, 1: positive, 2: very positive. The distribution of the scores is uniform, and there exists a difference between the way the paper is evaluated and the review written by the original reviewer.
Sentiment Lexicons For 81 Languages
Sentiment Lexicons for 81 Languages contains languages from Afrikaans to Yiddish. This data includes both positive and negative sentiment lexicons for a total of 81 languages. These lexica were generated via graph propagation for the sentiment analysis based on a knowledge graph which is a graphical representation of real-world objects and the relationship between them. The general idea is that words closely linked on a knowledge graph may have similar sentiment polarities. The sentiments were built based on English sentiment lexicons.
Lexicoder Sentiment Dictionary
This dataset for the sentiment analysis is designed to be used within the Lexicoder, which performs the content analysis. This dictionary consists of 2,858 negative sentiment words and 1,709 positive sentiment words. In addition to that, 2,860 negations of negative and 1,721 positive words are also included. Anyone willing to test this is advised by the developers to subtract negated positive words from positive counts and subtract the negated negative words from the negative count.
Opin-Rank Review Dataset
Opin-Rank Review Dataset contains full reviews on cars and hotels. This data set includes about 2,59,000 hotel reviews and 42,230 car reviews collected from TripAdvisor and Edmunds, respectively. The car dataset has the models from 2007, 2008, 2009 and has about 140-250 cars from each year. The fields include dates, favourites, author names, and full review in text. The dataset contains information from 10 different cities which include Dubai, Beijing, Las Vegas, San Fransisco, etc. There are reviews of about 80-700 hotels from each city. The fields include review, date, title and full-textual review.