With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.
Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text. In other words, it is used to check the polarity of the sentences.
Sentiment analysis approach utilises an AI approach or a vocabulary based way to deal with investigating human sentiment about a point. The test for sentiment investigation lies in recognizing human feelings communicated in this content, for example, Twitter information.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Here, our focus will be to cover the details of some of the most popular datasets used in sentiment analysis. Further,we will focus on executing the code on these datasets using Tensorflow and Pytorch.
SST
Stanford Sentiment Treebank was collected from the website:rottentomatoes.com by the researcher Pang and Lee. It incorporates 10,662 sentences, half of which were viewed as positive and the other half negative. Each name was removed from a more extended film audit and mirrors the author’s general goal for this survey. The Stanford Parser is utilized to parses every one of the 10,662 sentences. In around 1,100 cases it parts the scrap into various sentences. Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.
Dataset size: 19 MB
Sentiment Treebank
State of the Art
The present state of the art on the SST dataset is T5-3B. The model gave an exactness of 97.4%.ALBERT and T5-11B are near contenders with a precision of around 97%.
Loading the dataset using TensorFlow
!pip install tflite-model-maker
import numpy as np import os import tensorflow as tf assert tf.__version__.startswith('2') from tflite_model_maker import configs from tflite_model_maker import ExportFormat from tflite_model_maker import model_spec from tflite_model_maker import text_classifier from tflite_model_maker import TextClassifierDataLoader directory = tf.keras.utils.get_file( fname='SST-2.zip', origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8', extract=True) directory = os.path.join(os.path.dirname(data_dir), 'SST-2')
Loading the dataset using Pytorch
import os from torchtext import data class SST(data.Dataset): urls = ['http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'] directoryname = 'trees' namesentiment = 'sst' @staticmethod def sort_key(ex): return len(ex.text) def __init__(self, path, text_field, label_field, subtrees=False, fine_grained=False, **kwargs): fields = [('text', text_field), ('label', label_field)] def get_label_str(label): pre = 'very ' if fine_grained else '' return {'0': pre + 'negative', '1': 'negative', '2': 'neutral', '3': 'positive', '4': pre + 'positive', None: None}[label] label_field.preprocessing = data.Pipeline(get_label_str) with open(os.path.expanduser(path)) as f: if subtrees: examples = [ex for line in f for ex in data.Example.fromtree(line, fields, True)] else: examples = [data.Example.fromtree(line, fields) for line in f] super(SST, self).__init__(examples, fields, **kwargs) @classmethod def splits(cls, text_field, label_field, root='.data', train='train.txt', validation='dev.txt', test='test.txt', train_subtrees=False, **kwargs): path = cls.download(root) train_data = None if train is None else cls( os.path.join(path, train), text_field, label_field, subtrees=train_subtrees, **kwargs) val_data = None if validation is None else cls( os.path.join(path, validation), text_field, label_field, **kwargs) test_data = None if test is None else cls( os.path.join(path, test), text_field, label_field, **kwargs) return tuple(d for d in (train_data, val_data, test_data) if d is not None) @classmethod def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs): text = data.Field() label = data.Field(sequential=False) train, val, test = cls.splits(text, label, root=root, **kwargs) text.build_vocab(train, vectors=vectors) label.build_vocab(train) return data.BucketIterator.splits( (train, val, test), batch_size=batch_size, device=device)
Sentiment140
Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by utilizing the Twitter API. The tweets have been categorized into three classes: 0:negative,2:neutral, and 4:positive, and they can be utilized to distinguish sentiment.
The information contained in the csv file:
- polarity of the tweet
- id of the tweet
- date of the tweet
- query
- client that tweeted
- content of the tweet
Dataset size:305.13 MB
Loading the dataset using TensorFlow
import codecs import csv import os import tensorflow.compat.v2 as tf import tensorflow_datasets.public_api as tfds class Sentiment140(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version("1.0.0") def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ "polarity": tf.int32, "date": tfds.features.Text(), "query": tfds.features.Text(), "user": tfds.features.Text(), "text": tfds.features.Text(), }), supervised_keys=("text", "polarity"), homepage=_HOMEPAGE_URL, ) def _split_generators(self, dl_manager): dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL) return [ tfds.core.SplitGenerator( name=tfds.Split.TRAIN, gen_kwargs={ "path": os.path.join(dl_paths, "training.1600000.processed.noemoticon.csv") }), tfds.core.SplitGenerator( name=tfds.Split.TEST, gen_kwargs={ "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv") }), ]
Yelp Polarity Review
Yelp polarity review dataset is used for sentiment classification. It contains 560,000 yelp reviews for training and 38,000 for testing. It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun. The Yelp review dataset was built by considering stars 1 and 2 as negative, and 3 and 4 as positive. For every polarity review, 280,000 training and 19,000 testing sets were taken arbitrarily.
Dataset size:435.18 mb
State of the Art
The present state of the art on the Yelp polarity dataset is BERT large. The model gave an error of 1.89%.
Loading the dataset using TensorFlow
import os import tensorflow.compat.v2 as tf import tensorflow_datasets.public_api as tfds url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz" class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version("0.2.0") def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ "text": tfds.features.Text(), "label": tfds.features.ClassLabel(names=["1", "2"]), }), supervised_keys=("text", "label"), homepage="https://course.fast.ai/datasets", ) def _split_generators(self, dl_manager): arch_path = dl_manager.download_and_extract(url) train_file = os.path.join( arch_path, "yelp_review_polarity_csv", "train.csv") test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv") return [ tfds.core.SplitGenerator( name=tfds.Split.TRAIN, gen_kwargs={"filepath": train_file}), tfds.core.SplitGenerator( name=tfds.Split.TEST, gen_kwargs={"filepath": test_file}), ]
Conclusion
In this article, we have discussed the details and implementation of some of the most benchmarked datasets utilized in sentiment analysis using TensorFlow and Pytorch library. Out of all these datasets, SST is regularly utilized as one of the most datasets to test new dialect models, for example, BERT and ELMo, fundamentally as an approach to show superiority on an assortment of semantic tasks.