Most Benchmarked Datasets in Neural Sentiment Analysis With Implementation in PyTorch and TensorFlow

With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text.
sentiment

With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.

Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text. In other words, it is used to check the polarity of the sentences. 

Sentiment analysis approach utilises an AI approach or a vocabulary based way to deal with investigating human sentiment about a point. The test for sentiment investigation lies in recognizing human feelings communicated in this content, for example, Twitter information.

Here, our focus will be to cover the details of some of the most popular datasets used in sentiment analysis. Further,we will focus on executing the code on these datasets using Tensorflow and Pytorch

SST

Stanford Sentiment Treebank was collected from the website:rottentomatoes.com by the researcher Pang and Lee. It incorporates 10,662 sentences, half of which were viewed as positive and the other half negative. Each name was removed from a more extended film audit and mirrors the author’s general goal for this survey. The Stanford Parser is utilized to parses every one of the 10,662 sentences. In around 1,100 cases it parts the scrap into various sentences. Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.

Dataset Source

Dataset size: 19 MB

Sentiment Treebank

State of the Art

The present state of the art on the SST dataset is T5-3B. The model gave an exactness of 97.4%.ALBERT and T5-11B are near contenders with a precision of around 97%.

Loading the dataset using TensorFlow

!pip install tflite-model-maker
import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
directory = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
      extract=True)
directory = os.path.join(os.path.dirname(data_dir), 'SST-2')

Loading the dataset using Pytorch

import os
from torchtext import data
class SST(data.Dataset):
    urls = ['http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip']
    directoryname = 'trees'
    namesentiment = 'sst'
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field, subtrees=False,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        def get_label_str(label):
            pre = 'very ' if fine_grained else ''
            return {'0': pre + 'negative', '1': 'negative', '2': 'neutral',
                    '3': 'positive', '4': pre + 'positive', None: None}[label]
        label_field.preprocessing = data.Pipeline(get_label_str)
        with open(os.path.expanduser(path)) as f:
            if subtrees:
                examples = [ex for line in f for ex in
                            data.Example.fromtree(line, fields, True)]
            else:
                examples = [data.Example.fromtree(line, fields) for line in f]
        super(SST, self).__init__(examples, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train.txt', validation='dev.txt', test='test.txt',
               train_subtrees=False, **kwargs):
        path = cls.download(root)
        train_data = None if train is None else cls(
            os.path.join(path, train), text_field, label_field, subtrees=train_subtrees,
            **kwargs)
        val_data = None if validation is None else cls(
            os.path.join(path, validation), text_field, label_field, **kwargs)
        test_data = None if test is None else cls(
            os.path.join(path, test), text_field, label_field, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)
    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text = data.Field()
        label = data.Field(sequential=False)
        train, val, test = cls.splits(text, label, root=root, **kwargs)
        text.build_vocab(train, vectors=vectors)
        label.build_vocab(train)
        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)

Sentiment140

Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by utilizing the Twitter API. The tweets have been categorized into three classes: 0:negative,2:neutral, and 4:positive, and they can be utilized to distinguish sentiment.

The information contained in the csv file:

  1. polarity of the tweet
  2. id of the tweet
  3. date of the tweet
  4. query
  5. client that tweeted
  6. content of the tweet

Dataset Source

Dataset size:305.13 MB

Loading the dataset using TensorFlow

import codecs
import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class Sentiment140(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "polarity": tf.int32,
            "date": tfds.features.Text(),
            "query": tfds.features.Text(),
            "user": tfds.features.Text(),
            "text": tfds.features.Text(),
        }),
        supervised_keys=("text", "polarity"),
        homepage=_HOMEPAGE_URL,
    )
  def _split_generators(self, dl_manager):
    dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "path":
                    os.path.join(dl_paths,
                                 "training.1600000.processed.noemoticon.csv")
            }),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv")
            }),
    ]

Yelp Polarity Review

Yelp polarity review dataset is used for sentiment classification. It contains 560,000 yelp reviews for training and 38,000 for testing. It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun. The Yelp review dataset was built by considering stars 1 and 2 as negative, and 3 and 4 as positive. For every polarity review, 280,000 training and 19,000 testing sets were taken arbitrarily. 

Dataset Source

Dataset size:435.18 mb

State of the Art

The present state of the art on the Yelp polarity dataset is BERT large. The model gave an error of 1.89%.

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"
class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.2.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "text": tfds.features.Text(),
            "label": tfds.features.ClassLabel(names=["1", "2"]),
        }),
        supervised_keys=("text", "label"),
        homepage="https://course.fast.ai/datasets",
    )
  def _split_generators(self, dl_manager):
    arch_path = dl_manager.download_and_extract(url)
    train_file = os.path.join(
        arch_path, "yelp_review_polarity_csv", "train.csv")
    test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_file}),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={"filepath": test_file}),
    ]

Conclusion

In this article, we have discussed the details and implementation of some of the most benchmarked datasets utilized in sentiment analysis using TensorFlow and Pytorch library. Out of all these datasets, SST is regularly utilized as one of the most datasets to test new dialect models, for example, BERT and ELMo, fundamentally as an approach to show superiority on an assortment of semantic tasks.

More Great AIM Stories

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Vijaysinh Lendave
What is Extreme Multilabel Text Classification?

The problem of assigning the most relevant subset of class labels to each document from an extremely large label collection, where the number of labels could reach hundreds of thousands or millions, is known as extreme multi-label text classification (XMTC).

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM