Now Reading
Most Benchmarked Datasets in Neural Sentiment Analysis With Implementation in PyTorch and TensorFlow


Most Benchmarked Datasets in Neural Sentiment Analysis With Implementation in PyTorch and TensorFlow

Ankit Das
sentiment

With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.

Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text. In other words, it is used to check the polarity of the sentences. 

Sentiment analysis approach utilises an AI approach or a vocabulary based way to deal with investigating human sentiment about a point. The test for sentiment investigation lies in recognizing human feelings communicated in this content, for example, Twitter information.

Here, our focus will be to cover the details of some of the most popular datasets used in sentiment analysis. Further,we will focus on executing the code on these datasets using Tensorflow and Pytorch

SST

Stanford Sentiment Treebank was collected from the website:rottentomatoes.com by the researcher Pang and Lee. It incorporates 10,662 sentences, half of which were viewed as positive and the other half negative. Each name was removed from a more extended film audit and mirrors the author’s general goal for this survey. The Stanford Parser is utilized to parses every one of the 10,662 sentences. In around 1,100 cases it parts the scrap into various sentences. Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.

Dataset Source

Dataset size: 19 MB

Sentiment Treebank

State of the Art

The present state of the art on the SST dataset is T5-3B. The model gave an exactness of 97.4%.ALBERT and T5-11B are near contenders with a precision of around 97%.

Loading the dataset using TensorFlow

!pip install tflite-model-maker
import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
directory = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
      extract=True)
directory = os.path.join(os.path.dirname(data_dir), 'SST-2')

Loading the dataset using Pytorch

import os
from torchtext import data
class SST(data.Dataset):
    urls = ['http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip']
    directoryname = 'trees'
    namesentiment = 'sst'
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field, subtrees=False,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        def get_label_str(label):
            pre = 'very ' if fine_grained else ''
            return {'0': pre + 'negative', '1': 'negative', '2': 'neutral',
                    '3': 'positive', '4': pre + 'positive', None: None}[label]
        label_field.preprocessing = data.Pipeline(get_label_str)
        with open(os.path.expanduser(path)) as f:
            if subtrees:
                examples = [ex for line in f for ex in
                            data.Example.fromtree(line, fields, True)]
            else:
                examples = [data.Example.fromtree(line, fields) for line in f]
        super(SST, self).__init__(examples, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train.txt', validation='dev.txt', test='test.txt',
               train_subtrees=False, **kwargs):
        path = cls.download(root)
        train_data = None if train is None else cls(
            os.path.join(path, train), text_field, label_field, subtrees=train_subtrees,
            **kwargs)
        val_data = None if validation is None else cls(
            os.path.join(path, validation), text_field, label_field, **kwargs)
        test_data = None if test is None else cls(
            os.path.join(path, test), text_field, label_field, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)
    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text = data.Field()
        label = data.Field(sequential=False)
        train, val, test = cls.splits(text, label, root=root, **kwargs)
        text.build_vocab(train, vectors=vectors)
        label.build_vocab(train)
        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)

Sentiment140

Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by utilizing the Twitter API. The tweets have been categorized into three classes: 0:negative,2:neutral, and 4:positive, and they can be utilized to distinguish sentiment.

The information contained in the csv file:

  1. polarity of the tweet
  2. id of the tweet
  3. date of the tweet
  4. query
  5. client that tweeted
  6. content of the tweet

Dataset Source

Dataset size:305.13 MB

See Also
NeX

Loading the dataset using TensorFlow

import codecs
import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class Sentiment140(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "polarity": tf.int32,
            "date": tfds.features.Text(),
            "query": tfds.features.Text(),
            "user": tfds.features.Text(),
            "text": tfds.features.Text(),
        }),
        supervised_keys=("text", "polarity"),
        homepage=_HOMEPAGE_URL,
    )
  def _split_generators(self, dl_manager):
    dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "path":
                    os.path.join(dl_paths,
                                 "training.1600000.processed.noemoticon.csv")
            }),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv")
            }),
    ]

Yelp Polarity Review

Yelp polarity review dataset is used for sentiment classification. It contains 560,000 yelp reviews for training and 38,000 for testing. It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun. The Yelp review dataset was built by considering stars 1 and 2 as negative, and 3 and 4 as positive. For every polarity review, 280,000 training and 19,000 testing sets were taken arbitrarily. 

Dataset Source

Dataset size:435.18 mb

State of the Art

The present state of the art on the Yelp polarity dataset is BERT large. The model gave an error of 1.89%.

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"
class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.2.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "text": tfds.features.Text(),
            "label": tfds.features.ClassLabel(names=["1", "2"]),
        }),
        supervised_keys=("text", "label"),
        homepage="https://course.fast.ai/datasets",
    )
  def _split_generators(self, dl_manager):
    arch_path = dl_manager.download_and_extract(url)
    train_file = os.path.join(
        arch_path, "yelp_review_polarity_csv", "train.csv")
    test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_file}),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={"filepath": test_file}),
    ]

Conclusion

In this article, we have discussed the details and implementation of some of the most benchmarked datasets utilized in sentiment analysis using TensorFlow and Pytorch library. Out of all these datasets, SST is regularly utilized as one of the most datasets to test new dialect models, for example, BERT and ELMo, fundamentally as an approach to show superiority on an assortment of semantic tasks.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top