MITB Banner

Most Benchmarked Datasets in Neural Sentiment Analysis With Implementation in PyTorch and TensorFlow

With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text.

Share

sentiment

With the expanding prominence of blogging sites, a massive number of clients share reviews on various parts of life consistently. Therefore popular sites like Amazon, Twitter are rich wellsprings of information for opinion mining and sentiment analysis.

Sentiment analysis is a technique in natural language processing that deals with the order of assessments communicated in a bit of text. In other words, it is used to check the polarity of the sentences. 

Sentiment analysis approach utilises an AI approach or a vocabulary based way to deal with investigating human sentiment about a point. The test for sentiment investigation lies in recognizing human feelings communicated in this content, for example, Twitter information.

Here, our focus will be to cover the details of some of the most popular datasets used in sentiment analysis. Further,we will focus on executing the code on these datasets using Tensorflow and Pytorch

SST

Stanford Sentiment Treebank was collected from the website:rottentomatoes.com by the researcher Pang and Lee. It incorporates 10,662 sentences, half of which were viewed as positive and the other half negative. Each name was removed from a more extended film audit and mirrors the author’s general goal for this survey. The Stanford Parser is utilized to parses every one of the 10,662 sentences. In around 1,100 cases it parts the scrap into various sentences. Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.

Dataset Source

Dataset size: 19 MB

Sentiment Treebank

State of the Art

The present state of the art on the SST dataset is T5-3B. The model gave an exactness of 97.4%.ALBERT and T5-11B are near contenders with a precision of around 97%.

Loading the dataset using TensorFlow

!pip install tflite-model-maker
import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
directory = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
      extract=True)
directory = os.path.join(os.path.dirname(data_dir), 'SST-2')

Loading the dataset using Pytorch

import os
from torchtext import data
class SST(data.Dataset):
    urls = ['http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip']
    directoryname = 'trees'
    namesentiment = 'sst'
    @staticmethod
    def sort_key(ex):
        return len(ex.text)
    def __init__(self, path, text_field, label_field, subtrees=False,
                 fine_grained=False, **kwargs):
        fields = [('text', text_field), ('label', label_field)]
        def get_label_str(label):
            pre = 'very ' if fine_grained else ''
            return {'0': pre + 'negative', '1': 'negative', '2': 'neutral',
                    '3': 'positive', '4': pre + 'positive', None: None}[label]
        label_field.preprocessing = data.Pipeline(get_label_str)
        with open(os.path.expanduser(path)) as f:
            if subtrees:
                examples = [ex for line in f for ex in
                            data.Example.fromtree(line, fields, True)]
            else:
                examples = [data.Example.fromtree(line, fields) for line in f]
        super(SST, self).__init__(examples, fields, **kwargs)
    @classmethod
    def splits(cls, text_field, label_field, root='.data',
               train='train.txt', validation='dev.txt', test='test.txt',
               train_subtrees=False, **kwargs):
        path = cls.download(root)
        train_data = None if train is None else cls(
            os.path.join(path, train), text_field, label_field, subtrees=train_subtrees,
            **kwargs)
        val_data = None if validation is None else cls(
            os.path.join(path, validation), text_field, label_field, **kwargs)
        test_data = None if test is None else cls(
            os.path.join(path, test), text_field, label_field, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)
    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data', vectors=None, **kwargs):
        text = data.Field()
        label = data.Field(sequential=False)
        train, val, test = cls.splits(text, label, root=root, **kwargs)
        text.build_vocab(train, vectors=vectors)
        label.build_vocab(train)
        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)

Sentiment140

Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by utilizing the Twitter API. The tweets have been categorized into three classes: 0:negative,2:neutral, and 4:positive, and they can be utilized to distinguish sentiment.

The information contained in the csv file:

  1. polarity of the tweet
  2. id of the tweet
  3. date of the tweet
  4. query
  5. client that tweeted
  6. content of the tweet

Dataset Source

Dataset size:305.13 MB

Loading the dataset using TensorFlow

import codecs
import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class Sentiment140(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "polarity": tf.int32,
            "date": tfds.features.Text(),
            "query": tfds.features.Text(),
            "user": tfds.features.Text(),
            "text": tfds.features.Text(),
        }),
        supervised_keys=("text", "polarity"),
        homepage=_HOMEPAGE_URL,
    )
  def _split_generators(self, dl_manager):
    dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "path":
                    os.path.join(dl_paths,
                                 "training.1600000.processed.noemoticon.csv")
            }),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv")
            }),
    ]

Yelp Polarity Review

Yelp polarity review dataset is used for sentiment classification. It contains 560,000 yelp reviews for training and 38,000 for testing. It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun. The Yelp review dataset was built by considering stars 1 and 2 as negative, and 3 and 4 as positive. For every polarity review, 280,000 training and 19,000 testing sets were taken arbitrarily. 

Dataset Source

Dataset size:435.18 mb

State of the Art

The present state of the art on the Yelp polarity dataset is BERT large. The model gave an error of 1.89%.

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"
class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.2.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "text": tfds.features.Text(),
            "label": tfds.features.ClassLabel(names=["1", "2"]),
        }),
        supervised_keys=("text", "label"),
        homepage="https://course.fast.ai/datasets",
    )
  def _split_generators(self, dl_manager):
    arch_path = dl_manager.download_and_extract(url)
    train_file = os.path.join(
        arch_path, "yelp_review_polarity_csv", "train.csv")
    test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_file}),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={"filepath": test_file}),
    ]

Conclusion

In this article, we have discussed the details and implementation of some of the most benchmarked datasets utilized in sentiment analysis using TensorFlow and Pytorch library. Out of all these datasets, SST is regularly utilized as one of the most datasets to test new dialect models, for example, BERT and ELMo, fundamentally as an approach to show superiority on an assortment of semantic tasks.

Share
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.