Last updated October 7, 2021
In AI Mysteries

A Comprehensive Guide To 15 Most Important NLP Datasets

Published on January 5, 2021

by Mohit Maithani

If you are just getting started with NLP or a researcher who is really into Natural language processing this comprehensive guide will help you with all the major datasets with starter implementation for your next NLP projects, but first, let’s discuss what is NLP is, what kind of data is used and what are the outcomes/prediction of the NLP techniques.

NLP(natural language processing) is a subfield of AI and computer science that is concerned with the interactions between computers and natural human language. Simply it means, how to program computers to process, analyze, and understand large amounts of Natural language data. NLP is always a significant part of machine learning use cases, but it requires a lot of training for different kinds of datasets as our data can be in the form of text, speech, customer reviews, ratings, and more, on the basis of this, we have many kinds of NLP techniques for different purposes, Let’s see some of the use cases:

Speech datasets for making Voice assistant more human friendly
Textual datasets for virtual assistants.
Chatbots use a major part of NLP techniques.
Online translation services
Neural machine translation
Sentiment analysis of customers’ data using NLP.
Hiring and recruitment
Advertising and Market intelligence
Healthcare started using NLP.
Recommendation system.

Let’s review some of the already published articles on different NLP datasets by Analytics India Magazine with starter implementation:

1. Sentiment Analysis
2. Language Modelling
- 2.1 WikiText-103 dataset
  - Loading the WikiText-103 Dataset using Tensorflow
  - Loading WikiText-103 dataset using PyTorch
- 2.2 WikiText-2
  - Loading the WikiText-2 dataset using Tensorflow
3. Machine translation
4. Sequence tagging
5. Question Answering Datasets in NLP
- 5.1 SQuAD dataset
  - Loading the dataset using TensorFlow
- 5.2 bAbI dataset
  - Loading the dataset using Keras
6. Question Classification Datasets
Conclusion

1. Sentiment Analysis

Sentiment analysis is one of the most used techniques in Natural language processing(NLP) to systematically identify, extract, quantify, and study affective states and information. It is widely used in reviews and survey responses. Let’s see some popular dataset used for sentiment analysis:

1.1 SST dataset

SST dataset is collected by Stanford researchers for doing sentiment analysis some of the key points of this dataset are:

Stanford Sentiment Treebank(SST) dataset was collected from the website:rottentomatoes.com
Researcher: Pang and Lee.
It consists of 10,662 sentences.
Half of the sentences are positive and the other half negative.
Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.
SST dataset is available at Kaggle
The total size of this dataset is only 19 MB.
The present state of the art model on the SST dataset is T5-3B. The model gave an exactness of 97.4%.

Loading the dataset using TensorFlow

!pip install tflite-model-maker

import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
directory = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
      extract=True)
directory = os.path.join(os.path.dirname(data_dir), 'SST-2')

1.2 Sentiment140 dataset

Another dataset for sentiment analysis, Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by using the Twitter API. The tweets have been categorized into three classes:

0:negative,

2:neutral

4:positive

The information contained in the dataset:

The polarity of the tweet
id of the tweet
date of the tweet
query
User that tweeted
And the content of the tweet.
Dataset size: 305.13 MB

Download dataset from here

Loading the dataset using TensorFlow

import codecs
import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class Sentiment140(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "polarity": tf.int32,
            "date": tfds.features.Text(),
            "query": tfds.features.Text(),
            "user": tfds.features.Text(),
            "text": tfds.features.Text(),
        }),
        supervised_keys=("text", "polarity"),
        homepage=_HOMEPAGE_URL,
    )
  def _split_generators(self, dl_manager):
    dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "path":
                    os.path.join(dl_paths,
                                 "training.1600000.processed.noemoticon.csv")
            }),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv")
            }),
    ]

1.3 Yelp Polarity Review’ DataSet

Yelp polarity review dataset is used for sentiment classification.
It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun.
The dataset contains 560,000 yelp reviews for training and 38,000 for testing.
The Yelp review dataset was built by considering stars 1,2 as negative, and 3,4 as positive.
The present state of the art on the Yelp polarity dataset is BERT large.

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"
class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.2.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "text": tfds.features.Text(),
            "label": tfds.features.ClassLabel(names=["1", "2"]),
        }),
        supervised_keys=("text", "label"),
        homepage="https://course.fast.ai/datasets",
    )
  def _split_generators(self, dl_manager):
    arch_path = dl_manager.download_and_extract(url)
    train_file = os.path.join(
        arch_path, "yelp_review_polarity_csv", "train.csv")
    test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_file}),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={"filepath": test_file}),
    ]

Most Benchmarked Datasets in Neural Sentiment Analysis With Implementation in PyTorch and TensorFlow

1.4 IMDB (Internet Movie DataBase) dataset

This dataset is an online information base of thousands of movie reviews for natural language processing, text analytics, and sentiment analysis. It was first published in 2011 by Standford University and developed by the researchers: Andres L.Maas, Raymond E. Daly, Peter T.Pham, Dan Guang, Andrew Y.Ng, and Christopher Potts. The dataset is divided into training and test sets with each having 25000 reviews.

Every review has a total score of 10
Negative reviews are having a score of <=4
The positive review has a score of >=7
All the neutral reviews have been excluding from the IMDB dataset.
The total size of the dataset is 80.2 MB
The present best sentiment analysis model on the IMDB dataset is NB-weighted-BON +dv-cosine. The model gave an exactness of 97.4%

Download Dataset from here

Loading the IMDB dataset manually

import os
import TensorFlow as tf
import numpy as np
import matplotlib.pyplot as plt
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!cat aclImdb/train/neg/5003_4.txt

See full implementation here

Loading the dataset using Keras

It is straightforward, you can use Keras to load the dataset as it comes with prebuild datasets, and IMDB is one of them.

from keras.datasets import imdb

Guide to IMDb Movie Dataset With Python Implementation

2. Language Modelling

Language modelling power all the major fields of NLP like Google assistant, Alexa, Apple Siri, in language modelling we try to look through language data and build the knowledge base that can answer questions from the learning of dataset. Here are some of the dataset that are used in language modelling:

2.1 WikiText-103 dataset

This dataset is created by Salesforce, it contains more than ten crores of data tokens which is retrieved from the featured article on Wikipedia. This dataset is comprised of 28,475 articles and It has a drawn-out reliance with 103 million tokens. The dataset has a vocabulary size of 267,735 after replacing all the token that appears not more than two times. For an implanting size of 400, the embedding layer consists of 267K x 400 ≈ 106Million parameters.

Loading the WikiText-103 Dataset using Tensorflow

Using wget let’s first manually download the dataset and import some additional libraries

import tensorflow as tf
!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
!unzip wikitext-103-raw-v1.zip

def wiki103(path):
  data2 = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data2 = data2.filter(content_filter)
  data2 = data2.map(lambda x: tf.strings.split(x, ' . '))
  data2 = data2.unbatch()
  return data2
train = wiki103('/content/wikitext-103-raw/wiki.train.raw')

Loading WikiText-103 dataset using PyTorch

from torchtext import data
import io
class LanguageModelingDataset(data.Dataset):
def __init__(self, path, text_field, newline_eos=True,
                 encoding='utf-8', **kwargs):
        fields = [('text', text_field)]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')
        examples = [data.Example.fromlist([text], fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)

class WikiText103(LanguageModelingDataset):
urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip']
    name2 = 'wikitext-103'
    directoryname = 'wikitext-103'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText103, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text2 = data.Field()
        train, val, test = cls.splits(text2, root=root, **kwargs)
        text2.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)

root- directory where the dataset’s zip file will be stored.
batch size– number of training data points passed in one iteration.
Bptt_len-length of sequence for backpropagation.
device-Use – 1 for CPU and None for the presently dynamic GPU gadget.
text_field – a field that will be used for text data points.
train – training dataset
validation – approval dataset
test -testing dataset,

2.2 WikiText-2

This dataset is a small version of the above discussed dataset Wikitext-103 with a jargon size of 33,278 and 2 Million token variant of WikiText-103 dataset. The present state of the art model on WikiText-2 dataset is GPT-2. The model gave a Test perplexity of 18.34 with 1542 Million parameters.

Loading the WikiText-2 dataset using Tensorflow

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip
def wiki2(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= wiki2('/content/wikitext-2-raw/wiki.train.raw')

3. Machine translation

Machine Translation (MT) is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language.

Here are some of the dataset used in machine translation:

3.1 Multi-30k dataset

Multi-30K is a large dataset of pictures matched with sentences in English and German language, It is moving forwards towards contemplating the worth of multilingual- multimodal information. The dataset was developed in 2016 by the researchers: Desmond Elliott and Stella Frank and Khalil Sima’an.

Multi-30K is an extension of the Flickr30k dataset.
It contains 31,014 German interpretations of English depictions
Also contains 155,070 freely gathered German descriptions.
The translations and depictions were gathered from expertly contracted translators,
Their research paper is published here

The above figure shows the Multilingual examples in the Multi30K dataset. The independent sentences are all accurate descriptions of the image but do not contain the same details in both languages, such as shirt color or the scaffolding.

In the second translation pair (bottom left) the translator has translated “glide” as

“schweben” (“to float”) probably due to not seeing the image context.

Load the Multi-30k dataset using TensorFlow

import tensorflow as tf
def Multi30K(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= Multi30K('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz')

3.2 IWSLT Dataset

“The IWSLT 14 contains about 160K sentence pairs. The dataset comprises English-German (En-De) and German-English (De-En) description. The IWSLT 13 dataset has about 200K training sentence sets. English-French and French-English pairs will be used for translation tasks.IWSLT dataset was developed in 2013 by the researchers: Zoltán Tüske, M. Ali Basha Shaik, and Simon Wiesler.

The present state of the art on the IWSLT dataset is MAT+Knee. The model gave a bleu-score of 36.6.”

Loading using Tensorflow

import tensorflow as tf
def IWSLT_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= IWSLT_data('https://wit3.fbk.eu/archive/2016-01//texts/{}/{}/{}.tgz')

3.3 WMT14 dataset

WMT14 is a machine translation dataset.
WMT14 dataset was developed in 2014 by the researchers: Nicolas Pecheux, Li Gong and Thomas Lavergne.
WMT14 contains English-German (En-De) and EnglishFrench (En-Fr) pairs for machine translation.
The preparation datasets contain about 4.5M and 35M sentence sets separately.
The sentences are encoded with the Byte-Pair Encoding technique and it contains 32K tasks.
The present state of the art framework on the WMT14 dataset is Noisy back-translation. The model gave a bleu-score of 35.

Loading the WMT14 dataset Using Tensorflow

import tensorflow as tf
def WMT14_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= WMT14_
data('https://drive.google.com/uc?export=download&')

Read more about machine translation datasets:

Deep Dive in Datasets for Machine translation in NLP Using TensorFlow and PyTorch

4. Sequence tagging

Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.

Here are some of the datasets that are used in Sequence tagging:

4.1 CoNLL 2012

CoNLL has many previous versions that came over the years like the first one CoNLL 2000 which was introduced in the year 2000 by the researchers: Tjong Kim Sang and Buchholz. This information comprises similar parcels of the Wall Street Journal corpus (WSJ) as the generally utilized information for the noun phrase chunking: In CoNLLL 2000 dataset 211727 tokens used for training data and 47377 tokens utilized in test data. After the CoNLL 2000 dataset researchers introduced three more ColNLL datasets.

CoNLL 2000
CoNLL 2003
CoNLL 2012

CoNLL datasets are used in sequence tagging ( a sort of pattern recognition task that includes the categorical tag to every individual from a grouping of observed values)

The CoNLL 2012 dataset was made for a mutual task on multilingual unlimited coreference goals. It is bigger than the previous CoNLL NER based dataset.
The current state of the art on the CoNLL 2003 dataset is CorefQA + SpanBERT-large. The model gave an F1 score of 83.1.

Loading the CoNLL dataset using Tensorflow

import tensorflow as tf
def colnll(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= colnll('https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz’')

UPDOS is a parsed text corpus dataset that clarifies syntactic or semantic sentence structure.
The datasets follow the original format from the Universal Dependencies(an open network exertion with more than 300 contributors delivering more than 150 treebanks in 90 dialects) English Treebank.
the current state of the art framework on the Universal Dependencies dataset is UDPipe 2.0 + mBERT + FLAIR. The model gave a LAS score of 84.60.

Loading the dataset using Tensorflow

import tensorflow as tf
def UDPOS(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UDPOS(''https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')

4.3 Tiger Corpus dataset

Tiger Corpus is a broad collection of German paper messages
It was developed in 2002 by the researcher: Brandt.
The researchers used grammatical feature comments for setting up a German POS labeling task.
It has 40,472 of the initially requested sentence data for training, the following 5,000 for validation, and the remaining 5,000 for testing.

Loading the dataset using Tensorflow

import tensorflow as tf
def tigcor(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= tigcor('https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz’')

Most Popular Datasets For Neural Sequence Tagging with the Implementation in TensorFlow and PyTorch

5. Question Answering Datasets in NLP

Most Benchmarked Datasets for Question Answering in NLP with implementation in PyTorch, Keras, and TensorFlow

5.1 SQuAD dataset

“Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.”

It comprises 100,000+ inquiries presented by the crowdsource from Wikipedia article.
Response to each address is a fragment of text from the comparing understanding entry.
The dataset was presented by researchers: Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang from Stanford University.
The current state of the art framework on the SQuAD dataset is SA-Net on Albert. The model gave an F1 score of 93.011.
Learn more here

Loading the dataset using TensorFlow

import tensorflow as tf
def squad_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= squad_data('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')

5.2 bAbI dataset

Task generation for testing text understanding and reasoning”

The bAbI-Question Answering is a dataset for question noting and text understanding. The dataset is made out of a bunch of contexts, with numerous inquiry answer sets accessible depending on the specific situations.

It contains both English and Hindi data.
The “ContentElements” field contains training data and testing data.
The current state of the art model on the bAbI dataset is STM(Self-Attentive Associative Memory). The model gave an accuracy of 99.85
The initial two give admittance are retrieved from the 10,000k variant in English.bAbI was presented by Facebook Group.
The official published paper by bAbI is Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks.
Official GitHub Repository

Loading the dataset using Keras

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
try:
       path_new = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise
readfile= tarfile.open(path_new )

6. Question Classification Datasets

Question classification is a significant part in question answering systems, with one of the most important steps in the enhancement of classification problem being the identification of the type of question, initially, we used the Naive Bayesian, k-nearest neighbour, and SVM algorithms but as of now neural nets are taking big leap we use CNN models for NLP.

6.1 Trec dataset

The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

The TREC(Text retrieval Conference) dataset is used for question characterization
It consisting of open-area, real inquiries partitioned into wide semantic classes.
Trec has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation.
TREC-6 and TREC-50 both have 5,452 preparing models and 500 test models.
The current state of the art model trained on the Trec-6 dataset is USE_T+CNN.

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

6.2 UMICH S1650 dataset

University of Michigan(UMICH) SI650 (Information Retrieval) is a sentiment analysis dataset where the task of the dataset is to detect negative and positive sentiment from the sentences.
The data was originally collected from opinmind.com (which is no longer active).
Training data contains 7086 lines.
Test data contains 33052 lines, each contains one sentence.

The training data contains 7086 sentences, already labeled with 1 (positive sentiment) or 0 (negative sentiment).
The test data contains 33052 sentences that are unlabeled.
UMICH dataset

Loading the UMICH dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

6.3 ARC dataset

Can a computer learn complex, abstract tasks from just a few examples?

The Abstraction and Reasoning Corpus (ARC) provides a benchmark to measure AI skills on unknown tasks, with the constraint that only a handful of demonstrations are shown to learn a complex task. This dataset also provides a glimpse of a future where AI could quickly learn to solve new problems on its own.

The dataset description is shown below:

training: contains the task files for training (400 tasks).
evaluation: contains the task files for evaluation (400 tasks).

The tasks are stored in JSON format. Each task JSON file contains two fields:

“train“: demonstration input/output pairs(3 pair).
“test“: test input/output pairs(1 pair).

ARC dataset has many other things you need to know like:

It was developed by Dongfang Xu, Peter Jansen, and Jaycie Martin.
It comprises 7,787 decision science question
400 itemized question classes
Download dataset from here

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC_data('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Most Popular Datasets for Question Classification

Conclusion

Sentiment Analysis	IMDB, SST, Sentiment140, YELP polarity Review
Language Modelling	WikiText-103, WikiText-2
Question Answering	SQuAD, bAbL
Question Classification	Trec, UMICH S1650, ARC dataset
Neural Sequence tagging	CoNLL 2012, UPDOS, Tiger Corpus
Machine Translation	Multi-30k, IWSLT, WMT14

NLP datasets

We almost covered all the major Natural language processing dataset that is used extensively from machine translation to sentiment analysis. we have also seen how to import every dataset into your coding environment to get started with,’ For more info or more articles on NLP(Natural language processing ) datasets visit here

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.