MITB Banner

A Comprehensive Guide To 15 Most Important NLP Datasets

Share

If you are just getting started with NLP or a researcher who is really into Natural language processing this comprehensive guide will help you with all the major datasets with starter implementation for your next NLP projects, but first, let’s discuss what is NLP is, what kind of data is used and what are the outcomes/prediction of the NLP techniques.

NLP(natural language processing) is a subfield of AI and computer science that is concerned with the interactions between computers and natural human language. Simply it means, how to program computers to process, analyze, and understand large amounts of Natural language data. NLP is always a significant part of machine learning use cases, but it requires a lot of training for different kinds of datasets as our data can be in the form of text, speech, customer reviews, ratings, and more, on the basis of this, we have many kinds of NLP techniques for different purposes, Let’s see some of the use cases:

  • Speech datasets for making Voice assistant more human friendly
  • Textual datasets for virtual assistants.
  • Chatbots use a major part of NLP techniques.
  • Online translation services
  • Neural machine translation
  • Sentiment analysis of customers’ data using NLP.
  • Hiring and recruitment
  • Advertising and Market intelligence
  • Healthcare started using NLP.
  • Recommendation system.

Let’s review some of the already published articles on different NLP datasets by Analytics India Magazine with starter implementation:

1. Sentiment Analysis

Sentiment analysis is one of the most used techniques in Natural language processing(NLP) to systematically identify, extract, quantify, and study affective states and information. It is widely used in reviews and survey responses. Let’s see some popular dataset used for sentiment analysis:

1.1 SST dataset 

SST dataset is collected by Stanford researchers for doing sentiment analysis some of the key points of this dataset are:

  • Stanford Sentiment Treebank(SST) dataset was collected from the website:rottentomatoes.com
  • Researcher: Pang and Lee.
  • It consists of 10,662 sentences.
  • Half of the sentences are positive and the other half negative. 
  • Amazon Mechanical Turk was used by the researcher to name the subsequent 215,154 expressions.
  • SST dataset is available at Kaggle
  • The total size of this dataset is only 19 MB.
  • The present state of the art model on the SST dataset is T5-3B. The model gave an exactness of 97.4%.

Loading the dataset using TensorFlow

!pip install tflite-model-maker

import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
directory = tf.keras.utils.get_file(
      fname='SST-2.zip',
      origin='https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
      extract=True)
directory = os.path.join(os.path.dirname(data_dir), 'SST-2')

1.2 Sentiment140 dataset

Another dataset for sentiment analysis, Sentiment140 dataset contains 1,600,000 tweets extracted from Twitter by using the Twitter API. The tweets have been categorized into three classes: 

0:negative,

2:neutral

4:positive

The information contained in the dataset:

  • The polarity of the tweet
  • id of the tweet
  • date of the tweet
  • query
  • User that tweeted
  • And the content of the tweet.
  • Dataset size: 305.13 MB

Download dataset from here

Loading the dataset using TensorFlow

import codecs
import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class Sentiment140(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "polarity": tf.int32,
            "date": tfds.features.Text(),
            "query": tfds.features.Text(),
            "user": tfds.features.Text(),
            "text": tfds.features.Text(),
        }),
        supervised_keys=("text", "polarity"),
        homepage=_HOMEPAGE_URL,
    )
  def _split_generators(self, dl_manager):
    dl_paths = dl_manager.download_and_extract(_DOWNLOAD_URL)
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "path":
                    os.path.join(dl_paths,
                                 "training.1600000.processed.noemoticon.csv")
            }),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "path": os.path.join(dl_paths, "testdata.manual.2009.06.14.csv")
            }),
    ]

1.3 Yelp Polarity Review’ DataSet Yelp - Wikipedia

  • Yelp polarity review dataset is used for sentiment classification.
  • It was presented in 2015 by the researchers: Xiang Zhang, Junbo Zhao, and Yann LeCun. 
  • The dataset contains 560,000 yelp reviews for training and 38,000 for testing. 
  • The Yelp review dataset was built by considering stars 1,2 as negative, and 3,4 as positive.
  • The present state of the art on the Yelp polarity dataset is BERT large.

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = "https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz"
class YelpPolarityReviews(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("0.2.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "text": tfds.features.Text(),
            "label": tfds.features.ClassLabel(names=["1", "2"]),
        }),
        supervised_keys=("text", "label"),
        homepage="https://course.fast.ai/datasets",
    )
  def _split_generators(self, dl_manager):
    arch_path = dl_manager.download_and_extract(url)
    train_file = os.path.join(
        arch_path, "yelp_review_polarity_csv", "train.csv")
    test_file = os.path.join(arch_path, "yelp_review_polarity_csv", "test.csv")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_file}),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={"filepath": test_file}),
    ]

Read More:

1.4 IMDB (Internet Movie DataBase) dataset

IMDb

This dataset is an online information base of thousands of movie reviews for natural language processing, text analytics, and sentiment analysis. It was first published in 2011 by Standford  University and developed by the researchers: Andres L.Maas, Raymond E. Daly, Peter T.Pham, Dan Guang, Andrew Y.Ng, and Christopher Potts. The dataset is divided into training and test sets with each having 25000 reviews.

  • Every review has a total score of 10
  • Negative reviews are having a score of <=4
  • The positive review has a score of >=7
  • All the neutral reviews have been excluding from the IMDB dataset.
  • The total size of the dataset is 80.2 MB
  • The present best sentiment analysis model on the IMDB dataset is NB-weighted-BON +dv-cosine. The model gave an exactness of 97.4%

Download Dataset from here

Loading the IMDB dataset manually

import os
import TensorFlow as tf
import numpy as np
import matplotlib.pyplot as plt
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!cat aclImdb/train/neg/5003_4.txt
imdb dataset
imdb data exploration

See full implementation here

Loading the dataset using Keras

 It is straightforward, you can use Keras to load the dataset as it comes with prebuild datasets, and IMDB is one of them.

from keras.datasets import imdb

Read more: 

2. Language Modelling

Language modelling power all the major fields of NLP like Google assistant, Alexa, Apple Siri, in language modelling we try to look through language data and build the knowledge base that can answer questions from the learning of dataset. Here are some of the dataset that are used in language modelling:

2.1 WikiText-103 dataset

This dataset is created by Salesforce, it contains more than ten crores of data tokens which is retrieved from the featured article on Wikipedia. This dataset is comprised of 28,475 articles and It has a drawn-out reliance with 103 million tokens. The dataset has a vocabulary size of 267,735 after replacing all the token that appears not more than two times. For an implanting size of 400, the embedding layer consists of 267K x 400 ≈ 106Million parameters.

Loading the WikiText-103 Dataset using Tensorflow

Using wget let’s first manually download the dataset and import some additional libraries

import tensorflow as tf
!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
!unzip wikitext-103-raw-v1.zip

def wiki103(path):
  data2 = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data2 = data2.filter(content_filter)
  data2 = data2.map(lambda x: tf.strings.split(x, ' . '))
  data2 = data2.unbatch()
  return data2
train = wiki103('/content/wikitext-103-raw/wiki.train.raw')

Loading WikiText-103 dataset using PyTorch

from torchtext import data
import io
class LanguageModelingDataset(data.Dataset):
def __init__(self, path, text_field, newline_eos=True,
                 encoding='utf-8', **kwargs):
        fields = [('text', text_field)]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')
        examples = [data.Example.fromlist([text], fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)
class WikiText103(LanguageModelingDataset):
urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip']
    name2 = 'wikitext-103'
    directoryname = 'wikitext-103'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText103, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text2 = data.Field()
        train, val, test = cls.splits(text2, root=root, **kwargs)
        text2.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)
  • root- directory where the dataset’s zip file will be stored.
  • batch size– number of training data points passed in one iteration.
  • Bptt_len-length of sequence for backpropagation.
  • device-Use – 1 for CPU and None for the presently dynamic GPU gadget.
  • text_field – a field that will be used for text data points.
  • train – training dataset
  • validation – approval dataset
  • test -testing dataset,

2.2 WikiText-2

This dataset is a small version of the above discussed dataset Wikitext-103 with a jargon size of 33,278 and 2 Million token variant of WikiText-103 dataset. The present state of  the art model on WikiText-2 dataset is GPT-2. The model gave a Test perplexity of 18.34 with 1542 Million parameters.

Loading the WikiText-2 dataset using Tensorflow

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip
def wiki2(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= wiki2('/content/wikitext-2-raw/wiki.train.raw')

Read more: https://analyticsindiamag.com/datasets-for-language-modelling-in-nlp-using-tensorflow-and-pytorch/

3. Machine translation

Machine Translation (MT) is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language.

Here are some of the dataset used in machine translation:

3.1 Multi-30k dataset

Multi-30K is a large dataset of pictures matched with sentences in English and German language, It is moving forwards towards contemplating the worth of multilingual- multimodal information. The dataset was developed in 2016 by the researchers: Desmond Elliott and Stella Frank and Khalil Sima’an.

  • Multi-30K is an extension of the Flickr30k dataset.
  • It contains 31,014 German interpretations of English depictions
  • Also contains 155,070 freely gathered German descriptions.
  • The translations and depictions were gathered from expertly contracted translators, 
  • Their research paper is published here
Milti=30k dataset

The above figure shows the Multilingual examples in the Multi30K dataset. The independent sentences are all accurate descriptions of the image but do not contain the same details in both languages, such as shirt color or the scaffolding.

In the second translation pair (bottom left) the translator has translated “glide” as

“schweben” (“to float”) probably due to not seeing the image context.

Load the Multi-30k dataset using TensorFlow

import tensorflow as tf
def Multi30K(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= Multi30K('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz')

3.2 IWSLT Dataset

“The IWSLT 14 contains about 160K sentence pairs. The dataset comprises English-German (En-De) and German-English (De-En) description. The IWSLT 13 dataset has about 200K training sentence sets. English-French and French-English pairs will be used for translation tasks.IWSLT dataset was developed in 2013 by the researchers: Zoltán Tüske, M. Ali Basha Shaik, and Simon Wiesler.

The present state of the art on the IWSLT dataset is MAT+Knee. The model gave a bleu-score of 36.6.”

Loading using Tensorflow

import tensorflow as tf
def IWSLT_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= IWSLT_data('https://wit3.fbk.eu/archive/2016-01//texts/{}/{}/{}.tgz')

3.3 WMT14 dataset

  • WMT14 is a machine translation dataset.
  • WMT14 dataset was developed in 2014 by the researchers: Nicolas Pecheux, Li Gong and Thomas Lavergne.
  • WMT14 contains English-German (En-De) and EnglishFrench (En-Fr) pairs for machine translation.
  • The preparation datasets contain about 4.5M and 35M sentence sets separately.
  • The sentences are encoded with the Byte-Pair Encoding technique and it contains 32K tasks.
  • The present state of the art framework on the WMT14 dataset is Noisy back-translation. The model gave a bleu-score of 35.

Loading the WMT14 dataset Using Tensorflow

import tensorflow as tf
def WMT14_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= WMT14_
data('https://drive.google.com/uc?export=download&')

Read more about machine translation datasets:

4. Sequence tagging

Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.

Here are some of the datasets that are used in Sequence tagging:

4.1 CoNLL 2012 

CoNLL has many previous versions that came over the years like the first one CoNLL 2000 which was introduced in the year 2000 by the researchers: Tjong Kim Sang and Buchholz. This information comprises similar parcels of the Wall Street Journal corpus (WSJ) as the generally utilized information for the noun phrase chunking: In CoNLLL 2000 dataset 211727 tokens used for training data and 47377 tokens utilized in test data. After the CoNLL 2000 dataset researchers introduced three more ColNLL datasets.

  1. CoNLL 2000
  2. CoNLL 2003
  3. CoNLL 2012

CoNLL datasets are used in sequence tagging ( a sort of pattern recognition task that includes the categorical tag to every individual from a grouping of observed values)

  • The CoNLL 2012 dataset was made for a mutual task on multilingual unlimited coreference goals. It is bigger than the previous CoNLL NER based dataset.
  • The current state of the art on the CoNLL 2003 dataset is CorefQA + SpanBERT-large. The model gave an F1 score of 83.1.

Loading the CoNLL dataset using Tensorflow

import tensorflow as tf
def colnll(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= colnll('https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz’')

Read more: 

4.2 UPDOS dataset

  • UPDOS is a parsed text corpus dataset that clarifies syntactic or semantic sentence structure. 
  • The datasets follow the original format from the Universal Dependencies(an open network exertion with more than 300 contributors delivering more than 150 treebanks in 90 dialects) English Treebank.
  • the current state of the art framework on the Universal Dependencies dataset is UDPipe 2.0 + mBERT + FLAIR. The model gave a LAS score of 84.60.

Loading the dataset using Tensorflow

import tensorflow as tf
def UDPOS(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UDPOS(''https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')

4.3 Tiger Corpus dataset

  • Tiger Corpus is a broad collection of German paper messages
  • It was developed in 2002 by the researcher: Brandt. 
  • The researchers used grammatical feature comments for setting up a German POS labeling task. 
  • It has 40,472 of the initially requested sentence data for training, the following 5,000 for validation, and the remaining 5,000 for testing.

Loading the dataset using Tensorflow

import tensorflow as tf
def tigcor(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= tigcor('https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz’')

Read more: 

5. Question Answering Datasets in NLP

5.1 SQuAD dataset

“Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.”

  • It comprises 100,000+ inquiries presented by the crowdsource from Wikipedia article.
  • Response to each address is a fragment of text from the comparing understanding entry.
  • The dataset was presented by researchers: Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang from Stanford University.
  • The current state of the art framework on the SQuAD dataset is SA-Net on Albert. The model gave an F1 score of 93.011.
  • Learn more here

Loading the dataset using TensorFlow

import tensorflow as tf
def squad_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= squad_data('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')

5.2 bAbI dataset

Task generation for testing text understanding and reasoning”

The bAbI-Question Answering is a dataset for question noting and text understanding. The dataset is made out of a bunch of contexts, with numerous inquiry answer sets accessible depending on the specific situations.

  • It contains both English and Hindi data. 
  • The “ContentElements” field contains training data and testing data. 
  • The current state of the art model on the bAbI dataset is STM(Self-Attentive Associative Memory). The model gave an accuracy of 99.85
  • The initial two give admittance are retrieved from the 10,000k variant in English.bAbI was presented by Facebook Group.
  • The official published paper by bAbI is Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. 
  • Official GitHub Repository 

Loading the dataset using Keras

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
try:
       path_new = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise
readfile= tarfile.open(path_new )

6. Question Classification Datasets

Question classification is a significant part in question answering systems, with one of the most important steps in the enhancement of classification problem being the identification of the type of question, initially, we used the Naive Bayesian, k-nearest neighbour, and SVM algorithms but as of now neural nets are taking big leap we use CNN models for NLP.

6.1 Trec dataset

The Text retrieval Conference(or TREC, co-supported by the National Institute of Standards and Technology and U.S. Division of Defense, was presented in 1992.

  • The TREC(Text retrieval Conference) dataset is used for question characterization 
  • It consisting of open-area, real inquiries partitioned into wide semantic classes.
  • Trec has both a six-class (TREC-6) and a fifty-class (TREC-50) adaptation. 
  • TREC-6 and TREC-50 both have 5,452 preparing models and 500 test models.
  • The current state of the art model trained on the Trec-6 dataset is USE_T+CNN.

Loading the dataset using TensorFlow

import tensorflow as tf
import tensorflow_datasets.public_api as tfds
_URLs = {
    "train": "http://cogcomp.org/Data/QA/QC/train_5500.label",
    "test": "http://cogcomp.org/Data/QA/QC/TREC_10.label",
}
coarse_label = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
fine_label = [ "manner","cremat","animal","exp","ind","title","date","reason","event","state","desc","count"]
class Trec(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version("1.0.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        description=_DESCRIPTION,
        features=tfds.features.FeaturesDict({
            "label-coarse": tfds.features.ClassLabel(names=coarse_label),
            "label-fine": tfds.features.ClassLabel(names=fine_label),
            "text": tfds.features.Text(),
        }),
        homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
        citation=_CITATION,
    )

6.2 UMICH S1650 dataset

  • University of Michigan(UMICH) SI650 (Information Retrieval) is a sentiment analysis dataset where the task of the dataset is to detect negative and positive sentiment from the sentences.
  • The data was originally collected from opinmind.com (which is no longer active).
  • Training data contains 7086 lines.
  • Test data contains 33052 lines, each contains one sentence.
  • The training data contains 7086 sentences, already labeled with 1 (positive sentiment) or 0 (negative sentiment). 
  • The test data contains 33052 sentences that are unlabeled.
  • UMICH dataset

Loading the UMICH dataset using TensorFlow

import tensorflow as tf
def UMICH(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UMICH('https://www.kaggle.com/c/si650winter11/data')

6.3 ARC dataset

Can a computer learn complex, abstract tasks from just a few examples?

ARC Example

The Abstraction and Reasoning Corpus (ARC) provides a benchmark to measure AI skills on unknown tasks, with the constraint that only a handful of demonstrations are shown to learn a complex task. This dataset also provides a glimpse of a future where AI could quickly learn to solve new problems on its own.

The dataset description is shown below:

  • training: contains the task files for training (400 tasks). 
  • evaluation: contains the task files for evaluation (400 tasks).

The tasks are stored in JSON format. Each task JSON file contains two fields:

  • train“: demonstration input/output pairs(3 pair).
  • test“: test input/output pairs(1 pair).

ARC dataset has many other things you need to know like:

  • It was developed by Dongfang Xu, Peter Jansen, and Jaycie Martin.
  • It comprises 7,787 decision science question
  • 400 itemized question classes
  • Download dataset from here

Loading the dataset using TensorFlow

import tensorflow as tf
def ARC_data(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= ARC_data('https://www.kaggle.com/jeromeblanchet/arc-ai2-reasoning-challenge/ARC-Challenge-Dev.csv')

Conclusion

Sentiment AnalysisIMDB, SST, Sentiment140, YELP polarity Review
Language ModellingWikiText-103, WikiText-2
Question AnsweringSQuAD, bAbL
Question ClassificationTrec, UMICH S1650, ARC dataset
Neural Sequence taggingCoNLL 2012, UPDOS, Tiger Corpus
Machine TranslationMulti-30k, IWSLT, WMT14
NLP datasets

We almost covered all the major Natural language processing dataset that is used extensively from machine translation to sentiment analysis. we have also seen how to import every dataset into your coding environment to get started with,’ For more info or more articles on NLP(Natural language processing ) datasets visit here

PS: The story was written using a keyboard.
Share
Picture of Mohit Maithani

Mohit Maithani

Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India