Last updated November 27, 2020
In AI Mysteries

Most Popular Datasets For Neural Textual Entailment With Implementation In PyTorch And Tensorflow

Published on November 26, 2020
by Ankit Das

Textual entailment is a technique in natural language processing that endeavors to perceive whether one sentence can be inferred from another sentence. A pair of sentences are categorized into one of three categories: positive or negative or neutral. The positive category happens when the main sentence is used to demonstrate that a subsequent sentence is valid. Negative entailment or contradiction occurs when the primary sentence can be utilized to invalidate the subsequent sentence. Finally, if the two sentences have no relationship, they are considered to have a neutral entailment.

Textual entailment is valuable in some of the applications. For example, it is used in question-answering systems to verify an answer from stored data. It may also be used to remove sentences that don’t have new information.

The article will give a detailed explanation of the various popular datasets that are used in Textual entailment using TensorFlow and Pytorch.

SNLI

SNLI contains 570000 human-produced sentence sets in the English language that are physically marked for a balanced classification. It was developed in 2015 by the researchers: Samuel R.Bowman, Gabor Angeli and Christoper Potts of Stanford University. The data is collected by using the Amazon Mechanical Turk. It consists of three categories: entailment, contradiction and neutral. For Example:

Source

Loading the dataset usingTensorFlow

Load the libraries required for this project.

import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://nlp.stanford.edu/projects/snli/snli_1.0.zip'
class Snli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Text(),
            'hypothesis':
                tfds.features.Text(),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://nlp.stanford.edu/projects/snli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'snli_1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                'filepath': os.path.join(data_directory, 'snli_1.0_train.txt')
            }),
    ]

Loading the dataset using Pytorch

Pass the link to the function. The next step is to split the datasets into train,test and validation sets.

from torchtext import data
class ShiftReduceField(data.Field):
    def __init__(self):

        super(ShiftReduceField, self).__init__(preprocessing=lambda parse: [
            'reduce' if t == ')' else 'shift' for t in parse if t != '('])

        self.build_vocab([['reduce'], ['shift']])

class ParsedTextField(data.Field):
    def __init__(self, eos_token='<pad>', lower=False, reverse=False):
        if reverse:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                postprocessing=lambda parse, _: [list(reversed(p)) for p in parse],
                include_lengths=True)
        else:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                include_lengths=True)
            
class NLIDataset(data.TabularDataset):

    urls = []
    directoryname = ''
    namenew = 'nli'

    @staticmethod
    def sort_key(ex):
        return data.interleave_keys(
            len(ex.premise), len(ex.hypothesis))

    @classmethod
    def splits(cls, text_field, label_field, parse_field=None,
               extra_fields={}, root='.data', train='train.jsonl',
               validation='val.jsonl', test='test.jsonl'):
        path = cls.download(root)

        if parse_field is None:
            fields = {'sentence1': ('premise', text_field),
                      'sentence2': ('hypothesis', text_field),
                      'gold_label': ('label', label_field)}
        else:
            fields = {'sentence1_binary_parse': [('premise', text_field),
                                                 ('premise_transitions', parse_field)],
                      'sentence2_binary_parse': [('hypothesis', text_field),
                                                 ('hypothesis_transitions', parse_field)],
                      'gold_label': ('label', label_field)}

        for key in extra_fields:
            if key not in fields.keys():
                fields[key] = extra_fields[key]

        return super(NLIDataset, cls).splits(
            path, root, train, validation, test,
            format='json', fields=fields,
            filter_pred=lambda ex: ex.label != '-')

    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data',
              vectors=None, trees=False, **kwargs):
        if trees:
            TEXT = ParsedTextField()
            TRANSITIONS = ShiftReduceField()
        else:
            TEXT = data.Field(tokenize='spacy')
            TRANSITIONS = None
        LABEL = data.Field(sequential=False)

        train, val, test = cls.splits(
            TEXT, LABEL, TRANSITIONS, root=root, **kwargs)

        TEXT.build_vocab(train, vectors=vectors)
        LABEL.build_vocab(train)

        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)

class SNLI(NLIDataset):
    urls = ['http://nlp.stanford.edu/projects/snli/snli_1.0.zip']
    directoryname = 'snli_1.0'
    name1 = 'snli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, root='.data',
               train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl',
               test='snli_1.0_test.jsonl'):
        return super(SNLI, cls).splits(text_field, label_field, parse_field=parse_field,
                                       root=root, train=train, validation=validation,
                                       test=test)

Parameters Specification

text_field: It will be used for premise and hypothesis data.

Label_field: It will be used for label data.

parse_field: It will be used for shift-reduce parser transitions.

root: Path to the dataset’s zip archive directory.

train: Training Set. Default: ‘train.jsonl’.

test:Testing Set

validation:Validation Set

State of the Art

The current state of the art on SNLI dataset is CA-MTL. The model gave an accuracy of 92.1%.

Multi-NLI

Multi-Genre Natural Language Inference (MultiNLI) corpus contains 433000 sentence sets explained with literary entailment data. It was developed by the researchers: Adina Williams, Nikita Nangia and Samuel R. Bowman1 . The corpus is demonstrated on the SNLI corpus, however, varies in that it covers a scope of kinds of spoken and composed content, and supports a particular cross-class speculation assessment. Below is the example of corpus:

Source

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class MultiNLI(tfds.core.GeneratorBasedBuilder)
  VERSION = tfds.core.Version("1.1.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "premise":
                tfds.features.Text(),
            "hypothesis":
                tfds.features.Text(),
            "label":
                tfds.features.ClassLabel(
                    names=["entailment", "neutral", "contradiction"]),
        }),
        supervised_keys=None,
        homepage="https://www.nyu.edu/projects/bowman/multinli/",
    )
  def _split_generators(self, dl_manager):
    downloaded_dir = dl_manager.download_and_extract(
        "https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip")
    multinli_path = os.path.join(downloaded_dir, "multinli_1.0")
    train_path = os.path.join(multinli_path, "multinli_1.0_train.txt")
    matched_validation_path = os.path.join(multinli_path,
                                           "multinli_1.0_dev_matched.txt")
    mismatched_validation_path = os.path.join(
        multinli_path, "multinli_1.0_dev_mismatched.txt")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_path}),
    ]

Loading the dataset using Pytorch

class MultiNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip']
    directoryname = 'multinli_1.0'
    name2 = 'multinli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, genre_field=None,
               root='.data',
               train='multinli_1.0_train.jsonl',
               validation='multinli_1.0_dev_matched.jsonl',
               test='multinli_1.0_dev_mismatched.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        return super(MultiNLI, cls).splits(text_field, label_field,
                                           parse_field=parse_field,
                                           extra_fields=extra_fields,
                                           root=root, train=train,
                                           validation=validation, test=test)

State of the Art

The current state of the art on SNLI dataset is T5-11B. The model gave an accuracy of 92%.

XNLI

The Cross-lingual Natural Language Inference (XNLI) corpus contains 5,000 test and 2,500 training sets for the Multi-NLI corpus. It was developed by the researchers: Adina Williams and Samuel R. BowmanThe sets are explained with printed entailment and converted into 14 dialects: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

For example:

Source

Loading the dataset using TensorFlow

import collections
import csv
import os
import six
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://cims.nyu.edu/~sbowman/xnli/XNLI-1.0.zip'
languages = ('ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th',
              'tr', 'ur', 'vi', 'zh')
class Xnli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Translation(
                    languages=languages,),
            'hypothesis':
                tfds.features.TranslationVariableLanguages(
                    languages=languages,),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://www.nyu.edu/projects/bowman/xnli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'XNLI-1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={'filepath': os.path.join(data_directory, 'xnli.test.tsv')}),
    ]

Loading the dataset using Pytorch

class XNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip']
    directoryname = 'XNLI-1.0'
    name3 = 'xnli'
    @classmethod
    def splits(cls, text_field, label_field, genre_field=None, language_field=None,
               root='.data',
               validation='xnli.dev.jsonl',
               test='xnli.test.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        if language_field is not None:
            extra_fields["language"] = ("language", language_field)
        return super(XNLI, cls).splits(text_field, label_field,
                                       extra_fields=extra_fields,
                                       root=root, train=None,
                                       validation=validation, test=test)
    @classmethod
    def iters(cls, *args, **kwargs):
        raise NotImplementedError('XNLI dataset does not support iters')

State of the Art

The current state of the art on SNLI dataset is RoBERTa-wwm-ext-large. The model gave an accuracy of 81.2%.

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Textual Entailment. Further, we implemented these text corpus using Pytorch and TensorFlow.Textual Entailment are incredible vehicles for thinking, and essentially all inquiries regarding weightiness in language can be decreased to can be reduced to questions of entailment and contradiction in context. This recommends that Text Entailment is an ideal proving ground for hypotheses of semantic portrayal.

Access all our open Survey & Awards Nomination forms in one place >>

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Watch More

Most Popular Datasets For Neural Textual Entailment With Implementation In PyTorch And Tensorflow

SNLI

Loading the dataset usingTensorFlow

Loading the dataset using Pytorch

Parameters Specification

State of the Art

Multi-NLI

Loading the dataset using TensorFlow

Loading the dataset using Pytorch

State of the Art

XNLI

Loading the dataset using TensorFlow

Loading the dataset using Pytorch

State of the Art

Conclusion

Ankit Das

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.