Now Reading
Most Popular Datasets For Neural Textual Entailment With Implementation In PyTorch And Tensorflow

Most Popular Datasets For Neural Textual Entailment With Implementation In PyTorch And Tensorflow

textual

Textual entailment is a technique in natural language processing that endeavors to perceive whether one sentence can be inferred from another sentence. A pair of sentences are categorized into one of three categories: positive or negative or neutral. The positive category happens when the main sentence is used to demonstrate that a subsequent sentence is valid. Negative entailment or contradiction occurs when the primary sentence can be utilized to invalidate the subsequent sentence. Finally, if the two sentences have no relationship, they are considered to have a neutral entailment. 

Textual entailment is valuable in some of the applications. For example, it is used in question-answering systems to verify an answer from stored data. It may also be used to remove sentences that don’t have new information.

The article will give a detailed explanation of the various popular datasets that are used in Textual entailment using TensorFlow and Pytorch.

SNLI

SNLI contains 570000 human-produced sentence sets in the English language that are physically marked for a balanced classification. It was developed in 2015 by the researchers: Samuel R.Bowman, Gabor Angeli and Christoper Potts of Stanford University. The data is collected by using the Amazon Mechanical Turk. It consists of three categories: entailment, contradiction and neutral. For Example:

Source

Loading the dataset usingTensorFlow

Load the libraries required for this project.

import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://nlp.stanford.edu/projects/snli/snli_1.0.zip'
class Snli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Text(),
            'hypothesis':
                tfds.features.Text(),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://nlp.stanford.edu/projects/snli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'snli_1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                'filepath': os.path.join(data_directory, 'snli_1.0_train.txt')
            }),
    ]

Loading the dataset using Pytorch

Pass the link to the function. The next step is to split the datasets into train,test and validation sets.

from torchtext import data
class ShiftReduceField(data.Field):
    def __init__(self):

        super(ShiftReduceField, self).__init__(preprocessing=lambda parse: [
            'reduce' if t == ')' else 'shift' for t in parse if t != '('])

        self.build_vocab([['reduce'], ['shift']])

class ParsedTextField(data.Field):
    def __init__(self, eos_token='<pad>', lower=False, reverse=False):
        if reverse:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                postprocessing=lambda parse, _: [list(reversed(p)) for p in parse],
                include_lengths=True)
        else:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                include_lengths=True)
            
class NLIDataset(data.TabularDataset):

    urls = []
    directoryname = ''
    namenew = 'nli'

    @staticmethod
    def sort_key(ex):
        return data.interleave_keys(
            len(ex.premise), len(ex.hypothesis))

    @classmethod
    def splits(cls, text_field, label_field, parse_field=None,
               extra_fields={}, root='.data', train='train.jsonl',
               validation='val.jsonl', test='test.jsonl'):
        path = cls.download(root)

        if parse_field is None:
            fields = {'sentence1': ('premise', text_field),
                      'sentence2': ('hypothesis', text_field),
                      'gold_label': ('label', label_field)}
        else:
            fields = {'sentence1_binary_parse': [('premise', text_field),
                                                 ('premise_transitions', parse_field)],
                      'sentence2_binary_parse': [('hypothesis', text_field),
                                                 ('hypothesis_transitions', parse_field)],
                      'gold_label': ('label', label_field)}

        for key in extra_fields:
            if key not in fields.keys():
                fields[key] = extra_fields[key]

        return super(NLIDataset, cls).splits(
            path, root, train, validation, test,
            format='json', fields=fields,
            filter_pred=lambda ex: ex.label != '-')

    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data',
              vectors=None, trees=False, **kwargs):
        if trees:
            TEXT = ParsedTextField()
            TRANSITIONS = ShiftReduceField()
        else:
            TEXT = data.Field(tokenize='spacy')
            TRANSITIONS = None
        LABEL = data.Field(sequential=False)

        train, val, test = cls.splits(
            TEXT, LABEL, TRANSITIONS, root=root, **kwargs)

        TEXT.build_vocab(train, vectors=vectors)
        LABEL.build_vocab(train)

        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)
class SNLI(NLIDataset):
    urls = ['http://nlp.stanford.edu/projects/snli/snli_1.0.zip']
    directoryname = 'snli_1.0'
    name1 = 'snli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, root='.data',
               train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl',
               test='snli_1.0_test.jsonl'):
        return super(SNLI, cls).splits(text_field, label_field, parse_field=parse_field,
                                       root=root, train=train, validation=validation,
                                       test=test)

Parameters Specification

text_field: It will be used for premise and hypothesis data.

Label_field: It will be used for label data.

parse_field: It will be used for shift-reduce parser transitions.

root: Path to the dataset’s zip archive directory.

train: Training Set. Default: ‘train.jsonl’.

test:Testing Set

validation:Validation Set

State of the Art

The current state of the art on SNLI dataset is CA-MTL. The model gave an accuracy of 92.1%.

Multi-NLI

Multi-Genre Natural Language Inference (MultiNLI) corpus contains 433000 sentence sets explained with literary entailment data. It was developed by the researchers: Adina Williams, Nikita Nangia and Samuel R. Bowman1 . The corpus is demonstrated on the SNLI corpus, however, varies in that it covers a scope of kinds of spoken and composed content, and supports a particular cross-class speculation assessment. Below is the example of corpus:

See Also

Source

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class MultiNLI(tfds.core.GeneratorBasedBuilder)
  VERSION = tfds.core.Version("1.1.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "premise":
                tfds.features.Text(),
            "hypothesis":
                tfds.features.Text(),
            "label":
                tfds.features.ClassLabel(
                    names=["entailment", "neutral", "contradiction"]),
        }),
        supervised_keys=None,
        homepage="https://www.nyu.edu/projects/bowman/multinli/",
    )
  def _split_generators(self, dl_manager):
    downloaded_dir = dl_manager.download_and_extract(
        "https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip")
    multinli_path = os.path.join(downloaded_dir, "multinli_1.0")
    train_path = os.path.join(multinli_path, "multinli_1.0_train.txt")
    matched_validation_path = os.path.join(multinli_path,
                                           "multinli_1.0_dev_matched.txt")
    mismatched_validation_path = os.path.join(
        multinli_path, "multinli_1.0_dev_mismatched.txt")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_path}),
    ]

Loading the dataset using Pytorch

class MultiNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip']
    directoryname = 'multinli_1.0'
    name2 = 'multinli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, genre_field=None,
               root='.data',
               train='multinli_1.0_train.jsonl',
               validation='multinli_1.0_dev_matched.jsonl',
               test='multinli_1.0_dev_mismatched.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        return super(MultiNLI, cls).splits(text_field, label_field,
                                           parse_field=parse_field,
                                           extra_fields=extra_fields,
                                           root=root, train=train,
                                           validation=validation, test=test)

State of the Art

The current state of the art on SNLI dataset is T5-11B. The model gave an accuracy of 92%.

XNLI

The Cross-lingual Natural Language Inference (XNLI) corpus contains 5,000 test and 2,500 training sets for the Multi-NLI corpus. It was developed by the researchers: Adina Williams and Samuel R. BowmanThe sets are explained with printed entailment and converted into 14 dialects: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

For example:

Source

Loading the dataset using TensorFlow

import collections
import csv
import os
import six
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://cims.nyu.edu/~sbowman/xnli/XNLI-1.0.zip'
languages = ('ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th',
              'tr', 'ur', 'vi', 'zh')
class Xnli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Translation(
                    languages=languages,),
            'hypothesis':
                tfds.features.TranslationVariableLanguages(
                    languages=languages,),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://www.nyu.edu/projects/bowman/xnli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'XNLI-1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={'filepath': os.path.join(data_directory, 'xnli.test.tsv')}),
    ]

Loading the dataset using Pytorch

class XNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip']
    directoryname = 'XNLI-1.0'
    name3 = 'xnli'
    @classmethod
    def splits(cls, text_field, label_field, genre_field=None, language_field=None,
               root='.data',
               validation='xnli.dev.jsonl',
               test='xnli.test.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        if language_field is not None:
            extra_fields["language"] = ("language", language_field)
        return super(XNLI, cls).splits(text_field, label_field,
                                       extra_fields=extra_fields,
                                       root=root, train=None,
                                       validation=validation, test=test)
    @classmethod
    def iters(cls, *args, **kwargs):
        raise NotImplementedError('XNLI dataset does not support iters')

State of the Art

The current state of the art on SNLI dataset is RoBERTa-wwm-ext-large. The model gave an accuracy of 81.2%.

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Textual Entailment. Further, we implemented these text corpus using Pytorch and TensorFlow.Textual Entailment are incredible vehicles for thinking, and essentially all inquiries regarding weightiness in language can be decreased to can be reduced to questions of entailment and contradiction in context. This recommends that Text Entailment is an ideal proving ground for hypotheses of semantic portrayal.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top